Implementation and Evaluation of Deep Neural Networks (DNN) on Mainstream Heterogeneous Systems Junli Gu Maohua Zhu∗ Zhitao Zhou∗ Feng Zhang∗ Zhen Lin∗ Qianfeng Zhang Mauricio Breternitz {Junli.Gu, Maohua.Zhu, David.Zhou, Feng.Zhang}@amd.com {Zhen.Lin, QianFeng.Zhang, Mauricio.Breternitz}@amd.com AMD Research Lab, Beijing, China Abstract
1
Deep Neural Networks (DNN), with deep layers and very high dimension of parameters, have demonstrated breakthrough learning capability in machine learning area. These days DNN with Big Data input are leading a new direction in large scale object recognition. DNN training requires vast amount of computing power, which poses great challenge to system design.
Deep neural networks (DNN) extend traditional neural networks to have deep layers with high dimension of parameters (millions to billions). DNN models have evolved to have different structures for different applications, including multiple-layer perceptrons (MLP), convolutional neural networks (CNN) and deep belief networks (DBN) used in image classification and recognition, and deep autoencoders used in image classification pre-training and content-based retrieval. DNN has demonstrated the highest accuracy in image and voice recognition[1, 4, 9]. These days DNN with Big Data input are leading a new direction in large scale object recognition[5]. However, training very large DNN models with a vast amount of data takes weeks[4], which poses great challenge in parallel system design to provide the required computing power. It is an open problem of how to build the right clusters to speed up DNN training. Existing DNN implementations[8] first pointed out that GPUs surpass CPUs’ computational capabilities in to accelerating DNN. Their later work[4] built a 16-node CPU plugged with GPUs server which beats the earlier 1000node CPU servers. The heterogeneous CPU+GPU server demonstrated significant speed up, however also exposed certain bottlenecks such as heavy data transfer overheads. There has been another class of heterogeneous processors, APUs, which havn’t been explored for DNN acceleration. APUs fuse CPU cores and GPU computing units onto the same chip, so CPU and GPU can work more closely together through the same memory space and hence avoid data transfers. Intel and AMD have been releasing their APU products, such as Intel’s Haswell and AMD’s A-Series APU A10-7850K (formerly codenamed “Kaveri”) . Till today, APUs have taken up of more than half of the PC markets. In 2012 AMD acquired microserver producer SeaMicro and later announced plans to combine SeaMicro’s customized fabric
DNN training embraces massive thread and data parallelism, which matches naturally with GPU. There are various heterogeneous systems including discrete CPU armed with GPUs and chip level integrated CPU+GPU heterogeneous processors-named APUs. In this paper, we explore the implementation of DNN models on different heterogenous platforms to provide systematic evaluation and comparison. Specifically we implement two well-known DNN kernels Multi-Layer Perceptron (MLP) and Autoencoder on various GPUs and APUs from mainstream processor manufacturers. Evaluations results show GPUs are faster than APUs but at the cost of burning much more power. APUs achieve upto 2x higher performance per watt efficiency, which indicates that APU server can be an energy efficient and high density solution to accelerate DNN applications. This paper also conducts bottleneck analysis and presents the optimized techniques on various platforms. ∗ They
contributed to this project during their internship at AMD.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. APSys ’14, June 25-26, 2014, Beijing, China Copyright 2014 ACM 978-1-4503-3024-4/14/06 ...$15.00.
KWWSG[GRLRUJ
Background and Motivation
technology with APUs to deliver high density and energy efficient APU servers. As power and energy efficient become critical for green computing and exascale computing, APU servers provide another promising option for researchers and industry to accelerate their DNN applications. It is challenging to map deep learning algorithms efficiently onto different platforms. There are different DNN models combined with various training algorithms, and one combination can be quite different from another in computing patterns, data dependency and memory access patterns. So it requires expertise of both hardware architectures and algorithms to select or build the best fit architecture for a specific DNN application. In this paper, we explore the implementation of DNN kernels on various heterogenous platforms (focusing on single node) and provide systematic evaluation including performance, power and performance per watt efficiency. This paper further conducts bottleneck analysis on various platforms that can be leveraged by the architects to build the DNN system. We initialize the simple comparison between APU servers and CPU server plugged with GPU cards. Note that due to the significant engineering work, this paper has not yet fully implemented the parallel DNN on different clusters.
2
the fundamental computing patterns of DNN, which includes layers of matrix multiplication followed by nonlinear activation functions (sigmoid, hyperbolic tangent etc.) at each layer. MLP is usually trained by a back propagation algorithm[10]. The complete training process includes iterations of feedforward propagation and backward propagation process. Feedforward propagation takes a batch of input data and propagates through layers of matrix multiplication by weight matrixes and passes through sigmoid functions. From a computing point of view, it is majorly layers of matrix multiplication. Until reaching the last layer, a cost/error is calculated by comparing the difference between outputs with the right answers (labels). With the cost functions, a gradient is calculated whose opposite direction (gradient descent) points to where the network will converge to a minimum. Back propagation thus refers to back propagate the error from last layer to previous layers and calculate the gradients for each layer. Then the weight matrix will be modified with the gradient descent. Both the gradient calculation and the weight update process are presented by matrix multiplication. Task division for MLP is usually CPU prepares the data and GPU accelerates the feedforward and backward propagation process. We adopt this CPU-GPU task division for APU and CPU plus GPU platforms, while map everything on the CPU for CPU only platforms. When implementing MLP on various processors, we can leverage existing libraries for matrix multiplication, which is CUBLAS for Nvidia GPUs, CLAMDBLAS for AMD GPUs and MKL for Intel CPUs. For activation, a sigmoid function was used, which is implemented as a separate simple kernel. We implemented a nine-layer MLP model with 32 million parameters with the structure of input layer size 1100, seven successive of hidden layers each with 2048 neurons and output layer size 9304. To our knowledge, such model matches the scale of what industry adopt these days.
DNN Models and Implementation
We chose two popular DNN models–MLP used for classification problems like voice and image recognition, and autoencoder used for feature extraction problems like image and document retrieval. MLP structure and its backward prorogation training algorithm can represent the DNN basics and its performance can reflect the effect of DNN acceleration on each platform. Autoencoder model and L-BFGS training algorithm is a mix of CPUstyle and GPU-style computation and can explore the CPU and GPU resources utilization provided on heterogeneous platforms. This section describes the models in details and discusses how to implement them on various heterogeneous platforms using existing libraries.
2.1
2.2
Autoencoder Model
The autoencoder model was first proposed by Geoffrey E. Hinton in 2006[6] and achieved significant improvement in unsupervised learning. Nowadays it is widely used in applications like speech feature extraction and content-based image retrieval. An autoencoder is a fully connected neural network structure that trains a hidden layer as a compressed representation of input. The learned representation can be understood as extracted features from each input data set. Differently from other DNN models, the autoencoder tries to generate a reconstruction h(x) of the input x with the learned features. h(x) tries to meet the condition of h(x) ≈ x. Thus, training the autoencoder is to minimize the Euclidean dis-
MLP Model
MLP refers to multi-layer fully connected neural networks and are used for supervised learning. It was trained successfully in [10] and became popular in classification problems such as voice recognition, image recognition, machine translation, etc. In 2013, Microsoft[1] and Baidu described using deep MLP decreases the voice recognition error rate by 20%-30%, which basically makes the voice recognition useable for real-time cross language translation. MLP represents 2
tance between the input and the construction, which is the cost function as shown in Equation 1.
tion is similar to MLP, thus we still use BLAS libraries for implementation. L-BFGS training process requires frequent transfer of weight matrixes from CPU to GPU and gradients from GPU to CPU. Through the autoencoder training process we can also examine the bottlenecks of memory space limitation and data transfer on different platforms. On heterogeneous platforms, workloads mixed with CPU and GPU compute in theory can make better use of available hardware resources. The autoencoder training process will explore the computing efficiency and tradeoff between CPU and GPU compute ratio to provide insight to build machine learning system.
n
min ∑ σ (W T σ (W x(i) + b) + c) − x(i) 2
W,b,c i=1
(1)
where W , b and c are network parameters, n denotes the size of the training data. We implement a three-layer(3072-6144-1024) autoencoder as a representative case, shown as in Figure 1, which is considered as a deep autoencoder model. The model includes 25 million parameters. Input Layer
Code Layer
6144 W1
6144
Reconstruction Layer
3
W2
3072
Evaluation and Results Analysis
3072
Using the methodology stated in the previous section, we implemented MLP and autoencoder through L-BFGS training algorithm on the two categories of heterogenous platforms: integrated APUs and CPU plus GPU workstations. We pick up the competitive commodity processors with similar theoretical throughput at each category from AMD, Intel and Nvidia.
1024 W2T
W1T
APUs: We evaluated latest available APU processors including AMD’s APU A10-7850K (856GFLOPS, TDP 95watt) and Intel IvyBridge i7-4770K (848GFLOPS,TDP 84watt).
Figure 1: Autoencoder network architecture with reconstruction layers To train the autoencoder, or to get it converged to a global minima, three training algorithms are commonly used: the stochastic gradient descent methods(SGDs), Conjugate gradient(CG) and limited memory BFGS(LBFGS) algorithm. As pointed out in recent papers, for deep autoencoder with a small number of hidden layers, the L-BFGS algorithm works better to find global minima and achieve high scalability in Big Data training on large scale of system[4, 6, 8]. Thus we choose L-BFGS to train autoencoder. L-BFGS training process requires storage of m (m=6 in this paper) steps of weight matrixes on CPU, weight matrixes and gradients of all layers on GPU. The weight matrix increase quadratically with input data size. That might quickly become a bottleneck for limited device memory of discrete GPUs. To give an example, when the input data size equals to 12k (size of a 64x64 RGB image), the total data storage exceeds 7GB. Due to the above stated limitations, we map L-BFGS algorithm to the CPU and leverage the CPU’s memory to store the history weights and gradients. L-BFGS algorithm majorally does dot-product computation, which CPU can handle efficiently. We map autoencoder structure and computation to GPU for acceleration. We use a mature open source C version L-BFGS library[7] in final implementation. The autoencoder forward and backward proroga-
GPUs: We pick up current mainstream consumer GPU cards including AMD HD7970 (3789GFLOPS, TDP 250watt) and Nvidia GTX780 (3977GFLOPS, TDP 250watt). GPUs work with AMD 8-core CPU FX8320. Languages and libraries: CUDA C and CUBLAS are used in Nvidia GPU implementations. OpenCL and CLAMDBLAS for AMD APUs and GPUs. Same OpenCL codes run on Intel APU to make use of both CPU and GPU. For Intel CPU only we use C++ implementation with MKL library with multithreads but no hyperthread technology enabled because we found hyperthread decreasing the overall performance. The evaluation examines the performance per unit of training process and provides the apple-to-apple performance comparison on the selected platforms. In order to provide insights for system’s power and energy efficiency, we also provide realtime power consumption and the performance per watt metric. Performance per watt is an important metric to compare which system can provide the highest given the same amount of power. For APU processors, we capture realtime power traces using an external ammeter. GPU power is supplied by both power 3
140000
3591
3000 2500
2000
1733
1500
1000
980
500
198
222407
120000 TrainingTimeperUnit(ms)
Training Time Per Unit(ms)
lane and PCIe, which makes it difficult to measure, thus we use maximum allowed power TDP as the power for GPUs.
100000 80000 63898
58906
60000 40000
26629
24294
20000
158
0 A10-7850K
i7-4770