Paper Title (use style: paper title)

4 downloads 810 Views 1MB Size Report
For Big Data Applications. Akhmedov Khumoyun. Department of Internet&Multimedia Engineering. Konkuk University. Seoul, Korea [email protected].
Spark Based Distributed Deep Learning Framework For Big Data Applications Akhmedov Khumoyun

Yun Cui, Lee Hanku

Department of Internet&Multimedia Engineering Konkuk University Seoul, Korea [email protected]

Department of Internet&Multimedia Engineering Konkuk University Seoul, Korea [email protected], [email protected]

Abstract— Deep Learning architectures, such as deep neural networks, are currently the hottest emerging areas of data science, especially in Big Data. Deep Learning could be effectively exploited to address some major issues of Big Data, including withdrawing complex patterns from huge volumes of data, fast information retrieval, data classification, semantic indexing and so on. In this work, we designed and implemented a framework to train deep neural networks using Spark, fast and general data flow engine for large scale data processing. The design is similar to Google software framework called DistBelief which can utilize computing clusters with thousands of machines to train large scale deep networks. Training Deep Learning models requires extensive data and computation. Our proposed framework can accelerate the training time by distributing the model replicas, via stochastic gradient descent, among cluster nodes for data resided on HDFS. Keywords— Distributed Computing; Deep Learning; Big Data; Spark; HDFS.

I. INTRODUCTION Distributed and parallel processing of very huge datasets has been practiced for years in specialized areas, such as eCommerce applications. Recently there have been significant progress in usability, cost effectiveness, and diversity of parallel computing platforms, with their popularity rising for a broad set of data analysis and machine learning tasks, in particular Deep Learning. A number of systems such as Spark made it convenient to implement concurrent processing of data instances or their features. This allows fairly uncomplicated parallelization of many learning algorithms that observe input as an unordered batch of data points and aggregate fast computations. Great attention to large-scale Deep Learning is also because of the spread of very huge datasets across many modern applications. These datasets are often assembled on distributed storage platforms, running the development of learning algorithms that can be distributed easily. Furthermore, the dissemination of sensor devices which conduct real-time inference based on high-dimensional, complex feature representations drives additional demand for exploiting parallelism in learning-centric applications. Examples of this trend include image classification, fraud detection and visual object detection becoming common in many fields of computer science.

978-1-5090-3546-5/16/$31.00 ©2016 IEEE

Obtaining inner features of raw data, feature engineering, is a principal element on machine learning. This quality consumes a huge amount of the power in a machine learning job, and is mainly totally domain specific and requires significant human effort. Conducting feature engineering in a more automated and general way would be a big advantage of machine learning since this would enable developers to automatically withdraw features without much effort. One of the promising areas of research into the automated withdrawal of complicated data features is Deep Learning models. Such models involve a layered, hierarchical architecture of learning and representing data, where higher-level features are represented as lower-level features. Deep Learning models are very good at confronting with learning from huge amounts of raw data, and naturally infer hidden data features in an efficient way. Deep Learning solutions have demonstrated prominent results in a number of machine learning applications, including speech recognition, computer vision, and natural language processing. The plethora of distributed system options provide a number of platforms for Deep Learning algorithms to acquire productive gains and the ability to conduct very huge datasets. These platforms include custom processing units (e.g., general purpose Graphics Processing Units – GPUs), general data flow engines (Hadoop, Spark), multiprocessor and multicore parallelism, High-Performance Computing (HPC) clusters connected by fast local networks, and data-centre scale virtual clusters that can be provided by commercial cloud computing providers. In this work, the design and implementation of a method for scaling up distributed Stochastic Gradient Descent (SGD) training for Deep Learning models was proposed. The proposed system was implemented on top of Spark, fast and general data flow engine which can run in cluster environment. Many applications can be run using the proposed system against large dataset such as twitter, ImageNet, Wikipedia, and others. The remainder of the paper is structured as follows: in the second section we will give some information on several similar large scale Deep Learning systems, mainly production ready systems. In the third section, we present our proposed system, Spark based Distributed Deep Learning Framework for Big Data Applications, its overall architecture, main components and the system workflow. Then, in the fourth section we provide some Big Data applications of Deep Learning, such as information retrieval and semantic indexing,

discriminative works and finally sentiment analysis, which can be built with our system. Then, in the fifth section we demonstrate experiments and results which were obtained by running example application, sentiment classification using Deep Learning models, which was run on top of our system. Finally, in sixth chapter, we will conclude by giving short restatement of the proposed system and future directions towards enhancement of the system. II. RELATED WORK Deep learning has shown great promise in many practical applications in which unsupervised feature learning is automated. Compelling performance has been reported in several domains, ranging from speech recognition [1], visual object recognition [2, 3], to text processing [4]. Empirical results have shown the great performance of large-scale models, with special focus on models with a very huge amount of model parameters which are capable to withdraw more complex dimensions and representations. Hence, there is a natural need for such systems for handling very huge data efficiently. Of course, we are not the first who proposed such framework for Deep Learning. There have been introduced few systems to utilize distributed environment for computation over Deep Learning models, such as DistBelief Framework by leading software architects of Google, SparkNet relatively new system by AMPLab developers, and Map-Reduce-based Deep Learning system, etc. In the section, we will briefly look at some key features and design decisions of these systems. A. DistBelief Framework Google considered the problem of training a Deep Learning neural network with billions of parameters exploiting thousands of commodity servers, in the area of speech recognition and computer vision. A software framework, DistBelief, was developed that could exploit computing clusters with thousands of machines to train large-scale algorithms. DistBelief supports model parallelism within a node (via multithreading) and across nodes (via message passing), provided management by the framework itself with parallelism, synchronization and communication [5]. Within the framework, two brand new techniques for large-scale distributed training were designed and implemented (1) Downpour SGD, an asynchronous stochastic gradient descent procedure which uses adaptive learning rates and supports a large number of model replicas, and (2) Sandblaster L-BFGS, a distributed implementation of L-BFGS that uses both data and model parallelism. Both methods exploit the concept of a centralized partitioned parameter server, which model replicas use to share their parameters. Both techniques take advantage of the distributed computation DistBelief allows within each individual replica. But the main win is that both techniques are designed to tolerate variance in the execution speed of different model replicas, and even the total failure of model replicas which might be taken offline or restarted at random. In short, these two optimization algorithms implement an intelligent version of data parallelism. Simultaneous execution of distinct training examples is enabled by both approaches in each of the many model replicas, and periodically gathers their results to optimize model’s objective function.

B. SparkNet Distributed data processing frameworks such as Hadoop or Spark have made use of prevalent acceptance and success over the past decade. The original MapReduce paradigm and its spin-off processing algorithms, derivative technologies, and support tools have become a main tool of modern data science and analytics, and display no sign of releasing their grip. And this is where SparkNet emerged, "scalable, distributed algorithm for training deep networks." Authors point out that it lends itself to frameworks such as Spark and works effectively in limited bandwidth environments out of the box. SparkNet is built on top of Spark and Caffe. SparkNet was originally introduced by Moritz, Nishihara, Stoica, and Jordan (2015) [6]. Along with the core concept of a scalable, distributed deep neural network training algorithm, SparkNet also includes an interface for reading from Spark's data abstraction, known as the Resilient Distributed Dataset (RDD), a Scala interface for interacting with the Caffe Deep Learning framework (which is written in C++), and a lightweight tensor library. Perhaps most importantly, the developers claim that there are numerous additional benefits to realize by integrating Deep Learning into an existing data processing pipeline. A framework such as Spark would allow cleaning, pre-processing, and other datarelated tasks to be handled via the single system, and datasets could be kept in memory for the entire processing pipeline, eliminating expensive disk writes. C. Large-scale MapReduce-based Deep Learning The main focus of this project is to integrate Deep Learning algorithm with cloud computing system to manipulate hugescale data. In this project, implementation of Deep Learning method to train the input data, in which MapReduce programming model is exploited to parallelize the execution. The MapReduce job consists of the mapper and reducer procedures, in which the mapper procedure will withdraw key/value pairs from input data and dispatch them into a list of intermediate key/value pairs, and reducer procedure combines these intermediate values according to the same key computed from mapper procedure to generate output values. In Machine Learning problem, the input value will be the data from a realworld object that a machine is intended to “learn”. The parameters play a role of intermediate key/value pair, in order for a machine to define whether it has recognized the object properly. Reducer procedure exploits these weights to calculate the so-called “recognition” of the machine of an given object [7]. If there is no permissible distinction between the accuracy of training data and expected accuracy, the MapReduce job will loop until an acceptable result has generated. The key issue of the project is to train the massive amount of data. Although Deep Learning learns from representation, it is not able of excluding resemblance and noisiness ingrained in the raw data, and this greatly influences machine learning algorithm [8]. Hence the most straightforward inference is that the MapReduce job would loop many times so that it is difficult to ensure a tolerable accuracy, or even it could not generate an output. However, it is essential to implement a dynamics to someway enhance the performance of neural network. One approach is to delete the corresponding items by exploiting diversity-based data sampling technique.

III. SPARK BASED DISTRIBUTED DEEP LEARNIG FRAMEWORK FOR BIG DATA APPLICATIONS (SDDLF) In this section, we present the overall architecture and workflow of the proposed system, Spark based Distributed Deep Learning Framework for Big Data Applications (SDDLF). First, we look at overall architecture along with design and implementation challenges then we give detailed overview of the proposed system’s key components, such as Master, Parameter Server and Data Nodes, finally we show the workflow of the proposed system. A. Overall Architecture The system is designed for training deep networks using Stochastic Gradient Descent optimization algorithm with the help of distributed capabilities of Apache Spark. The system consists of several distributed components such as Master, Parameter Server and Data Shards, Figure 1. We defined these services on top of Apache Spark which is general purpose distributed data processing engine [9]. Spark has two main components, namely, Spark Driver which is responsible for maintenance, coordination and scheduling of applications on the cluster, and Worker Nodes which are responsible for running applications. Further in Worker Nodes they spawns several executors which are processes launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. We represent each data shard as a HDFS data node which holds some part of the whole dataset. Our system’s Master plays a role of Spark’s Driver Program and starts the neural network training process by the initialization of parameters in Parameter Server and neural network layers, all of them reside on HDFS. Each data shard spawns neural networks layers according to the network layer size and is responsible for computing forward and backward passes for each data example in its partition until processes all examples.

Figure 1 Overall Architecture of SDDLF After each iteration, it will update the master model by sending newly computed gradients to master model of Parameter Server. Our method for computing gradients exploits both data and model parallelism: Data Parallelism: the training data is partitioned across several machines in the Spark cluster each having its own replica of the model. Each model trains with its data shard (partition) in parallel; Model Parallelism: the layers of each deep network model replica are

also distributed across Spark cluster. The number of parameter matrix of each model depend on the number of neural network size, for example, if data shard’s model consists of 5 layers there are 4 parameter shards for this data shard. B. Main Components Our system consists of 3 main components, namely, Master, Parameter Server and Data Shard, in the following paragraphs, we will briefly describe each component and its functionality.  Master, Spark Driver, is responsible for maintenance, coordination and scheduling of training process on the cluster. The training process in terms of Spark MasterWorker communication was given in Figure 3-12. As you can see that the interaction nature is iterative as opposed to MapReduce which is based on Map and Reduce phases.  Parameter Server is partitioned across machines with each partition corresponding to a layer of the neural network model. When the parameter shard is first created it is given a unique shardId, a learning rate for the update step, and a random initial weight by Master. When a model replica layer requests the latest parameter values, the parameter shard sends them back to the replica layer wrapped in a Latest Parameters message. If a gradient message is received, then the parameter shard uses the gradient to update its parameters..  Data Shard is doing several things. First, since each data shard worker has its own model replica, the data shard worker creates the layers (these will be explained next) for its replica. Once the models are created, the data shard worker waits to receive the ReadyToProcess message, upon receiving data points a Fetch Parameters message is sent to each layer in the replica which tells them to retrieve the latest version of their weight parameters from their corresponding parameter shards. At this point, the worker enters a waiting context until each of its layers has successfully updated its parameters. Once this happens, the worker can send the first data point to the input layer of its replica for processing. When the backpropagation process has finished for that data point, the worker will again receive the ReadyToProcess message and the process will repeat. Once the worker has processed all of its data points, it is done and the worker stops itself. C. System Workflow In our system, distributed SGD (Stochastic Gradient Descent), which is responsible for training the model works as follows: The data partitions each train their own model replica with their partition of the data (i.e. one model replica per data shard). Furthermore, each model replica is partitioned across machines by layer. The weights for the model are kept in a central parameter server which is also partitioned across multiple machines. Each parameter partition keeps the parameters for one layer of the model (for example, if the

model replicas have 3 layers, then the parameter server has 2 partitions for the weights from layer 1 to layer 2 and the weights from layer 2 to layer 3 respectively). Hence, as the model replicas are trained in parallel, they asynchronously read (in the forward pass) and update (in the backward pass) their corresponding weight parameters. This implies that the weight parameter that a model replica layer reads might have been previously updated by the same layer from a different model replica.

Figure 2 System Workflow This process can be more easily understood with the following workflow Figure 2. First, client sends Start message to indicate the Master to start training process which in turn sends ReadyToProcess message to each data shards. The initialization of parameters was mentioned earlier. IV. BIG DATA APPLICATIONS In this section, we will give some applications of Big Data in which Deep Learning could be exploited to get better results comparing to traditional methods. A. Information Retrieval and Semantic Indexing One of the prominent areas of Big Data Analytics is information retrieval [10]. Effective storage and retrieval of information is a growing problem of Big Data, essentially as extremely huge amounts of data like text, image, video, and audio are being gathered and processed for various purposes across various domains, for instance, social networks, web crawling, data aggregation, fraud detection, security systems, online shopping and internet traffic monitoring. Conventional approaches and solutions for information repository and retrieval face challenges with the large amounts of data and various data sources, both linked with Big Data. In these platforms, large amounts of data could be processed with semantic indexing rather than being resided as raw data. Semantic indexing defines the data in a more sufficient fashion and becomes useful as a source for information discovery and understanding, for instance by impacting search engines operate faster.

B. Discriminative Works Conducting discriminative works in Big Data Analytics one could exploit Deep Learning models to obtain complex nonlinear features from the raw data, and then apply simple linear methods to execute discriminative works exploiting the acquired features as input [11]. Obtaining nonlinear features from huge amounts datasets enables the data scientists to make the knowledge available and useful via the vast amounts of data, by employing the obtained experience to more trivial linear methods for further investigation. The key benefit of exploiting Deep Learning in Big Data Analytics is to enable and make a proper environment for practitioners and analysts to fullfill complex operations related to AI, such as image comprehension, object recognition in images and so on, by exploiting trivial models. Hence discriminative works became adequately manageable in Big Data Analytics with the help of Deep Learning techniques. C. Sentiment Analysis With the rise of social media such as blogs and social networks, reviews, ratings and recommendations are rapidly proliferating; being able to automatically filter them is a current key challenge for businesses looking to sell their wares and identify new market opportunities. This has created a surge of research in sentiment classification (or sentiment analysis), which aims to determine the judgment of a writer with respect to a given topic based on a given textual comment. Sentiment analysis is now a mature machine learning research topic [12]. V. EXPERIMENTS AND RESULTS So far, we have been presented Deep Learning, their applications in Big Data Analytics and our proposed framework. In this section, we present experiments and their results of twitter sentiment analysis application which was designed on top of our system. The results showed that the proposed system was efficient enough to build Deep Learning applications.

Figure 3 Error rate vs number of iterations in terms of nodes We have collected the corpus of public tweets, roughly speaking one-year-data, containing more than 160 mln tweets so far and still collecting on a daily basis, daily gathering is

around 1.2 mln tweets. Twitter offers two APIs to retrieve data from it: REST API and Streaming API. For this service, we use Streaming API which is a part of Twitter Firehouse and provides 1% of all tweets. After obtaining data we divided it into training and test datasets for respectively for training phase and test phase. The division ratio was 70/30. The streamed tweets were initially very messy and unclean, and hence, we performed pre-processing before actual training. The training data was tokenized then vectoring was conducted using word2vec, finally the output from word2vec was fed into nonlinear deep learning model.

used in building highly scalable Big Data application or can be integrated to Big Data analytics pipeline as it showed satisfactory performance in terms of both time and accuracy. However, there are a lot of room for further enhancement and new features. Proposed system was implemented and tested on cluster computing environment and sufficient results are taken. As a future work, we plan to add more Deep Learning algorithms on the system stack and tweak the overall system so that it can be utilized in real world Big Data applications ACKNOWLEDGMENT This work was supported by Institute for Information & communication Technology Promotion (IITP) grant funded by the Korea government (MSIP) (R0113-15-0008) and by the MSIP(Ministry of Science, ICT and Future Planning), Korea, under the University Information Technology Research Center support program (IITP-2016-R2720-16-0004) supervised by the IITP(Institute for Information & communications Technology Promotion) REFERENCES [1]

Figure 4 Overall performance of the System We measured error rate of the model classifier, it showed a steady decrease in each iteration, and tries to reach the convergence point, the threshold where the error rate is acceptable and good enough to be used for new unseen data. From the Figure 3, it can be seen that as the number of nodes increases the convergence time reduces respectively. Especially, in 10 nodes, the green line in the given chart, the model reached the acceptable threshold after 200 iterations. We also achieved a good performance in terms of run time as depicted in Figure 4 it can be seen that as the number of nodes increase the overall performance will also increase. The runtime behaviour of the system also depends on the other factors such as JVM overhead, system processes, network latency, and so on. But we did not show their stats as it is negligible. Overall, we gained satisfactory results from the proposed system, however, there are a lot of room for further improvement and new features. VI. CONCLUSION The main goal of this work was to build Distributed Deep Learning Framework which is targeted for Big Data applications. We managed to implement the proposed system on top of Apache Spark, well-known general purpose data processing engine, and deep network training of proposed system depends on well-known distributed Stochastic Gradient Descent method, namely Downpour SGD. The system can be

G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2012. [2] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, 2010. [3] A. Coates, H. Lee, and A. Y. Ng.: “An analysis of single-layer networks in unsupervised feature learning”. AISTATS 14, 2011 [4] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin: “A neural probabilistic language model.” Journal of Machine Learning Research, 3:1137–1155, 2003. [5] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS, 2012. [6] Philipp Moritz, Robert Nishihara, Ion Stoica, Michael I. Jordan.: “SparkkNet: Training Deep Networks in Spark”, 2015. [7] D. Gillick, A. Faria, and J. DeNero: “Mapreduce: Distributed computing for machine learning,” Berkley (December 18, 2006), 2006. [8] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan, “Systemml: Declarative machine learning on mapreduce,” in Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 2011, pp. 231–242. [9] Apache Spark. Available: http://spark.apache.org/ [10] National Research Council. Frontiers in Massive Data Analysis. The National Academies Press, Washington, DC; 2013. [11] W. Yang, J. Boyd-Graber, P. Resnik, “A Discriminative Topic Model using Document Network Structure”, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 686– 696, Berlin, Germany, August 7-12, 2016. [12] Vishal A. Kharde, S.S. Sonawane, “Sentiment Analysis of Twitter Data: A Survey of Techniques”, International Journal of Computer Applications (0975 – 8887) Volume 139 – No.11, April 2016.