Deep Learning with GPU Technology for Image ...

Deep Learning with GPU Technology for Image & Feature Recognition Alison B. Lowndes

Submitted in accordance with the requirements for the degree of Bachelor of Science in Artificial Intelligence 2012/2015

Empirical Study © 2015 The University of Leeds and Alison B. Lowndes

1

Summary This is an empirical study of the use of deep learning (DL) neural networks powered by graphical processing units (GPU), to recognise features in images. The report is aimed at fellow students and researchers to assist them to run convolutional neural networks (CNN) and understand other techniques within the field. My hope is the University of

Leeds will invest in GPUs after I demonstrate that their role, along with massive datasets and efficient algorithms are powering the DL renaissance. Cognitive computing systems, such as IBM’s Watson, allow reasoning via probabilistic natural language processing for interacting more naturally with computers than ever before. The pace of progress is so great within cognitive deep learning that NVIDIA is now shipping a plug-n-play CNN “appliance” for the academic and developer community integrating 4 GPU’s and their DIGITS system, a powerful visualisation and configuration tool for neural networks, which I also demonstrate in this report. On January 3, 2015 China’s Baidu announced they had beaten the ‘GoogLeNet’ winning DL system of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC2014). They achieved a 5.98% error rate due to brute-force via GPU. On February 6, Microsoft announced they had achieved 4.94% beating the “human reference” set by Andrej Karpathy, of Stanford, who trained and tested himself on ImageNet. On March 2 Google announced a 4.82% error. The race is fuelled by GPU power. The Swiss AI Laboratory (IDSIA) won the Medical imaging ICPR 2012 competition and the MICCAI 2013 Grand Challenge on mitosis detection, training their multi-CNN system with images labelled by expert pathologists. Despite the subtlety, DL's can learn to recognise the tell-tale signs of cancer and flag for further human investigation. The IDSIA DL system could soon become available on a mobile phone. All major research labs are using ensembles of recurrent (RNN) or CNNs with custom optimized NVIDIA CUDA code. I have been tutored via many of the top researchers either directly or indirectly over the past few months, using many new tools and testing GPUs on various platforms. Having full access to open-sourced codesets from major research laboratories is a huge advantage. GPUs and open-sourcing are both very powerful human-engineered tools. I have been supported by Facebook and by NVIDIA in this project, who granted me a $4000 Tesla K40 GPU for testing and remote use of 2 x 4,992 core K80s.

2

Acknowledgements My huge thanks go to Graham Hardman in Faculty IT for helping me with the software tweaks and setting up the new GPU on a racked server in the machine room ‘ csfyp01 ’. Thank you also to Jack Watts and the NVIDIA Corporation for all their support; providing the Tesla K40 and setting up a test-drive of the K80, with Boston Limited (UK). They will shortly be running a workshop at Leeds with the ARC team. Special thanks to Soumith Chintala, AI Researcher at Facebook, who refactored the

fbcunn code allowing me to more easily run on the testdrive with Boston. It was actually Yann LeCun who suggested Soumith do this so I am very honoured. Codeset at : https://github.com/soumith/imagenet-multiGPU.torch . Thanks also to Andrej Karpathy of Stanford University and Dr Fei Fei Li, Director of Stanford’s AI Laboratory, for open-sourcing their course on Convolutional Neural Networks, and to Harvard’s John Owens and NVIDIA Research’s David Luebke for the excellent Udacity online course on GPU Programming. Huge thanks sincerely go to my previous personal tutor in the School of Physics and Astronomy, Dr Rene Oudmaijer, without whom I would never have won the battle with Student Finance to get funding for another 3 years of study, and also to Mark Walkley for ‘letting me in’ despite knowing this was my second choice. Thank you finally to Derek Magee, my first of 3 personal tutors in this degree course, for the many many references he wrote me, and advice he gave, even when he wasn’t my personal tutor, and to my Supervisor, Dr Marc de Kamps, for all the hard questions he asked that pushed me to the completion of this report. Lastly, Matt Zeiler of NYU has founded ClarifAI , to bring CNN’s to market - try it.

3

Contents

Table of Figures 1. Introduction 1.1 A History of Deep Learning 1.2 The current leader in deep learning for image recognition 1.3 A brief look into the future 1.4 Neuromorphic chips 2.0 Objectives 2.1 Personal Statement 3.0 Image recognition 4.0 Neural Nets 4.1 Perceptrons 4.2 Multilayer perceptron (MLP) 4.3 Convolutional Neural Networks (CNN) 4.4 Backpropagation, steepest gradient descent and momentum 4.5 Recurrent Neural Networks (RNN) 4.6 Sparse Autoencoders 4.7 Long short term memory (LSTM) 4.8 Machine Learning applications 4.8.1 Microsoft Medical Image Analysis 4.8.2 RCNN 4.8.3 Neural Image Caption (NIC) Generator 4.8.4 Neural Turing Machine (NTM) 4.8.5 Face patches 5.0 The GPU’s role in deep learning 6.0 Test Environments & frameworks 6.1 Terminal

4

6.2 Theano 6.2.1 Theano v Torch 6.3 Caffe and DIGITS 6.4 ConvNet 6.5 NVIDIA GRID 6.6 Leeds ARC2 6.7 Leeds cloud testbed 7.0 Virtual Pathology 7.1 Virtual Heart 7.2 Virtual Pathology at Leeds 7.3 Deep Learning in healthcare 8.0 GPGPU 9.0 Implementation 9.1 General information for testing and implementation 9.2 fbcuNN 9.3 Torch 7 in detail 9.4 Internal data representation 10.0 NVIDIA CUDA 10.1 Fast Fourier Transform 10.2 Convolution (cuFFT) 10.3 AlexNet 10.4 Available Image Datasets 11 Test Results 12.0 Conclusion 12.1 Future work References Appendix A: External Materials Used Appendix B: How ethical issues are addressed Appendix C: NVIDIA quotation for GPUs Appendix D: Example of a neural network implementation in LuaJIT 5

Appendix E: Installation instructions for Torch 7 Appendix F: Basics of CUDA GPU Programming Appendix G: Permission for full image access to ImageNet Appendix H: Partial history of deep learning from the research of Geoffrey Hinton Appendix J: Image Processing on the SVHN dataset in Torch/Lua Appendix K : Test results with fbcunn on ImageNet Appendix L: Test Results with the MICCAI Grand Challenge medical dataset Appendix M: Screenshot of live demo Appendix N: Reverse engineering the neocortex: implications for machine intelligence Appendix P: DIGITS by NVIDIA Appendix Q: Timekeeping Appendix R: Available datasets

6

Table of Figures Figure Number

Page Number

Description

1

11

One image from IDSIA applying a CNN to histology images

2

18

Neuromorphic chip designed by Heidelberg University

3

23

Conversion of an image to digital

4

26

The LeNet CNN

5

27

Deconvoluted features picked out at 2 layers of LeNet on images

6

28

11-layer architecture for network used by Ciresan et al

7

29

Supervised Deep Learning image convolution

8

29

Image from testing on MNIST

9

31

Gradient, update, momentum

10

32

Learning rates, Li et al

11

36

The R-CNN by Girshick et al

12

36

The memory block contains the core of the LSTM model

13

37

The Neural Turing Machine

14

39

Batch size v speedup

15

48

IEEE Spectrum Virtual Heart

16

52

How GPU Acceleration works

17

54

Data v Model parallelism

18

56

Tesla K80, NVIDIA

19

60

Schematic representation of a deep neural network

20

64

Comparison of convolution and correlation

21

66

Decimation in time decomposition & frequency

22

66

Results using an 11 x 11 & a 13 x 13 kernel

23

67

AlexNet layered model of convolution, subsampling/pooling followed by a classifier

7

24

75

Basic recipe for machine learning

24a

77

Neural modularity, EvolvingAI.org

25

App. J

Accuracy results for a CNN on SVHN

26

App. J

Sample of y-channel data from SVHN dataset. Credit Torch.ch

27

App. K

Soumith Chintala Benchmarks for Torch

28

App. L

Summary of MICCAI dataset content

29

App. L

Histology images from IDSIA 2013 CNN deep learning run

30

App. L

Closeup of one of the images from IDSIA

31

App. M

Screenshot of visualisation of a live CNN implementation

32

App. N

Cortical facts Credit J Hawkins, Numenta

33

App. N

The HTM Neuron model

34

App. N

Simulation where red cubes are sparse active neurons in the grid

35

App. N

Active and polarised (yellow) neurons and the multiple patterns formed, processed by the HTM algorithm

36

App. P

DIGITS Roadmap for development

37

App. P

DIGITS visualisation of first layer of a CNN (conv + pool)

38 - 41

App. P

Visualisation from within AlexNet via DIGITS

42

App. Q

Timekeeping

8

1. Introduction The first computing era was tabulated, the second, programmable systems. We are now witnessing the third era of cognitive augmented reasoning. The co-evolution of technology and biology is allowing us to quantify and personalise healthcare. Initiatives such as the Department of Health’s Personalised Healthcare 2020 Framework1 recognise that cognitive systems and humans together are more powerful than either, alone. DL is able to assist tirelessly, powering through vast amounts of structured or unstructured data, reconciling ambiguity and learning patterns we could never before see. Machine intelligence and deep learning is being combined in many fascinating ways as we learn more about ‘how we learn’. The human neocortex, for example, does not receive five different senses; sight, sound, touch, taste, smell. It simply receives a very rapid, highly-parallel, high-velocity datastream of pattern representations, which it then acts on in another very rapid, highly-parallel, high-velocity motorstream. This is further explained at Appendix N with details of several commercial applications already in existence from Jeff Hawkins’ Numenta2 .

The Swiss Artificial Intelligence Laboratory (IDSIA) is directed by DL pioneer Jürgen Schmidhuber. In 2010 IDSIA's GPU-based deep learners attained a 0.2% error on the MNIST dataset with a GPU-based system. They went on to win competitions such as the 2011 street sign recognition where they attained 0.56% error beating a human performance of 1.16% error, a significant and influencing result for the self-driving car industry. Regarding health, IDSIA also won the 2012 ISBI (brain image segmentation) competition and visual object detection competition in cancer cells (2012, 2013) from image scans [ 52 ].

1 2

https://www.gov.uk/government/publications/personalisedhealthandcare2020 www.Numenta.com , California

9

Fig. 1 Image from MICCAI dataset, larger image at Appendix L [ 60 ].

My wish is that insights can be made from this report to aid the histopathology work being automated and augmented by the Virtual Pathology Group at Leeds, who are currently using only CPU’s to train feature detection algorithms (random forest) on expert-classified tumour:stroma images [ 43 ], for use on the massive numbers of images coming from St James’ Teaching Hospital. CNNs could help their work massively and are in wide use for some of the biggest “search” problems now. Google won the 2014 ImageNet3 large-scale visual recognition challenge (ILSVRC) with their DL system ‘GoogLeNet’ (named after LeNet, Yann LeCun's influential convolutional network). Facebook then hired LeCun to run their new AI research laboratory. Karpathy of Stanford demonstrated a linked CNN and RNN live at the GPU Tech Conference, March 2015, that could describe a picture with a phrase. Jen-Hsun Huang, CEO & President of NVIDIA dedicated his entire Opening Keynote to deep learning, as did Senior Fellow at Google, Jeff Dean and Chief Scientist at Baidu, Andrew Ng. DL’s impact on the world at large is now at the forefront of tech. Medical applications require much less detail for feature recognition, classifying with a simple binary, presence or non-presence of anomaly. They are effective on far smaller datasets than ILSVRC2012 and deep learners need only the dimensions of the feature detected and intensity value. Processing of medical images for augmented healthcare are now achieving commercially-viable results with NVIDIA’s latest GPU-powered technology making access affordable for those that need it most, like the NHS. As part of my research I learnt the

3

ImageNet uses Princeton's WordNet hierarchy of 117,000 synonym sets or "synsets" providing ~1000 labelled images/synset (2012 1,431,167 images) http://image-net.org - http://wordnet.princeton.edu/

10

basics of the entire process of tissue sample preparation, digital pathology and mitosis counting, in order to run deep learning tests on the MICCAI Grand Challenge dataset. Section 12 and Appendix N further detail current and future work on

neocortex-inspired algorithms that require no training or tweaking of parameters, with strong arguments that such cortical algorithms are the way to true machine intelligence.

1.1 A History of Deep Learning Over 50 years ago Oliver Selfridge, grandson to the founder of the famous chain of stores, published a paper on a machine learning system called “Pandemonium” (1958) describing multiple layers of feature detectors, each activated by the hidden layer below. Selfridge, though born in England, moved to Massachusetts Institute of Technology (MIT) where, amongst other achievements, he was a supervisor to Marvin Minsky, the American cognitive scientist who co-founded MIT's AI laboratory. Selfridge organized the first public meetings on AI with Minsky, from 1955. Perceptrons came about in the late 1950s and 60s, syntactic-symbolic systems in the 70s, expert systems in the 80s, and the 2nd generation neural nets with backpropagation in 1985 (Rumelhart, Hinton et al, 1986). At this point focus went onto Vapnik et al and their support vector machines (SVM), with many researchers ignoring neural nets due to the SVM’s far superior results. The problem with backpropagation was that it required large amounts of labeled training data, and due to the lack of compute power learning time did not scale well. Backpropagation was actually devised separately by; Kelley (1960) and Bryson (1961) who were working on dynamic programming then Dreyfus (1962) derived the same method with the chain rule. Linnainmaa wrote a Fortran program for backprop in 1970 but it was Werbos who first applied backprop to neural networks (1974), followed by many others including Parker (1985), LeCun (1985) and Rumelhart et al (1986). The technique became an effective way for neural networks to learn. LeCun used it to recognise handwritten digits and his work led to the application within the US banking system for processing cheques. “Back propagation is by far the most widely used neural-network learning algorithm, and probably the most widely used learning algorithm of any form” [ 71 ]. That statement is still true 17 years later.

11

The first deep learning came from the Ukrainian "Connectionist" Ivakhnenko in 1965. Ivakhnenko et al trained feed forward neural networks (FFNN). Neurons implemented Gabor filters and layers were grown incrementally and trained by logistic regression. Ivakhnenko's paper of 1971 describes an architecture with 8 layers. Professor Kunihiko Fukushima (1980) proposed the first computational model, named the Neocognitron. Fukushima recognised that "when neurons with the same parameters are applied on patches of the previous layer at different locations, a form of translational invariance is obtained" [ 67 ]. LeCun et al continued to research this idea of human-pattern-recognition-based CNN systems. Jürgen Schmidhuber, Director of IDSIA had also been working with DL since 1987. His work helped to improve connected handwriting recognition, speech recognition, machine translation, optical character recognition, image caption generation [ 52 ], and medical feature recognition. In 1991 Schmidhuber and his PhD Hochreiter developed a purely supervised long short term memory (LSTM) RNN which overcame the fundamental DL problem of vanishing and exploding gradients4. Prior techniques used unsupervised pre-training of a hierarchy of RNNs to predict the next input. Schmidhuber’s research is distinct, in the use of this technique, from LeCun, Hinton and other senior researchers. In 1997 LeCun et al published a paper detailing graph transformer networks (GTN), designed specifically for reading and recognizing bank cheque characters for record accuracy. The heart of the system was LeNet-5, LeCun's CNN. At the time LeCun was working at the Speech and Image Processing Services Research Laboratory, AT&T Labs-Research, in New Jersey, US. The system is still deployed commercially to read millions of cheques per day within the US banking system. At this time LeCun also stated that the introduction of object-oriented programming was a factor in the success of LeNet. Hinton et al's breakthrough at the University of Toronto, in 2006 introducing Deep Belief Networks (DBN), coincided with NVIDIA introducing the CUDA parallel programming language for GPU’s. Both provided the pathway to vast improvement. Shallow networks 4

http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf in German but covered in English by Schmidhuber in article http://people.idsia.ch/~juergen/firstdeeplearner.html

12

were limited in the complexity of the functions they could compute, (and limited by compute power and availability of data to be trained on). “ It has been obvious since the

1980s that backpropagation through deep autoencoders would be very effective for

nonlinear dimensionality reduction, provided that computers were fast enough, data sets were big enough, and the initial weights were close enough to a good solution. All three conditions are now satisfied ” [ 69 ] Hinton talks himself [ 3 ] about how he, LeCun & Bengio were all struggling, around 2005, with neural net application. They were working on unsupervised layers of features with features from one layer being used as input to another. It was not until Bengio’s paper led to Hinton and his PhD Alex Krishevsky designing the AlexNet model for use on speech in 2009 that the field exploded. NVIDIA’s CEO described the “Big Bang” in 2012 when AlexNet was put onto Android phones and Krishevsky, Hinton and Sutzkever were all hired by Google. LeCun's group literally didn't have a student at the time working on this so Hinton et al took LeCun's techniques, added more computer vision techniques and gained massive improvement. Their system was running with GPU’s and beat everyone else in ILSVRC2012, achieving 84% accuracy. Stanford actually used Amazon's Mechanical Turk to pay people to label the 1.2 million training images on ImageNet, used in the competition and in this report. The interesting thing at the time was that a joint 2011 paper of a system learning 100x quicker than any other was actually rejected by CVPR “the premier annual Computer Vision conference” because reviewers said "we don't understand how it works" [ 3 ]. The field of deep learning was still not widely accepted at this time. LeCun gave a talk at the Frontiers in Computer Vision conference, 2011, titled “5 years from now, everyone will learn their features (you might as well start now)”. He was wrong. “It only took 3 years” [ 3 ] LeCun stated in interview. There is now a greater understanding but the situation forced an open-sourcing of the peer review process, captained by LeCun. With the overwhelming results from ILSVRC2012 Hinton, LeCun and Bengio proved that sticking to their beliefs, going against convention and risking failure were all worthwhile. Between 2008-2010 LeCun’s students were finding it difficult to get jobs after graduating, due to the recession but also because deep learning itself was frowned upon. Since then “they were the only people who had any knowledge and are all millionaires now” LeCun [ 3 ].

13

The following are excerpts from an audio interview recorded by www.thetalkingmachines.com at NIPS, 2014 with the “pillars of deep learning” Geoffrey Hinton, Yoshua Bengio and Yann Le Cun [ 3 ]: ● With better hardware, more data, slight improvement in techniques (unsupervised learning) we made progress (Hinton). ● In early 2000 it was unsupervised learning that brought back neural nets (NN) with CIFAR. We all got together to reboot the interest. Commercial success is usually supervised but efforts with unsupervised don't work well with small datasets. As the datasets grew, so did performance (LeCun). ● Importance of unsupervised techniques will come as humans realise the ease of using massively available unlabeled data (Bengio). ● in the late 90's it was difficult to work on NN's as papers were rejected (Bengio). ● In the mid 2000's, papers (with Bengio) were written to convince the community that NN’s would be successful (LeCun). ● Future is unsupervised learning especially in NLP (LeCun). ● Distributed representations (DR) of words - word embedding - unsupervised system but with supervised learning algorithms (LeCun). ● DR - large no. of neurons each represent tiny aspects but as a whole defines the object - different to symbols which are binary (either are or not) - DRs are "big patterns" - no rules needed just need a whole load of connection strengths and 1 DR will cause another DR eg translation. No symbols needed (Hinton). ● Embedding vector representations (words, images, people's interests) - Word2Vec algorithm, Hinton’s neural component analysis, Bengio’s WSABIE (LeCun). ● Vectors are attributes that are learnt eg word, concept, image associated to it. People build semantic descriptions all the time - deep learning systems discover all the attributes it needs to predict (Bengio). ● Composition - essential concept - is at the heart of why deep learning works - so many configurations - exponential - so (with fast computers) DR's are so powerful. The extra layers allow more and more abstractions (Bengio). ● RNNs connect to itself - work with just big DR vectors - Google using them now state-of-the-art (Hinton). ● Leon Bottou - use vectors as components for a kind of algebra for reasoning. Involves reasoning and search with RNNs. Augmenting with memory - Neural

14

Turing Machine - associative memory + NN to maintain a state of the world and can respond to questions about it (LeCun). Further detail of the history of deep learning research can be seen in Appendix H which lists a selection of approximately half of Hinton’s published work since his PhD in 1978.

1.2 The current leader in deep learning for image recognition The race for leadership is impossible to write a static report on. Chinese search engine Baidu's ‘Deep Image’ model is only as strong as its supercomputer counterpart, Minwa. Baidu claimed a 5.98% error rate in early January 2015, beating GoogLeNet's 6.66% which won the 2014 ImageNet competition. On February 6, Microsoft announced they had achieved 4.94% beating the human performance of 5.1% by Andrej Karpathy, of Stanford. On Feb 11 Google announced 4.82% error. The race continues but Baidu’s system is described in more detail below, for clarity. The supercomputer consists of 36 server nodes, each with 2 x 6-core Intel Xeon E5-2620 CPUs plus 4 x NVIDIA Tesla K40m GPUs and 1 x FDR InfiniBand (56Gb/s) offering low-latency interconnection. The peak single precision floating point performance and onboard GPU memory allows Minwa an astonishing “6.9TB host memory, 1.7TB device memory, and approximately 0.6 Petaflops of theoretical single precision peak performance" [ 10 ]. The system therefore allows use of higher-resolution images (512 x 512 pixels as opposed to 256 x 256) and exploits the various “editing choices, lighting situations or other extraneous factors" [ 10 ] naturally provided by Instagram users. Further mathematical detail is provided in section 5.0 .

1.3 A brief look into the future Graves et al at Google DeepMind added the fundamental aspects of von Neumann architecture to powerful RNNs; logical flow control (branching) and external memory similar to Turing’s tickertape [ 18 ]. They called their system a "Neural Turing Machine" [ 18 ].

15

This is an important step toward making computers much more brainlike than ever before, but perhaps trying to mimic the brain, will not be the best route forward [ 24 ] much the same way that aviation engineers realised it was much better to utilise concepts like lift than to try to mimic feathers for aircraft. Professor Michael Jordan states "There is nothing very neural about neural networks ... no spikes in deep-learning systems.. no dendrites” [ 24 ]. Regarding backpropagation Jordan describes how the brain does not operate the same way, though he acknowledges it led to significant progress. Numenta’s Jeff Hawkins argues5 that “intelligent machines will be based on models of the neocortex, not the rest of the brain”, there is a big difference. The most recent paper to have been accepted into the Computer Vision and Pattern Recognition conference, CVPR2015, describes how differently CNNs see the world than humans and how this can also lead to inaccuracy. Nguyen, Clune et al of the University of Wyoming show that CNNs, if trained with “fooled” images will give a majority false positive results. The technique is demonstrated on www.picbreeder.org where the user is the fitness function. The paper illustrates the significance that CNNs can be trained on whatever we want them to be trained on, and that we can even flip the linear processes; fixed image, tweaked parameters v. fixed parameters, altered image. The flaw itself, is inherited from the linear nature of classifiers themselves. The CNN is trained via a repeated process of sampling data, calculating gradients and updating. One gradient updates one parameter to increase the correct image score, along with all other gradient-parameter updates. This is repeated epoch after epoch on all images in the training data until a suitable accuracy is obtained. Perhaps more efficient methods lie within self-organisation and in the random way the brain seems to construct its own neural network. Whatever the solution, today, the most successful artificial neural networks are deep and multi-layered.

1.4 Neuromorphic chips Neuromorphic chips are currently being funded by DARPA via HRL Laboratories and IBM, to enhance AI and avoid Moores Law and its fundamental performance limit. Deep

learning will soon integrate with mobile phones due to Qualcomm bundling its Zeroth software with its Snapdragon 820 processor - to recognize speech, music, scenes, pick out objects and even spot patterns of behaviour via the device’s sensors.

5

http://recode.net/2015/03/02/theterminatorisnotcomingthefuturewillthankus/

16

Fig 2. Neuromorphic chip designed by Heidelberg University. The chip features 384 neurons, 100,000 synapses and operates at a speed of approx. 100,000x biological real-time (Human Brain Project, May 2013)

The Human Brain Project in Europe has spent EU100 Million on neuromorphics and many laboratories are attempting to reconstruct a silicon brain. Neural processing units (NPU) such as IBM’s SyNapse chip, unveiled in August 2014, are designed to be extremely energy-efficient and optimized for sensory input processing. Testing showed the chips perform 100x faster, consuming 100,000x less power than a laptop running standard von Neumann architecture and CPU's [ 27 ]. NPUs do not separate memory and processing blocks as in von Neumann architecture so bottlenecks and latency are vastly reduced with neurons and synapses intertwining both functions. Unfortunately these chips also require a completely new approach to programming, being developed by Rice University in the PLINY Project [ 20 ].

2.0 Objectives

This report provides a general introduction to the topic of deep learning with a history of the field leading to current progress. A short introduction to image recognition follows, prior to fuller definition of the different flavours of neural networks and techniques used in deep learning, with a number of specific applications in both academia and industry. From section 5 the role of GPU's are introduced, in depth, with specific testing further explained. Various frameworks, environments and software are tested before leading

17

into the author's key interest of the use of deep learning in healthcare, specifically recognition of subtle features in histology. There is attention given to the mathematics behind the success of the current state-of-the-art algorithms in use before a conclusion with details of future work. A substantial list of references and full literature review is provided with multiple appendices should the reader wish to duplicate any tests made or simply to dive into the fascinating field of deep learning with convolutional neural networks. Appendices include actual quotations and advice for obtaining the hardware necessary though it should be noted that this report leaves with it a 2880 core, 12GB GPU currently housed at csfyp01 in the DEC-10 machine room, thanks to the NVIDIA Corporation.

The report will cover : ➢ an overview of the current state-of-the-art deep learning techniques available for natural image recognition and mitosis (feature) detection in medical images. ➢ research and practical implementation of GPGPU (general purpose programmability on a GPU). ➢ research and practical implementation of neural networks with comparison of various machine learning techniques. ➢ comparison between CPU and GPU capability and performance. ➢ application of deep learning for recognition of features and objects in natural and medical images. ➢ presentation of use-case to persuade the School of Computing to invest in GPUs and assist current and future work in many fields of research.

2.1 Personal Statement [ removed for confidentiality ] Timing and numerical results detailed at section 11 with extensive details of testing in the appendices.

18

3.0 Image recognition It is important to understand the semantic gap between what "we see" and what a computer reads. An image of a cute cat implies knowledge, recognition, feelings (‘cute’). A computer sees a patch of integers between [0, 255], in 3-dimensions representing the Red, Green and Blue colour channels. All neural net activations are arranged in 3D (height x width x depth). Images from the CIFAR-10 dataset are represented as 32 width x 32 height x 3 depth (RGB) per image with 10 classes [ airplane, automobile, bird, cat, deer,

dog, frog, horse, ship, truck ]. The central pixel of a square patch of image data tends to be used to determine the class of the image and the network is trained on these patches (of known class). The 10 classes each label 50,000 training images used to train a DL system. Once trained the system is tested with a further 10,000 unlabelled test images. Challenges that humans take for granted, a computer needs help with; pose, viewpoint, lighting, scale, deformation, occlusion, noise, clutter, intra-class variation e.g 100 different types of chair. All these are big problems that neural nets have largely solved since the data-driven approach. From the mathematical representation algorithms simply calculate f(x , Weight, bias) as a i matrix Wx and a vector b. The trained network can then classify new, unseen, images i and trivially detect objects within by comparing pixels close to the object centroid with the different background pixels [ 52 ].

Fig 3. Conversion to digital Li et al, Stanford

Nearest neighbour classifiers can predict the probability of an unlabeled image based on the knowledge gained from the training set in the form of hyperparameters like distance,

19

pixel-wise abs() or Euclidean for example. See Appendix J for full code breakdown of image preprocessing on the Street View House Numbers dataset (Google).

4.0 Neural Nets The area of deep learning spans a wide spectrum ( covered ): MODEL

DEEP

SHALLOW

Unsupervised

Sparse Autoencoder LSTM Deep Belief Network Bayesian NP

Restricted Boltzmann Machines

Supervised

Neural Networks (MLP) Convolutional NNs Recurrent NNs

Perceptron SVM

4.1 Perceptrons Perceptrons are a close artificial implementation of the basic computational unit of the brain; the biological neuron. A biological neuron differs in one significant way to its artificial representation. Unlike the perceptron, a biological neuron is capable of recognising 100's of different patterns, not just one added to others, summed by weights, modified by weights, etc. Perceptrons can, however, concatenate into more complex neural networks. The synaptic strength (weights) in a perceptron are learnable and control the strength (excitatory = positive weight or inhibitory = negative weight) of one neuron on another. In the brain, dendrites carry signals to the soma (cell body) where they are summed. If the final sum is above a certain threshold, the neuron can fire, sending an electrical ‘spike’ along its axon. In the perceptron, only the frequency of the firing (rate code) communicates information, modelled by an activation function [ 11 ]. This differs to how the actual neocortex works, which is out of scope of this report. Nevertheless it is an important point, explained further in the context of Numenta’s work in Appendix N . Python implementation: class Neuron:

20

# ... def forward(inputs): """ inputs and weights represented as 1D numpy arrays with bias a number """ soma_sum = np.sum(inputs * self.weights) + self.bias firing_rate = 1.0 / (1.0 + math.exp(-soma_sum)) # sigmoid activation function return firing_rate

With appropriate weights and threshold/bias, perceptrons can simulate basic logic gates such as AND, OR distinguishing sets of inputs which are linearly separable. With automatic tuning of weights the perceptron can learn the correct answers from the input vector (0 or 1) by going through the training set one by one (entire set passed over several times) and if the current output is the same as desired, passing on to the next. If not the algorithm applies a learning rule as follows: W =W+(d-ƴ)ƔX 1

where d : desired (correct output) ƴ : actual output where 0 < Ɣ < 1 [ 11 ]

To force the perceptron close to d, W should be as close to the positive cases as possible (Ɣ=1) therefore : ● Move W towards positive X if output 0 instead of 1 (weak activation) ● Move W away from negative X if output 1 instead of 0 (strong activation)

4.2 Multilayer perceptron (MLP) A multilayer perceptron is a feed-forward neural network (FFNN) with >=1 hidden layer followed by an output layer. Every neuron has a transfer/activation function (sigmoid, or tanh usually). An FFNN with L hidden layers is parameterized by L + 1 weight matrices (W , . . . , W ) and 0 L L + 1 vectors of biases (b , . . . , b ). Concatenating these form a vector that fully 1 L+1 specifies the function computed by the network.

4.3 Convolutional Neural Networks (CNN)

CNNs are a special class of FFNN, well-suited to image recognition, by use of image filters - square regions with associated weights applied across an entire image, often in multiple

21

e.g four 6x6 filters. (Screenshot of a visualisation of a live CNN implementation at Appendix M ).

Fig 4. The CNN, LeCun et al

Convolutional networks combine local receptive fields (filters), weight replication and subsampling to cope with invariance, for example, in images, to differences in scale, pose, lighting etc. LeCun's LeNet-5 was based on how the visual cortex works, inspired by Fukushima’s Neocognitron (1980) allowing extraction of features such as certain orientations, edges and corners. Sets of neurons, at different locations within the image, share the same weight, the output of which is called a feature map. This means that all neurons (or ‘units’ - LeCun prefers the non-brain-referencing term ) in a feature map are “constrained to perform the same operation on different parts of the image” [ 71 ]. The advantage of

having shared weights is that the features will be detected regardless of their location, while the multiple filters allows each to detect different set of features, but also reduces the number of parameters, i.e the complexity and computation and reducing the difference between test and training error. LeNet-5 measures a loss function, the difference between desired and actual output though the more important measure is the error rate, estimated by “measuring the accuracy on a test set disjoint from the training set” [ 71 ]. Neurons in the first hidden layer of LeNet-5 (figure above) are organized in six planes/feature maps. Each neuron has 25

22

inputs connected to a 5x5 filter in the input, i.e. 25 “trainable coefficients plus a trainable bias” [ 71 ] with filters of neighbouring neurons overlapping. The second hidden layer of LeNet-5 is a subsampling layer of six feature maps, as per the previous layer. Subsampling breaks the symmetry in the network and increases invariance to shifts and distortions. Each neuron computes the average of its four 2 x 2 inputs, multiplies it by a learning rate, adds a bias, and passes the result through an activation function such as rectified linear units (ReLU). There are no overlaps so only half the rows and columns. If the learning rate is small, subsampling simply blurs the input, if large it performs a "noisy OR or a noisy AND" [ 71 ] depending on the bias. A breakdown of LeNet-5 is as follows (C: convolutional, S: subsampling): ● LeNet-5 comprises an input layer (32 x 32 pixels) + seven layers. Most characters in the MNIST database are 20 x 20 centred in a 28 x 28 field. Input is “normalized so that background level (white) corresponds to a value of -0.1 and the foreground (black) corresponds to 1.175” [ 71 ]. ● C1 has six 28 x 28 feature maps (FM). Each neuron in each FM is connected to a 5 x 5 neighbourhood in the input. C1 contains 156 trainable parameters and 122,304 connections.

Fig 5. Actual deconvoluted features picked out at layer 3 (top) and layer 2 by LeNet of image to left, Zeiler (Clarifai Inc) and Fergus (NYU, Facebook)

23

● S2 has six 14 x 14 feature maps. Each neuron in each FM is connected to a 2 x 2 neighbourhood in the C1 FM. The four inputs to a neuron in S2 are added, then multiplied by a learning rate, and added to a bias. The result is passed through an activation function. The 2 x 2 filters do not overlap, therefore S2 FMs have half the number of rows and column as C1 FMs. S2 has 12 trainable parameters and 5,880 connections. ● C3 has 16 FMs. Each neuron in each FM is connected to several 5 x 5 neighbourhoods at identical locations in a subset of S2’s FMs. C3 has 1516 trainable parameters and 156,000 connections. ● S4 has 16 FMs of size 5 x 5. Each neuron in each FM is connected to a 2 x 2 neighbourhood in the corresponding C3 FM, S4 has 32 trainable parameters and 2,000 connections. ● C5 has 120 FMs. Each neuron is connected to a 5 x 5 neighbourhood on all 16 S4 FMs. Since S4 is also 5 x 5, C5 FMs are simply 1 x 1; fully connected. C5 has 48,120 trainable connections. ● F6 contains 84 neurons fully connected to C5. It has 10,164 trainable parameters.

Fig 6. 11-layer architecture for network used by Ciresan et al, 2012

24

CNNs therefore differ by exploiting local connectivity. Normalization and dimensions (shifts and distortions in the images) make little difference to results in a CNN - the reason for their robustness - but depth of layers matters, as does a small filter size tuned to prime factorization such as 3x3, 5x5, 7x7 (see Convolution section 10.2 ).

Fig 7. Supervised Deep Learning, Marc'Aurelio Ranzato, Facebook A.I. Research

Only a feature’s approximate position, with respect to other features, is moot. If, for example, an input image contains a “roughly horizontal segment in the upper left area, a corner in the upper right area, and the endpoint of a roughly vertical segment in the lower portion of the image” this has a high probability of being a ‘7’ [ 71 ].

Fig 8. Image from MNIST, LeCun

Subsampling reduces resolution (of the image and therefore feature map). For example, if the input consists of a 32x32 image and the layer has a subsampling region of 2x2, the output value would be a 16x16 image, which means that 4 pixels (each 2x2 square) of the input image are combined into a single output pixel. The last subsampling (or convolutional) layer is usually connected to one or more fully connected layers, the last of which represents the target data.

25

Training is performed using modified backpropagation that takes the subsampling layers into account and updates the convolutional filter weights based on all values to which that filter is applied [ 44 ].

4.4 Backpropagation, steepest gradient descent and momentum Vast improvements have turned complexity on its head with CNNs offering computationally expensive training but cheap testing. Setting hyperparameters are still very problem-dependent but open-sourcing development has slingshotted progress in the field. The 3 key components to training neural nets are; score function, loss function and optimization. Loss functions are the difference between the correct and the incorrect class score from a data-driven classifier. Given some function f(x) where x is a vector of inputs (training data x weights), the backpropagation algorithm can then compute the gradient of f at x i.e. ∇f(x). In real time machine learning it is usual to only compute the gradient for the parameters (weights, bias) to perform updates but gradients on the input can be useful for visualization. Backpropagation can be thought of as “gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher" [ 11 ]. The technique essentially calculates the gradient with respect to a loss function, a recursive procedure that obtains the gradient of a deep layer in terms of the layer above. # backpropagation ddot = (1 - f) * f # gradient on dot variable, using the sigmoid dx = [w[0] * ddot, w[1] * ddot] # backprop into x dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w

In backpropagation it helps greatly to cache the forward pass variables, which is where the power of GPUs come in. Gradient checks are used for exactness, speed and accuracy i.e. optimization of loss functions. Steepest gradient descent (SGD) uses a single example at a time and current research proves that this technique is very hard to beat with CNNs.

26

Fig 9. Li et al, Stanford

A good analogy for momentum is that the loss can be interpreted as the height of a hilly terrain. Initializing parameters with random numbers is like setting a ball down still at some location. The optimization process is simulating the ball rolling on the landscape. The force on the ball is precisely the (negative) gradient of the loss function. Since F=ma the (negative) gradient is proportional to the acceleration of the ball. Note with SGD the gradient directly influences the position, not the velocity, which in turn has an effect on the position. Momentum likens, therefore, to friction, damping the velocity and reducing the energy of the system, so the ball actually stops at a local (or global) minimum. A typical setting is to start with a value of 0.5 and increase over multiple epochs. Although the sigmoid function is helpful to "squash" values to the range [0,1] this also "kills" gradients and is now, in the field of deep learning, largely obsolete. Tanh can be used but current trends promote use of the more efficient ReLU. #forward pass of a 3-layer NN f = lamda x: 1.0/(1.0+ np.exp(-x)) # sigmoid x = np.random.randn(3,1) # random input (3 x 1) h1 = f(np.dot(W1, x) + b1) # first hidden layer (4 x 1) h2 = f(np.dot(W2, h1) + b2) # second hidden layer (4 x 1) out = f(np.dot9W3, h2) + b3) # output (1 x 1)

[ 11 ] To train a neural net it is important to first initialize: ● set weights to small random numbers ● set biases to zero

27

Start with small regularization e.g. halving the weights gives a higher regularization strength (biologically interpreted as 'gradual forgetting'). Dropout [ 69 ] is a common technique; randomly setting some neurons to zero. Then it is important to find the learning rate that makes the loss go down . NB if the loss is not going down: learning rate too low, if loss exploding: learning rate too high

Fig 10. High learning rates look exponential and decay the loss faster, but get stuck at worse values of loss (green line) as too much "energy" in the optimization and parameters are bouncing around chaotically [ 11 ]

model = init_two_layer_model(32*32*3, 50, 10) # input size of CIFAR-10, hidden layers, no. classes trainer = ClassifierTrainer() # Caffe model best_model, stats = trainer.train(X_train, y_train, X_val, y_val, model, two_layer_net, num_epochs=10, reg=0.000001, update='sgd', learning_rate_decay=1, sample_batches = True, learning_rate=1e-6, verbose=True)

Much work has been done in optimization but current preferences are to the Adagrad adaptive learning rate algorithm [ 11 ].

4.5 Recurrent Neural Networks (RNN)

28

RNNs are the deepest FFNNs; nonlinear dynamical systems that map sequences to sequences, parameterized with weight matrices and bias vectors. RNNs have a layer for each timestep with weights shared across time [ 55 ]. RNNs rarely generate the same results twice and are optimized to model multivariate data. They are prone to becoming unstable, hence the development of Long Short Term Memory (LSTM) by Schmidhuber, which provides stability via a longer lookback capability. See Section 4.7 for further details. A traditional FFNN is organized in layers, with information flowing unidirectionally from input to output. In an RNN there are no undirected cycles in the connectivity pattern. Because the previous vector of activities is used to compute the vector of activities in each time step, RNNs are able to retain memory of previous events and utilize this in decision-making. See Section 4.8.3 for a specific application.

4.6 Sparse Autoencoders “ we have been focusing on sparse auto-encoders for unsupervised learning. They have

the advantage of scaling well ” LeCun [ 28 ] An autoencoder is an FFNN which aims to learn a compressed, distributed representation (encoding) of a dataset by essentially recreating the input as output, but compressed/abstracted [ 44 ]. Autoencoders can also be stacked to create a series of inputs, outputs, and hidden layers. "The hidden layer of autoencoder t acts as an input layer to autoencoder t + 1" [ 44 ] with the input layer of the first being the input to the entire network. Train the first autoencoder using backpropagation with all available training data. Train the second autoencoder with input which is the hidden layer of the first and propagating it forward to the output layer of the next, updating the weights using backpropagation. Repeat for all the layers. There is no mapping between the input data and the output labels. Adding one or more fully connected layer(s) to the last fine-tunes the network, overall providing an effective pre-training method for initializing the weights of a network

29

4.7 Long short term memory (LSTM) “..Sepp Hochreiter (1991) analysed the vanishing gradient problem. LSTM falls out of this almost naturally” [ 2 ] . A long short-term RNN, or LSTM, is a specialised RNN that overcomes exploding and vanishing gradients This problem occurs when an algorithm brings a solution closer and closer to the local minimum but suddenly gets thrown into an almost flat area in the problem space lengthening the time significantly till a solution is found. The architecture of an LSTM forces constant error flow through the internal state of special memory gates/ linear neurons (keep gate, forget gate, read/write gates). When the keep gate is turned on (with an activity of 1), the self connection has weight one and the memory cell writes its contents into itself. When the keep gate outputs a zero, the memory cell forgets its previous contents. The write gate allows the rest of the neural net to write into the memory cell when it outputs a 1 while the read gate allows the rest of the neural net to read from the memory cell when it outputs a 1. Finally, the cell is read and then cleared, forcing a constant error flow through time to locally protect against exploding and vanishing gradients. See Section 4.8.3 for a specific application.

4.8 Machine Learning applications

4.8.1 Microsoft Medical Image Analysis Microsoft’s InnerEye research project “focuses on the automatic analysis of patients' medical scans” (research.microsoft.com). In Sep 2012 Microsoft released software for “the automatic detection and localization of anatomy within CT scans”. The software obtained FDA approval and the Medical Image Initiative open-sourced their research, making it available at:

http://research.microsoft.com/apps/mobile/ShowPage.aspx?page=/en-us/people/antcri m/downloads.aspx Current collaborators include: Johns Hopkins Medical Institute, The University of Oxford, Cornell Medical School, Massachusetts General Hospital, University of Washington, Kings

30

College London, INRIA Asclepios and Addenbrooke's NHS Hospital in Cambridge, amongst others. Microsoft's He et al introduced spatial pyramid pooling (SPP) in early January 2015 [ 7a ] which eliminated the need to reduce image sizes when ran through a CNN to eg 256 x 256 or 512 x 512 (Baidu's HPC DL). This is very relevant for high-resolution medical images. He et al's SPP-net computes feature maps from the whole image once, then pools features (with local spatial bins) into "sub-images to generate fixed-length representations for training the detectors" [ 7a ] . SPP also implements “a simple multi-size training method” [ 7a ] where one epoch trains the network with one input size, then switches to another input size for the next epoch, and so on. The method converges with better testing accuracy than traditional methods and yielded a 100x speedup over R-CNN, as defined in the next section 4.8.2 , taking 0.5s to process one image (all steps). He et al open-sourced the Matlab code at

http://research.microsoft.com/en-us/um/people/kahe/ LeNet's seven-layer architecture is convolutional/pooling for the first five layers with the last two fully-connected n-way Softmax layers (n = number of classes). Only the fully-connected layers need fixed length vectors so SPP max-pools in local spatial bins on the last layer, with bin size proportional to image size. This gives a kMdimension where M = number of bins and k = number of filters in the last convolutional layer. e.g. a 3-level SPP uses 3×3, 2×2, 1×1 pooling with 3×3, 2×2, 1×1 bins respectively. Although GPU implementations tend to need fixed-image-sizes He et al preserved the arbitrary image size behaviour, as detailed further in the paper [7a].

4.8.2 R-CNN The R-CNN by Girshick et al , October 2014 is a supervised method with pre-training,

followed by fine-tuning with a smaller dataset (PASCAL VOC 2010). The paper proves the technique as “an effective paradigm for learning high-capacity CNNs when data is scarce” [ 22 ]. The paper introduces an algorithm comprising region proposals with CNNs (R-CNN) comparing it to OverFeat, “a sliding-window detector based on a similar CNN architecture” [ 22 ]. The algorithm solves the CNN localization problem, of maintaining high spatial resolution with precise localization, by operating within regions. Even though R-CNN outperforms OverFeat it still underperforms compared to other algorithms using

31

more massive datasets, due to the fact it operates on the relatively small PASCAL VOC 2010 dataset only improving mean average precision (mAP) from around 20 to 53.7%. Their method focuses on features, generating "around 2000 category-independent region proposals" [ 22 ] per image, extracting features with a CNN, and classifying each region with SVMs. SGD is now known to be much more effective for training CNNs .

Fig 11. The R-CNN by Girshick et al [ 22 ]

4.8.3 Neural Image Caption (NIC) Generator The NIC by Vinyals et al, Nov 2014, is trained using maximum likelihood of “producing a target sequence of words S = {S1, S2, . . .} where each word S from a given t comes dictionary, that describes the image adequately” [ 21 ]. Recent advances in machine translation are replacing autoencoder RNNs with CNNs or combining them for sequence modeling. An RNN can “encode a variable length input sentence into a fixed dimensional vector, and use this representation to “decode” it to the desired output sentence” [ 21 ]. This same procedure can be used on images.

Fig 12. The memory block contains the core of the LSTM model

32

The cell 'c' encodes knowledge at each time step, and is controlled by three gates that either keep a value (if gate == 1), zero the value (if gate == 0), forget the value (forget gate f), or output the new cell value (output gate o). The blue lines show the recurrent connections. “The output m at time t−1 is fed back to the memory at time t via the three gates; the cell value is fed back via the 'forget' gate; the predicted word at time t−1 is fed back in addition to the memory output m at time t into the Softmax for word prediction” [ 21 ]. Overfitting is avoided by initialising the weights of the CNN to a pre-trained model such as ImageNet and training with SGD and a fixed learning rate. Vinyals et al conclude that, as the size of the dataset increases, so does performance.

4.8.4 Neural Turing Machine (NTM)

Figure 13: During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed line indicates the division between the NTM circuit and the outside world. [ 18 ]

The NTM architecture "constrains each read and write operation to interact with a small portion of the memory, while ignoring the rest" [ 18 ]. Via weighting, a read/write head can learn about the locations or "attend sharply to the memory at a single location or weakly to the memory at many locations" [ 18 ]. A combination of erase and add operations of all write heads produce the final memory at a certain time, similar to Hopfield networks.

33

4.8.5 Face patches

Using functional magnetic resonance imaging (fMRI) neuroscientists have discovered “face patches”; biological neural networks with specialised functions associated with face recognition, in monkeys. Farzmahdi et al, at the Institute for Research on Fundamental Sciences in Tehran, Iran, put together a six layer CNN with the first four trained to extract primary features (first two recognise edges, similar to how the visual cortex V1 and V2 areas operate). The next two recognise face parts, similar to the behavioural aspect of the V4 and anterior IT neurons of the brain). The fifth layer recognises the same face from different angles - the view selective layer - inspired by parts of monkey brains called middle face patches. The sixth and final layer matches the face to an identity - the identity selective layer, similar to part of the simian brain known as the anterior face patch. Farzmahdi et al trained the network using datasets containing 740 face images consisting of 37 different views of 20 people (547,600). Another dataset contains images of 90 people taken from 37 different viewing angles. They also have a number of datasets for evaluating specific properties of the neural net [ 6 ].

5.0 The GPU’s role in deep learning All major laboratory deep learning systems combine a bespoke GPU-assisted supercomputer with access to natural ‘big data’. The Chinese search engine giant, Baidu, published their paper on the ‘Deep Image’ deep learning system in Jan, 2015 maximising the power of GPUs and Big Data with deep learning algorithms using SGD and claiming a 5.98% error rate. Though this error rate has subsequently been beaten by Microsoft, and Google, all systems are relatively similar. The Baidu supercomputer, Minwa, is detailed in section 1.2 . GPU onboard memory can store all parameters, multiple-GPU systems therefore make data parallelism trivial at all layers in a CNN. Each GPU handles 1/N batches of input images in parallel, computing gradients based on local training data and a local copy of weight, which they then exchange and update local copies [ 10 ].

34

The ‘butterfly effect’ synchronises “all gradients partitioned into K parts with each GPU responsible for its own partition. After computing, GPU k receives the k-th part from all other GPUs, accumulates results and then broadcasts them to all GPUs” [ 10 ]. “Lazy” updating by each GPU, sends gradients asynchronously which maximizes overlapping between computation and communication, preventing deadlock similar to how MPI_Barrier synchronizes CPU workload with non-blocking Isend/Irecv - returning calls immediately. Communication overhead is independent of the number of GPUs. Wu et al tested a network with 8 convolutional layers and 3 fully-connected layers with a 1000-dimension softmax. They found that speedup actually increased with batch size. A 47x speedup was achieved by using 64 GPUs (batch size of 1024) since total device memory increases as the number of GPUs increase, allowing more data to be cached on each GPU.

Fig 14. The larger the batch size, the larger the speedup, Wu et al

In their paper [ 10 ] Baidu were using NVIDIA Tesla K40s. NVIDIA's K80 now costs less than the K40 to purchase and consists of 2 x Kepler GPU's giving overall peak single precision floating point performance of 8.74 Tflops (GPU Boost Clocks), 5.6 Tflops (Base Clocks). The K80 also features double the GDDR5 RAM and over double the number of CUDA cores. By the time I had finished testing with the K80’s Jen-Hsun Huang, CEO of NVIDIA, announced their latest TitanX, a GPU aimed at the desktop researcher (non-HPC) and Pascal, which will offer 10x the performance with 3D-stacked memory, 32GB RAM and 5x improved CPU-GPU communication via ‘NVLink’, their upgrade to PCIExpress. Essentially the GPU itself is becoming the supercomputer. Pascal's bandwidth will outperform

35

Maxwell (HPC architecture) three-fold, and will double the install limit per single machine to 8x GPUs. Wu et al designed a model that was less sensitive to colour, intensity and scale and could use larger images (512 x 512 pixels) pulled straight from the massive databases of social media site Instagram. “An accuracy of 80% was achieved in 8.6 hrs with 32 GPU's, compared to 212 hours with only one” [ 10 ].

6.0 Test Environments & frameworks

A standalone machine was used for most testing. Initially this was racked in the DEC-10 machine room cslin155 with a Tesla C2075 left by a previous researcher. We then upgraded, March 5, on receipt of the Tesla K40, to the following:

2008 Precision T7400 workstation with a 950W PSU, 8GB RAM, Intel Xeon E5405 CPU 2GHz, NVIDIA Tesla K40 & 320GB SATA HDD; named csfyp01 on the network IP 129.11.144.99.

This machine was not installed with the standard university CentOs distribution but with straight Ubuntu 14.04 which allowed a simple install as per Facebook Github instructions for Torch and LuaJIT plus Python 3.4, allowing iPython and iTorch which enabled “easy rendering of images, using a standard web browser as a graphics client” (code.cogbits.com). Having been granted full access (see Appendix G ) the ImageNet 2012 dataset was then downloaded:

curl -O http://image-net.org/image/ilsvrc2013/ILSVRC2013_DET_train.tar + /val.tar + /test.tar Validation and test datasets: 150,000 photographs from Flickr labeled with the presence or absence of 1000 object categories. Training data; a subset of the full ImageNet containing 1000 categories and 1.2 million images: http://image-net.org/challenges/LSVRC/2012/browse-synsets A build, assisted by Faculty IT, included the CUDA libraries (later upgraded to CUDA 7) alongside Torch7/LuaJIT4, essential to run the FAIR code and the ImageNet dataset. Certain historical Facebook dependencies cause problems with CentOs so Soumith Chintala, lead AI researcher at FAIR under LeCun, also refactored this codebase isolating

36

the data parallel version from the dependencies and allowing a simple scripted install. See Appendix E for full instructions.

6.1 Terminal The Terminal virtual machine environment, preconfigured by Stanford was initially tested as part of their online course on CNNs that I ran through. This was very easy to use as all dependencies were pre-configured, though only a CPU setup with costs ranging from $0.124/hr for 2 x 3.2GB RAM CPU’s to $1.984/hr for the best configuration of 32 x 51.2GB CPUs. Stanford gives every online course member a free $10 credit. Requirements for non-Terminal virtualenv: Cython==0.21.2, Jinja2==2.7.3, MarkupSafe==0.23, Pillow==2.7.0, backports.ssl-match-hostname==3.4.0.2, certifi==14.05.14, gnureadline==6.3.3, ipython==2.3.1, matplotlib==1.4.2, mock==1.0.1, nose==1.3.4, numpy==1.9.1, pyparsing==2.0.3, python-dateutil==2.4.0, pytz==2014.10, pyzmq==14.4.1, scipy==0.14.1, six==1.9.0, tornado==4.0.2, wsgiref==0.1.2

6.2 Theano Theano is a Python library created for working with multi-dimensional arrays. It features integration with; NumPy, GPUs, efficient symbolic differentiation, speed and stability optimizations and dynamic C code generation. I tested various algorithms and also learnt about the fairly recent Cython extension for

C. A download can be obtained from www.cython.org . Since CNNs require a very efficient implementation, Cython can assist as an optimising static compiler giving the combined power of Python and C. Test results at section 11.

6.2.1 Theano v Torch Theano was introduced to the ML community by Bergstra et al (2010) and consequently further enhanced to implement Deep Learning models in 2011 when Collobert et al, compared Theano with Torch 7. Torch 7’s strengths stem from Lua’s advantages over Python: lower interpreter overhead, simpler integration with C, easy embedding in a C application. Torch 7 now uses LuaJIT4, a just-in-time highly optimized compiler. Soumith

37

Chintala, at FAIR has benchmarked the majority of implementation frameworks - see Appendix K . Torch7 was designed to use parallelism on multi-core CPUs with OpenMP parallel directives, notably in the tensor and neural network modules. Theano’s powerful graph optimization and symbolic differentiation engine make it a worthy opponent, except that “users are faced with a more complex workflow: first, define an abstract mathematical graph (without values), then optimize and compile it into a callable function, and finally execute the function. This additional complexity also makes it harder to interpret errors that may be raised during the compilation or execution phase” [ 66 ]. Torch is used internally by Facebook, Google, Twitter, NYU, The Idiap Research Institute (Ecole Polytechnique Fédérale de Lausanne), Purdue (major US research university known for discoveries in science, technology & engineering) plus multiple other companies and research labs. Torch7 is maintained by Ronan Collobert (IDIAP), Clement Farabet (Madbits, a deep learning startup acquired by Twitter in 2014), Koray Kavukcuoglu (Google Deepmind) and Soumith Chintala (Facebook). All are previous PhD students of Yann LeCun. See section 9.3 for further detail on Torch, Appendix E for install instructions.

6.3 Caffe and DIGITS Caffe is UC Berkeley's deep learning framework, developed by the Berkeley Vision and Learning Center (BVLC) specifically Yangqing Jia who created it for his PhD. Caffe can run on CPU but is optimized now for GPU, capable of switching between with a single flag setting: Caffe::set_mode(Caffe::GPU); CNN’s have a linear structure but Caffe networks can have any directed acyclic graph (FFNN) structure, capable of processing at 1ms per image (inference) and 4ms/image (learning) with a single NVIDIA K40 GPU and is built on pure C++ / CUDA architecture with command line, Python and MATLAB interfaces. In March 2015 NVIDIA announced their powerful Python-based DIGITS - deep learning GPU training system - which provides full visualisation of active neural networks, specifically convolutional neural networks. See Appendix P for further details.

38

Data must be converted to Caffe-format but the full framework is automated and open-sourced to Github: https://github.com/BVLC/caffe by the BVLC http://caffe.berkeleyvision.org

Caffe tends to use models with ReLUs and states a 17x speedup in deep learning with AlexNet on GPU. cuDNN has recently been upgraded and timings, by NVIDIA (March 2015) on native Caffe running the AlexNet model on ILSVRC2012 with a 16-core Haswell E5-2698 2.3GHz CPU + 1 x 3.6GHz NVIDIA Titan X were: ● CPU - 1 (normalised) ● GPU - 9x ● GPU w/cuDNN v2 - 17x speedup

● 15-30% speedup from cuDNN v1 => v2 ‘ SuperVision ’ (2012 ILSVRC Winner) by Krizhevsky, Sutskever and Hinton, ran on the Caffe framework. The model comprised a fully supervised CNN with 7 hidden layers, 650,000 neurons, 60M parameters, 630M connections and used ReLU, max pooling, dropout, 224 x 224 patches. The system was trained with SGD on 2 x GPUs for 1 week. Section 4.8.2 also details the R-CNN model by Girshick et al, that uses Caffe as does the ImageNet 2014 Winner, GoogLeNet. Caffe refers to 4D tensors (CUDA) as “blobs”. The user needs to simply define the bottom and top ‘blobs’ and Caffe will connect the network. The model is built on blobs, layers, and nets. "As data and derivatives flow through the network in the forward and backward passes Caffe stores, communicates, and manipulates the information" (Caffe website). The blob is a unified memory interface, a 4D array wrapper over the data being processed. The layers are the foundation of both the model and computation. The net is the collection and connection of layers. The 4D blob stores data in row major C-contiguous order of (Num (N), Channel (K), Height (H) and Width (W)) where N signifies the batch size (in ImageNet training batch is normally 256 images), and K is the feature dimension e.g. for RGB images K = 3. A blob holding 1000 vectors of 16 feature dimensions: 1000 x 16 x 1 x 1. A convolution layer blob with 96 filters of 11 x 11 and 3 inputs: 96 x 3 x 11 x 11. An inner fully-connected layer blob with 1000 output channels and 1024 input channels: 1 x 1 x 1000 x 1024.

39

Blobs store both data and diff; the gradient computed by the network, in 2 separate chunks of memory and are a means to abstract away the computational complexity of synchronizing from the CPU host to the GPU device. Since values could be stored either on the CPU and/or GPU, access is via either 'const' or 'mutable': const Dtype* cpu_data() const; Dtype* mutable_cpu_data(); It is good practice to use 'const' since GPU programming is most efficient when CPUGPU data transfer is minimized. Data should be loaded from disk to a blob in CPU code, which calls a device kernel to do GPU computation. All layers should have GPU implementations, to keep all data and gradients in the GPU. Full installations details at: http://caffe.berkeleyvision.org/installation.html#prerequisites On the K40, every 20 iterations, Caffe's ImageNet solver takes about 26.5s to run i.e ~5.2 ms per image for the full forward-backward pass. About 2 ms of this is on forward, and the rest is backward.

6.4 ConvNet Alex 'alexNet' Krizhevsky, PhD student of Geoff Hinton and now Google employee currently maintains the Cuda-ConvNet codebase, recently upgraded to cuda-convnet2. The C++/ CUDA/ Python code is open-source with extensive install instructions and documentation available at https://code.google.com/p/cuda-convnet2/ . Improvements

include optimization for all Kepler-generation NVIDIA GPUs (Geforce Titan, K20, K40, K80) and future generations such as Tegra. Coupled with CUDA's own cuDNNv2 there are impressive speedups being made (Google currently leads the error-rate race with this code). See section 11 . Multi-GPU training is now further supported with implementations for data-, model- and Krizhevsky's own ‘OWT’ hybrid-parallelization. Layers are defined via simple file definitions and take a neuron=x parameter, e.g. neuron=relu, to describe output from each layer. The code will train any (y x y) size e.g. 512 x 512 pixel image.

40

NB: A Fermi-generation GPU (GTX 4+ or Tesla) is required, as is a 64-bit Linux (Ubuntu) OS, the full CUDA toolkit and CUDA SDK. Various developers have tried to adapt to Windows but resistance is futile. There are, as normal, many prerequisite dependencies, listed with install advice, on the wiki, including; python-dev, python-numpy, python-magic (libmagic bindings), python-matplotlib for visualisations, OpenCV, git & essentially the ATLAS development libraries/headers: libatlas-base-dev The codebase can be cloned with: $ git clone https://code.google.com/p/cuda-convnet2/ Build problems are common with CentOs, it is advised to setup with straight Ubuntu 14.04+. See wiki for FAQ or Appendix P for DIGITS/Caffe install notes. Time limits

prevented me from running CNN's with straight convNet2 but I ran medical images on the DIGITS system using the GoogLeNet model. See section 11 .

6.5 NVIDIA GRID NVIDIA offered a multi-GPU setup with Boston Ltd in St Albans, Herts. with : 2x 2660v3 CPUs, 128GB DDR4 memory 2x K80s and a 256GB Samsung SSD Boston had to mount a separate 1TB SSD for the Imagenet download (requires unpacking to subfolders effectively doubling the 167GB memory load). Facebook do advise using SSD as NAND flash does not require power to retain data and is mounted on circuit boards specifically in a way to be very shock resistant. HDD's consist of various moving parts and are therefore more susceptible to damage. If using HDD or a slow SSD, it is recommended to resize the images to 256 x 256 (keeping the aspect ratio intact) to speed up data loading: $ find . -name "*.JPEG" | xargs -I {} convert {} -resize "256^>" {}

Full details of test at Appendix K. The Boston setup had far superior (faster) performance even in unpacking the 1000 subfolders of ImageNet and transferring the appropriate (of 1.2 million) images into each. I did not reduce the image size, as I did for the csfyp01 test.

41

fbcunn CNN with ImageNet : on 2 x K80 : $ th main.lua -data /disk1/imagenet -nGPU 4 -nDonkeys 8 -backend cudnn -netType alexnetowt

Since a K80 GPU comprises 2 x GK210 GPUs the setting was -nGPU 4 -nDonkeys 8 : uses 8 separate worker threads pulling the images into the model e.g epoch 1: [982/10000] Time 1.728s ERROR RATE 6.9078% Equal to 1 epoch in 50 minutes . See full test results for comparison at section 11 .

6.6 Leeds ARC2 The Advanced Research Computing (ARC) department at the University of Leeds "provide and support a large HPC resource to support researchers across the University research community". Computational resources of over 24,000 cores (3040 x 8-core Intel) are available via ssh, "off the desktop". Over 500TB of disk storage are also available in addition to support for Cloud services, such as Amazon EC2, MapReduce and GPGPU Computing ( www.arc.leeds.ac.uk ). Compute configuration (from https://hec.wiki.leeds.ac.uk ) : 190 blades; 380 CPUs; 3040 cores. Each HP BL460 blade houses one Sandy Bridge server (node). Each node is dual

socket with 8-core Intel E5-2670 (2.6GHz) processors (16 cores per node); 32GB of DDR3 1600MHz memory per server, 500Gb hard drive and quad-data-rate (QDR) Connect-X infiniband Storage "Lustre" parallel file system : 2 fail-over pairs, delivering 4GB/s via the InfiniBand network to ~170TB usable storage mounted on /nobackup. Network : InfiniBand provides a full-Clos6 and 2:1 blocking, 40Gbit/s interconnect to compute blades and access to infrastructure on the edge. Network topology : All user-facing systems (login and compute) are connected to the InfiniBand network and use it to transfer all data. A layered network, with communication latency dependent on the number of (36-port) switch hops required to route between the source and destination devices. Each server has a 4X QDR connection which can send and receive data at 3.6GB/s.

6

A Clos network is a nonblocking switch topology named and based on the seminal 1953 paper by Charles Clos which described how such a network could be constructed using the minimum number of switches .

42

Each switch has two 4X QDR links up to the core, able to transfer data at ~8GB/s. The latency between servers connected to the same switch is ~1.1 microseconds (1e-6). Between servers connected to different switches, latency is ~1.5 microseconds. Due to complications with using a batch scheduling system and the 20 hr download for ImageNet I decided not to test the Facebook model on ARC2.

6.7 Leeds cloud testbed Recently upgraded to 14 quad-core CPUs over 1Gbit ethernet, the new testbed came online March 9, 2015 managed by OpenNebula, accessed via: $ ssh @csgate1.leeds.ac.uk The cloud enables the testing and configuring of hypervisors, setup and configure of physical and virtual networks and installation of virtual infrastructure managers (VIM). It, of course, also provides unrestricted access to computation resources (tested up to 1000 VM’s). Faculty IT are responsible for the hardware (rack of servers, switches, network/power cables etc), a gateway node (to allow external access to the cloud's private network) and the storage node. Karim Djemame's research group are responsible for configuring the cloud infrastructure that reside on the hardware. D J Armstrong, Dec 2012, wrote his PhD thesis at the University of Leeds on a “Cloud Enhanced Pathology Application”. No GPU's were tested but, assuming a distributed parallelized setup could be coded in CUDA, there would be significantly enhanced scalability and capability, for example, of the Virtual Pathology work. There are two GPU-capable nodes in the cluster already, and with the acquisition of suitable machines the cloud could easily support multi-GPU computation. The granted K40 from my research will be installed in the cloud testbed.

43

7.0 Virtual Pathology 7.1 Virtual Heart A patient is first given an MRI scan and this is converted into a full 3D model with adaptations made to suit any patient's ailment so that doctors can manipulate the virtualised organ, non-invasively. Assessments for how prone a patient is to arrhythmia, for example, no longer require risky procedures, and doctors have been able to predict with 85% accuracy who is at risk simply by simulating tests. Image processing techniques can identify and map damaged muscle tissue in the walls of the heart's chambers, and images can be used to estimate the orientation of the muscle fibers which determines how electrical signals propagate through the tissue. The patient-specific geometric structure can then be overlaid with the computational model of the inner workings of a generic heart [ 19 ]

Fig 15: IEEE Spectrum Online Magazine Nov 2014 [ 19 ]

44

7.2 Virtual Pathology at Leeds The digital microscope (Magee et al) assists “analysis and visualisation of gigapixel tissue slides to diagnose disease at a cellular level” [ 65 ] with image processing algorithms

concurrently able to indicate presence or possibility of cancer. The application uses the commercial software ‘Aperio ImageServer’, a high-performance interface which acts as a web server and therefore makes the entire application a good candidate for cloud-based virtualised technology supported by GPUs, not the client-server architecture in current use. The full CPU process pipeline is described in the paper [ 65 ] . The Pathology and Tumour Biology Group, Leeds Institute of Molecular Medicine uses HPC with virtual slides, producing 3D tissue reconstructions at a cellular resolution. Data fusion techniques “allow visualisation of microanatomy and functional information in conjunction with 3D reconstruction” and random forest algorithms explore datasets. An “automated registration process of large amounts of tissue volumes at microscopic resolution give a high throughput of 3D reconstructions involving minimal user interaction” (VPL website). The setup at St James is a DELL R720 with Dual 8-core Xeon processors, 64GB RAM and 6 x 1.2TB 10k SAS HDD with Linux. There is currently no budget for virtual servers with GPU cores but the group are very interested in hearing about capability from NVIDIA.

7.3 Deep Learning in healthcare The paper by Unterthiner, Hochreiter et al [ 1b ] describes how deep learning won the Tox21 Data Challenge (FDA, NIH, EPA) predicting toxicity of various real-world chemical applications. Sepp Hochreiter co-authored the paper. Hochreiter is responsible for the LSTM-RNN. Toxicity is a large issue in drug development as well as patient treatments. Cytotoxicity can efficiently activate apoptosis in tumour cells but does not differentiate healthy cells. Toxicology relies on "hierarchical levels of abstraction .. which maps naturally to DL architecture" [ 1b ]. Tanh and standard deviation were used to normalize the chemical descriptors which were represented using the standard high-dimensional binary Extended Connectivity FingerPrint (ECFP4) which uses a presence/absence coding e.g 0010100011101000. The architecture of the network was approximately 40,000 input

45

features with weight parameters stored on 1 x K80 GPU via 512 sample minibatches for SGD. The intention for using deep learning in medical image analysis and feature detection is simply to automate the flagging process to reduce time, cost and error for trained medical personnel. Flagged images, classified as "trouble" can then be put aside for trained pathologists to further analyse size and shape, or discard. Cancer is a worldwide problem. Digitisation of processes allow comparisons across the entire world's histology laboratories for instant recognition and analysis, as IBM’s Watson is proving today in personalised treatments. Mitosis is simply cell division. Detection uses the gold standard process of nuclear staining with hematosin and eosin to aid comparison. The majority of stages in mitotic nuclei look the same as non-mitotic and it was thought that only a highly trained human observer could differentiate the two classes. CNNs have been proved, for 3 years now, to also be capable of making that profound differentiation between normal cell division and the typical “clumping” that may indicate abnormal tumour growth. The MICCAI competitions have proved the concept to be commercially viable for detection of the signs of cancer. Parties involved are not opensourcing models where hyperparameters have been painstakingly set to achieve multi-model high accuracy rates. Training a detector on labelled images such as those provided by the MICCAI dataset allows each pixel to be assigned either as 'trouble' or not. Constructing the training dataset is out of scope of this report but it is important to note that the problem is rotationally invariant so additional training instances can simply be generated by transforming windows (rotation and/or mirroring) [ 60 ]. Windows containing a visible sign around the centre are ‘detected’, others containing off-centre or zero signs are dismissed (off-centre detections are generalised as duplicates of the centrally-detected sign to an acceptable level of accuracy, proven to be in the region of 0.611% F1-score) [ 60 ]. IDSIA’s system took only 3 days to train on a single GPU. Feature vectors produced are classified by softmax (2 classes; probability of being ‘trouble’ or not) in the fully connected layers with weights corrected using SGD. "Each convolutional layer performs a 2D convolution of its input map" [ 60 ]. Feature maps from previous layers have their activations summed with a ReLU nonlinear function and

46

max-pooling layers output winning features (not sub-sampling layers) via calculation of the "maximum activations over non-overlapping square regions". The paper [ 60 ] explains in more detail how to process a new image using probability of proximity to the centroid of a smoothed mitotic area but the key to IDSIA's success has been in training with a number of different architectures of networks and various rotations and mirrors of source images to reduce variance, averaging outputs for an overall result. This complexity is merely another layer of the complexity already mastered in CNNs, and simply another operation that both deep learning and GPUs take in their stride, to aid medical professionals around the world. Andre Esteva of Stanford very recently [ 1a ] ran the public DermaNet dataset of 23,000 images, 23 classes of skin-disease through a series of the 2014 VGG (Oxford University) CNN architectures using transfer learning and fine-tuning (via learning rate). The Caffe-based models were pre-trained on ImageNet, downloaded from the BVLC Model Zoo ( https://github.com/BVLC/caffe/wiki/Model-Zoo ) . Combined runs with 5 models (3 x 16 layer (VGG16) CNNs and 2 x 19 layer (VGG19) CNNs) using a binary cancerous v. non-cancerous classifier achieved a 90.0% balanced accuracy . Results were tested against a trained dermatologist and an M.D. student who scored 46% and 52%,

respectively on 115 test images with 5 images from each of the 23 classes. Of these 23 classes 4 are cancerous, 19 not, so to ensure chance == 0.5 Esteva randomly selected 4 image categories from the 19 to test against the 4 cancerous classes, with 100 testing images per category, repeating the runs several thousand times, to average the classification accuracy [ 1a ]. By visualising the learned filters Esteva concluded that they "did not resemble typical gabor-like filters of most architectures, but instead look like color and region detectors". Esteva performed all runs on an NVIDIA Titan Black GPU and a Tesla K20 for comparison but does not list timings in the paper [ 1a ]. Similar to work by IDSIA and Ciresan [ 60 ], the highest accuracies were achieved by averaging results from multiple different models. In 2013, IDSIA were able to train each network in 1 day with their GPU-based MATLAB implementation, taking under 30 epochs to reach a minimum validation (approximately 8 minutes per image). Further details in Appendix L . IDSIA's results outperformed all competitors in 2013 and the lab are now comparing performance to human expert histologists "with the ultimate goal of gradually bringing automated mitosis detection into clinical practice" [ 60 ].

47

8.0 GPGPU "Brain's are parallel, GPU's are parallel" (D Cox et al, 2009). Image processing is embarrassingly parallel with GPUs. GPU (or GPU-accelerated)

computing is the use of GPUs to accelerate applications, hosted by CPU. “Pioneered in 2007 by NVIDIA, GPU accelerators offer unprecedented application performance by

offloading compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU” (www.NVIDIA.com).

Fig 16. Credit NVIDIA

General-purpose computing on graphics processing units (GPGPU) is the utilization of GPUs either alone or in a pipeline of parallel processing between one or more GPUs and CPUs. GPU programming is an art, often requiring weeks just to optimize. Due to time constraints I used code and libraries already optimized to the most efficient way to transfer control to the GPU. The best applications have both serial code and parallelized sections utilizing the GPU “cut” specifically. Shared memory gives the best results, managed by a cache with more threads per block and a way to tradeoff between thread count, shared memory size and register count.

48

A single GPU is most efficient when processing large numbers of computations (1000 matrices or more up to the limit of its memory). Data parallelism (DP) is a technique allowing multiple GPUs to process separate mini-batches of matrices and, in CNNs or RNNs, then combine per-mini-batch computed gradients at each iteration. Typically 2× or 4× DP with zero-padding, where necessary, will yield speedup but suffers from diminishing returns i.e processing 2× as many examples on 2× as many GPUs != 2× speedup in training. DP: workers train the same model on different data examples, share weights or gradients, batch size increases with no. of workers. Since workers exchange messages of size proportional to the number of weights DP is efficient when the computation per weight is high.

Model parallelism (MP): workers train different parts of the same model on the same data examples, sharing neuron activations. Since workers exchange messages of size proportional to the number of neurons, MP is efficient when the computation per neuron is high.

To scale further, MP partitions the model across GPUs, perhaps 1 computing forward passes and the other the backward passes. In 2014, Krizhevsky published "a new way to parallelize the training of convolutional neural networks across multiple GPUs" using both DP and MP. He realised that the convolutional layers comprise 90-95% of all computation so should be parallelized by DP, while the fully connected layers with 95% of the parameters should be parallelized by MP. I tested AlexNetOWT with the K80 setup and it worked extremely fast. See section 11 for results . The 2014 paper fully explains all processes but are out of scope (due to

limits in length of this report). The title for Krishevsky’s model “One Weird Trick” comes from the fact that he could not explain the "discrepancy between theory and practice" as to why his heuristic of multiplying the learning rate by k when multiplying the batch size by k worked, for batch sizes up to around 1024, but not for > 1024.

49

Fig 17. K workers train a CNN with three convolutional layers and two fully connected layers

The fbfft and fbcuNN implementations, discussed in more detail in section 9.2 depend entirely upon the NVIDIA Kepler GPU architecture. "NVIDIA GPUs execute code at the

granularity of a warp” [ 14 ] which is defined as an atomic unit of execution, consisting of a set of 32 threads.

“Each thread is assigned a lane within the warp and executes in a single instruction, multiple thread manner. A warp holds a single program counter and can therefore only execute a single instruction at a time across all its threads. Collections of warps are brought together in blocks which together share a region of fast shared memory resident on chip. Blocks of warps can only exchange data via slower global memory, either resident on the GPU or in the host CPU’s address space” (www.NVIDIA.com). Communication overheads are dramatically reduced in CUDA kernels by keeping data in registers as much as possible. Many warps can run in parallel, hiding long latency operations via fast context switching by the GPU hardware. Finite resources of registers and shared memory are partitioned by the compiler. “While a CPU consists of a few cores optimized for sequential serial processing, a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously” (www.NVIDIA.com).

50

9.0 Implementation The biggest surprise of this whole project was realising that GPUs are so new that not even Leeds is using them. The ARC2 cluster for advanced research is purely CPU-based with 3040 cores of 8-core Intel E5-2670 2.6GHz processors and fast disk storage with Infiniband allowing data transfer rates of around 3.2GB/s. After meeting with Martin Callaghan I was told that ARC3 will now be upgraded with GPUs and I arranged for NVIDIA to run another workshop with the ARC team and any other interested parties at Leeds. In comparison to the ARC2, installing 1 x NVIDIA Tesla K40 will provide a memory bandwidth of 288GB/sec, 12GB of GDDR5 - high Double Data Rate type five synchronous graphics RAM and 2,880 CUDA cores. I submitted an Academic Hardware Request for a K40 to NVIDIA, at the suggestion of Derek Magee (Feb 9, 2015). This request was granted with thanks to the NVIDIA Academic Research Program, http://research.nvidia.com and we set up the card on another machine csfyp01 March 5, 2015.

NVIDIA state their K40 outperforms CPUs by up to 10x, applications run up to 40% faster than the predecessor Tesla K20 and the graphics card features NVIDIA GPU Boost™ for up to 25% speedup, streaming multiprocessor (SMX) for 3x the workload with the same power budget and of course NVIDIA’s Kepler™ architecture is the world’s fastest, most efficient. Error Correcting Codes (ECC) protection for internal memory and ECC plus Dynamic Page Retirement for external GDDR5 addresses accuracy and reliability. This facility can be switched on or off. Peak double precision floating point performance: 1.43 Tflops (K40) 1.17 Tflops (K20). Peak single precision floating point performance: 4.29 Tflops (K40) 3.52 Tflops (K20). The Tesla K80 offers 2x Kepler GK210 GPU's with a total of 4,992 cores. Launched in November 2014 the GPU has a memory bandwidth of 480 GB/sec, gives 8.74 TFlops single precision performance with ECC protection for the 24GB onboard GDDR5 RAM. Despite the K40 costing $5,499 the K80, thanks to the Law of Accelerating Returns, is only $5,000 (£3,438.50 as of Feb, 2015, Scan Computers Ltd quote in Appendix C ).

51

Fig 18. Tesla K80. Credit NVIDIA

9.1 General information for testing and implementation CNNs require large amounts of computation, especially during the training phase and large amounts of training data to achieve high accuracy. I initially worked on a Terminal virtual machine environment pre-configured by Stanford University, working with familiar libraries such as numpy, scipy and matplotlib. Increasing computational resources allowed me to work with ImageNet , the largest

dataset in the world. ImageNet currently has 14,197,122 total images, of which 1,034,908 have bounding box annotations and 1.2 million with SIFT features, “tagged by humanity” via Stanford and Amazon’s Mechanical Turk crowdsourcing initiative. LeCun, and his laboratory at Facebook use Torch7 , “a numerical/scientific computing extension of LuaJIT with an ML/neural net library on top” (www.torch.ch). It has

significant performance advantages over Python/Theano and interfacing C/C++/CUDA code is very simple. Google DeepMind uses Torch7 because LeCun’s former student and Torch co-maintainer Koray Kavukcuoglu now works for DeepMind. The Google Brain group in Mountain View also use Torch7 with “custom CUDA backends for fast/parallel convolutional network training” [ 28 ]. CNN’s were run on the C2075 GPU on cslin155 by

52

the end of February, with a 23.7 x speedup against a 12-hyperthreaded single 2.66GHz CPU.

9.2 fbcuNN Facebook AI Research (FAIR) recently open sourced their optimized modules for Torch7 (Jan 2015) which are significantly faster ( fbcuNN ). The release includes GPU-optimized modules for running large convolutional nets as well as networks with sparse activations that are commonly used in NLP applications. Also includes custom CUDA kernels built around NVIDIA's cuFFT library, further details later in this report and at https://research.facebook.com/blog/879898285375829/fair-open-sources-deep-learningmodules-for-torch . The Torch/fbcunn implementation allows the following models to be ran efficiently: -netType AlexNet , AlexNetowt , GoogleNet , Overfeat , and VGG (Visual Geometry Group, University of Oxford).

9.3 Torch 7 in detail "Torch is actually not extremely difficult to learn — unlike, say, the Theano library. We’ve made it incredibly easy to use. We introduce someone to Torch, and they start churning out research really fast" (Soumith Chintala, Facebook/Torch) Torch is a scientific computing framework with wide support for machine learning algorithms. Full installation details at Appendix E . Core features include powerful N-dimensional array, multiple routines for indexing, slicing, transposing, interface to C, via LuaJIT, linear algebra routines, neural network, and energy-based models, numeric optimization routines, fast and efficient GPU support, embeddable, with ports to iOS and Android. Torch comes with a large ecosystem of community-driven packages in machine learning, computer vision, signal processing, parallel processing, image, video, audio and networking among others, and builds on top of the Lua community. At its heart are the neural network and optimization libraries which are simple but offer maximum flexibility in implementing complex neural network topologies.

53

Many thanks to Georg Ostrovski [ 9 ] at DeepMind for the tutorial on Tensors, available on GitHub. The Torch Tensor class is the most important class in Torch. Almost every

package depends on this class. It is the class for handling numeric data. Tensors are serializable, a potentially multi-dimensional matrix. The number of dimensions is unlimited, created using LongStorage : --- creation of a 4D-tensor 4x5x6x2 z = torch.Tensor(4,5,6,2)

--- for more dimensions, use LongStorage : s = torch.LongStorage(6)

s[1] = 4; s[2] = 5; s[3] = 6; s[4] = 2; s[5] = 7; s[6] = 3; x = torch.Tensor(s)

The number of dimensions of a Tensor can be queried by nDimension() or dim() . Size of

the i-th dimension is returned by size(i). A LongStorage containing all the dimensions can be returned by size() .

9.4 Internal data representation The actual data of a Tensor is contained in a Storage .

'Storages' are how Lua accesses memory of a C pointer or array. Storages can also map the contents of a file to memory. A Storage is an array of basic C types. Several Storage classes for all the basic C types exist and have the following self-explanatory names: ByteStorage, CharStorage, ShortStorage, IntStorage,

LongStorage, FloatStorage, DoubleStorage. ByteStorage and CharStorage represent both

arrays of bytes. ByteStorage represents an array of unsigned chars, while CharStorage represents an array of signed chars.

Conversions between two Storage type can be done using copy: x = torch.IntStorage(10):fill(1) y = torch.DoubleStorage(10):copy(x)

54

Data in a Tensor can be accessed using storage() . While the memory of a Tensor has to

be contained in this unique Storage, it might not be contiguous: the first position used in the Storage is given by storageOffset() (starting at 1). And the jump needed to go from one element to another element in the i-th dimension is given by stride(i) . In other words, given a 3D tensor

x = torch.Tensor(7,7,7)

accessing the element (3,4,5) can be done by = x[3][4][5]

A Storage only represents a chunk of memory, while the Tensor interprets this chunk of memory as having dimensions. Tensor factorisation methods used specifically by Theano and CUDA work on the basis that data has unobserved properties (seen in clustering) and patterns and behaviours that can be found with the 40 year old technique of expectation maximisation or maximum likelihood learning. This involves guessing then updating parameters iteratively until a set threshold for significant results is met. Unfortunately this technique is non-convex; the technique will produce different results each time as there are an exponential number of regions in the problem space that you may land in, many different local maxima. Finding a global maxima is very hard in this sense. However, to efficiently find the same local maxima you can now compute a large table of statistics and use them to recover the parameters which found that local maxima, which essentially become identifiable “moments”. Simple matrix decomposition (with tensors) essentially can now guarantee good results.

10.0 NVIDIA CUDA After its success in the gaming environment, which has fuelled GPGPU, NVIDIA became involved with Elon Musk’s Tesla Cars, providing all GPU’s for each model. It also started working with many of the ivy-league US university DL researchers, funding and developing libraries such as cuDNN. Further GPGPU programming advice in Appendix F.

55

cuDNN has recently been upgraded to v.2 with improvements focused on DL performance and optimization for current and future GPUs (Maxwell, Tesla, Titan, Tegra X1). NVIDIA are currently working on supporting 3D datasets e.g pointclouds. Timings, by NVIDIA (March 2015) on native Caffe running the AlexNet model are available at section 6.3 . Recent work with Facebook introduced fbcuNN making it even easier to obtain state-of-the-art performance. 90% of teams in ILSVRC14 used GPUs. cuDNN’s convolution routines provided the fastest matrix-multiply implementation, with significantly less memory use, until the Facebook AI Research lab, headed by Le Cun, decided to improve the cuFFT implementation too.

Fig 19. Schematic representation of a deep neural network http://devblogs.nvidia.com

The paper by Vasilache & LeCun et al, Dec 2014 describes the technique in detail, which is out of scope of this report, for the more effective fbcunn GPU implementation. NVIDIA also opensourced their "DIGITS" software allowing rapid and visualised design and

implementation of the best deep neural network for your chosen dataset, see Appendix P . The software allows combination of results from several different “tweaked” versions of models, the technique currently leading to the highest accuracies in image recognition. By open sourcing fbcunn (Jan 2015) the Facebook laboratory hope to also take advantage of community-led enhancements to algorithms too.

The CUDA platform includes the CUDA C/C++ programming language, the CUDA toolkit, parallel computing extensions, powerful “drop in” accelerated libraries and cloud based computation. NVIDIA work alongside Udacity in offering free online courses in GPU programming, which I undertook as part of my research.

56

The cuDNN library is targeted at developers of DNN frameworks (eg. CAFFE, Torch). The example code below shows how to allocate storage for an input batch of images and a convolutional filter and how to run the batch in the forward direction through a convolutional layer ( http://devblogs.nvidia.com/ ) . The calls to cudnnSetTensor4dDescriptor() and cudnnSetFilterDescriptor() define the input to this convolutional layer and filter parameters, respectively.

The call to cudnnSetConvolutionDescriptor() initializes the convolution descriptor for this layer using the descriptors from the previous two calls and some layer-specific information such as padding and striding parameters. cudnnGetOutputTensor4dDim() calculates the dimensions of the convolution’s output.

The next calls simply configure and allocate storage for the output of this layer, and then cudnnConvolutionForward() performs the NVIDIA-tuned convolution. /* Allocate memory for Filter and ImageBatch, fill with data */ cudaMalloc( &ImageInBatch , ... ); cudaMalloc( &Filter , ... ); ...

/* Set decriptors */ cudnnSetTensor4dDescriptor(InputDesc, CUDNN_TENSOR_NCHW, 128, 96, 221,221); cudnnSetFilterDescriptor(FilterDesc, 256, 96, 7, 7 ); cudnnSetConvolutionDescriptor(convDesc, InputDesc, FilterDesc, pad_x, pad_y, 2, 2, 1, 1, CUDNN_CONVOLUTION);

/* query output layout */ cudnnGetOutputTensor4dDim(convDesc, CUDNN_CONVOLUTION_FWD, &n_out, &c_out, &h_out, &w_out);

/* Set and allocate output tensor descriptor */ cudnnSetTensor4dDescriptor(&OutputDesc, CUDNN_TENSOR_NCHW, n_out, c_out, h_out, w_out); cudaMalloc(&ImageBatchOut, n_out*c_out*h_out*w_out * sizeof(float));

57

/* launch convolution on GPU */ cudnnConvolutionForward(handle, InputDesc, ImageInBatch, FilterDesc, Filter, convDesc, OutputDesc, ImageBatchOut, CUDNN_RESULT_NO_ACCUMULATE);

10.1 Fast Fourier Transform A Fourier Transform converts a wave in the time domain to the frequency domain (a plot of frequency v magnitude (amplitude) on an x-y graph). FFT is a divide and conquer algorithm that computes the discrete Fourier transform (DFT) and its inverse, rapidly factorising the DFT matrix into a sparse one. While computing the DFT of N points in the 2 naive way has a complexity of O(N ), FFT can compute the same in only O(N log N) 2

meaning massive reductions in time and resources on large datasets. “FFT is one of the most important and widely used numerical algorithms in computational physics and general signal processing” (NVIDIA documentation). The cuFFT library provides a simple interface for computing FFTs on an NVIDIA GPU. “The engineering effort to improve the GPU convolution layers [has achieved] speedups of up to 23.5x compared to the fastest publicly available code” (kernel size 13x13) [ 28 ]. The backpropagation algorithm computes the gradient during training. A custom CUDA implementation of the FFT algorithm was designed by Matthieu et al to parallelize over feature maps and within each 2D transform (2D FFTs parallelize well as they can break into two x 1D FFTs (one over rows and the other over columns), operating on each in parallel [ 32 ]. Facebook then developed two new FFT convolution implementations based on NVIDIA’s cuFFT library, and another “from-scratch ..of batched 1D FFT and batched 2D FFT” , called Facebook FFT fbfft , authored by Vasilache et al, Dec 2014, that gives over 1.5x speedup over cuFFT. Along with NVIDIA’s cuDNN implementation, all are now open source (Jan, 2015).

58

10.2 Convolution (cuFFT) For an excellent explanation of Caffe convolution, out of scope of this report, please refer to author Yangqing’s memo at: https://github.com/Yangqing/caffe/wiki/Convolution-in-Caffe:-a-memo The NVIDIA cuFFT implementation uses the following naming convention:

input - x (s,i)

weight - w (j,i)

output - y (s,j)

gradOutput - ∂L/∂y (s,j) gradInput - ∂L/∂x (s,i) gradWeight - ∂L/∂w (j,i) Torch 7 only supports single precision (higher performance than double-precision) and as such all are stored as single precision floating point 4D tensors in row-major layout. “Multiple factors influence the computational efficiency of FFTs: transform size n, n’s prime factor decomposition, and whether batched or iterated single transforms are applied” [ 14 ] cuFFT implements FFTs with the Cooley-Tukey algorithm (Cooley & Tukey, 1965) which “uses trigonometric equalities to recursively decompose and reuse computations” [ 14 ]. a b c cuFFT is therefore highly optimized for input sizes of the prime factor form 2 × 3 × 5 × d n 7 . The smaller the prime factor, the better the performance (2 fastest) though

Vasilache et al use zero-padding of the image and kernel to perform the FFT efficiently at any larger size (gives poor performance in small kernel size eg 3 x 3). Full details of the in-order computations are given in the paper [ 14 ]. Forward propagation inputs can be formally written as a set f of input feature planes x , i ∈ f . These are cross-correlated i with f′ × f different filter (height h / width w ) kernel weights w , j ∈ f′, i ∈ f , (j,i) producing output feature planes y , j ∈ f′ [ 14 ]. j

59

Fig 20. Comparison of convolution and correlation by Cmglee via Wikimedia Commons

LeCun’s CNNs consist of convolutional and subsampling layers, with each neuron associated with a fixed location in the image & “the region of the input image that influences the response of the neuron” [ 67 ] likened to our biological receptive fields (weight vector). “At each location of each layer, there are a number of different neurons, each with its set of input weights, associated with neurons in a rectangular patch in the previous layer. The same set of weights, but a different input rectangular patch, are associated with neurons at different locations” [ 67 ]. This concept of "fan-in" of neurons (a few inputs per neuron) appears to solve the problem of vanishing or exploding gradients. The hierarchical local connectivity structure also acts as a strong prior setting the parameters in an optimal region (all non-connections have zero weight) optimizing gradient-based algorithms further. Basic modules for a CNN (Ronan, Clément, Koray and Soumith, www.Torch.ch): Filter layer : the input is a 3D array with n1 2D feature maps of size n2 x n3. Each component is denoted x jk, and each feature map is denoted xi. The output is also a 3D i array, y composed of m1 feature maps of size m2 x m3. A trainable filter (kernel) k j in i the filter bank has size l1 x l2 and connects input feature map x to output feature map y . j The module computes y =b +i ∗x where ∗is the 2D discrete convolution operator and b j j kij i j is a trainable bias parameter. Each filter detects a particular feature at every location on the input. Non-linearity layer : a pointwise tanh() function or ReLU applied to each site (ijk). Rabs: abs(tanh(g )) where g is a trainable gain parameter is sometimes followed by a i i

subtractive and divisive local normalization N, which enforces local competition between adjacent features in a feature map, and between features at the same spatial location.

60

Feature pooling layer : this layer treats each feature map separately, computing the average values over a neighbourhood in each feature map. Max pooling describes the method where neighbourhoods are stepped by a stride > 1 (but doing epoch on training data:')

32: ']') 33:

print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. for t = 1,trainData:size(),opt.batchSize do

34:

-- disp progress

35:

xlua.progress(t, trainData:size())

36: 37:

-- create mini batch

38:

local inputs = {}

39:

local targets = {}

40:

for i = t,math.min(t+opt.batchSize-1,trainData:size()) do

41:

-- load new sample

42:

local input = trainData.data[shuffle[i]]:double()

43:

local target = trainData.labels[shuffle[i]]

44:

table.insert(inputs, input)

45:

table.insert(targets, target)

46:

end

47: 48:

-- create closure to evaluate f(X) and df/dX

49:

local feval = function(x)

50:

-- get new parameters

51:

if x ~= parameters then

52:

parameters:copy(x)

53:

end

54: 55:

-- reset gradients

56:

gradParameters:zero()

57: 58:

-- f is the average of all criterions

59:

local f = 0

60: 61:

-- evaluate function for complete mini batch

62:

for i = 1,#inputs do

63:

-- estimate f

88

64:

local output = model:forward(inputs[i])

65:

local err = criterion:forward(output, targets[i])

66:

f = f + err

67: 68:

-- estimate df/dW

69:

local df_do = criterion:backward(output, targets[i])

70:

model:backward(inputs[i], df_do)

71: 72:

-- update confusion

73:

confusion:add(output, targets[i])

74:

end

75: 76:

-- normalize gradients and f(X)

77:

gradParameters:div(#inputs)

78:

f = f/#inputs

79: 80:

-- return f and df/dX

81:

return f,gradParameters

82:

end

83: 84:

-- optimize on current mini-batch

85:

if opt.optimization == 'CG' then

86:

config = config or {maxIter = opt.maxIter}

87:

optim.cg(feval, parameters, config)

88: 89: 90:

elseif opt.optimization == 'LBFGS' then config = config or {learningRate = opt.learningRate,

91:

maxIter = opt.maxIter,

92:

nCorrection = 10}

93:

optim.lbfgs(feval, parameters, config)

94: 95: 96:

elseif opt.optimization == 'SGD' then config = config or {learningRate = opt.learningRate,

97:

weightDecay = opt.weightDecay,

98:

momentum = opt.momentum,

99:

learningRateDecay = 5e-7}

100:

optim.sgd(feval, parameters, config)

101:

89

102:

elseif opt.optimization == 'ASGD' then

103:

config = config or {eta0 = opt.learningRate,

104:

t0 = trsize * opt.t0}

105:

_,_,average = optim.asgd(feval, parameters, config)

106: 107:

else

108:

error('unknown optimization method')

109: 110:

end end

111: 112:

-- time taken

113:

time = sys.clock() - time

114:

time = time / trainData:size()

115:

print("==> time to learn 1 sample = " .. (time*1000) .. 'ms')

116: 117:

-- print confusion matrix

118:

print(confusion)

119:

confusion:zero()

120: 121:

-- update logger/plot

122: trainLogger:add{['% mean class accuracy (train set)'] = confusion.totalValid * 100} 123:

if opt.plot then

124:

trainLogger:style{['% mean class accuracy (train set)'] = '-'}

125:

trainLogger:plot()

126:

end

127: 128:

-- save/log current net

129:

local filename = paths.concat(opt.save, 'model.net')

130:

os.execute('mkdir -p ' .. sys.dirname(filename))

131:

print('==> saving model to '..filename)

132:

torch.save(filename, model)

133: 134:

-- next epoch

135:

epoch = epoch + 1

136: end

90

Appendix E: Installation instructions for Torch 7 Torch7 http://torch.ch/ provides a Matlab-like environment for state-of-the-art machine learning algorithms. It provides a very efficient implementation due to the easy and fast scripting language; LuaJIT and an underlying C implementation. cslin155 & csfyp01 already have Torch repo with luaJIT, luarocks, NumPy 1.7 and SciPy 0.14 List of Lua ‘rocks’ (packages) installed: argcheck, cudnn, cunn, cunnx, cutorch, cwrap, dok, env, fftw3, gnuplot, graphicsmagick, image, itorch, lbase64, lua-cjson, luafilesystem, lzmq, nn, nnx, optim, paths, penlight, qtlua, qttorch, sdl2, signal, sundown, sys, threads, torch, trepl, uuid, xlua School of Computing installation NB Facebook AI Research refactored their code for CentOs & multi-GPU installs available here: https://github.com/soumith/imagenet-multiGPU.torch

If using this install option make sure you follow Soumith's installation instructions to the letter. That may sound obvious but I made some rookie mistakes like forgetting to put the validation images back into subfolders (with his simple script) that cost me days of debugging. *NB individual installs require a non-instant registration for a CUDA developer account (advisable also to sign on to the Developer Program to access all software, libraries and tools). ● Create own ' CuDNN ' directory in $home .

NB Later instructions assume 'CuDNN' directory name so if choosing another name make sure you make changes further down too ● cd into the directory and create a " lib64 " and an " include " directory. ● Grab the CuDNN download at https://developer.nvidia.com/cuDNN (login required*) ○ Download cuDNN 6.5 R2-rc2 for Linux. ○ Includes a file cudnn-6.5-linux-x64-v2-rc2.tgz ● Unpack into /tmp and copy the .h files to new " include " directory and the .so files/symlinks into new " lib64 " directory ○ $ tar -xvf cudnn-6.5-linux-x64-v2-rc2.tgz ○ $ sudo cp cudnn-6.5-linux-x64-v2-rc2/*.h /usr/local/cuda/include ○ $ sudo cp cudnn-6.5-linux-x64-v2-rc2/*.so* /usr/local/cuda/lib64 ● In $home directory, to install Torch run: ○ $ git clone https://github.com/torch/distro.git torch --recursive to download the torch tree + dependencies. Then ● cd torch; ./install.sh

91

to build and install (to $HOME/torch/install ).

This takes approx 130MB of space (check if current quota allows - contact IT if necessary to increase) ● Then add to end of ~/.bashrc file, noting change to any other chosen directory name: ○ export LD_LIBRARY_PATH=$HOME/ CuDNN /lib64:$HOME/torch/install/lib:$LD_LIB RARY_PATH ○ export PATH=$HOME/ CuDNN /bin:$HOME/torch/install/bin:$PATH to add those directories to your environment (the "install.sh" script will try to do some of this). To build against CuDNN you may need to specify "-I$HOME/CuDNN/include" to pull in header files. General installation fbcunn requires minimum Ubuntu 14.04+ and an NVIDIA GPU then: $ sudo apt-get install build-essential

If using a virtual machine: $ sudo apt-get update $ sudo apt-get install linux-generic

Download the CUDA .deb file for Linux Ubuntu 14.04 64-bit from: https://developer.nvidia.com/cuda-downloads (possibly called cuda-repo-ubuntu1404_6.5-14_amd64.deb ) Install $ sudo dpkg -i cuda-repo-ubuntu1404_6.5-14_amd64.deb $ sudo apt-get update $ sudo apt-get install cuda $ echo "export PATH=/usr/local/cuda/bin/:\$PATH; export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:\$LD_LIBRARY_PATH; " >>~/.bashrc && source ~/.bashrc

Restart computer Install CuDNN Go to https://developer.nvidia.com/cuDNN(login to download) Download cuDNN 6.5 R2-rc2 for Linux.

92

Includes a file cudnn-6.5-linux-x64-v2-rc2.tgz then : $ tar -xvf cudnn-6.5-linux-x64-v2-rc2.tgz $ sudo cp cudnn-6.5-linux-x64-v2-rc2/*.h /usr/local/cuda/include $ sudo cp cudnn-6.5-linux-x64-v2-rc2/*.so* /usr/local/cuda/lib64

Install Torch Dependencies $ curl -sk https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash

Install Torch in a local folder $ git clone https://github.com/torch/distro.git ~/torch --recursive

To uninstall torch: rm -rf ~/torch

Install Folly, fbthrift, thpp and fblualib $ curl -sk https://raw.githubusercontent.com/soumith/fblualib/master/install_all.sh | bash $ cd ~/torch; ./install.sh

Install fbcunn $ git clone https://github.com/torch/nn && cd nn && git checkout getParamsByDevice && luarocks make rocks/nn-scm-1.rockspec $ git clone https://github.com/facebook/fbcunn.git $ cd fbcunn && luarocks make rocks/fbcunn-scm-1.rockspec

# go get a coffee

The Torch core is a general numeric library, built around neural network training, image & video processing, parallel computing and much more. The Torch ecosystem relies on Luarocks for package distribution. Each package is distributed via a Git repository. Main packages are hosted on the main Torch portal at https://github.com/torch Torch7 relies on the LuaJIT dynamic interpreter, similar to Python with high flexibility and interoperability but with an unmatched (assembler written, state-of-the-art) low memory footprint. To install see http://luajit.org/install.html or: $ luajit

93

LuaJIT 2.0.2 -- Copyright (C) 2005-2013 Mike Pall. http://luajit.org/ JIT: ON CMOV SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse > BUT .. the raw LuaJIT interpreter is very basic so a package was developed called trepl, a read–eval–print loop (REPL), or shell. To use it, simply replace luajit by th (a binary): $ th Features: ● ● ● ● ● ● ● ●

Tab-completion on nested namespaces History with arrow up or arrow down Easy help with: ? funcname Self help: ? Shell commands with: $ cmd (example: $ ls) Print all user globals with who() Import a package's symbols globally with import(package) Require is overloaded to provide relative search paths: require('./mylocallib/')

By default, the interpreter just preloads torch. Extra packages, such as 'nn', must be 'required' manually: $ require 'nn' $ require 'image' $ itorch.image(image.mypic()); Full Torch tutorial in .lua at https://github.com/torch/tutorials(run 6. _cuda)

Installing LuaJIT and other useful tools First install IPython and Notebook with : $ pip install "ipython[notebook]" Or go to https://github.com/ipython/ipython/releasesand download, unpack and install with: $ python setup.py install Then Torch ( Appendix E ), then iTorch (an IPython Kernel for Torch, that uses a web browser for great high-res plotting, rendering and visualization of images, video and audio). iTorch in notebook mode works like any other IPython notebook.

94

ITorch provides useful inline auto-complete with the TAB key and gives inline help using the ? symbol See https://github.com/facebook/iTorch#requirements To get ITorch: $ git clone https://github.com/facebook/iTorch.git $ cd iTorch $ luarocks make If you don't yet have LuaRocks, the package manager that allows you to install Lua modules as self-contained packages called rocks, you can download and install with: $ wget http://luarocks.org/releases/luarocks-2.2.0.tar.gz $ tar zxpf luarocks-2.2.0.tar.gz $ cd luarocks-2.2.0 $ ./configure; sudo make bootstrap $ sudo luarocks install luasocket $ lua Lua 5.3.0 Copyright (C) 1994-2015 Lua.org, PUC-Rio > To start an iTorch notebook interface, at localhost:8888: $ itorch notebook You code into the code cells, and press [Shift] + [Enter] keys together to execute. Code is similar to Matlab e.g. > i = image.mypicture() > itorch.image(i) Any code can be saved in a .lua file, and executed from the shell. Run with: $ th filename.lua

95

Appendix F: Basics of CUDA GPU Programming

CPU's optimize for latency whereas GPU's optimise for throughput and deep learning requires massive throughput. NVIDIA already has a well-established parallel programming model written in C (with many extensions) that allows one program to run on both CPU and GPU. CUDA classes the CPU as the 'host' and the GPU as a 'device' running on the host. The CUDA compiler splits the code and directs it to either the CPU or the GPU. Example commands are cudaMemcpy (CPU copies data to the GPU and from the GPU back to the CPU) and cudaMalloc (allocating GPU memory). Invoking programs on a GPU to be computed in parallel is known as 'the host launching kernels on the device'. Typical workflow is cudaMalloc -> cudaMemcpy -> launch kernel -> cudaMemcpy The key to good GPU programming is to minimise communication between CPU & GPU and maximise the execution of kernels since GPU's are highly efficient at both launching large numbers of threads and running large numbers of kernels on those threads in parallel. NVIDIA's K80 has a single precision floating point performance of 8.74 Teraflops 12 (8.74 x 10 ). The CPU allocates the memory and launches the kernel, expressing the parallelism: square (outarray, inarray) # where 'square' is the name given to the kernel program to square a number Each threads in a kernel is indexed using a C struct threadIdx with .x, .y and .z to support up to 3-dimensional structures. CUDA code is very similar to MPI but a convention of naming data on the host as h_ and data on the GPU device as d_ syntax: kernel(arg, arg) default is 1D - if using 2 or 3D use the dim3(x,y,z) syntax e.g. square(arg1, arg2) == square(arg1, arg2) // equates to a block size of bx.by.bz * (tx,ty,tz) threads A third arg can be used in the launch > which allocates the amount of shared memory per block in bytes Deep learning programs are well written when they maximise arithmetic intensity, limit or coalesce CPU-GPU-CPU communication and break down the appropriate ratio of work into the appropriate ratio of blocks:threads or Map(elements, function). CUDA is built on the hierarchy that local (registers/L1 cache) > shared >> global >> CPU memory in so far as speed of access

96

The following code specifies 1 block running 64 threads. Modern GPU's can support 1024 threads per block all running in parallel. #include //this is the actual kernel using the declaration specifier __global__ __global__ void square(float * d_out, float * d_in) { int idx= threadIdx.x; float f = d_in[idx]; // f is in local cache memory and therefore access is v. v. fast //a trick is to use pointers to point to points .. to access global memory and pull in that data to local memory d_out[idx] = f * f; }

int main(int argc, char ** argv) { const int ARRAY_SIZE = 64; const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

//generate input array on host float h_in[ARRAY_SIZE]; for (int i-0; i< ARRAY_SIZE; i++) { h_in[i] = float(i); } float h_out[ARRAY_SIZE];

//declare GPU memory pointers float * d_in; float * d_out;

//allocate GPU memory cudaMalloc((void **) &d_in, ARRAY_BYTES); cudaMalloc((void **) &d_out ARRAY_BYTES);

//transfer array to GPU cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

//launch kernel on 1 block of 64 elements square(d_out d_in);

97

//copy back result array to CPU cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

//free GPU memory cudaFree(d_in); cudaFree(d_out);

return 0;

To compile with the NVIDIA C Compiler (nvcc): $ nvcc -o square square.cu $ ./square

Strides relate to row-major-order and memory access. If a GPU has to access adjacent memory location this is best as it has to make no strides. If the GPU has to access every other memory location, as opposed to contiguous memory it is said to be strided by 1. The larger the stride, especially if this means access to other memory chunks, the longer the access will take and the lower the performance. An array : {1,2,3 4,5,6 7,8,9} is {1,2,3,4,5,6,7,8,9} in row-major-order and {1,4,7,2,5,8,3,6,9} in column-m-o.

Aos, Soa An array of structures (Aos) is, for example: {f,i,f,i,f,i,f,i,f,i,f,i,} A structure of arrays (Soa) is transposed from an Aos to {f,f,f,f,f,f}, {i,i,i,i,i,i}

The NVIDIA Kepler series e.g. K40, K80's consists of thousands of streaming multiprocessors each with its own shared memory accessible to all blocks (threads have access to their own (fastest) local memory cache too). All threads from all kernels can also read/write from global memory which is accessible by the CPU host memory. The GPU is responsible for allocating blocks of threads to each SM (now known as SMX's due to the speedup in the Kepler series) so the programmer does not know what blocks run on what SMX which is great for scalability but deadlocks must be considered with synchronization techniques like barriers.

98

All threads in a block run on the same SMX and all blocks in a kernel finish before blocks in the next kernel run. CUDA offers ‘atomics’ like atomicAdd(), atomicSum(), atomicCAS() aka compare and swop to handle deadlocks as built-in operations. For an example, instead of a[i] = a[i] + 1; which would give a different result each time with many multiple threads working on it, we use: atomicAdd(&a[i], 1); Atomics slow down performance because guarantee only 1 thread at a time can execute. As with all parallel programming, thread divergence should be avoided with if .. else conditions and loops due to idletime. CUDA Scan() is a running cumulative sum of input which is very useful in parallel eg scan_add(1,2,3,4) -> 1,3,6,10 to output 10 As a result of MapReduce principles and parallel operations it is interesting to note that although heapsort >> mergesort on CPU, mergesort >> heapsort on GPU. CUDA promotes good software engineering with the use of APOD: analyze-parallelize-optimize-deploy A great utility is deviceQuery which gives some very useful information. Much more information available at https://developer.nvidia.com/cuda-zone

99

Appendix G: Permission for full image access to ImageNet Citation: Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge . arXiv:1409.0575, 2014. paper | bibtex

Request made Feb 17, 2015 for permission to access the full images:

Alison Lowndes (the "Researcher") has requested permission to use the ImageNet database (the "Database") at Princeton University and Stanford University. In exchange for such permission, Researcher hereby agrees to the following terms and conditions:

● Researcher shall use the Database only for non-commercial research and educational purposes. ● Princeton University and Stanford University make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose. ● Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify Princeton University and Stanford University, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted images that he or she may create from the Database. ● Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions. ● Princeton University and Stanford University reserve the right to terminate Researcher's access to the Database at any time. ● If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer. ● The law of the State of New Jersey shall apply to all disputes under this agreement.

1.4M images for training, test and validation (.jpg format) totalling 167GB from ImageNet.

100

Appendix H: Partial history of deep learning from the research of Geoffrey Hinton

2014

Deep Belief Networks for Natural Language Understanding.

2013

Modeling Documents with a Deep Boltzmann Machine. Speech Recognition with Deep Recurrent Neural Networks. Optimizing with Rectified Linear Units (ReLU) and Dropout. Speech Processing and recognition. ImageNet Classification with Deep Convolutional Neural Networks.

2012

Deep Neural Networks for Acoustic Modeling in Speech Recognition. Robust Boltzmann Machines for Recognition and Denoising.

2011

Generating Text with Recurrent Neural Networks. Using Very Deep Autoencoders for Content-Based Image Retrieval. Deep Belief Nets for Natural Language Call-Routing.

2010

Binary Coding of Speech Spectrograms Using a Deep Auto-encoder. Learning to represent visual input. Learning to detect roads in high-resolution aerial images. Temporal Kernel Recurrent Neural Networks. Factored 3-way restricted Boltzmann machines for modeling natural images.

2009

3-D Object recognition with deep belief nets. Zero-Shot Learning with Semantic Output Codes. Deep Boltzmann Machines. The Recurrent Temporal Restricted Boltzmann Machine.

2008

Generating Facial Expressions with Deep Belief Nets. Deep Narrow Sigmoid Belief Networks are Universal Approximators. Modeling image patches with a directed hierarchy of Markov random fields.

2007

Three New Graphical Models for Statistical Language Modelling. Unsupervised learning of image transformations. Modeling human motion using binary latent variables.

101

2006

Reducing the dimensionality of data with neural networks.

2005

Improving dimensionality reduction with spectral gradient descent. Exponential Family Harmoniums with an Application to Information Retrieval.

2004

Reinforcement Learning with Factored States and Actions. Energy-Based Models for Sparse Overcomplete Representations.

2003

Learning Sparse Topographic Representations. The ups and downs of Hebb synapses.

2002

Training Products of Experts by Minimizing Contrastive Divergence. Classical and Bayesian Inference in Neuroimaging: Theory. Recognizing Handwritten Digits using Hierarchical Products of Experts.

2001

Training Many Small Hidden Markov Models. Products of Hidden Markov Models. Rate-coded Restricted Boltzmann Machines for Face Recognition. Recognizing Hand-Written Digits Using Hierarchical Products of Experts.

2000

Extracting Distributed Representations of Concepts and Relations from Positive and Negative Propositions. Learning to parse images. Learning Distributed Representations by Mapping Concepts and Relations into a Linear Space.

1999

Supervised learning in multilayer neural networks. Scaling in a hierarchical unsupervised network. Unsupervised Learning: Foundations of Neural Computation. Variational learning in nonlinear gaussian belief networks.

1998

Coaching variables for regression and classification. Glove-TalkII: A neural network interface which maps gestures to parallel formant speech synthesizer controls.

1997

Instantiating deformable models with a neural net. Modeling the manifolds of images of handwritten digits. A mobile robot that learns its place.

102

1996

Using Generative Models for Handwritten Digit Recognition. Recognizing handwritten digits using mixtures of linear models. Using a neural net to instantiate a deformable model. An alternative model for mixtures of experts.

1995

GloveTalkII: Mapping hand gestures to speech using neural networks. Learning population codes by minimizing description length. The Helmholtz machine. Using neural networks to monitor for rare failures.

1994

Autoencoders, minimum description length, and Helmholtz free energy. Developing population codes by minimizing description length. A modified gating network for the mixtures of experts architectures.

1993

Hand-printed digit recognition using deformable models. Simulating brain damage.

1992

Glove-Talk: A neural network interface between a data-glove and a speech synthesizer. Combining two methods of recognizing hand-printed digits. A self-organizing neural network that discovers surfaces in random-dot stereograms. How neural networks learn from experience.

1991

Adaptive mixtures of local experts.

1990

Traffic: Recognizing objects using hierarchical reference frame transformations. Deterministic Boltzmann learning in networks with asymmetric connectivity. Mapping part-whole hierarchies into connectionist networks. An unsupervised learning procedure that discovers surfaces in random-dot stereograms.

1989

Deterministic Boltzmann learning performs steepest descent in weight-space. Phoneme recognition using time-delay neural networks. GEMINI: Gradient Estimation by Matrix Inversion after Noise Injection. Connectionist learning procedures. A distributed connectionist production system.

1988

Scene-based and viewer-centered representations for comparing shapes.

103

Learning representations by recirculation. Representing part-whole hierarchies in connectionist networks. Connectionist architectures for Artificial Intelligence. 1987

Learning translation invariant recognition in a massively parallel network. The horizontal-vertical delusion. How learning can guide evolution. Learning sets of filters using back-propagation. Separating figure from ground using a Boltzmann machine.

1986

Learning representations by back-propagating errors. Learning distributed representations of concepts. Experiments on learning by back-propagation. A general framework for Parallel Distributed Processing. Distributed representations. Learning and relearning in Boltzmann machines. G-maximization: An unsupervised learning procedure for discovering regularities.

1985

A learning algorithm for Boltzmann machines. Symbols among the neurons: Details of a connectionist inference architecture. Shape recognition and illusory conjunctions. Learning in parallel networks. Solving random-dot stereograms using the heat equation. Parallel computations for controlling an arm.

1984

Boltzmann Machines: Constraint satisfaction networks that learn. Some computational solutions to Bernstein's problems.

1983

Parallel visual computation. Optimal perceptual inference.

1981

The role of spatial working memory in shape perception. Parallel models of associative memory. Models of information processing in the brain. Implementing semantic networks in parallel hardware. A parallel computation that assigns canonical object-based frames of reference.

104

Frames of reference and mental imagery. Shape representation in parallel systems. 1979

Some demonstrations of the effects of structural descriptions in mental imagery. Imagery without Arrays.

1978

Representation and control in vision. Relaxation and its role in vision (PhD Thesis, University of Edinburgh).

105

Appendix J: Image Processing on the SVHN dataset in Torch/Lua Deep learning on the Street View House Number dataset [http://ufldl.stanford.edu/housenumbers/] csfyp01 (1 x K40) - epoch 1 : 19 minutes

Boston (2 x K80) - epoch 1 : 4 minutes i.e 4.75x speedup

Fig. 25 Accuracy results for a CNN on SVHN

Code for 1_data.lua below phy4abl@csfyp01:~/tutorials/2_supervised$ th -i 1_data.lua ==> processing options ==> downloading dataset --2015-03-06 11:48:55-http://torch7.s3-website-us-east-1.amazonaws.com/data/housenumbers/train_32x32.t7 Saving to: ‘train_32x32.t7’ 100%[================>] 225,197,180 2.95MB/s in 73s 2015-03-06 11:50:08 (2.95 MB/s) - ‘train_32x32.t7’ saved [225197180/225197180]

--2015-03-06 11:50:08-http://torch7.s3-website-us-east-1.amazonaws.com/data/housenumbers/test_32x32.t7 Saving to: ‘test_32x32.t7’ 100%[================>] 80,024,325 3.01MB/s in 25s 2015-03-06 11:50:34 (3.05 MB/s) - ‘test_32x32.t7’ saved [80024325/80024325]

==> using reduced training data, for fast experiments

106

==> loading dataset ==> preprocessing data ==> preprocessing data: colorspace RGB -> YUV ==> preprocessing data: normalize each feature (channel) globally ==> preprocessing data: normalize all three channels locally ==> verify statistics training data, y-channel, mean: -0.0063139642510632 training data, y-channel, standard deviation: 0.94085703684516 test data, y-channel, mean: 0.064013399310564 test data, y-channel, standard deviation: 1.0712769986069 training data, u-channel, mean: 0.2126413485493 training data, u-channel, standard deviation: 0.7879642894009 test data, u-channel, mean: 0.26287914495139 test data, u-channel, standard deviation: 0.91655089373626 training data, v-channel, mean: 0.22919728013122 training data, v-channel, standard deviation: 0.74868942214652 test data, v-channel, mean: 0.23141433953093 test data, v-channel, standard deviation: 0.89402614240993 ==> visualizing data For visualization, run this script in an itorch notebook

Preprocessing requires a floating point representation (original data stored on bytes). Types can be easily converted in Torch: dst = src:type('torch.TypeTensor'), where Type=='Float','Double','Byte','Int',... Shortcuts are provided for simplicity (float(),double(),cuda(),...): trainData.data = trainData.data:float() testData.data = testData.data:float() Preprocessing the data (crucial to any machine learning algorithm) For natural images, intuitive tricks include mapping into YUV space, to separate luminance information from colour information. The luminance channel (Y) is locally normalized, using a contrastive normalization operator: for each neighbourhood, defined by a Gaussian kernel, the mean is suppressed, and the standard deviation is normalized to one. Colour channels are normalized globally, across the entire dataset; as a result, each colour component has 0-mean and 1-norm across the dataset.

107

Fig. 26 Sample of y-channel data from SVHN dataset. Credit Torch.ch

TEST ON SVHN dataset with csfyp01 (1 x K40) $ th -i doall.lua -size small -type cuda // testing on 10,000 images ==> processing options ==> switching to CUDA ==> executing all ==> downloading dataset ==> using reduced training data, for fast experiments ==> loading dataset ==> preprocessing data ==> preprocessing data: colorspace RGB -> YUV ==> preprocessing data: normalize each feature (channel) globally ==> preprocessing data: normalize all three channels locally ==> verify statistics ….. ==> visualizing data ==> define parameters ==> construct model ==> here is the model: nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> output] (1): nn.SpatialConvolutionMM (2): nn.ReLU

108

(3): nn.SpatialMaxPooling (4): nn.SpatialConvolutionMM (5): nn.ReLU (6): nn.SpatialMaxPooling (7): nn.View (8): nn.Dropout (9): nn.Linear(1600 -> 128) (10): nn.ReLU (11): nn.Linear(128 -> 10) } ==> define loss ==> here is the loss function: nn.ClassNLLCriterion ==> defining some tools ==> configuring optimizer ==> defining training procedure ==> defining test procedure ==> training! ==> doing epoch on training data: ==> online epoch # 1[batchSize = 1] [============= 10000/10000 =============>]ETA: 0ms | Step: 1ms ==> time to learn 1 sample = 26.008045506477ms + average row correct: 10.928883552551% + average rowUcol correct (VOC measure): 3.0105409631506% + global correct: 20.26% ==> saving model to /home/phy4abl/tutorials/2_supervised/results/model.net ==> testing on test set: [============ 2000/2000 ===============>]ETA: 0ms | Step: 0ms ==> time to test 1 sample = 62.619866967201ms + average row correct: 20.878945924342% + average rowUcol correct (VOC measure): 9.8470474407077% + global correct: 32.4% ==> doing epoch on training data:

Reran on full 73,257 image dataset 1 epoch of $ th -i doall.lua -size full -type cuda csfyp01 (1 x K40) - epoch 1 : 19 minutes

109

Boston (2 x K80) - epoch 1 : 4 minutes i.e 4.75x speedup ==> online epoch # 505[batchSize = 1] [=================== 73257/73257 =============>] ETA: 0ms | Step: 1ms ==> time to learn 1 sample = 1.3086422867985ms + global correct: 99.46080238066% ==> testing on test set: [=================== 26032/26032 =============>] ETA: 0ms | Step: 0ms ==> time to test 1 sample = 0.51342198177577ms + global correct: 92.958666256915%

Stopped @ 92.95% test accuracy

-----------------------------------------------------------------The net ran overnight. Reran with 2 x K80’s and the normal output, as above (confusion matrix & epoch no. and accuracy etc) was unable to be read as the speed was so fast. This may have been a problem with the remote shell not being able to keep the history but the only info available was the very rapid counting through the imageset (73, 257).

110

Appendix K : Test results with fbcunn on ImageNet Simple installation instructions were provided by Facebook after they isolated the dataParallel version of fbcunn, so I could run multi-GPU on CentOS without the problems of dependencies built for Facebook itself. (By this time - March 12 - we actually had csfyp01 with straight Ubuntu anyway). These instructions are available here: https://github.com/soumith/imagenet-multiGPU.torch 90 Epochs (10,000 samples) on ILSVRC2012 ImageNet dataset: ● Yadan et al, 2013 2.2x speedup on 4 NVIDIA GeForce Titan taking 226.8 hours ● Paine et al, 2013 3.2x speedup on 8 NVIDIA K20 taking 256.8 hours ● Krizhevsky, 2014 6.25x speedup on 8 NVIDIA K20 taking 15.68 hours with reparallelized AlexNetOWT (1024 x 1024) ● Me, 2015 0.33x speedup on 2x2 NVIDIA K80 taking < 3 minutes with AlexNetOWT (512 x 512) error 6.9% for 1000/10000 i.e approx 45hrs for 90 Epochs . Since Alex has 19,968 cores and my system has only 9,984 it is interesting that he gets 3x speedup for only 2x number of cores. ● Me, 2015 0.035x speedup on 1 x K40 with 5.61hrs for 1 epoch. This was 3x quicker than on the 2 x K80 setup since we upgraded to CUDA7 and cuDNNv2 The Boston setup has 2x 2660v3 CPUs, 128GB DDR4 memory, 2x K80s and a 1TB SSD holding Imagenet and the fbcunn codebase. Running AlexNetOWT on ILSVRC2012 with -nGPU 4 (each K80 is in fact 2 GPU's in one) and -nDonkey 8 (worker threads to pull the images in) LuaJIT, Torch7 Epoch: [1][1000/10000] Time 1.728 Error 6.9067 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.003 4x GK210 chips achieving 1/10000 of an epoch in 1.7s csfyp01: 1 x Tesla K40 running CUDA7 now (not 6.5) & cuDNNv2 achieved 3x speedup on only 1 x GK110B chip on the standard AlexNet, even with 3.5x less Tflops NB : Downloading the ImageNet ILSVRC2012 dataset is ~167GB and needs to unpack into subfolders so requires ~ 350GB (preferably SSD) . The download will take over 20 hrs (from Stanford in California). These images may or may not need to be resized depending on your GPU capability which can take another 18 hrs. (For csfyp01 (1 x K40) we resized to 256 x 256). Once unpacked in to train & val directories:

main.lua runs, as default, a 1 x GPU setup on the AlexNet model with the CuDNN backend and 2 data-loader threads known as ‘donkeys’. $ th main.lua -data [imagenet-folder with train and val folders] $ th main.lua -data /disk1/imagenet -nGPU 2 -backend cudnn -netType alexnet

111

or $ th main.lua -data /disk1/imagenet -netType alexnetowt -nGPU 4 -nDonkeys 8 // will use 4-GPUs (2 x K80 = 4 x GPU) for even faster training (Boston setup)

The current Top-1 and Top-5 error + objective loss is output per mini-batch (hard-coded* with a learning rate of 0.002. ❖ To test a single unseen image use 'testHook' (line 103 in ‘donkey.lua’) to load, and send through the model: dofile('donkey.lua') img = testHook({loadSize}, 'test.jpg') model = torch.load('model_10.t7')

# at end of every epoch model is saved as model_[xx].t7 where xx is the epoch no. predictions = model:forward(img:cuda())

main.lua (~30 lines) - loads all other files, starts training opts.lua (~50 lines) - all the command-line options and description data.lua (~60 lines) - contains the logic to create K threads for parallel data-loading donkey.lua (~200 lines) - contains the data-loading logic and details - ran by each data-loader thread w/ random image cropping, generating 10-crops etc. model.lua (~80 lines) - creates AlexNet model and criterion * train.lua (~190 lines) - logic for training the network (inc hard-coded learning rate + weight decay schedule that produces good results) test.lua (~120 lines) - logic for testing the network on validation set (including calculating top-1 and top-5 errors) dataset.lua (~430 lines) - a general purpose data loader, derived from main repo [https://github.com/soumith/imagenetloader.torch] Using ImageNet ILSVCR2012 (1.2 million natural images) $ th main.lua -data /disk1/imagenet -nGPU 4 -nDonkeys 8 -backend cudnn -netType alexnetowt

[via remote ssh to Boston Ltd, St Albans, Herts] root@ubuntu1404:/disk1/imagenet/imagenet-multiGPU.torch# th main.lua -data /disk1/imagenet -nGPU 4 -nDonkeys 8 -backend cudnn -netType alexnetowt nDonkeys nGPU

8 4

2 1

netType alexnetowt manualSeed : 2 LR : 0 epochNumber : 1

112

backend : "cudnn" nDonkeys : 8 momentum : 0.9 batchSize : 128 nGPU : 4 GPU : 1 data : "/disk1/imagenet" weightDecay : 0.0005 epochSize : 10000

[printout for each donkey loading train & test metadata from cache] nClasses:

1000

nTest: 50000 => Creating model from file: models/alexnetowt_cudnn.lua => Model nn.Sequential { [input -> (1) -> (2) -> output] (1): DataParallel nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> output] (1): cudnn.SpatialConvolution (2): cudnn.ReLU (3): cudnn.SpatialMaxPooling (4): cudnn.SpatialConvolution (5): cudnn.ReLU (6): cudnn.SpatialMaxPooling (7): cudnn.SpatialConvolution (8): cudnn.ReLU (9): cudnn.SpatialConvolution (10): cudnn.ReLU (11): cudnn.SpatialConvolution (12): cudnn.ReLU (13): cudnn.SpatialMaxPooling } (2): nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> output] (1): nn.View (2): ModelParallel { input |`-> (1) [gpu:1] nn.Sequential { |

[input -> (1) -> (2) -> (3) -> output]

113

|

(1): nn.Dropout

|

(2): nn.Linear(9216 -> 1024)

|

(3): nn.ReLU

|

}

|`-> (2) [gpu:2] nn.Sequential { |

[input -> (1) -> (2) -> (3) -> output]

|

(1): nn.Dropout

|

(2): nn.Linear(9216 -> 1024)

|

(3): nn.ReLU

|

}


[input -> (1) -> (2) -> (3) -> output]

|

(1): nn.Dropout

|

(2): nn.Linear(9216 -> 1024)

|

(3): nn.ReLU

|

}


[input -> (1) -> (2) -> (3) -> output]

|

(1): nn.Dropout

|

(2): nn.Linear(9216 -> 1024)

|

(3): nn.ReLU

|

}

... -> output } (3): ModelParallel { input |`-> (1) [gpu:1] nn.Sequential { |

[input -> (1) -> (2) -> (3) -> output]

|

(1): nn.Dropout

|

(2): nn.Linear(4096 -> 1024)

|

(3): nn.ReLU

|

}


[input -> (1) -> (2) -> (3) -> output]

|

(1): nn.Dropout

|

(2): nn.Linear(4096 -> 1024)

|

(3): nn.ReLU

|

}


[input -> (1) -> (2) -> (3) -> output]

|

(1): nn.Dropout

114

|

(2): nn.Linear(4096 -> 1024)

|

(3): nn.ReLU

|

}


[input -> (1) -> (2) -> (3) -> output]

|

(1): nn.Dropout

|

(2): nn.Linear(4096 -> 1024)

|

(3): nn.ReLU

|

}

... -> output } (4): nn.Linear(4096 -> 1000) (5): nn.LogSoftMax } } => Criterion nn.ClassNLLCriterion ==> Converting model to CUDA ==> doing epoch on training data: ==> online epoch # 1 Epoch: [1][1/10000]

Time 1.959 Err 6.9073 Top1-%: 0.00 LR 1e-02 DataLoadingTime 5.992

Epoch: [1][2/10000]


Epoch: [1][3/10000]


. . Epoch: [1][748/10000]


Epoch: [1][749/10000]


. Epoch: [1][977/10000]


Not a JPEG file: starts with 0x89 0x50 . . Epoch: [1][990/10000]


Epoch: [1][991/10000]


/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/threads/init.lua:171: [thread 2 callback] /disk1/imagenet/imagenet-multiGPU.torch/donkey.lua:32: attempt to index local 'input' (a nil value)

Time lapsed ~30 minutes Donkey #2 found a flaw, 1 of 1.4 million images! phy4abl@csfyp01:~/imagenet-multiGPU.torch$ th main.lua -data ~/imagenet -netType alexnetowt -nGPU 1 -nDonkeys 4

115

-- ignore option data nDonkeys 4 2 -- ignore option optimState -- ignore option cache netType alexnetowt alexnet -- ignore option retrain { LR : 0 batchSize : 128 data : "/home/phy4abl/imagenet" epochSize : 10000 nDonkeys : 4 save : "/home/phy4abl/fbcunn_imagenet/imagenet_runs_oss/alexnet12,nDonkeys=4,netType=alexnetowt/,W edApr123:59:442015" optimState : "none" nGPU : 1 epochNumber : 1 momentum : 0.9 cache : "/home/phy4abl/fbcunn_imagenet/imagenet_runs_oss" netType : "alexnetowt" nEpochs : 55 backend : "cudnn" GPU : 1 weightDecay : 0.0005 manualSeed : 2 retrain : "none" } Saving everything to: /home/phy4abl/fbcunn_imagenet/imagenet_runs_oss/alexnet12,nDonkeys=4,netType=alexnetowt/,We dApr123:59:442015 Starting donkey with id: 1 seed: 3 Starting donkey with id: 2 seed: 4 Starting donkey with id: 3 seed: 5 Starting donkey with id: 4 seed: 6 Loading train metadata from cache Loading train metadata from cache Loading train metadata from cache Loading train metadata from cache Loading test metadata from cache Loading test metadata from cache Loading test metadata from cache Loading test metadata from cache Loaded mean and std from cache. Loaded mean and std from cache. Loaded mean and std from cache. Loaded mean and std from cache. nClasses: 1000 nTest: 50000 => Creating model from file: models/alexnetowt_cudnn.lua => Model nn.Sequential { [input -> (1) -> (2) -> output] (1): nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> output] (1): cudnn.SpatialConvolution (2): cudnn.ReLU (3): cudnn.SpatialMaxPooling (4): cudnn.SpatialConvolution (5): cudnn.ReLU (6): cudnn.SpatialMaxPooling (7): cudnn.SpatialConvolution (8): cudnn.ReLU (9): cudnn.SpatialConvolution

116

(10): cudnn.ReLU (11): cudnn.SpatialConvolution (12): cudnn.ReLU (13): cudnn.SpatialMaxPooling } (2): nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> output] (1): nn.View (2): nn.Concat { input |`-> (1): nn.Sequential { | [input -> (1) -> (2) -> (3) -> output] | (1): nn.Dropout | (2): nn.Linear(9216 -> 4096) | (3): nn.ReLU | } ... -> output } (3): nn.Concat { input |`-> (1): nn.Sequential { | [input -> (1) -> (2) -> (3) -> output] | (1): nn.Dropout | (2): nn.Linear(4096 -> 4096) | (3): nn.ReLU | } ... -> output } (4): nn.Linear(4096 -> 1000) (5): nn.LogSoftMax } } => Criterion nn.ClassNLLCriterion ==> Converting model to CUDA ==> doing epoch on training data: ==> online epoch # 1 Epoch: [1][1/10000] Time 2.894 Err 6.9069 Top1-%: 0.00 LR 1e-02 DataLoadingTime 19.204

Epoch: [1][10/10000]

Time 0.540 Err 6.9089 Top1-%: 0.00 LR 1e-02 DataLoad 0.017

Epoch: [1][1000/10000] Time 0.530 Err 6.8751 Top1-%: 0.00 LR 1e-02 DataLoad 1.862 Epoch: [1][4000/10000] Time 0.522 Err 6.6706 Top1-%: 2.34 LR 1e-02 DataLoad 0.320 Epoch: [1][10000/10000] Time 0.510 Err 5.0222 Top1-%: 7.03 LR 1e-02 DataLoad 0.004

Alex K 8 x K20s == 10.45 mins per epoch! Me 1 x K40 == 5.61 hrs per epoch! Epoch: [1][TRAINING SUMMARY] Total Time: 20181.55s, average loss (per batch): 6.18, accuracy(%): Top-1 2.57% 2nd run: Epoch: [1][10000/10000] Time 0.520 Err 4.8698 Top1-%: 9.38 LR 1e-02 DataLoadingTime 0.004 Epoch: [1][TRAINING SUMMARY] Total Time(s): 18712.56 (5.2hrs) average loss (per batch): 6.12 accuracy(%): top-1 2.71

Soumith Chintala Benchmarks

117

From https://github.com/soumith/convnet-benchmarks

*** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne - the code can be found here: https://github.com/Theano/Theano/blob/master/theano/sandbox/cuda/fftconv.py This module has not yet been used 'in production' due to memory load. Ref: Sander Dieleman (Belgium) [email protected] aka @benanne

Fig. 27 Soumith Chintala Benchmarks, Github / Facebook

118

A ppendix L: Test Results with the MICCAI Grand Challenge medical dataset JPEG datasets (582.5MB Training & 555.2MB Test) downloaded and unzipped to: phy4abl@csfyp01:/usr/not-backed-up/miccai_test & /miccai_train

Summary from main website http://amida13.isi.uu.nl/ : Histologic tumor grading systems assess how closely they resemble normal tissue under a microscope. Mitotic activity can be used as a prognosticator independently of grading systems, especially when automated via deep learning. The MICCAI Grand Challenge not only introduced competition to push forward progress in this area. but also to provide "open access to a high quality annotated dataset" in order to bring "major advancement in the development of a successful mitotic detection method" (MICCAI Challenge website). This is primarily a detection problem since at this stage only the presence itself is important to the patient (not size and shape). I simply registered with the Image Sciences Institute and Pathology Department of the University Medical Center Utrecht, in order to obtain the .jpeg training, test and validation sets. Only academics are allowed access and I included the Virtual Pathology Group URL. Obviously my access will end with my affiliation to University of Leeds but all data will be available on csfyp01 , as well as codebase. Anyone researching the field can also simply apply for access, the website provides adequate overview of the entire process of tissue sample preparation, digital pathology and mitosis counting. Login via: http://amida13.isi.uu.nl/?q=user The dataset uses stained slides from 23 invasive breast carcinoma patients from 2 2009/10. Regions of interest vary with a median of 26 mm . The standard is to count 2 mitotic figures in an area of 2mmbut to gain as many mitotic figures as possible, counting was extended to the entire marked area. The patients themselves were divided into two groups, with slides from one group used for training and the other as an independent testing set. Digitisation was performed with the Aperio ScanScope XT scanner at 40× magnification and with a spatial resolution of 0.25 µm/pixel and high quality JPEG 2000 compression (quality factor of 85) was used in order to reduce the storage requirements. Two expert pathologists independently annotated locations of mitotic figures with concordant cases (annotated by both observers) were taken as ground truth. The discordant cases (annotated by just one observer) were presented to a panel of an additional two observers who made the final decision. The annotated regions were exported into separate images (TIFF), each representing one 2 HPF (high power field) defined as 0.5×0.5 mm or 2000×2000 pixels. Only HPFs that contain at least one mitotic figure were included as part of the dataset. Both training and testing sets are organized into numbered folders of HPFs and, if applicable, ground truth data from a single slide (patient). The HPFs are stored as 8-bit

119

RGB TIF images with PackBits lossless compression. An alternative version, with smaller download size, light lossy JPEG compression (quality factor 95), is also available for download. This is the dataset I tested with. Each folder contains images (high power fields, HPFs) from one patient. The HPFs that have mitotic figure(s) present have a corresponding CSV file with the same filename e.g 09.jpg & 09.csv. Each row in the CSV file corresponds to one mitotic figure e.g. 505,223 where the 2 columns correspond to the location of the mitotic figures in image pixel coordinates. This data is only for research or for entry to the MICCAI Challenge. Further questions can be addressed to: Mitko Veta, Image Sciences Institute, UMC Utrecht [email protected] IDSIA opensourced its pybrain code at https://github.com/pybrain/pybrain/blob/master/docs/documentation.pdf PyBrain was installed locally on phy4abl@cslin155 $cd pybrain for preliminary testing but it proved much more efficient to use the much more informative NVIDIA DIGITS/Caffe software. In the MICCAI Grand Challenge, a detection is considered a true positive if it’s Euclidian distance to a ground truth location is less than 7.5 µm (30 pixels). Ranking is made according to the overall F1-score; 2·precision·recall / (precision + recall). In 2013, IDSIA won by achieving: ● Precision: 0.610 ● Recall: 0.612, and ● F1: 0.611

Fig. 28 Summary of MICCAI dataset

120

Fig. 29. All 143 detections (29 per row) on IDSIA’s 2013 T3 run with scores larger than 0.1, sorted by descending score. For each, are the corresponding image patch, score, and whether it is a mitosis (TRUE, bright green background) or a non-mitosis (FALSE, dark red background). Vertical dotted line for score 0.35 reports the detection threshold t0 determined on T2 run [ 60 ] Original high-res images available for download at http://people.idsia.ch/~giusti/mitosis/ .

Fig. 30 Compressed .jpeg of one of the IDSIA images evaluated on the separate MITOS dataset after applying a CNN to raw histology images. Detection (true positives) circled green.

121

Appendix M: Screenshot of live demo This image is a screenshot of Dan Cireșan, Senior Researcher at the Dalle Molle Institute for Artificial Intelligence (IDSIA), live at the March 2015 NVIDIA GPU Tech Conference in California, showing a visualisation of a CNN running on a live video feed image of Dan himself. You can clearly see 4 weight-sharing alternating convolutional & max-pooling layers with a final 2 fully-connected classifier layers.

Fig 31. Credit Ciresan, IDSIA,2015

122

Appendix N: Reverse engineering the neocortex: implications for machine intelligence [email protected] , online presentation: http://1.usa.gov/1FKdDaS From the Neuron-inspired computing elements (NICE) Conference, Sandia National Labs Feb 23, 2015

" I'm not against deep learning networks I just don't think they're on a path to machine intelligence and thats what I'm shooting for "; Jeff C Hawkins, Founder Numenta. Numenta marries biological theory and neuroscience, with information and mathematical theory. Hawkins believes in the fascination of building machines that are faster, larger, have new senses and machines with new embodiments. He doesn't believe passing the Turing Test is important. In his talk, Hawkins explains a simplified framework of the human neocortex as a memory system interfacing through the world via a set of sensors . The retina, cochlea, somatics - your body itself - is how the neocortex learns and builds its model of the world. For all the apparent separate behaviours of the spinal cord, the arm, the leg etc. the neocortex makes representations. Of everything we do; smiling, walking, chewing, etc. The neocortex then learns to associatively control those actions and predict your next action from previous behaviours. Nothing is independent. The retina itself has approximately a million fibres in the optic nerve, a million come from the spinal cord, roughly 30,000 from the auditory system. All these axioms, all the action potentials are identical. The human body does not receive five different senses; sight, sound, touch, taste, smell - it simply receives a very rapid, highly-parallel, high-velocity datastream of pattern representations, which it then acts on in another very rapid, highly-parallel, high-velocity motorstream. Most of the time the neocortex is occupied by our own behaviour; what we're doing, saying, hearing, feeling, touching etc. and how that relates to its environment. The neocortex learns its model of the world you learn - through constant adaptive learning of our own behaviour, constantly inferring and predicting behaviour according to time. The neocortex is actually a 2.5mm sheet of grey matter, folded to fit inside the skull.

123

Fig 32. Credit J Hawkins, Numenta

The neorcortex is remarkably uniform, has hierarchy and is layered in depth (between 4 and 10 layers depending on how you count the layers). Neurons appear in mini-columns cross-sectioned along those layers, and each neuron in each mini-column shares a common feedforward receptive field. Actual processing occurs on the dendrites, not in the soma, the cell body of the synapse. The dendrites act like coincidence detectors whereas learning is the formation of NEW synapses, new connections. Numenta models their HTM neuron according to this.

Fig 33. The HTM Neuron model

124

A hierarchical temporal memory (HTM) exists where sequences are processed in regions, with each region knowing the lower regions activities. In each layer of the neocortex there is some form of sequence memory going on; learning, inferring, predicting and recalling sequences. Layer 4 (L4) performs inference of sensory-motor i.e predicting what you're about to do based on prior behaviour. L4 receives both sensory input and a copy of the local motor command. L2/3 is for higher-order inferencing such as recognising (predicting) a melody. L5 is believed to be for motor sequences. This layer-function is generic. As far as Numenta's research suggest, every cortical region has these layer functions, there is nothing specific for sight or sound, for example - its all just patterns. This, Hawkins states are the essential elements for machine intelligence.

Fig 34. Red cubes are sparse active neurons in the grid connecting, forming patterns

When a cell becomes active it has neighbours that were active right before it, connected via available dendrites and these form the patterns that can be learnt, recognised. Many many patterns, in parallel, can be recognised by the neocortex at any moment.

Fig 35. Active and polarised (yellow) neurons and the multiple patterns formed in multiple layers

125

The system degrades by generalising more, losing the ability to see these complex patterns - up to catastrophic failure, similar to the human brain aging. Numenta currently have multiple L2/3 commercial applications with capability for streaming data prediction, classification and anomaly detection (application screenshot) within biometrics, medical, vehicles, social media and many other fields. Any data stream that can be encoded into sparse data representation (SDR) can then be fed into the HTM. There is no tweaking needed. No training period. The same algorithm and codebase is used for all applications on the streaming SDR. Numenta's Grok Solutions product can pull in data from AWS, from any server, and predict anomalies. HTM continually learns and automatically adapts to change. To further progress Numenta are open-sourcing everything they do at https://github.com/numenta including, by mid-2015 their entire 'Company Monitor' application, purely for demonstration purposes. The app uses stock pricing & volume integrated with Twitter mentions so anomalies picked up, per company, can be then further explained via tweets on the company itself.

126

Appendix P: DIGITS by NVIDIA DIGITS is the deep learning GPU training system that provides full visualization of neural networks in action. It is an invaluable tool for researchers and developers, especially those who prefer to see what their networks are doing at each stage or epoch in training. The system allows full and thorough configuration of deep learning models and architecture. There are several default networks such as LeCun’s LeNet and AlexNet or the user has the option to import one or design their own. This is an evolving product.

Fig 36. DIGITS Roadmap

Full documentation is available at http://developer.nvidia.com/digits and the codebase is opensourced on Github from where it can be cloned or developed: git clone https://github.com/NVIDIA/DIGITS.git digits cd digits sudo apt-get install graphviz gunicorn % for req in $(cat requirements.txt); do sudo pip install $req; done

The product was also designed for use with Ubuntu & Caffe but will eventually include all major frameworks $ python digits-devserver # starts an instance listening on port 5000

or digits-server will use the Gunicorn app listening on port 8080 with http://localhost To create a dataset the user can either upload training and validation sets or supply the path to local storage and let the system create the dataset. At this point the user can also

127

determine whether to use greyscale or full colour and the system can automatically resize images at this point too, depending on your available CPU/GPU power. Training/val % parameters can be set along with the simple addition of rotations, colour distortions, augmentation and noise to a training set to reduce overfitting or simply increase the number of examples. This is especially useful for medical images datasets. Pre-trained models can be pulled in and manipulated and NVIDIA are also happy to share an AWS AMI for those unable to procure GPU's themselves. Further info at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html Visualisations are absolutely stunning with feature extraction possible at any stage of the CNN, at any epoch. There is a simple testhook facility too which can predict either a single or multiple file of images or text.

Fig 37: visualisation of first layer of a CNN

128

Notes from installation in csfyp01 To avoid display problems use the -X affix for ssh: $ ssh -X csfyp01 phy4abl@csfyp01:/$ cd usr/local/digits phy4abl@csfyp01:/usr/local/digits$ ./digits-devserver ___ ___ ___ ___ _____ ___ | \_ _/ __|_ _|_ _/ __| | |) | | (_ || | | | \__ \ |___/___\___|___| |_| |___/ Welcome to the DIGITS config module. Where is caffe installed? (enter "SYS" if installed system-wide) [default is SYS] (q to quit) >>> /usr/local/caffe Attached devices: Device #0: Name Compute capability Memory Multiprocessors

Tesla K40c 3.5 11.25 GB 15

Input the IDs of the devices you would like to use, separated by commas, in order of preference. (enter "NONE" if you want to run in CPU-only mode) [default is 0] (q to quit) >>> 0 Where would you like to store jobs? [default is /home/phy4abl/.digits/jobs] (q to quit) >>> Accepting default value of "/home/phy4abl/.digits/jobs" What is the minimum log level that you want to save to your logfile? [error/warning/info/debug] [default is info] (q to quit) >>> Accepting default value of "info" New config: gpu_list - 0 secret_key - 78823e43fdd54cce2d5c0529 log_level - info jobs_dir - /home/phy4abl/.digits/jobs caffe_root - /usr/local/caffe

129

NB /usr/local/digits/digits-devserver listens on port 5000 via http://csfyp01:5000 Makefile in phy4abl@csfyp01:~/caffe$ make all --jobs=4 Utilities include (.cpp files): benchmark, im2colp, cudnn, cudnn_conv_layer, cudnn_softmax_layer, pooling_layer, sigmoid_cross_entropy_loss_layer, cudnn_sigmoid_layer, dropout_layer, tanh_layer, cudnn_tanh_layer, neuron_layer, relu_layer, argmax_layer, hdf5_data_layer, deconv_layer, hdf5_output_layer, threshold_layer, euclidean_loss_layer, convert_imageset, finetune_net

NB. You need to kill the devserver process. $ ps aux | grep 5000 $ kill -9 pgrep `processID`

Training the MICCAI dataset with AlexNet: 11 minutes, 17s Training with GoogLeNet: 10 minutes, 46s - output #8: loss3/top-5 = 0.958333 Full output logs are made available but too extensive to list here

Fig 38. Visualisation from DIGITS

130

Fig 39. Visualisation from within AlexNet via DIGITS

131


132


133

Appendix Q: Timekeeping I used a Google Spreadsheet to keep a note of time and progress, as shown partially below:

Fig 42. Timekeeping

134

Appendix R: Available datasets (Datasets tested in this report shown in bold)

● SVHN:

Street View House Number dataset a real-world image dataset

from Google Street View for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. ● MNIST : images small cropped digits, over 600,000 digit images, 10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10. 73,257 digits for training, 26,032 digits for testing, and 531,131 additional to use as extra training data Available in two formats; original with bounding boxes or MNIST-like 3x32x32 centred around a single character (many of the images do contain some distractors at the sides). ● NIST:

similar to MNIST, but larger

● Perturbed NIST: a dataset developed in Yoshua’s class (NIST with tons of deformations) ● CIFAR10 / CIFAR100: 32×32 natural image dataset with 10/100 categories ( http://www.cs.utoronto.ca/~kriz/cifar.html) ● Caltech 101: pictures of objects belonging to 101 categories (http://www.vision.caltech.edu/Image_Datasets/Caltech101/) ● Caltech 256: pictures of objects belonging to 256 categories (http://www.vision.caltech.edu/Image_Datasets/Caltech256/) ● Caltech Silhouettes: 28×28 binary images contains silhouettes of the Caltech 101 dataset ● STL-10:

dataset is an image recognition dataset for developing

unsupervised feature learning, deep learning, self-taught learning algorithms. ● NORB:

binocular images of toy figurines under various illumination and

pose (http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/)

135

● Imagenet : image database organized according to the WordNet hierarchy (ILSVRC2012 - 1,431,167 images - 1000 classes (pill bottle, beer bottle, wine bottle, flamingo, quail, minivan, jeep, cab ...) http://www.image-net.org/ ● MICCAI :

medical images owned by the University of Utrecht see Appendix L

● Pascal Visual Object Classes (VOC): a benchmark (2005-2012) with 20 classes in 22,591 images (bottle, bird, car ...) http://pascallin.ecs.soton.ac.uk/challenges/VOC/ ● Labelme:

A large dataset of annotated images,

http://labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php ● COIL 20:

different objects imaged at every angle in a 360

rotation(http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php) ● COIL100:

different objects imaged at every angle in a 360 rotation

( http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php )

There is a further large updated list of available datasets at: http://www.datasciencecentral.com/profiles/blogs/great-github-list-of-public-data-se ts

136

Last page left intentionally blank

137

Deep Learning with GPU Technology for Image ...

Deep Learning with GPU Technology for Image ...

Suggest Documents

Image Classification with Caffe Deep Learning ...

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Unified Deep Learning with CPU, GPU, and FPGA ... - Radeon Pro

Unified Deep Learning with CPU, GPU, and FPGA ... - Radeon Pro

Unified Deep Learning with CPU, GPU, and FPGA ... - Radeon Pro

Learning where to Attend with Deep Architectures for Image ... - arXiv

Learning where to Attend with Deep Architectures for Image ... - arXiv

Deep Learning for Medical Image Segmentation - arXiv.org

Deep Feature Learning for Medical Image Analysis with ... - UBC ECE

Deep learning for biological image classification

A GPU deep learning metaheuristic based model for

A GPU deep learning metaheuristic based model for ...

image analysis with deep learning (1).pdf - Google Drive

Learning Local Image Descriptors with Deep Siamese and Triplet ...

Learning Local Image Descriptors with Deep Siamese and Triplet

GPU-Accelerated Deep Shadow Maps for Direct

CURFIL: A GPU library for Image Labeling with Random Forests

Lung Image Segmentation Using Deep Learning ...

Automated Dental Image Analysis by Deep Learning

Automatic Image Annotation using Deep Learning Representations

SynerScope - GPU Technology Conference

Deep Learning in Big Image Data: Histology Image

Murex - GPU Technology Conference

Deep Learning for Beginners; with MATLAB Examples