Using a CUDA-âenabled Graphics Card to Accelerate

Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis By Fumbeya Luis Marungo MSc Advanced Information Systems project report Department of Computer Science, Birkbeck, University of London 2010 This report is substantially the result of my own work except where explicitly indicated in the text. I give my permission for it to be submitted to the JISC Plagiarism Detection Service. The report may be freely copied and distributed provided the source is explicitly acknowledged.

Fumbeya Luis Marungo MSc Advanced Information Systems project report

Page 2 of 119 September, 2010

Table of Contents Table of Figures ..................................................................................................................................... 5 Abstract ................................................................................................................................................. 6 Chapter 1 – Introduction ....................................................................................................................... 7 1.1 The Need to Accelerate Performance in Computer-‐aided Diagnosis .......................................... 7 1.2 Report Structure .......................................................................................................................... 8 Chapter 2 – Background ........................................................................................................................ 9 2.1 Overview ...................................................................................................................................... 9 2.2 Computer-‐aided Diagnosis in Breast Cancer ............................................................................... 9 2.2.1 CADx Overview ..................................................................................................................... 9 2.2.2 How Cases are Characterized in CADx ................................................................................ 10 2.2.3 Az, the Common CADx Classification Accuracy Measure .................................................... 11 2.2.4 Neural Networks in CADx .................................................................................................... 12 2.2.5 The Challenges Facing CADx Neural Networks ................................................................... 13 2.3 Current Parallel Computing Approaches ................................................................................... 14 2.3.1 Why Parallelism Is Important .............................................................................................. 14 2.3.2 CUDA and the GPU ............................................................................................................. 15 2.3.3 The Multi-‐Core CPU ............................................................................................................ 15 2.3.4 Other Parallel Options ........................................................................................................ 15 2.4 A Comparison of CPU and GPU Threading Models .................................................................... 16 2.4.1 A Comparison of CUDA to CPUs Using Flynn’s Taxonomy .................................................. 16 2.4.2 The Impact of the Threading Model on CPU and GPU Optimization .................................. 18 2.5 Relevant Research ..................................................................................................................... 19 2.5.1 CADx Literature ................................................................................................................... 19 2.5.2 CUDA Literature .................................................................................................................. 20 2.5.3 Multi-‐core CPU Literature ................................................................................................... 20 2.6 Libraries, Tools, and Technologies Employed ............................................................................ 20 Chapter 3 –Analysis and Design ........................................................................................................... 22 3.1 Overview .................................................................................................................................... 22 3.2 Analysis ...................................................................................................................................... 22 3.2.1 Requirements ..................................................................................................................... 22 3.3 Design ........................................................................................................................................ 23 3.3.1 The Algorithms .................................................................................................................... 23 Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



3.3.2 The Object-‐oriented Design ................................................................................................ 24 3.3.3 Data Structures for SIMD .................................................................................................... 25 3.3.4 The SIMD Sampling Data Structure ..................................................................................... 27 3.3.5 Full Design ........................................................................................................................... 29 3.3.6 Framework Flexibility .......................................................................................................... 29 3.4 Summary .................................................................................................................................... 30 Chapter 4 – Implementation ............................................................................................................... 31 4.1 Overview .................................................................................................................................... 31 4.2 General Implementation Details ................................................................................................ 31 4.2.1 CUDA Development ............................................................................................................ 31 4.2.2 SSE Development ................................................................................................................ 32 4.2.3 The Choice of Native C++ .................................................................................................... 32 4.2.4 Factory Classes and Smart Pointers .................................................................................... 33 4.2.5 Random Numbers and Distributions .................................................................................. 34 4.3 Neural Network Output Calculation .......................................................................................... 34 4.3.1 Base Design ......................................................................................................................... 34 4.4 CPU Implementation ................................................................................................................. 35 4.5 GPU Implementation ................................................................................................................. 37 4.5.1 The Project’s Implementation ............................................................................................ 37 4.5.2 Preliminary GPU Optimization Efforts ................................................................................ 40 4.6 Summary .................................................................................................................................... 40 Chapter 5 – Testing and Results .......................................................................................................... 42 5.1 Overview .................................................................................................................................... 42 5.1.1 The Tests ............................................................................................................................. 42 5.1.2 The Datasets ....................................................................................................................... 43 5.2 The NeuralNetEvaluator Tests ................................................................................................... 43 5.2.1 Functional Test Description and Results ............................................................................. 43 5.2.2 Performance Test Description and Results ......................................................................... 43 5.3 The NeuralNetTrainer Tests ....................................................................................................... 44 5.4 The GeneticSelector Tests .......................................................................................................... 44 5.4.1 Test Description .................................................................................................................. 44 5.4.2 Functional Test Results ....................................................................................................... 45 5.4.3 Performance Test Results ................................................................................................... 46 5.5 CPU Allocation ........................................................................................................................... 46 Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



Chapter 6 – Summary, Conclusion, Future Work, and Evaluation ....................................................... 48 6.1 Summary .................................................................................................................................... 48 6.1.1 Background ......................................................................................................................... 48 6.2 Conclusion ................................................................................................................................. 48 6.2.1 Overall Conclusion .............................................................................................................. 48 6.2.2 Design ................................................................................................................................. 48 6.2.3 Implementation .................................................................................................................. 49 6.2.4 Testing and Results ............................................................................................................. 49 6.3 Evaluation .................................................................................................................................. 50 6.4 Future Work ............................................................................................................................... 50 Bibliography ......................................................................................................................................... 52 Appendix A

– Compute Capability 2.0 ............................................................................................. 58

Appendix B

– Testing Details ........................................................................................................... 59

Appendix C

– Factors in CUDA Performance ................................................................................... 61

C.1

Thread Hierarchy ................................................................................................................. 61

C.2

Memory Layout .................................................................................................................... 61

C.3

A Shared Memory Implementation .................................................................................... 63

Appendix D

– Systems Manual ........................................................................................................ 64

Appendix E

– Source Code .............................................................................................................. 66

Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



Table of Figures Figure 2-‐1 – BI-‐RADS Categories .......................................................................................................... 10 Figure 2-‐2 – Three ROC Curves with Increasing Area Under the Curve (Az) Values ............................ 11 Figure 2-‐3 – A Feed Forward Neural Network ..................................................................................... 12 Figure 2-‐4 – A Perceptron Network ..................................................................................................... 12 Figure 2-‐5 – Logistic Sigmoid ............................................................................................................... 13 Figure 2-‐6 – The SISD Architecture. ..................................................................................................... 17 Figure 2-‐7 – The SIMD, Processor Array Architecture. ........................................................................ 17 Figure 2-‐8 – The SIMD, Vector Pipeline Architecture. ......................................................................... 18 Figure 2-‐9 – The MIMD Architecture. .................................................................................................. 18 Figure 2-‐10 – GeForce GTX 260 Device Information ........................................................................... 21 Figure 3-‐1 – A Package Based Design Approach .................................................................................. 25 Figure 3-‐2 – Improved Design .............................................................................................................. 25 Figure 3-‐3 – The Array of Structures (AoS) Memory Layout ............................................................... 27 Figure 3-‐4 – The Structure of Arrays (SoA) Memory Layout Transposes the AoS Layout ................... 27 Figure 3-‐5 – Structure of Arrays (SoA) with Required Padding ............................................................ 27 Figure 3-‐6 – SoA Memory Layout for Training/Validation ................................................................... 28 Figure 3-‐7 – Data Classes, SoA Implementation .................................................................................. 28 Figure 3-‐8 – Overall Project Design ..................................................................................................... 29 Figure 4-‐1 – SIMD Node Multiply Add Assign ...................................................................................... 35 Figure 4-‐2 – CPU, SSE Implementation (SseGlobal.cpp) ...................................................................... 36 Figure 4-‐3 – CUDA Invocation Function (CudaBasic.cu) ...................................................................... 38 Figure 4-‐4 – CUDA Kernel .................................................................................................................... 39 Figure 4-‐5 – Device Load, Unload, and Copy Helper Functions (PROJ_MarungoF_Cuda.cu) .............. 40 Figure 5-‐1– Results .............................................................................................................................. 42 Figure 5-‐2 -‐-‐ Windows Task Manager .................................................................................................. 47 Figure C-‐1 – Memory System of the 8800GTX ..................................................................................... 62 Figure C-‐2 – A Shared Memory Design ................................................................................................ 63 Figure D-‐1 – Project’s Visual Studio 2008 Solution Explorer Window ................................................. 65




Abstract The graphics processing unit (GPU) is a high performance chip that controls the graphic card inside the computer. NVIDIA Corporation introduced the Compute Unified Device Architecture (CUDA) in late 2006. CUDA is designed to allow general-‐purpose programming targeting the GPU. This project examines techniques and issues in using CUDA to accelerate computational processing in breast cancer related computer-‐aided diagnosis (CADx). It presents the following: 1. Implementations of GPU based and CPU based neural network calculators. 2. A design framework for integrating CUDA into typical CADx neural network based algorithms. 3. A sample implementation of the framework elements using a genetic algorithm for feature selection, and an evolutionary computing algorithm for network training. 4. Functional and performance testing implementations for the framework elements. Despite numerous optimizations in the CPU implementation, the GPU implementation provides roughly an 18x speedup in raw network output calculation. Using the GPU creates slightly less than a 4x speedup in total runtime. The GPU implementation provides better scaling for future hardware upgrades.




Chapter 1 – Introduction 1.1

The Need to Accelerate Performance in Computer-‐aided Diagnosis

Computer-‐aided diagnosis (CADx) in breast cancer involves the classification of a previously identified region of interest in a medical image. The term “previously identified” is relevant as it distinguishes CADx from computer-‐aided detection, also known as CADe (Lo et al. 2006). The most common medical image is a mammogram. Other possible imaging modalities include MRI and ultrasound. Neural networks are frequently used for region classification in CADx. Creating an optimal neural network for a CADx problem presents several challenges, including: 1. 2. 3. 4.

Selecting the optimal features to serve as network inputs (Zheng 2009). Selecting an appropriate network architecture (Land et al. 2006). Evaluating classification accuracy with relatively small data sets (Lo et al. 2006). Training networks optimally (Land et al. 2006).

Methods employed to address the difficulties above include: 1. Genetic algorithms for feature selection and network architecture (Campanini & Lanconelli 2006). 2. Sampling techniques such as k-‐fold cross-‐validation and the bootstrap (Kohavi 1995). 3. Evolutionary computing for training a neural network (Porto, Fogel & Fogel 1995). These methods are computationally expensive. Until recently the regular release of newer and faster CPUs led to automatic increases in processing speed with each hardware upgrade. In 2005 the engineering challenges of creating processors faster than 3 GHz led manufacturers to shift focus from creating faster chips to fitting more processor units (“cores”) into a single chip. The speed of each core is about 3 GHz (Geer 2005). A direct consequence of the new trend is that computationally expensive techniques will not experience a reduction in runtime on new hardware unless they are implemented in a scalable parallel1 manner. The following parallelism technologies are available on workstations: 1. Graphics Processing Units (GPUs) are the chips that control graphics cards in workstations. GPUs are many-‐core2 processors designed for massive computational lockstep parallelism. The Compute Unified Device Architecture (CUDA) is a framework released by NVIDIA in 2006; CUDA provides a means to apply the GPU’s processing power to general-‐purpose3 calculation. 2. Multi-‐core CPUs4 offer concurrency at two levels. Each core is an independent processing unit capable of hosting one or two hardware threads. Each thread, at the register level is capable of executing a single math operation on a vector of four floating point numbers simultaneously. Intel calls the CPU instructions that execute a single operation over four numbers Streaming SIMD Extensions (SSE). 1

In this report, the terms concurrent and parallel are used interchangeably unless otherwise stated. Some will describe concurrent systems as MIMD systems and parallel systems as SIMD systems. See Section 2.4.1 for further details on these two terms. 2 Many-‐core, as opposed to multi-‐core, is a designation of scale. For example, a present day GPU has more than 200 cores; a present day CPU may have six cores. 3 Generally the term GPU refers to any graphics processing unit. In this report the term GPU refers to a GPU controlling a CUDA-‐enabled device unless otherwise stated. 4 In this report the term CPU refers to modern multi-‐core Intel x86 microprocessors unless otherwise stated. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



3. Hosted languages, such as Java and C# allow CPU thread creation. A recent development is Microsoft‘s Task Parallel Library available in Visual Studio 2010. The library supports optimized parallel execution on multi-‐core workstations using C# version 4.0. This project explores using CUDA for feature selection and network architecture design in CADx neural networks. The project presents a design that integrates CUDA into a genetic algorithm. The algorithm performs feature selection and designs the network architecture. The project’s implementation includes an evolutionary neural network trainer and two neural network output calculators. One calculator uses the CPU and the other uses the GPU. The CPU implementation employs a great deal of optimization with low-‐level Assembly calls. The optimizations create a fair comparison between the two technologies. The project includes functional and performance tests for the genetic algorithm, evolutionary trainer, and two neural network calculation implementations.

1.2

Report Structure

Chapter 2 through Chapter 5 start with overviews that briefly describe the topics the chapter covers. Chapter 3 and Chapter 4, the chapters that cover the technical work, have summaries to review salient points. Chapter 2 is a very substantive chapter; its subject matter spans the various topics this project touches upon. The chapter opens with an overview of breast cancer CADx in general and CADx neural networks in particular. An overview of parallelism follows; this covers parallelisms importance in contemporary computing, its role in modern CPUs and GPUs, and other parallelism options that exist. A more in depth look at the CPU and GPU follows. There is an examination of both processors’ threading models and optimization methods. The chapter closes with a look at previous work in CADx neural networks and relevant hardware based acceleration. Chapter 3 opens with the system requirements. These requirements are driven by the domain needs of CADx. The design section starts with a presentation of the algorithms implemented. Next there is a comparison of two possible object-‐oriented approaches and a description of the approach adopted in this project. An explanation of the data layout needs of both GPU and CPU programming is the last topic. Chapter 4 presents the implementation of the ideas presented in Chapter 3. It opens with general features of GPU and CPU development, an explanation of why the project required a native C/C++ implementation and a description of some design decisions that are specific to a C/C++ implementation. The chapter continues with an explanation of the core calculation of the program; that is the calculation of a neural network’s output. Finally there is an explanation of the specifics of the CPU and GPU calculation. Chapter 5 describes the project’s functional and performance testing. There is a general description of the project’s test methodology and the datasets used. Each project component’s test and results are then handled in turn. There is an explanation of the purpose of each test, a description of the functional and performance (if applicable) test procedures, and the test results. Chapter 6 opens with a review of the purpose of the project. It continues with an overall discussion of CUDA’s potential in CADx as well as conclusions drawn from different phases of the project. Next there is an evaluation covering my personal views of the project. Finally the report concludes with a look at future work.




Chapter 2 – Background 2.1 Overview This chapter presents background on the major topics that impact the project. It begins with the role of breast cancer computer-‐aided diagnosis (CADx) in general. The real world goal of diagnosis is to classify a lesion as malignant or benign. This chapter provides a description of CADx’s role in this process, a summary of lesion categorization, and an explanation of the accuracy metric for CADx systems. The chapter continues with a description of neural networks in CADx; the description covers both the role of neural networks and the challenges faced when trying to employ neural networks in CADx. After the initial coverage of CADx the chapter shifts focus to parallelism. The Computer Unified Device Architecture is a design for using the parallel computing ability of the graphics card to solve problems unrelated to screen rendering. These sections of the chapter cover why parallelism is important, what are the options for implementing parallelism, and what is the nature of the CPU based and GPU based approaches. Throughout the chapter there are citations of relevant literature. The penultimate section provides an overview of some of the relevant research. However if the reader’s particular interest is research on the field then it is best to read Chapter 2 in its entirety. The chapter concludes with a list of the libraries, tools, and technologies used for the project’s implementation.

2.2

Computer-‐aided Diagnosis in Breast Cancer

2.2.1 CADx Overview There are many different opinions on what constitutes computer-‐aided diagnosis (CADx) as opposed to computer-‐aided detection (often referred to as CADe). Many no longer recognize the distinction, and refer to both processes as computer-‐aided diagnosis (Lo et al. 2006). For the purpose of this report, computer-‐aided diagnosis is the automated classification of an identified region of interest, using previously extracted features. Thus CADx does not entail image processing techniques such as segmentation or feature calculation. The regions of interest and the features may come from any combination of automated systems, human experts, patient medical history, etc. CADx systems have a wide range of applications. One use of CADx systems is to serve as components in CADe systems. After a CADe system detects the region of interest its CADx subsystem determines the region’s likelihood of malignancy. The CADe system then returns the regions with a likelihood that crosses a preset threshold. A recent large-‐scale study of 31,000 women in the UK demonstrated that the combination of a single reader and a detection system with a CADx component acting as a “second reader” yielded similar mammogram screening accuracy to two human readers. The former detected 198 out of 227 possible cases and the latter detected 199 of the 227 possibilities (Gilbert et al. 2008). Another application of CADx systems is to help determine if a biopsy is necessary. Mammograms detect 80%-‐90% of all symptom free breast cancers in women. The downside is that mammograms Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



also generate many false positives. 5%-‐10% of all women have their mammograms interpreted as “abnormal” or “inconclusive.” Most of the abnormal and inconclusive interpretations resolve to benign cases after a biopsy and/or further imaging studies (American Cancer Society 2009). 75% of the 700,000 biopsies performed each year in the United States yield benign results. Unnecessary biopsies are an emotional and physical burden on patients and a large addition to the true cost of mammograms (Bilhanan 2004, p. 13). Jiang, et al. (1999) report that radiologists using CADx assistance can increase both sensitivity and selectivity in recommending biopsies. 2.2.2 How Cases are Characterized in CADx

®

Category

Breast Imaging Reporting and Database System (BI-RADS ) Assessment Follow-up Recommendations

a. Assessment is Incomplete 0

Need Additional Imaging Additional imaging and/or prior images are Evaluation and/or Prior needed before a final assessment can be Mammograms for Comparison assigned b. Assessment is Complete – Final Categories 1

Negative

2

Benign Finding(s)

3

Probably Benign Finding – Initial Short-Interval Follow-Up Suggested Suspicious Abnormality – Biopsy Should Be Considered

4

5

6

Optional subdivisions:* 4A: Finding needing intervention with a low suspicion for malignancy 4B: Lesions with an intermediate suspicion of malignancy 4C: Findings of moderate concern, but not classic for malignancy Highly Suggestive of Malignancy – Appropriate Action Should Be Taken Known Biopsy-Proven Malignancy – Appropriate Action Should Be Taken

Routine annual screening mammography (for women over age 40) Routine annual screening mammography (for women over age 40) Initial short-term follow up (usually 6-month) examination Usually requires biopsy

Requires biopsy or surgical treatment

Category reserved for lesions identified on imaging study with biopsy proof of malignancy prior to definitive therapy

* A subdivision may be used in addition to the Category 4 final assessment; MQSA does not allow a subdivision to replace a Category 4 final assessment. Use of subdivision is at the discretion of the facility it is not required by the FDA Copied from (American College of Radiology 2009). http://www.acr.org/SecondaryMainMenuCategories/quality_safety/BIRADSAtlas/BIRADSFAQs.as px Figure 2-‐1 – BI-‐RADS Categories




The American College of Radiology’s Breast Imaging Reporting and Database System (BI-‐RADS ®) provides the standard categorization for lesions (D'Orsi, Bassett & Berg 2003). Figure 2-‐1 on page 10 describes the BI-‐RADS categories. A binary classifier that recommends whether or not to proceed with a biopsy needs to be able to accurately determine the nature of Category 4 lesions. Barring outside information, such as the enlargement of a lesion, Category 3 lesions are unlikely to be malignant and do not require a biopsy (Sickles 1999) (Sickles 1991). Category 5 lesions clearly require a biopsy. 2.2.3 Az, the Common CADx Classification Accuracy Measure Az, or the area under the Receiver Operating Characteristic (ROC) curve, is the standard measure of classification accuracy in CADx (Lo et al. 2006). The ROC curve captures the tradeoff between detecting positive cases and misdiagnosing negative cases. The ROC curve assumes that a classifier’s settings can vary. Point (0, 0) represents the setting where the classifier rejects everything. Point (1, 1) presents the setting where the classifier accepts everything. Figure 2-‐2 below presents three ROC curves with increasing values for Az. ROC 1 has the smallest Az. Point (.6, .2) lies on ROC 1. This means that ROC 1 depicts a classifier that will misdiagnose 60% of the negative cases when it properly diagnoses 20% of the positive cases. ROC 1’s classifier is clearly not desirable. ROC 2 reverses these numbers. When ROC 2’s classifier properly detects 60% of the positive cases it misdiagnoses 20% of the negative cases. ROC 3 has the highest Az; when its classifier properly detects 80% of the positive cases it misdiagnoses 20% of the negative cases.

Figure 2-‐2 – Three ROC Curves with Increasing Area Under the Curve (Az) Values




2.2.4 Neural Networks in CADx Neural networks are commonly used as CADx classifiers. By far the most frequently used CADx neural network structure is the single feed forward network with one hidden layer and a single output node such as the network in Figure 2-‐3 below (Zheng 2009). Neural networks model brain function. Each network node represents a neuron. The hallmark of a neural network is that each node accepts a single input value and emits a single output value. A node’s input value is the weighted sum of outputs of the nodes that are connected to it. The node’s output value is the result of a calculation using an activation function and the input value. Figure 2-‐4 depicts the basic neural network calculation. The input value is x1w1 + x2w2 + ... + x7w7. f is the activation function. The node’s output is y = f(x). Typically networks use the simple logistic sigmoid function in Figure 2-‐5 on the following page as the activation function. The node feeds its output value forward to the nodes it is connected to on the next layer. The network in Figure 2-‐4 is a perceptron; it is a network with no hidden layer. All of the inputs feed directly into one output node5. The node uses the weighted sum of the inputs and applies an activation function; the result is the perceptron’s output. Figure 2-‐3 depicts a network with a hidden layer. The inputs do not feed directly into the output node; they feed into a layer of hidden nodes. The outputs from the hidden layer serve as the inputs to the output node. The addition of a hidden layer has a significant influence on a network. A perceptron can only perform linear separation; a network with a hidden layer can model more complex nonlinear relationships. However networks with hidden layers are opaque because the hidden layer masks the relationship between the inputs fed into the network and final output calculated. Networks with one output node can serve as binary classifiers. Set a cutoff value, 𝑘 | 𝑘 ∈ 0,1 , let x = the output from the neural network. x < k yields a “benign” classification, and x ≥ k yields a “malignant” classification. If the output node has a logistic sigmoid activation function normally k = .5. In the context of ROC, point (0, 0) represents k = 1, point (1, 1) represents k = 0, and the ROC curve represents the tradeoff as k varies. There is no fixed k value in ROC analysis of neural networks.

Copied from Wikipedia, Neural Network

Figure 2-‐3 – A Feed Forward Neural Network

Figure 2-‐4 – A Perceptron Network

Copied from Wikipedia, Perceptron

5

This project only concerns using neural networks as binary classifiers. Therefore, the report only examines networks with a single node in the output layer. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



𝑓 𝑥 =

1 , 𝑥 ∈ ℛ → 𝑓 ∈ 0,1 1 + 𝑒 !!

Figure 2-‐5 – Logistic Sigmoid

2.2.5 The Challenges Facing CADx Neural Networks Failure to Outperform Linear Classifiers In practice medical neural networks with hidden layers frequently do not provide classification accuracy that is superior to linear regression (Sargent 2001). Depending on the particular biological process under investigation it is also possible that the best type of classifier is a linear separator (Schwarzer, Vach & Schumacher 2000). Selecting an optimal combination of neural network topology and input features is still not immediately obvious in nonlinear problems where a neural network may provide more accurate results than a linear separator. By trying many network configurations, a genetic algorithm can be an effective method for neural network design and feature selection (Campanini & Lanconelli 2006). The genetic algorithm can compare the accuracy of linear separation to nonlinear separation by permitting perceptron networks. When perceptrons are allowed by a genetic algorithm they often dominate the top performing networks (Land et al. 2006). This finding is in line with Sargent (2001) and Schwarzer, et al. (2000). In the case of CADx, having a genetic algorithm for determining optimal neural network design is important. The availability of one feature can significantly alter the problem domain. For example breast cancer history is a feature that frequently appears in superior networks (Land et al. 2006). If this feature is missing it is quite possible that the classification problem may change from linear to nonlinear separation. I.e. the ideal network topology may change from a perceptron to a neural network with a hidden layer. A genetic algorithm may detect this change. The Lack of a Large Mammogram Database There are two distinct types of regions of interest in breast cancer CADx, masses and calcifications. Masses are the well known “lumps” commonly associated with breast cancer. Calcifications are clusters of small lesions. The two types have different feature sets and are typically evaluated by different CADx systems.




An ideal database would contain 100,000 BI-‐RADS Category 4 cases for each region type (Sutton 2009). Frequently the problem domain only consists of Category 4 cases for CADx classifiers (see Section 2.2.2). At present, the largest publicly available database is the University of South Florida’s Digital Database for Screening Mammography (DDSM) (Heath et al. 2001) (Heath et al. 1998). The DDSM has 2,640 cases. More than a quarter of the cases are normal; the cases are further subdivided between calcifications and masses; few of the remaining cases are BI-‐RADS Category 4. Sampling techniques such as cross-‐validation or the bootstrap are frequently employed in CADx to address the lack of a large database. Bootstrap sampling involves creating multiple datasets by sampling the original dataset with replacement. The bootstrap provides both an assessment of the accuracy of a classifier and an assessment of its variability under different training sets (Efron & Tibshirani 1998) (Marungo 2010). A regular bootstrap methodology is inappropriate with algorithms that have a memory component, such as neural networks (Kohavi 1995). The leave-‐one-‐out bootstrap will use the sampled data items for training, and use the estimated 36.8% of data items that are not in the sample for validation (Jiang & Simon 2007). Drawbacks of Back Propagation Training Neural networks require training to determine the appropriate values for the network weights. The typical neural network training method is back propagation. Back propagation uses the method of steepest descent to adjust the network’s weights and biases. The method starts with an initial set of weights and continuously modifies the set to incrementally reduce the net classification error. This method only requires that all of the activation functions are differentiable. Back propagation often creates locally optimal, but globally suboptimal, neural network weights and biases. This will lead to an artificially low evaluation of a neural network’s accuracy. Evolutionary computing can reduce the likelihood of converging to a local maximum by evaluating a broad distribution of weight combinations.

2.3 Current Parallel Computing Approaches 2.3.1 Why Parallelism Is Important During the 1990s, chip clock speeds increased 60% per year; increases dipped to 40% from 2000 until 2004. By 2004 doubling a single core CPU’s die only led to a 20% speed increase. That year problems with power consumption and heat generation led Intel to cancel three planned single-‐core processors. The top end of the planned processors was to be a 4 GHz dual-‐threaded single-‐core processor. During that year Apple also had to delay the release of the iMac G5 due to CPU manufacturing problems at IBM (Geer 2005) (Sutter 2005). Since 2005 clock speeds have remained relatively constant as chip makers used the gains from Moore’s Law to build more cores onto a single microprocessor. Using lower clock speeds, and adding multiple processing cores significantly reduces power consumption. For example, an Intel Pentium 4 Extreme Edition with a 3.8GHz clock speed uses up to 130W; an Intel Core 2 Duo with 2.93GHz uses 40-‐80W (Giles 2009). Six years after the cancellations in 2004 clock speeds are relatively unchanged. Intel’s top of the line i7 chip’s top clock speed is 3.33GHz; however the chip has six, dual-‐threaded cores (Wikipedia 2010b). The multi-‐core trend has significant implications for software design. With constant clock speeds sequential programs do not obtain performance gains with each hardware upgrade. Performance gains must come from scalable concurrent design (Kirk & Hwu 2008). Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



2.3.2 CUDA and the GPU A Graphics Processing Unit (GPU) is a microprocessor that controls a graphics card. GPUs are designed to render the real-‐time high definition 3D graphics required by gamers. The modern day GPU is a massively parallel many-‐core processor optimized for floating point operations (NVIDIA Corporation 2010b, p. 1). Graphics rendering requires performing the same operation over individual pixels on the screen region in parallel. Rendering each pixel can occur in lockstep fashion. With one parallel execution there is no need for synchronization. GPUs are designed to handle large-‐scale, highly parallel floating point operations. Historically, the drawback with attempting to apply the GPU’s performance advantage to general-‐ purpose programming is that it requires translation into a graphics metaphor. For example the Steinkraus, et al. (2005) GPU neural network implementation translated the problem into a matrix multiplication operation using texture mapping. In November, 2006 NVIDIA Corporation introduced the Compute Unified Device Architecture (CUDA). CUDA removes the translation requirement. CUDA exposes an extended C API that allows for direct execution of general-‐purpose programs on the GPU (NVIDIA Corporation 2010b, pp. 4-‐5). CUDA is not simply a software abstraction; it specifies hardware requirements. CUDA-‐enabled devices can operate in a normal graphics processing mode or in a separate general-‐purpose compute mode (NVIDIA Corporation 2008). 2.3.3 The Multi-‐Core CPU A modern CPU is a multi-‐core processor. In essence, each core is an independent single-‐core processor. The CPU’s responsibility is not primarily to perform calculations; the CPU manages the entire computer. For example one core may be managing network I/O, another core may be reading a file, and another core may be calculating the values in a spreadsheet. In that case each activity is entirely independent of the other. Only one activity involves calculation. 2.3.4 Other Parallel Options FPGAs Reprogrammable integrated circuits called Field Programmable Gate Arrays (FPGAs) offer another hardware solution to implementing high performance parallelism. FPGAs have the advantage of having the program physically burned into the chip. This provides FPGAs with an initial performance advantage over GPUs. However as the amount of data grows beyond the FPGA’s internal capacity, memory latency can reduce the FPGA’s advantage. Che, et al. (2008) report that when performing calculations using small matrix sizes an FPGA outperforms a GPU by ~6x on Gaussian Elimination and ~50x on Needleman-‐Wunsch. As the input matrix size increases, the performance advantage falls to 3x in the former case, and virtually nothing in the latter case. There are implementations of nature inspired algorithms on FPGAs (Graham & Nelson 1996) (Chai et al. 2009); however FPGAs offer far less programmer productivity than CPU programming (Benkrid 2008). Che, et al. (2008) compare FPGA programming productivity to GPU programming productivity using lines of code as a metric. They report that the study’s FPGA implementations of Gaussian Elimination, DES, and Needleman-‐Wunsch contains 450, 1400, and 650 lines of code respectively. They report the respective CUDA implementations contain 160, 1100, and 320 lines of code.




With typical FPGA development cycles lasting eight months (Feist 2009), GPU and CPU programming offer substantial benefits in terms of speed of implementation and functionality flexibility over custom hardware solutions even where hardware reprogramming is possible. Distributed Computing A cluster of workstations can also be used for parallel processing. A distributed implementation of a genetic algorithm can have a central process that assigns members of the cluster the task of calculating the fitness values of a set of chromosomes. Using load balancing a system can operate on the cluster workstations in the background leaving the users undisturbed (Bevilacqua, Campanini & Lanconelli 2001). Distributed computing uses coarse-‐grained parallelism. Therefore distributed computing can work in conjunction with the fine-‐grained parallelism of CPUs and GPUs. New Parallelism Support Introduced in C# 4.0 Visual C# 2010 (C# v. 4.0) contains a new library specifically designed to support parallelism, the Task Parallel Library (TPL). This library provides a method to manage parallel calls from the managed .NET environment.

2.4

A Comparison of CPU and GPU Threading Models6

2.4.1 A Comparison of CUDA to CPUs Using Flynn’s Taxonomy Flynn’s taxonomy (Flynn 1972) is a widely used parallelism classification scheme. There are four classifications: 1. 2. 3. 4.

Single instruction stream, single data stream (SISD). Single instruction stream, multiple data streams (SIMD). Multiple instruction streams, single data stream (MISD). Multiple instruction streams, multiple data streams (MIMD).

SISD systems are serial computers with no concurrency support. There is one processor that performs one instruction at a time over a single unit of data (see Figure 2-‐6 on the next page). MISD systems are rare and not relevant to this discussion. In SIMD systems a single instruction executes across multiple data units. This operation is intrinsically synchronous due to the fact that the single instruction executes simultaneously over multiple independent data streams. SIMD can be sub-‐divided into two types, processor arrays and vector pipelines (Duncan 1990). Figure 2-‐7 on the next page depicts a SIMD processor array. Each processor is simultaneously executing the same instruction. The value of A(1) is broadcast over all the processors. The processor index number specifies the non-‐broadcast values.

6

A CUDA enabled device’s Compute Capability defines the features and technical specifications it offers. This project employs a GTX 260, which has Compute Capability 1.3 (NVIDIA Corporation 2010b, p. 95). This report is based on Compute Capability 1.3. The latest devices, released by NVIDIA on 29 March, 2010, have Compute Capability 2.0 (Rizzo 2010). There is a brief overview of the improvements Compute Capability 2.0 offers in Appendix A. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



Figure 2-‐8 on the next page depicts a SIMD vector pipeline. One instruction executes over a fixed data size (in this case four values). Figure 2-‐9 on the next page depicts a MIMD system. In MIMD there is no relationship between the activities of the different processors. The MIMD system is essentially a collection of independent SISD processors. To apply Flynn’s taxonomy to CPU and GPU threading models replace the term “processor” with the term “thread.” GPUs use an SIMD processor array threading model. NVIDIA refers to this model as Single Instruction Multiple Thread (SIMT). The term “kernel” refers to the common function that all of the SIMT threads are executing. CPUs use an MIMD threading model. In addition each CPU thread can execute a SIMD vector pipeline instruction over four values. I.e. in Figure 2-‐9 following operations can occur simultaneously: 1. P1 – load the four numbers at B(1). 2. P2 – multiply two sets of four numbers (in this case x, y, and z are each vectors with four components). 3. Pn – multiply one set of four numbers by the four numbers stored at memory address 3. Streaming SIMD Extensions (SSE) is the name of Intel’s SIMD vector pipeline instruction set. Intel introduced SSE in 1999 (Wikipedia 2010d). Processor array based SIMT varies from vector pipeline based SSE in two critical ways: 1. SSE can only operate over a fixed data length. SIMT parallelism is scalable over n threads. A thread index determines which data to read, write, or operate on (NVIDIA Corporation 2010b, pp. 77-‐78). 2. SSE is a single instruction executed on an independent MIMD thread. SIMT threads are constrained to executing the same set of instructions over a group of threads in lockstep.

Figure 2-‐6 – The SISD Architecture.

Figure 2-‐7 – The SIMD, Processor Array Architecture.



Figure 2-‐8 – The SIMD, Vector Pipeline Architecture.


Figure 2-‐9 – The MIMD Architecture.

All four previous diagrams copied from (Barney 2010) 2.4.2 The Impact of the Threading Model on CPU and GPU Optimization A GPU dedicates 80% of chip area to processing. The key to high-‐performance GPU execution is to constantly keep the GPU’s raw computational capacity busy. CUDA’s execution model packages SIMT threads into groups of 32; NVIDIA refers to these groups as warps. All threads in a warp execute simultaneously in lockstep SIMD fashion7. Keeping the processor busy is the job of the thread scheduler. For example the thread scheduler can switch processing from warp that is latent and waiting for a read from memory to a warp that is ready to perform calculations. Because there is no caching or branch prediction, context switching is a zero cost operation. The principle is to create as many threads as possible. The scheduler will then have the flexibility to keep swapping warps for execution as need be. Therefore, the more SIMT threads the better CUDA performs.

7

For the sake of readability this report makes no clear distinction between the logical and physical models for technical descriptions. For example the actual physical CUDA lockstep execution uses alternating half-‐warps. From a logical point of view this fact is not relevant as the second execution immediately follows the first and performs the same instruction. In general this report will not drill into such details. Please see cited material for further information on the nuts and bolts of CUDA and CPU execution. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



Only about 20% of a CPU’s available chip area is devoted to computation, with over half devoted to cache memory. The key to high-‐performance CPU execution is efficient serial execution. A high cache hit rate and an efficient reordering of work using branch prediction underlie CPU optimization (NVIDIA Corporation 2008). MIMD threads are heavy weight threads. Each one requires a large cache and instruction pipeline to optimize for serial execution. When the number of MIMD threads is greater than the number of available hardware threads8 the CPU must perform context switching; therefore performance declines. The Marowka (2009) study bears this out. In summary GPUs consist of raw calculators. The more data that are streaming into the GPU, the busier the calculators are, and the higher the GPU’s performance is. CPUs consist of task processors. A CPU thread can remember what it did (caching) and plan for what it must do next (branch prediction). The more a CPU thread’s current action is related to the past, or the future, the faster the execution. In addition CPUs have an SSE instruction set that allows each MIMD thread to perform a simultaneous SIMD instruction over a vector of four numbers. SSE instructions are the major tool for accelerating CPU calculation.

2.5

Relevant Research

2.5.1 CADx Literature Fogel, Wasson III & Boughton (1995) and Land et al. (2006) both report that breast cancer CADx neural network classifiers with few hidden nodes tend to be more accurate than neural networks with many hidden nodes. The Fogel, et al. study compares networks with nine input nodes, two hidden layer nodes, and one output nodes (9-‐9-‐1 networks) to 9-‐2-‐1 networks. The networks’ respective average mean error squared were .13 and .11 respectively. Land et al. (2006) present two examples of evolutionary computing in neural networks. One example uses evolutionary computing for network training exclusively; the other uses the technique for both training and architecture. In the former case they manually vary the number of hidden nodes between two and five. They find that networks with two hidden nodes are generally more accurate than networks with more hidden nodes. However they did note that on a particularly difficult sample a five hidden node network outperforms a two node network; in this case accuracy declines with six or seven node networks. The evolutionary computing example that includes both training and node architecture consistently has perceptrons as top performing network. Both Fogel et al. (1995) and Land et al. (2006) demonstrate that basic feed-‐forward neural networks with few or no hidden nodes are appropriate in CADx. Porto, Fogel & Fogel (1995) report that neural network training using evolutionary computing with a population size of 50 and an iteration count of 100 created significantly more accurate networks than back propagation training. The best performing network in the Land et al. (2006) training only example has a population size of 200 and an iteration count of 600. Campanini & Lanconelli (2006) provide an overview of many CADx related genetic algorithms. They state that Az is a frequent fitness measure. They also provide examples of modified ROC analysis depending on targeted specificity and sensitivity characteristics. 8

In a CPU there are one or two hardware threads per core.




2.5.2 CUDA Literature Graham & Nelson (1996) performed pre-‐GPU research in applying FPGAs to genetic algorithms. They created a FPGA implementation of the selection operator. They then compared the performance of the implementation with the FPGA component to an implementation that only used a contemporary high-‐end CPU. The FPGA implementation led to a 38x speedup in executing the selection portion of the algorithm. The total FPGA speedup for processing the entire algorithm was 4x. The early study demonstrated the potential for using parallel hardware to accelerate genetic algorithm processing. Pre-‐CUDA work in using GPUs for back propagation training of neural networks revealed a 3x speedup (Steinkraus, Buck & Simard 2005). This implementation employs a graphics metaphor using texture mapping to perform matrix inner products. One of the key advantages of CUDA is that it removes the requirement to employ a graphics metaphor. CUDA specifies a separate compute mode that is part of the software and hardware of the architecture (Lindholm et al. 2008). The performance gap between GPUs and CPUs has widened substantially since Steinkraus, et. al. (NVIDIA Corporation 2010b, p. 2). Implementing a neural network with a combination of CUDA and OpenMP, a C++ and FORTRAN parallelism API for the CPU, creates another level of performance gains (Jang, Park & Jung 2008). This implementation uses a multi-‐core CPU for feature extraction from an image, and CUDA on the GPU to concurrently process the neural network. The CPU/GPU blend creates a 15x performance gain over a CPU only implementation, and a 4x gain over a GPU only implementation. The study reports that using the CPU for feature extraction removes the overhead of transferring the large amounts of raw image data from the host to the device. Literature regarding not only CUDA but also previous hardware driven implementations of genetic algorithms and neural networks demonstrate speedups between 2x and 15x. 2.5.3 Multi-‐core CPU Literature Most high performance C++ programming techniques are now well known. Synchronization and scheduling can be very costly when using CPU multithreading. Because CPU threads and cores operate in an MIMD fashion, it is critical to avoid communication between threads. The overhead can consume 70% of total runtime on multi-‐core Intel CPU systems. Additionally, there is a decline in performance if the application software thread count exceeds the number of threads available in hardware. This is the opposite case from CUDA. Thread scheduling is an overhead on the CPU; on the GPU it is an optimization method (Marowka 2009). For high performance SIMD execution there is a SSE implementation of the logistic sigmoid activation function. The SSE implementation displays up to a 38x speedup over conventional approaches (Milner & Grandison 2008). This project uses this approach for calculating neural network output on the CPU.

2.6 Libraries, Tools, and Technologies Employed The project used the following libraries, tools, and development environment: 1. The NVIDIA CUDA Toolkit 3.0 (NVIDIA 2010). 2. The XFX GeForce GTX 260 216 Core Graphics Card. Figure 2-‐10 on the next page displays the card’s performance capabilities. 3. A Dell XPS 400 with a Pentium D 2.79 GHz CPU and 3GB of RAM. 4. Visual C++ 2008 Professional Edition. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



5. The Boost C++ Libraries v1.43.0 (Boost Project 2010). Boost is an Open Source project that provides a broad range of useful C++ libraries. The project utilizes the following libraries: a. Random. b. Smart_Ptr. The selection above represents the versions and hardware available at the time project development began in early April, 2010. The latest Windows based C++ compiler NVIDIA supported for CUDA at that time was Visual C++ 2008. The GeForce GTX 260 is the least expensive card in NVIDIA’s high end GTX 200 GPU line. The XPS 400 is a personal workstation. The clock speed is consistent with current top CPU clock speeds.

Figure 2-‐10 – GeForce GTX 260 Device Information




Chapter 3 –Analysis and Design 3.1

Overview

The project’s analysis and design bridges the divide between the project’s goals and the subsequent implementation. The requirements drive the design. The design forms the program’s framework; items in the design figures map directly to components in the implementation. The analysis section contains the implementations requirements. The requirements derive from a combination of the domain requirements of CADx and the project’s overall goals. The design section begins with descriptions of the genetic selection and the evolutionary training algorithms. There is a review of two object-‐oriented approaches: one is a package based approach and the other is a decoupled approach. There is an explanation of why the project uses the decoupled approach. A discussion of the specific data layout requirements for SIMD, both in the context of SSE and SIMT follows (see Section 2.4). There are descriptions of both the general nature of the layout and the specific data structures the project’s implementation uses to encapsulate this need. The chapter then reviews the full design consisting of a combination of the decoupled approach and the project’s SIMD data layout classes. A demonstration of the design’s flexibility featuring a substitution of one evolutionary training algorithm for another concludes the chapter.

3.2

Analysis

3.2.1 Requirements The goal of this project was to explore using CUDA to accelerate design algorithms for breast cancer CADx neural network design. To achieve this goal required a baseline measure of CPU performance. It would be incorrect to assume that the GPU will outperform the CPU. Accomplishing the project’s goal requires creating a corresponding CPU component for each GPU component. Optimizations for the CPU are different from the optimizations for the GPU. The implementation’s design must be flexible enough to support the disparate needs of both approaches. Accomplishing these goals in the reference implementation has the following project implications: 1. To provide a meaningful comparison between the CPU and GPU the CPU implementation should use the SSE instructions for calculation (see Section 2.4). 2. The neural network topologies should be basic with very few hidden nodes (see Section 2.5.1). 3. The domain specific Az should be a metric in analysis (see Section 2.2.3). 4. The neural network training algorithm should employ evolutionary computing (see Section 2.2.5). 5. The training and validation should use datasets generated via a sampling technique, such as leave-‐one-‐out bootstrap (see Section 2.2.5).



3.3


Design

3.3.1 The Algorithms The genetic and evolutionary algorithms implemented are typical general purpose versions of the respective methods. The goal of the project is to measure runtime performance, not effectiveness. The exact nature of the algorithms is not important so long as they are characterized by large amounts of parallel computation. Genetic Algorithm A genetic algorithm performs feature selection and determines the number of nodes in the hidden layer. The genetic algorithm’s steps are as follows (Negnevitsky 2005, pp. 222-‐225): 1. Randomly generate a set of N chromosomes. Each chromosome represents a neural network. The chromosomes have two components; a binary encoded component representing the available features and a single integer gene representing the number of hidden nodes. 2. Calculate the fitness of each neural network (see below). 3. Using roulette wheel selection, generate N -‐ T partner pairs. Each pair will generate an offspring chromosome. Each offspring gene has an equal probability of coming from each parent. 4. The generated chromosomes and the top T chromosomes based on fitness form the next generation. 5. Calculate the fitness of each neural network (see below). 6. If generation count is not met then go to step 3. 7. Output population list with fitness. To calculate fitness, the genetic algorithm performs the following: 1. For each sample set9 perform the following: a. Train the neural network using the training data. b. Generate a fitness Az value using the validation data. 2. Sort the fitness values in descending order. 3. Use the bottom Cth performance value as chromosome’s performance. Using the worst case performance result provides neural networks that perform well under variation. Evolutionary Trainer The neural network training algorithm is as follows (Negnevitsky 2005, pp. 288-‐289): 1. Create a population of 2m weight vectors 𝑤! with random uniformly distributed numbers between -‐1.0 and +1.0. 2. Calculate the population fitness using the error squared. 3. Sort the population based on fitness. 4. Set the top m weight vectors as “parent” vectors, and drop the bottom m vectors. 5. For each parent, create a “child” weight vector: 9

The implementation uses leave-‐one-‐out bootstrap to generate the data samples. The training set contains the records selected; the corresponding validation set contains the unselected records. The genetic algorithm implementation supports any sampling technique however. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo MSc Advanced Information Systems project report 𝑤!! = 𝑤! + 𝛿 ! 𝜎! 𝑤ℎ𝑒𝑟𝑒 𝛿 ∈


0, 1 , 𝑛 = 𝑡ℎ𝑒 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟, 𝑎𝑛𝑑 𝜎! 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑎𝑟𝑒 𝑢𝑛𝑖𝑓𝑜𝑟𝑚𝑙𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 −1, 1 6. Calculate fitness on the combined population of parents and children based on squared error. 7. Sort the population based on fitness. 8. If the required number of generations, then end, otherwise go to step 4.

3.3.2 The Object-‐oriented Design A “Package Based” Object-‐oriented Design Approach A key design requirement is to support both CPU and GPU implementations. Both implementations share most functionality; only a small portion requires dual implementation. One possible approach to reuse the shared components is the “package based” approach in Figure 3-‐1 on the following page. In this approach the abstract GeneticSelector and EvolutionaryTrainer classes contain the shared functionality. Implementation specific subclasses then reside in individual packages. In this case there are CPU and GPU packages. A CPU based GeneticSelector directly depends on using a CPU based EvolutionaryTrainer. The package approach is frequently used in object oriented systems. One classic example is databases. Often a database API will have a set of defined interfaces and abstract classes. These base classes will contain general database functionality such as connecting, querying, result reading, etc. This functionality is common to all database systems. Each individual database platform (Oracle, MySQL, SQL Server, DB2, etc.) will have a package containing platform specific implementation. Each class in the implementation package will derive from and correspond to a predefined abstract base class. The classes inside an implementation package are interdependent on each other. You cannot use an Oracle connection object with a SQL Server query object for example. While the package approach is very common, in this case it presents problems with coupling (Larman 2002, pp. 229-‐236). Coupling is the measure of how strongly one element is dependent on another. Problems occur when there is high coupling along volatile dimensions. Database libraries couple across the relatively stable dimensions of a database platform’s functionality. In the context of this project the corresponding EvolutionaryTrainer and GeneticSelector implementation classes couple across two volatile dimensions. The combination of the appropriate algorithm and the training method can frequently change. A new genetic algorithm implementation requires three new classes; one base class implementing the algorithm, and two corresponding implementation child classes. A Decoupled Object-‐oriented Design The design in Figure 3-‐2 on the next page decouples the implementation of the two algorithms and the neural network calculation. It separates each class’ responsibilities and collaborations (Beck & Cunningham 1989). In this case, a GeneticSelector’s responsibility is to execute the genetic algorithm to find optimal features. To fulfill its responsibility, the GeneticSelector uses a NeuralNetTrainer to obtain trained neural networks. In this design, the GeneticSelector is not only independent of whether the trainer uses the CPU or GPU, but also is independent of the training Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



technique. The GeneticSelector could use a back propagation trainer without any modification. In this case, the NeuralNetTrainer is responsible for training the neural networks, and collaborates with a NeuralNetEvaluator to calculate the output for a set of neural networks given data and weights. The CudaEvaluator contains the GPU implementation and the SseEvaluator contains the CPU implementation. The NeuralNetTrainer uses the base NeuralNetEvaluator base class. Thus there is no need to modify a NeuralNetTrainer to switch between GPU and CPU implementations.

Figure 3-‐1 – A Package Based Design Approach

Figure 3-‐2 – Improved Design

3.3.3 Data Structures for SIMD The typical data layout in programming uses an array of structures (AoS) approach. Both SIMD paradigms (processor array and vector pipeline) require a structure of arrays (SoA) approach (Wald 2004, p. 79). SoA is a transpose of the traditional in memory data layout. The task of calculating the average grade for four students each with eight marks can demonstrate the difference between the Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



two. The normal AoS approach is to define a Student data structure and create an array of four Student structures. In this case, Student contains an eight element array of grade values. Figure 3-‐3, Figure 3-‐4, and Figure 3-‐5 on the next page demonstrate this layout. Figure 3-‐3 shows the AoS layout of the grades in memory. A given student’s grades are adjacent to each other. Averaging each student’s grade is straight-‐forward: •

For each student: o Set running total to zero. o For each Course: § Add current student’s current course grade to running total. o Divide the running total by eight

This layout is not compatible with SIMD. SIMD systems calculate the average for all four students simultaneously. A SIMD calculation is as follows: • • •

Simultaneously set four running total values to zero For each Course o Simultaneously add all four course grades to the corresponding running totals. Simultaneously divide all four running totals by eight.

In AoS the corresponding grades for each course are not adjacent to each other in memory; they are separated by eight memory locations. The solution is to transpose the data’s memory layout. The SoA approach places the corresponding values adjacent to each other. The memory layout depicted in Figure 3-‐4 accommodates a simultaneous SIMD execution over all four values. Another constraint in SIMD is that there are fixed multiples of operation. SSE is a vector pipeline and must execute an instruction over exactly four adjacent memory locations at a time. If the number of students is not a multiple of four then SSE will still perform the operation over the padded values. Figure 3-‐5 depicts an example where there are five students. In this case two SSE calculations will occur per course. Care must be taken to ignore calculation results from the remaining three padded columns. Because CUDA’s SIMT model is a processor array not a vector pipeline it does not operate over a fixed amount of data. However CUDA does group execution into units of 32 (the number of threads in a warp) internally. For maximum performance it is best to execute a kernel with a thread count that is a multiple of 32 and use padding.




memory à students à

memory à

Course 0

g00

g10

g20

g30

grades à

Course 1

g01

g11

g21

g31

Student 0 g00 g01 g02 g03 g04 g05 g06 g07

Course 2

g02

g12

g22

g32


Course 3

g03

g13

g23

g33


Course 4

g04

g14

g24

g34


Course 5

g05

g15

g25

g35

Course 6

g06

g16

g26

g36

Course 7

g07

g17

g27

g37

Figure 3-‐3 – The Array of Structures (AoS) Memory Layout

memory à

Figure 3-‐4 – The Structure of Arrays (SoA) Memory Layout Transposes the AoS Layout

students à

Grade 0 g00 g10 g20 g30 g40 n/a n/a n/a Grade 1 g01 g11 g21 g31 g41 n/a n/a n/a Grade 2 g02 g12 g22 g32 g42 n/a n/a n/a Grade 3 g03 g13 g23 g33 g43 n/a n/a n/a Grade 4 g04 g14 g24 g34 g44 n/a n/a n/a Grade 5 g05 g15 g25 g35 g45 n/a n/a n/a Grade 6 g06 g16 g26 g36 g46 n/a n/a n/a Grade 7 g07 g17 g27 g37 g47 n/a n/a n/a

Figure 3-‐5 – Structure of Arrays (SoA) with Required Padding

3.3.4 The SIMD Sampling Data Structure The training and validation datasets use a SoA layout in with the values for a feature adjacent to each other. On the next page, Figure 3-‐6 displays the layout. The width is the number of records with appropriate padding. The height is the number of features. The SamplingData class in Figure 3-‐7 contains a group of training datasets (a set of bootstrap samples for example) with the matching validation datasets10 The only difference between the TrainingSet and TestingSet class is that the number of records in each TestingSet dataset can vary. In leave-‐one-‐out bootstrap, the number of records in the validation set is random; it is the number of records that were not selected in the original bootstrap. 10

Lo, et al. (2006) makes a very valid distinction between the terms testing and validation. They correctly assert that the term validation occurs during the learning process. Testing data should never be any part of the learning process. Therefore the TestingSet class should be renamed ValidationSet during a future refactoring. Similar refactoring name changes should occur in the appropriate variable and function names. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



In the TrainingSet and TestingSet classes (see Figure 3-‐7), the fields are as follows: 1. Alignment: SSE requires that the floating point array starts on 16 byte aligned memory addresses. This means that the starting memory address is a multiple of 16. 2. FieldDim: How many features a record contains. 3. RecordCnt/RecordCnts: Is the number of records in the dataset. TrainingSet uses the field RecordCnt, and TestingSet uses the field RecordCnts due to the fact that with a leave-‐one-‐ out bootstrap the number of records in the validation set varies randomly. 4. RecordDim/RecordDims: If the number of records is not an appropriate multiple, there is padding. The RecordDim/RecordDims is a multiple of the RecordDimMultiple. If RecordCnt is five and RecordDimMultiple is four then the RecordDim is eight. 5. RecordDimMultiple: Four for SSE, 32 for SIMT/CUDA. 6. SampleDim/TestsetDim: The number of samples the structure contains. In the case of bootstrapping, this is the number of bootstraps executed; in k-‐fold cross-‐validation, this is the value of k. In keeping with the generic design approach to the project SamplingData is not bound to the bootstrap method. It represents any set of sampled data. While the project uses the Bootstrap class to generate a SamplingData instance, there is nothing that prevents GeneticSelector from using a k-‐ fold data sample.

Records à Features à

Figure 3-‐6 – SoA Memory Layout for Training/Validation

Figure 3-‐7 – Data Classes, SoA Implementation




3.3.5 Full Design Figure 3-‐8 below depicts the project’s full design; the approach combines the decoupling described in Section 3.3.2 combined with the SoA based data classes described in Section 3.3.4. The GeneticSelector uses a SamplingData class. In the implementation a Bootstrap class has the responsibility of creating a SamplingData instance. There are two NeuralNetTrainer classes in the implementation. The OrigEvolutionaryTrainer is the initial training algorithm implementation. This implementation failed to converge during testing (see Section 3.3.6 for more details).

Figure 3-‐8 – Overall Project Design

3.3.6 Framework Flexibility A demonstration of the frameworks flexibility occurred during testing. Below is a description of the original evolutionary training algorithm implementation (Land et al. 2006): 1. Create m “parent” weight vectors 𝑤! . 2. Create “child” weight vectors: 𝑤!! = 𝑤! + 𝐶𝜎!! 𝑤ℎ𝑒𝑟𝑒 𝐶 𝑖𝑠 𝑎 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐶𝑎𝑢𝑐ℎ𝑦 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒,



𝜎!! = 𝜎! 𝑒𝑥𝑝 3. 4. 5. 6. 7.

1

𝑁 0,1 +


1

𝑁! 0,1 , 𝑛 = 𝑡ℎ𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑒𝑖𝑔ℎ𝑡𝑠. 2𝑛 2 𝑛 Calculate fitness on the combined population of parents and children based on classification error (see below). For each population member, create a tournament by selecting x random competitors. Increase the win count tournament member with the highest function. Sort the population by win count, and remove the bottom 50%. Reset win count, set the surviving members as parents. If the required number of generations has been evaluated, then end, otherwise go to step 2.

The algorithm implementation could not pass testing, due in large part to a failure to converge. Because it was not within the scope of this project to determine the exact cause of an implementation’s failure to converge – the failure could be due to subtle floating point issues for instance – a more typical evolutionary training algorithm was substituted into the project. This algorithm is still in the project as OrigEvolutionaryTrainer. The project will still compile with this legacy implementation. The ability to quickly change from the original implementation, to the final implementation with no changes to the surrounding implementation or framework demonstrates the design’s resiliency.

3.4 Summary The goal of this project was to explore using CUDA to accelerate CADx neural network architecture. To accomplish this goal required creating a matching CPU implementation for the GPU component of the system in order to verify that the GPU can actually outperform the CPU. The project’s implementation employs a genetic algorithm for feature selection and network architecture. The algorithm uses Az to measure fitness (see Section 2.2.3). An evolutionary computing algorithm trains each neural network. There are separate CPU and GPU implementations to calculate the neural network output. The object-‐oriented design for the project employs loose coupling rather than packaging components. Packaging requires creating an abstract base class as well as matching CPU and GPU implementation classes for each algorithm component. Decoupling the different components prevents changes in one portion of the implementation from having a cascading impact on unrelated sections. SIMD requires data structures with a structure of arrays, SoA, memory layout (Wald 2004, p. 79). The SoA layout is a transpose of the traditional array of structures, AoS, memory layout. In AoS data from the same record are adjacent to each other. In SoA data from the same feature are adjacent to each other. The project’s data classes TrainingSet and TestingSet contain data in this layout. The overall design for the project combines the decoupled approach for the processing components with the SIMD compatible SoA approach to the data classes.




Chapter 4 – Implementation 4.1

Overview

This chapter presupposes an understanding of the SIMT and SSE threading models (see Section 2.4), the basics of neural network calculation (see Section 2.2.4), and SoA data layout (see Section 3.3.3 and Section 3.3.4). The project’s implementation applies the previous design and requirements to the specific needs of CUDA and SSE. The description CUDA and SSE development drills down from the generic to the implementation specific. The chapter starts with general notes on CUDA and SSE development. This overview describes the overall process for development in the respective environments; it is not specific to this project’s implementation. Next there is an explanation of why the two environments require native C++ and a review of some implementation details specific to C++. The chapter continues with a description of how the implementation calculates the neural network output in a SIMD fashion using data arranged in a SoA layout. It concludes with separate explanations of how the CPU and the GPU perform the calculation. The explanations contain a review of the relevant source code from each implementation that is involved in the calculation.

4.2 General Implementation Details 4.2.1 CUDA Development NVIDIA uses the term “device” instead of the term “graphics card” when describing CUDA hardware. This nomenclature emphasizes that CUDA is an architecture for general-‐purpose computation. The device typically is a graphics card; however NVIDIA also sells its Tesla product line. Tesla cards do not have video outputs because they are solely for high performance computing. A CUDA device is a separate computation focused computer. The device resides inside a “host” (typically a workstation). A GPU controls the device; a CPU controls the host. The device has substantial memory that is separate from the host’s main computer’s memory11. Program execution on the device is completely separate from the host process. A single C function called a kernel executes on all of the SIMT threads. Each thread is a member of a block and has a thread id. Each thread is uniquely identified by the combination of its thread id and its block id. The kernel invocation call passes the number of threads per block and the number of blocks. It is best to have a multiple of 32 threads per block because the 32 thread warp is the unit of CUDA execution. A block can contain a maximum of 512 threads. The general steps in CUDA program execution are as follows: •

On the host: 1. Allocate memory on the device for both input and output.

11

The GTX 260 used in this project has 896MB.



•

•


2. Copy input data from host memory to device memory. 3. Define the number of device threads per block and the number blocks that will execute and determine how the threads are grouped. 4. Invoke kernel function on the device. On the device: 1. Each thread executes the kernel function in SIMT fashion. The threads use the combination of block id and thread id for indexing. 2. Notify host when all threads complete execution. On the host: 1. Copy output data from device memory to host memory. 2. Free device memory.

Creating a CUDA program requires two compilers because execution occurs on both the device and the host. The CUDA Toolkit (NVIDIA 2010) provides the nvcc compiler for the device. Nvcc is a C compiler with a small set of extensions. The host compiler is platform specific: either Visual C++ on Windows or gcc on Linux and MacOS. The general program executing on the host must link to the CUDA C program’s host component. The host component will perform the operations above. 4.2.2 SSE Development It is necessary to use SSE instructions for optimal CPU computational performance. There are three options for employing SSE (Wald 2004, p. 80): 1. Relying on automatic compiler optimization. 2. Using SSE compiler intrinsics (Microsoft Corporation 2010c). 3. Writing Assembly manually. Wald reports that compilers frequently do not recognize program sections eligible for SSE optimization. This project uses compiler intrinsic for SSE calls. The Milner & Grandison (2008) logistic function uses manual assembly provided by the authors. SSE also requires a specific memory alignment for calculations. A SSE vector pipeline instruction operates over four, four byte floating point numbers; that is 16 bytes of data (4x4). The memory address for the beginning of the four number vector must be a multiple of 16. This alignment requirement precludes the use of the C++ new and delete operators. SSE requires the C functions _aligned_malloc() and _aligned_free() to allocate and release the memory used for SSE calculations. 4.2.3 The Choice of Native C++ CUDA and SSE need a native unmanaged execution environment because of the necessary linking and memory alignment requirements. Microsoft offers the C++/CLI language as a bridge language between native C++ and .NET managed environment languages such as C# (Wikipedia 2010a). C# also natively supports pointers. Despite the options C++/CLI and C# offer for mixing a managed environment with an unmanaged native environment, the approach is impractical. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



Managed environments, whether Java or .NET, have a specific goal of maintaining control over the management and the access of the environment’s memory space. This is antithetical to CUDA and SSE’s linking and memory layout requirements C++/CLI and C# will not provide direct pointers to the managed memory space. The languages enforce a strict separation between managed and unmanaged memory for the heap, the call stack, and the instruction stack. All data in the managed environment must be copied to the unmanaged environment before the native code sections can use them. Moving between managed and unmanaged execution is also deleterious to performance. When a managed function calls an unmanaged function, the following occurs (Microsoft Corporation 2010d): 1. The function call arguments are marshaled from CLR to native instances. 2. A managed to unmanaged thunk (switch from a managed to unmanaged stack, heap, and instruction context) occurs. 3. The unmanaged function is called (using the marshaled native instances of the arguments). 4. An unmanaged to managed thunk occurs. 5. The return variable and any output arguments are marshaled from native to CLR instances. Thus, not only does the boundary between managed and unmanaged memory require constant copying between the two regions, but also constant context switching between managed and unmanaged stack frames. An additional problem is double thunking. Functions in programs created for a mixed managed/unmanaged environment by default have both managed and unmanaged entry points. Double thunking occurs when a managed function calls another managed function’s native entry point. The native call then routes the call to the managed entry point. Two thunks occur, when none were necessary, as this was a managed to managed call. If an unmanaged entry point exists for a virtual method in a managed class, double thunking will always occur (Microsoft Corporation 2010b). Linking managed and unmanaged programs is also intricate. The .NET environment uses a different naming and argument passing protocol (Microsoft Corporation 2010a). During implementation there were a number of attempts to integrate .NET into the project. Each time one problem was solved, another cropped up. Ultimately, .NET was excluded from the current implementation. A feasible approach to integrating for .NET in the future is to use C# for high-‐level application tasks such as user interface and I/O. C++/CLI can then simply serve as a bridge to marshal data between the managed and unmanaged environments. 4.2.4 Factory Classes and Smart Pointers In anticipation of the need for multithreaded NeuralNetEvaluator and NeuralNetTrainer implementations in the future the project employs the Factory pattern. Multithreading and concurrency frequently require complex creation logic. The Factory pattern decouples object use from object creation (Larman 2002, pp. 346-‐348). The clients use the factory classes; they do not create instances directly. Since Factory classes are responsible for creating NeuralNetEvaluator and NeuralNetTrainer instances, it is not appropriate for the calling classes to destroy them. In a managed environment, Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



this is not an issue. Garbage collection automatically destroys objects when there are no more references. Unmanaged C++ does not offer automatic garbage collection. To address this challenge the project uses Boost’s smart_ptr C++ library (Boost Project 2010). The factory classes return smart pointers to the NeuralNetEvaluator and NeuralNetTrainer. Boost smart pointers are not a panacea. For example the smart pointer library does not automatically handle circular references. See Boost documentation for more information. 4.2.5 Random Numbers and Distributions The Bootstrap, GeneticSelector, and EvolutionaryTrainer use Boost’s random library to generate random values. The library supports many different probability distributions, including the uniform, normal, and Cauchy distributions.

4.3 Neural Network Output Calculation 4.3.1 Base Design The leaf operation for the system is the calculation of neural network outputs during training. Therefore, this is where a GPU implementation can provide the most value. The calculation is intrinsically parallel. The calculation of a neural network’s output (see Section 2.2.4) is as follows: 1. Set the output node’s input value to zero (there is only one output node). 2. For each node in the hidden layer. 2.1. Set the hidden layer node’s input value to zero. 2.2. For each feature value. 2.2.1. Multiply the feature value by the appropriate weight value. 2.2.2. Add the result to the current hidden layer node’s input value. 2.3. Add the current hidden layer node’s bias to the node’s input value12. 2.4. Calculate the current hidden layer node’s activation function. 2.5. Multiply the result by the appropriate weight. 2.6. Add the result to the network output node’s input value. 3. Add the bias to the network output node’s input value. 4. Calculate the network output node’s activation function. 5. The result is the neural network’s output, end. Figure 4-‐1 on the following page shows the multiply, add and assign operations from lines 2.2.1 and 2.2.2 occurring on a single feature/node connection in SIMD fashion (see Section 2.4.1) over four records. The calculation uses the same network with the same weight vector. The feature values vary because they are from four different records. The weight value is broadcast and does not vary because the same network and weight vector is calculating all four records. The product is added and assigned to the node’s input. 12

The bias is simply a threshold; it shifts the activation function’s curve to the left or right without changing the curves shape. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



records à Feature Value

f0

f1

f2

f3

Multiplication

*

*

*

*

Current Weight w0 w0 w0 w0 Add and Assign += += += += Node Value

n0 n1 n2 n3

Figure 4-‐1 – SIMD Node Multiply Add Assign

4.4

CPU Implementation

SSE requires a 16 byte memory alignment. The starting address must be a multiple of 16 (see Section 4.2.2). If the memory is not correctly aligned a CPU instruction level error occurs. Only the C functions _aligned_malloc() and _aligned_free() can manage aligned memory; the C++ new and delete operators will not work. This eliminates the option of using Boost’s smart pointer library for direct memory management. To protect against memory leaks, the project places the aligned memory in the TestingData, and TrainingData classes. The memory is allocated during construction, and released during destruction. These classes can be managed by the library. Figure 4-‐2 on the next page contains the source code for the CPU implementation. The code contains special data structures and SSE Intrinsics calls that make it vary from typical C/C++ in appearance. __m128 is a special SSE data type that represents four aligned floating point numbers. _mm_set1_ps(0.0f) initializes all four vector node input values to zero; _mm_set1_ps(*w) sets the same weight value to all four vector elements. _mm_add_ps(ipt4, _mm_mul_ps(w4, *d)) executes the multiply addition. SquashingFunctionP4(&ipt4) transforms the input value to an output value by using Milner & Grandison (2008) logistic sigmoid implementation. The function is written purely in Assembly; it is also fast because it only calls nine SSE instructions (see SseGlobal.cpp on page 99).


Fumbeya Luis Marungo MSc Advanced Information Systems project report void SseGlobal::EvaluateNN(float *dStart , int rowDim , int fieldDim , int wgtVectDim , int hidNodCnt , float *weight , float *output , int recCnt) { __m128 w4; // weights on the hidden __m128 wo4; // weights to the output __m128 ipt4; // input node values

}


layer node

// iterate over 4 records at a time int inc = rowDim / 4; int wOutOff = (fieldDim + 1)*hidNodCnt; int wOff = (fieldDim + 2)*hidNodCnt + 1; float *weightEnd = &weight[wOff*wgtVectDim]; __m128 *dEnd = (__m128 *)&dStart[rowDim*fieldDim]; while(weight< weightEnd) { __m128 *opt4 = (__m128 *)output; float *data = dStart; for(int i = 0; i < recCnt; i += 4, data += 4) { float *w = weight; float *wo = &w[wOutOff]; // set four network output node input values to zero *opt4 = _mm_set1_ps(0.0f); // iterate over hidden nodes for(int j = 0; j < hidNodCnt; ++j) { // set four hidden layer input values to zero ipt4 = _mm_set1_ps(0.0f); // iterate over inputs for(__m128 *d = (__m128 *)data; d < dEnd; d += inc) { // store the same weight in four consecutive memory // locations for SSE operation w4 = _mm_set1_ps(*w); // execute multiply/add/assign ipt4 = _mm_add_ps(ipt4, _mm_mul_ps(w4, *d)); ++w; } // add bias w4 = _mm_set1_ps(*w); ipt4 = _mm_add_ps(ipt4, w4); ++w; // calcuate output using logistic sigmoid function SquashingFunctionP4(&ipt4); wo4 = _mm_set1_ps(*wo); // execute multiply/add/assign *opt4 = _mm_add_ps(*opt4, _mm_mul_ps(wo4, ipt4)); ++wo; } // add bias wo4 = _mm_set1_ps(*wo); *opt4 = _mm_add_ps(*opt4, wo4); ++wo; SquashingFunctionP4(opt4);// this is the 4 NNs' output // move to the next four records ++opt4; } output += rowDim; weight += wOff; }

Figure 4-‐2 – CPU, SSE Implementation (SseGlobal.cpp)



4.5


GPU Implementation

4.5.1 The Project’s Implementation There are two components to the GPU implementation: the function on the CPU that manages the host execution (Figure 4-‐3 on page 38), and the kernel function that executes on the device (Figure 4-‐4 on page 39). Before invoking kernel function on the thread dimensionality must be set up. SIMT device threads are grouped together into thread blocks. A thread block is a set of threads that can share on chip memory and synchronization calls. Each block can contain up to 512 threads. The combination of the thread’s block id and thread id uniquely identify a thread. CUDA’s data structures allow the thread ids to be up to three dimensional and block ids to be up to two dimensional. In Figure 4-‐3 each block contains the maximum 512 threads in a 32x1x16 layout. Each block will calculate 32 records over 16 weight vectors. The number of records and the number of different weight vectors determine the number of blocks. If there are 96 records and 32 weight vectors to evaluate then there will be six blocks arranged in a 3x2 layout. 3,072 threads will execute; each thread will calculate a neural network’s output for a unique record and weight vector combination. NVIDIA describes the setup of the thread ids and block ids as the thread hierarchy. In this case the threads are in a block with block dimensionality of 32x1x16 and the blocks are in a grid with grid dimensionality of 3x2. The invocation function uses the helper functions in Figure 4-‐5 (page 40) to transfer data between the device and the host. The CUDA API functions are similar to the well known C functions for memory allocation, release, and copy operations. The cutilSafeCall and cutilCheckMsg functions are included with the CUDA Toolkit. The functions provide error checking on the device. Each SIMT device thread executes the kernel function in Figure 4-‐4. The first task is to determine which record and weight vector to calculate using the thread id and the block id. The rest of the program is a straight-‐forward C program that performs the calculations. The kernel reads more like a typical C function than the SSE based function with the mixture of Assembly, SSE Intrinsics, and special data structures.


Fumbeya Luis Marungo MSc Advanced Information Systems project report __host__ void BasicEvaluateNN(NNEvaluationData data) { // number of threads per block blkDim.x = 32; blkDim.y = 1; blkDim.z = 16;


int recCnt = hostDs.RecordCnt; int wgtCnt = data.WeightVectorDim; // number of blocks grdDim.x = (recCnt & (blkDim.x*blkDim.y - 1)) ? (recCnt / (blkDim.x*blkDim.y)) + 1 : recCnt / (blkDim.x*blkDim.y); grdDim.y = (wgtCnt & (blkDim.z - 1)) ? wgtCnt/blkDim.z + 1 : wgtCnt/blkDim.z; grdDim.z = 1; // load the data & workspace NNEvaluationData copy = data; data.Output[1] = data.Dataset[1] = data.WeightVectors[1] = 0; data.WeightSetDim = 1; for(int i = 0; i < copy.WeightSetDim; ++i) { data.Dataset[0] = copy.Dataset[i]; data.WeightVectors[0] = copy.WeightVectors[i]; data.Output[0] = copy.Output[i]; // load the data & workspace NNEvaluationData nn = LoadEvalData(data); // call kernel ExecEvaluateNN(); cutilCheckMsg("Kernel ExecGlobalMemoryEvaluateNN execution failed"); cudaThreadSynchronize(); // copy output GetOutputEvalData(nn, (float **)&data.Output); // free device memory UnloadEvalData(nn); } return; }

Figure 4-‐3 – CUDA Invocation Function (CudaBasic.cu)


Fumbeya Luis Marungo Page 39 of 119 MSc Advanced Information Systems project report September, 2010 static __global__ void ExecEvaluateNN() { // determine which record and weight vector to process int recIdx = blockDim.x*(blockDim.y*blockIdx.x + threadIdx.y) + threadIdx.x; int wgtIdx = blockDim.z*blockIdx.y + threadIdx.z; // iterate through all of the datasets float opt = 0.0f; if(wgtIdx < Nn.WeightVectorDim) { float *wgt = &((float *)Nn.WeightVectors[0])[wgtIdx*Nn.WeightEleDim]; float *oWgt = &wgt[Nn.WeightOutputOffset]; float *datStrt = &((float *)Ds.Datasets[Nn.Dataset[0]])[recIdx]; float *datEnd = &datStrt[Ds.RecordDim*Ds.FieldDim]; for(int i = 0; i < Nn.HiddenNodeCnt; ++i) { float ipt = 0.0f; for(float *curDat = datStrt; curDat < datEnd; curDat += Ds.RecordDim) { // multiply/add/assign ipt += *curDat * *wgt; ++wgt; } // add bias ipt += *wgt; ++wgt; // multiply/add/assign opt += *oWgt/(1.0f + expf(-ipt)); ++oWgt; } // add bias opt += *oWgt; ++oWgt; // E squared fitness // save output to device memory ((float *)Nn.Output[0])[Ds.RecordDim*wgtIdx + recIdx] = (1.0f/(1.0f + expf(-opt))); } }

13

Figure 4-‐4 – CUDA Kernel

13

There are slight differences between Figure 4-‐4 and the actual source code. The source code calculates

recIdx and wgtIdx using function calls. Nvcc inlines all device functions; therefore these variances are for

descriptive purposes only and have no bearing on functionality. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo MSc Advanced Information Systems project report static NNEvaluationData LoadEvalData(NNEvaluationData host) { NNEvaluationData dev = host;


// zero terminate arrays dev.WeightVectors[dev.WeightSetDim] = 0; dev.Output[dev.WeightSetDim] = 0; dev.Dataset[dev.WeightSetDim] = 0; int len0 = sizeof(float) * dev.WeightVectorDim * dev.WeightEleDim; int len1 = sizeof(float) * dev.WeightVectorDim * hostDs.RecordDim; for(int i = 0; i < host.WeightSetDim; ++i) { cutilSafeCall( cudaMalloc((void **) &dev.WeightVectors[i], len0) ); cutilSafeCall( cudaMemcpy((void *)dev.WeightVectors[i], (void *)host.WeightVectors[i], len0, cudaMemcpyHostToDevice) ); cutilSafeCall( cudaMalloc((void **) &dev.Output[i], len1) ); } cutilSafeCall( cudaMemcpyToSymbol("Nn", &dev, sizeof(dev)) ); return dev; } static void UnloadEvalData(NNEvaluationData data) { for(int i = 0; i < data.WeightSetDim; ++i) { cutilSafeCall( cudaFree((void *)data.WeightVectors[i]) ); cutilSafeCall( cudaFree((void *)data.Output[i]) ); } } void GetOutputEvalData(NNEvaluationData nn, float **output) { for(int i = 0; i < nn.WeightSetDim; ++i) { cutilSafeCall( cudaMemcpy(output[i], ((void *)nn.Output[i]), sizeof(float)*nn.WeightVectorDim*hostDs.RecordDim, cudaMemcpyDeviceToHost) ); } }

Figure 4-‐5 – Device Load, Unload, and Copy Helper Functions (PROJ_MarungoF_Cuda.cu)

4.5.2 Preliminary GPU Optimization Efforts There were preliminary attempts to optimize the kernel in Figure 4-‐4. The work centered around using pre-‐fetching data from the device’s global memory (located on separate memory chips on the device) and storing them in shared memory (located on the actual GPU chip). The interesting result was that this simple implementation performed better than more complicated attempts at optimization. For more details see Appendix C.

4.6

Summary

Every CUDA program has two parts. One part runs on the host (normally a workstation). The other part runs on the device (normally a graphics card). The host component allocates and frees memory on the device, copies data between the device and the host, and invokes the kernel on the device. The host program groups threads into blocks before invoking the kernel. There are a maximum of 512 threads per block. The kernel invocation passes the number of threads per block and the number of blocks. A device thread is uniquely identified by the combination of its block id and its thread id. The same kernel function runs in each thread on the device in SIMT fashion. Because a CUDA program contains both host and device components two compilers are necessary. The nvcc compiler Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



generates the device program. The host compiler is platform specific (Visual C++ on Windows and gcc on Linux and MacOS). The rest of the general program must link with the C based host program. Employing SSE often requires either using SSE Intrinsics (Microsoft Corporation 2010c) or manually writing Assembly (Wald 2004, p. 80). In either case, SSE has very specific memory alignment requirements. The traditional C++ new and delete operators are unavailable because of the alignment requirements. Memory management requires use of special C functions _aligned_malloc() and _aligned_free(). CUDA’s linking needs and SSE’s memory alignment requirements necessitate an unmanaged native environment. While C++/CLI and C# have mechanisms to allow interoperability between managed and unmanaged environments, mingling the two environments is not practical. Problems can include: 1. Thunking and double thunking issues (Microsoft Corporation 2010d). 2. Linking problems due to different naming and argument passing protocols (Microsoft Corporation 2010a). 3. The need to marshal data between managed and unmanaged environments. Given the intricate issues involved in interoperability, it is best to use C++/CLI for creation of a bridge layer between an unmanaged component that handles processing and a managed component that handles file I/O and user interaction. The CPU implementation uses SSE Intrinsics and the Milner & Grandison (2008) logistic sigmoid Assembly function. The SSE Intrinsics and the Assembly make the source code difficult to read and understand. The GPU implementation is more readable. The host program uses CUDA functions that are similar to traditional C functions to copy data and manage memory. The device kernel function is a straight-‐forward C function. There are no special calls. Each SIMT thread executing the kernel uses the block id and the thread id to determine which record and weight vector combination to calculate.




Chapter 5 – Testing and Results

Figure 5-‐1– Results

5.1 Overview 5.1.1 The Tests The project uses two forms of testing: functional and performance. Figure 5-‐1 above displays the output of both the performance and functional tests. The functional tests follow the decoupled collaborations described in Section 3.3.2. The NeuralNetEvaluator tests are first, then NeuralNetTrainer test uses the confirmed evaluators, and finally the GeneticSelector test uses all of the previously verified components. Performing testing in this order allows bugs to be isolated and fixed and allows the performance results of earlier tests to be a factor in deciding settings for subsequent tests.




Performance testing measures execution time. There are two measurement levels. The first level looks at raw performance. It measures how quickly a NeuralNetEvaluator can calculate the output of a large group of neural networks. The GeneticSelector performance test is a total runtime test. The test measures the total time spent in each part of the program. Figure 5-‐1 displays the output of both the performance and functional tests. The tests read from the bottom up, that is NeuralNetEvaluator test results are at the bottom, then NeuralNetTrainer test results are in the middle, and finally GeneticSelector test results are at the top. 5.1.2 The Datasets Initially the project goal was to use domain specific data for testing. However that is not ideal because neural networks are opaque (see Section 2.2.4). When stepping through a program during debugging it is difficult to tell if a node’s input or output value is correct. Rather than use a domain specific dataset, which can have unpredictable values, the project uses automatically generated XOR datasets. When weight vectors are necessary the project uses two weight vectors; both weight vectors are from Negnevitsky (2005, pp. 183-‐184) and will solve an XOR operation. Using an XOR dataset with known weight vectors simplifies debugging. When the network output is wrong it is possible to compare the calculated value to the correct value at each step. Appendix B contains the correct node input and output values for the two weight vectors.

5.2

The NeuralNetEvaluator Tests

5.2.1 Functional Test Description and Results The NeuralNetEvaluator functional test uses a network with two hidden nodes and the two proven weight vectors for network output calculation. The functional test compares the evaluator calculated output with the correct value. The test performs this comparison over a full NeuralNetEvaluator run and returns the maximum difference between the correct and calculated values over all of the network outputs calculated. It is important to note that the correct output is the result of a floating point calculation; it is never exactly zero or one. Therefore the test uses the maximum difference instead of a hard equality. Different methods for calculation will lead to slightly different results. If the maximum difference is very small then the calculated value is always “close” to the correct value. The results of the tests confirming that the CPU and GPU NeuralNetEvaluators properly calculate neural network output are on the lines starting with “Accuracy Test” in Figure 5-‐1. The SSE implementation is slightly less accurate than the CUDA calculation. The maximum differences are .00127 and 0.00049 respectively. This result is not surprising as the CPU implementation trades some accuracy for speed (Milner & Grandison 2008). 5.2.2 Performance Test Description and Results The NeuralNetEvaluator performance test uses an entirely random dataset. This is because the goal is to measure computing speed, not accuracy. The settings for the performance test are: •

Number of Records:

1024



• • • •

Number of Features: Number of Hidden Nodes: Number of Samples: Number of Weight Vector:


64 2 100 750

The number of records is based on the maximum expected number of data points in a DDSM training set (see Section 2.2.5). The number of features varies significantly in different studies; however 60 features tends to be towards the upper range. Therefore the system can manage a maximum of 64 (the number of bits in a long integer data type). Frequently, the optimal CADx neural networks have only one or two hidden nodes (see Section 2.5.1). The number of weight vectors and the number of samples were chosen to provide a test with a large number of network calculations. In total this test calculates the output for 76,800,000 networks (1024*100*750). This test can measure raw performance. Despite all of the optimizations in the CPU implementation the GPU implementation consistently outperforms in calculating neural network output values. Figure 5-‐1 on page 42 shows almost an 18x speedup on raw performance (the time tests). The time measurements are in milliseconds. The GPU implementation takes 2.172 seconds to calculate the output. The CPU implementation takes 36.578 seconds to perform the same calculation.

5.3 The NeuralNetTrainer Tests There is only a functional test for the NeuralNetTrainer; performance testing is performed as part of the GeneticSelector test in Section 5.4. The functional test uses the same dataset as the evaluator test but does not include weight vectors. The trainer’s job is to generate the weight vector. The functional test returns the classification accuracy of the top performing weight vector using k = .5 as the cutoff. The evolutionary training test used the following settings: • • • •

Population Size: Generation Count: Number of Records Number of Weight Sets:

50 100 128 32

The population size and generation count are from Porto, et al. (1995), see Section 2.5.1. The number of records and weight sets are based on the threading dimensionality (see Section 4.5.1). With the 32x1x16 block layout that the GPU implementation uses, 128 records by 32 weight sets will create eight blocks in a 4x2 block grid. The goal of this test is to confirm that the trainer converges, not to measure speed. Both trainers perform well. The GPU trainer is only slightly more accurate than the CPU trainer (96.875% versus 96.0938%).

5.4 The GeneticSelector Tests 5.4.1 Test Description The GeneticSelector test combines both functional and performance testing. The test uses a XOR dataset with additional random “noise” features. The GeneticSelector must identify the two genuine Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



features. The fitness function does not have a penalty for including noise fields. The performance test measures the total time in different levels of the program. The test settings are: • • • • • • •

Number of Records: Number of Features: Evolutionary Population Size: Evolutionary Generation Count: Number of Bootstraps: Genetic Population: Genetic Population Generation:

1024 64 50 100 10 50 20

All but the last three settings are previously explained. The settings for the genetic algorithm are from Campanini & Lanconelli (2006). They state that genetic algorithms that review approximately 450 networks can provide good results. Setting the number of bootstrap samples is a balancing act between the desire for useful sampling results and total runtime. The NeuralNetEvaluator performance test provides metrics for the time required to perform neural network calculations. Setting the bootstrap value to 10 is reasonable. Many studies use five or ten samples for k-‐fold cross-‐validation. Not only is leave-‐one-‐out bootstrapping similar to cross-‐validation, but also the project’s design can support a decision to switch to cross-‐validation (see Section 3.3.4). The program will calculate the output for 10,240,000,000 neural networks using these settings. That is 133x the number of networks in the NeuralNetEvaluator performance test. If the ratios hold then the total estimated calculation time for the GPU is 280 seconds (slightly less than five minutes) and 4,877 seconds (about 80 minutes) for the CPU. 5.4.2 Functional Test Results The GeneticSelector test does find the true features; one and three. Every one of the top performing networks contains the two true input features. The selector did not eliminate noise features well however. In part, this is the result of the network training. A well trained network may have zero weight values for all of the noise features. In other words the network accepts noise as input but ignores it during processing. Modifying the selector’s algorithm to add a cost for each additional feature creates problems as well. A solution that contains many features including the true features may have a lower fitness value than a solution that contains only a few features even if all of the features are noise. An experiment using a cost component in the algorithm found that the cost necessary to cause the selector to select only two input features varies with the number of input features available. The nature of filtering out noise features is not only outside of the scope of the project but also may suggest using a different algorithm for feature selection in true experiments. There is a similar problem with the number of hidden nodes. Only two nodes are necessary for XOR. Again this may require re-‐examination of the algorithm.




Another odd quirk is that the selector results vary between the GPU and CPU implementation. However this appears to be related to the random numbers generated. When the order of execution is switched, the GPU has consistent fitness values of 100 and none of the CPU values were exactly 100. The GeneticSelector passes the test because every top performing neural network it returns contains the two true features; the fact that the selector does not filter the noise features is not a factor. The selector never returned a network that used all, or even a majority, of the features. The algorithms idiosyncrasies serve to emphasize that algorithm development is a trial and error process. The ability to flag unexpected outcomes in the algorithms is another advantage to using predictable data for testing. 5.4.3 Performance Test Results The actual results for the GPU is 340 seconds, about 20% more than predicted by the NeuralNetEvaluator performance test. This is not surprising given that the individual chunks of data moving to the device are smaller. This can create more overhead in data transfer. The CPU results are much more interesting. The total CPU time is 2,165 seconds (about 36 minutes). This is less than half the estimate based on the NeuralNetEvaluator performance test. The GPU only has a speedup of about 6x versus 18x in the evaluator test. The total runtime advantage is less than 4x. This is still a significant difference in absolute terms. The total runtime for the GPU is about 11.5 minutes; the total runtime for the CPU is about 42 minutes. While it is difficult to prove, it is reasonable to believe that increased performance comes from the fact that the CPU is able to use its cache during the GeneticEvaluator test. This test repeatedly operates over the same set of data. The NeuralNetEvaluator test operates over the data only once. A high cache hit ratio is one of the most important factors in CPU performance (see Section 2.4.2). The GPU implementation does not employ caching (CUDA does offer some caching of constant and texture memory, see Appendix C, and in the latest version caching of general memory, see Appendix A). Preliminary tests with very low settings actually had the CPU outperforming the GPU. In that case the CPU was probably able to cache the entire dataset. Another area where the cache may have an impact is the differences in performance in other parts of the algorithm. The CPU implementation has lower performance times than the GPU version in all of the other sections of the program. These sections are common to both implementations. This is probably because parts of the cache are disturbed during the constant copying of data between the host and the device.

5.5 CPU Allocation The screen shots in Figure 5-‐2 on the following page show the CPU usage while the application is running. Due to the computational intensity of the application, the CPU allocates execution to a single core. The CPU usage on that core remains at 100% throughout the execution. The 50% in the Processes screen represent 100% usage on one core in a dual core machine.



Figure 5-‐2 -‐-‐ Windows Task Manager





Chapter 6 – Summary, Conclusion, Future Work, and Evaluation 6.1 Summary 6.1.1 Background The goal of this project was to explore using CUDA to accelerate breast cancer CADx neural network classifier design. To achieve this goal the project presents an implementation of an algorithm that performs feature selection, network architecture, and network training using genetic and evolutionary computing techniques as well as leave-‐one-‐out bootstrap sampling. To provide a basis of comparison, there are two implementations that calculate neural network: one uses the GPU and the other uses the CPU.

6.2 Conclusion 6.2.1 Overall Conclusion This project demonstrates a role for CUDA in computationally intensive CADx. The results show a significant speedup over using a CPU only implementation. The CUDA implementation is intrinsically parallel and will therefore also have automatic performance increases with future hardware upgrades (see Section 2.3.1). The CPU implementation requires an additional multithreading layer in order to have a chance to match the GPU’s performance and to have processing increases in the future. The multithreading component is not a trivial addition to the program. There is also no guarantee that the addition of multithreading will add the anticipated gains (see Section 2.5.3). The sections below provide conclusions from various phases of the project. 6.2.2 Design One area this project tackles that is not as commonly explored yet in research is integrating CUDA into an overall domain application such as CADx. Design becomes very important in this context. CUDA development requires that all memory management for the device occurs on the host. In addition the host is responsible for orchestrating the movement of input and output data between the device and the host. Programs normally have an Array of Structures (AoS) memory layout. Many of the GPU based neural network programs in the literature, such as Steinkraus, et al. (2005) and Jang, et al. (2008), maintain this layout by using the GPU to perform matrix multiplication as a parallel operation on only one network at a time. CADx neural networks are relatively small; they do not have the large numbers of hidden nodes to make this approach feasible. There is no ability to scale when performing small matrix multiplication calculations. Scale in CADx neural networks comes from calculating multiple network outputs simultaneously. This requires Structure of Arrays layout (see Section 3.3.3). When CUDA is part of a higher level application, transformations between the AoS data layout in the rest of the program and the SoA data layout in the part of the program running on the GPU must occur at some point. Without a decoupled design that clearly delimits responsibilities the Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



demarcation points where the transformations occur become muddled. Changes in the surrounding program will cascade; this will require changes to the GPU program to adjust the data layout. Employing a decoupled design removes the need to change to the GPU program if changes are made in a modular area outside of where the transformation takes place. This project manages decoupling in two ways. The interaction with CUDA (or SSE) only occurs in the appropriate NeuralNetEvaluator class. And the data layout requirements are managed by the SamplingData, TrainingSet, and TestingSet classes. This decoupling allows changes to occur throughout the program without a need to modify the CUDA program. 6.2.3 Implementation Using either SSE or CUDA presents difficulties. The SSE has specific memory alignment requirements that prevent the use of traditional C++ approaches to memory management. The requirement for low-‐level instruction calls creates less readable programs. CUDA is a C API that must link with native C++. Therefore a decision to use CUDA or SSE is a decision to use a fair amount of C++. The implementation did demonstrate the general-‐purpose applicability of CUDA. The kernel function looks like a typical C function. The CUDA functions to allocate, free, and copy memory look like their well known C counterparts. A drawback to CUDA is that the GPU program runs on the graphics card. This makes debugging very arduous. It is not possible to simply print out or step through the execution. CUDA does offer an emulation mode that allows debugging. However this mode is very limited. Patience and program simplicity are often the only programming tools available14. In this CADx application, the CUDA kernel is not a particularly large part of the code base. This is probably quite typical. Most of the effort in CUDA development involves orchestrating the interaction between the host and the device (see Section 4.2.1 and Section 4.5). 6.2.4 Testing and Results Based on when the technologies were released and placement in the product line, a CPU roughly comparable to the GPU in this test may have up to four hardware threads. If all the threads are in use the CPU may match the GPU’s 4x speedup. However there are other factors to consider. To obtain comparable CPU performance will require the addition of multithreading. Multithreading adds another level of complexity on top of SSE. The additional threads also may not provide a linear speedup. It is likely that much of the CPUs performance came from caching. CUDA’s scaling ability appeared to be more predictable. Another potential reduction in performance comes from other programs running on the workstation. The CADx program creates near full utilization in a hardware thread because the program is continuously performing calculations. There are either one or two hardware threads per CPU core. In a single threaded program the CPU allocates all of the work to a single hardware thread; other programs running on the workstation use the other available thread(s). If the 14

NVIDIA has recently released a new development tool called NSight which claims to alleviate many of the programming difficulties. However NSight requires Windows Vista or Windows 7. The workstation used for developing this project uses Windows XP; therefore this project does not review it. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



application is multithreaded and uses all of the available hardware threads then either the other applications will freeze or the performance of the CADx program will decline as the CPU must perform task switching. The only way to avoid this problem is to continuously monitor CPU usage by other programs on the machine. CUDA intrinsically scales. The more threads the better the performance. This is the opposite of the CPU. With CUDA there is no need to monitor the available usage. Performance gains from hardware upgrades are also immediate there is no need to modify the program or purchase a new workstation. More powerful cards will automatically schedule more warps to run simultaneously.

6.3 Evaluation My overall assessment of the project is that it was a success. The project provides a window into the performance consideration in the context of CADx. However the path from beginning to end was very different from my initial expectations. I thought that .NET would be integral in my project. That turned out to be impractical. I also did not anticipate the importance of design; it turned out to be crucially important. As the code base grew during implementation the project required constant redesigning. Without it the project would have stalled due to complexity. Decoupling along the volatile dimensions of possible algorithm implementation and execution location (host or device) was critical in maintaining stability as the project progressed. Another characteristic of the project is that the general phases overlapped and mutually influenced each other. The last third of the design phase occurred during the first two thirds of implementation phase. Practical problems in implementation would lead to modifications of the design. For example the package based approach described in Section 3.3.2 was abandoned about half way through the implementation. The change allowed me to continue developing the program without constantly editing multiple sections. Testing led to changes in implementation. The two largest modifications during testing were the replacement of the original evolutionary training algorithm, which did not converge, and the decision to use a binary XOR dataset. Because of the flexible design substituting a new trainer did not have a cascading effect on the rest of the program. The decision to use the binary XOR dataset occurred during the overlap of the late stage of implementation and early stage of testing. At that point the need for a predictable and well understood test case was clear. While overall I do not have any regrets in the project there were tradeoffs. The decision to invest significant effort in the CPU implementation crowded out the exploration of some areas of CUDA optimization (see Appendix C). I believe this investment was necessary to provide a true benchmark on CUDA performance; but there were quite a few other avenues I would have liked to explore. There is not much that I would do differently, but I would have liked to have done more.

6.4

Future Work

Based on the results of this project the decision to use CUDA in CADx implementations is a decision as to whether or not to use C++. If the decision is made to use C++ then the benefits of using CUDA over using SSE is clear. However using a managed language such as C# or Java has considerable Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



benefits over C++. Future work into the performance of a managed CADx implementation as compared to CUDA would be highly beneficial. This is especially relevant with Microsoft’s release of the Task Parallel Library (Microsoft Corporation 2010a). If this library can manage CPU threads efficiently and scale as the number of cores increase then the benefits of using a higher level may be worth the cost of better performance from CUDA. This project did not cover coordinating GPU and CPU activity. The total runtime performance test revealed that the program spent almost roughly the same amount of time on calculating the neural network output using the GPU as on performing other work using the CPU. The current implementation of the program does not utilize this time. The program blocks when the GPU is busy. Modifying the program to allow the CPU to continue executing while the GPU performs processing may almost half the total runtime. Appendix C contains a description of an attempt to optimize the CUDA kernel. Based on the results, the kernel presented in the project appears to be a fairly top performing one. However, there are still a few optimizations that may be worthwhile. For example, using the texture memory cache may create some performance improvements. Ultimately, the goal of this project is to accelerate processing a domain specific problem. The design and architecture of the system is built around the needs of CADx. To see this project used to accelerate a true CADx application, even as it is, would be very rewarding.




Bibliography American Cancer Society 2009, Breast Cancer Facts & Figures 2009-‐2010, American Cancer Society, Atlanta, GA. American College of Radiology 2009, The American College of Radiology BI-‐RADS ATLAS and MQSA: Frequently Asked Questions, viewed 29 August 2010, . Barney, B 2010, Introduction to Parallel Computing, viewed 17 June 2010, . Beck, K & Cunningham, W 1989, 'A Laboratory For Teaching Object Oriented Thinking', OOPSLA '89: Conference Proceedings on Object-‐Oriented Programming Systems, Languages and Applications, ACM. Benkrid, K 2008, 'High Performance Reconfigurable Computing: From Applications to Hardware', IAENG International Journal of Computer Science, vol 35:1, IJCS_35_1_04. Bevilacqua, A, Campanini, R & Lanconelli, N 2001, 'Optimization of a Distributed Genetic Algorithm for the Detection of Microcalcifications', International Journal of Modern Physics, vol 12, no. 1, pp. 55-‐70. Bilhanan, A 2004, 'High Level Synthesis Of An Image Processing Algorithm For Cancer Detection', MSc Thesis, Department of Computer Science and , University of South Florida, Florida, USA. Boost Project 2010, Boost C++ Libraries, . Boujelben, A, Chaabani, AC, Tmar, H & Abid, M 2009, 'Feature Extraction from Contours Shape for Tumor Analyzing in Mammographic Images', Digital Image Computing: Techniques and Applications, Conference Publishing Services, Melbourne, Austalia. Campanini, R & Lanconelli, N 2006, 'Chapter 4: Genetic Algorithms in Mammography', in Recent Advances In Breast Imaging, Mammography, and Computer-‐Aided Diagnosis of Breast Cancer, The Society of Photo-‐Optical Instrumentation Engineers, Bellingham, Washington. Chai, Z, Sun, J, Cai, R & Xu, W 2009, 'Implementing Quantum-‐behaved Particle Swarm Optimization Algorithm in FPGA for Embedded Real-‐time Applications', 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology, pp. 886-‐890. Che, S, Li, J, Sheaffer, JW, Skadron, K & Lach, J 2008, 'Accelerating Compute-‐Intensive Applications with GPUs and FPGAs', Proceedings of the 2008 Symposium on Application Specific Processors, pp. 101-‐107. D'Orsi, CJ, Bassett, LW & Berg, WA 2003, Breast Imaging Reporting and Data System: ACR BI-‐RADS-‐ Mammography (ed 4), American College of Radiology, Reston, VA. Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



Duncan, R 1990, 'A Survey of Parallel Computer Architectures', Computer, vol 23, no. 2, pp. 5-‐16. Efron, B & Tibshirani, RJ 1998, An Introduction to the Bootstrap, CRC Press LLC, Boca Raton, Florida. Feist, T 2009, Following the road from ASIC to FPGA, viewed 13 December 2009, . Flynn, MJ 1972, 'Some Computer Organizations and Their Effectiveness', IEEE Transactions on Computers, pp. 948-‐960. Fogel, DB, Wasson III, EC & Boughton, EM 1995, 'Evolving Neural Networks for Detecting Breast Cancer', Cancer Letters, pp. 49-‐53. Frank, A & Asuncion, A 2010, UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science, viewed 13 June 2010, . Geer, D 2005, 'IEEE: Chip Makers Turn to Multicore Processors', Computer, May 2005, pp. 11-‐13. Gilbert, F, Astley, S, Gillan, M, Agbaje, O, Wallis, M, James, J, Boggis, C & Duffy, S 2008, 'Single reading with computer-‐aided detection for screening mammography', New England Journal of Medicine, no. 359, pp. 1675 -‐ 84. Giles, M 2009, Numerically Intensive Computing In Finance -‐-‐ Lecture Notes, viewed 1 May 2010, . Graham, P & Nelson, B 1996, 'Genetic algorithms in software and in hardware-‐-‐-‐A performance analysis of workstations and custom computing machine implementations', IEEE Symposium on FPGAs for Custom Computing Machines, pp. 216-‐225. Heath, M, Bowyer, K, Kopans, D, Kegelmeyer, WP, Moore, R, Chang, K & MunishKumaran, S 1998, 'Current status of the Digital Database for Screening Mammography', Digital Mammography, pp. 457-‐460. Heath, M, Bowyer, K, Kopans, D, Moore, R & Kegelmeyer, WP 2001, 'The Digital Database for Screening Mammography', Proceedings of the Fifth International Workshop on Digital Mammograpy, pp. 212-‐218. Intel Corporation 2000, Approximate Math Library for Intel Streaming SIMD Extensions Release 2.0, viewed 2010 June 17, . Intel Corporation 2009, Vector Math Library (VML) Performance and Accuracy Data, viewed 30 April 2010, . Intel Corporation 2010, Intel AVX, viewed 23 August 2010, . Jang, H, Park, A & Jung, K 2008, 'Neural Network Implementation using CUDA and OpenMP', Digital Image Computing: Techniques and Applications, pp. 155-‐161.




Jiang, Y, Nishikawa, R, Schmidt, R, Metz, CE, Giger, ML & Doi, K 1999, 'Improving breast cancer diagnosis with computer-‐aided diagnosis', Academic Radiology, vol 6, no. 1, pp. 22-‐33. Jiang, W & Simon, R 2007, 'A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification', Statistics in Medicine, no. 26(29), pp. 5320-‐5334. Kirk, D & Hwu, W 2008, 'Chapter 1: Introduction', in Programming Massively Parallel Processors, Draft, viewed 14 December 2009, . Kohavi, R 1995, 'A Study of Cross-‐Validation and Bootstrap for Accuracy Estimation and Model Selection', Proceedings of the 14th international conference on artificial intelligence (IJCAI) , pp. 1137-‐1143. Land, W, McKee, DW, Anderson, FR, Masters, T, Lo, JY, Embrechts, M & Heine, J 2006, 'Chapter 10: Using Computational Intelligence For Computer-‐Aided Diagnosis Of Screen-‐Film Mammograms', in Recent Advances In Breast Imaging, Mammography, and Computer-‐Aided Diagnosis of Breast Cancer, The Society of Photo-‐Optical Instrumentation Engineers, Bellingham, Washington. Larman, C 2002, Applying UML and Pattern: An Introduction to Object-‐Oriented Analysis and Designt and the Unified Process, 2nd Ed., Prentice-‐Hall, Inc., Upper Saddle River, NJ. Lewis, TE & Magoulas, GD 2009, 'Strategies to Minimise the Total Run Time of Cyclic Graph Based Genetic Programming with GPUs', Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, Association for Computing Machinery, Montreal, Québec, Canada. Lindholm, E, Nickolls, J, Oberman, S & Montrym, J 2008, 'NVIDIA Tesla: A Unified Graphics and Computing Architecture', March/April 2008, pp. 39-‐55. Lo, JY, Bilska-‐Wolak, AO, Baker, JA, Tourassi, GD, Floyd, CE & Markey, MK 2006, 'Chapter27 Computer-‐Aided Diagnosis in Breast Imaging: Where Do We Go after Detection?', in Recent Advances In Breast Imaging, Mammography, and Computer-‐Aided Diagnosis of Breast Cancer, The Society of Photo-‐Optical Instrumentation Engineers, Bellingham, Washington. Mangasrian, OL, Street, WN & Wolberg, WH 1995, 'Breast Cancer Diagnosis and Prognosis via Linear Programming', Operations Research, vol 43, no. 4, pp. 570-‐577. Marowka, A 2007, 'Parallel Computing On Any Desktop', Communications of the ACM, vol 50, no. 9, pp. 75-‐78. Marowka, A 2009, 'Performance Study of the First Three Intel Multicore Processors', Scalable Computing: Practice and Experience, vol 10, no. 4, pp. 429-‐41. Marungo, F 2010, 'A Bootstrap Linear Regression of Temperatures in the United States', Coursework, Department of Computer Science, Birkbeck, University of London, London, UK.




Marungo, F 2010, 'A Bootstrap Linear Regression of Temperatures in the United States', Coursework in Computational Intelligence and Visualisation, Department of Computer Science, Birkbeck, University of London, London, UK. Microsoft Corporation 2010a, MSDN -‐-‐ Argument Passing and Naming Conventions, viewed 6 September 2010, . Microsoft Corporation 2010b, MSDN -‐-‐ Double Thunking, viewed 15 August 2010, . Microsoft Corporation 2010c, MSDN -‐-‐ MMX, SSE, and SSE2 Intrinsics, viewed 28 April 2010, . Microsoft Corporation 2010d, MSDN -‐-‐ Performance Considerations for Interop, viewed 15 August 2010, . Microsoft Corporation 2010a, MSDN -‐-‐ Task Parallel Library, viewed 6 September 2010, . Milner, JJ & Grandison, AJ 2008, 'A Fast, Streaming SIMD Extensions 2, Logistic Squashing Function', Neural Computation, pp. 2967-‐72. Negnevitsky, M 2005, Artificial Intelligence: A Guide to Intelligent Systems (2nd Ed.), Pearson Education Limited, Essex, England. NVIDIA 2010, CUDA Toolkit 3.0, viewed 13 August 2010, . NVIDIA Corporation 2008, 'Technical Brief NVIDIA GeForce GTX 200 GPU Architecrtual Overview.', Technical Report TB-‐04044-‐001_v01. NVIDIA Corporation 2010a, NVIDIA CUDA C Programming Best Practices Guide Version 3.0, viewed 17 June 2010, . NVIDIA Corporation 2010b, NVIDIA CUDA Programming Guide Version 3.0, viewed 17 June 2010, . NVIDIA Corporation, CUDA and Tesla for Breast Cancer detection and treatment, viewed 13 December 2009, . Oliveira, J, Gueld, M, Araujo, A, Ott, B & Deserno, TM, Towards a Standard Reference Database for Computer-‐aided Mammography, viewed 28 April 2010, . Pande, V., Stanford University 2010, FAQ-‐NVIDIA-‐GPU3, viewed 31 July 2010, . Porto, VW, Fogel, DB & Fogel, LJ 1995, 'Alternative Neural Network Training Methods', IEEE Expert: Intelligent Systems and Their Applications , pp. 16-‐22.




Rangayyan, RM, Paranjape, RB, Desautels, JEL & Bryant, H 2006, 'Chapter 3: An Indexed Atlas of Digital Mammograms for Computer-‐Aided Diagnosis of Breast Cancer', in Recent Advances In Breast Imaging, Mammography, and Computer-‐Aided Diagnosis of Breast Cancer, The Society of Photo-‐ Optical Instrumentation Engineers, Bellingham, Washington. Richter, J 2003,.NET Column: The CLR's Thread Pool, viewed 27 April 2010, . Rizzo, BD 2010, New NVIDIA GeForce GTX 480 GPU Cranks Up PC Gaming to New Heights, viewed 17 June 2010, . Sargent, D 2001, 'Comparison of artificial neural networks with other statistical approaches -‐-‐ results from medical data sets', Cancer, no. 91(8), pp. 1636-‐1642. Schwarzer, G, Vach, W & Schumacher, M 2000, 'On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology', Statistics in Medicine, no. 19, pp. 541-‐561. Sickles, E 1991, 'Periodic Mammographic Follow-‐up of Probably Benign Lesions Results in 3,184 Consuctive Cases', Radiology, 1991, pp. 463-‐468. Sickles, E 1999, 'Probably Benign Breast Lesions: When Should Follow-‐up Be Recommended and What Is the Optimal Follow-‐up Protocol', Radiology, October 1999, pp. 11-‐14. Sonka, M, Fitzpatrick, JM (eds.) 2009, Handbook of Medical Imaging: Medical Image Processing and Analysis, S P I E-‐International Society for Optical Engineering, Bellingham, WA. Steinkraus, D, Buck, I & Simard, PY 2005, 'Using GPUs for Machine Learning Algorithms', Proceeding of the 2005 Eight International Conference on Document Analysis and Recognition. Stoner, M 2009, Integrating Fast Math Libraries for the Intel Pentium 4 Processor, viewed 21 June 2010, . Street, WN, Wolberg, WH & Mangasarian, OL 1993, 'Nuclear feature extraction for breast tumor diagnosis', International Symposium on Electronic Imaging: Science and Technology, IS&T/SPIE, San Jose, CA. Suckling, J., et. al. 1994, 'The Mammographic Image Analysis Society Digital Mammogram Database', Exerpta Medica. International Congress Series 1069, pp. 375-‐378. Suri, JS, Reiser, I, Chandrasekhar, R, Wu, DH, Lanconelli, N, Campanini, R, Roffilli, M, Wong, K, Chang, R, Kshirsagar, A, Guo, Y, Sun, Y, Sivaramakrishna, R, Wirth, M, Tot, T, Cao, A, Acha, B, Serrano, C, Desautels, JEL & Rangayyan, RM 2006, 'Chapter 28: The Current Status and Likely Future of Breast Imaging CAD', in Recent Advances In Breast Imaging, Mammography, and Computer-‐Aided Diagnosis of Breast Cancer, The Society of Photo-‐Optical Instrumentation Engineers, Bellingham, Washington. Sutter, H 2005, A Fundamental Turn Toward Concurrency in Software, viewed 29 June 2010, . Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



Sutton, MA 2009, 'Chapter 6: Image Segmentation by Fuzzy Clustering: Methods and Issues', in IN Bankman (ed.), Handbook of Medical Image Processing and Analysis, Second Edition edn, Elsevier Inc., London, UK. VanderSpek, J 2008, 'The CUDA Compiler Driver', NVIDIA Corporation. Volkov, V & Demmel, JW 2008, 'Benchmarking GPUs to Tune Dense Linear Algebra', Proceedings of the 2008 ACM/IEEE conference on Supercomputing, IEEE Press, Austin, Texas, Article No. 31. Wald, I 2004, 'Realtime Ray Tracing and Interactive Global Illumination', PhD Thesis, Computer Graphics Group, Saarland University, Saarbrücken, Germany, Germany. Wikipedia 2010a, C++/CLI, viewed 6 September 2010, . Wikipedia 2010a, Flynn's taxonomy, viewed 19 June 2010, y.Fitness; } static SampDatPtr createSmallSet(GeneticSelector::Chromosome &, SamplingData &, int trainMult); // public members // construction and destruction GeneticSelector::GeneticSelector(int popSize, int gen, int elite , SamplingData &data, NeuralNetTrainer::Factory &fact , float azCutOff) : PopulationSize(popSize), Generations(gen), EliteTopN(elite) , Data(data), factory(fact), AzCutOff(azCutOff) , Population(new Chromosome[PopulationSize]) , pairs(new Chromosome[PopulationSize - EliteTopN][2]) { } GeneticSelector::~GeneticSelector(void) { delete[] Population; delete[] pairs; } void GeneticSelector::Execute() { TotalTrainerEvalTime = TotalVerificationEvalTime = TotalTrainerTime = 0; TotalTrainerCalcFitness = TotalTrainerSortPopulatoin = TotalTrainerEvaluateNNsTime = 0; TotalTrainerCreateChildrenTime = TotalTrainerCreateParentsTime = 0; initPop(); calcFitness(true); // sort in declining order of fitness sort(Population, &Population[PopulationSize], ChromosomeGreater); for(int i = 1; i < Generations; ++i) { generatePairs(); executeCrossover(); calcFitness(false); sort(Population, &Population[PopulationSize], ChromosomeGreater); } }


Fumbeya Luis Marungo Page 78 of 119 MSc Advanced Information Systems project report September, 2010 // private members void GeneticSelector::initPop() { unsigned long mask = ~0ul; uniform_int hiddenGeneDist(1, MaxHiddenNodes); variate_generator hiddenGeneGen(Random, hiddenGeneDist); mask >>= 64 - Data.Trainingset.FieldDim; for(int i = 0; i < PopulationSize; ++i) { Chromosome &c = Population[i]; c.FeatureGenes = Random(); // this will generate a 32 bit number c.FeatureGenes rand1) idx1 += iAdd1--; } --idx0; // ... and remove later. --idx1; // crossover population member mating with itself // go back if(idx0 == idx1) { --i; continue; } // set up the pairs. pairs[i][0] = Population[idx0]; pairs[i][1] = Population[idx1]; } delete[] cumulative; } static int dbg; void GeneticSelector::executeCrossover() { const int pairsLen = PopulationSize - EliteTopN; Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo MSc Advanced Information Systems project report Chromosome *p = &Population[EliteTopN]; for(int i = 0; i < pairsLen; ++i, ++p) { unsigned long mask = 1ul; unsigned long &u = p->FeatureGenes; u = 0ul; // randomly cross over each bit for(int j = 0; j < Data.Trainingset.FieldDim; ++j) { u |= mask & pairs[i][Random() & 1].FeatureGenes; mask Trainingset, samp->Trainingset.FieldDim, Population[i].HiddenLayerGene); TotalTrainerTime -= GetTickCount(); tnr->TrainNNs(); TotalTrainerTime += GetTickCount(); EvolutionaryTrainer *et = dynamic_cast(tnr.get()); TotalTrainerEvalTime += et->NNEvalTime; TotalTrainerCalcFitness += et->CalcFitnessTime; TotalTrainerSortPopulatoin += et->SortPopulationTime; TotalTrainerEvaluateNNsTime += et->EvaluateNNsTime; TotalTrainerCreateChildrenTime += et->CreateChildrenTime; TotalTrainerCreateParentsTime += et->CreateParentsTime; NeuralNetTrainer::WgtsPtr wgts = tnr->GetWeights(); // then use the top weight vector from each sample to evaluate Az over the test // set of the sample // this can be done async with the addition of callback functionality to the // library. for(int j = 0; j < sampleSize; ++j) { sampleAz[j] = calcAz(wgts[j].get(), samp->Testingset, i, j); } sort(sampleAz, &sampleAz[sampleSize]); chrom[i].Fitness = 100.0f*sampleAz[cutOffIdx]; } }

delete[] sampleAz;

float GeneticSelector::calcAz(float *wgts, const TestingSet &testSamp, int chromIdx, int sampIdx) { Chromosome &chrom = Population[chromIdx]; Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo MSc Advanced Information Systems project report const int testCnt = testSamp.RecordCnts[sampIdx]; int totalTrue = 0, totalNeg = 0;


OptGnd *optGnd = new OptGnd[testCnt]; processTestNN(testSamp, wgts, chromIdx, sampIdx, optGnd, totalTrue, totalNeg); struct {bool operator()(OptGnd &x, OptGnd &y){return x.opt > y.opt;}} Comp; sort(&optGnd[0], &optGnd[testCnt], Comp); int tt = 0; float tDelta = 1.0f/float(totalTrue); float fDelta = 1.0f/float(totalNeg); float retVal = 0.0f; // calculate the area under the curve. for(int i = 0; i < testCnt; ++i) { if(optGnd[i].gnd) ++tt; else retVal += fDelta*tDelta*tt; } delete[] optGnd; return retVal; }

//// this is designed to be a very fast calculation of the //// test data (unselected items of the bootstrap) //// the test data is only evaluated on the top network //// thus it can occur on the CPU while the evolutionary //// occuring over multiple generations and multiple //// on the GPU. void GeneticSelector::processTestNN(const TestingSet &testSamp, float * const wgts, int chromIdx, int sampIdx, OptGnd * const optGnd, int &totalTrue, int &totalNeg) { Chromosome &c = Population[chromIdx]; const int &hiddenNodeCnt = c.HiddenLayerGene; int recCnt = testSamp.RecordCnts[sampIdx]; float *output = (float *)_aligned_malloc(sizeof(float)*testSamp.RecordDims[sampIdx], 64); TotalVerificationEvalTime -= GetTickCount(); SseGlobal::EvaluateNN(testSamp.TestSets[sampIdx] , testSamp.RecordDims[sampIdx] , testSamp.FieldDim , 1 , hiddenNodeCnt , wgts , output , recCnt); TotalVerificationEvalTime += GetTickCount(); // set up optToGnd // calculate total trues and total false float *gnd = testSamp.GroundTruth[sampIdx]; totalTrue = 0; totalNeg = 0; for(int i = 0; i < recCnt; ++i) { if(gnd[i] > 0.5f) { ++totalTrue; optGnd[i].gnd = 1; } else { ++totalNeg; optGnd[i].gnd = 0; } optGnd[i].opt = output[i]; } _aligned_free(output); } vector GeneticSelector::Chromosome::GetInputFieldIndexes() { vector retVal; Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo MSc Advanced Information Systems project report unsigned long gene = FeatureGenes; for(int i = 0; i < 64; ++i) { if(gene & 1ul) retVal.push_back(i); gene >>= 1; } return retVal; }


SampDatPtr createSmallSet(GeneticSelector::Chromosome &c, SamplingData &s, int trainMult) { std::vector idxs = c.GetInputFieldIndexes(); int fldCnt = idxs.size(); int setCnt = s.Testingset.TestsetDim; // will use the CPU for test ANN evaluation. TestingSet *test = new TestingSet(setCnt, fldCnt, s.Testingset.RecordCnts, 4); // will use whichever evaluator provided. TrainingSet *train = new TrainingSet(setCnt, fldCnt, s.Trainingset.RecordCnt, trainMult); // training data is rectangular, testing data is jagged. int trainRecCnt = s.Trainingset.RecordCnt; int trainRecDim = s.Trainingset.RecordDim; for(int i = 0; i < setCnt; ++i) { int testRecCnt = s.Testingset.RecordCnts[i]; int testRecDim = s.Testingset.RecordDims[i]; float* trainSet = s.Trainingset.Samples[i]; float* trainGnd = s.Trainingset.GroundTruth[i]; float* testSet = s.Testingset.TestSets[i]; float* testGnd = s.Testingset.GroundTruth[i]; float *beg, *end; float *dest; for(int j = 0; j < fldCnt; ++j) { beg = &trainSet[idxs[j]*trainRecDim]; end = &beg[trainRecCnt]; dest = &train->Samples[i][j*train->RecordDim]; copy(beg, end, dest); beg = &testSet[idxs[j]*testRecDim]; end = &beg[testRecCnt]; dest = &test->TestSets[i][j*test->RecordDims[i]]; copy(beg, end, dest); } beg = &trainGnd[0]; end = &beg[trainRecCnt]; dest = train->GroundTruth[i]; copy(beg, end, dest); beg = &testGnd[0]; end = &beg[testRecCnt]; dest = test->GroundTruth[i]; copy(beg, end, dest); } return SampDatPtr(new SamplingData(*train, *test)); }




Global.h #pragma once #ifndef _GLOBAL_H #define _GLOBAL_H #include namespace PROJ_MarungoF { namespace Lib { boost::mt19937 Random; } } typedef struct { // All fixed length array are // elements size_t WeightVectors[1024]; size_t Output[1024]; size_t Dataset[1024]; int int int int int } NNEvaluationData; #endif

0-terminating thus can only contain a max of 1023 // [WeightSetDim][WeightVectorDim][WeightEleDim] // [WeightSetDim][WeightVectorDim][RecordDim] // [WeightSetDim] points to the matching Dataset

WeightSetDim; WeightVectorDim; WeightEleDim; WeightOutputOffset; HiddenNodeCnt;

// // // //

# of Weightsets, max val 1023 Population size of evolutionary algo == (EvalFldDim + 2) * HiddenNodeCnt + 1 == (# of EvalFlds + 1) * # of hidden nodes

// this is the data for the evaluation call



Global.cpp #include "Global.h" using namespace PROJ_MarungoF::Lib; using namespace boost; static int init(); static int dummy = init(); static int init() { Random = mt19937(0); return 0; }




NeuralNetEvaluator.h #pragma once #ifndef _NEURAL_NET_EVALUATOR_H #define _NEURAL_NET_EVALUATOR_H #include namespace PROJ_MarungoF { namespace Lib { class TrainingSet; class NeuralNetEvaluator { public: struct WeightData { float **WeightVectors; float **Output; int *DatasetMapping; int WeightSetDim; int WeightVectorDim; int WeightEleDim; int HiddenNodeCnt; }; typedef boost::shared_ptr Ptr; // factory class class Factory { public: Factory() {} virtual Ptr GetEvaluator() = 0; private: Factory &operator=(const Factory &); Factory (Factory &); }; virtual int GetRecordDimMultiple() = 0; virtual void Evaluate(WeightData &) = 0; virtual void SetDataset(TrainingSet &) = 0; virtual void ReleaseDataset() = 0; NeuralNetEvaluator(void); virtual ~NeuralNetEvaluator(void); }; } } #endif




NeuralNetEvaluator.cpp #include "NeuralNetEvaluator.h" using namespace PROJ_MarungoF::Lib; NeuralNetEvaluator::NeuralNetEvaluator(void) { } NeuralNetEvaluator::~NeuralNetEvaluator(void) { }





NeuralNetTrainer.h #pragma once #ifndef _NEURAL_NET_TRAINER_H #define _NEURAL_NET_TRAINER_H #include #include #include #include

namespace PROJ_MarungoF { namespace Lib { class TrainingSet; class NeuralNetTrainer { public: // member types typedef boost::shared_ptr Ptr; typedef boost::shared_array WgtsPtr; // factory class class Factory { public: Factory() {} virtual Ptr GetTrainer(TrainingSet &, int fldCnt, int hidNodeCnt) = 0; virtual int GetRecordDimMultiple() = 0; private: Factory &operator=(const Factory &); Factory (Factory &); }; virtual virtual virtual virtual virtual

int GetFieldCnt() = 0; int GetHiddenNodeCnt() = 0; TrainingSet &GetData() = 0; WgtsPtr GetWeights() = 0; void TrainNNs() = 0;

// [SampleDim][WeightElementDim]

virtual ~NeuralNetTrainer(void);

protected: NeuralNetTrainer(void); private: NeuralNetTrainer(const NeuralNetTrainer &); const NeuralNetTrainer &operator=(const NeuralNetTrainer &); }; } } #endif



NeuralNetTrainer.cpp #include "NeuralNetTrainer.h" using namespace PROJ_MarungoF::Lib; NeuralNetTrainer::NeuralNetTrainer(void) { } NeuralNetTrainer::~NeuralNetTrainer(void) { }





OrigEvolutionaryTrainer.h #pragma once #ifndef _ORIG_EVOLUTIONARY_TRAINER_H #define _ORIG_EVOLUTIONARY_TRAINER_H #include "NeuralNetTrainer.h" #include "NeuralNetEvaluator.h" #include namespace PROJ_MarungoF { namespace Lib { class TrainingSet; class OrigEvolutionaryTrainer : public NeuralNetTrainer { public: // member types typedef boost::shared_array FloatArrPtr; class Factory : public NeuralNetTrainer::Factory { public: Factory(int popSize, int genCnt, NeuralNetEvaluator &eval) : PopSize(popSize), GenCnt(genCnt), Eval(eval){} const int PopSize; const int GenCnt; NeuralNetEvaluator &Eval; virtual Ptr GetTrainer() = 0; // {return Ptr(new OrigEvolutionaryTrainer(PopSize, GenCnt, Eval));} private: Factory(const Factory &); Factory &operator=(const Factory &); }; // construction and destruction OrigEvolutionaryTrainer(int popSize, int generations, NeuralNetEvaluator &eval); ~OrigEvolutionaryTrainer(void); // member fields const int PopulationSize; const int Generations; FloatArrPtr Fitness;// [SampleSetDim][PopulationSize] FloatArrPtr Sigma; // [SampleSetDim][PopulationSize][WeightEleDim] WgtsPtr Weights; // [SampleSetDim][PopulationSize][WeightEleDim] // member mehtods virtual void TrainNNs(TrainingSet &data, int hidNodeCnt); virtual int GetRecordDimMultiple(); virtual WgtsPtr GetWeights(); //protected: // memeber methods virtual void evaluateNNs(bool calcParents); // member fields float **gndTruths; int wgtsLen; int sampleDim; int fldDim; int wgtEleDim; int recDim; int recCnt; int hidNodeCnt; int childWgtsLen; float **output; // [SampleSetDim][PopulationSize][RecordDim] int popRecDimDim; NeuralNetEvaluator &evaluator; //private: // member fields float c0, c1; // coeffecients for mutation // member methods void initPopulations(); void createChildren(); void sortByFitness(); Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo MSc Advanced Information Systems project report void calcFitness(bool calcParents = true); // unused copy constructor and assignment operator OrigEvolutionaryTrainer(const OrigEvolutionaryTrainer &); OrigEvolutionaryTrainer & operator=(const OrigEvolutionaryTrainer &); }; } } #endif





OrigEvolutionaryTrainer.cpp #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include

"OrigEvolutionaryTrainer.h" "Global.h" "TrainingSet.h"

using namespace PROJ_MarungoF::Lib; using namespace boost; using namespace std; mt19937 &Random(Random); static uniform_real wgtDist(-1.0f, +1.0f); static cauchy_distribution cDist; static normal_distribution nDist;

OrigEvolutionaryTrainer::OrigEvolutionaryTrainer(int popSize, int gen, NeuralNetEvaluator &eval) : PopulationSize(popSize), Generations(gen), evaluator(eval) {} OrigEvolutionaryTrainer::~OrigEvolutionaryTrainer(void)

{}

void OrigEvolutionaryTrainer::TrainNNs(TrainingSet &data, int hidNodeCnt) { evaluator.SetDataset(data); gndTruths = (float **)data.GroundTruth; sampleDim = data.SampleDim; Weights = WgtsPtr(new shared_array[sampleDim]); Fitness = FloatArrPtr(new shared_array[sampleDim]); Sigma = FloatArrPtr(new shared_array[sampleDim]); output = new float *[sampleDim]; fldDim = data.FieldDim; wgtEleDim = (fldDim + 2)*hidNodeCnt + 1; wgtsLen = wgtEleDim*PopulationSize; childWgtsLen = wgtsLen >> 1; recDim = data.RecordDim; recCnt = data.RecordCnt; this->hidNodeCnt = hidNodeCnt; popRecDimDim = PopulationSize*recDim; int optSze = sizeof(float)*popRecDimDim; c0 = 1.0f/(sqrtf(2.0f*wgtEleDim)); c1 = 1.0f/(sqrtf(2.0f*sqrtf((float)wgtEleDim))); // initialize values for(int i = 0; i < sampleDim; ++i) { Weights[i] = shared_array(new float[wgtsLen]); Fitness[i] = shared_array(new float[PopulationSize]); Sigma[i] = shared_array(new float[wgtsLen]); output[i] = (float *)_aligned_malloc(optSze, 64); } initPopulations(); calcFitness(true); sortByFitness(); for(int i = 1; i < Generations; ++i) { createChildren(); calcFitness(false); sortByFitness(); } for(int i = 0; i < sampleDim; ++i) { Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo MSc Advanced Information Systems project report _aligned_free(output[i]); } delete[] output; evaluator.ReleaseDataset(); }


void OrigEvolutionaryTrainer::initPopulations() { // initialize weights & sigma // initial weights uniformly distributed between -1.0 and +1.0, sigma initially 1.0 static variate_generatorwgtGen(Random, wgtDist); for(int i = 0; i < sampleDim; ++i) {

}

for(float *wgt = Weights[i].get(), *sig = Sigma[i].get(), * const wgtsEnd = &wgt[wgtsLen]; wgt < wgtsEnd; ++wgt, ++sig) { *wgt = wgtGen(); *sig = 1.0f; }

} // This function mutates the top 50% of previous generations population to create the // children // See Project Report // wi' = wi + C*sigi' // sigi' = sigi * ext(c0*N(0,1) + c1*Ni(0,1) // c0 = 1.0/sqrt(2*WeightVectorDim), c1 = 1.0/sqrt(2*sqrt(WeightVectorDim)) void OrigEvolutionaryTrainer::createChildren() { static variate_generatornGen(Random, nDist); static variate_generatorcGen(Random, cDist); for(int i = 0; i < sampleDim; ++i) { const float *pw = Weights[i].get(); float *cw = (float *)&pw[childWgtsLen]; const float *ps = Sigma[i].get(); float *cs = (float *)&ps[childWgtsLen]; const float *le = &pw[wgtEleDim]; const float * const childWgts = cw; for(; pw < childWgts; le += wgtEleDim) { float N0 = nGen(), C = cGen(); for(;pw < le; ++pw, ++cw, ++ps, ++cs) { *cs = *ps * exp(c0*N0 + c1*nGen()); *cw = *pw + C*(*cs); } } } } void OrigEvolutionaryTrainer::calcFitness(bool calcParents) { //int t0 = GetTickCount(); evaluateNNs(calcParents); //int t1 = GetTickCount(); //int deltT0 = t1 - t0; //cout > 1]; } Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo MSc Advanced Information Systems project report float *opt = output[i]; const float *gdTh = gndTruths[i]; const float * const gndEnd = &gdTh[recCnt]; const int padding = recDim - recCnt; while(fit < fitEnd) { *fit = (float)recCnt; for(float *gnd = (float *)gdTh; gnd < gndEnd; ++gnd, ++opt) { *fit -= (*gnd - *opt)*(*gnd - *opt); } *fit *= 100.0f/recCnt; ++fit; opt += padding; } } //t1 = GetTickCount(); //int deltT1 = t1 - t0; } void OrigEvolutionaryTrainer::sortByFitness() { static struct IdxToFit {int idx; float fit;}; static struct Greater { bool operator()(const IdxToFit &x, const IdxToFit &y) {return x.fit > y.fit;} } compare; IdxToFit *idxToFit = new IdxToFit[PopulationSize]; for(int i = 0; i < sampleDim; ++i) { float * const wgts = Weights[i].get(); // old weights float * const fit = Fitness[i].get(); float * const sig = Sigma[i].get(); for(int j = 0, idx = 0; j < PopulationSize; ++j) { idxToFit[j].idx = j; idxToFit[j].fit = fit[j]; } sort(idxToFit, &idxToFit[PopulationSize], compare); //float *curWgt = wgtsBuf; //float *curSig = sigBuf; float *curWgt = new float[wgtsLen]; float *curSig = new float[wgtsLen]; for(int j = 0; j < PopulationSize; ++j, curWgt += wgtEleDim, curSig += wgtEleDim) { copy(&wgts[idxToFit[j].idx*wgtEleDim], &wgts[(1 + idxToFit[j].idx)*wgtEleDim], curWgt); copy(&sig[idxToFit[j].idx*wgtEleDim], &sig[(1 + idxToFit[j].idx)*wgtEleDim], curSig); fit[j] = idxToFit[j].fit; } Weights[i] = shared_array(curWgt); Sigma[i] = shared_array(curWgt); //copy(wgtsBuf, &wgtsBuf[wgtsLen], wgts); //copy(sigBuf, &sigBuf[wgtsLen], wgts); } } int OrigEvolutionaryTrainer::GetRecordDimMultiple() { return evaluator.GetRecordDimMultiple(); } void OrigEvolutionaryTrainer::evaluateNNs(bool calcParents) { NeuralNetEvaluator::WeightData wgt; int nNCnt = calcParents ? PopulationSize : PopulationSize >> 1; int offset = (calcParents ? 0 : PopulationSize >> 1) * wgtEleDim; wgt.WeightSetDim = sampleDim; wgt.HiddenNodeCnt = hidNodeCnt; wgt.WeightVectorDim = nNCnt; wgt.WeightEleDim = (fldDim + 2) * hidNodeCnt + 1; Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis



wgt.WeightVectors = new float *[sampleDim]; wgt.Output = new float *[sampleDim]; wgt.DatasetMapping = new int[sampleDim]; copy(output, &output[sampleDim], wgt.Output); for(int i = 0; i < sampleDim; ++i) { wgt.WeightVectors[i] = Weights[i].get() + offset; wgt.DatasetMapping[i] = i; } evaluator.Evaluate(wgt);

}

delete[] wgt.WeightVectors; delete[] wgt.Output; delete[] wgt.DatasetMapping;

NeuralNetTrainer::WgtsPtr OrigEvolutionaryTrainer::GetWeights() { return Weights; }




SamplingData.h #pragma once #ifndef SAMPLING_DATA_H #define SAMPLING_DATA_H namespace PROJ_MarungoF { namespace Lib { class TrainingSet; class TestingSet; class SamplingData { public: TrainingSet TestingSet

&Trainingset; &Testingset;

SamplingData(TrainingSet &, TestingSet &); ~SamplingData(); private: SamplingData(const TrainingSet &); const SamplingData &operator=(const TrainingSet &); }; } } #endif





SamplingData.cpp #include "SamplingData.h" #include "TrainingSet.h" #include "TestingSet.h" using namespace PROJ_MarungoF::Lib; SamplingData::SamplingData(TrainingSet &trSet, TestingSet &teSet) : Trainingset(trSet), Testingset(teSet) { } SamplingData::~SamplingData() { delete &Trainingset; delete &Testingset; }



SseEvaluator.h #pragma once #ifndef _SSE_EVALUATOR_H #define _SSE_EVALUATOR_H #include "NeuralNetEvaluator.h" namespace PROJ_MarungoF { namespace Lib { class SseEvaluator : public NeuralNetEvaluator { public: class Factory : public NeuralNetEvaluator::Factory { public: virtual Ptr GetEvaluator(); Factory(){} private: Factory &operator=(const Factory &); Factory (Factory &); }; SseEvaluator(void); virtual ~SseEvaluator(void); virtual int GetRecordDimMultiple(); virtual void SetDataset(TrainingSet &); virtual void ReleaseDataset(); virtual void Evaluate(WeightData &); protected: TrainingSet *data; }; } } #endif




SseEvaluator.cpp #include "SseEvaluator.h" #include "SseGlobal.h" #include "TrainingSet.h" using namespace PROJ_MarungoF::Lib; SseEvaluator::SseEvaluator(void) { } SseEvaluator::~SseEvaluator(void) { } int SseEvaluator::GetRecordDimMultiple(){return 4;} void SseEvaluator::SetDataset(PROJ_MarungoF::Lib::TrainingSet &data) {this->data = &data;} void SseEvaluator::ReleaseDataset(){} void SseEvaluator::Evaluate(WeightData &wgtDat) { int rowDim = data->RecordDim; int fldDim = data->FieldDim; int hidNodCnt = wgtDat.HiddenNodeCnt; int recCnt = data->RecordCnt; int wgtVectDim = wgtDat.WeightVectorDim; for(int i = 0; i < wgtDat.WeightSetDim; ++i) { float *dat = data->Samples[wgtDat.DatasetMapping[i]]; float *wgt = wgtDat.WeightVectors[i]; float *out = wgtDat.Output[i]; SseGlobal::EvaluateNN(dat , rowDim , fldDim , wgtVectDim , hidNodCnt , wgt , out , recCnt); } } NeuralNetEvaluator::Ptr SseEvaluator::Factory::GetEvaluator() { return Ptr(new SseEvaluator()); }





SseGlobal.h #pragma once #ifndef _SSE_GLOBAL_H #define _SSE_GLOBAL_H union __m128; namespace PROJ_MarungoF { namespace Lib { struct SseGlobal { // data and output must be aligned on 16 byte boundaries // this function evaluates the same weight on all of the networks void static EvaluateNN(float *data // [fieldDim][rowDim] , int rowDim // must be a multiple of 4 , int fieldDim , int wgtVectDim , int hidNodCnt , float *weight // [(fieldDim + 2)*hidNodCnt + 1] , float *output // [rowDim] , int recCnt ); void static __fastcall SquashingFunctionP4(__m128* fin);

} } #endif

private: SseGlobal(void); ~SseGlobal(void); };




SseGlobal.cpp #include "SseGlobal.h" #include #include using namespace PROJ_MarungoF::Lib; // SquashingFunctionP4 and constant declarations from // A Fast, Streaming SIMD Extensions 2, Logistic Squashing Function (Published in Neural Computation) //J. J. Milner //[email protected] //A. J. Grandison //[email protected] //School of Computing and Mathematical Sciences, University of Greenwich, //30 Park Row, Greenwich, London SE10 9SL, UK //doi:10.1162/neco.2008.10-06-366 _declspec(align(64)) static const float MAX[4]={87.0f,87.0f,87.0f,87.0f}; _declspec(align(64)) static const float MIN[4]={-87.0f,-87.0f,-87.0f,-87.0f}; _declspec(align(64)) static const float p4shiftexp[4]= {-(8388608.0f/0.6931471806f),-(8388608.0f/0.6931471806f),-(8388608.0f/0.6931471806f), -(8388608.0f/0.6931471806f)}; _declspec(align(64)) static const float p4shiftbias[4]= {1065353216.0f ,1065353216.0f,1065353216.0f,1065353216.0f}; _declspec(align(64)) const float p4ones[4]={1.0f,1.0f,1.0f,1.0f}; _declspec(align(64)) const float p4zeros[4]={0.0f,0.0f,0.0f,0.0f}; void SseGlobal::EvaluateNN(float *dStart , int rowDim , int fieldDim , int wgtVectDim , int hidNodCnt , float *weight , float *output , int recCnt) { __m128 w4; // weights on the hidden layer __m128 wo4; // weights to the output node __m128 ipt4; // input node values // iterate over 4 records at a time int inc = rowDim / 4; int wOutOff = (fieldDim + 1)*hidNodCnt; int wOff = (fieldDim + 2)*hidNodCnt + 1; float *weightEnd = &weight[wOff*wgtVectDim]; __m128 *dEnd = (__m128 *)&dStart[rowDim*fieldDim]; while(weight< weightEnd) { __m128 *opt4 = (__m128 *)output; float *data = dStart; for(int i = 0; i < recCnt; i += 4, data += 4) { float *w = weight; float *wo = &w[wOutOff]; *opt4 = _mm_set1_ps(0.0f); // iterate over hidden nodes for(int j = 0; j < hidNodCnt; ++j) { ipt4 = _mm_set1_ps(0.0f); // iterate over inputs for(__m128 *d = (__m128 *)data; d < dEnd; d += inc) { w4 = _mm_set1_ps(*w); ipt4 = _mm_add_ps(ipt4, _mm_mul_ps(w4, *d)); ++w; } // add bias w4 = _mm_set1_ps(*w); ipt4 = _mm_add_ps(ipt4, w4); ++w; SquashingFunctionP4(&ipt4); wo4 = _mm_set1_ps(*wo); *opt4 = _mm_add_ps(*opt4, _mm_mul_ps(wo4, ipt4)); ++wo; } Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo MSc Advanced Information Systems project report // add bias wo4 = _mm_set1_ps(*wo); *opt4 = _mm_add_ps(*opt4, wo4); ++wo; SquashingFunctionP4(opt4); // this is the 4 NNs' output ++opt4; } output += rowDim; weight += wOff; }


} // SquashingFunctionP4 and constant declarations from // A Fast, Streaming SIMD Extensions 2, Logistic Squashing Function (Published in Neural Computation) //J. J. Milner //[email protected] //A. J. Grandison //[email protected] //School of Computing and Mathematical Sciences, University of Greenwich, //30 Park Row, Greenwich, London SE10 9SL, UK //doi:10.1162/neco.2008.10-06-366 __declspec (naked) void __fastcall SseGlobal::SquashingFunctionP4(__m128* fin) { __asm { movaps xmm1,[ecx] mulps xmm1, [p4shiftexp] addps xmm1, [p4shiftbias] cvtps2dq xmm0,xmm1 movdqa [ecx], xmm0 movaps xmm1, [ecx] addps xmm1, [p4ones] rcpps xmm0, xmm1 movaps [ecx],xmm0 ret

;load 4 single precision values ;shift y into high order bits ;add the (pre-shifted) bias ;convert 4 floats to integers ;store 4 integers ;reload as floats, this is e-y ;add one ;reciprocal ;store 4 results

} }




TestingSet.h #pragma once #ifndef TEST_SET_H #define TEST_SET_H #include namespace PROJ_MarungoF { namespace Lib { class TestingSet { public: // construction, destruction TestingSet(int testsetDim, int fieldDim, int recCnts[1024], int recDimMult = 32); virtual ~TestingSet(void); // elements

All fixed length array are 0-terminating thus can only contain a max of 1023

(float *)TestSets[1024]; (float *)GroundTruth[1024]; const int TestsetDim; const int FieldDim; int RecordDims[1024];

// [TestsetDim][FieldDim][RecordDim] // [TestsetDim][RecordDim] // # of bootstraps, max val 1023 // will be a multiple of RecordDimMultiple

int RecordCnts[1024];

// this will be the true record count

const int RecordDimMultiple; const int Alignment;

// must be a power of two

private: TestingSet(const TestingSet &); TestingSet & operator()(const TestingSet &); }; } } #endif




TestingSet.cpp #include "TestingSet.h" #include #include using namespace PROJ_MarungoF::Lib; using namespace std; TestingSet::TestingSet(int testsetDim, int fieldDim, int recCnts[1024], int recDimMult) : TestsetDim(testsetDim) , FieldDim(fieldDim) , RecordDimMultiple(recDimMult) , Alignment(64) { copy(&recCnts[0], &recCnts[1024], &RecordCnts[0]); TestSets[TestsetDim] = 0; GroundTruth[TestsetDim] = 0; RecordDims[0] = 0; RecordCnts[TestsetDim] = 0; for(int i = 0; i < TestsetDim; ++i) { RecordDims[i] = recCnts[i] & (recDimMult - 1) ? (recCnts[i] | (recDimMult - 1)) + 1: recCnts[i]; size_t tsLen = sizeof(float)*FieldDim*RecordDims[i]; size_t gndLen = sizeof(int)*RecordDims[i]; TestSets[i] = (float *)_aligned_malloc(tsLen, Alignment); GroundTruth[i] = (float *)_aligned_malloc(gndLen, Alignment); //zero padded values. if(RecordDims[i] != RecordCnts[i]) { float *f = TestSets[i]; for(int j = 0; j < FieldDim; ++j, f += RecordDims[i]) for(int k = RecordCnts[i]; k SetDataset(ts); eval->Evaluate(wgtDat); eval->ReleaseDataset();


for(int i = 0; i < sampDim; ++i) { for(int j = 0; j < wgtVectDim; ++j) { float *out = &output[i][j*ts.RecordDim]; for(int k = 0; k < ts.RecordCnt; ++k) { retVal = max( retVal , fabsf(out[k] - XOR_OUT[(i + j) & 1][3 & (i + k)]) ); } } } for(int i = 0; i < sampDim; ++i) { _aligned_free(weightVects[i]); _aligned_free(output[i]); } delete[] weightVects; delete[] output; delete[] dsMap; return retVal; } void initTrainingSet(TrainingSet &ts) { for(int i = 0; i < ts.SampleDim; ++i) { for(int j = 0; j < ts.RecordCnt; ++j) { int val = i + j; ts.Samples[i][j] = (float)(val & 1); ts.Samples[i][ts.RecordCnt + j] = (float)((val & 2) >> 1); ts.GroundTruth[i][j] = (float)(((val & 1) ^ ((val & 2) >> 1)) & 1); } } } float TestClass::TestTrainer(NeuralNetTrainer::Factory &fact) { TrainingSet train(32, 2, 128, 32); initTrainingSet(train); return TestTrainer(fact, train, 2); } float TestClass::TestTrainer(NeuralNetTrainer::Factory &fact, TrainingSet &train, int hidNodeCnt) { NeuralNetTrainer::Ptr pTnr = fact.GetTrainer(train, train.FieldDim, hidNodeCnt); NeuralNetTrainer &tnr = *pTnr.get(); tnr.TrainNNs(); NeuralNetTrainer::WgtsPtr pWgts = tnr.GetWeights(); NeuralNetEvaluator::WeightData wgtData; wgtData.WeightSetDim = train.SampleDim; wgtData.WeightVectorDim = 1; wgtData.WeightEleDim = (train.FieldDim + 2)*hidNodeCnt + 1; wgtData.HiddenNodeCnt = hidNodeCnt; wgtData.DatasetMapping = new int[train.SampleDim]; wgtData.Output = new float *[train.SampleDim]; wgtData.WeightVectors = new float *[train.SampleDim]; for(int i = 0; i < train.SampleDim; ++i) { wgtData.Output[i] = new float[wgtData.WeightVectorDim*train.RecordDim]; wgtData.WeightVectors[i] = new float[wgtData.WeightEleDim*wgtData.WeightVectorDim]; Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo Page 115 of 119 MSc Advanced Information Systems project report September, 2010 copy(pWgts[i].get(), &pWgts.get()[i][wgtData.WeightEleDim*wgtData.WeightVectorDim], wgtData.WeightVectors[i]); wgtData.DatasetMapping[i] = i; }

NeuralNetEvaluator::Ptr pEval = CudaEvaluator::Factory(&BasicEvaluateNN).GetEvaluator(); NeuralNetEvaluator &eval = *pEval.get(); eval.SetDataset(train); eval.Evaluate(wgtData); eval.ReleaseDataset(); int right = 0, total = 0; for(int i = 0; i < train.SampleDim; ++i) { for(int j = 0; j < train.RecordCnt; ++j) { if((wgtData.Output[i][j] < 0.5f && !train.GroundTruth[i][j]) || (wgtData.Output[i][j] > 0.5f && train.GroundTruth[i][j])) ++right; ++total; } } for(int i = 0; i < train.SampleDim; ++i) { delete[] wgtData.Output[i]; delete[] wgtData.WeightVectors[i]; } delete[] wgtData.DatasetMapping; delete[] wgtData.Output; delete[] wgtData.WeightVectors; return (float)right/(float)total; } PtrGS TestClass::TestGpuGeneticSelector() { // create a data array with 6 fields, and 128 records // the 1 and 3 fields are XORed to create the ground truth, the rest are dummies. float data[64][1024]; float gnd[1024]; for(int i = 0; i < 64; ++i) for(int j = 0; j < 1024; ++j) { if(i == 1) data[i][j] = (float)(j & 1); else if (i == 3) data[i][j] = (float)((j & 2) >> 1); else data[i][j] = rand() & 1; gnd[j] = (j & 1) ^ ((j & 2) >> 1); } Bootstrap bs(1024, 10); Bootstrap::DataPtr pData = bs.CreateSamplingData((float*)data, (float *)gnd, 1024, 4, 64); CudaEvaluator::Factory evalFact(&BasicEvaluateNN); NeuralNetEvaluator::Ptr eval = evalFact.GetEvaluator(); EvolutionaryTrainer::Factory trainFact(50, 100, *eval); GeneticSelector &gs = *(new GeneticSelector(50, 20, 5, *pData, trainFact, 0.1f)); GsExecuteTime = -GetTickCount(); gs.Execute(); GsExecuteTime += GetTickCount(); return PtrGS(&gs); } PtrGS TestClass::TestCpuGeneticSelector() { // create a data array with 6 fields, and 128 records // bootstrap 10 samples // changed to 6 fields, and 128 records // the 1 and 3 fields are XORed to create the ground truth, the rest are dummies. float data[64][1024]; float gnd[1024]; for(int i = 0; i < 64; ++i) for(int j = 0; j < 1024; ++j) Using a CUDA-‐enabled Graphics Card to Accelerate Neural Network Design for Breast Cancer Computer-‐aided Diagnosis

Fumbeya Luis Marungo Page 116 of 119 MSc Advanced Information Systems project report September, 2010 { if(i == 1) data[i][j] = (float)(j & 1); else if (i == 3) data[i][j] = (float)((j & 2) >> 1); else data[i][j] = rand() & 1; gnd[j] = (j & 1) ^ ((j & 2) >> 1); } Bootstrap bs(1024, 10); Bootstrap::DataPtr pData = bs.CreateSamplingData((float*)data, (float *)gnd, 1024, 4, 64); SseEvaluator::Factory evalFact; NeuralNetEvaluator::Ptr eval = evalFact.GetEvaluator(); EvolutionaryTrainer::Factory trainFact(50, 100, *eval); GeneticSelector &gs = *(new GeneticSelector(50, 20, 5, *pData, trainFact, 0.1f)); GsExecuteTime = -GetTickCount(); gs.Execute(); GsExecuteTime += GetTickCount(); return PtrGS(&gs); }



Testing.cpp // Testing.cpp : Defines the entry point for the console application. #include #include #include #include #include #include #include #include #include #include #include #include

"stdafx.h" "TestClass.h" "SseEvaluator.h" "CudaEvaluator.h" "EvolutionaryTrainer.h" "CudaBasic.h" "TrainingSet.h" "TestingSet.h"

#include #include static int init(){srand(0); return 0;} static int dummy = init(); using namespace PROJ_MarungoF::Lib; using namespace PROJ_MarungoF::Testing; using namespace std; const int FUNC_CNT = 1; const CudaEvaluator::Function FUNCS[FUNC_CNT] = {&BasicEvaluateNN}; const char FUNC_NAMES[FUNC_CNT][50] = {"BasicEvaluateNN"}; void setupWgtData(NeuralNetEvaluator::WeightData &, int); void destroyWgtData(NeuralNetEvaluator::WeightData &); void testGpuGs(); void testCpuGs(); NeuralNetEvaluator::Ptr trainEval; int _tmain(int argc, _TCHAR* argv[]) { testGpuGs(); cout

Using a CUDA-âenabled Graphics Card to Accelerate

Using a CUDA-âenabled Graphics Card to Accelerate

Suggest Documents

Using Graphics Processor Units to Accelerate OneSAF - UNC Gamma

Using Graphics Processor Units to Accelerate OneSAF - UNC Gamma

Using graphics processors to accelerate the computation of the ... - Core

Graphics Card - Asus

Using a Computerised Graphics Package to Achieve

Using Benchmarks to Accelerate Process ... - WordPress.com

using approximations to accelerate engineering design - Computer ...

Using Benchmarks to Accelerate Process Improvements

Using Graphics Processors to Accelerate the Solution of Out-of-Core

Using Proxies to Accelerate Cloud Applications - Usenix

GEFORCE® GTX 560 Ti GRAPHICS CARD

Asus q3083 graphics card driver - Google Drive

NVIDIA Professional Graphics Solutions | Line Card

using a memory card - CASIO

Using the Animal Model to Accelerate Response to ... - Semantic Scholar

using a memory card - CASIO

using a memory card - CASIO

using a memory card - CASIO

using a memory card - CASIO

SHARING INFORMATION TO ACCELERATE

The Graphics Card as a Stream Computer - CiteSeerX

Using molecular graphics software to create geometry

Inferential statistics: Using a graphics calculator to enhance ...

Using Graphics in Calc

Using a CUDA-âenabled Graphics Card to Accelerate