the computer science department of Macalester College. The challenges I ...... Different degrees of dimensionality are achieved by either removing or duplicating ...
GPU-Accelerated Data Mining with Swarm Intelligence by Robin M. Weiss
Honors Thesis Department of Computer Science Macalester College
May 2010
Abstract Swarm intelligence describes the ability of groups of social animals and insects to exhibit highly organized and complex problem-solving behaviors that allow the group as a whole to accomplish tasks which are beyond the capabilities of any one of the constituent individuals. This natural phenomenon is the inspiration for swarm intelligence systems, a class of algorithms that utilizes the emergent patterns of swarms to solve computational problems. Recently, there have been a number of publications regarding the application of swarm intelligence to various data mining problems, yet very few consider multi-threaded, let alone GPU-based implementations. In this paper we adopt the General-Purpose GPU parallel computing model and show how it can be leveraged to increase the accuracy and efficiency of two types of swarm intelligence algorithms for data mining. To illustrate the efficacy of GPU computing for swarm intelligence, we present two swarm intelligence data mining algorithms implemented with CUDA for execution on a GPU device. These algorithms are: (1) AntMinerGPU, an ant colony optimization algorithm for rule-based classification, and (2) ClusterFlockGPU, a bird-flocking algorithm for data clustering. Our results indicate that the AntMinerGPU algorithm is markedly faster than the sequential algorithm on which it is based, and is able to produce classification rules which are competitive with those generated by traditional methods. Additionally, we show that ClusterFlockGPU is competitive with other swarm intelligence and traditional clustering methods, and is not affected by the dimensionality of the data being clustered making it theoretically well-suited for high-dimensional problems.
ii
Acknowledgements I am extremely grateful for the past four years I have spent as an undergraduate student in the computer science department of Macalester College. The challenges I faced over the past years have allowed me to identify what truly interests me intellectually and as I move forward in my academic career. I would like to begin by thanking my academic advisor Professor Susan Fox for her guidance along the way. Also, many thanks to Professor Elizabeth Shoop for her input and advice throughout the writing process of this paper. Next, I would like to thank my advisor from the University of Minnesota, Dr. David A. Yuen, for whom I give credit to for introducing me to the world of high-performance and GPU-based computing. Dr. Yuen’s approach to advising offered a mixture of independence and direction which has truly allowed me to grow by leaps and bounds over the past year. My profound thanks are to Dr. Yuen for his guidance and for providing me the resources and contacts I needed to complete this work. Additionally, I would like to thank Professors Witold Dzwinel and Marcin Kurdziel of AGH University for their input during both the very first, and very last phases of this project. Finally, I would like to thank Professor Michael Steinbach of the University of Minnesota for his critiques of my approach and for assisting me as I took my first steps into the world of data mining and machine learning.
iii
We are drowning in information but starved for knowledge. John Naisbitt
iv
Contents
Abstract
ii
Acknowledgements
iii
1 Introduction
1
1.1
Background and Problem Statement . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Organization of This Paper
2
. . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Introduction To Swarm Intelligence
4
2.1
Swarm Intelligence Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2
Two Types of Swarm Intelligence Systems . . . . . . . . . . . . . . . . . . .
6
2.2.1
Ant Colony Optimization Algorithms . . . . . . . . . . . . . . . . . .
6
2.2.2
Flocking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3 Parallel and GPU Computing
11
3.1
Parallel Computing Overview . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.2
GPU Performance and Hardware Architecture . . . . . . . . . . . . . . . . .
14
3.3
CUDA for GPU Computing . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.4
Swarm Intelligence and GPU Computing . . . . . . . . . . . . . . . . . . . .
18
v
4 Data Mining Tasks, Techniques, and Applications 4.1
4.2
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
4.1.1
Applications of Rule-Based Classification . . . . . . . . . . . . . . . .
26
4.1.2
Traditional methods for Classification . . . . . . . . . . . . . . . . . .
26
Partitional Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
4.2.1
Applications of Data Clustering Algorithms . . . . . . . . . . . . . .
28
4.2.2
Traditional Partitional Data Clustering Techniques . . . . . . . . . .
28
5 The AntMinerGPU Algorithm 5.1
5.2
30
Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
5.1.1
Background and Technical Details . . . . . . . . . . . . . . . . . . . .
31
5.1.2
The MAX-MIN Ant System . . . . . . . . . . . . . . . . . . . . . . .
33
5.1.3
Ant Colony Optimization and GPGPU . . . . . . . . . . . . . . . . .
34
The AntMinerGPU and AntMiner+ Algorithms . . . . . . . . . . . . . . . .
35
5.2.1
The AntMiner+ Algorithm . . . . . . . . . . . . . . . . . . . . . . . .
35
5.2.1.1
Transition Rule . . . . . . . . . . . . . . . . . . . . . . . . .
40
5.2.1.2
Heuristic Value . . . . . . . . . . . . . . . . . . . . . . . . .
40
5.2.1.3
Pheromone Updating . . . . . . . . . . . . . . . . . . . . . .
41
5.2.1.4
Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . .
43
The AntMinerGPU Algorithm . . . . . . . . . . . . . . . . . . . . . .
43
Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.3.1
Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
5.3.2
Test System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
5.3.3
Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
5.2.2 5.3
23
vi
5.4
5.3.4
Comparison to Traditional Methods . . . . . . . . . . . . . . . . . . .
51
5.3.5
Comparison to AntMiner+ . . . . . . . . . . . . . . . . . . . . . . . .
52
Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.4.1
Multi-Class Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.4.2
Multiple Colonies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.4.3
Improved Convergence Detector . . . . . . . . . . . . . . . . . . . . .
54
5.4.4
Better Use of Shared Memory . . . . . . . . . . . . . . . . . . . . . .
55
5.4.5
Very Large Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
6 The ClusterFlockGPU Algorithm 6.1
56
Swarm Intelligence Algorithms for Data Clustering . . . . . . . . . . . . . .
57
6.1.1
Particle Swarm Optimization for Cluster Analysis . . . . . . . . . . .
57
6.1.2
Ant Colony Algorithms for Cluster Analysis . . . . . . . . . . . . . .
59
6.2
Flock Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
6.3
A Flock Algorithm for Data Cluster Analysis . . . . . . . . . . . . . . . . . .
62
6.3.1
Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
6.4
The ClusterFlockGPU Algorithm . . . . . . . . . . . . . . . . . . . . . . . .
65
6.5
Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
6.5.1
Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
6.5.2
Test System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
6.5.3
Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
6.5.4
Comparison to Other Methods . . . . . . . . . . . . . . . . . . . . . .
71
Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
6.6.1
72
6.6
Improved Neighbor Detection . . . . . . . . . . . . . . . . . . . . . .
vii
6.6.2
Improved Cluster Extraction . . . . . . . . . . . . . . . . . . . . . . .
7 Conclusion
73 75
7.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
7.2
Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
7.3
Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
A CUDA Technical Details and Example
78
Bibliography
83
viii
Chapter 1 Introduction 1.1
Background and Problem Statement
Swarm intelligence describes the phenomenon where highly organized and often beneficial global behaviors emerge from the individual actions of a group of decentralized and selforganizing agents. In the past decade, the field of swarm intelligence has become a hot topic in the areas of computer science, collective intelligence, and robotics. To date, swarm intelligence algorithms have been shown to be able to tackle a wide range of hard optimization problems (including the Traveling Salesman, Quadratic Assignment, and Network Routing Problems) and there is a myriad of documented applications of swarm intelligence to computational problems of all sorts [9, 11, 10]. Swarm intelligence algorithms for data mining have been shown to be competitive with traditional techniques [47, 33, 28, 26, 22, 21, 7], and some even provide useful features not found in other methods [13]. However, we find that most of these swarm intelligence data mining algorithms are relatively slow in comparison to traditional ones.
1
Swarm intelligence systems are characterized by groups of independent agents collectively working to solve some problem and, as such, there is a large amount of implicit parallelism. Interestingly, a survey of the literature reveals that few authors consider parallel, let alone GPU-based implementations to boost the performance of their algorithms. In this paper, we adopt the so-called General-Purpose GPU (GPGPU) computing model and show how it can be applied to swarm intelligence algorithms for data mining to achieve better results and considerable speedups as compared to CPU-based implementations. To this end, we present two GPU-based swarm intelligence data mining algorithms: an ant colony optimization algorithm for rule-based classification (AntMinerGPU), and a flock algorithm for partitional cluster analysis (ClusterFlockGPU). Our results indicate the AntMinerGPU algorithm produces more accurate classifications than traditional methods on some data sets, and may offer a speedup of 20-50x over the CPUbased AntMiner+ algorithm on which the AntMinerGPU algorithm is based. We also show that the ClusterFlockGPU algorithm is competitive with other swarm intelligence clustering algorithms and the traditional k-means algorithm, and that this algorithm’s running time is unaffected by the dimensionality of the data being clustered. While the focus of this paper is on swarm intelligence algorithms for data mining, we also hope to underscore the fact that swarm intelligence algorithms of all sorts are extremely well-suited for the GPGPU computing model.
1.2
Organization of This Paper
The remainder of this paper is organized as follows. Chapter 2 explores swarm intelligence systems in general, giving a high-level overview of their operations and characteristics. Chap2
ter 3 provides background on the two data mining problems our algorithms are intended to solve: classification and clustering. Chapter 4 gives an overview of the current state of parallel computing, providing background on GPGPU computing with NVIDIAs CUDA toolkit and discussing how this parallel computing model can be utilized to improve the performance of swarm intelligence algorithms. Chapter 5 gives a detailed description of the generic ant colony optimization technique and presents our AntMinerGPU algorithm for rule-based classification. Chapter 6 discusses flock algorithms and presents our ClusterFlockGPU algorithm for partitional data clustering. Finally, Chapter 7 concludes this paper with a summary of concepts presented and possible directions of future research.
3
Chapter 2 Introduction To Swarm Intelligence In this chapter we give a brief overview of the history of swarm intelligence systems and describe two classes of swarm intelligence algorithms: ant colony-based and flock-based. Section 2.1 offers an overview of swarm intelligence, explaining the benefits of the approach and the general features of this problem-solving technique. Section 2.2 describes two types of swarm intelligences: that found in colonies of ants, and that found in flocks of birds. These two types of swarm intelligence will be revisited when we present our AntMinerGPU and ClusterFlockGPU algorithms for data mining in Chapters 5 and 6, respectively.
2.1
Swarm Intelligence Overview
Swarm intelligence describes the ability of groups of decentralized and self-organizing agents to exhibit highly organized behaviors. These global behaviors of swarms often allow the swarm as a whole to accomplish tasks which are beyond the capabilities of any one of the constituent individuals. Following the publications of two works, “Swarm Intelligence” [17] and “Swarm Intelligence: From natural to artificial systems” [4], the area of swarm 4
intelligence became a hot topic in the fields of computer science, collective intelligence, and robotics. Today, the number of successful applications of swarm intelligence continues to grow. The term “swarm intelligence” was coined in 1989 by Gerardo Beni and Jing Wang in the context of cellular robotic systems [3]. Beni and Wang proposed that groups of simple robotic agents could be programed to collaboratively solve difficult tasks and described how collectively intelligent behaviors can be exhibited in systems of non-intelligent robots. The authors showed that an artificial system so organized would present three distinct advantages: 1. Agents need not be complex since they will work together to solve problems 2. The overall system is very reliable due to high levels of redundancy 3. Problems are solved in parallel since each agent handles a portion of the problem These three characteristics are present in nearly all swarm intelligence systems and are what makes swarm intelligence approaches to problem-solving attractive for certain problems. In addition to these general characteristics, nearly all swarm intelligence systems exhibit the following features: Homogeneity: Every member of the swarm follows the same rules and decision making processes. Locality: Actions and decisions of individuals are made on the basis of purely local information and of what agents learn via (direct or indirect) communication with others. Randomness: Swarm members introduce randomness into their decision-making processes in order to explore new solutions. 5
Positive Feedback: As with Darwinian evolution, “good” solutions that emerge from the actions of swarm agents are identified as having good “fitness” and are reinforced over time. To further illustrate the characteristics of swarm intelligence, we provide an overview of two classes of swarm intelligence systems in the following section. For a more complete exploration of the patterns of swarm intelligence systems and the theoretical underpinnings thereof, the reader is referred to [17] and [4].
2.2
Two Types of Swarm Intelligence Systems
Nearly all swarm intelligence systems take their inspiration from the behaviors exhibited by social animals or insects in nature. The wide range of behavioral patterns (bird flocking, ant colonies, fish schooling, herding, etc...) found in nature has allowed for many types of swarm intelligence algorithms to be proposed, each well suited for a different type of target problem. For example, the ant colony and flock algorithms presented in Chapters 5 and 6, respectively, differ in terms of how data is represented in the virtual environment and how the actions of swarm agents are interpreted. We give an overview of these contrasting paradigms in swarm intelligence in the following two subsections.
2.2.1
Ant Colony Optimization Algorithms
Perhaps the most well explored swarm intelligence algorithms are ant colony optimization (ACO) methods for discrete optimization. The basic operating principles of these algorithms is given in Figure 2.1.
6
Figure 2.1: The simple scheme of an ACO swarm. In ACO, swarm agents make explorations of the solution space of a target problem in an attempt to locate the optimal solution. A solution is defined by the path traveled by an ant agent through solution space. After each iteration, agents evaluate their solution with respect to an objective fitness function. If an agent finds a good solution, it will communicate with others in the swarm to alert them to a potentially rich area of the solution space. By communicating the relative success or failure to find a good solution to others, agents in subsequent iterations are more or less inclined to explore a similar region of the solution space. Over time and with the introduction of random variations in the movements of agents, this repeated exploration of solution space will often lead to a convergence of the swarm as a whole to the optimal solution. As an example of how this approach can be applied to a discrete optimization problem we describe here the general approach of an ACO method for solving the Traveling Salesman Problem (TSP). This problem, as implied by the name, can be conceived of as the problem faced by a salesman who, departing from his home town, wishes to find the shortest possible 7
route through a number of customer towns, visiting each exactly once and then returning home. Formally, the task presented by the TSP is to find the minimum-cost, Hamiltonian path through a fully-connected weighted graph. To solve this problem with ACO, a population of ants is generated in the problem graph with each ant starting at a random city (or node). Then, all of the ants incrementally add cities to their tour until all cities have been visited exactly once. When all of the ants have generated a candidate tour, the ant with the shortest tour is allowed to deposit pheromone on the path it took. This increased level of pheromone will cause ants in subsequent generations to follow a similar path, but due to the probabilistic nature of the path finding process of the ants, random variations on this “good” path will be generated (some of which will be better, and some worse). Repeating this process usually leads to the optimal path being found. Figure 2.2 shows how this process proceeds.
(a)
(b)
(d)
(c)
(e)
Figure 2.2: The progression of pheromone trails laid by ants while solving the Traveling Salesman Problem. Pheromone levels are indicated by edge thickness.
8
Figure 2.2(a) shows the initial construction graph with no pheromone on edges. After the first group of ants explore the graph, the best path is reinforced with pheromone as shown in Figure 2.2(b). This increased level of pheromone causes ants in the next iteration to explore a similar path. Figure 2.2(c) shows that a better path has been found and reinforced with pheromone. This path is then further reinforced in Figure 2.2(d), and finally taken as the optimal solution in Figure 2.2(e). In Chapter 5, we show that the task of rule-based classification can be translated into a problem of finding “good” paths through a graph. With this conception of the classification problem, we show how the ability of ant colonies to find good paths through an environment can be used to generate classification rules.
2.2.2
Flocking Algorithms
The basis for nearly all flock algorithms is derived from the work of Craig Reynolds and his seminal work “Flocks, Herds, and Schools: A Distributed Behavioral Model” [36]. Here, Reynolds proposed that lifelike computer simulations of bird flocks could be achieved by having each member of the flock adhere to a certain set of “flocking rules.” These rules, computed independently by each flock member, keeps the virtual birds in a cohesive flock (flock centering), moving in the same direction (velocity matching), and prevents collisions with one another or objects in the environment (collision avoidance). Every bird calculates vectors which satisfy these rules and adds them to its current velocity to attain its velocity for the subsequent time step. The general flow of this basic flocking algorithm is given in Figure 2.3. To be clear, swarm intelligence models based on flocking behaviors interpret the actions
9
Flocking Rules
Start
Generate Flock Agents
Flock Centering
Velocity Matching
Collision Avoidance No Yes
Termination Condition Met?
Interpret Positions of Agents
End
Figure 2.3: The simple scheme of a basic flocking algorithm. of agents in a different manner than in ACO-style algorithms. With flocking algorithms, swarm agents do not exist in the solution space of the target problem, but instead in their own virtual environment and follow well-defined rules that govern their movements. In these models, agents themselves often represent data objects and solutions to the target problem are defined by patterns in the agents’ arrangements which emerge as they interact with one another and the environment. A proposed application of this type of swarm intelligence algorithm is for the navigation of groups of unmanned areal vehicles (UAVs) . The main idea is that a group of UAVs could use a flocking algorithm to coordinate their movements and maintain cohesion while avoid collisions as they fly to their destination. We will revisit flocking algorithms in more detail in Chapter 6 when we explore a flocking algorithm that can be used to perform partitional data cluster analysis.
10
Chapter 3 Parallel and GPU Computing The two algorithms we present in this paper, AntMinerGPU and CluserFlockGPU, rely on the parallel architecture of GPU devices to achieve their high levels of performance. As such, we feel it is important to provide background on CPU- and GPU-based parallel computing in order to highlight why we adopt a GPU-based approach here. This chapter will give an overview of the recent advances in parallel computing devices and will introduce NVIDIA’s CUDA programming model for General-Purpose GPU (GPGPU) computing. Following this introduction, we will highlight the features of swarm intelligence algorithms that make them well-suited for GPGPU computing and show how this computing model can be used to increase overall performance. Much of the information presented in this chapter is a recapitulation of that found in the introductory sections of David Kirk and Wen-mei Hwu’s book “Programming Massively Parallel Processors: A Hands-On Approach” [19]. The reader is referred to this work for a more information on the concepts presented here.
11
3.1
Parallel Computing Overview
With the release of each new version, conventional single-core microprocessors have, for the past few decades, steadily brought about performance increases for computer applications. These steady increases in CPU and memory speeds have allowed computer software developers to consistently create applications with better functionality, user interfaces, and overall performance. Today, computer systems built atop a single-core CPU have been able to achieve performance levels on the magnitude of a billion (giga) floating point operations per second (GFLOPS). During the era of steady improvement of CPU and memory devices, application developers relied heavily on the improvements of computing hardware to increase the speed of their applications and algorithms. Simply put, the same application run on a computer with a faster CPU and faster memory will (most likely) run faster. However, since approximately 2003, the steady increases in CPU clock speeds and transistor counts have slowed. This is due in large part to engineers’ inability to overcome heat dissipation and power consumption issues presented by such high clock rates and transistor densities. The steady increase in computing performance predicted by Moore’s law has essentially come to an end. In the past 5 years, the market has seen nearly all microprocessor manufactures adopting a design model where multiple processing units, or cores, are used in each chip to increase overall processing power. As manufacturers began adopting parallel architectures in their chips, two major design paradigms emerged: multi-core and many-core. Multi-core processors are characterized by a processing unit containing a relatively small number of “heavy-weight” cores. For example, the Intel Core i7 CPU contains four cores, each a fully functional processor with a wide range of capabilities and a very large instruction set. Generally, multi-core
12
processors are designed to maximize the performance of sequential code, but also allow for true hardware-level parallelism (albeit on a relatively small scale). On the other hand, the many-core design paradigm is focused on the optimized execution of parallel code. Many-core processors are characterized by a processing unit comprised of a large number of “light-weight” cores. For example, the NVIDIA GeForce GTX 285 GPU contains a total of 240 light-weight processing cores each with very limited instruction sets and optimized for basic floating point operations. As shown in Figure 3.1, many-core processors, and specifically GPU devices, lead the way in peak computing performance. It should be noted that the performance levels shown in Figure 3.1 are very rarely achieved by real-world applications, but are speeds theoretically attainable by such devices. An explanation of this large divergence in peak performance is offered in the Section 3.2 when the hardware architecture of modern GPU devices is discussed.
Figure 3.1: Comparison of GPU and CPU peak GFlop performance [29]
13
With such large amounts of raw computational power theoretically attainable with GPU devices, a growing number of computer scientists have begun porting their algorithms to GPU-based computing systems. In the past 3 years, NVIDIA, one of the largest manufacturers of GPU devices in the market, has been very vocal in promoting the GPU-based approach to parallel computing and maintains a repository of examples on its website ([30]). To date, the CUDA Community Showcase contains over 1000 examples of GPU-based algorithms for a wide range of computing tasks, each reporting considerable speedups (many on the order of 100x) over sequential implementations [30]. With such large amounts of FLOPS achievable with GPU devices, the popularity of GPGPU computing continues to increase. Recently, Oak Ridge National Laboratory announced it will be constructing a new GPUbased supercomputer which is expected to be 10-times more powerful than today’s fastest supercomputer and manufacturers such as Cray have begun including GPU-devices in their next-generation systems.
3.2
GPU Performance and Hardware Architecture
The exponentially increasing performance levels of GPU devices has largely been driven by the video game industry. Interactive 3D video games demand a very high level of data throughput and an absolutely staggering number of floating-point operations per second. Consider that for a SXGA (1280x1024) display, there is a total of ∼1.3Million individual pixels. With a commonly desired frame rate of 30fps, there is a worst-case scenario of having to compute >39Million pixel-values every second with each value requiring multiple floating point and memory read/write operations. With such high demands for throughput, programmers in the computer graphics community adopted thread-level parallelism as the 14
dominant paradigm for producing satisfactory results and GPU manufactures followed suit by building hardware that could realize the performance benefits of this paradigm. A very high-level overview of CPU and GPU architecture is given in Figure 3.2.
SP
SM
Figure 3.2: A high-level comparison of CPU and GPU hardware architecture. We show the stream processor (SP) and streaming multiprocessor (SM) that comprises the GPU architecture. Figure adapted from [29].
As shown in the above figure, NVIDIA’s GPU devices are organized as an array of streaming multiprocessors (SMs). These SMs are comprised of a number of stream processors (SPs), each capable of simultaneous execution of many hundreds of threads. Because these SPs are grouped into SMs, the threads executing on the SPs of a single SM are able to cooperate and share instruction cache and control logic, as well as a relatively small amount of on-chip, low-latency memory. The DRAM on GPU devices provides up to 4GB of storage capacity with very high bandwidth to the SMs (102 GB/sec for the DRAM of the Tesla C1060). It should be
15
noted however that while this DRAM has comparatively higher bandwidth than system DRAM, latency is also greater. As such, GPU-based applications always attempt to minimize memory operations that access DRAM, instead taking care to intelligently store frequently accessed data in the on-chip “shared memory” of each SM. Achieving high levels of floating-point performance on a GPU device is possible only when an algorithm employs a massive number of concurrent threads. This way, if a thread stalls due to memory latency, another thread can be scheduled for immediate execution on that ALU while the first thread waits for its memory operation to complete. In this fashion, the very large amount of chip-space dedicated to floating-point calculations is kept saturated with concurrently executing threads and the overall effect is a very high throughput of floating-point operations. To be clear, GPU devices are designed to be floating-point calculating machines and are thus not ideal for certain types of computing tasks. Complex control and branch structures are better suited for the complex control logic and large instruction sets found in CPU devices. Also, due to the relatively high latency of GPU DRAM as compared to that of CPU memory, applications that require large amounts of memory read/writes are also better suited for CPU devices. Realizing this, the CUDA programming model from NVIDIA is specifically designed to allow for a mixture of GPU and CPU code execution. It should be noted that swarm intelligence algorithms are characterized by the repeated application of the same simple decision making process by multiple independent agents and thus do not require a large amount of complex control structures. Because of this feature, the limited instruction sets of GPU devices will theoretically not impact the performance of swarm intelligence algorithms. As detailed in Chapter 2, swarm intelligence systems are characterized by groups of 16
independent agents working asynchronously and collectively to solve a problem. It can easily be envisioned how an asynchronous system such as a swarm can be quite accurately modeled with a GPU-based algorithm by allocating one thread per swarm agent. With this implementation strategy, each agent would be allocated its own private “brain” and virtual swarms would more closely resemble the asynchronous natural swarms on which they are based. This concept is further explored in Section 3.4 as we examine some specific ways in which the GPU computing model can be leveraged to create high-performance swarm intelligence algorithms.
3.3
CUDA for GPU Computing
The idea of using GPU hardware for general purpose computing is a concept that dates back nearly two decades. In the early days, utilizing GPU devices for the execution of nongraphics related algorithms was a very difficult task. Essentially, programmers were forced to use graphics application programming interfaces (APIs) such as OpenGL or Direct3D to gain access to the GPU chip. The main strategy for doing GPGPU computing was to find clever ways to fit some target algorithm into computer graphics abstractions that are compatible with the graphics API being used. For example, large matrices were translated into images and by casting a general mathematical problem into computer graphics abstractions, data could be loaded into GPU texture memory and standard computer graphics functions (mostly matrix-matrix operations) could be applied. The process of doing GPGPU computing changed dramatically with the release of NVIDIA’s CUDA toolkit in 2007. In the early 2000s, NVIDIA realized that the interest in using GPU 17
devices for general purpose computing was growing and wished to capitalize on this emerging market. The goal was to provide a specially extended version of some high-level programming language that would allow programmers to gain direct access to GPU devices without needing in-depth knowledge of computer graphics algorithms and techniques. The result was the CUDA toolkit, a software development kit (SDK) that allows programmers to write parallel code in a specially extended version of the C programming language for parallel execution on most modern NVIDIA GPUs. This SDK allows for a much more generic parallel programming model than was possible in earlier generation GPUs and allows programmers to use common parallel programming abstractions such as parallel threads, barrier synchronization, and atomic operations in GPU-based code. This paper assumes a familiarity with the CUDA programming model. More information regarding the technical details of CUDA as well as a simple example CUDA program can be found in Appendix A. In the next section, we describe how the parallel programming abstractions offered by CUDA can be applied to swarm intelligence algorithms. The concepts presented in Section 3.4 will be revisited in greater detail and specificity when we describe our AntMinerGPU and ClusterFlockGPU algorithms in Chapters 5 and 6, respectively.
3.4
Swarm Intelligence and GPU Computing
This project proposes that swarm intelligence systems are well suited for implementation on GPGPU devices. We now consider the main characteristics of the swarm intelligence algorithms outlined in Chapter 2 with respect to the GPU architecture and programming model outlined in the previous sections. 18
Ant colony optimization algorithms rely on a large number of explorations of solution space and the law of large numbers to converge the swarm to the optimal solution. Thus, if a larger number of explorations of solution space can be carried out at the same time (in turn necessitating larger swarm populations), more candidate solutions can be generated and there will be a greater chance that the optimal solution will be found (and in less time). Generally, in sequential implementations the population of a swarm has a direct, and often very large, impact on overall running time. This makes extremely large populations of swarm agents infeasible due to the associated complexity. We propose that the massive multi-threading capabilities of GPU devices can be used to achieve much larger populations of swarm agents and in turn a much larger number of solution space explorations without severely degrading running time. This hypothesis is supported by [2] where the authors present a GPU implementation of the MAX-MIN Ant System and report a speedup of over 2x compared to a CPU-based implementation. To illustrate, we consider a very generalized ACO algorithm and show how it might be parallelized for GPU implementation. Listing 3.2 shows a pseudocode overview of the generalized ACO algorithm.
generate N ant agents until (no improvement in solution quality for X iterations): for 1 to N: generate a candidate solution; evaluate solution quality; deposit amount of pheromone proportional to quality on solution; done done extract final solution Listing 3.2: Generalized sequential ACO algorithm
19
There is clearly a large amount of implicit parallelism in this algorithm since by definition, all N ant agents have to complete the same tasks each iteration. Because of this, we can easily fit this algorithm into the SPMD execution pattern of CUDA. A straightforward way of parallelizing the above algorithm for execution on a GPU device would be to allocate one thread per each of the N ant agents. This allows us to essentially “unroll” the for-loop. In such an implementation, each ant agent would generate a candidate solution, evaluate it, and deposit pheromone concurrently. This model leads to a truly asynchronous system where each ant agent has its own thread of control, a system virtually identical to real-world ant colonies. With this implementation strategy, the above pseudocode would be as shown in Listing 3.3. This is the implementation strategy we adopt for our AntMinerGPU algorithm and in Chapter 5 we show the performance benefits of this type of approach.
generate N ant agents until (no improvement in solution quality for X iterations): each of N concurrent threads: generates a candidate solution; evaluates solution quality; deposits amount of pheromone proportional to quality on solution; done done extract final solution Listing 3.3: Generalized parallel ACO algorithm with N = number of ants threads The benefit of the GPU computing model for flocking algorithms is equally as large. To illustrate, consider the very generalized flocking algorithm given in Listing 3.4.
20
generate N flock members do until done: for each i in N: calculate bird i’s flock centering force calculate bird i’s velocity matching force calculate bird i’s collision avoidance force update bird i’s velocity update bird i’s position done done Listing 3.4: Generalized sequential flocking algorithm As with the ACO algorithm given in Listing 3.2, we can use the multi-threading of a GPU device to “unroll” this algorithm’s for-loop. In this way, each flock member will update its velocity and position simultaneously. By applying the same strategy that was used for the parallel ACO algorithm, this flocking algorithm would become:
generate N flock members do until done: each i in N concurrent threads: calculate bird i’s flock centering force calculate bird i’s velocity matching force calculate birs i’s collision avoidance force update bird i’s velocity update bird i’s position done done Listing 3.5: Generalized parallel flocking algorithm This implementation strategy is adopted for our ClusterFlockGPU algorithm presented in Chapter 6. In addition to each agent requiring its own thread of control, there are other characteristics of swarm intelligence algorithms also cast them as being well suited for the GPU 21
architecture. As noted in Section 3.2, memory latency is often the largest bottleneck with GPU-based algorithms. This can be combated by utilizing the smaller but faster on-chip shared memory. Since all agents operate solely on local data, these relatively small chunks of information can be loaded into low-latency shared memory. This accelerates the implementation and mitigates the impact of the long-latency DRAM memory operations. Additionally, during the solution evaluation phase of ACO algorithms, blocks of threads can be linked together to cooperatively evaluate the generated solutions and in turn accelerate this portion of the algorithm as well.
22
Chapter 4 Data Mining Tasks, Techniques, and Applications This chapter provides an overview of the two data mining problems that our AntMinerGPU and CluterFlockGPU algorithms are designed to solve. We begin by describing the task of classification in Section 4.1 and then move to data cluster analysis in Section 4.2. These overviews are offered here to provide context for the algorithms that we present in Chapters 5 and 6.
4.1
Classification
The goal of classification is to be able to accurately assign a “class” to a given data point in an automated way and according to some number of the data point’s characteristics. That is, by examining the features of a given datapoint, predict which pre-defined class the data point actually belongs to. While there are many types of classification algorithms (as will be discussed in Section 4.1.2), we focus on so-called “rule-based” classifiers. The goal of 23
rule-based classification is to produce a set of classification rules that are able to correctly specify the class to which a data point belongs based on its characteristics or features. Our AntMinerGPU algorithm is a so-called “rule-based classifier” and is designed to produce classification rules. The process that AntMinerGPU uses to generate these rules will be examined in detail in Chapter 5. To illustrate how classification rules work, we show in Figure 4.1 a very simple data set and an associated classification rule. By plugging in the values of V1 and V2 from a given data point into the classification rule, the class of the data point can be accurately predicted.
V1 V2 V3 Class 1 0 2 1 1 4 2 0 1 3 0 1 1 2 1 1 0 1 0 0 if (V1 == 1) && (V2 >= 0) && (V2 =
V2 x. To prevent this case, we simply remove edges from the construction graph that would allow for this type of path. The basic operation of the AntMiner+ algorithm is as follows. At the start of every iteration, ants begin in the start node and walk through the construction graph until reaching the end node. As an ant walks, it records the nodes it has visited, each representing a logical term which is added to the ant’s candidate classification rule. Once all ants in a generation reach the end node, the rule described by each ant’s tour is evaluated and the path representing the best rule is reinforced with pheromone. In the next iteration, this pheromone trail will encourage other ants to explore a similar region of the construction graph. When all ants have converged to a given path (classification rule), the rule is extracted and data points in the training set which are covered by the rule are removed. The process then repeats until a specified percentage of training data points have been covered or early stopping occurs (early stopping criteria is described in Section 5.2.1.4). Listing 5.1 presents the pseudocode for the AntMiner+ algorithm as given in [24] and Figure 5.3 gives an example classification rule.
38
generate construction graph do until (min. percentage of training data remains OR early stopping) initialize heuristic values, pheromones, and edge probabilities while (not converged) create ants let ants walk from start node to end node evaporate pheromone from edges reinforce path of iteration-best ant clamp pheromone levels to [Tmin, Tmax] as specified by the MMAS kill ants update edge probabilities end extract rule remove data points from training data covered by the extracted rule end Listing 5.1: Pseudocode for the AntMiner+ algorithm
V1=
V2 >=
V2 = 0) && (V2