Rapid Prototyping of High Performance Fuzzy Computing Applications using High Level GPU Programming for Maritime Operations Support Marco Cococcioni*, Member, IEEE, Raffaele Grasso and Michel Rixen Applied Research Department NATO Undersea Research Centre (NURC) Viale San Bartolomeo 400, 19126 La Spezia, Italy *Corresponding author. E-mail:
[email protected] Tel: +39-0187-527467 Fax: +39-0187-527354
Abstract—The advent of relatively cheap general purpose graphics processing units (GPUs) is having a huge impact on scientific computing. This is opening the door to high performance fuzzy computing (HPFC) to the masses, due to the low cost and the possibility to have GPUs on desktop computers at home. Furthermore, even mobile HPFC seems imminent, provided that your laptop is equipped with a general purpose GPU. Very recently another innovation has occurred: the availability of libraries for high level GPU programming. By using them, the programmer avoids the necessity of having detailed knowledge of the GPU’s hardware architecture. The availability of such facility on Matlab (today among the most used rapid prototyping software) is also opening the door to the rapid prototyping of high performance applications, in general, and to the rapid prototyping of HPFC, in particular. In this work we show how to speed up an existing Matlab software prototype (which computes spatial maps for supporting decision making during maritime operations) with little effort. By delegating the GPU Matlab library to take care of low level optimizations, we not only save time, but also build rapidly prototyped software that is portable over different GPU hardware types.
I.
INTRODUCTION
Many engineering applications are based on the development of a computer program that needs to be prototyped as fast as possible, in order to start field testing as soon as possible. Moreover the same source code often needs to be improved, taking into account feedback from early validation experiments and from early users, in a closed loop. In contrast to commercial applications, where a mature software prototype is likely to be re-engineered into a commercial software application, in scientific research the prototype is rarely re-engineered into a stable, commercial application. Reengineering a prototype from scratch is often an opportunity to optimize the code performance. Unfortunately, within research institutions such an opportunity rarely occurs. On the contrary it is often required that even the prototype runs sufficiently fast. As an example consider the case of a prototype which has been (rapidly) written in Matlab but its performance on a desktop computer is not satisfactory. Suppose also that a supercomputer (made of fast interconnected computing nodes) is available: the code can be re-engineered, trying to exploit
978-1-4244-9941-0/11/$26.00 ©2011 IEEE
any parallelism available (data-parallelism, task-parallelism). A skilled programmer (or even a team of them) is needed to do this, in order to re-analyze the original problem and its sequential solution, and to re-write a parallel version of it using libraries such as MPI (Message Passing Interface). Of course this requires money and time. Also, unexpected new bugs can be introduced at this stage due to the concurrency of the processes. Fortunately an appealing alternative is emerging today, provided that the problem shows a high data-parallelism. This alternative solution has been made possible by the advent of general purpose graphics processing units (GPUs). As opposed to what was done in the recent past (where the GPU was programmed at a lower level, sometimes in C), in this paper we put emphasis on the benefits of high level GPU programming facility within a rapid prototyping framework such as Matlab. This high level programming is made possible by the recent development of Matlab libraries for GPU programming (like the free GPUmat and the commercial counterpart Jacket from AccelerEyes). In particular in this paper we will show how using Jacket we are able to port rapidly prototyped Matlab sources to a slightly variant Matlab prototype, which runs faster thanks to GPUs. This solution is also appealing because it is generally less expensive than buying Matlab Parallel Computing Toolbox licenses. In particular, we show how to use Matlab GPU libraries for the rapid prototyping of performing applications in Matlab, in order to speed-up a fuzzy rule-based system (FRBS). The FRBS is the core of a decision support system recently developed for maritime operations support. We also argue how this is an interesting path towards the rapid prototyping of high performance fuzzy computing (HPFC) applications. A. A brief history of GPUs GPUs have been used for years as CPU co-processors, assisting them in the task of rendering complex images onto the computer screen. Since then GPUs have evolved into highly parallel general purpose co-processors, not just graphics processors. Today’s GPUs are easily programmable devices that easily allow exploiting the parallelism present at data level. A programming standard, called Open Computing
Language (OpenCL), has been defined in order to make GPU code portable among different hardware types. So far, NVIDIA has played a major role in this respect, by developing increasingly more powerful GPUs (GeForce, Tesla and Fermi), by developing the Compute Unified Device Architecture (CUDA) library and by contributing to the definition of OpenCL. The reader interested in more historical details on the historical evolution can refer to [1] and [2]. The latest step in this history has been the development of high level programming libraries, which avoids the necessity of a detailed knowledge of the GPU’s hardware for the programmer. Consequently, the high level GPU code is also more portable over different hardware types. B. Related works Despite the fact that speeding up fuzzy computing has been addressed for decades, especially in the realm of fuzzy control using hardware implementation like FPGAs (fast programmable logic arrays) and VLSI (Very Large Scale Integration), to date few contributions can be found in literature devoted to exploiting GPUs for fuzzy algorithms. GPU-based speedup of fuzzy-related processing algorithms has been presented in [3] and [4], the former containing a description on how to speed up a fuzzy ART (adaptive resonance theory) neural network and the latter illustrating how to speed up the segmentation of a fuzzy connected image. However, most of the existing work on speeding up fuzzy computing using GPUs has been developed at University of Missouri-Columbia by Anderson, Keller and other collaborators [5]-[9]. References [5][6] only consider the speedup of fuzzy clustering, while here we are more focused on the speedup of fuzzy rule-based systems. Reference [7] discusses how to speed up an edge detection algorithm based on fuzzy rules using CUDA. The works most similar to the present one are, however, [8] and [9], where a general vision on the possibility of speeding up fuzzy computing using GPU is conveyed. Unlike those works, we are interested in speeding up existing Matlab code involving fuzzy rule-based computing by modifying as few original source codes as possible. To do this, we will exploit higher level GPU programming than the aforementioned studies. II.
HIGH PERFORMANCE FUZZY COMPUTING
Fuzzy computing refers to a huge realm of computing techniques based on fuzzy sets and fuzzy logic. Although in this paper we restrict our attention to fuzzy rule-based computing, most of our considerations remain valid for generic fuzzy computing. Let us start from the consideration that, while high performance computing (HPC) is a well assessed discipline, this is not the case for HPFC (to date the acronym is almost unknown to the Google search engine). The Wikipedia entry on HPC tells us that HPC “uses supercomputers and computer clusters to solve advanced computation problems. Today, computer systems approaching the teraflops-region are counted as HPC-computers”. We can argue that a similar definition paralleling the previous one can be made by
replacing the more general expression “computing” with the more specific one “fuzzy computing”. And this is exactly the definition of HPFC that we will use from here onwards. Obviously, there are several hardware and software solutions to perform HPFC, as there are many to perform HPC. For instance, one can develop parallel fuzzy code for supercomputers and computer clusters, but this is only one way to do it. Reasoning inclusively, achieving speedup of fuzzy algorithms using hardware implementation on FPGAs [10][11] and VLSI [12] can be regarded as an alternative (if not complementary) path to HPFC. Since the advent of general purpose GPU is affecting HPC (current supercomputers are adding one or more GPUs to each computing node), the same happens for HPFC. Even desktop computers equipped with one or more GPU cards can thus be viewed as an interesting way towards (desktop) HPFC, affordable by the masses. Finally, we observe how the recent availability of laptops equipped with general purpose GPU cards opens the door to the generation of mobile HFPC. In the next two sections we demonstrate simple guidelines to rapid prototyping of HPFC applications in Matlab using high level GPU programming. The implicit assumption on what follows is that fuzzy algorithms are often applied to problems which show high data-parallelism, such as fuzzy image processing (that usually works pixel-wise) or fuzzy rule-based systems where the elaboration of one input is independent from the elaboration of another one (the majority of the cases). This assures that GPU programming actually speeds up the code. III.
A PRELIMINARY STEP: WRITING FAST FUZZY ALGORITHMS IN MATLAB
A tenet among Matlab beginners, usually trained with Fortran/C/C++, says that Matlab code is slow, while Fortran/C/C++ are fast (and of course, assembly). We very partially agree with this sentence, since it needs to be stated with great care. What is generally true is that naïve Matlab programming usually leads to very slow code, especially if compared with naïve Fortran/C/C++ programmed counterparts. But if you are looking for a counter-example, try writing the product from scratch between two matrices in C in the most naïve and straightforward way, i.e., by using three nested for loops. When comparing its performance against that of the Matlab for increasing size you will see that the Matlab version is far more efficient. The reason for this is that built-in Matlab functions are optimized to operate efficiently on vectors and matrices, making use of sophisticated numerical algorithms (usually coded in C), compiled as dynamically linked libraries (DLLs). This also stems from the fact that there is no difference in performance in using a compiled DLL in C or in Matlab (no significant overhead is introduced by Matlab interpreter in making the call). On the other hand, one may still observe that a naïve C implementation that uses for loops is more efficient than a naïve Matlab version also based on for loops, but this does not remove the fact that Matlab
code, if appropriately written, can be comparative in performance to C counterparts. There are many tips (and sometimes tricks) to keep in mind for writing fast Matlab code. Two of them are particularly important: pre-allocating arrays which grow inside a loop and “vectorizing” the code as much as possible. The former refers to the fact that if the memory of a growing array is not preallocated, each time a new element is added inside a loop Matlab wastes time for de-allocating and re-allocating it and for copying the old array into the new one. The latter refers to finding a new formulation of the same code that uses loops without using them. Put in another way, instead of using for loops and operating element-wise, an alternative solution has to be found that uses no loops (and thus operates on all the elements at once). While memory pre-allocation is easy to handle (you just have to add a statement at the beginning of your loop), vectorizing code is far more challenging and often becomes a matter of pure programming art. If you are interested in an instructive example of how to write vectorized code, take a look at the implementation of the fuzzy c-means clustering algorithm provided with the Matlab Fuzzy Logic Toolbox (the function to look at is stepfcm). Going a step forward, we can observe how vectorization itself can be done in different ways which impact differently on both the running speed and the memory demand (see Appendix A). Identifying Matlab programs bottlenecks by visual inspection is far from easy and can lead to wrong conclusions. Fortunately Matlab has introduced a profiler since version 5, which has been improved over the years. Profiling Matlab code helps a lot in speeding up the code, often identifying counterintuitive bottlenecks where one does not expect to find them. In summary, starting guidelines for writing fast Matlab code are: 1. 2. 3.
pre-allocate growing arrays (and cells) inside loops; vectorize the code as much as possible; profile it and identify possible bottlenecks.
More guidelines can be found in [13]. Writing fast fuzzy algorithms in Matlab not only requires the guidelines above, but also some specific tips related to fuzzy computing. Obviously, one has to keep in mind that when the goal is HPFC, less complex fuzzy algorithms and operators have to be preferred to more complex ones, even though the latter are more accurate (a trade-off has to be chosen here). Although the list of fuzzy tips is probably endless, we report some of them that easily come to mind as follows: 1. 2. 3. 4.
use double-precision only when required; prefer membership functions with a rational expression to the ones that have not; prefer simple implication and defuzzification operators to more complex ones; re-use previously computed quantities whenever possible, especially in evolutionary optimized FRBSs [14]-[16];
5.
consider the possibility of speeding up the identification of Takagi-Sugeno systems using heuristics [14]-[17].
Tip 1 is important since many existing GPU cards still do not support or support to a lower extent (i.e. with fewer cores) double precision, even though in the future this may no longer be the case. In fuzzy rule-based systems, for instance, using the minimum as and operator makes single-precision accurate enough (on the contrary, using the product requires doubleprecision, especially when the number of inputs increases). Tip 2 suggests the use of triangular, trapezoidal, Bellman and generalized Bellman membership functions in place of the more popular Gaussian function, since the evaluation of the latter requires more computations (normally a Taylor expansion). If the Gaussian needs to be preferred for other reasons, consider some tricks to speed up its evaluation [18]. IV.
RAPID PROTOTYPING OF HFPC APPLICATIONS USING MATLAB AND HIGH LEVEL GPU PROGRAMMING
The goal of this work is to present a method for rapid prototyping of HPFC applications. Rapid prototyping is a peculiarity of 4th generation languages like Matlab. Thus Matlab is a perfect candidate for rapid prototyping of HPFC applications. As a preliminary step, we have argued that generic Matlab code needs to be written carefully. In addition, further attention has to be paid to the choice of the fuzzy computing algorithm, in order to keep its computational complexity as low as possible. Once these issues have been addressed we are ready to exploit the GPU. In order to reach the goal of rapidly prototyped HPFC applications starting from an existing Matlab code, we add the following desired features: 1. 2. 3.
the GPU(s) has to be programmable from within Matlab; we want to change as little as possible the original, often optimized Matlab code; we do not want to spend time studying the hardware architecture of the GPU(s) in depth (and do not want to have to re-write the Matlab code in order to exploit the hardware on hand to the highest extent).
The third desired feature means that we do not want to spend time (and thus money) on low level programming the GPU, although from within Matlab. An emerging solution to address the three desired features above is to make use of recently developed Matlab libraries for programming the GPU, like GPUmat (freely available at www.gp-you.org) and the commercial counterpart Jacket from AccelerEyes (www.accelereyes.com). In this work we will concentrate on the Jacket library, because it has been developed to provide support to desired features 13. Indeed they aimed at squeezing the GPU as much as possible, but avoiding as much as possible the necessity for the programmer to tweak his Matlab source code to the specific hardware. This is an interesting approach to high level GPU programming that has an attractive advantage: there is no
need to re-optimize the code in case of changing the GPU hardware. Adapting exiting Matlab code to exploit the GPU in Matlab using Jacket is straightforward. Many times it is sufficient just to cast your variables using data types defined in Jacket (gsingle and gdouble), and the same code will run on the GPU. This kind of magic is due to the fact that Jacket redefines many existing Matlab functions for the two data types. The interpreter will automatically call the Matlab version in case of single and double arrays and the Jacket counterpart in case of gsingle and gdouble variables. The freeware alternative GPUmat uses a similar approach (the data types are GPUsingle and GPUdouble) and re-defines most of the operations on arrays, other than matrix inversion and fast Fourier transform. However fewer functions have been re-written so far by GPUmat volunteers than that redefined in Jacket. Indeed, Jacket developers aim at re-writing all Matlab functions for the GPU. V.
THE APPLICATION: REAL-TIME DECISION SUPPORT FOR MARITIME OPERATIONS
Maritime operations, like naval refueling, amphibious landing, operations with autonomous underwater vehicles (AUVs) (deployment, surfacing, recovery, etc.) are heavily affected by meteorological and atmospheric (METOC) variables like wave height, current and wind speeds. Today’s availability of forecasts of METOC conditions (usually up to few days ahead) can be exploited for supporting decisions related to maritime operations. Fuzzy rule-based systems are a powerful tool to encode knowledge from operational people (like METOC officers, AUV operators, etc.) after their elicitation. For this reason, the NATO Undersea Research Centre has been developing maritime decision support systems since 2005, as an extension of the first (non-fuzzy) DSS developed in 2002 and called MIMS at that time (METOC Impact Matrix System). The system is made of fuzzy rules like the following: R1: If WaveHeight is High and SurfaceCurrent is High then not deploy the AUV R2: If WaveHeight is Low and SurfaceCurrent is Low then deploy the AUV where WaveHeight is the significant wave height, SurfaceCurrent is the surface current speed, and Low and High are two fuzzy sets appropriately defined over the domains of the METOC variables, which depend on the maritime operation on hand (in this paper we focus on the AUV deployment operation). The reader can find all the details of devised maritime DSSs in [19]-[24]. An open issue related to those systems is the computational complexity associated with the DSS in general and the fuzzy rule-based inference in particular, when run over large spatial domains and many temporal instances. In most cases the DSS can be viewed as a mapping from METOC forecasts to the recommendation to run or not run a given operation. These areas are typically discretised using a
regular grid and METOC forecasting models provide their forecasts on those spatial points only, with a resolution of few km2. Since we usually compute the result of the DSS on the same sampling points for which METOCs are available, we are often required to run the DSS on hundreds of thousands of spatial points. Also, the forecasts are typically provided up to 3 days ahead, every hour. This means that the DSS needs to be run 3x24=72 times in the same area. Moreover, the same procedure is repeated as soon as new METOC forecasts are available (typically 3-4 times a day). In summary, we have to run the DSS for 10 million times (no matter if they are 10 million spatial points and 1 instant in time, or 1 hundred thousand points for 100 instants in time). Things can even be worse when considering larger areas. VI.
RESULTS
We have used a NVIDIA GPU (the GeForce 8800 GT), equipped with 1024 Mb of RAM and 112 cores, 1500 MHz processor clock. We have used Matlab 7.2 and Jacket library version 1.4.1, over a CUDA driver 258.96 (CUDA toolkit 3.1). After having optimized the Matlab code, we have easily ported it on the GPU as described in Sec. IV using Jacket. We have first run an experiment on a synthetic dataset and then we have applied the same code from within the DSS. A. Results on the synthetic dataset Figure 1 shows the runtime using CPU (green solid line) and GPU (red dotted line) attained over an increasing number of points (from 1 hundred to 1 million). Figure 2 shows the GPU speedup: it can be seen how the use of GPU is only convenient for thousands of points or more, and it saturates on a speedup factor of approximately 8. When the data points are few (