Accelerating Missile Threat Simulations Using Personal Computer ...

1 downloads 0 Views 3MB Size Report
do not reflect the official policy or position of the United States Air. Force, Department of ... 2.8 times per year since 1993 (exceeding Moore's law) and may soon reach ... Volume 82, Number 8 SIMULATION 549 ... providing a low-cost, high-payoff solution. ..... all cases, the GPU outperformed the software-based pro- cessing ...
SIMULATION http://sim.sagepub.com

Accelerating Missile Threat Simulations Using Personal Computer Graphics Cards Sean A. Jeffers, Rusty O. Baldwin and Barry E. Mullins SIMULATION 2006; 82; 549 DOI: 10.1177/0037549706072047 The online version of this article can be found at: http://sim.sagepub.com/cgi/content/abstract/82/8/549

Published by: http://www.sagepublications.com

On behalf of:

Society for Modeling and Simulation International (SCS)

Additional services and information for SIMULATION can be found at: Email Alerts: http://sim.sagepub.com/cgi/alerts Subscriptions: http://sim.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

ACCELERATING MISSILE THREAT SIMULATIONS

Accelerating Missile Threat Simulations Using Personal Computer Graphics Cards Sean A. Jeffers Rusty O. Baldwin Barry E. Mullins Air Force Institute of Technology 2950 Hobson Way Wright-Patterson AFB, OH 45433 [email protected] The authors use inexpensive personal computer graphics cards to perform the intensive imageprocessing computations done in a heat-seeking missile’s tracking systems and thereby dramatically reduce the execution time of missile threat simulations used by military mission planners. Using an innovative processing algorithm, these calculations are accomplished up to 3.5 times faster in graphic cards versus using a conventional CPU. Through a combination of this and other software optimizations, simulation time was reduced by 33%. Keywords: GPU, graphics processors, hardware assisted simulation, GPU algorithm

1. Introduction Pairing computer graphics cards with military mission planning simulations might seem an odd combination, but today’s powerful, state-of-the-art graphics cards contain an ideal architecture for the processing done during missile threat simulations. We harness that power to increase the performance of a joint modeling and simulation system (JMASS) [1] software model that simulates engagements between heat-seeking missiles and friendly aircraft. Simply put, JMASS simulations determine the probability of a successful missile attack and evaluate the effectiveness of tactics and countermeasures (such as chaff and flares) used by aircraft to thwart missile threats. However, these simulations are computationally intensive, often requiring 2 hours to complete a single 10-second missile engagement. Furthermore, hundreds of simulations are needed to perform a risk assessment under various environmental and aircraft maneuver conditions. In addition, missile guidance and tracking systems are able to distinguish target features at ever increasing levels of detail, creating a corresponding demand for higher resolution simulations. Thus, the need for faster simulations is acute. The views expressed in this article are those of the authors and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the U.S. government.

We speed up these mission-critical simulations by using inexpensive graphics cards to perform the intensive imageprocessing computations that simulate a heat-seeking missile’s tracking system. Commodity graphics accelerator cards, found in almost every personal computer on the market today, are high-performance, flexibly programmable stream processors that can be adapted to a variety of general-purpose computing tasks [2]. These devices, commonly referred to as graphics processing units (GPUs), can outperform modern CPUs in a number of computationally intensive applications [3, 4]. Graphics cards are optimized for parallel operations on large streams of data and are equipped with high-bandwidth on-board memory (recent models provide 50 GB/s or more), far exceeding a typical PC’s 6.4 GB/s bandwidth, and have a dedicated fast bus to the CPU. GPU performance has increased at a rate of 2.8 times per year since 1993 (exceeding Moore’s law) and may soon reach tera-FLOP levels. The GPU, therefore, is a powerful resource with the potential to provide a sizable performance boost at a modest cost.1 2. Background JMASS is a software program that runs on Windows-based PCs. All computation, including image rendering, is performed in software on the CPU. The JMASS missile threat engagement model spends about 35% of its execution time

SIMULATION, Vol. 82, Issue 8, August 2006 549-558 © 2006 The Society for Modeling and Simulation International DOI: 10.1177/0037549706072047 1. Mainstream graphics cards range in price from about $60 to $500.

Volume 82, Number 8 SIMULATION 549

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

Jeffers, Baldwin, and Mullins

simulating a missile’s infrared (IR) seeker, the electrooptical system that guides a missile to its target. The upper half of Figure 1 shows the JMASS model of the electrooptical seeker system. The seeker is located at the front of the missile and consists of an IR-transparent dome followed by a set of optics not unlike a telescope. The optics, shown in the upper-right portion of Figure 1, focus incoming light from the missile’s target through a rapidly spinning, partly transparent disk called a reticle, which modulates the light prior to reaching the IR detector. The reticle is designed so the position of the target, relative to the center of the missile’s field of view, can be determined from the modulated signal [5]. The missile’s control system uses this signal to maneuver the missile appropriately. Thus, the input to the seeker is the IR scene (i.e., the view of the world as seen through the IR dome and optics), and its output is the reticle-modulated IR detector signal. This tracking method is known as spin scan. The lower half of Figure 1 shows the JMASS model of the interaction between the IR scene and the spinning reticle. In the JMASS software, the reticle and IR scene images are the same size and are both stored in two-dimensional double-precision floating point arrays. Simulation time progresses in discrete intervals, called time steps, 250 times each second. During each time step, JMASS generates an updated IR scene (rendered in software by the CPU) based on the missile’s position and performs O(N 2 ) arithmetic operations, where N is the width2 in pixels of the reticle and scene images. The operations include the rotation and interpolation of the reticle image and, as shown in the lower-left portion of Figure 1, multiplying the IR scene with the rotated reticle image element by element. The sum of the pixels in this filtered image represents the total IR radiance incident on the detector. This process occurs 40 times per simulation step using a reticle image incrementally rotated via a linear coordinate transformation (to simulate its spinning motion) and interpolated (to remove rotation artifacts). This process results in a 10-kHz sampling frequency of the simulated IR detector signal. The multiply-add operation of the reticle and IR scene image is essentially a dot product of two N 2 element vectors. To correctly simulate the detector signal, 10,000 O(N 2 ) image rotation, interpolation, and multiply-add operations are performed each second of a missile engagement. Depending on the size of the images, this can require 75 million double-precision floating point calculations per time step, or 19 billion calculations per simulated second.3 This type of JMASS simulation takes about 2 hours for a 10-second 2. Assuming square images. 3. Assumes a 256 × 256 size image and bilinear interpolation. This accounts for the floating point addition, multiplication, and sin and cos operations JMASS performs per pixel to accomplish the coordinate transformation and interpolation. It does not include instructions for performing loops, lookups, or array index calculations. Interpolation requires 10 floating point operations per image pixel, rotation requires 16, and multiply- add requires about 2.

missile engagement on a 2.8-GHz Pentium 4 with 2 GB of memory, using 5122 -sized images. The optics calculations described above are well suited to the GPU in that they exhibit data parallelism and independence: the same operation is applied to all pixels, and computations on one pixel generally do not depend on the other pixels [6]. In addition, GPUs favor data streams that have 2-D locality (images are stored in GPU memory as 2-D arrays called textures) and computation modes that proceed sequentially through the data elements [7]. Since the JMASS computations meet these criteria and account for a significant amount of the JMASS execution time, it seemed worthwhile to use a GPU to accelerate them providing a low-cost, high-payoff solution. There are numerous examples of similar uses of a GPU. Time domain convolution with graphics cards performs a Fast Fourier Transform (FFT) on two images, multiplies them element by element in the frequency domain, and then performs an inverse FFT on the result [2]. JMASS similarly performs an element-by-element multiplication of two matrices, which can be accomplished on a GPU by storing the matrices as textures in GPU memory and multiplying them using a pixel shader program [2]. After multiplying the rotated reticle image with the IR scene, all elements are summed to produce a single IR intensity value. Such reduction operations can be accomplished faster on a GPU than on a conventional CPU [3, 4]. More advanced uses of the GPU have been devised, spanning diverse applications such as flow simulation, solving differential equations, sorting, and even options pricing in financial markets [8]. 3. GPU Image-Processing Implementation To simulate a spinning reticle, JMASS rotates and interpolates a reticle image as many as 100,000 times during a typical simulation. A more efficient approach is to store a set of prerotated and interpolated reticle images (either in GPU or CPU memory) for lookup as needed versus generating them repeatedly throughout the simulation. Thus, hundreds of thousands of costly O(N 2 ) calculations are eliminated, leaving only the reticle-scene multiply-add operation being performed on a repeated basis, either in software or in GPU hardware. A set of 100 incrementally rotated images, spanning a complete rotation of the reticle, is sufficient to replace the continuously variable rotations allowed in the baseline JMASS software. JMASS was modified therefore to implement this lookup-based approach, motivated by the fact that such an approach would enable reticle images to be cached as textures in fast GPU memory and minimize costly data transfers between the CPU and GPU. Prior to the beginning of the simulation, the set of 100 prerotated reticle images is calculated and uploaded to GPU memory. During each time step of the simulation, an updated IR scene is uploaded to the GPU along with 40 indices for the desired reticle images used in the reticlescene multiply-add operations. When GPU processing is

550 SIMULATION Volume 82, Number 8

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

ACCELERATING MISSILE THREAT SIMULATIONS

Rotating reticle within missile filters the input scene

Seeker

IR scene image

IR scene image

Σ

=

* Reticle

Filtered image

Single radiance value for missile IR detector

Sum all pixels

Element-by-element multiply - Performed 40 times per simulation step using 40 rotated reticle images - Performed 10,000 times per simulated second

Figure 1. JMASS image processing for missile engagement simulations

complete, the 40 results are transferred from GPU to CPU memory. To support JMASS calculations, the graphics cards had to meet several requirements. First, they had to support floating point textures since the JMASS simulation requires high dynamic range. Unfortunately, current GPUs only support single-precision IEEE-754 format, if they support floating point at all. To use a GPU, we had to accept single-precision versus JMASS’s normal double precision.A second important requirement was sufficient GPU memory to store the 100 reticle images, the input scene, and textures for storing intermediate results between rendering passes. The 256-MB capacity available in high-end cards is adequate, supporting 5122 pixel reticle images, the largest images currently required by JMASS users. Finally, the graphics card had to support DirectX Pixel and Vertex Shader version 2.0 [9] since the reduction operations use dependent texture addressing.4 DirectX graphics was chosen over OpenGL because it provides a quicker mechanism to retrieve data from the GPU [4]. At the time this research was being conducted, only two graphics cards met these requirements, the ATI X800XT and the nVidia 6800 Ultra. We considered these cards representative of the state-of-the-art GPUs available to consumers. Their GPU clock speed and feature sets, though 4. Dependent texture addressing uses texture coordinates from one texture pixel to derive the coordinates for another.

not identical, were comparable. We compared the performance of both cards. Shader programs were written in Microsoft’s High-Level Shader Language (HLSL) versus assembly language for simplicity, while the code to control the GPU and to interface with JMASS was in C++ to facilitate integration with JMASS, also written in C++. To increase efficiency, four (single-precision) floating point values are packed into each single GPU texture pixel using the R, G, B, and A color channels, thereby exploiting the GPU’s four-way parallelism while reducing texture sizes [10]. The software interface to the GPU is instantiated as an object with methods for uploading reticle images to GPU memory and for processing scene images. Due to the high degree of programmability and the rich feature set offered by the graphics card and DirectX, there are many ways to implement the multiply-add operation on a GPU. Two approaches, which we call the sequential and the palette approach, are described below. 3.1 Sequential Approach Figure 2 illustrates the sequential approach. Step 0 and step 1 store 100 reticle images as separate textures and upload the IR scene into GPU memory respectively, as shown in the upper-left corner of the figure. In step 2, the GPU executes a shader program on the 40 reticle-scene image

Volume 82, Number 8 SIMULATION 551

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

Jeffers, Baldwin, and Mullins

Figure 2. “Sequential” approach for processing multiply-add operations in the GPU

pairs, requiring 40 rendering passes. The output of a single pass is a single image one-fourth the size of the original image after one IR scene is multiplied by one reticle image (element by element) and the four adjacent pixels are added. These 40 resultant images are rendered to a single large target texture and arranged in rows and columns like tiles. Step 3 begins with this texture, and the GPU performs up to three 16:1 reduction operations, adding 16 adjacent pixels in the source texture and rendering the results to another 1/16-sized texture until it is reduced to an array of 40 pixels, each containing a single reticle-scene sample of the IR detector signal.

like tile, into one large “palette” texture. Step 1 loads the scene image to the GPU. Step 2 multiplies this scene with the larger palette texture, taking advantage of a GPU addressing mode that effectively replicates the scene image across the palette texture so that the scene image to be multiplied corresponds with the reticle images contained in the palette texture. In this way, the scene is multiplied by many reticle images in a single rendering pass. This initial rendering pass results in a partially summed intermediate result image that is onefourth the size of the reticle palette. As with the sequential approach, step 3 performs up to three 16:1 reductions to produce the 40 result pixels.

3.2 Palette Approach The palette approach, shown in Figure 3, produces the same results as the sequential approach described above but requires only 4 (versus 43) rendering passes. Instead of storing 100 reticle images as separate textures, step 0 of the palette approach arranges the reticles by row and column,

4. Experimental Goals and Objectives We wanted to determine whether a GPU could accelerate JMASS simulations and whether it could accelerate the specific class of image-processing operations used in

552 SIMULATION Volume 82, Number 8

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

ACCELERATING MISSILE THREAT SIMULATIONS

Figure 3. “Palette” approach for processing multiply-add operations in the GPU

Volume 82, Number 8 SIMULATION 553

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

Jeffers, Baldwin, and Mullins

JMASS compared to software-based alternatives. In addition, since the results were likely to influence future purchasing decisions, we wished to compare the performance of the nVidia and ATI GPUs using both AGP and the (two-times) faster PCI-express graphics bus. We therefore designed a four-phased experiment. In all but the last phase, JMASS was not used, but instead a simple test application was written to emulate JMASS, generating IR scene images each simulation time step and feeding them to the image-processing implementation under test while recording computation results and execution times in a controlled, repeatable environment. For realism, the test program could be set to one of three schemes for updating the IR scene after each time step: static (no change), dynamic (all pixels change), and moving point source (single moving pixel on black background). The first experimental phase compared the ATI and nVidia cards using both the sequential and palette approaches described above. In this phase, only the AGP platform was used since a PCI-express version of the nVidia card was not commercially available. The second phase used the fastest card and processing approach combination identified in the first phase to compete against two software-based alternatives: a simple loop-based C++ implementation and another using the cache-optimized Intel Math Kernel Library (MKL) sdot command. The third phase compared the GPU to software-based processing using a slightly different image-processing scheme to simulate what is called a conical scan seeker. In a conical scan simulation, the IR scene is much larger than the reticle image, and the reticle is displaced with respect to the center of the IR scene by a varying set of (x, y) offsets prior to each reticle-scene multiply operation, as illustrated in Figure 4. For these first three phases of experiments, the PCI-express machine was a 3.6-GHz Pentium 4 with 2 GB of memory and a 1-MB cache; the AGP machine was a 3.0GHz Pentium 4 with 2 GB of memory and a 512-KB cache. Although the differences in these machines do not facilitate an “apples-to-apples” comparison, it will be shown that these differences had little impact on the experimental outcome. The final phase of experiments integrated GPU processing into JMASS and compared the performance of GPU-enhanced JMASS with two other JMASS versions: baseline JMASS and modified JMASS (software), which incorporates the lookup-based approach described earlier but still performs the reticle-scene multiply-add in software. The performance metric in all experiments is execution time. The first three experimental phases measured the runtime of 40,000 reticle-scene multiply-add operations, equivalent to the simulation processing of a 4-second missile engagement. Thirty replications of each experiment were performed. The final phase, which incorporated JMASS, used a generic shoulder-fired missile threat model for the simulation workload, a 10-second engagement, and one replication for each experiment. In all phases, experiments were conducted using three image sizes: 1282 , 2562 , and 5122 pixels.

Figure 4. Conical scan variation. Scene is 4× the size of the reticle image. Reticle image is displaced from scene center and multiplied-added with the portion of the scene it overlaps.

5. Results Phase 1: ATI versus nVidia and Sequential versus Palette Approaches. Per Table 1, ATI was consistently about four to seven times faster than nVidia. One potential reason for this is that the ATI card represents floats with only 24 bits versus the standard 32. The result is faster performance, but the truncation introduces error as high as 0.016% when the result of the image multiply-add is on the order of 1012 . Although seemingly insignificant, this error was enough to adversely affect simulation outcomes when the ATI card was eventually integrated with JMASS (more on this later). The palette approach was faster for the 1282 image size on both cards, but for the 2562 image size, results were mixed, with the ATI card preferring the palette approach and nVidia preferring the sequential approach. GPU cache configuration is a likely reason for the cards having different “sweet spots” in this way. The DirectX documentation suggests keeping textures as small as possible, at 2562 or less, for best performance. It is therefore not surprising that the palette approach, which stores an array of many reticle images in a single, large texture, was slower than the sequential approach for both cards at the 5122 image size. In fact, the palette approach would not execute properly on the AGP platform (with either card) at the 5122 image size, either because DirectX or the cards’drivers would not allow all of the large textures to be loaded into GPU memory, even though sufficient memory was available. However, no such limitation existed with the PCI-e machine, and we were able to successfully run the palette algorithm using the PCI-e ATI card at the 5122 size. Nevertheless, palette was slower than sequential for the 5122 image size, and since we did not have a PCI-e version of the nVidia card to test, we do not report the times

554 SIMULATION Volume 82, Number 8

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

ACCELERATING MISSILE THREAT SIMULATIONS

Table 1. GPU execution time (seconds) to complete 40,000 reticle-scene multiply-adds, ATI versus nVidia, using palette and sequential approaches Image Size 1282 2562 5122

ATI GPU Algorithm Palette Sequential 0.640 2.059 NA

0.870 2.199 7.226

NVIDIA GPU Algorithm Palette Sequential 2.456 14.377 NA

4.216 10.124 37.427

NA = Not available

for the palette approach at this image size. Scene update scheme had almost no effect on execution times in any of the experimental phases, so the times shown throughout are for the static case only. Since the ATI card was the clear winner for speed, if not for computational accuracy, it was chosen to be the representative GPU in the next two phases. Phase 2: GPU versus Software Alternatives. Figure 5 shows the relative speedup provided by the ATI GPU over C++ and cache-optimized MKL implementations using both AGP and PCI-express platforms. Based on the results of phase 1, a hybrid GPU routine was developed to automatically select the fastest approach (palette or sequential) for theATI card based on the image size: palette for the 1282 and 2562 image sizes and sequential for the 5122 size. In all cases, the GPU outperformed the software-based processing methods, providing between a 1.4 and 3.5 times speedup. Recall, however, that this is achieved at the expense of accuracy since the ATI card processes only 3 bytes per float, whereas the C++ and MKL methods provide full 32-bit precision. The GPU’s advantage is slightly diminished on the PCI-express platform because the softwarebased methods run faster due to the platform’s faster CPU. It is also interesting to note that the faster PCI-express graphics bus did not help the GPU much for this application. As the first and fourth columns of Table 2 shows, GPU performance increased by only 1.1, 1.03, and 0.99 times for the 1282 , 2562 , and 5122 image sizes, respectively, compared to the AGP version. GPU performance on the two platforms was almost the same despite their differing CPU speeds, demonstrating that the GPU can serve as an “equalizer,” allowing slower machines to perform on par with (or even faster than) more capable machines. Note nVidia’s best times from Table 1 are beaten by even the unoptimized C++ implementation. Phase 3: GPU versus C++ and MKL (Conical Scan Variation). The first two columns of Table 3 show that the GPU provides from 0.9 to 2.5 times speedup over softwarebased processing methods, with GPU performance being surpassed in only one case by MKL at the 1282 image size. The GPU had less of an advantage in these experiments because the scene images that must be transferred to the GPU each time step are four times the reticle size, thus quadrupling the time spent transferring data to the GPU compared

to previous experiments. Furthermore, conical scan forces us to use the sequential approach exclusively, which has been shown to be less than optimal for the smaller reticle sizes, and we are no longer able to efficiently pack four values per pixel if we are to maintain correct alignment of pixels between the scene and reticle images after shifting. Even so, for the larger image sizes, the GPU was still between 1.9 and 2.5 times faster than software-based alternatives. As in the previous phase, the PCI machine’s faster CPU provided about 1.3 times speedup for all methods, as seen in the third column of Table 3, but the PCI-express bus provided little advantage for the GPU specifically (Table 3, fourth column). As image size increases, the impact of the bus/platform factor diminishes, and the processing method (GPU, C++, or MKL) becomes dominant (Table 3, fifth and sixth columns). This is primarily due to the GPU, since the GPU is so much faster than the other methods and accounts for the largest deviation from mean performance when larger image sizes are used. The diminishing effect of platform as image size increases indicates that the GPU is working more efficiently, spending more time processing versus transferring data in and out of the GPU. That is, the GPU is operating at higher computational intensity, a necessary condition for the GPU to be beneficial [4]. Phase 4: Baseline JMASS versus GPU-Assisted JMASS. Table 4 lists the execution times for a 10-second JMASS engagement with the four versions of JMASS. Based on these results and the known GPU processing times (as determined in the phase 1 experiments), Table 5 lists the approximate proportion of JMASS execution time spent in reticle rotation and interpolation, as well as reticlescene multiply-add versus all other processing. Recall that baseline JMASS performs all those operations in software 100,000 times per 10-second simulation, whereas modified JMASS (software) eliminates the rotation/interpolation by using lookups for the reticle images, leaving only the reticle-scene multiply-add to be performed in software. GPU-assisted JMASS provides a final optimization by performing the multiply-add operations in GPU hardware. As shown in Table 5, baseline JMASS spends 35% of its runtime in optics computations. Transitioning JMASS to the lookup-based approach resulted in a 1.4 times speedup over baseline JMASS, reducing optics processing time to 11% of the total JMASS runtime. Using the GPU to further Volume 82, Number 8 SIMULATION 555

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

Jeffers, Baldwin, and Mullins

Table 2. Execution times of GPU and CPU-based approaches Spin Scan Procedure Image

AGP Platform

PCI-Express Platform

Size

GPU

C++

MKL

GPU

C++

MKL

1282 2562 5122

0.640 2.059 7.226

1.267 5.234 25.032

0.876 3.776 20.448

0.576 1.980 7.283

1.199 4.787 21.885

0.664 2.645 19.409

Figure 5. Relative speedup provided by GPU over MKL and C++ software alternatives

Table 3. GPU versus C++ and MKL for conical scan variation

Reticle Size 1282 2562 5122

GPU Speedup Over C++ MKL 1.3× 2.3× 2.5×

0.9× 1.9× 2.2×

Average Speedup of PCI Platform over AGP

Speedup of PCI GPU vs. AGP GPU

1.3× 1.3× 1.1×

1.13× 1.05× 1.02×

optimize the reticle-scene multiply-add reduces the optics processing time to less than 1% of the total JMASS execution time, providing a 1.1 times incremental speedup over modified JMASS (software) and a 1.5 times overall speedup over baseline JMASS. Since optics processing accounts for 35% of the baseline JMASS execution time, the maximum speedup achievable by eliminating the optics processing altogether is 1.5. Eliminating the rotation/interpolation through lookup, combined with GPU processing of the reticle-scene multiplyadd operation, comes very close to accomplishing this. However, the full potential of the GPU is not exploited because it is only used, at most, 1.7% of the time, resulting in only 1.1 times incremental speedup over modified JMASS (software).

Allocation of Variation (%) Interaction of Platform Method Platform & Method 34 86 99

53 11 1

13 3 0

From Table 4, we also see that execution times for GPUassisted JMASS were almost identical for both graphics cards tested, with the nVidia version actually running faster in two cases—an apparent inconsistency considering the ATI card was much faster than the nVidia card in earlier experiments using the test program. A possible cause for this disparity is that JMASS execution times vary from run to run, enough to mask the differences in GPU performance. In this case, a variation of 1% to 2% would be enough. However, the variance remains unknown because only one replication of each JMASS experiment was performed. A more important result is the accuracy of the simulation suffered when using the ATI card. Recall that the ATI card truncates single-precision floats to 24 bits. Accumulated errors are substantial enough to cause a simulation to

556 SIMULATION Volume 82, Number 8

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

ACCELERATING MISSILE THREAT SIMULATIONS

Table 4. Results of phase 4 experiments Modified JMASS Image

Baseline

Multiply-Add

Size

JMASS

in Software

ATI

GPU-Assisted nVidia

1282 2562 5122

579 2141 8200

407 1574 6289

360 1393 5530

359 1411 5525

Table 5. Incremental and absolute speedup Modified JMASS

Percentage of Optics Processing Incremental Speedup over Previous Version Absolute Speedup over Baseline JMASS

Baseline JMASS

Software

GPU-Assisted

35% 1 1

11% 1.4× 1.4×

≤1% 1.1× 1.5×

report a “miss” when it should report a “hit,” essentially disqualifying the ATI card for this application. In contrast, the nVidia card, with its full 32-bit precision, yielded simulation results closely matching baseline JMASS. JMASS users continue to use the nVidia card and GPU-assisted JMASS to achieve faster simulations. We unfortunately did not test MKL with JMASS but acknowledge that doing so could possibly yield comparable or even improved performance over the GPU in some cases, given our earlier experiments.

6. Conclusions and Discussion We have demonstrated that GPU hardware can support JMASS simulations and can perform the reticle-scene multiply-add operation up to 3.5 times faster than unoptimized C++ and 2.8 times faster than cache-optimized MKL software-based solutions. The GPU advantage is greatest when processing larger image sizes due to increased computational intensity. We achieved a 1.5 times speedup for JMASS by incorporating a lookup-based approach for processing reticle images that eliminates hundreds of thousands of unnecessary image transformation operations and reduced simulation execution time by 33%. Nevertheless, despite the performance increase afforded by the graphics cards, GPU impact on overall JMASS performance did not reflect the speedup achievable by the GPU. This is not due to any problem with the GPU—but rather that the multiply-add operation accounts for a small portion of the total JMASS execution time, so optimizing it has a correspondingly small effect (cf. Amdahl’s law). The results of the first three

phases of experiments indicate that the GPU could have a much greater impact, providing up to 3.5 times speedup, in applications where the multiply-add operation accounts for the bulk of the total execution time. Since the GPU is idle most of the time, it is a ready source of computational power to speed up other portions of the JMASS simulation. IR scene generation is one such example. Graphics cards excel at rendering complex and dynamic 3-D scenes and so will be faster than the procedural methods currently used by JMASS to generate the scene images. Combining scene generation and multiplyadd operations in the GPU is very efficient because the scene would reside natively in GPU memory and would not need to be uploaded using costly data transfers to the GPU after every scene update. We demonstrated that graphics cards can provide an impressive performance boost for a general computing application, provided the application has SIMD processing and can maintain high enough rates of computational intensity. We further showed that GPU acceleration can enable slower computers to meet or exceed the performance of faster and otherwise better-equipped machines. If GPU technology continues to improve as it has (in the 18 months that have passed since this research was conducted, GPUs have tripled the number of pixels that can be processed in parallel and doubled or quadrupled on-board memory sizes), the GPU could become the processor of choice for many applications. In the meantime, the latest graphics cards, which support floating point operations and can be flexibly programmed via rich APIs and shader programming languages, are better prepared than ever to meet the demands of scientific, engineering, and modeling and simulation applications. Volume 82, Number 8 SIMULATION 557

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

Jeffers, Baldwin, and Mullins

7. References [1] Defense Modeling and Simulation Office. 2005. Available from https://www.dmso.mil/public/ [2] Moreland, K., and E. Angel. 2003. The FFT on a GPU. In SIGGRAPH/Eurographics Workshop on Graphics Hardware 2003, Eurographics Association, July, pp. 112-9. [3] Kruger, J., and R. Westermann. 2003. Linear algebra operators for GPU implementation of numerical algorithms. Association of Computing Machinery Transactions on Graphics 22 (3): 908-16. [4] Buck, I., T. Foley, D. Horn, J. Sugerman, K. Fatabalian, M. Houston, and P. Hanrahan. 2004. Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics 23 (3): 77786. [5] May, J., and M. E. Van Zee. 1983. Electro optical and infrared sensors. Microwave Journal, September, 121-31. [6] Harris, M. 2005. Mapping computational concepts to GPUs. In GPU Gems 2: Programming techniques for high- performance graphics and general-purpose computation, edited by M. Pharr, 493-508. Reading, MA: Addison-Wesley. [7] Buck, I. 2005. Taking the plunge into GPU computing. In GPU Gems 2: Programming techniques for high-performance graphics and general-purpose computation, edited by M. Pharr, 509-19. Reading, MA: Addison-Wesley. [8] Pharr, M., ed. 2005. GPU Gems 2: Programming techniques for highperformance graphics and general-purpose computation. Reading, MA: Addison-Wesley. [9] Gray, K. 2003. The Microsoft DirectX 9 programmable graphics pipeline. Redmond, WA: Microsoft Press. [10] Larsen, E. S., and D. McAllister. 2001. Fast matrix multiplies using graphics hardware. In The International Conference for High Performance Computing and Communications—SC2001, November, Denver, CO.

Sean A. Jeffers is a communications and information officer in the United States Air Force, currently stationed at Hickam AFB, Hawaii. He received a B.S. in electrical engineering from the United States Air Force Academy and a M.S. in electrical en-

gineering from the Air Force Institute of Technology (AFIT). He received the National Defense Industry Association’s Polk Award for the research described in this paper. He is a member of the Eta Kappa Nu electrical engineering honor society. His interests include computer science and physics. Rusty O. Baldwin is an associate professor of computer engineering at the Air Force Institute of Technology (AFIT), WrightPatterson AFB, Ohio. He received the BSEE degree (with honors) in 1987 from the New Mexico State University and the MS degree in computer engineering in 1992 from AFIT. He received his PhD degree in 1999 in electrical engineering from the Virginia Polytechnic Institute and State University. His research interests include computer communications protocols, software engineering, information warfare, and computer architecture. Dr. Baldwin is a senior member of IEEE and a member of Eta Kappa Nu. Barry E. Mullins is an assistant professor of computer engineering in the Department of Electrical and Computer Engineering, Air Force Institute of Technology, Wright-Patterson AFB, Ohio. He received a BS in computer engineering (cum laude) from the University of Evansville in 1983, an MS in computer engineering from the Air Force Institute of Technology in 1987, and a PhD in electrical engineering from Virginia Polytechnic Institute and State University in 1997. He served 21 years in the Air Force, teaching at the U.S. Air Force Academy for 7 of those years. He is a registered Professional Engineer in Colorado and a member of Eta Kappa Nu, Tau Beta Pi, IEEE (senior member), and ASEE. Dr. Mullins has received the U.S. Air Force Academy’s Outstanding Academy Educator award as well as the Brig. Gen. R. E. Thomas award for outstanding contribution to cadet education twice. His research interests include computer communication networks, embedded (sensor) and wireless networking, information assurance, and reconfigurable computing systems.

558 SIMULATION Volume 82, Number 8

Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2006 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.

Suggest Documents