Simulating Parallel Overlapping Universes in the Fifth Dimension ...

3 downloads 51 Views 14MB Size Report
branching capabilities are necessary to rapidly explore the ... the fifth dimension is through several examples. The first ... explored in a five dimensional universe.
Simulation Interoperability Workshop, Paper 08S-SIW-025

Simulating Parallel Overlapping Universes in the Fifth Dimension with HyperWarpSpeed Implemented in the WarpIV Kernel Jeffrey S. Steinman, Ph.D. Craig N. Lammers Maria E. Valinski WarpIV Technologies, Inc. 5230 Carroll Canyon Road, Suite 306 San Diego, CA 92121 (858) 605-1646 [email protected], [email protected], [email protected] Karen Roth Air Force Research Laboratory AFRL/ RISB 525 Brooks Road Rome, NY 13441-4505 (315) 330-7878 [email protected] Key Words: WarpIV Kernel, HyperWarpSpeed, SPEEDES, Estimation and Prediction, Control Theory, DSAP, Common Operational Picture, UDOP, Optimization, Multicore, PDES, OSI-PDMS, OpenMSA, OSAMS ABSTRACT: This paper describes a revolutionary information processing technology that may increase warfighter productivity by orders of magnitude for problems requiring analysis, planning, optimization, and dynamic situation assessment and prediction in live operational environments. This technology, known as HyperWarpSpeed, breaks the four-dimensional barrier of space and time by extending modeling, simulation, and analysis into the fifth dimension. HyperWarpSpeed optimizes the simulation of parallel overlapping universes using Bayesian branching, multireplication state variable data types, event splitting, and event merging techniques. Multiple excursions can be explored within a single simulation execution while aggressively sharing computations between parallel branches whenever possible. This is in stark contrast to traditional and wasteful brute-force Monte Carlo approaches requiring the execution of hundreds or thousands of simulation replications to achieve statistical validity and/or to explore all relevant potential outcomes. In addition, HyperWarpSpeed is able to achieve scalable computational speedup on emerging multicore and distributed computing architectures. The multicore revolution that is already underway requires next-generation software systems to fully embrace parallel processing to achieve scalable performance. With its rollback-based discrete-event processing algorithms, HyperWarpSpeed is able to support real-time dynamic situation assessment and prediction of complex battlefield scenarios in live operational settings. Control theory techniques, such as Kalman Filters, continually calibrate the simulation with live measurements as they are received while dynamically predicting future outcomes based on best estimates of the current state. Affected predictions are rolled back as live data is received to continually maintain best estimates of the real-time operational picture. These capabilities are implemented within the freely available open source WarpIV Kernel that is provided as a reference implementation of the Open Modeling and Simulation Architecture (OpenMSA) and Open System Architecture for Modeling and Simulation (OSAMS) that are currently being investigated by the Open Source Initiative for Parallel and Distributed Modeling and Simulation (OSI-PDMS) study group. The first operational prototype of this technology was delivered to the Air Force Research Laboratory, Rome Labs in the fall of 2007. Although preliminary tests show orders of magnitude improvement over traditional Monte Carlo methods, HyperWarpSpeed requires further research and development to discover and unleash its full potential for command and control systems such as Air Operations Centers. HyperWarpSpeed introduces a paradigm shift in modeling, simulation, and analysis. This technology, when fully mature, will change the way models are represented, experiments are designed, analysis studies are conducted,

1

optimization problems are solved, planning is performed, and real-time dynamic situation assessment and prediction capabilities are supported. Through open source distribution and standardization, this technology will benefit a wide range of future advanced computing programs, with direct benefits to the warfighter.

1.0

reuse redundant event calculations that are shared between the multiple branches at the state attribute level. Unique calculations for a particular branch would only be performed when required. Furthermore, scalable parallelprocessing techniques must be incorporated to take advantage of emerging multicore computing architectures. Rollback-based optimistic time management algorithms should be used to support the integration of live data feeds to continually calibrate the simulation with the most current information.

Introduction

Simulations are often used to analyze the performance of battle plans for complex scenarios. Such analytic simulations are typically executed many times with different parameter settings to determine the optimal configuration of a complex system. Monte Carlo simulations that model critical decisions using probabilistic distributions and random number generation normally require large numbers of end-to-end simulation replications to determine the statistical measures of effectiveness and/or performance of complex systems. Real-time estimation and prediction decision aid tools project the future through simulation using live data to estimate the current state. Multiple-hypothesis what-if branching capabilities are necessary to rapidly explore the effectiveness of various decision options and battle plans [1, 2, 13].

Large scenarios that consume large amounts of memory can be supported in a scalable manner on distributedmemory parallel computing architectures by distributing simulated entities across processors. However, multiple hypotheses branching simulation applications should be supported with minimal computational overhead on single-processor machines, multi-processor machines, and/or networks of such machines.

Monte Carlo simulations are used to explore the effects of: (1) different model-parameter settings, (2) different starting random number seeds, (3) different battle plans, and/or (4) specific decisions within a plan. Grid computing techniques can help speed up Monte Carlo simulation execution times by farming independent replications to processing resources across a network. However, this simple approach ignores the fact that each of the replicated simulation executions may redundantly repeat many of the same time-consuming calculations. Furthermore, final statistical outcomes are generally provided only at the completion of the simulation runs. It is very difficult to determine what the common threads of execution are, where critical decision points occur, and how each decision relates to others across all replications.

This paper describes techniques that address all of these stated issues. First, the branching techniques that manage parallel overlapping universes in the fifth dimension are described. Then, background on parallel processing with an emphasis on discrete-event simulation is presented, along with a discussion on new standards that are being promoted within the Simulation Interoperability Standards Organization (SISO). Maturing these standards is essential in bringing this next-generation technology to mainstream programs. A brief conceptual background is then given on real-time estimation and prediction, indicating how the combined concepts of control theory and optimistic simulation can be applied to support dynamic situation assessment and prediction. Then a discussion is provided on how to statistically analyze results generated by multiple branches using the Statistical Algebra Package provided within the WarpIV Kernel. The discussion then shifts to capturing results in a rollbackable and dynamically constructed futures graph that can be analyzed to determine statistical results of various decisions. Next, a comparison is made between HyperWarpSpeed and other approaches. Finally, several sets of test and performance results are provided for different parallel processing hardware architectures to verify correctness of the technical approach and to benchmark the performance of several representative realworld problems.

Some improvements over the brute force Monte Carlo approach have been explored. One example is to replicate (or fork) the simulation only at branch points [3]. This has the appeal of sharing computations up to the branch point, but it does not take full advantage of computation sharing afterwards. Furthermore, simulation forking can be cumbersome to operate and manage. Another approach clones objects within an executing simulation as decisions are made [8]. While this offers a higher degree of computation sharing, it still does not take full advantage of the potential computation sharing for complex objects with multiple concurrent event threads and patterns. Furthermore, neither approach merges branches as they converge to eliminate redundant event computations. A computationally more efficient approach to support these types of simulations is to automatically detect and

2

2.0

nuclear weapon could affect the entire simulation in that there would be nothing in common and no possibility of overlap between the branches from that point forward.

Parallel Overlapping Universes

Perhaps, the best way to explain the HyperWarpSpeed concepts of simulating parallel overlapping universes in the fifth dimension is through several examples. The first example describes a scenario where Jim and Nancy order food at a restaurant. This example is simple enough to discuss the five-dimensional concepts without getting bogged down by low-level details. The other examples in this section describe military scenarios that might be explored in a five dimensional universe.

Example 2: Time Critical Attack Mission In this second example, imagine that an intelligence report is received identifying the current location of a high profile terrorist leader. A time-critical decision (branch) must be made on whether or not to (1) dispatch an aircraft on a bombing mission to destroy the reported location, or (2) whether it would be better to wait until further conformation of the report is received. In either case, there is potential for civilian loss.

Example 1: Jim and Nancy at a Restaurant Imagine within a simulated world that two friends, Jim and Nancy, agree to meet for lunch. Jim arrives at the restaurant first and decides to order his meal. Nancy is running late, so Jim finishes his lunch and then orders desert. Jim now has to decide on whether to order a pricy ice cream sundae or an inexpensive chocolate chip cookie. At this point, Jim has to make a decision on what to order. In the HyperWarpSpeed time management algorithm, Jim branches into two overlapping parallel universes. In one universe, Jim enjoys his ice cream sundae, while in the other universe Jim eats his cookie. Meanwhile, cars driving by the restaurant are unaffected by what desert Jim has ordered. Both branches (universes) of Jim share (overlap) the modeling of the automobile traffic occurring outside the restaurant.

If the aircraft is immediately dispatched, bombs the target, and then it turns out that the terrorist leader was not there, innocent civilians may be killed resulting in strong political backlash. On the other hand, if the terrorist leader is killed, a significant blow will have been dealt to the enemy. If the aircraft is not dispatched until further confirmation is received, the terrorist leader may have time to escape. Innocent civilians may still be killed. However, it is still possible (but less likely) that the aircraft will still destroy the terrorist leader in time before he moves to another location. In this example, branching may occur in several places: (1) the decision on whether or not to launch the aircraft, (2) whether or not the first intelligence report was accurate, and (3) if the intelligence report was accurate, at what time the terrorist leader moved to another location, and (4) whether or not civilians were killed. Event splitting would occur when the bomb hit the target and potentially killed the terrorist leader. Merging would occur in cases where the bomb was dropped on the target and the outcome was the same, regardless of what time the bomb was released. Examples of such merges are (1) the terrorist leader was or was not killed independent of when the aircraft bombed the target, and/or (2) civilians were or were not killed independent of whether or not the terrorist leader was also killed.

Nancy finally arrives as Jim is eating his desert. Independent of what Jim ordered, Nancy asks Jim if he would share his desert with her. At this point, Nancy automatically splits into two. In one universe, Nancy enjoys the ice cream sundae with Jim. In the other universe, Nancy shares Jim’s cookie. After they finish, Jim and Nancy leave the restaurant and continue on with the rest of their day. Their universes that had previously branched and split now merge because the rest of their activities do not depend on what they ate for desert. That night, Jim decides to go to an expensive restaurant for dinner. Jim really wants to order steak but it is expensive. The Jim that had the cookie has enough money in his wallet to order the steak dinner, but the Jim in the alternate universe who ordered the ice cream sundae only has enough money for the chicken dinner. Jim automatically splits again as he orders his meal. What caused the split is the fact that the money in Jim’s wallet depended on the decisions he made earlier in the day.

Example 3: Military Weapon-Target Assignment In this third example, a blue aircraft unexpectedly encounters an enemy Surface to Air Missile (SAM) site as it flies towards its target on a time-critical bombing mission. The aircraft must decide whether to (1) fire a missile at the SAM site, or (2) divert its flight path to avoid the SAM site and continue with the mission. If the aircraft fires a missile at the SAM site, it may not be able to defend itself later during the mission. On the other hand, if it diverts its flight path, the SAM site might destroy blue aircraft carrying out other missions. In addition, a diverted aircraft may not reach its intended target in time to be effective.

While this example assumes the rest of the world is unaffected by the couples behavior, it is entirely possible to construct a simulation in which branch points are completely independent from one another. For example, branching on whether or not a region was attacked by a

3

Branching occurs on the aircraft’s decision to attack the SAM site or to divert its flight path. Events for other aircraft in the simulation split as they encounter the SAM site. Merging occurs when any aircraft in the simulation takes out the SAM site. Merging also occurs if the target is destroyed by either of the branches taken by the aircraft.

estimation and prediction [6] using live data and control theory [10] to continuously calibrate the simulation while aggressively predicting future outcomes [20]. The HyperWarpSpeed algorithm provides three significant performance benefits. First, redundant computations that would be performed by running independent Monte Carlo simulation replications are eliminated. If each Monte Carlo replication performed 90% of the same computations as the others, then combining these replications into a single execution would improve performance by an order of magnitude. Second, optimistic time management algorithms provide parallel processing capabilities on emerging multicore computer architectures. Scalable parallel processing can provide additional orders of magnitude in speedup. Third, large computational savings can be achieved by continuously integrating live data into the simulation through rollback-based optimistic event processing and mathematical control theory techniques. Instead of rerunning Monte Carlo simulation replications as new data arrives, only the portions of the simulation that are affected by the newly received data are rolled back and recomputed (or rolled forward if possible). If restarting new simulations recalculated 90% of the same computations from previous runs, another order of magnitude performance gain would be achieved.

Example 4: Urban Patrol Soldiers are on a mission to patrol the streets of a foreign city. The goal is to search for terrorists, find weapon stashes, and/or promote stability/trust within the neighborhoods. The soldiers fan out into smaller teams as they patrol each street in an orderly manner. While on patrol, one team of soldiers is informed by a local resident that terrorists are holed up in a particular residence. The soldiers now must decide on what to do. One option is to break down the door of the house with weapons firing in a surprise raid. Another option is to surround the house and then issue a warning for the terrorists to surrender. The first option might provide a better tactic because it would catch the enemy off guard. However, this tactic is still dangerous, and may involve the accidental loss of innocent life if it turns out that peaceful civilians actually occupy the house. On the other hand, the second option could be more dangerous for the soldiers because the terrorists might have time to take cover, draw their weapons, signal for help, and prepare for a long drawn-out fight. In addition, innocent civilians might get caught in the crossfire. Other possibilities might include options such as the soldiers radioing to command for approval and perhaps requesting additional backup support before engaging the enemy, etc.

One can think of this approach as continually updating the three-dimensional Common Operational Picture (COP) that is embedded within the HyperWarpSpeed simulation using real-time estimation techniques. Predictions are continually refined because they are based on the embedded HyperWarpSpeed COP as it evolves over time. By providing rollback and rollforward event processing capabilities, control over the fourth dimension (i.e., time) is achieved for the embedded COP. Integrating the fourdimensional COP with branching technology provides a five-dimensional COP capability that is able to manage and explore multiple futures. This new approach that expands on previous space-time concepts can be thought of as simulating parallel overlapping worlds in a fivedimensional interactive universe. Concepts involving time travel and parallel universes, previously confined to science fiction and fantasy, now play a significant role in next-generation modeling and simulation technology.

In HyperWarpSpeed, each battle option would be carried out in a separate branch. Merging would potentially occur after the battle for branches where (1) the enemy was destroyed, (2) the soldiers did not survive, (3) it turned out that there were no terrorists in the home, and/or (4) innocent civilians were, or were not, killed. In any case, the combat activity would not necessarily affect the soldiers patrolling other remote areas, which means that those common computations would be shared between the HyperWarpSpeed branches.

2.1 HyperWarpSpeed

The HyperWarpSpeed capabilities described in this paper are provided to the model developer and simulation operator in a manner that is robust and easy to use. The underlying complexities are completely hidden from the model developer. Providing this extremely complex capability in a simple and transparent manner is critical for promoting its widespread use for command and control applications requiring dynamic situation assessment and prediction capabilities.

The HyperWarpSpeed branching technology was developed to (1) transparently manage large numbers of Monte Carlo replications within a single simulation execution while sharing all common computations between replications, (2) provide scalable performance on a wide variety of sequential, parallel, and distributed computing architectures, and (3) support real-time

4

2.1.1

The internal storage structure for the primitive multireplication data types is shown in Figure 1.

HyperWarpSpeed Time Management

Fundamental to understanding HyperWarpSpeed is the notion of replication sets. Replication numbers uniquely identify individual runs of Monte Carlo simulations with different input parameters, random number seeds, battle plans, and/or specific decisions within a plan. For example, 128 replications would consist of replications {0, 1, … 127}. Replication numbers can be grouped into sets that are implemented as bit fields in an integer array. Each bit in the array represents a particular replication number. A replication set with 128 replications would require an array of four 32-bit integers. A simple example of a replication set with 32 potential replications containing replications {0, 1, 4, 9, 15, 21, 29} would be stored as binary 0010 0000 0010 0000 1000 0010 0001 0011, where the rightmost bit represents replication 0, the next bit to the left represents replication 1, etc. High-speed bit-masking instructions automate critical bookkeeping operations for replication sets. To achieve high performance, it is extremely critical to minimize all computational overheads when manipulating bit fields on replication sets.

Figure 1: Multireplication data types are represented as an array of values with a replication mask for each value. Bits set in the mask identify which other array elements have the same value. Current data types supported are int, double, and Boolean.

Each multireplication data type manages an array of values along with a bitfield for each value. The index of the array indicates the replication number. Each bit in the bitfield represents the set of replications that have the same value as the array element value (see Table 1).

Each event in HyperWarpSpeed is associated with a replication set. At the start of the simulation, all initial events represent the full set of replications supported by the replication set. When Bayesian branching occurs, the replication set of the current event splits based on a specified or computed probability for each branch taken. The branching construct is extremely simple to use and looks like the familiar C++ switch/case statement. The first two arguments identify labels for the two branch points. The third argument is the probability of taking the first branch.1

Table 1: An example of a replicated integer with 32 replications. The value for each replication and the replication bit-mask are stored in the array. The bit-mask identifies which replications have the same value. Since the value 0 is stored in replications {0, 9, 31}, the bit mask has three 1’s stored in the corresponding bits. The rest of the bits are set to 0.

BRANCH(1, 2, 0.7) LABEL(1) // Code that handles first branch break; LABEL(2) // Code that handles second branch break; END_BRANCH

New multireplication primitive data types for integers, doubles, and Booleans, have been developed that internally maintain different values depending on how they are modified by events with different replication sets. Through operator overloading, accessing and assigning values to multireplication data types is completely transparent to the modeler. These new data types simply behave as normal integer, double, and Boolean variables. 1

The current branching interface can be extended to support multiple branch labels within a single branch statement.

5

Rep 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Value 0 5 1 2 1 1 1 5 5 0 3 4 1 1 5 2

1000 0000 0100 0011 0100 0100 0100 0000 0000 1000 0000 0000 0100 0100 0000 0011

0000 0000 0000 0001 0000 0000 0000 0000 0000 0000 1100 0010 0000 0000 0000 0001

0000 1000 0100 0011 0100 0100 0100 1000 1000 0000 0000 0000 0100 0100 1000 0011

Mask (binary) 0000 0000 0010 0000 0100 0001 1100 0011 0000 0001 1000 0000 1100 0011 0000 1100 0011 0000 1100 0011 0000 0000 0100 0001 0000 0100 0001 0000 0000 0010 0000 0000 0100 0000 0000 1000 1100 0011 0000 1100 0011 0000 0000 0100 0001 0001 1000 0000

0000 1000 0111 0000 0111 0111 0111 1000 1000 0000 0000 0000 0111 0111 1000 0000

0001 0010 0100 1000 0100 0100 0100 0010 0010 0001 0000 0000 0100 0100 0010 1000

Rep 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Value 2 7 1 1 2 2 1 5 2 4 3 3 2 2 1 0

0011 0000 0100 0100 0011 0011 0100 0000 0011 0000 0000 0000 0011 0011 0100 1000

0001 0000 0000 0000 0001 0001 0000 0000 0001 0010 1100 1100 0001 0001 0000 0000

0011 0000 0100 0100 0011 0011 0100 1000 0011 0000 0000 0000 0011 0011 0100 0000

Mask (binary) 0001 1000 0000 0010 0000 0000 1100 0011 0000 1100 0011 0000 0001 1000 0000 0001 1000 0000 1100 0011 0000 0000 0100 0001 0001 1000 0000 0000 0000 1000 0000 0000 0100 0000 0000 0100 0001 1000 0000 0001 1000 0000 1100 0011 0000 0000 0000 0010

0000 0000 0111 0111 0000 0000 0111 1000 0000 0000 0000 0000 0000 0000 0111 0000

1000 0000 0100 0100 1000 1000 0100 0010 1000 0000 0000 0000 1000 1000 0100 0001

Before an event is processed, its replication-set bitfield is copied into a globally accessible variable known as the CurrentReplicationSet. This variable stores the current set of replications to be processed by the event. The first replication in the CurrentReplicationSet, called the FirstReplication, is also stored in another globally accessible variable. Its replication value corresponds to the right-most bit that is set in the CurrentReplicationSet.

algorithm automatically processed the event twice, once to order chicken and then again to order steak. Events can potentially be merged to minimize excessive branching and splitting. Merging the current event with other pending events scheduled for the same time and for the same entity may occur first before processing events. Merging may also occur afterwards for newly generated events that are scheduled during the processing of an event as it splits. Event merging is an optimization that combines the processing of identical events generated by different branches or splits. A flow chart is provided in Figure 2 showing the internal HyperWarpSpeed event merging and processing logic.

When a replicated state variable is accessed within an event, the bit-field mask stored in its FirstReplication array element is ANDed with the CurrentReplicationSet. This may reduce the collection of replications in the CurrentReplicationSet if the associated replicated values in the model’s state have different replication values. When the processing the event has completed, the CurrentReplicationSet is XORed with the remaining replication set. The result is stored back in the CurrentReplicationSet. If the resulting value is non-zero, then the event is processed again with the new reduced CurrentReplicationSet and FirstReplication value. This process continues until all of the bits stored in the CurrentReplicationSet become zero. In this manner, event splitting based on accessing state variables that have different values for different replication sets is automated. Note that this approach does not require individually examining each replicated state variable as it is accessed to determine which ones have the same value. The bitfield mask automates this very efficiently. However, when an event modifies a replicated value, a check is required to determine if the new value is shared by other replication sets, or if it constitutes a new set. For large numbers of replications, this can be computationally intensive. Performance bottlenecks can occur if the assignment operation is not fully optimized or when multireplication variable assignments occur frequently. An additional complexity arises when an event modifies state variables and then later accesses other variables causing a reduction in the size of the event’s current replication set. Values that were modified but then later eliminated in the event splitting processing cycle must be restored to their original values before reprocessing the event with the reduced replication set. Special use of the WarpIV Kernel rollback framework restores all incorrectly modified variables back to their original values in a straightforward manner.

Figure 2: Flow chart showing the event processing sequence in HyperWarpSpeed. An attempt is made to merge identical pending events first before processing the current event. The global replication mask is then determined using the collective replication sets of all merged events. The event is then potentially processed multiple times as event splitting occurs. Finally, all newly generated event messages that were scheduled during event splitting are merged whenever possible.

To support predictive capabilities in a real-time operational setting, HyperWarpSpeed can be configured to receive live inputs that roll back and calibrate the simulated state in real time. If received data does not affect event processing, those events that were rolled back can be rolled forward instead of being reprocessed. Integration of real-time sensor data into Kalman Filters for estimating and predicting enemy behaviors was recently demonstrated to the Air Force. The WarpIV Kernel provides an advanced rollback/rollforward framework that allows events to transparently undo and potentially redo their state-changing operations.

Event splitting occurs as an event is processed whenever it accesses state variables that take on different values for the event’s replication set. Event splitting causes the event to be processed multiple times for subsets of its original replication set. In the previous illustration, the event used to model Jim ordering his dinner automatically split when he accessed the state variable representing the amount of money he had in his wallet. The HyperWarpSpeed

3.0

Scalable Parallel Processing

With the multicore revolution underway, tomorrow’s software must be written to take advantage of parallel processing resources. Next generation computers may actually run at reduced clock speeds to accommodate large numbers of processors within a single chip. This

6

… Experts predict dire consequences if the software for more complicated applications isn't brought up to speed soon. They warn that programs could suddenly stop getting faster as chips with eight or more cores make their way into PCs. The software as it's currently designed can't take advantage of that level of complexity.

means that unless software applications are modified or newly developed to make use of multiple processors, future modeling and simulation programs may actually run slower than they do today. It is therefore critical for next-generation modeling and simulation technologies, such as HyperWarpSpeed, to fully embrace parallel computing methodologies. The following subsections motivate the need for parallel processing, describe how parallel processing is achieved in the WarpIV Kernel, and then discuss open source standards underway for modeling and simulation on parallel computers.

“We'd be in uncharted territory,” Patterson said. “We need to get some Manhattan Projects going here – somebody could solve this problem, and whoever solves this problem could have this gigantic advantage on everybody else.” “New Generation of Processors Presents Big Problems, Potential Payoffs for Software Industry.”

3.1 Next Generation Multicore Computing

Jordan Robertson, Associated Press, July 22, 2007.

Single-processor computing technology has hit the performance wall.2 Computer experts universally agree that the only cost-effective way forward in improving runtime performance is through multicore chip designs along with the development of new software applications that are able to take advantage of parallel processing. While Graphics Processing Units (GPUs) have exploited multicore technologies for many years, they are very specialized in their processing capabilities and have extremely limited output bandwidth for gathering results.

Hardware and software experts in the field of parallel computing recognize the significance of the multicore revolution. Yet, there is still much uncertainty concerning the direction that it will actually take. Hardware vendors are actively canvassing the HPC user community for guidance on how applications might best take advantage of next generation chip designs. Balancing processor speeds with shared memory cache architectures presents the greatest challenge. Distributed shared memory with cache coherency is necessary for offering scalability on large multicore systems. To further optimize performance, some hardware vendors have even questioned the need for providing cache coherency altogether and would prefer to leave the details of synchronizing shared memory to software programmers.4 However, virtually all software developers are in complete agreement that simpler parallel programming paradigms are necessary to develop, debug, and maintain multicore programs.

General-purpose multicore computing capabilities are quickly becoming mainstream. Affordable dual quad-core computers operating at 3 GHz are common in today’s marketplace. More-exotic tiled-core architectures are becoming available.3 Based on Moore’s law, experts predict that processing cores per chip will double every 18 to 24 months. Within a few years, affordable desktop computers will be configured with 16 to 32 processors (or more). By 2021, desktop computers with more than 1,000 cores are predicted to be commonplace.

The general consensus is that low-level parallel programming techniques are far too challenging for the average software developer. A robust abstract parallel programming framework is strongly preferred to reduce the cost of developing multicore software applications.5 The WarpIV Kernel provides a parallel and distributed

The following excerpt, taken from a recent news article, signifies the importance of developing new scalable software solutions to address the emerging multicore hardware revolution. 2

3

4

Three serious performance walls have limited the advancement of single CPU performance. First, the power consumed (and heat generated) by increasing the clock speed changes as the square of the clock speed. Current clock speeds are currently at the limit of being practical. Second, Instruction Level Parallelism (ILP) that exploits vector processing, instruction pipelining, branch prediction, etc. has virtually reached its limit. Third, the gap between processing speed and memory has widened. Because memory access is a serious bottleneck, faster processors will not necessarily produce faster results. Optimal computing performance requires a balanced chip-design architecture. “Tilera Corp. (Santa Clara, Calif.) says it has started shipping its Tile64 processor, a 64-processor chip based on an architecture that could scale to hundreds and even thousands of cores, according to the company… Production pricing for the Tile64 family starts at $435 in 10,000 unit quantities.” EE Times Online.

5

7

The IBM Power4 Architecture does not provide cache coherency between processors interacting through distributed shared memory. Furthermore, to improve processing efficiency, instructions might not be processed in the expected order. These issues can lead to situations where one node writes multiple variables in shared memory that are transmitted and then read by other nodes in a different order. So for example, a flag indicating that a variable is safe to read could be handled incorrectly unless the program invokes special memory and/or instruction synchronization services. Microsoft Corp. sponsored a special by-invitation-only workshop on manycore computing on June 20-21, 2007. Participants were the “Who’s Who” in the parallel computing industry. During the workshop, experts uniformly agreed that parallel programming frameworks were needed to simplify the development of parallel applications. Dr. Steinman gave a presentation on the open source standardization efforts that are underway within SISO and the OSI-PDMS Study Group.

object-oriented computing framework that automates parallel processing on wide variety of hardware and network architectures [19]. The WarpIV Kernel was designed to support multiple programming paradigms and not just simulation.

interact with one another, and/or (3) place first-in first-out event ordering rules on inter-object interactions. Rollback techniques allow objects on any compute node to process their events optimistically assuming that the processing of the event will not become invalid due to the arrival of an earlier time-tagged event scheduled by a simulated object residing on another node. When such straggler messages are received, all optimistically processed events for the receiving object that were processed with time tags greater than the time tag of the straggler event, are rolled back. When using incremental state saving techniques, the events are rolled back in reverse order much like the familiar undo mechanisms that are provided by virtually all commercial software applications.

Figure 3: Parallel and distributed computing paradigms.

The supercomputing community has explored various programming approaches over the last twenty years (see Figure 3). These include: (1) vector processing, (2) message-passing, (3) shared memory, (4) threads, (5) grid computing, (6) transactional memory, (7) distributed shared memory and cache coherency, (8) parallelizing compilers, (9) parallel math libraries, (10) distributed object technologies, and (11) parallel discrete-event simulation.

Events perform two basic operations. They arbitrarily (1) modify state variables, and (2) generate new events. Rollbacks must therefore undo those two operations. This means that the rollback infrastructure of the optimistic simulation engine must be capable of (1) undoing the state changes made by each event and (2) retract any new events that were scheduled. Both incremental and full state saving techniques have been used in the WarpIV Kernel [17] to restore state as rollbacks occur. Full state-saving techniques that are often supported in other systems save the entire state of a modeled object before each event is processed. This can be efficient for objects having small states. However, full state saving can become highly inefficient both in terms of memory consumption and processing overheads for complex objects having large states that are fragmented across many memory segments. The WarpIV Kernel primarily uses automatic incremental state saving techniques that capture each change made to the state as they occur. The rollback infrastructure simply undoes those changes in reverse order. A fully functional rollback infrastructure must not only support the restoration of primitive state variables, but also external I/O, generic dynamic memory allocation/deallocation, and container classes such as trees or lists that store the dynamically allocated memory.

As previously stated, experts uniformly agree that a highlevel programming framework is needed to streamline software development. Debugging parallelized software using low-level concurrency primitives such as thread or shared memory locking mechanisms is simply too difficult and expensive for mainstream programs. Often, bugs in parallel software applications are not repeatable, which can make fixing them almost impossible without a high degree of expertise. The WarpIV Kernel provides a high-level object-oriented programming framework for parallel applications. In addition, the WarpIV Kernel provides specialized support for parallel simulation and its new extensions for HyperWarpSpeed.

3.2 Parallel Discrete Event Simulation The fundamental challenge of executing a simulation in parallel on multiple processors is to ensure that each object represented within the simulation processes its events in causally correct ascending time order, while also achieving maximal processing efficiency and concurrency [5]. Optimistic event processing techniques such as the Time Warp algorithm [9] are primarily used to support complex parallel simulations that place no limitations on how distributed objects interact. Optimistic simulations allow any object to interact with any other object on any other processing node at any time. This is in contrast to conservative techniques that either (1) limit which objects can interact with which other objects, (2) place constraints on how tightly in time objects on remote nodes can

When events are scheduled for objects residing on other nodes, antimessages are sent to retract event-scheduling messages if the scheduling event is subsequently rolled back. This can cause cascading rollbacks and further antimessages if those retracted events were processed before they were retracted. Various optimistic flow control techniques have been developed in the WarpIV Kernel to keep the number of rollbacks and antimessages stable. These techniques include (1) holding back highrisk messages until it is safe (or safer) to release them and (2) limiting event-processing optimism on runaway

8

nodes. A good rollback infrastructure only rolls back those events that are associated with each individual object, and not the entire state of the simulation on a node. Object-based (as opposed to node-based) rollback techniques significantly reduce the number of rollbacks experienced by an application as it executes in parallel.

In the WarpIV Kernel, real-time inputs are handled as externally generated events. Inputs can be in the form of aiding terms that are used to describe accurately known changes to the system, or in the form of noisy measurements that require balancing the believability of the measurements with the system uncertainty specified within the model. In any case, the objects affected by the inputs are rolled back to the current real-world time. Processing these real-time inputs can be thought of as calibrating, or estimating the real-time state of the simulation. After the inputs have been received and processed, the system must quickly rollforward and/or reprocesses all events that were rolled back to continue predicting the future.

3.3 Composability Standards for M&S The WarpIV Kernel provides a composable technology framework for hosting complex parallel and distributed simulations. All of the low-level communications and time-concurrency mechanisms necessary to achieve scalable performance are built into the high-level event scheduling and processing abstractions. The WarpIV Kernel is a composable top-to-bottom layered architecture that conforms to the Open Modeling and Simulation Architecture (OpenMSA) [18] and the Open System Architecture for Modeling and Simulation (OSAMS) [21] standards that have been discussed in technical papers and forums over the last several years.

Figure 4: Estimation and Prediction. Noisy measurements are repeatedly combined with previous predictions over time to continually form the best state estimate of the system.

Model composability, in contrast to technology composability, is challenging due to many factors such as conceptual model representations, integrating models with mixed resolutions, and basic integration challenges [4]. The OpenMSA defines a bottom-up layered dependency architecture focusing on the critical technologies of parallel and distributed simulation. OSAMS, on the other hand, focuses on mixed-resolution model interoperability standards that promote the development of plug-and-play software model components. Flexible hierarchical composition techniques and abstract interfaces decouple mixed-resolution model components while promoting their full interoperability in a parallel and distributed computing environment.

Kalman Filters [23] provide an excellent mechanism to form state estimates and predictions of linear control systems with noisy measurements, accurate inputs, and well understood dynamics. It is assumed that the true dynamical system can be represented as a linear matrix equation of the following form. X (t i+1 ) = φ (t i+1,t i )X (t i ) + L(t i ) +U (t i )

Where, X (t i ) Is the true state of the system at time ti.



φ (t i+1,t i ) Is the transition matrix that extrapolates the state of the system from time ti to ti + 1.



L(t i ) Is the “aiding” term that describes known



A new SISO [15] study group, known as The Open Source Initiative for Parallel and Distributed Modeling and Simulation (OSI-PDMS) [16], was formed September 18, 2007 to study such composability architectures with the potential goal of forming new standards. The WarpIV Kernel provides a freely available open source reference implementation of both the OpenMSA and OSAMS architectures. HyperWarpSpeed will be included within the OpenMSA and OSAMS architecture specifications.

4.0

• €



inputs to the system between time ti and ti + 1. €



U (t i ) Is the unknown inputs to the system between time ti and ti + 1.

The true state of the system is never actually known, which is why it is important to perform estimations and predictions based on noisy input measurements and accurately known inputs that are collected regularly over time. Kalman Filters can have different representations depending on whether the estimation and prediction steps are performed separately or collectively within a single step. The current implementation uses the Kalman Filter representation that performs the estimation and prediction computation in a single step. The estimator/predictor form of the Kalman Filter is written as follows:



Estimation and Prediction

Using optimistic simulation to support real-time estimation and prediction is conceptually straightforward (see Figure 4). The simulation must always be running and predicting the future up to a specified end time or sliding time window while rolling back affected models occasionally to process real-time inputs. Global Virtual Time (GVT) is synchronized to the wall clock.

9

Xˆ (t i+1 ) = φ (t i+1,t i ) Xˆ (t i ) + L(t i ) + K (t i ) Z (t i ) − H (t i ) Xˆ (t i )

[



]

Where, •

Xˆ (t i ) is the estimated/predicted state of the

system at time ti. € € €



Z (t i ) is a vector of measurements of the system taken at time ti.



H (t i ) is the measurement matrix that translates the system state to the set of measurements.



K (t i ) is the Kalman Gain that weighs the

in real-time to the simulation client mimicking how an operational simulation might receive live measurements from multiple sensors. The simulation continually processed the detections using Kalman Filters to form estimates of the aircraft’s position in real time. Predicted motion of the aircraft was based on the track state estimate, which in this case consisted of position, velocity, and acceleration in three dimensions.

correction to the state obtained from each measurement.



The Kalman Gains are computed from the covariance matrix that represents the expectation value of the correlated state errors. The covariance matrix takes into account estimates of the measurement noise and the system-uncertainty during each update in time. These equations are shown below.

[

K (t i ) = φ (t i+1,t i )P(t i )H (t i ) R(t i ) + H (t i )P(t i )H (t i )

Figure 5: Guiding simulated aircraft from real-time detections.

Calibrating the simulation with live data ensures that reasonable predictions are generated and eliminates the need to void a potentially invalid execution. Predicting enemy intents and complex behaviors based on past and current intelligence data [14] is a much more challenging topic that is well beyond the scope of this paper [1].

T −1

]

P(t i+1 ) = [φ (t i+1,t i ) − K (t i )H (t i )] P(t i )φ (t i+1,t i )T + Q(t i ) € R(t ) is the measurement noise vector and Q(t ) is the i i €

system uncertainty vector at time ti. Notice that the gains are small (i.e., approach 0 for simple systems) when R(ti) gets large compared to Q(ti). This represents the case when the model is extremely well known and more accurate than the measurements. Similarly, the gains become large (i.e., approach 1 for simple systems) when Q(ti) is much larger than R(ti). This represents the case when measurements are more accurate than the model.

5.0

Statistical Analysis

An important component of the HyperWarpSpeed approach is to perform statistical analysis of intermediate and/or final results. This is accomplished by examining the distribution of various measures of effectiveness for the set of replications of interest (i.e., the ones that resulted from going down specific branch decisions). One of the capabilities provided by the WarpIV Kernel is its Statistical Algebra Package [22]. It supports algebraic operations (e.g., convolutions, etc.) using operatoroverloading techniques and transcendental functions on variables representing statistical distributions. A new interface was developed to convert multireplication variables into statistical algebra variables. The interface allows statistical algebra distributions to be formed using multireplication variables and a replication mask using values specified by the mask. Statistical analysis to determine the effectiveness of various plans can be performed using statistical algebra distributions. The results may feed back into plan optimization, where plans that are not effective can be pruned. This capability may evolve into a learning environment where better decisions are made over time as various branches are explored and outcomes are determined.

Kalman Filters can be used to fuse noisy measurements from multiple sources. A good example is a tracking system that fuses detections from multiple radar and IR sensors. This capability has been demonstrated in the WarpIV Kernel by using real-time detections generated by multiple sensors to estimate aircraft motion within an optimistic simulation environment. This particular demonstration predicted future outcomes based on extrapolating real-time estimates of aircraft states. Initially, a simulation consisting of three aircraft, several ground radar systems, and a command center was executed to generate and log potentially noisy aircraft detections to a file. A second system, as shown in the sequence diagram in Figure 5, consisted of a detection server and simulation client. The detection server transmitted logged detections

10

statistical measures of effectiveness for particular plan instances. This requires tracking all dependencies in a futures graph that is dynamically constructed during branching, event splitting, event merging, and whenever accessing or assigning values to multireplication state variables. Correlations between state variables for different replications and plans must be carefully analyzed to provide statistically correct results.

5.1 Futures Graph Assembly One of the most challenging aspects of HyperWarpSpeed is to consolidate the results of multiple branch excursions that were taken during the execution of the simulation. This capability has not yet been implemented in the WarpIV Kernel. Various simple approaches could be implemented for simulations with moderate numbers of branches, but they break down or are computationally sub-optimal when modeling scenarios that require many independent branch excursions.

Traditional approaches to solve this problem require the use of Design Of Experiments (DOE) techniques to carefully only explore only the most likely decision paths and then perform Monte Carlo experiments to collect statistical results. These results are at best uncertain and are produced with the hope that no important permutations were left out of the set of runs performed during experimentation. The HyperWarpSpeed approach is able to potentially explore all of the permutations within a single execution of the simulation without requiring DOE techniques. The post-processing challenge is to determine which multireplication state variables were affected by what branches when computing the overall effectiveness of each plan.

To illustrate this problem, consider a complex battlefield engagement that eventually (and unexpectedly) unfolds into 10 independent battles. Assume that each of the 10 battles branched twice based on planned decisions chosen first by blue forces, followed by tactical responses taken by red forces. In this example, assume that a probability of 0.5 is used on each blue and red branch. If HyperWarpSpeed is executed using a replication set with 128 starting replications, each battle independently decomposes into four branches with replication sets containing 32 replications that are uncorrelated with the branches taken in the other independent battles.6

Three simple approaches that could be supported in a straightforward manner by the current implementation of HyperWarpSpeed are briefly listed below:

To continue with this simple example, a raw measure of effectiveness might be determined by globally summing the intrinsic value of all surviving blue assets plus all destroyed red assets divided by the sum of blue and red intrinsic values for all assets defined at the start of the simulation. Such a metric would be normalized between 0 and 1, where 1 indicates complete success of the military operation and 0 indicates complete failure in meeting planned objectives for blue forces. A plan is defined as a particular set of branch decisions taken by blue forces. This example has a total of 210 = 1,024 plans that could be independently evaluated out of a total of 410 = 1,048,576 possible permutations evaluated in the simulation. Yet, the simulation only performed the equivalent of 128 replications, and perhaps only took three or four times longer to run than a single Monte Carlo replication. One might ask how HyperWarpSpeed can provide all of the information generated by 1,048,576 permutations in a single run? The key to understanding computation sharing is to realize that the 10 battles in this example were completely independent from one another. The challenge is to extract 6

Note that this example is different from exploring multiple rules of engagement that could be universally applied to the 10 battles, thus introducing correlated outcomes. Multiple rules of engagement could be stored during initialization in static multireplication variables that are accessed during force-on-force engagements. This would automatically generate event splitting during event processing without requiring branching.

11

1.

Make sure that the starting replication set is large enough so that post processing always has at least one replication remaining for every possible branch taken in the plan. The problem with this limited approach is that it can break down if too many branches are taken. Or, it can introduce unacceptable overheads by requiring excessively large replication sets. In the example above, 128 replications (i.e., 27) could at most support 7 decision points.

2.

Only run a single set of blue decisions at a time. While it is tempting to go down this simple route, this approach does not share common computations that might occur during the execution of different decisions. Running individual decision permutations in HyperWarpSpeed would still combine multiple enemy responses along with stochastic effects into the overall execution. However, it would be better to share the computations of overlapping blue decisions wherever possible.

3.

Use the rollback/rollforward framework to explore multiple decision branches. The simulation locks in a set of blue decisions, runs the simulation to the end time to determine the effectiveness of those decisions, and then dynamically rolls back to previous decisions to explore other permutations. The problem with this approach is that it still does not appear to share redundant computations between

branch excursions. However, this approach is worth exploring further and may lead to promising results.

graph data structure will maintain dependency information about HyperWarpSpeed event processing as it evolves over time. The nodes in the graph will represent (1) branching, (2) event splitting, (3) event merging, (4) event-message merging, and (5) multireplication state variable dependencies that were modified or accessed by other graph nodes. Links will connect a newly created node to the node representing the current replication state. Each node in the graph will maintain a list of forward and backward links for each type of link (see Figure 6). Forward links represent future events, while backward links capture the series of causal dependencies. Traversing these links is key to identifying the chain of dependencies that produced the current state.

Fortunately, a viable fourth option exists that tracks branches, event splitting, event merging, and multireplication state variable dependencies in a rollbackable dynamically constructed time-based graph that can be queried for information. This futures graph is able to reconstruct state information at any simulated time for all branches taken. Although the basic design has been developed to solve this problem, this fourth option has not yet been implemented in HyperWarpSpeed. It will have to be carefully implemented in a future effort. Technical design highlights of the dynamically created futures graph are provided below.

Multireplication state data will be represented in a rollback-based space-time data structure that logs changes as they occur in time. The rollback framework allows state (i.e., space) to be reconstructed at any point in time. The futures graph node of the current event is stored in the space-time data structure to track dependencies whenever a multireplication state variable is accessed and/or modified. A rollbackable link to the futures graph is stored for each replication value in a multireplication state variable.

5.2 Futures Graph Representation The futures graph will be constructed in local memory during the execution of the simulation and then consolidated afterwards into a file that is generated at the end of the simulation execution. Part of the consolidation process will be to compute the statistical measures of effectiveness for each plan taking into account both correlated and uncorrelated multireplication state variables. This may be computationally intensive for large scenarios. The WarpIV Kernel is designed to execute on multicore architectures and will utilize scalable parallel processing techniques to perform consolidation of the futures graph on multicore computers.

These concepts are illustrated below in the continued example of Jim and Nancy (see previous example 1).

The internally generated graph will utilize the rollback framework in the WarpIV Kernel to support much faster than real-time operation as the simulation executes in parallel, or as real-time inputs are received. Persistence that is built into the WarpIV Kernel will allow the futures graph to automatically serialize into a buffer that can be written to disk. The futures graph can then be reconstructed as the file is read back to support analysis. To support distributed computing on computer systems that do not share disks, a serialized buffer of the futures graph could also be sent as a message to other systems and then reconstructed for further analysis.

Figure 6: Futures graph of Jim and Nancy.

A high-level programming interface to process and analyze the futures graph will be provided to simplify access to state variables that were logged during the execution of the simulation at any specified time or branch point. Blue, red, and probabilistically generated branch points can be specified. However, these state variables may also be represented as statistical distributions using the Statistical Algebra Package that combines uncertain red responses and battle outcomes.

When Jim decides on what to order for desert, his event branches creating two nodes in the graph. One node represents Jim ordering ice cream and the other branch represents him ordering a cookie. These new nodes create branch-links to the root node. Nancy is unaffected by this decision, until she asks Jim to share his desert. At this point, Nancy’s event splits creating a node where she shares Jim’s ice cream and another node where she shares the cookie. The new nodes create split-links to the root node. Since the desert Nancy shares is directly dependent on what Jim ordered,

The futures graph will be represented as a doubly linked directed graph data structure with multiple link types. The

12

dependency-links establish the correlation between Nancy sharing desert and the desert that Jim ordered.

5.3 Futures Graph Analysis in Real Time The futures graph could be separately analyzed in parallel using the Real Time Compute Engine capabilities that are also provided by the WarpIV Kernel. The futures graph can be reconstructed in shared memory in a separate WarpIV Kernel application and then processed in parallel to achieve scalable performance on multicore computers.

When Jim and Nancy finish their desert, pay the bill, and leave the cafe, they continue with the rest of their day unaffected by what they had for desert. This causes them to merge back into their individual event behaviors. New nodes in the graph are created to represent the merging, with appropriate merge-links created.

During the analysis of the futures graph, the likelihood of decisions points will be assessed and logged. When certain events repeatedly have a low probability of effectiveness or high probability of risk, the nodes in the graph representing these decision points may be pruned. For example, if turning left at a fork in the road repeatedly leads to a 99% failure rate, there is no need to consider this route in the future.

When Jim later decides to go to the expensive restaurant for dinner, his meal is dependent on what he ordered for desert during lunch. This causes his process to once again split, and create two new nodes. One node represents Jim ordering chicken and the other steak. These new nodes create split-links to the previous node representing Jim leaving the cafe. Because the amount of money in Jim’s wallet depended on what he had for desert during lunch, dependency-links are created from the nodes representing Jim eating chicken to Jim ordering ice cream and Jim ordering a cookie to Jim eating steak. The multireplication variable storing the amount of money in Jim’s wallet maintains links to nodes in the futures graph indicating when it was set. This information is used to form dependency-links in the futures graph.

The probability and confidence level of a HyperWarpSpeed multireplication state variable is computed by forming statistical algebra histograms. Pruning will occur before any decision points or nodes for replanning are selected. Information about these pruned nodes will be submitted back into the HyperWarpSpeed simulation for removal from the current plan.

When accessing logged multireplication state variables to perform analysis for a particular set of blue decisions, each value in its replication set can be traced back in the graph to determine the chain of decisions leading to its final result. Values tracing back to planned decisions (i.e., branches) in the graph are counted in the statistical formation of the variable, while values tracing back to alternate decisions are rejected. Correlations between replications and decision points on the plan must be maintained to compute global measures of effectiveness involving multiple objects in the battlespace. Some values may not trace back to any of the decisions in the plan. This can happen when branches are completely merged, or when there never were any dependencies on the set of blue decision points. In these cases, such values are uncorrelated and factor differently into measures of effectiveness.

In addition to likelihood, the value, utility, and flexibility of each plan will be determined using raw and relative measures of effectiveness [20]. Upcoming decision points will be analyzed and ranked. Decision points unable to meet their goals due to incomplete or improbable futures will be identified for further planning. A learning component can be provided to continuously analyze the futures graph. This component will operate as a low priority process, so its computations should not impact the more critical real-time analysis. The main focus of the analysis is to (1) gather and analyze the data contained in the futures graph, and (2) determine which futures are qualitatively and quantitatively the same. The live data received from the environment could also be analyzed to provide improved estimates of the branch probabilities for red decisions, blue decisions, and other stochastic activities. This is another area where learning techniques can be employed.

Dealing with correlations is the most challenging part of futures graph analysis. For example, a multireplication state variable for one entity may store values that depend on a particular subset of blue decisions, while a multireplication state variable for another entity could depend on a different subset of blue decisions. Correctly combining multireplication state variables from different entities with different branch dependencies to form global measures of effectiveness requires further investigation.

6.0

Comparison with Other Technologies

This section briefly describes existing Course of Action (COA) and related M&S technologies along with how they are limited in capability when compared to the HyperWarpSpeed approach described in this paper.

13

this paper offer a much more computationally efficient approach by automatically sharing redundant calculations.

6.1 Brute Force Monte Carlo Approach The most common simulation strategy is to run large numbers of Monte Carlo replications to explore different courses of action. The problem with brute force Monte Carlo execution strategies is that many redundant computations are unnecessarily performed by each replication. Multi-branching and branch-merging techniques described in this paper can significantly outperform the brute force Monte Carlo approach. In addition, managing the statistical outputs and accumulating data about branch points and resultant execution paths are simplified because all of the results are now provided by a single execution.

6.5 Object Cloning at Branch Points Another approach has been to clone individual objects within the simulation as branches occur. While this approach is more aggressive than cloning entire simulations, it still suffers from not reusing event computations from redundant branches. The attributelevel branching and merging approaches described in this paper provide a much more aggressive and optimal strategy for minimizing redundant computations.

6.6 Recursive Simulation

6.2 Grid Computing

One approach used to project multiple potential outcomes is the notion of recursive simulation (i.e., simulations that use embedded simulation in their event processing) [7]. Recursive simulations do not share computations during branching and they do not merge branches. Therefore, recursive simulation is not as efficient as the HyperWarpSpeed time management algorithm described in this paper.

Grid computing farms Monte Carlo replications to available compute resources. A more effective use of computing resources would be to efficiently run a multibranching simulation execution in parallel that does not redundantly compute branches that could be shared between replications. A single laptop with dual-core processors programmed correctly with branching and merging capabilities could potentially outperform a large farm of computing resources managing the execution of brute force Monte Carlo runs.

6.7 Integrating With Live Data Traditional approaches collect live data and periodically run simulation executions to predict the future. Each new simulation execution may reproduce a significant amount of redundant computations from previous runs. The rollback-based strategy described in this paper continually feeds live and potentially noisy data into the simulation using control theory such as Kalman Filters to continually calibrate the state. This technique then continually maintains predictions with multiple branches using the best state estimate of the real world. One additional benefit is that uncertainties from covariance matrices can potentially be extrapolated to characterize the believability of future states. Such error propagation techniques can provide significant value to Verification, Validation, and Accreditation (VV&A) efforts [12]. Again, traditional approaches do not provide this capability.

6.3 Multi Replication Framework (MRF) The Multi Replication Framework (MRF), previously developed by the authors of this paper, manages large numbers of Monte Carlo simulation replications and then potentially prunes and restarts them if new data arrives that is not consistent with their current logged simulated state [11]. This approach has some utility in supporting legacy simulations. However it is far from optimal because simulation replications almost always diverge from the real world rapidly. Also, there is no potential for sharing computations between replications.

6.4 Simulation Cloning/Forking The Monte Carlo approach can be extremely wasteful and cumbersome because each replication redundantly computes many of the same calculations as the other replications. One approach to help mitigate this problem is to run a simulation up to the point in time where it makes a branch decision. At that point, the simulation clones or forks itself. If this happens early and often, large numbers of clones will be generated. This can explode rapidly in many cases. Also, cloned simulations do not merge results. In addition, each cloned simulation could still redundantly recompute many of the same events. The branching and branch merging techniques described in

7.0

Preliminary Performance Results

Three preliminary test programs are described in the following subsections. These tests collectively (1) verify the correct operation of multireplication integer, double, and Boolean data types, (2) exercise branching, event splitting, and event merging operations, (3) measure performance in comparison to the equivalent number of Monte Carlo replications, and (4) determine speedup when running on parallel processing architectures. Each test successfully verified the correctness of the

14

multireplication state variable data types, demonstrated repeatable event processing on one or more processors, and highly stressed the branching, splitting, and merging algorithms. All of the tests demonstrated significant performance gains over the brute force Monte Carlo approach and obtained substantial parallel speedup when executing on multicore parallel computers. Test hardware configurations are described in Table 2.

the dispersion from spreading too large, the random walk test code was modified to progressively increase the probability of stepping back towards the center.

Random Walk Performance: Mac Pro The simulation was first executed on the Mac Pro using the brute force Monte Carlo approach. Each replication took 6.02 seconds to complete. The simulation then ran using HyperWarpSpeed configured to support the equivalent of 128 replications. HyperWarpSpeed took 85.15 seconds to complete on a single processor and 26.85 seconds to complete when running in parallel on four processors. The sequential performance gain of HyperWarpSpeed over 128 independent Monte Carlo runs on a single processor was a factor of 9.05. An additional speedup factor of 3.17 was achieved by executing in parallel on four processors.

Table 2: Test System Configurations. Machine Mac Pro HP Superdome IBM p690

Processors Specifications Two Dual-Core Xeon at 3 GHz (total of 4 processors) 550 MHz PA8600 (total of 48 processors) Power4 at 1.3 GHz (total of 32 processors)

Operating System Mac OS X 10.4.10 HP-UX 11i Suse Linux Enterprise Server (SLES-10)

Random Walk Performance: HP Superdome

7.1 Random Walk

The simulation was then executed on the HP Superdome using the brute force Monte Carlo approach. Each replication took 37.63 seconds to complete. The simulation then ran using the HyperWarpSpeed algorithm configured to support the equivalent of 128 replications. HyperWarpSpeed took 560.52 seconds to complete on a single processor and 13.98 seconds to complete when running in parallel on forty processors. The sequential performance gain of HyperWarpSpeed over 128 independent Monte Carlo runs on a single processor was a factor of 8.59. An additional speedup factor of 40.01 was achieved by executing in parallel on forty processors.

Random Walk Performance: IBM p690 The simulation was finally executed on the IBM p690 using the brute force Monte Carlo approach. Each replication took 8.81 seconds to complete. The simulation then ran using HyperWarpSpeed configured to support the equivalent of 128 replications. HyperWarpSpeed took 121.13 seconds to complete on a single processor and 4.75 seconds to complete when running in parallel on thirty processors. The sequential performance gain of HyperWarpSpeed over 128 independent Monte Carlo runs on a single processor was a factor of 9.31. An additional speedup factor of 25.50 was achieved by executing in parallel on thirty processors. Figure 7: Results from the Random Walk test. All 128 replications start at zero. Each time step performs a branch that takes a step to the left or to the right with an equal probability. The left figure shows how the branching should occur. The right figure plots data taken from an actual run with ten steps.

The random walk test highly stressed the algorithms because every event branched, merged, and changed local variables, while performing almost no computations. The fact that HyperWarpSpeed provided performance improvement over the brute force Monte Carlo approach is notable.

The Random Walk test modeled 10,000 entities taking random steps in three dimensions (see Figure 7). To keep

15

configured to support the equivalent of 128 replications. HyperWarpSpeed took 239.24 seconds to complete on a single processor and 6.18 seconds to complete when running in parallel on forty processors. The sequential performance gain of HyperWarpSpeed over 128 independent Monte Carlo runs on a single processor was a factor of 40.81. An additional speedup factor of 38.71 was achieved by executing in parallel on forty processors.

7.2 Aircraft Bombing Target The Aircraft Bombing Target test simulated the flight of 1,000 aircraft bombing 1,000 ground targets in sequence. Each aircraft flew in a single-file formation over the same sequence of targets and dropped a bomb on the target if the target had not already been destroyed. If the dropped bomb did not destroy the target, a surface to air missile was then fired from the target to the aircraft with the goal of destroying the aircraft. Again, a replication set with 128 replications was used to support this test.

Aircraft Mission Performance: IBM p690 The simulation was finally executed on the IBM p690 using the brute force Monte Carlo approach. Each replication took 19.66 seconds to complete. The simulation then ran using HyperWarpSpeed configured to support the equivalent of 128 replications. HyperWarpSpeed took 49.39 seconds to complete on a single processor and 2.04 seconds to complete when running in parallel on thirty processors. The sequential performance gain of HyperWarpSpeed over 128 independent Monte Carlo runs on a single processor was a factor of 50.95. An additional speedup factor of 24.21 was achieved by executing in parallel on thirty processors.

Figure 8: Aircraft bombing target scenario. 1,000 Aircraft bomb 1,000 targets in sequence. If the bomb did not destroy the target, the target fires a surface to air missile back at the aircraft. If the aircraft is not destroyed, then the aircraft moves on to the next target where the sequence of events repeats.

7.3 Miscellaneous Branching Stress Test The Miscellaneous Branching Stress Test was developed to exercise the branching, event-splitting, and eventmerging features of the HyperWarpSpeed time management algorithm. Events branched and later merged within objects of type A while other objects of type B queried multireplication state variables for type A objects. Spin loops burning CPU cycles were added to the models in order to mimic computational workloads representing actual event processing that might be performed by more realistic applications. Significant performance improvements for HyperWarpSpeed over brute force Monte Carlo replications were observed (see Figure 9).

In this scenario, branching occurred at two places (see Figure 8): (1) to determine if the bomb destroyed the target, and (2) to determine if the surface-to-air missile destroyed the aircraft. In both branching cases, a probability of 0.5 was used to determine the effect of (1) the bomb on the target, and (2) the missile on the aircraft.

Aircraft Mission Performance: Mac Pro The simulation was first executed on the Mac Pro using the brute force Monte Carlo approach. Each replication took 10.97 seconds to complete. The simulation then ran using HyperWarpSpeed configured to support the equivalent of 128 replications. HyperWarpSpeed took 33.92 seconds to complete on a single processor and 10.45 seconds to complete when running in parallel on four processors. The performance gain of HyperWarpSpeed over 128 independent Monte Carlo runs on a single processor was a factor of 41.4. An additional speedup factor of 3.25 was achieved by executing in parallel on four processors.

128 individual Monte Carlo replications would have required the processing of 68,096,000 events. Executing the simulation using HyperWarpSpeed without branch merging enabled only required 2,253,000 events to be processed. With branch merging enabled, only 642,000 events were actually required to be processed. The results of a single HyperWarpSpeed execution versus 128 independent Monte Carlo replications are statistically equivalent. On four nodes, the HyperWarpSpeed approach ran 236 times faster than 128 serial replications.

Aircraft Mission Performance: HP Superdome The simulation was then executed on the HP Superdome using the brute force Monte Carlo approach. Each replication took 76.27 seconds to complete. The simulation then ran using the HyperWarpSpeed algorithm

16

replication took 114.40 seconds to complete. The simulation then ran using HyperWarpSpeed configured to support the equivalent of 128 replications. HyperWarpSpeed took 222.25 seconds to complete on a single processor and 8.64 seconds to complete when running in parallel on thirty processors. The sequential performance gain of HyperWarpSpeed over 128 independent Monte Carlo runs on a single processor was a factor of 65.89. An additional speedup factor of 25.73 was achieved by executing in parallel on thirty processors.

7.4 Summary of Performance Results Performance for each of the three tests executing on the three parallel processing hardware systems described above are summarized in Table 3.

Figure 9: Performance test of the miscellaneous branching stress test executing on the four-processor Mac Pro. For these measurements, the spin loop was increased by a factor of 100. The simulation ran for 100 simulated seconds. This test highly stressed branching, event splitting, and event merging between different types of objects distributed across four processors.

Table 3: Summary of HyperWarpSpeed performance for each of the three tests (Random Walk, Aircraft Bombing Mission, and Miscellaneous Branching Test) on three hardware systems (Mac Pro, HP Superdome, and IBM p690).

Stress Test Performance: Mac Pro The simulation was first executed on the Mac Pro for 1000 simulated seconds using the brute force Monte Carlo approach. Each replication took 29.41 seconds to complete. The simulation then ran using HyperWarpSpeed configured to support the equivalent of 128 replications. HyperWarpSpeed took 88.47 seconds to complete on a single processor and 24.89 seconds to complete when running in parallel on four processors. The performance gain of HyperWarpSpeed over 128 independent Monte Carlo runs on a single processor was a factor of 42.55. An additional speedup factor of 3.55 was achieved by executing in parallel on four processors. In all cases, the parallel performance of HyperWarpSpeed offered orders of magnitude speedup gains over the more traditional brute force Monte Carlo approach executed in a serial manner. While these results are preliminary, and do not necessarily indicate high performance for all possible military scenarios, the results are extremely positive and warrant further serious investigation. Obtaining similar performance on realistic military scenarios may require non-traditional conceptual design approaches for constructing models that best take advantage of branching, merging, support for real-time estimation and prediction, and emerging scalable multicore computing architectures.

Stress Test Performance: HP Superdome The simulation was then executed for 100 simulated seconds on the HP Superdome using the brute force Monte Carlo approach. Each replication took 257.40 seconds to complete. The simulation then ran using the HyperWarpSpeed algorithm configured to support the equivalent of 128 replications. HyperWarpSpeed took 493.77 seconds to complete on a single processor and 14.65 seconds to complete when running in parallel on forty processors. The sequential performance gain of HyperWarpSpeed over 128 independent Monte Carlo runs on a single processor was a factor of 66.73. An additional speedup factor of 33.70 was achieved by executing in parallel on forty processors.

8.0

Areas of Future Research

To further simplify the use of this technology and to optimize run-time performance, future research and development goals for HyperWarpSpeed have been identified in the following six general areas.

Stress Test Performance: IBM p690 The simulation was finally executed on the IBM p690 using the brute force Monte Carlo approach. Each

17

Goal #1: Extend existing programming constructs and data types to provide transparent model execution for all WarpIV Kernel time management modes.

rollback framework to predict the future while calibrating its state with live data. Fourth, it applies control theory to form best estimates of the common operational picture (i.e., the COP) while noisy data is received from multiple sources. Fifth, it uses statistical algebra to form measures of effectiveness and performance from the various branch excursions taken during the course of the simulation. Finally, all of the capabilities of HyperWarpSpeed are supported within the proposed OpenMSA and OSAMS architecture standards that are being studied by OSIPDMS Study Group under SISO. The WarpIV Kernel provides a freely available open source reference implementation for government, commercial, and university research-programs to use.

Goal #2: Extend existing publish/subscribe data distribution mechanisms within the WarpIV Kernel to transparently support branching capabilities. Goal #3: Research and develop efficient modeling techniques and methodologies that branch, split events, and merge within HyperWarpSpeed. Goal #4: Develop semi-realistic models of DoD battlefield entities that will verify and validate the correct operation of the HyperWarpSpeed algorithm. These models should be extensible to eventually support more realistic behaviors and scenarios.

10.0

The authors of this paper would first like to thank Dr. Timothy Busch and Dawn Trevisani for their support and technical insight in real-time Air Force command and control operational systems requiring dynamic situation assessment and prediction.

Goal #5: Identify potential bottlenecks, optimize algorithms, measure computational improvements over brute force Monte Carlo approaches, and demonstrate scalable performance on a wide range of parallel and distributed computing architectures.

The authors of this paper would also like to thank Bob Pritchard, Program Manager of the Joint Virtual Network Centric Warfare program at SPAWAR Systems Center, for making the HP Superdome and IBM p690 supercomputers available for benchmarking.

Goal #6: Develop capabilities to create and efficiently analyze the futures graph in real time. This includes computing statistically valid measures of effectiveness, performing sensitivity analysis to key decision points, using live data from the environment to calibrate the HyperWarpSpeed simulation, and develop learning strategies to more accurately predict future outcomes.

9.0

Acknowledgements

WarpIV Technologies, Inc. performed this work for the Air Force Research Laboratory, Rome Research Site in New York, under contract FA8750-06-C-0218.

Summary and Conclusions

11.0

HyperWarpSpeed provides a revolutionary M&S technology for hosting estimation and prediction applications on emerging multicore computing architectures. The ability to efficiently predict outcomes based on the best estimate of the current common operational picture is critical in supporting dynamic situation assessment and prediction. HyperWarpSpeed may provide orders of magnitude speedup over existing approaches, which will give warfighters a significant information advantage over adversaries in a time-critical manner when human lives are at stake.

Author Biographies

JEFFREY S. STEINMAN, President and CEO of WarpIV Technologies, Inc. received his Ph.D. in High Energy Physics from UCLA in 1988 where he studied the quark structure function of high-energy virtual photons at the Stanford Linear Accelerator Center. Upon completing his Ph.D. dissertation, Dr. Steinman worked at the Jet Propulsion Laboratory/CalTech in the area of parallel and distributed computing technologies on hypercube supercomputers. In 1990, Dr. Steinman developed the Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES) framework as the next-generation replacement for the Time Warp Operating System (TWOS). This work transitioned to industry in 1996 when Dr. Steinman joined the staff at Metron, Inc. where he formed a new High Performance Computing Division. SPEEDES eventually transitioned into several mainstream programs including BMDS SIM, JMASS, JSIMS, EADTB, and many other smaller programs or R&D efforts. In 2000, Dr. Steinman left Metron to join RAM Laboratories as Chief Scientist where he directed all of their high performance computing R&D programs.

HyperWarpSpeed synergistically combines several important techniques for supporting dynamic situation assessment and prediction. First, it combines Bayesian branching, event splitting, event merging, and multireplication primitive data types into a common execution framework. Results are captured in a futures graph that can be analyzed in parallel to optimize decision-making. Second, it utilizes the parallel processing capabilities provided by the WarpIV Kernel to obtain scalable performance on next-generation multicore computing systems. Third, it uses the WarpIV Kernel

18

Dr. Steinman formed WarpIV Technologies, Inc. in 2005, to focus on the development of new technologies, composable simulation architectures, and open system standards for high performance computing. Dr. Steinman is the principle developer of the WarpIV Kernel.

Engineering in systems engineering from Cornell University. She is an associate computer engineer with the Air Force Research Laboratory working on the analysis of next-generation simulation capabilities for use in command and control systems.

CRAIG N. LAMMERS, Senior Analyst at WarpIV Technologies, Inc., graduated with high honors in 2001 from the University of Michigan, Dearborn with a B.S. in Industrial & Systems Engineering. Craig Lammers started his career in 2002 at RAM Laboratories as a Software Engineer where he designed, developed, tested, and documented several software applications for the Air Force Research Laboratory (AFRL) and the Missile Defense Agency (MDA). Craig Lammers served as principle investigator and developer of the Detego neural network and was the lead developer of the DSAP multireplication framework developed for AFRL. He was the principle developer of the WanSim wide area network simulation tool for Missile Defense. Each of these applications heavily leveraged capabilities provided by the WarpIV Kernel. Craig Lammers joined WarpIV Technologies, Inc. in 2006 where he has focused on software engineering, development of commercial products, VV&A for biological defense, development of real-time tracking algorithms, development of multiplatform graphical analysis tools, and R&D in estimation and prediction technologies such as HyperWarpSpeed. He has authored more than 10 papers on M&S and related fields.

12.0 1.

2.

3.

4.

5.

6. 7.

MARIA E. VALINSKI, Senior Software Engineer at WarpIV Technologies, Inc. graduated with honors in 1997 from East Stroudsburg University with a B.S. in Computer Science. She started her career at Hughes Aircraft in 1997 as a software engineer where she designed, developed, tested, and documented software for the B-2 and F-117 flight simulators. In 2000, she joined RAM Laboratories as a senior software engineer supporting several M&S projects as a software engineer, technical lead, and program manager. These projects utilized the parallel and distributed computing capabilities of the WarpIV Kernel to host real-time, conservative, and optimistic simulation applications. In 2007, Maria Valinski joined the staff at WarpIV Technologies, Inc. as a Senior Software Engineer. She is currently the Lead Software Engineer for the Joint-Virtual Network Centric Warfare (JVNCW) program that is developing an opensource government-owned product for simulating communications links of Battlefield Mobile Ad-Hoc Networks on massively parallel supercomputers using the WarpIV Kernel.

8.

9.

10. 11.

12.

13.

KAREN ROTH received her B.S. in software engineering from Rochester Institute of Technology in 2006 and is currently working on a Masters of

14.

19

References

Bell B., Santos E. Jr., and Brown S., 2002. “Making Adversary Decision Modeling Tractable with Intent Inference and Information Fusion.” In proceedings of the 11th Conference on Computer Generated Forces and Behavioral Representation. Busch T., "Modeling of Air Operations for Course of Action Determination" in Enabling Technologies for Simulation Science VI, Alex Sisti and Dawn Trevisani Editors, Proceedings of SPIE Vol. 4716, pp 35-40, (2002). Chen D., et. al., 2004. “Incremental HLA-Based Distributed Simulation Cloning.” In proceedings of the 2004 Winter Simulation Conference, pages 386-394. Davis P., and Anderson R., 2004. “Improving the Composability of Department of Defense Models and Simulations.” Prepared for the Defense Modeling and Simulation Office (DMSO). RAND National Defense Research Institute. Fujimoto R., 2000. “Parallel and Distributed Simulation Systems.” John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 10158-0012. Gelb A., 1974. “Applied Optimal Estimation.” MIT Press. Gilmer, J. B. Jr. and Sullivan F. J., (2005). “Issues in Event Analysis for Recursive Simulation.” In proceedings of the 2005 Winter Simulation Conference: 8. Hybinette, M. and R. M. Fujimoto. 2001. Cloning parallel simulations. ACM Transactions on Modeling and Computer Simulation, 11: 378-407. Jefferson D., 1985. “Virtual Time.” ACM Transactions on Programming Languages and Systems, Vol. 7, No. 3, pages 404-425. Katsuhiko O., 2005. “Discrete-Time Control Systems (2nd Edition).” ISBN/UPC: 0130342815. Pearson Education. Lammers C., et. al., 2005. “Applying a multi-replication framework to support dynamic situation assessment and predictive capabilities.” Proc. SPIE Vol. 5805, p. 257-268, Enabling Technologies for Simulation Science IX; Dawn A. Trevisani, Alex F. Sisti; Editors. Pace, D. K., “Verification, Validation, and Accreditation (VV&A),” In Applied Modeling and Simulation: An Integrated Approach to Development and Operations, D. J. Cloud and L. B. Rainey, editors. Pp. 369-410, McGrawHill, New York (1998). Phister P., Busch T., and Plonisch I., 2003. “Joint synthetic battlespace: Cornerstone for predictive battlespace awareness.” In Proceedings of the Eighth International Command and Control Research and Technology Symposium. Rothenberg, J., Rand. “A Discussion of Data Quality for Verification, Validation, and Certification (VV&C) of Data

15. 16.

17.

18.

19.

20.

21.

22. 23.

to be used in Modeling.” Rand Project Memorandum PM709-DMSO, Rand, August 1997. Simulation Interoperability Standards Organization (SISO), http://www.sisostds.org. SISO, Open Source Initiative for Parallel and Distributed Modeling and Simulation (OSI-PDMS) study group. http://www.sisostds.org/index.php?tg=articles&idx=More& topics=120&article=471. Steinman J., 1993. "Incremental State Saving in SPEEDES Using C++." In proceedings of the 1993 Winter Simulation Conference. Pages 687-696. Steinman J. and Hardy D., 2004. “Evolution of the Standard Simulation Architecture.” In proceedings of the Command and Control Research Technology Symposium, paper 067. Steinman J, 2005. “The WarpIV Simulation Kernel.” In proceedings of the 2005 Principles of Advanced and Distributed Simulation (PADS) workshop. Steinman J. and Busch T., 2006. “Real Time Estimation and Prediction Using Optimistic Simulation and Control Theory Techniques.” In proceedings of the Spring 2006 Simulation Interoperability Workshop, 06S-SIW-017. Steinman J., et. al., 2007. “A Proposed Open System Architecture for Modeling and Simulation.” In proceedings of the Fall 2007 Simulation Interoperability Workshop, 07F-SIW-044. WarpIV Technologies, Inc., Copyright © 2007, “Statistical Algebra Package - Programming Guide: Version 1.5.” Zarchan P., and Musoff H., 2005. “Fundamentals of Kalman Filtering: A Practical Approach.” AIAA.

20

Suggest Documents