OpenMP parallelization of agent-based models - CiteSeerX

38 downloads 6672 Views 260KB Size Report
Feb 16, 2005 - cells, molecules or traders and brokerage agencies are mostly unknown, or ... Email addresses: [email protected] (Federico Massaioli), ... namic lists of “records” that contain all the information about each entity.
OpenMP parallelization of agent-based models Federico Massaioli a,∗ Filippo Castiglione b Massimo Bernaschi b a CASPUR,

Via dei Tizii 6/b, 00185 Roma, Italy

b Istituto

Applicazioni del Calcolo (IAC) “M. Picone”, Consiglio Nazionale delle Ricerche (CNR), Viale del Policlinico 137, 00161 Roma, Italy

Abstract Agent-based models, an emerging paradigm of simulation of complex systems, appear very suitable to parallel processing. However, during the parallelization of a simulator of financial markets, we found that some features of these codes highlight non-trivial issues of the present hardware/software platforms for parallel processing. Here we present the results of a series of tests, on different platforms, of simplified codes that reproduce such problems and can be used as a starting point in the search of a possible solution. Key words: OpenMP, Fine grain parallelism, Agent-based models, Financial market simulation

1

Introduction

The computational features of the codes in use in fields like fluid-dynamics, materials science or meteorology are well known and much work has been devoted to the enhancement of their performances. In other important fields the situation is not so good. In biology or finance, for example, even the rules governing the micro-behavior of entities such as cells, molecules or traders and brokerage agencies are mostly unknown, or ∗ Corresponding author. Email addresses: [email protected] (Federico Massaioli), [email protected] (Filippo Castiglione), [email protected] (Massimo Bernaschi).

Preprint submitted to Parallel Computing

16 February 2005

overwhelmed by so much details that is impossible to write them in a compact way. Although very different in terms of scale and components such systems share the feature that their complexity arises from the interaction of a large number of heterogeneous elements. Therefore, a possible approach for studying their behavior is represented by what is called micro-simulations [1][2] or “agent-based” approach. This can be thought as the natural evolution of the so-called spin models [3] or cellular automata [4]. As a matter of fact, the availability of high performance computers allows to add details to the definition of the “fundamental entities” (the agents) and to increase the complexity of the rules governing the agents’ individual behavior. This means to represent entities (cells, molecules, traders on the market etc.) as a collection of information or attributes instead of simple two-value variables (like the spins in a simulation of the Ising model). The heterogeneous information corresponding to a single entity is usually represented by a collection of integer variables, so that the numerical stability of the simulation algorithm is unconditionally guaranteed. Very often the entities follow a sort of life-path during the simulation (i.e. birth, evolution and death) so it is more appropriate to resort to flexible dynamic lists of “records” that contain all the information about each entity. Moreover, since the complex behavior of the entities is represented by statechanges, every single entity can be thought as a (small scale) Finite State Machine that processes information by interacting with other entities, or with external sources of information. A subclass of these models uses probabilistic rules for the state changes. Transition probabilities can be fixed or can change during the simulation depending, for instance, on the outcome of the interaction rules governing the entities. In summary, agent-based models are discrete mathematical representation of complex systems in which entities behave and evolve in time. To complete the picture it is necessary to specify the environment in which the agents act. For simulations in biology the environment can be the physical space within a cell, a tissue or an organ. In this case the interactions among entities require physical proximity. In other cases, the environment may be virtual as in the case of the financial markets, where there are networks of relationships among the agents that are loosely dependent on the physical proximity. The discussion above is not without purpose. Although the definition of the external environment might seem to be a modeling issue, it determines the difficulty of writing a parallel code for the simulation. The main reason is that an external field must be visible in read and write, so to speak, to all agents. This issue will be discussed in depth below. In the following paragraphs we describe examples of agent-based models that we developed in the past and for which we wrote a message passing based parallel version according to the distributed memory paradigm. These exam2

ples should make easier for the reader to understand our motivations for the development of an OpenMP based parallel version.

1.1 Simulation of the immune response to a pathogen

The first example of agent based model simulates the immune system response to a generic pathogen as a bacteria or a virus. Recently it has been specialized for the HIV-1 infection to conduct studies on the dynamics of the virus during the lifetime of an infected patient. The whole simulation space corresponds to a single lymph-node of a vertebrate animal. The model is based on a generalized cellular automaton having the following features: (1) the automaton is defined on a triangular 2-d lattice with periodic boundary conditions (toroidal geometry); (2) the dynamics is probabilistic; (3) the evolution of each site depends just on the site itself (internal dynamics); (4) entities move from site to site (diffusion process); (5) each time step corresponds to ∼ 8 hours of “real life”. Other primary lymphoid organs such as the thymus and bone marrow are modeled apart. The immune system is dynamic in nature with a population that, due to the combination of multiple mechanisms like evolution, learning and memory, may change significantly during the simulation. The information for each cellular entity is organized in blocks of variables, one block for each cell as is common in the agent-base paradigm. The blocks form single linked lists, one for each class of cells and are managed dynamically at run time. The model represents the “binding sites” of cells and molecules (e.g., the lymphocyte receptors, the antigen peptides and epitopes, the immunocomplexes, etc.) as binary strings. Currently, we are able to deal with strings of more than 24 bits that allow the representation of more than 107 different binding sites. The most time consuming phase of a simulation is the diffusion of the entities. At each time step there is a complete sweep of the lattice and the generation of a very large set of random numbers (the diffusion follows a random-walk dynamics). The interested reader may refer to [5] for further details.

1.2 Simulation of financial markets

The second application is a simulator of financial markets. We proposed [6] an agent-based model that includes many features of other models proposed in this field [1], and aims to be a first step towards the construction of a simulator that employs more realistic market rules, strategies, and information from the real-world. The model considers three kinds of agents who trade (with different strategies) 3

a set of N assets (stocks, commodities, etc.): fundamentalists, noisy and technical traders. Fundamentalists consider a reference (or “fundamental”) value to determine the “right” price of an asset. Noisy traders do not follow any reference value and do not look at charts. Their behavior is mostly random. There is a very high (order of thousands) number of noisy traders but most of them have a very limited capital at their disposal. The last group (the technical traders also called chartists) represent those agents who take into account information about the evolution of prices (in our case a moving average over certain time horizons). The initial wealth (stocks and money) of each trader is selected according to a power-law distribution. At each time step agents decide whether to trade or not. Active agents follow different decision paths depending on their trading strategy and may take different positions with respect to each stock in the market. The size of the agents’ lists depend on the granularity of the representation. If only major players in the market are represented (e.g., the brokers), there will be relatively few agents. In case each single “on-line trader” (i.e., people trading from home) must be represented, there will be, potentially, millions of agents at the same time. Regardless of their size, the lists must be dynamic since existing agents disappear (when they run out of money) and new agents enter the market during the simulation. The interaction among traders is mediated by the book-of-orders. This is a pretty detailed representation of the two lists of buy-orders and sell-orders. Two agents (a seller and a buyer) interact if the orders they post in the corresponding lists have a price that matches. At each time step, matching orders are promptly satisfied (filed). Pending orders remain in the book waiting for a matching order up to an expiration time set when the order is issued. Market rules require buy-orders to be sorted in descending order (from the highest to the lowest order price) whereas sell-orders must be sorted in reverse order (from the lowest to the highest order price) for each stock. This means that every time an order enters the book it can not be simply inserted at the beginning of the corresponding list but it must be placed in the “right” position. Since most of the orders have a limited life-time (in the real word, up to a few days) it is necessary to remove expired (unsatisfied) orders, which means an additional periodic scanning of the lists. Note that in a real market there are hundreds of different stocks and hundreds of millions of transactions every day. A quasi-realistic simulation requires a significant amount of computing time for the management of the book-of-orders (up to 30% of the simulation time). In this model, the environment consists of both the book of orders described above and the communication network used by the agents to exchange information.

4

2

Parallelization: general issues

Applications like those described in the previous sections look very suitable to parallel processing since they are computing intensive and relatively easy to split among a set of processors. At first glance, in fact, each processor can deal with a subset of the agents exchanging information with other processors when required. However, there are non-trivial issues that need to be addressed. The Immune System simulator was developed with message passing–based parallel processing in mind from the very beginning. MPI was adopted for its rich set of Collective Communication Primitives (CCPs). The lattice representing the environment is decomposed among parallel processes, and all data are local to the processes, without replications. The problem is not “embarrassingly parallel” for two features of the simulation. First of all, there is a diffusion phase in which cells and molecules may migrate from a lattice site towards one of its six nearest neighbors. Second, a specific antigen is represented by a global and unique record. Since in some cases new antigens may appear in a lattice site during the simulation (either by mutation or by direct injection), it is necessary to synchronize the information among the processing elements to prevent the existence of duplicated records. The approach to the parallelization of the Stock Market simulator was very similar. Each task of a parallel run is in charge of a subset of the agents. In both applications, during the diffusion phase, agents migrate from a task to another and the communication is point-to-point. To evaluate global quantities required by all tasks CCPs are employed. However, the introduction of the book-of-orders changed significantly the situation in the stock market simulator. For each stock there is a single “book”. The issuance of a new order entails an interaction which can be either local or remote depending on the location of the agent. There are, at least, three messages for any successful non-local order: i) a first point-to-point message for the submission of the order; ii) a second point-to-point message to communicate the result of the order iii) a third broadcast message to communicate the new price for that stock (actually, this is required only if the order is fulfilled). Load balancing can be an issue as well. It is apparent how a parallelization based on an explicit message passing mechanism in a distributed memory environment becomes unbearable. The clear trend towards large shared memory systems led us to consider a different approach to parallelization, using pthreads or OpenMP [7].

2.1 Case study: the financial simulator

We chose to parallelize the simulator of financial markets described in section 1.2 with OpenMP on a shared memory system. This case is very tough for dis5

tributed memory architectures. Luckily, significant simplifications arise from the lack of direct interactions among agents: the only interactions take place through the environment (i.e., the book of orders). The structure of the main program can be summarized as follows: While(Simulation is running) { External information hits the market; While (no price changes) { Pick another agent; /* shuffled order */ Decide what to do; /* trade or not, which asset and how much */ If (Active) { Decide the order type; /* market or limit order? */ Check if the agent has enough resources to buy/sell; Post the order in the Book of Orders; /* possible contentions!! */ If (order can be fulfilled) { Update involved agents; /* possible contentions!! */ Update asset price; /* possible contentions!! */ } } } } Check for expiring orders in the book; /* to contend or to serialize? */ Substitute new agents to those who ran out of money; Save info, compute statistics, ... }

Agents must be reshuffled at every time step to change their order of selection. This is necessary to avoid any unfair bias which could arise if the order of agents interaction with the market were fixed. Agents running out of money must be removed from the system, and replaced by new agents entering the market. The most convenient data structure to manage this behaviour is the linked list. This is iterated on, using pointer chasing loops, and dynamically reordered as needed. Each agent has to take some decisions. Part of these (to trade or not, which asset, how much, order type, ...) are based on pseudo-random numbers generation, whereas the trading price can be determined pseudo-randomly or algorithmically, according to the agent nature (noisy, fundamentalist, chartist). That implies that the computational cost of each agent interaction cannot be predicted. Experimental measurements on an Intel Pentium 4, 1.6 GHz CPU, for a realistic blend of noisy, fundamentalist and chartist traders, show that 96% of active agents take from 1000 to 3000 CPU clock cycles to trade. Inactive agents take next to zero CPU clock cycles. The probability of an agent being inactive is an input parameter. As each active agent interacts with a book of orders, the corresponding data structures must be shared. The locking mechanism required in the parallel version to regulate the access to the book of orders can become a serious bottleneck when the number of threads grows. In fact, the cost of lock operations grows with the number of threads. Moreover, the number of traders (agents) usually exceeds the number of assets (book of orders) by orders of magnitude. As a consequence, when the number of threads grows, the probability of contention grows as well. The main question becomes whether to parallelize on the agents or on the 6

assets. The former approach seems more suitable to load balancing, since in our case the number of agents is by far greater than the number of assets (as in reality). Moreover, in real markets, at a given time, most activities involve a subset of existing assets. On the other hand, parallelizing on the assets, i.e., having each asset “to query” agents for trading, could help to tame the locking problem. However, this is a specific feature of the stock market simulation: exploiting it would undermine the validity of the results when applied to generic agent-based models. Therefore, in the following, we analyze the feasibility of a parallelization on the agents.

2.2 OpenMP parallelization issues

The OpenMP standard [7] supports multi-platform shared-memory parallel programming in C/C++ and Fortran. It has been widely adopted in the scientific computing community, and most vendors supports its Application Programming Interface (API) in their compiler suites. OpenMP offers not only parallel programs portability but, being based on directives, also a simple way to maintain a single code for the serial and parallel version of an application. In C/C++ codes, explicit calls to OpenMP API functions can be ruled out from serial executables using conditional compilation. Despite of the apparent similarity of OpenMP to traditional micro-tasking directives customary of vector computers, all OpenMP implementations rely on the support for threads provided by host operating systems. Very often, the system call overhead makes OpenMP unsuitable to fine grain parallelism. Therefore the parallel decomposition should be possibly medium or coarse grained. The computational cost of the simulation of a single agent activity (see section 2.1) is very low, and the chance of inactivity makes things even worse. As a consequence, the “natural” choice one-agent one-thread is unbearable since the computation would be dominated by the parallelization overhead. It is however possible to lessen or eliminate this inefficiency by building units of work made up by a significant number of agents whose activity is simulated sequentially by each thread in a single block. This approach should also mitigate the problems due to the locking of shared data (book of orders). The overhead due to the locking can be furtherly reduced by carefully using the omp atomic directive. This primitive offers a significant reduction of contention costs, as it protects the update of a single memory location, and can be optimized in the OpenMP runtime. Moreover, hardware support for atomic updates of memory locations can be exploited, if present. However, this technique does not apply to every transaction involving coherent updates of multiple variables, nor to generic agent-based models. The most general issue, however, is that OpenMP was conceived for “classic” numerical codes, where there are arrays and loops on indexes, whereas in the simulators we have described so far the loops are on dynamic lists managed 7

by means of pointers. A number of well known techniques are available for parallel updates and reshuffling, much fewer for the perusal of lists in parallel. Pointer chasing loops are, by definition, serial constructs. One possible solution is to decompose the lists in (approximately) equal sized chunks, store pointers to these chunks in an array, and then use a standard omp for worksharing directive, choosing a suitable scheduling to overcome possible load unbalances. However, this approach is too naive. The list decomposition is not only difficult to parallelize, but completely useless for a serial execution. As such, it constitutes a significant departure from the spirit of maintaining a single version of the code. Of course, preprocessor directives could be used to this purpose, but this would be equivalent to keep two different codes in a single source file. For this kind of codes, Kuck and Associates Inc. (now Intel KAI Software Laboratory) proposed the introduction of the workqueueing concept in OpenMP [8], implemented for C/C++ codes with two directives omp taskq and omp task: #pragma omp parallel shared(list) { #pragma omp taskq private(p) for(p = list; p!=NULL; p = p->next) #pragma omp task { /* work on *p */ } }

Workqueuing is a flexible mechanism for specifying units of work that are not pre-computed at the start of the worksharing construct. The taskq pragma specifies the environment within which the enclosed units of work (tasks) are to be executed. From among all the threads that encounter a taskq pragma, one is chosen to execute it initially. Conceptually, the taskq pragma causes an empty queue to be created by the chosen thread, and then the code inside the taskq block is executed in a single-threaded mode. All the other threads wait for work to be enqueued on the conceptual queue. The task pragma specifies a unit of work, potentially executed by a different thread. When a task pragma is encountered lexically within a taskq block, the code inside the task block is conceptually enqueued on the queue associated with the taskq. The conceptual queue is disbanded when all work enqueued on it finishes, and when the end of the taskq block is reached. To preserve sequential semantics, there is an implicit barrier at the completion of the taskq. The user is responsible for ensuring either that no dependencies exist or that dependencies are appropriately synchronized, either between the task blocks, or between code in a task block and code in the taskq block outside of the task blocks. Beside enabling possible runtime optimizations, this mechanism supports the usual and convenient privatization and reduction clauses. Up to now the OpenMP Architecture Review Board has not accepted this extension, available only in the KAI KAP/Pro ToolSet product. A third way of parallelizing pointer chasing loops has been proposed in [9]: 8

#pragma omp parallel private(p) shared(list) { for(p = list; p!=NULL; p = p->next) #pragma omp single nowait { /* work on *p */ } }

This idiom deserves some additional words. The omp single is a worksharing directive that identifies a block of code to be executed by only one thread in the team. Actually, all threads invoke the directive, but execution of the associated structured block is granted only to the first thread that reaches the construct. By default, there is an implied barrier at the end of a single construct. As a result, all other threads should ”catch up” to the barrier point before they all simultaneously execute the next statements. However, the nowait clause eliminates the implied barrier so all threads loop independently along the list, chasing from pointer to pointer through the same sequence of addresses. In this way, at the first iteration, a single thread starts to work on the first element of the list whereas the other threads continue to the second one. In kind, only another thread is selected to operate on the second element, and so on, until each thread gets one unit of work. When the first thread completes its work on the first item of the list, it recovers its backlog of iterations, skipping those items of the list already in charge to other threads, until it reaches an iteration not yet assigned. Being the first to enter the omp single nowait instance, it gets again the right to operate on the corresponding list item. The same happens with all the other threads when they finish with their first assigned item, up to the list end. There is an obvious overhead in this mechanism as, with n threads, each one has to perform n − 1 idle iterations for every unit of useful work. On the other hand, it is a standard and portable OpenMP construct. The omp single directive cost grows with the number of threads involved (see [10]). However, the program flow naturally staggers thread arrivals to the omp single nowait directive, thus reducing contention on the shared data needed to implement this directive. As a consequence, the overhead could be less severe than expected at first glance. In view of the parallelization of the financial market simulator, we decided to analyze in depth the efficiency and scalability of the idioms presented above.

3

Experimental measurements

In order to assess the behavior of the different parallelization schema sketched in the previous section, we decided to write a compact ad hoc code, including the salient features of the financial market simulator. In such a way we avoided the chore of parallelizing, debugging and validating in three different ways the full code. The test code mimics the behavior of the computational core under study, 9

namely the trading phase and in particular the interaction of the agents with the book of orders. The part of creation and removal of agents was dispensed with, as was their reshuffling. The number of agents in a test run is provided in the input phase. In the end the test code amounts to a loop on a linked list of agents, each iteration spending some time working on each agent and updating some global data structures. Agent activity is mimicked by means of a fixed point iteration of the cosine function: double fpiter(int n, double x) { double s = x; int i; for (i = 0; i