A Generic Distributed Architecture for Business Computations. Application to Financial Risk Analysis. Arnaud Defrance, St´ephane Vialle, Morgann Wauquier
[email protected] Supelec, 2 rue Edouard Belin, 57070 Metz, France Philippe Akim, Denny Dewnarain, Razak Loutou, Ali Methari
[email protected] Summit Systems S.A, 62 Rue Beaubourg, 75003 Paris, France
Abstract
at any time and need to get the results quickly (on-demand computations). They need to run their computations on available servers in interactive mode. Moreover, business application architectures are evolutive: new functionalities can be added easily, just including new server processes in the architecture. Considering all these industrial constraints, we have designed a parallel architecture including on demand parallel computing servers. These computing servers run on PC clusters and mix distributed programming paradigm, client-server mechanism and database accesses. This parallel architecture has been evaluated on financial risk analysis based on Monte-Carlo simulations[6], leading to intensive computations and including time constraints. For example, traders in market places need to run large risk computations and to get the results quickly to make their deal on time! The financial application we wanted to speed up and to size up used a distributed database, several scalable computing servers and some presentation servers, but still needed to increase the power of its computing servers (applicative servers). The new solution uses a generic distributed server architecture, some standard development libraries, and has been integrated in the industrial Summit environment. Section 2 introduces our complete generic architecture and its basic component (a distributed on-demand server ), section 3 focusses on the mix of TCP client-server paradigm and MPI distributed programming, and section 4 introduces our investigations about database accesses from cluster nodes. Then, section 5 describes the implementation of a complete financial application on our distributed architecture, and ex-
This paper introduces a generic n-tier distributed architecture for business applications, and its basic component: a computing server distributed on a PC cluster. Design of this distributed server is relevant of both cluster programming and business standards: it mixes TCP client-server mechanisms, MPI programming and database accesses from cluster nodes. The main topics investigated in this paper are the design of a generic distributed TCP server based on the MPI paradigm, and the experimentation and optimization of concurrent accesses to databases from cluster nodes. Finally, this new distributed architecture has been integrated in the industrial Summit environment, and a Hedge application (a famous risk analysis computation) has been implemented and performances have been measured on laboratory and industrial testbeds. They have validated the architecture, exhibiting interesting speed up.
1
Motivations
Typical HPC scientific applications run long computations in batch mode on large clusters or super-computers, and use flat files to read and write data. Classical MPI paradigm is well adapted to implement these applications on clusters. But business applications are different. They use client-server mechanisms, are often 3tiers, 4-tiers or n-tiers architectures [2], access databases, and need to be run immediately in interactive mode. Users run numerous short or medium computations from client applications, 1
hibits some experimental performances in an industrial environment. Finally, section 6 summarizes the achieved results and points out the next steps of the project.
2
TCP server process will not be a MPI process with a specific number but will be the MPI process running on a specific node. The TCP server node will be specified in a config file or in the command line (ex: mpirun -np 10 -machinefile machines.txt prog.exe -sn PE-03, where -sn specifies the server name). It will be identified at execution time, during the initialization step of the MPI program. Top of figure 3 illustrates this initialization step. Each MPI process reads the name of the TCP server node, gets the name of its host node, detects if it is running on the server node, and sends this information to the MPI process 0. So, the MPI process 0 can identify the rank of the MPI process hosted by the server node, and broadcasts this rank to all MPI processes. Finally, each MPI process knows the rank of the server process. Moreover, this algorithm chooses automatically the MPI process with the lower rank if several MPI processes are hosted by the server node.
Architecture overview
To parallelize our business applications we consider a large pool of PCs, organized in clusters (see figure 1), and a two level distributed architecture. The high level takes into account the partitioning of the pool of PCs into clusters. These clusters can be statically defined or dynamically created and cancelled. They can have fixed size or on demand size. These topics are still under analyze, the high level of our architecture is not totally fixed. The low level defines the basic component of our architecture: the cluster architecture of our distributed on-demand server, and its distributed programming strategy. This strategy will be TCP- and MPI-based and will have to interface databases, TCP clients and distributed processes on cluster nodes (see figure 2). In the next sections this paper focusses on 3.2 Distributed server algorithm the design of the basic component of our distributed architecture, and introduces our first When the server process is identified, the first experiments on two different testbeds. computation step starts. The server process initializes and listens a socket, waiting for a client request, while other MPI processes wait for a 3 MPI-based TCP server message of the server process, as illustrated on the left-bottom part of the figure 3. Each time We have identified three important features for the server process receives a client request, it the MPI based TCP server we want to design reads this request, may load some data from the (see figure 2): database, and splits and load balances the work 1. The server node has to be clearly identi- on all the MPI processes (it broadcasts or scatfied (ex: ”PE-03” on figure 2) and fixed by ters a mission to each MPI process). users, not by the MPI runtime mechanism. Then each process receives its mission, usually loads data from the database and starts its work 2. The TCP server node has to be a worker (see right-bottom part of figure 3). These comnode, listening for client request and pro- putation steps can include some inter-process cessing a part of this request. communications, depending if we process a set 3. The server has to remain up after processing of embarrassingly parallel operations or some a client request, listening for a new request more complex computations. At the end the server process gathers the different results (for and ready to process it. example with a MPI gather operation), and returns the global result to the client. Finally, 3.1 TCP server identification the server process waits for a new client request, A usual MPI program is run specifying just the while other MPI processes wait for a new mesnumber of processes needed and a file listing sage from the server process. This algorithm has been successfully impleall available machines to host these processes[7]. We do not make any assumption on the num- mented using MPICH and the CommonC++ bering of the MPI processes and their mapping socket library[1] on Windows and Linux. Next on the available machines: they are fixed by the experiments introduced in this article use this MPI runtime mechanism. So, we decided the client/distributed-server architecture. 2
Client application
Gigabit network
Database
Client machine Client application Client machine Client application
MPI process PC
MPI process PC
MPI process PC
MPI process PC
MPI process PC
MPI process PC
MPI process PC
PC cluster 1
MPI process PC PC cluster 2
Client machine MPI process PC
Database
Database
PC cluster 3
Pool of PC organized in different PC clusters Parallel and Distributed Applicative Server(s)
Client(s)
Data Server(s)
Figure 1: Global architecture: a N-tiers client-server architecture and a pool of PCs split into clusters. Applicative servers are distributed on PC clusters. Client process
TCP socket
Database
User machine MPI process
MPI process
PE-01
PE-02
MPI process
..............
PE-03 TCP server
MPI process PE-xx
MPI communications (Broadcast & Point-to-Point)
Figure 2: Detail of the basic component of the architecture: a distributed and parallel TCP applicative server.
4
DB access experiments
3. Independent data writing: each MPI process writes its results in a database.
Business Application use databases, so our distributed architecture needs to access databases from cluster nodes. In the next sections we identify three main kinds of database accesses and several access strategies from cluster nodes. Then we try to point out a generic and efficient strategy for each kind of access.
4.2
Possible DB access strategies
From a MPI parallel programming point of view, we can adopt different implementation strategies. Database accesses can be distributed on the different processes, or can be centralized (gathered) on a process that will for4.1 Three kinds of DB accesses ward/collect the data to/from other processes, We can easily identify three kinds of data ac- using MPI communications. When centralized cesses from concurrent processes: on a process, database accesses and data forward/collection can be serialized or overlapped 1. Independent data reading: each MPI prodepending on MPI communication used. When cess reads specific data. For example, fidatabase accesses are distributed in all MPI pronancial computing processes load different cesses, their executions can be highly concurrent lists of trades to evaluate. or scheduled using MPI synchronization mecha2. Identical data reading: all MPI processes nisms. All these strategies have been experimented need to read the same data. For example, financial computing processes need to on the three kinds of database access identified load the same market data before to evalu- previously, and on two testbeds using different databases. ate their trades. 3
MPI_Init()
Initialization step
Detection of the MPI process rank and of the total number of MPI processes Reads the name of the server node in the command line or in the config file Gets its HostName (system call) Checks if it runs on the server node (HostName = ServerName ?) Sets its ServerNodeFlag MPI process 0 gathers all the ServerNodeFlag MPI process 0 looks for the MPI process with the lower rank that has a ServerNodeFlag set MPI process store this ServerProcessId MPI process 0 broadcasts the ServerProcessId to all other MPI process Compars its MPI rank and the ServerProcessId, and sets its ServerProcessFlag Computation steps Server process initializes a TCP socket and listens its TCP port Server process waits for a client TCP request Server process reads some data in the database Server process splits the client request Server process scaters the work to do to each worker process
Non server processes wait for a msg from the server process
Server process reads some data and executes its work
Non server processes read some data and executes their works
Server process gathers the results Server process sends the final result to the TCP client (or to a result server!)
Non server processes send their result to the server process
Server process closes the TCP socket MPI_Finalize()
Figure 3: Generic algorithm of a TCP parallel server implemented on a cluster with MPI.
4.3
MySQL database experiments data from each node (ex: loading some common market data on each processor) the most efficient way is to make only one database access on one node and to broadcast the data to all other nodes (see figure 6). A MPI broadcast operation makes it easy to implement. Finally, when writing independent results from each node in the database, it is more efficient to make concurrent write operations than to centralize the data on one node before to write in the database (see figure 8). This was similar to read independent data, but concurrent writes in our MySQL database were a little bit more efficient when scheduled and serialized to avoid contention on the database (see the distributedscheduled curve). MPI synchronization barriers
Figures 4, 6 and 8 show the times we have measured on our laboratory testbed: a 6 node Gigabit Ethernet PC cluster accessing a MySQL database across the native MySQL API. This database was hosted on one devoted PC with SCSI disks. When reading independent data from each node (ex: loading some trades to evaluate), it is really faster each node makes its own database access than centralizing the requests on one node (see figure 4). The database did not slow down when all the cluster nodes emit concurrent requests, and there is no improvement when trying to schedule and serialize the different accesses. At the opposite, when reading identical set of 4
100
35
centralized-serialized centralized-overlapped distributed-concurrent distributed-scheduled
30 25
Total data access time(s)
Total data access time(s)
40
20 15 10 5
60 40 20
0
0 0
5
10
15
20
25
30
0
Total size of data read in DB (MB)
200
300
400
500
600
700
Figure 5: Loading different data on a 3-node cluster from a Sybase database, with IDE disks and across the Summit API.
70
30 distributed & concurrent centralized & broadcasted
Total data access time(s)
Total data access time(s)
100
Total size of data read in DB (KB)
Figure 4: Loading different data on a 6-node cluster from a MySQL database, with SCSI disks and across the native MySQL API.
60
centralized-serialized centralized-overlapped distributed-concurrent distributed-scheduled
80
50 40 30 20 10 0
25
distributed & concurrent centralized & broadcasted
20 15 10 5 0
0
5 10 15 20 25 Size of common data read in DB (MB)
30
0
100 200 300 400 500 600 Size of common data read in DB (KB)
700
Figure 6: Loading a same set of data on a 6node cluster from a MySQL database, with SCSI disks and across the native MySQL API.
Figure 7: Loading a same set of data on a 3node cluster from a Sybase database, with IDE disks and across the Summit API.
make this scheduling easy to implement.
4.5
4.4
Best strategies
Experiments on industrial and laboratory testbeds have led to point out different best strategies to access a database from a cluster (see table 1). Finally, it is possible to identify the best common strategies for the two testbeds we have experimented, considering only efficient strategies and looking for good compromises. But we prefer to consider real business applications will use only industrial databases (like Sybase), and to focus on the distributed-concurrent strategy. However, performances and optimal strategies could change when the number of processors will increase. So, our next experiments will focus on larger industrial systems.
Sybase database experiments
This section introduces experiments on a Sybase database, accessed across the Summit API (used by all Summit applications), and hosted on one devoted PC connected to the cluster across a Gigabit Ethernet network. However, this PC had only IDE disks, leading to longer execution times than previous experiments on the laboratory testbed. For each kind of database access, the distributed & concurrent strategy appeared the most efficient, or one of the most efficient (see figures 5, 7 and 9). Then it has been the favorite, because it is efficient and it is very easy to im- 5 The Hedge application plement. It seems the industrial Sybase database (accessed across the Summit API) supports any The Hedge computation is a famous computakind of concurrent accesses from a small number tion in risk analysis. It calls a lot of basic pricing operations (frequently based on Monte-Carlo of nodes. 5
40 centralized-serialized centralized-overlapped distributed-concurrent distributed-scheduled
40
Total data access time(s)
Total data access time(s)
50
30 20 10
35
centralized-serialized centralized-overlapped distributed-concurrent distributed-scheduled
30 25 20 15 10
0
5 0
0
5
10
15
20
25
30
0
Total size of data written in DB (MB)
Figure 8: Storing different data on a MySQL database from a 6-node cluster, with SCSI disks and across the native MySQL API.
MySQL - native MySQL API most interesting strategies Sybase - Summit API most interesting strategies Best common strategies (interesting compromises)
2
4
6
8
10
12
14
Total size of data written in DB (MB)
Figure 9: Storing different data on a Sybase database from a 3-node cluster, with IDE disks and across the Summit API.
Independent data reading distributed & concurrent distributed & concurrent distributed & concurrent
Identical data reading centralized & broadcast distributed & concurrent centralized & broadcast
Independent data writing distributed & scheduled distributed & concurrent distributed & concurrent
Table 1: Best database access strategies from cluster nodes simulations) and consider large sets of market data. It is time consuming and a good candidate for distributed computing, like many financial risk computations ([8, 4]). The following sections introduce its implementation in the industrial Summit environment, according to our distributed TCP-server architecture.
5.1
cast or scatter operation). So, each MPI process receives a description of a task to process (4) and then loads data from the database: a list of trades to process (4) and a list of up to date market data (5). Then all cluster nodes shift the market data (6) to artificially disturb these data, and re-evaluate all their trades (7). Finally, the server process gathers these elementary results (8) (executing a MPI gather operation) and returns the global result to the client (9).
Introduction to the Hedge
The Hedge is the computation of a set of discrete derivatives, based on finite differences. The market data are artificially disturbed, some trade values are re-computed, and the difference with their previous values measure their sensitivity to the market data changes. These computations allow to detect and reduce the risks associated to trades and to market data evolution. The distributed algorithm of our Hedge computation is illustrated on the figure 10. The server process receives a client request (a request for a Hedge computation) (1), loads from the database some informations on the trades to process (2), and splits the job for the different cluster nodes, according to a load balancing strategy (3). Then the MPI server process sends to each MPI process a description of their job, using a MPI collective communication (broad-
5.2
Optimized implementation
We have implemented the distributed algorithm introduced in the previous section in the Summit environment with different optimizations. First optimization has been to implement the database accesses with respect to the optimal strategies identified on an industrial database in section 4.5. Each processor need to read the trades and the market data it has to process (step 4 and 5 of algorithm 10). The corresponding database accesses are achieved according to the distributed-concurrent strategy: each processor accesses its data concurrently to other processors. The second optimization has been to call a specific function of the Summit API to read 6
0 - initializations 1 - Server node receives a request on its TCP port 2 - Server node asks to the DB a list of the trade descriptions 3 - Server node loads balance the set of trades to evaluate on the different nodes, and sends a task description to each node (MPI communication) 4 - Each node receives a task description (MPI comm.), and loads trades from the DB
4 - Each node receives a task description (MPI comm.), and loads trades from the DB
5 - Each node loads its data from the DB
5 - Each node loads its data from the DB
6 - Each node shifts the data
6 - Each node shifts the data
7 - Each node evaluates its trades
7 - Each node evaluates its trades
8 - Server node gathers results (MPI comm.)
8 - Server node gathers results (MPI comm.)
9 - Server node sends the final result to the client (on the TCP socket) Server process
Halt (if halt msg received) All other processes
Figure 10: Main steps of our Hedge computation. short trade descriptions on the server, in place of complete descriptions, to achieve the step 2 of the algorithm (see figure 10). This specific function was available but not used in sequential programs. This optimization has decreased the serial fraction of the Hedge computation of 30%. Its impact will be great on large clusters, when the execution time of the serial fraction will become the most important part of the total execution time, according to Amdahl’s law[3].
PCs, when processing 5050 trades of 5 different kinds.
5.3
Experimental performances
We have run an optimized distributed Hedge computations on a 8-node industrial cluster, with a Sybase database and the Summit enviThe third optimization has been to improve ronment. the load balancing mechanism (step 3 on figure We have processed a small portfolio of 5000 10) taking into account the trades are heterogetrades and a larger one with 10000 trades, both neous. Now the server process sorts the trades were heterogeneous portfolios including different according to their kinds or to their estimated prokinds of trades. The curves on the figure 11 show cessing times, and distribute the trades on the the speed up reaches approximately 5 on 8 PCs different processors applying a round-robin alfor the small portfolio, and increase until 5.8 for gorithm. To estimate the processing time of the the large portfolio, leading to an efficiency close trades, some benchmarks have to be run before to 73%. the Hedge computation to measure some elementary execution times. These measures are stored These first real experiments are very good, in a database and read by the server process. considering the entire Hedge application inHowever, trades with the same kind have sim- cludes database accesses, sequential parts and ilar processing times, and optimizations of the process heterogeneous data. Moreover, perload balancing based on the kind of the trades formances seem to increase when the problem or on their estimated execution time have led to size increases, according to Gustafson’s law [5]. close improvements. Finally, compared to a ba- Our distributed architecture and algorithm are sic load balancing of the amount of trades, the promising to process larger problems on larger total execution time has decreased of 35% on 3 clusters. 7
Speed up of Hedge computation
Acknowledgment
8 S(P) = P SU 5000 trades SU 10000 trades
7 6
Authors want to thank Region Lorraine that has supported a part of this research.
5
References
4
[1] GNU Common C++ documentation. http://gnu.mirrors.atn.ro/software/commonc++/ commonc++.html.
3 2
[2] R. M. Adler. Distributed coordination models for client/sever computing. Computer, 4(28), April 1995.
1 1
2
3
4 5 6 Number of PC
7
8
[3] G.M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings, pages 483–485, 1967.
Figure 11: Experimental speed up of a Hedge computation distributed on an industrial system, including a Sybase database, a Gigabit Ethernet network and the Summit environment.
6
[4] J-Y. Girard and I.M. Toke. Monte carlo valuation of multidimensional american options deployed as a grid service. 3rd EGEE conference, Athens, 18-22 April, 2005.
Conclusion
[5] J.L. Gustafson. Reevaluating Amdahl’s law. This research has introduced a generic n-tiers Communications of the ACM, 31(5), May distributed architecture for business applica1988. tions, and has validated its basic component: a distributed on-demand server mixing TCP [6] John C. Hull. Options, Futures and Other client-server mechanism, MPI programming and Derivatives, 5th Edition. Prentice Hall Colconcurrent accesses to databases from cluster lege Div, July 2002. nodes. [7] P.S. Pacheco. Parallel programming with We have designed and implemented a disMPI. Morgan Kaufmann, 1997. tributed TCP server based on both TCP clientserver and MPI paradigms, and we have tracked [8] Shuji Tanaka. Joint development project applies grid computing technology to finanthe optimal strategies to access databases from cial risk management. NLI Research report cluster nodes. Then we have distributed a Hedge 11/21/03, 2003. computation (a famous risk analysis computation) on an industrial testbed, using our distributed TCP server architecture and optimizing the database accesses and the load balancing mechanism. We have achieved very good speedup, and an efficiency close to 73% when processing a classical 10000-trade portfolio on a 8-node industrial testbed. These performances show our distributed approach is promising. Implementation of the basic component of our distributed architecture has been successful under Windows on the Summit industrial environment. We have both used the CommonC++ and MPICH-1 libraries, a Sybase database and the Summit proprietary toolkit. Next step will consist in running bigger computations on larger clusters, and testing the scalability of our distributed architecture. 8