Improving the simulation of Storage Area Networks (SAN) using concurrent execution À. Perles, X. Molero, A. Martí, V. Santonja, J.J. Serrano Departament d’Informàtica de Sistemes i Computadors Universitat Politècnica de València Camí de Vera, 14. 46022 València (Spain) e-mail:
[email protected]
KEYWORDS Storage networks, parallel simulation, cluster computing, PVM environment, CSIM simulation language. ABSTRACT A Storage Area Network is a high-speed subnet that establishes a direct connection between heterogeneous storage resources and servers. Up to now, the work done in our department on the performance evaluation of these systems has been carried out using traditional simulation techniques. However, the SAN simulator designed by our researchers needed a lot of computational time to obtain statistically correct results. In this work we show how we have improved the execution time of our SAN simulator using a concurrent simulation approach. This approximation basically consists of executing in parallel variablesized independent replications of the simulation model. The obtained results encourage us to continue working on concurrent simulation development. INTRODUCTION Storage Area Networks (SAN) (Clark 1999) are an emerging data communications platform which interconnects servers and storage devices (such as disks, disk arrays, and tape drives) to create a pool of storage that users can access directly. This networking approach reports benefits such as computer clustering, topological flexibility, fault tolerance, high availability, and remote management (see Fig. 1). In order to evaluate the performance of these systems, a very flexible and easy to use SAN simulator has been developed (Molero et al. 2000). This tool is able to consider, among others, both real world I/O traces and synthetic I/O traffic, Fibre Channel and Myrinet switches, message fragmentation, faults in links and switches, virtual channels, different routing
algorithms, etc. This simulator has been written using CSIM (CSIM 1998) libraries for C. Our simulator accurately models a SAN focusing on its internal design, producing very computationally expensive simulation programs. Storage Area Network
Server
SAN
Local Area Network
LAN Server
Fig. 1. A typical SAN environment
In order to improve the execution time of our SAN simulator, we have introduced concurrent simulation using the Clustered Simulation Experimenter (CSX) tool (Perles et al. 1999), that makes massive concurrent simulation experiments without having to be concerned in locating computational resources to carry it out. In this work, we present the benefits obtained when applying this tool to a CSIM based SAN simulator. Distributed simulation (Fujimoto 1990) has not been considered because it can not be easily applied to our model and its application normally requires an extra modellers effort. THE SAN SIMULATION MODEL We have used the CSIM language for implementing the SAN simulator. CSIM consists of a library of procedures, functions and macros that give C programmers a powerful tool for developing discreteevent, process-oriented simulation models. A CSIM program models a system as a collection of CSIM processes that interact with each other by using internal structures to CSIM. The simulator, which is about 9500 lines long, has been written in ANSI C
This work has been supported by both the CICYT TAP99-0443-C05-02 and the “UPV- Design and modeling of storage systems based on magnetic disks” projects.
code in order to enable system portability. In fact, we have run it in both Unix and Windows systems. The process-oriented conception of CSIM allows an easy way for structuring the simulator program. Before the simulation stage, the program first reads and tests input parameters; next, it generates the network topology and computes routing tables according to the specified routing algorithm. Once the load has been specified by means of trace files or synthetic generation, and disks have been initializated, the main() function calls the main simulation process, sim(), for carrying out the simulation itself. This program structure has been defined by the following code: void main() { test_input_parameters(); topology_generation(); calculate_routing_tables(); load_specification(); disks_initialization(); sim();
a packet process terminates when its last flit leaves the source device. As a consequence of the way messages and packets processes are modeled, the only entities that network manages are flits. Therefore, flit processes terminate when they leave the network at the destination device. When the last flit of an entire message arrives at the destination disk, it creates the disk_access() process that models the disk request. Once the access is completed, this disk_access() process creates a new message() process and then terminates. This new message models the response (data or acknowledgement, if read or write, respectively) from the disk to the server that initiated the I/O operation. The program uses the batch means simulation technique to statistically test output variables. However, obtaining accurate results appears to be computationally expensive. In consequence, both the concurrent simulation and the replication simulation technique have been considered in order to reduce the simulation response time. Fig. 3 shows a typical result obtained with the SAN simulation model. Each point in the curve represents the performance for a specified load value.
collect_statistics(); } sim() Alive througout the whole simulation server()
Processes are one of the most important parts in the simulator design. The global process hierarchy is shown in Fig. 2. The sim() process is the main simulation process. It creates one server() process for each server connected to the SAN; these processes generate the I/O operations according to load specification.
messsage(
packet() When the message has arrived to disk header_flit()
Throughout the whole simulation, only the sim() and server() processes are alive. A message terminates when the last flit of its last packet has left the source device (server or disk) and enters the network at the first switch in the path. In the same way,
data_flit()
last_flit()
Fig. 2. CSIM process hierarchy of the SAN model
900
800
I/O mean response time (cicles)
Each I/O operation consists of transmitting two different messages: a data and a control message (read request or write acknowledgement). Data message processes create several data packet() processes, according to their respective length. For example, a 8192-byte data message may generate 4 packets of 2048 bytes each one, or 1 single packet of 8192 bytes, depending on the maximum length allowed by the storage network. Control message processes only create a control packet() process, a few bytes long. Finally, a packet generates the corresponding flit processes (there are three different classes: header, data, and tail flit processes).
disk_access()
700
sim() 600
sim() 500
sim() 400
300
200 0,5
1
1,5
2
2,5
3
3,5
Delivered traffic (bytes/cicle)
Fig. 3. Typical simulation result
4
4,5
workstation PVM virtual machine PVM messages experiment running
LM LM
LM
LSM Local Simulation Monitor
MLM machine database
MM Master Manager
LSM instrumented Simulation
MLM Machine and Load Manager
experiment database
UI LM
LM
MM
LM
Load Monitor
UI UIS User Interface Server
LM UI
UIS
UI LSM
User Interface
UI
Fig. 4. The CSX architecture
THE CSX TOOL The CSX tool main purpose is to use a lot of idle and heterogeneous workstations at university laboratories and research centers to run concurrent simulations. Simulations executed using CSX are discrete-event simulation models that, basically, will be run using replication techniques, and will be monitored in order to extract statistical information of the output variables. The CSX design allows its application to any commercial or public simulator, avoiding the apprenticeship of a new simulation language. It provides a general form of monitoring and controlling simulations and it has been initially applied to the SMPL simulation language (McDougall 1987). The monitorization of the model only requires minimal changes in the original program code, and it is mainly based on a special event injection in the simulator event queue. This allows external program control without introducing noticeable overhead. Fig. 4 shows the design of a CSX environment in operation. This is outlined as a distributed application working under PVM (Geist et al. 1994) that lets Unix and Windows heterogeneous computers incorporation to a parallel virtual machine. In this work, this environment has been successfully applied to the CSIM simulation language. The SAN simulator program code has been modified and linked with CSX
libraries in order to be monitorized. The main introduced modifications are showed bellow: #include "csim.h" #include "csx_csim.h" void main() { ... /* Statistics trap function. */ csx_uservar("I/0 mean resp. time",tr); /* CSX monitor connection. */ csx_enroll(); /* Main simulation process. */ sim(); ... } void sim() { /* Schedule the CSX control event. */ csx_first_event(); ... }
Using the CSX tool, a concurrent simulation can be done by spawning N replications of the instrumented program. Each replication uses a different random stream, and the output variables of each replication allow obtaining a confidence interval. SIMULATION RESULTS Execution time of serial and parallel simulation has been analyzed using the interconnection topology shown in Fig. 5. This topology presents an irregular configuration where three switch ports are used to connect to other switches. Servers and disks may be attached to the remaining ports.
In order to run our experiments, 10 PCs AMD K62/350 MHz with 128 MB RAM and running SuSE 6.1 Linux has been utilized. Switch
0 1
Disk
5
Free port
Server
When CSX receives a sample of statistical information of any replication, it computes the confidence interval for the user-selected output variables. The simulation will end if the confidence interval satisfies the user-specified end conditions. Fig. 8 shows the CSX computed mean and its limits for a 95% confidence interval. The stopping simulation criterion is achieved when the relation between the mean and the half-width of the confidence interval is lower than 5%. Simulation has been artificially continued in order to show the confidence interval evolution.
3
5000
2 4500
4000
3000
2500
Replications mean
2000
reset of output variables
1500
1000
500
97
100 100
94
91
97
88
85
82
79
76
73
70
67
64
61
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
7
13
0 4
The original simulation program has been executed using one of the PCs previously described. This program uses the batch means analysis method, and convergence for the output variable “I/O mean response time” is achieved in 64 minutes. Fig. 6 shows the evolution of the model output variable "I/O mean response time" and the behavior of the statistical analysis method of CSIM, being observed that confidence interval is not offered until sufficient batches has been collected.
3500
1
Fig. 5. A SAN with six switches and irregular topology
I/O mean response time
4
10
Bidirectional link
Simulated time (cicles/4000)
Fig. 7. I/O mean response time for each independent replication and CSX calculated mean 5000
END OF SIMULATION Desired error for the confidence interval
95% confidence interval.
4500
4000
I/O mean response time
4000
3500
3000
CSIM computed mean
END OF SIMULATION 5% error
2500
Response execution time for the slowest replication = 890 s
3000
2500
5% error bars
2000
1500
reset of output variables
2000 1000
1500 500
1000
94
91
88
85
82
79
76
73
70
67
64
61
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
7
1
4
0
Reset of output variables
10
I/O mean response time (cicles)
Confidence intervals
3500
Simulated time (cicles/4000)
500
Fig. 8. CSX computed mean and confidence interval 137
133
129
125
121
117
113
109
97
105
101
93
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
29
25
21
9
17
13
5
1
0
Simulated time (cicles/10.000)
Fig. 6. Original CSIM output analysis method behavior
Next, an independent replication has been spawned in each of the 10 available computers. Statistical output variables are sent, asynchronously, to the CSX tool by the running simulation replications every 14 seconds in average. Fig. 7 shows the evolution of collected values of the output variable for each replication and also the evolution of the CSX calculated mean.
End-simulation execution time for this experiment was 14.8 minutes. Compared with the 64 minutes of the original model, our concurrent approach was 4 times faster. However, we have used a total amount of 10 computers. Replicatons (computers) Original (batch) 10 7 4
End of simulation (seconds) 3840 890 1016 1090
Speed-up 4.3 3.8 3.5
Table 1. Speed-up and number of replications
Table 1 shows the relation between speed-up and number of computers. In this case, we have used the same number of replications as the number of available computers.This relation is not lineal, for example, with a total of 4 computers and 4 independent replications, the execution response time was 18.17 minutes, that is, about 3 times faster than the original simulation program. This effect is due to the initial model warmup, because batch means only has one transient period and replications method has a transient period for each replication. In this experiment, 6 replications are a good relation between speed-up and mean coverage. Another important factor that influences speed-up is the different performance executing each replication. Fig. 9 shows the relation between simulated time and replications response time. For this experience each computer only has run one replication. As can be seen, some simulations are faster than others, producing that the end of simulation is determined by the slowest replication. In this experience this effect is due to the influence of the random stream used in each replication. On a true heterogeneous environment, computer perfor-mance, input random stream, and load variation effects can be solved using dynamic load balance techniques. 1800
As a future work, we are working on enhancing the statistical output analysis method and to apply dynamic load balance techniques in order to improve the concurrent simulation tool. REFERENCES Farley M. Building storage area networks. McGraw-Hill. January, 2000. User's guide: CSIM18 Simulation Engine (C version), Mesquite Software, Inc. 1998 Perles A., Martí A., Serrano J.J., Clustered Simulation Experimenter: A tool for concurrent simulation on loosely coupled workstations. Proceedings of the 13th European Simulation Multiconference (ESM99). May, 1999. McDougall M. H. Simulating Computer Systems. The MIT Press, Cambridge, Massachusetts. 1987. Fujimoto R.M. Parallel discrete event simulation. Communications of the ACM. N. 10. October 1990 Geist A., Beguelin A., Dongarra J., Jiang W., Manchek R.and Sunderam V., PVM3 user's guide and reference manual, Technical Report ORNL/TM-12187, Oak Ridge National Laboratory, May, 1994. Clark, T. Designing storage area networks: a practical reference for implementing fibre channel SANs. AddisonWeslwy, 1999.
1583 seconds
1600
1400
Execution time (seconds)
concurrent independent replications. The developed tool that implements this parallel method works on a cluster of heterogeneous computers and can be applied to any discrete-event simulator.
1200
1355 seconds
1000
Molero X., Silla F., Santonja V., Duato J. Modeling simulation os storage area networks. To appear in proceeding of the 8th. International Symposium Modelling, Analysis and Simulation of Computer Telecommunications Systems, August, 2000
and the on and
800
BIOGRAPHY
600
Difference response time between the fastest and the slowest replications
400
200
y = 2.5091x - 9.5729 97
94
100
91
88
85
82
79
76
73
70
67
64
61
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
7
13
4
10
1
0
Simulated time (cicles/4000)
Fig. 9. Execution time of each replication
CONCLUSIONS AND FUTURE WORK The recently advent of SANs as the new storage paradigm has motivated the interest of our department in evaluating their performance. However, the researches using the designed simulator experienced long execution times mainly due to the detailed modeling process and the simulation technique used (serial batch means). In this work we have improved the high execution time of this SAN complex simulator by using
Àngel Perles is an assistant professor in the Department of Computer Engineering DISCA at the Politechnical University of Valencia. He is member of the Fault Tolerant Systems group in this department. His research interests include the design and development of parallel and distributed simulation tools to solve efficiently the simulations models of the group.