VIPPES : A Virtual Parallel Processing System ...

8 downloads 0 Views 237KB Size Report
communication on track, the simulation speed is dra- matically increased with ..... mance as HXB and BHC, since the nearest neighbor- ing communication inĀ ...
VIPPES : A Virtual Parallel Processing System Simulation Environment 3

Taisuke BOKU

[email protected]

Masahiro MISHIMA

[email protected]

Ken'ichi ITAKURA

[email protected] Institute of Information Sciences and Electronics University of Tsukuba

Abstract In this paper, we propose a performance preevaluation system for designing parallel processing systems, named VIPPES (VIrtual Parallel Processor Evaluation System). When constructing a parallel processor, we need various levels of design and simulation for preliminary evaluation of the total system performance for some benchmarks or actual application programs. Since there are various components to build a parallel processing system, we need simulators for this purpose. VIPPES controls several levels of simulators and retrieves the result data automatically. Simulation results are managed in several levels to be used in later simulations under various conditions. The User only needs to describe his program in PVM environment and determine the speci cation of processing unit and interconnection network. VIPPES nally reports the total computation time for given application and detailed breakdowns of the time consumption. While VIPPES can be performed on just a workstation, we implemented this environment on UNIX workstations cluster to achieve high speed evaluation. In this paper, we describe the concept, design and implementation of VIPPES. As an example, some performance evaluation results by VIPPES for several virtual machines with di erent interconnection network topology are also shown. Moreover, the accuracy of VIPPES as a total system simulator is also shown.

3 Currently

belongs to Fujitsu Co., Ltd.

1

Introduction

Recent progresses in semiconductor technology make advantage for implementing large scale parallel processing systems which consist of a large number of PUs (Processing Units), and some commercial products have been implemented. Most of them are designed over distributed memory con guration on which all PUs communicate with each other by message passing via interconnection network. Distributed shared memory paradigm has also been researched, but not commercialized for large scale systems. There are various selections in several levels for building a parallel processing system designing, for instance, the type of processor used for PU or interconnection network topology. To determine a set of these selections, we need to evaluate the performance of the virtual system. And it should be performed on our actual applications. For a single processor system, a clock-level processor simulator for the target machine is enough for exact performance evaluation. For a parallel processing system, however, it is much complicated and dicult. Basically, such a performance pre-evaluation can be done by preparing a large scale simulator which exactly simulates the behavior of all PUs and messages on the interconnection network. In general, however, it is dicult on a single system due to a large number of instructions and messages. When the number of PUs in the target system increases upto thousands, it will be impossible. One of the solutions is parallelization of the simulator system. There have been several researches for parallel or distributed discrete simulation. However, these schemes introduce relatively large sim-

ulation cost and overhead in the system. It also requires an actual parallel processing system to reduce the simulation overhead. Another solution for this problem is dividing whole simulation steps into large number of pieces, which can be easily executed in parallel. The most heavy part of this work is the clock-level processor simulation. If we can divide it into separate pieces, which correspond to the behavior of all PUs, keeping the e ect of inter-PU communication on track, the simulation speed is dramatically increased with workstations cluster or existing parallel processors. The clock-level network simulation is relatively light-weight work compared with the processor simulation, if all processors behaviors are abstracted into simple time consumption rather than actual instruction executions. We designed and implemented VIPPES (VIrtual Parallel Processor Evaluation System) for the performance pre-evaluation of various sizes of parallel processing systems based on the above concept. In this paper, we describe the concept, design and implementation of VIPPES. First, we describe the concept of VIPPES in Section 2. Next, the design and implementation of the system are described in Sections 3 and 4. After that, we show an evaluation result for several virtual machines in Section 5. Finally, we conclude our work and describe the future work in Section 6.

2 Concept Our goal for implementing VIPPES is to provide an easy and cost-e ective performance pre-evaluation environment for parallel processing system design. Generally, the theoretical peak performance of a parallel processing system is speci ed with the peak performance of a PU multiplied by the number of PU. However, such a catalogued performance cannot be achieved for real users' application. There are several elements to degrade the total system performance. Each PU cannot execute the instructions in maximum speed due to cache miss-hit, TLB miss-hit or bunk con ict penalty. PUs synchronize with each other to perform parallel algorithm correctly. Message passing among PUs often causes a large overhead. These messages also block with each other in the interconnection network. To evaluate the exact performance of the target system, we need to use a clock-level simulation system. Even if we simplify the components of the target system as PUs and interconnection network, there are several problems to solve for such an exact simulation: (1) How to describe a parallel application program,

(2) How to describe and simulate the node processor architecture, and

(3) How to describe and simulate the interconnection network.

First, a parallel application program can be described in multi-processing or multi-threaded manner to represent parallel active objects in the target problem. Recently, several message passing libraries have been widely used, for instance, PVM[2], MPI[1], etc. Since these libraries are generalized enough to describe parallel programs based on message passing paradigm, we can select one of them to standard problem description method. Second, to describe the processor architecture for a node processor of the target parallel system, general CAD system can be utilized. For instance, VHDL[7] and similar hardware description languages are useful for description, simulation and logic synthesis of a processor. However, their description levels are too detailed for our requirement due to support the nal logic synthesis. Here, we need higher level hardware description scheme just enough for clock-level simulation. Another problem on processor architecture design and simulation is how to provide a dedicated compiler for high level languages. Most of benchmarks and target applications are described in high level language like FORTRAN or C. In some benchmarks (ex. NAS-PB[3]), the program description method is limited into high level languages only. Therefore, the compiler for target processor should exist. If we select an o -the-shelf processor like widely used RISC one, this problem can be avoided. Third, we need a generalized interconnection network description and simulation tool. There are a few general purpose network simulation tools[5]. In these systems, however, the exibility is not enough to describe various classes of interconnection network. The interconnection networks in actual systems are much more complicated in network topology and routing algorithm, and the network simulator has to be able to target them. We also require the MIMD manner execution of node processors, and their behavior should be exactly a ected onto message passing on the network. Combining clock-level simulator for PUs and interconnection network, we can realize a complete simulator for the target parallel processing system. However, the running cost of such a system must be very high. Moreover, when we change the design of any part in the target system, we must execute this highcost simulation again. For cost-e ective simulation

environment, the following two problems should be solved:

(a) The simulation part which requires much com-

putational power should be distributed or parallelized, and

(b) Simulation results should be saved in several lev-

els for reuse in later simulation under di erent conditions.

The key to solve these problems is how to divide and simulate the PU internal (intra-PU) execution and inter-PU communication. Between inter-PU communications (message sendings or receivings), a PU performs its internal calculation with local data. The result of an inter-PU communication may change the internal status of the PU. Generally, it is caused by a message receiving, which results in the change of local memory contents. It a ects on the following behavior of the receiver PU. Therefore, the PU can be simulated with the complete intra-PU processing, if each of such memory contents changing is performed in exact order. For deterministic receiving, these order and contents of transferred messages can be recorded as a trace le using some virtual execution of the target program. For this tracing, the clock-level simulation is not required. Based on the above concept, we divide the performance evaluation into three phases; inter-PU tracing phase, PU simulation phase and network simulation phase.

Inter-PU tracing: We execute a target parallel pro-

gram described in PVM or MPI library on any parallel or distributed processing environment. In this phase, no exact timing simulation is required. Therefore, a workstations cluster or an existing parallel processing system can be used, and this phase requires relatively small cost. On each message passing action, all attributes and contents of the message, such as sender and receiver PUs, message tag, message length and contents, are recorded into the trace le associated with each PU.

PU simulation: The execution on each PU is sim-

ulated by the clock-level simulator. When the PU executes an inter-PU communication, it is trapped and data exchanging is emulated according to the contents of the trace le generated in the inter-PU tracing phase. The data transferred from another PU are stored into the local memory of the targeting PU, then the following simulation is continued. At the same time, the time (clock)

spent between the last inter-PU communication and previous one, which corresponds to the last intra-PU execution piece, is recorded as a pro le of the behavior of the PU. This phase requires a large number of computational power, but it can be easily distributed to a number of computers since the processing for each PU is isolated with each other.

Network simulation: All message passing behaviors are simulated on the interconnection network simulator. Each PU alternates the time consumption corresponding to the intra-PU execution and inter-PU communication, according to the interPU trace le and PU execution pro le, which are generated in above two phases. Message con icts in the network is exactly simulated. As a result, total processing time for the target application is given.

The tracing and pro ling les generated in the rst and second phases are called IPTRACE (Inter-PU trace) and IPPROFILE (Intra-PU pro le), respectively. The processing ow of these phases are shown in Figure 1.

Parallel Application

Step-1: Inter-PU tracing

DataTransfer Log (IPTRACE)

Step-2: PU simulation

CPU Time Profiling (IPPROFILE)

Step-3: Network simulation

Total Execution Time

Figure 1. Processing ow in VIPPES Since the second phase requires very high cost, the result IPPROFILE is very expensive. One of the important issues for parallel processing system design is the topology of the interconnection network. When we try to evaluate and compare several virtual machines with various network topologies, we have to execute only the third phase for each candidate. Here, IPPROFILE can be utilized for all simulation trials, and its reusability is high.

3 Design On the designing of VIPPES, we have to determine how to realize three elements of the system, that is, application description, PU simulation, and network simulation. In this section, we describe the current design of VIPPES.

3.1 Application Description For general and cost-e ective performance evaluation, we use the message passing interprocess communication paradigm for target application programming in VIPPES. There are several candidates of generalized message passing programming environments or libraries widely used on many platforms. Here, we decided to use PVM[2], because its parallel processing model is very simple and relatively easy to implement on our system, and a number of applications have been written with this library in these years. Another candidate is MPI[1], which supports more general and complicated programming schemes. However, our purpose is to describe the target application which can be simulated on various con gurations of virtual parallel processor. For this purpose, PVM is the best choice. Note that we use the PVM interface as a model of inter-PU communication, and do not limit the actual target program. In order to record all inter-PU data transfer into IPTRACE le, each PVM subroutine call is trapped and its attributes (i.e. communication partner, message tag, length and contents) are recorded. For this purpose, we rewrite the PVM subroutines to append this recording feature. In this phase, all messages are actually transferred, and IPTRACE le for each PU is created as a side e ect of the message passing.

3.2 PU Simulator As described in Section 2, there are two issues on PU simulation; how to describe the target processor architecture and how to provide the compiler for high level languages. Usually, general hardware description languages like VHDL[7] are used not only for processor design but also for logic synthesis. Here, we need higher level hardware description just enough for clock-level or instruction level simulation. For this problem, we have proposed a high level hardware description language approach, named AIDL[4] (Architecture and Implementation Design Language). In AIDL, only basic resources and execution scheme in the target processor should be de-

scribed, and no more detailed description for logic synthesis is required. For instance, a user describes processor resources (ex. general and special purpose registers), pipeline stages (ex. fetch, decode and execution) and their control manner. They are enough for high level processor design and clock-level simulation, but for logic synthesis. These description is interpreted by AIDL simulator to perform clock-level simulation for speci ed object program. The problem on high level language compiler is still open even in AIDL. An automatic compiler compiler, which can generate a dedicated compiler for speci ed instruction set and pipeline con guration, is required for this purpose. There are some studies on this problem[9]. When we determine to use o -theshelf processor like widely used RISC one, the compiler problem can be avoided. For this reason, the target PU architecture is limited to commercialized RISC processor with cache and main local storage, in current design and implementation of VIPPES. Whenever an inter-PU message passing by PVM subroutine is performed, it is trapped and pseudo execution of message passing is performed according to the contents of IPTRACE le. Especially, when the PU calls a message receive function (pvm receive()), the corresponding message content is retrieved from IPTRACE, and stored into the receive bu er located on the simulated local memory. After that, the processor can load and process the data. As a result, the pseudo message execution is performed. At this time, the interval between the previous inter-PU communication and current one is also recorded into IPPROFILE le for the PU. Finally, when all instruction codes have been executed, an IPPROFILE le contains the abstraction of the PU behavior represented as a set of interval clocks between all inter-PU communication operations.

3.3 Network Simulator In VIPPES, we need a generalized interconnection network description and simulation tool. The simulator should be generalized enough to handle various kinds of network topology, and the behavior of each PU should be programmed and simulated in clocklevel. In most of interconnection network simulators, the message transfer pattern is restricted into some typical models such as simple random transfer[5]. In VIPPES, we have to simulate the inter-PU action based on scenario described in IPTRACE le, and the timing of such actions is determined by IPPROFILE le. We have developed an interconnection network sim-

ulator generating system called INSPIRE (Interconnection Network Simulator with Programmable Interaction and Routing for performance Evaluation)[6], for this purpose. INSPIRE requires two user description inputs; NDF (Network Description File) and PBF (PU Behavior File). NDF represents the network con guration, size, topology, and routing algorithm in dedicated language named NDL (Network Description Language). PBF represents the behavior of all PUs which can be described as a procedural function written in C language. Here, all actions of a PU such as message sending and receiving, calculation, and time consumption (corresponding CPU time for internal processing) can be freely described. Since a complete C program can be written as this function, it can represent the PU behavior in trace-driven manner, handling a trace recorded le. Providing these two input les, INSPIRE system generates a dedicated interconnection network simulator for these speci cations. NDL is designed exible enough to describe various classes of interconnection network including most of direct and indirect network topologies proposed so far. The PU behavior function in PBF can be generalized for trace-driven simulation according to the contents of IPTRACE and IPPROFILE les. Therefore, the nal stage of VIPPES is performed by the dedicated network simulator automatically generated by INSPIRE. At the end of simulation, the simulator outputs various information such as total simulation cycles, PU execution cycles and PU blocked cycles by inter-PU communication and synchronization. Analyzing these statistics values, the total performance evaluation is nished for given IPTRACE, IPPROFILE and network con guration.



Generalized PBF to simulate the time consumption and inter-PU communication operations on the network simulator.

In this section, we describe the detail of these elements, and give the total image of the current implementation of VIPPES.

4.1 Simulation Engines The most dicult problem in VIPPES is how to provide a generalized PU simulator. For exact simulation, it should simulate not only clock-level instruction executions but also other environments such as multiple level of cache, main memory and I/O including the network interface. Although AIDL can provide these features on PU by describing the detail of them in the target description le, the simulation speed of current AIDL simulator for such an heavy system is not practical. For this reason, we are currently using a specially designed simulator for our most interesting processor architecture, Hewlett Packard PA-RISC 1.1, on VIPPES. PA-RISC is a basic architecture of the PU used in massively parallel processor CP-PACS[8], which is implemented in University of Tsukuba. Since the compiler for this processor architecture is provided, we can also avoid the compiler problem. On the other hand, for network simulator, INSPIRE is available and used for various kinds of network simulation. Using NDL, we can represent any network topology and routing algorithm for both direct and indirect network con gurations. For example, Figure 2 shows the NDL description for binary hypercube network. About the detail of NDL, see [6].

4 Implementation

4.2 PVM library for IPTRACE

Beside two simulation engines for PU and interconnection network, the following elements also should be provided to support VIPPES environment:

PVMT (PVM library with Tracing) is a library for generating IPTRACE le at the rst running of the application on existing parallel processing environment (ex. workstations cluster). In PVM subroutines of PVMT, given arguments are recorded into IPTRACE le after the actual execution of the PVM subroutine. Generally, a PVM task (or process) is involved into the PVM environment with calling pvm mytid(). With PVMT, this subroutine also opens an IPTRACE le for that task. When pvm receive() subroutine which receives a message from other task, these received data are recorded into the IPTRACE le. When pvm exit() is called for departure from the PVM environment, IPTRACE le is closed.



A PVM library which provides full-set PVM subroutines including functions for recording all inter-PU communications into IPTRACE les as a side-e ect,



A PVM library which provides subroutines for pseudo execution of inter-PU communication for the PU simulator,



A controller to perform simulation automatically on interactive PU simulator to record PU execution pro le into IPPROFILE les, and

var i in [0:n-1], j in [1:dim]; /* dim = log2(n) */ resource{ node pu(n); switch_size pu(i).i(dim), pu(i).o(dim); } connection{ pu(i).o(j) | pu(i ^ (1