Implementation and Evaluation of Program Development Middleware for Cell Broadband Engine Clusters Toshiaki Kamata1 , Masahiro Yamada1, Akihiro Shitara1, Yuri Nishikawa1 , Masato Yoshimi2 and Hideharu Amano1 1 Graduate School of Science and Technology, Keio University, 3-14-1 Hiyoshi Kouhoku-ku Yokohama, Kanagawa 223-8522, Japan 2 Faculty of Science and Engineering, Doshisha University, Tatara Miyakodani Kyotanabe, Kyoto 610-0394, Japan Email:
[email protected] Abstract— Although PC clusters with multi-core accelerators have become popular, it is still difficult to write efficient parallel programs because two types of programming techniques of are required: multi-thread programming and internode programming. The former requires special techniques and training dedicated to the accelerator, while the latter urges programmers to be skilled in using communication libraries such as mpich or OpenMPI. In order to reduce such programming cost, in this report, we propose a program development middleware which targets a PC cluster consisting of multiple nodes with Cell Broadband Engine (Cell/B.E.). This middleware supports inter-node and inter-core thread control. so it lets developers to focus on tuning a program to elicit computational power of each core in Cell/B.E processors. As a result of evaluating middleware by executing two types of benchmark programs, it could reduce 40% of code quantity compared to OpenMPI implementation, and provided approximately the same execution performance. Keywords: Cell Broadband Engine, Virtualization, Parallel Computing
1. Introduction PC clusters with multi-core accelerators such as Cell Broadband Engine (Cell/B.E.), Graphic Processing Unit (GPU), ClearSpeed, and Field Programmable Gate Array (FPGA) have become popular especially for high performance scientific computing[1][2][3][4]. Especially, a Cell/B.E. cluster consisting multiple PlayStation3 nodes is considered as flexible and cost-effective computing environment because the processor MIMD-based, and unit price of PlayStation3 is affordable despite its high computation power. Examples of PlayStation3 clusters can be found in [5] and [6]. However, writing effecient program on Cell/B.E. cluster is yet a difficult task. In the first place, stand-alone Cell/B.E., programming requires acquirement of dedicated programming language and tuning techniques. In concrete, (1) programmers have to write two seperate programs that run on controlling core and computation cores by using
libspe2 and pthread libraries, and (2) data transfers between memories and multiple cores need to be explicitly specified using DMA transfer instructions. In addition to this, for using multiple Cell/B.E. processors in a cluster environment, inter-node communication should be described in order to control multiple nodes using communication libraries such as OpenMPI or mpich[7]. This difficulty of the programming forms the main reason why Cell/B.E. clusters are not popularly used by end-users, in spite of Cell/B.E.’s distinctive potential in terms of flexiblity and cost-performance. In this paper, we propose a middleware which mitigates such difficulty on program development of Cell/B.E. clusters. Using this middleware, a programmer can focus only on SPE programming and tuning without writing PPE and inter-node communication program codes. The programming environment with the proposed middleware is now available on a PC cluster consisting of an Intel Xeon node and several SONY BCU-100 nodes. The evaluation results using two applications: Monte-carlo method and matrix product appeared that the overhead of the middleware is acceptable considering its benefit in programming. The rest of this report is organized as follows: Section 2 describes Cell/B.E. and parallel processing using multiple Cell/B.E. in cluster environment. Section 3 is for related work. Section 4 shows the design, and Section 5 shows the implementation of this middleware. Section 6 shows the evaluation. Finally, we state the conclusion and future work in Section 7.
2. Cell Broadband Engine Cell Broadband Engine (Cell/B.E.) is a multi-core processor jointly developed by SONY, Toshiba and IBM known as STI as core of PlayStation3. Figure 1 shows the structure of Cell/B.E.. Cell/B.E. and other processors based on Cell Broadband Engine Architecture (CBEA) are classified in a heterogeneous processor with a general purpose 64-bit processor based on PowerPC called PowerPC Processor Element (PPE), and eight SIMD processors called Synergistic Processor Element (SPE). Each processor is connected by a ring
SPE LS
S P E
S P E
S P E
PPE L1 L2
3. Related Work
EIB
S P E
S P E
S P E
S P E
SPE can only access data stored in its LS, the target data to be processed must be transferred from the main memory using DMA transfer. The DMA transfer has some limitation on memory alignment and data size, thus, the program must take care of such limitations[9].
MIC
Memory
BIC
I/O
Fig. 1: The architecture of Cell/B.E.
based interconnect called Element Interconnect Bus (EIB). Eight SPEs run in parallel, (six in case of a Cell/B.E. used in PlayStation3), and total performance is 204.8GFlops for single precision floating point calculations[8]. PPE is a general purpose processor which runs the operating system, and also controls SPEs, main memory and other external devices. It consists of a PowerPC Processing Unit (PPU) connected to 32KB L1 cache and 512KB L2 cache. PPU has VMX, a 128-bit SIMD unit, which is based on PowerPC instruction set archtecture. SPE is a 128-bit SIMD processor consisting of a Synergistic Processor Unit (SPU), Local Store (LS) and Memory Flow Controller (MFC). It is suitable for multimedia processing such as image processing, and MPEG stream encoding/decoding. A programmer can control eight SPEs from PPE using high level programming language such as C/C++ with libspe2 libraries. The 128-bit register can store four single precision floating point numbers. The rounding of SPE’s floating point number does not follow IEEE754 standard. It has a 256KB exclusive memory called Local Store (LS), and accessed by 128-bit per cycle. EIB has a ring structure consisting of four buses each of which can transfer 16-Byte in a cycle. Thus, the total performance of EIB is 96-Byte/cycle. It is used for data transfer among PPE, SPEs, Memory Interface Controller (MIC) and Bus Interface Controller (BIC). These transactions occur simultaneously within the same ring. When a common program without any consideration runs on Cell/B.E., only PPE will be used. In order to use SPEs, the programmer has to write two different programs running on PPE and SPE, and declares “context” (object to control SPE in software level) in PPE source code to control each SPE. Generally, threads are created according to the number of utilized SPEs, and they are controlled by the PPE. Since
In this section, we introduce related works which aims efficient use of computational resources of multi-core processor in cluster environments. As examples of Cell/B.E. cluster programming environments or frameworks, there is a proposal of programming framework by Kunzman which provides programmers a technical guideline to efficiently load-balance tasks among multiple Cell/B.E. nodes[10]. They suggest automatic offloading methodology of PPE’s tasks to SPEs, and management technique of the tasks by adopting job queues. They evaluated performance by using homogeneous PlayStation3 cluster. Also there is a proposal of a communication API by Pakin called Cell Messaging Layer (CML), which is implemented by MPI[11]. The performance of the CML is analyzed with homogeneous cluster of multiple IBM’s Bladecenter QS21, which is a blade server that equips two Cell/B.E.s Yamada proposed Thread Virtualization Environment (TVE), a middleware which shows an image to programmers as if SPEs in multiple Cell/B.E.s connected a network are integrated on a single processor, and one PPE on a host machine can use all the SPEs[12]. By using TVE, programmers can write a parallel distributed program without use of communication libraries. As examples of multi-core cluster middleware, Ninf project is a representative programming middleware for efficient use of computational resources in grid environments[13]. It can offload heavy tasks to remote cluster environments with larger computational capacity. It is suitable for developing flexible and fault-tolerant grid system whose node size may change frequently. This can not be realized by MPI programming which requires deterministic assignment of nodes that the program run on. Also, XcalableMP released in November 2010, can automatically generate parallel program codes which can be applied in cluster environment by inserting #pragma directives in to non-parallelized program codes. XcalableMP provides Java-based libraries for general-purpose processor such as Intel X86, which can hide MPI communications. Compared to the related works presented above, the middleware that we propose in this paper has following characteristics: •
•
It assumes a cluster environment with x86 servers and multiple Cell/B.E. nodes with different number of SPEs, provides libraries for automatic inter-node communication tuning and intra-node control, and
SPE (ID 0-7)
Private Network SONY BCU-100 (Cell/B.E.) PPE + 8SPE
PlayStation3 (Cell/B.E.) PPE + 6SPE
SONY BCU-100
PlayStation3
Host Machine Intel Xeon (x86) or Cell/B.E.
Node 1 x16
Host Machine
SPE (ID 8-15)
x16 SONY BCU-100
SONY BCU-100
Node 2
PlayStation3 x16
Virtual SPE 0 - (8N -1)
PlayStation3
Receive Socket Connection
x16 Global Network
Node N
PlayStation3 x8
Fig. 3: Outline of Virtual SPE environment
Fig. 2: Example of PC cluster environment with multiple Cell/B.E.
•
Send
Cell/B.E.
SPEs Host Machine (x86 or Cell/B.E.)
provides high flexibility to the modification of the system structure.
In the next section, we describe the design of this middleware.
4. Design of Program Development Middleware In this section, we explain the design of middleware for PC cluster environment with multiple Cell/B.E. processors. Generally, in order to run a parallel program in such an environment requires communication program codes between host and client machines, and program codes for a PPE to control its subordinate SPEs by using libspe2 and pthread libraries. This urges a programmer to learn two types of programming techniques, and also results in large code quantity. In order to mitigate such a programming burden, we propose a Virtual SPE programming environment. This mechanism allows programmers to focus on tuning SPE codes, and brings out an SPE’s computing power by releasing them to describe communication part by using socket connection with thread control functions.
4.1 Target Environment We assume and environment shown in Figure 2. •
•
•
Multiple client Cell/B.E. nodes are connected by a network such as Ethernet. Multiple client Cell/B.E. nodes with different number of SPEs are connected. A host machine (Intel x86 processor or Cell/B.E.) offloads its tasks to client Cell/B.E nodes.
SPEs Host Machine (x86 or Cell/B.E.)
Fig. 4: Virtualization by middleware
Here, PlayStation3 (PS3) and SONY BCU-100 are examples of machine with Cell/B.E.. PS3 have 7 SPEs, and BCU100 have 8 SPEs. Each Cell/B.E. are connected with host machine.
4.2 Required Feature Figure 3 shows the outline of Virtual SPE environment. When a programmer issues operation to a Virtual SPE on the host machine, physical SPE of each node corresponding to the Virtual SPE executes it. Although programmers must know the number of physical SPEs in the system and declare it, they can be treated as if they were connected with the host directly as shown in Figure 4. Thus, they are called "Virtual SPE". Required feature of this middleware are following: • •
•
Communication between host and node machines. Middleware can send or receive arbitrarily sized data between host machine and specified SPE. Connection and data transfer between other nodes without host machine.
Table 1: List of communication functions Function Name API_Initialize API_Finalize Barrier_All
Feature Connects to a server program. This function is called automatically at the beginning of a program. Closes all connections. This function is called automatically at the end of a program. Waits until all SPE’s executions are terminated.
#include "vcell_runtime.h" int main() { Virtual SPE vspe; int value; Vspe.Send(&value, sizeof(int)); Vspe.Run(); return 0; }
Value
Generally, in order to use Cell/B.E. requires two program files running PPE and SPE, and control part of SPE context and pthread function is the most. In contrast, using this middleware, PPE program is eliminated and programmer can focus optimizing of SPE program. In next section, we explain about implementation of this middleware.
5. Implementation Here, the implementation of the following mechanisms supported by the middleware is shown. 1) Server program for transferring data between a host and client machines, and between the PPE and SPEs. 2) A “Virtual SPE” in order to use SPEs beyond the network. We also show an example program code which uses the Virtual SPE. Table 1 shows the list of communication functions between a host and client machines. Virtual SPEs can be used only by including an original header file. Then, functions API_Initialize and API_Finalize are called automatically at the beginning and the end of the program execution.
5.1 Server program on PPE In general, data cannot be directly transferred between a host machine and SPEs. Thus, we implemented a server program which supports data transfer among a host and node machines. It runs on a PPE in each client machine, and receives commands from the Virtual SPE in the host machine in order to control the corresponding physical SPEs. A server program has the following functions: • It receives data from host machine using socket connection, • manages threads according to the number of SPEs, and • sends data to a specified SPE by the DMA transfer when it receives data from a host machine. The transfers are repeated when data size exceeds 16 KB, a maximum data size that can be sent at once. • Then it sends data to a host machine from data buffers in subordinate SPEs.
SPE 1
Host Machine Virtual SPE
SPE (ID 0-7)
Node 1
Fig. 5: Example using Virtual SPE environment Table 2: List of Virtual SPE functions Function Name Vspe.Send Vspe.Recv Vspe.Run Vspe.Wait
Feature Send data to target SPE Recv data to target SPE Start SPE execution Wait until the end of execution
As server program provides above functions, a programmer can only focus on the distribution and tuning of SPE programs. Double- or multi-buffering is a popular tuning technique. An SPE prepares multiple buffers to hold blocks of data, and initiates DMA transfer for data used in the next step (e.g. next loop), while processing computation. In other words, this technique can hide memory latency by overlapping with computation. In our middleware, it can be applied to SPE program unless multiple buffer sizes exceed that of LS.
5.2 Virtual SPE Here, the outline of a Virtual SPE shown in Figure 5 is introduced. It is an C++ class object, and corresponds to a physical SPE one by one. Table 2 shows functions to control Virtual SPEs. A programmer can control data transfer or initialize of program execution for remote SPEs by using Virtual SPE functions listed above. Details of functions are as follows: • Vspe.{Send, Recv} These functions support sending or receiving data between a host machine and target SPEs. Arguments of this function are memory address of data, data size, and target SPE number. Data transfer between clients and host machines, and DMA transfer between PPE and SPEs are initialized by the PPE server program automatically. According to Cell/B.E.’s specification, the DMA transfer can send 16KB of data block in maximum. When data size exceeds this, the PPE server
4
Table 3: The environment of evaluation
divides them into multiple blocks and repetitively transfers data • Vspe.Run This function starts the execution of target SPE’s. It only issues an execution (run) command to a specified SPE, and does not wait for acknowledgment of its initialize nor termination. If a programmer wants to synchronize with other SPE, they need to use “Vspe.Wait” function. • Vspe.Wait This function is used when a programmer wishes to wait until specified SPE’s execution is done. Programmers first must declare several Virtual SPE objects according to the number of SPE to be used, and control them by using the above functions. After Virtual SPEs are declared, ID numbers are given to each node and SPE automatically by the middleware.
6. Evaluation In this section, we evaluate the availability of Virtual SPE environment with the following two applications: • Calculation of circular constant by using Monte-Carlo integration • Matrix-Matrix product Performance is evaluated with the environment shown in Table 3. Virtual SPE environment lets programmers to use as many SPEs as connected to the same network segment with a host machine. The total number of SPE available is equal to the number of BCU-100 × 8 plus the number of PlayStation3 × 7. Host machine in this evaluation equips Intel Xeon processor, and four SONY BCU-100 servers as client nodes. First, we confirm acceleration effect by parallel processing in this environment by using Monte-Carlo integration program which requires small amount of data transfer. As this is an embarrassingly parallel algorithm, computational load of each thread or process is uniform, and N , the total count of plots, is simply divided by the number of SPEs. Thus, linear speedup according to the number of used SPEs is expected. Figure 6 shows the relationships between execution time and number of SPEs for the case of using the middleware and OpenMPI implementation. The figure shows that the middleware can achieve approximately the same parallel effect with OpenMPI implementation, and performance would further approach as plot count increases. The overhead of
OpenMPI 3.5
Speedup Ratio
Node machine SONY BCU-100 Cell/B.E. 3.2GHz 1GB Yellow Dog Linux 6.0 {ppu, spu}-g++ 4.1.1
Virtual SPE
3
2.5
2
1.5
1
0
5
10
15
20
25
30
35
Number of SPEs
Fig. 6: Relation between execution time and number of SPEs (plotting count N = 1010 ) 2.4
OpenMPI 2.2
Virtual SPE
2
Speedup Ratio
Hardware CPU Memory OS Compiler
Host machine Intel Xeon Intel Xeon 2GHz 2GB CentOS 5.4 g++-4.1.2
1.8 1.6 1.4 1.2 1
5
10
15
20
25
30
35
Number of SPEs
Fig. 7: Relation between execution time and number of SPEs (Dimension size of Matrix N = 4096)
the middleware is caused when connections to all nodes are estabilshed, but its impact becomes smaller as the algorithm becomes more computation bound. Second, we evaluate performance using matrix-matrix product benchmark. Figure 7 and Figure 8 show the relationship between execution time and dimension size of the matrix. As shown in Figure 8, performance of our middleware is about 80% of OpenMPI implementation. It is noted that the performance is degraded when the matrix size is large, due to increase of communication overhead, which derives from both socket communication and frequent DMA transfers. Third, we evaluate programmability of this middleware in terms of code quantity. Table 4 shows the code size of computation portion of the program in case of using OpenMPI, libspe2 and our middleware. The table indicates that our middleware can reduce approximately 40% of code quantity.
2.4
Highly parallel application such as Monte-carlo method or Matrix product, the Virtual SPE environment achieved same performance of OpenMPI execution in maximum • Middleware can reduce 40% of these program steps Here, the evaluation was done using a cluster with Intel Xeon and SONY BCU-100 connected by Ethernet. Thus, the network easily forms a bottleneck and the type of applications which can be efficiently executed are limited. Evaluation with various types of application on the other platform is our future work. •
OpenMPI 2.2
Virtual SPE
Speedup Ratio
2 1.8 1.6 1.4 1.2 1
5
10
15
20
25
30
35
Number of SPEs
Fig. 8: Relation between execution time and number of SPEs (Dimension size of Matrix N = 8192) Table 4: Program steps of benchmark application Monte-Carlo Matrix-Product
Virtual SPE 150 (Node + SPE) 200 (Node + SPE)
OpenMPI + libspe2 250 (PPE + SPE) 340 (PPE + SPE)
The evaluation results of Monte-Carlo integration, Matrixmatrix product and code quantity suggest that the overhead of the middleware is acceptable considering its benefit in programming. We didn’t evaluate node-to-node communication. It can apply to some applications which required data transfer without host machine. In addition, The future works are following: 1) Evaluation with 10 Gigabit Ethernet and Infiniband network environment 2) Examination of another selection method of each SPE. 3) Visualize utilization of each SPE. To evaluate with network environment listed above will cancel the overhead of data transfer, and It can achieve more closer performance compared with OpenMPI. In this implementation, Virtual SPE correspond to physical SPE one-to-one. the alternative method of this is using Virtual SPE depends on distance of physical network. We considered this mechanism in evaluation, however, difference between communication of intra-node (same Cell/B.E.) and inter-node makes complexity. Finally, In this middleware, there is no method to observe activity of each SPE. therefore, programmer doesn’t know each SPE’s load.
7. Conclusion In this paper, we proposed and evaluated Virtual SPE environment, and compared with traditional OpenMPI implementation. The evaluation results reveal following:
References [1] K.J.Barker, K.Davis, A.Hoisie, D.J.Kerbyson, M.Lang, S.Pakin, and J.C.Sancho, “Entering the Petaflop Era: The Architecture and Performance of Roadrunner,” in SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008, pp. 1 – 11. [2] S. Matsuoka, “The TSUBAME Cluster Experience a Year Later, and onto Petascale TSUBAME 2.0,” in Recent Advances in Parallel Virtual Machine and Message Passing Interface, ser. Lecture Notes in Computer Science, F. Cappello, T. Herault, and J. Dongarra, Eds. Springer Berlin / Heidelberg, 2007, vol. 4757, pp. 8–9, 10.1007/9783-540-75416-9_5. [3] K. Tsoi and W.Luk, “Axel: A Heterogeneous Cluster with FPGAs and GPUs,” in In Proceedings of International Symposium of Field Programmable Gate Array (FPGA), 2010. [4] M. Kistler, J. Gunnels, D. Brokenshire, and B. Benton, “Programming the Linpack benchmark for the IBM PowerXCell 8i processor,” Scientific Programming, vol. 17, no. 1-2, pp. 43–57, 2009. [5] J. Kurzak, A. Buttari, P. Luszczek, and J. Dongarra, “The playstation 3 for high performance scientific computing,” 2008. [6] A. Buttari, J. Dongarra, and J. Kurzak, “Limitations of the PlayStation3 for high performance cluster computing,” Tech. Rep., 2007. [7] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-performance portable implementation of the mpi message passing interface standard,” Parallel Computing, vol. 22, no. 6, pp. 789–828, Sept. 1996. [8] T. Chen, R. Raghavan, J. N. Dale, and E. Iwata, “Cell Broadband Engine Architecture and its first implementation: A performance view,” IBM Journal of Research and Development, vol. 51, no. 5, pp. 559 –572, 2007. [9] I. Corp, “"Cell/B.E. Programming Handbook Version 1.1",” "http://www.ibm.com/developerworks/power/cell/". [10] D. M. Kunzman and L. V. Kale, “Towards a framework for abstracting accelerators in parallel applicatoins: Experience with cell,” in SC’09: Proceedings of the 2009 ACM/IEEE conference on Supercomputig, 2009, pp. 1– 2. [11] S. Pakin, “Receiver-initiated message passing over rdma networks,” in In Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, 2008, pp. 1– 2. [12] M. Yamada, Y. Nishikawa, M. Yoshimi, and H. Amano, “A Proposal of Thread Virtualizaton Environment for Cell Broadband Engine,” in In Processings of Parallel and Distributed Computing and Systems (PDCS), no. 724-027, 2010. [13] Y. Tanaka, H. Nakada, S. Sekiguchi, T. Suzumura, and S. Matsuoka, “Ninf-G: A Reference Implementation of RPC-based Programming Middleware for Grid Computing,” Journal of Grid Computing, vol. 1, no. 1, pp. 41–51, 2003.