Performance Issues of Grid Computing Based on Different ... - CiteSeerX

3 downloads 0 Views 193KB Size Report
in order to perform performance analysis of experimental results. ... grid computing environments is not a challenge, since it is ..... Mpich-g2: A grid-enabled.
Performance Issues of Grid Computing Based on Different Architecture Cluster Computing Platforms Hsun-Chang Chang∗ Kuan-Ching Li∗

Yaw-Ling Lin∗

Chao-Tung Yang‡ Hsiao-Hsi Wang∗ Liang-Teh Lee†

∗ Dept.

of Computer Science and Information Management Providence University, Taichung 43301, Taiwan Email: {hcchang,kuancli,yawlin,hhwang}@cs.pu.edu.tw † Dept. of Computer Science and Engineering, Tatung University, Taipei 10451, Taiwan E-mail: [email protected] ‡ Dept. of Computer Science and Information Engineering Tunghai University, Taichung 40744, Taiwan E-mail: [email protected]

Abstract— This research paper discusses performance issues of cluster and grid computing platforms, and reasons to support the implementation of these computing infrastructures. A number of benchmark programs are executed in these computing systems, in order to perform performance analysis of experimental results. We will be able to show that cluster platforms are excellent alternatives to access to supercomputing, due to its cost/performance, scalability and commodity components factors. In addition, we also show that grid technology is viable by increasing total system performance at no additional cost.

Keywords: CPU Architecture, Cluster Computing, Grid Computing, Performance Evaluation.

I. I NTRODUCTION The need to achieve high performance and to solve large scale computational problems, distributed network computing environments have become a cost-effective and popular choice to attend such demand computations [3]. Unlike past supercomputers, a cluster or grid computing system can be used as multi-purpose computing platform, to run diverse highperformance parallel applications. Nowadays, most cluster systems are built using Intel or AMD CISC architecture computers, running a variant of open source operating systems. Cluster environment consists of PCs or workstations that are interconnected among themselves using high-speed networks, e.g., Gigabit Ethernet, SCI, Myrinet and Infiniband. Early this century, a new technology called Grid is developed, by enabling coupled and coordinated use of geographically distributed resources for purposes such as large-scale computation and distributed data analysis [6]. Actually, Grid technology is not a new idea. The concept of using multiple distributed resources to cooperatively work on a single application has been around for several decades. Together with the development of Globus [5] and MPICHG2 [9], re-structuring and executing parallel applications in grid computing environments is not a challenge, since it is automatic the process of executing those parallel applications already developed for cluster platforms. MPICH-G2 is a grid-enabled implementation of MPI, and once coupled with Globus Toolkit [5], it provides efficient

and transparent execution of parallel applications in extremely heterogeneous Grid computing environments. Moreover, it allows users to couple multiple machines, potentially of different architectures and topologies. In addition, it automatically converts data in messages sent among computing nodes of different architectures. We wonder which CPU architecture platform is most suitable to solve problems in computational biology, aerodynamics applications, and three-dimensional ray-tracing. Also, we would like to know the influence in parallel application by assigning computing task jobs to available computing nodes in geographically distributed cluster systems through grid technology. Thus, performance evaluation are performed by selecting different combinations of workstations with different CPU architectures in geographically distributed locations. Two grid computing platforms are built basically by interconnecting selected computing nodes of cluster computing systems, in different locations across Taiwan. We named these grid infrastructures as TIGER, standing for Taichung Infrastructure of Grid Environment and Resources and Taiwan UniGrid. The detailed configuration of these systems are listed later in this paper. Five sets of benchmark programs were chosen and executed in different computing platforms, as goal in our set of experimentation for performance evaluation, and they are listed: •





NAS Parallel Benchmarks: a set of eight problems consisting of five kernels, which highlight specific areas of machine performance, and three pseudo-applications, that simulate computational fluid dynamics [1], mpptest: a program that measures the performance of some of the basic MPI message passing routines in a variety of situations, implemented by ANL (Argonne National Laboratory, USA), ClustalW-MPI: a well-known parallel version tool of bioinformatics to align multiple protein or nucleotide sequences [10]. The main goal is to evaluate the scalability and performance of clusters of workstations [4], [10],

Proceedings of the 19th International Conference on Advanced Information Networking and Applications (AINA’05) 1550-445X/05 $20.00 © 2005 IEEE





[12], MPI-POVRAY: a distributed POVRAY application using MPI message passing that distributes tasks, among a number of computing nodes, Netperf: a benchmark that can be used to measure the performance of many different types of networking. It provides tests for both unidirecitonal throughput, and endto-end latency. II. B ENCHMARK P ROGRAMS AND C OMPUTING P LATFORMS

Our experiments using benchmark programs were performed on five cluster computing platforms and two grid computing platforms, where each grid computing platform is a combination of more than two cluster platforms. Additionally, in order to understand the influence of geographical distance in grid technology, we exploited network performance benchmark and other benchmark programs as mentioned above to observe performance variations between TIGER and Taiwan UniGrid [11] platforms. At this time, Taiwan UniGrid Project is a grid computing environment built by interconnecting ten research and academic institutions via TANET (Taiwan Academic Network) and TWAREN (Taiwan Advanced Research & Education Network). Each site of Taiwan UniGrid is located in different locality across Taiwan. TIGER Project is another grid computing platform, interconnected among five academic institutions. Also connected by TANET, the only difference when compared to previous project is that all cluster systems in this project are located in the same metropolitan city. It can also be viewed that TIGER Project institutions are interconnected via metropolitan area network. To be fair on experimental tests, cluster platforms are chosen to perform the experiments as follows. The first and second cluster computing platforms are separately built with nine PCs (1 master node + 8 computing nodes), where each one of computing nodes contains one AMD Athlon 2400+ processor, 1GB DDR333 memory, 60GB ATA HD, interconnected via Gigabit Ethernet network. Both of them are located at PDPC/Providence University, and the first cluster platform belongs to TIGER project, while the second platform is part of Taiwan Unigrid Project. The third cluster computing platform contains nine workstations (1 master node + 8 computing nodes), where each one of computing nodes contains one 1.6GHz Apple PowerPC G5 processor, 1GB DDR333 memory, 80GB S-ATA HD, 600MHz frontside bus (FSB), interconnected via Gigabit Ethernet network. This cluster is part of TIGER project. The fourth cluster computing environment is composed by four PCs, where each node contains two 1GHz Intel PentiumIII processors, 1GB memory, SCSI HD, interconnected via Fast Ethernet network. This cluster is part of Taiwan UniGrid Project. The fifth computing environment is composed by one PC, which has four 2.4GHz Intel Xeon processors, 1Gb memory,

SCSI HD, interconnected via Fast Ethernet network. It is another component of TIGER Project. The NPB benchmark programs [1] are well-known problems for testing the capabilities of high-performance computers and parallelization tools. The NPB (NAS Parallel Benchmarks) 2.4 are a set of eight benchmark problems consisting of five kernels, which highlight specific areas of machine performance, and three pseudo-applications, that simulate computational fluid dynamics. Problem sizes and verification values are given as classes S, W, A, B, C, and D. ClustalW-MPI [12] is the most widely used mean for globally aligning multiple protein or nucleotide sequences. It incorporated a number of improvements to the alignment algorithm, including sequence weighting, position-specific gap penalties and the automatic choice of a suitable residue comparison matrix at each stage in the multiple alignment. The alignment is achieved via three steps. Recently, K. Li and J. Ebedes have reported a parallel version of ClustalW for cluster environments using MPI interface [12]. A series of pairwise alignments are computed using full dynamic programming to align large group of sequences. The sequences are aligned starting from the leaves of the tree toward the root. The third set of benchmark program we selected to be executed in our computing environments is mpptest [7], developed by ANL/Argonne National Laboratory, USA. Basically, it is a program that measures the performance of some of the basic MPI message passing routines in a variety of situations. MPI-POVRAY is a MPI parallel implementation of POVRAY, a powerful and high computational power threedimensional rendering engine. It derives information from an external configuration file, and then, it utilizes ray-tracing that calculates an image of a scene by simulating the way rays of light travel in the real world. Netperf is a benchmark that can be used to measure various aspects of networking performance. It provides tests for unidirectional throughput, and end-to-end latency. Its primary focus is on bulk data transfer and request/response performance using either TCP or UDP. In our experiments, three parallel programs of NAS Parallel Benchmarks Suite have been selected due to the memory limitation and compiler issues. The problem size in these benchmark programs grows from class A to class D. Regarding to application ClustalW-MPI, the source sequences randomly select 500 sequences of the genome of P.aeruginosa strain PAO1 from Pseudomonas aeruginosa Community Annotation Project [2]. The length of the sequences ranged from 50 to 3535 characters. The options blocking and non-blocking options of mpptest [7] program has been selected in our set of experiments to be performed with different computing platforms described. We would like to know what happens with computational execution time when we overlap or not intensive computations and communication among selected computing nodes when using these options. Regarding to MPI-POVRAY parallel program, we chose the standard skyvase.pov and Trifid.pov script files to for our

Proceedings of the 19th International Conference on Advanced Information Networking and Applications (AINA’05) 1550-445X/05 $20.00 © 2005 IEEE

Network Performance of CA Server TIGER Taiwan UniGrid TCP Throughput 62.95MB/s 90.84MB/s UDP Throughput 312MB/s 349.59MB/s TCP Request/Response Trans. Rate 455.35/s 350.16/s UDP Request/Response Trans. Rate 470.15 356.41/s UDP message size 1024bytes 1024bytes TCP message size 16384bytes 16384bytes

TABLE I S UMMARY OF NETWORK MEASUREMENT AND SPECIFICATION .

experiments. These scripts can be downloaded freely from the site [8]. The tcp/udp stream test and request/response test are executed with default parameters between TIGER and Taiwan UniGrid platforms. We wonder whether the geographical factor play a key rule or not in this set of experiments. During the experimentation using grid computing platform built, all three benchmarks were performed with MPICH-G2 [9] and Globus Toolkit 3.0.2. III. E XPERIMENTS AND R ESULTS In our experiments, we did not performed any type of performance tuning in all parallel benchmark programs with comments supplied by the compiler. The execution time of selected applications was averaged with at least 6 executions of the same application. The applications IS, LU and SP of NPB (NAS Parallel Benchmarks) were selected to be executed in CISC and RISC architecture based cluster computing platforms. Not only testing the cluster environments, we also performed tests of these applications using TIGER grid platform. Figure 1 shows the processing time of NPB application under different problem sizes. Still in this Figure 1, it shows that Apple PowerMac cluster computing infrastructure is ahead of other computing platforms for the execution of LU application. Different result is shown the execution of SP problem, where we could see better results in CISC CPU based cluster and grid computing platforms, when compared to RISC CPU based cluster. Furthermore, the result shows in Figure 1 has an interesting discovery. The benchmark of applications seems to be bounded by the slow one. Figure 2 shows the execution time of application ClustalW-MPI, in both RISC and CISC architecture based cluster platforms. In order to fairly perform this experiment, we did not pick any specific sequence that may produce the well balanced guide tree. Additionally, the speed of network interface card did not affect the experiment in our observation itself, because the amount of data transmitted is far under the theoretical speed with our tested problem sizes. Moreover, we have tried to execute it in the grid environment. Unfortunately, it can’t be compiled by MPICH-G2 (MPICH v1.2.6) in the Apple platform’s current version. The results of MPI-POVRAY application are shown in the figure 2. In this experiment, better results were obtained with AMD based cluster. The figure 3 shows experimental results with mpptest program. It can show the results that computation and communication were be able to overlapping or not at any time

during experimentation, while we perform experiments with MPI point-to-point blocking or non-blocking send/receive communication. We can observe from figures 3 that both point-to-point blocking or non-blocking send/receive communication, Apple based cluster platform shows better performance than AMD based cluster and grid platforms. We varied the message size from a few bytes to some megabytes, and as the message size increases, Apple based cluster platform could handle communication better than other two platforms. Furthermore, the results are shown in figure 3 when we overlap computation with blocking and non-blocking mode communications. Again, for transmission of the same message close to 4Mbytes, Apple based cluster needs approximately 3000µs, while other two computing platforms, it is needed around 4500µs. Figures 4show us an interesting result. We run mpptest experiment in two clusters which have the same computing platform and locating at our PDPC/PU computing center. One is located in the TIGER, and the other is located in the Taiwan Unigrid. The results represent that the throughput of Globus [5] CA server in different grid environments will act a key role in the benchmarking test. The Table I indicate the tcp/udp throughput information of CA servers in the TIGER and Taiwan UniGrid. IV. C ONCLUSIONS AND F UTURE W ORK In this research paper, we show some experimental results using well known parallel applications, under different computing platforms. From the experiments, it may indicate that CISC based architecture computers can be inexpensive workstation clusters. In the experiment using MPI-POVRAY, we take advantages of code optimization using Apple PowerPC G5 processors’ Velocity Engine. The AltiVec instruction set allows operation on multiple bits within the 128-bit wide registers. This combination of new instructions, operated in parallel on multiple bits, and wider registers, provide speed enhancements of up to 30x on operations that are common in media processing. It has been implemented on some famous bioinformatics applications, just like HMMER and Apple/Genentech BLAST. It actually speedup these applications. However, we do not see the same improvement at MPIPOVRAY. The reason is that AltiVec uses single-precision floating-point numbers. So far, the most floating-point code in POVRAY utilizes double-precision floating-point numbers. Simply, AltiVec cannot be used to speed up those. In addition, we could observe that the throughput of CA server in a grid environment will play a key role of benchmark. As next step in our research, we would like to modify the source code of ClustalW-MPI and make it working on our grid computing environment. Moreover, we world like to optimize the bioinformatics application ClustalW-MPI’s parallel code for PowerPC processors, by taking instruction-level advantages with Apple’s Altivec technology. Additionally, in order to improve the speedup of the ClustalW-MPI, we need to split up a single alignment task among multiple processors. This will need a new algorithm for decrease the costs of

Proceedings of the 19th International Conference on Advanced Information Networking and Applications (AINA’05) 1550-445X/05 $20.00 © 2005 IEEE

Different sizes of IS problem

Execution Time(sec)

150 100 50

1

2

3

4

5

6

7

8

1

2

Number of Processors

Average Time(s)

20000 1000

40000

20000 1

10

100

1000

10000 100000 1e+06 1e+07

Fig. 3.

10

100

1000

10000 100000 1e+06

4500 4000 3500 3000 2500 2000 1500 1000 500 0

1e+07

20 15 10

3

1

10

100

1000

10000 100000 1e+06

Average Time(us)

10000 100000 1e+06

1e+07

Message Length(bytes)

Fig. 4.

1000

10000 100000 1e+06

1e+07

Non-Blocking

Blocking

400000

UniGrid TIGER

350000

300000

250000

200000

150000

100000

300000

250000

200000

150000

100000

50000

0

0

Message Length(bytes)

100

Message Length(bytes)

UniGrid TIGER

350000

1000

10

Performance evaluation of Mpptest.

400000

100

Linux Apple Grid

1

1e+07

50000

10

4500 4000 3500 3000 2500 2000 1500 1000 500 0

Message Length(bytes)

UniGrid TIGER

1

8

7

6 5 4 Number of Processors

Non-Blocking with Overlap

Linux Apple Grid

Non-Blocking with Overlap

Average Time(us)

Average Time(us)

Blocking with Overlap

1

4500 4000 3500 3000 2500 2000 1500 1000 500 0

Message Length(bytes)

UniGrid TIGER

4

Linux Apple Grid

Blocking with Overlap

60000

10000 100000 1e+06 1e+07

3.5

25

2

8

7

6 5 4 Number of Processors

0

100

3

Performance evaluation of Clustalw-MPI and MPI-POVRAY.

Message Length(byte)

4500 4000 3500 3000 2500 2000 1500 1000 500 0

3

80000

0

2.5

5 2

Linux Apple Grid

100000

2

0

0

Average Time(us)

Average Time(us)

Average Time(us)

40000

1.5

35

20

Non-Blocking

120000

60000

1

30

40

10

8

6 4 Number of Processors

80000

500

Number of Processors

10 2

1000

0

8

30

200

0

1500

Average Time(us)

Pairalign Time(sec)

600

Blocking

10

7

Linux Apple Grid

50

400

Linux Apple Grid

1

6

2000

Performance evaluation of NAS/IS, LU, and SP.

Apple Linux

800

Fig. 2.

100000

5

60

1000

120000

4

2500

Number of Processors

Fig. 1.

0

3

Linux Apple Grid

3000

Average Time(us)

0

Different sizes of SP problem

3500

Linux Apple Grid

Average Time(s)

Execution Time(sec)

Different sizes of LU problem

9000 8000 7000 6000 5000 4000 3000 2000 1000 0

Linux Grid

200

Execution Time(sec)

250

1

10

100

1000

10000 100000 1e+06 1e+07

Message Length(bytes)

1

10

100

1000

10000 100000 1e+06 1e+07

Message Length(bytes)

Performance evaluation of Mpptest with different CA server.

communications between each processor, speeding up the total performance of this application. Furthermore, we will do more test to realize how to get more speedup in a mixed networking and huge grid computing environment. R EFERENCES [1] D. Bailey, T. Harris, W. Saphir, R. Van der Wijngaart, A. Woo, and M. Yarrow. The nas parallel benchmarks 2.0. Technical Report NAS95-020, NASA Ames Research Center, 1995. [2] Stover CK and et al. Complete genome sequence of pseudomonas aeruginosa pa01, an opportunistic pathogen. Nature, 406(6799):947– 948, 2000. [3] L. M. Ni D. K. Panda. Special issue on workstation clusters and networkbased computing: Guest editors’ introduction. Journal of Parallel and Distributed Computing, 43:63–64, 1997. [4] Justin Ebedes and Amitava Datta. Multiple sequence alignment in parallel on a workstation cluster. Bioinformatics, 20:1193–1195, 2004. [5] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Supercomputer Applications, 11(2):115–128, 1997.

[6] I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, 1999. [7] William Gropp and Ewing Lusk. Reproducible measurements of mpi performance characteristics. PVM/MPI, pages 11–18, 1999. [8] Andrew Haveland-Robinson. http://www.haveland.com/povbench/. [9] N. Karonis, B. Toonen, and I. Foster. Mpich-g2: A grid-enabled implementation of the message passing interface. Journal of Parallel and Distributed Computing, 63(5):551–563, 2003. [10] Kuo-Bin Li. Clustalw-mpi: Clustalw analysis using distributed and parallel computing. Bioinformatics, 19(12):1585–1586, 2003. [11] Taiwan UniGrid Project. http://unigrid.nchc.org.tw/. [12] J. D. Thompson, D. G. Higgs, and T. J. Gibson. Clustalw: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22(22):4673–4680, 1994.

Proceedings of the 19th International Conference on Advanced Information Networking and Applications (AINA’05) 1550-445X/05 $20.00 © 2005 IEEE

Suggest Documents