Monitoring Tool for ArmCluster

2 downloads 432 Views 729KB Size Report
A series of regional projects (seismology, environment monitoring, etc.) require extensive and ... The file server is being stored the software packages and other ...
466

CREATION OF HIGH-PERFORMANCE COMPUTATION CLUSTER AND DATABASES IN ARMENIA H.V. Astsatryan Institute for Informatics and Automation Problems National Academy of Sciences of the Republic of Armenia 1, P. Sevak, Yerevan 375014, Armenia E-mail: [email protected]

Yu.H. Shoukourian Institute for Informatics and Automation Problems National Academy of Sciences of the Republic of Armenia 1, P. Sevak, Yerevan 375014, Armenia E-mail: [email protected]

V.G. Sahakyan Institute for Informatics and Automation Problems National Academy of Sciences of the Republic of Armenia 1, P. Sevak, Yerevan 375014, Armenia E-mail: [email protected] Key words: Armenian cluster, cluster infrastructure, cluster computations, myrinet, intellectual software packages, PACO ‘2004 The purpose of the Project is to create a distributed information-processing network in Armenia, in order to provide the scientists and engineers in the South Caucasus region with high-performance information sources and databases. The Project will essentially use the research-oriented computer network existing in Armenia which may serve as Communication media for the cluster having access to the world research community and other clusters via Internet.

1. Introduction The development of computer capacity is noticed periodically. Although the history of computer development shows that when technologies satisfy already known goals, the new tasks arise, which demand use of new technologies, and more effective resources. The automation of scientific researches and examinations belongs to such kind of problems, distinctive features of which are calculating processes. This and other fields rase the necessity of creating high-performance computing systems. The use of separated information systems is profitable from the financial point of view and, it is possible to reach high productivity by joining the magnification of the available computers [1]. However separated systems organization is a difficult task because it is necessary to provide accordant programming environment, reliable transfer of the information between the sites (loss of information, network modification, development, overload), organization of used parallel systems, management, security, etc. The importance of such a regional high-performance computing centre is dictated by the following factors: • Large number of international programs are currently running in the Republic of Armenia, such as the projects in physics, chemistry and technology; • A series of regional projects (seismology, environment monitoring, etc.) require extensive and powerful data-processing means; • Research and technological developments exist in the area for which computations, data retrieval and visualization represent an integral part of investigation.

ТРУДЫ ВТОРОЙ МЕЖДУНАРОДНОЙ КОНФЕРЕНЦИИ «ПАРАЛЛЕЛЬНЫЕ ВЫЧИСЛЕНИЯ И ЗАДАЧИ УПРАВЛЕНИЯ» МОСКВА, 4-6 ОКТЯБРЯ 2004

PACO ‘2004

PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE “PARALLEL COMPUTATIONS AND CONTROL PROBLEMS” MOSCOW, 4-6 OCTOBER 2004

467

The purpose of the project A-823 (funded by International Scientific Technology Centre) is to create a distributed information-processing network in Armenia, in order to provide the scientists and engineers in the South Caucasus region with high-performance information sources and databases. One of the promising ways is to develop a high-performance cluster of personal computers using high-rate transmission wires. The project will essentially use the research-oriented computer network (Armenian Academic Scientific Network) existing in Armenia which may serve as a communication media for the cluster having access to the world research community and other clusters via Internet. Analysis of the cluster construction experience has shown that – taking into account the project tasks and for near future – a cluster including 64 nodes with Dual Xeon 3.06 GHz processors should be selected. RedHat Linux 9.0 is used as OS. The ArmCluster [2] consists of a master node and compute nodes (see fig. 1).

Fig. 1. ArmCluster Structure.

User accounts exist only on the master node, and the master node must schedule and dispatch all applications to the compute nodes. This makes a cluster to be easily administered with highly controlled security, simplified management of the entire cluster through a single node, and controlled by software upgrades [3]. OSCAR [4] is a collection of best-known methods for building, programming, and using clusters, with a convenient install wizard to build a Linux. OSCAR is not an industry standard, but a collection of the knowledge of experienced hardware and software experts that was developed to foster the use of clusters [5]. We use the following OSCAR components: • C3 – The Cluster Command and Control (C3) tool suite is a command-line based management tool for managing clusters; • MPICH – MPICH is a portable implementation of MPI developed at Argonne National Laboratory [6]. • Maui PBS Scheduler – Maui is an advanced job scheduler for clusters and supercomputers [7]. • PBS – The Portable Batch System (PBS) provides flexible batch queuing and workload management on clusters and other types of parallel systems [8]. • OpenSSH – OpenSSH is a freely distributed implementation of the secure shell protocol, SSH. OpenSSH provides needed security for communication over an open network. • OpenSSL – OpenSSL is a freely distributed toolkit that implements the Secure Sockets Layer (SSL v2/v3) and the Transport Layer Security (TLS v1) protocols, along with a general purpose cryptography library for secure data transfer. • SIS – The System Installation Suite (SIS) is a collection of software tools designed to automate the installation and configuration of networked workstations, such as HPC clusters. The cluster nodes are integrated by means of a Myrinet high-rate and Gigabit networks. Myrinet network is being used for calculations as well as gigabit network is being used for tasks distribution, visualization, monitoring. Myrinet technology provides high scaling capacity and is currently used in the setup of high-performance clusters [9-10]. Host computer will be used to load the cluster with problems and control. The file server is being stored the software packages and other information. Special computer is being used for visualization the cluster functioning. Parallel programming in the cluster is performed by means of MPI interface.

ТРУДЫ ВТОРОЙ МЕЖДУНАРОДНОЙ КОНФЕРЕНЦИИ «ПАРАЛЛЕЛЬНЫЕ ВЫЧИСЛЕНИЯ И ЗАДАЧИ УПРАВЛЕНИЯ» МОСКВА, 4-6 ОКТЯБРЯ 2004

PACO ‘2004

PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE “PARALLEL COMPUTATIONS AND CONTROL PROBLEMS” MOSCOW, 4-6 OCTOBER 2004

468

The performance of a computer is a complicated issue, the function of many interrelated quantities. These quantities include the application, the algorithm, the size of the problem, the high-level language, the implementation, the human level of effort used to optimize the program, the compiler's ability to optimize, the age of the compiler, the operating system, the architecture of the computer, and the hardware characteristics. The results presented for benchmark suites should not be extolled as measures of total system performance (unless enough analysis has been performed to indicate a reliable correlation of the benchmarks to the workload of interest) but, rather, as reference points for further evaluations. The manufacturer usually refers to peak performance when describing a system. This peak performance is arrived at by counting the number of floatingpoint additions and multiplications that can be completed in a period of time. Particularely, the peak theoretical performance of the Dual Xeon 3.06GHz node is 12.24GFlops. By peak theoretical performance we mean only that the manufacturer guarantees that programs will not exceed these rates, sort of a speed of light for a given computer. At one time, a programmer had to go out of his way to code a matrix routine that would not run at nearly top efficiency on any system with an optimizing compiler. Statistics on high-performance computers are of major interest to manufacturers, and potential users. They wish to know not only the number of systems installed, but also the location of the various supercomputers within the high-performance computing community and the applications for which a computer system is being used. Such statistics can facilitate the establishment of collaborations, the exchange of data and software, and provide a better understanding of the highperformance computer market. Statistical lists of supercomputers are not new. Every year since 1986, Hans Meuer has published system counts of the major vector computer manufacturers, based principally on those at the Mannheim Supercomputer Seminar. Statistics based merely on the name of the manufacturer are no longer useful, however. New statistics are required that reflect the diversification of supercomputers, the enormous performance difference between low-end and high-end models, the increasing availability of massively parallel processing (MPP) systems, and the strong increase in computing power of the high-end models of workstation suppliers (SMP). To provide this new statistical foundation, the TOP500 list was created in 1993 to assemble and maintain a list of the 500 most powerful computer systems. The list is compiled twice a year with the help of high-performance computer experts, computational scientists, manufacturers, and the Internet community in general. In the list, computers are ranked by their performance on the HPL NxN benchmark. HPL [11] is a software package that generates and solves a random dense linear system of equations on distributed-memory computers as it is outlined in fig. 1. The package uses 64-bit floating point arithmetic and portable routines for linear algebra operations and message passing. Also, it gives a possibility of selecting one of multiple factorization algorithms and provides timing and an estimate of accuracy of the solution. Consequently, it can be regarded as a portable implementation of the HPL NxN benchmark. It requires implementations of MPI and either BLAS or Vector Signal Image Processing Library (VSIPL). The Xeon 3.06 GHz processor has been achieved the best performance – 4.023 GFLOPS at the 10500×10500 problem size. Each node has been clocked at 7Gflops. The cluster has been benchmarked at 483.6 GFLOPS, which makes it the fastest supercomputer in South Caucasus region at present (see fig. 2). In the past few years, the computational power of commodity PCs has been doubling about every eighteen months. At the same time, network interconnects that provide very low latency and very high bandwidth are also emerging. This trend makes it very promising to build high performance computing environments by clustering, which combines the computational power of commodity PCs and the communication performance of high speed network interconnects. Traditionally, simple micro-benchmarks such as latency and bandwidth tests have been used to characterize the communication performance of network interconnects. However, real applications usually run on top of a middleware layer such as the Message Passing Interface (MPI). Therefore, the performance we see reflects not only the capability of the network interconnects, but also the quality of the MPI implementation and the design choices made by different MPI implementers. Thus, to provide more insights into the communication capability offered by each interconnect, it is desirable to conduct tests at a lower level. To help compare the performance of various computing platforms and MPI implementations, the need for a set of well-defined MPI benchmarks arises. It has been used Pallas MPI Benchmarks (PMB) suite [12-13]. The idea of PMB is to provide a concise ТРУДЫ ВТОРОЙ МЕЖДУНАРОДНОЙ КОНФЕРЕНЦИИ «ПАРАЛЛЕЛЬНЫЕ ВЫЧИСЛЕНИЯ И ЗАДАЧИ УПРАВЛЕНИЯ» МОСКВА, 4-6 ОКТЯБРЯ 2004

PACO ‘2004

PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE “PARALLEL COMPUTATIONS AND CONTROL PROBLEMS” MOSCOW, 4-6 OCTOBER 2004

469

set of elementary MPI benchmark kernels. With one executable, all of the supported benchmarks, or a subset specified by the command line, can be run. The rules, such as time measurement, message lengths, selection of communicators to run a particular benchmark are program parameters. For a clear structuring of the set of benchmarks, PMB now introduces classes of benchmarks: Single Transfer, Parallel Transfer, and Collective. This classification refers to different ways of interpreting results, and to a structuring of the code itself. It does not actually influence the way of using PMB. Single transfer (Point to Point Communication) class among the processors of the node has been tested. Point to point communication performance is measured between two processes. The PingPong test is the classical pattern used for measuring startup (latency) and throughput (bandwidth) of a single message sent between two processes (see fig. 3). HPL on ArmCluster

450

435.1 400

350

300

Gflops

250

200

150

100

50

0 0

10000

20000

30000

40000 50000 Problem size

60000

70000

80000

90000

Fig 2. HPL results on ArmCluster. PingPong with 2 processes per node (Dual Xeon 3.06GHz)

14 Latency Bandwidth

12

900

848.96 800

700 10

500

400

6

Bandwidth (MB/s)

Latency (ms)

600 8

300 4 200 2 100

0 51 2 10 24 20 48 40 96 81 92 16 38 4 32 76 8 65 53 6 13 10 72 26 21 4 52 4 42 10 88 48 57 6 20 97 1 41 52 94 30 4

25 6

64

12 8

32

8

16

4

2

1

0

0

Message size (Bytes)

Fig 3. PingPong test per node (Dual Intel Xeon 3.06GHz).

As PingPong, PingPing measures the startup and throughput of a single message sent between two processes, with the crucial difference that messages are obstructed by oncoming messages. For this, two processes communicate with each other, with the MPI_Isend's issued simultaneously (see fig. 4). The expected results of the Project will include: • An infrastructure for high performance computations, providing access to high-efficiency resources for scientific and high technology organizations in Armenia and regional centers, based on a cluster of computers and academic research network, and providing access to other supercomputer centers. The total software package of the cluster will include efficient means of task planning and confidential information exchange by users; • A system of intellectual software packages to support the advance in the field of modeling and analysis of quantum systems, signal and image processing, theory of radiation transfer, calcuТРУДЫ ВТОРОЙ МЕЖДУНАРОДНОЙ КОНФЕРЕНЦИИ «ПАРАЛЛЕЛЬНЫЕ ВЫЧИСЛЕНИЯ И ЗАДАЧИ УПРАВЛЕНИЯ» МОСКВА, 4-6 ОКТЯБРЯ 2004

PACO ‘2004

PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE “PARALLEL COMPUTATIONS AND CONTROL PROBLEMS” MOSCOW, 4-6 OCTOBER 2004

470



lation of time constants for bimolecular chemical reactions. A system of mathematically proved methods, fast algorithms and programs for solving of certain classes of problems in linear algebra, calculus, algebraic reconditibility, test-checkable design of the built-in control circuits in the machine cluster environment; Software environment for design and real-time verification of the Global Automatic Control Systems is based on primitives, where design and verification processes are integrated. The proposed methods will be tested on a specific ACS. Separate components (such as accessclassification system design means for distributed users) may be used independently. PingPing with 2 processes per node (Dual Xeon 3.06GHz)

30

Latency Bandwidth

25

439.71

500

450

400

350

Latency (ms)

300

15

250

200

Bandwidth (MB/s)

20

10 150

100 5 50

0

51 2 10 24 20 48 40 96 81 92 16 38 4 32 76 8 65 53 6 13 10 72 26 21 4 52 4 42 10 88 48 5 20 76 97 15 2 41 94 30 4

25 6

64

12 8

32

8

16

4

2

1

0

0

Message size (Bytes)

Fig 4. PingPing test per node (Dual Intel Xeon 3.06GHz).

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Sterling T., Savarese D., Becker D., E. Dorband J., Ranawake U., Packer C. BEOWULF: A parallel workstation for scientific computation // Proceedings of the 24th International Conference on Parallel Processing, August 1995. CRC Press. Vol. I. P. I:11-14. Astsatryan H., Shoukourian Yu., Sahakyan V., The ArmCluster Project: Brief Introduction // Proceedings of the International Conference on Distributed Processing Techniques and Applications (PDPTA ‘04), Las Vegas, Nevada, USA, June 21-24, 2004. Vol. III. P. 1291-1295. Astsatryan H., Sahakyan V. The Management of Programming Environment of Distributed Systems for Programmers // Proceedings of International Conference on Computer Science and Information Technologies, Yerevan, Republic of Armenia, 2001. P. 376-379. Open Cluster Group: OSCAR Working Group. OSCAR: A packaged cluster software for High Performance Computing. http://www.OpenClusterGroup.org/OSCAR. Ferri R. The OSCAR revolution // Linux Journal. June 2002. 98. MPICH: Implementation of MPI. http://www-unix.mcs.anl.gov/mpi/mpich. Portable Batch System (OpenPBS). http://www.openpbs.org. MAUI Scheduler. http://supercluster.org/maui/. Felderman, R., DeSchon, A., Cohen, D., Finn, G., ATOMIC: A High Speed Local Communication Architecture // Journal of High Speed Networks. 1994. Vol. 3, No 1. P. 1-29. Myrinet Link and Routing Specification. http://www.myri.com/myricom/documents.html. Dongarra J. Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report). University of Tennessee Computer Science Technical Report CS-89-85, 2004. Pallas GmbH: Pallas MPI Benchmarks. http://www.pallas.com/e/products/pmb. Tierney B., Johnston W., Crowley B. The Netlogger Methodology for High Performance Distributed Systems Performance Analysis // Proceedings of 7th IEEE International Symposium on High Performance Distributed Computing, 1998. P. 260-267.

ТРУДЫ ВТОРОЙ МЕЖДУНАРОДНОЙ КОНФЕРЕНЦИИ «ПАРАЛЛЕЛЬНЫЕ ВЫЧИСЛЕНИЯ И ЗАДАЧИ УПРАВЛЕНИЯ» МОСКВА, 4-6 ОКТЯБРЯ 2004

PACO ‘2004

PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE “PARALLEL COMPUTATIONS AND CONTROL PROBLEMS” MOSCOW, 4-6 OCTOBER 2004