Earth Simulator Running 1 Introduction - CiteSeerX

12 downloads 95 Views 752KB Size Report
Earth Simulator Center/Japan Marine Science and Technology Center ... Its development was initiated in 1997 and the latest 0.15 μm CMOS LSI technology was.
Earth Simulator Running Tetsuya Sato½ , Shigemune Kitawaki¾ and Mitsuo Yokokawa¿

½ Earth

Simulator Center/Japan Marine Science and Technology Center 3173-25 Showa-machi, Kanazawa-ku, Yokohama, 236-0001, Japan [email protected]

¾ Earth

Simulator Center/Japan Marine Science and Technology Center 3173-25 Showa-machi, Kanazawa-ku, Yokohama, 236-0001, Japan [email protected]

¿ Japan

Atomic Energy Research Institute 6-9-3 Higashi-Ueno, Taito-ku, Tokyo 110-0015, Japan [email protected]

Abstract The Earth Simulator (ES) is a distributed memory parallel system which consists of 640 processor nodes connected by a single stage full crossbar network. Each processor node is a shared memory system composed of eight vector processors and a memory system of 16 GB. The total peak performance and main memory capacity are 40 Tflops and 10 TB, respectively. Its development was initiated in 1997 and the latest 0.15 m CMOS LSI technology was adopted in realizing a one-chip vector processor. The development has been successfully completed in February, 2002, and a remarkable sustained performance of 35.86 Tflops with 87.5% of the peak performance in the Linpack benchmark is obtained.

1 Introduction It is widely recognized that global phenomena such as global warming which affect social and economic activities of human beings should be analyzed. However, present computer capabilities are insufficient to carry out high resolution simulations for global change and therefore the complicated phenomena cannot be pursued. The Earth Simulator project had been initiated for aiming at understanding and elucidation of the global change as precisely as possible, and an ultra high-speed supercomputer called the Earth Simulator (ES) had been started to develop as a joint project by the National Space Development Agency of Japan (NASDA), the Japan Atomic Energy Research Institute (JAERI), and the Japan Marine Science and Technology Center (JAMSTEC). ES has been completed to fabricate and installed at Yokohama Institute for Earth Sciences (YES/JAMSTEC) in the end of February 2002. In this paper, an achievement of the development of ES is presented.

1

2 Overview of the Earth Simulator 2.1 Hardware System The Earth Simulator is a distributed memory parallel system which consists of 640 processor nodes connected by a 640  640 single-stage crossbar switch (Fig. 1). Each node is a shared memory system which composed of eight arithmetic vector processors (AP), a shared memory system of 16GB, a remote access control unit (RCU), and an I/O processor (IOP). The peak performance of each AP is 8Gflops. Therefore the total number of processors is 5120 and the total peak performance and the main memory capacity are 40Tflops and 10TB, respectively. A 0.15 micron CMOS technology with Cu interconnection is used for LSIs[1, 2].

Interconnection Network (Single-stage full crossbar switch : 12.3GB/s x 2)

Processor Node #0

Processor Node #1

Arithmetic Processor #7

Arithmetic Processor #1

Arithmetic Processor #0

Shared Memory 16GB

Arithmetic Processor #7

Arithmetic Processor #1

Shared Memory 16GB

Arithmetic Processor #0

Arithmetic Processor #7

Arithmetic Processor #1

Arithmetic Processor #0

Shared Memory 16GB

Processor Node #639

Figure 1: Configuration of the Earth Simulator The AP contains a vector unit (VU), a 4-way super-scalar unit (SU), and a main memory access control unit which are mounted on a one-chip LSI. The chip size is about 2cm  2cm and it operates at clock frequency of 500MHz, partially 1GHz. The VU consists of 8 sets of vector pipelines, vector registers, and some mask registers. Vector pipelines has six types of operation pipeline which are add/shift, multiply, divide, logical, mask, and load/store pipelines. The same type of 8 operation pipelines works together by a single vector instruction and different type of the pipelines can operate concurrently. There are 72 vector registers of 256 vector elements. The SU is a super-scaler processor with a 64KB instruction cache, a 64KB data cache, and 128 general-purpose scalar registers. Branch prediction, data prefetching and out-of-order instruction execution are employed. The VU and SU support the IEEE 754 floating point data format. The memory system (MS) in the node is equally shared by 8 APs and is configured by 32 main memory package units (MMU) with 2048 banks. A 128 mega-bits high speed DRAM operating at 24 nsec bank cycle time is used for the memory chip. The memory capacity of each node is 16GB. Each AP has a 32 GB/s memory bandwidth and 256 GB/s in total. The RCU in the node 2

is directly connected to the crossbar switch by two ways of sending and receiving, and controls inter-node data communications. Several data transfer modes such as three-dimensional sub-array accesses and indirect accesses are supported. The single-stage crossbar network (IN) consists of two units(Fig. 2); One is the inter-node crossbar control unit (XCT) which is in charge of coordination of switching operations. The other is the inter-node crossbar switch (XSW) which is an actual data path. XSW is composed of 128 seperated switches, each of which has 1 byte bandwidht opereted independently. All the pairs of nodes and switches are connected by electric cables. The theoretical data transfer rate between every two nodes is 12.3 GB/s  2 ways.

Internode Crossbar Control Unit (XCT #0ì#1)

Internode Crossbar Switch (XSW) #0

Internode Crossbar Switch (XSW) #1

Internode Crossbar Switch (XSW) # 127

12.3GB/s bisectional bandwidth

PN #0

PN #1

AP

PN #2

AP

MMU

AP

RCU

MMU

RCU

MMU

AP

RCU

MMU

RCU

PN #639

Figure 2: Configuration of IN Two nodes are placed in a node cabinet, the size of which is 140cm(W)  100cm(D)  200cm(H), and 320 node cabinets in total were installed in the building. Two XCTs are placed in an IN cabinet, so are two XSWs. The size of the IN cabinet is 130cm(W)  95cm(D)  200cm(H) and there are 65 IN cabinets in total.

2.2 Earth Simulator Building The ES was installed in the building at the Yokohama Institute for Earth Sciences, JAMSTEC located at 40 km south of Tokyo. The building has two stories with seismic isolation system, the size of which is 50m  65m  17m. The ES is protected against electromagnetic wave coming from outside the building by covering the building with steel plates.

2.3 Software System and Parallel Programming Model The basic software such as operating system, programming tools, and operation supporting software of the ES should have large scalability and should be a readily usable system by researchers 3

Figure 3: Earth Simulator installed in the building

in different application fields. Then, a hierarchical management system is introduced to control the ES (Fig. 4). Every 16 nodes are collected as a cluster system and therefore there are 40 sets of cluster in total. A set of cluster is called an “S-cluster” which is dedictated for interactive processing and small-scale batch jobs. A job within the node can be processed on the S-cluster. Other sets of cluster but the S cluster is called “L-cluster” which are for medium-scale and large-scale batch jobs. Parallel processing jobs on several nodes are executed on some sets of cluster. Each cluster has a cluster control station (CCS) which monitors the state of nodes and controls electricity of the nodes belonged to the cluster. A super cluster control station (SCCS) plays an important role of integration and coordination of all CCS operations. An operating system running on the node of the ES is basically a UNIX-based system and provides execution environments as conventional UNIX systems. It also provides parallel execution environments to the distributed memory system of the Earth Simulator. In addition to the usual UNIX system, a high-speed file system and a parallel file system for large-scale scientific computations are supported. Principal style of job processing on the Earth Simulator is a batch job processing, and a job scheduler plays an important role for smooth operation. We have developed a flexible job scheduler which assigns batch jobs to the nodes independent to L-cluster. The ES provides three-level parallel processing environments: vector processing in an AP, parallel processing with shared memory in a node, and parallel processing among distributed nodes via the IN. An automatic vectorization and automatic parallelization in a node are supported by the compilers for programs written in conventional Fortran 90 and C languages. Shared memory parallel programming are supported for microtasking and OpenMP. The microtasking is a sort 4

Crossbar Network

CCS #0

CCS #1

PN #639

PN #625

PN #624

PN #31

PN #17

PN #16

PN #15

Back-end Clusters

PN #1

PN #0

TSS Cluster

CCS #39

Figure 4: Concept of cluster system in operation

of multitasking provided for the Cray’s supercomputer at the first time and the same function is realized for ES. There are two methods for using microtasking. The one is automatic parallelization by the compilers and the other is manual insertion of the parallel directive line before the target do loop. The OpenMP is the standard shared memory programming API. A message passing programming model by MPI2 libraries both within a node and between nodes is prepared as a base programming environment so that the three-level parallel processing environment is used efficiently. However, it is recognized that the development of efficient parallel application software with message passing libraries is very difficult and hard tasks for even well-trained users in programming. We believe that higher level data parallel language is a key issue to advance high performance computing. The HPF language is one of the candidates. The HPF/ES is provided with HPF2 approved extensions, HPF/JA extensions, and some extensions for the ES. The extensions include features for irregular grids problems, user controllable shadow, and so on[3]. We adapted the HPF/ES compiler to a plasma simulation code IMPACT-3D with 512 nodes of the ES and 12.5 Tflops performance is obtained, which is 39% of the peak performance. This result shows us HPF/ES has a high scalability and can be used in developing an actual simulation program.

3 Performance of MPI Libraries In this section, a result on performance measurements of MPI libraries implemented on the ES is presented as well as a virtual memory space allocation[4]. A memory space of a process on the ES is divided into two kinds of spaces which are called local memory (LMEM) and global memory (GMEM). These memory spaces can be assigned to buffers of MPI functions. Especially, GMEM is addressed globally over nodes and can only be accessed by the RCU. The GMEM area can be shared by every MPI processes allocated to different nodes with a global memory address. 5

The behavior of MPI communications is different according to the memory area where the buffers to be transfered are resided. It is classified into four cases.

Throughput(Byte/Sec)

16G

1G

100M

MPI_Send(local-to-local) MPI_Send(global-to-global) MPI_Isend(local-to-local) MPI_Isend(global-to-global)

10M

1M 4

16

128

1K

4K

16K

128K

1M

4M

16M 64M

MessageSize(Byte)

Figure 5: Throughput of pingpong pattern

Case 1 Data that are stored in LMEM of a process A are transferred to LMEM of another process B on the same nodes. In this case, GMEM is used as general shared memory. The data in LMEM of the process A are copied into an area of GMEM of the process A first. Next, the data in GMEM of the process A are copied into LMEM of the process B. Case 2 Data that are stored in LMEM of the process A are transferred to LMEM of a processor C invoked in a different node. Firstly, the data in LMEM of process A are copied into GMEM of the process A. Next, the data in GMEM of the process A are copied into GMEM of the process C via the crossbar switch using INA instructions. Finally, the data in GMEM of the process C are copied into LMEM of the process C. Case 3 Data that are stored in GMEM of the process A are transferred to GMEM of the processor B in the same node, the data in GMEM of the process A are copied by one copy operation directly into GMEM of the process B. Case 4 Data that are stored in GMEM of the process A are transferred to GMEM of the processor C invoked in a different node. The data in GMEM of the process A are copied directly into GMEM of the process C via the crossbar switch using INA instruction. The performance of the pingpong pattern implemented by either MPI Send/MPI Irecv functions or MPI Isend/MPI Irecv functions is evaluated on the ES. Two MPI processes are invoked 6

14.8G 3.6G

Throughput(Byte/Sec)

1G

100M

MPI_Get MPI_Put MPI_Accumulate

10M

1M

100K 4

16

128

1K

4K

16K

128K

1M

4M

16M 64M

MessageSize(Byte)

Figure 6: Throughput of RMA functions

according to the four cases described above and the performance of the two send functions are measured by changing message size to be transfered. Here throughputs for the cases 3 and 4 are shown in Figure 5. For intra-node communication, the maximum throughput is 14.87GB/s in Case 3, half of the peak is achieved when message length is larger than 256KB, and the startup cost is 5.20 sec. For inter-node communication, the maximum throughput is 11.76GB/s in Case 4, half of the peak is achieved when message length is larger than 512KB, and the startup cost is 8.56 sec. The gaps at the message size of near 1KB and near 128KB in the figure are caused by changing the communication method in the MPI functions, which are changed according to the message length. The performances of the ping pattern implemented by three RMA functions of MPI Get, MPI Put, and MPI Accumulate with sum operation are measured by invoking two MPI processes over two nodes. The throughputs are depicted in Figure 6. The maximum throughputs for MPI Get, MPI Put, and MPI Accumulate are 11.63GB/s, 11.63GB/s, and 3.16GB/s, respectively. The startup costs are 16.52sec, 15.00sec, and 14.57sec. We also measured the time required to barrier synchronization among nodes and the time is about 3.3sec independently of the number of nodes.

4 Experimental Run of Applications on ES Several application programs in climate research have been executed on the ES as an experimental run to know the performance of the ES.

7

60N

30N

EQ

30S

60S

60E

120E

180

120W

60W

0

Figure 7: Precipitation calculated by AFES The first code is AFES[5] which is an optimized code of the atmospheric general circulation model NJR-SAGCM [6] on ES by the Earth Simulator Research and Development Center (ESRDC). The model NJR-SAGCM is based on the CCSR/NIES AGCM jointly developed by the Center of Climate System Research (CCSR), University of Tokyo and the Japanese National Institute for Environmental Studies (NIES)[7]. The original model is based on the global threedimensional hydrostatic primitive equations. It uses a spectral transform method in horizontal and a finite-difference method on a sigma coordinate in vertical. It predicts horizontal winds, temperatures, surface pressure, specific humidity, cloud water and so on at the mesh point generated around the Earth. Figure 7 shows the precipitation after three days of the model integration with the T1279L96 resolution which stands for the T1279 spectral truncation in horizontal (3840  1920) and 96 vertical layers. Meso-scale features are shown in the Figure, for example each of the cyclones developing and migrating eastward over the mid-latitude oceans is characterized by a moist and precipitation area with a distinctive T-bone shape. The performance of 14.5 Tflops was achieved the execution of the main time step loop, while the performance of 13.2 Tflops was achieved the execution of the one model day integration with 2,560 APs (320 nodes) as a half of the ES system, by the execution of the AFES with T1279L96 resolution. These performances are correspond to 70.8% and 64.5%, respectively, of the theoretical peak performance. The execution time in the one model day integration is 2,649 seconds including the main time step loop, the radiation process per one model hour, the pre-post processes, but the execution time in the main time step loop of 10 steps is 8.52 seconds including only the main time step loop. The second code is an oceanic circulation model MOM3 which was developed by the Geophysical Fluid Dynamics Laboratory (GFDL) and optimized for the ES. This is the first time to carry out a global simulation with 0.1 degree resolution. Figure 8 shows the sea surface temperature calculated by the optimized code. As evident in Figure 8, meso-scale vortices of the diameter 8

60N 50N 40N 30N 20N 10N EQ 10S 20S 30S 40S 50S 60S 60E

120E

180

60W

120W

Figure 8: Sea surface temperature calculated by MOM3 optimized for ES of 50 km and Kuroshio current can be apparently simulated.

5 Summary The development of the ES was successfully completed as it planned with a great sustained performance of 35.86 Tflops in Linpack Benchmark, which is 87% of the peak performance. This is the first Japanese system after Numerical Wind Tunnel to take over the top position in the TOP500 supercomputer list. The Earth Simulator put into operational use by the Earth Simulator Center, a branch of YES/JAMSTEC. The authors are expecting that ES would give a big change to the environmental science in simulation quality and would give a new scenario for global change.

Acknowledgement The authors would like to offer cordial condolences to late Mr. Hajime Miyoshi who gave great contributions in planning and designing of the ES with his outstanding leadership. The authors also would like to thank all members of ESRDC and ESC for their valuable discussions and comments.

References [1] M. Yokokawa, S. Shingu, S. Kawai, K. Tani and H. Miyoshi, “Performance Estimation of the Earth Simulator,” Towards Teracomputing, Proc. of 8th ECMWF Workshop, pp.34-53, World Scientific (1998). [2] K. Yoshida and S. Shingu, “Research and development of the Earth Simulator,” Proc. of 9th ECMWF Workshop, pp.1-13, World Scientific (2000).

9

[3] High Performance Fortran Language Specification Version 2, High Performance Fortran Forum, January (1997). [4] H. Uehara, M. Tamura and M. Yokokawa, “An MPI Benchmark Program Library and Its Application to the Earth Simulator,” LNCS 2327, Springer (2002). [5] S. Singu, H. Fuchigami, M. Yamada, Y. Tsuda, M. Yoshioka, W. Ohfuchi, H. Nakamura, and M. Yokokawa, “Performance of the AFES – Atmospheric General Circulation Model for Earth Simulator –,” Proc. of the Parallel CFD 2002, Nara, Japan (2002). [6] Y. Tanaka, N. Goto, M. Kakei, T. Inoue, Y. Yamagishi, M. Kanazawa, H. Nakamura, “Parallel Computational Design of NJR Global Climate Models,” High Performance Computing, LNCS 1615, pp281-291, Springer (1999). [7] A. Numaguti, S. Sugata, M. Takahashi, T. Nakajima and A. Sumi, “Study on the climate System and Mass Transport by a Climate Model,” CGER’s Supercomputer Monograph Report, 3, CGER/NIES (1997).

10