A distributed load balancing algorithm for climate big data processing ...

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2016; 28:4144–4160 Published online 24 March 2016 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3822

SPECIAL ISSUE PAPER

A distributed load balancing algorithm for climate big data processing over a multi-core CPU cluster Yuzhu Wang1,2,*,†, Jinrong Jiang2, Huang Ye2 and Juanxiong He3 1

Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing, China 2 Computer Network Information Center, Chinese Academy of Sciences, Beijing, China 3 Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing, China

SUMMARY Load imbalance is a common problem to be tackled urgently in large scale data-driven simulation systems or data intensive computing. According to the coupler, the Chinese Academy of Sciences-Earth System Model (CAS-ESM) implements one-way nesting of the Institute of Atmospheric Physics of Chinese Academy of Sciences Atmospheric General Circulation Model version 4.0 (IAP AGCM4.0) and Weather Research and Forecasting model (WRF). The METGRID (meteorological grid) and REAL program modules in the WRF are used to process meteorological data. In the CAS-ESM, the load of the METGRID module is seriously unbalanced on many CPU cores. The load imbalance has a serious impact on the processing speed of meteorological data, so this study designs an optimization algorithm to solve the problem. Numerical experiments show that compared to before optimization, the optimization algorithm can solve the load imbalance of the METGRID, and the computation speed of the METGRID and REAL modules after optimization on 64 CPU cores is about 7.2 times faster than before. Meanwhile, the whole computation speed of the CASESM can improve by 217.53%. In addition, results indicate that they also can reach to a similar speedup on different numbers of CPU cores. Copyright © 2016 John Wiley & Sons, Ltd. Received 24 August 2015; Revised 17 February 2016; Accepted 17 February 2016 KEY WORDS:

big data; data processing; load imbalance; optimization algorithm; high performance computing

1. INTRODUCTION The Earth System Model (ESM) is a coupled climate model which is employed to simulate the Earth’s climate states in different periods. Many countries have developed their own ESMs. One of the most famous ESMs is the Community Earth System Model (CESM) which is mainly developed by the National Center for Atmospheric Research (NCAR) of the United States [1]. In China, there are five ESMs which have been used for the Fifth Assessment Report of the United Nations Intergovermental Panel on Climate Change (IPCC AR5) [2]. It indicates that China has good competencies in developing ESMs at present. The Chinese Academy of Sciences-Earth System Model (CAS-ESM) has been developed by the Institute of Atmospheric Physics (IAP) of Chinese Academy of Sciences (CAS) [3, 4]. And the CAS-ESM system has achieved the nesting (or coupling) of the Institute of Atmospheric Physics of Chinese Academy of Sciences Atmospheric

*Correspondence to: Yuzhu Wang, Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, 100094 Beijing, China. † E-mail: [email protected] Copyright © 2016 John Wiley & Sons, Ltd.

A DISTRIBUTED LOAD BALANCING ALGORITHM

4145

General Circulation Model version 4.0 (IAP AGCM4.0) and Weather Research and Forecasting model (WRF). Because of huge computational cost [5] of general circulation models (GCMs), GCMs are usually utilized to simulate global climate at coarse spatial resolution. Global climate models with coarse resolution have no good simulation ability on a regional spatial scale, specifically for topography, eddy processes and have some difficulty in parametrizing subgrid scale processes [6]. Regional climate models (RCMs) with high resolution can resolve more accurately regional variations in orography and land surface characteristics [7]. Many studies used downscaling of global model results to bridge the gap of scales between global and regional climate information [8, 9]. That is to say, GCMs provide initial and lateral boundary conditions to RCMs [10]. During the nesting of GCMs and RCMs, some GCMs provide offline initial and lateral boundary conditions to drive RCMs, while others provide online initial and lateral boundary conditions for RCMs. According to the coupler, the nesting of the IAP AGCM4.0 and WRF in the CAS-ESM is online. In recent years, big data has been a focus of research in science, technology, economics and social studies [11–17]. Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. Big data can be characterized by 3Vs: the extreme Volume of data, the wide Variety of types of data and the Velocity at which the data must be processed [18, 19]. With the arrival of big data era [20], it is quite meaningful to research the processing of massive data and data-intensive computing for climate models. In the study of Johnsen [21], the WRF with horizontal resolution of 500 m uses a grid of size 9120 × 9216 × 48 (1.4 TBytes of input), and 86 GBytes of forecast data is written every 6 forecast hours. With the development of higher resolution climate models, massive amounts of data need to be processed and computed. In the CAS-ESM, the METGRID (meteorological grid) data module needs to process horizontal interpolation of massive data. To improve the data computation speed, the METGRID module is executed parallelly by using message passing interface (MPI). The MPI is different with the MapReduce [22] in ways of implementation to improve processing speed of massive data. The MapReduce programming model is often used to process large volumes of data in parallel by dividing the Job into a set of independent Tasks [23]. However, they both finish these operations by dividing the Job into a lot of small Tasks. Therefore, the idea of big data is also embodied in the CAS-ESM. On climate models, Zhang et al. [24] evaluated the climate simulation performance of the IAP AGCM4.0. He et al. [25] employed one-way coupling of the CAM4 (Community Atmosphere Model version 4) and WRF to simulate for a cyclogenesis event of America. Michalakes et al. [26], Shainer et al. [27] and Malakar et al. [28] analyzed the WRF’s computation and parallel performance on some high performance computers with different architectures, then presented specific recommendations for improving its performance, scalability and productivity. However, there are not yet any researches on the load imbalance and parallel performance for the nesting of the IAP AGCM4.0 and WRF. During assessing the feasibility and capability of the CAS-ESM in simulating a cyclogenesis event over the Southern Great Plains of the United States (SGP) during the March 2000, it is found that the METGRID module of the WRF in the CAS-ESM involves the load imbalance problem. It is very crucial for improving real-time computation performance of the CAS-ESM to solve the METGRID load imbalance and improve its processing speed for massive data. Therefore, the objective of this study is to analyze the cause of the load imbalance and then propose an effective load balancing algorithm. According to reusing the traversal record in the iterations of the function search_extrap, the algorithm solves the load imbalance problem. And the experimental results indicate that the algorithm can improve not only the computing performance of the METGRID module but also the computing speed of the whole CAS-ESM system. Beyond that, the performance of the CAS-ESM on many CPU cores is evaluated and the parallel performance of the METGRID and REAL modules is investigated. The rest of this study is organized as follows. Section 2 mainly introduces the integrated CASESM modeling system. Section 3 describes the load imbalance problem of the METGRID module in detail. Section 4 designs and implements a parallel optimization algorithm to solve the problem in Section 3. In Section 5, the heavy precipitation part of the cyclogenesis event is simulated by employing the CAS-ESM and the performance of the algorithm is evaluated. The last Section contains a summary. Copyright © 2016 John Wiley & Sons, Ltd.

Concurrency Computat.: Pract. Exper. 2016; 28:4144–4160 DOI: 10.1002/cpe

4146

Y. WANG ET AL.

2. MODELING SYSTEM The CAS-ESM is designed and developed based on the CESM version 1.0. The CAS-ESM is composed of six separate component models and one central coupler component. It can be employed to conduct fundamental research for the Earth’s climate states. In the CAS-ESM system, the atmosphere component model used is the IAP AGCM4.0 which is developed by the IAP, the ocean component model is the LASG/IAP Climate System Ocean Model (LICOM) version 2.0 developed by the State Key Laboratory of Numerical Modeling for Atmospheric Sciences and Geophysical Fluid Dynamics (LASG) of the IAP, the land component model is the Common Land Model (CoLM) which is developed by Beijing Normal University, the sea-ice component model is the CICE version 4, the land-ice component model is the GLC and the atmospheric chemical component model is the Global Environmental Atmospheric Transport Model (GEATM) developed by the IAP. Besides the six separate model components, the Advanced Research WRF version 3.2 is put into the CAS-ESM modeling system. Here, the WRF is considered as a part of the IAP AGCM4.0 source code. Meanwhile, the four computation modules (GEOGRID, METGRID, REAL, INTEGRATION) of the WRF are integrated together in order to make the IAP AGCM4.0 drive the WRF online. That means that the IAP AGCM4.0 provides the initial conditions, lateral boundary conditions, surface temperature and soil moisture online to the WRF. The data exchange between the IAP AGCM4.0 and WRF is achieved through the CAS-ESM Coupler version 7 (CPL7). The model structure of the CAS-ESM system is presented in Figure 1. Figure 2 illustrates the flow chart of time integration of the WRF in the CAS-ESM. The coupling time interval between the IAP AGCM4.0 and CPL7 is atm_cpl_dt, and the one between the WRF and CPL7 is wrf_cpl_dt. The wrf_cpl_dt can be set to the IAP AGCM4.0 time step or an integral multiple of the IAP AGCM4.0 and WRF time steps. When the CPL7 sends the data to the WRF at each time step of data exchange, the lateral boundary data set and other data information of the WRF are updated [25].

3. PROBLEM DESCRIPTION AND ANALYSIS 3.1. METGRID module The METGRID program or computation module is mainly used to interpolate meteorological data with intermediate-format onto the computation grid. The GEOGRID (geoscientific grid) program defines simulation domains. The METGRID.TBL file controls interpolation methods of every meteorological field. Average_4pt, wt_average_4pt, search_extrap and the other interpolation

Figure 1. Model structure of the CAS-ESM. Copyright © 2016 John Wiley & Sons, Ltd.



4147

Figure 2. Flow chart of time integration of the WRF in the CAS-ESM.

methods are often utilized. Among these interpolation methods, the search_extrap method represents breadth-first search interpolation. Normally, a computation grid in climate models is a grid cell that denotes a physical region of the earth. When a parent grid cell with low resolution is divided into several son grid (or subgrid) cells with high resolution, land or sea information on son grid cells may be missing or masked. For example, a vertex on a low-resolution parent grid represents a data point with land information, while the same vertex on a high-resolution son grid might represent a data point with sea information. Therefore, the vertex can get the sea data information by using some interpolation methods. In the METGRID module, the breadth-first search interpolation method is used to find the nearest valid (not missing or masked) vertex of the invalid point (x, y). Then the value of the valid vertex is assigned to the point (x, y) [29].

3.2. Load imbalance of METGRID The integrated CAS-ESM system is employed to simulate the climate event described in Section 5. From 1 March 2000 to 6 March 2000, the simulation of five days for the CAS-ESM on 64 CPU cores is carried out. Figure 3 shows the METGRID running time of all the 64 MPI ranks. According to the result, the METGRID running time is quite unbalanced. The decomposition format of the WRF in the CAS-ESM is by 8 × 8. From MPI rank (or process ID) 0 to 7, the running time of the METGRID decreases gradually. However, from MPI rank 24 to 31, the time first decreases gradually and then increases progressively. Anyway, the fact of the METGRID load imbalance really exists. The METGRID module accounts for about 80% of the whole WRF running time, so it is critical to tackle the load imbalance problem. Copyright © 2016 John Wiley & Sons, Ltd.


4148

Y. WANG ET AL.

Figure 3. The METGRID running time of all the 64 MPI ranks.

Through the testing, it is found that the load imbalance of the METGRID is related to the function search_extrap which is called by the METGRID for many times. In the five days numerical simulation, the subroutine METGRID module is called for 2161 times circularly. Figure 4 shows all the time spent on executing the search_extrap during calling the METGRID for 2161 times. According to Figure 4, the search_extrap running time of each process which accounts for a significant portion of the METGRID running time is also quite unbalanced and the changing trend of the search_extrap running time from MPI rank 0 to 63 is consistent with the METGRID running time. Therefore, the conclusion is that the load imbalance of the search_extrap running time for each process results in the load imbalance of the METGRID directly. 3.3. Function search_extrap The function search_extrap is used to find the nearest valid neighbor of an invalid vertex which represents a missing or masked source data point. Table I illustrates the pseudocode of the search_extrap. To find out the reason of the load imbalance in the search_extrap further, the total number of times that the search_extrap is invoked by the METGRID is counted. As shown in Table I, the function

Figure 4. The comparison of the running time of the METGRID and search_extrap. Copyright © 2016 John Wiley & Sons, Ltd.



4149

Table I. Function search_extrap. Algorithm 1 Search_extrap before the optimizing Input: xx,yy,izz: Coordinate of interpolating point array: Three dimensional array representing the input field start_x,end_x: Start-end value of x-coordinate in the domain mesh start_y,end_y: Start-end value of y-coordinate in the domain mesh start_z,end_z: Start-end value of z-coordinate in the domain mesh msgval: A real number giving the value in the input field that is assumed to represent missing data maskval: A real number giving the value in the interpolation mask that is assumed to represent missing data mask_array: Two dimensional array representing the interpolation mask Output: search_extrap: Field data at the point nearest to (xx, yy) that has a non-missing value 1: found_valid = false 2: qdata%x = nint(xx), qdata%y = nint(yy), call q_insert(q, qdata) 3: for q is not empty and found_valid is false do 4: assign the first element of the queue q to the variable qdata, then remove the first element of q 5: i = qdata%x, j = qdata%y 6: if present(mask_array) and present(maskval) do 7: if (array(i, j, izz) /= msgval and mask_array(i, j) /= maskval) found_valid = true 8: else do 9: if (array(i, j, izz) /= msgval) found_valid = true 10: end if 11: if (i-1 > = start_x or i + 1 < = end_x or j-1 > = start_y or j + 1 < = end_y) do 12: qdata%x = i-1 or i + 1 or i or i, qdata%y = j or j or j-1 or j + 1, call q_insert(q, qdata) 13: end if 14: end for 15: if found_valid is true do 16: continue do the second for-loop to find the nearest point (x, y) to (xx, yy) from the queue q that has a nonmissing value 17: search_extrap = array(x, y) 18: else do 19: search_extrap = msgval 20: end if 21: return search_extrap

search_extrap includes two for-loops. Therefore, the total numbers of the first for-loop and the second for-loop are also counted to find which for-loop leads to the load imbalance of the search_extrap. The total number of times that the search_extrap is invoked by the METGRID is different for each process as illustrated in Figure 5, but the direct relevance between it and the load imbalance is not obtained. Figure 6 indicates that from MPI rank 0 to 63 there is a really big disparity in the total number of loops inside the search_extrap and the changing trend of all the amount of loops is consistent with the METGRID running time in Figure 3. So Figure 6 supports the conclusion in Section 3.2 further. Meanwhile, Figure 6 shows that the total amount of the first loop accounts for a main part of all the amount of loops. Therefore, it is the best method for solving the load imbalance problem to decrease the total amount of the first loop in the search_extrap.

4. OPTIMIZING ALGORITHM 4.1. Algorithm principle Although the values of the parameter array of each process in the function search_extrap are different, the ones of the parameter mask_array of each process are the same. By the testing, it is found that when the parameters mask_array and maskval exist, the judgment condition (if (mask_array(i, j) /= maskval)) in Table I has a great impact on the number of iterations in the first for-loop code. Obviously, it is not necessary to execute the first for-loop code with multiple iterations once when Copyright © 2016 John Wiley & Sons, Ltd.


4150

Y. WANG ET AL.

Figure 5. The total number of times that the search_extrap is called by each MPI rank.

the search_extrap is called by the METGRID each time, because frequent iterations will take too much time. To reduce the number of iterations, the idea of the optimizing algorithm is that each process stores the traversal record of executing the first for-loop code for all the vertexes before the search_extrap is called for the first time. When the search_extrap is called later, each process reads the traversal record instead of executing the first loop again. In this way, the total amount of the first loop in the search_extrap can be decreased effectively. In theory, the problem of the load imbalance should be solved effectively. 4.2. Algorithm implementation Although the values of the mask_array for each process are the same, the maskval may be different. But the value of the maskval is either 0 or 1. Therefore, each process has two traversal records. Based on the algorithm idea above, a new subroutine named search_match_mask and six common variables that are the arrays match_mask_array0, match_mask_array1, match_mask_queue0, match_mask_queue1, match_mask_found_valid0 and match_mask_found_valid1 need to be created. The subroutine search_match_mask is used to find the nearest valid neighbor of every vertex in the grid and then

Figure 6. Statistics for the amount of loops inside the search_extrap. All the amount of loops is the sum of the total amount of the first loop and the total amount of the second loop for each MPI rank. Copyright © 2016 John Wiley & Sons, Ltd.



4151

store the traversal record. The variable match_mask_array0 is used to store the coordinate of the nearest valid neighbor of every vertex when the value of the maskval is 0. In the same way, the variable match_mask_array1 is used to store the coordinate of the nearest valid neighbor of every vertex when the value of the maskval is 1. The variable match_mask_queue0 is used to store the queue q during searching the nearest valid neighbor of every vertex when the value of the maskval is 0. The variable match_mask_queue1 is used to store the queue q during searching the nearest valid neighbor of every vertex when the value of the maskval is 1. The queue q can be used for the computing in the second for-loop in the function search_extrap. The variable match_mask_found_valid0 is used to record whether to find the valid neighbor of a vertex when the value of the maskval is 0. The variable match_mask_found_valid1 is used to record whether to find the valid neighbor of a vertex when the value of the maskval is 1. The detailed optimizing algorithm implementation is shown in Table II. Before the search_extrap is called by each process for the first time, each process calls the subroutine search_match_mask at first, and then assigns the traversal records to the two dimensional arrays match_mask_array0, match_mask_array1, match_mask_queue0, match_mask_queue1, match_mask_found_valid0 and match_mask_found_valid1. In addition, the corresponding code in the function search_extrap is modified, as shown in Table III. After the optimizing above, the total amount of loops in the search_extrap is reduced greatly. And the total running time of the search_extrap should be decreased drastically, too.

5. RESULTS AND DISCUSSION 5.1. Case description and experimental setup In this study, the CAS-ESM system is used for simulation of the climate event. The cyclone event occurred at March 2000. Xie et al. [30] also studied the same case for cloud simulations. In the case, the simulation time is from 00 universal coordinated time (UTC) 1 March 2000 to 00 UTC 6 March 2000. The time interval of data exchanges between the WRF and CPL7 is 200 s. The time step of the WRF is 100 s, and that of the IAP AGCM4.0 is 200 s. For the case, data ocean model, the prescribed sea-ice model, active land model CLM and atmospheric model IAP AGCM4.0 in the CAS-ESM are employed. Therefore, although the CAS-ESM is composed of six separate models, it is only used to evaluate the online nesting of the IAP AGCM4.0 and WRF in the study. The IAP AGCM4.0 uses the CAM3.1 physics package, while its dynamical core is developed independently by the IAP. The IAP AGCM4.0 uses finite-difference scheme with a horizontal resolution of 1.4∘ latitude by 1.4∘ longitude and 26 levels in the vertical direction. The initial and lateral boundary conditions of the IAP AGCM4.0 are the same as that in the article of He et al. [25]. The integration domain of the WRF which covers North America has 401 grids along the East-West direction and 281 grids along the North-South direction, with the center at 40.25°N, 95.25°E. In the WRF, the grid spacing is 30 km and 31 sigma levels with the model top at 50 hPa are used in the vertical direction. The WRF physics use Rapid Radiation Transfer Model (RRTM) long-wave radiation scheme, Dudhia short-wave radiation scheme, Noah land surface scheme, Kain–Fritsch cumulus scheme, Lin microphysics scheme, the Yonsei University planetary boundary layer scheme and Monin–Obukuhov surface layer scheme. The experiment platform for the 5-day simulation is the Sugon TC4600H blades cluster in the Computer Network Information Center of Chinese Academy of Sciences, which has 270 compute nodes. Each compute node has 20 CPU cores. The CPU is Intel Xeon E5-2680 v2 processor, running at 2.8 GHz, with 224 GFlops theoretical speed. In each compute node, 20 CPU cores share 64 GB DDR3 system memory through the QuickPath Interconnect. Intel C/Fortran compiler with the version 13.1.3 is used as the basic compiler in the tests. For MPI communication routines, Intel MPI 4.1.3 implementation binding with the Intel compiler is employed. The number of cores used in the following experiments is 2 integer power, so 16 cores are launched within one node. In the CAS-ESM, the default decomposition strategy of the IAP AGCM4.0 is that the number Py of processes in the direction of longitudinal circle is as big as possible. But the WRF chooses as close to a square decomposition as possible by default. Copyright © 2016 John Wiley & Sons, Ltd.


4152

Y. WANG ET AL.

Table II. Subroutine search_match_mask. Algorithm 2 Search_match_mask Input: match_mask_array_tmp0: Two dimensional array storing the coordinates of the nearest valid neighbor match_mask_array_tmp1: Two dimensional array storing the coordinates of the nearest valid neighbor match_mask_queue_tmp0: Two dimensional array storing the corresponding valid queue match_mask_queue_tmp1: Two dimensional array storing the corresponding valid queue match_mask_found_valid_tmp0: Two dimensional logical array match_mask_found_valid_tmp1: Two dimensional logical array start_x,end_x: Start-end value of x-coordinate in the domain mesh start_y,end_y: Start-end value of y-coordinate in the domain mesh mask_array: Two dimensional array representing the interpolation mask Output: match_mask_array_tmp0, match_mask_array_tmp1, match_mask_queue_tmp0, match_mask_queue_tmp1, match_mask_found_valid_tmp0, match_mask_found_valid_tmp1 1: for k = 0, 1 do 2: for im = start_x, start_x + 1, start_x + 2, …, end_x do 3: for jm = start_y, start_y + 1, start_y + 2, …, end_y do 4: if (k == 0) do 5: found_valid = false 6: qdata%x = im, qdata%y = jm, call q_insert(match_mask_queue_tmp0(im, jm), qdata) 7: for match_mask_queue_tmp0(im, jm) is not empty and found_valid is false do 8: assign the first element of the queue match_mask_queue_tmp0(im, jm) to the variable qdata, then remove the first element of the queue 9: i = qdata%x, j = qdata%y 10: if (mask_array(i, j) /= k) do 11: found_valid = true 12: match_mask_array_tmp0(im, jm)%x = i 13: match_mask_array_tmp0(im, jm)%y = j 14: end if 15: if (i-1 > = start_x or i + 1 < = end_x or j-1 > = start_y or j + 1 < = end_y) do 16: qdata%x = i-1 or i + 1 or i or i, qdata%y = j or j or j-1 or j + 1 17: call q_insert(match_mask_queue_tmp0(im, jm), qdata) 18: end if 19: end for 20: match_mask_found_valid_tmp0(im, jm) = found_valid 21: else if (k == 1) do 22: change match_mask_queue_tmp0 into match_mask_queue_tmp1, change match_mask_array_tmp0 into match_mask_array_tmp1, and change match_mask_found_valid_tmp0 into match_mask_found_valid_tmp1, then do the same operation with k == 0 23: end if 24: end for 25: end for 26: end for

5.2. Simulation and verification In the 5 days of simulation for the CAS-ESM, 64 CPU cores are used to compute first. Figure 7 illustrates the METGRID running time of all the 64 MPI ranks before and after the optimization. Compared with the result before the optimizing, the METGRID running time of each MPI rank is not only reasonably load-balanced but also reduced greatly. The total number of times that the search_extrap is invoked by the METGRID is different for each MPI rank as illustrated in Figure 5, so it is nearly impossible to achieve the load balance completely. After the optimizing, the algorithm has reduced the load imbalance of the METGRID to a certain extent. In addition, the computing speed of the METGRID for MPI rank 0 after the optimizing is about 11 times faster than before. If a real climate case for a longer period of time is simulated, it is quite beneficial for the CAS-ESM system to employ the optimizing algorithm because the system meets real-time demand for simulating global or regional climate. All in all, the experiment result indicates that the optimizing algorithm is quite effective to solve the load imbalance of the METGRID. Copyright © 2016 John Wiley & Sons, Ltd.



4153

Table III. Function search_extrap after the optimizing. Algorithm 3 Search_extrap after the optimizing Input: xx,yy,izz: Coordinate of interpolating point array: Three dimensional array representing the input field start_x,end_x: Start-end value of x-coordinate in the domain mesh start_y,end_y: Start-end value of y-coordinate in the domain mesh start_z,end_z: Start-end value of z-coordinate in the domain mesh msgval: A real number giving the value in the input field that is assumed to represent missing data maskval: A real number giving the value in the interpolation mask that is assumed to represent missing data mask_array: Two dimensional array representing the interpolation mask Output: search_extrap: Field data at the point nearest to (xx, yy) that has a non-missing value 1: found_valid = false 2: im = nint(xx), jm = nint(yy) 3: if present(mask_array) and present(maskval) do 4: if (maskval == 0) do 5: if match_mask_found_valid0(im, jm) is true do 6: i = match_mask_array0(im, jm)%x 7: j = match_mask_array0(im, jm)%y 8: if (array(i, j, izz) /= msgval) do 9: found_valid = true 10: call q_copy(match_mask_queue0, q_temp) 11: execute the second loop in the original function search_extrap and change q into q_temp 12: stop the function search_extrap 13: else do 14: execute the code of the original function search_extrap 15: end if 16: else do 17: search_extrap = msgval 18: stop the function search_extrap 19: end if 20: else if (maskval == 1) do 21: change match_mask_queue0 into match_mask_queue1, change match_mask_array0 into match_mask_array1, and change match_mask_found_valid0 into match_mask_found_valid1, then do the same operation with maskval == 0 22: end if 23: end if 24: return search_extrap

Figure 7. The comparison of the METGRID running time for each MPI rank before and after the optimizing. Copyright © 2016 John Wiley & Sons, Ltd.


4154

Y. WANG ET AL.

Table IV. The comparison of the total running time (s) with the default decomposition strategy. Nodes (cores)

nproc_x × nproc_y

1 (16)

4×4

2 (32)

4×8

4 (64)

8×8

8 (128)

8 × 16

16 (256)

16 × 16

Item METGRID + REAL CAS-ESM METGRID + REAL CAS-ESM METGRID + REAL CAS-ESM METGRID + REAL CAS-ESM METGRID + REAL CAS-ESM

Before optimizing

After optimizing

Improvement

23 156.42 30 808.49 13 017.50 17 144.23 8787.42 11 299.11 5430.28 6945.30 3958.39 5228.10

3699.28 11 347.18 2267.86 6323.61 1071.27 3558.46 823.37 2416.52 502.78 1777.26

525.97% 171.51% 474.00% 171.11% 720.28% 217.53% 559.52% 187.41% 687.30% 194.17%

Because there is a MPI_Barrier operation before the WRF integration, the total running time of the METGRID and REAL modules for each MPI rank is the same. Table IV compares not only the total running time of the CAS-ESM but also the one of the METGRID and REAL modules before and after the optimizing. Here, the WRF uses the default decomposition strategy. From Table IV, the study draws some conclusions as follows. 1. The running time of the METGRID and REAL modules before the optimizing takes up over 70% of the total running time of the CAS-ESM. Therefore, it explains further the importance of making the optimization for the METGRID. 2. When running the CAS-ESM on 64 CPU cores, the total computing speed of the METGRID and REAL modules after the optimizing is about 7.2 times faster than before. Moreover, the total computing speed of the CAS-ESM after the optimizing improves by 217.53%. 3. When the CAS-ESM is run again on different numbers of CPU cores, the METGRID and REAL modules after the optimizing can achieve a similar speedup. 4. The performance improvement peaks on 64 cores. When the number of the cores used is constant, all the amount of loops of MPI rank 0 in the search_extrap is the most in all the MPI ranks as shown in Figure 6, so MPI rank 0 can have a major impact on the running time of the METGRID and REAL modules. The number of times that the search_extrap is invoked by the METGRID, all the amount of loops and the amount of the first for-loop in the search_extrap for MPI rank 0 before the optimizing are shown in Table V. From 16 cores to 256 cores, the current number of times that the search_extrap is called by MPI rank 0 is almost half of the former, but the amount of the first for-loop from 32 cores to 64 cores only decreases slightly. This is because MPI rank 0 on 64 cores needs more loops to finish the breadth-first search interpolation for its invalid grid vertexes. When the number of the cores used is different, the grid domain that is processed by each MPI rank 0 is also different. The amount of the first for-loop of MPI rank 0 is related to its own grid domain (or subdomain) processed. Similarly, the running time of the METGRID and REAL modules from 32 cores to 64 cores only decreases a little. That means that the load imbalance of the METGRID on 64 cores is worse. In this situation, the optimizing

Table V. The details about the number of the loops inside the function search_extrap for MPI rank 0 before the optimizing. The all_search_number is the total number of times that the search_extrap is called by MPI rank 0; the all_loops_number is all the amount of loops for MPI rank 0 and the all_loop1_number is the total amount of the first for-loop for MPI rank 0. Nodes (cores)

nproc_x × nproc_y

all_search_number

all_loops_number

all_loop1_number

1 (16) 2 (32) 4 (64) 8 (128) 16 (256)

4×4 4×8 8×8 8 × 16 16 × 16

129 928 320 67 633 920 34 801 920 19 232 640 10 160 640

222 729 333 120 126 629 464 320 85 919 166 720 49 067 786 880 29 447 228 160

208 033 246 080 118 593 918 720 81 085 812 480 46 351 785 600 27 912 695 040

Copyright © 2016 John Wiley & Sons, Ltd.



4155

Figure 8. The precipitation (mm) at 00 UTC 5 March. (a) Before the optimizing, (b) after the optimizing. The black dot is the center of the SGP site.

algorithm can make better use of its advantage. Therefore, the performance improvement can peak on 64 cores. In the algorithm proposed above, six common arrays (match_mask_array0, match_mask_array1, match_mask_queue0, match_mask_queue1, match_mask_found_valid0 and match_mask_found_valid1) need to be created. Whichever case the CAS-ESM system simulates, the size of each array is 128 × 261. Therefore, the six arrays need about 3 MB memory space. Today, each node of a cluster usually shares greater than or equal to 64 GB memory, so generally the size of memory capacity will not affect the performance of the optimizing algorithm. The ratio of the running time of the METGRID and REAL modules before and after the optimizing can be written as

R¼

T1 þ T2 ; T 1′ þ T 2′

T 1 þ T 2 > T 1′ þ T 2′ ;

where R is the ratio, T1 is the computation time before the optimizing, T2 is the communication time before the optimizing, T1′ is the computation time after the optimizing, T2′ is the communication time after the optimizing and T2 ≈ T2′. If the cluster provides bigger network bandwidth, T2 and T2′ will be smaller. If the cluster provides smaller network bandwidth, T2 and T2′ will be bigger. Therefore, the network bandwidth has some impacts on the performance of the algorithm. The Copyright © 2016 John Wiley & Sons, Ltd.


4156

Y. WANG ET AL.

Table VI. The total running time (s) of the METGRID and REAL modules with different decomposition strategies. Nodes (cores)

nproc_x × nproc_y

Before optimizing

After optimizing

Improvement

1 (16)

1 × 16 2×8 4×4 8×2 16 × 1 1 × 32 2 × 16 4×8 8×4 16 × 2 32 × 1 1 × 64 2 × 32 4 × 16 8×8 16 × 4 32 × 2 64 × 1 1 × 128 2 × 64 4 × 32 8 × 16 16 × 8 32 × 4 64 × 2 128 × 1 1 × 256 2 × 128 4 × 64 8 × 32 16 × 16 32 × 8 64 × 4 128 × 2 256 × 1

11 028.37 16 004.03 23 156.42 27 079.23 25 392.38 7423.74 9426.17 13 017.50 15 675.03 16 568.66 14 095.77 5335.06 6412.97 7954.16 8787.42 9492.85 10 394.93 10 638.50 4461.55 4897.27 5306.85 5430.28 5550.10 6252.21 7896.93 9450.49 3773.67 4323.52 4065.85 3719.61 3958.39 3637.19 4685.10 6416.46 5508.47

2977.17 2917.19 3699.28 3411.10 3351.06 2109.14 1779.15 2267.86 1830.24 1956.59 2125.12 1390.61 1118.96 1260.51 1071.27 1087.39 1306.81 1486.07 982.19 919.94 742.95 823.37 908.57 736.87 929.62 1145.19 877.78 838.26 560.55 511.15 502.78 673.25 579.57 795.95 870.10

270.43% 448.61% 525.97% 693.86% 657.74% 251.98% 429.81% 474.00% 756.45% 746.81% 563.29% 283.65% 473.12% 531.03% 720.28% 772.99% 695.44% 615.88% 354.25% 432.35% 614.29% 559.52% 510.86% 748.48% 749.48% 725.23% 329.91% 415.77% 625.33% 627.69% 687.30% 440.24% 708.38% 706.14% 533.08%

2 (32)

4 (64)

8 (128)

16 (256)

communication time is far less than the computation time and T1 > T1′, so the algorithm will be still effective if the bandwidth is limited. To verify the METGRID code after the optimizing, the simulation results of the strong rainfall over the SGP site at 00 UTC of 5 March in the cyclogenesis event are investigated. The simulation results of the precipitation before and after the optimizing in Figures 8(a) and 8(b) are identical. And the CASESM outputs before and after the optimizing are also identical. Therefore, the METGRID code after the optimizing is absolutely right. 5.3. Different decomposition strategies Because the METGRID is a module of the WRF, different process mapping strategies of the WRF should have the impact on the computing performance of the optimizing algorithm. In the WRF, nproc_x is the number of processes in the direction of latitudinal circle and nproc_y is the one in the direction of longitudinal circle. From Table VI and Table VII, the study gets some conclusions as follows. 1. When the WRF employs the default decomposition strategy, the CAS-ESM does not have the best computing performance before the optimizing. If nproc_x < < nproc_y (rectangle decomposition), the CAS-ESM can get better performance, because the WRF which takes up most of the total running time of the CAS-ESM can give better performance in this case [21]. Similarly, the Copyright © 2016 John Wiley & Sons, Ltd.


4157


Table VII. The total running time (s) of the CAS-ESM with different decomposition strategies. Nodes (cores)

nproc_x × nproc_y

Before optimizing

After optimizing

Improvement

1 (16)

1 × 16 2×8 4×4 8×2 16 × 1 1 × 32 2 × 16 4×8 8×4 16 × 2 32 × 1 1 × 64 2 × 32 4 × 16 8×8 16 × 4 32 × 2 64 × 1 1 × 128 2 × 64 4 × 32 8 × 16 16 × 8 32 × 4 64 × 2 128 × 1 1 × 256 2 × 128 4 × 64 8 × 32 16 × 16 32 × 8 64 × 4 128 × 2 256 × 1

18 847.82 23 626.07 30 808.49 35 141.60 34 873.85 11 913.04 13 612.97 17 144.23 19 894.99 21 391.59 19 949.78 8307.70 8966.27 10 335.68 11 299.11 12 240.83 13 526.52 14 955.05 6753.05 6753.58 6911.03 6945.30 7141.19 8085.50 10 195.05 13 096.06 5857.98 5897.37 5398.35 4943.03 5228.10 4921.72 6182.98 8409.78 8697.59

10 563.45 10 502.46 11 347.18 11 522.22 12 864.53 6604.23 5980.63 6323.61 6063.07 6710.99 8125.42 4396.26 3774.88 3668.89 3558.46 3706.83 4674.39 5581.71 3292.40 2732.54 2355.80 2416.52 2518.15 2567.56 3179.60 4538.17 3039.96 2404.97 1888.12 1723.53 1777.26 1957.01 2115.87 2761.15 4037.78

78.42% 124.96% 171.51% 204.99% 171.09% 80.38% 127.62% 171.11% 228.13% 218.75% 145.52% 88.97% 137.52% 181.71% 217.53% 230.22% 189.38% 167.93% 105.11% 147.15% 193.36% 187.41% 183.59% 214.91% 220.64% 188.58% 92.70% 145.22% 185.91% 186.80% 194.17% 151.49% 192.22% 204.58% 115.41%

2 (32)

4 (64)

8 (128)

16 (256)

METGRID and REAL modules before the optimizing have better performance if nproc_x < < nproc_y. 2. On the whole, after the optimizing the METGRID and REAL modules still have better performance if the WRF employs the rectangle decomposition. But the METGRID and REAL modules after the optimizing have higher performance improvement rate if nproc_x > nproc_y. Similarly, the CAS-ESM after the optimizing has corresponding improvement trend. This is because the total running time of the CAS-ESM or METGRID and REAL modules before the optimizing grows faster than after the optimizing if nproc_x > nproc_y. 3. When the process mapping strategy is 16 × 4 on 64 CPU cores, the total computing speed of the METGRID and REAL modules after the optimizing is about 7.73 times faster than before. Moreover, the total computing speed of the CAS-ESM system after the optimizing improves by 230.22%. In a word, the results in Table VI and Table VII indicate that the optimizing algorithm still works if the WRF employs other process mapping strategies, even it with some process mapping strategy can show better computing performance. 5.4. Parallel analysis Then, the simulating results for the CAS-ESM with the default decomposition strategy on 16, 32, 64, 128 and 256 CPU cores respectively are analyzed to test the scalability of the system. Table IV shows that the running time of the METGRID and REAL modules decreases with the increment of CPU cores Copyright © 2016 John Wiley & Sons, Ltd.


4158

Y. WANG ET AL.

Figure 9. Speedup of the METGRID and REAL modules.

Figure 10. Parallel efficiency of the METGRID and REAL modules.

number. Comparing with the 16 CPU cores, after the optimizing the speedup of METGRID and REAL modules with 45.99% parallel efficiency on 256 CPU cores can reach 7.36, as shown in Figures 9 and 10. Obviously, the optimizing algorithm improves the speedup and parallel efficiency of the METGRID and REAL modules. On the whole, the METGRID and REAL modules in the CASESM have desirable parallel performance and strong scalability.

6. CONCLUSIONS AND FUTURE WORK In this work, the computing speed of the METGRID module in the CAS-ESM system has been improved. The improvement comes from the idea that before the system calls the function search_extrap for the first time, the system stores the traverse record for all the vertexes. When the search_extrap is called later, every process reads the traversal record instead of executing the first for-loop again. According to the method, to some extent, the problem of the METGRID load imbalance is tackled. And the numerical experiments show that the computing speed of the METGRID and REAL modules on 64 CPU cores after the optimizing is about 7.2 times faster than Copyright © 2016 John Wiley & Sons, Ltd.



4159

before. The considerable improvement of the computing speed for the METGRID and REAL modules is quite meaningful for the real-time climate simulation of the CAS-ESM. Besides, the nesting process of the IAP AGCM4.0 and WRF by the CPL7 in the CAS-ESM is described. Based the parallel analysis, the METGRID and REAL modules in the CAS-ESM have desirable parallel performance and strong scalability. In the future, the system with thousands of CPU cores on larger clusters will be evaluated. And the METGRID module and the CAS-ESM system could still continue to be optimized in order to achieve to higher parallel performance. Although there are few application researches on big data of climate models or earth system models, we can try them later by using programming models of big data with the coming of big data era. Graphics processing units (GPUs) are often used for large-scale or data-intensive computing applications [31–35], so it will be also considered to develop the GPU version of the algorithm later in order to improve the data processing ability of the METGRID module further. ACKNOWLEDGEMENTS

We want to thank Dr. He Zhang for the help about the IAP AGCM4.0 and CAS-ESM basic knowledge. This work is supported by the National High-Tech Research and Development Plan of China under Grant No. 2012AA01A309, the National Natural Science Foundation of China under Grant No. 11301506 and the key deployment project of the CAS under Grant No. KJZD-EW-TZ-G09. The corresponding authors are Yuzhu Wang and Jinrong Jiang. REFERENCES 1. Vertenstein M, Craig T, Middleton A, Feddema D, Fischer C. CESM1.0.4 user’s guide. Technical report, Community Earth System Model, NCAR, USA, 2011. 2. Zhou T, Zou L, Wu B, Jin C, Song F, Chen X, Zhang L. Development of earth/climate system models in China: a review from the Coupled Model Intercomparison Project perspective. Journal of Meteorological Research 2014; 28(5):762–779. 3. Sun H, Zhou G, Zeng Q. Assessments of the climate system model (CAS-ESM-C) using IAP AGCM4 as its atmospheric component. Chinese Journal of Atmospheric Sciences (in Chinese) 2012; 36(2):215–233. 4. Dong X, Su T, Wang J, Lin R. Decadal variation of the Aleutian low-Icelandic low seesaw simulated by a climate system model (CAS-ESM-C). Atmospheric and Oceanic Science Letters 2014; 7(2):110–114. 5. Taylor KE, Stouffer RJ, Meehl GA. An overview of CMIP5 and the experiment design. Bulletin of the American Meteorological Society 2012; 93(4):485–498. 6. Duffy PB, Govindasamy B, Iorio JP, Milovich J, Sperber KR, Taylor KE, Wehner MF, Thompson SL. High-resolution simulations of global climate, part 1: present climate. Climate Dynamics 2003; 21(5–6):371–390. 7. Cocke S, LaRow TE. Seasonal predictions using a regional spectral model embedded within a coupled oceanatmosphere model. Monthly Weather Review 2000; 128(3):689–708. 8. Giorgi F, Hewitson B. Regional climate information-evaluation and projections. In Climate Change 2001: The Scientific Basis, Houghton JT et al. (eds). Cambridge University Press: Cambridge, 2001; 583–638. 9. Bukovsky MS, Karoly DJ. A regional modeling study of climate change impacts on warm-season precipitation in the Central United States. Journal of Climate 2011; 24(7):1985–2002. 10. Liang X, Pan J, Zhu J, et al. Regional climate model downscaling of the U.S. summer climate and future change. Journal of Geophysical Research 2006; 111, D10108. DOI:10.1029/2005JD006685. 11. Guo H, Wang L, Chen F, et al. Scientific big data and digital Earth. Chinese Science Bulletin 2014; 59(35):5066–5073. 12. Kołodziej J, González-Vélez H, Wang L. Advances in data-intensive modelling and simulation. Future Generation Computer Systems 2014; 37:282–283. 13. Ma Y, Wang L, Zomaya AY, Chen D, Ranjan R. Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic DAG scheduling. IEEE Transactions on Parallel and Distributed Systems 2014; 25(8):2126–2137. 14. Sun S, Wang L, Ranjan R, Wu A. Semantic analysis and retrieval of spatial data based on the uncertain ontology model in Digital Earth. International Journal of Digital Earth 2015; 8(1):3–16. 15. Wang L, Lu K, Liu P, Ranjan R, Chen L. IK-SVD: dictionary learning for spatial big data via incremental atom update. Computing in Science and Engineering 2014; 16(4):41–52. 16. Wang L, Geng H, Liu P, Lu K, Kolodziej J, Ranjan R, Zomaya AY. Particle Swarm Optimization based dictionary learning for remote sensing big data. Knowledge-Based Systems 2015; 79:43–50. 17. Hu C, Xu Z, Liu Y, Mei L, Chen L, Luo X. Semantic link network-based model for organizing multimedia big data. IEEE Transactions on Emerging Topics in Computing 2014; 2(3):376–387. 18. Xu Z, Liu Y, Mei L, Hu C, Chen L. Semantic based representing and organizing surveillance big data using video structural description technology. Journal of Systems and Software 2015; 102:217–225. 19. Casado R, Younas M. Emerging trends and technologies in big data processing. Concurrency and Computation: Practice and Experience 2015; 27(8):2078–2091. Copyright © 2016 John Wiley & Sons, Ltd.


4160

Y. WANG ET AL.

20. Xu Z, Wei X, Luo X, Liu Y, Mei L, Hu C, Chen L. Knowle: a semantic link network based system for organizing large scale online news events. Future Generation Computer Systems 2015; 43:40–50. 21. Johnsen P, Straka M, Shapiro M, Norton A, Galarneau T. Petascale WRF simulation of hurricane sandy: deployment of NCSA’s cray XE6 blue waters. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2013. 22. Ding Z, Guo D, Liu X, Luo X, Chen G. A MapReduce-supported network structure for data centers. Concurrency and Computation: Practice and Experience 2012; 24(12):1271–1295. 23. Wang L, Tao J, Ranjan R, Marten H, Streit A, Chen J, Chen D. G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Generation Computer Systems 2013; 29(3):739–750. 24. Zhang H, Zhang M, Zeng Q. Sensitivity of simulated climate to two atmospheric models: Interpretation of differences between dry models and moist models. Monthly Weather Review 2013; 141(5):1558–1576. 25. He J, Zhang M, Lin W, Colle B, Liu P, Vogelmann AM. The WRF nested within the CESM: simulations of a midlatitude cyclone over the Southern Great Plains. Journal of Advances in Modeling Earth Systems 2013; 5(3):611–622. 26. Michalakes J, Dudhia J, Gill D, et al. The weather research and forecast model: software architecture and performance. Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, 2004; 156–158. 27. Shainer G, Liu T, Michalakes J, Liberman J, et al. Weather Research and Forecast (WRF) model performance and profiling analysis on advanced multi-core HPC clusters. The 10th LCI International Conference on HighPerformance clustered Computing, Boulder, CO, 2009. 28. Malakar P, Saxena V, George T, et al. Performance evaluation and optimization of nested high resolution weather simulations. In Euro-Par 2012 Parallel Processing. Springer: Berlin Heidelberg, 2012; 805–817. 29. Wang W, Bruyère C, Duda M, et al. ARW version 3 modeling system user’s guide. Mesoscale & Miscroscale Meteorology Division, National Center for Atmospheric Research, 2010. 30. Xie S, Zhang M, Branson M, et al. Simulations of midlatitude frontal clouds by single-column and cloud-resolving models during the atmospheric radiation measurement March 2000 cloud intensive operational period. Journal of Geophysical Research 2005; 110, D15S03. DOI: 10.1029/2004JD005119.. 31. Chen D, Li X, Wang L, Khan SU, Wang J, Zeng K, Cai C. Fast and scalable multi-way analysis of massive neural data. IEEE Transactions on Computers 2015; 64(3):707–719. 32. Chen D, Li D, Xiong M, Bao H, Li X. GPGPU-aided ensemble empirical-mode decomposition for EEG analysis during anesthesia. IEEE Transactions on Information Technology in Biomedicine 2010; 14(6):1417–1427. 33. Chen D, Wang L, Ouyang G, Li X. Massively parallel neural signal processing on a many-core platform. Computing in Science & Engineering 2011; 13(6):42–51. 34. Chen D, Wang L, Zomaya AY, Dou M, Chen J, Deng Z, Hariri S. Parallel simulation of complex evacuation scenarios with adaptive agent models. IEEE Transactions on Parallel and Distributed Systems 2015; 26(3):847–857. 35. Yang C, Xue W, Fu H, Gan L, Li L, Xu Y, Lu Y, Sun J, Yang G, Zheng W. A peta-scalable CPU-GPU algorithm for global atmospheric simulations. Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013; 1–12.

Copyright © 2016 John Wiley & Sons, Ltd.


A distributed load balancing algorithm for climate big data processing ...

A distributed load balancing algorithm for climate big data processing ...

Suggest Documents

A Distributed Algorithm for Gateway Load-Balancing in Wireless Mesh ...

A Distributed Algorithm for Gateway Load-Balancing in Wireless Mesh ...

Dynamic load balancing algorithm for balancing

A decentralized load balancing algorithm for

Practical Load Balancing for Distributed Stream Processing Engines

Efficient Load Balancing Algorithm for Distributed Systems ... - hikari

Time Complexity of an Distributed Algorithm for Load Balancing of

An online load balancing scheduling algorithm for cloud data centers

Dynamic-Distributed Load Balancing for Highly

Dynamic DNS for load balancing - Distributed

Scalable Distributed Job Processing with Dynamic Load Balancing

Software Agents for Distributed Load Balancing - DiVA

Software Agents for Distributed Load Balancing - DiVA

Load Balancing in a Distributed Network Environment

IMPLEMENTATION OF A HYBRID LOAD BALANCING ALGORITHM ...

Shortest Job First Load Balancing Algorithm for

Designing a big data processing platform for algorithm trading strategy ...

Load Balancing for a Distributed CORBA-based SCP - CiteSeerX

A semi distributed load balancing scheme for large ... - Semantic Scholar

A Load Balancing Policy for Distributed Web Service - Scientific ...

Reinforcement Learning for Sel sh Load Balancing in a Distributed

A fault-tolerant load-balancing protocol for distributed ...

A Distributed Control Law for Load Balancing in ... - Semantic Scholar

A Simple Heuristic for Load Balancing in Parallel Processing - Cornell