Thermal-aware Workload Distribution for Clusters - Science Direct

0 downloads 0 Views 286KB Size Report
power consumption are based on dynamic power management (DPM). .... SKLSDE-2009ZX-01 and the Fundamental Research Funds for the Central.
Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Procedia Engineering

Procedia Engineering 00 (2011) 000–000 Procedia Engineering 15 (2011) 3308 – 3312 www.elsevier.com/locate/procedia

Advanced in Control Engineeringand Information Science

Thermal-aware Workload Distribution for Clusters Aihua Lianga,b,c, Limin Xiaoa,b, Yu Panga,b, Yongnan Lia,b, Li Ruan a,b* a

State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China b School of Computer Science and Engineering, Beihang University, Beijing 100191, China c The Institute of Computer Technology, Beijing Union University, Beijing 100101, China

Abstract With the increase of the computing demand, cluster is becoming one of the most important computing infrastructures. Most of cluster systems use central air conditioning to remove heat. Therefore, the hot spot is unavoidable. Motivated by alleviating the hot spot issue and optimizing workload distribution, we propose a thermal-aware workload distribution policy. It ranks the nodes according to the thermal characteristics of racks layout and dispatches the workloads based on the ranked node queue. We analyze the saving power trend and validate that this policy can reduce power consumption to some extent with no impact on quality of service (QoS).

© 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and/or peer-review under responsibility of [CEIS 2011] Keywords: workload distribution; thermal-aware; hot spot; cluster

1. Introduction High Performance Computing (HPC) clusters have been widely adopted by companies and research institutions for their data processing centers because of their parallel performance, high scalability, and low acquisition cost. Whereas, further deployment of HPC clusters is limited due to their high maintenance costs in terms of energy consumption, required by system hardware and the air cooling equipment [1]. About 15% of the total cost of a data center is for operation and maintenance of the environmental control system [2]. Most HPC centers use central air conditioning (CRAC units) to remove

* Corresponding author. Tel.: +86-10-82338824. E-mail address: [email protected].

1877-7058 © 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. doi:10.1016/j.proeng.2011.08.620

2

Aihua Liang et al. Procedia Engineering 1500 (2011) 3308 – 3312 Aihua Liang et /al/ Procedia Engineering (2011) 000–000

heat from HPC server farms. Due to the preferred performance machines, infrastructure concentrations, unevenness of workload and other factors, hot spots are unavoidable. Fig 1 indicates a typical cooling infrastructure with raised floor plenum.

Fig. 1 cooling infrastructure of cluster system

Hot spots not only require higher capacity CRAC, they also increase the chance of hardware failure. Hot spots are a ubiquitous problem in air-cooled data centers, and drive the environmental control system to work much harder to ensure that no server is fed hot air (i.e. air at a temperature greater than the target threshold). This is a major concern because high temperatures can seriously affect reliability and lifetime of the deployed components. Scientists and technicians are currently showing special interest in all type of solutions and ideas to minimize energy in clusters. A well-known energy management technique is DVFS (Dynamic Voltage and Frequency Scaling). DVFS entails reducing the system energy consumption by decreasing the CPU supply voltage and the clock frequency (CPU speed) simultaneously. Alternative strategies to reduce power consumption are based on dynamic power management (DPM). This technique would switch on and shut down the idle nodes, according to the needs of the users’ applications. The main control objective in data center thermal management is to keep the temperature of all the data processing equipments below a certain threshold and maximize the energy efficiency of the system at the same time. Some related works [3-5] considered the placement of computational workload to alleviate these local hotspots and provided failure mitigation. Some algorithms have been proposed to guide the placement of resources according to the external environment. Bash et al. [2] considered the varying ability of the air conditioning units to cool different places in the room. Patel et al. [3] showed the potential methods and benefits of apportioning cooling resources based on the dynamic needs of the data center. Vasic et al. [6] drove a thermodynamic model of a data center and proposed a temperature control strategy that combines air flow control and thermal-aware scheduling. On server level, there is a large body of related work on temperature aware disk scheduling policies and controlling the processor temperature [7, 8]. If workload could be distributed to flatten spikes in temperature, HVAC (Heating Ventilating and Air Conditioning) units would run at much lower than capacity and reduce the overall power consumption of cluster systems. Therefore, we present a thermal-aware workload distribution policy that ranks the nodes according to CRAC layout and dispatch workloads based on the ranking nodes. The proposed policy can reduce the power consumption compared with the Maxload policy. The rest of the paper is organized as follows.

3309

3310

Aihua Liang et /al. / Procedia Engineering 15 (2011) 3308 – 3312 Aihua Liang et al Procedia Engineering 00 (2011) 000–000

Section 2 describes the thermal-aware workload distribution policy. Section 3 presents the simulation results. Section 4 concludes the paper with comments on the results and discussion about future work. 2. Thermal-aware workload distribution In general, the management system of clusters for high performance scientific computing mainly consists of two components: resource management subsystem and job management subsystem. Resource management subsystem focuses on providing services for administrators to configure, manage and monitor the system resources of clusters. The job management subsystem concentrates on scheduling the submitted jobs and dispatching them to the fitting computing nodes according to some scheduling algorithms. Fig 2 shows the structure of the job management system. The job scheduler is Maui [9] and the resource management system is Torque [10]. 2.1. Ranking nodes A location might appear to be a good place because it is currently cold. But it may be difficult to cool, such as in a corner of the room, where is far removed from an air conditioning unit. These nodes are called “hot spot”. Our strategy is to alleviate “hot spot” in cluster that allow all CRAC systems to run at much lower capacity and conserve power for the same computational throughput. The nodes of the cluster system can be represented by a set of servers as Equation (1).

Ν = {N i ,1 ≤ i ≤ n}

(1)

In classical HPC layout with a number of air-cooled server racks, the racks are typically arranged in rows such that they form hot aisles and cold aisles. Therefore, we firstly rank the nodes of the system according to its layout characteristics.

Ν = {N 1 , N 2 , L N i , N i +1 ,L , N n ,1 ≤ i ≤ n}

(2)

As Equation (2) shows, the former i nodes belong to the cold nodes, which are from cold aisles. The latter n-i nodes are hot nodes from hot aisles. Therefore, each node has a priority level according to its location in node pool. When a node is allocated for a job, this node can be called working node. Since all nodes are ranked, the cold nodes of idle pool would have priority to be the working nodes. When a node finished the job, the node would become an idle node and return to the idle node pool. Therefore, when workloads of the system are light, the cold nodes may be allocated many times.

Fig 2 Structure of job management system

Fig 3 Schematic diagram of node allocation

3

4

3311

Aihua Liang et al. Procedia Engineering 1500 (2011) 3308 – 3312 Aihua Liang et /al/ Procedia Engineering (2011) 000–000

2.2. Software architecture The software architecture we adopted is based on open source software (Torque with Maui scheduler). Torque coordinates the actions of all the components in the system by maintaining a database of resources, submitted jobs and running jobs. Maui retrieves jobs and nodes information from Torque and allocates nodes to the jobs according to its policy. The proposed workload distribution policy is implemented as an integral component of Maui scheduler. The main steps are listed as follows: • Step 1: Nodes ranking The nodes of cluster should be sorted according to aforementioned method in above section. Firstly, when system startup, all nodes should be sorted according to CRAC layout. The nodes in cold aisle should have high priority. To avoid local overheating in that some nodes are used frequently, the nodes in cold node pool and hot node pool should be both reordered at fixed interval according to the current monitored temperature. The time interval can be set by administrator and adjust with the system usage. • Step 2: Node allocation Users can assign node sizes, processor per node when the job is submitted. The scheduler would allocate corresponding nodes from the idle node pool according to the node priority. • Step 3: Update the node pool When a job is finished, the used nodes should return to the idle node pool. The working node pool and idle node pool should be updated correspondingly. 3. Simulation measurement The threshold of working nodes indicates the non-hot spot node number of the system. Fig 4 shows the analysis result. Saving power would increase with the working nodes when the working nodes number is less than the node number. However, when working nodes number is greater than threshold number, the saving power would decrease and the increased power due to using the hot nodes would begin to accumulate. The threshold number can be set according to the specific system. Therefore, the saving power is related to the CRAC layout of system. In simulation, the workload trace we used is Atlas from parallel workloads archive [11] in Lawrence Livermore National Laboratory. We compare the proposed policy with the node policy provided by Maui, Maxload, which does not take power consumption into account. Fig 5 shows the power consumption of a day. The power saving Y-axis represents the power in KW (Kilowatt), X-axis represents hour of a day. The node utilization rate is about 51%. In this case, the power can be reduced above two percent. Since the proposed policy would not delay the jobs, it has no impact on the Quality of Service (QoS). 1400

Proposed Maxload

1300 1200 1100 1000 900 800 0

Fig. 4 Analysis schematic diagram

2

4

6

8

10 12 14 16 18 20 22 24

Fig. 5 Power consumption with the time pass

3312

Aihua Liang et /al. / Procedia Engineering 15 (2011) 3308 – 3312 Aihua Liang et al Procedia Engineering 00 (2011) 000–000

4. Conclusion and future work In this paper, a thermal-aware workload distribution policy has been proposed. Concentrating on the hot spot issue in clusters, we set the node priority according to its cool efficiency. Workloads are dispatched based on the ranking node pool. The saving power is related to the CRAC layout and the proposed policy can yield power saving with no impact on QoS. Because the workloads of cluster system fluctuate with time, DPM can be adopted according to the workload arrival pattern. In the future, we will analyze the workload arrival model and integrate the thermal-aware workload distribution and dynamic power management. Acknowledgements This study is supported the National Core electronic devices high-end general purpose chips and fundamental software project under Grant No.2010ZX01036-001, the National Natural Science Foundation of China under Grant No. 61003015, the Doctoral Fund of Ministry of Education of China under Grant No. 20101102110018, the fund of the State Key Laboratory of Software Development Environment under Grant No. SKLSDE-2009ZX-01 and the Fundamental Research Funds for the Central Universities under Grant No. YWF-10-02-058. References [1] Dolz M, Fernández J, Mayo R, et al. Energy Saving Cluster Roll: Power saving system for clusters. Architecture of Computing Systems. 2010: 162-173. [2] Bash C, Forman G. Cool job allocation: Measuring the power savings of placing jobs at cooling-efficient locations in the data center. 2007 USENIX Annual Technical Conference, USENIX Association, 2007: 29 [3] Patel, C.D., Sharma, R.K, Bash, C.E., Beitelmal, A. Thermal Considerations in Cooling Large Scale High Compute Density Data Centers. Eighth Intersociety Conference on Thermal and Thermo mechanical Phenomena in Electronic Systems, May 2002. [4] Moore, J., Chase, J., Ranganathan, P., Sharma, R. Making Scheduling ‘Cool’: Temperature-Aware Workload Placement in Data Centers. USENIX 2004 [5] Patel, C.D., Sharma, R.K., et al. Energy Aware Grid: Global Workload Placement Based on Energy Efficiency, Proceedings of the ASME International Mechanical Engineering Congress and R&D Expo Nov 15-20, 2003 [6]Vasic N, Scherer T, Schott W. Thermal-aware workload scheduling for energy efficient data centers. Proceeding of the 7th international conference on Autonomic computing, 2010. ACM, 2010: 169-174. [7] S. Gurumurthi, A. Sivasubramaniam, and V. K. Natarajan. Disk drive roadmap from the thermal perspective: A case for dynamic thermal management. SIGARCH Computer Architecture News, 2005. [8] S. Murali, A. Mutapcic, et al. Temperature-aware processor frequency assignment for MPSoCs using convex optimization. In Proceedings of the 5th IEEE/ACM International Conference on Hardware/ Software Codesign and System Synthesis, 2007. [9] http://www.clusterresources.com/products/maui-cluster-scheduler.php [10] http://www.clusterresources.com/products/torque-resource-manager.php. [11] Parallel Workloads Archive, http://www.cs.huji.ac.il/labs/parallel/workload/, 2008 [12] Tang Q, Gupta S, Stanzione D, et al. Thermal-aware task scheduling to minimize energy usage of blade server based datacenters. 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing. 2006: 195-202. [13] Natarajan V, Deshpande A, Solanki S, et al. Thermal and Power Challenges in High Performance Computing Systems. Japanese Journal of Applied Physics. 2009, 48

5

Suggest Documents