Improving System and Software Deployment on a Large-Scale Cloud Data Center Yu-Sheng Wu
Tong-Ying Juang* Yue-Shan Chang Department of Computer Science and Information Engineering, National Taipei University, Taiwan {ysc, juang}@gm.ntpu.edu.tw
Wei-Jen Wang
A cloud operating system (cloud OS) is a distributed system that facilitates resource management in a data center. For example, the ITRI Cloud OS v1.0, launched in April 2011, was designed to convert the entire data center to a platform similar to AWS [5], which adopts the virtualization technology to manage computational resources in a more flexible way. That is, a cloud OS differs from the a traditional operating system in the use of virtualization and the scale of computational resources.
Abstract— In the cloud computing paradigm, the public cloud services provide end-users on-demand computational resources in a pay-as-you-go manner. Those computing resources are usually organized and located at one place, namely a cloud data center. As a cloud data center involves the issue of managing a large number of computing nodes, it requires a monitoring mechanism, which can observe the important events of each node as well classify the information, such that the system can use the mechanism to achieve automatic computing. In addition to the management issue, a cloud data center requires a fast cloud software deployment mechanism that facilitates establishment and management of various cloud services atop the cloud data center. In this study, we investigate three cloud OS deployment architectures to improve the deployment of Cloud OS. We describe how to design these architectures and use simulation tools to observe their performance values based on real observations. We also propose a mechanism based on Puppet to manage software installation and version control. Keywords—Cloud Computing, Deployment, Cloud OS
I.
Data
Center,
The cloud OS abstracts the operating procedures at each physical machines and integrates the entire data center for a computer. In a cloud OS, the Virtual Machine (VM) is the smallest granule for management. Therefore, a Cloud OS should be able to do VM creation, termination, scheduling, migration and placement on each physical machine. To achieve this goal efficiently, many internal services such as DHCP are used in a cloud OS, and the internal services are installed at some particular physical machines that do not host any end-users’ VMs and applications. As a result, the administrators of the cloud platform should create a blueprint of software installation before the cloud OS is deployed on the physical machines. The challenge is, a typical cloud platform consists of a large number of physical machines, which makes internal service deployment hard and slow. In addition, updating the internal services is hard since the administrators have to manage a large number of physical machines. This study focuses on the problem of cloud OS deployment as well as the problem of software (internal services) version management. Based on our experiences on cloud software deployment (the PRM module of ITRI Cloud OS version 1.0), we have proposed several cloud OS deployment architectures and evaluated their performance. We also implemented an extended component based on Puppet [9] that can facilitate software version control and management.
Large-Scale
INTRODUCTION
Cloud Computing [1][2][3][4] is not a new technology but rather a new computing paradigm. The core concept of the paradigm is that the end-users use the internet to access all kinds of services on demand, while the service providers deliver high quality of service to the end-users through the internet. Based on the NIST cloud definition [12], cloud services can be divided into three different types: “Infrastructure as a Service” (IaaS), " Platform as a Service " (PaaS) and “Software as a Service” (SaaS). To run these cloud services and to handle various users’ requests, a powerful collection of resources, software and hardware, should be used to fit the computing demands from the cloud services and the users. A cloud data center, or sometimes referred as a cloud, can host those services by provisioning infrastructure-level resources on demand. A data center should work in a dynamically scalable way since it needs to handle any workload surge of a cloud service. This implies that a data center usually consists of a large number of computing resources. Therefore, the scale of resources becomes a challenge for resource management.
978-1-4673-5990-0/13/$31.00 ©2013 IEEE
Jun-Ting Lu
Department of Computer Science and Information Engineering, National Central University, Taiwan
[email protected]
The rest of the paper is organized as follows: Section II introduces the knowledge of the PRM module, which is a subcomponent. Section III describes the proposed fast deployment strategies for a large cloud data center. Section IV describes how we design the software management system. Section V shows the results of this study. Section VI concludes this paper and points out our future research direction.
82
ICUFN 2013
II.
up program for physical and virtual resources according to the configuration, which is specified by the administrators.
THE PRM MODULE FOR CLOUD SOFTWARE DEPLOYMENT
System Role The Physical Resource Management (PRM) module is a sub-component of the ITRI Cloud OS version 1.0, developed by the Clouding Computing Center for Mobile Application, Industrial Technology Research Institute, Taiwan. It is responsible for cloud software/service installation, update, and version control. To better explain how the PRM works, we abstract four types of the system roles appeared in a cloud operating system. They are Service Node 1 (SN1), Service Node k (SNk), Computing Node (CN), and Data Node (DN), as described in the follow: i.
Service Node 1 (SN1)
Service Node 1 (SN1) is controlled by the system administrator directly. It gets the requests from the administrator and deploys the local OSes and services to the right physical machines. By this reason, SN1 should be started up before deploying any other nodes. This property makes SN1 unique when compared with other types of service nodes. The SN1 reads a configuration file, specified by the administrator, to assign the required software packages to each physical machine. Each physical machine will get the commands and software packages from SN1. So, SN1 is also responsible to build a networking environment for communication. ii.
Fig. 1. The PRM deployment architecture
Service Node k (SNk)
Service Node k (SNk) represents a physical machine that works as an internal service of a cloud operating system. Since we already defined the role of SN1, we do not consider SN1 as an instance of SNk. Each SNk provides different services. So, the required software packages for installation may differ from each other. By the design of the cloud operating system, the requests from non-service nodes are distributed to those service nodes. iii.
Computing Node (CN)
A Computing Node (CN) provides its computing resources to the end-users. The CN not only installs the base OS, but also the hypervisor that manages local virtual machines at this node. It requires longer installation time than other nodes. In a typical Cloud operating system the number of CNs are the largest among all types of nodes. In most cases, they are the bottleneck for cloud OS installation.. iv.
Fig. 2. The PRM booting flow
Data Node (DN)
A Data Nodes (DN) provides the storage resources to the end-users as well as the cloud operating system. A DN just needs to install a base local OS and its install time is short while compared with other types of nodes.
Fig. 3. The PRM deployment model
The PRM uses the PXE [6] protocol to coordinate the installation procedure at each physical machine. As Fig. 2 shows, a physical machine first sends requests to the PRM service and ask for its IP addresses through PXE. When receiving the IP addresses, it also gets the TFTP information. It will request for its boot-up program from the TFTP service.
The PRM Approach for Software Deployment Fig. 1 shows how PRM works for cloud OS deployment. Like BIOS in a traditional operation system, the administrators can install services and software components and set the boot-
83
At the last step, the client (physical machine) gets the source programs of the local operating system from the HTTP service. To activate local OS installation, the PRM sends an executable script to each physical machine. As Fig. 3 shows, the PRM sends the action scripts to the PRM agents at the physical machines, and the PRM agents perform the actions specified in the script files. When the whole deployment is done, the PRM agent sends a message to the PRM, notifying the PRM service whether the installation succeeded.
removed from the data center. It doesn’t waste any resource after the deployment terminates.
Software Version Control As TABLE I shows, we compare the PRM module with other software version control tools, Crowbar[7] and Puppet [8][9]. Like PRM, Crowbar supports software deployment and base OS deployment. On the other hand, Puppet does not support direct base OS deployment. It is better to support software version control, such as software update and upgrade. Puppet differs from the other two in its event-triggered model. Based on the event-triggered model, the administrator can specify when and what condition should a physical machine starts to do something. This feature relies on its monitoring mechanism, which can specify what items should be monitored. TABLE I.
COMPARISON OF DEPLOYING TOOLS PRM
Dell Crowbar
Puppet
Network Booting
V
V
V
OS Installation
V
V
X
Monitioing
V
V
V
Scriptbased commands
1. Script-based commands 2. Chef client and server
1. C&S model 2. User-defined events and items 3. Event-triggered model
X
X
V
X
X
V
System Configuration User-Defined Monitioing Items Automaitic Version Control
III.
Fig. 4. The ESNS architecture
PROPOSED DEPLOYMENT ARCHIECTURES
Although the PRM module can install the cloud OS on a cloud data center, the efficiency of software installation is still a concern. In order to improve the efficiency of cloud OS deployment over a large-scale data center, we propose three possible architectures: the ESNS architecture, the MSNS architecture, and the hybrid approach combining the ESNS architecture and the MSNS architecture.
Fig. 5. The ESNS deployment flow
The deployment flow of the ESNS is shown in Fig. 5. In the ESNS architecture, the administrator sends the deployment requests along with the system configuration plan to SN0, and then the SN0 starts building the communication environment and sends the base OS images to every other nodes (physical machines). After every node gets the OS images, they could concurrently install their OSes. With this architecture, we can save the time on deploying the base OS on SN1. Note that all services, except SN0 and SN1, still should be deployed from SN1. In this architecture, SN1 becomes a software update and management service (or a version control service).
ESNS (External Source Node System) The ESNS (Fig. 4) utilizes a new node SN0, which is a dedicated and pre-configured server node, to initialize a base OS installation for every node, where the base OS is installed and then woken up. The SN0 just acts as a pure installation service and does not belong to the data center after the cloud OS has been deployed. That is, when all the software deployment procedures finish, the external node, SN0, can be
84
within a switch. This strategy should be able to reduce the chance of packet loss as well as the transmission time.
MSNS (Multiple Service Nodes System) The problem of the ESNS architecture is the downloading bottleneck at SN0 and SN1. When the number of physical machines becomes large, the deployment time is expected to be high because every nodes ask data from SN0 and SN1. In the MSNS, we focus on improving the PRM deployment architecture by a location-aware deployment approach. The MSNS architecture, shown in Fig. 6, produces many replicas of SN1, and each of the replicas is assigned in a network switch. Ever replica of SN1 only serves its neighbors that are connected by the same switch. The deployment flow of the MSNS is shown in Fig. 7. The administrator sends the deployment requests to the SN1 nodes. Every SN1 begins to send the software packages to the other nodes. According to the role setting, each node will build the communication channel to its corresponding SN1 and then the SN1 sends required packages to each node. In this architecture, all the SN1 nodes must be initialed and worked up before deploy the other nodes. But in this model, the system will waste time to wait every SN1 workup.
Fig. 7. The MSNS deployment flow
Fig. 6. The MSNS architecture
Hybriding the ESNS and the MSNS Because the ESNS and the MSNS have different advantages and disadvantages, we hybrid them to achieve better deployment performance. Compared to the approach used by the original PRM module, the ESNS should be able to improve the performance of base OS deployment at each physical machine while the MSNS improves the performance of software packages deployment. Fig. 8 shows the proposed deployment architecture, and Fig. 9 shows the deployment flow of the proposed hybrid approach.
Fig. 8. Hybrid system architecture
In the beginning of the hybrid architecture, we adopt the ESNS architecture to initialize each physical machine. By the ESNS architecture, every node installs the base OS, and the SN1 nodes are deployed in parallel. When every SN1 is ready to work, the other nodes then start to get their software packages and perform local installation. At this stage, the MSNS architecture makes each node get its software packages
85
Fig. 10. Software management model
Fig. 9. The hybrid deployment flow
Software Deployment As shown in Figure 10, we propose to use the PRM to install the base OS to every physical machines. If the cloud data center is large, we can use the proposed hybrid deployment architecture for the base OS deployment and software package deployment. Because the deployed software packages needs version control as the data center keeps working, we propose to use an event-triggered mechanism to install, update or upgrade software package automatically. To facilitate administrators’ management tasks, a console is used to provide a global view of the status of the software components on the physical machines. The proposed software management system is implemented based on Puppet. When each physical machine has been installed the base OS, the software management system starts to deploy necessary services specified in a configuration file. As Figure 11 shows, we use a node as the master to manage in the whole data center. The master, namely the Puppet master, is responsible to maintain a global view of the software packages. Moreover, we treat the master as a typical client. That is, some additional services and software packages are also deployed just like the client nodes. For the example of Fig. 11, the system installs the essential services such as the components of OpenStack [10] (ex. Nova, Mysql, and Glance etc). The system also installs the necessary networking services such as DHCP, HTTPD, and SOAP.
Fig. 11. An Example of the software management architecture
IV.
SIMULATION RESULTS
This section represents the performance evaluation results of the proposed deployment architectures. Since the cloud data center is usually large, it is hard to obtain hundreds of physical machines to evaluate the performance of the proposed deployment architectures. As a result, we choose to use the simulation approach to evaluate scalability of the proposed deployment architectures. To make the simulation results more convincible, we have created a prototype of a deployment architecture on a cluster of 13 physical machines, measured the performance values as well as the simulation parameters, and used the parameters in our simulations. The Network Simulator - ns-2 Network Simulator - Version 2 (ns-2) [11] is a discreteevent driven and object-oriented simulator. Sometime the networking environment are too big for the developer to find enough resources to implement and evaluate their ideas A
86
dedicated and pre-configured server node to initialize a base OS installation for every node, where the base OS is installed and then woken up. Then we can create many replicas of the deployment services, each of which is assigned in a network switch. Ever replica of the deployment service only serves its neighbors that are connected by the same switch. The replicas are nullified after the deployment. The simulation results showed this approach is more scalable and efficient. Also, we proposed to an event-driven model for software management and update atop of Puppet. We have implemented the mechanism of automatic software version control based on a prototype of a data center. In the future, we hope to emulate the OS deployment model for proof of concept and make software dependency checking easier.
simulator like ns-2 is a good choice to handle this situation since it can simulate the networking accidents, for instance, packet lose, network blocking, and disconnection.
ACKNOWLEDGMENTS This work was partially supported by the National Science Council, Taiwan, under Grants No. 101-2221-E-305-010
Fig. 12. The deployment time for different number of physical machines
REFERENCE
Evaluation with Three Architectures We have simulate the proposed architectures with different amount of physical machines, ranging from 12 to 528. Since the physical machines are connected by switches, we have to make some reasonable assumptions of how the physical machines are connected. In our simulations, we assume that each switch connects at most 24 nodes, including the switches and the physical machines. The most important services of a cloud OS are connected at the same switch. As a result, the system is in the form of a hierarchical balance tree.
[1]
M. Armbrust et al., “A view of cloud computing,” Communications of the ACM, vol. 53, p. 50–58, Apr. 2010. [2] I. Foster, Yong Zhao, I. Raicu, and S. Lu, “Cloud Computing and Grid Computing 360-Degree Compared,” pp. 1-10, Nov. 2008. [3] M. Armbrust et al., “Above the clouds: A berkeley view of cloud computing,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009-28, 2009. [4] B. Furht and A. Escalante, Handbook of Cloud Computing. Springer, 2010. [5] “Amazon Web Services.” Available: http://aws.amazon.com/ [6] "Preboot Execution Environment (PXE) Specification - Version 2.1" (PDF). Intel Corporation. 1999-09-20. Section "2.2.3 Proxy DHCP". Retrieved 2012-10-02. [7] ” Welcome to Apache™ Hadoop ” Available: http://hadoop.apache.org/ C. Iversen and T. D. Nielsen, “Automatic system administration: Metaconfig – a data center automation system,” Master’s thesis, Universityof Copenhagen, 2010. [8] “Puppet” Available: http://puppetlabs.com/ [9] “OpenStack Install and Deploy Manual.”Available: http://docs.openstack.org/trunk/openstack-compute/install/apt/content/ [10] “The Network Simulator - ns-2” Available: http://www.isi.edu/nsnam/ns/ [11] Peter Mell, Timothy Grance, "The NIST Definition of Cloud Computing," Special Publication 800-145, Computer Security Division, Information Technology Laboratory, National Institute of Standards and Technology, USA, September 2011.
Fig. 12 shows the results of our simulations for the original PRM deployment architecture and the three proposed deployment architectures, as described in Section III. We can find that all three proposed architectures improve the performance of the PRM deployment. The ESNS architecture only installs the OS in parallel while compared with the PRM architecture. When we have a small system, the improvement rate of this architecture is the best among all the architectures. But, if the system scales to a huge size, the improvement rate becomes trivial. The MSNS architecture is significantly better in performance than the ESNS architecture. However, when we hybriding the ESNS and the MSNS, the deployment can achieve the best performance. V.
CONCLUSIONS AND FUTURE WORK
We have evaluated three deployment architectures to deploy base OSes and software packages to a data center. Among the three architectures, we found that, we can use a
87