Multi-objective Optimization for Data Placement Strategy in Cloud Computing Lizheng Guo1,2, Zongyao He1, Shuguang Zhao2, Na Zhang2, Junhao Wang2, and Changyun Jiang2 1
Department of Computer Science and Engineering Henan University of Urban Constructioin, Pingdingshan 467036, China 2 College of Information Science and Technology, Donghua University, Shanghai, 201620, China
[email protected],
[email protected],
[email protected]
Abstract. In cloud computing, the data of processing and the data of transfering is charged at for the service of the provider. So, it is important to reduce the cost and to improve the performance for the consumer of the cloud computing. At present, the existing optimization algorithms only focus on one aspect , such as reducing the move of data, the processing time, the transferring time, the processing cost or the transferring cost. This paper makes a model for the multiobjective data placement and uses a particle swarm optimization algorithm to optimize the time and cost in cloud computing. The mode applied processors interaction graph to map the data of the task and the data center. The simulation experimental result manifests that the proposed method is more effective in time and cost. Keywords: Cloud Computing, Particle Swarm Optimization, Multi-Objective Optimization, Data Placement.
1
Introduction
As cloud computing appears, the long dream of utility computing becomes reality. This can be seen from the National Institute of Standards and Technology’s definition of cloud computing, which is cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction[1]. Cloud Computing is associated with a new paradigm for the provision of computing infrastructure. This paradigm shifts the location of this infrastructure to the network to reduce the costs associated with the management of hard-ware and software resources [2]. The main feature of cloud computing is combination of vast software and common hardware to provide a powerful computing paradigm. Apart from providing powerful computing capacity, the cloud computing is more economical than general data center. Berkeley [3] is estimated that, by statistically multiplexing the resource in large scale economies, cloud computing uncovers factors C. Liu, L. Wang, and A. Yang (Eds.): ICICA 2012, Part II, CCIS 308, pp. 119–126, 2012. © Springer-Verlag Berlin Heidelberg 2012
120
L. Guo et al.
of 5 to 7 decrease in cost of electricity, network bandwidth, operations, software, and hardware available at these very large economies. There are lots of advantages in cloud computing system, but there are many obstacles to overcome. It [3] listed top ten obstacles in which the first is availability of the service; the fourth is the data transfer bottleneck. AS many data centers are distributed all over the world in cloud computing, the availability of the service and the data transfer are challenge problems. At present, a lot of companies and academic institution located all over the word. They have immense data to be dealt with or calculated. As cloud computing charged on the time of using the service, so, they face a challenge problem, which is how to place the data make the cost and time minimize. As the data in the cloud is very large in lots of science research, [4] propose a matrix based k-means clustering strategy for data placement in scientific cloud workflow; the strategy contains two algorithms that group the existing datasets in k data centers during the workflow build-time stage, and dynamically clusters newly generated datasets to the most appropriate data centersbased on dependencies-during the running stage. The main purpose reduces the data movement in the datacenters. In addition, cloud service providers charge the fee of using the service for computation and storage in cloud computing. [5] formulates a non-linear programming model to minimize the data retrieval and execution cost of data-intensive workflows in cloud. Their model retrieves data from cloud storage resources such that the amount of data transferred is inversely proportional to the communication cost. [6] presents a particle swarm optimization based heuristic to scheduling applications to cloud resources that take into account both computation cost and data transmission cost. [7] explores an architecture for cloud brokering and multi-cloud VM management. They also describe algorithms for optimized placement of applications in multi-cloud environments. Their placement model incorporates price and performance, as well as constraints in terms of hardware configuration, load balancing, etc. An evaluation against commercial clouds demonstrates that compared to single-cloud deployment, their multi-cloud placement algorithms improve performance, lower costs, or provide a combination thereof. Although their object is to maximize performance and to minimize cost, they only optimize the processing cost. [8] proposes efficient cloud storage system by using inexpensive and commodity computer nodes. These computer nodes are organized into PC cluster as datacenter. Data objects are distributed and replicated in a cluster of commodity nodes located in the cloud. In the proposed cloud storage system, a data placement algorithm which provides a highly available and reliable storage is proposed. The proposed algorithm applies binary tree to search storage nodes. The other optimization placement strategy is to minimize the executing and communication time [9, 10, 11] in grid or cloud computing. Taking into account the limited resources of data center and the bandwidth of communication in different cloud computing , customer using resources is charged by the using time or by using data amount in cloud computing. So, in this paper, we place the data to optimize multiple objectives, which are not only minimization of the executing and communication time, but also minimization of the cost of executing and transferring. Data placement has been found to be NP-complete. Some heuristic algorithm has been used to be resolved this kind of problem, Such as Genetic Algorithm (GA)
Multi-objective Optimization for Data Placement Strategy in Cloud Computing
121
[12-13]. A. Salman has shown that the performance of Particle Swarm Optimization (PSO) algorithm is better than GA algorithm in distributed system [14]. Not only the PSO algorithm solution quality is better than GA in most of the test cases, but also PSO algorithm run faster than GA. So in this paper, we optimize the data placement using Particle Swarm Optimization algorithm. Simulation results show that our algorithm not only minimizes the total executing time and transferring time, but also minimizes the total cost of executing and transferring. The rest of this paper is organized as follows. Section 2 presents the problem of data placement and the modeling of optimization algorithm. Section 3 introduces the details of the experimental setting and the analyzing of the experimental result. Section 4 concludes the paper.
2
Multi-objective Optimization Model and Algorithm
2.1
Problem of Data Placement
In the data centers, on the one hand, there are lots of physical machine which have different capacity of computing and storage; on the other hand, several virtual machine can deploy on the machine. Moreover, the bandwidth between different data center is limited and the using fee dependents on the capacity of the processor and the amount of transferring data. Thus, data placement is more challenge problem in the cloud computing. The data placement problem can be described as assigning all the data of task to all the data centers in the cloud computing environment and makes the total cost and time of processing and communication minimize. We can take the data placement as a mapping which maps all the data of the tasks to a Processors Interaction Graph (PIG) G (V, E). The set of V={1,2, …, m} is the m data centers and the set of E is the interaction between these data centers. Each vertice represents a data center that has a weight, which delegates the performance of the data center; each vertice is assigned one or many tasks. If two data centers have a communication, there is an edge between them. The weight of the edge describes the bandwidth of between the different data center. To understand this issue clearly pays attention to the following example: Data placement graph like the fig.1. From fig.1, there are 5 tasks to be assigned 3 data centers; the processing capacity of the data center one (DC1) is 500; task 1(T1), which has a data 1(D1), and task 3(T3), which has a data 3(D3), are assigned to DC1; T15 means that task 1 produce a data to be processed by task 5. 2.2
Modeling
The data centers are heterogeneous and the computers have a different processing ability in cloud computing. The processing cost of the data center dependents on the capacity of the processor and the standard of charge fee for different provider or different place. The cost of the data of transferring is decided by the amount of the data; the time of the data of transferring depends on the data center that the task is assigned to. Our target is minimizing the time and cost of the communication and processing by mapping all the tasks to all the processors in the data center.
122
L. Guo et al.
Fig. 1. Example of data placement
In order to make the mathematical model of the data placement of the task assignment, we define Ti : i = {1, 2,..., n} as n independent tasks permutation, the data amount of tasks is mapped onto Million Instructions; DCk : k = {1, 2,..., m} as m data centers, the performance of the data center use the Million Instruction Per Second as a metric to evaluate(MIPS); Bkl : k , l = {1, 2,..., m} as the bandwidth between two data center and m is the number of data center
;x
ik
= 1 if task i is assigned to data center
k, and xik = 0 otherwise; x jl = 1 if only if task j is assigned to processor l and x jl = 0 otherwise; n is the number of tasks; m is the number of data center; DTij is
the interchange data amount between task i which is the generator of the transferring data and task j which is the consumer; equation (1), (2) and (3) respectively represent the executing time, the transfer time and the total time. Amazon EC2 [15] provide three types of charging method as following: the first is On-Demand, the second is Reserved and the third is Spot. In this paper, we chose the On-Demand method as our computing standard. On-Demand Instances let you pay for compute capacity by the hour with no long-term commitments. This frees you from the costs and complexities of planning, purchasing, and maintaining hardware and transforms what are commonly large fixed costs into much smaller variable costs. The processing pricing is listed in table 1. Regional data transfer $0.01 per GB– all data transferred between instances in different Availability Zones in the same region. There is no Data Transfer charge between Amazon EC2 and other Amazon Web Services within the same region (i.e. between Amazon EC2 US West and Amazon S3 in US West). Data transferred between Amazon EC2 instances located in different Availability Zones in the same Region will be charged Regional Data Transfer. Data transferred between AWS services in different regions will be charged as Internet Data Transfer on both sides of the transfer. In light of the charging standard of Amazon EC2, we define Poutk as the pricing of data transfer from DCk and Pinl as the pricing the data transfer to DCl . Equation (4) is the cost of data processing. Pk is the processing pricing of standard on-demand and equation (5) is the cost of data transfer. Equation (6) is the total cost which is the sum of the cost of data transfer and processing.
Multi-objective Optimization for Data Placement Strategy in Cloud Computing
123
Supposing that the processing time and cost are know for task i executing on data center k and the transfer time and cost are know for transfer the data from i data center to j data center. Constraint (7) indicates that each task must be assigned to one data center. Constraint (8) describes the condition that two tasks are assigned to different data center and the two tasks have a data to be communication. Constrain (9) guarantees that xik , y jl are binary decision variables. n
m
Tp = xik × i =1 k =1
DPi DCk
(1)
Table 1. Standard on-demand instances: Large Region US East Virginia EU(Ireland) Asia Pacific(Tokyo)
Linux/UNIX Usage $0.320 per Hour $0.360 per Hour $0.368 per Hour n
n
Windows Usage $0.460 per Hour $0.460 per Hour $0.460 per Hour
DTij
m
Tt = xik × x jl × T
Tp
Cp n
Ct
n
m
i 1 j 1 k 1 l k
xik
m k 1 n
m
i 1 j 1 k 1 l k
xik
xik , y jl
2.3
x jl
(3)
Tp Pk
(4)
Cp
Subject to n
Tt
x jl ( DTij Poutk
C
(2)
Bkl
i =1 j =1 k =1 l ≠ k
xik
(5)
DTij Pinl )
Ct
(6)
1, i 1, 2,..., n
1, i, j 1, 2,..., n and
k , l 1, 2,..., m and
{0,1}, i, j 1, 2,..., n and k , l
1, 2,..., m
(7) DTij
(8) (9)
PSO Algorithm
We use the PSO algorithm to optimization the cost and time of processing and transferring. The aim of optimization is the cost and time, so, we take combination of cost and time as the fitness function. The detail of the PSO algorithm has been done a description in our previous work [11]. CT=T+C
(10)
124
L. Guo et al.
3
Experiment Settings and Result Analyzing
3.1
Experiment Setup
In order to more reasonable evaluating our algorithm, we generate the test data at random. The data amount of tasks is mapped onto Million Instructions Per Second (MIPS): restricted between 1000 and 10000; the performance of each data center use the numbers of MIPS as a metric which is a set of P:{500,1000,2000,4000}. The P is according to the Amazon EC2 Standard On-Demand Instances standard, which is small, medium, large, extra large model respectively; the bandwidth varies form 100 to 1000. In the following part, all of the experiments are tested on an AMD Phenom[tm] II X4 B95 3.0 GHz with 2G RAM under a Microsoft Windows XP environment. All the experiments were implemented in MATLAB R2009b. The parameters of the PSO are as following: size of swarm is 30, self-recognition coefficient c1 is 1.49445, social coefficient c2 is 1.49445 and weight w is 0.729 [16]. 3.2
Performance Metrics
In order to test the efficient of the mode and the algorithm we use several metric to evaluate the performance. One of the metric is the cost of processing and transferring. Another metric is the time of the processing and transferring. These metric are typically relevant, therefore, we propose a combine metric that captures both the cost of processing and transferring and the time of processing and transferring. 3.3
Simulation Result and Analysis
Performance analyzing
Fig. 2. The cost, time of 35 tasks, 3 data centers
As the PSO is stochastic-based algorithm, the same problem may produce different result. In order to acquiring more reliable result, we get the average result of every test instance which runs ten times. We test a serial set of data which is different tasks for the processor centers. The centers are US East Virginia, EU Ireland and Asia
Multi-objective Optimization for Data Placement Strategy in Cloud Computing
125
Pacific Tokyo. The pricing of processing and transferring of the centers are OnDemand method. In the processing pricing, we chose the Unix/Linux usage. The simulation result is listed in table 2. From the experimental result, we can get the conclusion that our optimization algorithm not only improve the cost, but also reduce the time. From Fig. 2, we can find that the cost, time and the total cost are rapidly convergence. As our main aim is to optimize the cost of processing and transferring, the time of processing and transferring, the total cost is stable convergence, the time and the cost have a slightly shake. Table 2. Standard on-demand instances: Simulation result Task
Processor
10 15 20 25 30 35
3 3 3 3 3 3 Total
4
Cost No 215 898 1800 3003 4415 5457 15788
Time O 213 889 1794 2989 4403 5441 15729
No 61 403 613 728 1097 1299 4201
O 52 317 584 614 1032 1240 3839
Cost+Time No O 269 262 1270 1193 2399 2360 3701 3585 5480 5410 6724 6664 19843 19474
Conclusion
In summary, we make a modeling for the multi-objective data placement and use a PSO algorithm to optimize the data placement problem. Our optimization objects are not only one target, but include processing time, transferring time, processing cost and the transfer cost. Simulation results demonstrate our method is not only to decrease the cost of the processing and transferring, but also to decrease the time of processing and transferring. This mean that our algorithm not only increases the efficient, but also decreases the cost in cloud computing. Acknowledgment. This work is supported by key programs of science and technology research of He’nan Education Committee (No:12A520006).
References 1. NIST Definition of Cloud Computing v15, http://csrc.nist.gov/groups/SNS/cloudcomputing/cloud-def-v15.doc 2. Hayes, B.: Cloud computing. Communications of the ACM (7), 9–11 (2008) 3. Armbrust, M., et al.: Above the Clouds: A Berkeley View of Cloud Computing, Technical Report, http://www.eecs.berkeley.edu/Pubs/TechRpts/ 2009/EECS-2009-28.pdf 4. Yuan, D., Yang, Y., Liu, X.: A data placement strategy in scientific cloud workflows. Future Generation Computer Systems, 1200–1214 (2010)
126
L. Guo et al.
5. Pandey, S., Barker, A., Gupta, K.K., Buyya, R.: Minimizing Execution Costs when Using Globally Distributed Cloud Services. In: 2010 24th IEEE International Conference on Advanced Information Networking and Applications (AINA), pp. 222–229 (2010) 6. Pandey, S., Wu, L., Guru, S.M., Buyya, R.: A Particle Swarm Optimization-Based Heuristic for Scheduling Workflow Applications in Cloud Computing Environments. In: 2010 24th IEEE International Conference on Advanced Information Networking and Applications, vol. i(1), pp. 400–407. IEEE (2010) 7. Tordssona, J., Monterob, R.S., Moreno-Vozmedianob, R., Llorenteb, I.M.: Cloud brokeringmechanisms for optimizedplacement of virtualmachinesacross multiple providers. Future Generation Computer Systems 28(2), 358–367 (2012) 8. Myint, J.: A data placement algorithm with binary weighted tree on PC cluster-based cloud storage system. In: 2011 International Conference on Cloud and Service Computing (CSC), December 12-14 (2011) 9. Zhang, L., Chen, Y.H., Sun, R.Y., Jing, S., Yang, B.: A Task Scehduling Algorithm Based on PSO fro Grid Computing. International Jouranal of Computational Intelligence Research, 37–43 (2008) 10. Yin, P.Y., Yu, S.S., Wang, P.P., Wang, Y.T.: A hybrid particle swarm optimization algorithm for optimal task assignment in distributed systems. Computer Standards & Interfaces 28, 441–450 (2006) 11. Guo, L.Z., Zhao, S.G., Shen, S.G., Jiang, C.Y.: Task Scheduling Optimization. Cloud Computing Based on Heuristic Algorithm Journal of Networks 7(3), 547–553 (2012) 12. Chang, C.K., Jiang, H., Di, Y., Zhu, Y., Ge, D.: Time-line based model for software project scheduling with genetic algorithms. Information and Software Technology, 1142– 1154 (2008) 13. Gharooni-fard, G., Moein-darbari, F., Deldari, H., Morvaridi, A.: Procedia Computer Science. In: ICCS 2010, vol. 1(1), pp. 1445–1454 (May 2010) 14. Salman, A.: Particle swarm optimization for task assignment Problem. Microprocessors and Microsystems 26(8), 363–371 (2002) 15. Amazon EC2 Pricing, http://aws.amazon.com/ec2/pricing/ (visited:November 4, 2012) 16. Shi, Y., Eberhart, R.C.: Empirical study of particle swarm optimization. In: Proc. IEEE Congr. Evol. Comput., pp. 1945–1950 (1999)