Choosing Cost-Effective Configuration in Cloud Storage

Choosing Cost-Effective Configuration in Cloud Storage Wei-Tek Tsai, Guanqiu Qi, Yinong Chen School of Computing, Informatics, and Decision Systems Engineering Arizona State University, Tempe, AZ, U.S. {wtsai, guanqiuq, yinong} @asu.edu Abstract—Cloud storage provides a virtually unlimited storage spaces for customers. Customers can combine their data storages from different types of cloud storage following their own requirements. End customers often stuck in choosing desirable configuration from different types of cloud storage. How to spend the minimum costs on using the highly efficient cloud storage? How to balance the relationship between the expenditures and performance? This paper proposes a cost-effective optimal configuration model using the repeated game model that can provide optimal configuration solutions to customers. The data mining techniques are used in provisioning on cloud storage. Classification helps users to find the related data and trend analysis assists users to mine the future trend on data storage. A simulate experiment is discussed and verify the correctness of the proposed model.

I. I NTRODUCTION Cloud computing plays an important role today [18], [20], [15], [19], [4]. Facing an undeveloped territory with huge potential profits, almost all major IT companies have provided their own cloud in the recent years, including Amazon EC2, Google GAE, Windows Azure, Salesforce.com and Oracle cloud. In addition to public clouds, VMware, Eucalyptus, Citrix, and many vendors offer private clouds to meet different requirements, where cloud software can be loaded locally. The cloud service pricing [16] will be different from traditional software pricing and be charged as pay-as-go. Due to the ever increase of data amount, how to store the data becomes a annoying issue. Now almost all companies, schools, hospitals, governments and so on need to face the same problem. How to store? How to maintain? How many technicians needs to be employed? Many questions needs to be answered. The high costs of human resources and money are spent on data storage. The traditional data storage uses the hierarchical storage method to manage the data storage. It uses caches, memory, and disk to store different data and the system manages how to move data between high-cost and low-cost storage media. The price decreases from the high speed data storage device to the lower device. For instance, the high-speed storage devices, such as hard disk drives, are more expensive than slower-speed devices, such as optical discs and magnetic tape drives. Unlike the traditional data storage, the cloud storage is based on a networked online storage model. The data is stored on virtual machines hosted by third parties. All workloads of data storage will be given to the third parties. The hosting companies operates the large data center in many aspects,

including data storage, storage maintenance and others. Users just need to buy or lease the storage capacity for their storage needs from third parties. The data center provides virtual storage pools to users following the requirements. Customers can store files or data objects by themselves. The difficult technical details and the implementation will be left to the cloud storage providers. For customers, they can get some benefits from cloud data storage. Unlike the traditional data storage methods, users need to install the data storage devices in their data center. Now they do not need to buy physical data storage devices and only pay for the virtual cloud data storage capacity. The service fee of cloud data storage is much less than buying physical devices. At the same time the heavy workloads of data storage maintenance, including backup, data replication and leasing/buying more storage capacity are responded by service providers. In one word, cloud storage saves a large number of money, time and human resources for customers. So in future, the cloud storage will become one popular trend. More and more customers will choose cloud storage as the first data storage choice. But in the profit-driven market, the service providers are pursuing the maximized profits. Different companies provide their own cloud storages with different pricing standards in the market. In short high, middle, low speed storage has high, middle and low price respectively. Although these service providers use their own statistic tools to mine the most frequent used data and recommend customers to use the high speed storage to store these data. Cloud storage system may dynamically adjust the data into different speed storage. For example, Oracle cloud [12], [13] provides their own solutions of storage cost and performance optimization. It summarizes the storage performances into four types, which are active, less active, historical, and archive. The four types data will be stored into high performance storage tie, low cost storage tie, online archive storage tie, and offline archive storage tie respectively. The trend of usage cost from high performance storage tie to offline storage tie decreases. The Oracle cloud storage management system assigns different type of data storage to the data following the statistics of data usage, which means that the data storage can dynamically change based the usage frequency. The existing methods are designed mainly based on requirements from service providers. Users may have their own

requirements on cloud storage or think that service providers may try to use their methods to get more profits. So sensible customers want to decide types of cloud storage. The relationship between the storage type and the price (US dollar), that are simulated based on the real data from Windows Azure and Amazon EC2, shows in Table I. Since we cannot get the reference data, the price of huge data storage usage is shown in ”contact for price”. We will not discuss it in this paper. The prices of different type storages are shown, but customers need to face the following questions. Which service provider should be chosen? Which type of cloud storage should be picked? How to balance the relationships between costs and cloud storage speeds? Most of customers do not have computer science backgrounds and cloud storage usage experiences. They do not know much details about the cloud service, such as the computation ability, latency, compatibility, and so on. Although service providers guide customers how to choose, configure and use cloud storage, it is still difficult for users to choose the most appropriate services. In most cases, users spend lots of money on the high speed data storage services, but actually the efficiency may be even lower than the low speed data storage services. Using the most economic budget to configure data storage services for meeting usage requirements is difficult for most of users. In another aspect, service providers always put the most frequently used data in the high speed data storage. It does not mean that some data should not be important and stored in the high speed data storage, which may be only used one time a month. For instance, customers always pay for the bill to internet service provider. The payment service, information and data may be only accessed one time a month in most cases. Although the payment service is not frequently used, it is very much. Suppose internet service provider stores these data in the low speed data storage for saving purpose, customers pay for their bills waiting more than five minutes for transaction process. Most of customers like to change their internet service provider due to the poor payment service. This paper proposes a cost effective way using game theory method to help users to pick the most appropriate cloud storage configuration solution. Cost constraints are considered in the configuration model to reach the optimal performance. The performance is measured by the efficiency, which the product of usage time and costs. For the configuration model, the lower efficiency value is better than others. Data mining techniques are used to predict the future requests for dynamically changing the configuration model to get the best configuration solution. This model can also change configuration solution automatically following users’ will. Sometimes the new services may affect the existing configuration model. This model can adjust the changes timely. One small experiment will simulate the cloud environment to analyze and get the best configuration method using the proposed model. The following questions will be addressed: Will the high speed data storage get the highest efficiency? What is the expected relationship between the speed of data storage and the costs? Which part of the old data will be moved to the

low speed data storage for saving purpose? In summary, this paper makes the following contributions: • It exploits an intelligent configuration model in cloud storage. To the best of our knowledge, this is the first framework that uses heuristic rules to guide the cloud storage configuration process. • It proposes a heuristic model to capture the relationship between the speed of cloud storage and the costs. A costeffective model is proposed to tune the configuration, and use the simulated experiment to demonstrate the framework. • It utilizes data mining techniques in the proposed model to assist to get the best configuration module. The classification methods help to get the correlated data, trend analysis predicts the future usages, and so on. The results will be used to guide the configuration process. Provisioning is supported in cloud storage configuration. II. C LOUD S TORAGE C ONFIGURATION M ODEL This paper mainly focuses on the cloud storage configuration model that addresses both performance and cost issues together. In traditional computing storage, all data are stored in physical data storage devices. The purchasing costs of physical devices is a one time consumption. No matter the performance of the purchased physical devices, users will try to maximize the effectiveness as possible. In most cases, users can not change the performance of physical devices. If they do not satisfy the performance of these storage devices, the only way they can do is purchasing new higher performance devices. Different from traditional computing storage, cloud storage provides a new leasing model. Users and companies do not need to buy their own data storage devices and maintain the stored data by themselves. The cloud storage helps them to reduce the costs of purchasing new devices and the costs for hiring IT engineers to maintain the data. Cloud storage service providers provides different types of data storage. Users can choose different type of storage following their requirements mainly involving the performance of data storage and the rents. For the most frequently used data, customers can rent high speed data storage. For the rarely used and backup data, customers can choose low speed data storage. Actually the usages of different data may change in different period of time. If users can move data among different types of data storage following the usages, it will save costs. Unfortunately, now cloud storage providers do not give users authorities to change data storage types timely. Users can only change the data storage types at the beginning of one rent period (one month as unit). The service providers will manage the stored data using their own solutions based on the data usages. In this paper, it will give users more flexibilities. It assumes that users can move their own data. At the same time, the service providers can also manage the stored data. When the conflicts happen, users must obey the service providers. Under this circumstance, the movements of the stored data are bidirectional. The data can be moved from slow speed storage

TABLE I P RICING FOR C LOUD DATA S TORAGE

Storage Type 1-50 TB/month 51-500 TB/month 501-1000 TB/month 1001 TB -5 PB/month Greater than 5 PB/month

High Speed Storage Price $0.202 /GB $0.176 /GB $0.148 /GB $0.133 /GB Contact for price

Intermediate Speed Storage Price $0.155 /GB $0.132 /GB $0.115 /GB $0.102 /GB Contact for price

to high speed storage or from cheap storage to expensive storage. The opposite movements are also allowed. Storage providers give the reactive approach that the stored data will be managed automatically by the system based on the statistics. Users can provide the proactive approach that they know more about the stored data, when the data should be moved to meet the increasing requests or decrease the storage costs. At the same time,this paper will also consider the moving cost, which is not counted in the usage costs now. Actually storage providers do not want users to move their own data frequently. It not only spends some times to move the data, but also increases the difficulties of managing the data. So the data movement will be charged for the data movement fee. Similarly, in airline ticket models airline companies provide two types of air tickets, such as non-refundable and refundable ticket. For the non-refundable ticket, the purchased ticket cannot be returned and the change of schedule will be charged. The refundable ticket is more flexible, customers can get fully refund after they return their purchased tickets and the changes of schedule are permitted without any additional fees. But the refundable ticket are more than two times expensive than non-refundable ticket. As we know, in most cases change often has high costs in human resources, money, and time. Providers do not want users change the existing fixed things. If something must be changed, providers need to get more benefits, otherwise the change will not be allowed. The similar thing may also happen in cloud storage. The system architecture is show in Figure 1. Different types of date can be stored in different types of cloud storages. According the requirements, users can change the cloud storage configuration model timely following the provisioning instructions. The stored data can be moved among different types of cloud storage in order to decrease the cost and increase the efficiency under one certain budget. A. Configuration Model and Parameters This paper discusses the cloud storage configuration model in the proposed scenario, which is shown in Table I. In this condition, there are three types of cloud storage, including high, intermediate, and low speed cloud storage. It means that as time goes by, it always has three options for customers to choose. The choosing process is shown in Figure 2. In this process, all customers interact not just once but many times. Different strategy in different stage may cause different results. Good deed can be awarded, oppositely bad deed can be punished. In the repeated game [3], [8], a stage game, which is played many times, is a component game. In this paper,

Low Speed Storage Price $0.125 /GB $0.112 /GB $0.103 /GB $0.085 /GB Contact for price

Data

High Speed Intermediate Low Speed Data Storage Speed Data Data Storage Storage

Data Storage Configuration Model

Provisioning

Efficiency

Cost

Users

Figure 1.

System Architecture

High

…… Intermediate

…… Low

…… Stage 1 Figure 2.

Stage T Repetitive Choices

it supposes that it only has the same three options for each stage. So the cloud storage configuration can be reduced to a repeated game. Here it uses a stage game G and the number of its repetition T to define this repeated game. The stage game G is a game in strategies form: G = Si , πi ; i = 1, ..., N , where Si is customer i’s set of strategies and πi is his payoff function and it depends on (s1 , s2 , ..., sN ). Some formal definitions are given below:

πit : player i’s payoff at t-th stage δ: probability that the players will play the same game again δ t : probability that t-th stage will get played δ t πit : expected payoff to the t-th stage of the game So from the above definitions and analysis, it can get the following formula. T otal expected payof f = sum of the payof f s in dif f erent stages = πi0 + δπi1 + δ 2 πi2 + ... + δ t πit + ... (1) Discount Factor: Payoffs may decrease over time by factor β T otal expected payof f = πi0 + πi1 δβ + πi2 (δβ)2 + ... (2) Here in the supposed scenario, there are three different speed cloud storage. The supposed speed relationships among different cloud storage: High speed storage = 3Low speed storage, Intermediate speed storage = 3 2 Low speed storage. It supposes the start total storage = M and storage volume will increase α% per month. Customers like to use three different type of cloud storage to store different data. Mh , Mi , and Ml represents the storage volume for high, intermediate and low speed storage respectively. the start total storage M = Mh + Mi + Ml . Here it uses a and b as the percentages of the total storage for Mh and Mi . So Mh = aM , Mi = bM , and Ml = (1 − a − b)M . B. Formula and Constraints To measure the performance of the cloud storage, it defines efficiency as the ratio of the total storage volume M with the speed of cloud storage S, when executing task t. M (t) (3) S The price of cloud storage service is another important factor. Here it uses budget to represent the available budget for the cloud storage. The budget mainly spends on the cloud storage. The total cloud storage service costs is measured by C = CMh +CMi +CMl , Where CMh ,CMi ,and CMl stands the cost for high, intermediate, and low speed storage respectively. As shown in Table I, the prices are closely related to the speed of cloud storage. The parameters and relationships about cloud storage are complex. It mainly talks about the relationship between performance and costs. After comprehensive consideration, how to optimize the cloud storage performance and utility costs can be reduced to programming problem. Two main constraints are: { M aximize : Ef f (t) Restrict to: Cost(C) ≤ Budget, M inimize : Cost(C) Restrict to:Ef f (t) ≥ Ef f (Pt0 ). (4) For customers, efficiency and cost always conflict with each other. How to use the little to costs to get high efficiency? This questions always puzzle customers. Actually it is very difficult or almost impossible to use the least money to get the highest efficiency. So how to balance the importance between Ef f (t) =

efficiency and cost becomes an important and meaningful thing. Although it cannot get the best general formula for configuration, under one fixed budge it can get the optimal solution to maximize the efficiency and minimize the cost. As far as it knows that Ef f (t) ∝ Cost(c), the higher efficient cloud storage costs more money. Here it proposes the performance formula P for measuring the relationship between Ef f (t) and Cost(c) to guide configuration process. Some users want to increase the efficiency and satisfactions. Or from the professional aspect, the cold-start avoidance needs to be considered in data management. The service providers provide their data management strategies and methods, which are based on the data usages. It means that if the data is used in the first time, in most cases the data is not stored in the high speed data storage. For service providers, they may not know that which data will be used in future. But users know, they can move these data into the high speed data storage in advance. Another case is that some data may only be used one time a month or a year, or a short time in one year. The system will follow the strategies to manage these data. Maybe it needs a long time to move these data from high speed to low speed data storage. If users have authorities to control their own data, they can move these data to the low speed data storage after the usage. This may help users to decrease the costs. The discussion of the data movement cost is meaningful. The data movements among different speed type data storages are used to solve this problem. So the moving cost is also considered in the performance formula P . The proposed performance metric P is used to measure the relationship between cost and efficiency. P = K ∗ Ef f (t) ∗ Cost(c) + M Cost(mc)

(5)

P , Ef f (t), Cost(c), and M Cost(mc) stand performance, efficiency, cost and data movement cost respectively. K is the coefficient in the formula for adjusting the weight between efficient and cost to make relationship more reasonable. The highly efficient service corresponds high cost. So the value of highly efficient service is higher than others. So the main goal of this formula is to find the optimal solution that is the biggest value of performance under one fixed budget. Expect pricing standards and cloud storage speeds, other factors may affect the relationship between efficiency and cost. So the configuration may be a linear or nonlinear model in different condition. For the data movement cost, it involves the speed type of data storage, the size of data and the movement time. For instance, the per unit cost of the movement from low speed to high speed data storage is more expensive than the movement from low speed to intermediate speed data storage. For the movement cost, it depends on the actual needs. From the cost aspect, maybe users do not want to move their data. Sometime M Cost(mc) equals to 0. C. Strategies in Configuration Model In the proposed performance model, each customer i has three different strategies (high speed, intermediate speed and low speed) to decide. The strategies are shown below:

TABLE II S TRATEGY 1

Sh = 0 Sh = 1

Si = 0 Si = 1 (0,0,0) (0,PSi ,0) (PSh ,0,0) (PSh ,PSi ,0) Sl = 0 TABLE III S TRATEGY 2

Sh = 0 Sh = 1

Si = 0 Si = 1 (0,0,PSl ) (0,PSi ,PSl ) (PSh ,0,PSl ) (PSh ,PSi ,PSl ) Sl = 1

Sh = 0 ⇒ decide not to use high speed storage; Sh = 1 ⇒ decide to use high speed storage • Si = 0 ⇒ decide not to use intermediate speed storage; Si = 1 ⇒ decide to use intermediate speed storage • Sl = 0 ⇒ decide not to use low speed storage; Sl = 1 ⇒ decide to use low speed storage N :number of decision times in the game; S = {0, 1}N ⇒ joint strategy space. So it is easy to get the following strategy form of three different strategies. In this paper, it only shows that different payoffs (the performance value P ) are under different conditions. It is a generic strategy model. It may have different optimal strategy in different condition. The generical approach to find the optimal solution is trying to find the dominant strategy [6] first. If there is no dominant strategy, then it tries to find the mixed strategy [5]. Here, strategy PS′ h (PS′ i or PS′ l ) strongly dominates all other strategies of player m if the payoff to PS′ h (PS′ i or PS′ l ) is strictly greater than the payoff to any other strategy. In other words, ′ • πm (PS , PSi , PSl ) > πm (PSh , PSi , PSl ); h ′ • πm (PSh , PS , PSl ) > πm (PSh , PSi , PSl ); i ′ ′ • πm (PS , PSi , PS ) > πm (PSh , PSi , PSl ). h l for all PSh , PSi and PSl . Here one player has eight pure strategies, P 1 ,P 2 ,...,P 8 . A mixed strategy for this player is a probability distribution over his pure strategies; It is a probability vector (pro1 ,pro2 ,...,pro8 ), with prok ≥ 0, k = 1, 2, ..., 8, and ∑ 8 k k=1 pro = 1. So in this game player m could play a mixed strategy (pro1 ,pro2 ,...,pro8 ). The expected payoff to player m is given by: pro1 πm (PSh , PSi , PSl ) + pro2 πm (PSh , 0, PSl ) +...+ pro8 πm (0, 0, PSl ) No matter dominant strategy or mixed strategy will give customers good recommendations for cloud storage configuration. •

III. P ROVISIONING AND C LASSIFICATION M ETHODS Provision method will be used to predict the future usage of cloud storage and classification will be used to find the correlated data for preprocessing data movement in the proposed system architecture. The current techniques used by major service providers mainly focus on space locality and time locality. Space locality concerns the using data and its surrounding data. It means that when the using data is moved into the high speed data storage, its surrounding data will also be moved with it. Time locality concerns the

recent usages of data. The frequently used data in recently will be move into the high speed data storage. Both two methods try to decrease the responding time for requests and increase the users’ satisfactions. Here we add logical locality into the data management techniques to improve the current techniques. Logical locality is different from space locality, which concerns the logical connections among different data. For instance, the information of student and department are not stored in the same table. These table may be stored in different unit in database, which is far away from each other. But in university domain, students have connections with departments. One student belongs to one department (or more than one) and one department has many students. When the using data is moved into the high speed data storage, its logical connected data will also be moved with it. So the new technique concerns the using data, its linked data, its surroundings, and the usages. The following two methods provide useful suggestions to users for optimizing cloud configurations based on the proposed configuration model. This paper mainly uses trend analysis to analyze the historical data used to extract the underlying trend pattern from the historical data. The obtained pattern can be used to predict the future usage. For classification, it mainly analyzes the relationship between stored data and frequently used data and decides whether the stored data is strongly related to the frequently used data. If the answer is yes, these data will be move from the low speed storage to high speed storage. It will increase the efficiency of cloud storage. First, we talk about trend analysis. Trend analysis [25], [21] is a special case of regression analysis where the dependent variable is the variable to be forecasted and the independent variable is time. While moving average model limits the forecast to one period in the future, trend analysis is a technique for making forecasts further than one period into the future. Least squares is also used in trend analysis to fit a trend line to a set of time series data and then project the line into the future for a forecast. The historical data needs to the preprocessed to meet the following requirements before analyzing. • It needs the large enough size of sample data. The more large size of data, the more correct and precise prediction will be. • The existence reason of extreme observation and outliers needs to be investigated. If the reason is random variability, the data should be excluded from the sample data. Otherwise the data will interfere with the analyzed results. • The accuracy of numerator, denominator, and the indicator of interest need to be considered during trend analysis, which involve each time period data and affect the correctness of prediction. The trend line can be expressed as F = a + bt, where F is forecast value, t is time value, a stands for y intercept, b is the alue−StartV alue , slope of the line. The a can be got from EndV StartV alue StartValue and EndValue stand two values in a certain time period.

After getting the draft trend line, least squares will be used to help analyze the trend line. The method of least squares is a standard approach to minimize the sum of the squares of the errors made in solving every single equation. Here least square method determines the values for a and b so that the resulting line is the best-fit line through a set of the historical data. After a and b have been determined, the equation can be used to forecast future values. For the linear least squares problem, the regression model can be expressed ∑m as f (xi , β) = j=1 βj ϕj (xi ), where the coefficients ϕj are (xi ,β) functions of xi . For Xij and β, Xij = ∂f∂β = ϕj (xi ), j T −1 T βˆ = (X X) X y. For the non-linear least squares, the values can be obtained by successive approximation, βjk+1 = βjk + ∆βj . Here k is an iteration number and the vector of increments, ∆βj is the shift vector. The model can be linearized to a first order series expansion about β k . ∑ Taylor ∂f (xi ,β) k f (xi , β) = f (xi , β) + j ∂βj (βj − βjk ) = f k (xi , β) + ∑ j Jij ∆βj . Second, we discuss the classification techniques, which will help us to find the previously unseen relationships among data. There are many classification techniques, including decision tree, neutral networks, support vector machines and so on. Here we mainly use k-nearest neighbor algorithm [24], [23] to help to the correlated data. Before computing, k-NN requires an integer k, a set of labeled examples and a metric to measure ”closeness”. k-NN computes the Euclidean distance by a set of weights from √∑the stored data to the frequently used data. D 2 dw (xu , x) = k=1 (w[k](xu [k] − x[k]) ), where dw (xu , x) refers to the distance between two objects xu and x. The similarity metric and distance metric are often used to refer to any measure of affinity between two objects. The following four criteria show the relationship between xu and x. 1) dw (xu , x) ≥ 0; non-negativity 2) dw (xu , x) = 0 only if xu = x; identity 3) dw (xu , x) = dw (x, xu ); symmetry 4) dw (xu , y) ≥ dw (xu , x) + dw (x, y); triangle inequality After that choose heuristically optimal k nearest neighbor based on RMSE done by cross validation technique. It also chooses one distance as threshold. The data will be treated as potentially useful data and will be moved to the high speed storage, if their Euclidean distance is less than the threshold. IV. C ASE S TUDY In this section, we will use one simulated experiment to show the process of how to get the optimal configuration solution. It uses the example in Table I. There are three different types of data storages including high speed, intermediate speed and low speed storage. In the beginning of this example, it supposes there is one ratio for all data in different data storage. It uses M , Mh , Mi , and Ml to represent the total data volume, high, intermediate, and low speed data volume, M = Mh +Mi +Ml . The a, b, and c represents the percentage for each data storage, Mh = aM , Mi = bM , Ml = cM , c = 1 − a − b. a, b, and c can dynamically change following different on-demand requirements and the instructions of

provisioning and classification. The pure pricing formulas for three different cloud storages are given below:  206.9Mh    180.2Mh + 1331.2a CostMh = 151.6Mh + 15667.2a    136.2Mh + 31027.2a  158.7Mi    135.2Mi + 1177.6b CostMi = 117.8Mi + 9881.6b    104.4Mi + 23193.6b  128Ml    114.7Ml + 665.6c CostMl = 105.5Ml + 5428.1c    87Ml + 23251.1c

⊆ [1, 50T B], ⊆ [51, 500T B], ⊆ [501, 1000T B], ⊆ [1001, 1000P B]. (6) Mi ⊆ [1, 50T B], Mi ⊆ [51, 500T B], Mi ⊆ [501, 1000T B], Mi ⊆ [1001, 1000P B]. (7) Ml ⊆ [1, 50T B], Ml ⊆ [51, 500T B], Ml ⊆ [501, 1000T B], Ml ⊆ [1001, 1000P B]. (8) So it has different performance (P ) for each type data storage. Time (T ) means the storing time in data storage. Frequency (F ) is the reading/writing times in data storage. K is the coefficient that balances the relationships among cost, time, and frequency. • Ph = K ∗ CostMh ∗ T imeh ∗ F requencyh • Pi = K ∗ CostMi ∗ T imei ∗ F requencyi • Pl = K ∗ CostMl ∗ T imel ∗ F requencyl Different data storage has different speed (S), such as Sh , Sl , and Sl . r is the increasing ratio of data size in every month. Fh , Fi , and Fl represents the average reading/writing times for high, intermediate and low speed data storage. After n unit time, the data volume becomes Mn = M (1 + r)n , Mnh = aMn , Mni = bMn , Mnl = cMn . So the performance equations are shown as follows:  2 K206.9Mnh Fh  Mnh ⊆ [1, 50T B],  S  h   K(180.2Mnh +1331.2a)Mnh Fh Mnh ⊆ [51, 500T B], Sh PMnh = K(151.6Mnh +15667.2a)Mnh Fh  Mnh ⊆ [501, 1000T B],  Sh    K(136.2Mnh +31027.2a)Mnh Fh Mnh ⊆ [1001, 1000P B]. Sh (9)  2 K158.7Mni Fi  M ⊆ [1, 50T B],  ni  Si   K(135.2M ni +1177.6b)Mni Fi M ni ⊆ [51, 500T B], Si PMni = K(117.8Mni +9881.6b)Mni Fi  Mni ⊆ [501, 1000T B],  Si    K(104.4Mni +23193.6b)Mni Fi M ni ⊆ [1001, 1000P B]. Si (10)  2 K128Mnl Fl  Mnl ⊆ [1, 50T B],  Sl    K(114.7M nl +665.6c)Mnl Fl Mnl ⊆ [51, 500T B], Sl PMnl = K(105.5M +5428.1c)M F nl nl l  Mnl ⊆ [501, 1000T B],  Sl    K(87Mnl +23251.1c)Mnl Fl Mnl ⊆ [1001, 1000P B]. Sl (11) Here it supposes a = 0.2, b = 0.5, c = 0.3, r = 0.3 and Fh = 100Fi = 5000Fl . For the speed (S) relationships among different data storage, Sh = 3Sl and Si = 32 Sl . Then we put the performance value in Table II and Table III. It has four groups of table sets for different data size. Mh Mh Mh Mh

0.9

V. R ELATED W ORK

0.8 0.7 0.6 0.5

a

0.4

b

0.3

c

0.2 0.1 0 2

3

Figure 3.

4

5

6

The Simulation Results

The budget is the main constraint for data storage, CostMh + CostMi + CostMl ≤ budget. At the same time, the system tries to find the best combination with the greatest accumulated performance (AP ), APmax = PMnh + PMni + PMnl . We are trying to find the dominant strategies in different condition. Unfortunately, in most cases there is no dominant strategy. Then it is trying to find the mixed strategy, which is composed by the set of pure strategies. Here it only shows the mixed strategy solutions, when n = 2, M = 1024T B, and budget = 210000. The optimal solution is 0.24(0, PSni , 0) + 0.07(0, 0, PSnl ) + 0.08(PSnh , PSni , 0) + 0.05(PSnh , 0, PSnl ) + 0.11(0, PSni , PSnl ) + 0.07(PSnh , PSni , PSnl ). The total cost is 207362, which is less than but almost equals to the budget. So the system can get the greatest performance under the budget. Actually following the usages, the configuration also changes among three different type storages. Since there are lots of variables, the change of one variable affects the configuration model. It is very difficult to get one formula to guide the configuration process. We know that hard disk drive (HDD) prices are decreasing constantly for years. Actually the cloud storage costs are composed of HDD, maintenance costs, and others. The costs of human resources fluctuates following the economy status. In general, the trend of costs is increasing. The speed of data increase is higher than the speed of HDD price decrease. So the cloud storage costs will increase when the data increases as time goes by. The system tries to use the minimum costs to maintain the same performance as the beginning. It simulates the experiment from n = 2 to n = 6. The results are shown in Figure 3. a, b, and c stand for the ratio of high, medium, and low data storage respectively that are shown in blue, red, and green respectively. The ratios of a and b decreases from 0.2 and 0.5 to less than 0.1 and around 0.1 after five rounds computation. The ratio of c increases from 0.3 to more ant 0.8 after five rounds computation. The data size increases after each round. The ratio of three types of data storage is always changing. From the results, it shows that more and more data will be moved to the low speed data storage following the data increasing. In real world, most of data are rarely used. For the saving purpose, there is no need to store them in high speed data storage. The proposed method will help users to automatically manage the data, including configuring data storage models and moving data.

Following the developing and spreading of cloud computing, the enormous number of application and data are moving to the cloud. For the cloud storage, it is a new and popular part in cloud computing. Comparing the traditional data storage system is sold as license [9], [2] and the storage devices are sold as the common merchandises. The pricing standard of cloud data storage is a new topic [22]. The cost-benefit of cloud computing has been analyzed, involving the specific resource requirements, monetary costs of creating and deploying applications, and the cost-effectiveness based on the Amazon elastic compute cloud [7]. The storage I/O performance has been measured based on the virtualized setup experiments of Amazon EC2. The performance bottlenecks in a virtual setup can be identified, the workload combination can be understood and resource configurations on the I/O performance can be used to manage the infrastructure efficiently [14]. The new performance measuring method for IaaS takes into account the type of service running in a virtual machine. The price and performance can be evaluated between different providers and different platforms [1]. Due to the characteristic of leasing computing resources on demand in cloud computing, an economical model is used to analyze the Buy-or-Lease decision for storage in cloud computing. A probabilistic approach evaluates future prices and the replacement needs for disk arrays comparing the deterministic prices and fixed replacement rates [11]. Video-on-demand (VoD) service varies with time in one day period and the loading size is changing all the time. It designs one model to help migrate VoD services into the hybrid cloudassisted deployment. It also considers the relationship between the cloud price and server bandwidth in saving cost. For the user experience improvement, the cloud storage size and cloud content update strategy play the key roles [10]. One cost-effective intelligent configuration model [17] has been used to guide for choosing the most efficient cloud services under one certain budget. The model considers the relationships among workloads, the number of services, and the price of services. The system adjusts the configuration of different services based on the real time service usages. VI. C ONCLUSION This paper proposed a cost-effective optimal configuration model in cloud storage, which uses game theory and data mining methods to provide some optimal configuration solutions for end customers to make better choices. The model balances the efficiency and cost of cloud storage and tries to find the best solution between two factors. Specially, cost effect was used as a constraint to assist the configuration. Trend analysis in data mining is used to exploit the future trend and assist provisioning in cloud storage. Classification will help customers to find the potentially useful data. The simulated experiment explains the computing process and verify the correctness of the proposed model. In future, we will further investigate different variance of configuration models, and

provide more intelligent support in cloud storage configuration models.

[21]

R EFERENCES [1] Johannes Lipsky Stefan Tai Alexander Lenk, Michael Menzel and Philipp Offermann. What are you paying for? Performance benchmarking for infrastructure-as-a-service offerings. In Proceedings of IEEE International Conference on Cloud Computing, 2011. [2] Barry Boehm, Chris Abts, and Sunita Chulani. Software development cost estimation approaches - a survey. In Proceedings of Annals of Software Engineering, 2000. [3] Chunlin Chen, Daoyi Dong, Qiong Shi, and Yu Dong. A quantum reinforcement learning method for repeated game theory. In Proceedings of International Conference on Computational Intelligence and Security, volume 1, pages 68–72, nov. 2006. [4] Yinong Chen and Wei-Tek Tsai. Service-Oriented Computing and Web Software Integration, 3rd Edition. Kendall Hunt Publishing, 2011. [5] P.-A. Chiappori, S. Levitt, and T. Groseclose. Testing mixed-strategy equilibria when players are heterogeneous: The case of penalty kicks in soccer. In Journal of American Economic Review, pages 1138–1151, September 2002. [6] Eileen Chou, Margaret McConnell, Rosemarie Nagel, and Charles R. Plott. The control of game form recognition in experiments: Understanding dominant strategy failures in a simple two person guessing game. Working papers, Division of the Humanities and Social Sciences in California Institute of Technology, 2007. [7] Paul Malecot Franck Cappello Derrick Kondo, Bahman Javadi and David P. Anderson. Cost-benefit analysis of cloud computing versus desktop grids. In Proceedings of International Parallel and Distributed Processing Symposium, 2009. [8] Youbei Huang and Mingming Wang. Credit rating system in c2c e-commerce: Verification and improvement of existent systems with game theory. In Proceedings of 2009 International Conference on Management of e-Commerce and e-Government (ICMECG’09), pages 36 –39, Sept. 2009. [9] S. S. Tanilkan H. Gallis A. C. Lien K. Molokken-Ostvold, M. Jorgensen and S. E. Hove. A survey on software estimation in the norwegian industry. In Proceedings of 10th International Symposium on the Software Metrics, 2004. [10] Haitao Li, Lili Zhong, Jiangchuan Liu, Bo Li, and Ke Xu. Cost-effective partial migration of vod services to content clouds. In Proceedings of IEEE CLOUD, pages 203–210, 2011. [11] Loretta Mastroeni and Maurizio Naldi. Storage buy-or-lease decisions in cloud computing under price uncertainty. In Proceedings of 7th Conference on Next Generation Internet, pages 1–8, 2011. [12] Oracle. Cloud-stored offsite database backups, 2010. [13] Oracle. Oracle cloud computing, 2010. [14] Mei Yiduo Sankaran Sivathanu, Ling Liu and Xing Pu. Storage management in virtualized cloud environment. In Proceedings of IEEE 3rd International Conference on Cloud Computing, 2010. [15] Wei-Tek Tsai, Qian Huang, Jay Elston, and Yinong Chen. Serviceoriented user interface modeling and composition. In Proceedings of the 2008 IEEE International Conference on e-Business Engineering, pages 21–28, 2008. [16] Wei-Tek Tsai and Guanqiu Qi. Dicb: Dynamic intelligent customizable benign pricing strategy for cloud computing. In Proceedings of IEEE Fifth International Conference on Cloud, pages 654–661, 2012. [17] Wei-Tek Tsai, Guanqiu Qi, and Yinong Chen. A cost-effective intelligent configuration model in cloud computing. In Proceedings of the 32nd International Conference on Distributed Computing Systems Workshops, pages 400–408, 2012. [18] Wei-Tek Tsai and Qihong Shao. Role-based access-control using reference ontology in clouds. Technical report, Arizona State University, June 2010. [19] Wei-Tek Tsai, Xin Sun, Qihong Shao, and Guanqiu Qi. Two-tier multi-tenancy scaling and load balancing. In Proceedings of IEEE International Conference on E-Business Engineering, pages 484–489, 2010. [20] Wei-Tek Tsai, Bingnan Xiao, Yinong Chen, and Raymond A. Paul. Consumer-centric service-oriented architecture: A new approach. In Proceedings of IEEE Workshop on Software Technologies for Future Embedded and Ubiquitous Systems and International Workshop on

[22] [23] [24]

[25]

Collaborative Computing, Integration, and Assurance, pages 175–180, 2006. Jing-Doo Wang. A novel approach to compute pattern history for trend analysis. In Proceedings of Eighth International Conference on Fuzzy Systems and Knowledge Discovery, pages 1746–1750, 2011. Ray Wang. Shape your apps strategy to reflect new saas licensing and pricing trends, 2009. Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. In Journal of Machine Learning Research, pages 207–244, 2009. Lijuan Zhou, Linshuang Wang, Xuebin Ge, and Qian Shi. A clusteringbased knn improved algorithm clknn for text classification. In Proceedings of the 2nd international Asia conference on Informatics in control, automation and robotics - Volume 3, pages 212–215, 2010. Quanyin Zhu, Hong Zhou, Yunyang Yan, Jin Qian, and Pei Zhou. Commodities price dynamic trend analysis based on web mining. In Proceedings of International Conference on Multimedia Information Networking and Security, pages 524–527, 2011.