Distributed inter-domain SLA negotiation using Reinforcement Learning Tristan Grol´eat
H´elia Pouyllau
T´el´ecom Bretagne Email:
[email protected]
Alcatel Lucent Bell Labs France Email:
[email protected]
Abstract—1 Applications requiring network Quality of Service (QoS) (e.g. telepresence, cloud computing, etc.) are becoming mainstream. To support their deployment, network operators must automatically negotiate end-to-end QoS contracts (aka. Service Level Agreements, SLAs) and configure their networks accordingly. Other crucial needs must be considered: QoS should provide incentives to network operators, and confidentiality on topologies, resource states and committed SLAs must be respected. To meet these requirements, we propose two distributed learning algorithms that will allow network operators to negotiate end-to-end SLAs and optimize revenues for several demands while treating requests in real-time: one algorithm minimizes the cooperation between providers while the other demands to exchange more information. Experiment results exhibit that the second algorithm satisfies better customers and providers while having worse runtime performances.
I. I NTRODUCTION The current Internet relies on “best effort”: network service providers (which we call further in this paper ”domains” although they are managed by independent business actors) allow customers to use resources but do not guarantee any Quality of Service (QoS). As more and more applications (e.g. VPNs, telepresence, etc.) requiring inter-domain QoS guarantees are becoming mainstream, current techniques providing QoS (static configurations or dedicated networks) will be unsuitable unless dedicated networks support such services, which is not desirable from a business perspective. Service Level Agreements (SLAs) specify possible QoS guarantees inside a domain (e.g. delay ≤ 20ms). To support a service across several networks, different domains’ SLAs have to be combined to form an end-to-end chain of SLAs. Several factors make the negotiation of SLA chains difficult: i) domains do not wish to disclose their SLAs and remaining bandwidth to all other domains, ii) customer requests and responses are unpredictable and concurrent over the bandwidth resources in each network. Many previous works tackled the end-to-end SLA negotiation problem ([1], [2]) optimizing the selection of a SLA chain for each request. Such approaches do not consider two issues: future more profitable requests might need the allocated resources, the customer might refuse the providers’ offer. Our goal is more ambitious: we aim at optimizing long-term 1 This work has been partially supported by the ETICS-project, funded by the European Commission through the 7th ICT-Framework Program. Grant agreement no.: FP7-248567 Contract Number: INFSO-ICT-248567.
revenues of domains while processing dynamically customers’ requests. This can be achieved using Reinforcement Learning (RL) techniques which allow to learn environment reaction (customer responses, responses of other domains). To meet domains’ requirements on confidentiality, we adopted a distributed framework with limited inter-provider cooperation. Section II introduces the end-to-end SLA negotiation context and section III presents the distributed framework. Within this framework, we propose two RL algorithms with different cooperation levels, which are described in sec. IV and whose simulation results are presented in sec. V. II. E ND - TO - END SLA NEGOTIATION As mentioned in [3], future Internet architectures must be able to support novel high performance services such as remote surgery, VPNs, Cloud Computing. Some of them are already deployed, but network resources are either dedicated to them or statically configured. Both solutions are clearly not scalable in presence of a large number of such services. A. Requirements for inter-domain QoS delivery The challenge is to upgrade current network and management architectures towards on-demand QoS provisioning. The attempts in extending the Internet routing protocol (Border Gateway Protocol, BGP) adding QoS attributes (see e.g. [4]) failed to convince the Internet Engineering Task Force (IETF): values of QoS attributes are hard to maintain up-to-date at the scale of the Internet and, operators’ needs for confidentiality on network topology and status is not ensured. Meanwhile, within a network domain, various technical solutions exist to support QoS-guaranteed services (e.g. GMPLS, DiffServ). However, to provide QoS across networks, operators tend to rely on over-dimensioning - at the detriment of resources optimization - or on manual configuration of their equipments. Among the possible benefits of on-demand provisioning the reduction of Operational Expenditures and the optimization of long-term revenues are two crucial arguments. Technical and economic constraints of inter-domain QoS delivery should also be considered in order to support investments both in network and application infrastructures. Technical constraints are related to the independence of network management systems, the heterogeneity of systems, etc. A strong economic requirement is on operators’ thirst for confidentiality and the claim for fair revenue sharing among applications
and network providers. For that purpose, we assume that the negotiation of SLAs occurs in a group of domains which trust each other, thus forming a federation. Federations of networks allow operators to cooperate while keeping confidentiality and independence on price policing. Although we assume that such organization could be mandatory for end-to-end QoS delivery [3], authors of [5] even contemplate the existence of alliances to support best-effort Internet traffic. B. Inter-domain SLA negotiation To meet both technical and economic requirements, we propose an approach in two phases: i) in a first phase, network resources, represented in an abstract manner as QoS parts (e.g. delay ≤ 20ms, bandwidth ≥ 1 Gbit/s, packet-loss ≤ 1%, etc.) of potential contracts, called Service Level Agreements, SLAs, are bilaterally negotiated among network operators; and, ii) each selected domain enforces, with the technology of its choice, its SLA. A SLA contains various attributes including a price, a duration, etc and is available for a subset of neighbor domains. In the negotiation phase, we assume that only QoS parts and prices are considered when selecting a SLA. Another confidentiality requirement is that a SLA must remain private between contractors.
Fig. 1.
Example domains network.
Figure 1 illustrates an example of inter-domain network with one customer (domain c), its target (domain t) and three intermediate domains. Each abstract internal link is represented with the SLAs it can support. In the negotiation phase, one must pick a path between 1−2−t and 1−3−t and then selects in each domain of the chosen path which SLA is the most ”relevant” to use. The selection of a SLA is not trivial and faces various issues: how to make an offer the customer is likely to accept, how to deal with the network operators’ different interests, how to not over-provision the request? We aim at designing a solution answering these 3 questions: at first, we want a solution which takes into account the customer utility, that is, as close as possible to the customer expectations on price and QoS, then we would like each domain to see its revenues optimized for long-term, i.e. taking into account the further requests to come rather than only the immediate request. We choose to lay out a distributed resolution of the problem so that each network operator keeps critical data confidential and works independently of others.
C. Related work Per-request optimization. The problem of selecting a SLA chain can be modeled as the problem of optimizing a function (e.g. a profit function) under several QoS constraints. Such a problem is known to be NP-hard when the constraints are multiple and have different mathematical properties (e.g. delay is additive, packet-loss is multiplicative, etc.). Previous works on the topic ([1], [6], [2]) proposed solutions via an instantaneous revenue optimization for each request. These algorithms are based on meta-heuristics (genetic programming [1], ant colony optimization [2]) or on deterministic algorithms [6]. Authors of [7] formalized the problem as a KMulti-constrained Optimal Path (k-MCOP) problem, using a graph representation. Their solution performs several times - as much as the number of QoS constraints - a Dijkstra algorithm to provide under-optimal paths. Extending these works to treat several requests simultaneously would lead to very complex and unadapted solutions. Furthermore, it would impose requests to be queued until they are treated. This might be adequate for specific contexts, but with respect to the business and real-time context we consider, we opt for Reinforcement Learning (RL) techniques. Reinforcement Learning. Reinforcement Learning [8] uses Markov Decision Processes (MDP) as models. A MDP is specified as: a set of states, a set of actions representing the decisions of the agent, a conditional transition probability function that gives the probability to go from a state to an other by choosing an action and which must verify the Markov property, and an immediate reward function representing the gain for a given action at a given state. Several RL algorithms exist giving the possibility to learn the best policy, (i.e. how to map an action to a state) in order to maximize - weighted by a discount factor - long-term benefits. In the SLA negotiation, the global profit obtained by a provider relies on the other providers’ willingness of acceptance (which depends on their network utilizations), and on the customer willingness to accept an offer. The customer utility can be based on price but also on other information the system will reveal to him (e.g. the final level of QoS, the name of sub-contractors, the network providers in the chain, etc.). However, from the provider’s point of view, these utilities, which define the transition probability, are unknown. Hence among RL algorithms, we focus on the Q-learning [9] - further detailed in sec. IV-A, which learns in a model-free context (the transition probability is not known a priori). Distributed learning. Because of confidentiality requirements, we opt for a distributed system. Many distributed learning solutions have been proposed in various contexts: load balancing for search in peer-to-peer networks [10], network routing [11], team coordination [12], etc. All these proposals focus on the same question: how to define the state of a learning agent when the global knowledge is split among several agents? This question is crucial: if the state (or action) space is too large, agents can not acquire experience and the convergence of learning algorithm is very slow; conversely, if
it is too small, agents can not find an optimal action because states are too global, so they converge fast but to a suboptimal solution. This led us to consider different cooperation levels between the domains, defining what the domains should communicate of their state, and to which other domains. In [13], the authors focused on the dynamic negotiation of transit prices between network domains. To stay in line with BGP, only one route - the cheapest - is selected for each destination. The proposed distributed algorithm applies the principles of Linear Reward Inaction (LRI) to converge to a sub-game perfect equilibrium, which allows domains with the lowest costs to share high revenues. The LRI algorithm is applicable to ”simple” strategies while the SLA negotiation problem involves more variables. III. D ISTRIBUTED SLA NEGOTIATION This section describes the operations required on SLAs to achieve a distributed end-to-end negotiation and the associated distributed framework. A domain receives a request from one of its neighbors (the customer), treats and forwards it to another neighbor able to reach the target domain. A. SLA combination As suggested in sec. II-B, QoS parts of SLAs are sets of thresholds over QoS parameters. To build up an end-to-end QoS contract meeting end-to-end QoS requirements, SLAs of a sequence of domains have to be combined using these thresholds w.r.t. their mathematical properties (e.g. the delay is additive, the availability is multiplicative, etc.). SLA definition. A domain is denoted i where i = 1, . . . , N and N is the number of considered domains. A domain i defines its own set of SLAs, Ei , where each SLA, denoted eji ∈ Ei , j = 1, . . . , | Ei |, is a vector of K QoS parameters. A QoS parameter value is denoted wk where k = 1, . . . , K. A price function pi (.) ∈ 0}, where the SLAs are the ones respecting the QoS demanded by the request;
an action aτ ∈ A is the choice of a SLA e∗i and a neighbor domain n for the request di at epoch τ ; • the immediate reward function coincides with the revenue derived by the assignment of the SLA in the chosen action at epoch τ . The internal capacity of the SLAs is included in the states so that the domain is able to know which choice will have the biggest impact on its long-term revenue (future availability of the SLAs). SLAs with a capacity grade of 0 are excluded from the action spaces. Each domain deploying the NC algorithm uses this MDP and the Q-learning algorithm in the frame of the distributed process described in sec. III-C. More precisely, the Q-learning algorithm is executed in the ”yellow” part of the sequence diagram depicted by fig. 2. •
D. Recursive Cooperation (RC) algorithm In this algorithm, domains communicate with their direct neighbors sending an aggregated capacity representing how they and their next domains are able to satisfy a certain request type (i.e. requests having the same target and QoS requirements). This aggregated capacity, denoted κi , is computed using the fuzzy capacity capi eji defined in sec. III-D. When a domain i receives a request di , it identifies the possible next domains n ∈ neigh (i, t) and SLAs eji ∈ Edni . Then, it computes the request to forward: dn = di eji and asks the chosen next domain n its aggregated capacity κn (dn ) which is defined recursively by eq. (3). This capacity is the one for a sequence of domains and a given SLA. Finally, the domain updates its own aggregated capacity associated to this SLA using eq. (3) and (3). The aggregated capacity (3) is thus updated at the reception of a request of the same type. Note that, by definition κi ∈ [0..G − 1]. κi eji , di , n = min(capi eji , κn di eji ) κi (di ) = max j κi eji , di , n
(2) (3)
n∈neigh(i,t), ei ∈Edn
i
This categorization of requests avoids storing too many capacity values and thus prevents flooding. A timeout is set so that capacity values computed too long ago are replaced with a default value (the upper bound of [0..G − 1]). The corresponding MDP to apply the Q-learning algorithm to this RC process is close to the one of the NC algorithm, only the state definition differs: • the state of domain i is built at a request arrival di and described by a set of tuples (next domain, SLA, aggregated capacity), denotedsτdi = {(n, eji , κi ), n ∈ neigh (i, t) , eji ∈ Edni s.t. capi eji > 0};
E. Convergence and Complexity The convergence of the Q-learning algorithm depends on the state and action definitions: the more there are, the slower the convergence is. To simplify notations, neigh (i) denotes all direct neighbors of the domain i.
Theorem 1. The size of the set of actions is bound by |Ei | × | neigh (i) − 1| for both algorithms. Theorem 2. The size of the set of states is bound by G|Ei |×| neigh(i)−1| for the NC algorithm and |E |×| neigh(i)−1| (G + 1) i for the RC algorithm. Proof: An action is a choice of a couple (next domain, SLA). Hence, in the worst case - all SLAs satisfy the request, all neighbors allow to reach the target domain -, there are at most |neigh (i) − 1| × |Ei | of such couples (the −1 comes from the fact that the domain sending the request should be ignored), which proves theorem 1. A state is a subset of the same possible couples associated to either the internal capacity of a SLA in the case of the NC algorithm or the aggregated capacity in the case of the RC algorithm. To be integrated in a state, a SLA must respect the QoS demand received at epoch τ and have a positive internal capacity. So, in the worst case, all SLAs satisfy all requests and lead to all neighbor domains. In the definition of a state for the NC algorithm, a condition is that the internal capacity is superior to 0 while in the RC algorithm, the same definition does not forbid a κ value of 0. Hence, there are G + 1 choices for each couple for the RC algorithm, and G for the NC algorithm, which proves theorem 2. So the convergence rates of both algorithms are close because of the aggregated capacity definition used in the RC algorithm. One can observe that the states number can become large depending on the number of SLAs and the inter-domain topology. Hence, in the case of a large number of SLAs per domain and a full-meshed network, the convergence rate of the algorithm could be impacted. We let for future work inquiries on methods to figure out these issues. In terms of exchanged messages, the NC algorithm requires only four messages (request, offer, answer, acknowledgment) between each domain for a request. The RC algorithm requires one more type of message asking for the capacities for each request that a domain could possibly send: a domain asks for at most |Ei | capacities, one per SLA that could be chosen. V. E XPERIMENTS We implemented a simulation environment in Java and realized experiments on a 3.5 GHz CPU and 3.5 GB RAM machine. We compared four algorithms: a random algorithm, labeled “random” on fig. 4, where domains choose randomly a next domain and a SLA satisfying the request, an optimization algorithm, labeled ”Min OPT”, selecting the cheapest SLA, and the NC and RC Q-learning algorithms described in sec. IV. Optimization with batch of requests has not been studied due to its poor performances [16]. We observed: i) the performance of the algorithms in terms of quality of response, so the customer satisfaction rate (number of accepted offers vs. the number of sent requests) and the gain of a domain (price of the sold SLAs, or 0 if the offer is refused); and ii) the runtime performance according to the number of SLAs.
(a) Domain gains; standard conditions
(d) Domain gains; customer focused on prices
(b) Customer satisfaction; standard conditions
(c) Request processing time; standard conditions
(e) Domain gains; varying capacity
(f) Request processing time; varying SLA number
Fig. 4.
Simulation results
A. Settings The simulated network is depicted by fig. 1. A source domain, labeled ”customer”, sends a request to domain 1 at each simulation step, with domain t as target. Two domain paths are feasible: one goes through domains 1, 2 and t, and the second through domains 1, 3 and t. The domain 1 is the observed domain. This simulated network is close to a realistic situation where requests go through 5 carriers in worst case. However, the number of alternative paths could be higher. We let for future work the study in larger topologies and the impact of carriers’ interconnections on the algorithms. Capacity and Acceptance. The network load is simulated by internal links on each domain. A link is associated to a maximal available bandwidth: when a SLA is committed, the remaining bandwidth is computed. The internal capacity mentioned in sec. III-D is bound to G = 3 and set as follows: 0 when no bandwidth remains, 1 when the remaining bandwidth is at 10% of the maximal one, and 2 when it is more than 10% of the maximal one. Hence, an internal capacity of 1 is a warning that the SLA will soon be unavailable. The acceptance of an offer, as defined by eq. (1), is used with various parameter values described in sec. V-C. The maximal price the customer is ready to pay (˜ p(di ) in eq. (1)) is set using the minimum summed price of the SLA chains that satisfy the QoS, so there is always a feasible chain. SLAs and requests. QoS values of SLAs are generated randomly following an uniform probability law except for the guaranteed bandwidth for which levels are defined. So, each QoS value is available with all levels of bandwidth, thus ensuring a higher number of feasible chains. The price
function of domains uses the distance of each QoS parameter value to the minimum and maximum of this parameter. QoS parts of requests are generated in the same manner. Requests are distributed over these requirements with a normal law. Q-learning parameters. Both algorithms use the same parameter values. The learning rate is updated using recommendations of [15]: ατ = (τ + 1)−0.85 . The discount factor is set at γ = 1/2 meaning that present and future rewards have an equal importance. The value of decreases uniformly from 1 to 0.01 and remains constant, so that the learning is first very fast, and then slow. To ensure the validity of experimental data, the same set of 250 requests is sent multiple times with the same arrival law, forming a ”run”. A full simulation is of 60 runs, each run having the same requests and with a re-initialization of link bandwidth between each run. Each full simulation has been run in the same conditions 30 times and results have been averaged to be more reliable. B. Standard conditions A first set of experiments was conducted with 100 SLAs per domain, a customer interested mainly in price (β = 0.8), and an internal capacity covering the bandwidth demanded by all requests. As illustrated by fig. 4(a) and 4(b), the satisfaction of the customer and domain’s gains are tightly related. Both Q-learning algorithms behave the same, and stabilize after 10 runs (2500 requests). The RC algorithm exhibits better results than the NC algorithm. The random and Min OPT algorithm are less efficient than both Q-learning algorithms (ca. 10% of satisfied requests). As confirmed by fig. 4(c), the NC algorithm shows better runtime performances than the RC algorithm:
processing completely a request requires about 1.4 ms for the NC algorithm and 4.3 ms for the RC algorithm. Random and Min OPT algorithms are obviously very fast (ca. 0.1 ms). C. Influence of the environment Customer utility. Depending on the customer utility, domains should apply different strategies: if a better QoS matters more than a cheap one for the customer, domains will get higher revenues by proposing a better QoS and vice versa. Fig. 4(d) illustrates that when the price matters (β = 1), the convergence is slower (30 runs, 7500 requests) than in standard conditions. Nevertheless, both algorithms converge to high revenues. Note that, the random and Min OPT algorithm perform badly because they do not anticipate bandwidth utilization. Changing statically the customer utility or the domains’ bandwidth capacities does not interfere with the learning process, it only affects the convergence speed. But, making the capacity of a domain vary during a simulation affects the results. Capacity variations. A set of experiments was conducted with domain 3’s bandwidth alternating between the total bandwidth of all the requests and 15% of this same bandwidth at each run. The results, shown by fig. 4(e), reveal that domain gains are varying smoothly accordingly. With the RC algorithm, domain 1 is aware of the load of domain 2, while with the NC algorithm, domain 1 can not adapt to the different load of domain 2. Fig. 4(e) exhibits that the gain performance of the RC algorithm is less impacted by bandwidth variations. Such variations are quite common in a realistic environment (e.g. because of requests made by other domains). D. Scalability The number of SLAs per domain is an important component of the RC and NC convergence properties. On fig. 4(f), the number of SLAs per domain is varying from 100 to 1000. One might observe that the runtime of the RC algorithm is higher and increases faster than the runtime of the NC algorithm. At more than 900 SLAs per domain, the RC algorithm needs more than a second to answer. In realistic cases, the number of SLAs could be higher. However, SLAs and requests are often categorized by type of applications thus reducing the space to explore, which was not the case in our experiments. We let for future work scalability analysis in large and fully interconnected topologies. VI. C ONCLUSION Democratizing QoS-demanding applications over the Internet requires operators to upgrade their network management systems with, among others, real-time and automated interdomain SLA negotiation algorithms. The adapted Q-learning algorithms proposed in this paper meet operators’ requirements on confidentiality and long-term revenue optimization while being able to adapt to a complex environment without explicit model of it. Both algorithms present different properties: the NC algorithm does not require any communication of the domains’ states, while the RC algorithm requires domains
to communicate aggregated capacities to their direct neighbors. Simulation results exhibit that the NC algorithm is also advantageous compared to per-request optimization and is much faster than the RC algorithm. But if capacities change, the RC algorithm adapts faster. Experimental results also confirmed that both algorithms guarantee higher revenues than local optimization or random choices. The choice between these two algorithms depends on the speed of change of the domains’ capacities and the number of SLAs each domain needs to support. Further studies have to be conduced also in large inter-carrier topologies since the number of interconnections plays a role in the algorithms’ complexity. These scalability issues may be tackled in future works by learning which updated requests should be computed instead of computing one for each candidate SLA. Defining and using continuous spaces in the RC algorithm could also help decreasing the impact of a large number of SLAs. R EFERENCES [1] M. P. H. et al., “End-to-end quality of service provisioning through inter-provider traffic engineering.” Computer Communications, 2006. [2] H. Pouyllau and N. Djarallah, “Distributed ant algorithm for inter-carrier service composition,” in Euro-NGI conference on Next Generation Internet networks. IEEE Press, 2009. [3] N. L. Sauze, A. Chiosi, R. Douville, H. Pouyllau, H. Lonsethagen, P. Fantini, C. Palas-ciano, A. Cimmino, M. A. C. Rodriguez, O. Dugeon, D. Kofman, X. Gadefait, P. Cuer, N. Ciulli, G. Carrozzo, A. Soppera, B. Briscoe, F. Bornstaedt, M. Andreou, G. Stamoulis, C. Courcoubetis, P. Reichl, I. Gojmerac, J. L. Rougier, S. Vaton, D. Barth, and A. Orda, “Etics : Qos-enabled interconnection for future internet services,” in Future Network and Mobile Summit, 2010. [4] O. Bonaventure, “Using BGP to distribute flexible QoS information, internet draft draft-bonaventure-bgp-qos-00.txt,” 2001. [5] X. Hu, P. Zhu, K. Cai, and Z. Gong, “AS alliance in inter-domain routing,” International Conference Advanced Information Networking and Applications, March 2008. [6] R. Douville, J. L. Roux, J. Rougier, and S. Secci, “A service plane over the PCE architecture for automatic multi-domain connection-oriented services,” IEEE Communication Magazine, 2008. [7] J. Xiao and R. Boutaba, “Qos-aware service composition and adaptation in autonomic communication,” IEEE Journal on Selected Areas in Communications, vol. 23, pp. 2344–2360, 2005. [8] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). MIT Press, 1998. [9] C. J. Watkins and P. Dayan, “Technical note: Q-learning,” in Journal of Machine Learning Research, 1992, pp. 279–292. [10] S. M. Thampi and C. S. K, “Q-learning based collaborative load balancing using distributed search for unstructured p2p networks,” in IEEE Local Computer Networks, 2008. [11] M. Littman and J. Boyan, “A distributed reinforcement learning scheme for network routing,” in Proceedings of the International Workshop on Applications of Neural Networks to Telecommunications. IEEE Communications Society, 1993, pp. 45–51. [12] J. Huang, B. Yang, and D.-Y. Liu, “A distributed q-learning algorithm for multi-agent team coordination,” in International Conference on Machine Learning and Cybernetics, 2005. [13] D. Barth, J. Cohen, L. Echabbiz, and C. Hamlaoui, “Transit prices negotiation: Combined repeated game and distributed algorithmic approach,” in First EuroFGI International Conference, 2007, pp. 266–269. [14] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, 1996. [15] E. Even-Dar and Y. Masour, “Learning rates for q-learning,” Journal of Machine Learning Research, vol. 5, 2003. [16] H. Pouyllau and G. Carofiglio, “Inter-carrier SLA negotiation using Q-learning,” in Telecommunication Systems Journal Special Issue on ”Socio-economic Issues of Next Generation Networks”, 2010.