A Multi-Agent Q-Learning-based Framework for Achieving Fairness in HTTP Adaptive Streaming Stefano Petrangeli∗ , Maxim Claeys∗ , Steven Latr´e† , Jeroen Famaey∗ , Filip De Turck∗ ∗ Department
of Information Technology (INTEC), Ghent University- iMinds Gaston Crommenlaan 8 (Bus 201), 9050 Ghent, Belgium, email:
[email protected] † Department of Mathematics and Computer Science, University of Antwerp- iMinds Middelheimlaan 1, 2020 Antwerp, Belgium Abstract—HTTP Adaptive Streaming (HAS) is quickly becoming the de facto standard for Over-The-Top video streaming. In HAS, each video is temporally segmented and stored in different quality levels. Quality selection heuristics, deployed at the video player, allow dynamically requesting the most appropriate quality level based on the current network conditions. Today’s heuristics are deterministic and static, and thus not able to perform well under highly dynamic network conditions. Moreover, in a multi-client scenario, issues concerning fairness among clients arise, meaning that different clients negatively influence each other as they compete for the same bandwidth. In this article, we propose a Reinforcement Learning-based quality selection algorithm able to achieve fairness in a multi-client setting. A key element of this approach is a coordination proxy in charge of facilitating the coordination among clients. The strength of this approach is three-fold. First, the algorithm is able to learn and adapt its policy depending on network conditions, unlike current HAS heuristics. Second, fairness is achieved without explicit communication among agents and thus no significant overhead is introduced into the network. Third, no modifications to the standard HAS architecture are required. By evaluating this novel approach through simulations, under mutable network conditions and in several multi-client scenarios, we are able to show how the proposed approach can improve system fairness up to 60% compared to current HAS heuristics.
I.
I NTRODUCTION
Nowadays, multimedia applications are responsible for an important portion of the traffic exchanged over the Internet. One of the most relevant applications is video streaming. Particularly, HTTP Adaptive Streaming (HAS) techniques have gained a lot of popularity because of their flexibility, and can be considered as the de facto standard for Over-TheTop video streaming. Microsoft’s Smooth Streaming, Apple’s HTTP Live Streaming and Adobe’s HTTP Dynamic Streaming are examples of proprietary HAS implementations. In a HAS architecture, video content is stored on a server as segments of fixed duration at different quality levels. Each client can request the segment at the most appropriate quality level on the basis of the local perceived bandwidth. In this way, video playback dynamically changes according to the available resources, resulting in a smoother video streaming. The main disadvantage of current HAS solutions is that the heuristics used by clients to select the appropriate quality level are fixed and static. This entails they can fail adapting under highly dynamic network conditions, resulting in freezing or frequent quality switches that can negatively affect user perceived video quality, the so-called Quality of Experience (QoE). In order to overcome these issues, we propose to embed a
Reinforcement Learning (RL) agent [13] into HAS clients, in charge of dynamically selecting the best quality level on the basis of its past experience. As shown in a single-client scenario [14], this approach is able to outperform current HAS heuristics, achieving a better QoE even under dynamic network conditions, up to 10% gain. In a real scenario, multiple clients simultaneously request content from the HAS server. Often, clients have to share a single medium and issues concerning fairness among them arise, meaning that the presence of a client has a negative impact on the performance of others. Particularly, fairness can be defined both from an application-aware point of view, as the deviation among clients’ achieved QoE, and from an application-agnostic point of view, as the deviation among clients’ achieved bit rate. Moreover, a fundamental aspect we have to consider when dealing with learning in a multi-agent setting, is that the learning process of an agent influences that of the other agents. For example, when a client selects the i−th quality level, it is using a portion of the shared bandwidth. This decision can have an impact on the performance of the other clients and thus also on their learning process. This mutual interaction can lead to unstable behavior (e.g. the learning process never converges) or unfair behavior (e.g., some agents control all the resources to the detriment of the other ones). Also traditional HAS clients present fairness issues. The main drawback here is HAS heuristics are static and uncoordinated. This entails they are not aware of the presence of other clients nor can adapt their behavior to deal with it. Classical TCP rate adaptation algorithms are not effective in this case, since quality selection heuristics partly take over their role, as they decide on the rate to download. In this paper, we investigate the aforementioned problems arising in a multi-client setting. Particularly, we present a multi-agent Q-Learning-based HAS client able to achieve smooth video playback, while coordinating with other clients in order to improve the fairness of the entire system. This goal is reached with the aid of a coordination proxy, in charge of collecting measurements on the behavior of the entire agent set. This information is then used by the clients to refine their learning process and develop a fair behavior. The main contributions of this paper are three-fold. First, we present a Q-Learning-based HAS client able to learn the best action to perform depending on network conditions, in order to provide a smoother video streaming with respect to current deterministic HAS heuristics, and improve system fairness. Second, we design a multi-agent framework to help agents coordinating their behavior, which does not require
neither explicit agent-to-agent communication nor a centralized decision process. Consequently, the quality level selection can still be performed locally and independently by each client, without any modification to the general HAS principle. Third, detailed simulation results are presented to characterize the gain of the proposed multi-agent HAS client compared to the proprietary Microsoft ISS Smooth Streaming algorithm and a single-agent Q-Learning based HAS client [14]. The remainder of this article is structured as follows. Section II reports related work on HAS optimization and multi-agent algorithms. Section III formally introduces the fairness problem in a multi-client scenario, while Section IV presents a short overview of the single-agent Q-Learning client. Next, Section V illustrates the proposed multi-agent Q-Learning HAS client, both from an architectural and algorithmic point of view. In Section VI, we evaluate our HAS client through simulation and show its effectiveness compared to current HAS heuristics. Section VII concludes the paper. II.
R ELATED W ORK
A. HAS Optimization Akshabi et al. provide a good overview of the performance and drawbacks of current HAS heuristics [1]. The authors point out as current HAS heuristics are effective in nondemanding network conditions, but fail when rapid changes occur, leading to drops in the playback buffer or unnecessary quality reductions. Moreover, these solutions carry out the quality level decision in a very conservative way. Furthermore, it is shown that two clients sharing the same bottleneck do not develop a fair behavior. Akshabi et al. investigate the main factors influencing fairness in HAS [2]. They report that the mutual synchronization among clients is a relevant cause of unfairness. Particularly, unfair behavior emerges when clients request video segments at different times, since this leads to wrong bandwidth estimations. Several HAS clients have been proposed to deal with the aforementioned problems. Jarnikov et al. propose a quality decision algorithm based on Markov Decision Process, which requires an offline training [3]. Liu. et al describe a step-wise client, but they do not consider the playback video buffer level in the decision process [4]. De Cicco et al. use a centralized quality decision process exploiting control theory techniques [5]. They have also studied a scenario where two clients share the same bottleneck and shown how their approach can result in a fair behavior, at least from the network point of view. Villa et al. investigate how to improve fairness randomizing the time interval at which clients request a new segment [6]. A similar approach is also used by Jiang et al. [7]. They study different design aspects that may lead to fairness improvement, including the quality selection interval, a stateful quality level selection and a bandwidth estimation using an harmonic mean instead of a normal one. In general, all the works available in literature share some of the following drawbacks. First, the proposed algorithms are fixed and static. This entails they are not able to modify their behavior online, taking into account their actual performance. Second, fairness is not explicitly considered in the client design, but is just a result of it. Third, they do not evaluate their outcomes from a QoE-point of view. This can lead to a quality selection process that can optimize network resources utilization but not the quality perceived by the user. In this
work, we incorporate the experience acquired by the client into the quality selection algorithm, using a RL approach. Moreover, we explicitly design our client to improve system fairness and exploit a QoE model to evaluate the obtained results directly at the user level.
B. Multi-Agent Algorithms As far as multi-agent systems are concerned, a good overview of different approaches and challenges is discussed by Vidal [8]. Multi-agent algorithms can be subdivided into two categories: centralized and distributed. In the centralized case, agents communicate with a central entity in charge of deciding the best action set. The work presented by Bredin et al. belongs to this category [9]. They propose a marketbased system for resource allocation, where each agent bids for computational priority to a central server, which finally needs to decide on resources allocation. This type of solutions are not applicable in the HAS case, since each client has to decide autonomously which quality level to request. In a distributed approach, agents can communicate directly with each other. The nature of this communication has a big impact on system performance and algorithm feasibility in a real scenario. Dowling et al. propose a model-based collaborative RL, where agents exchange their reward to optimize routing in ad-hoc networks [10]. In classical fixed or mobile networks, agents may not communicate directly with each other due to the overhead introduced into the network. Moreover, Schaerf et al. show how naive agent-to agent communication may even reduce system efficiency [11]. Crites et al. study the problem of elevator group control, using distributed Q-Learning agents fed with a common global reward [12]. This approach facilitates influencing global system performance only and not that of each single agent separately. In a HAS setting, we are also interested in optimizing local performance of the clients. Consequently, we use a reward composed by both a local and a global term. Theoretical investigation on multi-agent RL algorithms mainly concentrates on stochastic games, exploiting game theory techniques. An example is the aforementioned work by Bredin et al., where the central server allocates resources computing a Nash equilibrium point for the system. These approaches require strong assumptions (e.g. perfect knowledge of the environment) to work properly or a huge amount of agent-toagent communication. The multi-agent HAS client presented in this work does not require any explicit communication among agents or any a priori assumptions.
III.
FAIRNESS P ROBLEM S TATEMENT
In this section, we formally define the multi-agent fairness problem addressed in this paper. The problem we want to solve is to reach the highest possible video quality at the clients while keeping the deviation among them as low as possible. The formal problem characterization is given in Definition 1: Definition 1. Multi-Agent Optimization Problem J(q) = ξ × QualityIndex(q) + (1 − ξ) × FairIndex(q) with ξ ∈ [0, 1]
maximize
J(q)
q=(q1 ,...,qN )
subject to
1 ≤ qi (k) ≤ qmax
∀i = 1 . . . N , ∀k = 1 . . . K
DTik (q, Bandwidth)
≤ BLki ∀i = 1 . . . N , ∀k = 1 . . . K
are avoided. It is worth noting that the download time is not only a function of the quality level requested by client i, but also of the quality levels downloaded simultaneously by other clients and of the available bandwidth. IV.
with N being the number of clients, K the number of segments the video content is composed by, qi (k) the quality level requested by the i − th client for the k − th segment, qi the vector containing all the quality levels requested by client i and qmax the highest available quality level. DTik represents the download time of the k − th segment, while BLki denotes the video player buffer filling level of client i when the k − th segment download starts. Bandwidth is the vector containing the bandwidth pattern. The objective function J(q) is the linear combination of two terms. The first one, QualityIndex(q), measures the overall video streaming quality at the clients side. The second term, F airnessIndex(q), represents the fairness of the system. The final formulation of QualityIndex(q) and F airIndex(q) depends on the actual interpretation given to the video quality at the client. From an application-aware point of view, video quality is explicitly associated with the user perceived video quality, or QoE. This way, we are explicitly focusing on achieving fairness from a QoE point of view: clients have to reach similar perceived video quality. In this case, QualityIndex(q) can be characterized as the average of clients’ QoE values, while fairness can be expressed as the standard deviation from this average. The model used to compute the QoE is explained in Section VI. On the other hand, if the main issue is on network resource optimization, video quality can be associated to the bit rate achieved by the client or, equivalently, to the average quality level requested. With this application-agnostic formulation, we are interested in fairness from a network point of view: clients goal is to request the same average quality level, i.e. to equally share the available bandwidth. In light of the above, QualityIndex(q) and F airIndex(q) can be computed as the average and standard deviation of clients’ average requested quality level, respectively. It is worth noting that both application-aware and applicationagnostic interpretations are valid and can be used depending on the focus given to the multi-agent optimization problem. In the design of our client, we focused on the applicationaware interpretation, since it is directly correlated to the user perceived quality of the video streaming. Nevertheless, the proposed framework can be easily modified to deal with the application-agnostic interpretation of the multi-agent optimization problem. In light of the above, it is clear why QualityIndex(q) and F airIndex(q) have to be optimized together. If we only consider the maximization of the fairness index, agents could obtain similar but unacceptable video qualities. Instead, our goal is also to reach the highest possible video quality at the clients. Depending on applications and scenarios, ξ can be tuned to benefit one of the two terms. The second constraint of the optimization problem is intended to avoid freezes in the video playback. The download time of the next segment (DTik ) has to be lower than the video player buffer filling level when the download starts (BLki ). In this way, the video player buffer will never be empty and freezes
S INGLE -AGENT Q-L EARNING A LGORITHM
In this section we provide an overview of a single-agent Q-Learning client proposed by Claeys et al. [14], used as the basis for our multi-agent algorithm presented in Section V. Particularly, the local reward experienced by client i when the k − th segment is downloaded, is: ri (k) = − |qmax − qi (k)| − |qi (k) − qi (k − 1)|+ − |bmax − bi (k)|
(1)
with qi (k − 1) being the quality level requested at the (k − 1)−th step, bi (k) the video player buffer filling level when the download is completed and bmax the buffer saturation level. Moreover, when bi (k) is equal to zero, i.e. when a video freeze occurs, ri (k) is set to -100. The first two terms drive the agent to request the highest possible quality level, while keeping quality switches limited. In fact, these two factors have a big impact on the perceived quality. The last term is used to avoid freezes in the video playout, which also have a big impact on the final QoE. The action the RL agent can take at each decision step is to select one of the available quality levels. Consequently, each agent has N L possible actions to perform, with N L being the number of available quality levels. The goal of the learning agent is to select the best quality level depending on two parameters, which compose the agent state space. The first one is the local perceived bandwidth, while the second is the buffer filling level bi (k). These two terms are of great importance when deciding on the quality level to download. For example, if the perceived bandwidth is low and the playout buffer is almost empty, the agent will learn to request a low quality segment, in order to respect the bandwidth limitation and avoid a video freeze. Since both the perceived bandwidth and the playout buffer are continuous quantities, they are discretized in N L + 1 and bTmax intervals, seg respectively. Tseg represents the segment duration in seconds.
V.
M ULTI -AGENT Q-L EARNING A LGORITHM
In this section, we discuss the proposed multi-agent QLearning HAS client. First, we give an architectural overview. Next, we present the multi-agent quality selection algorithm. A. Architectural Overview A key element of our multi-agent Q-Learning algorithm is an intermediate node, called coordination proxy, in charge of helping clients achieving fairness. The operations it performs are monitoring performance of the system and returning this information to the clients, which use it to learn a fair policy. In light of the above, the real position of the coordination proxy has to be carefully decided depending on the network scenario. In a real setting, multiple HAS clients may belong to different networks. Depending on bottlenecks position, we
#2: Next segment request forwarded to HAS Server
HAS Server
#3: Next segment from HAS Server
#2: Reward estimation
#5: Learning Process
#1: Next Segment Request #4: Global Signal Broadcast
Coordination Proxy
HAS Clients
#3: Global Signal Computation
Fig. 1.
Logical work-flow of the proposed solution.
can identify two types of network scenario. When all clients share a common bottleneck, the best option is to use a single coordination proxy in charge of controlling the entire agent set. For example, the coordination proxy can be embedded into the HAS server. In a second and more complex scenario, a multitude of bottlenecks may be simultaneously present. In this case we need a hierarchy of coordination proxies, exchanging information to coordinate agents’ behavior both locally and globally. In this paper we focus on the first network scenario, while we propose to investigate the more complex one in future work. The logical work-flow of our multi-agent algorithm is shown in Fig. 1. First, each client i requests to the HAS Server the next segment to download at a certain quality. Based on this information, the coordination proxy estimates the reward the agents will experience in the future. This data is then aggregated into a global signal, representing the status of the entire set of agents. This global signal is then returned to the agents, which use this information to refine their learning process. In particular, the global signal informs an agent about the difference between its performance and that of the entire system. In this way, the agents can learn how to modify their behavior to achieve similar performance, i.e. fairness. In light of the above, the reward estimation and global signal computation have to be simple enough to avoid overloading the coordination proxy and maintain scalability. The main advantage of this hybrid approach is two-fold. First, no communication is needed among clients and consequently no significant overhead is introduced. Second, the HAS architecture is not altered. The coordination proxy has only to collect and aggregate agents’ reward and is not involved in any decision process. Furthermore, the global signal broadcast can be made reusing the existing communication channels between HAS server and clients.
B. Algorithmic Details As shown in Fig. 1, our multi-agent algorithm is subdivided into four steps, that are detailed below. We assume here the client i has executed the (k − 1) − th step and is waiting for the execution of the k − th step. 1) Reward Estimation: The coordination proxy can compute an estimate of the reward ri (k) the client will experience at the k − th step, exploiting the information sent when requesting a new segment at a certain quality. In particular, for each client i, the coordination proxy can compute the following: rif (k) = −|qmax − qi (k)| − |qi (k) − qi (k − 1)|
(2)
rif (k) represents an estimate of the local reward the i − th agent will experience at the next k − th step. We refer here to an estimate since the buffer filling level term is lacking, as can be seen comparing rif (k) with reward ri (k) shown in Eq. 1. The video player buffer filling level bi (k) is not accessible to the coordination proxy, being computed by the client only when the new segment will be received, i.e. when the k − th step will be actually executed. 2) Global Signal Computation and Broadcast: After having computed the reward for each client, the coordination proxy aggregates them into a global signal. This value represents the global status of the entire system and helps agents achieving fairness. The global signal formulation has been chosen equal to that of the QualityIndex(q) introduced in Section III, i.e. an average: gs(k) =
N 1 X f r (k) N i=1 i
(3)
Considering that clients are not synchronized, i.e. they request segments at different moments, gs(k) has to be continuously updated by the coordination proxy, when a client requests a new segment. The global signal can then be added as an HTTP header field and returned to the agents when downloading the next segment to play. 3) Learning Process at the Client: In a HAS multi-client scenario, there is a secondary goal with respect to a singleclient one. Besides reaching the best video streaming at the client side, fairness among clients has to be obtained. In order to achieve these objectives, we modify the reward function shown in Eq. 1, adding a Homo Egualis-like reward term [15], which follows the theory of inequity aversion. This theory states that agents are willing to reduce their reward in order to increase that of under-performing agents. The general formulation is given in Eq. 4: X ri − rj X rj − ri −β (4) rihe = ri − α N −1 N −1 r >r r >r i
j
j
i
rihe
The total reward is composed by a first term ri , the local reward an agent experiences while interacting with its environment. The other two terms take into account the performance of the other agents. Each agent experiences a punishment when others have a higher reward as well as when they have a lower reward. rihe reaches its maximum when ri = rj , for each j, i.e. when the agents show a fair behavior. The Homo Egualis reward shown in Eq. 4 is not directly applicable to the HAS case, since it requires a direct reward communication among the agents. For this reason, we use the global signal gs(k), which has been designed to represent the overall performance of the system. The Homo Egualis reward consequently becomes as in Eq. 5: rihe (k) =ri (k) − α max(rif (k) − gs(k), 0)+ − β max(gs(k) − rif (k), 0)
(5)
with ri (k) being the local reward reported in Eq. 1. In accordance to the computation of the global signal (see Eq. 2 and 3), the punishment term in Eq. 5 is computed using
TABLE I.
AGENT STATE SPACE
Client #1
State Element
Range
Elements
Buffer Filling
[0; bmax ] sec
bmax Tseg
Bandwidth gs(k) gs(k) − rif (k)
[0; BWmax ] bps [2 × (1 − qmax ) ; 0] [2 × (1 − qmax ) ; 2 × (qmax − 1)]
NL + 1 3 3
only the quality and the switching reward term. It can be noted that in this formulation the coordination proxy acts as a macro-agent representing the behavior of the entire system. The reward reaches the maximum when rif (k) = gs(k), i.e. when the behavior of the agent matches that of the macroagent. When the reward is far from the global signal, the punishment term operates to modify the agent policy. This way, agents’ reward will converge to a similar value, i.e similar performance is achieved and, consequently, fairness. We fix the value of α and β in Eq. 5 to 1.5 to benefit the punishment term and give higher priority to the fairness goal in the learning process of the agents. In order to enforce the learning process, we also add an element to the agent state, in addition to the perceived bandwidth and video player buffer filling level (see Section IV). In particular, we explore two possible configurations: (i) using the global signal gs(k) or (ii) using the difference between the global signal gs(k) and the reward rif (k). This way, the agent can also consider the overall system behavior when requesting a new segment. The two state space configurations give the agent a slightly different knowledge. In the first case, the agent considers directly the entire system behavior. For example, if gs(k) is close to zero, i.e. the overall system performance is good, the agent will learn to select a quality level to maintain this condition. In the second case, the absolute value of gs(k) is not relevant because the agent considers its deviation from it. We discretize gs(k) into three intervals, to represent the conditions when the agent set is performing bad (gs(k) ≈ 2 × (1 − qmax )), normal or well (gs(k) ≈ 0). Also the value gs(k) − rf (k) has been discretized in three intervals. In this case, the three intervals represent the situations when the agent behavior is in line with that of the entire set (gs(k)−rf (k) ≈ 0), is under-performing (gs(k)−rf (k) 0) or is over-performing (gs(k) − rf (k) 0). The complete state space is given in Table I, with Tseg the segment length in seconds, BWmax the maximum bandwidth in bps and N L the number of available quality levels. From now on we will refer to the two possible state space configurations presented above with the expressions absolute global signal state space configuration (third row of Table I) and relative global signal state space configuration (fourth row of Table I), respectively. The RL algorithm embedded into the clients is the well-known Q-Learning [13]. VI.
P ERFORMANCE E VALUATION
A. Experimental Setup A NS-3-based simulation framework [16], [17] has been used to evaluate our multi-agent HAS client. The simulated network topology is shown in Fig. 2; the actual capacity CL depends on the number of clients, and is equal to 2.5×N Mbps, with N being the number of clients. The video trace streamed is the Big Buck Bunny, composed by 299 segments, each 2 seconds long and encoded at 7 different quality levels
Cross traffic
Client #2
Link A 100 Mbit/sec
CL Mbit/sec
Client #3
HAS ServerCoordination Proxy
. . .
Client #N
Fig. 2. Simulated network topology. The coordination proxy has been embedded into the HAS Server. TABLE II.
V IDEO TRACE QUALITY LEVELS Quality Level
Bit Rate
1 2 3 4 5 6 7
300 Kbps 427 Kbps 608 Kbps 806 Kbps 1233 Kbps 1636 Kbps 2436 Kbps
(see Table II for details). The buffer saturation level for each client is equal to 5 segments, or 10 seconds. In order to give enough time to the RL algorithm to learn, we simulate 800 episodes of the video trace. During each episode, the same variable bandwidth pattern on link A is used, varying each 250 msec and scaled with respect to the number of clients. The bandwidth model is obtained using a cross traffic generator, introducing a traffic ranging from 0 Kbps to 2380×N Kbps into the network. In this way, we obtain an available bandwidth ranging from 120×N Kbps to 2500×N Kbps. As far as the coordination proxy is concerned, it has been embedded into the HAS sever. The network topology shown in Fig. 2 represents the situation when many clients share a common bottleneck. In this case, one coordination proxy is needed and its functions can be easily carried out by the HAS Server. B. QoE Model As stated in the previous sections, an important aspect to consider when evaluating the performance of a video streaming client, is the final quality perceived by the user enjoying the service, the so-called QoE. Consequently, we need to define a model to correlate client performance with user perceived quality. We use a metric in the same range of the Mean Opinion Score (MOS), that can be computed as in Eq. 6 [14], [18]:
QoEi (t, t + T ) = M OSi (t, t + T ) = 0.81 × q¯i (t, t + T )+ −0.96 × qˆi (t, t + T ) + 0.17 − 4.95 × Fi (t, t + T ) (6) The QoE experienced by client i over the time window [t, t + T ] is a linear combination of the average quality level requested q¯i (t, t + T ), its standard deviation qˆi (t, t + T ) and Fi (t, t + T ), which models the influence of freezes and is computed as in the following:
Fi (t, t + T ) =
ln(fif req (t, t
7 × 8
1 + × 8
6
+ T ))
! +1 +
min(fiavg (t, t + T ), 15) 15
(7)
fif req (t, t+T ) and fiavg (t, t+T ) are the freezes frequency and the average freeze duration, respectively. All the coefficients reported in Eq. 6-7 have been tuned considering the works by Claeys et al. [14] and De Vriendt et al. [18]. C. Optimal Parameters Configuration The Q-Learning approach, which is a widely employed Reinforcement Learning technique [13], has been used in the HAS client rate adaptation algorithm. The parameters characterizing it are the discount factor, which weighs the relevance of future rewards in the learning process, and the learning rate, which weighs the relevance of newly acquired with respect to past experience. Additionally, the exploration policy, which selects the action to take at each decision step, has to be carefully selected in order to balance the well-known trade-off in Reinforcement Learning algorithms between exploration and exploitation [13]. In order to properly tune our multi-agent HAS client and select the best configuration, an exhaustive evaluation of the parameter space has been performed. Particularly, we selected two exploration policies, Softmax [13] and VDBESoftmax [19], which are well-established exploration policies for Reinforcement Learning algorithms and allow a good balance between exploration and exploitation. An epsilon-greedy policy was also evaluated, but the results are omitted because of its poor performance. We investigated also the influence of the discount factor γ of the Q-Learning algorithm, the inverse temperature τ of the Softmax and VDBESoftmax policy and the σ value of the VDBESoftmax policy. These parameters are of interest, since they influence the behavior of a RL agent. The importance of the discount factor γ has been pointed out at the beginning of this section. The inverse temperature τ influences the action selection process: a low value entails all the actions have a similar probability to be selected. σ, the inverse sensitivity of the VDBESoftmax policy, controls agent exploration. Low values facilitate exploration, while high values cause the agent to select a greedy action more often. We consider five different γ values (0.05, 0.1, 0.15, 0.2, 0.25), three different τ (0.2, 0.5, 0.7) and three σ values (1, 50, 100). Preliminary simulations showed that 0.1 is the best choice for the learning rate of the Q-Learning algorithm. For this reason, we kept it fixed for all simulations. We repeated the exhaustive evaluation of the parameter space for the two possible state space configurations reported in Section V-B and for scenarios with 4, 7 and 10 clients, leading to 360 different configurations overall. The outcome of this analysis is shown in Fig. 3-5. The main goal of this investigation is to find a parameter combination for our multi-agent client able to perform well even if conditions change (e.g., the number of clients). In this section, the performance evaluation is conducted considering the application-aware interpretation of
the multi-agent optimization problem presented in Section III. Fig. 3 and 4 investigate the influence of the discount factor γ on the performance of the multi-agent client, for both state space configurations, in a scenario with 10 clients streaming video. The x-axis reports the discount factor γ, while the y-axis the MOS. In the top graph, each point represents the average MOS of the entire agent set, computed during the last iteration, i.e. over the last 10 minutes of the video trace. In the bottom graph, each point represents the MOS standard deviation of the entire agent set, computed during the last iteration. We selected the four out of sixty policies achieving the highest average MOS with the lowest MOS standard deviation. From now on we will refer to the Softmax policy and VDBESoftmax policy with the abbreviations SMAX and VDBE, respectively. For the absolute global signal state space configuration (Fig. 3), the VDBE policy with τ = 0.5, σ = 1 and γ = 0.2 leads to the best result overall, with an average MOS of 3.75 and a standard deviation of 0.17. Moreover, this same policy appears to be robust also when the discount factor changes, leading to good results with γ = 0.05, 0.1. Another eligible configuration is VDBE τ = 0.7, σ = 1, γ = 0.15, which results in the second best outcome with an average MOS of 3.69 and a standard deviation of 0.19. The same analysis, for the relative global signal state space configuration, is shown in Fig. 4. The best global result is reached with VDBE τ = 0.7, σ = 1 and γ = 0.1, resulting in an average MOS of 3.71 and a standard deviation of 0.14. The SMAX τ = 0.7 shows a low sensitivity when the discount factor changes, for both the average MOS and its standard deviation, and reaches the best outcome for γ = 0.2. Table III and IV summarize the outcome of the exhaustive parameter evaluation. We repeated the same analysis shown in Fig. 3-4 also for the 4 and 7 clients scenarios, obtaining a total of eight eligible configurations. In light of the above, two preliminary conclusions can be drawn. First, the VDBE policy is in general the best choice. This is because the VDBE tends to a greedy selection policy when the learning process converges. Second, higher values of τ and lower for σ are preferable. In the first case, high τ values cause the agent to select actions with the highest expected reward. To balance this aspect, low σ values allow more exploration during the learning phase. A fundamental characteristic of the multi-agent client is to be able to perform well independent of the number of clients in the system. In order to select the best configuration from this point of view, we evaluate the performance of every eligible combination reported in Table III-IV for 4, 7 and 10 clients. We then select the two best performing configurations across all clients, for each state space configuration. Fig. 5 shows the influence of clients number on the selected configurations. All configurations perform similarly when considering the average MOS. A bigger variability can be noticed instead for the MOS standard deviation. In light of the above, the VDBE τ = 0.5, σ = 1, γ = 0.2, absolute global signal state space configuration (i.e., third row of Table I) has been finally chosen. As can be seen by Fig. 5, it guarantees the best results from the average MOS point of view and a very low standard deviation for 4, 7 and 10 clients.
SMAX τ=0.7 VDBE τ=0.5 σ=1
VDBE τ=0.7 σ=1 VDBE τ=0.2 σ=50
SMAX τ=0.7 γ=0.25 - First VDBE τ=0.5 σ=1 γ=0.2 - First SMAX τ=0.7 γ=0.2 - Second VDBE τ=0.5 σ=1 γ=0.25 - Second
4.1 4
3.7
Average MOS
Average MOS
3.8
3.6 3.5 3.4 3.3
3.9 3.8 3.7 3.6 3.5
3.2
3.4 0.05
0.1
0.15
0.2
0.25
4
7
MOS Standard Deviation
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05
0.1
0.15
10
Number of clients
Discount factor γ
MOS Standard Deviation
Conf. Conf. Conf. Conf.
0.2
0.25
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4
7
10
Number of clients
Discount factor γ
Fig. 3. Influence of the discount factor γ on average MOS (top) and its standard deviation (bottom) for the absolute global signal state space configuration and 10 clients streaming video.
Fig. 5. Influence of the number of clients on average MOS (top) and its standard deviation (bottom). TABLE III. R ESULTS OF THE PARAMETER SPACE EVALUATION . A BSOLUTE GLOBAL SIGNAL STATE SPACE CONFIGURATION
SMAX τ=0.7 VDBE τ=0.5 σ=1
Average MOS
3.8
VDBE τ=0.7 σ=1 VDBE τ=0.2 σ=50
3.7
Eligible Combinations
4
Softmax τ = 0.7 γ = 0.25 VDBESoftmax τ = 0.5 σ = 1 γ = 0.25 VDBESoftmax τ = 0.5 σ = 1 γ = 0.2 VDBESoftmax τ = 0.7 σ = 1 γ = 0.15 VDBESoftmax τ = 0.5 σ = 1 γ = 0.2 VDBESoftmax τ = 0.7 σ = 1 γ = 0.15
3.6 3.5
7
3.4
10
3.3 3.2 0.05
0.1
0.15
0.2
0.25
Discount factor γ
MOS Standard Deviation
Clients Number
0.6 0.5 0.4 0.3 0.2 0.1 0.05
0.1
0.15
0.2
0.25
Discount factor γ
Fig. 4. Influence of the discount factor γ on average MOS (top) and its standard deviation (bottom) for the relative global signal state space configuration and 10 clients streaming video.
D. Gain Achieved by the Algorithm In this section, we investigate the performance of the proposed multi-agent HAS client, in comparison with both the single-agent HAS client studied by Claeys et al. [14] and a traditional HAS client, the Microsoft ISS Smooth Streaming1 (MSS). In particular, we show the results for the 7 and 10 clients scenario, both from a QoE and network point of view. 1 Original source code available from: https://slextensions.svn.codeplex.com/svn/trunk/SLExtensions/ AdaptiveStreaming
An exhaustive evaluation of the parameter space has been carried out for the single-agent HAS client, and a VDBE policy with τ = 0.7, σ = 1, γ = 0.1 has been selected. For the multiagent client, we consider the parameters configuration resulting from the analysis presented above. Also in this case, all the metrics are computed considering the last of 800 iterations. Fig. 6 shows the results obtained when analysing clients performance according to the application-aware interpretation of the multi-agent optimization problem introduced in Section III. Each bar represents the average MOS of the entire agent set, together with its standard deviation. The MSS client presents a very high standard deviation, both for the 7 and 10 clients scenario. This entails there is a big difference among the video quality perceived by different clients, i.e. unfairness. Remarkable improvements can be noticed when using a RL approach. The single-agent RL client is able to considerably reduce MOS standard deviation by 80% and 20%, in the 7 and 10 clients case respectively. This is a very good result, considering that, in this case, there are no coordination mechanisms. Nevertheless, the lack of coordination affects the average MOS, which is similar to that reached by the MSS client. The multi-agent RL client is instead able to improve the average MOS by 11% for 7 clients and by 20% for 10 clients with respect to MSS. Moreover, a very good fair behavior is obtained for the 7 clients scenario. For the 10 clients case, the standard deviation is 48% less than the single-agent solution and 60% less than MSS.
Clients Number
Eligible Combinations
4
VDBESoftmax τ = 0.5 σ = 1 γ = 0.25 VDBESoftmax τ = 0.2 σ = 50 γ = 0.15 VDBESoftmax τ = 0.5 σ = 1 γ = 0.25 VDBESoftmax τ = 0.5 σ = 50 γ = 0.25 Softmax τ = 0.7 γ = 0.2 VDBESoftmax τ = 0.7 σ = 1 γ = 0.1
7 10
6.4 Average Quality Level
TABLE IV. R ESULTS OF THE PARAMETER SPACE EVALUATION . R ELATIVE GLOBAL SIGNAL STATE SPACE CONFIGURATION
6
MSS
Single-Agent RL
Multi-Agent RL
5.6 5.2 4.8 4.4 4 3.6 7
10
Average MOS
Number of clients 4.2 4 3.8 3.6 3.4 3.2 3 2.8 2.6 2.4
MSS
Single-Agent RL
7
Multi-Agent RL
10 Number of clients
Fig. 7. Comparison between the different clients, from a network perspective. The proposed multi-agent client is able to improve fairness and increase the average requested quality level, both in the 7 and 10 clients scenario.
show that our multi-agent HAS client resulted in a better video quality and in a remarkable improvement of fairness, up to 60% and 48% in the 10 clients case, compared to MSS and the Q-Learning-based client, respectively.
Fig. 6. Comparison between the different clients, from a QoE perspective. The proposed multi-agent client outperforms both the MSS client and the single-agent Q-Learning-based one.
In Fig. 7 the network analysis is depicted. In this case, we evaluate clients performance considering the applicationagnostic interpretation of the optimization problem in Section III. The graph reports the average and standard deviation of clients’ average requested quality level. Also in this case, the MSS deviation is very high: this means the agents do not fairly share network resources. The multi-agent client is able to considerably reduce the deviation of the average quality level requested by the clients, both with respect to MSS and the single-agent client. The situation arising in the 7 clients scenario is of interest. In this case, the single-agent client performance is close to that of the multi-agent one. If we recall the results shown in Fig. 6 for the 7 clients scenario, we see there is a bigger difference between the average MOS of the two clients. This entails that in this case, exploited resources being equal, the multi-agent client results in a better overall perceived video quality, i.e. is more efficient.
ACKNOWLEDGMENT The research was performed partially within the iMinds MISTRAL project (under grant agreement no. 10838). This work was partly funded by Flamingo, a Network of Excellence project (ICT-318488) supported by the European Commission under its Seventh Framework Programme. Maxim Claeys is funded by grant of the Agency for Innovation by Science and Technology in Flanders (IWT). R EFERENCES [1]
[2]
[3]
VII.
C ONCLUSIONS
In this paper, we presented a multi-agent Q-Learning-based HAS client, able to learn and dynamically adapt its behavior depending on network conditions, in order to obtain a high QoE at the client. Moreover, this client is able to coordinate with other clients in order to achieve fairness, both from the QoE and the network point of view. This was necessary as both traditional and earlier proposed RL-based approaches introduced non-negligible differences in obtained quality among clients. Fairness is achieved by means of an intermediate node, called coordination proxy, in charge of collecting information on the overall performance of the system. This information is then provided to the clients, which use it to enforce their learning process. Numerical simulations using NS-3 have validated the effectiveness of the proposed approach. Particularly, we have compared our multi-agent HAS client with the Microsoft ISS Smooth Streaming one and with a Q-Learning based HAS client. In the evaluated bandwidth scenario, we were able to
[4]
[5]
[6]
[7]
[8]
[9]
S. Akhshabi, S. Narayanaswamy, A. C. Begen and C. Dovrolis, An experimental evaluation of rate-adaptive video players over HTTP. Signal Processing: Image Communication, Volume 27, Issue 4, pp. 271287, April 2012. S, Akhshabi, L. Anantakrishnan, A, C. Begen and C. Dovrolis, What happens when HTTP adaptive streaming players compete for bandwidth?. Proceedings of the 22nd international workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV ’12), 2012. D. Jarnikov and T. Ozcelebi, Client intelligence for adaptive streaming solutions. Signal Processing: Image Communication, Volume 26, Issue 7, pp. 378-389, August 2011. C. Liu, I. Bouazizi and M. Gabbouj, Rate adaptation for adaptive HTTP streaming. Proceedings of the second annual ACM conference on Multimedia systems (MMSys ’11), 2011. L. De Cicco, S. Mascolo and V. Palmisano, Feedback control for adaptive live video streaming. Proceedings of the second annual ACM conference on Multimedia systems (MMSys ’11), 2011. B. J. Villa and P. H. Heegaard, Improving perceived fairness and QoE for adaptive video streams. Eighth International Conference on Networking and Services, 2012. J. Jiang, V. Sekar and H. Zhang, Improving fairness, efficiency, and stability in HTTP-based adaptive video streaming with FESTIVE. Proceedings of the 8th international conference on Emerging networking experiments and technologies (CoNEXT ’12), 2012. J. M. Vidal, Fundamentals of Multiagent systems. Online. http://multiagent.com/p/fundamentals-of-multiagent-systems.html, Last accessed: September 2013. J. Bredin, R. T. Maheswaran, C. Imer, D. Kotz and D. Rus, A gametheoretic formulation of multi-agent resource allocation. Proceedings of the Fourth International Conference on Autonomous Agents, 2000.
[10]
[11]
[12] [13] [14]
[15]
[16] [17]
[18]
[19]
J. Dowling, E. Curran, R. Cunningham and V. Cahill, Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing. IEEE Transactions on Systems, Man, and Cybernetic- Part A: Systems and Humans, pp. 360-372, 2005. A. Schaerf, Y. Schoam and M. Tennenholtz, Adaptive load balance: a study in multi-agent learning. Journal of Artificial Intelligence Research, pp. 475-500, May 1995. R. H. Crites and A. G. Barto, Elevator group control using multiple reinforcement learning agents. Machine Learning, pp. 235-262, 1998. R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction. The MIT Press, March 1998. M. Claeys, S. Latr´e, J. Famaey, T. Wu, W. Van Leekwijck and F. De Turck, Design of a Q-learning-based client quality selection algorithm for HTTP adaptive video streaming. Proceedings of the Adaptive and Learning Agents Workshop, part of AAMAS2013, May 2013. S. de Jong, K. Tuyls and K. Verbeeck, Artificial agents learning human fairness . Proceedings of 7th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2008), pp. 863-870, May 2008. ns3, The Network Simulator ns-3. Online. http://www.nsam.org, Last accessed: September 2013. N. Bouten, J. Famaey, S. Latre, R. Huysegems, B. De Vleeschauwer, W. Van Leekwijck and F. De Turck, QoE optimization through in-network quality adaptation for HTTP Adaptive Streaming. Eight International Conference on Network and Service Management (CNSM), 2012. J. De Vriendt, D. De Vleeschauwer and D. Robinson, Model for estimating QoE of video delivered using HTTP adaptive streaming. 1st IFIP/IEEE Workshop on QoE Centric Management (IM 2013), May 2013. J. Bach and S. Edelkamp, Value-difference based exploration: adaptive control between Epsilon-greedy and Softmax. KI 2011: Advances in Artificial Intelligence, pp. 335-346, 2011.