Factor Selection for Reinforcement Learning in HTTP Adaptive ...

Factor Selection for Reinforcement Learning in HTTP Adaptive Streaming Tingyao Wu, Werner Van Leekwijck? Alcatel Lucent - Bell Labs, Copernicuslaan 50, B-2018 Antwerp, Belgium {tingyao.wu,werner.van_leekwijck}@alcatel-lucent.com

Abstract. At present, HTTP Adaptive Streaming (HAS) is developing into a key technology for video delivery over the Internet. In this delivery strategy, the client proactively and adaptively requests a quality version of chunked video segments based on its playback buffer, the perceived network bandwidth and other relevant factors. In this paper, we discuss the use of reinforcement-learning (RL) to learn the optimal request strategy at the HAS client by progressively maximizing a pre-defined Quality of Experience (QoE)-related reward function. Under the framework of RL, we investigate the most influential factors for the request strategy, using a forward variable selection algorithm. The performance of the RL-based HAS client is evaluated by a Video-on-Demand (VOD) simulation system. Results show that given the QoE-related reward function, the RL-based HAS client is able to optimize the quantitative QoE. Comparing with a conventional HAS system, the RL-based HAS client is more robust and flexible under versatile network conditions. Keywords: Reinforcement Learning, HTTP Adaptive Streaming, Machine Learning, Variable Selection

1

Introduction

In recent years, video delivery over traditional best-effort Internet has attracted lots of attention. Among this, HTTP adaptive streaming (HAS) has become a key technology. Several instances, like IIS Smooth Streaming by Microsoft [1], HTTP Live Streaming (HLS) by Apple [2] and HTTP Dynamic Streaming by Adobe [3], of this technology are in the market today. In HAS, video content is encoded in different qualities (bit-rates), and chunked into independent segments, typically 2-10 seconds. The encoded segments are hosted on an HTTP Web server, together with a playlist (or manifest) file describing the quality levels and available segments. A client first retrieves the playlist, and then requests the segments with different qualities from the Web server in a linear fashion and downloads them using plain HTTP progressive download. Because the segments are carefully encoded without any gaps or overlaps between them, the segments can be played back as a seamless video. The key feature of HTTP adaptive streaming is that the client is responsible for deciding for each consecutive segment, which quality to download. Typically a rate determination algorithm, also called heuristic, is responsible for the selection of the highest sustainable quality, and is also adaptive to the changing environment. The decision is made by taking into account one or many observed factors to adaptively select a bit-rate. These factors may include the perceived bandwidth (based on download temporal intervals of previous segments), the playback bufferat the client together its dynamics, user screen resolution, CPU load, etc. Usually when the playback buffer is low or the perceived bandwidth is limited, the heuristic tends to request low quality segments, in order to avoid freeze; when the playback buffer is high or the bandwidth is sufficient, the heuristic attempts to increase the quality level ?

This research was partially funded by the iMinds MISTRAL project (under grant agreement no. 10838).

2

Factor Selection for Reinforcement Learning in HAS

of; while when the bandwidth is stable, the heuristic maintains sustainable quality level. More details of adaptive streaming can be found in [1], [2] and [3]. It is desirable for a client-side bit determination algorithm that is able to balance multivariate variables, attempting to obtain optimal QoE given certain circumstances. However, recent research work has shown that it is not an easy task for the current video streaming heuristics to pick a suitable video stream rate. For instance, in [4], it is pointed out that the existing streaming heuristics can be both too conservative (not fully make use of available bandwidth) and too aggressive (not fully consider the fast decreasing buffer filling level); in [5], it is observed that the competition between two or more adaptive streaming players can lead to issues of stability, unfairness and potential bandwidth under-utilization. To optimize the bit-rate selection algorithm, there have been several studies. In [6], a rate adaptation algorithm for HAS was proposed to detect bandwidth changes using a smoothed HTTP throughput measure based on the segment fetch time. In [7], the authors presented a Quality Adaptation Controller(QAC), which uses a feedback control to drive stream-switching for adaptive live streaming applications. In [8], we have proposed a Q-learning based client quality selection algorithm for HTTP adaptive video streaming to dynamically learn the optimal behavior corresponding to the current network environment. Comparing to other adaptive bit-rate algorithms, reinforcement learning by nature is able to learn from its punishments or rewards by trying different actions in a certain environment state and thus moves towards the maximum of its defined reward function. Some promising results have been reported in [8]. Huang [9] et al., following his work in [4], argued that only observing and controlling the playback buffer, without having to estimate network capacity, is already able to avoid unnecessary rebuffering and achieve an average video rate equal to available capacity in steady state, with the assumption that the available bandwidth never goes below the lowest encoded bit-rate. However, based on our observation, most of freezes occur exactly when the bandwidth is extremely limited due to the burstiness property of TCP. To deal with such circumstance, an integration of different information resources seem indispensable to prevent the client from rebuffering. In this paper, we extend our study on the use of reinforcement learning by carefully and incrementally integrating factors relevant to bit-rate selection step by step, using a forward variable selection strategy; we attempt to quantitatively identify which one or more factors are the most important and influential for the bit-rate selection algorithm, and how they are combined. To simplify our study, we do not consider external factors like screen resolution or CPU load, as these factors are not easy to control in a simulation setting; instead, we focus on 5 parameters that matter to the HAS player itself, namely, playback buffer, previous playback buffer, instant perceived bandwidth, average perceived bandwidth in a sliding window, and previous requested bit-rate. Our selection result shows that the playback buffer, the average bandwidth and the previous playback buffer are the three most valuable ones in descending order. We simulate the evolution of the learning process and compare with one commercial HAS client, e.g. Microsoft IIS Smooth Streaming, and demonstrate the robustness and flexibility of RL-based HAS client. The rest of the paper is organized as follows. In section 2, we describe the adaptation and implementation of the RL algorithm for the HAS client. The forward variable selection strategy for identifying the most influential parameters is presented in section 3. Section 4 is about the experiments, including the experiment design, forward variable selection for the RL environment, and the comparison with a traditional HAS client. The conclusions and future work are given in section 5.

2

RL based HAS client

In the sense of machine learning, reinforcement learning is concerned with how an agent ought to take actions in an environment so as to maximize a given cumulative


3

reward [10]. Reinforcement learning has been widely used in control theory, game theory, etc. The motivation of using RL in HAS client is that if we can define a QoEoriented reward function for RL, then in a multi-variable controlled environment, a RL agent should be able to incrementally maximize the reward during its trialand-error procedure, thus enhancing the quantitative QoE. A RL model consists of a set of environment states S, a set of actions A, a reward function R and a state-action paired Q-function Q(S, A). 2.1

Environment variables

The environment that a HAS client encounters and interacts with may include factors like the playback buffer, the perceived bandwidth, the quality levels of previous segments, the speed of buffer filling/consuming, etc. But these state factors must be carefully chosen: integrating too many undiscriminative variables into the environment states may reduce the interpretability for the agent and may make the searching task tedious, while selecting too few variables may not be descriptive enough, preventing the agent from reaching higher rewards. Therefore, a variable selection strategy(see section 3) is adopted to deliberately select necessary factors for the HAS environment. An environment state is then represented by a combination of discretized states of chosen factors. 2.2

Action set

The action set A in RL indicates what actions an agent could take at a certain state. In the RL-based HAS client, this corresponds to all possible bit-rates that a client may request for a segment. The probability of choosing action a (a ∈ A) associated with state s (s ∈ S) is described by the state-action paired function Q(s, a). 2.3

Reward function

A reward function R in the RL defines the reward that an agent would get when it takes a certain action and jumps to another state. To obtain a reliable delivery strategy, the reward function in the RL-based HAS client should be directly related to QoE. Certainly, the duration of freeze fi between playing segment i and segment i + 1 and the quality qi for segment i are two indispensable factors for the QoE when requesting segment i. Moreover, [11][12] report that alternatively degrading and upgrading the quality level in a short time can also result in a degraded QoE, implying that a frequency oscillation should be punished and a stable quality sequence is desired. Meanwhile [13] suggest that a linear combination of these QoE-relevant parameters is already good enough to model the mean opinion score (MOS) of viewers. As a result, the instant reward function is modeled as: ri = −A ∗ fi − B ∗ qi − C ∗ oi .

(1)

The freeze fi can be detected by checking the playback buffer bi and the time duration ei+1 for retrieving segment i + 1: 0 bi − ei+1 ≥ 0 fi = (2) ei+1 − bi bi − ei+1 < 0 The second item qi of the reward function is quality related. Intuitively, when the highest encoded bit-rate amax is chosen, no punishment is given; otherwise, lower qualities are penalized proportionally in terms of their encoded bit-rates: qi =

amax −ai amax .

(3)

The last item oi in the reward function is about quality oscillation, and is calculated based on the following rules. Suppose there is a quality switch at the i-th segment:

4


ai 6= ai−1 , implying that a potential quality oscillation occurs, we will look for the closest quality switch point within the last M segments. If no closest switch is found, or the closest switch is in the same direction as the switch at the i-th segment, then no oscillation is detected. Otherwise, within M consecutive segments, quality decreasing/increasing co-exist. Then the oscillation oi is calculated as the average of these two quality changes. This calculation implies that a pair of quality decrease/increase occur for a period of more than M + 1 segments is not seen as an oscillation. The positive weights A, B and C in Eq. 1 represent to what extent these factors influence the QoE. As there is no decisive conclusion about how the freeze, quality and oscillation impact the QoE of viewers, and how importance they are relatively, we arbitrarily set their weights to be A = 100, B = 10 and C = 10 in our preliminary experiments, assuming that much higher punishment should be given to the freeze [14]. 2.4

State and action-paired Q-value

Suppose at time ti , the RL-based HAS client agent reaches environment state si , and in probability requests quality level ai+1 (ai+1 ∈ A) for segment i + 1. After having received the segment at time ti+1 , the agent transits to environment state si+1 . Between ti and ti+1 , it is possible that due to buffer under-run, the client experiences a picture freeze with duration fi+1 . Together with the quality and the quality oscillation for requesting the ai+1 th version of segment i + 1, a reward ri+1 is calculated based on Eq. 1. Then the learned action-value function Q(si , ai ) is updated using one step Q-learning method [10]: Q(si , ai ) ← Q(si , ai ) + α[ri+1 + γmaxQ(si+1 , a) − Q(si , ai )], ∀a

(4)

where (s, a) is the state-action pair, and α ∈ [0; 1] and γ ∈ [0; 1] are the learning rate and the discount parameter respectively. The learning rate α determines to what extent the newly acquired information overrides the old information, while the discount factor γ is a measure of the importance of future rewards. This online update runs for multiple episodes until converge, and the probability of choosing i ,ai ) ai at state si is calculated as: P r(ai |si ) = PQ(s Q(si ,aj ) . j

3

Forward selection for environment variables

The environment states, containing multi-variables, represent the variable space that an agent interacts with. Those variables must be carefully selected to sufficiently represent the learning environment, as not all candidate variables contribute to illustrate the environment states. For instance, [9] claims that only using the playback buffer is already able to avoid freeze, assuming that the perceived bandwidth is never less than the lowest encoded bit rate a1 . But we observed that in our simulation (and in many cases), this assumption is usually not held. In this sense, we conjecture that only relying on the playback buffer is seemly insufficient; some other factors may help the client to make a better selection decision. To this end, we use a forward selection procedure to select discriminative variables among some potential factors. The forward selection is a data-driven model which adds variables to the model one at a time. At each step, each variable that is not already in the model is tested for inclusion in the model. The most significant one of these variables is added to the model, so long as its p-value is below a pre-determined level. The potential influential factors that we will test, composed of a variable set P , are listed in Table 1. As shown, bi and pbi are playback buffers when receiving segment i and segment i − 1. wi and w ¯i are the instant perceived bandwidth and the average of recently perceived bandwidths in a sliding window with the size M respectively. The last potential variable di indicates how many segments that


5

the current quality change direction has been maintained, implying the quality switches. Table 1. Potential environment variable set P Variable Representation Unit Play buffer bi sec Previous play buffer pbi = bi−1 sec Instant BW wi kbps PM −1

ave. BW w ¯i = number of segments in the same trend

j=0

M

di

wi−j

kpbs level

We begin with a model including the variable that is the most significant in the initial analysis, and continue adding variables until none of remaining variables is “significant”. The significance is tested by the average reward obtained by different variable sets. Given a set of selected variables S (S ⊂ P ) in the RL-based HAS ¯ S = PN ri /N client, the average reward for variable set S over all segments is R i=1 (N denotes the number of segments that the client requests), then a variable v from the rest variable set S C (S ∪ S C = P ) will be incorporated into the existing variable set S, only if the following conditions are met:  ¯ S+v > R ¯ S+j R j ∈ S C , j 6= v  S+v S ¯ ¯ R >R (5)  ¯ S+v > R ¯ S ) = 0.5 rejected H0 : P r(R The above procedure is repeated until no variable is significant. For the hypothesis test in Eq. 5, a non-parametric two-sided sign test is performed as the distributions of the tested variables are unknown. If the null hypothesis test is rejected with the significance level 0.05, then two variable sets are considered to be statistically different.

4

Experiments

In this section, we first describe the network topology of our simulation and the design of the experiment. Then the forward variable selection method is performed to select influential factors for the RL environment. The performance of RL-based HAS client is compared with that of a standard HAS client [1] in the same condition of randomly generated bandwidth, from which we demonstrate how reinforcement learning gradually optimizes the request sequence along with the number of trials. Finally, we show that the RL based HAS client is superior in terms of obtained rewards in multiple weighting parameter setups. 4.1

Simulation setup

The simulation design of a Video-on-Demand (VOD) system is demonstrated in Fig. 1. At the server side, a video clip, Big Buck Bunny, is hosted and is available for retrieval. This video trace, about 10 minutes long, consists of 299 segments (N = 299), each with a fixed length of 2 seconds. Each video segment is encoded in 7 quality levels: 300kbps, 427kbps, 608kbps, 866kbps, 1233kbps, 1636kbps and 2436kbps. This leads to the action set A = {300, 427, 608, 866, 1233, 1636, 2436}kbps, and amax = 2436kbps. At the client side, a standard Microsoft IIS Smooth Streaming [1] or a RL-based adaptive streaming runs. The client either immediately starts the request for the next segment once the previous segment has been fully received in the buffering state, or starts the request for each 2 seconds in the steady state [1].

6


Click Modular Router bw Cross-traffic manager

HAS Server t 3000kbps

HAS client

bw

MS IIS or RL-based

t A video clip: 10mins, 299 segments

Fig. 1. Simulation design

Besides a typical HAS server-client topology, a cross-traffic manager, which is implemented on Click Modular Router [15], is also present. The cross-traffic manager is used to limit bandwidth towards the server as a bottleneck and also generates a tunable random cross traffic by keeping sending packets to the server. The randomly generated cross-traffic is determined by two random values, namely, the sending rate of packets and the corresponding duration. As a result, the bandwidth between the client and the server is the difference between the bottleneck bandwidth and the cross traffic. Considering the encoded bit-rates, the bottleneck bandwidth is set to be 3000kpbs (denoted as BWmax ). To allow the perceived bandwidth at the client covering all bit-rates, the generated bandwidth by the cross traffic is in the range of 0kpbs and 2700kpbs. But note that because of the burstiness of TCP connections, the client could encounter very low bandwidth in a short period. 4.2

Forward selection of environment variables

Parameter discretization Given that the maximum buffer length at the client is 32 seconds (which is the default maximum buffer length for the reference Microsoft IIS system), bi and pbi are discretized into 16 non-overlapping states, with one state spanning over each 2-second interval. Setting M = 5, w ¯i is calculated as the average of previous 5 perceived bandwidths. wi and w ¯i are then discretized into 7 states, whose lower and upper boundaries for each state are relevant to the encoded bit-rates: [0, a2 ], [a2 , a3 ],...,[a7 , BWmax ]. Note that the upper boundary for the first state is a2 , as the bandwidth less than a2 (427kbps) corresponds to the lowest quality (300kbps). di is in the range between 0 and 5. When the direction of instant quality switch differs with the maintained direction, this variable is set to 0; when two consecutive segments are in the same quality or in the same change direction, this variable increases by 1, until 5. Experiment design for variable selection A series of packet-sending rate and the corresponding durations are randomly generated and fed into the cross-traffic manager. The total duration of the cross traffic is 179,400 seconds (corresponding to 300 episodes). For each variable selection step, the RL-based client running with different environment variable sets will perceive almost identical bandwidth for the same episode; as a result, the average reward for the same episode can be paired and compared. The RL-based HAS client always starts with equal probability distribution for all actions. The 100 averaged rewards between episode 201 and episode 300 for different environment variable sets are taken to check whether the conditions


7

in Eq. 5 are satisfied, as we observe that after 200 episodes, the performance of RL is stable. 1st variable The playback buffer b and the perceived bandwidth, including w and w, ¯ are probably the most directly influential factors to decide the desirable quality level. Consequently, we start our variable selection by determining which one of the three to be selected at the first place: each of the three is treated as an ¯ b (e), independent variable set. The average reward for each episode e is denoted as R ¯ w¯ (e) and R ¯ w (e) respectively. The histograms of R ¯ w¯ (e) − R ¯ b (e) and R ¯ w (e) − R ¯ b (e) R (200 < e ≤ 300) are demonstrated in Fig. 2, together with the p-values of twosided sign test. It is shown that the number of episodes between 201 and 300 in the ¯ w¯ − R ¯ b < 0 (in red in the figure) is 72 out of 100, with the p-value of condition of R ¯ w¯ > R ¯ b ) = 0.5. two-sided sign test 0.000017, rejecting the null hypothesis H0 : P r(R ¯w − R ¯ b can be explained in the same way. So we conclude The bottom graph for R that the rewards achieved by w and w ¯ individually are significantly lower than that of b, and the most important variable to be selected is the playback buffer b: S = {b}.

Fig. 2. The 1st variable selection.

2nd variable The second variable will be chosen from the factors w and w. ¯ To do so, these two variables are added into the existing set S separately and their performance is tested. The histograms of the difference between their average rewards ¯ b , are shown in Fig. 3. From the upper graph we can see and the benchmark, R ¯ b+w¯ ) significantly increases that incorporating w ¯ into the environment variables (R the averaged reward (the number of episodes of positive difference is 69), while adding the instant bandwidth w does not provide additional benefits (shown in the bottom graph). As a consequence, in this step w ¯ is then added into the variable set: S = {b, w}. ¯ 3rd variable The three remaining variables pb, w and d in S C are tentatively incorporated into the existing environment variable set S, and their performance is compared with the set without any of them, as demonstrated in Fig. 4. Not surprisingly, adding pb provides helpful information for the agent to learn its tangible environment, while the other two variables do not improve the rewards. Actually,

8


Fig. 3. The 2nd variable selection.

the combination of pb and b describes how quickly the play buffer fills/consumes. If the buffer increases/decreases too fast, the agent knows that the true bandwidth does not match the previous requested quality, thus it may adjust the requested quality for the next segment. The variable set S is then empirically set to be S = {b, w, ¯ pb}.

Fig. 4. The 3rd variable selection.

No 4th variable Fig. 5 shows the comparison of the performance of the environment variable set with and without the remaining two variables w (upper) and d (bottom). It is clearly shown that neither of these two variables helps to increase the reward of RL. So finally, no further variable will be selected and the environment variable set is fixed to be S = {b, w, ¯ pb}. Discussion The results of our selection of prominent variables can be explained in two-folds. First we confirm that the playback buffer is the most influential one,


9

Fig. 5. The 4th variable selection.

as shown in [9]. Second, we also see that the average bandwidth and the speed of filling/consuming of the buffer also play important roles, helping the RL-learner to obtain higher rewards by clearly stating the environment that the learner stays at. We also notice that w ¯ is more preferred than w in the selection. This probably could be explained that the HAS client should not be too sensitive to the instant perceived bandwidth, as the playback buffer acts as a container to keep the quality as stable as possible; the quality does not necessarily react instantly to a sudden bandwidth fluctuation. Actually w, ¯ filtered by a low-pass filter, smooths the recent perceived bandwidths, giving the RL learner a clearer perception of the tendency of bandwidth change. Besides, it is somewhat surprising that the tendency monitor d does not benefit to the description of the RL environment. We conjecture that as the penalty of quality oscillation has been given in the reward function, knowing how long the current quality tendency has been maintained does not provide more information. 4.3

Comparison with Microsoft IIS Smooth Streaming

Requested quality In section 4.2, we have obtained the optimal environment representation S = {b, w, ¯ pb}. To verify the performance of RL-based HAS client, the cross-traffic manager re-generates another series of cross traffic for 500 episodes such that the RL-based client and Microsoft IIS Smooth Streaming can be running on the same bandwidth condition for the same episode and can be compared. Fig. 6 shows the requested qualities (blue lines)and the corresponding perceived instant bandwidths (the green dashed line, corresponding to the right vertical axis) for episodes 1, 10, 20 and 50 respectively. The horizontal axis is the index of segment, ranging from 1-299. In the graphs, we also show the normalized picture freeze (the red vertical line. The longer the line is, the longer the freeze is). As can be seen, the requested qualities are very random and irregular in the beginning (episode “0001”). But this randomness, together with oscillation, keeps diminishing as more episodes have been learned. At episode 50, the RL-based client has already learned to follow the fluctuation of bandwidth, which proves the validity of our approach. Fig. 7 and 8 show respectively the requested quality levels by the RL-based client (the upper graph) and Microsoft IIS Smooth Streaming client (the bottom graph) in episode 490 and 491. Visually comparing with IIS Smooth streaming, both of figures demonstrate that the RL-based HAS client could adapt, without too sensitively, to the rapidly fluctuated bandwidth, achieving quite stable request

10


Fig. 6. The evolution of reinforcement learning. Selected episodes: 1, 10, 20 and 50

list. Meanwhile, the RL-based client is also well protected from freeze, especially when the bandwidth is extremely low.

Various weight settings in reward function Fig. 7 and 8 show the superior performance achieved by the RL-based HAS client, with the weights of picture freeze, quality and oscillation in the reward function being A = 100, B = 10 and C = 10 respectively. These QoE related parameters are intuitively chosen: the reason that the weight of freeze is 10 times bigger than the other two is that normally picture freeze leads to much worse QoE. However, it is still uncovered how these factors are combined to simulate the QoE perceived by a human. While looking for the mapping from QoE to the reward function is an interesting but outof-scope topic, our target is to show that as long as the “optimal” QoE-oriented reward function is defined, the RL-based HAS client could automatically learn the “optimal” requesting strategy. In order to do this, we run the RL-based HAS client in multiple times with different weight sets of the reward function, still assuming that they are linearly combined. As a reference, the reward obtained by the RL learner with a set of parameters is then compared with the reward of the standard Microsoft IIS Smooth Steaming, pretending that the standard one were using the same parameter set for the reward. To this end, we fix the weight A for freeze to be 100, and alternatively change the weights of quality B and oscillation C, ranging from 2 to 22. The average rewards of episode 400-499 with different parameter sets for both types of clients are shown in Table 2. Note that for IIS, the requested quality sequence of an episode is independent of the concerned parameters A, B and C, thus the “achieved reward” is quite regular, like an arithmetic sequence. Nevertheless it can be seen that within this grid, the average rewards of the RLbased HAS client are uniformly higher than the ones obtained by Microsoft IIS Smooth Streaming. This implies that given a reward representation, the RL-based HAS client could learn to maximize the reward incrementally; if the perceived QoE can be represented as a reward function, then by maximizing the reward the reinforcement learning should be able to optimize the QoE progressively.


11

Fig. 7. Requested quality for RL based HAS client and Microsoft IIS Smooth Streaming. Episode: 490 Table 2. Average rewards with multiple parameter sets. Episode: 400-499. A = 100 Ave Reward RL / IIS 2 6 10 quality cost: B 14 18 22

5

2 -4.9/ -6.3 -13.7/-16.3 -22.5/-26.3 -31.4/-36.3 -40.2/-46.3 -48.8/-56.1

6 -5.2/ -6.5 -14.3/-16.5 -23.2/-26.5 -32.1/-36.5 -41.0/-46.5 -49.8/-56.5

oscillation cost: C 10 14 -5.3/ -6.8 -5.5/ -7.1 -14.7/-16.8 -14.9/-17.1 -23.5/-26.8 -23.9/-27.1 -32.4/-36.8 -32.8/-37.0 -41.3/-46.8 -41.7/-47.1 -50.1/-56.8 -50.7/-57.1

18 -5.6/ -7.4 -15.2/-17.4 -24.5/-27.4 -33.1/-37.2 -42.2/-47.3 -51.0/-57.3

22 -5.7/ -7.6 -15.5/-17.6 -24.5/-27.6 -33.6/-37.6 -42.6/-47.6 -51.4/-57.6

Conclusions and future work

In this paper, reinforcement learning was employed in the HAS client to demonstrate its robustness and adaptability for fast changing network conditions. Specifically, we identified the most influential factors to represent the RL environment, using the forward feature selection strategy. We then compared the requests given by the RL-based client and a standard HAS client and showed that the RL can be an alternative method for the bit-rate selection algorithm. We conclude that as long as the designed reward function, as an objective function, could map the true QoE, reinforcement learning should be able to learn the optimal request strategy in its trial-and-error procedure. In the future, we will investigate the feasibility of other potential factors for the RL environment, such as recent requested quality bit-rates, the first/second order derivatives of play buffer, etc, without excessively expanding the state space. Some parameters, like the frequency of freeze, could also be incorporated into the reward function. One ongoing study is the co-operation of multiple RL agents for quality fairness in the framework of HTTP Adaptive Streaming.

References 1. Microsoft, “Smooth streaming,” http://www.iis.net/downloads/microsoft/ smooth-streaming, 2008, [Online; accessed July, 2013]. 2. R. Pantos and W. May, “HTTP live streaming overview,” http://tools.ietf.org/html/ draft-pantos-http-live-streaming-10, 2012, [Online; accessed July, 2013].

12


Fig. 8. Requested quality for RL based HAS client and Microsoft IIS Smooth Streaming. Episode: 491

3. Adobe, “HTTP dynamic streaming: Flexible delivery of on-demand and live video streaming,” http://www.adobe.com/products/hds-dynamic-streaming.html, 2010, [Online; accessed July, 2013]. 4. T.-Y. Huang, N. Handigol, B. Heller, N. McKeown, and R. Johari, “Confused, timid, and unstable: Picking a video streaming rate is hard,” in ACM Internet Measurement Conference, November 2012, pp. 225–238. 5. S. Akhshabi, L. Anantakrishnan, C. Dovrolis, and A. Begen, “What happens when HTTP adaptive streaming players compete for bandwidth?” in ACM NOSSDAV, June 2012, pp. 9–14. 6. C. Liu, I. Bouazizi, and M. Gabbouj, “Rate adaptation for adaptive http streaming,” in ACM MMSys, 2011, pp. 169–174. 7. L. De Cicco, S. Mascolo, and P. V., “Feedback control for adaptive live video streaming,” in ACM MMSys, Feb. 2011, pp. 145–156. 8. M. Claeys, S. Latr´e, J. Famaey, T. Wu, W. Van Leekwijck, and F. De Turck, “Design of a Q-learning-based client quality selection algorithm for HTTP adaptive video streaming,” in Proc. Conference on autonomous agents and multiagent systems, May 2013, pp. 30–37. 9. T.-Y. Huang, R. Johari, and N. McKeown, “Downton abbey without the hiccups: Buffer-based rate adaptation for http video streaming,” in ACM FhMN, Aug. 2013, pp. 9–14. 10. R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction. MIT Press, Cambridge, MA, 1998. 11. R. Mok, E. Chan, and R. Chang, “Measuring the quality of experience of HTTP video streaming,” in Proc. IFIP/IEEE International Symposium on Integrated Network Management (IM), May 2011, pp. 485–492. 12. A. Balachandran, V. Sekar, A. Akella, S. Seshan, I. Stoica, and H. Zhang, “Developing a predictive model of quality of experience for internet video,” in ACM SIGCOMM, Aug. 2013, pp. 339–350. 13. J. De Vriendt, D. De Vleeschauwer, and D. Robinson, “Model for estimating QoE of video delivered using HTTP adaptive streaming,” in Proc. IFIP/IEEE Workshop on QoE CENTRIC Management, May 2013, pp. 1288–1293. 14. T. Hossfeld, S. Egger, R. Schatz, M. Fiedler, K. Masuch, and C. Lorentzen, “Initial delay vs. interruptions: Between the devil and the deep blue sea,” in Quality of Multimedia Experience (QoMEX), 2012 Fourth International Workshop on, July 2012, pp. 1–6. 15. “Click modular router,” http://read.cs.ucla.edu/click/click, 2010, [Online; accessed July, 2013].

Factor Selection for Reinforcement Learning in HTTP Adaptive ...

Factor Selection for Reinforcement Learning in HTTP Adaptive ...

Suggest Documents

PAC-Bayesian Model Selection for Reinforcement Learning

reinforcement learning algorithm selection - OpenReview

Reinforcement Learning based Dynamic Model Selection for

Reinforcement Learning for Active Model Selection

Motivated Reinforcement Learning for Adaptive ... - CiteSeerX

Reinforcement Learning for Adaptive Routing - arXiv

Feature Selection for Reinforcement Learning - Semantic Scholar

A Reinforcement Learning Scheme for Adaptive

Reinforcement Learning for Adaptive Theory of Mind in the Sigma ...

Multiagent reinforcement learning with adaptive ... - Semantic Scholar

Adaptive multi-objective reinforcement learning with ...

Algorithm Selection using Reinforcement Learning - CiteSeerX

REinforcement learning based Adaptive samPling: REAPing ... - arXiv

Reinforcement Learning and Adaptive Dynamic ... - Semantic Scholar

An Adaptive Reinforcement Learning-based ... - Semantic Scholar

Progressive Reinforcement-Learning-Based Surrogate Selection - About

Fast Feature Selection for Reinforcement-Learning-based ... - CiteSeerX

Adaptive Fusion by Reinforcement Learning for Distributed Detection ...

Reinforcement Learning with Classifier Selection for Focused Crawling

Selection for Reinforcement-Free Learning Ability as an Organizing

Automatic Feature Selection for Model-Based Reinforcement Learning

Reinforcement Learning for True Adaptive Traffic Signal Control

Reinforcement Learning for Adaptive Dialogue Systems - Google Sites

Adaptive PID Controller based on Reinforcement Learning for Wind ...