Freeway Ramp-Metering Control with Queuing ... - IEEE Xplore

2 downloads 0 Views 939KB Size Report
metering controller needs to consider the performance of the network, since ..... length control requires more details of the statues of the net- work, in addition to ...
2011 14th International IEEE Conference on Intelligent Transportation Systems Washington, DC, USA. October 5-7, 2011

Motorway Ramp-Metering Control with Queuing Consideration using Q-Learning Mohsen Davarynejad, Andreas Hegyi, Jos Vrancken and Jan van den Berg Abstract— The standard reinforcement learning algorithms have proven to be effective tools for letting an agent learn from its experiences generated by its interaction with an environment. Among others, reinforcement learning algorithms are of interest because they require no explicit model of the environment beforehand and learning happens through trial and error. This property makes them suitable for real control problems like traffic control. Especially when considering the performance of a network where for instance a local rampmetering controller needs to consider the performance of the network, since limitations needs to be considered, like the maximum permissible queue length, reinforcement learning algorithms are of interest. Here, a local ramp-metering control problem with queuing consideration is taken up and the performance of standard Q-learning algorithm as well as a newly proposed multi-criterion reinforcement learning algorithm is investigated. The experimental analysis confirms that the proposed multi-criterion control approach has the capability to decrease the state-space size and increase the learning speed of controller while improving the quality of solution.

I. I NTRODUCTION Road traffic is among the most flexible and most important, but also among the most expensive [25], most polluting and, by far, the most dangerous means of transport [6]. Traffic control is widely used to improve traffic efficiency and safety and to reduce environmental damage. Traffic control consists of influencing the flow of traffic by means of visual signals. The oldest and most common means of traffic control are the traffic signals at intersections and on-ramps. On-ramp metering is an effective and widely used measure to increase motorway operation efficiency by controlling the traffic allowed to enter the motorway. When traffic is dense, by limiting the inflow into motorway, ramp-metering can prevent a traffic breakdown such that the density remains below the critical value, thus avoiding congestion [11]. Among others, Model Predictive Control [17] is a technique that is frequently used for the purposes of rampmetering control [4]. The fact that MPC is model-based results in risk of predictive controller to misbehave when considering model mismatch. Moreover non-linear optimizations that are part of MPC approaches may fail to find the M. Davarynejad and J. Vrancken are with the Faculty of Technology, Policy and Management, Systems Engineering Group, Delft University of Technology, the Netherlands, {m.davarynejad;

j.l.m.vrancken}@tudelft.nl A. Hegyi is with the Faculty of Civil Engineering and Geosciences, Department of Transport & Planning, Delft University of Technology, the Netherlands, [email protected] J. van den Berg is with the Faculty of Technology, Policy and Management, Section of Information and Communication Technologies, Delft University of Technology, the Netherlands,

[email protected] 978-1-4577-2197-7/11/$26.00 ©2011 IEEE

global optimal solution [9]. Other approaches are notable here like the one proposed in [20], which is based on control theory as well and it is well known that control theory is computationally demanding. This will limit their applicability to small sized networks. The approach still needs an accurate model of the network. To elude this problem and to avoid the computational complexity of MPC, in this paper, reinforcement learning (RL) algorithms are used for ramp-metering control problem, since techniques of this kind do not require a model of the process and reduces reliance on knowledge of the system to be controlled. They are also less computationally expensive. In addition, RL algorithms are continuously gathering information over different traffic patterns and adapting their control policy on-line, making them suitable for complex traffic network control problem with so many non-recurrent patterns. The motivation of this work is to present a ramp-metering RL based controller to increase the network throughput while keeping the on-ramp queue below the storage capacity limit, thus preventing long queues to interfere with surface street traffic. In RL, while interacting with the environment and through trial and error, the agent estimates the expected utility of state-action pairs. It is a semi-supervised learning approach, aiming at developing algorithms for solving sequential decision problems (SDPs), by which agents learn to achieve goals from their interactions with the environment. In RL, the agent perceives the state of the environment, takes an action and receives a scalar signal providing evaluative information on the quality of the action. The signal does not provide any instructive information on the best behavior in that state. At each time step, the signal (also known as the reinforcement signal) can be positive, negative or zero. The goal for the agent is to maximize the expected cumulative discounted rewards by finding an optimal action selection. The agent starts with almost random actions, but by seeking a balance between exploration and exploitation, gradually finds actions that lead to high values of the reward function [14], [24]. Knowing the value of each state, which is the expected longterm reward that can be earned when starting from that state (value function), the agent can choose the best action to take. There is an enormous number of approaches aiming at developing value function based solutions to RL. Temporal difference is the most common search principle in the space of value functions [23]. The RL algorithms are becoming increasingly popular in traffic control problems [1]–[3], [5], [21]. In local level, as

1652

TABLE I L INK E QUATIONS qm,i (k) = ρm,i (k)vm,i (k)nm

Flow-Density-Speed equation

T ρm,i (k + 1) = ρm,i (k) + [qm,i−1 (k) − qm,i (k)] lm,i nm  vm,i (k + 1) = vm,i (k) + τT V [ρm,i (k)] − vm,i (k)

Conservation of vehicles

m

  T vm,i (k) vm,i−1 (k) − vm,i (k) lm,i ϑm T ρm,i+1 (k) − ρm,i (k) − τm lm,i ρm,i (k) + κm   1 ρm,i (k) am V [ρm,i (k)] = vfree,m exp − ( ) am ρcrit,m  wo (k + 1) = wo (k) + T do (k) − qo (k)   ρmax,m − ρm,l (k) wo k qo (k) = min do (k) + , Qo ro (k), Qo T ρmax,m − ρcrit,m +

Speed dynamics

Fundamental diagram Queueing modeling in Origin Ramp outflow equation

TABLE II METANET N ODE E QUATIONS Qn (k) =

X

qµ,Nµ (k)

Total traffic flow enter node n

µ∈In m qm,o (k) = βn (k).Qn (k) P 2 µ∈On ρµ,l (k) ρm,Nm+1 (k) = P µ∈On ρµ,l (k) P v µ∈ln µ,Nµ (k).qµ,Nµ (k) P vm,o (k) = µ∈ln qµ,Nµ (k)

an example, a ramp controller based on Q-learning method is presented in [13]. [12] uses the reinforcement learning algorithm, along with CMAC tile coding (for function approximation) for ramp-metering and variable message signs control problem. While it is crucial to control the length of queue on on-ramps, it is normally neglected in learning based control approaches. This is mainly due to the fact that controlling the queue length adds extra dimensions to the state-space, making it difficult to handle with standard Q-learning algorithms. This paper presents a multi-criteria ramp-metering control structure by introducing a simple concept to keep the state-space size limited, and making the control problem faster, even for simple Q-learning algorithms. The paper is organized as follows: Section II presents the traffic flow model used to design a motorway ramp-metering controller with a queue length regulator. Section III introduces the MDP that formulates the interaction between an agent and its environment. A common strategy for finding the optimal control policy is the temporal difference approach, for example, which is presented in Section IV. From the class of TD approached, the Q-learning is chosen for the control problem of ramp-metering, that is described in Section V. Test results of the controller on a macroscopic traffic simulator are presented in Section VI. Finally, conclusions are drawn in Section VII.

Traffic flow that leaves node n via link m Virtual downstream density Virtual upstream speed

II. M OTORWAY TRAFFIC FLOW MODEL Instead of interacting with a real system, trials are made by interaction with a simulation model because of its several advantages including cheaper and faster learning. According to application area, the traffic flow models are classified into deterministic vs. stochastic, continuous vs. discrete and microscopic vs. macroscopic representation. METANET [18] is a deterministic, discrete time - spatially discrete macroscopic second-order traffic flow modeling tool that results in a good trade-off between efficiency and accuracy. Details on METANET can be found in [18]. Here we use an extended version of the METANET [10]. In METANET, a node is defined as a point where there is a geometric change in the motorway such as a bifurcation, on/off-ramp, lane gain/lane drop, etc. A road connecting two nodes with uniform characteristics is called a link. METANET discretizes a link into equal segments and through nonlinear difference equations, explains how the traffic variables (e.g. density and flow) of each segment evolves. The most fundamental equations of traffic dynamics in a segment of a link are given in Table I and the equation of nodes are presented in Table I. Its worth mentioning that instead of the ramp outflow equation presented in Table I, we prefer a slightly different formulation of ramp-metering that can be found in [19] and [16], stating that the ramp flow at time t, qo (k) equals

1653

the ramp-metering rate ro (k), times the minimum of the available traffic at simulation step k, do (k) + wTo k (demand plus queue), the capacity of the on-ramp ,Qo , and the maximal flow that can enter the motorway because of the ρ −ρm,l (k) mainstream conditions, Qo ρmax,m : max,m −ρcrit,m   wo k ρmax,m − ρm,l (k) qo (k) = ro (k) min do (k) + , Qo , Qo T ρmax,m − ρcrit,m (1) The advantage of the second formulation is that it assumes that ramp-metering is the limiting factor on the ramp flow (and not the on-ramp capacity or the traffic conditions on the motorway), which may be an advantage when interpreting plots of the ramp-metering signal.

• •

γ: discount factor, 0 ≤ γ < 1, that determines the present value of future rewards. R(x, π(x)) ∈ R is a scalar reward function, where π(x) is the policy in state x.

An MDP is a process in which the effects of actions depends only on the current state. The goal of the agent is to find the optimal policy π ∗ , in which π is defined as a map from state to action and Π is defined as the set of all possible policies. The value of following a policy π ∈ Π from state x is the expected cumulative discounted reward value that can be formally written as: " V π (x) = E

∞ X

# γ k R (xk , π(xk )) |x0 = x)

(3)

k=0

III. M ARKOV DECISION PROCESSES A policy defines the way a learning agent behaves by mapping a perceived state of the environment to (feasible) actions to be taken (when in those states). Having a policy alone is sufficient to determine the agents behavior. The control policy we are attempting to find will be a reactive policy, meaning that it will define a mapping from states to actions without the need of storing the states-action pairs of previous time steps or using future info. This poses the requirement on the system that it can be described by a statetransition mapping for a discrete state x and an action (or input) u in discrete time as follows: xk+1 ∼ p(xk+1 |xk , uk )

(2)

with p a probability distribution function over the stateaction space. In this case, the system is said to be a markov decision process (MDP) and the probability distribution function p is said to be the markov model of the system. An MDP describes a process by states, actions, transitions, and associated rewards. So it basically captures the stochastic transitions between states under the influence of actions. In domains with complex state transitions, MDPs focus on capturing the uncertainty. In the MDP context, the problem is to determine a control policy, π : X → U (decision rule), which assigns an action to each state, such that the sequence of decisions (or control) minimises a cost functional. Finding an optimal control policy for an MDP is equivalent to finding the optimal sequence of state-action pairs from any given initial state to a certain goal state, which is a sequential decision problem (SDP). When the state transitions are stochastic, like in (2), it is a stochastic combinatorial optimization problem. The learning takes place through the agent-environment interaction. The environment is described by an MDP model, which is a 5-tuple (X, U, P, γ, R). • X: A set of discrete states in the system; where x ∈ X. • U : A set of discrete actions; where u ∈ U . u • P (x, y): Represents the transition probabilities from state x to state y when taking action u, Where P u 0 u 0 y0 P (x, y ) = 1 and P (x, y ) ≥ 0.

Having the optimal state-value function, V ∗ (x), which is defined as V ∗ (x) = max V π (x), one can easily find the π optimal π, π ∗ . Finding the value function of a policy is among the convenient ways of finding an optimal policy. A solution to find π ∗ , is through finding the optimal value function which is presented and discussed in the next section. IV. T EMPORAL DIFFERENCE METHODS FOR RL PROBLEMS

Temporal difference (TD) methods are a class of incremental learning procedures [23] that are designed to learn the value function either in on-policy or off-policy methods. Among them, table lookup representation of value function is widely used. For example, the so called Q-learning algorithm stores the return obtained by taking action u in state x and choosing the greedy action w.r.t. the current Q-values with the aim to approximates Q∗ , the solution of Bellman’s optimality equation on-line. The Q-values, Q(xk , uk ), stores the state-action values, which represents the expected value of being in state xk and taking action uk . The Q update rule updates the values of state-action pairs according to: Q(xk , uk ) ← Q(xk , uk )+   α R(xk , uk ) + γ max Q(xk+1 , uk+1 ) − Q(xk , uk ) uk+1

where α is the learning rate and γ is the discount factor. In this algorithm, the value-function parameters are estimated using temporal difference learning, meaning that the temporal difference between the two successive evaluations of the value function is used to update the current value. SARSA, another TD learning algorithm, operates very similar to Q-learning; but unlike Q-learning, it uses the defined policy to predict future actions (on-policy) while Q-learning assumes a greedy choice of future actions e.g. its predictions about future actions are independent of the policy it follows (off-policy). It is shown that over time, the Q-learning and SARSA will produce a same result [15] while their patterns of learning are different.

1654

In many cases a model is used to ease the structural credit assignment problem 1 . These methods are known as indirect or simulation based learning and can increase the speed of learning. Indirect learning is plausible if a fairly accurate model of the process is available. Here, direct Q-learning is used to control the density of a road network, meaning no internal model is created during the control process. In reinforcement learning systems, exploration is often implemented by perturbing the exploitation operation, i.e. randomly preventing the agent from acting optimally. greedy is an extensively used policy and is an extension of greedy policy. In greedy policy, the agent selects action u ∈ U based on the highest estimated Q(xk , uk ), uk = argmaxukQ(xk , uk ). In -greedy action selection, the agent is allowed a certain degree of random exploration.

Fig. 1. A simple network with an on-ramp [10]. The objective of the controller is to maximize the networks throughput by ramp-metering while keeping the on-ramp queue length bellow a predefined permissible queue length.

( uk ∼ U(U ) if U(0, 1) ≤  , 0 <  < 1,  ∈ < uk = argmaxukQ(xk , uk ) otherwise (4) U(0, 1) is a random number with uniform distribution between 0 and 1. The higher the , the higher the exploration. While high exploration rate gives the agent the chance to react rapidly to environment changes, it prevents the agent from following its optimal choices.

Fig. 2.

Demand profile over time horizon of about 2.5h.

V. Q- LEARNING BASED DENSITY CONTROL AGENT DESCRIPTION

A ramp-metering control problem can be formulated as Markov Decision Process (MDP) [26] and solved using reinforcement learning algorithms. There are three basic components in RL, i.e. state space, action space and reward function. These elements are defined as follows:

When the downstream flow is lower than the flow capacity, the agent can select an appropriate action to increase the flow. If the queue length reaches the maximum queue length work capacity, as the state of the network changes, the agent increases the ramp-metering rate. B. Actions

A. States Five possible system states are presented here. Of course not all of them are of interest in all simulations, depending on the objective of the controller. The first state is the density of the downstream of the diversion point, ρ, and the second one is the on-ramp queue length, wo . The third and fourth states are on-ramp demand, do , and one step prediction of on-ramp traffic demand, do+1 , an important parameter for proactive control actions [8]. Finally the last state is the current metering rate, ro . The first state is normalized with respect to the jam density, ρcrit , and the range of (ρl , ρu ] is discretized to nρ equispaced grid points. The second one is also normalized with respect to the maximum permissible queue length, wmax , and is discretized into nwo elements. do is first normalized with regards to the maximum capacity, Qo , and then discretized to ndo states. do+1 has only 3 states that are defined based on do . It can take values either equal to do , or one step up/down the current do . The last state is the ramp-metering rate, ro , and is discretized into nro equal parts ranging from rl to ru . 1 Structural credit assignment problem requires an appropriate internal state space representation of the process.

In order to optimize the traffic situation, the agent may select a suitable action, according to its current state. Three actions were considered, ∆ro ∈ {r− , r0 , r+ }. In this study, r− = −0.1, r0 = 0, r+ = +0.1. The final ramp-metering rate is a finite number of distinct values in [rl , ru ]. The ramp-metering will be calculated by the following:   rl ro (k + 1) = ro (k) + ∆ro   ru

if ro (k) + ∆ro < rl , if rl ≤ ro (k) + ∆ro ≤ ru if ro (k) + ∆r > ru (5)

C. Reward function The reward can be either positive or negative, in accordance with the outcome, based on whether a benefit or penalty is accrued. Since the usual control goal is outflow maximization, the reward is a function of outflow, for agents controlling the outflow directly, and is function of queue length, for agents controlling the queue length.

1655

Fig. 3. (a) Applied ramp-metering control, (b) Resulting flow throughput, (c) Density at fist segment of L2 , (d) Queue length at O2 , (e) Obtained reward values.

VI. N UMERICAL RESULTS The case study network consists of a mainline and an onramp (See Figure 1). So basically there are two origins (O1 , the main origin, and O2 ), two motorway links (L1 and L2 ), and one destination (D1 ). The length of L1 is 4 km consisting of four segments of 1 km each. Link L2 has two segments with length of 1 km each, and ends in D1 with unrestricted outflow. The simulation is programmed and carried out using MATLAB. Full details of the studied motorway configuration can be found in [10]. The problem is that of determining optimal ramp-metering control over a time horizon of 2.5h. Traffic demands of the network is shown in Figure 2. The ramp-metering is considered to be active only if the flow on the downstream of the diversion point is higher than fT = 0.8. The networks characteristics remain the same for all simulation cases. The simulation for the ramp-metering control problem when no constraint is posed on the ramp queue length are presented in Section VI-A, and the simulation results when the maximum permissible queue length is posed are presented in Section VI-B. A. Ramp-metering without limitation on queue length For here, we assume that there is no restriction on the queue length at O2 , meaning that states associated with do and do+1 as well as wo introduced in Section V-A are not of interest and will therefore be left out. The lower bound and upper bound of the normalized density, ρl = 0.2 and ρu = 0.4 is discretized to nρ = 11 equispaced grid points. rl and ru are set at 0 and 1 respectively, meaning nr = 11. Table III displays the utilized parameter values in Q-learning.

Fig. 4. Ramp-metering with limitation on queue length using standard Q-learning control, (a) Applied ramp-metering control, (b) Resulted flow throughput, (c) Density at fist segment of L2 , (d) Queue length at O2 , (e) Obtained reward values.

The reward function is as follows: R = q(k)

(6)

Figure 3 presents the simulation results after 1000 iterations. The traffic flow throughput volume is maintained in capacity during the high demands. The reward the agent obtained during one single run is presented in Figure 3(e). The robustness of the obtained solutions against simple system failures is investigated in [7]. B. Ramp-metering with limitation on queue length To control the on-ramp queue length, two approaches are proposed here. The first one is the standard Q-learning algorithm. Since the convergence of the standard Q-learning is low, at least in this experiment, another controller is designed in Section VI-B.2 with faster convergence and better performance. 1) Ramp-metering with limitation on queue length using standard Q-learning control: Considering the fact that queue length control requires more details of the statues of the network, in addition to the density downstream of the diversion point and the current metering rate, the queue length is an TABLE III Q- LEARNING UTILIZED PARAMETER VALUES . α IS LEARNING RATE , γ IS DISCOUNT FACTOR AND ε IS THE PARAMETER VALUE IN ε- GREEDY ACTION SELECTION .

1656

Parameter Value

α 0.2

γ 0.95

ε 0.2

important feature of the network, and will be considered an a new dimension of states of the network. The queue length, wo , is also normalized with respect to the maximum permissible queue length, wmax = 100, and is discretized into wo ∈ {.25, .65, .85, .95, 1, 1.05, 1.25, 2.0}. Moreover, the current on-ramp flow demand do is considered as an additional state, and is first normalized and discretized to ndo = 11 states. So overall the number of state equals to 10684 = kρk · kro k · kwo k · kdo k = nρ · nr · nwo · ndo = 11 · 11 · 8 · 11 . As a result, the reward function is slightly modified as following when considering queue length control. ( q(k) if wo ≤ 1 R= (7) q(k) ∗ (2 − wo ) if wo > 1. The simulation results for the case of considering the queue length control are presented in Figure 4. 2) Ramp-metering with queue storage limitation using multi-agent Q-learning control: In this approach, two separate controller are designed, one to control the ramp-metering to ensure that the motorway operates efficiently and vehicles move at free-flow speed, the other one regulates the queue length. The reason for designing an extra controller is the desire for reducing algorithmic complexity, higher robustness against unpredictable demands and lower computational requirement by shrinking the state-action space. The state-space of the first controller is as the same as the one described in Section VI-A, a very simple controller with only 121 = kρk · kr0 k = nρ · nr = 11 · 11 states. The extra controller observes the current queue length, wo , the current on-ramp flow demand, do , as well as the predicted on-ramp flow demand of the next time step, do+1 . The queue length is normalized and is discretized into wo ∈ {.25, .65, .85, .95, 1, 1.05, 1.25, 2.0}. The current onramp flow demand is first normalized and discretized to ndo = 11 states. For every flow demand, by assuming a smooth changes in demand profile, the predicted flow demand, qo+1 can take be either equal, to the current flow, one step up/down to the current demand. The current metering rate, the last state of this controller takes quantized values between 0 and 1 and can take 11 distinct values. Considering the maximum permissible queue length, wmax , the controller decides whether to increase or to decrease the metering rate so that the maximum permissible queue length is maintained. The total number of discretized states is 2904 (= kwo k · kdo k · kdo+1 k · kro k = 8 · 11 · 3 · 11). The reward function for queue length control agent is defined as following: ( 1 if 0.01 ≤ wo ≤ 1.99 (8) R = lnk1−wo k −100 othewise 1 Note that lnk1−w is zero when wo = 1, that is the ok maximum reward the agent can get in each trial. To boost the actions toward a smooth changes in action space and as a result a smooth changes in resulted queue length, once the .975 ≤ wo ≤ 1.025, the reward is set to +1. The designed

Fig. 5. For the Queue length controller, the reward is a function of queue length. The closer the queue length is to the desired value, the higher the instant reward the agent receives.

Fig. 6. Queue length control using a second Q-learning controller, (a) Applied ramp-metering control, (b) Resulted flow throughput, (c) Density at fist segment of L2 , (d) Queue length at O2 , (e) Obtained reward values.

reward function is presented in Figure 5 and the resulted queue length controller after 3000 episodes is presented in Figure 6. The objective of the queue length regulator is to accumulate vehicles to reach the maximum permissible queue length. For both heavy and light mainline traffic conditions, this is in conflict with the objective of the mainline metering controller. Considering the former case, the mainline controller allows on-ramp vehicles to enter at a slow rate (to maintain the density on the mainline around the critical density), resulting in quick queue length growth beyond the limit. For the later case (light mainline traffic), because of the same desire, on-ramp vehicles are allowed to enter the mainline as fast as possible, which will result in short queue.

1657

The conflicting objectives of the two controllers suggest to choose the maximum (the most permissive) value of the two controller as an appropriate ramp-metering rate [22]. Of course after receiving the reward, only the Q-table of the corresponding agent is updated. VII. C ONCLUSION AND FUTURE WORK An interesting setting of ramp-metering control problem is to pose input constraints. In practice, when traffic flow exceeds capacity, queuing is inevitable. In diverse situations, the ramp-metering is constrained by the length of the queue on the on ramp. In this paper, a Q-learning based density control is presented that handles the on-ramp control problem when maximum on-ramp queue length is bounded by a certain upper limit in order to prevent spill-back to a surface street intersection. The approach presented here guarantees a good performance in terms of keeping the flow relatively equal to the capacity while simplifying the problem using multi-agent Q-learning control. Theoretically, it is possible to find a superior controller than the one we archived here with a more informative reward function. An interesting research could be to explore how the formulation of the reward function influences the results. This paper does not compare the proposed algorithm to any other method in order to benchmark its performance excluding the standard well known SARSA. A proper comparison to existing techniques is part of our plan for future research. Moreover in our next experiments, we are attempting to study the impact of different policies in Qlearning like Boltzmann Distribution, Simulated Annealing and Probability Matching. ACKNOWLEDGMENT This research received funding from the European Community’s Seventh Framework Programme within the ”Control for Coordination of Distributed Systems” (Con4Coord - FP7/2007-2013 under grant agreement no. INFSO-ICT223844) and the Next Generation Infrastructures Research Program of Delft University of Technology. R EFERENCES [1] B. Abdulhai, R. Pringle, and G.J. Karakoulas. Reinforcement learning for true adaptive traffic signal control. Journal of Transportation Engineering, 129:278, 2003. [2] A.L.C. Bazzan. Opportunities for multiagent systems and multiagent reinforcement learning in traffic control. Autonomous Agents and Multi-Agent Systems, 18(3):342–375, 2009. [3] A.L.C. Bazzan, D. de Oliveira, and B.C. da Silva. Learning in groups of traffic signals. Engineering Applications of Artificial Intelligence, 23(4):560–568, 2010. [4] T. Bellemans, B. De Schutter, and B. De Moor. Anticipative model predictive control for ramp metering in freeway networks. In Proceedings of the 2003 American Control Conference, pages 4077 – 4082, 2003. [5] M.C. Choy, R.L. Cheu, D. Srinivasan, and F. Logi. Real-time coordinated signal control using agents with online reinforcement learning. In Proceedings of the 80th Annual Meeting of the Transportation Research Board, 2003. [6] E. Commission. European transport policy for 2010, time to decide. Office for Official Publications of the European Communities. Luxembourg, 2010.

[7] M. Davarynejad, A. Hegyi, J. Vrancken, and Y. Wang. Freeway traffic control using q-learning. In Proceedings of 10th TRAIL CongressWorld Trade Center, Rotterdam, The Netherlands, 2010. [8] M. Davarynejad, Y. Wang, J. Vrancken, and J. van den Berg. Multiphase Time Series Models for Motorway Flow Forecasting. In International IEEE Conference on Intelligent Transportation Systems (ITSC). IEEE, 2011. [9] D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel. Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 39(2):517 – 529, 2009. [10] A. Hegyi. Model Predictive Control for Integrating Traffic Control Measures. PhD thesis, Delft University of Technology, Delft Center for Systems and Control., 2004. [11] A. Hegyi, B. De Schutter, and H. Hellendoorn. Model predictive control for optimal coordination of ramp metering and variable speed limits. Transportation Research C, 13(3):185209, 2005. [12] C. Jacob and B. Abdulhai. Integrated traffic corridor control using machine learning. In International Conference on Systems, Man and Cybernetics, volume 4, pages 3460–3465. IEEE, 2006. [13] X. Ji and Z. He. An Optimal Control Method for Expressways Entering Ramps Metering Based on Q-Learning. In Second International Conference on Intelligent Computation Technology and Automation, volume 1, pages 739–741. IEEE, 2009. [14] L.P. Kaelbling, M.L. Littman, and A.W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237– 285, 1996. [15] J.O. Kephart and D.M. Chess. The vision of autonomic computing. Computer, 36(1):41–50, 2003. [16] A. Kotsialos, M. Papageorgiou, and F. Middelham. Optimal coordinated ramp metering with advanced motorway optimal control. In Proceedings of the 80th Annual Meeting of the Transportation Research Board, 2001. [17] J.M. Maciejowski. Predictive Control with Constraints. Prentice Hal, Harlow, England, 2002. [18] A. Messmer and M. Papageorgiou. Metanet: A macroscopic simulation program for motorway networks. Traffic Engineering and Control, 31(8), 1990. [19] A. Messner and M. Papageorgiou. Metanet: A macroscopic simulation program for motorway networks. Traffic Engineering & Control, 31(89):466–470, 1990. [20] M. Papageorgiou. An integrated control approach for traffic corridors. Transportation Research Part C: Emerging Technologies, 3(1):19–30, 1995. [21] L.A. Prashanth and S. Bhatnagar. Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems, (99):1–10, 2011. [22] E. Smaragdis and M. Papageorgiou. A series of new local ramp metering strategies. Transportation Research Record, pages 89–117, 2004. [23] R.S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [24] R.S. Sutton and A.G. Barto. Introduction to Reinforcement Learning. MIT Press, 1998. [25] J. Vrancken, J.H. van Schuppen, M. dos Santos Soares, and F. Ottenhof. A hierarchical model and implementation architecture for road traffic control. In International Conference on Systems, Man and Cybernetics, pages 3540–3544. IEEE, 2009. [26] K. Wen, S. Qu, and Y. Zhang. A Machine Learning Method for Dynamic Traffic Control and Guidance on Freeway Networks. In International Asia Conference on Informatics in Control, Automation and Robotics, pages 67–71. IEEE, 2009.

1658

Suggest Documents