Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework
Mohamed A. Khamis, Walid Gomaa Department of Computer Science and Engineering Egypt-Japan University of Science and Technology (E-JUST)
Khamis, M.A., Gomaa, W., Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on.... Eng. Appl. Artif. Intel. (2014), http://dx.doi.org/10.1016/j.engappai.2014.01.007i
1
RL-Based Traffic Signal Controllers Class of AI traffic signal controllers A trial-and-error process
Agent learns a policy that optimizes cumulative reward gained over time
RL is based on a sequential online decision making process
Junction-based state-space representation, Thorpe et al., 1996
Estimator for every junction state.
This representation quickly leads to a very large state-space
Vehicle-based state-space representation, Wiering et al., 2000’s
Estimator for every vehicle state (position): Q(s, red), Q(s, green)
Number of states grow linearly in number of lanes and vehicles positions
Scale well for large networks
29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
2
Multi-Objective Traffic Signal Controllers Using neuro-fuzzy or multi-objective genetic algorithms
Neural Networks & genetic algorithms require many computations
Parameters are difficult to be determined
E.g., Lertworawanich et al., 2011
We are the first to apply multi-objective RL for TC
Based on consolidated immediate reward functions
Houli et al., 2010, Multi-objective RL for TC
However, activate one objective according to the traffic demand!
29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
3
MAS Traffic Signal Control Traffic Signal Control: o Finding the optimal traffic signal configuration o Red/Green consistent configurations Multi-Agent System (MAS) modeling, Wiering et al. 2000’s o Vehicle: passive agent (communicate with the junction) o Junction: active agent (traffic signal controller)
The controller at each junction sums up the gains Q(s, red)−Q(s, green) of all vehicles waiting at the current junction and chooses the traffic signal configuration (consistent green lights on all directions of the junction) with the maximum cumulative gain. 29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
4
Multi-Objective RL Traffic Signal Control Two possible options for multi-objective Q-function implementation Separate Q values for each objective, Consolidated immediate reward functions more suitable to vehicle-based modeling
Some objectives dominate according to road conditions E.g., congestion in specific lanes
29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
5
GLD Traffic Signal Simulation Model Widely used open source, developed by Wiering et al. in early 2000’s Develop and experiment various traffic signal controllers Various Performance Indices
Edit/create traffic networks Schedule traffic demands
Drawbacks:
Discrete-time discrete-space model of traffic dynamics Oversimplifications in modeling the driving behavior Some simplifications in computing the statistics 29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
6
Contributions to the GLD Simulator Varying distributions of traffic demand o Allow for variability and non-stationarity Applying the Intelligent Driver Model (M. Treiber et al., 2000) o Acceleration/deceleration model o Continuous time/continuous space model
Sign oscillation problem (a Zeno phenomenon) o Resulted from the infinitesimally slow acceleration of back vehicles when the traffic signal is just turning green o Solved by giving those stationary vehicles some penalty smaller than that when the traffic signal is red. 29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
7
Handling Non-Stationarity Using Bayesian approach to estimate the underlying MDP parameters Current estimation becomes the prior for the next time step Let P be a random variable representing an estimator of some unknown parameter for: 1. Pr(a | s): posterior probability of taking action a given state s, or 2. Pr(s’ | s, a): transition probability of being in next state s’ given (s, a). E.g., Fix some state s, then Pr (a|s) has one parameter P for Pr {a = RED}
is a sequence of Bernoulli random variables defined at the time indices where the state s is occupied by a vehicle.
29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
8
Handling Transient Periods Hybrid exploration technique based on both ε-exploration and softmax exploration
Use softmax exploration to better respond to transient periods (e.g., due to congestion at rush hours). The traffic signal decision is chosen proportional to the gain values: exp(gi)/∑gi exp(gi) gi is the cumulative gain of the vehicles in the lanes of the traffic signal configuration # i Cooperation is used to check if some junction is in a transient state most likely transferred soon to neighboring junction During this period, it is more appropriate to use softmaxexploration. 29-Jan-14
M. A. Khamis - PhD Thesis Defense - August 2013
9
Experimentations Congestion: Examined by the average trip time
Mean value is lower ≈ 8 times!
TC-1 is the single-objective controller (M. Wiering 2000) that is based on the frequentist probability estimation using ε-exploration.
29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
10
Experimentations Fuel Consumption: Examined by the average number of trip stops
Green Wave: Examined by the average number of trip absolute stops
Vehicle stops is lower ≈ 22 times when using the multi-objective controller
29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
11
Conclusions Multi-Agent Reinforcement Learning for Traffic Signal Control Proposed Framework
Contributions to the GLD traffic simulator
E.g., Applying continuous time-space acceleration model
Handling Traffic Network Non-Stationarity
Bayesian approach to estimate the unknown parameters
Adaptive Cooperative Hybrid Exploration
Multi-Objective RL for Traffic Signal Control
Significantly outperforms the underlying single-objective controller
Under congested periods and adverse weather conditions
29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
12
• Questions and updated GLD source – E-mail:
[email protected];
[email protected]
29-Jan-14
http://dx.doi.org/10.1016/j.engappai.2014.01.007i
13