Minimum Time Search for Lost Targets using Cross Entropy Optimization Pablo Lanillos, Eva Besada-Portas, Gonzalo Pajares and Jos´e J. Ruz
no
Abstract— This paper formulates and proposes a discrete solution for the problem of finding a lost target under uncertainty in minimum time (Minimum Time Search). Given a searching region where some information about the target is known but uncertain (i.e. location and dynamics), and a searching agent with constrained dynamics, we provide two decision making algorithms that optimizes the agent actions to find the target in a minimum time. The problem is faced as a discrete optimization: the actions and the sensor are discrete, and the target probabilistic model is described over a graph, where each vertex contains the target’s location probability information and each edge defines the agent possible actions. We revisit the mathematical model of the optimal search problem and we propose a novel approach to include the time into the decision making by reinterpreting it as a maximum discounted time reward problem. The optimal decision plan for the agent is obtained by solving this non convex discrete problem using the cross entropy method. By performing statistical simulations we show how the target is found in minimum time.
er
m
m
co
n-
al
ci
I. I NTRODUCTION The search of an object with uncertain location has been widely studied since in 1975 “Theory of Optimal Search” was published [1]. That mathematical analysis collected all the research derived from searching applications of the II world war. They solved with beauty two complementary objectives of the search: detect the target with (1) the smallest cost and (2) in minimum time. The approach has one main drawback because they assumed that the space is infinitely divisible (i.e. there is not searcher dynamics involved). Few years later a doctor on Naval engineering published a mathematical model adding constraints to the searcher path [2]. He set out the search as a Partially Observable Markov Decision Process (POMDP), leaving the grounds to some branch and bound algorithms and POMDP solutions [3], [4]. The algorithmic complexity is demonstrated to be NPcomplete or NP-hard, depending on the problem formulation [5]. Afterwards, with the arrival of the swarms fever and the unmanned vehicles, and due to lower costs and miniaturization [6], a new research impulse appears where instead of one agent searching, there are many [7], [8], [9], [10]. Using a multi-agent set up, it is proved that the complexity rises up to NEXP-complete [11], but simplifying the communications and modeling the information as a common layer the problem has been partially solved by gradient base approaches [12]. Despite these advances, the problem remains partially solved under some hard assumptions. Indeed, the literature
left aside one of the two objectives defined in the original optimal search [1]: minimizing the time of detection. In [5] the search problem is formulated in two similar but not equal ways: maximize the probability of detection and minimize the time of detection. Just the first one has been widely tackled [3], [13], [14], [15], [12], [10] because of the intractability issues of the second [16], [13], [5]. As the problem is still unsolved even for one agent, in this paper we focus on one agent and one target location distribution configuration, maintaining a philosophy that will favor possible extensions to multi-agent scheme in the future. In this work we study in depth the problem of Minimum Time Search (MTS) by revisiting the optimal search theory and also by providing a new mathematical framework that permits us to include the time as a function, instead of directly minimizing the expected time (i.e. mean time). The optimal search problem is a decision making process that involves two dynamic partakers: the searcher (agent) which owns a sensor, and the object that is searched (target). In our case the target movement does not depend on the agent (i.e. we don’t have an evading target). This search problem can be modelled in a continuous way [12], [10], [16] or in discrete form [3], [7], depending on the searching region representation, the sensor type, and the agent possible actions. In works like [12], for instance, the optimization assumes piecewise linear actions and continuous differentiable sensors. This paper fills the gap of the optimal search in a discrete approach, where the sensor, as well as the agent actions, are discrete and their functions are non differentiable. Our methodology starts identifying what we know about the target. We know a priori that it is inside a delimited region and that it is moving. Our knowledge or belief of the target location is modeled as a probability density function, which indicates the probability of locating the target at any point of the region. Representing the region in discrete form (i.e. dividing it into cells) the belief becomes a matrix or vector that describes the probability of locating the target in each cell. This belief is important to decide the best agent actions because it contains the information about the places with high probability of finding the target, that are the locations that the agent should observe. We use these observations to update the belief using Bayesian inference [17]. As the agent is driven by actions, our solution is a sequential set of actions that try to perform the best observations in less time. We propose two approaches: 1) Minimizing the expected time of detection. If we assume that the observations of the agent will be no
es
os
rp
pu
ly
on
P. Lanillos, E. Besada-Portas and J.J. Ruz are with the Computer Architecture and Automation Department (DACYA), Complutense University, Madrid, Spain
[email protected] G. Pajares is with the Software Engineering and Artificial Intelligence Department, Complutense University, Madrid, Spain
no
detection, we can estimate the mean time that the agent spends to detect the target. Thus, we look for the actions that have minimum expected time. The problem is that for that purpose we have to compute the sum of infinite terms (i.e. number of actions), but we can still use the approach for a finite number of actions, under a few assumptions. 2) Using a function that implicitly models the time. One of the drawbacks of the first approach is that, there is no possibility of deciding when is important to find the target, while with this new approach we can specify the importance of locating it at each instant. An application example is sea rescue operations in warm waters, where during the first thirty minutes the agent could explore more the region, but afterwards, due to health risk the survivors should be found as soon as possible. To design this approach we transform the MTS into a maximum discounted time reward problem. For that purpose, we identify the agent rewards and we develop an utility function whose kernel is a function that weights those rewards. We look for the actions that maximize that utility function. Both approaches are non-convex optimization problems with high complexity (i.e. NP-hard). To find the solutions (i.e. best agent actions) we adapt the Cross Entropy Optimization method (CEO) [18] because it is able to deal with any type of sensor model, it provides good results in discrete optimization [19], and, due to its iterative learning nature, we have a trade-off between optimality vs. computation time. This method is based on minimizing the discrepancy between two distributions, that in our problem they are the optimal distribution of actions and the best distribution of actions found by the algorithm. In short, our formulation and solving approach let us propose a new discrete receding horizon decision making algorithm, which provides the best N actions of an agent to locate a moving target with uncertain location in the minimum possible time. The paper is organized as follows: section II defines the world representation, and agent and target model; section III describes the Minimum Time Search problem as a expected time minimization; section IV propose the new time modeling approach that includes time functions into the optimization process; section V solves the non-convex optimization using CEO; and finally statistical results at section VI show how our approach minimizes the time to find the target.
k ACTIONS ( u ) N
NW
uk+3
NE
uk+2
E
W SW
S
SE
uk+1 uk
s
k+3
s k+2
s k+N
uk+N-1
wy
s k+N-1
s k+1
sk wx
permit high level decisions (i.e. cardinal directions). Besides, we focus on the individual agent description and we assume that the target is not evading from the searcher. Also, when we talk about the target it means either one target or many targets that follow the same probability location distribution and dynamics. The generalization for targets with different dynamics is out of the scope of this paper. The objective of the algorithm is to compute the best action plan at instant k, that implies calculating the estimated observation of the region. Each action uk makes the agent transit to another state or position over the search space Ω. As we assume that the sensor position is the same as the agent position, we have the action plan vk = {uk , · · · , uk+N−1 } that moves the agent/sensor along the states {sk+1 , · · · , sk+N } taking a supposed observation zk at each sk state. We have to distinguish between what is really happening while the agent is making observations and the action planning, where the observations are predicted. In this paper we deal mostly with the decision making where, given a state of the world (i.e. the model), we have to predict the best actions. Figure 1 illustrates the MTS problem: there is a delimited searching region Ω that is discretized into cells that are 8connected (using cardinal directions) except for the region frontiers; both the agent and the target are contained in Ω; the agent, starting at sk , moves over the grid to make observations and detect the target; and depending on the actions (uk ) the agent will observe different cells; The solution is the sequence of actions vk∗ = {uk , · · · , uk+N−1 } over the grid that makes the agent locate the target in minimum time.
al
ci
er
m
m
co
n-
Fig. 1. Scenario for the searching problem. The solution is a sequence of actions that drive the agent to detect the target in minimum time. The searching region is a set of cells that contains the agent and the target. The target starts in sk with an a priori information of the target (i.e. the probability of the target being at each cell), that it will be updated with the observations made at {sk , · · · , sk+N }. The agent actions {uk , · · · , uk+N−1 } are the cardinal directions that matches with the cells adjacency
es
os
rp
pu
ly
on
II. M ODELING THE PROBLEM In a search problem, we have a sensing agent capable of maneuvering freely and gathering information about the targets existence in a mission defined work space. In this section, we state the problem using a discrete approach of the world, agent dynamics and sensor model definition, and describing the location target belief by means of a probability distribution. The discrete approach is taken to add the possibility of using discrete sensors and actions, and to
A. The world Lets us define the delimited mission search space as Ω ∈ R2 where the agents and the targets are contained. The world in a our approach is discretized into a two dimensional grid with wx × wy cells. This grid is described mathematically as a graph G = (V, E) by assigning each cell to a node, and
defining the edges of the graph as the adjacency among cells. B. Agent motion model We consider an agent with discrete actions dynamics that are restricted by the edges of the graph G induced by the two dimensional grid. As each cell surrounded by 8 cells is accessible, the possible action values for each instant (uk ) are the eight cardinal directions: N, NE, E, SE, S, SW, W and NW. We have an agent with an action vector composed of N discrete actions vk = {uk , · · · , uk+N−1 }, which transits the sensing agent state from sk into {sk+1 , · · · , sk+N } as time increases. In other words, sk is the position of the agent in the grid at instant k and vk is the action plan computed for the next N steps.
k|k−1
= P(τ k |z1:k−1 , s1:k ) is the probability location 2) bτ distribution of the target at instant k with the previous observations (z1:k−1 ). That is, the belief after the prediction step. We refer to the belief bkτ as the probability distribution k|k after the prediction and the update step. Thus, bkτ = bτ . The observation instant superscript can also be omitted when k|k k|k−1 there are no observations because bτ = bτ . The two steps of the RBE are: a) Prediction step: This step predicts the location of the target at instant k given the probability target distribution of the target at k −1, according to the target probabilistic motion model. k|k−1
no
bτ
P(z = D|τ , s ) =
1 0
if τ k = sk otherwise
(1)
er
m
k
m
co
n-
This sensor model represents the likelihood of producing a target detection event D from the observation zk at time step k, if the sensor is at state sk and the target position is placed at τ k . The discrete sensor model or observation likelihood used by the agent is defined as, k
∑
k−1|k−1
P(τ k |τ k−1 )bτ
(2)
τ k−1 ∈V
C. Sensor model
k
=
ci
This model implies that, if the agent is at the same location as the target τ k , the object is discovered. To simplify the expressions, hereafter we will use Dk to represent the deteck tion event (zk = D) and D to represent the no detection one (zk = D).
b) Update step: The following shows the general form of recursive Bayesian target belief estimation for a series of sensor observation events. a prior target We start with belief up to time step k, P τ k |z1:k−1 , s1:k , conditioned on all previous sensor observations z1:k−1 taken at sensor states s1:k−1 . At time step k, the new sensor observation zk at sensor k|k state sk is D. The posterior target belief bτ can be expressed using Bayes rule: 1 k k|k k|k−1 (3) bτ = P D |τ k , sk bτ η
al
where η is the normalization constant that forces k ∑τ k P D |τ k , sk P τ k |z1:k−1 , s1:k equal to 1. Note that the update step assumes that the target is not observed for the action planning purpose, because when the target is detected the decision making algorithm stops. III. M INIMUM T IME S EARCH
os
The MTS problem consists in finding an “object” (our target), that is placed somewhere in the “space”, in the minimum possible time. Underneath, it is about making the optimal decision using the information we have and taking into account the future. The MTS is in essence an optimal search as defined in [1], where we can find a dual objective definition: minimize the spent effort or the cost of looking in a determined place and minimize the time that we employ to find the “object”. If we want to reduce the task time, one of the approaches is to minimize the expected time to find the object. Defining the time to find the target as the random variable T , we can compute the density function P(T ≤ k) for any instant k ≥ 0. The Expected Time (ET) or the expected value of T is [21], [1]:
es
ly
on
k|k
1) bτ = P(τ k |z1:k , s1:k ) is the probability location distribution of the target at instant k with all the observations from the beginning up to zk (z1:k ). That is, the belief after the update step.
rp
The dynamics and location (τ k ) of the target are uncertain, so we define our knowledge about the target probabilistically. We assume that the target motion model is Markovian P(τ k |τ k−1 ), which in our discrete space approach, describes the probability of the target to go from one cell to another. We define the target location in the probabilistic framework wx ×wy that is the probability of as the belief bkτ : V → Z[0,1] the target being in a location inside the region. This belief is represented as a Probability Mass Function (PMF), and therefore, the total probability of the belief is ∑τ k ∈V bkτ = 1. We refer to the whole region belief at instant k as bkτ and the probability of locating the target in a single cell c ∈ V as bkτ=c . Note that V is the discretization of the region Ω. The Bayesian approach permits us to start from a prior target belief b0τ and update the information after every successive sensor observation. Also we can predict the target position according to the probabilistic motion model. This method, called Recursive Bayesian Estimation (RBE) [20], is an algorithm that iterates two steps: update and prediction. Using the Kalman filter notation we distinguish two beliefs:
pu
D. Target model
∞
E{T } = ∑ (1 − P(T ≤ k))
(4)
0
We have an agent that makes decisions using its knowledge about the world and making observations. Thus, the actions do not only modify the agent state but also the world belief, making every past action-observation affect the future decisions. This means that we are dealing with a
sequential problem. The MTS is a decision making problem, because in order to minimize the time we have to choose the best sequence of actions vk . For the discrete approach, the decision plan of horizon N is: v1 = {u1 , · · · , uN }
(5)
The probability of detecting the target in a time window 1 :k is the probability of the union of the detection events: P( kj=1 D j |s0 , u1:k ). We can compute the truncated ET of an action plan as the mean time μ(v) (i.e. ET for a given v), using the probability of detecting the target. Thus, the ET function is:
N
∑
k
1 − P(
no
μ(v1 ) =
k=1
D j |s0 , u1:k )
(6)
j=1
A. Accumulated Probability of Detection The probability of detection until instant k + N is: J(sk , vk , bkτ ) =
co
vk∗ = arg min μ(vk )
(7)
m
vk
k
k=1
j=1
∑ k P(
D j |s0 , u1:k ) − P(
k−1 j=1
D j |s0 , u1:k−1 )
J(sk , vk , bkτ ) =
N
∑ bτ=sk+ j
k+ j|k+ j−1
(11)
j=1
k+ j|k+ j−1
is the belief of the target being in cell where bτ=sk+ j sk+ j before the update step. This equation has the drawback of not taking into account the time. So it does not optimize the ET in the horizon N. In fact, two action plans with the same cumulative probability of detection (J) don’t have to find the target at the same time. Ideally, we are looking for the uniformly optimal plan vk∗ where the following condition is satisfied (see [1]),
k=1
This means that the probability of detection is maximal at each instant. The following time modeling proposed approach does not assures an uniformly optimal plan, but it guarantees an optimal plan constrained by the time function. We identify the utility function (i.e. the probability of detecting the target) as the reward. If we get more chance of detection we have higher reward. Then, the time should be modelled in a way that the reward obtained by the agent decrease with the time.
ly
on
PROBLEM
Apart from minimizing the expected time, we want also to include the possibility of modeling the time. For that purpose, firstly we use the probability of detection as the utility function ([13]) converting the optimization problem into a maximization of the accumulated probability of detection. Secondly we identify the probability of detection as a reward
(13)
es
IV. MTS AS A MAXIMUM DISCOUNTED TIME REWARD
J ∗ (sk , vk , bkτ ) ≥ J(sk , vk , bkτ ), 0 ≤ k ≤ N
that is equivalent to,
os
Note that P( kj=1 D j |s0 , u1:k ) is the cumulative probability 1:k−1 1:k−1 k until instant k and P(Dk |D ,s , u ) is the probability of detecting the target at instant k, given the non detection previous observations. Computing all the actions up to infinity is untraceable and if N < ∞ we cannot guarantee that the limit will converge to 1, so (9) fails. Although, (6) still applies as showed in [22]. In this case μ(v) cannot be consider the expected time of detecting the target, but for a receding horizon optimization, (6) can be used because it conserves the proportions of the minimum between the action plans of same length N. The proportions are conserved even for a dynamic target only if the normalization η1 is not applied. This is a consequence of the PMF structure for the problem, that allows the gradient of 6 to yield useful policies.
(12)
rp
(9)
μ(vk∗ ) ≤ μ(vk )
pu
(8) 1:k−1 1:k−1 k ,s ,u ) = ∑ k P(Dk |D ∞
(10)
j=1 τ k ∈V
al
∞
k+ j|k+ j−1
P(Dk+ j |τ k+ j , sk+ j )bτ
ci
μ(v) =
er
m
If μ(v) is differentiable and limk→∞ P( kj=1 D j |s0 , u1:k ) = 1 we can use the following equations to compute the ET (see [5], [13] for further details):
N
∑ ∑
Using the sensor model described by (1) and the discrete model of the search region described in II-A we can simplify (10) because the internal sum is just the probability of detecting the target at the sensor position. That is, only the probability of detection at that cell sk is added to the sum and the other values are zeros. Thus, the probability of detection function becomes:
n-
Note that writing {s0 , u1:k } is the same as {s0 , · · · , sk }, because the actions moves the agent through the states. When N = ∞ we are computing the expected time. Therefore, the solution to the MTS is given by the following equation:
and finally we include the time function. This converts the MTS into a maximum Discounted Time Reward problem (max-DTR)[23].
B. Discounted Time Reward In order to model the time we use a exponential function f , that will reduce the possible rewards (i.e. probability of detecting the target) as the time passes. The discounted time function chosen is ([23]): f (k) = λ k | 0 < λ ≤ 1
(14)
The tuning parameter λ permits us to decide indirectly how fast we want to find the target or in other words, how important are actions that the agent will take in the future. The cost function (11) is combined with the discounted time
⎡ 0 ⎢ 0 ⎢ 1 ⎢ ⎢ 0 X =⎢ ⎢ 0 ⎢ ⎢ 0 ⎣ 0 0
function to develop the final discounted time reward function: Jλ (sk , vk , bkτ ) =
N
∑ λ k bτ=sk+ j
k+ j|k+ j−1
(15)
j=1
Therefore we formulate the MTS problem using the maxDTR approach as: vk∗ = arg max Jλ (sk , vk , bkτ )
(16)
vk
V. S OLVING THE DISCRETE MTS: C ROSS E NTROPY O PTIMIZATION M ETHOD (CEO)
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0
⎤
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
Fig. 2. Action-Time representation to code a sample solution of the problem. On the left the binary matrix representing the action u is taken at time k. On the right it is show the translation to a path over the graph if the agent starts in cell one
no
As a black box, our algorithm works as follows: given an starting point sk and a prior target location belief bkτ the algorithm returns the best N sequential actions vk∗ = uk:k+N−1 . In practice, (7) and (16) are non-convex discrete optimization problems. Because their functions can be non differentiable, we cannot use a gradient base approach as in [12]. Instead, to solve this problem formulation of the MTS, we propose the cross entropy optimization approach [18], due to its good performance in solving countable problems. The idea behind CEO is to learn a probability distribution that let us sample the optimal action plan (v∗ = uk:k+N−1 ). CEO learns this probability iterating two steps. In the first it samples the solutions from the probability distribution obtained so far and selects the set with the best ones. “Best” is defined according to the problem optimization criteria: the maximum time discounted reward (15), minimum expected time (6) or maximum detection (11). In the second, it obtains the parameters of the distribution from the samples minimizing the cross entropy between the obtained distribution and the optimal one, that will let us compute, using importance sampling, the percentage of the best ones. To solve the MTS using CEO we have to transform the action solution vk into a probability distribution qˆ (i.e. probability of taking the action u at instant k). The qˆ is with a hat because CEO will estimate its value. Also we have to design the samples as action solution instances. A sample Xuk is a binary matrix (8xN), representing which action u is taken at instant k. X is builded following this equation: 1 if action u is taken at time k (17) Xuk = 0 otherwise
0 0 0 0 1 0 0 0
al
ci
er
m
m
co
n-
1) Initialize the probability distribution of the actions qˆ0 . The initial probability of taking an action u at time k 1 , where |A| is the number of possible actions (e.g. is |A| the 8 cardinal directions). 2) Generate a set of action-time binary matrices {X1j , . . . , XNj } sampling their values according to qˆ j distribution. We generate the samples using multinomial sampling [24]. 3) Update Jˆj using the best samples of {X1j , . . . , XNj }, considering them as rare events with low probability (ρ). We look for samples that maximize the utility function J ((6),(15) or (11)) by using the probability of finding a sample with J(sk , vk , bkτ ) ≥ J.ˆ Sampling a solution with optimal cost function is a rare event so we know that the probability of a sample with cost equal to the optimal cost J ∗ is really low. It is not possible to compute J ∗ in a close form but it is possible to estimate it assuming that it belongs to the samples with higher costs. So Jˆ is computed using the probability of finding a sample over a specified value and taking into account that it is a rare event. In practice we take the percentile of samples ρN with highest utility J. 4) Learn qˆ j+1 as the function that best fits the best sample set. Once Jˆj has been computed and the new elite samples have been selected, it is possible to learn a new qˆ j+1 fitting its variables to the probabilities defined by the elite samples. In order to reduce the fluctuations of the estimated probability distributions in different iterations and avoid local maxima, we also smooth qˆ j+1 with qˆ j using a smoothing parameter α ([19]). 5) If the stop condition is satisfied, then q∗ = qˆ j+1 , else j = j + 1 and the algorithm returns to step 2. When Jˆj reaches a fixed value, the algorithm can stop because there are little chances of finding a better solutions due to the relation J(sk , X, bkτ ) ≥ Jˆj . For implementation purposes two consecutive Jˆj are considered equal when: Jˆj = Jˆj−1 ⇒ |Jˆj − Jˆj−1 | < ε
es
os
ly
on
The CEO algorithm that solves the MTS has the following steps:
rp
A. Algorithm
pu
Note that Xuk is the action-time representation of the u1:k sequence of actions. A example for a 2x2 grid world and action sequence E, S, W is showed in Figure 2. The agent starts at cell 1 and then it takes the actions (East, South and West), arriving to cell 3. Finally we have to identify the utility function J that will be used to evaluate the samples, for instance, (15). Every iteration ( j), CEO computes an estimation of the utility function used (Jˆj ) and when it converges it is the optimal utility.
If NI is the number of iterations where Jˆj is allowed to change, the stop condition becomes: |Jˆj − Jˆj−NI | < ε 6) Extract the sequence of actions: v∗ = arg max q∗ .
each scenario, the target starts in any cell c ∈ V of the searching region Ω with non zero probability in the initial belief b0τ . In the static case, it remains in that position, while in the dynamic scenario, it follows the selected target dynamics P(τ k |τ k−1 ). In both cases, for each simulation we store the time ts that the agent spends to detect the target (calculated as the number of cells that the agent covers until it detects the target) and use it to compute, for each action plan and approach, the probability of detecting the target along the time P(T ≤ k). This probability is a cumulative density function and is computed as follows:
VI. R ESULTS In this section we compare the performance of the two proposed approaches against others and show their applicability to a real world search problem. A. Performance Statistical Analysis
no
We compare the performance of: our two approaches (ET (7) and max-DTR (16)), the Detection ((10), which is the best utility function found in the literature improved by including the possibility of target movements inside the horizon plan, the greedy strategy (i.e. select the adjacent cell with maximum probability) and the random walk. We evaluate the algorithms performance by statistically analyzing the probability of detecting the target at different instants taking into account the probabilistic nature of the problem (there is uncertainty in the target location and dynamics) and also, the stochastic nature of CEO (the solutions provided are non deterministic). In order to obtain the data of this analysis we use two different scenarios (static and dynamic), compute Na = 30 actions plans using each of the algorithms for both scenarios, and run Ns = 10000 target search simulations for each scenario and precomputed action plan (i.e. we run Ns ∗ Na simulations for each scenario and algorithm). Additionally, • In the two scenarios presented in this paper, the searching area is a square of dimensions wx = 8, wy = 8 and the starting location of the agent s0 is fixed in the lower-left cell. The initial target location belief b0τ of the static scenario is generated by propagating for a while a two peaks belief with a random transition matrix. The b0τ for the dynamic case is builded as a zero vector of size wx wy where the 48 position is set to 1 (b0τ=48 = 1), meaning that the target starts in the opposite side of the square region. The target dynamics model P(τ k |τ k−1 ) is the transition matrix that spreads the belief probabilities for the dynamic case. In our dynamic scenario, the target and the agent velocity are equal (i.e., at every time step the agent and the target move from one cell to another), forcing the prediction step of the Bayesian filter to be computed every agent move. The target dynamics P(τ k |τ k−1 ) is generated by extracting an 8x8 region of the matlab wind database [25] and building a normalized transition matrix from it. We have chosen this dynamics due to its non-uniformity properties, because our approaches improve better the performance with asymmetric transition matrices, where two paths have different reward. With this setup, the two scenarios only differ at the target dynamics P(τ k |τ k−1 ) and the prior location distribution b0τ . • The CEO parameters used to compute each of the Na = 30 actions plans for the scenarios and any approaches are: ρ = 0.01, α = 0.6. Besides, for the max-DTR approach the λ parameter is fixed to 0.8. In all the approaches the decision making horizon is N = 20. • In each Ns = 10000 simulations run for each of the Na = 30 actions plans obtained by each approach for
P(T ≤ k) =
1 Ns ∑ Ic (ts