industrial application, namely the engagement of a wet clutch using reinforcement learning. We analyse the approximate Pareto front obtainable by each ...
On the Behaviour of Scalarization Methods for the Engagement of a Wet Clutch Tim Brys, Kristof Van Moffaert, Kevin Van Vaerenbergh and Ann Now´e AI Lab, VUB Brussels, Belgium
Abstract—Many industrial problems are inherently multiobjective, and require special attention to find different tradeoff solutions. Typical multi-objective approaches calculate a scalarization of the different objectives and subsequently optimize the problem using a single-objective optimization method. Several scalarization techniques are known in the literature, and each has its own advantages and drawbacks. In this paper, we explore various of these scalarization techniques in the context of an industrial application, namely the engagement of a wet clutch using reinforcement learning. We analyse the approximate Pareto front obtainable by each technique, and discuss the causes of the differences observed. Finally, we show how a simple search algorithm can help explore the parameter space of the scalarization techniques, to efficiently identify possible trade-off solutions.
I. I NTRODUCTION Multi-objective reinforcement learning problems are often solved by combining the different objectives into a single scalar signal that is optimized using standard single-objective techniques. What is not always realized, is that several scalarization techniques exist, and each has different properties that may be more suitable to solve the problem at hand. Commonly, the only scalarization technique considered is linear scalarization, which calculates a weighted sum of the signals. Recently, a reinforcement learning approach was proposed for the engagement of a wet clutch, and was validated in both simulation and on a real clutch, showing great improvements over the state-of-the-art [1]. This problem involves minimizing both engagement time and engagement velocity. The authors successfully employed a specific scalarization technique to optimize both objectives. Their choice of scalarization method and parameterization was not well motivated on the other hand. In this paper, we look deeper into the multi-objective component of this problem, and evaluate different scalarization techniques, looking at how they modulate the reward landscape to shift the optima to different trade-offs. In the remainder of this paper, we first describe the necessary background, i.e. the wet clutch problem in Section II-A, and the reinforcement learning solution method originally used in Section II-B, and finally some background on multiobjective optimization in Section II-C. In Section III, we discuss different scalarization functions, and analyze the results of their application on the wet clutch environment in Section IV. The first two authors have equal contributions to this paper.
Finally, we propose a simple weight space exploration strategy that can efficiently explore the scalarization parameter space to identify well-spread Pareto front policies in Section IV. II. BACKGROUND A. Wet Clutch Engagement Piston Friction plates Chamber Return spring Drum To valve
Input shaft
Output shaft
Fig. 1: Schematic representation of a wet clutch (from [2]).
The problem considered in this paper is the efficient engagement of a wet clutch, i.e. a clutch where the friction plates are immersed in oil, in order to smooth the transmission and increase the lifetime of the plates. Wet clutches are typically used in heavy duty transmission systems, such as those found in off-road vehicles and tractors. Fig. 1 offers a schematic representation of its functioning. Two sets of friction plates are connected to the input and the output shaft, respectively, such that, when they are pressed together, the torque generated by the motor, connected to the input shaft via a set of gears, is transmitted to the output shaft, which is connected to the wheels of the vehicle. The two sets of plates are free to translate axially, and a return spring keeps the clutch open: to close the clutch, the plates are pressed together by a hydraulic piston, which can be pushed against the plates by increasing the pressure of the oil in the clutch chamber. This engagement has to be fast enough, but also smooth, in order to avoid disturbing the operator, and reduce the wearing of friction plates. In other words, there is a trade-off among the two objectives: on one hand, a very slow piston movement will result in a smooth engagement, which takes a long time; on the other hand, a fast piston will engage in a short time, but it will also cause a jerky engagement, with a large torque loss, and possibly damage to the setup. The main issue with this setup is that there is no sensor providing information about the actual position of the piston, which can only be detected when the piston actually touches
the plates. Given the available signals (oil pressure and temperature), there is no reference trajectory that can be tracked using classical control methods. In this sense, this setup is a good example of a limited feedback system (see also [3]). For this reason, wet clutches are commonly controlled in open loop, applying a signal which is either defined ad-hoc, or learned iteratively [4]. A parametric form of such a signal has been adopted in [5], for a genetic-based optimization approach, and it is displayed in Fig. 2. First, a large current pulse is sent to the valve, in order to generate a high pressure level in the oil chamber of the clutch, which will allow the piston to overcome the resistance of the preloaded return spring, and accelerate towards the friction plates. After this pulse, a lower constant current is sent out, in order to decelerate the piston as it approaches the friction plates: before reaching this low value, the current decreases shortly to an even lower value, creating a small dent in the signal, which should act as a “brake”, limiting the speed of the piston. Finally, the current signal grows following a ramp with a low slope, closing the clutch smoothly. The signal is parametrized as follows: the first parameter (a) controls the duration of the initial peak, whose amplitude is fixed at the maximum level, while the last (d) controls the low current level, just before the engagement begins. The remaining parameters (b, c) control the shape of the “braking” dent, while the final slope during the engagement phase is fixed.
Fig. 2: Parametric input signal from [5], with four parameters (a, b, c, d). In our implementation, all parameter ranges are mapped to the unit interval [0, 1].
There currently is no reliable model of the whole engagement: an approximate model1 is available for the filling phase, until the piston touches the plates, but not for the following slip phase, which would allow to simulate and optimize the resulting torque loss. But, torque loss has been observed to depend on the speed of the piston upon engagement, and we can therefore minimize piston velocity in simulation. Note that in the original work [1], the simulation was used to train a reinforcement learning policy, which was subsequently validated on a real clutch, and was shown to perform well. In R 1 Developed in Simulink by Julian Stoev, Gregory Pinte (FMTC), Bert Stallaert and Bruno Depraetere (PMA, Katholieke Universiteit Leuven).
this work however, we do not move beyond simulation. B. Reinforcement Learning Reinforcement learning problems [6] are a class of machine learning problems where an agent must learn to interact with an unknown environment, using a “trial and error” approach. At a given timestep t, the agent may execute one of a set of actions a ∈ A, possibly causing the environment to change its state s ∈ S, and generate a (scalar) reward r ∈ R. An agent’s behavior is represented by its policy, mapping states to actions. The aim of a RL algorithm is to optimize the policy, maximizing the reward accumulated by the agent. In direct policy search, the space of policies is searched for directly, maximizing the reward: for example, in Policy Gradient (PG) methods [7], the policy is represented as a parametric probability distribution over the action space, conditioned by the current state of the environment. Epochs are subdivided into discrete time steps: at every step, an action is randomly drawn from the distribution, conditioned by the current state, and executed on the environment, which updates its state accordingly. After an epoch has been completed, the parameters of the policy are updated, following a Monte Carlo estimate of the expected cumulative (discounted) reward. A major disadvantage of PG methods is that drawing a random action at every timestep may result in noisy control signals, as well as noisy gradient estimates. Moreover, the policy is required to be differentiable w.r.t. its parameters. To overcome these issues, PG with Parameter Exploration (PGPE) was introduced [8], [9]. In this method, the random sampling and policy evaluation steps are, in a sense, “inverted”: the policy is a parametric function, not necessarily differentiable, therefore it can be an arbitrary parametric controller; the parameter value to be used is sampled at the beginning of each epoch from a Gaussian distribution, whose parameters are in turn updated at the end of the epoch, again following a Monte Carlo estimate of the gradient of the expected return. In other words, rather than searching the parametric policy space directly, PGPE performs a search in a “meta-parameter” space, whose points correspond to probability distributions over the (parametric) policy space. PGPE was shown to work well in the context of the engagement of a wet clutch [1], and we build on that work. For further details on PGPE, we refer the reader to [9], [3] . C. Multi-Objective Reinforcement Learning In multi-objective optimization, the objective space consists of two or more dimensions. Therefore, the reward signal in reinforcement learning is extended from a single scalar to a vector of rewards, i.e: r(si , ai ) = (r1 (si , ai ), . . . rm (si , ai )) where m represents the number of objectives. Since the environment now consists of multiple objectives, conflicts could arise when trying to simultaneously optimize the objectives. In such a case, trade-offs between these objectives have to be made, resulting in a set of incomparable
80
80
60
60 c
100
c
100
40
40
20
20
10
20
30
40
50 a
60
70
80
90
100
10
20
30
Velocity objective
40
50 a
60
70
80
90
100
Time objective
Fig. 3: Velocity and time reward landscape for actions a and c. Blue to red indicate worst to best values. The optima are indicated by black dots. The area approximately above c = 67 returns very bad values, as these actions result in a broken clutch. Linear scalarization range 2
1
0.8
1.5
0.6 Time
policies. The set of non-dominated, optimal policies for each objective or a combination of objectives is referred to as the Pareto front. A solution x1 is said to strictly dominate another solution x2 , i.e. x2 ≺ x1 , if each objective in x1 is not strictly less than the corresponding objective of x2 and at least one criterion is strictly greater. In the case where x1 improves x2 on some criterion and x2 also improves x1 on one or more objectives, the two solutions are said to be incomparable.
1 0.4 0.5
0.2
III. R EWARD LANDSCAPES AND SCALARIZATION APPROACHES
Multi-objectivity in reinforcement learning is often handled using scalarization functions [10], [11] to reduce the dimensionality of the underlying multi-objective environment to a single dimension, such that classic single-objective optimization techniques can be used. In the following sections, we present first the objectives’ reward landscapes, and subsequently three of these scalarization functions and how they combine the reward landscapes given different weight settings. Furthermore, we show the approximate Pareto front policies obtainable for a sweep through the weight space. A. Reward landscapes As we have seen, the wet clutch engagement problem is a multi-objective problem; the agent needs to minimize the piston velocity v at engagement, and the engagement time t itself. Low velocities are preferred to ensure a smooth engagement (abrupt engagements may break the clutch). On the other hand, fast engagement is desirable, as it will ensure faster response times. Since PGPE is a maximization method, following [1], the reward signal for piston velocity k is transformed into rv = v+k , where v is piston velocity at engagement and k = 10−3 is a constant. For the time objective rt , time is negated to turn it into a maximization problem rt = −t. In Figure 3, we plot the reward functions and optima for the velocity and time objectives independently, for the whole ranges of actions a and c (0 − 100%) as these are the most important parameters; b = 0 and d = c stay fixed. We can clearly observe the conflict between the objectives, as
0 −4 10
−3
10
−2
Velocity
10
−1
0
10
Fig. 4: Obtained solutions for a range of different weights (color key) with linear scalarization. Both objectives need to be minimized.
the global optima for velocity and time do not overlap. On the other hand, they are not completely conflicting either, given that some of their near-optimal regions overlap. B. Linear Scalarization The linear scalarization function calculates a weighted sum , rm for m objectives: Pm of the reward signals, r1 , . . .P m o=1 wo ro . The weights wo , with o=1 wo = 1, define a preference relation over each of the objectives, and allow the system designer to steer the optimization process to specific areas in the policy space. Note that using linear scalarization, we can only make policies in convex parts of the Pareto front optimal [12]. In that same paper, Das and Dennis show that the setting of weights can usually not be done a priori, at least with confidence, since a uniform sampling of the weight space often does not provide a uniform sampling of the Pareto front, i.e. setting weights intuitively is hard. In Fig. 4, we show the performance of the policies learned by the PGPE algorithm for a range of weight settings. The velocity weight wv is varied from 0.0001 to 0.9999 with steps of 0.01 (wt = 1 − wv ). We do not include the extremes 0 and 1, since these always result in failure, i.e. no engagement when only minimizing engagement velocity, and a broken
Chebyshev scalarization range
Time discounting scalarization range
2
2
1
0.8
1.5
1
0.8
1.5
0.6 Time
Time
0.6 1
1
0.4 0.5
0 −4 10
0.2
−3
10
−2
Velocity
10
−1
0
10
Fig. 5: Obtained solutions for a range of different weights (color key) with Chebyshev scalarization. Both objectives need to be minimized.
clutch when only minimizing engagement time. The algorithm is trained and tested with each weight setting 10 times, and each point in the figure shows the result of a single run. The results are color-coded, with blue indicating a high weight for time, and red a high weight for velocity. The results show that there is a clear mapping between weights and policies (despite the stochasticity of the optimization process), as there is a lot of clustering of points with the same color, i.e. with similar weights. Note the tail on the right, where weights with a large emphasis on time result in relatively small improvements in time at the cost of a larger deterioration of the velocity objective. C. Chebyshev Scalarization Non-linear Lp metrics [13] are an alternative to a linear combination of objectives. Specifically, Lp metrics measure the distance between a point in the multi-objective space and a utopian point z ∗ . We measure this distance to the reward function ro for each objective o of the multi-objective solution Pm ∗ p 1/p x, i.e. minx∈Rn Lp (x) = , where o=1 wo |ro − zo | 1 ≤ p ≤ ∞. In the case of p = ∞, the metric is also called the weighted L∞ or the Chebyshev metric and is of the form: minx∈Rn L∞ (x) = maxo=1...m wo |ro − zo∗ |. Intuitively, the Chebyshev scalarization identifies the objective that is worst off in the current solution, and tries to optimize based on that objective. In this setting, we set the utopian point (zt∗ , zr∗ ) to (0, 1). The important advantage of this technique, being nonlinear, is that in addition to policies from convex parts of the Pareto front, it can also find policies in non-convex parts of the Pareto front. In Fig. 5, we plot the results of a sweep through the weight space, now with Chebyshev scalarization. The results are quite similar to those when using linear scalarization, except that Chebyshev scalarization appears not to be able to approximate the actual Pareto front as well as linear scalarization (see the ‘missing’ part bottom left (Pareto dominating) compared to Figure 4). This is probably due to the Pareto front being convex, in which case a non-convex technique may not perform very well.
0.4 0.5
0 −4 10
0.2
−3
10
−2
Velocity
10
−1
0
10
Fig. 6: Obtained solutions for a range of different weights (color key) with time discounting scalarization. Both objectives need to be minimized.
D. Time Discounting Scalarization The last scalarization function we consider is inspired by the way time is typically handled in combination with a reward signal in reinforcement learning. That is, by discounting the reward. The further in the future a reward is expected to occur, the more it is discounted, and thus the lower and less desirable it is. The scalarization function is of the form γ t rv , where γ ∈ [0, 1] is a constant discount factor and the weight to control here. This is the scalarization function used in the original work on wet clutch engagement [1], with γ = 0.9. The multi-objective policies obtained for a range of γ settings (same as before) are depicted in Fig. 6. Two features differentiate this plot from the linear scalarization plot. First, the tail on the right is no longer present. Second, for very low γ, PGPE consistently converges to supoptimal policies. The causes of these phenomena are discussed in the next section. Note that again, we obtain similar policies with similar weights, demonstrating the consistency of this scalarization approach with respect to the weight settings in this problem. E. Comparison In this section, we discuss the way the different scalarization approaches combine the two single-objective reward landscapes, and explain the differences we previously observed. Recall Figure 3, where we plotted separately the reward landscapes of velocity and time. In Figure 7, we plot the scalarized landscapes for the three scalarization approaches (top to bottom), with three different weight settings (left to right). The most notable difference between linear and Chebyshev on the one hand, and time discounting on the other hand, is the shape of the landscapes at the extreme weight settings, and consequently the in-between landscapes for other weight settings. With the former two, we get the two original singleobjective landscapes at the extremes (wt = 0 and wt = 1). With time discounting scalarization on the other hand, we get the velocity objective at the one extreme (γ = 1), and a global plateau of reward 0 at the other extreme (γ = 0), since 0t rv = 0. This explains why in Figure 6, for very
80
80
80
60
60
60 c
100
c
100
c
Linear
100
40
40
40
20
20
20
10
20
30
40
50 a
60
70
80
90
100
10
20
30
60
70
80
90
100
10
80
60
60
60
40
40
40
20
20
20
40
50 a
60
70
80
90
100
10
20
30
wt = 0.1
40
50 a
60
70
80
90
100
10
80
60
60
60
40
40
40
20
20
20
40
50 a
60
70
80
90
100
60
70
80
90
100
30
40
50 a
60
70
80
90
100
70
80
90
100
c
80
c
80
c
100
30
50 a
wt = 0.9
100
20
20
wt = 0.5
100
10
40
c
80
c
80
30
30
wt = 0.9 100
20
20
wt = 0.5 100
10
Time discounting
50 a
100
c
Chebyshev
wt = 0.1
40
10
20
30
γ = 0.1
40
50 a
60
70
80
90
100
10
20
γ = 0.5
30
40
50 a
60
γ = 0.9
Fig. 7: Reward landscape for actions a and b, with different scalarization functions. Top: linear scalarization. Center: Chebyshev scalarization. Bottom: Time discounting scalarization. From left to right, wt or γ is 0.1, 0.5, and 0.9 respectively. Red indicates best values. The optima are indicated by black dots. Note that while for time discounting scalarization, the optimal trade-offs stay in the same region, for linear and Chebyshev scalarization, these move around to more extreme trade-offs. Pareto fronts 1
Linear scalarization Chebyshev scalarization Time discounting scalarization
0.9 0.8 0.7 Time
low γ, we find suboptimal solutions, as the scalarized reward signal provides little to no gradient information and PGPE performs a random walk. Furthermore, this explains why we cannot emphasize the time objective as much as with linear and Chebyshev, since the time objective is only used to modulate the velocity landscape, slightly shifting optima. The time landscape is no integral part of the scalarized landscape as it is with linear and Chebyshev. This is a crucial difference that should be kept in mind when considering this scalarization technique, since it limits the maximum potential emphasis on the time component. In Figure 8, we plot the approximate Pareto fronts obtained by the different techniques (the means of 10 runs are plotted, to account for the stochasticity in the learning process). This plot confirms our previous findings. The approximate Pareto front found using the time discounting scalarization is very much focused on velocity, while with linear and Chebyshev, we achieve greater spread and can put great emphasis on time, shaving off another 0.15s compared to time discounting. In the next section, we show how we can quickly identify a variety of trade-off policies and their associated weights, by using a simple systematic weight space search technique.
0.6 0.5 0.4 0.3 0.2
−3
10
−2
Velocity
10
Fig. 8: Pareto dominating solutions obtained by a sweep through the weight space for three scalarization techniques. Every point is the mean of 10 runs for a given weight setting. Both objectives need to be minimized.
IV. E FFICIENT W EIGHT S PACE S EARCH If we are limited to a given number of weight evaluations, we can ensure maximal coverage of the weight space by always evaluating those weight settings that are furthest from
Running hypervolume of policies found
ACKNOWLEDGMENT
1.1
Tim Brys is funded by a Ph.D grant of the Research Foundation Flanders (FWO). Kristof Van Moffaert is supported by the IWT-SBO project PERPETUAL (grant nr. 110041). This work has been carried out within the framework of the LeCoPro project (grant nr. 80032) of the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWTVlaanderen).
Hypervolume
1 0.9 0.8 0.7 Linear scalarization Chebyshev scalarization Time discounting scalarization
0.6 0.5 0
20
40
Iteration
60
80
100
Fig. 9: Hypervolume of policies found during Divide & Conquer search for three scalarization techniques. Average of 10 runs.
the previously evaluated weights. In a bi-objective problem, this means first evaluating the extreme weights [1, 0] and [0, 1], and subsequently [0.5, 0.5], [0.25, 0.75] and [0.75, 0.25], etc., each time halving the remaining intervals. At anytime, we have covered the weight space as uniformly as possible. In Fig. 9, we plot the hypervolumes obtained for such a search with each of the scalarization approaches. The hypervolume metric calculates the volume between a reference point and the Pareto front obtained so far [14]. This is a measure of the spread of the estimated Pareto front, and how well it approximates the actual (unknown) Pareto front. Again, we can note the supremacy of linear scalarization on the wet clutch problem, achieving both the most Pareto dominating points and the largest spread of these points. Please note that a uniform sampling of the weight space does not always yield a uniform sampling of the Pareto front [12], and therefore directed search techniques may be required instead of this naive uniform search.
V. C ONCLUSION In this paper, we analysed different multi-objective aspects to an industrial application, i.e. the engagement of a wet clutch. We discussed different scalarization techniques and showed their effect on the reward landscape, as well as the behaviour they induce in the optimization algorithm used to optimize the policy controlling the wet clutch. We considered linear, Chebyshev, and time discounting scalarization. While the former two allow for real trade-offs between the time and velocity objectives, the time discounting scalarization only allows to slightly emphasise the time objective, with most of the focus remaining on the non-time objective. This study illustrates the different aspects to consider in an industrial multi-objective problem, including the analysis of different multi-objective techniques, landscape analysis of the reward signals, and the use of different multi-objective indicators to measure performance, i.e. Pareto dominance and hypervolume.
R EFERENCES [1] K. Van Vaerenbergh, A. Rodriguez, M. Gagliolo, P. Vrancx, A. Now´e, J. Stoev, S. Goossens, G. Pinte, and W. Symens, “Improving wet clutch engagement with reinforcement learning,” in Neural Networks (IJCNN), The 2012 International Joint Conference on. IEEE, 2012, pp. 1–8. [2] B. Depraetere, G. Pinte, and J. Swevers, “Iterative optimization of the filling phase of wet clutches,” in The 11th International Workshop on Advanced Motion Control (AMC 2010). IEEE, 2010, pp. 94–99. [3] M. Gagliolo, K. Van Vaerenbergh, A. Rodriguez, A. Now´e, S. Goossens, G. Pinte, and W. Symens, “Policy search reinforcement learning for automatic wet clutch engagement,” in 15th International Conference on System Theory, Control and Computing — ICSTCC 2011. IEEE, 2011, pp. 250–255. [4] G. Pinte, J. Stoev, W. Symens, A. Dutta, Y. Zhong, B. Wyns, R. D. Keyser, B. Depraetere, J. Swevers, M. Gagliolo, and A. Now´e, “Learning strategies for wet clutch control,” in 15th International Conference on System Theory, Control and Computing — ICSTCC 2011. IEEE, 2011, pp. 467–474. [5] Y. Zhong, B. Wyns, R. D. Keyser, G. Pinte, and J. Stoev, “Implementation of genetic based learning classifier system on wet clutch system,” in 30th Benelux Meeting on Systems and Control — Book of Abstracts. Gent, Belgium: Universiteit Gent, 2011, p. 124. [6] R. Sutton and A. Barto, Reinforcement learning: An introduction. Cambridge, MA: MIT Press, 1998. [7] J. Peters and S. Schaal, “Policy gradient methods for robotics,” in Proceedings of the IEEE Intl. Conf. on Intelligent Robotics Systems (IROS), 2006. [8] F. Sehnke, C. Osendorfer, T. R¨uckstiess, A. Graves, J. Peters, and J. Schmidhuber, “Policy gradients with Parameter-Based exploration for control,” in ICANN ’08: Proceedings of the 18th international conference on Artificial Neural Networks, Part I. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 387–396. [9] F. Sehnke, C. Osendorfer, T. R¨uckstieß, A. Graves, J. Peters, and J. Schmidhuber, “Parameter-exploring policy gradients,” Neural Networks, vol. 23, no. 4, pp. 551–559, 2010. [10] P. Vamplew, J. Yearwood, R. Dazeley, and A. Berry, “On the limitations of scalarisation for multi-objective reinforcement learning of pareto fronts,” in Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence, ser. AI ’08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 372–378. [11] K. Van Moffaert, M. M. Drugan, and A. Now´e, “Scalarized multiobjective reinforcement learning: Novel design techniques,” IEEE SSCI, vol. 13, pp. 94–103, 2012. [12] I. Das and J. Dennis, “A closer look at drawbacks of minimizing weighted sums of objectives for pareto set generation in multicriteria optimization problems,” Structural optimization, vol. 14, no. 1, pp. 63– 69, 1997. [13] T. Voß, N. Beume, G. Rudolph, and C. Igel, “Scalarization versus indicator-based selection in multi-objective cma evolution strategies.” in IEEE Congress on Evolutionary Computation. IEEE, 2008, pp. 3036–3043. [14] K. Van Moffaert, M. M. Drugan, and A. Now´e, “Hypervolume-based multi-objective reinforcement learning,” in Evolutionary Multi-Criterion Optimization. Springer, 2013, pp. 352–366.