End-to-end learning of a 2D driving task using

0 downloads 0 Views 2MB Size Report
prediction is used to update the actual agent (car simulator) itself, and so is kept ... conv_tr: a transverse convolutional connection from the hidden state to the lower layer. ... through the data generation process to guide the car around the track. .... primitive has a circular motion in PC space, but as time goes on the primitive is.
End-to-end learning of a 2D driving task using visuomotor control. Kole Harvey Independent Scholar Aug 2018

1. Introduction In this study we employ a neural network to learn a visuomotor association task using a 2D car driving simulation. The purpose of this study is to investigate the applicability of the visuomotor association approach to familiar AI tasks, in this case a scenario in which a 2D car travels around a track and adjusts its heading and velocity according to the visual input (for example, turning on corners and accelerating on straights). The network employed is based on the visuomotor dynamic neural network model as introduced in [1] and expanded on in [2]. The network in the present study consists of two pathways, visual and proprioceptive, each consisting of 3 layers each, which produce low level predictions at each timestep. These predictions are used to calculate the error which is then fed back using back propagation through time (BPTT) to train the network. The two pathways are described below. We demonstrate the ability of the network to learn the ability to drive around corners both on tracks similar to and different from the training data. For the proprioceptive pathway, we use a continuous time recurrent neural network (CTRNN) [3] with multiple timescales at each layer [4]. The continuous time aspect refers to the ability of the network to hold onto information in a leaky-integrator fashion and gradually forget information over time, unlike the discrete kind of RNN which explicitly represents one timestep prior with context units. The tau parameter of each network layer dictates how slow the forgetting process is, or in other words how long it holds onto information for. Each higher layer of the hierarchy has a larger tau value, such that the top layers follow information on a longer timescale than the lower layers. This allows for the selforganization of motor primitives on lower layers and trajectories which sequence primitives together on the higher levels. For the visual pathway, we employ a multiple spatio-temporal scales recurrent neural network (MSTRNN) [5] which not only zooms out on a temporal level as the hierarchy is ascended, but also on a spatial one. In other words, through the use of convolutions on each layer, the receptive field of the neurons becomes more spread out over the visual field and over time the higher the layer is in the network. The MSTRNN essentially employs the leaky-integration properties of the CTRNN with the filtering of a regular convolutional neural network.

2. The Network Model Here we describe how we constructed the dual visual/proprioceptive pathway and allowed information to flow between layers and pathways on each timestep. The network model was coded in PyTorch using Python 3.6 on Ubuntu 18 with a CUDA-enabled GPU. The code for the present model and experiments can be found in the repository outlined in Code. The network’s forward pass consists of generating a sequence of predictions based on a sequence of input data with length SEQ_LEN. Essentially given inputs from t to t+SEQ_LEN-1, we wish to predict the inputs t+1 to t+SEQ_LEN. We can do this in an open loop fashion, setting the real input and predicting the next timestep on each pass through the network, or in a closed loop fashion, using the network’s own prediction as its input and generating another prediction based on it. Here we use a parameter p which defines the probability of reusing predictions as input. During training we gradually raise the p value so that the network must make progressively longer term predictions about the input.

Note that during training closed loop mode only extends to the end of a single sequence, and on the start of each sequence the real input is always used. In test mode however, the proprioceptive prediction is used to update the actual agent (car simulator) itself, and so is kept between sequences, while the visual prediction is replaced with the true visual input (taken from the simulator) at the start of each sequence. We discuss this more in Experiments.

Figure 1. The network model. Each pathway takes in input at time t and produces a t+1 prediction. Each layer in the hierarchy can receive input from lower, higher, or lateral layers. This input is then transformed into hidden state activation which is held onto and decays over time via leaky integration. MSTRNN: multi timescale recurrent neural network (visual layer). CTRNN: continuous time recurrent neural network (proprioceptive layer). PREDICT: predictive layer. bu: bottom-up input. td: top-down input. rc: recurrent input. In Figure 1 we can see that each layer has bottom-up, top-down, and lateral inputs. The initial bottomup inputs to the layer 1 nodes V1 and P1 are set to the data before running through all nodes in the network. The output of each layer is then held in a buffer and processed on the next timestep by the destination layers, with the exception of the PREDICT layers, which process current timestep inputs and immediately produce a prediction for the next timestep. Since the inputs of the previous timesteps are what are processed in each node in non-PREDICT layers, it doesn’t matter which order the nodes are processed in for the current timestep as long as the prediction nodes V0 and P0 are handled last. In other words, PREDICT layers access their input buffers on the same timestep that they are written to, whereas all other layers only access input buffers one timestep after they have been written to. At the end of each sequence, the network returns the list of outputs (predictions) to the main loop, and these predictions are compared to the real data to calculate the mean squared error loss. We then enact the backward pass of BPTT on the network based on this deviation. This then ends a single epoch of training and the process repeats on the next sequence of input data.

CTRNN layers have three sets of weights to learn, namely for bottom-up, lateral, and top-down inputs. These weights are as follows: wIU: a connection from the concatenated top-down/bottom-up input to the hidden state wUU: a recurrent connection from the hidden state in the previous timestep to the current hidden state wLU: a connection from the lateral input (flattened MSTRNN output) to the hidden state we send the output (the hidden state sent through the activation function) directly as output to all lower/higher/lateral nodes and so there is no weight for the output connection. MSTRNN layers, on the other hand, have four sets of weights to learn: conv_fwd: a convolutional connection from input to the hidden state wUU: a recurrent connection from the hidden state in the previous timestep to the current hidden state make2D: a connection from lateral CTRNN input to the hidden state conv_tr: a transverse convolutional connection from the hidden state to the lower layer. PREDICT layers can also be fully connected to the output, in which case it has a weight called final_fc to learn. We employ this in P0 to produce the proprioceptive prediction from the output of P1. In the case of MSTRNNs we do not employ this however because of the transverse convolution in V1 which already outputs the correct prediction format (V0 is essentially a dummy layer which copies input to output). Training is ran on the network until the error converges, and the saved checkpoint can then be reloaded to execute the tests as described later.

3. Method For each of the experiments in this study, we used a 2D simulation of a car driving on a track. During training, the simulation was used to generate visual data, consisting of 10x10 pixel grayscale images and proprioceptive data, consisting of a vector with heading and velocity of each timestep normalized to lie within 0 and 1. The training data was generated by first creating a track and an ideal trajectory through that track (list of positions, headings and velocity). This list of ideal checkpoints are then used as targets, and in the actual training data we dynamically generate sequences of length SEQ_LEN, each starting out with some degree of randomness in the starting position and velocity. The car is then guided towards the target position and velocity throughout the sequence, and as it reaches each checkpoint the next checkpoint in the ideal trajectory is then set as the next target. Further, only one image is generated for each sequence (at the start of the sequence). We do not store the absolute position and only use it through the data generation process to guide the car around the track. Thus the learning task is as follows: given the initial image, heading, and velocity, determine the following next sequence of heading and velocity changes (so as to stay on the track or to get back to the track if the car has run off of it). By running through the track and introducing deviations into the car’s position then guiding it back to the checkpoint, we are in effect teaching it at each point how to deal with certain situations of misalignment. By forcing the network to predict the next trajectory without visual feedback we are essentially creating a forward model that does not rely on immediate visual/proprioceptive feedback but rather its own predictions to form adequate corrective movements. For each of the following experiments, the following settings were used. Further details regarding model parameters can be found in [1].

Sequence Length V1 feature size V2 feature size V3 feature size Visual Filter P1 units P2 units P3 units

6 20 18 16 7 100 100 100

V1 tau V2 tau V3 tau

1 2 3

P1 tau P2 tau P3 tau

1 2 3

Table 1. Training parameters for the network in the current study.

As shown in Figure 2, we train the network until convergence, in this case at around 4000 epochs.

Figure 2. Training error over 4000 epochs. Error is determined on each epoch by taking the mean square error between the sequence of predictions and the training data for both pathways.

4. Experiments In the car simulator, the main variables dealt with are the corner angle, which dictates how acute each corner is, and the straight length, which dictates how long the straights of the track’s midsection are. The network as described above was trained on data taken only from tracks with corner angle π/70, π/140, and straight length of 40. We restricted the training data to these values in order to test generalization of the network, as described below. We test the network by linking it to the simulation directly, allowing it to influence the real heading and velocity of the car directly with its proprioceptive predictions at each timestep, but only receiving a new visual image back from the simulator every SEQ_LEN steps. The test is considered successful if the car can keep mostly on track or recover from short periods off the track, can somewhat match the acceleration profile of the training data (i.e. to speed up on straights and speed down on corners), and can finish at least one lap around the track. In the test images below we show the trajectory of the car with velocity from slow to fast represented by green to yellow respectively, superimposed on the track (colored blue). The following set of tests was undertaken on the trained network.

1) First we tested whether the network could handle the same kind of corner and straight styles as in the training data. Here we see that the network has no problem on these tracks.

Figure 3. (Left) Track with corner angle of π/70 and straight length of 40. (Right) Track with corner angle of π/140 and straight length of 40. Original track is blue. Green line represents regular velocity, yellow line represents high velocity.

2) Next we tested the ability of the network to generalize beyond the initial training data, with differing corner angles and straight length 40.

Figure 4. (Left) Corner angle of π/50 and straight length of 40. (Right) Corner angle of π/100 and straight length of 40.

We see that the network is able to generalize to various corners in the range of the two extremes of π/70 and π/140 that were in the training data. 3) Here we test on various straight lengths for each of the tested corner angles.

Figure 5. Tracks with π/50 corner angle and various straight lengths. (From left to right) tracks with straight lengths of 10, 20, and 60 respectively.

Figure 6. Tracks with π/70 corner angle and various straight lengths. (From left to right) tracks with straight lengths of 10, 20, and 60 respectively.

Figure 7. Tracks with π/100 corner angle and various straight lengths. (From left to right) tracks with straight lengths of 10, 20, and 60 respectively.

Figure 8. Tracks with π/140 corner angle and various straight lengths. (From left to right) tracks with straight lengths of 10, 20, and 60 respectively.

Here we see that the network has no trouble with varying lengths of straights as it has generalized the ability to stay on a straight track at a high speed, accelerating at the beginning and decelerating at the end when turning into a corner. 4) Here we further test the generalization of the trained network by widening the amount that corners are taken, that is each time a rotation of heading is made, either 3 or 10 steps are taken, in comparison to the one step taken per rotation in the initial training data. Here we see that the network can still deal with the somewhat wider corners with 3 steps taken but fails to fully complete the track when widening all the way to 10 steps per heading rotation.

Figure 9. Tracks with extra steps taken per heading rotation - 3 (left) or 10 (right) steps.

5) Finally we construct the track with corners going in the opposite direction to test the ability of the network to turn right (it has only been trained to turn left). The network has no problem at all in generalizing to this kind of rotation, which appears to be a benefit of generalizing from training differing corner rotation amounts.

Figure 10. Track with opposite corners (turning right)

5. Analysis of Network Activation Now that we have demonstrated the ability of the trained network, here we review the internal network activity during testing to analyze what representations allow for the learned skills to be enacted.

Figure 11. Network activation during test (with new visual data provided from simulator at the beginning of every 6 timestep sequence). (From top to bottom) V0 predictions of visual data for t=0,10,..,100. P1 predictions of proprioceptive data for same time range. Principal components of the network activation of layers P1, P2, P3, also for same time range. From the visual predictions in V0 we see accurate predictions of the upcoming track. In particular the car starts off heading right on a straight at t=0 and turns to head vertically up by t=30, turning further to head left by t=60 and continuing along this straight until t=100. From the proprioceptive predictions in P0 we see how the heading vector gradually shifts from pointing right to pointing left in a gradual fashion, with the speed vector (green) increasing after the car enters the straight at t=60. Now that we have a general idea of what transpires during the recorded interval, we can analyze the network activation at the higher layers. Each layer was condensed into its two principle dimensions using PCA. We are most interested in the predictions of the P pathway. Since the sequence length was 6 and we ran for 100 timesteps, we should expect to see around 15 sequences in total. As P1 has a time constant of tau=1, it readily reflects the periodicity of the sequences, with each sequence as a trajectory which spreads out from a central point. Each of these trajectories can be thought of as a motor primitive, and this primitive has a circular motion in PC space, but as time goes on the primitive is interpolated from the primitive of heading due right to eventually heading due left, with the transition happening clearly between t=20 and t=60 (orange component). P1 then repeats the motor primitive of ‘go left’ as seen by the periodic repetition of this primitive from t=60 onwards. Further, the blue component evidently tracks the velocity vector, as it slows down into the corner from t=10 to t=60 before picking up again on the straight. Here again we see a periodicity as P1 repeats the motor primitive of ‘go at x velocity’. P3, which has a time constant of tau=3, has a more discrete nature, due to its tracking of slower timescale dynamics. Essentially we only see one notable component here, in blue. This component evidently reflects whether the car is going left or right, but its absolute magnitude also seems to track the velocity of the car. Interestingly then it has seemingly combined heading and velocity into a single component which is somewhat discretized in terms of left and right (its negative/positive multiplier), while its analog component appears to track the absolute velocity. P2, as expected, lies in the middle of these two extremes, which has taken on a similar discretization of left and right as P3, but also still tracks the particulars of trajectory change as the car turns the corner. It too appears to track velocity in terms of absolute magnitude of activation.

Figure 12. Network activation during test. (From top to bottom) principal components of the network activation of layers V1, V2, V3, for same time range as Figure 8. As for the visual pathway, in V1 we see a gradual transition of the components as the visual information goes from a straight line to a curved one and back to a straight line. The periodicity of the primitive follows the same pattern as in the P pathway. In V3 we see a more discretized version of this, similar to what we saw in P3. And likewise in V2 we see an intermediate stage between the fast and slow dynamics. This is not surprising since the visual and proprioceptive pathways are laterally connected at each layer. Thus we see clear evidence of multimodal visuomotor representations (visuomotor primitives) which get more discretized in nature as one ascends the hierarchy.

6. Discussion In the present study we have developed a multimodal network with a visual and proprioceptive pathway based on the Visuomotor Dynamic Neural Network model. We trained this network to complete a simple 2D driving task and demonstrated its ability to learn visuomotor primitives at varying timescales. Further we demonstrated the ability of the network to generalize to unseen data (traversing tracks of shapes not included in the training data). An interesting result of these experiments were the increasingly discretized nature of primitives as the hierarchy was ascended. We can predict that adding more layers to the network at slower timescales and increasing the sequence length would lead to even more discrete ‘chunks’, representing particular kinds of corners and so on. However due to hardware limitations we were not able to investigate this in the present study. Discrete ‘intermittent control’ which keeps the car aligned to the track are compatible with the automated driving discussion in [6] based on human data. It is our view that the present study makes an interesting case for the application of multimodal recurrent neural networks in the learning of tasks traditionally associated with symbolic AI or handcoded programming routines. We believe that such end-to-end learning of visuomotor skills is key to developing open-ended learners [7] and the self-organized emergence of discrete ‘symbol-like’ representations at higher levels of the network bolsters our confidence that a ground-up ‘sub-symbolic’ approach is sufficient to develop a cognitively acting agent which can predict consequences of its actions and plan in the complex high dimensional space of the real world. Future research will seek to expand upon these ideas with more proactive agents that are capable of learning behaviors with more foresight than the purely reactive behaviors as outlined here. A key to unlocking this will certainly be the ability of agents to predict the future when equipped with such as model as that outlined in this study. 7. Code All code for the present study can be found at https://github.com/khrv/vmdnn_driving.

References [1] J. Hwang, M. Jung, J. Kim, and J. Tani, "A deep learning approach for seamless integration of cognitive skills for humanoid robots," in 2016 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), 2016, pp. 59-65. [2] Hwang, J., Kim, J., Ahmadi, A., Choi, M., & Tani, J. (2017, June). Predictive Coding-based Deep Dynamic Neural Network for Visuomotor Learning. In IEEE Int. Conf. Dev. Learn. Epigenetic Robot.(ICDL-EpiRob), Lisbon, Portugal. [3] Beer, R. D. (1995). On the dynamics of small continuous-time recurrent neural networks. Adaptive Behavior, 3(4), 469-509. [4] Y. Yamashita and J. Tani, "Emergence of functional hierarchy in a multiple timescale neural network model: a humanoid robot experiment," PLoS Computational Biology, vol. 4, p. e1000220, 2008. [5] M. Jung, J. Hwang, and J. Tani, "Self-Organization of Spatio-Temporal Hierarchy via Learning of Dynamic Visual Image Patterns on Action Sequences," PLoS ONE, 2015. [6] Markkula G, Boer E, Romano R, Merat N. Sustained sensorimotor control as intermittent decisions about prediction errors: Computational framework and application to ground vehicle steering. Biological Cybernetics. 2018 Feb; 112(3):181–207. https://doi.org/10.1007/s00422-017-0743-9. [7] Kole Harvey. “An Open-Ended Approach to Piagetian Development of Adaptive Behavior”. In: OALib 05.03 (2018), pp.1–33. issn: 2333-9721. doi: 10.4236/oalib.1104434.

Suggest Documents