Embedded Neural Network for Swarm Learning of Physical Robots Pitoyo Hartono and Sachiko Kakita Department of Media Architecture, Future University-Hakodate Kamedanakanocho 116-2, Hakodate, Japan
[email protected] http://www.fun.ac.jp/~hartono/hartono.html/
Abstract. In this study we ran real time learning of multiple physical autonomous robots situated in a real dynamic environment. Each robot has an onboard micro controller where a simple neural network is embedded. The neural network was built with the consideration of the power and calculation resources limitation which is a general characteristic of simple robots. In the experiments, several autonomous robots were placed in one environment, where each of them was given a specific task which was expressed as the evaluation function for the robot’s neural network. The learning processes of the robots were started simultaneously from their randomized initial conditions. The presence of several robots consequently formed a dynamic environment, in which an action of one robot affected the learning process of others. We demonstrated the efficiency of the embedded learning mechanism with respect to different environmental factors.
1
Introduction
In the past decade, several multi-robot systems have been proposed. This relatively young field encompasses a number of interesting research topics [1]. The fact that so far we have not succeeded in building highly autonomous singular robots capable of reliably and continuously operating in dynamic environment, is one of the factors that motivates the study of multi-robot systems. The emergence of swarm intelligences in nature such as in ants’ or bees’ colonies is one of the main inspirations for these studies[2]. In [3] a decentralized control theory for multiple cooperative robotic system was developed. A group of robots, each with a simple predefined behavior, capable of cooperating in self-organized manner was proposed in [4]. A study in [5] implemented a social reinforcement learning to encourage the emergence of social behaviors in multi-robot system. Most of the previous studies agreed that bottom-up behavior-based mechanism is a more realistic approach compared to the centralized top down control in running multi-robot systems for achieving sophisticated group behaviors applicable to real world task. It is also becoming obvious that development of learning mechanism of a single robot within a multi-robot system to obtain intelligent strategies in supporting the group behavior is one of the core topics. V. K˚ urkov´ a et al. (Eds.): ICANN 2008, Part II, LNCS 5164, pp. 141–149, 2008. c Springer-Verlag Berlin Heidelberg 2008
142
P. Hartono and S. Kakita
In this research we developed a simple neural network that can be embedded on the onboard processor of a physical simple autonomous robot which is limited in power, computational resources and sensors’ precision. The efficiency of the learning mechanism is demonstrated with the execution of a collective learning scheme where several robots are placed in a same environment. Initially the robots did not have any moving strategy, hence they moved in a random manner. A learning target is then given to each robot in the form of an evaluation function for the embedded neural network. There is no particular limitation on the learning task, hence it is possible that all of the robots have different individualistic goals. However, in our research we set a common primitive goal for each of the robots, which is the acquirement of obstacle avoidance strategy while random walking in the presence of each other. Although simple, we consider that an effective obstacle avoidance strategy is strongly correlated with self-preservation capability of autonomous robots situated in dynamic environments. In this research, because each robot must execute its learning process in the presences of other robots, the learning environment becomes highly dynamics, where the learning process of one robot is affected by the behaviors of others. So far, this kind of collective learning is rarely executed in physical multi-robot systems, at least without any partial help from simulators. One of our near future interests is to learn about the collective behavior of the multi-robot system at various stages of the learning process of the respective individual robots. This will lead to a clearer understanding in the relation between the individual behavior and the collective behavior of the multi-robot system. We believe this research contributes in providing experimental platform not only the further multi-robot systems studies but also in understanding the emergence of the swarm intelligence in nature.
2
Robot’s Structure and Behavior
For this study we built several robots with two different structures shown in Fig.1. The robot in the right side of the figure is built based on a commer-
micro controller
micro controller micro switchs
proximity sensors
micro switchs
type1
proximity sensors
Fig. 1. Simple Robots
type 2
Embedded Neural Network for Swarm Learning of Physical Robots
143
cially available toy. The size and weigh of each robot are approximately 230 x 180 x 130 [mm], and 0.5 [kg] (without batteries), respectively. For each robot, we attached four proximity sensors which measure distance from obstacles and four micro-switch to detect collisions. Each robot has two DC motors, one micro controller(H8/3052) where a simple neural network is implemented and two EEPROMs to store the internal state of the neural network. Using ten 1.2[V] nickel hydrogen batteries, a robot can operate for up to 3.5 hours. The internal structure, common for all the robots, is shown in Fig.2.
Fig. 2. Internal Structure
3
Learning Algorithm
For the controller of each robot, a simple two-layered neural network is fully embedded onboard. In the input layer, the current sensory values from the sensors, the delayed values of the past sensors’ measurements, the past actions and the associated evaluation values are given as the current inputs. Each neuron in the output layer is associated with a primitive behavior of the robot. In this study we designate four primitive behaviors, which are f orward, backward, turn − lef t − f oward, and turn − right − f orward. The execution time of each behavior is the same. The last two behaviors are designated to prevent the robot from rotating in the same position. It should be noted that stop is not one of the primitive behaviors, hence the robot is forced to continuously move. In this research, there is no distinction between a learning phase and a running phase. In its life time, a robot continuously utilizes the embedded neural network to generate movements while also updating its neural network based on the feedback from the environment. Consequently, the robot continuously adapts the gradual change in the environment.
144
P. Hartono and S. Kakita
Fig. 3. Neural Network
For the neural network of a robot, the value of the j-th output neuron at time t, Oj (t), can be shown as follows. N wij (t)xi (t)) Oj (t) = f (
(1)
i=1
xi (t), and wij (t) are the value of the i-th input neuron and the connection weight between the i-th input neuron and the j-th output neuron at time t, respectively, while N is the number of the input neurons. All input x is normalized so that, x ∈ [−1, 1]. The activation function is shown as follows. f (y) = y if max yj (t) < 1 j
(2)
y otherwise maxj yj (t) j ∈ {1, 2, 3, 4} =
The normalization in Eq.2 is needed so that the past behaviors that are given as feedbacks to the input layer have the same scale as other inputs. After processing a given input vector, the values of all of the output neurons are calculated, and then the output neuron that generates the largest value is designated as a winner neuron, as follows. win(t) = arg max{Oj (t)} j
(3)
The robot then executes a primitive behavior associated the winner neuron as follows. a(t) = A(win(t)) (4) a(t) denotes the primitive behavior executed at time t, while A is a function to map an output neuron into its associated primitive behavior.
Embedded Neural Network for Swarm Learning of Physical Robots
145
Fig. 4. Flowchart of Behavior of Individual Robot
After the robot executed a winning primitive behavior, it calculates the cumulative value of the distance measurements of all the proximity sensors, as follows. K pi (t + 1) (5) D(t + 1) = i=1
In Eq.5, D(t + 1) shows the cumulative distance value at time t. pi (t) is the value of the i-th proximity sensor at time t, while K is the number of proximity sensors. The evaluation for the robot’s movement at time t, E(t) is defined as follows. E(t) = D(t + 1) − D(t)
(6)
The evaluation function is simply the difference between the cumulative distances before and after the robot executed its primitive behavior. A positive E(t) is an indication that the robot moved away from obstacles in the environment as the implication of its recently executed behavior. In this case the winning neuron will be rewarded by correcting the connection weights so that its output increases with respect to the same input. At the same time the losing neurons will be punished by correcting the connection weights so that their values decrease with regards to the same input. Hence, in this case the learning process works in a positive feedback manner for the winning neuron, and a negative feedback manner for other neurons. A negative E(t) indicates that the robot
146
P. Hartono and S. Kakita
executed a wrong behavior because it moved towards obstacles. In this case, the winning neuron is punished by correcting the connection weights so that its value decreases. At the same time, the robot has to be encouraged to execute another behavior, hence the connection weights should be corrected so that the values of the non-winning neurons increase. In this case, the learning process works in a negative feedback manner for the winning neuron and a positive feedback manner of others. In the case that the primitive behavior of the robot did not bring a significant improvement or deterioration, no correction is made to the internal state of the neural network. The outline of the robot’s learning process is shown in Fig.4 and the weight correction algorithm is formulated as follows. wij (t) = wij (t − 1) + α(E(t)) φ(E(t), xi (t), j)
(7)
α(E(t)) = η if |E(t)| > T
(8)
= 0 otherwise T and η in Eq.8 are empirically decided positive values. The direction of the correction is decided by the function φ() which is defined in the table below.
The correction rule in Eq.7 guarantees that when the executed behavior at time t improved the evaluation of the robot, for the same input vector X, Ojn+1 (X) > Ojn (X) j = win Ojn+1 (X)
Ojn (X)
(10)
j = win
Ojn (X) denotes the output of the j-th output neuron in reaction to input vector X given for the n-th time. Equation 9 implies that the neural network reinforces a good behavior while simultaneously suppresses other alternatives, while Eq.10 shows that the neural network suppresses a proven harmful behavior and explores alternative ones. While the implemented learning mechanism has similar properties with the conventional Reinforcement Learning, it operates in continuous value domain and requires significantly less calculation resources (especially memory). The characteristics of the learning method allow it to be fully implemented on an onboard processor that supports the real time learning process of a robot situated in a dynamic real world environment.
Embedded Neural Network for Swarm Learning of Physical Robots
4
147
Experiments
In this research we run experiments in two types of environments. The first environment is shown in the left side of Fig.5. In this experiment, three robots were arbitrary placed in a straight line surrounded by four walls, and the primitive behaviors are limited to f orward and backward. The connection weights of these robots’ neural networks were randomly initialized. The result of the first experiment is shown in Fig.6. The left graph shows the average of cumulative distance measured by the proximity sensors. From this graph it is obvious that the cumulative distance increases with the progress of the learning process. The unit for the ordinate of this graph is the unit of the distance measurement of the proximity sensor, while the unit of the abscissa is the learning step. The total learning time for 500 learning steps are equivalent to approximately 5 minutes. The right graph shows the average number of collisions of the three robots. This graph indicates that the number of collisions decreases with the progress of the learning process. From these two graphs, it is clear that, through the proposed learning process, the embedded neural
a)environment1
b)environment2 Fig. 5. Environments
Fig. 6. Experimental Result in Environment 1
148
P. Hartono and S. Kakita
Fig. 7. Experimental Result in Environment 2
Fig. 8. Multi-Robot Learning
networks gradually obtained a strategy to control the robots in achieving the given task (obstacle avoidance) in a dynamic environment. As the consequence of the learning process, we observed that the three robots were able to evenly intervaled themselves in this environment while random walking, thus forming a simple formation. In the second experiment, three robots are randomly placed inside an environment shown in the right side of Fig.5. In this experiment, all of the primitive behaviors are activated. Figure 7 shows the result of this environment. The two graphs also show the efficiency of the proposed learning method in real time learning. We ran 10000 learning steps which are equivalent to approximately 75 minutes. Although from this experiment we cannot observe any formation generation, the three robots were able to achieve the task of obstacle avoidance while random walking in a two dimensional environment. The shown graphs are the average over 5 iterations of this experiment. Some snapshots of multi-robot learning are shown in Fig.8. The learning parameter (in Eq.8) η is set to 0.1 for the first experiment, and 0.001 for the second experiment, T is commonly set to 10.
5
Conclusions
In this research we proposed a simple learning algorithm that can be fully implemented on onboard processor for low cost autonomous robots with very limited
Embedded Neural Network for Swarm Learning of Physical Robots
149
calculation and power resources. Although the proposed learning mechanism is very basic and shares similar properties with Hebbian Learning and Reinforcement Learning, it is sufficiently powerful to be implemented for strategyacquisition real time learning in dynamic environment. We run several experiments with two types of simple but dynamic environments, in which three robots must learn the obstacle-avoidance strategy while random walking in the presence of each other, based only on local information and local evaluation. To show the structure-independent property of the learning mechanism, it is also implemented on two types of robots with different structures, which gave similar results. Altough it is not reported in this paper, scalability of the learning algorithm is also tested on the swarm learning containing 10-100 robots on a simulator which also gave good results. Although very simple, we believe that the proposed learning mechanism including the internal hardware configuration can contribute in setting a low cost standard experimental platform not only in the field of swarm robotics but also for helping to understand the emergence of swarm intelligences in nature.
References 1. Parker, L.E.: Current state of the art in distributed autonomous mobile robotics. Distributed Autonomous Robotic System, vol. 4, pp. 3–12. Springer, Heidelberg (2000) 2. Bonabeau, E., Theraulaz, G.: Swarm Smarts. Scientific American, 82–90 (March 2000) 3. Feddema, J., Lewis, C., Schoenwald, D.: Decentralized control of cooperative robotic vehicles: theory and application. IEEE Trans. on Robotics and Automation 18(5), 852–864 (2002) 4. Parker, C., Zhang, H., Kube, R.: Blind bulldozing: Multiple Robot Nest Construction. In: 2003 Int. Conf. on Robots and Systems, pp. 2010–2015 (2003) 5. Mataric, M.: Learning social behavior. Robotics and Autonomous System 20, 191– 204 (1997)