Adaptation to criticality through organizational

Adaptation to criticality through organizational invariance in embodied agents Miguel Aguilera1,2,3,* and Manuel G. Bedia1,2 1 Dept.

of Computer Science, University of Zaragoza, Zaragoza, Spain Institute of Engineering Research, Zaragoza, Spain 3 Institute for Cross-Disciplinary Physics and Complex Systems, Palma, Spain * [email protected]

arXiv:1712.05284v1 [nlin.AO] 13 Dec 2017

2 Aragon ´

ABSTRACT Many biological and cognitive systems do not operate deep within one or other regime of activity. Instead, they are poised at critical points located at transitions of their parameter space. The pervasiveness of criticality suggests that there may be general principles inducing this behaviour, yet there is no well-founded theory for understanding how criticality is found at a wide range of levels and contexts. In this paper we present a general adaptive mechanism that maintains an internal organizational structure in order to drive a system towards critical points while it interacts with different environments. We implement the mechanism in artificial embodied agents controlled by a neural network maintaining a correlation structure randomly sampled from an Ising model at critical temperature. Agents are evaluated in two classical reinforcement learning scenarios: the Mountain Car and the Acrobot double pendulum. In both cases the neural controller reaches a point of criticality, which coincides with a transition point between two regimes of the agent’s behaviour. These results suggest that adaptation to criticality could be used as a general adaptive mechanism in some circumstances, providing an alternative explanation for the pervasive presence of criticality in biological and cognitive systems.

Introduction Generally, the behaviour of biological and cognitive systems is not steadily poised at one phase or another. Instead, living beings operate around points of critical activity at the boundary separating regimes with different dynamics: the transition between an ordered and a disordered phase. Criticality refers to a distinctive set of properties found at this transition . Some of these properties include power-law divergences of some quantities described by critical exponents and fractal behaviour and maximal sensitivity to external perturbations1, 2 . The surprising fact is that unlike unanimated matter, where critical transitions from order to disorder take place by fine tuning of the parameters of the system1 , living systems are ubiquitously poised near critical points according to mounting empirical evidence3 . Signatures of criticality have been detected in neural cultures4 , immune receptor proteins5 , the network of genes controlling morphogenesis in fly embryos6 and bacterial clustering7 . Indicators of critical behaviour have also been observed in the brain8 and cognitive behavioural patterns9 . These results suggest that general theoretical principles might underlie biological self-organization. And yet, there is no well-founded theory for understanding how living systems operate near critical points in a wide range of contexts, compelling us to ask what type of mechanisms are driving biological systems at a dauntingly diverse span of levels of organization to operate near critical points of activity. This question is largely unresolved, and a general mathematical framework for understanding how living systems are driven to criticality is still lacking. The issue has been popularized as the ‘adaptation to the edge of chaos’10 , and different solutions have been tested through modelling approaches during the last couple of decades. Many models addressing criticality in living systems assume that biological systems are self-tuned, either by learning or evolution, to regions of the parameter space displaying optimal fitness, and that these optimal points are often placed near critical points due to the functional advantages of critical behaviour11, 12 . Other models have in turn focused on the idea of self-organized criticality (SOC), in which criticality emerges spontaneously from simple local interactions, without fine tuning of the parameters of the model. Typically, SOC models exploit clever local rules (e.g. in cellular automata) that produce critical behaviour in a specific topology and context of the system13–15 . However, many SOC models are highly idealized and do not attempt to capture the basic interactions of living systems, often failing to provide general explanations of the ubiquitous emergence of criticality16 . Although both adaptive self-tuning and SOC models present interesting insights, they are generally applicable in a relatively narrow set of contexts or under highly idealized conditions. Here, we explore an alternative approach to examine how a system can display critical activity in a wide variety of situations using simple and general mechanisms. Instead of thinking about criticality as a by-product of adaptation to complex environments or a spontaneous property of certain systems, we explore the idea that adaptive systems might exploit critical surfaces using cheap learning mechanisms. That is, we inquire

into the possibility that some biological and cognitive systems are poised at criticality not as a consequence of adaptation to external circumstances, but because they are equipped with adaptive systems which are tuned to maintain an internal equilibrium at a critical point while they interact with their environment. Systems at critical points display a wide range of dynamic scales of activity and maximal sensitivity to external fluctuations, presenting optimal responses when facing complex heterogeneous environments12 . Thus, cheap mechanisms maintaining the parameters of the system at regions of criticality could drastically reduce the cost of searching large parametric spaces for finding fit solutions or even finding interesting solutions in an unsupervised manner. The paper is structured as follows. First, we propose a novel method for driving a system to critical points by maintaining certain relational invariants. In our case, these invariants are extracted from the correlation structure of an Ising lattice at a critical point, where correlations scale with distance according to a well known power law function. Second, we propose a simple learning rule maintaining this correlation invariance, and hypothesize that it could be used for driving systems in different contexts to operate near critical points. We test the model using two classical examples of learning and control: the Mountain Car and a double pendulum. The main result is that in both testing environments the general rule proposed here is able to drive self-tuned agents with no free parameters towards critical points of operation. At the same time, the agents themselves are poised at points of behavioural transition, where they are able to exploit a wide range of dynamic possibilities available in their environment, suggesting a link between an internal search of critical points and the exploration of external behavioural points of transition. Finally, we discuss the possible applications of this synthetic approach as a contribution towards understanding deeper principles governing biological and cognitive systems.

Model Inspired by the ideas described, we present a novel simple mechanism for driving a neural controller near criticality while interacting with different environments. Unlike the case of most SOC models, instead of specifying certain mechanistic rules to generate criticality, we drive a system towards criticality through the conservation of certain invariants in the relations between the components of the system (e.g. correlations between components), imposing certain patterns in the system’s organization. This focus on the system’s organization instead of the mechanistic properties of its components is supported by the existence of well-known universality classes that provide a unified expression for families of systems operating under criticality17 . Systems belonging to the same universality class, even if defined by very different material parameters or physical properties, have the same critical exponents characterizing diverging observables which are defined by the symmetries of the system. For example, in all Ising models in 2D lattices (square, triangular, hexagonal and so forth) spin-spin correlations follow the asymptotic form c(r) ∝ 1/rη , where η = 1/42 . This surprising property provides a perspective on criticality in terms of universal relations, suggesting that we could model criticality using simple and non-specific mechanisms independently of the individual parameters of the system. Following this intuition, we propose a model that preserves certain structure in the correlations of the system. There is some experimental evidence showing that, given an Ising model near a critical point, one could build a family of models by learning correlations drawn at random from the original model, which will be poised near a similar critical point18 . Based on this, we propose to reproduce and support criticality by the invariant maintenance of an organizational structure of correlations following a 1/rη law. Instead of restricting a model to a particular set of mechanisms or a given topology, the model preserves a general organizational distribution which could be easily implemented by very different structures, driving a system to a regime of criticality. We define the model as a neural network with the form of an Ising model or Boltzmann Machine19 of N units following a maximum entropy distribution: " # 1 P(s) = exp β ∑ hi si + ∑ Ji j si s j Z i i< j

(1)

where the distribution follows an exponential family P(s) = Z1 e−β E(s) , Z is a normalization value. The energy E(s) of each state of size N is defined in terms of the bias hi and couplings Ji j between pairs of units, with β = 1/(T kB ), being kB Boltzmann’s constant and T the temperature of the system. Units si can take discrete values of +1 or −1 and the couplings and bias can take continuous values. Without loss of generality we can set an operating working temperature such that β = 1. To simulate the network, we use Glauber dynamics, by which the sign of a unit is inverted with probability: " Pi (s) = 1 + eβ ∆Ei (s)

#−1 (2)

2/12

where ∆Ei (s) = 2(hi si + ∑ j Ji j si s j ) is the energy difference between the inverted and original state. Throughout the paper, we simulate the network updating its state sequentially by applying Glauber dynamics to all units in the network in a random order at each simulation step.

Figure 1. (A) Distribution of correlation values used for learning, extracted from a 20x20 lattice Ising model at critical temperature. (B) Divergence of the heat capacity in 10 models learning random correlations sampled from Figure 1.A. Maximum and minimum values are shown by the grey area. (C) Distribution of the values of the coupling matrix Ji j of the 64 units of a 8x8 lattice Ising model with periodic boundaries. (D) Distribution of the values of the coupling matrix Ji j of a 64 units Ising model after learning correlations sampled from Figure 1.A. (the order of the nodes in the coupling matrix has been arranged using hierarchical clustering). Values of Ji j correspond to the values in the colour bar.

Learning rule In order to induce criticality in its behaviour, we apply a learning rule designed for maintaining the internal structure of correlations of a bidimensional Ising lattice model. In a nutshell, the learning rule will operate as follows: 1) The distribution of correlations of a finite 2D Ising model is calculated, 2) the new model is defined by assigning each neuron reference correlation values for each synapses randomly sampled from the previous distribution, 3) during learning, each neuron sorts its synapses by its correlation strength, and adjusts these correlations to the reference values assigned using an inverse Ising learning rule. First, since the size of the models employed here will be far from the thermodynamic limit, instead of directly using the asymptotic form c(r) ∝ 1/rη , we approximate it by computing the correlation structure of a finite model operating at a known critical temperature. One of the few specific cases where the Ising model presents √ an exact solution is a model with zero fields and 2D lattice connectivity, in which a critical point appears at J = β log(1 + 2)/220 . Exploiting this result, we build a 20x20 2D lattice Ising model at critical temperature with periodic boundary conditions to generate reference correlations to be used by the learning rule. We simulate the model using Glauber Dynamics, generating 106 samples, after initializing the model running 105 updates. From this simulation, we obtain the distribution of correlations in the system P(ci j ), where ci j = hsi s j i, observed in Figure 1.A. Since the fields hi of all units are zero, the means mi = hsi i of all units are also zero. Second, once the distribution of correlations has been obtained, we generate new models by adjusting their correlations to match the distribution P(ci j ). For doing so applying only local information, we assign each neuron a set of correlation values c∗i,k , k = 1...N drawn at random from P(ci j ). At each step, we will compute the actual correlations between a neuron i and its ∗ m neighbours j as cm i j , and generate reference values for learning by sorting ci,k to match the order of ci j . We will denote the sorted ∗ ∗ reference values as ci, j . All the values of mi will be set equal to zero. The reason for sorting the values of the correlations is to give more flexibility to the rule, since we are only interested in maintaining a c(r) ∝ 1/rη relation, independently of which connection holds each value. For the third step, the problem is that it is not trivial finding which combination of hi and Ji j generates a specific combination of m j and ci j . This is known as the ‘inverse Ising problem’, originally formulated for Boltzmann machines19 , which can be 3/12

solved by using a simple gradient descent rule: hi ← hi + µ(m∗i − mm i ) J ji ← J ji + µ(c∗i j − cm i j)

(3)

where µ is a constant learning rate, m∗i and c∗i j are the reference mean and correlations of the learning algorithm, and mm i and cm are the mean and correlations of the model for the current values of h and J . Generally, computing each learning step is i i j ij computationally costly, since it requires summing over all possible states of s, although approximate methods such as Monte Carlo sampling are generally used to speed up learning. Similarly, we compute the approximating correlations by simulating our networks using the Glauber dynamics for a number of steps. Testing adaptation to criticality in isolated networks As a demonstration, we apply the learning rule to different networks assigning them objective means and correlations drawn at random from the distribution found for the 20x20 lattice Ising model. For each network, we apply Equation 3 for learning mi m and ci j , computing the actual mm i and ci j with Glauber dynamics. Since precision of learning is not important (the objective is to capture the overall distribution), we do not wait for convergence of the algorithm and simply update the learning rule 1000 times. We use a learning rate µ = 0.01 and compute 1000N samples for each learning step, N being the size of the system. Using this method, we apply the learning rule to 10 different networks for sizes N = 4, 8, 16, 32, 64. For simplicity, instead of the critical temperature of the lattice Ising model, we set an arbitrary inverse temperature of β = 1. Note that the choice of operating temperature is irrelevant, since it only implies a rescaling of the parameters that the algorithm will compensate. For each network, we test if it is near a critical point by computing their heat capacity C(β ) = β ∂∂ Hβ = β 2 (hE 2 (s)i − hE(s)i2 ), where E(s) = − ∑i hi si − ∑i< j Ji j si s j is the energy of the Ising model and H = − ∑ P(s)log(P(s)) is the entropy of the system. The divergence of the heat capacity is a sufficient indicator to detect continuous phase transitions associated with critical points. We simulate each network for 105 steps for different values of β , and we find that all the 10 networks diverge at the operating temperature β = 1 (Figure 1.B), showing heat capacity values similar to those of the original lattice Ising model with periodic boundaries. Nevertheless, although the distribution of correlations is similar to the lattice Ising model, the structure of the network is radically changed. Instead of the original ordered structure of a uniform lattice (Figure 1.C), we now have a disordered distribution of couplings Ji j (Figure 1.D), including both positive and negative values. Also, each execution of the learning algorithm yields a completely different arrangement of values of couplings Ji j . In the following section, we use this learning rule to drive the neural controller of an embodied agent towards a critical point. In order to do so, we need to take into account the environment during learning. If we consider two interconnected Ising models, (one being the neural controller and other being the environment) Equation 3 holds perfectly if we only apply it to the values of i and j corresponding to units of the neural controller. In our case, we do not use an Ising model as an environment but instead we use two classical examples from reinforced learning. Therefore, our learning rule will be valid as long as the statistics of the environment can be approximated by an Ising model with an arbitrary number of units. Luckily, Ising models in the form of Boltzmann machines are universal approximators21 and the stationary distribution of any arbitrary environment can be approximated by an equivalent Ising model. Embodied model: Mountain Car and Acrobot In order to evaluate the behaviour of the proposed learning rule, we test it in two embodied situations using the OpenAI Gym toolkit22 . We define a neural network consisting of an Ising model defined as in Equation 1, describing a network of N = 6 + Hh units, with Nh hidden units, 2 motor units and 4 sensor units. Motor units define the actions performed by the agents. In sensor units, the magnetic field of the unit is not a fixed parameter but it is be updated with the value of an external input hi = Ii sensor units are updated directly from the state of the environment. Sensor units and motor units are only connected to hidden neurons, while hidden neurons are connected to all other neurons (Figure 2.A). All connections are assigned an objective ci j value (selected at random from distribution P(ci j ) shown in Figure 1.A), and all units except sensor units are assigned an objective mean value mi = 0. During learning, the agent applies the rule in Equation 3 for adjusting its means and correlations to the assigned values. Each simulation step, the units of the Ising model are updated in a sequential random order using Glauber dynamics (Equation 2). The first embodiment of the network consists of the Mountain Car environment23 . This environment is a classical testbed in reinforcement learning depicting an under-powered car that must drive up a steep hill (Figure 2.B). Since gravity is stronger than the car’s engine, the vehicle must learn to gain potential energy by driving to one hill before the car is able to make it to the goal at the top of the opposite hill (see methods). The neural network receives the speed of the system as an input and controls the force of the car’s engine as an output. The second embodiment consists of a double pendulum or ‘Acrobot’24 , which has to coordinate the movements of two connected links to lift its weight (Figure 2.C, see methods). The neural network receives the speed of the first pendulum and controls the torque applied on the joint between the two pendulums. 4/12

Figure 2. (A) Structure of the embodied neural controller for a model with N = 12 units and Nh = 6 hidden units. (B) Mountain Car environment: an under-powered car that must drive up a steep hill by balancing itself to gain momentum. (C) Acrobot environment: an agent has to balance a double pendulum to reach the high part of the environment. For sizes Nh = 1, 2, 4, 8, 16, 32, 64, we train 10 agents in both environments, applying the learning rule from Equation 3, with η = 0.01. Note that agents during learning have no explicit goal other than adjusting the correlations of the system to a random sample extracted from the probability distribution in Figure 1.A. In Supplementary Videos S1 and S2 we can observe an example of the behaviour of agents with N = 64 after training. In Figure S2.C-D we can see the distribution of correlations of an agent with Nh = 64 (the result is similar for all agents and sizes) and Figure S2.E-F shows the error between the distribution in Figures S2.C-D and 1 for this agents. In this Figure we can observe how agents after learning display a correlation structure consistent with the infinite correlation length (c(r) ∝ 1/rη distribution) of the lattice Ising model at criticality.

Figure 3. Ranked probability distribution function of the neural network for 10 different agents and different sizes in (A) the Mountain Car and (B) the Acrobot embodiments. The real distribution is compared with a distribution following Zipf’s law, (i.e. P(s) = 1/rank, dash-dotted line). We observe a good agreement between the model and Zipf’s law, suggesting critical scaling.

Results In this section, we analyse the behaviour of the neural controllers and the behavioural patterns of the agents with respect to the possibilities of their parameter space. As we observe below, the 10 agents display quite similar behaviour for each environment, despite the fact that each one has learned different values of ci j and Ji j . In order to compare the agents with other behavioural possibilities, we explore the parameter space by changing the parameter β of the agents. Modifying the value of β is equivalent to a global rescaling of the parameters of the agent transforming hi ← β · hi and Ji j ← β · Ji j , thus exploring the parameter space along one specific direction. Signatures of criticality in the neural controller First, in order to test whether the agents are being driven towards a critical point, we analyse signatures of critical behaviour in the neural controller of the agent. A classical signature of criticality is the presence of power law distributions in the statistical descriptions of the states of a system. Unlike an isolated Ising model, our neural network is connected to an environment and the probability distribution of the Ising neural controller is no longer described by equation 1. Thus, we compute the probability distribution of the system by counting the occurrence of each state of the units s to compute the probability distribution of the Ising model P(s). We simulate the system at β = 1 for 108 simulation steps. As in training, the agent’s position and state are reset every 5 · 104 simulation steps. We observe that all agents approximately follow Zipf’s law for the Mountain Car 5/12

Figure 4. (A, B) Entropy of the agent’s neural controller for different sizes up to N = 22 (Nh = 16) and 10 different agents ) and 101 values of β . (C, D) Heat capacity of the agent’s neurons. Heat capacity computed as C(β ) = β ∂ H(β ∂ β , where H(β ) is estimated using B-splines. Maximum and minimum values are shown by the grey area. The figures suggest that in both the Mountain Car (left) and Acrobot (right) embodiments the model presents a divergence of its heat capacity as the number of neurons N increases, indicating that the neural controller of the system is near a continuous phase transition.

Figure 5. (A, B) Entropy of the sensor units for different sizes up to N = 70 (Nh = 64) and 10 different agents and 101 values ) of β . (C, D) Heat capacity of the agent’s neurons. Heat capacity computed as C(β ) = β ∂ H(β ∂ β , where H(β ) is estimated using B-splines. Maximum and minimum values are shown by the grey area. The figures suggest that in both the Mountain Car (left) and Acrobot (right) embodiments the system presents a divergence of its heat capacity as the number of neurons N increases, indicating that the neural controller of the system is near a second order phase transition.

6/12

(Figure 3.A) and Acrobot embodiments (Figure 3.B), with error bars in a very narrow range. The power-law distribution of neural activation patterns suggests that the neural controller of the agents is operating near a critical point. Nevertheless, the sole occurrence of a power law is generally insufficient to assess the presence of criticality and may arise naturally in some non-equilibrium conditions. Thus, further evidence is necessary to test if the neural controller is in a critical point. A sufficient indicator for criticality is the presence of a divergence of the heat capacity of the system (as the one shown in Figure 1.B), indicating that the system is maximally sensitive to parametrical changes. Unfortunately, when the neural controller is embodied in the Mountain Car and Acrobot environments, we can no longer access a formal description of the probability distribution of the agent-environment system, thus we cannot directly compute from the energy of the system values as the entropy or heat capacity of the system. Nevertheless, we can still directly compute the entropy H(x) = − ∑x P(x) log P(x) of any variable of the system by estimating its probability distribution P(x) through simulations. In order to compute the entropy and heat capacity of different variables, we simulate the agent’s behaviour for 101 values of β , logarithmically distributed in the interval [10−1 , 101 ]. We run the 10 agents for each embodiment during 106 simulation steps, reseting the agent’s position and state every 5 · 104 simulation steps. First, we compute the entropy of the probability function of hidden neurons in the controller. Due to computational constraints (the number of states of the probability distribution increases with 2Nh ) we only compute the entropy in agents up to Nh = 16 hidden units. Displaying the entropy of the neurons for different values of β we observe that the agents are near an order-disorder transition (Figure 4.A,B). Larger sizes make the transition sharper and closer to the operating temperature ) β = 1. From the entropy of a variable, its heat capacity can be computed as C(β ) = β ∂ H(β ∂ β . From the computed 101 values of entropy, H(β ) is estimated by fitting a curve using cubic B-splines25 as indicated in the methods section. In Figure 4.C,D we observe how the system displays a similar divergence of the heat capacity of the neural network as the Ising model, suggesting that the robot’s neural controller is at a critical point. In order to confirm a divergence in the observed transition, we repeat the analysis for the state of the 4 sensor units of the network. In Figure 5 we show the entropy and heat capacity of the sensor units for sizes up to Nh = 64 hidden neurons, where we can observe a similar picture than in Figure 4, suggesting that the heat capacity diverges, and a second order transition takes place in the neural controller of the agent of the agent. These results suggest that the agent’s neural controller is operating near a point of criticality, resembling a continuous phase transition, suggesting that the neural controller self-organizes to present maximal sensitivity to changes in its parameter space.

Figure 6. Transition in the behavioural regime of the agents with Nh = 64. We show the behaviour of two individual agents with different values of β for the Mountain Car (A, B, C) and Acrobot (D, E, F) embodiments. We observe that β = 1 is a transition point between two modes of behaviour in both agents.

7/12

Figure 7. (A, B) Mean height hyi of the agents for different sizes and 10 different agents and 101 values of β . (C, D) Susceptibility the agent’s behaviour, computed as χy (β ) = β ∂∂hyi β , where hyi is estimated using B-splines. Maximum and minimum values are shown by the grey area. The figures suggest that in both the Mountain Car (left) and Acrobot (right) embodiments the behaviour of the agent presents a sharp transition around β = 1.

Behavioural transitions in the parameter space What does it imply for the agent to adapt to be poised near a critical point? It should be remarked here that agents are given no explicit goal. They only tend to adapt to behavioural patterns maintaining a distribution of correlations randomly sampled from the distribution shown in Figure 1.A. To explore this issue, we examine the different behavioural modes of the agent while exploring its parameter space by changing the value of β . The behaviour of the Mountain Car can be described just by its horizontal position x and speed v at different moments of time. As well, the horizontal and vertical positions of the tip of the Acrobot’s links is a good description of its behaviour. In Figure 6.A-C we can observe the behaviour of a Mountain Car agent with Nh = 64 for β = {0.75, 1, 1.2}, respectively. We observe that for values of β lower than the operating temperature, the agents are not able to reach the top of the mountain. On the other hand, when β is higher, the agents present more ‘rigid’ trajectories going from one mountain peak to the other. At β = 1 the agent is able to reach the top of the mountain (note that the peaks of the mountain are located at x = −π/2 and x = π/6) while displaying larger behavioural diversity. Similarly, in Figure 6.D-F, we observe that the Acrobot agent with Nh = 64 at β = 1 displays a diverse range of behaviours, being able to reach the top of the plane while, when β is lowered or increased, it drifts to other behavioural modes in less diverse regimes. Although only one agent is represented for each environment, the results of Figure 6 are similar in all agents and sizes. To get a more general picture of different agents and sizes, we can analyze the behavioural transitions in relevant variables of the agent environment systems. Furthermore, we analyse whether the behaviour of the agent, and not only its neural controller, is near a critical point. For inspecting this, we calculate the mean height of agents hyi in our simulations (Figure 7.A,B), and compute the susceptibility of this height value as χy (β ) = β ∂∂hyi β (Figure 7.C,D). In this case, the susceptibility appears to increase monotonically with size when size is doubled, even if these increases are not as uniform as in Figures 4 and 5. Although further tests of criticality could validate if criticality is also found in the whole agent-environment system, the figures suggest that the continuous phase transition of the neural controller corresponds with a sharp transition in the agent’s behaviour.

Discussion Recapitulating the main ideas presented so far, we have tested how by taking a set of correlations chosen at random from a distribution generated by a lattice Ising model at a critical point, we can construct a new model that is also near a critical point in its parameter space. Moreover, if an embodied agent maintains these correlations using a simple learning rule while interacting with its environment – as a sort of organizational homeostasis – the agent neural controller is driven to a critical point, which coincides with behavioural transitions in the agent-environment parameter space. This suggests the possibility that criticality could be diagnosed and induced directly from the maintenance of a given distribution of correlations rather than modelling 8/12

a precise mechanistic structure. Also, our results show that criticality could be generated by quite simple mechanisms only relying on local information, maintaining specific correlations around a given value. Here we have implemented the mechanism as a simple Boltzmann Learning process, but other rules could have the same effect, such as the combination of Hebbian and anti-Hebbian tendencies in specific ratios. In our model, we only require the system to maintain a distribution of relations between the components of the system. This connects with systemic approaches to biology interested not in specific or intrinsic components of biological systems but in the networks of relations and processes26–28 . It is also in line with notions of relational invariance in Piaget’s approach to functional invariants in cognitive development29 or Maturana and Varela’s ideas of autopoietic machines, defined as homeostatic systems that maintain their own organization constant as a network of relations between components30 . Assuming a similar systemic perspective, we have derived learning rules for a system that drives itself towards critical points by maintaining an invariant structure of correlations roughly defined by a critical exponent 1/rη . Our approach assumes a different point of view on self-organized criticality, in which the distribution of correlations is not the consequence of criticality in a specific topology, but the cause driving an indeterminate topology to a critical point. The question now is whether imposing connections derived from a 1/rη function is a strong assumption or implies particularly exigent circumstances. We do not think so, since power law functions can be naturally generated by simple rules of preferential attachment favouring ‘rich-get-richer’ cumulative inequalities31 , or directly as a natural consequence of certain geometries of space (e.g. gravitation laws32 ). Our model only assumes that a system is going to adapt in order to preserve an internal network of relations. It emphasizes the maintenance of organizational structures capable of reproducing the behaviour of living systems without relying on internal models of the external source of sensory input. This contrasts with other approaches which have focused on understanding criticality as a strategy to effectively represent a complex and variable external world12 , for example studying criticality in predictive coding or deep learning architectures dealing with complex inputs33, 34 . In those cases, an internalist view is assumed, where the neural controller represents structures of an external world, whose complexity may be the cause of criticality being present in the neural controller. Instead, our approach is agnostic in terms of the inputs or the external world of an organism, and deals only with how an agent rearranges its internal structures facing different environments. The agents presented here are not specifically designed for a particular problem. In simple terms, our agents generate (preserving the same internal neural organization) a wide variability and richness of behaviours (avoiding both disorder and explosive and indiscriminate propagation) that permits them to explore the space of parameters and eventually achieve solutions that they were not designed to find. The empirical evidence of experiments shown here supports this idea. A parallel could be established with the concept of play, which can be understood as a ‘rule-breaker’ activity of the constraints of a stable and self-equilibrating regime of behaviours which has no concrete goals35 . A model as the one presented here could be used for exploring life-like autonomous behaviour without the need for explicit internal representations, goals, or rule-based behaviour. Conceptual models of critical activity based on the maintenance of a system’s relational invariants could help in the development of a synthetic route towards the exploration of adaptive and embodied criticality.

Methods This environment consists in a car with mass m moving along a one-dimensional environment. In this environment, the agent moves its position in an horizontal axis x, limited to an interval of [−1.5π, 0.5π]. Each horizontal position represents a point in an environment with two mountains, whose height is defined as y = 0.55 + 0.45 sin(3x). The velocity in the horizontal axis is updated each time step as v(t + 1) = v(t) + 0.001a − 0.0025 cos(3x), where a is the action of the motor which can be either −1, 0, 1, impulsing the car with a force F = ma. The inputs Ii fed to the sensor units are defined as an array of 4 units, which are assigned the instantaneous velocity of the car discretized into an array of 4 bits. Each input Ii is assigned a value of 1 if its corresponding bit is active and −1 otherwise. Mountain Car.

Acrobot. The Acrobot is a two-link planar robot composed of two pendulums joined at their tip, with a motor applying a torque in clockwise or counterclockwise directions on the joint between the two links. The position of the system is defined by 9/12

the angles of both pendulums θ1 and θ2 , whose behaviour is defined by the following system of differential equations: θ¨1 = −(d2 θ¨2 + φ1 )/d1 d2 d2 2 θ¨2 = (m2 lc2 + I2 − 2 )−1 (τ + φ1 − φ2 ) d1 d1 2 2 d1 = m1 lc1 + m2 (l12 + lc2 + +2l1 lc2 cos(θ2 )) + I1 + I2 2 d2 = m2 (lc2 + l1 lc2 cos(θ2 )) + I2

(4)

φ2 = m2 lc2 g cos(θ1 + θ2 − π/2) φ1 = −m2 l1 lc2 θ˙22 sin(θ2 )− −2m2 l1 lc2 θ˙2 θ˙1 sin(θ2 )+ +(m1 lc1 + m2 l1 )g cos(θ1 − π/2) + φ2 where τ is the torque applied to the system which can be either −1, 0, 1, m1 = m2 = m is the mass of the links, l1 = l2 = 1 is the length of the links and lc1 = lc2 = 0.5 are the lengths to the center of mass of the links, and I1 = I2 = 1 are the moments of inertia of the links and g = 9.8 is the gravity. As well, variables d1 and d2 are the total moments of inertia of each link, and φ1 and φ2 are linked to the potential energy of the system Similarly to the Mountain Car, the inputs fed to the sensor units in the Acrobot embodiment are defined as an array defined with 4 sensor units, encoding the angular speed of the first link with binary values (encoding active and inactive bits as +1 and -1). Task difficulty. In order to make the tasks challenging, we set the maximum velocity allowed to the Mountain car to ±0.045 (typically is set to 0.07) and the mass of the Acrobot links to m = 1.75 (typically m = 1). These parameters are designed to make it difficult for agents controlled by neural networks with random parameters solve the task (reaching the top of the environment), having success rates of 6.1% for the Mountain Car and 3.1% for the Acrobot. Success rates were evaluated by simulating 1000 neural controllers with random parameters (sampled from a uniform distribution in the range [−2, 2]). The Mountain Car was simulated for 1000 simulation steps starting from a random position in the valley between [0.4, 0.6], and was considered successful when reached the maximum position at least once. The Acrobot was simulated for 5000 simulation steps from the bottom position (angles and angular speeds between [−0.1, 01]) and was consider successful if reached a vertical position higher than 1.8. Training. During training, agents are initialized in the starting random positions in the bottom of their environments (x ∈ [0.4, 0.6] and v = 0 for the Mountain Car and θ1 , θ2 , θ˙1 , θ˙2 ∈ [−0.1, 01] for the Acrobot). The state of the neural network is randomized and the initial parameters hi and Ji j are set to zero. Agents are simulated for 1000 trials of 5000 steps, computing m each trial the values of mm i and ci j and applying Equation 3 at the end of the trial. The agent’s position and state are reset every 4 5 · 10 simulation steps

The source code implementing the learning rule in the different examples is freely available at https:// github.com/MiguelAguilera/Adaptation-to-criticality-through-organizational-invariance.

Code availability.

The heat capacity of hidden and sensor units was computed by interpolating entropy curves using cubic B-splines25 using scipy function splrep with as smoothing coeficient of 1. Different coeficients were tested with similar results. Curve interpolation.

References 1. Salinas, S. R. A. Phase Transitions and Critical Phenomena: Classical Theories. In Introduction to Statistical Physics, Graduate Texts in Contemporary Physics, 235–256 (Springer New York, 2001). DOI: 10.1007/978-1-4757-3508-6 12. 2. Salinas, S. R. A. Scaling Theories and the Renormalization Group. In Introduction to Statistical Physics, 277–304 (Springer New York, 2001). 3. Mora, T. & Bialek, W. Are biological systems poised at criticality? J. Stat. Phys. 144, 268–302 (2011). 4. Schneidman, E., Berry, M. J., Segev, R. & Bialek, W. Weak pairwise correlations imply strongly correlated network states in a neural population. Nat. 440, 1007–12 (2006). DOI 10.1038/nature04701. 5. Mora, T., Walczak, A. M., Bialek, W. & Callan, C. G. Maximum entropy models for antibody diversity. Proc. Natl. Acad. Sci. 107, 5405–5410 (2010). DOI 10.1073/pnas.1001705107. 10/12

6. Krotov, D., Dubuis, J. O., Gregor, T. & Bialek, W. Morphogenesis at criticality. Proc. Natl. Acad. Sci. 111, 3683–3688 (2014). DOI 10.1073/pnas.1324186111. 7. Chen, X., Dong, X., Be’er, A., Swinney, H. L. & Zhang, H. P. Scale-Invariant Correlations in Dynamic Bacterial Clusters. Phys. Rev. Lett. 108, 148101 (2012). DOI 10.1103/PhysRevLett.108.148101. 8. Chialvo, D. R. Critical Brain Dynamics at Large Scale. In Plenz, S. I. D. & Niebur, E. (eds.) Criticality in Neural Systems, 43–66 (Wiley-VCH Verlag GmbH & Co., 2014). 9. Van Orden, G., Hollis, G. & Wallot, S. 10.3389/fphys.2012.00207.

The blue-collar brain.

Fractal Physiol. 3, 207 (2012).

DOI

10. Kauffman, S. A. The origins of order: Self-organization and selection in evolution (Oxford University Press, USA, 1993). 11. Beggs, J. M. & Plenz, D. Neuronal Avalanches in Neocortical Circuits. J. Neurosci. 23, 11167–11177 (2003). 12. Hidalgo, J. et al. Information-based fitness and the emergence of criticality in living systems. Proc. Natl. Acad. Sci. 111, 10095–10100 (2014). DOI 10.1073/pnas.1319166111. 13. Bak, P., Tang, C. & Wiesenfeld, K. Self-organized criticality. Phys. Rev. Lett. 59, 381–384 (1987). DOI 10.1103/PhysRevLett.59.381. 14. Bak, P. & Sneppen, K. Punctuated equilibrium and criticality in a simple model of evolution. Phys. Rev. Lett. 71, 4083–4086 (1993). DOI 10.1103/PhysRevLett.71.4083. 15. Jensen, H. J. Self-organized criticality: emergent complex behavior in physical and biological systems (Cambridge Univ. Press, Cambridge [u.a., 1998). 16. Watkins, N. W., Pruessner, G., Chapman, S. C., Crosby, N. B. & Jensen, H. J. 25 Years of Self-organized Criticality: Concepts and Controversies. Space Sci. Rev. 198, 3–44 (2016). DOI 10.1007/s11214-015-0155-x. 17. Kadanoff, L. P. More is the same; mean field theory and phase transitions. J. Stat. Phys. 137, 777–797 (2009). 18. Tkacik, G., Schneidman, E., Berry II, M. J. & Bialek, W. Spin glass models for a network of real neurons. arXiv:0912.5409 (2009). 19. Ackley, D. H., Hinton, G. E. & Sejnowski, T. J. A learning algorithm for Boltzmann machines. Cogn. science 9, 147–169 (1985). 20. Onsager, L. Crystal Statistics. Phys. Rev. 65, 117–149 (1944). DOI 10.1103/PhysRev.65.117. 21. Mont´ufar, G. F. Universal Approximation Depth and Errors of Narrow Belief Networks with Discrete Units. Neural Comput. 26, 1386–1407 (2014). 22. Brockman, G. et al. OpenAI gym. arXiv:1606.01540 (2016). 23. Moore, A. W. Efficient memory-based learning for robot control (Univ. of Cambridge, 1990). 24. Sutton, R. S. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. In Advances in NIPS, 8, 1038–1044 (MIT Press, 1996). 25. Dierckx, P. Curve and Surface Fitting with Splines (Oxford University Press, Inc., New York, NY, USA, 1993). 26. Bernard, C. & Greene, H. C. An introduction to the study of experimental medicine (Courier Corporation, 1957). 27. Rosen, R. Some relational cell models: The metabolism-repair systems. In Foundations of Mathematical Biology, 217–253 (Academic Press, 1972). DOI: 10.1016/B978-0-12-597202-4.50011-6. 28. Ashby, W. R. Principles of the self-organizing system. Princ. Self-organization 255–278 (1962). 29. Piaget, J. Biology and knowledge: an essay on the relations between organic regulations and cognitive processes (University of Chicago Press, 1974). Google-Books-ID: sXMSHUIBDNsC. 30. Maturana, H. R. & Varela, F. J. Autopoiesis and Cognition: The Realization of the Living (D. Reidel Publishing Company, 1980), 1st edn. 31. Greenwood, M. & Yule, G. U. An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. J. Royal statistical society 83, 255–279 (1920). 32. Barrow, J. D. New Dimensions. In The Constants of Nature, 203–205 (Random House, New York, 2002). 33. Friston, K., Breakspear, M. & Deco, G. Perception and self-organized instability. Front. Comput. Neurosci. 6 (2012). DOI 10.3389/fncom.2012.00044. 11/12

34. Lin, H. W. & Tegmark, M. Critical Behavior from Deep Dynamics. arXiv:1606.06737 (2016). ArXiv: 1606.06737. 35. Di Paolo, E. A., Rohde, M. & De Jaegher, H. Horizons for the enactive mind: Values, social interaction, and play. Enaction 33–87 (2010).

Acknowledgements Research was supported in part by the Spanish National Programme for Fostering Excellence in Scientific and Technical Research project PSI2014-62092-EXP and by the project TIN2016-80347-R funded by the Spanish Ministry of Economy and Competitiveness.

Author contributions statement M.A. conceived and conducted the experiments. M.A. and M.B. analysed the results and wrote the manuscript.

Competing interests The authors declare no competing financial interests.

12/12