Daniel Durstewitz, Georgia Koppe. 1 and Hazem Toutounji. 1. Traditionally, models in statistics are relatively simple 'general purpose' quantitative inference ...
Computational models as statistical tools Daniel Durstewitz, Georgia Koppe1 and Hazem Toutounji1 Traditionally, models in statistics are relatively simple 'general purpose' quantitative inference tools, while models in computational neuroscience aim more at mechanistically explaining specific observations. Research on methods for inferring behavioral and neural models from data, however, has shown that a lot could be gained by merging these approaches, augmenting computational models with distributional assumptions. This enables estimation of parameters of such models in a principled way, comes with confidence regions that quantify uncertainty in estimates, and allows for quantitative assessment of prediction quality of computational models and tests of specific hypotheses about underlying mechanisms. Thus, unlike in conventional statistics, inferences about the latent dynamical mechanisms that generated the observed data can be drawn. Future directions and challenges of this approach are discussed. Address Department of Theoretical Neuroscience, Bernstein Center for Computational Neuroscience, Central Institute of Mental Health, Medical Faculty Mannheim of Heidelberg University, Mannheim, Germany Corresponding author: Durstewitz, Daniel ([email protected]) 1 Equal contribution.

Introduction In traditional statistics, models are general-purpose devices in the sense that they could be applied to a large class of experimental situations, originating in various fields and disciplines, where inference about a set of observed data is sought. A General Linear Model (GLM), for instance, relies on assumptions about the distribution of the data (or error terms), and the functional form of the relationship between predictors and outcomes (linearity), but otherwise makes no claims about the specific processes or mechanisms that underlie the data at hand. Parameters in the model (like the ‘beta weights’ in the GLM) obtain their meaning only within the specific experimental context investigated. Statistical models are usually simple (often linear), with relatively few or

strongly constrained (penalized) parameters, to render the inference process well-defined and tractable. Models in computational neuroscience, on the other hand, are traditionally tools for gaining insight into the possible processes and mechanisms that underlie experimental observations. They are put forward to advance an explanation for a pattern of experimental results, not necessarily at a quantitative, but, at least in the past, often at a rather qualitative level (but see [1,2]). For instance, a classical observation in prefrontal cortex neurophysiology is that single cells recorded in vivo appear to hop from a low-firing into a high-firing rate state during the delay period of a working memory task, when a specific item has to be retained in short-term memory to guide subsequent responding [3]. A ‘classical’ account for this observation is that the underlying network is a multi-stable dynamical system where the single neuron ‘hopping’ is a consequence of the network switching between different stimulus-selective attractor states (e.g. [4]). Although these models are often loosely adapted to capture key aspects (or moments) of the data, like the mean spiking rate and its coefficient of variation, their parameters are not estimated in a principled or systematic manner to capture the full data distribution (although, fitting by least squares, without explicitly specifying probability distributions, is sometimes used, e.g. [1,5]). They serve to provide an explanation for a key observation, not necessarily to explain all variation in a specific data set. Computational models are often complicated, highly nonlinear and with a large number of parameters. Both approaches are obviously justified in their own right, and both – statistics in particular – are anchored in their own long-standing research traditions. Here we will argue that a lot could be gained by merging them (see also [6]). It is emphasized that this is not, per se, a new idea: Statistical estimation of computational process models has indeed a longer history in various fields of the life sciences, like ecology (e.g. [7]) or biochemistry [8], and, somewhat more recently, also in some areas of the neuroand behavioral sciences (see below). In neuroscience, it is not yet, however, a widespread idea, and still one associated with many open issues.

Integrating computational models into a statistical framework As with comparatively simple statistical models, computational models can be augmented with probability assumptions that allow for principled inference by maximum likelihood or Bayesian approaches. Some of these Current Opinion in Behavioral Sciences 2016, 11:93–99

may follow naturally from the type of data, as for instance if the model produces as its output binary behavioral choices (e.g., correct vs. incorrect) or spike counts, which follow a Bernoulli process and may be captured by a binomial or a Poisson distribution. In other cases, the Gaussian distribution might be a reasonable choice. A more challenging aspect with computational models, often referred to as generative models in this context [9], is that these commonly comprise latent (hidden) states not directly instantiated through observed data, often embedded within nonlinear functional relationships. To make the discussion more concrete, consider process (generative) models of the general form (Figure 1, [6,10,11]): pðyt jzt Þ ¼ g u ðhðzt ÞÞ zt ¼ f l ðzt1 ; ut ; et Þ;


where Y = {yt} is a (vector) time series of observed outputs (like behavioral responses) with probability distribution gu which depends solely on the underlying unobserved states zt at time t and parameters u. The link function h1 connects the hidden process to natural parameters of the distribution g (usually its mean). The hidden states zt themselves form a dynamical system (which may be given through difference, as here, or differential equations2), where fl is the (potentially nonlinear) transition function, parameterized through l and affected by known (fixed) inputs ut (e.g. experimental conditions or stimuli), and a stochastic factor et, known as process noise (Figure 1). Process noise may either reflect unknowns in the specific form, parameter space, inputs, or other factors, of the underlying dynamical system, or it may capture known biological noise sources (such as probabilistic synaptic release; [12]) that might come with a computational purpose (e.g., escaping from local optima [13] or sampling from probability distributions for inference [14]). The goal of statistical estimation would be to obtain, in the general case, both estimates for the unknown parameters {u, l} and the posterior distribution over the unobserved latent states Z = {zt} given the observed data series {yt} and regressors {ut}. A simple statistical example is factor analysis, where the latent states (factors) Z give rise to the observations Y through a linear-Gaussian model. Embedding computational models in such a framework has a number of profound advantages (see also [6]): 1) Model parameters are not hand-tuned or guessed in some arbitrary or rough fashion, but obtained through a principled optimization approach using a well-defined criterion function (e.g., the probability or density of the data given the parameters, as in maximum likelihood). 2) The estimation comes with confidence intervals (or credibility intervals in a Bayesian approach) by virtue of the 2 In continuous time, the latent process is described by a stochastic differential equation, z˙ðtÞ ¼ f l ðzðtÞ; uðtÞ; eðtÞÞ, where e(t)) specifies the process noise.

probability assumptions connected with the model. Thus we gain a quantitative sense of how much confidence we can put into the estimated model parameters. 3) We can directly test hypotheses about (the relevance of) specific model parameters and the computational processes associated with them, e.g. through likelihood ratio statistics [15]. Approaches like hierarchical Bayesian (mixedeffects) modeling additionally give us insight into the structure of parameter space itself and the form of the (prior) distribution of parameters across individuals [16]. 4) Vice versa, through the tight formal link to the experimental observations, the models can more directly inform the experimental design such as to optimally de-correlate or disentangle specific model parameters. 5) We can also compare quite different computational models with respect to how well they can account for the observed data using principled means like likelihood-based information criteria [17] (e.g., the Akaike or the Bayesian Information Criterion), sampling- or cross-validation-based estimates of prediction error [18,19], or Bayesian model comparison [20,21,22] (see [23] in this issue, and [19,20,24] for a more in depth review of model comparison). In these approaches, finding the model which minimizes out-of-sample prediction error may be seen as the ultimate target ([19]; see also below). 6) Procedures to obtain estimates of prediction error will also give us a specific idea about how much we might be over-fitting the data. 7) Perhaps most excitingly, the fact that computational models, different from a pure statistical approach, include process assumptions about underlying latent states, implies we have a means to look beyond the ‘data surface’, to gain insight into the mechanisms that may have produced the observed data, rather than just establishing the statistical significance of a pattern observed in the data. That is, we may be able to infer, in a maximum likelihood or Bayesian sense, the dynamical process underlying the observed data. It thus seems we can only win by placing computational models into a statistical context. But, of course, there are also caveats and limitations. To obtain the likelihood p(Yju, l), we need to integrate the joint probability p(Y, Zju, l) across the usually very high-dimensional hidden state path Z = {zt}, using efficient algorithms like Expectation-Maximization [25] and the Kalman filter-smoother recursions [26,27] (see [28,29] for alternative approaches). Originally, these approaches have been developed for linear models (i.e., with h and f being linear functions) and Gaussian assumptions on both the outputs (g) and the process noise (e), a statistical framework often referred to as ‘state space models’ [10]. However, linear models are very restricted in the types of dynamics they can produce, exhibiting either fixed point behavior or simple harmonic oscillations which are highly sensitive to noise (e.g. [6]). With nonlinear (dynamical) transition equations and/or non-Gaussian observations, on the other hand, only approximate analytical or comparatively time-consuming,

Figure 1









u t–1









t = 1,...,T



gθ , h

yt y1






Graphical representation of state-space (generative) models. (a) Latent variable (state-space, generative) model for sequential data. Open white circles refer to the generating latent process and parameters, open gray circles to observables, and the black node to known inputs. (b) In this state space example, the input {ut} is a sequence of either rewarding (cherry) or aversive (lime) stimuli, the latent variable{zt} represents stimulus values that are learned over time, and the observed variable {yt} corresponds to neural spike trains that encode the latent values.

sampling-based numerical solutions are usually feasible (e.g. [17,30]). The complexity of computational models, both in terms of their state space dimensionality and their numbers of parameters, thus becomes much more of a burden than in conventional computational modeling. Problems with parameter identifiability frequently ensue [31,32] which may request additional measures like including penalty terms and constraints into the optimization process [33]. Hence, currently, considerable resources in terms of computing time and expertise for setting up, running, and evaluating these models is often required.

Behavioral computational-statistical models For behavioral computational models, statistical estimation has received growing interest especially within the past decade due to rapid advancements and increasing availability of model estimation and selection techniques [20,21,22,34]. We focus here on examples from the arguably two most influential classes of models, reinforcement and belief learning models on the one hand, and sequential sampling models for decision making on the other. Reinforcement learning (RL) models learn values for state-action pairs from repeated experience based on reward prediction errors, that is the differences between expected and actually received rewards [35]. Based on these learnt (iteratively updated) (state, action)-values they, in general, select among two or more available behavioral options according to a probabilistic (e.g. Boltzmann-type) choice function. The choice function therefore constitutes a natural link (h in Eq. (1)) from the RL process to the probability parameters of a bi- or multinomial output distribution (g in Eq. (1)). The (state, action)-values may be seen as the underlying latent variables {zt} in Eq. (1),

which — in the simplest case — follow a linear deterministic updating process (i.e., no process noise in Eq. (1)). Belief-learning models are similar (in fact may encompass RL models as a special case [36,37]), but assume in addition that learning also occurs for non-chosen actions and fictive rewards. They have been applied mainly to study learning in the context of social decision making and game theory, i.e. when personal values and actions depend on the observed actions and inferred values of others. Since the latent process is usually deterministic in these models (but see [38–40]), parameter recovery through maximization of a closed-form likelihood under the assumption of a Markov Decision Process is relatively straightforward. The benefit that this type of approach has brought to neuroscience is the ability to infer, and formally and quantitatively characterize, a variety of underlying psychological processes from observed behavioral responses, including computational parameters that control the rate of learning [41], the exploration-exploitation tradeoff [42], reward sensitivity [43], or memory [44], to name but a few. In combination with model selection techniques [20,21,22], they provide the means to explicitly test and disentangle hypotheses on specific alterations of these processes among subject groups in mechanistic terms, e.g. with respect to mental disorders [43–45], or to study the neural implementation of different learning algorithms (e.g. [46–49]). The drift diffusion model (DDM) [50,51] as an instance of sequential sampling models, is another example of a highly successful cognitive process model which has been employed as a ‘statistical tool’ to elucidate basic cognitive processes underlying (2-choice) decision making under temporal constraints [52–54]. DDMs, usually formulated in continuous time, perform a noisy integration of relative Current Opinion in Behavioral Sciences 2016, 11:93–99

evidence through a latent state variable z(t) (see Eq. (1) and footnote 2), driven by a constant drift term (which embodies the relative strength of evidence) and a (usually Gaussian/Wiener) diffusion noise process e(t). A binary choice is emitted once z(t) crosses one of two decision boundaries. The drift rate, a time related to all nondecision (like perceptual or motor) processes, and an a priori bias, all modeled as random variables, along with the parameter setting the distance between decision boundaries, determine the pattern of choices and trialby-trial reaction times {yt}. DDMs, unlike conventional statistical approaches, thereby take into account trial-totrial variations and the full (typically non-Gaussian) reaction time distributions for correct vs. incorrect choices, to infer underlying information processing components. As discussed in the previous section, the process noise in the evidence accumulation aggravates statistical estimation of these models, but a variety of solutions exist [55–57], together with publicly available code [16,57,58].

Neural computational-statistical models For neural systems, broadly, models have been formulated at two levels: Either 1) neural recordings in the form of spike trains or neuroimaging data are used to estimate an abstract (network-level) representation of the underlying latent dynamics [17,59,60], connectivity or biophysical parameters [61], or for decoding stimulus features [62,63]; or 2) biophysically more detailed spiking single neuron models such as integrate-and-fire-like [64–67] or Hodgkin-Huxley-like [30,68] models are estimated from spike train or membrane potential recordings. Spike train observations represent point process or count data (if binned), such that a Poisson distribution for g in Eq. (1) is a natural assumption, often coupled to a spike intensity or rate, produced by the latent dynamics through a log-link function h1, while for the latent dynamics itself, additive Gaussian noise e  N (0,S) is commonly assumed [17,29,59,62,63]. One aim with these models may be, for instance, to find a lowdimensional, latent recurrent neural network dynamic representation of high-dimensional spike train observations [59]. Another example is biophysically-anchored latent models of human MEG data to elucidate properties of synaptic transmission, capture pharmacological manipulations and predict behavioral responses [60]. With continuously-valued observations, like field potentials (EEG) or membrane potential recordings, Gaussian assumptions for the output distribution g (Eq. (1)) may be more appropriate. On this basis, biophysically more detailed models have been used, for instance, to systematically infer ion channel and synaptic parameters of single neurons [30,68,69]. However, the large number of parameters associated with biophysically detailed neuron models makes it difficult to extend this approach to biophysical networks beyond a handful of neurons only Current Opinion in Behavioral Sciences 2016, 11:93–99

[70] (sometimes the hurdles here may be more on the computational side [18], however, as for biophysical models the available data are usually also more numerous and precise, in terms of spatio-temporal resolution and noise levels, than, e.g., for behavioral models). An alternative approach for making a link to the network level therefore is to start from stochastic differential equations for biophysically realistic, yet still simplified, single neuron models (e.g. [5,67]), and to translate these into Fokker-Planck equations for describing the mean-field dynamics of populations of such neurons [71]. FokkerPlanck equations are partial differential equations, which, put in this context, describe the evolution of the membrane potential’s probability density (and potentially that of other single neuron variables like adaptation currents) [64,66,72,73], and thus can be used to probabilistically characterize, in the mean-field sense, the latent state evolution, i.e. @p(z(t))/@t, of an entire population derived from biophysical single neuron models. The underlying state distribution may then be converted into a spike density or rate [64,73] that provides a link to series of observed spike times or counts.

Future directions There are several areas in this field that need further attention. First, we still need to find efficient ways of dealing with larger-scale models comprising very many parameters and high-dimensional state spaces. One possibility is hierarchical, stepwise approaches. For instance, single neuron parameters of cells in a biophysical network model may first be estimated from in vitro electrophysiological recordings and then fixed [5]; similar for the properties (conductances, time constants, etc.) of synaptic currents. At the next, network level, statistics derived from in vivo electrophysiological measurements and anatomical studies may then be used to infer connectivity parameters of the model (cf. [2]). Often we may have to combine data from qualitatively quite different sources (e.g., anatomical and physiological; [1]) to sufficiently constrain the model. More generally, joint estimation of data at different levels, specifically neural and behavioral [74], generated by the same underlying model, may be a powerful way forward for linking different scales. Sometimes it may be possible to lump parameters in a physiologically reasonable way (as with simplified spiking neuron models, e.g. [5,75,76]), or parameter distributions may be defined which are governed by a much smaller set of (meta-)parameters that one attempts to estimate. For instance, we may not need to estimate the strength of each synaptic connection in a network model, but just parameters of their distribution. Another issue is the criteria for model selection and comparison (see also [23]). Good out-of-sample prediction in the statistical sense, as e.g., estimated by cross-validation, may not be enough to guarantee we are dealing with the right model mechanism [18]. This is, partly (see also

Bishop CM: Pattern Recognition and Machine Learning (Information Science and Statistics). New York, Inc.: SpringerVerlag; 2006.

discussion in [23]), because it only probes prediction within the same data domain, i.e. with respect to new observations drawn from the same statistical distribution that also underlies model estimation. We may demand, however, that a good computational model should also predict new observations within regions of data space that were not assessed through the initial experiments, call it ‘out-of-domain’ rather than just ‘out-of-sample’ predictions (e.g. [5,18]). For instance, we may want a model inferred solely from physiological data to also make good behavioral predictions, or a model inferred from one cognitive task to also predict behavior on a different cognitive task. These are still very much underexplored theoretical issues that, we think, need to be addressed to take this whole approach to the next level.


