Nonparametric Modeling of Neural Point Processes via ... - CiteSeerX

1 downloads 0 Views 855KB Size Report
John P. Donoghue. ∗ .... ates, can be expressed as (Daley & Vere-Jones, 2003; Truccolo, Eden et al., .... learner h(x;a) is an R terminal node regression tree.
LETTER

Communicated by Robert Kass

Nonparametric Modeling of Neural Point Processes via Stochastic Gradient Boosting Regression Wilson Truccolo Wilson [email protected]

John P. Donoghue∗ John [email protected] Neuroscience Department, Brown University, Providence, RI 02912, U.S.A.

Statistical nonparametric modeling tools that enable the discovery and approximation of functional forms (e.g., tuning functions) relating neural spiking activity to relevant covariates are desirable tools in neuroscience. In this article, we show how stochastic gradient boosting regression can be successfully extended to the modeling of spiking activity data while preserving their point process nature, thus providing a robust nonparametric modeling tool. We formulate stochastic gradient boosting in terms of approximating the conditional intensity function of a point process in discrete time and use the standard likelihood of the process to derive the loss function for the approximation problem. To illustrate the approach, we apply the algorithm to the modeling of primary motor and parietal spiking activity as a function of spiking history and kinematics during a two-dimensional reaching task. Model selection, goodness of fit via the time rescaling theorem, model interpretation via partial dependence plots, ranking of covariates according to their relative importance, and prediction of peri-event time histograms are illustrated and discussed. Additionally, we use the tenfold cross-validated log likelihood of the modeled neural processes (67 cells) to compare the performance of gradient boosting regression to two alternative approaches: standard generalized linear models (GLMs) and Bayesian P-splines with Markov chain Monte Carlo (MCMC) sampling. In our data set, gradient boosting outperformed both Bayesian P-splines (in approximately 90% of the cells) and GLMs (100%). Because of its good performance and computational efficiency, we propose stochastic gradient boosting regression as an off-the-shelf nonparametric tool for initial analyses of large neural data sets (e.g., more than 50 cells; more than 105 samples per cell) with corresponding multidimensional covariate spaces (e.g., more than four covariates). In the cases where a

∗ J. P. Donoghue is a cofounder and shareholder in Cybernetics, Inc., a neurotechnology company that is developing neural prosthetic devices.

Neural Computation 19, 672–705 (2007)

Nonparametric Modeling of Neural Point Processes

673

functional form might be amenable to a more compact representation, gradient boosting might also lead to the discovery of simpler, parametric models. 1 Introduction Modeling neural spiking activity as a function of covariates, such as spiking history, neural ensemble activity, stimuli, and behavior, is a developing field in neuroscience, with many challenges and problems yet to be overcome. Recent work has provided a solid formulation for the parametric modeling of spiking activity, while preserving its point process nature, based on maximum likelihood analysis (Brillinger, 1988; Chornoboy, Schramm, & Karr, 1988; Brown, Barbieri, Eden, & Frank, 2003; Paninski, 2004; Truccolo, Eden, Fellows, Donoghue, & Brown, 2005; Okatan, Wilson, & Brown, 2005). However, there still remains a need for new statistical tools that enable the discovery and approximation of functional forms relating spiking activity and covariates, rather than simply imposing a priori parametric forms. Previous attempts to address this challenge have relied on cardinal cubic spline modeling (Frank, Eden, Solo, Wilson, & Brown, 2002), Zernicke polynomials (Barbieri et al., 2002), smoothing splines in the context of generalized additive models (Kass & Ventura, 2001), and Bayesian free-knot spline smoothing (DiMatteo, Genovese, & Kass, 2001; Kaufman, Ventura, & Kass, 2005). While these attempts have successfully generated new insights in their applications, there still is a need for new approaches that combine high-fitting performance with computational efficiency in analyses of large neural data sets. For example, motor neurophysiology studies using multiple multielectrode arrays now permit the simultaneous recording of hundreds of cells with often more than 105 samples per cell and at the same time generate measurements of behaviorally relevant high-dimensional covariates including arm kinematics in joint space, joint torques, and electromyograms (EMGs) (Hatsopoulos, Joshi, & O’Leary, 2004; Truccolo, Vargas, Philip, & Donoghue, 2005). Off-the-shelf, computationally efficient modeling algorithms that require a minimum amount of fine tuning, while at the same time maintaining good fit, are required for initial analyses of these data sets. Here, we approach this challenge from a different perspective. Recent developments in the statistical and machine learning communities have provided powerful new tools for nonparametric modeling (Hastie, Tibshirani, & Friedman, 2001; Vapnik, 1998). In the particular case of interest here, this progress originated from the insight gained by relating the nonparametric regression problem to numerical function approximation via gradient descent methods (Breiman, 1997; Friedman, 1999). This insight has led to the development of stochastic gradient boosting regression (Friedman, 2001, 2002). Rather than optimizing in parameter space, as performed in parametric modeling, the approximation problem is formulated

674

W. Truccolo and J. Donoghue

in terms of optimization in function space. Intuitively, stochastic gradient boosting can be seen as a regularized fitting of the residuals in a gradual iterative fashion. This approach has provided both a new theoretical perspective for understanding nonparametric modeling algorithms such as ensemble of experts and other boosting algorithms,1 and a general off-the-shelf regression tool (Hastie et al., 2001). Here, we show how stochastic gradient boosting can be successfully extended to the statistical modeling of neural point processes. We introduce the neural point process modeling problem in the next section. Succinctly, a point process joint probability can be completely specified by its conditional intensity function. The modeling problem can thus be expressed in terms of approximating the conditional intensity of the process as a function of the neuron’s own spiking history and other covariates of interest. We then formulate stochastic gradient boosting in the context of conditional intensity function approximation. Because stochastic gradient boosting allows arbitrary (differentiable) loss functions, our application uses the standard likelihood function for general point processes to derive the loss function for the conditional intensity function optimization problem. To illustrate the theory, we apply the algorithm to spike train data recorded from primary motor (M1) and parietal (5d) cortices in a behaving monkey performing a standard center-out reaching task. Goodness-of-fit assessment of the function approximation is done by applying the time-rescaling theorem (Brown et al., 2001). Additional tools for model interpretation and selection are also discussed and illustrated. Finally, we perform an extensive 1 Most current investigations of boosting approaches originated with the AdaBoost algorithm proposed by Freund and Schapire (1995) for classification problems. The relations between AdaBoost and additive logistic regression were first established by Friedman, Hastie, and Tibshirani (2000). Relations between boosting, maximum likelihood estimation, Kullback- and Kullback-Leibler divergence were explored by Lebanon and Lafferty (2002) in the context of exponential family models. Breiman (1997, 1999), Friedman (1999, 2001) and Mason, Baxter, Bartlett, & Frean (1999) were perhaps the first to formulate boosting in terms of greedy function approximation via gradient descent and to extend it to several types of loss functions. Consistency analyses of boosting algorithms have been presented by Lugosi and Vayatis (2004) and Zhang (2004). Of particular interest here is the result in Rosset, Zhu, & Hastie (2004) showing that boosting trees approximately (and in some cases exactly) minimize the loss criterion with a L 1 − nor m constraint on the estimated parameter vector. Relations between boosting, L 1 -constrained regression methods (e.g., Lasso), and sparse representations have also been discussed in Friedman, Hastie, Rosset, Tibshirani, & Zhu (2004) and in Hastie et al. (2001). Additionally, an interpretation of boosting integrating its formulation in terms of both gradient function optimization and margin maximization (Schapire, Freund, Bartlett, & Lee, 1998) has been provided by Rosset et al. (2004). In comparison to other nonparametric approaches, stochastic gradient boosting has been shown (Friedman, 2001) to outperform multivariate additive regression approaches based on smoothing splines (MARS; Friedman, 1991). In our exposition and approach to modeling neural point processes (section 3), we have followed Friedman’s statistical perspective.

Nonparametric Modeling of Neural Point Processes

675

comparison between stochastic gradient boosting regression and two alternative approaches: standard log-linear models in the parameters (GLMs) and Bayesian P-splines fitted via Markov chain Monte Carlo (MCMC) sampling. We use the tenfold cross-validated log likelihoods of the neural point processes as the criterion for the performance comparison on a 67-cell data set.

2 Neural Point Processes: Stochastic Formulation A neural point process can be completely characterized by its conditional intensity function (CIF; Daley & Vere-Jones, 2003), defined as

λ(t|x(t)) = lim

→0

P[N(t + ) − N(t) = 1|x(t)] , 

(2.1)

where P[·|·] is a conditional probability and N(t) denotes the number of spikes counted in the time interval (0, t] for t ∈ (0, T]. The term x(t) represents covariates of interest: the spiking history up to (but not including) time t and stimuli-behavioral covariates at varied time lags with respect to the spiking activity. The CIF is a strictly positive function that gives a history-dependent generalization of the rate function of a Poisson process. From equation 2.1 and for small , we have that λ(t|x(t)) gives approximately the neuron’s spiking probability in the time interval (t, t + ]. A discrete time representation of the point process is often preferable. To obtain this representation, we choose a large integer K and partition the K observation interval (0, T] into K subintervals (tk−1 , tk ]k=1 , each of length −1  = T K . We choose K large so that there is at most one spike per subinterval. Let Nk = N(tk ), N1:k = N0:tk , and xk = x(tk ). Because we choose K large, the differences Nk = Nk − Nk−1 result in a binary time series of zero and one counts, which corresponds to the usual representation of spike trains. For any discrete time neural point process, the joint probability density of the spike train, that is, the joint probability density of J spike times 0 < u1 < u2 < . . . < u J ≤ T in (0, T] conditioned on history and covariates, can be expressed as (Daley & Vere-Jones, 2003; Truccolo, Eden et al., 2005)  P(N1:K |x1:K ) = exp

K  k=1

Nk log[λ(tk |xk )] −

K 

 λ(tk |xk ) + o( J ).

k=1

(2.2)

676

W. Truccolo and J. Donoghue

The likelihood for the point process is readily available from equation 2.2.2 This probability distribution belongs to the exponential family, with the natural parameter given by k = log λ(tk |xk ) = F (xk ).

(2.3)

The problem we address here is the approximation of F (x) above. We will present a new approach to this problem, which consists of treating this approximation as a regularized optimization in function space and using stochastic gradient boosting regression to solve it. 3 Optimization in Function Space and Stochastic Gradient Boosting Consider the random variables (y, x). In general y and x = [x1 , . . . , x p ] can be any continuous, discrete, or mixed set of variables. In the specific case of a discrete time point process, y is a binary sequence. The target function to be approximated is the function F ∗ (x), which relates the covariate vector x to y while minimizing the expected value of some arbitrary (differentiable) loss function L(y, F (x)) over the joint distribution of all (y, x) variables, that is,   F ∗ = arg min E y,x L(y, F (x)) = arg min E x E y (L(y, F (x))) | x . F

(3.1)

F

This is an optimization problem in function space. In other words, we consider F (x) evaluated at each point x to be a “parameter” and seek to minimize E y [L(y, F (x)) | x]

(3.2)

at each individual x directly with respect to F (x), observing that pointwise minimization of equation 3.2 is equivalent to minimization of equation 3.1. Numerical optimization methods are often required to solve this type of problem. Commonly they involve an iterative procedure where the approximation to F ∗ (x) takes the form F ∗ (x) ≈ Fˆ (x) =

M 

f m (x),

(3.3)

m=0

2 In section 7, our data consist of multiple realizations or trials of the neural point process. The extension of equation 2.2 to reflect multiple realizations is straightforward. However, for notational simplicity, we leave the existence of multiple realizations implicit.

Nonparametric Modeling of Neural Point Processes

677

where f 0 (x) is an initial guess and { f m (x)}1M are successive increments (steps or boosts), each dependent on the preceding steps. In the particular case of function approximation via steepest-descent gradient, the mth step has a (steepest-descent) direction defined by the negative gradient vector gm (x) of the loss function with respect to F (x) and a magnitude ρm obtained via line search. K Given finite data samples {yk , xk }k=1 , smoothing is required to improve generalization of the approximation over nonsampled covariate values. This is achieved by introducing a parametric model at each of the successive steps, such that the approximation in equation 3.3 becomes Fˆ (x) = Fˆ 0 (x) +

M 

βm h(x; am ),

(3.4)

m=1

where h(x) is a regressor or base learner, βm and am = {a 1 , a 2 , . . .} are parameters, and Fˆ 0 (x) is a constant. A “greedy stagewise” approach for solving equation 3.1 under the assumptions in equations 3.3 and 3.4 results (see section A.1 in the appendix) in searching, at each iteration step, for the member of the parameterized K class h(x; am ) that produces the vector hm = {h(xk ; am )}k=1 ∈ R K most paralK lel to the negative gradient vector, that is, gm = {gmk (xk )}k=1 ∈ R K . For most choices of base learners, the most parallel vector can be obtained by solving the least-squares problem, am = arg min a,β

K 

[gmk (xk ) − βh(xk ; a)]2 .

(3.5)

k=1

This gives the direction for the steepest-descent step. If needed, other minimization criteria can be used in place of equation 3.5. The magnitude of the step is obtained by the line search ρm = arg min ρ

K 

  L yk , Fˆ m−1 (xk ) + ρh(xk ; am ) .

(3.6)

K =1

Therefore, the update rule for the approximation at each step becomes Fˆ m (x) = Fˆ m−1 (x) + ρm h(x; am ).

(3.7)

Fitting the regressor to the negative gradient effectively works as a gradient smoothing. For the case of squared loss functions L(y, F (x)) = 1  [y − F (x)]2 , the negative gradient is simply the usual residual, and 2 the procedure corresponds to residual fitting.

678

W. Truccolo and J. Donoghue

Loss

0.094

α = 0.1

0.088

α = 0.01 α = 0.001 0.082 0

2000

4000 6000 iterations (trees)

8000

10000

Figure 1: Loss function dependence on number of iterations (trees) and shrinkage parameters. The negative of the log likelihood for the point process (cell A) is computed on validation data for three different choices of shrinkage values α and as a function of the number of trees in the approximation. The dotted line shows the value of the loss function computed on the training data for α = 0.001.

3.1 Regularization. In order to improve generalization and avoid overfitting, the approximation can be constrained by the choice of the number of steps or iterations and by weighting the contribution of each base learner. A shrinkage strategy is adopted (Copas, 1983) for the weighting, that is, the update rule at each iteration becomes Fˆ m (x) = Fˆ m−1 (x) + α ρm h(x; am ), 0 < α ≤ 1.

(3.8)

The actual value for α may depend on the data (see Figure 1), but usually small values around α = 0.001 are chosen. With a fixed shrinkage parameter, estimation of an optimal number of iterations is obtained by cross validation on a test data set, where we assess if additional iteration steps improve significantly the minimization of the loss function. As we mentioned in section 1, consistency and convergence analyses have been recently provided by Lugosi and Vayatis (2004) and Zhang (2004). In addition, Rosset et al. (2004) have shown that boosting trees regression approximately (and in some cases exactly) minimizes its loss criterion with a L 1 − norm constraint on the estimated parameters vector. Some initial

Nonparametric Modeling of Neural Point Processes

679

discussion relating boosting, L 1 -constrained regression methods, and sparse representation has appeared in Friedman et al. (2004) and Hastie et al. (2001). 3.2 Variance Reduction via Stochastic Subsampling. Bootstrap subsampling has been shown to reduce the variance of iterative approximations such as done in bagging procedures (Breiman, 1996, 1999). In a similar fashion, a stochastic subsampling can be adopted (Friedman, 2002) to reduce variance in gradient boosting estimates. Specifically, at each iteration, a subsample of the training data is drawn at random (without replacement) from the full training data set. This randomly selected subsample is then used to fit the base learner and compute the model update for the current iteration. Previous analysis suggests that a good size for the subsampling is about half the size of the total available training data (Friedman, 2002). 3.3 Stochastic Gradient Boosting: General Algorithm. Under the general formulation given above, the algorithm for the stochastic gradient boosting can be expressed as: 1. Set the initial value Fˆ 0 (x) = arg min µ

K 

L(yk , µ)

k=1

2. For m = 1 to M do: Kˆ 3. Select a random subsample of size Kˆ , {ykˆ , xkˆ }k=1 ˆ , without replacement

4. Compute the components of the negative gradient vector from the subsample

∂ L (ykˆ , F (xkˆ )) gmkˆ = − , kˆ = 1, . . . , Kˆ ∂ F (xkˆ ) F (x)= Fˆ m−1 (x) 5. Fit h(x; a) to the negative gradient by solving am = arg min 6. ρm = arg min ρ

Kˆ 

[gmkˆ − βh(xkˆ ; a)]2

a,β

ˆ k=1

 Kˆ

  L ykˆ , Fˆ m−1 (xkˆ ) + ρh(xkˆ ; am )

ˆ k=1

7. Update Fˆ m (x) = Fˆ m−1 (x) + αρm h(x; am ) 3.4 Choosing the Base Learner: Regression Trees and Interaction Order in the Approximation. Thus far, we have not defined the type of parameterized model h(x; a). In principle, we could boost GLMs, spline models, or many other different models. However, commonly regression trees are

680

W. Truccolo and J. Donoghue

chosen based on two motivations. First, approximation using an ensemble of regression trees is a simple yet powerful approach. In this case, each base learner h(x; a) is an R terminal node regression tree. At each mth iteration, a regression tree partitions the space of all joint values of the covariates x into R-disjoint sets or regions {Srm }rR=1 and predicts a separate constant value in each of these regions, such that h(x; am ) =

R 

c rm 1{x∈Srm } ,

(3.9)

r =1

where c rm is the terminal node mean. Therefore, the parameter am specifies the splitting variables, the splitting locations that define the disjoint regions {Srm }rR=1 , and the terminal node means of the tree at the mth iteration. To fit a regression tree at each iteration, we need to decide on the splitting variables and split points and also what topology (shape) the tree should have. We follow the CART algorithm (Breiman, Friedman, Olshen, & Stone, 1984). In this case, the partition of the space is restricted to recursive binary partitions (i.e., at each node, we partition the current variable into two disjoint spaces: left and right of the splitting point). Binary splits are preferred instead of multiway splits because the latter tend to fragment the data too quickly, leaving insufficient data at the next level down. Additionally, multiway splits can be achieved in gradient boosting trees by a series of binary splits, each occurring in one of the iteration steps (Hastie et al., 2001). Finding the best binary partition in terms of minimum sum of squares is generally computationally infeasible. A greedy algorithm is adopted. Starting with all of the data, consider a splitting variable j and split points s, and define the pair of half-planes: S1 ( j, s) = {x|x j ≤ s} and S2 ( j, s) = {x|x j > s}.

(3.10)

Then we seek the splitting variable j and split point s that solve  min min j,s

c1

 xk ∈S1 ( j,s)

(yk − c 1 )2 + min c2



 (yk − c 2 )2  .

(3.11)

xk ∈S2 ( j,s)

For any choice j and s, the inner minimization is solved by cˆ 1 = ave(yk |xk ∈ S1 ( j, s)) and cˆ 2 = ave(yk |xk ∈ S2 ( j, s)).

(3.12)

The computation of the terminal node mean depends on the distribution adopted for the data (see equation 4.4). For each splitting variable, the determination of the split point s can be done very quickly, and hence by scanning through all of the inputs, determination of the best pair ( j,s)

Nonparametric Modeling of Neural Point Processes

681

is feasible. Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of the two regions. Then this process is repeated on all of the resulting regions. The second motivation for choosing regression trees is that by controlling the number of terminal nodes, it is also possible to control for the interaction order in the approximation. Usually, in order for Fˆ (x) to approximate well the target F ∗ (x), it is important to consider how close the interaction order of Fˆ (x) is required to be to that of F ∗ (x). Interaction order can be thought of as follows. Suppose that the target function can be expressed in terms of the expansion F ∗ (x) =

 i

Fi (xi ) +



Fi j (xi , x j ) +

ij



Fi jk (xi , x j , xk ) + . . .

(3.13)

i jk

A tree with splitting on only one of the covariates falls in the first sum of equation 3.13, a tree with splits in two covariates falls in the second sum, and so on. The highest interaction order is limited by the number p of covariates, and, given the additive nature of the boosting procedure, a tree with R terminal nodes produces a function approximation with interaction order of at most min(R − 1, p). In practice, for a large number of covariates, a good approximation can be obtained by a much lower order.3 Depending on the case, we can either select the interaction order based on testing on a single test data set or by an n-fold cross-validation procedure, or fix a priori the order to capture only second-order interactions among the covariates, for example. 4 Loss Function for Point Process Modeling In order to apply stochastic gradient boosting to neural point processes, we define the loss function at each iteration step to be the negative of the log likelihood for the point process. For a given CIF model in log-scale log λ(tk |xk ) = Fˆ (xk ), the loss function becomes K      K Nk Fˆ m−1 (xk ) − exp Fˆ m−1 (xk ) , = L m {Nk , Fˆ m−1 (xk )}k=1 k=1

(4.1)

3 See Friedman (2001) for a discussion of the selection of interaction order. Note also that because we are modeling the natural parameter of the point process distribution log λ, a model containing only the first term in equation 3.13 will already lead to multiplicative effects among the covariates.

682

W. Truccolo and J. Donoghue

where again K is the total number of subintervals or number of samples. The kth component of the negative gradient at each iteration is then gmk

    ∂ L m Nk , Fˆ m−1 (xk ) =− = exp Fˆ m−1 (xk ) − Nk . ∂ Fˆ m−1 (xk )

(4.2)

The initial value is given by 

 K 1  Fˆ 0 (x) = log Nk , K

(4.3)

k=1

and the terminal node estimates in the regression tree are given by K

1{xk ∈Rrm } Nk . ˆ k=1 1{xk ∈Rrm } exp[ Fm−1 (xk )]

c rm = log  K

k=1

(4.4)

5 Goodness-of-Fit Analysis via the Time Rescaling Theorem Given spike train and covariate time series, we can approximate the CIF ˆ λ(t|x(t)) = exp{ Fˆ (x(t))} using the stochastic gradient boosting approach just described. Assessing the goodness of fit of the approximation can be implemented by applying the time-rescaling theorem for point processes as follows (Brown et al., 2001; Papangelou, 1972; Girsanov, 1960). We start by computing the rescaled times z j from the estimated conditional intensity and from the spike times as  z j = 1 − exp

 −

u j+1

 ˆ λ(t|x(t)) dt

,

(5.1)

uj

for j = 1, . . . , J − 1 spike times. The time-rescaling theorem states that the z j s will be independent uniformly distributed random variables on the interval [0, 1) if and only if the estimated CIF approximates well the true conditional intensity function of the process. Because the transformation in equation 5.1 is one-to-one, any statistical assessment that measures the agreement between the z j s and a uniform distribution directly evaluates how well the original model agrees with the spike train data. Additional tests to check the independence of the rescaled times can be formulated (Truccolo, Eden et al., 2005) but will not be employed here. A Kolmogorov-Smirnov (K-S) test can be constructed by ordering the z j s from smallest to largest and then plotting the values of the cumulative distri bution function of the uniform density function defined as b j = ( j − 1/2) J for j = 1, . . . , J against the ordered z j s. We term these plots K-S plots. If the

Nonparametric Modeling of Neural Point Processes

683

model is correct, then the points should lie on a 45 degree line. Confidence bounds for the degree of agreement between a model and the data may be constructed using the distribution of the K-S statistic (Johnson & Kotz, 1970). For moderate to large sample sizes, the 95% confidence bounds are 1 well approximated (Johnson & Kotz, 1970) by b j ± 1.36 × J −2 . 6 Model Interpretation: Relative Importance and Partial Dependence Plots 6.1 Relative Importance of Individual Covariates. Given an approximated CIF, we would like to assess the relative importance or contribution of each covariate. This assessment is particularly important in more exploratory studies, where the set of covariates is large and little a priori knowledge is available about their individual relevance. A measure of relative importance should reflect the relative influence of a variable xl on the variation of Fˆ (x). Assessing the influence of a covariate on the variation of Fˆ (x) by expected partial derivatives is not possible given the piecewise constant approximations provided by regression trees. Instead, we use the relative importance measure that Breiman et al. (1984) used. We start by computing it for a single tree Tm = h(x, am ) as 2 Iˆlm (Tm ) =

R−1 

iˆl2 1(v(r ) = l),

(6.1)

r =1

where the summation is over the nonterminal nodes r of the R-terminal node tree. The term v(r ) gives the splitting variable associated with node r in the tree. The term iˆl2 is the corresponding empirical improvement in squared error achieved by splitting, at the point given by the splitting variable, a region Sr into two left and right subregions (Sr − , Sr + ). The empirical improvement in squared error is given by the weighted squared difference between the constants fit to the left and right subregions. The final relative importance measure with respect to the approximation Fˆ (x), including the entire collection of regression trees, is obtained by averaging equation 6.1 over all of the trees: 1  2 Iˆlm (Tm ). M M

Iˆl2 =

(6.2)

m=1

6.2 Partial Dependence Plots. Beyond assessing the relative importance of individual covariates, we would like to gain insight into the functional form relating response variable and covariates. That might be done by graphically plotting the partial dependence of the approximation on a

684

W. Truccolo and J. Donoghue

small subset of the covariates zl = {z1 , . . . , zl } ⊂ {x1 , . . . , x p }. This partial dependence can be formulated in terms of the expected value of Fˆ (x) with respect to the probability of the complement subset z\l (Friedman, 2001; see also section A.2 of the appendix). An empirical approximation to this partial dependence is straightforward to obtain in the case of regression trees based on single variable splits. In this case, the partial dependence of Fˆ (x) on a specified target variable subset zl is evaluated given only the tree, without reference to the data. For a specified set of values for the variables zl , a weighted transversal of the tree is performed. At the root of the tree, a weight of 1 is assigned. For each terminal node visited, if its split variable is in the target subset zl , the appropriate left or right daughter node is visited and the weight is not modified. If the node’s split variable is a member of z\l , then both daughters are visited and the current weight is multiplied by the fraction of training observations that went left or right, respectively, at that node. Each terminal node visited during the transversal is assigned the current value of the weight. When the tree transversal is complete, the weighted average of the Fˆ (x) values over those terminal nodes visited during the tree transversal provides an empirical estimate of the expected value of Fˆ (x) with respect to the complement subset. For a collection of M regression trees, the results from the individual trees are simply averaged.

7 Applications: Modeling Spiking Activity as a Function of Spiking History and Kinematics To illustrate this approach to neural point process modeling, in this section we apply stochastic gradient boosting to spiking activity recorded from two cells, one in M1, and the other in 5d, of a behaving monkey. Those two cells are henceforth referred to as cell A and B, respectively. They were chosen based on their rich modulation by intended and executed kinematics. Details of the basic recording hardware and surgical protocols are available elsewhere (Paninski, Fellows, Hatsopoulos, & Donoghue, 2004). Following task training, two Bionic Technologies LLC (BTL, Salt Lake City, UT) 100-electrode silicon arrays were implanted in M1 and 5d areas related to sensorimotor control of the arm. One monkey (M. mulatta) was operantly conditioned to perform a standard reaching center-out radial task with instructed delay. The monkey learned to use a manipulandum in a horizontal plane. The manipulandum controlled a cursor on a monitor oriented vertically in front of her. Each trial consisted of moving the cursor to one of eight radially displaced targets on the screen. The task started with the monkey moving the manipulandum to a center holding position. Soon after, an instructed delay cue specifying the target for the reach was presented for 1.5 sec. Finally, a go cue, appearing randomly in the time interval 1–2 sec after the disappearance of the instructed delay cue, signaled the monkey to reach for the remembered target. The hand position signal in 2D

Nonparametric Modeling of Neural Point Processes

685

was digitized and resampled to 1 KHz. Hand velocities and accelerations were obtained via standard Savitsky-Golay derivatives. 7.1 Covariate List, Data Set, Boosting Meta-Parameters, and Computational Issues. The following covariates were selected. To incorporate simple spiking history effects, we selected the time elapsed since the last spike. This covariate will be denoted by stk . To introduce the kinematics effects, we chose position, velocity, and acceleration in 2D. These kinematics ˙ k + τq˙ ), and q(t ¨ k + τq¨ ), respectively. covariates are denoted by q(tk + τq ), q(t When referring to specific components of a kinematics vector, we use subscripts, for example, q = [q 1 , q 2 ] . The investigation of kinematics effects at multiple time lags for each covariate, although straightforward, is beyond the scope of investigation in this article. For this reason, we chose to present the simpler case of single time lags for each covariate. The choice of single optimal time lags was based on mutual information analyses (Paninski et al., 2004). For cell A, the time delays for position, velocity and acceleration were set to −0.15, 0.25, and 0.25 sec, respectively. For cell B, these delays were set to zero. In order to account for the planning of movement direction during the instructed delay period, we included in the model the covariate target direction, denoted by φ and henceforth referred to as the intended movement direction in a given trial. This is not to be confused with the actual movement direction derived from the velocity vector. Our covariate vector thus consisted of xtk = [stk , φ, qtk +τq , q˙ tk +τq˙ , q¨ tk +τq¨ ] , with p = 8 covariates. We analyzed 268 trials, distributed almost uniformly across the eight directions. The recorded neural point process was time discretized by taking  = 0.001 sec. Given the known refractory period after spiking, this choice ensured that no more than a single event would be observed in each subinterval of the partition. In each trial, only the data in the time interval [−0.4, 0.5 sec], with respect to the movement onset at time zero, were used. Typical hand movements lasted approximately 0.4 sec, with velocity bell shape profiles peaking at around 0.2 sec after movement onset. After time discretization, the number of samples, including data from all of the trials, resulted in K = 241, 200 samples. The indexes k for the resulting samples or tuples {yk , xk }kK were reordered by a single random permutation. That was done to ensure that the training and validation data sets contained samples that originated from everywhere in the experimental session, not just the beginning or the ending part, for example. The meta-parameters for the boosting algorithm were set as follows. Two-thirds of the data set were separated for model fitting, and the remaining data were taken for validation tests. For the stochastic subsampling at each iteration, half of the model-fitting data set was randomly subsampled without replacement. The number of terminal nodes for the regression trees was set to R = 9, that is, p+1. The shrinkage coefficient was set to α = 0.001. This choice was based on the analysis of the loss function computed on

686

W. Truccolo and J. Donoghue

the validation data set for different shrinkage parameters and different numbers of regression trees. In the example shown for cell A in Figure 1, at low shrinkage values, for example, α = 0.1, large generalization errors start to occur after just 50 to 100 trees, resulting in poor approximations. Much better performance was obtained for α = 0.001, with improvement on the loss function remaining roughly constant after approximately 11,000 trees. Similar results were obtained for cell B and other cells in the recorded ensemble (not shown). The gradient boosting algorithm was implemented in C. It took about 3 hours of computational time in a 2 GHz dual processor (2 GB RAM) computer to fit a model for a single cell. We used a computer cluster to run the cross-validation analyses presented in the last part of this section and in the next section. 7.2 Approximated Conditional Intensity Function and Goodness-ofFit Analysis. To provide a glimpse of the type of CIF approximations generated by an ensemble of regression trees, Figure 2 shows an example for cell A. In this case, two main features are observed. First, after each spike, the CIF captures a refractory period (the CIF approximates zero) and a quick rebound excitation. Second, around 200 ms before movement onset, the cell begins to be modulated by the kinematics. This effect peaks about 100 ms prior to movement onset. (See also Figures 4 and 5 for details on the contributions of spiking history and kinematics to the approximation.) With the exception of sharp changes reflecting refractory periods after spiking, the temporal evolution of the estimated CIF is otherwise relatively smooth. Once the CIFs of the two cells were approximated by the ensemble of regression trees, a goodness-of-fit assessment by time rescaling was done on the validation data set. Figure 3 shows that overall, only minor deviations are observed for the quantiles close to 0.5 (cell A) and to 0.25 (cell B). These deviations could have arisen from a lack of fit of the approximation itself or from the absence in the chosen covariate set of other variables relevant to the modeled spiking activity. 7.2 Relative Importance and Partial Dependence Plots. Table 1 shows the estimated relative importance for each of the covariates. Different covariate rankings were observed for the two cells. For cell A, the most important covariates were spiking history and velocity. For cell B, velocity and position in the vertical axis, together with spiking history, were highest ranked. Partial dependence plots are shown in Figures 4 and 5 to display the marginal effects of subsets of the covariates. Each partial dependence plot shows the contribution of a subset in terms of the approximated function Fˆ , that is, the CIF in log scale. The partial dependence plots do not include the constant term Fˆ 0 (x) in the approximation. The estimated effects of spiking history in terms of time elapsed since the last spike are shown in Figure 4A. An increase in the partial dependence, implying also an increase in spiking

Nonparametric Modeling of Neural Point Processes

687

λ (tk|x) (Hz)

120

60

−0.4

−0.3

−0.2

−0.1

0 0.1 time (sec)

0.2

0.3

0.4

Figure 2: Example of approximated conditional intensity function. The resulting approximated CIF via stochastic gradient boosting with approximately 11,000 trees (top plot) is shown together with the related spike train data for that time segment (bottom plot). The top plot also shows the magnitudes (arbitrary units) of the position (dotted), velocity (solid), and acceleration (dashed) vectors. This example was taken from cell A, during a reach to a target to the right (0 degrees). Movement onset is at time zero. 1

1

CDF

A

B 0.5

0

0.5

0

0.5

quantiles

1

0

0

0.5

1

quantiles

Figure 3: K-S goodness-of-fit plots via time rescaling. The K-S plots for cells A and B are shown in plots A and B, respectively. Goodness-of-fit assessment was performed on validation data.

688

W. Truccolo and J. Donoghue −3.5

−4.4

A

F(s)

−4 −4.5 −4.5

−5

0 0.02

0.05 s (sec)

0.1

−4.58

φ(

A −4.4

B −3.2

F (q)

F (q) ˙

Nonparametric Modeling of Neural Point Processes

−4.8 −10

q1

0

10

−10

0

10

689

−4

−5 20

q2

0

q1

−20

20

−20

0

q2

C F (q) ¨

−4.4

−4.6 20

q¨2

20

0 −20 −20

0

q¨1

Figure 5: Partial dependence plots for kinematics. The partial dependences for position (q 1 and q 2 , in cm), velocity (q˙ 1 and q˙ 2 , in cm/s), and acceleration (q¨ 1 and q¨ 2 , in cm/s2 ) at the specified time lags (see text) for cell A are shown in plots A, B, and C, respectively.

The partial dependencies on kinematics for cell Aare summarized in Figure 5. The strongest partial contributions are given by the velocity covariate. For this cell, the velocity tuning narrowed at higher speeds. A dependence of preferred direction on speed, with a shift ranging about 50 degrees, was also observed. The partial dependence on position, though small, showed a non-monotonic relation with saturation. For acceleration, the functional form was close to that of sigmoidal functions. 7.3 Prediction of Peri-Movement Time Histograms Based on Covariate State. In neurophysiology, the analysis of the average neuronal response to stimuli or behavioral events commonly plays an important role in identifying relevant covariates to the neuron’s spiking activity. In our example, this average response corresponds to what is usually referred to as the perimovement time histogram (PMTH). Here we illustrate how to do a simple assessment of the predictive power of the chosen covariates in terms of their ability to predict the time course of this average response. This can

690

W. Truccolo and J. Donoghue

be done by comparing the predicted PMTH, derived from the single-trial approximated CIFs, to the empirical PMTH. The predicted PMTH is therefore a function of the covariates, while the empirical PMTH is simply a time-locked average response that does not explicitly model the covariates’ effects. This comparison between predicted and empirical PMTHs (assuming the empirical PMTH provides a good estimation of the PMTH) can complement the goodness-of-fit assessment using the time-rescaling theorem. Commonly, the time-rescaling-based assessment will show some lack of fit without, on the one hand, specifying a source for the fitting problem or, on the other hand, pointing to what aspects the model captures well despite a lack of fit. The comparison between predicted and empirically estimated PMTHs can then be used as an additional diagnostic tool to check if the modeled covariate effects capture at least the temporal features observed in the peri-event time histograms. The empirical PMTH was obtained by first time-locking the single trial binary sequences {Nk } to the time of hand movement onset and then averaging them across trials. This average was then smoothed by a gaussian (SD = 0.05 sec) kernel, and finally expressed as spiking frequency in Hz units.4 To obtain the predicted PMTH, we first ran stochastic gradient boosting on a model-fitting data set containing all of the trials, with the exception of the trial whose CIF was being estimated. This corresponds to a leave-one-trial-out cross-validation procedure. Then, given the fitted model and observed covariates, we estimated the CIF for the selected trial. This was done for all of the trials. The predicted PMTH was finally obtained by time-locking the estimated single trial CIFs to the onset of movement, and then averaging them across trials. Figure 6 shows both the predicted and empirical PMTHs. The empirical PMTHs show strong pre- and post-movement effects for cell A, but only post-movement effects for cell B. Overall, predicted and empirical PMTHs agree well for both cells. For these two cells, we found that the selected kinematics and spiking history covariates alone could account for most of the features in the time course of activation. Similar results were obtained for the other cells in the recorded cell ensemble (approximately 60 cells) in the two cortical areas (not shown). These results, together with the goodnessof-fit assessment by time rescaling, suggest that the CIF approximations have explained most of the cells’ spiking behavior and that the model captured well the dependence of the time course of the cells’ activity on movement parameters. The small disagreements between predicted and empirical PMTHs happening at the extremes of the time interval [−0.4, 0.5] could have arisen from nonmodeled motor preparation and reward effects

4

We could have obtained a better estimate of the “empirical” PMTH, with corresponding confidence intervals by fitting Bayesian splines to the histogram as a function of time. However, we found this to be unnecessary for these illustration examples.

Nonparametric Modeling of Neural Point Processes

691

A

Hz

100 50 0 −0.5 0 0.5 Time (sec)

B

Hz

100 50 0 −0.5 0 0.5 Time (sec)

Figure 6: Prediction of PMTH based on the single trial approximated conditional intensity functions. The predicted (black) and empirical (gray) PMTHs for cells A and B are shown in plots A and B, respectively. The eight plots show the PMTH conditioned on the 8 radial directions for the executed reachings.

or from estimation problems resulting from lower signal-to-noise ratio at these extremes. 8 Comparison to Other Approaches: GLMs and Bayesian P-Splines We compare stochastic gradient boosting regression to two alternative approaches: a generalized linear model and Bayesian penalized splines. The GLM framework applied to neural point processes has been described in detail in Truccolo, Eden et al. (2005). Briefly, in the GLM approach, the conditional intensity function was modeled as λ = exp{µ + α1 x1 + . . . + α p x p }. The Bayesian P-splines approach is described below.

692

W. Truccolo and J. Donoghue

8.1 Bayesian P-Splines. Bayesian P-splines are a state-of-the-art nonparametric regression tool. We choose Bayesian P-splines (Brezger & Lang, 2006) instead of a Bayesian free-knot spline approach using reversible-jump MCMC (DiMatteo et al., 2001) because of its faster computation. Consider the following model for the conditional intensity function, log λi = f i1 (xi1 ) + f i2 (xi2 ) + . . . + f i p (xi p ) + viT γ ,

(8.1)

where f (x j ), j = 1, . . . , p are smooth functions and viT γ represents the strictly linear part of the predictor. In the Eilers and Marx (1996) approach, it is assumed that the unknown functions can be approximated by Mj = k j + l B-splines of degree l (usually cubic) with equally spaced knots, ζ j0 = x j,min < ζ j1 < . . . < ζ j,k j−1 < ζ j,k j = x j,max ,

(8.2)

over the domain of x j . Denoting the mth basis function by B jm , we obtain f j (x j ) =

Mj 

β jm B jm (x j ).

(8.3)

m=1

By defining the n × Mj design matrices X j with the elements in row i and column m given by X j (i, m) = B jm (xi j ), we can rewrite the predictor in matrix notation as log λ = X1 β1 + X2 β2 + . . . + X p β p + Vγ .

(8.4)

 T Here β j = β j1 , . . . , β j Mj , j = 1, . . . , p corresponds to the vectors of unknown regression coefficients. The matrix V is the usual design matrix for linear effects. To overcome the well-known difficulties involved with regression splines, Eilers and Marx (1996) suggested a relatively large number of knots (usually between 20 and 40) to ensure enough flexibility and introduce a roughness penalty on adjacent regression coefficients to regularize the problem and avoid overfitting. Theoretically, because roughness is controlled by the penalty term, once a minimum number of knots is reached, the fit given by a P-spline should be almost independent of the knot number and location. In their frequentist approach, Eiler and Marx proposed penalties based on squared r th order differences. Here we choose second-order differences in analogy to second-order derivatives in smoothing splines. In the Bayesian P-splines approach, the second-order differences are replaced with their stochastic analogs, that is, second-order random walks defined by β jm = 2β j,m−1 − β j,m−2 + u jm

(8.5)

Nonparametric Modeling of Neural Point Processes

693

with gaussian error u jm ∼ N(0, σ j2 ) and diffuse priors β j1 and β j2 ∝ const, for initial values.5 While first-order random walks would penalize abrupt jumps β jm − β j,m−1 between successive parameters, second-order random walks penalize deviations from the linear trend 2β j,m−1 − β j,m−2 . The amount of smoothing is controlled by the parameter σ j2 , which corresponds to the inverse smoothing parameter in the traditional smoothing spline approach. By defining an additional hyperprior for the variance parameters, the amount of smoothness can be estimated simultaneously with the regression coefficients. We assign the conjugate prior for σ j2 , which is an inverse gamma prior with hyperparameters a j and b j , that is, σ j2 ∼ I G(a j , b j ). Based on Brezger and Lang’s (2006) experience from extensive simulation studies, we use a j = b j = 0.001. Bayesian inference is based on the posterior of the model, which is given by p(β1 , . . . , β p , σ12 , . . . , σ p2 , γ |y, X) ∝ L(y, X, β1 , . . . , β p , σ12 , . . . , σ p2 , γ )   p  1 1 T × exp − 2 β j K j β j (σ j2 )r k(K j )/2 2σ j j=1   p   2 −a j −1 bj × exp − 2 , σj (8.6) σj j=1 where L(·) is the likelihood. The two last terms on the right are the priors for the spline coefficients and for the variance of the random walk, respectively; r k(·) denotes the matrix rank and the term K j is a penalty matrix that depends on the prior assumptions about the smoothness of f j and the type of covariate. This penalty matrix is of the form K j = DT D, where D is in our case a second-order difference matrix. Spline parameters and smoothing terms were estimated by MCMC sampling. Good mixing properties of the Markov chain are provided by a sampling scheme that combines an approximation of the full conditionals of regression parameters via IWLS (iteratively weighted least squares) and uses them as proposals in a Metropolis-Hastings algorithm, as proposed by Gamerman (1997). We set the number of MCMC iterations to 20,000, with the first 5000 iterations discarded as the burn in part. In order to reduce correlations between the MCMC samples, the chain was thinned by keeping only the samples from every 10 iterations. We visually inspected the resulting Markov chains to verify convergence. Cubic spline bases with 30 uniformly spaced knots were used. In the computation of the Bayesian P-splines, we used the

5 In cases where the estimated function is highly oscillating, the assumption of a global variance parameter may be relaxed by using u jm ∼ N(0, σ j2 /δ jm ), where both σ j2 and δ jm are hyperparameters to be estimated.

694

W. Truccolo and J. Donoghue

publicly available software BayesX developed by Brezger, Kneib, and Lang (2005). Besides its good MCMC mixing properties especially designed for Bayesian P-splines, this software has been shown to be 60 to 280 times faster than the other well-known alternative for Bayesian inference, the software WinBUGS (Spiegelhalter, Thomas, Best, & Lunn, 2003). 8.2 Cross-Validation Results. We used a 10-fold cross-validation scheme to compare the different modeling approaches. Exactly the same 10 test data sets were used for each modeling approach, and each approach used the same eight covariates as described in section 7. We modeled all 67 cells available in the data set. As a comparison criterion, we adopted the average cross-validated log likelihood of the modeled neural process. Because in Bayesian splines the introduction of interaction terms is usually restricted in practice to interaction surfaces (second-order interactions, modeled by the tensor product of B-splines, for example) and because their inclusion would increase dramatically the computational time for these comparisons, we opted not to include higher-order interaction terms in any of the modeling approaches. Therefore, the gradient boosting regression models included trees with splits in a single covariate only. Additionally, to prevent gradient boosting from having an unfair advantage in crossvalidation performance, we did not estimate the optimal number of trees from the computed loss function in a small test data subset. Instead, we set the shrinkage value to a small value, α = 0.001, and fixed the number of trees to a high number: 40,000 for all of the modeled cells. Hyperparameters for the Bayesian splines and additional parameters for the MCMC sampling were specified as described. For all of the cells, we obtained high acceptance rates in the Markov chains. For comparison purposes, we also computed the cross-validated log likelihood for a model that consisted of only a constant mean Poisson process, independent of any of the covariates. We refer to this model as the noise model. Additionally, we used the difference among the cross-validated log likelihoods ( log likelihood) from the different models. More specifically, the log likelihoods from GLM, Bayesian P-splines, or noise models were subtracted from the corresponding log likelihoods of the gradient boosting regression models. In this way, a positive  log likelihood means a better performance for the gradient boosting models in the test data sets. Figure 7 summarizes the comparison results over the 67 cells. Overall, the gradient boosting regression models performed much better for all of the 67 cells when compared to the noise and GLM models, as expected, and performed better in about 90% of the cells when compared to Bayesian P-splines. A preliminary analysis contrasting the types of estimated covariate effects on cell A based on gradient boosting regression and Bayesian P-splines is given in Figure 8. We focused on the shapes of the estimated effects rather than on their absolute magnitudes and chose the velocity covariate because, for most of the cells, it is ranked as the most important covariate. Compared

Nonparametric Modeling of Neural Point Processes

695

N

20 10 0

0

100

200

300

400

500

0

100

200

300

400

500

0

100

200

300

400

500

N

20 10 0

N

20 10 0

∆ log−likelihood

Figure 7: Comparison between the different modeling approaches using crossvalidated log likelihood. (Top) The histogram for the differences between the average 10-fold cross-validated log likelihoods from the gradient boosting regression and Bayesian P-splines models. (Middle) The histogram for the differences between the gradient boosting regression and GLM models. (Bottom) The histogram for the differences between the gradient boosting regression and noise models (a Poisson process with constant spiking rate independent on the covariates). Total N is 67 modeled cells. Positive differences ( log likelihoods) mean that the gradient boosting regression models performed better in the test data than the comparison modeling approach did.

to the gradient boosting estimate (see Figure 8A), the Bayesian P-spline estimate seems to oversmooth (see Figure 8B), missing many of the structures captured by the gradient boosting regression trees model. The gradient boosting estimate clearly partitions the work space into four regions (movements up and down, left and right), with fine structure especially visible in two of these quadrants. However, this structure could be simply an artifact resulting from the piecewise constant approximation nature of regression trees. Assuming for the moment that this fine structure truly exists, it could then be that a different spline approach not dependent on MCMC sampling would better capture this structure. To address this issue we estimated a P-spline model using the recently developed MGCV algorithm in R (Wood, 2004). Although this algorithm does not provide the full Bayesian inference setup as in Bayesian P-splines, it still allows for the automatic estimation of smoothness parameters. Figure 8C shows the resulting estimate obtained

696

W. Truccolo and J. Donoghue

A

C

F (q)

.

20

. 0

q1

−20

20

0

.

−20

q2

0

−20

20

0

− 20

D

B

−20

20

0 20 0

−20

20

0

0

−20 −20 20

F

E

−20

20

0

0

−20 20

−20

0

0

−20 20

Figure 8: Comparison of estimated dependence on velocity from Bayesian Psplines, gradient boosting regression, and other spline-based methods. In all plots, the estimated marginal effects of the velocity covariate for cell A are shown as 2D surfaces. (A) Estimated dependence obtained from stochastic gradient boosting. (B) From Bayesian P-splines. (C) Penalized cubic-splines (MGCV). (D) Top view of the estimated dependence in A. (E) Top view of the estimated dependence in C. (F) Penalized cubic splines with shrinkage (top view; estimated using MGCV). The range of velocity comprises the actually observed velocities in the fitted data. Velocities are given in cm/s. See the text for details.

Nonparametric Modeling of Neural Point Processes

697

by fitting the model with all covariates and using penalized cubic splines (basis dimension set to 32). Interestingly, the estimated effect for velocity via MGCV is overall closer to the gradient boosting model than the Bayesian P-spline model is to the gradient boosting model. However, it does so by paying the high price of having significant oscillations (overshootings and undershootings) at certain regions, which probably makes the Bayesian P-spline smoother approximation better. More important, these oscillations appear exactly at regions where there are reasonable jumps in the piecewise constant approximation (see Figure 8E). Figure 8F shows another example, now using MGCV with penalized cubic splines with shrinkage. Instead of minimizing the oscillation artifacts, it actually increases them. Similar results were obtained with B-spline bases. (We could not compute the model using thin-plate splines in MGCV because of memory allocation problems.) Therefore, we conclude that the fine structure estimated by the gradient boosting regression cannot be easily explained as a pure artifact resulting from the piecewise constant approximation. Instead, this fine structure is likely to reflect features of the true underlying relation between spiking and velocity. 9 Discussion We have demonstrated how stochastic gradient boosting regression can be successfully extended to the modeling of neural point processes, thus providing a robust nonparametric modeling approach for neural spiking data while preserving their point process nature. Furthermore, our choice of loss function keeps this nonparametric modeling approach close in spirit to the (regularized) maximum likelihood principle. Analysis on validation data of the loss function’s dependence on the number of iterations (trees), together with the application of the time rescaling theorem, provided the approach with both model selection and a rigorous goodness-of-fit assessment. Partial dependence plots and analysis of averaged CIFs were shown to enhance the model’s interpretability, which is important in neurophysiological studies of the relation between spiking activity and observed covariates. Our cross-validation analyses showed that gradient boosting regression can perform similar to or superior to a state-of-the-art Bayesian regression tool. We speculate that one of the possible reasons for this better performance with respect to Bayesian splines might be related to the choice of basis learners. The preliminary analysis provided in Figure 8 seems to argue in this direction. Additionally, as already pointed out by Friedman (2001), methods that use smooth functions tend to be adversely affected by outliers and wide tail distributions. On the other hand, the piecewiseconstant approximation in regression trees is more robust to these adverse effects. However, although the issue of smooth versus piecewise constant approximation seems to shed some light on why gradient boosting performed better, there are many other and equally important features in

698

W. Truccolo and J. Donoghue

gradient boosting that could also have contributed to this better performance. These features are shrinkage, stochastic subsampling, and especially iterative regularized fitting of residuals. This makes the proper addressing of why gradient boosting performed better a nontrivial problem. A thorough analysis of this problem is beyond the scope of this article. From a theoretical point of view, we believe that Bayesian splines may provide comparable or higher performance if one spends enough time with the fine tuning of prior distributions and parameters in the MCMC sampling and with the selection of a smaller subset of covariates. While this fine tuning and selection is feasible for smaller selected data subsets, it is precisely what one would like to avoid in initial analyses of large data sets. It is in that sense that we see gradient boosting regression as an off-the-shelf tool. Given its good performance and computational efficiency, we propose stochastic gradient boosting regression as an off-the-shelf modeling tool for the analysis of large neural data sets currently generated by multiple arrays recording technology (e.g., more than 50 cells, number of samples K > 105 ) and corresponding measurements of multidimensional (dimension > 4) covariate spaces. Additionally, while in gradient boosting regression trees, we can easily attempt to capture arbitrary high-order interactions among the covariates, splines-based methods are in practice currently limited to second-order interaction effects usually modeled with tensor products of two univariate spline bases. The introduction of second-order interaction effects by tensor product splines in generalized additive models (GAM) can be in practice infeasible even in the case of low-dimensional covariate spaces. That is because of memory storage problems resulting from the requirement in current algorithms that projections of the covariate data on these tensor product bases be explicitly stored in matrices. These matrices can be prohibitively large given the typical size of neural point process data sets (K > 105 ). The use of Bayesian P-splines would alleviate the memory storage problem. Currently, however, these methods are not a practical choice because of their computational cost and the difficulty in designing MCMC samplers with good mixing properties in high-dimensional parameter spaces when interaction effects are added. Alternative methods based on Kernel regression algorithms (e.g., kernel logistic regression and gaussian process regression) currently face similar difficulties. Most available algorithms require the explicit representation of kernel matrices as 2D arrays and are therefore limited to a few hundred thousand samples. Finally, another advantage of stochastic gradient boosting regression is that it can easily handle mixed metric and discrete (categorical) covariates. Mixed covariates present difficulties for GAM algorithms based exclusively on spline bases. Confidence intervals for the approximated functions and partial covariate effects can in principle be obtained by bootstrapping gradient boosting regression models. However, the introduction of bootstrapping would make gradient boosting regression less attractive in terms of computational

Nonparametric Modeling of Neural Point Processes

699

efficiency. We recommend the following approach. If more detailed statistical inference based on estimated confidence intervals is required after an initial modeling analysis performed with stochastic gradient boosting regression, one can focus on a selected smaller subset of modeled cells and perhaps lower-dimensional covariate sets. Bootstrapping of gradient boosting models could then be used. Alternatively, one could consider the application of Bayesian splines to this selected smaller data subset in order to assess parameter uncertainty. To address these issues, we are also considering a Bayesian formulation of boosting regression trees (work in preparation). In neuroscience, modeling the relations between neural spiking and stimulus-behavioral covariates has been intrinsically associated with the problem of neural population decoding. In this regard, neural point process decoding based on models fitted through stochastic gradient boosting can be easily implemented using sequential Monte Carlo particle filters (Doucet, de Freitas, & Gordon, 2001; Gao, Black, Bienenstock, Shoham, & Donoghue, 2001, Brockwell, Rojas, & Kass, 2004). The spiking probability in the time interval ((k − 1), k] given estimated states, necessary to compute the importance sampling weights for the particles, is simply obtained as P(Nk |xk ) = exp{Nk Fˆ (xk ) − exp{ Fˆ (xk )}. The application of gradient boosting models to other neural decoding approaches that require the computation of partial derivatives of the CIF with respect to covariates (Eden, Frank, Barbieri, Solo, & Brown, 2004) will be problematic because of the piecewise constant nature of the regression trees approximation. We can think of several directions for improvement of stochastic gradient boosting regression for neural point processes. The algorithm is generic regarding the types of base learners or regressors that can be used at each iteration step. We plan to investigate the use of penalized low-dimensional B-splines as base learners, for example. Boosting splines could be used, for example, in cases where smooth approximations are preferable. Also, this choice might result in fewer boosting iterations. With respect to our goal of identifying the features of the functional form relating spiking and covariates and despite the contribution of partial dependence plots in this regard, model interpretability remains a topic for further development. As mentioned before, one advantage of an ensemble of regression trees is that the modeling of interaction effects among the covariates can be easily implemented. Estimated second-order interaction effects can be easily visualized in 2D plots. However, an analysis of higher-order interaction effects and their contribution to spiking activity is beyond the assessment provided by the partial dependence plots. We are also considering more efficient algorithmic implementations of boosting regression in terms of least angle regression (Efron, Hastie, Johnstone, & Tibshirani, 2004). A comparison to other related nonparametric modeling algorithms such as Random Forests (Breiman, 2001) is underway.

700

W. Truccolo and J. Donoghue

Our applications to M1 and 5d data in this article were intended as illustrative examples of the algorithm. Refined models will likely require larger sets of covariates covering more extensive spiking history effects, as well as the introduction of kinematics covariates at multiple time lags. In addition, movement covariates in coordinate frames other than the Cartesian coordinates adopted here and covariates involving dynamic forces are other crucial topics for investigation. A detailed study of the relations between spiking and movement covariates in these two areas is the topic of a future report. We are also currently applying stochastic gradient boosting to the study of cortico-cortico interactions between primary motor (M1) and parietal (5d) cortices during sensorimotor control of reaching (Truccolo, Vargas et al., 2005). Appendix For completeness, we describe in more detail two aspects of stochastic gradient boosting. We follow the exposition in Friedman (2001). A.1 Gradient Smoothing via “Greedy-Stagewise” Approximation. Given the representation

F ∗ (x) ≈ Fˆ (x) =

M 

f m (x),

(A.1)

m=0

for the function solving the optimizing problem in equation 3.1, and under sufficient regularity conditions such that we can interchange differentiation and integration, each f m (x) can be implemented as the mth step in a steepestdescent gradient optimization, f m (x) = ρgm (x) = −ρ E y

∂ L(y, F (x)) , |x ∂ F (x) F (x)= Fˆ m−1 (x)

(A.2)

where gm is the negative gradient determining the (steepest-descent) direction of the step and ρm determines the magnitude of the step, and  Fˆ m−1 (x) = m−1 j=0 f j (x). Formulated in this simple pointwise optimization manner, this nonparametric approach would result in poor generalization when applied K to finite data sets {yk , xk }k=1 . The approximation would lack generalization when estimating F ∗ (x) at x-values other than the ones provided by the sample data set. This shortcoming can be dealt with by imposing smoothness on the approximation. Gradient smoothing can be achieved by adopting a parameterized form of f m (x) and performing parameter

Nonparametric Modeling of Neural Point Processes

701

optimization at each of the iteration steps. A common parameterization is to express each f m (x) in terms of a regressor or base learner h(x, am ), such M that Fˆ (x) = m=1 βm h(x; am ). The optimization problem then becomes

M {βˆ m , aˆ m }m=1

= arg min

K 

 L

{βm ,am }1M k=1

yk ,

M 

 βm h(xk ; am ) .

(A.3)

m=1

Instead of solving the above at once (i.e., for all m = 1, . . . , M simultaneously), a “greedy stagewise” approach is motivated as follows. First, proceeding in an iterative fashion as before, the problem to be solved at each step is

(βˆ m , aˆ m ) = arg min β,a

K 

  L yk , Fˆ m−1 (x) + βh(xk ; am ) ,

(A.4)

k=1

 with Fˆ m−1 (x) = m−1 j=0 h j (x; a j ). Second, the problem can be simplified by noticing that it suffices to find the member of the parameterized class K h(x; am ) that produces the vector hm = {h(xk ; am )}k=1 most parallel to the K negative gradient vector gm = {gmk (xk )}k=1 ∈ R K , where the K -components of the negative gradient are given by

∂ L(yk , F (xk )) gmk (xk ) = − ∂ F (xk )

F (x)= Fˆ m−1 (x)

, k = 1, . . . , K .

(A.5)

Note that the hm most parallel to gm is also the one most correlated with gm (x) over the data distribution. For most types of base learners, the most parallel vector can obtained by solving the least-squares problem am = arg min a,β

K 

[−gmk (xk ) − βh(xk ; a)]2 .

(A.6)

k=1

Other minimization criteria can be used in place of equation A.6 if needed. In summary, by proceeding in a greedy-stagewise manner, the hard gradient smoothing problem in equation A.3 has been replaced by the much simpler problem of least-square fitting h(x; a) to the negative gradient followed by only a single parameter optimization for the magnitude of the gradient step ρm in ρm = arg min ρ

K  K =1

  L yk , Fˆ m−1 (xk ) + ρh(xk ; am ) .

(A.7)

702

W. Truccolo and J. Donoghue

K A.2 Partial Dependence Plots. Given a data set {yk , xk }k=1 , with x =  [x1 , x2 , . . . , x p ] , let zl be a chosen target subset, of size l, of the covariate variables x

zl = {z1 , . . . , zl } ⊂ {x1 , . . . , x p },

(A.8)

and z\l be the complement set such that z\l ∪ zl = x.

(A.9)

If one conditions on specific values for the variables in z\l , then Fˆ (x) can be considered as a function only of the variables in the chosen subset zl : Fˆ z\l (zl ) = Fˆ (zl |z\l ).

(A.10)

In general, the functional form of Fˆ (zl ) will depend on the particular values chosen for z\l . If this dependence is not too strong, then the average function   F¯ l (zl ) = E z\l Fˆ (x) =

 Fˆ (zl , z\l ) p\l (z\l )dz\l ,

(A.11)

with  p\l (z\l ) =

p(x)dzl ,

(A.12)

can represent a useful summary of the partial dependence of Fˆ (x) on the chosen variable subset zl . Here p\l (z\l ) is the marginal probability density of z\l  p\l (z\l ) =

p(x)dzl ,

(A.13)

where p(x) is the joint density of all the covariates x. An empirical estimate of this quantity can be obtained from the data. In the case where the dependence of Fˆ (x) on the chosen covariate subset is additive or multiplicative, the form of Fˆ z\l (zl ) does not depend on the joint values of the complement covariates z\l and equation A.11 provides a complete description of the variation of Fˆ (x) on the chosen covariate subset. The approximation of the partial dependence measure based on the reasoning above is simplified in the case of regression trees by the procedure given in section 6.2.

Nonparametric Modeling of Neural Point Processes

703

Acknowledgments W. T. thanks Emery Brown and Uri Eden for discussions on point processes theory, Ji Zhu for discussion on boosting algorithms, and Carlos Vargas for the data recordings. This work was supported by ONR NRD-339, NIH NS-25074, and DARPA MDA 972-00-1-0026. References Barbieri, R., Frank, L. M., Quirk, M. C., Solo, V., Wilson, M. A. & Brown, E. N. (2002). Construction and analysis of non-gaussian spatial models of neural spiking activity. Neurocomputing, 44–46, 309–314. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (1997). Arcing the edge (Techn. Rep. No. 486). Berkeley: Department of Statistics, University of California, Berkeley. Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation, 11(7), 1493–1517. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Breiman, L., Friedman, J. H., Olshen, J. R., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth. Brezger, A., Kneib, T., & Lang, S. (2005). BayesX: Analyzing Bayesian structured additive regression models. Journal of Statistical Software, 14(11), 1–22. Brezger A., & Lang, S. (2006). Generalized structured additive regression based on Bayesian P-splines. Computational Statistics and Data Analysis, 50, 967–991. Brillinger D. R. (1988). Maximum likelihood analysis of spike trains of interacting nerve cells. Biological Cybernetics, 59, 189–200. Brockwell, A. E., Rojas, A. L. & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91(2), 1899–1907. Brown, E. N., Barbieri, R., Eden, U. T., & Frank, L. M. (2003). Likelihood methods for neural data analysis. In J. Feng (Ed.), Computational neuroscience: A comprehensive approach. CRC. Brown, E. N., Barbieri, R., Ventura, V., Kass, R. E., & Frank, L. M. (2001). The timerescaling theorem and its application to neural spike train data analysis. Neural Computation, 14, 325–2346. Chen, D., & Fetz, E. E. (2005). Characteristic membrane potential trajectories in primate sensorimotor cortex neurons recorded in vivo. Journal of Neurophysiology, 94, 2713–2725. Chornoboy, E. S., Schramm, L. P., & Karr, A. F. (1988). Maximum likelihood identification of neuronal point process systems. Biological Cybernetics, 59, 265–275. Copas, J. B. (1983). Regression, prediction, and shrinkage (with discussion). Journal of the Royal Statistical Society, B, 45, 311–354. Daley, D., & Vere-Jones, D. (2003). An introduction to the theory of point processes. New York: Springer-Verlag. DiMatteo, I., Genovese, C. R., & Kass, R. E. (2001). Bayesian curve-fitting with freeknot splines. Biometrika, 88, 1055–1071.

704

W. Truccolo and J. Donoghue

Doucet, A. de Freitas, N., & Gordon, N. (2001). Sequential Monte Carlo methods in practice. New York: Springer-Verlag. Eden, U. T., Frank, L. M., Barbieri, R., Solo, V., & Brown, E. N. (2004). Dynamic analyses of neural encoding by point process adaptive filtering. Neural Computation, 16(5), 971–998. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499. Eilers P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11, 89–102. Frank, L. M., Eden, U. T., Solo, V., Wilson, M. A., & Brown, E. N. (2002). Contrasting patterns of receptive field plasticity in the hippocampus and the entorhinalcortex: An adaptive filtering approach. Journal of Neuroscience, 22, 3817– 3830. Freund, Y., & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory (pp. 23–27). Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). Annals of Statistics, 19(1), 1–82. Friedman, J. H. (1999). Stochastic gradient boosting (Tech. Rep.). Palo. Alto, CA: Stanford University, Statistics Department. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378. Friedman, J., Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). Discussion. Annals of Statistics, 32(1), 102–107. Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Annals of Statistics, 28(2), 337– 374. Gamerman, D. (1997). Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing, 7, 57–68. Gao, Y., Black, M. J., Bienenstock, E., Shoham, S., & Donoghue, J. P. (2001). Probabilistic inference of hand motion from neural activity in motor cortex. In T. G. Dietlerich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural information process systems, 14 (pp. 213–220). Cambridge, MA: MIT Press. Girsanov, I. V. (1960). On transforming a certain class of stochastic processes by absolutely continuous substitution of measures. Theory of Probability and Its Applications, 5(3), 285–301. Hastie, T., Tibshirani, T., & Friedman, J. H. (2001). Elements of statistical learning. New York: Springer-Verlag. Hatsopoulos, N., Joshi, J., & O’Leary, J. G. (2004). Decoding continuous and discrete motor behaviors using motor and premotor cortical ensembles. Journal of Neurophysiology, 92, 1165–1174. Johnson, A., & Kotz, S. (1970). Distributions in Statistics: Continuous univariate distributions. New York: Wiley. Kass, R. E., & Ventura, V. (2001). A spike-train probability model. Neural Computation, 13, 1713–1720.

Nonparametric Modeling of Neural Point Processes

705

Kaufman, C. G., Ventura, V., & Kass, R. E. (2005). Spline-based non-parametric regression for periodic functions and its application to directional tuning of neurons. Statistics in Medicine, 24(14), 2255–2265. Lebanon, G., & Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. In T. G. Dietlerich, S. Becker, & Z. Ghahramanl (Eds.), Neural information processing systems, 14 (pp. 447–454). Cambridge, MA: MIT Press. Lugosi, G., & Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Annals of Statistics, 32(1), 30–55. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Boosting algorithms as gradi¨ ent descent. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Neural information processing systems, 12 (pp. 512–518). Cambridge, MA: MIT Press. Okatan, M., Wilson, M. A., & Brown, E. N. (2005). Analyzing functional connectivity using a network likelihood model of ensemble neural spiking activity. Neural Computation, 17(9), 1927–1961. Paninski, L. (2004). Maximum likelihood estimation of cascade point-process neural encoding models. Network, 15(4), 243–262. Paninski, L., Fellows, M. R., Hatsopoulos, N. G., & Donoghue J. P. (2004). Spatiotemporal tuning of motor neurons for hand position and velocity. Journal of Neurophysiology, 91, 515–532. Papangelou, F. (1972). Integrability of expected increments of point processes and a related change of scale. Transactions of the American Mathematical Society, 165, 483–506. Rosset, S., Zhu, J., & Hastie, T. (2004). Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5, 941–973. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651–1686. Spiegelhalter, D., Thomas, A., Best, N., & Lunn, D. (2003). WinBUGS user Manual (Version 1.4). Cambridge: Medical Research Council Biostatistics Unit. Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2005). A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects. Journal of Neurophysiology, 93(2), 1074–1089. Truccolo, W., Vargas, C., Philip, B., & Donoghue, J. P. (2005). M1-5d Statistical interdependencies via dual multi-electrode array recordings. Society for Neuroscience, abstract 981.13, Washington, DC. Available online at http://sfn.scholarone.com/itin2005/index.html. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wood, S. N. (2004). Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association, 99, 673–686. Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1), 56–85.

Received August 16, 2005; accepted July 11, 2006.

Suggest Documents