Adaptive Sparse Logistic Regression with Application to Neuronal Plasticity Analysis Alireza Sheikhattar
Jonathan B. Fritz
Shihab A. Shamma
Behtash Babadi
ECE Department University of Maryland Email:
[email protected]
Institute for Systems Research (ISR) University of Maryland Email:
[email protected]
ECE Department & ISR University of Maryland Email:
[email protected]
ECE Department & ISR University of Maryland Email:
[email protected]
Abstract—We consider the problem of estimating the timevarying parameters of a sparse logistic regression model in an online setting. We introduce two adaptive filters based on proximal gradient algorithms for recursive estimation of the model parameters by maximizing an ℓ1 -regularized version of the observation log-likelihood, as well as an efficient online procedure for computing statistical confidence intervals around the estimates. We evaluate the performance of the proposed algorithms through simulation studies as well as application to real spiking data from the ferret’s primary auditory cortex during a series of auditory tasks.
I. I NTRODUCTION Analyses of spiking activity recorded from sensory neurons have revealed three main features: first, neuronal activity is stochastic and exhibits significant variability across trials; second, the spiking statistics often undergo rapid changes referred to as neuronal plasticity, in order to adapt to changing stimulus salience and behavioral context; and third, the tuning characteristics of sensory neurons to the stimuli exhibit a degree of sparsity. Despite significant progress in neuronal data analysis in recent years, a unified framework to simultaneously capture the stochasticity, dynamicity and sparsity of neuronal data is lacking. We close this gap by integrating techniques from point process theory, adaptive filtering, and compressed sensing. To this end, we consider the problem of estimating time-varying stimulus modulation coefficients (e.g., receptive fields) from a sequence of binary observations in an online fashion [6]. We model the spiking activity by a conditional Bernoulli point process, where the conditional intensity is a logistic function of the stimulus and its time lags. We introduce a novel objective function by incorporating the forgetting factor mechanism of RLS-type algorithms into the ℓ1 -regularized maximum likelihood estimation of the point process parameters. We develop two adaptive filters for recursive estimation of the objective function based on proximal gradient techniques. Statistical testing and assigning measures of uncertainty to the parameter estimates associated with neuronal models is of great importance in the analysis of neural data. Although the exact or asymptotic inference for ML estimators is well-established in classical statistics, it is highly challenging to compute measures of uncertainty for modern statistical problems with regularized ML estimators. Recently, a number of remarkable results in high-dimensional statistics
[3]–[5] overcame this challenge by proposing methods to construct nearly optimal confidence regions for ℓ1 -regularized ML estimates of linear models as well as generalized linear models (GLMs). The rationale behind these approaches is to characterize the bias introduced by ℓ1 -regularization through a careful inspection of the KKT conditions. This procedure is referred to as de-sparsifying in [3] and de-biasing in [4]. In this paper, we adapt the de-sparsifying procedure of [3] based on nodewise regression to the RLS-type forgetting factorbased log-likelihood, and obtain a recursive filter for nodewise regression based on the SPARLS algorithm [1]. Finally, we present simulation results as well as application to multi-unit spiking data from ferret primary auditory cortex (A1) during passive stimulus presentation and during performance of a click rate discrimination task [12] in order to characterize the spectrotemporal receptive field (STRF) plasticity of A1 neurons. Application of our algorithm to these data provides new insights into the time course of attentiondriven STRF plasticity, with over three orders of magnitude increase in temporal resolution from minutes to centiseconds, while capturing the underlying sparsity in a robust fashion. The outline of the paper follows next. We will present the preliminaries and notational conventions in Section II. We will introduce the adaptive filters in Section III, followed by the recursive construction of the confidence intervals in Section IV. Finally, simulation studies and application to real data are presented in Section V, followed by conclusion in Section VI. II. P RELIMINARIES
AND
N OTATIONS
We consider a sequence of observations in the form of binary spike trains obtained by discretizing continuous-time electrophysiology recordings using bins of length ∆. The s binary spiking sequence is denoted by {ni }K i=1 . Such a binaryvalued sequence can be modeled as conditionally independent Bernoulli trials with success/spiking probability given by p(ni = 1) := λi ∆, where λi is the discretized conditional intensity function (CIF) of the underlying point process [6]. Let the input stimulus sequence (covariate sequence) be s denoted by {xi }K i=−M+1 . We adopt a GLM model with a logistic link given by: λi ∆ = logit−1 (x′i ω i ),
(1)
′ for i = 1, . . . , Ks , where xi := 1, xi , xi−1 , . . . , xi−M+2 is the covariate vector at the ith bin, ω i := µi , θi is the parameter vector characterizing the neuronal model, composed of a scalar parameter µ determining the baseline firing rate and a sparse modulation vector θ i of length M − 1 modeling the effect of the stimulus on the spiking rate at the ith bin, and exp(·) . logit−1 (·) := 1+exp(·) We consider a window-based dynamic parametric model, where the parameters change in a piece-wise constant fashion over windows of length W bins. We partition en Kthe s tire spiking data and covariates D := ni , xi i=1 into s K := ⌈ K W ⌉ windows of length W bins each, denoted by Dk := {nk , Xk } for k = 1, 2, . . . , K, where the vector ′ nk := n(k−1)W +1 , n(k−1)W +2 , . . . , nkW and the matrix ′ Xk := x(k−1)W +1 , x(k−1)W +2 , . . . , xkW are the spiking data and covariates within window k, respectively. Similarly, the spiking probability vector at window k is defined as ′ λk ∆ := λ(k−1)W +1 ∆, λ(k−1)W +2 ∆, . . . , λkW ∆ . In order to explicitly show the dependence of the spiking rate on the parameter vector ω, we define λk (ω)∆ := logit−1 Xk ω , with the notational convention that logit−1 (·) acts on a vector in an element-wise fashion. Our goal is to recursively solve an ℓ1 -regularized maximum likelihood problem at window k, given by: b k = argmax Pβ L(ω k ) − γkωk k1 , ω
(2)
ωk
where L(ω) := log p n|X, ω denotes the generic data loglikelihood function over a window, and the operator Pβ f (ω) is defined for a function f : {0, 1}W × RW ×M × RM → R as the RLS-weighted empirical expectation with forgetting factor β as follows: Pβ f (n, X, ω) :=
k X
β k−i f (ni , Xi , ω).
(3)
Algorithms 1 and 2 summarize the two adaptive filter recursions at window k, which we refer to as the ℓ1 -regularized Point Process Filter of the Zeroth order (ℓ1 -PPF0 ) and of the First Order (ℓ1 -PPF1 ), respectively. The order refers to the number of Taylor series expansion coefficients retained for recursive computation of the cost function gradient vector (see [6] for details). The computational complexity of the ℓ1 PPF0 and ℓ1 -PPF1 algorithms are linear and quadratic in M (length of the parameter vector) per iteration, respectively. Both algorithms recursively update the estimates upon the arrival of the new input data Dk := {nk , Xk }. Recursive update rules for computation of the gradient vector gk are given in line 3 of Algorithm 1 and lines 3–5 of Algorithm 2. The main proximal-based shrinkage step is given in line 4 of Algorithm 1 and line 6 of Algorithm 2, where Sγα [.] denotes the soft thresholding shrinkage operator at a level γα, defined element-wise as Sγα [x] i := sgn(xi )(xi − γα)+ , where (x)+ := max(x, 0), γ is the regularization parameter tuning the trade off between the model fit and sparsity, and α is a step-size parameter. At each window k, we use a warm (0) (L) bk = ω b k−1 . start initialization given by ω Algorithm 1 ℓ1 -PPF0
(0)
b k , and L . Inputs: Dk , gk−1 , ω 1: for ℓ = 0, . . . , L −1 do (ℓ) bk 2: λk ∆ = logit−1 Xk ω ′ 3: gk = β gk−1 + i h Xk εk (ℓ) (ℓ+1) b k + αgk bk 4: ω = Sγα ω 5: end for (L) b k := ω bk . Output: ω Algorithm 2 ℓ1 -PPF1
i=1
(0)
b k , and L . Inputs: Dk , uk−1 , Bk−1 , ω Using the definition above, we have: 1: for ℓ = 0, . . . , L −1 do (ℓ) −1 W k b ω X 2: λ ∆ = logit k k X X k ni,j x′i,j ω k −log 1+exp(x′i,j ω k ) , β k−i Pβ L(ω k ) = (ℓ) bk 3: uk = β uk−1 + X′k εk + Λk Xk ω j=1 i=1 4: Bk = β Bk−1 + X′k Λk Xk (ℓ) where ni,j := n(i−1)W +j and xi,j := x(i−1)W +j correspond bk 5: gk = uk − Bkhω i to the j th bin of the ith window. (ℓ) (ℓ+1) b k + αgk bk 6: ω = Sγα ω 7: end for III. P ROPOSED R ECURSIVE A LGORITHMS : ℓ1 -PPF0 AND (L) b k := ω bk . Output: ω ℓ1 -PPF1
In a recent work [6], we developed two adaptive filtering algorithms to recursively solve the ℓ1 -regularized maximum likelihood estimation problem of Eq. (2) using proximal gradient techniques. We define the point process innovation at window k by εk (ω) := nk − λk (ω)∆ evaluated at ω. Wealso denote by Λk (ω) := diag λk (ω)∆ ⊙ (1W − λk (ω)∆) the diagonal matrix with diagonal elements corresponding to the second derivative of the logistic function, where the operator ⊙ denotes component-wise multiplication, and 1W is the allone vector of length W .
IV. R ECURSIVE C HARACTERIZATION OF S TATISTICAL C ONFIDENCE I NTERVALS We denote by gk (ω k ) and Hk (ω k ) the gradient vector and the Hessian of Pβ L(ω k ), respectively, which can be expressed as: gk (ω k ):= Pβ ∇L(ω k ) = Hk (ω k ):= Pβ ∇2 L(ω k ) =
Pk
k−i ′ Xi εi (ω k ), i=1 β Pk − i=1 β k−i X′i Λi (ωk )Xi .
(4) (5)
b k as the maximizer of (2) satisfies The sparse estimator ω the following KKT conditions: gk (b ω k ) − γbsk = 0 , kb sk k∞ ≤ 1,
(6)
b k k1 is the subgradient vector from the where b sk ∈ ∂kω subdifferential of the ℓ1 norm, with components (b sk )m = ω k )m 6= 0 and |(b sk )m | ≤ 1, otherwise, for sgn (b ω k )m if (b m = 1, 2, . . . , M . Replacing the log-likelihood Pβ L(ω k ) in (2) by its quadratic approximation around the true parameter vector ω 0k , and inverting the corresponding KKT conditions, b k is given by [3]: the de-sparsified estimator w b k gk (b b k := ω bk − Θ w ω k ),
(7)
b k is the approximate inverse of the where the matrix Θ Hessian, estimated by performing a nodewise regression of b k )m , m = 1, 2, . . . , M , the matrix Hk (b ωk ) [3]. Each row (Θ is constructed separately based on the solution to the following optimization problem
b := argmin ψ ′ (Hk )\m,\m ψ − 2(Hk )m,\m ψ + 2γm kψk1 , ψ m ψ∈RM −1
(8)
where (Hk )m,\m denotes the mth row of Hk excluding the mth diagonal element, and (Hk )\m,\m is the submatrix obtained from elimination of the mth row and column. Note bk that we have suppressed the dependence of Hk (b ω k ) on ω b k is for notational simplicity. Subsequently, the mth row of Θ constructed as: b k )m = 1 C b m, (Θ (9) 2 τbm b m is a vector of length M defined as where C
b , (C b m )\m = −ψ b m )m = 1, (C m
2 and τbm is a scaling factor computed as
(10)
′
2 b , = (Hk )m,m − (Hk )m,\m ψ τbm m
(11)
with (Hk )m,m denoting mth diagonal element of Hk . Folbk lowing from Theorem 3.1 in [3], the unbiased estimator w converges weakly to a Gaussian random vector as K → ∞ and β → 1, under mild conditions. The asymptotic pointwise confidence regions for the mth component (ω 0k )m are then given by: b k )m ± Φ−1 (1 − η/2) σ (12) CRk,m := (w bk,m ,
where CRk,m denotes the confidence interval for mth compo2 b k at window k at a significance level η, and σ nent of ω bk,m is given by ′
2 b k )m Gk (b b k) , b k ))m,m = (Θ σ bk,m := (cov(w ω k )(Θ m
(13)
where the matrix Gk (ω k ) in our exponentially-weighted empirical averaging setting takes the form: Gk (ω k ) := Pβ 2 ∇L∇L′ (ω k ) =
k X i=1
The fact that all gk , Hk and Gk depend on the estimator b k at window k, along with the non-linearity of the logistic ω link function makes a recursive update rule challenging. We overcome this challenge by using a Taylor series expansion of the logistic function, similar to that used in the development of our proposed adaptive filters [6]. The recursive update rules for the gradient vector gk is already available from the adaptive filter recursions of ℓ1 -PPF0 (line 3) and ℓ1 -PPF1 (lines 35). We use the zeroth order expansion for both matrices Hk (b ω k ) and Gk (b ω k ) to obtain low-complexity recursive rankW update rules at window k as follows:
β 2(k−i) X′i εi (ω k )εi (ω k )′ Xi .
Hk = β Hk−1 − X′k Λk Xk , 2
Gk = β Gk−1 +
(15) (16)
b k . It where Λk and εk are evaluated at the latest estimate ω can be easily seen that the zeroth order approximation to the Hessian is equal to −Bk , whose recursive update rule is given in line 4 of Algorithm 2. Using the recursive update rules (15) and (16), we can use b k . The SPARLS the SPARLS algorithm to recursively update Θ algorithm [1] was introduced for solving ℓ1 -regularized least squares problems, and belongs to the class of iterative shrinkage algorithms [8]–[10]. The lth iteration of the SPARLS algorithm for solving (8) takes the following form h i b(l+1) = Sγm αm ψ (l) + αm (Hk )m,\m −(Hk )\m,\m ψ (l) , ψ m m m (17)
where l = 0, 1, . . . , L − 1 is the iteration index, and γm and αm are the corresponding regularization and step-size parameters for the mth row, respectively. Both parameters (γm , αm ) can be chosen to be in the same order of magnitude as their counterparts (γ, α) in the main filtering algorithm. The remaining steps for nodewise regression can be carried out in b(L) from Eq. (17) as an a straightforward fashion by using ψ m b in Eq. (8). approximation to ψ m Algorithm 3 Adaptive Component-wise Confidence Regions bk for sparse estimator ω b(0) , η, and L b k , Gk−1 , Ak , ψ Inputs: Dk , gk , Hk , ω k 1: Gk = β 2 Gk−1 + X′k εk ε′k Xk 2: for m ∈ Ak do 3: for l = 0, . . . , L − h 1 do 4:
5: 6: 7: 8: 9: 10: 11:
(14)
X′k εk ε′k Xk ,
b(l+1) = Sγ α ψ b(l) + αm (Hk )m,\m − (Hk )\m,\m ψ b(l) ψ m m m m m
end for ′ 2 b(L) τbm = (Hk )m,m − (Hk )m,\m ψ m b k )m = 12 C b (Θ τbm m 2 b k )m Gk (Θ b k )′ σ bk,m := (Θ m b b k )m = (b (w ω ) − ( Θ ) k m gk k m b k )m ± Φ−1 (1 − η/2) σ CRk,m := (w bk,m end for
i
Algorithm 3 gives a summary of our proposed procedure for adaptive statistical confidence region computation for the
A
15 10
-1PPF1 ˆk ω
b k at window k, with inputs Dk , the gradient sparse estimate ω b k from vector gk , Hessian matrix Hk , the sparse estimate ω the adaptive filter, Gk−1 , and an index set Ak . The set Ak ⊆ {1, . . . , M } represents the indices of the desired components for which the confidence regions will be computed at window k. The recursive nodewise regression procedure for b k is given in lines 2–7. Finally, constructing the mth row of Θ 0 the pointwise confidence region for ωk,m is given in line 10.
5 0 -5
V. A PPLICATIONS
-1PPF0 ˆk ω
5 0 -5 0
100
Time (s)
200
300
Fig. 1. Sparse estimates and their confidence regions of A) ℓ1 -PPF1 , and B) ℓ1 -PPF0 algorithms. The true components (θ k ){1,10,20} and their estimates are shown by the dashed and solid traces, respectively. The colored hulls around the estimated parameters show the 95% confidence regions.
A 10 ˆ , ω
We consider an observation period of length T = 300s discretized by a time bin of length ∆ = 1 ms, resulting in to Ks = 300000 bins. We consider a window length of W = 1 bin, so that K = Ks . The binary spike train {nk }K k=1 is generated as a realization of conditionally independent Bernoulli trials with success probabilities λk ∆. The spiking activity is generated using a time-varying GLM model with logistic link and covariates {xk }K k=−M+1 drawn i.i.d. from a Gaussian distribution N (0, σ 2 ). We select the baseline parameter for GLM as µ = −2.51, and the stimulus variance σ 2 = 0.01 to set the average spiking rate to λavg ∆ = 0.13 ≪ 1, small enough to ensure that the Bernoulli approximation is valid for the underlying point process. All the simulations are done with L = 1 iterations per window. We compute the confidence regions at a significance level of 5% (η = 0.05), for which Φ−1 (0.975) ≃ 1.96. For the parameter vector ω k = (µk , θk ), we select a sparse modulation vector θk of length M = 100 with an initial support Supp(θ 0 ) = {1, 10, 20} of size s = 3, and respective initial values of (θ k ){1,10,20} = {+10, −5, +5}. All the components remain constant for the first 100 s, followed by a linear decline of the most significant component (θ k )1 to 0 within the following 100 s, and no change occurring in the final 100 s segment of the experiment. The parameter evolution is shown by the colored dashed traces in Figure 1. Figure 1 shows the in-support components of the estimated parameter vectors and their corresponding confidence intervals for ℓ1 -PPF1 (Fig. 1-A) and ℓ1 -PPF0 (Fig. 1-B) algorithms. The hyper-parameters β, α, and γ used for both algorithms are included in Figure 1 and are obtained by two-fold even-odd cross-validation. The colored hulls around the estimated values show the %95 confidence regions. A comparison of Figs. 1A and 1-B reveals that the second-order ℓ1 -PPF1 algorithm provides smoother estimates with smaller confidence intervals at the cost of higher computational complexity. Additionally, the true values of (θ k )M m=1 (colored circles), the estimated components (θbk )M (black dots) and the confidence intervals m=1 CRM (black bars) for the ℓ1 -PPF1 algorithm are shown in k,m=1 Figures 2-A, 2-B, and 2-C for three selected time bins k = {100, 150, 200s}/∆, respectively.
15 10
B ˆ , ω
A. Application to Simulated Data
B
5 0 -5 10 5 0 -5
C 10 ˆ , ω
In this section, we apply the proposed recursive filters to spike trains from simulated data as well as real data recordings from the ferret’s A1 during attentive behavior.
5 0 -5 0
20
40
60
80
100
m Fig. 2. Confidence intervals for the estimated sparse modulation vector for three time snapshots at A) t1 = 100 s, B) t2 = 150 s, and C) t3 = 200 s for the ℓ1 -PPF1 algorithm. Each panel shows the true components by the colored circles, the estimated values by the black dots, and the 95% confidence intervals by the black vertical bars.
0.01 -0.01
B. Application to Real Data
VI. C ONCLUSION In this paper, we considered recursive estimation of the time-varying parameter vectors in a logistic regression model for binary time series driven by continuous input. We constructed an objective function which integrates the trackability features of the RLS-type algorithms and sparsifying features of ℓ1 -minimization. We constructed two adaptive filters, with respective linear and quadratic complexity requirements, for recursive maximization of the objective function in an online setting. Moreover, we characterized the statistical confidence regions for our estimates, and devised a recursive procedure to compute them efficiently. Application of our filters to real data from the ferret primary auditory cortex provided a highresolution characterization of the time-course of spectrotemporal receptive field plasticity, with 3 orders of magnitudes increase in temporal resolution.
0.025 -0.025
0.025 -0.025
0.02 -0.025
Frequency (Hz)
A B C
4000
1000 25
50
25
50
25
50
25
50
25
50
Time lag g ((ms))
0.02 Coefficient values
The functional properties of many auditory neurons can be expressed in terms of Spectrotemporal Receptive Fields (STRF), which is a two-dimensional kernel in the timefrequency plane relating the neural response to the spectrogram of the acoustic stimuli. Studies in auditory processing have revealed that receptive field properties of auditory neurons exhibit plasticity by changing in response to the behavioral demands of the underlying task. This phenomenon is referred to as task-related receptive field plasticity [12]. In order to obtain a precise characterization of the STRF plasticity, we apply the proposed adaptive filter ℓ1 -PPF1 to multi-unit spiking data recordings from the ferret’s primary auditory cortex (A1) during attentive behavior under a clickrate discrimination task (See [12] for the details of the experiment). The total duration of the experiment T = 1017 s is binned by ∆ = 1ms, and segmented by windows of length W = 5 bins. We chose a forgetting factor of β = 0.9998 and a regularization parameter γ = 20 by two-fold even-odd cross 4(1−β) validation, and a step size of α = MW ¯ 2 is the σ ¯ 2 , where σ average variance of the spectrogram components [6]. Figure 3, top row, shows five snapshots of STRF estimates associated with five selected points in time highlighting the task-related plasticity. The first pair and the second pair of STRF snapshots correspond to the respective first and second pre-active and post-active conditions, and the last one shows the long-term passive condition. On the bottom row, the time course of three selected STRF components specified by A, B and C in the left-most panel of the top row, are shown. The low-latency excitatory component (A, blue trace) shows significant increase right after the active tasks, while the highlatency excitatory component (B, red trace) and the inhibitory component (C, green trace) stay relatively stable throughout the experiment. This result is consistent with the reported latency shortening effect of the STRF in temporal auditory tasks [12], while improving the temporal resolution of the estimates from minutes [12] to centiseconds.
0.02 -0.02
16000
0
B A
-0.02
C -0.04 0
200
400 600 Time (s)
800
1000
Fig. 3. Adaptive estimation of spectrotemporal receptive fields from the spiking activity of the ferret’s A1 under a click-rate discrimination task. Top row: snapshots of the estimated STRF for five selected time points during the experiment; Bottom row: the time course of three selected STRF components (A, B and C on the left-most panel of the top row).
R EFERENCES [1] B. Babadi, N. Kalouptsidis, and V. Tarokh, “SPARLS: The sparse RLS algorithm,” Signal Processing, IEEE Trans. on, vol. 58, no. 8, pp. 4013– 4025, 2010. [2] L. Wasserman, All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013. [3] S. Van de Geer, P. B¨uhlmann, Y. Ritov, and R. Dezeure, “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Stat., vol. 42, no. 3, pp. 1166–1202, 2014. [4] A. Javanmard and A. Montanari, “Confidence intervals and hypothesis testing for high-dimensional regression,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 2869–2909, 2014. [5] C.-H. Zhang and S. S. Zhang, “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 76, no. 1, pp. 217–242, 2014. [6] A. Sheikhattar, J. B. Fritz, S. A. Shamma, and B. Babadi, “Recursive sparse point process regression with application to spectrotemporal receptive field plasticity analysis,” arXiv preprint arXiv:1507.04727, 2015. [7] N. Meinshausen and P. B¨uhlmann, “High-dimensional graphs and variable selection with the lasso,” The Annals of Stat., pp. 1436–1462, 2006. [8] M. A. Figueiredo and R. D. Nowak, “An EM algorithm for waveletbased image restoration,” Image Processing, IEEE Trans. on, vol. 12, no. 8, pp. 906–916, 2003. [9] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images,” SIAM review, vol. 51, no. 1, pp. 34–81, 2009. [10] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Communications on Pure and Applied Mathematics, vol. 57, no. 11, pp. 1413–1457, 2004. [11] L. Meier, S. Van De Geer, and P. B¨uhlmann, “The group lasso for logistic regression,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 1, pp. 53–71, 2008. [12] J. Fritz, M. Elhilali, and S. Shamma, “Active listening: task-dependent plasticity of spectrotemporal receptive fields in primary auditory cortex,” Hearing research, vol. 206, no. 1, pp. 159–176, 2005.