30th Annual International IEEE EMBS Conference Vancouver, British Columbia, Canada, August 20-24, 2008
Neuronal Tuning in a Brain-Machine Interface during Reinforcement Learning Babak Mahmoudi, Student Member, Jack DiGiovanna, Student Member, Jose C. Principe, Fellow, and Justin C. Sanchez, Member, IEEE Abstract— In this research, we have used neural tuning to quantify the neural representation of prosthetic arm’s actions in a new framework of BMI, which is based on Reinforcement Learning (RLBMI). We observed that through closed-loop brain control, the neural representation has changed to encode robot actions that maximize rewards. This is an interesting result because in our paradigm robot actions are directly controlled by a Computer Agent (CA) with reward states compatible with the user’s rewards. Through co-adaptation, neural modulation is used to establish the value of robot actions to achieve reward.
I. INTRODUCTION
B
rain machine interface (BMI) technologies provide an alternative means of communication and control that bypasses natural sensory and motor physiologic pathways. We have recently introduced a new framework (RLBMI) for BMIs based on reinforcement learning which continuously adapts with the user during brain control [1, 2]. In this paradigm, the user is rewarded for generating neural activations that produce behaviors, which lead to task completion. Specifically in our RLBMI, neural modulation is used to estimate a value function to help choose actions in a grid world that lead to maximizing reward returns. We are interested in quantifying the changes in M1 relative to the changes in prosthetic control in the users workspace environment [3]. Traditionally, changes in the functional organization and neural representation have been quantified using the formalism of directional tuning [4, 5]. Directional tuning has been used in the past to provide two quantification metrics for motor tasks. Tuning direction specifies what neurons are most correlated with (e.g. hand velocity). Tuning depth specifies how strong that correlation is. We seek to extend this theory to BMIs in a closed-loop brain control experiment where user and decoding algorithm are coadapting through experience. We hypothesize that through skilled BMI use there will be a modification of the neural representation of the user of the BMI. The tuning will be This work was supported in part by the U.S. National Science Foundation under Grant #CNS-0540304, the Children’s Miracle Network, and the UF Alumni Association Fellowship. B. Mahmoudi and J. DiGiovanna are with the Department of Biomedical Engineering, University of Florida, 106 BME Building, Gainesville, FL 32611 USA (email: {babakm, jfd134}@ufl.edu) J. C. Principe is with the Department of Electrical and Computer and Biomedical Engineering, NEB 451, University of Florida, Gainesville, FL 32611 USA (e-mail:
[email protected]) J. C. Sanchez is with the Department of Pediatrics, Division of Neurology, University of Florida, P.O. Box 100296, JHMHC, Gainesville, FL 32610 USA (e-mail:
[email protected])
978-1-4244-1815-2/08/$25.00 ©2008 IEEE.
used as a metric to quantify the changes in the neural representation and how much has occurred. Quantifying neural representation in a BMI through directional tuning adds a new perspective compared to traditional behavioral motor learning experiments [6, 7]. In both situations, the user must learn how to control an appendage (natural or artificial). However, in a BMI the normal physiologic pathways are replaced by artificial actuators (e.g. robot, CA) which complete the task. Since the actuators are artificial, we have direct knowledge of all BMI control signals throughout learning. However, it remains unknown in the RLBMI how the normal neuronal activation is remapped to the decoding system variables and new environmental workspace. Since the decoding evolves through experience, we have the opportunity to find causal relationships. Specifically, we address the following questions in the RLBMI context: Does neural tuning direction change as the user learns to control the robot? Does neuronal tuning depth change throughout learning? II. METHODS A. RLBMI Experiment We briefly summarize the experimental paradigm, neural data acquisition and the computational framework of the RLBMI [1, 2]. In order to investigate the RLBMI, we designed an operant conditioning experiment paradigm where three male, Sprague Dawley rats were trained to control a robotic arm and maneuver this arm to press levers, in the robotic workspace (see Fig. 1 for overview). The RLBMI decodes robot actions from the user’s neuronal modulations. These modulations are recorded bilaterally with 16 microwire electrodes chronically implanted in each hemisphere of primary motor cortex (MI) for a total of 32 electrodes. Single neuron action potentials were detected and sorted using standard techniques [8] and 29 single units were discriminated for the rat in this study. Surgical details are given in [9] and signal acquisition and processing details in [1]. Modulations consist of the firing rate of each discriminated neuron which was estimated in non-overlapping 100 ms bins. We examine tuning in the brain-control portion of this experiment and a time line of this control is given in Fig. 2. Briefly, the rats were required to control the robotic arm (see Fig. 1) using only neuronal modulations form the single units we identified. A target side was randomly selected and according to the selected side the trial was labeled as left or right. Every 100 ms, the CA must select which robot control action to take given the
4491
rat’s modulations (CA learning is shown in Fig. 3). If the CA selects a sequence of actions which maneuver the arm proximal [1] to the target lever and the arm crosses a threshold around the target, the trial was considered a success. After successful trails, the rat receives a water reward and the robot resets to the initial position. Once the rat and CA can achieve at least 60% success on both left and right targets, the task difficulty was increased. By changing the threshold, the robotic arm is required to be maneuvered in a longer distance from the staring position, hence increased level of difficulty. Specifically, we force the rat and CA to move the robot closer to the target before earning a reward. By increasing task difficulty, we can study both rat and CA learning.
robot control problem as a Markov Decision Process (MDP) which is characterized by neural modulation as states s and discrete robot movements as actions a (see Fig. 3). Each action in a particular state will change the state of the environment with a certain probability. This probability in (1) is the transition probability. Additionally, the CA expects a reward r when taking an action given a state. This expected reward is expressed in (2). These transition probability functions are unknown; therefore we used RL to learn the approximate values of (1) and (2) based on observations. Once the CA has a good estimate of (2) it can choose the actions which will maximize reward. Specifically, we used Q(λ) learning [11] to approximate (2). We implemented this learning with an MLP neural network to map state-action pairs to their expected value (derived from (2)) [1]. (1) (2) (3) The network is trained trough (3). In this equation Q is the state-action value function therefore the CA’s control ability is a function of the MLP training. For RLBMI, reward distribution in the robot workspace is defined as 1 at targets and -0.01 anywhere else [1].
Fig 1. Experiment setup
Fig 2. Trial timing The RLBMI is an interesting architecture because both the user and CA learn to solve the reaching task. The CA and user are encouraged to cooperatively learn (also known as co-adaptation [1, 10]) by assigning a common reward to both of them. The user’s only way to control the robot is to modulate the discriminated neurons; the user may not realize that they have this control immediately. The CA can only control the robot by learning mapping between the user’s current neural modulation and robot actions to maximize its own reward. After we increase in task difficulty, neither the user nor CA initially achieves maximal rewards [1]. Instead, they must co-adapt to find a successful strategy [1]. Here the adaptation of the user and CA symbiotically help in solving the task. To understand why this occurs, we present a brief overview of exactly how the CA learns. The CA is trained with RL which is a learning algorithm for decision making in goal based tasks. RL is different from other paradigms because it learns through interaction with environment rather than a specific training signal [11]. We modeled the CA’s
Fig.3 Co-adaptive RLBMI B. Neural Tuning in RLBMI We build upon the classical formulation for computing the tuning direction which measures neuronal firing rates given a particular kinematic variable. While co-adaptation is occurring throughout brain control of the prosthetic arm, we assume that changes in the tuning function are smoothly varying within a session. This was assessed by observing the weight tracks of the value function and the rate of reward returns for each session. For the data reported here, no abrupt changes were observed and we use the timescale of the entire session to compute the tuning. Neural tuning was computed for the robot control actions that CA had taken at each time step. Tuning curves were constructed for each action by taking the mean instantaneous firing rate over all instances of CA taking that action. This is similar to spike-
4492
L R F B U D LF RF LB
Left Right Forward Back Up Down Left-Fwd Right-Fwd Left-Back
TABLE I: RLBMI ACTIONS RB BLU Right-Back LU BRU Left-Up RU FRD Right-Up LD FRU Left-Down RD BRD Right-Down BD FLD Back-Down BU FLU Back-Up FD BLD Fwd-Down FU St Forward-Up
Left Trials
Session/Performance
Right trials
triggered averaging but the trigger for averaging is now the applied control actions. Actions were ordered from left to right in the x axis of action tuning curves where actions on the left side of the plots had a left component and those on the right side had a right component. Actions in the middle of the plot had no lateral components. Action tuning labels are presented in Table I. Back-Left-Up Back-Right-Up Fwd-Right-Down Fwd-Right-Up Back-Right-Down Fwd-Left-Down Fwd-Left-Up Back-Left-Down Stay
We have used tuning depth and direction as scalar descriptors of the tuning curves. However, unlike classic methods for computing the tuning depth [12], we could not normalize the tuning depth by standard deviation of firing rate. In RLBMI, the number of actions is variable between different sessions; therefore, sessions with few actions would have a heavily biased estimation of standard deviation. This bias could distort trends in the tuning curves. To avoid this problem, tuning depth for each neuron was computed by taking the difference between maximum and minimum of the tuning curve and normalized by the area under the tuning curve (sum of mean firing rates).
2/61% 3/90% 1/75% 2/42% 3/78%
0.1002
0.2430 R
N23/Action 0.2373 LF 0.1920 FRU 0.4649 FRU 0.1904
FLU
0.0386
0.1423 R
FLU 0.2479
FRU
0.1918
0.5643 FRU
FRU 0.1075
FRU
FRU
(a)
III. RESULTS While co-adapting with the CA, each rat achieved control that was significantly better than chance for all task complexities. Chance performance is calculated using five sets of 10,000 simulated brain control trials using random action selection. RLBMI average performance (over difficulties and targets) was 68%, 74%, and 73% for rats 1, 2, and 3 respectively (average chance performance was 14.5%) [1]. The results here are based on three representative neurons of rat 3 which completed the most difficult level of the task in 3 sessions. On average, each session took about 1 hour and was comprised of 77 left and 79 right trials. This rat could achieve the highest performance brain control among the subjects at this difficulty level [1]. Table II summarizes the action tuning depths corresponding to action tuning curves and overall performance of the animal for left and right trials. For a given session and neuron, the most tuned action and its corresponding tuning depth value were computed. All sessions were initialized with the trained CA model from the previous session. Neural tunings with respect to CA actions were computed for successful left and right trials separately. The RLBMI allows for exploratory actions [1]; however, these actions were independent of neural modulations and were excluded from the analysis.
1/56%
TABLE II: Action Tuning Depths N03/Action N19/Action 0.0959 0.2015 L FLU 0.0467 0.1589 R FRU 0.0458 0.4410 L FRU
(b) Fig 4. Action tuning curves of 3 neurons for (a) left and (b) right trials. Fig. 4a, 4b show the action tuning curves of 3 neurons for successful left and right trials respectively. The actions in these figures were exclusively selected by the CA based on their estimated value at each time step throughout the session. Although we recorded 29 neurons from this rat, we only present two representative neuron types. The first is an example of a fast switching neuron (Neuron 3) which changes its representation for left or right trials. The second example is slow switching neurons which change their tuning representation (Neurons 19 and 23) over sessions. This behavior shows redundancy in the population representation. Specifically using the results of Table II, we see that Neuron 3 is most deeply tuned to an action with a L component for L trials and to an action with a R component for R trials for all trials except those of session 2 (left). This neuron is characterized by a relatively shallow overall tuning depth. In contrast, we see from the table that Neurons 19 and 23 both changed their tuning direction from actions with left components to actions with right components between sessions (FLU to FRU). These neurons are characterized by much deeper tuning. The slow switching neurons also
4493
showed behavior that was tuned to the same actions for both left and right trials. For example, FRU is selected for both left and right trials. Over these three sessions, we can see the number of actions selected by CA decreased (left 6 to 4, right 4 to 2) to a subset of effective actions to accomplish the task. These modeling changes are also accompanied by increases in tuning depth as shown above. This implies that both the animal and CA have converged to a stationary point on the joint performance surface. IV. CONCLUSION In this work, we determined that co-adaptation produces two advantages compared to traditional BMI. First, as experience is gained between the user and CA, neural response of the user is shaped and the set of actions the CA has used is refined such that sequences of optimal actions are used most frequently [1]. This leads to solving the task more quickly and efficiently. Second, the tuning for these actions becomes deeper indicating that the users internal representation for these actions is strengthened. Both of these observations indicate that co-adaptation is useful for tailoring BMI control over long-term use. CA actions were an intuitive variable to use for computing the neural tuning. However, neural tuning could be computed with respect to many variables in the RLBMI. Based on this preliminary study, we can classify neurons into two main groups. The first group consists of neurons which are not deeply tuned to their preferred actions but they are able to switch between their preferred actions according to the cue (i.e. target side). The other group is those neurons which are deeply tuned to only one action irrespective of the side. These neurons did not switch their preference based on the cue in one session but they can switch their preferred actions over different sessions. In other words, neurons in the RLBMI adapt their modulation in different time scales and the rate of their adaptation is inversely proportional to the depth of their tuning. From the tuning curves of different neurons, we observed that several neurons had very similar tuning curves (e.g. Neurons 19 and 23). This similarity implies neurons with similar tuning curves might belong to the same populations. However by increasing the level of task difficulty, populations of neurons changed their tuning to solve the task. During the learning, these populations did not have the same members over different sessions. This may imply that a population of neurons contributes to a certain action in the RLBMI but these populations are not fixed during learning. This hypothesis raises a fundamental question; do neurons change their preference based upon their membership to a certain population or populations are they formed just as a collection of neurons with similar preference? In this study, we investigated the tuning through a BMI system to show the causation between the modulation of certain neurons (input space) and output kinematics (actions in robot space). However, when interpreting Fig. 4 it is
important recognize that only specific temporal sequences of actions will complete this task. For example from the tuning curves and Table II we can see that Neuron23 in the last session is deeply tuned to a right side action (action 22) for left trials. That is due this fact that in this session the strategy to reach left target was first maneuvering towards the right (action 22) and then maneuvering left (action 1). That strategy explains how Neuron23 could be deeply tuned to a right action for left trials. A potential confound not addressed in this study is the time embedding of the neural states (a 3 tap Gamma structure at the input of the value function estimator of the CA) [1]. Future studies will address the temporal structure in the tuning curves by considering the history of firing, rather than looking at instantaneous firing. REFERENCES [1]
J. DiGiovanna, B. Mahmoudi, J. Fortes, J. C. Principe, and J. C. Sanchez, "Co-adaptive Brain-Machine Interface via Reinforcement Learning," IEEE Trans. Biomed. Eng., 2008. [2] J. DiGiovanna, B. Mahmoudi, J. Mitzelfelt, J. C. Sanchez, and J. C. Principe, "Brain-machine interface control via reinforcement learning," in IEEE EMBS Conf. Neural Engineering, 2007. [3] E. M. Schmidt, "Single neuron recording from motor cortex as a possible source of signals for control of external devices," Ann. Biomed. Eng., vol. 8, pp. 339-349, 1980. [4] A. Georgopoulos, J. Kalaska, R. Caminiti, and J. Massey, "On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex.," Journal of Neuroscience, vol. 2, pp. 1527-1537, 1982. [5] A. P. Georgopoulos, R. E. Kettner, and A. B. Schwartz, "Primate motor cortex and free arm movements to visual targets in threedimensional space. II. coding of the direction of movement by a neuronal population," The Journal of Neurosci. vol. 8, 1988. [6] P. Bays, and Wolpert, DM, "Computational principles of sensorimotor control that minimise uncertainty and variability," Journal of Physiology, vol. Physiology in Press, 2006. [7] K. Kording, and Wolpert, DM, "Bayesian decision theory in sensorimotor control," TRENDS in Cognitive Sci., vol. 10, 2006 [8] M. S. Lewicki, "A review of methods for spike sorting: the detection and classification of neural action potentials," Network: Computation in Neural Systems, vol. 9, 1998. [9] J. C. Sanchez, N. Alba, T. Nishida, C. Batich, and P. R. Carney, "Structural modifications in chronic microwire electrodes for cortical neuroprosthetics: a case study," IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 14, pp. 217-221, 2006. [10] S. I. H. Tillery, D. M. Taylor, and A. B. Schwartz, "Training in cortical control of neuroprosthetic devices improves signal extraction from small neuronal ensembles," Rev. in the Neurosci, vol. 14, 2003. [11] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction. Cambridge: MIT Press, 1998. [12] J. M. Carmena, et. al "Learning to control a brain–machine interface for reaching and grasping by primates," PLoS Biology, vol. 1, 2003.
4494