Towards Efficient, Personalized Anesthesia Using

16 downloads 0 Views 1MB Size Report
Abstract– We demonstrate the use of reinforcement learning algorithms for efficient and personalized control of patients' depth of general ..... error of sample patient using patient-specific policy (red line) and bang- bang control (blue line).
6th Annual International IEEE EMBS Conference on Neural Engineering San Diego, California, 6 - 8 November, 2013

7RZDUGV (IILFLHQW 3HUVRQDOL]HG $QHVWKHVLD XVLQJ &RQWLQXRXV 5HLQIRUFHPHQW /HDUQLQJ IRU 3URSRIRO ,QIXVLRQ &RQWURO Cristobal Lowery1 and A. Aldo Faisal1,2,3, Member, IEEE

Abstract² We demonstrate the use of reinforcement learning algorithms for efficient and personalized control of patients¶ depth of general anesthesia during surgical procedures ± an important aspect for Neurotechnology. We used the continuous actor-critic learning automaton technique, which was trained and tested in silico using published patient data, physiological simulation and the bispectral index (BIS) of patient EEG. Our two-stage technique learns first a generic effective control strategy based on average patient data (factory stage) and can then fine-tune itself to individual patients (personalization stage). The results showed that the reinforcement learner as compared to a bang-bang controller reduced the dose of the anesthetic agent administered by 9.4% and kept the patient closer to the target state, as measured by RMSE (4.90 compared to 8.47). It also kept the BIS error within a narrow, clinically acceptable range 93.9% of the time. Moreover, the policy was trained using only 50 simulated operations. Being able to learn a control strategy this quickly indicates that the UHLQIRUFHPHQW OHDUQHU FRXOG DOVR DGDSW UHJXODUO\ WR D SDWLHQW¶V changing responses throughout a live operation and facilitate the task of anesthesiologists by prompting them with recommended actions.

I. INTRODUCTION In the operating theatre, it is important to accurately control the hypnotic state of a patient while under general anesthesia (depth of anesthesia). Giving too high a dose of an anesthetic agent may have negative side effects, such as longer recovery times [1], but too low a dose can bring the patient into a state of awareness which can cause physical pain as well as psychological distress [2]. Two techniques are currently used to control the infusion rate in the field of general anesthesia. The first consists of the anesthesiologist manually adapting the infusion rate of anesthetic into the bloRG VWUHDP EDVHG RQ H[SHULHQFH DQG REVHUYLQJ WKH SDWLHQW¶V response. The second, known as target-controlled infusion (TCI), allows the practitioner to specify an ideal concentration of the anesthetic agent in a compartment of the body (brain). This is achieved using pharmacokinetic (PK) models that enable the computation of an infusion rate for a computer-controlled drug delivery pump [3]. TCI operates in open-loop control, thus lacking feedback for response tuning, and consequently cannot account for differences in PK in individual patients (e.g. with high body fat ratios). Therefore, we investigate closed-loop control through physiological feedback. The depth of anesthesia of a patient can be effectively measured using the bispectral index (BIS) [4]. BIS is calculated using EEG measurements of brain activity Brain & Behaviour Lab ± 1, Department of Computing & 2Department of Bioengineering, Imperial College London, South Kensington Campus, London SW7 2AZ, UK 3MRC Clinical Sciences Centre, Hammersmith Hospital Campus, W12 0NN London, UK a.faisal at imperial.ac.uk.

978-1-4673-1969-0/13/$31.00 ©2013 IEEE

and converting this data into a unit-free value from 0 to 100, where 100 represents normal electrical activity (fully awake) and values of 40-60 represent accepted values for depth of anesthesia during surgery [4]. BIS is a suitable feedback control signal with an update rate 0.2 Hz and stability across subjects [5]. Several studies have looked into how algorithms can be used in general anesthesia to provide more efficient control of infusion rates than manual adaptation or TCI. Some have suggested that closed-loop control algorithms [6], [7] perform better than manual control, as they keep the hypnotic state in a tighter regime [6], and decrease the amount of anesthetic administered [8]. We consider algorithmic techniques that level out differences in clinical experience, and aim at a control solution that prompts the practitioner with a recommended action and outcome predictions, but leaves the ultimate decision with the clinician. A challenge to controlling neurophysiological systems is that they operate themselves with variability and time-delays [9] [10] [11]. A recent study proposed a first reinforcement learning technique for anesthetic control, which, in its specific setup, yielded better results than using PID control [5]. This improved performance was explained by the fact that PID is designed for linear and time-invariant problems, while anesthesia is a stochastic, non-linear, and time-dependent problem, and as such is more suited to being solved by an adaptive algorithm that naturally accounts for variability, i.e. reinforcement learning. Their reinforcement learning algorithm discretizes state and action spaces, making the system sensitive to choices of discretization levels and ranges, as well as making the generalization capability of the system subject to the curse of dimensionality. Moreover, their system is trained in a single stage, using one-size-fits-all factory-supplied settings. Therefore, we explore here two main advances, based on our previous experience in closed-loop drug delivery [12]. First, we use continuous state and action spaces to control the infusion rates of the anesthetic Propofol. We have proposed a reinforcement learning technique known as a continuous actor-critic learning automaton (CACLA), which allows for state and action spaces to be kept in a continuous form and replaces the Q-function with an actor and a critic [13]. Second, we use two stages of training in the preoperative stage to achieve personalization to patients. In the first stage a general control policy is learnt, and in the second stage a patient-specific control policy. The advantage of first learning a general control strategy, is that this strategy has to only be learnt once, and can then be used to speed up learning of a patient-specific policy.

1414

II. METHOD A. Modeling patient drug-anesthesia dynamics In order to model the expected change in BIS readings of a patient in response to Propofol infusion, we used a twostage calculation. The first stage was a PK model that was used to calculate plasma concentration at a given time based on previous Propofol infusion. Generally, Propofol concentrations are modeled using a mammillary threecompartmental model [14], [15], composed of one compartment representing plasma concentration, and two peripheral compartments representing the effect of the body absorbing some of the Propofol and releasing it back into the veins. Propofol can flow between the compartments so that the concentration is equilibrated over time. To calculate the plasma concentration, we had to specify the three compartment volumes and the rate of Propofol elimination from them (rate constants) [3]. These parameters were patient-specific, and were approximated using the PK model proposed by Schnider [16] ZKLFK LV EDVHG RQ WKH SDWLHQW¶V gender, age, weight and height. This technique is widely used and has been validated in human subjects [17]. The second stage was a pharmacodynamic (PD) model that used the plasma concentration to find the effect site concentration (brain) and expected BIS reading. We modeled the PD by introducing a new compartment representing the effect site, connecting it to the central compartment of the PK model, and specifying the rate constant between the two compartments as 0.17 min-1 [17]. The effect site was used to calculate a BIS value via a three-layer function approximator (artificial neural network) as described in [5]. B. Reinforcement learning framework We assumed that the anesthetic control problem was a Markov decision process, and chose to implement the CACLA technique (Fig. 1) [18]. This technique is composed of a value function and a policy function. V(st) represents the value function for a given state, s, and time, t, and finds the expected return. P(st) represents the policy function at a given state and time, and finds the action which is expected to maximize the return. We modeled both the value function and policy function by linear weighted regression of Gaussian basis functions. To update the weights corresponding to the two functions, we used (2) and (3), which we derived using gradient descent performed on a squared error function. In these equations, Wk(t) is the weight of the kth Gaussian basis function at iteration t DQG 3k(st) is the output of the kth Gaussian basis function with input st. The value function was updated at each iteration using ZKHUH / UHSUHVHQWV WKH WHPSRUDO GLIIHUHQFH 7' HUURU DQG UHSUHVHQWV WKH OHDUQLQJ UDWH 7KH 7' HUURU LV GHILQHG LQ ZKHUH UHSUHVHQWV WKH GLVFRXQW UDWH DQG rt+1 represents the reward received at time, t+1. The policy function was only updated when the TD error was positive so as to reinforce actions that increase the expected return. This was done using (3), where the action taken, a, consists of the action recommended by the policy function with an added Gaussian exploration term. / rt+1 V(st+1) ± V(st) Wk(t+1) = Wk(t) + /3k(st) Wk(t+1) = Wk(t) + at ± P(st 3k(st)

(1) (2) (3)

Figure 1. Patient connected to machine with illustration of reinforcement learning algorithm (CACLA)

The state space we used for both the value function and the policy function was two-dimensional. The first dimension was the BIS error, found by subtracting the desired BIS level from the BIS reading found in the simulated patient. The second dimension was the gradient of the BIS reading with respect to time, found using the modeled patient system dynamics. The action space was the Propofol infusion rate, which was given a continuous range of values between 0-20 mg/min. The reward function was formalized so as to minimize the squared BIS error and the dosage of Propofol, as: Nç L F$+5¾ååâå 6 F rärt H +JBQOEKJ4=PA. C. Training the personalization of drug delivery control We trained the reinforcement learner by simulating virtual operations, which lasted for 4 hours and in which we allowed the learner to change its policy every 30 seconds. For HDFK RSHUDWLRQ WKH SDWLHQW¶V VWDWH ZDV LQLWLDOL]HG E\ DVVLJQLQJ Propofol concentrations, C, to the three compartments in the PK model, using uniform distributions (where U(a,b) is a uniform distribution with lower bound a and upper bound b): C1 = U(0,50), C2 = U(0,15), C3 = U(0,2) We introduced three elements in order to replicate BIS reading variability. The first was a noise term that varied at each time interval and followed a Gaussian distribution with mean 0 and standard deviation 1. The second was a constant value shift specific to each operation, assigned from a uniform distribution, U(-10,10). The third represented surgical stimulus, such as incision or use of retractors. The occurrence of the stimulus was modeled using a Poisson process with an average of 6 events per hour (based on the frequency used in a study by Struys [19]). Each stimulus event was modeled using U(1,3) to give its length in minutes, and U(1,20) (as used by Moore) to give a constant by which the BIS value is increased. As well as modeling the BIS reading errors, we provided that the desired BIS value for each operation varied uniformly in the range 40-60 [4] [20]. This pre-operative training phase for the reinforcement learner consisted of two episodes. The first learnt a general control strategy, and the second learnt a control policy that was specific to the patientV¶ WKHRUHWLFDO SDUDPHWHUV. The reinforcement learner only needs to learn the general control strategy once, which provides the default setting for the second pre-operative stage of learning. Therefore, for each

1415

patient, only the second, patient-specific strategy needs to be learnt, making the process faster. In order to learn the first, general control strategy, we carried out 35 virtual operations on a default-simulated patient (male, 60 years old, 90 kg, and 175 cm) that followed WKH SDUDPHWHUV VSHFLILHG LQ 6FKQLGHU¶V 3. PRGHO [16]. In the first 10 operations, the value function was learnt but the policy function was not. As a result, the infusion rate only consisted of a noise term, which followed a Gaussian distribution with mean 0 and standard deviation 5. In the next 10 operations, the reinforcement learner started taking actions as recommended by the policy function and with the same noise term. Here, the value of the discount rate used was 0.7, and the learning rate was set to 0.05. The final stage of learning performed 15 more operations using the same settings, with the exception of a reduced learning rate of 0.02. The second learning episode adapted the first, general control policy to a patient-specific one. We did this by training the reinforcement learner for 15 virtual operations on simulated patients that followed the theoretical values corresponding to the actual age, gender, weight and height of the real patients DV VSHFLILHG LQ 6FKQLGHU¶V 3. PRGHO Once the pre-operative control policies were learnt, we ran them on simulated real patients to measure their performance. Here the setup was very similar to the virtual operations used in creating the pre-operative policies. However, one difference was that during the simulated real operations, the policy function could adapt its action every 5 seconds. This shorter time period was used to reflect the time frames in which BIS readings are received. The second difference was the method used to simulate the patients. To effectively measure the performance of the control strategy, it was necessary to simulate the patients as accurately as possible. However, there is significant variability between the behavior of real patients during an operation and that which LV SUHGLFWHG E\ 6FKQLGHU¶V 3. PRGHO As a result, in order to model the patients accurately, we used the data on nine patients taken from the research by Doufas et al [17]. This research used information from real operations to estimate the actual parameters of the patients, which are needed to model their individual system dynamics. To summarize, at the pre-operative learning stage we used theoretical patients EDVHG RQ 6FKQLGHU¶V 3. PRGHO DQG to then simulate the UHLQIRUFHPHQW OHDUQHU¶V EHKDYLRU RQ UHDO SDWLHQWV ZH XVHG the data by Doufas et al.

When assessing the stability of the patients¶ hypnotic states (Fig. 2), we see that during the first 5 minutes of the operation, the BIS error level often falls sharply below 0. This is due to initializing the patient into the state of general anesthesia by injecting a large amount of Propofol. This initial high dose wears off after around 15 minutes, at which stage the reinforcement learner successfully stabilizes the SDWLHQW¶V VWDWH The stability of the hypnotic state is also indicated by the amount of time that the absolute BIS error is kept below 10 [19]. Our reinforcement learner achieves this 93.9% of the time (using data from the last 3 hours of the operation). We benchmarked our reinforcement learner against a naïve bang-bang-type controller, which followed a basic clinical guideline whereby if the BIS error was greater than 10, it set the infusion rate of Propofol to 20 mg/min It would then maintain this infusion rate until the BIS error fell to below -10, in which case it would stop infusion. We find that our patient-specific policy outperforms our general policy, which in turn outperforms the bang-bang controller, in terms of RMSE and dose of Propofol administered. The patientspecific policy, as compared to the bang-bang controller, reduces the RMSE from 8.47±0.43 to 4.90±0.20, and the dose of Propofol by 9.4% (Fig. 3). In terms of computational cost, the patient-specific policy was learnt in only 50 virtual operations.

III. RESULTS The policy that is learnt by our reinforcement learner is highly correlated to BIS error. For very negative BIS error values, the infusion rate it suggests is 0, and above a certain threshold, the infusion rate increases with increased BIS error. The results of testing our reinforcement learner in silico on nine simulated patients are positive in terms of three measures: WKH VWDELOLW\ RI WKH SDWLHQW¶V hypnotic state, the speed of learning and the RMSE. We present the RMSE as the mean of the individual RMSEs of the nine simulated SDWLHQWV 7KHVH DUH FDOFXODWHG XVLQJ WKH SDWLHQW¶V %,6 HUURU readings during the last 3 of the 4 hours of the operation.

Figure 2. (A) surgical stimulus applied to operations in (B) and (C). (B) BIS error of sample patient using patient-specific policy (red line) and bangbang control (blue line). (C) Mean±standard deviation (solid black line ± red shaded area) of BIS error readings of 9 patients with identical surgical stimulus and dosed with their repsective final patient-specific policies. Range of clinically acceptable BIS error (blue lines).

1416

1417

Suggest Documents