Feb 28, 2017 - For example, online detection of sports phase can be used to guide the camera on ... arXiv:1702.08623v1 [cs. ..... learned specific scenarios rapidly and was later able to learn to discriminate similar classes in other cases.
Process Progress Estimation and Phase Detection Xinyu Li, Yanyi Zhang, Jianyu Zhang, Yueyang Chen, Shuhong Chen, Yue Gu, Moliang Zhou and Ivan Marsic {xl264, yz593, jz549, yc788, sc1624, yg202, mz330, marsic}@rutgers.edu
arXiv:1702.08623v1 [cs.LG] 28 Feb 2017
Richard A. Farneth and Randall S. Burd {rfarneth, rburd}@childrensnational.org
March 1, 2017 Abstract Process modeling and understanding is fundamental for advanced human-computer interfaces and automation systems. Recent research focused on activity recognition, but little work has focused on process progress detection from sensor data. We introduce a real-time, sensor-based system for modeling, recognizing and estimating the completeness of a process. We implemented a multimodal CNN-LSTM structure to extract the spatio-temporal features from different sensory datatypes. We used a novel deep regression structure for overall completeness estimation. By combining process completeness estimation with a Gaussian mixture model, our system can predict the process phase using the estimated completeness. We also introduce the rectified hyperbolic tangent (rtanh) activation function and conditional loss to help the training process. Using the completeness estimation result and performance speed calculations, we also implemented an online estimator of remaining time. We tested this system using data obtained from a medical process (trauma resuscitation) and sport events (swim competition). Our system outperformed existing implementations for phase prediction during trauma resuscitation and achieved over 80% of process phase detection accuracy with less than 9% completeness estimation error and time remaining estimation error less than 18% of duration in both dataset.
1
Introduction
Sensor data at a specific time instance defines a fixed scenario, such as a photo of a person holding a carrot while cooking. A sequence of data collected at each time instance defines activities. For example, a frame of a video clip of a person chopping a carrot while cooking. A sequence of activities defines a process. For instance, chopping vegetables, applying sauce, and stirring defines the process of making salad. Several successful systems have been introduced for image classification [1] and activity recognition [13]. We take the next step in this type of analysis by proposing a system for real-time process progress estimation. Online progress estimation has many real-world applications in automated human computer interaction systems. For example, online detection of sports phase can be used to guide the camera on overhead drones for video broadcasting. Online medical process progress information, such as the completeness of resuscitation process and the time remaining to the end (time-remaining), can help medical providers organize medical resources and schedule treatment timing. We designed our system to model and estimate the process progress using sensor data from three aspects: 1. The process phase that indicates the current stage of the process. We analyzed a trauma resuscitation dataset and an Olympic swimming dataset each with six phases (Table 1). 2. The process completeness that indicates the percent completion of the whole process. 3. The time-remaining that represents the estimated time left for the process to accomplish.
1
Table 1: The phases and associated definition for trauma resuscitation dataset (left) and Olympic swimming dataset (right) Resuscitation Phases Pre-Arrival (PA) Patient Arrival (A) Primary (P) Secondary (S) Post-Secondary (PS) Patient Leave (PL)
Starting When First trauma team member enters room Patient first appears in camera view First “primary survey activity” starts First “secondary survey activity” starts First “post-secondary survey activity” starts Patient leaves the room
Swimming Phases Pre-Competition (PC) Introduction (I)
Definition Statement of participants
Competition (C)
Athletes enter the field with introduction Athletes prepare for swimming Athletes compete
Replay (RE)
Replay of the winner
Result (R)
Statement of results
Preparation (P)
Previous research tried to recognize medical process phase using Hidden Markov Models (HMM) or decision trees with sensor data [3, 5, 15], but they faced several problems. Previous approaches using domain knowledge based methods and certain instrument only worked for specific applications, such as a system specialized for the laparoscopic appendectomy [14]. Sensor data may be hard to obtain in medical settings and surgical instruments may be expensive and hard to deploy. Wearable sensors may interfere with work in many applications. Features obtained via feature extraction do not generalize well and each sensory type may require arbitrary “best features” [13]. Most importantly, these existing systems have treated the problem as multiclass classification by partitioning the process into several independent phases rather than considering the process as a continuous procedure. These approaches do not address the associations between process phase and percentage completion, allowing the system to make phase predictions that do not always follow the logical order [12]. To address these issues and challenges, we introduced a multimodal deep regression structure that first extracts the spatio-temporal features from different sensor types, and then performs regression for completeness estimation and phase prediction. Our proposed system processes data in three steps. First, it extracts and learns the spatio-temporal associations from input data using a multimodal deep learning structure. Second, regression is used for percent-completeness estimation. Unlike regression models that use the polynomial function or a combination of linear functions to generate regression value, we used a deep neural network with a single neuron in the output layer to generate the regression result. We introduced the rectified hyperbolic tangent (rtanh) activation function to ensure that the regression output stays within the bounds of process completeness values. The phase prediction was made based on the regression results using a Gaussian Mixture Model (GMM) based decision making approach. We also introduced a conditional loss to train the system based on the errors made by the regression and phase detection. The third and final step is the estimation of the time-remaining using the estimated overall completeness as indicator. We introduce and tested our model with two datasets: a medical dataset recorded in 35 actual trauma resuscitations at Children’s National Medical Center (CNMC) using installed sensors (include depth camera and microphone array) [12], and an Olympic swimming dataset including 60 YouTube videos of Olympic swimming competitions in different swimming styles. Our process progress estimation system achieved average 87% online process phase detection accuracy for the resuscitation dataset, outperforming existing systems applied in same settings, and 80% accuracy for the Olympic swimming dataset. The system had less than 9% completeness estimation error with average 6.5 minutes (14% total duration) time-remaining estimation error for trauma resuscitation dataset, and less than 5% completeness estimation error with average 2.2 minutes (18% total duration) time-remaining estimation error for Olympic swimming dataset. Our system outperformed the previously proposed systems using the trauma resusciation dataset [21, 22]. Our resuscitation process phase detection system has been deployed in a trauma room and is generating data that can be used in future applications. The contributions of this paper are: 1. A novel deep regression-based approach for process progress estimation and phase detection using
2
commercially available sensors, as well as a new rectified hyperbolic tangent (rtanh) activation. 2. A GMM-based phase detection approach that uses the completeness regression results, as well as a conditional loss function for model tuning using regression and classification error. 3. The trauma resuscitation dataset and Olympic swimming dataset that can be used in future research, as well as the details of the deployment of our system in an actual medical setting (which can be used as reference for other applications). The rest of the paper is organized as follows: Section 2 introduces the problem; Section 3 presents our proposed method; Section 4 describes experimental results and performance comparison to other systems; Section 5 discusses the results; and Section 6 concludes the paper.
2
Problem Description and Related Work
Modeling the progress of a process can be considered from three aspects: the completeness, the process phase, and remaining time. Few previous studies have addressed process progress estimation and several challenges remain for the process modeling and estimation using sensor data [2,12]. Classification models,which can only generate discrete predictions, are not as suitable for process completeness estimation, which is a continuous variable, as they are suitable for activity recognition [7]. Without an overall completeness estimation, the system might generate irrelevant phase predictions (such as the patient leaving the trauma room before arriving). The duration of process performance depends on many factors and can vary significantly. For example, trauma resuscitation duration may depend on injury type, severity, preexisting conditions, and the availability of different team roles. The process’s pace performance may also change during the same event, so the remaining time estimates may need to be dynamically updated. Finally, the real-world applications requires online process progress estimation using sensor data. Sensor data collected in real-world settings, however is noisy and cannot be simply filtered or modeled because of the diversity of application scenarios. Previous published research has focused only on the process phase detection. The research initially tried to approach the process modeling using key activities estimated based on the output from surgical instruments [19]. Directly using these signals provided “clean” data compared to mobile sensor data, but also required expensive and hard-to-deploy equipment. Endoscope video recordings were used for workflow analysis with HMM [3]. However, this approach does not generalize easily to other surgical processes. For economical reasons and ease of deployment, mobile sensors such as passive RFID and depth sensor have been used for trauma resuscitation process phase detection [2, 12]. These sensors lead to data that is noisier compared with the data collected by medical equipment. Deep Convolutional Neural Networks (CNNs) have been implemented for surgical phase detection with endoscope image [15, 22]. Our previous research used multimodal deep learning with low resolution depth images and microphone array audio for trauma resuscitation phase estimation [12], which is both privacy-preserving and easy to deploy. This system, however used a purely CNN based structure which is only able to learn the spatial feature in the input depth frame and ignore the temporal association between depth frames in time. More recent work has used the temporal association for better process phase and activity modeling [10,11]. Using both spatial and temporal features, we propose a deep online regression model capable of simultaneously predicting process phase, completeness, and the time-remaining for the process.
3 3.1
System Structure Overall Structure
Our model consists of three steps (Figure 1): Feature extraction: Because we used different sensors for data collection in different applications, the first step is data representation and feature extraction. Instead of several types of manually crafted features, we applied CNN and LSTM to learn the spatio-temporal features from the data [16]. We used a multimodal structure to fuse the features extracted from different sensory data types.
3
LSTM
Feature
CNN
Feature Extraction
Regression (rtanh)
Overall Complete ness
Completeness Estimation
GMM Based Decision-Making
Phase
Process Phase Detection Process Speed TimeEstimation Remaining Time-Remaining Estimation
Figure 1: The overall system structure used for process modeling and progress estimation. Completeness and Phase estimation: We implemented a regression model based on deep learning that directly produces a single output value as the regression output. We introduced a rtanh activation function for the output neuron and a conditional loss function to train the regression model. We further used the GMM as probabilistic inference to make phase detection and generated conditional loss for model training. Estimation of Time-remaining: Our system dynamically updates the time-remaining estimation based on the process execution speed, which we defined as the rate at which one percent of the process is being accomplished.
3.2
Feature Extraction
Similar to activity recognition [7, 13], the estimation of process progress relies on both temporal and spatial features. The learnable filters in CNN are commonly used to extract the spatial features [23], and longshort-term memory networks (LSTM) are often used to model temporal dependencies of sequential data [6]. Given different sensor types, different data representations and CNN-LSTM structures are required. Our dataset from trauma resuscitations contained low-resolution depth images and audio from a Kinect [12]. Other types of sensors can be easily added and combined into our system by adding additional multimodal branches. We chose the Kinect depth sensor for our medical application because it is privacy preserving (does not capture any facial details) and does not require active user participation to support the sensing. For relatively texture-less depth images, we used an AlexNet [9] (Figure 2, top). For audio data, we extracted MFSC feature maps [1] for every second, and fed it into an AlexNet [9]. We implemented the CNN-LSTM structure for spatio-temporal feature extraction and used a multimodal structure for feature fusion from different sensor datatypes. The features extracted from different sensors were later combined in a fusion layer, based on a previous implementation [12] (Figure 2 top). We implemented a smaller neural network (using smaller input images) works on a mini PC (Intel NUC) mounted on a side wall in the trauma room using the same configuration as previously used [13]. This implementation will allow us to eventually provide contemporaneous information about work progress and process phase. The Olympic swimming dataset was collected from YouTube videos recorded using different types cameras (including cellphone cameras and professional cameras). This dataset contained videos of different resolutions and audio recorded by cameras at different distances. We first down-sampled the RGB frames to one frame per second, and resized them to 256×256px. We used a VGG Net structure with 11 convolutional layers for image feature extraction, because RGB images contain more textural details than depth images. The weights from previous trained VGG Net were directly used for initialization [17]. As for the trauma resuscitation dataset, an AlexNet was used to extract features from MFSC feature maps (Figure 2, bottom).
3.3 3.3.1
Regression Regression and Completeness Estimation
Unlike image classification or activity recognition, which use the data captured at a given moment, process completeness is continuous, and cannot be solved as a discrete classification problem. For this reason, we propose a regression-fitting model that uses the extracted spatio-temporal features for process completeness estimation (Figure 3, completeness estimation part).
4
WAV
Extract MFSC
Conv (5×5×64) Pool(2×2)
Conv (3×3×128) Pool(2×2)
Conv (3×3×256)
Conv (3×3×256) Pool(2×2)
LSTM (2048)
LSTM (1024)
Conv (5×5×64) Pool(2×2)
Conv (3×3×128) Pool(2×2)
Conv (3×3×256)
Conv (3×3×256) Pool(2×2)
LSTM (1024)
LSTM (512)
CNMC dataset and preprocessing
RGB Frames (1 frame per second) Extract MFSC
CNN (AlexNet)
Conv3-64 Pool(2×2)
Conv (3×3×256)
Conv (3×3×128) Pool(2×2)
Olympic swimming dataset and preprocessing
Conv3-512 Conv3-512 Pool(2×2)
Conv3-256 Conv3-256 Pool(2×2)
Conv3-128 Pool(2×2)
Conv (5×5×64) Pool(2×2)
LSTM
Conv3-512 Conv3-512 Pool(2×2)
LSTM (2048)
LSTM (1024)
Conv (3×3×256) Pool(2×2)
LSTM (1024)
LSTM (512)
CNN (VGG Net and AlexNet)
2048
WAV
1536
Depth Frames (1 frame per second)
LSTM
Figure 2: Feature extraction framework for the trauma resuscitation dataset (top) and the Olympic swimming dataset (bottom). + 𝐿𝑜𝑠𝑠$
𝐿𝑜𝑠𝑠
Phase
GMM Based Decision-Making
Completeness
rtanh
Fully Connected 2
Fully Connected 1
Extracted feature
Completeness Estimation
𝐿𝑜𝑠𝑠%
Phase Detection
Figure 3: The system diagram for completeness estimation and phase prediction. The Lossc represents the loss from completeness regression error and the Lossp represents the loss from phase prediction error. Given an appropriate neural output activation function, a deep neural network extracts features from the input data and performs regression by assigning it an overall completeness value (between zero and one). The regression can be expressed as: N X yˆ = f ( (wi xi + bi )) (1) i=1 th
where, wi is the weight connected the i neuron in the last fully connected layer (with N neurons in total) to the output neuron and xi is the output value of the ith neuron in the last fully connected layer (Figure 3, fully connected layer 2). The bi is the bias term and the f (·) is the activation function of the output neuron. To model process progress, our activation function must have the following properties: (1) the activation output should be zero if the process has not started (e.g. pre-arrival phase of the trauma resuscitation). (2) the regression value changes from zero to one as the process proceeds. (3) the activation should be equal one if the process has been completed (e.g., after the patient leaves the trauma room). We modified the hyperbolic tangent function for this purpose, and we named it as rectified hyperbolic tangent (rtanh): rtanh(x) = max(0, tanh(x)) = max(0,
e2x − 1 ) e2x + 1
(2)
This activation function generates zero for negative inputs and incremental positive values up to one for positive inputs. This value range makes it suitable for returning an progress completeness percentage. 5
The sigmoid function has the same value range, but its range of positive values is distributed across all real numbers, making it hard to train. Similar to a classifier, a regression model can be trained using backpropagation. Completeness should be labeled differently for different applications. For example, the completeness of a trauma resuscitation is zero before the patient enters the room and remains one after the patient leaves. Given the time at which it was collected relative to the process’s beginning and end, we can label the data at certain time instance with completeness range the time instance falls in. We divided the data into 20×(5%-segments), and labeled each segment with an associated completeness between 0% and 100% witch forms stepwise function with 5% increments. The output of the rtanh neuron can be directly used as the regression result (Figure 3, decision making part). To make the regression smooth, we applied a Gaussian smooth filter after the rtanh neuron. The loss from completeness estimation error Lc , which can be expressed as: P|D| abs(R(θ = {w, b}, Di ) − pi ) (3) Lc (θ = {w, b}, D) = i=0 |D| where the θ = {w, b} denotes the model with parameter set θ and input dataset D. R(θ = {w, b}, Di ) denotes the regression output from the model with parameter set θ on dataset D at time instance i as input. pi denotes the label for regression. We used the mean absolute error for loss, which is the abs(·) term in the loss function. Although the error measurements such as mean squared error could be used, the square error would inhibit training by making the loss even smaller which makes the training slower considering that the regression error can be a small number. 3.3.2
Phase Detection and Conditional Loss
The regression model can intuitively estimate completeness but cannot predict phase because the regression only produces continuous results. A separate classification model using the same extracted features can be used for phase detection [12], but detecting phase independently from completeness ignores the associations between phase and the completeness (e.g., the phase distribution along completeness), making the deep neural network less computational efficiency and less accurate. Training the Gaussian mixture model only requires obtaining the phase distribution with overall completeness, which is obtainable from the ground truth data. We can pre-train the GMM using ground truth and establish the probabilistic association between the overall completeness and process phase. We set the number of centroids equal to the number of phases, because each Gaussian distribution is used to model the distribution of one phase. The phase recognition results can be calculated from: 1 pˆ = argmax{log(w) − 2 log(det(Σ)) + (x − µ)T Σ−1 (x − µ)} 2 i
(4)
where pˆ is the predicated phase. w (N × 1) is the weight for each Gaussian kernel, µ (N × 1) is the vector of means of all Gaussian kernels. The Σ (N × N ) denotes the covariance matrix. N is the number of phases. argmax{·} denotes the function finding the index i with the largest likelihood among all N indices. i
With equation 4 and the regression result, we can directly use the GMM for phase prediction (Figure 4). We only applied the GMM for four out of the six phases in the trauma resuscitation dataset, removing pre-arrival and patient-leave. This is because the duration of pre-arrival phase depends on the transport time from the injury scene to the hospital, which our data set doesn’t contain, and the patient-leave phase only takes one second in the coding. The GMM-based phase prediction approach relies only on the regression result, and the phase prediction error would not have impacted the backpropagation training process used to tune the regression. The regression model can be trained based on both the completeness estimation error and the phase prediction error. Moreover, we introduced an additional training loss from phase prediction error. We assume that the regression value is correct, and allows the GMM to make the correct phase prediction. Based on this assumption, we designed the system to generate no additional loss if the GMMbased phase detection result was correct. Otherwise, we take the distance from regression result to the mean of the actual current phase distribution as the additional loss from phase prediction error. In conclusion, the loss from phase Lossp is conditional based on the phase detection result and can be expressed as: ( 0 pˆ = p (5) Lossp = |rtanh(Dp ) − µp | pˆ 6= p 6
Olympic Swimming Dataset
Probability of each phase
Probability of each phase
Trauma Resuscitation Dataset
Normalized overall completeness from regression
Normalized overall completeness from regression
Figure 4: The probability for each phase plotted against the process duration for the trauma resuscitation dataset (left) and the Olympic swimming dataset (right). where the pˆ denotes the GMM predicted phase and p is the actual current phase. Dp is the input data for phase p and rtanh(Dp ) denotes the regression model output. The µp is the mean of the distribution for the actual current phase p. By combining the loss from regression error and the classification error, the regression model can be tuned to make completeness estimations and phase predictions that follow a logical order. The loss used for backpropagation is defined as: Loss = Lossc + Lossp
(6)
Training the system with proposed loss can be intuitively understood as tuning a regression model to fit both the overall completeness ground truth and phase ground truth.
3.4
Time Left Estimation
The speed of a process’s performance can be different, so our time-remaining estimator requires a dynamicallyupdating strategy. We estimated the time remaining by calculating how long that system takes to change 1%. If we use p for the current completeness (in percentage) and t for time elapsed since starting, the time-remaining can be estimated as: time remaining = (t/p) × (1 − p)
(7)
where (t/p) denotes the time that system takes to complete and (1 − p) denotes the remaining percentage for the process to complete.
3.5
System Implementation
We implemented our model with the Keras framework and TensorFlow backend. Because our proposed rtanh activation and loss function do not exist in Keras, we developed these functions for this application. As proposed in previous research [9], we used the rectified linear unit (ReLU) as our CNN activation function. For the trauma resuscitation dataset, we initialized and trained the model using one GTX 1080 GPU. For the YouTube Olympic swimming dataset that has a larger input data format and larger network structure, we used dual GTX 1080 GPUs for training. We directly used the trained weights in VGG Net [17] to initialize the network for RGB frame processing. An Adam optimizer [8] was implemented with initial learning rate 0.01 and 10−8 decay. We configured the system to stop automatically if the performance did not change for three epochs using the Keras callback function. Due to large model size, overfitting was an issue. We adopted dropout [18] for the network to address this issue. We also partitioned the training and testing set by whole case instead of segmenting each case for evaluation (to minimize similarities between training and testing sets) as suggested in [12]. Unlike image classification models with a target in each training image, process phase classification may be significantly different from case to case (e.g., treating patients with different injuries or performing different swimming styles). Entering data into the classifier simultaneously causes slow convergence. For this reason, 7
we initially feed only two cases into the classifier for training. When the system achieved a specified loss value (manually defined), we added one more case of data into the classifier. Using this approach, the model learned specific scenarios rapidly and was later able to learn to discriminate similar classes in other cases. During the training, we added weights α in from of Lossc and β in front of Lossp in equation 6: Loss = αLossc + βLossp
(8)
Training the system with larger α would cause the system to prioritize overall completeness regression instead of minimizing phase prediction errors. Larger β would do the opposite.
4 4.1
Experimental Results Data Collection
Trauma Resuscitation Dataset: The trauma resuscitation dataset was collected in a trauma room at the Children’s National Medical Center in Washington D.C. Use of this data for research purpose has been approved by the IRB. One hundred and fifty trauma resuscitations with seriously injured patients were manually coded for ground truth. The data was collected through a Kinect depth sensor mounted on the side wall of the trauma room [12, 13] (Figure 5, left). Of the 150 trauma resuscitations, 50 had synchronized depth data and 35 out of 50 had synchronized audio data. We used these 150 coded cases to generate the GMM phase distributions, and 35 cases with both depth and audio data for model training and testing. Given the different patient conditions and times of the day, the durations of resuscitation phases varied (Figure 5, right). Olympic Swimming Dataset: The Olympic swimming dataset includes 60 videos from 2004 to 2016 manually downloaded from YouTube. The videos were recorded by different devices and at different angles. We coded the six phases for swimming competitions manually. We manually divided some of the downloaded videos to ensure all the video clips contained only swimming-related content. Because some videos did not record pre-competition phase, some cases did not have all six phases. Trauma Resuscitation Dataset
Router Kinect V2
Intel NUC
Olympic Swimming Dataset
Figure 5: The trauma room with our Kinect installed (left) and the boxplot of phase duration in completeness percentage for trauma resuscitation dataset and Olympic swimming dataset (right).
4.2
Evaluation of Process Progress Estimation
We evaluated the proposed system with the trauma resuscitation dataset and the Olympic swimming dataset for estimating overall completeness, phase, and time-remaining. We first calculated the overall completeness estimation error of the system by inputting (one-by-one) 20% of cases into the trained network for progress estimation using mean absolute error. The system achieved average 9.3% overall completeness error for trauma resuscitations and 5.2% error for Olympic swimming dataset. We then visualized Mean Absolute Error (MAE) of the completeness estimation with normalized process duration of testing cases (Figure 8
6). Large MAE indicates that the system has difficulty to distinguish similar scenarios and large variance indicates that the system does not generalize well for all testing cases. We noticed that larger estimation error always comes with a larger variance, indicating that system does not generalize well for some similar phases in our testing cases. Training system with more data can miltigate the generalization issue. Olympic Swimming Dataset
Completeness Estimation Error
Completeness Estimation Error
Trauma Resuscitation Dataset
Progress Error in Percentage
Progress Error in Percentage
Figure 6: Mean(solid line) and Variance(shaded region) of completeness estimation error plotted against normalized duration
Trauma Resuscitation Dataset
Olympic Swimming Dataset
0.2 Error
Error
0.15 0.1 0.05 0 PA
A
P
S
PS
PL
0.1 0.08 0.06 0.04 0.02 0 PC
I
P
C
RE
R
Figure 7: Aggregate error over the extent of each phase. We further evaluate the phase prediction error (Figure. 7) from the completeness estimation error (Figure. 6) and the GMM based probablistic distribution for each phase, which gives more informative understanding of our model. Our evaluation showed that the system performed well in pre-arrival and patient-leave phases of the trauma resuscitation dataset, as well as the pre-game and result phase of the swimming dataset. The good performance during these starting and ending phases is probably due to the proposed rtanh function’s ability to hold regression output at zero or one at the beginning or end. The high error rate in the postsecondary phase may be related to the high similarity between the secondary and post-secondary phases. The competition and replay phases of the Olympic swimming dataset may have a higher error rate for the same reason that the replay phase is basically a slow-motion replaying of the competition. However, given a 1fps sampling rate, the two phases can barely be distinguished. For this reason, the LSTM may confuse the replay phase with the actual competition phase. These results suggest that system performance can be further improved by: 1. adding other sensors to provide more input features and better distinguish the phases with similar sensor data and 2. using a higher frame rate for data pre-processing to provide more sufficeint temporal association details. Because almost no related research for progress estimation exists, few examples are available for comparing the performance of our system. In a building construction application [4], the processes are often labeled based on the availability of different construction components, and are generally invariantly sequential. The process completeness for activity sequences is more challenging, because the activities may be performed in different orders and durations. For the process phase prediction, we plotted the confusion matrices of the phase prediction results (Figure 8). The system achieves average 87.03% accuracy for trauma resuscitation phase prediction and average 9
80.06% accuracy for the Olympic swimming data. Confusion Matrix of Trauma Resuscitation Dataset
PA
A
P
S
PS
PL
Confusion Matrix of Olympic Swimming Dataset
PC
I
P
C
RE
R
PA
95.8
4.2
0
0
0
0
PC
76.2
23.8
0
0
0
0
A
24.1
43.2
32.6
0
0
0
I
42.1
29.8
28.1
0
0
0
P
0
12.8
70.9
16.3
0
0
P
0
0
100.0
0
0
0
S
0
0
36.4
63.6
0
0
C
0
0
16.4
21.2
62.3
0
PS
0
0
0
41.3
45.1
5.0
RE
0
0
0
0
90. 8
9.2
PL
0
0
0
0
6.9
93.1
R
0
0
0
0
10.3
89.7
Figure 8: The confusion matrix for process phase prediction Our analysis of the confusion matrix showed that the proposed system is able to make reasonable phase predictions (Figure 8). The zero values in most of the upper and lower triangles of the matrices indicate that the system only makes adjacent phase prediction errors. This finding shows that the system makes phase predictions with a certain order and will not jump between non-adjacent orders. By further comparing the confusion matrices of both datasets, we found that: • The system accurately predicted the starting and ending phase, due to rtanh’s ability to maintain the zeros and ones for the associated starting and ending phases. • The phase detection performance is associated with regression performance. For example, the regression and phase systems poorly predicted the post-secondary compared to other phases (7). This issue is related to the use of the regression output as input for GMM phase detection. • The system achieved similar performance for both datasets, despite the differences in camera mobility. The trauma resuscitation dataset was recorded with a fixed camera, while the Olympic swimming dataset was captured by moving cameras. The fixed camera always captures the activity’s scene and provides continuous activity sequence information, while the moving camera might capture some discontinuous unrelated data. Our approach achieved similar performance under both scenarios, showing its potential for analyzing the temporal activity processes regardless of the video input’s changing viewpoints. The system is able to achieve similar performance with both fixed and moving cameras because of the CNN-LSTM structure’s ability to learn and extract very different temporal features for decision making. Ground truth phase coding was performed manually. To compare our system’s phase prediction performance with that of domain experts, 10 additional cases were coded. To better simulate the approach used by our system, we started the video clips from the beginning of the trauma resuscitation and stopped at a random time instance before the moment the patient left trauma room. Using this approach, the domain experts could not calculate the overall completeness by looking at the video’s progress bar. Our results (Table 2) showed that humans are slightly more accurate, but significantly slower (30 times slower than machine) at coding phases than our system. Coders used specific indicators for process phase detection, such as certain medical activities or indicative object usage. Recognizing certain activities requires both experience and domain knowledge. This knowledge may be hard for a person to obtain from only depth videos and audio. It was difficult for the domain experts to estimate overall completeness and almost impossible for them to estimate the remaining time. Table 2: The Comparison of Expert Assessment of Completeness and Phase Estimation with Machine Based Estimation. System Expert Machine
Avg. Accuracy 0.95 0.87
Avg. Precision 0.85 0.72
Avg. Recall 0.91 0.69
10
Avg. F1 Score 0.87 0.67
Avg. time (frame) 30s 1s
We also calculated the time-remaining estimation error made by the system, which was 6.5 minutes(14% total duration) for the trauma resuscitation dataset and 2.2 minutes(18% total duration) for the Olympic swimming dataset. Similar to calculation of completeness error, we evaluate the time-remaining errors for different phases (Figure 9). We found that the time-remaining estimation error was not associated with completeness estimation error, but associated with completeness. A small time-remaining estimation error in an earlier stage may lead to a large error during future phases, as the estimated process speed will be multiplied by the percentage left in the process. Comparing the time-remaining estimation error of the trauma resuscitation dataset with the Olympic swimming, we found that the error was also associated with process length (Figure 5, right); the longer processes had greater time-remaining estimation errors.
0.1 0.05 0 PA
A
P
Completeness estimation error
S
PS
Olympic Swimming Dataset
0.2
80
0.15
60
0.1
40
0.05
20
0
PL
Time-remaining error
0.15
1400 1200 1000 800 600 400 200 0
Completeness error
Trauma Resuscitation Dataset
Time-remaining error
Completeness error
0.2
0 PC
I
P
C
RE
R
Time-remaining estimation error
Figure 9: Estimated time error for different dataset
4.3
Comparison of Phase Prediction to Previous Work
We further compared the proposed system to our previous work. Since we are the first to propose completeness estimation and time-remaining estimation, we only compare the system performance of process phase detection with previous research. We used the same trauma resuscitation dataset that was previously used in this previous study [12] and the EndoTube dataset [22], which has seven phases previously used to evaluate EndoNet (with the same training and testing split). Finally, we used the TUM LapChole dataset [20] which has been used in previous surgical process phase analysis [22]. The proposed model achieved significantly higher performance compared with our previous model [12] (Figure 10). A major drawback of the previous work was its reliance on spatial information and ignorance of temporal associations [12]. The previous model [12] could make incorrect predictions that do not follow common sense, e.g., predicting primary phase before patient arrival. To address this issue, we previously proposed a constraint softmax layer. Because the constraint softmax requires information across a window centered at the current time instance, the decision making must wait for future data. With such limitation, the previous system can only be used for post-event analysis. The system in this paper instead relies on the incremental nature of the regression curve to make logically ordered predictions (i.e., patient arrival will happen before the primary survey). For this reason, the proposed system can work online. We further compared the phase prediction performance (Figure 10). The results show that the proposed system outperformed the existing systems (Table 3). The proposed system maintained the same F-score for pre-arrival and patient-leave, but significantly improved the scores for other phases. We next analyzed the EndoTube dataset, which was a part of the endoscope video challenge in 2015 and was recently used in EndoNet [22], and the TUM LapChole dataset used in the M2CAI workflow recognition challenge at the MICCAI conference 2016 [20]. Since the EndoTube dataset and TUM LapChole dataset contains repeated workflow (with repeatedly performed surgical phases), we have to manually remove the repeated instances of the workflow during the training phase. Our results confirm that our CNN-LSTM and regression structure better models the temporal associations of the data, resulting in better performance. The experiments using both datasets again proved the importance of temporal associations in process phase detection, as our system again outperformed the previous purely CNN-based system [22]. We also provide a detailed comparison of our system with other previous phase detection implementations (Table 3). Because not all of the previous work has been evaluated with the same dataset, the evaluation scores may not reflect the system performance in each setting. Our system achieved adequate performance 11
Precision
1 0.8 0.6 0.4 0.2 0 PA
A
P
S
Recall
1 0.8 0.6 0.4 0.2 0 PS
PL
PA
A
P
F1 Score
1 0.8 0.6 0.4 0.2 0 S
PS
PL
[12] with softmax
[12] with constraint softmax Propose Model
PA A
P
S
PS PL
Figure 10: The comparison of the previous model (classification with constraint softmax) and the proposed model. Table 3: Comparison of phase detection systems System Gen* Input Data Trauma Resuscitation Dataset Multimodal CNN [12] Yes Depth and audio Multimodal CNN [12] Yes Depth and audio
Model
Acc. Prec. Rec. F-S
Multimodal CNN Multimodal CNN + constraint softmax CNN-LSTM Regression
0.62 0.80
0.60 0.66
0.51 0.64
0.50 0.63
Our model
Yes
Depth and audio
Endovis Dataset Endonet [22] Our model
Yes Yes
Endoscope video Depth and audio
CNN CNN-LSTM Regression
0.66 0.65 0.64 n/a 0.66 0.63 0.69 0.66
TUM LapChole Dataset CNN based approach [20] Yes
Laparoscopic video
n/a
Our model
Yes
Depth and audio
AlexNet + time window CNN-LSTM Regression
Olympic Swimming Dataset Our model Yes
Depth and audio
CNN-LSTM Regression
0.80 0.37 0.51 0.34
Other Previous Research Phase detection from low- No Activity log Decision Tree level activities [5] Surgical phases detection No Instrument signal HMM [3] Phase recognition using No Wearable sensor data Decision Tree mobile sensors [2] Surgical workflow model- No Instrument signal HMM ing [13] Food Preparation Activi- No RGB video, depth and Random Forest ties [21] accelerometer data *Generalizability, the ability of learning the features automaticaly from data.
0.87 0.72 0.69 0.67
0.66 0.74 n/a
0.83 0.55
0.61
0.51
n/a
0.75
0.74
0.74
0.83
n/a
n/a
n/a
0.77
n/a
n/a
n/a
0.84
0.85
0.84
n/a
n/a
0.65
0.67
n/a
based on the F-measure. The system with best performance [5] can only predict high-level phase based on manually generated low-level activity logs. This is a significant limitation for real-world applications. The automatically generated low-level activity log using sensor data might contain errors, and may significantly influence the system performance. Moreover, our system is well generalized to work with all types of trauma resuscitations. Previous resuscitation based research were trained on the certain datasets with specific patient injuries, but our system is trained on cases with a variety of patient injuries, ages and conditions. In terms of data collection sensors, our system uses only a cheap, easily-deployed, commercially available depth camera
12
along with microphone array. Other early systems required human input, specific medical equipment or obtrusive wearable sensors.
5 5.1
Discussion Application — Online Trauma Resuscitation Progress Detection
Instead of testing the proposed system on recorded data only, we deployed the proposed system in one of the trauma rooms at CNMC. As shown (Figure 5, left), we mounted the entire system (a Kinect depth sensor and an Intel NUC mini PC) on the side wall of the trauma room. Since the mini PC uses a low power CPU with no dedicated GPU, we had to shrink the input data size to ensure its capability of running an online feedforward network. Considering that real time MFSC feature extraction is computationally expensive, we used only depth video for resuscitation progress estimation. Since the mini PC has very limited runtime computational resources, we further resized all the online captured depth images down to 64×64px before feeding them into the network. Using depth-only and down-sampling do impact system performance (Figure 11). Fortunately, the problem can be addressed by using more powerful computers than a mini PC in the future, and based on our experiments the system performance drop is not very significant. Precision
1.00 0.80 0.60 0.40 0.20 0.00 PA
A
P
Recall
1.00 0.80 0.60 0.40 0.20 0.00 S
PS PL
PA
A
P
F1 Score
1 0.8 0.6 0.4 0.2 0 S
PS PL
With depth and MFSC
With small depth only
PA A
P
S
PS PL
Figure 11: Comparison of the phase prediction performance using small depth image as input and depth image plus MFSC. As we mentioned earlier, we trained the model with Keras using the TensorFlow backend. Since a C# environment is required for Kinect-based depth image capturing, we established a local TCP communication to connect the C# data capturing system to the feedforward network running on a python terminal. We trained the model with depth images captured during 50 cases and tested the system with 10 actual trauma resuscitations. The system achieved 8% overall progress estimation error and 86% phase detection accuracy (Figure 11). The accuracy is slightly lower than systems using larger depth images with MFSC data as input, but the system performance is still applicable for many applications.
5.2
Limitations
The proposed system works well with several different datasets recorded in real-world scenarios. However, the completeness labels we used were only based on relative time. As previously argued, the time of some complex process can be affected by many factors (e.g., the availability of medical resources affects the waiting time of post-secondary phase), and long wait times do not always indicate the change of progress. Instead of labeling the data using time-based completeness, it might be reasonable to use activity combination indicators. For example, if the left-eye and right-eye are both checked, then 5% of the trauma resuscitation has been finished. Labeling the detailed activity with the associated completeness will require more manual coding work. Besides, for complex datasets like medical activity sequences, it is hard to determine an activity combination’s contribution to overall completeness given different patient conditions. More work and collaboration with the medical team is necessary to design a phase recognition model that determines process progress based on both time and activities performed. Besides, the current system assumes the workflow follows a certain order, therefore the training process prefers that input data follows a certain order. The system may not work well if the process is not fully linear workflow like the EndoTube dataset and the TUM LapChole dataset. This is because the GMM based approach assumes a single normal distribution for each phase which cannot be used to describe the repeated
13
phase. Training a model that is able to model the process with different process order will be our future work.
5.3
Extension — Smart Human Computer Interaction
Online progress estimation is not the goal, since the online prediction itself is not very valuable. However, progress detection information is precious in creating smarter high-level human computer interaction systems to improve work efficiency. The proposed system has great potential in many real-world applications. For example, a possible extension of the system is an online trauma resuscitation check list system. Currently the check list is used in the trauma room to assist medical teams with resuscitation managemnet and to ensure that no required tasks are omitted. The challenge is that trauma resuscitations are fast-paced and the doctors might forget to check certain items during the resuscitation. The online phase detection system can be used to remind the medical team to check certain items at the end of each process phase. The future online activity recognition system will provide more contextual information to make the check list digital and smarter.
6
Conclusion
We introduce a system capable of estimating the progress of complex processes. We provide two realworld datasets collected from different commercially available sensors and used several published datasets to evaluate the system. The system outperformed previous research and can be extended to many real-world applications providing fundamental information for advanced human computer interaction. We introduced our deployment of the system in an actual trauma room for online resuscitation progress estimation and a potential extension of the system. We hope the paper can deliver the following benefits to the community: 1. A regression based system structure for process completeness estimation. 2. The GMM based approach and conditional loss that can be used with the regression model for classification tasks. 3. The detailed system deployment introductions that can be used as a guideline to extend the system to other fields. 4. The trauma resuscitation dataset and Olympic swimming dataset will be published with ground truth labels, which can be used for future study.
7
Acknowledgement
This research was supported by the National Institutes of Health under Award Number R01LM011834. The authors would like to thank Colin Lea for providing the EndoTube dataset.
References [1] Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G., and Yu, D. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing 22, 10 (2014), 1533–1545. [2] Bardram, J. E., Doryab, A., Jensen, R. M., Lange, P. M., Nielsen, K. L., and Petersen, S. T. Phase recognition during surgical procedures using embedded and body-worn sensors. In Pervasive Computing and Communications (PerCom), 2011 IEEE International Conference on (2011), IEEE, pp. 45–53. [3] Blum, T., Padoy, N., Feußner, H., and Navab, N. Modeling and online recognition of surgical phases using hidden markov models. In International Conference on Medical Image Computing and Computer-Assisted Intervention (2008), Springer, pp. 627–635. 14
[4] Dimitrov, A., and Golparvar-Fard, M. Vision-based material recognition for automated monitoring of construction progress and generating building information modeling from unordered site image collections. Advanced Engineering Informatics 28, 1 (2014), 37–49. [5] Forestier, G., Riffaud, L., and Jannin, P. Automatic phase prediction from low-level surgical activities. International journal of computer assisted radiology and surgery 10, 6 (2015), 833–841. [6] Karpathy, A., Johnson, J., and Fei-Fei, L. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078 (2015). [7] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. Largescale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (2014), pp. 1725–1732. [8] Kingma, D., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [9] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (2012), pp. 1097–1105. [10] Kuehne, H., Gall, J., and Serre, T. An end-to-end generative framework for video segmentation and recognition. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on (2016), IEEE, pp. 1–8. [11] Lea, C., Vidal, R., and Hager, G. D. Learning convolutional action primitives for fine-grained action recognition. In Robotics and Automation (ICRA), 2016 IEEE International Conference on (2016), IEEE, pp. 1642–1649. [12] Li, X., Zhang, Y., Li, M., Chen, S., Austin, F. R., Marsic, I., and Burd, R. S. Online process phase detection using multimodal deep learning. In Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), IEEE Annual (2016), IEEE, pp. 1–7. [13] Li, X., Zhang, Y., Marsic, I., Sarcevic, A., and Burd, R. S. Deep learning for rfid-based activity recognition. In Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM (2016), ACM, pp. 164–175. [14] Padoy, N., Blum, T., Ahmadi, S.-A., Feussner, H., Berger, M.-O., and Navab, N. Statistical modeling and recognition of surgical workflow. Medical image analysis 16, 3 (2012), 632–641. ¨ szo ¨ rmenyi, L. Temporal segmentation of laparoscopic [15] Primus, M. J., Schoeffmann, K., and Bo videos into surgical phases. In Content-Based Multimedia Indexing (CBMI), 2016 14th International Workshop on (2016), IEEE, pp. 1–6. [16] Sainath, T. N., Vinyals, O., Senior, A., and Sak, H. Convolutional, long short-term memory, fully connected deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on (2015), IEEE, pp. 4580–4584. [17] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [18] Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958. [19] Stauder, R., Okur, A., Peter, L., Schneider, A., Kranzfelder, M., Feussner, H., and Navab, N. Random forests for phase detection in surgical workflow analysis. In International Conference on Information Processing in Computer-Assisted Interventions (2014), Springer, pp. 148–157. [20] Stauder, R., Ostler, D., Kranzfelder, M., Koller, S., Feußner, H., and Navab, N. The tum lapchole dataset for the m2cai 2016 workflow challenge. arXiv preprint arXiv:1610.09278 (2016). 15
[21] Stein, S., and McKenna, S. J. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing (2013), ACM, pp. 729–738. [22] Twinanda, A. P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., and Padoy, N. Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Transactions on Medical Imaging 36, 1 (2017), 86–97. [23] Zeiler, M. D., and Fergus, R. Visualizing and understanding convolutional networks. In European conference on computer vision (2014), Springer, pp. 818–833.
16