Applied Intelligence manuscript No. (will be inserted by the editor)
Using Physiological Signals to Detect Natural Interactive Behavior Yasser Mohammad · Toyoaki Nishida
Received: date / Accepted: date
Abstract Many researchers in the Human Robot Interaction (HRI) and Embodied Conversational Agents (ECA) domains try to build robots and agents that exhibit human-like behavior in real-world close encounter situations. One major requirement for comparing such robots and agents is to have an objective quantitative metric for measuring naturalness in various kinds of interactions. Some researchers have already suggested techniques for measuring stress level, awareness etc using physiological signals like Galvanic Skin Response (GSR) and Blood Volume Pulse (BVP). One problem of available techniques is that they are only tested with extreme situations and cannot according to the analysis provided in this paper distinguish the response of human subjects in natural interaction situations. One other problem of the available techniques is that most of them require calibration and some times ad-hoc adjustment for every subject. This paper explores the usefulness of various kinds of physiological signals and statistics in distinguishing natural and unnatural partner behavior in a close encounter situation. The paper also explores the usefulness of these statistics in various time slots of the interaction. Based on this analysis a regressor was designed to measure naturalness in close encounter situations and was evaluated using human-human and human-robot interactions and shown to achieve statistically significant distinction between natural and unnatural situations. Keywords RSST · Psychophysiology · HRI · data mining Y. Mohammad Kyoto University, Graduate School of Informatics Tel.: +81-75-753-5867 Fax: +81-75-753-4961 E-mail:
[email protected] T. Nishida Kyoto University, Graduate School of Informatics
1 Introduction Personal robots that co-exist with untrained human users and partners are expected to increase dramatically in the near future. The acceptance of these robots in the human society depends on how these untrained human partners with whom they are to interact are perceiving them. For this reason it is important to have subjective evaluations of new interactive robot designs. Nevertheless, subjective evaluations are not completely satisfactory for comparing interactive robots for two reasons: Firstly, even though subjective evaluations through questionnaires can cover, to some extent, the response of the subjects to the robot, it is very hard to compare results of such subjective evaluations coming from different research groups due to the difficulties in controlling environmental conditions across experiments. Secondly, the results of questionnaires are known to be cognition mediated and disagreement between reported and measured behavior is evident from many psychological studies [1]. In a subject with strong novelty effects, and sometimes high cognitive load like interacting with robots such limitations of subjective evaluations based on questionnaires become even stronger. To avoid these problems objective metrics are needed to compare different interactive robot designs. Three main types of metrics can be used: psychological metrics based on psychological tests, behavioral metrics based on comparing the external behavior of the subjects to some predefined standard, and physiological metrics based on measuring the internal state of the subjects using their physiological responses. We focus on physiological metrics in this paper. Nevertheless, psychological and behavioral metrics are to be included in future research. There are many possible dimensions of comparison between interactive robots, their appearance, their behavior, the partner’s response to them, etc. One obvious measure
2
of interactive robots’ behavior is how human-like they are in terms of behavior and appearance, but it is not always the case that human-like behavior and appearance are considered as more natural or desirable by the partner [2]. As the goal of our metric is to compare interactive robots as perceived by their human partners, we focus on the partner’s response to the robot. Definition 1: Natural behavior of an agent is defined in this paper as the behavior that minimizes negative emotions (e.g. stress, cognition load, frustration, and anxiety) of the partner during the interaction while maximizing his/her positive emotions (e.g. engagement). A robot or agent that has more natural behavior according to this definition is more desirable than a robot that has less natural behavior in most normal situations. Using this definition it is possible to measure naturalness of agent’s or robot’s behavior by measuring the physiological response of its human partner to this behavior. In this paper, we utilize physiological signals that are known from past research to correlate to positive and negative emotions in order to recognize the emotional state and in turn the naturalness of robot’s/agent’s behavior as defined in Definition 1. Even though many physiological signals (e.g. Galvanic Skin Response (GSR), Blood Volume Pulse (BVP), Respiration Rate (RR), Skin Temperature (ST), etc) were shown to correlate with different aspects of human internal states (e.g. stress, cognitive load, anxiety, frustration, engagement, etc) [3], [4], [5], [6], [7], [8]), in most cases the analysis was done using extreme controlled conditions in which differences in the internal state is intense enough to be captured by simple statistics of the physiological signal under processing. For example [4] and [5] used a 3D computer game environment, while [3] used a complex multimodal user interface with 12 different tasks of varying complexity. It is not clear that the correlations found in these extreme conditions can be reliably found in natural interaction situations. One contribution of this paper is to show that it does not and more complex signal processing and machine learning techniques are needed in order to uncover the differences in physiological response in natural interaction situations. A second problem of available methods is that they put very little emphasize on the effect of the interaction context on the measured physiological signals because most of them are used to measure the response to an inanimate object (e.g. a computer interface). For example human-human interaction research show that normal interactions between humans go through different phases [9] including opening and closing phases. It is expected that the physiological response of every partner to the behavior of other partners will depend on the interaction phase in which this behavior takes place. A third problem of physiological sensing of internal state is that in most cases different individuals have different re-
sponses to the same stimuli and these individual differences can be much higher than the differences that depend on the stimuli itself. For example [8] have shown that only 74% of the tested subject have statistically significant correlation between GSR level and arousal despite the wide usage of this signal to measure arousal [6].
From the perspective of the agent executing the behavior, it does not mean anything to speak about natural artifact (robot) behavior. Nevertheless, in this work we use the word natural in a different meaning. We use it from the perspective of the agent perceiving the behavior and this agent is always a human in our case. For example, if the artifact (robot) has two cameras in an arrangement similar to human eyes, its human partner may find it unsettling if these cameras (perceived as eyes) are fixed to some object in the scene that is not related to the current interaction. This is the kind of unnaturalness we speak about in this work (as depicted in Definition 1). Because naturalness can also be seen as describing the acting agent’s behavior (without any relation to a perceiving agent), the word may not be perfect. Nevertheless, all other options have some ambiguity in them. For example the word agreeable may be interpreted as meaning according to rules or protocol which is stricter than what we mean. The word desirable may be interpreted as meaning useful which is again different than what we meant. The expression human-compatible may be interpreted to mean human like which is not the case in our situation as the form factor of the robot may cause human like behavior to be unsettling (unnatural in our terminology). We actually used agreeable, and human-compatible expressions as well as natural in subjective questionnaire before [10], and from our experience the word natural was the most corresponding to our meaning in the answers of the subjects.
In this paper we cast the problem of measuring naturalness of an agent’s behavior, as a problem of measuring human’s physiological response to that agent, and we cast this later problem as a regression problem in which a set of input features (calculated from the physiological signals) are fed to the system and the result is a real number ranging from zero to one measuring how natural was the behavior from the viewpoint of the subject. The result is a novel metric for behavior naturalness (as defined earlier in this section) that utilizes signal processing, data mining and machine learning techniques to alleviate the three aforementioned limitations of available physiological metrics. The proposed metric was able to distinguish natural and unnatural listening behavior in an explanation scenario. The main contribution of this paper to applied intelligence is the utilization of a data mining technique for developing an accurate measure of interaction naturalness using physiological signal processing.
3
2 Experimental Scenario In this paper we are interested in evaluating the subject’s response to the behavior of its partner during a natural close encounter situation. There can be many varieties of such situations and in this work we focus on the explanation scenario in which the subject is explaining the assembly/disassembly of some machine or device to a listener using verbal and (a) Instructor’s GSR signal before session nonverbal modalities. The listener can either be another human subject or a robot. This explanation scenario was selected because of its importance for HRI applications (i.e. learning by demonstration, knowledge media robots [2], companion robots etc). To reduce the number of participants needed to get robust results we used only diadic interactions. To make the interaction context as normal as possible only noninvasive sensors were used, and the instructor was familiarized with the (b) Instructor’s GSR signal during session task before the experiment. The time of the sessions was not Fig. 1 Comparison of GSR signals of the instructor before and during limited by the experimenter to reduce the effect of stress due session. to time constraints on the results.
3 Physiological Signals Some researchers in Human Computer Interaction (HCI) and HRI community studied various physiological measures of internal state of human subjects. In [4], an evidence that physiological data correlate with task performance data in a video game was found: with a decrease of the task performance level, the normalized galvanic skin response (GSR) increases. In addition, physiological data were mirrored in subjective reports assessing stress level.
(a) Instructor’s BVP signal before session
3.1 Skin Conductance There are specific sweat glands that are used to measure skin conductance called the eccrine sweat glands. Located in the palms of the hands and soles of the feet, these sweat glands respond to psychological stimulation rather than simply to temperature changes in the body. As sweat rises in the particular gland, the resistance of that gland decreases even though the sweat may not reach the surface of the skin. Skin conductance measures the electrodermal activity of the autonomic nervous systems. Skin conductance is correlated with affective arousal [6], cognitive load [3], frustration [5], and engagement [7]. In [8] 74% of the subjects exhibited this correlation. Skin conductance was measured using EDA-1 sensor connected to the Polymate physiological sensing device. Two channels related to the skin conductance were measured in this experiment: Galvanic Skin Response [GSR] and Skin Conductance Level [SCL].
(b) Instructor’s BVP signal during session Fig. 2 Comparison of BVP signals of the instructor before and during session.
Fig. 1 shows the GSR signal of the instructor before and during a typical explanation session. The signal contains outliers in both cases and, as the figure shows, there is a clear difference between the shape of the signal between the two conditions.
4
3.2 Blood Volume Pulse Blood Volume Pulse (BVP) measures the cardiac activity of the autonomic nervous systems using a sensor attached to one of the body extremities. The heart rate deducible from BVP has been used to differentiate between positive and negative emotions [11]. Heart rate variability can also be deduced from BVP and is used extensively in human factors literature as an indication of mental effort and stress in high stress environments [12]. As BVP sensors are very sensitive to motion, the sensor was attached to the ear of the instructor (who moves his/her hands during the explanation) and to the finger of the listener (who moves much less during the experiment). Fig. 2 shows a typical BVP signal before and during an explanation session. The figure shows clearly the effect of motion in distorting the signal. The apparent distortion can safely be attributed to motion because the signal before the session was collected just before session start and the signal after the session was collected just after the subject started moving with the beginning of the session. The stress level is not expected to rise too much in this short time period between the two samples (few seconds). Also, the effect of stress on BVP should appear as a change in rate not a distortion of all the periods.
(a) Instructor’s respiration signal before session
(b) Instructor’s respiration signal during session Fig. 3 Comparison of respiration signals of the instructor before and during session.
3.3 Respiration Respiration is most accurately measured by gas exchange in the lungs but the sensors used to measure it accurately prevents moving and talking. Instead a stretch sensor can measure chest cavity expansion associated with breathing. This kind of sensors is noninvasive and was used in this experiment. Respiration is believed to be too slow for reflecting real time change in the internal state of humans [4], but in [1] we showed that – using appropriate processing – it can be a reliable physiological measure of naturalness. Fig. 3 shows typical respiration signals before and during and explanation session. In this case, the effect or motion is less than the case of BVP (see Fig. 2). In summary we used three physiological sensors(skinconductance, BVP, and Respiration) that are believed, based on related research, to have high correlations with the dimensions of response to natural behavior as defined in this paper (Definition 1 ). The sampling rate used was 100Hz.
4 Experimental Setup and Data Processing 4.1 Data Collection The data collected in this experiment provides sessions of human-human interaction using natural nonverbal interac-
Fig. 4 A frame from the experiment showing the instructor and the listener
tion, human-human interaction sessions using unnatural nonverbal behavior as well as human-robot interaction sessions using the simplest possible interaction protocol. This availability of positive and negative examples of behavior is a unique property of this collected data and allows robot designers to utilize learning algorithms that require negative examples for its training and also provides a ground truth to test different interaction evaluation metrics. To make the sessions as comparable as possible, only two explanation objects were used as the task of the explanation and each is used in exactly one half of the interactions. The two explanation objects were selected to have roughly the same size and complexity (see Fig. 6). The listener in this experiment used a cold mask to cover his/her face (Fig. 4). Covering the face was needed to reduce the effect of facial expressions of the listener on the
5
Fig. 5 A frame from the experiment showing the instructor with the robot Fig. 6 The explanation objects used in the experiment
instructor as most current humanoids – with few exceptions – have far less expressive faces than humans. Normal cold masks were used to cover the listener’s face to reduce their effect on instructor’s behavior (Fig. 4). Moreover, these face masks are usual in Japan in winter to prevent spread of flu and school children are usually instructed to wear them in winter (The experiment was conducted in winter). Because instructor’s response to robots can be different than his/her response to human listeners, we added an HRI session in which the listener is a humanoid robot. The robot used in this experiment is a Robovie II humanoid robot (see Fig. 5) and the only action it does during the interaction was just moving its eyes randomly. This can be considered the simplest possible interactive robot as its movement is not at all related to the behavior of the instructor and thus provides a base line for evaluating other more interactive robots. We did not use a fixed robot as our basic interactive robot because the instructor may think that it is broken or shutdown. The Robovie II was selected because it is not too similar to humans to risk an uncanny valley effect and still similar enough to elicit an anthropomorphic response. The participants (44 subjects) where randomly assigned either the listener or the instructor role. The instructor interacts in three sessions with two different human listeners and one humanoid robot (Robovie II). The instructor always explains the same assembly/disassembly task after being familiarized with it before the sessions. To reduce the effect of novelty, the instructor sees the robot and is familiarized with it before the interaction [13]. The listener listens to two different instructors explaining two different assembly/disassembly tasks. In one of these two sessions (s)he plays the natural listener role in which (s)he tries to listen as carefully as possible and use normal nonverbal interaction behavior. In the other session (s)he plays the unnatural listener role in which (s)he tries to appear distracted, and use abnormal nonverbal interaction protocol. The listener is free to decide what is natural and what is unnatural. To eliminate the ordering effects the experiment was scheduled so that most possible orders of listening and instructing sessions are available in the collected data.
Table 1 shows the procedure of the participants playing the instructor role. The instructor is first informed about the sensors that will be attached to him/her, and this experiment procedure (except the fact that one of the listeners with which (s)he will interact will try not to listen normally). The participant then signs a consent form. After sensor attachment and filling the pre-questionnaire, the experimenter (same experimenter in all sessions which was the first author of this paper) teaches the instructor how to assemble/disassemble one of the explanation objects shown in Fig. 6. Once this is completed the participant (called the instructor from here on) is allowed to do the task by himself as many times as needed until (s)he is confident that (s)he can explain this assembly/disassembly task. The instructor is also briefly introduced to the robot. Once this preparation is complete the instructor carries out 3 sessions in which (s)he explains the assembly/diassembly of the explanation object (s)he just learned to two different human listeners and one robot listener. After every session the instructor fills a questionnaire and is allowed to relax for a while (5:00 minutes at least) to allow his/her physiological signals to reach their steady state normal levels. After finishing all the sessions, the instructor fills a post-questionnaire. The three sessions of every instructor are one session with the robot, one session with a human listener playing the natural listener role and one session with a human listener playing the unnatural listener role. The procedure for the listener role is shorter than the procedure for the instructor role (see Table 2). The listener is first informed about the sensors that will be attached to him/her, and this experiment procedure. The participant then signs a consent form. Once this preparation is complete the listener carries out 2 sessions in which (s)he listens to an explanation about the assembly/diassembly of the two explanation objects in Fig. 6. After every session the listener fills a questionnaire and is allowed to relax for a while (5:00 minutes at least) to allow his/her physiological signals to reach their steady state normal levels. After finishing all the sessions, the listener fills a post-questionnaire. The procedure outlined above ensures the following:
6 Table 1 Procedure for the Instructor Role Num. 1 2 3 4 5 6 7 8 9 10 11 12 13
Step Orientation Sensor Attachment Pre-Questionnaire Task Explanation Task and Robot Familiarization Session 1 Post-Session Questionnaire 1 Session 2 Post-Session Questionnaire 2 Session 3 Post-Session Questionnaire 3 Sensor Removal Post-Questionnaire
Min Time
Max Time
0:14:40
0:45:25
0:05:13 0:05:05 0:03:51 0:05:36 0:05:00
0:12:31 0:51:15 0:08:29 0:39:51 0:08:58
0:05:03
0:13:42
Min Time
Max Time
0:10:50
0:21:15
0:03:51 0:05:14 0:05:13
0:12:31 0:22:25 0:11:05
0:04:14
0:09:21
Fig. 7 Data processing steps applied to the physiological signals collected during the 66 sessions. Each line shows the dimensionality of the data transferred between processing blocks
Table 2 Procedure for the Listener Role Num. 1 2 3 4 5 6 7 8 9
Step Orientation Sensor Attachment Pre-Questionnaire Session 1 Post-Session Questionnaire 1 Session 2 Post-Session Questionnaire 2 Sensor Removal Post-Questionnaire
1. Every instructor interacts with a natural listener, a bad listener and the robot. The order of these sessions is shuffled to resist ordering effects. The natural and unnatural listening sessions are with different listeners 2. Every instructor explains always about the same explanation object to reduce the cognitive load needed to remember the explanation. 3. Every listener interacts with two different instructors explaining two different explanation objects. This was needed to resist familiarization effects. The order of roles and explanation objects is shuffled to reduce ordering effects 4. Every listener interacts in the natural and unnatural listening roles to reduce the effects of personal differences without increasing the number of listeners needed.
4.2 Preliminary Analysis Table 3 shows the results of t-test analysis of the mean of GSR (usually used to measure stress level, cognitive load, and arousal), mean HRV (used to measure mental effort and stress), and Respiration Rate in the three conditions of this experiment. As clear from the table there is no statistically significant difference between the subject’s response to natural, unnatural, and robot behaviors that can be deduced from these features. This negative result show that traditional stress and cognitive load physiological measures cannot be used directly in determining the subject’s response to partner’s behavior in natural close encounter settings like the
explanation situation used in this experiment. These results agree with what was reported in [13] in other interaction contexts. In the rest of this paper we provide a solution of this problem by fusing features of multiple signals in appropriate times of the interaction using either a Decision Tree classifier[14] or an Support Vector Machine (SVM) [15].
4.3 Data Processing The goals of this work were three: Firstly, to assess the usability of various physiological signals and features extracted from these signals in evaluating naturalness of behavior as defined in this work (Definition 1). Secondly, to support or reject the hypothesis that different time slots of the interaction may not be of the same importance in this evaluation. Thirdly, to use the most important signals and time slots to drive an objective metric of naturalness. Fig. 7 shows the processing steps applied to signals collected during this experiment. As physiological sensors are usually very sensitive, outlier removal is a must when dealing with them. To remove outliers we used Rosner’s manyoutliers test [16] applied to the logarithm of the input data (as the original data was not normally distributed). After the outliers are removed the signals were smoothed using a Savitzky-Golay filter of second degree. The following signals were then calculated from the smoothed sensor data: Heart Rate (HR), Heart Rate Variability (HRV) as well as raw pulse data (P) were calculated from BVP data. Skin Conductance Level (SCL) and Galvanic Skin Response (GSR) were calculated from skin-conductance sensor data. Respiration Rate (RR), Respiration Rate Variability (RRV), and raw respiration pattern (R) were calculated from the respiration sensor. This leads to a total of eight physiological signals to be processed. To find the effective interaction time slots, 32 time series were calculated from the eight physiological signals of every session by using the first two minutes, the middle four minutes, the last three minutes, and the total interaction time.
7 Table 3 Difference in means of GSR, HRV and Respiration Rate between different conditions in the experiment. For every two conditions the t and p values are shown Natural vs. Unnatural sessions Natural vs. Robot sessions Unnatural vs. Robot sessions
Mean GSR t=1.833, p=0.074 t=1.391, p=0.171 t=0.952, p=0.347
This two minutes boundary was decided based on our previous research in adaptation of human behavior to nonhumanoid robots [17] which found that the adaptation started after roughly two minutes of the interaction time. For every one of these 32 time series the following features were extracted: mean (MEA), median (MED), standard deviation (STD), minimum (MIN) and maximum (MAX) leading to 160 features for every session. These are the statistics usually extracted from physiological data when measuring internal state [3], [6], [18]. The preliminary analysis presented in the preceding section showed that under normal interaction conditions, the mean of physiological signals did not provide much information about the internal state of the subject. These results were shown to be true with the same extent to the minimum and maximum of all of the signals measured. On the other hand standard deviation showed little improvement as it was less for natural condition but the difference was not statistically significant. Based on this point and the fact that negative emotions cause irregularities in respiration pattern [19], we decided to estimate the change in the underlying generating process of every physiological signal at every point of time and to use this as an extra source of information about the change in the internal state of the subject. To measure changes in the underlying dynamics of every signal the Robust Singular Spectrum Transform (RSST) [20] was applied to every time series and the number of local maxima per second (RSST LMD) and maximum number of local maxima per minute (RSST LMR) were calculated. This led to 64 more features for every session totaling 224 features per session.
4.4 Robust Singular Spectrum Transform The RSST transform is a data mining algorithm for unsupervised change-point detection in single dimensional timeseries [20]. Given a time series x (t) of length T , the RSST transform of this time series is another time series (xrsst (t)) of the same length (T ) where 0 ≤ xrsst (t) ≤ 1 for t = 1 : T . Higher response means higher probability that the underlying generation process of the time series is changing. This transform have been shown to provide an effective measure of change when applied to synthetic data as well as respiration data in [20]. RSST is based on the well known Singular
Mean HRV t=1.25, p=0.192 t=0.532, p=0.597 t=0.849, p=0.401
Mean RR t=0.228, p=0.192 t=0.770, p=0.445 t=0.577, p=0.567
Spectrum Transform algorithm, but it provides better noise rejection and increased specificity over traditional Singular Spectrum Transform (SST). To test the accuracy of RSST with real world data we analyzed the respiration response of the 22 instructors
Fig. 8 Change points detected in the Respiration of the Instructor in the two conditions of the experiment
RSST and SST were then applied to the respiration signal of the instructor. Fig. 8.a shows the original signal from the respiration sensor, SST, and RSST responses during a natural listening session. Fig. 8.b shows the same time series in an unnatural listening session of the same instructor. From the original signal and the SST response, it is very hard to device a classifier that can distinguish these two kinds of sessions. On the other hand, the number of peaks in the RSST response is much higher in the unnatural session compared with the natural session.
4.5 Normalization Previous research in HCI suggests that individual differences between subjects can partially be eliminated by normalizing the statistics driven from physiological signals during the experiment by the same statistics from the same signals be-
8 b fore the interaction using:SN = S−S Sb , where SN is the statistic/feature after normalization, S is the statistic/feature before normalization and Sb is the statistic/feature in the five minutes immediately preceding the session. By normalizing over the base statistic before every session rather than the base values obtained before the whole experiment as was done in [5], effects of earlier sessions do not leak to the features calculated at later sessions.
5 Feature Selection and Classification
Fig. 9 Effectiveness of various time slots in distinguishing various listener conditions.
5.1 Statistical Feature Ranking In [21], we used a simple feature selection algorithm based on Principal Component Analysis (PCA) for feature ranking. One disadvantage of this technique for feature ranking is that it does not utilize the session type (natural, unnatural, robot) when deciding the importance of features. This means that the resulting feature ranking will favor features that contribute more to the variance in the data but may not lead to good features for classification. In this section we propose another technique to rank features that directly utilizes their contribution to the statistical difference between different conditions in the data set. To quantify the ability of each period, signal and statistic to distinguish between different listener conditions we applied ANOVA analysis followed by a multiple comparison procedure (using scheffe algorithm) to every one of the 256 features and calculated an effectiveness score for each of them. The effectiveness score for every feature was calculated as ( c1 ,c2 c1 ,c2 0 i f con fmin < 0 < con fmax e f f ectivness = 1 otherwise log(p) c1 ,c2 where p is the p-value of the ANOVA test, and con fmin c1 ,c2 and con fmax are the limits of the 95% confidence interval of the difference in the mean between conditions c1 and c2 returned by the scheffe multiple comparison test. The logarithm of p-value was used rather than its value to prevent a single highly significant feature from overshadowing all other features. Fig. 9 shows the normalized aggregated effectiveness of various time slots in distinguishing the three conditions. The figure shows that the least effective time-slot in distinguishing the natural and unnatural conditions is the first two minutes of the interaction. The effectiveness in distinguishing these two conditions is in its maximum in the last two slots of the interaction. Fig. 10 shows the normalized aggregated effectiveness of various statistics in distinguishing the three conditions.
Fig. 10 Effectiveness of various statistics in distinguishing various listener conditions.
Fig. 11 Effectiveness of various signals in distinguishing various listener conditions.
The figure shows that the most effective features in distinguishing Natural and unnatural sessions were the two RSST based statistics. The same is true for distinguishing natural and robot sessions. For distinguishing unnatural and robot sessions, no feature was effective by itself. RSST based statistics have the extra advantage of needing no calibration per subject. The superiority of RSST based statistics over other simple statistics may be because the behavior of the listener does not cause as much difference in the internal state of the instructor as the conditions under which support for other statistics was established (e.g. collaborative play [4]). Fig. 11 shows the normalized aggregated effectiveness of various signals in distinguishing the three conditions. Pulse
9
Fig. 12 Accuracy of tested classifiers in distinguishing the three conditions of the experiment using 5-fold cross-validation over five different trials.
and respiration based signals were the most effective in distinguishing the three conditions. Skin conductance seems to have less association with distinguishing natural behavior. This supports our claim that natural real-world interactions may require different kinds of processing to get robust physiological metrics compared with the extreme conditions used in most research of physiological measurement of internal emotional state. In general natural-unnatural condition were the most distinguishable conditions followed by natural-robot conditions and unnatural-robot conditions.
5.2 Classification of Different Conditions Based on the analysis given in the previous section a classifier was designed to distinguish the three conditions of the experiment using features calculated at different time slots. Two classifiers were compared: SVM and a decision tree classifier. To assess the effectiveness of each type of these classifiers we used five different trials of 5-fold cross validation with permuted sessions in every trial. The data was divided into five sets (two containing five sessions of every condition and three containing four sessions of every condition). Each classifier was then trained using three of these sets and tested using the remaining two sets. This procedure was done five times with different test sets in each time and the accuracy of the classifiers were averaged over all runs. To quantify the accuracy of the classifiers, the accuracy of distinguishing each two conditions rather than the accuracy in detecting each condition was utilized as it gives more information about the relative distinguishability of different conditions. We used the top ten features from the feature ranking presented in the previous section. Most of these features consisted of RSST features applied in the middle four minutes of the interaction time. To assess the quality of the classifier, only its ability to distinguish natural and unnatural conditions is important as we cannot a-priori assume that robot’s behavior was either natural or unnatural.
Four types of kernels were tried for the SVM classifier (linear, quadratic, polynomial with order 3 and radial). The linear kernel gave the highest accuracy and was used in the remaining of the analysis. Fig. 12 shows the performance of the linear-kernel SVM classifier. The average accuracy in distinguishing natural and unnatural conditions was 79% with a standard deviation of 10.71% and the maximum accuracy for this condition was 100%. Fig. 12 also shows the performance of the tree classifier. The average accuracy in distinguishing natural and unnatural conditions was 81% with a standard deviation of 10.2% and the maximum accuracy for this condition was again 100%. The ability of the two classifiers to distinguish unnatural and robot conditions was also very similar. Nevertheless, the tree classifier outperformed the SVM in distinguishing the robot and natural conditions with 87.5% average accuracy compared to 69% in case of the SVM. The difference was found to be statistically significant with a p-value