This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
1
Audio-Visual Emotion-Aware Cloud Gaming Framework M. Shamim Hossain, Senior Member, IEEE, Ghulam Muhammad, Member, IEEE, Biao Song, Member, IEEE, Mohammad Mehedi Hassan, Member, IEEE, A Alelaiwi, Member, IEEE, and Atif Alamri, Member, IEEE
Abstract—The promising potential and emerging applications of cloud gaming have drawn increasing interest from academia, industry, and the general public. However, providing a highquality gaming experience in the cloud gaming framework is a challenging task because of the tradeoff between resource consumption and player emotion, which is affected by the game screen. We tackle this problem by leveraging emotion-aware screen effects in the cloud gaming framework and combining them with remote display technology. The first stage in the framework is the learning or training stage, which establishes a relationship between screen features and emotions using Gaussian mixture model (GMM) based classifiers. In the operating stage, a linear programming (LP) model provides appropriate screen changes based on the real-time user emotion obtained in the first stage. Our experiments demonstrate the effectiveness of the proposed framework. The results show that our proposed framework can provide a high quality gaming experience while generating an acceptable amount of workload for the cloud server in terms of resource consumption. Index Terms—Emotion recognition, cloud gaming, gaming experience, remote display.
I. I NTRODUCTION LOUD, gaming is currently gaining in popularity because of its immense benefits with regard to ubiquitous user gaming experience. The fundamental idea behind cloud gaming is to render a game in the cloud remotely and stream the encoded game screen back to users over the network. Player interactions, commands, and responses on game screens at the client are recorded and delivered to the remote gaming server. A player’s device contains a view component that operates as a remote display for the games and services that run on remote servers in the cloud. With such cloud-based games, it is a challenging task to provide a high quality gaming experience to the user. It is difficult for existing cloud gaming techniques [1]-[7] to provide an enhanced gaming experience [8]-[11] because of their high consumption of resources such as bandwidth
C
M. Shamim Hossain and A. Alelaiwi are with the Department of Software Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia, e-mail:
[email protected]. G. Muhammad is with the Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia B. Song, M. M. Hassan and A. Alamri are with the Department of Information Systems, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia Manuscript received October 19, 2014; revised January 17, 2015, accepted May 24. The authors extend their appreciation to the Deanship of Scientific Research at King Saud University, Riyadh, Saudi Arabia for funding this work through the research group project no. RGP-1436-023
and CPU, and the necessary low response time, which is affected by encoding delay. For instance, OnLive [4] requires a bandwidth of more than 5 Mbps, while StreamMyGame [6] requires 6–8 Mbps. Increased response time affects users’ gaming experiences and impacts player emotions [3], [11], depending on the game genre. Moreover, because of the lack of attention to player’s emotional responses in current cloud games, it is a challenge to keep players motivated during game play, as emotion has an important relationship with player satisfaction and engagement or disengagement, beyond the basic emotions of excitement, competitiveness, and the urge for violence [12]. Emotion can play an important role in cloud gaming by either engaging or disengaging players. Recently, a number of researchers and gaming companies have started incorporating emotion into some video games [13]- [17] and a small number of cloud games [18]. Researchers in Stanford [13] now consider player emotion in their emotion-sensing video game controllers to adapt gameplay to make it more engaging. In addition, thatgamecompany [14] implements emotion as a method to stimulate positive emotional changes in video games. Conversely, researchers [15] treat emotion as an engine for games based on a computational emotion model. Because of the inherent nature of cloud gaming (e.g., its resource consumption and susceptibility to delay), it is a challenge to directly incorporate such techniques [13]- [18]. To this end, this paper proposes a novel emotion-aware cloud gaming framework to provide quality gaming experiences. The proposed framework uses the potential of emotionaware screen effects in cloud gaming, and combines them with remote display technology. This technology allows a player to use heterogeneous devices and easily access remote powerful servers and services over the network. Traditional remote display technology [19]- [21] do not have the ability to provide a better cloud gaming user experience. Hence, we use recent remote display technology [22] along with an emotion recognition engine to satisfy the growing demand for an emotional user experience. The emotion recognition engine identifies a user’s emotion for further processing. We adopted a multimodal approach to detect emotion, where the modalities consist of audio and image sequences from video. A video camera mounted on the device in front of the user constantly records the user. The video signal is demultiplexed to produce audio and image frame signals. For the audio signal, MPEG-7 low level audio descriptors are used as features, while for the image sequence, local binary patterns (LBPs) from the region of interest of the
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
face image are extracted as features. Two separate classifiers in the form of Gaussian mixture models (GMMs) output the log likelihood scores from audio and image features, respectively. These two scores are fused to produce the final scores for each type of emotion. The emotion with the maximum score is output as the recognized emotion. The score is passed as the input to the screen effect system, which is based on the recognized emotion. The screens are then updated so that a user’s negative emotion is changed to a positive one. In summary, this paper makes the following contributions: • To the best our knowledge, this work is the first investigation that combines remote display technology and emotion technology in the cloud gaming domain. Unlike existing solutions [15]-[16],[18], that solely detect user emotion, we provide an effective solution that addresses a user’s negative emotion in a real-time manner. • We propose a cloud gaming framework that uses screen features to affect a user’s emotion while maintaining a high quality gaming experience. The first stage is the learning or training stage, where a relationship is established between five screen features and emotions using GMM-based classifiers. In the operating stage, a linear programming (LP) model is then applied to provide appropriate screen changes. These are based on realtime user emotions that were obtained from the emotion recognition engine and the information learned in the first stage. Objective and subjective evaluations were conducted to verify the suitability and effectiveness of the proposed framework. The simulation results demonstrate that the proposed framework is capable of providing a high quality, real-time gaming experience. During the subjective evaluation, most of the participants expressed high levels of satisfaction and attention, which indicates that they were happy to have the perceived emotion-based dynamic screen effects during gameplay. The rest of this paper is organized as follows. Section II describes the majority of the related work. Section III presents a detailed description of the proposed novel framework, along with the description of emotion recognition, remote display, and screen effect engine. Section IV reports the results of performance comparisons through objective and subjective evaluations. We conclude our work in Section V. II. R ELATED W ORKS Cloud gaming allows users to access and play their desired games ubiquitously. However, because of its high-bandwidth requirements and sensitivity to delay, it is difficult to provide a good cloud gaming user experience. Despite this, many studies [24]-[27] have concentrated on the quantitative aspects of cloud gaming (e.g., delay, response time, bandwidth, CPU consumption, and audio-visual quality), although little attention has been given to a user’s psychological status[12]. This lack of consideration makes it more challenging to provide a high quality gaming experience in terms of a player’s emotional levels. Moreover, cloud gaming enables users to play games remotely through some sort of remote display technology, which may also affect user experience.
2
To better understand the potential of the cloud gaming platform with respect to emotion, it is necessary to study how this platform affects users’ experience and influences gamers. Chen et al. [24] attempted to measure the response delay of OnLive and StreamMyGame. Conversely, Jarschel et al. [25] evaluated user perceived Quality of Experience (QoE) in cloud gaming based on user context in addition to delay or loss. Chen et al. [10] presented a user study to show the impact of user satisfaction in terms of graphics, smoothness, and control. A cloud gaming user experience depends on a variety of subjective and objective considerations. In this context, Quax et al. [26] evaluated user experience in cloud gaming by considering network quality in terms of delay and its impact on emotion-related terms such as enjoyment and frustration. Bowen Research [17] conducted a survey on a user’s emotional experience according to game genre and discovered that 52% of players ranked first-person shooter (FPS) games as being emotionally powerful. It can hence be inferred that emotion is essential to a user’s experience and their enjoyment of cloud gaming. Emotion can play an important role in cloud gaming to promote satisfaction and engagement during game play. There are some existing studies [13]- [17] based on video games in the affective computing domain, which considers emotion. However, there is little research [18] that discusses emotion in cloud gaming. Currently, nViso technology [28] allows for emotion recognition in real-time using three dimensional (3D) facial imaging. This technology enables customers to analyze emotion caused by product use and ultimately adopt new policies in decision making. IBM Smart Cloud [28] enhances the nViso technology to measure the emotion of a user from a live webcam stream, pre-recorded still images, or video over the Internet. Recently, thatgamecompany [14] considered creating video games that provoked emotional responses from players. Sabourin and Lester [16] discussed the role of emotion (e.g., positive and negative) in a 3D-game platform and showed how game features promote positive effect and engagement with regard to emotions. With regard to cloud games, Lee et al. [18] proposed a model that is used to identify a relationship between a game’s real-time strictness in terms of latency and visual feedback, where a player’s negative emotion is quantified with respect to latency using a facial electromyography (fEMG) approach. Our approach is distinct from this in multiple ways. First, their work uses the rate of players’ inputs and game screen dynamics to measure a game’s real-time strictness. We measure a user’s real-time emotion. Second, their work only provides an estimation model; our work provides both an estimation model and corresponding solutions. Third, they use fEMG, which is invasive in nature, and is, therefore, uncomfortable to players. Our approach is completely noninvasive and does not need any instrumental body contact. Moreover, our approach promotes emotional responses during game play using dynamic screen effects through remote display technology. The approach proposed in this paper may enhance gaming experience and could ultimately result in increased revenue for the gaming industry. In essence, to the best of our knowledge, there is no existing
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
B. System Architecture The high-level system architecture of the proposed emotionaware cloud gaming framework is shown in Figure 1. The purpose of this framework is to provide a new generation of cloud gaming that uses remote display technology and emotion technology to give players increasingly immersive experiences. Similar to leading cloud-based video games, game screens (audio/video) are rendered, captured, and compressed in private cloud servers and then streamed back to the desktop or client after possible decoding. This is done to display the game screen to the user or player device. At the same time, users’ responses or interactions are sent back to the cloud server in real-time for possible display updates. Cloud Gaming (CG) Server VM
Game Streaming
A. Scenario Prior to describing the proposed framework, we present a scenario to show how emotion can affect user experience. For example, a game that contains a significant amount of violence may cause some negative emotions like anger or fury. If the original screens of such games are continuously delivered to a player, the negative emotions may last during the entire gaming period. As the screen outputs are always delivered through a remote display engine in cloud gaming environment, they can be first modified to address those negative emotions and then sent to the user. Assuming that the user is feeling angry, the red component in the new screen updates can be reduced to a certain level that eliminates such feelings. Similarly, other screen modifications can benefit users who feel sadness or fear while playing exciting games. Meanwhile, the changes to screen updates must be controlled in a reasonable way to guarantee the quality of gaming. Changing a screen to a completely different one certainly might make a user feel annoyed. Note that we are not totally removing negative emotions; instead, we intend to reduce extremely high negative emotions. Very high negative emotions psychologically affect a player and may adversely affect the gameplay.
VM
CG Component
CG Component
Screen Effect
Remote Display
Screen Effect
Remote Display
Emotion Recognition
Registration Management
Metering
Resource Allocation
Game Session Management
Emotion Recognition
Performance and Communication Service Management
Cloud Manager
Cloud Gaming Client (Player/User)
Distributed Resources in CG User Interaction Screen effect
Remote Display
Network
Player’s Emotion
Users’ Input (A/V - based emotion, Display/Screen Update) Video Player
Screen Update
III. P ROPOSED ARCHITECTURE AND SYSTEM MODELING First, we present a user scenario in Section III-A. In Section III-B, we then provide a brief description of the high level system architecture of the proposed emotion-aware framework. Finally, we describe the remote display, emotion recognition, and emotion changing screen components in Sections III-C, III-D, and III-E, respectively.
User, Device and Game Service Profile Management
Virtual Machines (VMs) Game Streaming
framework that can recognize user emotion over the cloud and provide instant user feedback to alleviate negative emotions. The proposed framework in this paper addresses this issue. In this framework, a user plays a video game, and a video camera continually captures the user’s voice, speech, and facial video, which are sent over the cloud to an emotion recognition server. When reaching a gaming situation that elicits an angry, sad, or bored emotional response, the user may no longer wish to continue. To overcome or reduce this negative emotion, the screen is adjusted according to the emotion detected by the emotion recognition server. To fulfill the increasing expectations of immersive experiences from players, we have designed a new cloud-based game architecture that leverages emotion-aware screen effects in the cloud gaming framework. The proposed framework is one of a few attempts to combine the potential of current remote display technology and emotion technology in the cloud gaming domain. Traditional remote display technology [19], [20], [21] and remote desktop sharing technology [29] cannot meet the strict requirements of cloud gaming. For instance, the Remote Display Protocol in Amazon’s EC2 Cloud service [21] consumes significant bandwidth to deliver high-motion screen updates. However, current remote display technology has great potential to create a customized cloud gaming experience. Therefore, in the remote display component, we use two encoding methods, H.264 [46] and M-JPEG [47], along with very recent remote display technology.
3
GPU/CPU Rendering Screen Compression
Game Screen Streaming
Screen Decompression
Fig. 1: High level architecture of proposed emotion-aware cloud gaming framework. More elaborately, player interactions, commands, and responses on game screens are recorded and sent to an emotion recognition engine in the cloud gaming server. Upon detection of a user’s emotion, emotional functions (from the emotion recognition engine) pass a message to the remote display component and request a corresponding change of screen effect as well display updates. The main components of the proposed framework are described below. • The overall management of the proposed framework is governed by the Cloud Manager. It has a number of web services that manage the game infrastructure as a whole in the cloud. For example, it manages a player’s profile, client device information, and game information. It performs user authentication and registration, initiates and controls game sessions, and generates and tracks performance reports as well as managing the communication services. • The Resource Allocation Manager assigns the various virtual machine (VM) resources needed for running the game session and web services. The VM can be regarded as a server aggregation to provide gaming service information, including emotional detection, screen effect control, game state updates, rendering and streaming, and audio-visual playback. The VM has the following major components: remote display, emotion recognition, and screen effects.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
4
the emotion separately and the classification scores are then combined using the Bayes sum rule. An overall block diagram of the proposed emotion recognition system is shown in Figure 3, where filled and non-filled arrows correspond to training and testing phases, respectively.
Fig. 2: Remote display for cloud gaming.
C. Remote Display Component This technology allows the user to use heterogeneous devices and easily access remote powerful servers and services over a network. As shown in Figure 2, the remote display component provides access to a desktop environment on a remote server that consists of an operating system and the games installed on it. Through this remote display component, a user can enjoy a convenient gaming experience on any device, even with low computing power. It frees players from the need to frequently upgrade their computers and games. On the user side, the viewing component displays the received screen updates and sends the user’s audio and facial inputs to the cloud server. The server component receives the user inputs, recognizes the emotion, encodes the response, and transmits the gaming output. A remote display protocol transfers display updates and user events between the endpoints. Traditional applications of remote displays for cloud gaming do not modify the screen data, except when performing encoding and streaming to reduce bandwidth consumption. However, recent remote display technology [22] has great potential to create a customized cloud gaming experience based on a user’s emotions. To this end, we use two encoding methods: H.264 and MJPEG, along with recent remote display technology. H.264 is the most popular encoding method used by many existing commercial cloud gaming platforms, such as OnLive [4], StreamMyGame (SMG) [6]. The MJPEG-based encoder has the following advantages: i) minimum latency in image processing, ii) flexibility of splicing and resizing, and iii) good resilience against packet loss; however, it consumes more network bandwidth than H.264 encoding. D. Emotion Recognition Engine In the proposed framework, emotion is recognized using two modalities: audio and video. The multimedia system is located in front of the user and captures video. A client-side demultiplexer audio data from image frames. Using uniform sampling, five image frames are selected. The audio data and image frames are the two input modalities to the engine located in the VMs. These inputs are sent to the cloud gaming server through the network. Features are extracted from both the audio data and image frames. Therefore, the engine uses two sets of feature vectors, one for audio and the other for image frames. Two GMM-based classifiers classify
Fig. 3: A generic block diagram of the proposed emotion recognition system. 1) Audio-based Emotion Recognition: Emotion recognition from speech has been a topic of research for the last fifteen years or so. Many types of features and classifiers are tried to achieve an emotion recognition performance close to human perception of emotion. The features that are investigated the most are spectral / cepstral and prosodic [31]. These features are originally developed for speech or speaker recognition applications. It is shown that a combination of spectral and prosodic features can enhance the emotion recognition performance [31]. Several pitch related features such as pitch frequency, glottal air-velocity, number of harmonics, etc. are commonly used in emotion recognition. Figure 4 shows spectrograms, formants, and pitch frequencies of two utterances, one with happy mood, while the other with sad mood. From the figures, we see that the pitch frequencies in a sad mood are flatter than those in a happy mood along time. The proper choices of features and classifiers are important for a real-time high performance emotion recognition system. This section investigates the feasibility of MPEG-7 low level audio descriptors (features) in the field of emotion recognition. The main reason of using these descriptors is that they were found to be more efficient than the traditional speech features such as Mel frequency cepstral coefficients (MFCC) and linear predictive cepstral coefficients (LPCC) in many applications such as speaker recognition, environment recognition [32], and musical instrument classification. a) MPEG-7 low level audio descriptor: MPEG-7 features are originally developed for multimedia indexing, which contains both video and audio parts [33]. MPEG-7 Part 4 audio features can be applied to all forms of audio content. The MPEG-7 low level audio consists of 45 features, some of which are vector and some are scalar. b) Feature Selection on MPEG-7 Audio Features: The dimension of MPEG-7 audio features is prohibitively large for real-time application. Not all the features in MPEG-7 are independent, nor they are all relevant to emotion recognition. Therefore, we apply a simple but effective feature selection technique, namely, Fisher discrimination ratio (F-ratio), on
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
5
MPEG-7 features. For a two-class problem, F-ratio is defined in (1). Fratio(i) =
(µ1,i − µ2,i )2 2 + σ2 ) (σ1,i 2,i
(1)
2 2 where µ1,i , µ2,i , σ1,i , and σ2,i are the averages and variances of the ith feature of class 1 and class 2, respectively. A high F-ratio of a feature indicates its high discrimination capabilities between the two classes. For a multiclass problem, the F-ratio (simply denoted as F for multiclass) is calculated by using (2).
µ2 (2) σ2 where µ and σ 2 are pair-wise averages and variances of F-ratios for feature i. F(i) =
Depending on the experiments with training data, we select five features that have the F value ≥ 1.0. The five features that are selected are described as follows:
(a) Happy mood speech wave (up) and spectrogram (down).
(i) Audio spectrum centroid (ASC): It is described as the center of gravity of the log-scaled power spectrum of frame, fr , and is calculated by using (3), where pi is the power of frequency fi . , X X ASC(f r) = pi (3) log 2 (fi /1000)pi i
i
(ii) The harmonic ratio, H, of audio harmonicity (AH). It is loosely defined as the proportion of harmonic components within the power spectrum. There are two steps to calculate H. First calculate the normalized cross correlation, r(f r, k), of frame f r with lag k: m+n−1 P s(j) × s(j − k) j=m r(f r, k) = v ! (4) u m+n−1 u m+n−1 P P 2 2 t s(j − k) s(j) × j=m
j=m
where s is the speech signal, n is the number of samples in the analysis window, m = f r × n. The harmonic ratio H(i) is chosen as the maximum r(f r, k) in each frame, f r. H(f r) = max r(f r, k)
(5)
k=1,n−1
(iii) Audio spectrum spread (ASS): It is described as the second moment of the log-scaled power spectrum of frame, f r, and is calculated by using (6). sP − ASC)2 pi ) i ((log2 (fi /1000) P (6) ASS(f r) = i pi (iv) Instantaneous harmonic spectrum spread (HSS): It is computed as the amplitude weighted standard deviation of peaks of the harmonics in the spectrum. It is normalized by the instantaneous HSC by using (7). IHSS(f r) =
(b) Sad mood speech wave (up) and spectrogram (down).
Fig. 4: Speech wave and spectrogram (formants and pitch frequencies overlaid) of the utterances.
v u hr=N u P A2 (f r,hr).[f (f r,hr)−IHSC(f r)]2 u u hr=1 1 (7) hr=N IHSC(f r) u P 2 t A (f r,hr)
hr=1
where A(f r, hr) is the amplitude of the harmonic peak “hr” of the frame index f r, f (f r, hr) is the frequency of the harmonic peak “hr” of the frame index f r, and N is the number of harmonics taken into consideration. (v) Audio fundamental frequency (AFF): It is calculated based on normalized cross correlation by using (4).
2) Image Frames Based Emotion Recognition: There are several previous works on face based emotion recognition. In case of using a sequence of image frames, the recognition system detects the face in the first image sequence and tracks the movements in the next sequences. In the proposed emotion recognition system, we applied Dollar’s method to the image
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
frames [34]. First, to search for interest points, we calculate a response function by applying two linearly separable filters on spatial and temporal domains. The equation of the response function is given below: 2
R = (I ∗ g ∗ hev ) + (I ∗ g ∗ hod ) g(x, y, σ) =
− 1 2πσ 2 e
2
x2 +y 2 2σ 2
(8)
Where I(x, y; t) is the input image frame sequence, g(x, y; σ) is the 2D Gaussian smoothing kernel, and hev and hod are a quadrature pair of 1D Gabor filters [34]. The Gaussian kernel and the Gabor filters are applied on spacial and temporal domain, respectively. hev and hod are expressed as: 2
2
hev (t; τ, ω) = − cos(2πtω)e−t /τ 2 2 hod (t; τ, ω) = − sin(2πtω)e−t /τ
(9)
The strongest response (maxima in the response function) corresponds to the interest points. In the experiments, different combinations of σ and τ were tried keeping ω = π/4. The highest accuracy was obtained with σ = 2.1 and τ = 1.8. Therefore, we kept these values for all subsequent experiments. Around the interest points, cuboids are extracted to the size of twice the scale of the interest points. LBP histograms are calculated from each cuboid [35]. The LBP is a simple and computationally inexpensive texture descriptor. The local structure around the center pixel is encoded by thresholding the eight neighboring pixels’ grayscale values in a 3 × 3 neighborhood with the center pixel value. The center pixel is subtracted from each of its eight neighbors. If the result of the subtraction is negative, it is encoded with 0; otherwise, it is encoded with 1. The eight binary values are concatenated either clockwise or counter clockwise to form an 8-bit binary number. The binary number is converted to a decimal value, and used as a label for the center pixel. The dimension of the feature vector is reduced by applying principal component analysis (PCA) to capture 95% of all variances. On the average, after feature reduction, the number of features is 48. 3) GMM-based Classifier: In the proposed emotion recognition system, we use two GMM-based classifiers, one for audio and the other for image frames. Each of the emotions is modeled with the corresponding audio or image features as described above using Gaussian mixtures, where the number of mixture is varied between one and 32 with a step of power of two. The classifiers estimate the best hypothesis by the typical time-synchronous beam search. GMM is based on a stochastic process and thereby more robust than other nonstochastic processes based classifier. As this work deals with multi-class emotions and data transfer over the network, we choose GMM-based classifier for robustness against loss of some packets. We use BSR to combine the classifiers’ scores for final classification. BSR considers the scores from each of the two classifiers in such a way that the higher probability estimate from any of the classifiers gives a stronger bias. The final score, S, is calculated by using (10):
6
"
S = arg max (1 − K)P (YC ) +
K X
P (YC |Zi )
i=1
#
(10)
where K = 2 (two classifiers), P (YC ) is the prior probability of a given class C, P (YC |Zi ) is the posteriori probability, and Zi is a function of the output score of class C in ith classifier. Based on the score, we can decide the kind of emotion users currently have. E. Screen Effect An overall block diagram of the proposed emotion-based screen effect system is shown in Figure 5, where small and large arrows correspond to training and operating phases, respectively. In the training phase, the testing screens, which do not create particular emotions, are combined with randomly generated screen effects. The overall redness, blueness, and greenness of the original screens are randomly scaled up or down to create test screens. The range of scaling is from 30% to 30%. After viewing these screens, the user’s responses are recorded and sent to our emotion detection engine. If emotional changes are detected, a relationship between the corresponding screen features and detected emotion can be established using GMM-based classifiers in the modeling block. The method of training is similar to that presented in the previous section; the only difference is that we replace the audio and video features with screen features. In the operating phase, the screen effect engine obtains real-time user emotion from the emotion recognition engine. When a negative emotion is detected, the screen effect engine determines the appropriate screen effect to add to the real-time cloud gaming screen updates using an LP model. Finally, the screen with the added effect is delivered to the user to reduce the negative emotion. Models
Testing Screens
+
Random Effect
Screen Features
Happy
Modeling
Sad
Emotion User's Responses
Emotion Detection Engine Anger
Real-time User Emotion
Screen Updates of Cloud Gaming
Negative Emotions with Added Screen Effect
Screen Effect Engine
Fig. 5: Emotion changing system using added screen effect. 1) Creating Screen Effect: Given a screen update P of size m × n in RGB color space, each pixel pij ∈ P contains red color element rij , green color element gij and blue color element bij where each color element ranges from 0 to 255. In this work, we investigate five screen features extracted from P that have been commonly related with emotions. The definitions are described as follows: Given a screen update P of size m × n in RGB color space, each pixel pij ∈ P contains red color element rij , green color element gij , and blue color element bij , where each color
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
element ranges from 0 to 255. In this work, we investigate five screen features extracted from P that are commonly related with emotions. The definitions are as follows: • Overall redness R. The overall redness of the screen update P is defined in a normalized form as: n m X X 1 RP = rij (11) 255 × m × n i=1 j=1 •
•
•
Overall blueness B. The overall blueness of the screen update P is defined in a normalized form as: n m X X 1 bij (12) BP = 255 × m × n i=1 j=1 Overall greenness G. The overall greenness of the screen update P is defined in a normalized form as: n m X X 1 GP = gij (13) 255 × m × n i=1 j=1 Brightness T . The overall brightness of screen update P is defined in a normalized form as: n m X X 1 TP = (rij + bij + gij ) (14) 255 × 3 × m × n i=1 j=1
Contrast C. The definition of contrast of screen update P is modified from the root mean square (RMS) contrast for the purpose of adopting linear programming as: m n 1 XX |Iij − Ip | CP = (15) m × n i=1 j=1 where, the intensity of pij is denoted as Iij = 1 255 (0.2126rij +0.7152gij +0.0722bij ) according to ITUR BT.709 standard [45] with normalization, and the average intensity Pn of screen update P is defined as Ip = Pm 1 j=1 Iij . i=1 m×n In order to create a linear programming model, we define the following variables: • xr : linear scaling of redness. For any pixel pij ∈ P , the 0 scaled redness is rij = rij × xr . • xb : linear scaling of blueness. For any pixel pij ∈ P , the scaled blueness is b0ij = bij × xb . • xg : linear scaling of redness. For any pixel pij ∈ P , the 0 scaled redness is gij = gij × xg . Using the above parameters and variables, a linear programming model for optimal color scaling can be stated as follows: •
M in :
Wr |RE − RP xr | + Wb |BE − BP xb | + Wg |GE −GP xg | + Wt |TE − 31 (RP xr + BP xb + GP xg )| n m P P 1 |I 0 ij − I 0 p || +Wc |CE − m×n i=1 j=1
s.t. : S × Lr ≤ xr ≤ S × Hr S × L b ≤ x b ≤ S × Hb S × L g ≤ x g ≤ S × Hg S × Lt ≤ (Rp xr + Bp xb + Gp xg ) /Tp ≤ S × Ht n m P P |I 0 ij − I 0 p |/Cp ≤ S × Hc S × Lc ≤ i=1 j=1
(16) Where, Lr , Lb , Lg , Lt , Lc are the lowest scaling values and Hr , Hb , Hg , Ht , Hc are the highest scaling values obtained
7
from learning phase to avoid disgusting feeling caused by obvious inconsistency between original screen and modified screen. These scaling values are multiplied by the maximum log likelihood score obtained from (10). Moreover, RE , BE , GE , TE and CE represent the redness, blueness, greenness, brightness and contrast information learned from an effective emotion changing screen, respectively. The weight values Wr , Wb , Wg , Wt and Wc (ranging from 0 to 1) are also obtained from learning phase that represents the effectiveness of each screen feature. For example, Wr is a large value when overall 0 redness can significantly change current emotion. Iij and Ip0 are the intensity information after applying screen changes. In a shooting game, the emotion can be mixed for certain scenes; therefore, it is necessary to take into account the likelihood scores of the recognized class (emotion) rather than only recognized class for relative scaling. The above linear programming model can be solved by using the simplex method after following reformulation: Pm P n 1 0 0 1) Given a screen update P , m×n i=1 j=1 |Iij − Ip | can be represented as a linear combination of xr , xb and xg . 2) To remove the calculation of absolute values, we introduce two new variables u and v to replace each of the existing variables. For instance, |RE − RP xr | is denoted as ur + vr and RE − RP xr is denoted as ur −vr where ur = 12 (RE −RP xr )+|RE −RP xr |and vr = 21 (|RE − RP xr | − (RE − RP xr )), ur , vr ≥ 0. Finally, the emotion-changing screen effect is determined by solving the above linear programming problem. IV. R ESULTS AND DISCUSSION A. Performance comparison concerning learning phase This section presents the performance comparisons related to the learning or training phase, where the system learns to classify emotion so that in the operating stage, an LP model can be applied to provide appropriate screen changes based on these emotions. 1) Methodology and Data: To evaluate the proposed emotion recognition system, we use the eNERFACE’05 data set containing 42 subjects performing six different emotions: anger, fear, disgust, surprise, happy, and sad [36]. The data was recorded with a mini-digital video camera under constant light conditions. There are five video sequences for each subject and emotion category. In our experiments, we excluded the “surprise” emotion, and tested a total of five emotions. For the evaluation, we ran the experiments ten times. In one trial, three video sequences of every subject were chosen to train the system, while the remaining sequences were designated for testing. The combinations of the training sequences are different. We used ten such combinations, because there are five sequences, and we can hence group them in 5 C3 (=10) possible ways. The training and testing sets were always mutually exclusive. The feature selection was done with the training set only; in all the ten training combinations, the five mentioned features were selected (for audio data), because their F values are higher than one. The ten accuracies were averaged to obtain the final accuracy.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
2) Using audio features: Figure 6 shows the accuracy of five types of emotions’ recognition by the proposed system with five MPEG-7 audio features and different numbers of Gaussian mixtures. From the figure, we see that the best performance is with 16 mixtures. The system achieves 90.5%, 84.3%, 85.2%, 87.5%, and 85.75% accuracy for anger, disgust, fear, happiness, and sadness emotions, respectively. Table I shows the confusion matrix of different types of emotion recognition, where rows and columns correspond to input and output, respectively, using audio data. The performance of the proposed system is compared with other systems with MFCC and LPCC features. Specifically, 12 dimensional MFCC, 12 MFCC plus their first-order derivatives, and 12 LPCC plus their first-order derivatives are considered. The system with all the 45 MPEG-7 features is also tested. In all the cases, GMM-based classifiers with 16 mixtures are used. The performance comparison of the features is given in Figure 7. The proposed system with five selected MPEG7 features outperforms MFCC (12), MFCC (24), and LPCC (24). Though the system with all the MPEG-7 audio features perform slightly better than the proposed system, the number of features is prohibitively large. 100 95
Accuracy (%)
90 85
Anger
80
Disgust
75
Fear
8
TABLE I: CONFUSION MATRIX OF ACCURACIES (%) OF THE FIVE EMOTION CLASSES USING AUDIO DATA. Anger Disgust Fear Happy Sad
Anger 90.5 8.7 8.3 5 6.25
Disgust 5 84.3 5.5 4.5 6.5
Fear 3.5 4 85.2 3 1.5
Happy 0 0 0 87.5 0
Sad 1 3 1 0 85.75
TABLE II: CONFUSION MATRIX OF ACCURACIES (%) OF THE FIVE EMOTION CLASSES USING IMAGE DATA. Anger Disgust Fear Happy Sad
Anger 91 7.5 7 4 5.5
Disgust 6.5 86.3 5.3 3.8 6.4
Fear 2.5 6.2 86.7 3 1
Happy 0 0 0 89.2 0
Sad 0 0 1 0 87.1
ness, and sadness emotions, respectively, using 16 Gaussian mixtures. Table II shows the corresponding confusion matrix of different types of emotion recognition, where rows and columns correspond to input and output, respectively, using image data. We compare the performance of the proposed system with another system, which uses principal component analysis (PCA) on each cuboid [37]. Figure 9 compares the accuracy of the two systems. The proposed system with LBP features outperforms the other system with PCA features in all five emotion categories.
Happy
70
100
Sad 65
95
60
2
4
8
16
90
32
Number of Gaussian mixtures
Fig. 6: Recognition accuracy of the proposed system with five MPEG-7 audio features using different number of Gaussian mixtures.
Accuracy (%)
1
85
Anger
80
Disgust
75
Fear Happy
70
Sad 65 60
1
95
2
4
8
16
32
Number of Gaussian mixtures
Fig. 8: Recognition accuracy of the proposed system with LBP features and different number of Gaussian mixtures.
Accuracy (%)
90
MPEG-7 all
85
MPEG-7 selected 5
80
MFCC (12) MFCC (24)
75
LPCC (24)
70 Anger
Disgust
Fear
Happy
Sad
Emotion
Fig. 7: Performance comparison of different types of features of audio. 3) Using image features: Figure 8 shows the accuracy of five types of emotions’ recognition by the proposed system with image features. The system achieves 91%, 86.3%, 86.7%, 89.2%, and 87.1% accuracy for anger, disgust, fear, happi-
4) Using both audio and image features: The scores of the two classifiers are fused using (10) to get a final decision. The accuracy of the proposed system with two modalities (audio and image sequence) is given in Figure 10. The number of Gaussian mixture is 16. The figure also shows the performance of the systems described in [37] and [38]. To be fair with the comparison, the systems in [37] and [38] are evaluated with the same data as we used in the proposed system. In [38], MFCC and PCA are applied for audio features, and 2D facial features and 3D facial features are extracted for visual features. Then, the recognition is performed using a triplestream dynamic Bayesian network. We can see from the figure that the proposed system outperforms the other system.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
95
Accuracy (%)
90 85
80
Proposed Method [37]
75 70 65 Anger
Disgust
Fear Emotion
Happy
Sad
Fig. 9: Performance comparison of the two systems using image features. 100
95
Accuracy (%)
90 85 80
75
Proposed
70
Method [37]
65
Method [38]
9
(b) and (d) from Figure 11 to Figure 14 show the resulting performance of adding the emotion component. Figure 11 shows the real time CPU usage of the cloud server during a six-minute gaming period. Emotion functions along with remote display techniques were applied in the results shown in Figures 11(b) and (d), while the other two cases shown in Figures 11(a) and (c) only ran a remote display component on the server. As the results show, the H.264 encoder consumes more CPU resources than the MJPEG encoder. Around 25%–35% of the CPU resources were consumed for the H.264 encoder, while 15%–30% of the CPU resources were consumed for the MJPEG encoder. The emotion component slightly increased the CPU usage when a user presented no negative emotion. If a user’s negative emotion is detected, emotion functions passed a message to the remote display component and requested a corresponding change of screen effect. To deal with the negative emotion, 5%–10% more CPU capacity was required to generate the corresponding screen effects. Such operations created peak CPU usage in a few instances in Figures 11(b) and (d).
60
55 50 Anger
Disgust
Fear Emotion
Happy
Sad
Fig. 10: Performance comparison of the systems using both audio and image modalities.
B. Objective evaluation The cloud gaming user experience was evaluated using subjective and objective tests. In this section, we tested the performance of our proposed emotion-changing remote display technique using objective evaluation. Table III lists the hardware and software environment in which we conducted the simulations. TABLE III: SIMULATION SETUP. Language OS Specific Tool Hardware
Network
Client Server C C Windows XP Windows Server 2010 SP3 None CUDA HP Z820 Workstation: Computer: CPU: Intel CPU: Intel®Xeon®CPU E52620 @ 2.00GHZ CoreT M 2 Duo Memory: 32.00GB Memory: Graphic Card: NVIDIA Quadro DDR3 ECC RAM K4000 1.99Gbyte 100 Mbps Ethernet, TCP/IP
We ran the famous first person shooter game Counter-Strike on the server at a resolution of 640 × 480. Two encoding methods were applied in our simulation. For the H.264 based encoding method, we chose to use NVENC published by NVIDIA [39] as the encoding tool. For the MJPEG based encoding, we used our previous encoding technique in [22]. Subfigures (a) and (c) from Figure 11 to Figure 14 present the results of using the existing methods while subfigures
Fig. 11: CPU consumption during gaming (a) H.264 encoder without emotion component. (b) H.264 encoder with emotion component. (c) MJPEG encoder without emotion component. (d) MJPEG encoder with emotion component. Figure 12 shows the encoding latency of the cloud server during the six-minute gaming period. Studies have shown that various degrees of response time can be tolerated by players of different types of games. According to [23], the response delay consists of three major parts: a) Encoding Delay (ED) (similar to processing delay in [23]), b) Playout Delay (OD), and c) Network Delay (ND). Details of these delays can be found in [23]. In our simulation, we examined the ED of our approach. The emotion component does not have to act in a strict realtime manner, so it does not impact OD and ND. Thus, OD and ND are out of the scope of our research. Emotion functions, along with two remote display techniques, were applied in the cases in Figures 12(b) and (d), while the other two cases shown in Figures 12(a) and (c) run only a remote display component on the server. The simulation results indicate that
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
10
the H.264 encoder has a longer ED than the MJPEG encoder. Around 21–28 ms is needed for the H.264 encoder to encode one frame, and 20–22 ms is needed for the MJPEG encoder to generate one frame. Running the emotion component adds a 1 ms ED encoding latency when a user presents no negative emotion. If a user’s negative emotion is detected, the screen effect generator adds 5–8 ms extra ED. These results can be observed in Figures 12(b) and (d). Nevertheless, our proposed solution can meet the real-time encoding requirement of 24 frames/s, as the ED of each frame is less than 40 ms.
Fig. 13: Bandwidth consumption during gaming (a) H.264 encoder without emotion component. (b) H.264 encoder with emotion component. (c) MJPEG encoder without emotion component. (d) MJPEG encoder with emotion component. mean square error M SE between image I before encoding and the final decoded image K:
Fig. 12: Encoding delay during gaming (a) H.264 encoder without emotion component. (b) H.264 encoder with emotion component. (c) MJPEG encoder without emotion component. (d) MJPEG encoder with emotion component. Emotion functions along with the remote display technique have been applied in the cases shown in Figures 13(b) and (d), while the other two cases in Figures 13(a) and (c) only have a remote display component running on the server. The figures show that the H.264 encoder consumes less network bandwidth than the MJPEG encoder. In the traditional cloud gaming scenario shown in Figures 13(a) and (c), the cloud server receives a only few data packets from client side, including the mouse and keyboard operations. With the emotion component, both audio and video/image data are required by the cloud server to detect user emotion; therefore, more bandwidth is consumed to transmit them. In our simulation, the emotion detection engine requires 5 frames/s video/image data. As shown in Figure 13(b), transmitting a user’s face image by H.264 video takes 10,000 bytes/s. For the MJPEG encoding, Figure 13(d) indicates that 50,000 bytes/s bandwidth is consumed by receiving a user’s data. Figure 14 presents the Peak Signal-to-Noise Ratio (PSNR) of the encoding results during the six-minute gaming period. Emotion functions along with the remote display technique were applied in the cases shown in Figures 14(b) and (d), while the other two cases in Figures 14(a) and (c) only run the remote display component on the server. During our simulation, we use the default setting of NVENC for H.264 and Q = 50 for the MJPEG encoder. The PSNR is calculated by using the
Fig. 14: PSNR during gaming (a) H.264 encoder without emotion component. (b) H.264 encoder with emotion component. (c) MJPEG encoder without emotion component. (d) MJPEG encoder with emotion component. m−1 X X n−1 1 2 [I(i, j) − K(i, j)] . m × n i=0 j=0 Then the PSNR (in dB) is defined as:
M SE =
(17)
P SN R = 20 × log10 (M AXI ) − 10 × log10 (M SE) (18) where M AXI is the maximum possible pixel value of the image. We used the frame data with added screen effect as original image I to measure the PSNR for Figures 14(b) and
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
14(d). Because the PSNR is mainly determined by encoding techniques rather than images, no significant difference is observed in Figure 14. While playing a game with emotion control, the user receives gaming video with a similar quality as that of a game alone. Figure 15 presents the actual images captured from the server and client. Figure 15(a) is the original server-side gaming screen, and Figure 15(b) represents the encoded image that is captured at the client side. This frame with an added screen effect is presented in Figure 15(c). To reduce a negative emotion (a high level of anger in this example), the overall redness is significantly reduced and the overall blueness is increased in a visible way.
(a) Original
11
C. Subjective evaluation To study the feasibility of the proposed emotion-aware cloud gaming framework, the subjective evaluation was performed with the FPS game Counter-Strike. The initial experiments were conducted by inviting 35 undergraduate and graduate students of King Saud University to participate. The participants played the game over a private cloud platform in a laboratory environment with a free play time of 10 minutes. While playing, the various emotional responses of the participants were recorded using microphones and digital video cameras and necessary screen effects were provided after processing over the cloud. After the gameplay, the participants were requested to fill out surveys related to their emotional feedback (e.g., motivation, engagement, or disengagement) for the game. Finally, 30 sets of responses were validated for data analysis. To evaluate the users’ motivation and satisfaction, the Attention, Relevance, Confidence, and Satisfaction (ARCS) model, developed by Keller [43] was adopted. The model was used to evaluate the players’ motivational stimuli in instructional games. The model has four perceptual components: Attention (A), consisting of seven items; Relevance (R), comprising two items; Confidence (C), comprising three items; and Satisfaction (S), which also has three items. The scores of these items were calculated based on a 9-point symmetrical Likert Scale [44] to specify the level of motivation and engagement during gameplay. Table IV presents the motivational items and their respective reported ratings. The overall reliability of the scale on a standardized Cronbach’s Alpha was 0.92 (n = 30 on 15 items), which is an indication of a very good reliable scale, where the highest mean was 7.2633 and the lowest was 5.24. TABLE IV: REPORTED LEVELS BY ARCS ITEMS.
(b) Encoded
(c) Screen Effect
Fig. 15: Actual cloud gaming screens (a) Original screen on server side. (b) Encoded screen on client side. (c) Screen with effect to deal with negative emotion.
ARCS motivational items Attention 1. Game loading times (e.g., delay) 2. Video image and sound quality 3. Perceived screen effect update 4. Jerkiness of game animation was caused by additional consumption of resources such as bandwidth and CPU 5. The game is stressful and make players sad or angry 6. Game control stability 7. Game environment condition Relevance 1. Game content relevancy to real life scenarios 2. Familiarity of dynamic emotion-aware screen update Confidence 1. Negative emotion such as violence or anger may affect confidence 2. Game difficulty level 3. The game had too much information to pick out and remember Satisfaction 1. Emotional feeling (e.g., happy, and engaged) after game play 2. Benefit of the screen display for game motivation 3. Impact of dynamic screen effects on current emotion such as the extent of enjoyment
Ratings 6.49 5.89 6.78 6.51 6.61 6.76 6.66 5.28 5.98 5.31 5.14 5.27
7.35 7.45 6.98
In the FPS game scenario, Attention corresponds to the players’ responses to the instructional stimuli generated by the gameplay, in this case, different screen effects based on emotional feedback. Relevance determines if the content or
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
emotion-aware screen effects impact on the real-life scenarios. Confidence measures the players’ performance at different difficulty levels of gameplay and how it influences players. Satisfaction measures the players’ experience of gameplay, their excitement and happiness, and the influence of the screen effects on players with respect to motivating them to play more. TABLE V: DATA SUMMARY OF ARCS COMPONENTS. Parameters N P x Mean P 2 x Variance Std. Dev.
Attention 7 45.8 6.542 300.214 0.0919 0.3031
Relevance 2 11.26 5.63 63.6388 0.245 0.495
Confidence 3 15.72 5.24 82.3886 0.0079 0.0889
Satisfaction 3 21.79 7.2633 158.3945 0.0632 0.2515
The results of the survey presented in Table IV, show the average score based on 30 participant’s responses for each component and its items. A summary of the ARCS components is shown in Table V. The one-way ANOVA test (F-test with alpha = 0.05), based on the Table IV and Table V, shows that there are significant differences between the subscales (see Table VI). Based on the value of alpha and degrees of freedom, the critical value of F was 2.68. The observed value of F from the data in Tables IV and Table V was larger than this value (New F= 29.17>2.68). Therefore, we conclude that there are significant differences between the subscales. TABLE VI: ONE-WAY ANOVA TEST.
Between Within
SS
df
MS
7.4652 0.9384
3 116
2.4884 0.0853
F-Significance (F>2.68) 29.17
Tukey’s Post-Hoc Honestly Significant Difference (HSD) test was used to identify how the four subscales differ with regard to the mean differences of the ARCS components. Based on the mean score of all the ARCS components in Table V, the Tukey HSD test value (0.71 at alpha = 0.05) was performed. The results are presented in Table VII, where both the Attention and Satisfaction components are significantly different from all other components. As presented in Table V, the Satisfaction component has the highest mean score of 7.263, implying that the dynamic screen effects had some impact on users’ emotional feelings during gameplay. It slowly reduced their negative emotions and motivated them to continue the game. The participants’ Attention component had the second highest mean score (6.542) among the ARCS components because of poor the image quality at the client side compared to the original server’s image quality. Moreover, as the game streamed from the cloud server with the remote display technology, it was often not responsive to user inputs because of the wireless network condition. In the case of Relevance, the participants reported a moderate score (5.63) because the game was not very related to their real-life scenarios. However, the intent of the game is personal enjoyment and excitement rather than realism. Additionally,
12
it was not very easy for players to become acquainted with the game environment. With regard to Confidence, the score was quite low (5.24) because of the unfamiliarity of the players with this game genre. In addition, the game showed some violence that may have affected the players’ mind and affected their confidence. However, their confidence increased slowly after some time because of the various emotional screen effects. In summary, the Tukey HSD test reveals that the participants reported high levels of Satisfaction and Attention, which indicates that they were happy to perceive the emotion based dynamic screen effects during gameplay. TABLE VII: TUKEY HSD TEST. Pair AR AC SA RC SR SC
Mean Difference (MD) 0.912 1.302 0.720 0.39 1.633 2.023
Significance (MD>0.71) Yes Yes Yes No Yes Yes
V. C ONCLUSION In this paper, we described a framework that uses audiovisual emotion to modify the gaming screen with randomly generated effects to create an enhanced gaming experience. This study also demonstrates how a dynamic screen effect through remote display technology can result in emotional responses during game play, and how it stimulates players to keep motivated and engaged. The suitability and effectiveness of the proposed framework was assessed by objective and subjective evaluations. We showed that our approach causes a tolerable amount of extra workload to the cloud server. Even for interactive game genres such as FPS or (Action roleplaying game) ARPG, the server-side performance shows that our approach is still capable of providing a high quality, strictly real-time gaming experience when the network situation is good. For future research, we plan to create emotional benchmark data for various game genres. Meanwhile, more screen effects will be considered, not only to address negative emotions, but also to enhance positive emotions. Providing individual emotional control services for cloud gamers is another research goal that we would like to achieve in our future work. Moreover, it would be interesting to include and run more game genres in our study and observe the effects. R EFERENCES [1] R. Shea, J. Liu; Ngai, E.C.-H Ngai, and Y. Cui, ”Cloud gaming: architecture and performance,” IEEE Network, vol.27, no.4, pp.16-21, July-August 2013. [2] S-P. Chuah, C Yuen, N-M. Cheung, ”Cloud gaming: a green solution to massive multiplayer online games,” IEEE Wireless Commun., vol.21, no.4, pp.78-87, August 2014. [3] D. Wu, Z. Xue, and J. He, ”iCloudAccess: Costeffective streaming of video games from the cloud with low latency,” IEEE Trans. Circuits and Sys. Video Techno., vol. 24, no.8, pp.1405-1416, Aug. 2014. [4] OnLive, [Online]. Available: http://www.onlive.com/. Accessed Sep 14, 2014. [5] Gaikai, [Online]. Available: http://www.gaikai.com/. Accessed Sep 14, 2014.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
[6] StreamMyGame, [Online]. Available: http://www.Streammygame.com. Accessed Sep 14, 2014. [7] K.-T.Chen, C.-Y. Huang, and C.-H. Hsu, ”Cloud gaming onward: research opportunities and outlook,” In Proc. IEEE ICME’2014, Chengdu, China, pp.1,4, 14-18 July 2014. [8] D. Mishra, M. E. Zarki, A. Erbad, C. Hsu, and N. Venkatasubramanian, ”Clouds + games: A multifaceted approach,” IEEE Internet Comput., vol.18, no.3, pp.20-27, May-June 2014. [9] W. Cai, C. Zhou, V.C.M. Leung, and M. Chen, ”A Cognitive Platform for Mobile Cloud Gaming,” In Proc. IEEE CloudCom’2013, pp.72,79, Bristol, UK, 2-5 Dec. 2013. [10] K.-T. Chen, Y.-C. Chang, H.-J. Hsu, D.-Y. Chen, C.-Y. Huang, and C.-H. Hsu, ”On the quality of service of cloud gaming systems,” IEEE Trans. Multimedia, vol. 16, no. 2, pp. 480-495, February 2014. [11] C.-Y. Huang, C.-H. Hsu, D.-Y. Chen, and K.-T. Chen, ”Quantifying user satisfaction in mobile cloud games,” In Proc. ACM MoVid’2014, pp. 4:1-4:6, Singapore, March 2014. [12] M. Chen, Yin Zhang, L. Hu, S. Mao, ”EMC: Emotion-aware Mobile Cloud Computing in 5G,” IEEE Network, Mar. 2015. [13] ACM Technews ”Stanford Engineers Design Video Game Controller That Can Sense Players’ Emotions,” Communications of the ACM. [Online] http://cacm.acm.org/news/173744-stanfordengineers-design-video-game-controller-that-can-sense-playersemotions/fulltext#comments, Accessed Jan 14, 2015. [14] thatgamecompany [Online]. Avialable: http://thatgamecompany.com/, Accessed Jan 14, 2015. [15] A. Popescu, J. Broekens, and M. V. Someren, ”GAMYGDALA: An Emotion Engine for Games,” IEEE Trans. Affective Comput., vol. 5, no. 1, pp. 32-44, Jan-Mar. 2014. [16] J. L. Sabourin and J. C. Lester, ”Affect and Engagement in Game-Based Learning Environments,” IEEE Trans. Affective Comput., vol. 5, no. 1, pp. 45-56, Jan-Mar. 2014. [17] Can Videogames Make You Cry? Bowen Research- Game Informer, Available: http://www.bowenresearch.com/studies.php?id=3 , Accessed Jan 14, 2015. [18] Y-T. Lee, K.-T. Chen, H-I Su and C.-L. Lei, ”Are All Games Equally Cloud-Gaming-Friendly? An Electromyographic Approach,” In Proc. ACM/IEEE NetGames’2012, Venice, Italy, Nov. 2012. [19] Microsoft remote desktop protocol: basic connectivity and graphics remoting specification. http://msdn2.microsoft.com/enus/library/cc240445.aspx, Accessed Sep 14, 2014. [20] Ulta VNC [Onine] http://www.uvnc.com/ Accessed Sep 14, 2014. [21] EC2 RDP, http://superuser.com/questions/855222/the-remote-desktopscreen-is-always-a-grey-background, Accessed Jan 15, 2015. [22] B. Song, W. Tang, T.-D. Nguyen, M. Hassan, and E. Huh, ”An optimized hybrid remote display protocol using gpu-assisted m-jpeg encoding and novel high-motion detection algorithm,” J. Supercomputing, vol. 66, no. 3, pp. 1729-1748, 2013. [23] C.-Y. Huang, K.-T. Chen, D.-Y. Chen, H-J. Hsu, and C.-H. Hsu, ”GamingAnywhere: The First Open Source Cloud Gaming System,” ACM Trans. Multi. Comput.Commun. Appl., vol. 10, no. 1s, Jan 2014. [24] K. Chen, Y. Chang, P. Tseng, C. Huang, and C. Lei, ”Measuring the latency of cloud gaming systems,” In Proc. ACM MM’2011, pp. 1269– 1272, Scottsdale, AZ, USA, Nov. 28 - Dec. 01, 2011. [25] M. Jarschel, D. Schlosser, S. Scheuring, and T. Hoßfeld, ”An evaluation of QoE in cloud gaming based on subjective tests,” In Proc. IEEE IMIS’2011, pp. 330–335, Seoul, Korea, Jul. 2011. [26] P. Quax, A. Beznosyk, W. Vanmontfor, R. Marx, and W. Lamotte, ”An evaluation of the impact of game genre on user experience in cloud gaming,” In Proc. IGIC’2013, pp.216-221, Vancouver, BC, 23-25 Sept. 2013. [27] W. Cai, V.C.M. Leung, and M. Chen, ”Next Generation Mobile Cloud Gaming,” In Proc. IEEE SOSE’2013, pp.551-560, San Francisco Bay, USA, 25-28 March 2013. [28] Nvsio [Online] http://www.nviso.ch/ Accessed Sep 14, 2014. [29] Y-C. Chang, P-H.Tseng, K.-T. Chen, and C. L. Lei, ”Understanding the performance of thin-client gaming,” In Proc. IEEE CQR’2011, pp.1-6, North Naples, Florida, 10-12 May 2011. [30] B. Schuller, G. Rigoll, and M. Lang, ”Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine - belief network architecture,” In Proc. IEEE ICASSP’2004, pp. I-577-580, Montreal, Quebec, 17-21 May 2004. [31] Y. Zhou, Y. Sun, J. Zhang, and Y. Yan, ”Speech Emotion Recognition Using Both Spectral and Prosodic Features,” In Proc. ICIECS’2009, pp. 1-4, Wuhan, China, 19-20 December 2009.
13
[32] G. Muhammad and K. Alghathbar, ”Environment recognition for digital audio forensics using MPEG–7 and mel cepstral features,” J. Electrical Engineering, vol. 62, no. 4, pp. 199-205, 2011. [33] Information Technology-Multimedia Content Description Interface-Part 4: Audio, ISO/IEC CD 15938-4, 2001. [34] P. Dollar, V. Rabaud, G. Cottrell, S. Belongie, ”Behavior recognition via sparse spatio-temporal features,” In Proc. IEEE VS-PETS’2005, pp. 65-72, Beijing, China, 15-16 Oct. 2005. [35] T. Ahonen, A. Hadid, and M. Pietikainen, ”Face Description with Local Binary Patterns: Application to Face Recognition,” IEEE Trans. Pattern Anal. Machine Intell., vol. 28, no. 12, December 2006. [36] O. Martin, I. Kotsia, B. Macq, I. Pitas, ”The eNTERFACE’05 audiovisual emotion database,” In Proc. ICDEW’2006, pp. 8, Atlanta, GA, April 3-8, 2006. [37] R. Munaf, S.A.R. Abu-Bakar, M. Mokji, ”Human emotion recognition from videos using spatio-temporal and audio features” Visual Computer, vol. 29, no. 12, pp. 1269-1275, 2013. [38] D. Jiang, Y. Cui, X. Zhang, P. Fan, I. Ganzalez, and H. Sahli, ”Audio Visual Emotion Recognition Based on Triple-Stream Dynamic Bayesian Network Models,” S. DMello et al. (Eds.): ACII 2011, Part I, LNCS 6974, pp. 609–618, 2011. [39] NVIDIA VIDEO CODEC SDK, January 2014. https://developer.nvidia.com/nvidia-video-codec-sdk, Accessed Sep 14, 2014. [40] M. Claypool and K. Claypool, ”Latency and player actions in online games,” Communications of the ACM, vol. 49, pp. 40-45, November 2006. [41] T. Henderson. The Effects of Relative Delay in Networked Games. PhD thesis, Department of Computer Science, University of London, February 2003. [42] S. Zander, I. Leeder, and G. Armitage, ”Achieving fairness in multiplayer network games through automated latency balancing,” In Proc. of ACM SIGCHI ACE 2005, pages 117-124, Valencia, Spain, June 2005. [43] J.M. Keller, ”Motivational design of instruction,” In C. M. Reigeluth (Ed.), Instructional design theories and models: an overview of their current status, Hillsdale, NJ: Lawrence Erlbaum, pp. 383–434, 1983. [44] G. Fred, J. Jeroen, and J. Jos, ”Measurement of cognitive load in instructional research,” Perceptual and motor skills, vol. 79, pp. 419– 430, 1994. [45] ITU-R Recommendation BT.709 [Online], Available: http://www.itu.int/rec/R-REC-BT.709-5-200204-I/en. Accessed Jan 14, 2015. [46] H.264, [Online], Available: http://www.divx.com/en/software/technologies/h264, Accessed Jan 14, 2015. [47] Motion JPEG, [Online], Available: http://en.wikipedia.org/wiki/Motion JPEG, Accessed Jan 14, 2015.
M. Shamim Hossain (SM’09) is an Associate Professor at the King Saud University, Riyadh, KSA. Dr. Shamim Hossain received his Ph.D. in Electrical and Computer Engineering from the University of Ottawa, Canada. His research interests include serious games, cloud and multimedia for healthcare, resource provisioning for big data processing on media clouds and biologically inspired approach for multimedia and software system. He has authored and co- authored around 100 publications including refereed IEEE/ACM/Springer/Elsevier journals, conference papers, books, and book chapters. He has served as a member of the organizing and technical committees of several international conferences and workshops. He has served as co-chair, general chair, workshop chair, publication chair, and TPC for over 12 IEEE and ACM conferences and workshops. Currently, He serves as a cochair of the 5th IEEE ICME workshop on Multimedia Services and Tools for E-health MUST-EH 2015. He is on the editorial board of International Journal of Multimedia Tools and Applications. Previously, he served as a guest editor of IEEE Transactions on Information Technology in Biomedicine and Springer Multimedia tools and Applications (MTAP). Currently, he serves as a lead guest editor of Elsevier Future Generation Computer Systems, Computers & Electrical Engineering, Springer Cluster Computing, and International Journal of Distributed Sensor Networks. Dr. Shamim is a member of ACM and ACM SIGMM.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2015.2444731, IEEE Transactions on Circuits and Systems for Video Technology IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 25, NO. 8, SEPTEMBER 2015
Ghulam Muhammad (M’10) is an Associate Professor in the department of Computer Engineering, College of Computer and Information Sciences at the King Saud University, Riyadh, KSA. Dr. Ghulam received his Ph.D. in Electrical and Computer Engineering from Toyohashi University and Technology, Japan in 2006. His research interests include serious games, cloud and multimedia for healthcare, resource provisioning for big data processing on media clouds and biologically inspired approach for multimedia and software system, image and speech processing. He has authored and co- authored more than 100 publications including refereed IEEE/ACM/Springer/Elsevier journals, conference papers, books, and book chapters.
14
Atif Alamri(M’09) is an Associate Professor of Information Systems Department, at the College of Computer and Information Sciences, King Saud University. Riyadh, Saudi Arabia. His research interest includes multimedia assisted health systems, ambient ntelligence, and service-oriented architecture. Mr. Alamri was a Guest Associate Editor of the IEEE RANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, a Cochair of the first IEEE International Workshop on Multimedia Services and Technologies for E-health, a Technical Program Cochair of the 10th IEEE International Symposium on Haptic Audio Visual Environments and Games, and serves as a Program Committee Member of many conferences in multimedia, virtual environments, and medical applications.
Song Biao (M’12) Song Biao received his Ph.D. degree in Computer Engineering from Kyung Hee University, South Korea in 2012. Currently he is with King Saud University, Kingdom of Saudi Arabia as Assistant Professor, in College of Computer and Information Science. His current research interests are Cloud computing, remote display technologies and dynamic VM resource allocation. He has more than 35+ publications in refereed IEEE/ACM/Springer journals.
Mohammad Mehedi Hassan (M’13) is an Assistant Professor of Information Systems Department in the College of Computer and Information Sciences, King Saud University, Riyadh, Kingdom of Saudi Arabia. He received his Ph.D. degree in Computer Engineering from Kyung Hee University, South Korea in February 2011. He has more than 80+ paper publications in refereed IEEE/ACM/Springer journals, such as IEEE Wireless Communication Magazine, IEEE Network Magazine, IEEE Transaction on Computers, and reputed conferences. He has served as, chair, and Technical Program Committee member in numerous international conferences/workshops like IEEE HPCC, ACM BodyNets, IEEE ICME, IEEE ScalCom, ACM Multimedia, ICA3PP, IEEE ICC, TPMC, IDCS, etc. His research interests include cloud collaboration, multimedia cloud, sensor-cloud, Internet of things, mobile cloud, thin-client, grid computing, IPTV, virtual network, sensor network, and publish/subscribe system.
Abdulhameed AlElaiwi(M’12) is a faculty of SWE Department in the College of Computer and Information Sciences, King Saud University, Riyadh, Kingdom of Saudi Arabia. He received his Ph.D. degree in Software Engineering from Florida Tech in February 2002. He has authored and co-authored more than 30 publications including refereed IEEE/ACM/Springer journals, conference papers, books, and book chapters. His research interests include Software Testing, Cloud collaboration, multimedia Cloud, sensor-Cloud, mobile Cloud, and E learning system.
1051-8215 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.