which is a novel collaboration framework for creating multimedia contents through ... Index TermsâSocial networking application, Mobile film, User collaboration ...
Globecom 2014 - Symposium on Selected Areas in Communications: GC14 SAC Social Networks
EMVideo: User Collaboration Framework for Generating Multimedia Contents in Social Networks Suk Kyu Lee, Seungho Yoo, Eugene Kim, Hyunsoon Kim, Kangho Kim, and Hwangnam Kim School of Electrical Engineering, Korea University {sklee25, pen0423, ekim57, gustns2010, mybalam2k, hnkim}@korea.ac.kr
Abstract—With rapid proliferation of mobile devices, it is common to see users using mobile devices for multimedia interaction. Users are not only viewing the contents, but also utilizing it as a medium for expressing their thoughts into a new way of storytelling. However, the way of creating contents via the mobile device is not novel, but simply follows the traditional media production approach. In this paper, we present EMVideo, which is a novel collaboration framework for creating multimedia contents through the mobile device. We believe that the proposed framework can be utilized as an innovative method for digital storytelling and a novel application for social networking. Index Terms—Social networking application, Mobile film, User collaboration framework
I. I NTRODUCTION As we face the age of mobile devices, it is common to see users interacting and generating multimedia contents with their mobile devices. Users often create their own videos and distribute their stories through a video-sharing social networking services such as YouTube or Facebook. Furthermore, it is not only limited for sharing purpose, but it is expanded to generate contents for education, entertainment, and even to filming festival dedicated only to the contents generated with the mobile devices [1]–[3]. A mobile device with a novel technical feature is continuously introducing into the market: a mobile device with a 3D camera [4], [5]. 3D camera has not drawn much attention yet, due to the issue with the application plan for 3D camera and the deficiency of 3D contents. However, we should note that the 3D camera can be exploited as a tool for creating 2D multimedia contents with the mobile device’s computing power and communication ability. By utilizing mobile device with a 3D camera for building multimedia contents, contents creator will be capable of employing the distance information between a subject1 and the camera for creating contents. Additionally, with the mobile device’s communication ability, it can be connected to the social network. This will eventually endow a unique characteristic to the created multimedia contents. With an emerging technology to create contents, the application scope for the mobile device with a 3D camera can be broaden and multimedia contents created by the mobile device will be distinguishable. In addition to adopting an emerging technology into a mobile device, a filming method which is dedicated to the mobile device should be enhanced from its current state. One approach is to adopt the automation of filming the contents, 1 In this paper, the terminology of subject represents the person, animal or object that is being filmed or photographed.
978-1-4799-3512-3/14/$31.00 ©2014 IEEE
which has not been endorsed into the media industry [6]. The final product comes from the hands of editors and producers, irrespective of users’ taste. To provide a novel filming method for mobile device, we can rely on the computing power for editing the video in an automatic fashion as a novel approach. Henceforth, we can induce the idea of social networking by providing a multimedia contents generation framework based on users’ collaborative actions. In this paper, we propose Everyone Makes a Video (EMVideo), which is a collaboration framework among mobile devices to create multimedia contents within the social network. Users need to collaborate with each other as a team to create their desired contents. EMVideo is composed of multiple Observers, which act as filming units, and a Director, which manages Observers and generate video contents in an automatic fashion. Within Observer, it is equipped with Computerized Multimedia Contents Alternator (CoMCA) to assist Director. The major contributions that we have made with the proposed framework are the followings. Firstly, we propose a unique way of creating multimedia contents with multiple mobile devices. Secondly, EMVideo can be employed as an instrument for an innovative digital storytelling. Contrary to relying on an Executive Producer to create video contents, users are the ones who create the contents (their story) with other users. Lastly, we thoroughly analyzed the usage of each type of CoMCA and propose a way to create video contents in various filming conditions. In summary, this paper makes the following major contributions: • Proposing a novel social networking application for creating multimedia contents with the mobile devices; • Identifying how mixture of multi-modal information can be employed as a medium for generating multimedia contents within various filming conditions; • Implementing a specific test-bed for EMVideo and performing a thorough empirical evaluation with different experimental configurations. The remaining of the paper is organized as follows. We will briefly introduce the past and current researches in the related field in Section II. We will thoroughly present the overview of EMVideo in Section III. In Section IV, we will explain the evaluation of our proposed framework and analyze the result. Finally, we conclude the paper in Section V. II. P RELIMINARY We will briefly go over the past conducted researches and applications of automatically creating video contents from
2941
Globecom 2014 - Symposium on Selected Areas in Communications: GC14 SAC Social Networks
'/8KFGQ 1DUGTXGT
1DUGTXGT
1DUGTXGT
&RQWHQWV*HQHUDWRU
1DUGTXGT
0LFURSKRQH 0LFURSKRQH
(09LGHR (09LGHR &R0&$ &R0&$
&KTGEVQT
'&DPHUD '&DPHUD 9LGHR'DWD 9LGHR'DWD 7UDQVPLWWHU 7UDQVPLWWHU
&R0&$ &R0&$ 'DWD 'DWD +DQGOHU +DQGOHU 1,& 1,&
(09LGHR&RQWHQWV*HQHUDWRU (09LGHR&RQWHQWV*HQHUDWRU
&R0&$ &R0&$ 'DWD 'DWD +DQGOHU +DQGOHU
9LGHR'DWD 9LGHR'DWD 5HFHLYHU 5HFHLYHU 1,& 1,&
6HOHFWHGYLGHRIUDPH
(a) Observer 1DUGTXGT
Fig. 1.
Fig. 2.
1DUGTXGT
Overview of EMVideo framework.
multiple cameras. One of the related literature is Virtual Director [7]. Virtual Director utilized sound information to localize a user in the filming space which consists of multiple number of cameras. Note that this work proposed the video frame selection mechanism based on the sound information. Another related literature of automatic tool for capturing the camera view is Active Capture [6], [8]. The authors proposed a system, which can automatically edit a video based on the recorded videos from multiple cameras. Contrary to the previous work, the primary goal of this paper is to present a novel user collaboration framework for the mobile devices, which consists of a filming unit, a management unit, and a computerized video frame selection. III. D ESIGN AND I MPLEMENTATION In this section, we will describe the blueprint of the proposed framework. EMVideo is composed of multiple Observers and a Director as shown in Fig. 1. Observer is a filming unit, which is a laptop2 that is equipped with a Kinect, a microphone, and a software module named as a Computerized Multimedia Contents Alternator (CoMCA). CoMCA is designed to assist the Director to select the video frame to append it to the multimedia content based on the proximity between a subject and the Observers. For this paper, we designed three types of CoMCA: CoMCA-D, CoMCA-DV, and CoMCA-DS. The details of CoMCA will be explained later in this section. Lastly, the Director consists of a Contents Generator, which is a multimedia contents generating unit by collecting all of the information from Observers. Consequently, based on users’ collaborative actions, Observers transmit CoMCA data to the Director, and then the Director analyzes CoMCA data and automatically selects a video frame to add it to the content. A. Overview of EMVideo In this section, we will explain the architecture of Director and Observer in EMVideo framework in detail. Figs. 2(a) and 2(b) illustrate the architecture of Observer and Director of the proposed framework. In Fig. 2(a), EMVideo CoMCA is a software algorithm for determining the proximity between the subject and an Observer. It is crucial that all Observers work with the same type of CoMCA for consistency. However, during the filming session, user can dynamically employ different types of CoMCA as long as the equal type 2 We utilized a laptop with a 3D camera as an Observer, since 3D data from the smartphone with a 3D camera was not accessible for computation.
(b) Director
Architecture of the proposed framework.
of CoMCA is selected for all participating Observers and the Director. Once Observers acquired CoMCA data, CoMCA Data Handler transmits it to the Director. Afterwards, CoMCA Data Handler receives Director’s request for a video frame. When an Observer receives the request from the Director, Video Data Transmitter transmits the video frame along with the image sequence number to the Director. As we can see in Fig. 2(b), the Director is designed to manage Observers and create multimedia contents based on Observers’ CoMCA data. It is equipped with CoMCA Data Handler to receive all Observer’s CoMCA data and request the video frame to a chosen Observer. Once the Observer sends the requested video frame, Video Data Receiver receives and transfers it to Contents Generator to append it to the current video. We designed our framework in a way that Observers and the Director perform their own tasks independently to improve the overall performance of EMVideo. If Observer’s role is to only transmit the raw video data, the Director cannot handle tremendous amounts of video data from multiple Observers in time. This will indeed impose a heavy traffic on the network, and the Director will be busy only from receiving the raw video data. Moreover, if the Director needs to compute CoMCA data from all Observers, this may lead to heavy computational burden. The process of concurrently receiving and computing the data at one station will slow down the entire framework. Thus, endorsing all abilities at Director’s side could result in performance degradation. To prevent this effect, we designed EMVideo in a manner that any Observer computes CoMCA data and transmit its video frame based on the Director’s decision. B. EMVideo CoMCA In this section, we will describe the Computerized Multimedia Contents Alternator (CoMCA), which is the core component of EMVideo framework. Before we explain each type of CoMCA in detail, we will firstly describe the design philosophy. Design philosophy of EMVideo CoMCA: In 1963, Hall modeled a structure of personal dynamic space, which listed four distances as intimate, less than 18 inches, personal, between 18 inches and 4 feet, social, between 4 and 12 feet, and public, more than 12 feet [9]. From Hall’s model, how close we are in the personal dynamic space indicates the intimacy of communication. This is not only limited to communication within the physical space, but is applied to contents generation. For example, when we are watching multimedia contents through a display, the performers are the ones who are trying to communicate with the viewers to tell their story. When the performers of the content appeared to be closer to the screen, viewers feel more intimacy and tend to enjoy the video content more [10], [11]. Based on this idea, we utilized proximity as a
2942
Globecom 2014 - Symposium on Selected Areas in Communications: GC14 SAC Social Networks
medium to select the video frame so that the viewers feel more intimate about the subject in videos generated by EMVideo. Thus, all CoMCA’s types primarily focus on determining the proximity between a subject and an Observer. The purpose of EMVideo is not limited for digital storytelling; rather, users’ collaborative actions based on the proximity information for generating multimedia contents. This approach will provide an innovative way for communicating with their peers as a novel social networking application. In EMVideo, users can dynamically select a different type of CoMCA to create their desired contents. Note that we did not employ the combination of depth, visible, and sound information into a single type of CoMCA. The reason is twofold as follows. Firstly, one filming condition does not require all the information to accurately estimate the proximity. The visible information is not robust to illuminational change, so it is not appropriate to determine the proximity when the light condition changes often, but the depth information is robust to such the illuminational change. On the contrary, it would not be always true to assume that the depth information is omnipotent for determining the proximity. Sometimes, when a subject is very close to Observer, determining the distance to the subject based on only the depth information is not accurate due to the inaccuracy of the depth sensor. Therefore, visible information should be used to determine the proximity more accurately when a subject is very close to the camera. Conclusively, we use different information for each different filming condition. Secondly, there is an issue of what type of multimedia contents the user wants to create: live broadcast or recorded broadcast. The importance of sound varies according to types of broadcast. For example, the origin of sound is very critical for a live broadcast. Within a live broadcast, hosts and guests in a show tend to speak loud and clearly for the viewers. The camera view changes based on who is speaking. Per contra, for contents such as soap opera, the actors have different voice quality and some may not make a voice strong enough for a microphone to detect accurately. Thus, sound and video are recorded separately, and additional process for refining the sound from the actors are conducted. Henceforth, importance of sound varies on the types of broadcast. Although we can design a single type of CoMCA that can adopt all combinations, for these reasons, we do not want to restrict EMVideo for very few filming conditions and broadcast types by creating single type of CoMCA. Each application of raw information has its own strengths and limitations. We believe that utilizing these information appropriately for the proper filming condition would increase the overall applicability of EMVideo. CoMCA-D: As the first type of CoMCA, we designed CoMCA-D, which is based on the depth information. CoMCA-D is primarily designed to be robust to any light changes, which may be used in filming music concert, play, or musicals. As the first step, we acquire CoMCA data, γ, by defining a set of collected background depth images, D, as D = {D1 , D2 , ..., DN }, where N is the number of acquired background depth images before a subject enters the filming space. In order to acquire accurate depth background information, we obtain 30 (=N ) background depth images
prior to running the framework. After we acquire D, we compute the mean background image. Db =
N 1 X Di , N i=1
(1)
where Db denotes the mean depth background image among the set of collected depth background images D. After the acquisition of Db , we compute the foreground depth mask, Df as Df = Dc − Db , where Dc is the current depth image while a subject is in the filming space. However, due to Kinect’s depth sensor issue, obtained Df becomes zero if the subject is very close to the camera. To improve the accuracy, we discard the zero value of Df , which is unnecessary and inaccurate compared to the previous images (whose values are not 0). We will explain this effect later in this paper. As the next step, if Df is equal to zero, we perform a linear interpolation among the previous two images and the current image. Df = Df (k−2) + (Df (k−1) − Df (k−2) )
tk − tk−2 , tk−1 − tk−2
(2)
where k, Df (k−1) , Df (k−2) , and t represents the index of the foreground depth image, the previous foreground depth image, the second previous depth image, and the time, respectively. Once we compute Df based on Eq. (2), we compute the raw CoMCA data, γraw . X γraw = Df (i, j), S = {∀(i, j)|Df (i, j) > 0}, (3) (i,j)∈S
where γraw represents the summation of the foreground depth mask’s pixels in Df (i, j) where i and j denote the x and y coordinates of the image. If the pixel that corresponds to Df (i, j) is greater than 0, then it signifies that Df (i, j) is different from the background image’s pixel. Afterwards, we sum up the depth value within the foreground region and compute γ, γraw γ= , (4) ζ where ζ represents the number of pixels within the foreground region. CoMCA-DV: As the second type of CoMCA, CoMCA-DV employs both visible and depth information to enhance the accuracy if the light condition is consistent. Within the filming space, the depth information is not always the right solution for determining the proximity. There is a limitation of acquiring distance information when a subject is very close to the camera. CoMCA-DV is designed for a filming condition such as soap opera, where accurate subject detection and the close view of the subject is crucial. Firstly, we compute Df with respect to visible spectrum image as we compute in CoMCA-D. Once we obtain Df in terms of visible image, we compute the foreground mask, F , with the next equation: ( 0 Df (i, j) < ψ F (i, j) = . (5) 255 otherwise The above equation marks 0 if the selected pixel from Df (i, j) is less than ψ, which is a threshold value for determining the
2943
Globecom 2014 - Symposium on Selected Areas in Communications: GC14 SAC Social Networks
Observer 1
foreground mask, and marks 255 in the other case to indicate the existence of the foreground pixel in the foreground mask, F . We set a static value of fifty to ψ for classifying the background and the foreground region. Any value that is below ψ will be considered as a background. The threshold value of fifty has been derived from empirical study: if we set ψ to a value that is above fifty, many of the pixels within the foreground mask will be eliminated, but if we set ψ to a value that is below fifty, we could not get accurate foreground mask due to the addition of unnecessary background information. Once we compute F , we proceed to the following equation: ( Dc (i, j) F (i, j) = 255 γraw (i, j) = , (6) 0 F (i, j) = 0 where Dc is the current depth image, which is associated with the visible image, and γraw indicates the selected depth data with respect to the current visible image. Once F (i, j) equals to 255, meaning the foreground mask’s pixel, CoMCA records the original depth value to the γraw (i, j). Theoretically, 3D camera is calibrated and the visible and the depth image are related in terms of the position of x and y axis [12]. Thus, based on the foreground detected area of the visible image, we place the depth information. After the mapping is done, we compute CoMCA data, γ, based on Eq. (4). CoMCA-DS: CoMCA-DS is designed for a video content that is based on live broadcast, such as a talk show or a news program. The sound is collected from the microphone of an Observer. While we were collecting the sound, the accumulated sound value was too fluctuating. Thus, we selected a maximum value of the time interval (11 frames) to extract a refined sound information. When we used the processed sound information directly from each Observer, the decibel level of each Observer was unequal to each other since different sound card was equipped with a different Observer. Thus, we needed to normalize sound information of each Observer to calibrate its decibel level. Based on this empirical study, CoMCA-DS starts from the following equation: α = max{β[i]}, i∈S
(7)
where β is the raw sound data, α is the maximum sound information that we have collected from β, S denotes number of frames collected, and i is the index of the sound collected from the microphone. In specific, S is represented as S = {n− 10, n−9, ..., n−2, n−1, n}, where n is the current time index of the sound that we have collected from the microphone. Based on the above equations, we determine the proximity between a subject and an Observer. Additionally, we integrate both depth and sound information into one metric to enhance the accuracy: γ = γraw ∗ d + α ∗ s , (8) where γraw denotes CoMCA-D data after Eq. (4), d is the ratio for depth, α depicts the sound data obtained from Eq. (7), s is the ratio for the sound data, and γ represents CoMCA data. We decided to set a static value for d to 70%, and s to 30% based on the empirical study. The reason why we use static values for those ratios, d and s , is because the depth value is the principal data to actually decide which Observer
Observer 2 Observer 3 Observer 2 Observer 1 Observer 1
Observer 2
Observer 3 Observer 3
(a) 1st configuration
(b) 2nd configuration
Fig. 3.
(c) 3rd configuration
Experimental configurations.
is closer to the subject, and the sound information is used as a supportive metric. The detail about the supportive metric will be explained later in this paper. C. EMVideo Contents Generator Contents Generator is equipped within the Director. Eq. (9) represents when Observers transmit CoMCA data to the Director. D(t) = [γ1 (t), γ2 (t), γ3 (t), ..., γl (t)], (9) where D denotes a set of all Observers’ CoMCA data and l signifies the number of connected Observers to Director. Once all of CoMCA data are collected, the Director selects Observers’ CoMCA data, which has the maximum value. γmax (t) = max{D(t)},
(10)
where γmax (t) is the currently selected Observer’s CoMCA data at time t. However, in order to prevent frequent exchange of Observer’s video frame, which may disturb the viewers, we added a mechanism to prevent this effect. ( γs (t) t ≤ tc + τ γs (t) = , (11) γmax (t) t > tc + τ where tc is the time when the selected frame is displayed, t is the turn time, γs (t) represents the CoMCA data that corresponds to the video frame to be appended to the video content at time t, and tc is the current time. τ can be any value, which lets the user to select the number of video frames to hold for preventing the frequent video frame exchange. If t is less than or equal to the summation of tc and τ , then the Director requests the video from the previously chosen Observer, which has already been selected regardless of current CoMCA data. However, once t exceeds the summation of τ and tc , the currently selected CoMCA data, γmax (t) is selected for the video frame and the Director requests a video frame. IV. P ERFORMANCE E VALUATION In this section, we will describe how we evaluated the proposed framework with different ways of configuring and deploying Observers in a filming condition. A. Experimental configuration We considered two main aspects for designing the experimental configurations for deploying Observers: (i) the set-up of Observers for filming conditions and (ii) the way of Observers’ seeing subjects. We believe that these considerations are important, since the configuration of the cameras may vary on how the users want to express their story to the viewers. The first configuration is depicted in Fig. 3(a). This experimental route was designed for a filming condition, where multiple Observers share the view of the subject and the
2944
Globecom 2014 - Symposium on Selected Areas in Communications: GC14 SAC Social Networks
1400
subject is not close to Observers. Moreover, this configuration has been devised for the video contents that are assumed to be generated for a concert where the deployment of the cameras are oriented to see various aspects of the performer (subject). Fig. 3(b) illustrates the second experimental configuration. It was designed for a filming condition, which is suited for filming a soap-opera where the close view of the actor is crucial. The goal for this configuration was to verify whether CoMCA can determine the proximity between the subject and an Observer when they are proximate to each other. Lastly, Fig. 3(c) presents the third experimental configuration. The goal for the third experimental configuration was to set-up a condition that resembles a newsroom. It was designed to see the effectiveness of EMVideo when Observers’ set-up and the person in sight is almost always the same. Conclusively, all Observers’ setups for the experimental configurations were designed for different purposes. We wanted to present EMVideo is not limited for certain filming condition, and each type of CoMCA has its own effects. Furthermore, we evaluated the EMVideo with only one subject. Instead of placing one subject for each Observer, we let the subject to move from one Observer to another to evaluate whether the appropriate video frame is selected for the video content. B. Experimental setup We have constructed an empirical test-bed, which is composed of three Observers and one Director. Observer is a laptop that is equipped with a Kinect and a microphone. Desktop machine is used as a Director. We developed the EMVideo within Ubuntu 12.04 with C++, OpenCV, and OpenNI library. Lastly, we used PortAudio API to acquire the sound data. For the hardware specification, each Observer is equipped with Intel i5-450M (2.4GHz) with Nvidia GeForce 310M but made by different manufacturers. Director is equipped with Intel i5760 (2.8GHz) with Nvidia GeForce GTX550 Ti. C. Experimental results CoMCA-D: As for the first experiment, we performed an experiment with CoMCA-D. Fig. 4(a) presents the empirical result when we evaluated the proposed framework with the configuration as specified in Fig. 3(a). In the figure, y-axis depicts CoMCA data and x-axis represents the sequence of video frames from the Observer. We could see that as the subject moves to Observer 2, Observer 1 tends to have a higher depth value until the subject reaches at Observer 2. The reason is following. When the subject was located at the mid-point between Observer 1 and 2, due to Observer’s field of view, Observer 1’s video frames were selected. While the subject was approaching closer to Observer 2, video frames from Observer 2 were edited into the video. Once the subject was moving towards to Observer 3, the proposed framework was able to select appropriate video frame. Fig. 4(c) presents the result corresponding to the configuration in Fig. 3(b). We could verify that the result was not accurate due to the near-sight issue with the 3D camera: Observer 3 was accurate only when EMVideo started, Observer 1 never had a chance to provide a video frame, and Observer 2, which was located at the center of the experimental route (that the
1200
18000
Observer 1 Observer 2 Observer 3
16000
Observer 1 Observer 2 Observer 3
14000
1000
CoMCA data
CoMCA data
12000
800
600
10000 8000 6000
400 4000
200 2000
0 0
50
100
150 Frame number
(a) CoMCA-D:
1st
200
250
0 0
300
100 150 Frame number
configuration (b) CoMCA-DV:
600
1st
200
250
configuration
1800
Observer 1 Observer 2 Observer 3
500
50
1600
Observer 1 Observer 2 Observer 3
1400 1200 CoMCA data
CoMCA data
400
300
200
1000 800 600 400
100 200
0 0
50
100 Frame number
150
0 0
200
20
40
60 Frame number
80
100
120
(c) CoMCA-D: 2nd configuration (d) CoMCA-DV: 2nd configuration 7000
1000 900
Observer 1 Observer 2 Observer 3
6000
Observer 1 Observer 2 Observer 3
800
5000
600
CoMCA data
CoMCA data
700
500 400 300
4000
3000
2000
200
1000 100 0 0
0 10
20
30 40 Frame number
(e) CoMCA-D: Fig. 4.
3rd
50
60
70
10
20
30
40 50 Frame number
configuration (f) CoMCA-DV:
3rd
60
70
80
90
configuration
Empirical results of CoMCA-D and CoMCA-DV.
subject took), had a higher probability to be selected by the Director than the other Observers. Based on this experiment, we learned that Kinect has an issue with capturing the subject when the subject was very close to the Observer. Once the subject approaches to the vicinity of an Observer, the Observer decided that there was no subject in sight, so it acquired zero for all depth values. To prevent this behavior, we utilized a linear interpolation method among previous two images in CoMCA-D in Eq. (2). However, there was another issue with a subject’s movement speed. If the subject was moving slow in the designated configuration, linear interpolation improved the accuracy. Nonetheless, if the subject was moving fast, linear interpolation was not effective. Lastly, we conducted an experiment with the configuration in Fig. 3(c). In Fig. 4(e), we could see that CoMCA-D acquired accurate depth information as the subject moves from Observer 1 to Observer 3. From all three experiments, we could verify that CoMCA-D worked effectively with two experimental configurations: configurations in Figs. 3(a) and 3(c). CoMCA-D was effective when Observers were distantly located from the subject. However, if the camera set-up was in the vicinity of the subject, it was still not accurate. One noticeable point was that solely employing depth information does not suffice for all kinds of contents generation. There was a strength and a limitation, and users need to be aware about effects of using different type of CoMCA. But this does not necessarily imply that we should always use external information other than depth. There are some cases for which the depth information is sufficient to perceive proximity: (i) some situations like a music concert, where Observers are configured distantly from each other and the performer (subject); (ii) when the light
2945
Globecom 2014 - Symposium on Selected Areas in Communications: GC14 SAC Social Networks
2000
300
0.2 Depth Observer 3 Sound
Observer 3
250
Observer 1 Observer 2 Observer 3
0.1
CoMCA data
1000
Sound Amplitude
200
Depth
intensity changes often, it is difficult to cooperate with the visible information to obtain the figure of the subject. CoMCA-DV: In this experiment, we conducted an empirical study to evaluate the performance of CoMCA-DV. Fig. 4(b) represents the result corresponding to the experimental configuration in Fig. 3(a). In the figure, y-axis depicts CoMCA data and x-axis represents the sequence of video frames from the Observers. Even though the subject was detected by two or more Observers in some areas, we could observe that Observers could detect the subject in the filming space accurately for selecting the proper video frame. For the second experimental configuration, we could see different results from those in Fig. 4(c). In Fig. 4(d), when the subject moved from Observer 3, Observer 3’s video frame was selected more frequently than the other Observers. As the subject moved straight to the center, Observer 2’s video frame was selected. Finally, as the subject reached at the end point, video frame from Observer 1 was chosen more frequently compared to the other Observers. Conclusively, CoMCA-DV was able to determine the subject even if the subject was very close to Observer. Lastly, Fig. 4(f) depicts the result of the third experimental configuration. We could observe that the proper Observer’s video frame was selected as the subject moved from Observer 1 to Observer 3. Conclusively, we could verify the effectiveness of CoMCA-DV. CoMCA-DV outperforms CoMCA-D in terms of determining the subject in any filming space without any light variation: CoMCA-DV has a strong advantage of determining the proximity between the subject and the camera in any camera set-up. However, we learned that when light condition changes, CoMCA-DV has a limitation. CoMCA-DS: Prior to evaluating CoMCA-DS, we conducted an empirical study by only utilizing the sound information. Fig. 5(a) depicts the result corresponding to the configuration in Fig. 3(a). In the figure, left y-axis depicts computed depth value, right y-axis denotes the sound amplitude, and x-axis represents the sequence of video frames from Observer 3. From Fig. 5(a), we were able to observe that the sound information was not sufficient to obtain correct distance information between the subject and the microphone. Therefore, instead of solely relying on the sound information, we combined both depth and sound information into one aggregate value as specified in Eq. (8) to accurately determine the proximity between the subject and the Observer. Based on this empirical study with the sound, we can observe the effectiveness of CoMCA-DS from Fig. 5(b). In the figure, yaxis depicts CoMCA data and x-axis represents the sequence of video frames from the Observers. The combination of depth and sound information reduced the error in determining proximity by discriminating integrated values coming from different Observers. Furthermore, as the subject generates more sound, we could verify that Observer could select the appropriate video frame for the multimedia content. In summary, for a live broadcast, understanding the origin of the sound was important. Based on the variation of the sound that can be heard from the viewer’s perspective, it can assist to select the proper video frame more effectively. With depth information to perceive the proximity between the subject and an Observer, knowing the origin of sound can extend the
150
100
50
0 0
50
0 150
100
0 0
(a) Observer 3 Fig. 5.
50
100
150
Frame number
Frame number
(b) All Observers
Empirical results of CoMCA-DS.
application scope for EMVideo. V. C ONCLUSION In this paper, we propose EMVideo, a user collaboration framework for generating multimedia contents as a novel social networking application. The proposed framework can generate a video based on automatic selection of video frames from multiple Observers. We conducted various empirical analysis to evaluate the proposed framework. Based on the results, we could identify advantage and disadvantage of each CoMCA and provide insights for each type. For the future work, we will develop novel types of CoMCA to diversify the range for creating the multimedia content. Furthermore, during the process of creating a video content, we would like to conduct a user-survey to analyze whether EMVideo helps to improve the teamwork and see whether users enjoyed the interaction with our proposed framework. VI. ACKNOWLEDGMENT This work was supported by the ICT R&D program of MSIP/IITP [14-000-04-001, Development of 5G Mobile Communication Technologies for Hyper-connected Smart Services] R EFERENCES [1] A. Herrington, “Using a smartphone to create digital teaching episodes as resources in adult education,” Faculty of Education-Papers, p. 78, 2009. [2] KT. (2014) 4th olleh International Smartphone Film Festival. [Online]. Available: http://www.ollehfilmfestival.com/eng/main.php [3] L. A. Times, “Movie shot on iPhone 4 by Park Chanwook to hit South Korean theaters,” January 2011. [Online]. Available: http://latimesblogs.latimes.com/technology/2011/01/videoiphone-movie-to-hit-s-korea-theatres.html [4] LG Electronics. LG Electronics Optimus 3D. http://www.lgmobile.co.kr/event/optimus3d/page.jsp. [5] Google Inc. Project Tango. [Online]. Available: https://www.google.com/atap/projecttango/ [6] M. Davis, “Active capture: Integrating human-computer interaction and computer vision/audition to automate media capture,” in Proceedings. 2003 International Conference on Multimedia and Expo, 2003. ICME’03., vol. 2. IEEE, 2003, pp. II–185. [7] E. Machnicki and L. A. Rowe, “Virtual director: Automating a webcast,” in Electronic Imaging 2002. International Society for Optics and Photonics, 2001, pp. 208–225. [8] A. Ram´ırez and M. Davis, “Active capture and folk computing,” in Proc. ICME 2004, 2004. [9] E. Hall, “The hidden dimension: Man’s use of space in public and private,” in Doubleday, New York, USA, 1966. [10] G. Burrows, “Living gallery: Investigating dynamic display of artwork through proximity detection,” 2005. [11] N. Roussel, H. Evans, and H. Hansen, “Proximity as an interface for video communication,” Multimedia, IEEE, vol. 11, no. 3, pp. 12–16, 2004. [12] H. Liu, M. Philipose, and M.-T. Sun, “Automatic objects segmentation with rgb-d cameras,” Journal of Visual Communication and Image Representation, no. 0, pp. –, 2013.
2946