Tangible and Body-Related Interaction Techniques for a Singing ... - TIII

4 downloads 218 Views 3MB Size Report
recognition, singing voice synthesis, 3D rendering and tan- gible interaction to .... application with skeleton tracking system and rendering. If the Kinect sensor is ...
Tangible and Body-Related Interaction Techniques for a Singing Voice Synthesis Installation Jochen Feitsch, Marco Strobel, Stefan Meyer University of Applied Sciences D¨usseldorf {jochen.feitsch, marco.strobel, stefan.meyer}@fh-duesseldorf.de ABSTRACT

This paper presents an interactive media installation that aims at providing users with the experience to sing like an opera singer from the 19th century. We designed a set of tangible and body-related interaction and feedback techniques and developed a singing voice synthesizer system that is controlled by the user’s mouth shapes and gestures. This musical interface allows users to perform an aria without real singing. We adapted techniques from 3D body tracking, face recognition, singing voice synthesis, 3D rendering and tangible interaction to integrate them into an interactive musical interface. Author Keywords

tangible musical interfaces, 3D character performance, singing voice synthesis, interactive media installation ACM Classification Keywords

H.5.m. Information Interfaces and Presentation (e.g. HCI): Miscellaneous General Terms

Human Factors; Design. INTRODUCTION

Singing is a complex process usually described as the result of a source-filter model. Acoustic output in humans and many nonhuman species is produced by movements of vocal folds in the larynx based on lung pressures and muscle tension. While this procedure can be regarded as the source of singing the propagation of sound into oral and nasal cavities can be viewed as a time-varying filtering. With lips and nose openings the volume velocity flow is converted into radiated pressure waves creating a singing voice [4]. Although singing had an important impact in human evolution and cultural development, there is a strong tendency towards externally generated music consumption and few peoPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. TEI’14, February 16-19, 2014, Munich, Germany. Copyright (c) 2014 ACM 978-1-4503-2635-3/14/02...$15.00.

Christian Geiger University of Applied Sciences D¨usseldorf Josef-Gockeln Str 9 40474 D¨usseldorf, Germany [email protected] ple are able to really experience the positive feeling of their own successful singing voice production. This is the motivation of the project presented in this paper: to provide users with the joy of singing. In this paper we do not only want to focus on the active user’s experience but also entertain passive users who watch the system in action. We designed and implemented a media installation that is aimed to allow active users a believable user experience of being a singer performing an aria. We identified body gestures and mouth shapes as input and synthesized singing voice and vibrational feedback as output as essential parameters of an enjoyable installation. This includes the design and implementation of dedicated tangible and full-body interaction techniques in order to create a system that allows users to behave like a singer and get corresponding visual, acoustic and haptic feedback. Passive users who watch the performance should be entertained by selected requisites, visualization techniques and suitable content elements. This objective is a challenge as it requires an expressive orchestration of a complex network of various interconnected technologies into a successful media installation. A general challenge for this project and similar singing voice synthesis approaches is the real time detection of vowels and consonants produced by the user’s mouth. A possible solution for a fast calibration and efficient operation in realtime is to restrict the recognition to easily trackable vowels. In this project we aim at providing users with an enjoyable user experience to feel like an opera singer performing an aria (consisting only of vowels). We assume that performing mouth gestures and body postures similar to a singer and at the same time hearing and feeling the music in real-time as direct feedback provides a believable user experience. In the remainder of this paper we briefly mention related projects that have influenced our work. Next, we provide an overview of the complete installation and proceed with a description of the 3D avatar creation process and the applied user tracking, including facial tracking and full body tracking. In this paper we focus on tangible interaction techniques, audio-visual effects we applied and available accessories we have designed. The following section is dedicated to a brief explanation of the sound synthesis process and includes the data mapping and the melody creation techniques we applied. We conclude with a short review of user feedback and open issues.

RELATED WORK

The synthesis of singing is an ambitious area of research with a long tradition and the human singing voice is a most complex instrument to be synthesized. Unfortunately, singing voice synthesis (SVS) has received much lesser scientific attention than speech voice synthesis. A good overview of singing voice synthesis is provided by X. Rodet in [15]. He examined the state of the art in SVS and pointed out relevant improvements in research and development. The paper described different SVS architectures based on synthesizers and modification of existing voice recordings. Relevant techniques like waveform synthesizers, unit selection or modification techniques, performance rules and database construction are briefly described. Systems that use a human voice as input are called singingto-singing synthesizers. Such a system is Vocalistener2 and it is able to synthesize a singing voice by taking a human voice as input, analyzing its pitch, dynamics, timbre shifts and phonemes and creates a singing voice based on the timbre changes of a user’s singing voice. In a recent publication the authors presented VocaWatcher that generates realistic facial expressions of a human singer and controls a humanoid robot’s face [10]. This is one of the first systems we identified that focuses on imitating a real singer based on acoustic and visual expressions. Recently, Cheng and Huang published an advanced mouth tracking approach that combines real-time mouth tracking and 3D reconstruction of the mouth movement in real-time [3]. HandySinger provides hand puppets as expressive and easyto-use interface to synthesize different expressional strengths of a singing voice [16]. Experiments confirmed that it is very easy to gesture with a hand-puppet interface. Cano et al. described a system that allows to morph between the user’s singing voice and the voice of a professional singer [2]. The system is used within a Karaoke performance setting. Several projects studied the synthesis of sound with mouth shapes or with gestures. De Silva et al presented a facetracking mouth controller [5]. The application example focused on a bioacoustics model of an avian syrinx that is simulated by a computer and controlled with this interface. The imagebased recognition tracks the user’s nostrils and mouth shape and maps this to a syrinx’s model to generate sounds. At NIME 2003 Lyons et al presented a vision-based mouth interface that used facial action to control musical sound. A headworn camera tracks mouth height, width and aspect ratio. These parameters are processed and used to control guitar effects, a keyboard and look sequences. Closely related to our approach is the ”Artificial Singing” project, a device that controls a singing synthesizer with mouth movements that are tracked by a web camera [12]. The user’s mouth is tracked and mouth parameters like width, height, opening, rounding, etc. are mapped to a synthesizer’s parameters e.g. pitch, loudness, vibrato, etc. Few commercial SVS software products exist so far. Yamaha offers Vocaloid (http://www.vocaloid.com), a signing voice

synthesizer that takes lyrics and melody as input and creates a signing voice experience. The software had a large impact in Japan and a number of songs have been produced with it. Vocaloid consists of an authoring tool and a set of voices with different attributes (range, timbre, characteristics). A well-known early electronic musical instrument that can be controlled without physical contact is the Theremin. In its original construction it consists of two metal antennas that sense the distance of the performer’s hands to the instrument, whereby one hand controls the frequency of a generated sound, and the other hand manipulates the amplitude. Several instruments and installations were inspired by the Theremin and there have been digital versions as well. An evaluation of a set of 3D interfaces for a virtual Theremin can be found in [9]. We applied the idea of controlling pitch and volume with the user’s hand in this project. Many computer based musical interfaces apply RGB-D cameras like Kinect or PrimeSense for controlling sound synthesis [13] and we also choose these devices for our purposes. While most projects focus on either body tracking or facial tracking, we combine both tracking methods. Users have the opportunity to control a variety of parameters in real-time to offer a better overall singing performance. In [8] we described the first working prototype of this project and focused on the musical foundations of the singing voice synthesis. The presented prototype allowed to control a custom created 3D avatar and had a limited singing voice synthesis without any tangible interaction techniques and no haptic feedback for the performing user. Also, it featured only three vowels and used a proprietary face tracking with a significant error rate. In the work presented in this paper we designed and implemented a set of tangible and multimodal interaction techniques. We equipped a real gramophone with sensors to select the digital background music based on shellac discs and to start and stop the simulation by operating the gramophone. An advanced face tracking detects up to six vowels using an additional face or hand gesture. A number of body-related gestures modify the singing voice by altering pitch and volume. The vibrational and acoustic feedback support an interesting user experience of ”singing without singing”. We also provide a visual feedback focused on an opera singer performance from the 1920th by using dedicated accessory and graphical effects. SYSTEM OVERVIEW

The installation illustrated in figure 1 consists of a 3x3 video wall with 46” monitors, either the motion tracking system MyoMotion (www.noraxon.com) or a Microsoft Kinect for full body skeleton tracking, and a PrimeSense Carmine for facial tracking. The user wears a tailcoat, white gloves and a top-hat to enhance the impression of being a 1920s opera singer. If operated with the MyoMotion, the user wears five to nine portable sensors on top of this attire. The top-head includes bone conduction headphones to place the generated tenor’s voice ”inside” the user’s head without blocking sur-

Figure 1. System overview

roundings sounds. This is important to hear the background music of the sung track. Additionally, a bib is used as insertion to the tailcoat instead of a shirt to make the preparation process as short as possible. The bib is equipped with several exciters and vibration modules to induce the vibration of the singing voice into the user’s thorax. Using bone conduction and vibrational feedback, the illusion of singing is supported. The digital orchestral background music played during the performance is selected via an old gramophone and shellac discs. To connect the gramophone with the computer acoustically an exciter is attached, using it as resonating body for audio playback. In an initial step of the performance, the user can choose between several songs by placing a shellac disc (gramophone record) on the gramophone. RFID markers are attached to one side of the shellacs and the gramophone is equipped with a corresponding RFID reader that determines the chosen track. Furthermore, the gramophone’s pickup arm and one of the gloves have simple contact switches. Putting the pickup arm on the shellac disc closes the switch and triggers the performance mode of the installation using an Arduino nano which is connected to the computer system via bluetooth. The switch on the glove can be closed by closing thumb and index finger. This is used to switch between two sets of vowels that could not be detected by mouth shape alone (see subsection ”Mapping”).

The processing is done on two network-connected computers. One operates the hardware, the system for facial tracking and the synthesizer modules. The other runs the main application with skeleton tracking system and rendering. If the Kinect sensor is used for skeleton tracking it is placed close to the video wall in two to three meters distance to the user, facing the user. The facial sensor is positioned and fixed hanging in the air in front of and above the user, looking at the user in about a 20 degree angle. This way interference among the two Kinect sensors is minimized by just covering the relevant parts of the user with each sensor. The computer running the main application processes all tracking data, uses it to fully animate a virtual opera singer in 3D controlled by the user and sends relevant data to the synthesizer module via OSC (Open Sound Control). In a preprocessing step, the character’s head can be designed and customized to look like the user or a modified version of it (user’s choice). Initially, the user steps in front of the main computer which renders the content on the video wall. She can now take a picture of herself to generate a 3D representation of her head that is used by the rendering application. Next the user may choose to change several parameters to modify her facial features (see figure 3). After that she puts on the tailcoat and top-hat, starts the performance mode by putting the gramophone’s pickup arm on top of the selected record and steps onto the stage. By moving her arms and

Figure 2. System illustration

shaping vowels with her mouth, she can not only control the movements of the virtual tenor, but also make him sing. The user’s goal is to reproduce the original singing of the selected track as accurately as possible. With an increased positive rating the 3D avatar’s face morphs from the original user face to the face of a famous opera singer. AVATARS AND THEIR CUSTOMIZATION

At the beginning of the performance the rendered 3D character should represent the acting user. To provide the avatar with a custom face each user needs to create a digital clone using a face generation tool. We realized this by creating an interface to the FaceGen SDK (www.facegen.com) for the Unity3D Engine and provide an easy-to-use interface for quick set-up and calibration. For this process the application starts in avatar customization mode. The user takes a twodimensional photograph of her face using a HD camera. In a semi-automated calibration process the user places eleven face feature markers on the photograph to identify the facial structure. Once all markers are set the face generation software calculates a 3D mesh representation of the user’s face with appropriate texture and shape. This head is then loaded into our system and is put on a predefined model of a tenor’s body. After this process the user may further change her avatar’s look by altering a number of parameters. The options include age (look older or younger), gender (more

male or female) or mix up some racial features (e.g. AfroAmerican, eastern Asian, southern Asian or European). A caricature model could also be generated. USER TRACKING

Next to the avatar generation step the user can initiate the performance mode in which face and body of the user are tracked. As mentioned in the overview there are several possible hardware combinations that can be used depending on the preferred setup. For facial tracking either a PrimeSense Carmine or the Microsoft Kinect can be used without any change to the system, yet the PrimeSense Carmine 1.09 is the better choice due to its superior near-field capabilities. For full body skeleton tracking either an additional Kinectlike sensor or the Noraxon MyoMotion System can be used. MyoMotion is a portable, wireless and expandable motion tracking system that can provide three-dimensional orientation information of two to 36 IMU (Inertial Measurement Unit) sensors. Our initial setup uses five sensors to track the upper body: one for each upper arm, one for each forearm and one at the back to root the coordinate system to the user. This variant is significantly more accurate for joint rotation, yet lacks the capability of providing spatial position which might be needed for future interactions. In addition, the sensors need to be attached to the user’s body.

Using an additional Kinect-like depth sensor for body tracking instead of the MyoMotion system causes interference difficulties with the one used for face tracking as they use the same infrared pattern to track depth data. One method used to minimize interference is to shield the field of view of both sensors from each other so far as it is possible by hanging the facial tracker from above, aiming only at the user’s face, while the full body tracker is further away capturing the whole body. In addition we applied the ”Shake’n’Sense” [1] approach, which introduces artificial motion blur through vibration and eliminates crosstalk. For this a small offsetweight motor is attached to the depth sensor allowing to separate interfering IR patterns. Facial tracking and calibration

The facial motion tracking system used is an adaption of faceshift (www.faceshift.com), a software originally used for markerless facial motion capturing and character face animation. Previous prototypes used a custom 2D Camera face tracking via the Microsoft Kinect SDK 1.5+, but it was found to be too inaccurate and provided too little information to detect and analyze mouth shapes for singing synthesis. The

Figure 4. Face tracking training

mode. Tracking data includes head pose information, arm gestures, body position, facial blend shapes (also called coefficients), eye gaze and additionally specified virtual markers. The coefficients are fed to the trained neural network and the probability values of each vowel are sent directly to the audio synthesizer, using the OSC protocol (Open Sound Control), where they are processed for sound generation (see section ”Sound”). Furthermore head pose, blend shapes and

Figure 3. Face scanning and customization process

current face tracking approach provides sufficiently accurate data based on a user profile that has to be created for each user in advance. This calibration process takes about five to ten minutes including capturing of the user’s expressions and profile calculation time (1.5 to 2 minutes). In this preproduction step a set of default facial expressions are captured to create a 3D model of the user’s facial structure (see figure 4). After faceshift is calibrated we developed an additional calibration step to further increase the recognition accuracy. The system feeds a neural network with data taken from the user calibration process. The neural network is used to map the tracking data provided by faceshift to corresponding vowels the user is currently providing. For this the user forms each desired mouth shape for several seconds while the data is recorded. This final calibration steps can take a varying amount of time depending on how much data is supplied to the neural network. Tests and experience show that a few seconds for each vowel are sufficient for a high accuracy recognition. There are four mouth shapes that have to be trained to the neural network: ”A”, ”E”, ”O” and closed mouth. The second vowel set is triggered by hand gesture using the same mouth shapes (see figure 5). After the calibration step is completed the system proceeds to tracking

Figure 5. Facial animation shots for vowel set

eye gaze data are used to animate the avatar’s facial features in real time. The head pose is used to rotate the neck bone and the eye gaze to rotate the specific eye while the blend shapes are used to change the look of the avatar’s face using the morph capabilities of the FaceGen SDK. To make this work correctly with the customized avatar the opera singer model was adjusted to work well with both the facial tracking data and the integrated fitting and morphing system. This was done by adjusting the basic face shape of the user avatar and creating blend shapes that mimic the blend shapes used by faceshift and finally converting this into a model base that can be easily used by the face tracking. To make it easier to calibrate the system and for trouble-shooting we also developed a small Android app that functions as a control panel for the face tracker. This allows to send network commands to start or stop tracking, calibrate the neutral position and angle and debug each system module independently.

User-specific character morphing

During the installation the user’s performance modifies the visual representation of the rendered 3D scene in two ways. The system detects how well the user ”sings” the aria based on the selected song and a musical / gestural score is rendered on the screen. Additional feedback is provided by face morphing. The better the user performs according to the given score the more the tenor will morph to the singer’s appearance. There are several characters of famous singer that can be chosen using additional accessories that are equipped with a RFID tags. Per default the user’s character morphs into a 3D representation of Enrico Caruso. If the

favorite background music by placing the relevant disc on the gramophone. The disc with an attached RFID is detected by a reader integrated in the gramophone. Placing the tip above the shellac disc is detected by a simple contact switch at the tip. The turning of the table with the handle is prototyped with a custom rotational sensor that detects the discs movement and starts the tracking mode. The gramophone is also used as audio playback device, supporting the 1920s sound of the singing voice.

Figure 6. Morphing between user’s head and Caruso

user wears a RFID-tagged white scarf the morphing target is switched to a virtual Luciano Pavarotti. This is realized by mapping both models to the same base structure and interpolating between the mesh data of both models. Full body tracking

The full body skeleton tracking uses the provided joint data from the body tracker and maps it onto the avatar, either with full body tracking enabled or only upper body tracking. Smoothing of the joint’s data is applied for a better animation look. In addition, tracking data from hand, elbow and shoulder joints are sent to the singing voice synthesizer to control the volume and pitch parameters. The arm stretch is calculated by shoulder-to-elbow + elbow-to-hand shoulder-to-hand giving a normalized value. This is used to set the volume to maximum when the arms are fully stretched out, resulting in a value of 1. The hands’ position height (y-axis) is subtracted from the shoulder positions height and is divided by the maximum arm stretch again to get a normalized range of approximately -1 to 1, with -1 being at the lowest possible point, 1 being at the highest point and 0 being at a neutral position (shoulder height). This value is used to determine the pitch value. These calculations are done separately for each arm, choosing the larger of both values for system control. TANGIBLE INTERACTION AND ILLUSTRATION

The newly designed interaction techniques include the application of a gramophone, suitable clothes and appropriate acoustic, visual and haptic feedback for the performing user: • We realized the sound control using an old gramophone with shellac discs (fig 7). Currently we provide three different suitable arias for selection (Ave Maria, Nessun Dorma and La Donna e mobile). The user can select his

Figure 7. Real gramophone to control the installation

• Real singing results in hearing one’s own voice inside head and body and in vibrational feedback caused by lung pressure and movement of the syrinx’s vocal folds. This is absent in our installation because the user does not really sing. To provide a similar user experience we developed a prototype vibrational vest by equipping a bib with vibrational actors, exciters and a sound device that triggers the actors based on generated sounds (see fig 8). The bib can be easily attached to the user and can be worn under the tail coat. The prototypes depicted in fig 8 are only initial versions but work as intended. • The user can select the idol character she wants to morph to during the performance. We used a small set of RFID equipped accessories that determine the idol character. Per default the user morphs into a virtual Enrico Caruso without any accessory, a scarf, for example, morphs the user into a virtual Pavarotti. Additional items / characters are still under development. • The user has to wear a tail coat, a white bip as substitute for a shirt, a top hat and a set of white gloves. We also use a microphone with 1920s look as accessory that can also be used as stand for the PrimeSense for face tracking. • We select vowel sets using an additional interaction technique because it is difficult to detect all vowels based on ¨ [æ:], I [i:],U facial depth tracking only. Three vowels (A [u:]) are synthesized if the system detects the mouth shape of ”A” [a:],”E” [e:] and ”O” [o:] and the user performs an additional gesture. We currently selected two alternative techniques, eye brow raising or using a simple glove switch. This is not a realistic behavior but works well after a short training period.

• For an additional increase of a 1920s performance experience we applied a shader that emulates the visual presentation of an old silent movie film. This shader turns down the color level of the presentation, adds grain and random dirt spots and has a flickering movie projector effect with varying light levels. It also has the added benefit of putting more emphasis on the user by lowering the contrast of the background. Similar to the afore mentioned face morphing, the shader is dynamically applied based on the user performance, e.g. the rendering starts in color mode and changes gradually to a 1920s film rendering.

singers. This formant is essential for singers to be heard without further amplification when performing with an orchestra. After formant filtering, several additional filters are used to achieve a 19th century gramophone like sound. In a final step, vowel-dependent amplitude variations are reduced by normalizing and compressing the signal, and optionally a reverb effect can be added to simulate the acoustic of an opera. A more detailed description is presented in [8]. Mapping

The user controls the singing voice with his arms. Volume is determined by the user’s arm stretch and directly mapped to the system’s gain control. The arm’s height is used to determine the target pitch. Two different ways of controlling the tenor’s pitch are offered: method one uses fixed steps spread over a certain area of possible arm positions; method two uses the change of position to trigger a change in pitch.

Figure 8. vibrational vest, bone conduction speaker

SOUND SYNTHESIS

The basic method used for synthesizing the singing voice is formant synthesis. In short, formants are the acoustic resonances of the human vocal tract and are required to distinguish between vowels. A summary of basic speech synthesis techniques can be found in [11]. For better in-depth understanding, please refer to [6] and [8] for an explanation of the term ”formant”. The vowel’s formant frequencies used in this work are averaged values obtained from [14]. Synthesis

The audio processing and necessary value calculations were programmed in Max/MSP in conjunction with Java Script. The target pitch of the tenor’s voice, determined by the user’s arm positioning (see chapter ”Mapping”), gives the base frequency upon which multiple sine waves are added up to generate the fundamental signal. These sine waves are of integer multiple frequency of the target pitch, that means if the target pitch is 110 Hz, the added sine waves are at 220 Hz, 330 Hz etc. In this synthesis, the upper limit was defined at 12 kHz, as higher frequencies were identified as irrelevant or disturbing for the synthesis of the human voice. The fundamental signal is then modulated by several signals that also base on the fundamental pitch, but are randomized slightly to create a preferably non-artificial, non-predictable and somewhat unclean impression of the end result. This modulation also adds a vibrato effect. The characteristic formants of the target vowel are shaped by three band pass filters. While two formants are sufficient for the perception of a vowel, we identified a third formant to be useful to achieve a more human sound. The synthesis actually uses an additional fixed fourth formant at 2,7 kHz. Formants around 3kHz are called ”singers’ formant” and are solely visible in the frequency spectrum of trained

For the first method, the received arm-height is quantized into 25 steps, representing MIDI-pitch 41 (ca 87 Hz) to MIDI-pitch 65 (ca 349 Hz). The actual singable pitches are limited by the song’s current key (which needs to be predefined for the whole song), so that these 25 steps are mapped to the nearest valid pitch. The program triggers events in accordance to the song’s beat to change the currently used mapping-table. The basis of these mapping-tables has always the keynote ”C” with a variety of scales: C-Major, CMinor, C diminished and C augmented, as well with optionally added seventh, major seventh, sixth or diminished fifth. In addition, the system distinguishes whether the song’s measure is in beat 1 or 4, or in beat 2 or 3. In case one, only pitch values from the current key’s chord can be played, in case two the whole scale can be selected. These limitations help the user to create a melody that sounds always more or less suitable to the currently chosen song’s background music. To realize other keynotes than ”C”, the whole table is transposed. If one of the MIDI-pitches exceeds 65, it is added to the beginning of the table (starting at MIDI-pitch 41) and the table is resorted. The other method to control pitch uses the change of armheight over time. This method was created to make the system easier and more fun to use for casual users. The simplest implementation uses only three commands: ”no change”, ”change upwards” and ”change downward” (this is actually already used to filter out small value fluctuations in method one). If the melody’s next note is above the current pitch, the user has to move his arms upward, if the next note is below, he has to move his arms downwards. The system then automatically chooses the right note to play the correct melody. If the user makes a wrong move, the system either stays at the currently sung pitch (no change), or uses the above mentioned table to determine the next higher respectively next lower pitch. This mode can be varied to make playing a bit harder again by using a finer gradation of commands, for example: ”no change”, ”big change upwards”, ”small change upwards”, ”big change downwards”, ”small change downwards”. Naturally, this could be extended to seven, nine or

even more distinct levels, making the selection of the correct pitch harder but providing a finer control for improvisation. USER EXPERIENCE AND CONCLUSION

We successfully presented parts of the system at scientific conferences. At ”Mensch und Computer”, a national HCI conference in Germany [7], our installation won the best demonstration award. At ACE 2013 in the Netherlands we achieved a bronze award in the best demo competition [8]. Although we did not present the advanced interaction techniques described in this paper, users of our installation were impressed by the experience they could gain from it. We assume that the results from this paper will further enhance this. The described prototype is working with all compo-

4. N. D’Alessandro, C. d’Alessandro, S. L. Beux, and B. Doval. Real-time calm synthesizer: New approaches in hands-controlled voice synthesis. In NIME, pages 266–271. IRCAM, 2006. 5. G. C. de Silva, T. Smyth, and M. J. Lyons. A novel face-tracking mouth controller and its application to interacting with bioacoustic models. In Y. Nagashima and M. J. Lyons, editors, NIME, pages 169–172. Shizuoka University of Art and Culture, 2004. 6. G. Fant. Acoustic Theory of Speech Production. Mouton De Gruyter, 1960. 7. J. Feitsch, M. Strobel, and C. Geiger. Caruso - singen wie ein tenor. In Mensch & Computer Workshopband, pages 531–534, 2013. 8. J. Feitsch, M. Strobel, and C. Geiger. Singing like a tenor without a real voice. In Advances in Computer Entertainment, pages 258–269. Springer, 2013. 9. C. Geiger, H. Reckter, D. Paschke, F. Schulz, and C. Poepel. Towards Participatory Design and Evaluation of Theremin-based Musical Interfaces. In Proc. of the Int. Conference on New Interfaces for Musical Expression, pages 303–306, 2008. 10. M. Goto, T. Nakano, S. Kajita, Y. Matsusaka, S. Nakaoka, and K. Yokoi. Voca-listener and voca-watcher: Imitating a human singer by using signal processing. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 5393–5396, 2012.

Figure 9. Demonstration at ACE 2013, see also [8]

nents described in this paper but current limitations concerning the long set-up / calibration time for a user and fault tolerance of some system parts need to be resolved before we conduct larger user tests. Current selected tests with the new techniques were positive and visitors to our lab had fun using or watching the system. We are currently adjusting and fine-tuning the system to provide a more robust and better user experience. Moreover, some items like the glove and vest have to be improved to provide a robust demonstration for a large audience. REFERENCES

1. D. Butler, S. Izadi, O. Hilliges, D. Molyneaux, S. Hodges, and D. Kim. Shake’n’sense: Reducing interference for overlapping structured light depth cameras. In Proc. of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’12, pages 1933–1936, New York, NY, USA, 2012. ACM. 2. P. Cano, A. Loscos, J. Bonada, M. D. Boer, and X. Serra. Voice morphing system for impersonating in karaoke applications. In In Proceedings of the ICMC, 2000. 3. J. Cheng and P. Huang. Real-time mouth tracking and 3d reconstruction. In Image and Signal Processing (CISP), 2010 3rd International Congress on, volume 4, pages 1524–1528, 2010.

11. R. Linggard. Electronic Synthesis of Speech. Cambridge University Press, 1985. 12. M. J. Lyons, M. Haehnel, and N. Tetsutani. Designing, playing, and performing with a vision-based mouth interface. In Proceedings of the 2003 Conference on New Interfaces for Musical Expression, NIME ’03, pages 116–121, Singapore, Singapore, 2003. National University of Singapore. 13. G. Odowichuk, S. Trail, P. Driessen, W. Nie, and W. Page. Sensor fusion: Towards a fully expressive 3d music control interface. In Communications, Computers and Signal Processing (PacRim), 2011 IEEE Pacific Rim Conference on, pages 836–841, 2011. 14. G. E. Peterson and H. L. Barney. Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24(2):175–184, 1952. 15. X. Rodet. Synthesis and processing of the singing voice. In In Proc. 1st IEEE Benelux Workshop on Model based Processing and Coding of Audio (MPCA-2002, 2002. 16. T. Yonezawa, N. Suzuki, K. Mase, and K. Kogure. Handysinger: Expressive singing voice morphing using personified hand-puppet interface. In Proc. of the Int. Conference on New Interfaces for Musical Expression, NIME ’05, pages 121–126, Singapore, 2005.

Suggest Documents