19th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7

19th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Development of Selectable Viewpoint and Listening Point System for Musical Performance PACS: 43.60.Dh Kenta Niwa1; Takanori Nishino2; Kazuya Takeda1 Graduate School of Information Science; Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Japan; [email protected], [email protected] 2 Center for Information Media Studies, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Japan; [email protected] 1

Abstract This paper describes the development of a new visual and audio display on which users can view images and listen to sounds at arbitrary positions. In this system, the ray-space method is used for the selectable viewpoint, and a signal separation method based on frequency-domain independent component analysis is used for the selectable listening point. For the selectable listening point, individual musical instrument performances are extracted from the signals composed by various instruments, and users can listen at any arbitrary positions by synthesizing the signals generated by convoluting the separated signals and head-related transfer functions. We use actual musical performance instead of transduced signals from loudspeakers. In this case, separation performances are influenced by room reverberation, directivity of the musical instrument, and so on. Moreover, since evaluating these separation performances is difficult because we cannot obtain reference signals, we propose a microphone pair selection-type method and a novel evaluation measure of separation performance. We collected the images and the sound signals in a real environment and evaluated the separation performances. As a result, the proposed methods were found to be effective for actual musical performance. 1. Introduction In broadcasting and movie, we can only view images and listen to sounds at fixed points. However if the viewpoint and listening point could be freely selected, user degree of freedom would be improved. The attempt to make images of an arbitrary viewpoint [1, 2] has been conducted in broadcasting or movies, but making the sounds of an arbitrary listening point has not been attempted. This paper describes the development of a selectable listening point system using blind source separation and head related transfer function (HRTF). If a musical ensemble can be separated into each instrument, we can relocate each instrumental signal by using HRTF. We propose a new use for blind source separation: generating arbitrary listening point audio (Fig.1). A blind signal separation method on frequency-domain independent component analysis is used in several applications including speech enhancement, recognition and so on. Many experiments have been conducted with speech signals, but few studies on musical signals have used actual instruments. Previous experiments to examine separation performance were often conducted by using loudspeakers. In those cases, advantages were that the directivity of the sound source did not fluctuate and the reference signal was obtained. However, the players of musical instruments swayed with the tune, for some instruments the directivity of the sound sources are different based on the music scale. In this case, separation performance cannot be expected to resemble cases using loudspeakers. Moreover, we conducted experiments to examine the source separation performance of previous studies [3, 4] using two loudspeakers with different directivity [5]. As a result, the source separation performance was influenced by the directivity of the sound generator in a real

Recording performance of musical instrument

Relocating sound images by synthesizing after convolution of separation signal and HRTF

Extracting individual performance based on frequency-domain ICA X1(f,t1)

X1(f,t2)

ｆ st-DFT

ｆ

separation filter W

st-DFT X２(f,t2)

X２(f,t1)

ｆ

ｆ

Fig.1 Selectable listening point system. environment. Degradation is caused by the length of the early decay time. But the source separation performance for the loudspeakers was not equal to the actual sound sources. Examining separation performance using actual sound sources is important. The evaluation measure of the source separation performance was not established because the reference signal was not obtained. We propose a microphone pair selection-type method for the actual sound sources and an evaluation measure for the separation performance. Images and sounds for free viewpoint and listening point contents in real environments were recorded and evaluated by our methods using these signals.

2. Microphone Pair Selection Method We propose a microphone pair selection-type method that is robust for actual sound sources. The observed signals are obtained by a microphone array arranged by many microphones at various intervals. Microphones outnumber the sound sources. Our method is based on frequency-domain independent component analysis, and a permutation-solving method, which is based on the direction of arrival (DOA) estimation, and a projection back method for solving a scaling problem used for every microphone sets. Then the microphone pair that gave the best performance was selected. Figure 2 shows the proposed algorithms. First, since microphones outnumber sound sources, the same number of microphones as sound sources makes one set. Separation filter W is estimated every microphone sets. Since two sound sources were used in our experiments, we call our scheme the microphone pair selection method. If a microphone set can be selected that gives accurate DOA results, separation is robust for the directivity of sound sources and environments. DOA is estimated with the microphone set while satisfying a spatial sampling theorem for every frequency band. Then the permutation problem is solved by a method based on DOA [6] that is empirically effective. After solving the permutation problem, the projection back method [7] shown in Equation (1) is applied. This operation includes calculation of the inverse matrix. Conventional method

Proposed method Learning separation filter using signals more than sound source

Learning separation filter using as many signals as sound source number

Permutation solving based on DOA estimation Scaling solving by projection back

Selection of microphone pair whose result of DOA estimation is accurate

NO

YES

YES

Is condition number smaller than half of the candidate?

Separation filter W

Permutation solving based on DOA estimation Scaling solving by projection back Separation filter W

Fig.2 Algorithm of obtaining separation filter using microphone pair selection-type method 2 19th INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID

WPB = diag{W-1}W .

(Equation 1)

If a condition number in Equation (2) is large, the separation filter given by the projection back method is unstable: cond(WPB) = κmax / κmin , (Equation 2) where κmax is a maximum singular value of WPB and κmin is minimum. Finally, a suitable microphone set is selected by two criteria: 1) the Euclid distance between a DOA result and the average of results are small; 2) the condition number is smaller than half of the candidates.

3. Measurement 3.1 Experiments for Contents of Selectable Viewpoint and Listening Point System A multi-dimensional multipoint measurement system [8] was used for collecting musical performances that is capable of collecting more than 100 image and sound points in a synchronized manner (Fig.3). One node has one camera and two or four microphones. In our measurements, 16 nodes were used for this collection. Table 1 shows the specifications of one node, and Figure 4 shows the recording conditions. The background noise level was 37.1 dB(A), and the room temperature was 21.9 degrees in centigrade. Due to simultaneous collecting the image and the sound, the microphone array consisting of seven microphones (SONY, ECM-77B) was set up on the floor in front of the players, who were surrounded by the cameras, and the distance between the players was determined by a radius camera array and a maximum angle of view because they must be captured by all cameras. To get better separation performance and to feel better sound localization, the distance between both players must be as long as possible. Table 2 shows the recording programs composed by two wind instruments.

LAN

Synchronization control

HUB

RS-232C

System control PC node#001

node#002 Sound signals

node#NNN Image signals

Fig.3 Multi-dimensional multipoint measurement system [8]. Table.1 Specifications of a multi-dimensional multipoint measurement system < 1 μ sec 29.4118 fps 4 ch(max) 16 bit 48 kHz

Synchronization Frame rate Analog input channel number A/D conversion Sampling frequency

3 19th INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID

Table.2 Recording programs Program Down in the valley (American) The Entertainer (Joplin)

Location src1 src2 src1 src2

Camera location Image of performance

8 240 cm

650 cm

9

Instruments Clarinet Flute Bassoon Clarinet

15 16

src1

2 1

src2

160cm 200cm

Cameras

900cm

150cm 1050cm

Microphone array 7

6 16cm

20°

5

4 3 21

8cm 4cm2cm 1cm

Fig.4 Arrangement of equipments. 4 Experiments 4.1 Evaluation Method of signal separation performance for actual sound sources In previous studies, examining separation performance was easy because reference signals existed for evaluation. However, in the case of musical performance composed by multiinstruments, each signal cannot be collected as reference signals. Our evaluation method assumes that a separation signals include other source elements. If the degree of mixing can be calculated, separation performance is evaluated. Therefore we propose a method of evaluating a pairwise separation performance without reference signals. Signal xˆ is obtained by adding separated signal yA for evaluation and the other separation signal yB. Evaluated signal yev is defined as: B

yev = yA + 10-α/20 × yB , B

(Equation 3)

where α is a weight coefficient. Since α is larger, yev approximates separation signal yA. The degree of yB included in yA can be calculated by measuring the dissimilarity between xˆ and yev when α is increased. Dissimilarity is evaluated by cepstrum distance. If a small cepstrum distance was obtained in the case of a large weight coefficient, the separation performance was low. On the other hand, cepstrum distance should be large enough to obtain higher separation performance by increasing the weight coefficient. Cepstrum distance is given by B

CD =

1 M

M

D −1

∑ ∑ [c

yev , m

m =1

(k ) − c' xˆ , m(k )] 2 ,

(Equation 4)

k =0


Flute (conventional method) Clarinet (conventional method) Flute (proposed method) Clarinet (proposed method)

1.1

1.1 1.0

Cepstrum distance

1.0

Cepstrum distance

Clarinet (conventional method) Basoon (conventional method) Clarinet (proposed method) Basoon (proposed method)

0.9 0.8 0.7 0.6

0.9 0.8 0.7 0.6 0.5

0.5 10

20

30

40

50

Weight coefficient α

10

20

30

40

Weight coefficient α

50

Fig.5 Results of separation performance. where cyev(k) is the k-th cepstrum of evaluated signal yev in the m-th frame, c’ (k) is the k-th cepstrum of signal xˆ in the m-th frame, and M is the number of frames. Frame length is decided by the reverberation time from measured impulse responses using dodecahedron loudspeakers. Since early decay time (calculation with decline from –10 to 0 dB) was 249 ms and reverberation time (calculation with decline from –35 to –5 dB) was 423 ms, frame length D was 32,768 points for 48 kHz sampling data. 4.2 Results Figure 5 shows the separation performance results by every music program. The cepstrum distance obtained a high score by increasing the weight coefficient, and the cepstrum distance was saturated with a large weight coefficient. As a result, the curb obtained by the proposed method exceeded the curb obtained by the conventional method for the flute and bassoon. However, the conventional method scored a high cepstrum distance for the clarinet. Since the directivity of the clarinet is sharper than the flute and bassoon, the microphone array easily observed its direct sound, clarifying that the proposed method is an effective technique for hard to observe direct and omni-directivity signals. Because the proposed evaluation scheme is a relative evaluation method, comparing the difference of the cepstrum distance and directivity understanding the separation performance is difficult. Therefore, since it is not easy to compare experiments, evaluation results must be examined subjective tests.

5 Development of selectable listening point system The selectable listening point system is made by convolving separation signals and an HRTF at arbitrary points, and generating a movement feeling by rearranging each sound image. HRTF, which is an acoustic transfer function between the sound source and the entrance of the ear canal, is sometimes used in spatial audio systems. We used a HRTF database [9] that measured with a head and torso simulator (B&K, 4128) in a soundproof chamber. HRTFs at directions not covered by this database were obtained by a linear interpolation method in the time-domain [10]. By reproducing sound signals with HRTFs, users can freely obtain movement feelings. Demonstrations are available at our web site: http://www.sp.m.is.nagoya-u.ac.jp/~niwa/. 5 19th INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID

6. Conclusion We investigated the development of a selectable viewpoint and listening point system and proposed a microphone pair selection method and an evaluation measure of separation performance without reference signals. As a result, our proposed method was more robust than the conventional method for directivity of the sound source. We conducted synchronized recording of the images and sounds of musical performances and system demonstrations through which users could view images and listen to sounds at arbitrary positions. Free demonstrations of viewpoint and listening point contents can be downloaded at our web site. Future works include the investigation of the utility of our proposed evaluation measure by comparing the results of subjection tests, developing a real-time selectable viewpoint and listening point system, and making many kinds of contents such as sports.

Acknowledgments The authors thank Norishige Fukushima of Nagoya University for his help with the experiments and the generation of arbitrary viewpoint images.

References: [1] T.Fujii and M.Tanimoto, “Free-viewpoint TV system based on the ray-space representation,” SPIE ITCom, vol. 4864-22, pp.175-189, Aug. 2002. [2] N.Fukushima et al., “Free viewpoint image generation using multi-pass dynamic programming,” Proc. SPIE Stereoscopic Displays and Virtual Reality Systems XIV, vol.6490, pp.460-470, Feb. 2007. [3] H.Sawada et al., “A robust and precise method for solving the permutation problem of frequency-domain blind signal separation on speech signals,” Proc. of International Symposium on ICA , pp.505-510, 2003. [4] T.Takatani et al., “High-fidelity blind separation of acoustic signals using SIMO-model-based independent component analysis,” IEICE, vol.E87-A, no.8, pp.2063-2072, 2004. [5] K.Niwa et al., “Blind signal separation of musical signals applied to selectable-listening-point audio reconstruction, ” 4th Joint Meeting of the Acoustical Society of America and the Acoustical Society of Japan, 3pSP20, 2006. [6] S.Kurita et al., “Evaluation of blind signal separation method using directivity pattern under reverberant conditions,” ICASSP, pp.3140-3143, 2000. [7] N.Murata et al., “An on-line algorithm for blind source separation on speech signals,” NOLTA, pp.923-926, 1998. [8] T.Fujii et al., “Multipoint measuring system for video and sound – 100-camera and microphone system,” IEEE 2006 International Conference on Multimedia & Expo (ICME 2006), 2006. [9] http://www.sp.m.is.nagoya-u.ac.jp/HRTF/. [10] T.Nishino et al., “Interpolation of the head related transfer function on the horizontal plane,” J.Acoust. Soc. Jpn., vol. 55, no. 2, pp.91-99, Feb. 1999 (in Japanese).