Development of virtual sound imaging system using ... - IEEE Xplore

5 downloads 0 Views 529KB Size Report
IEEE Transactions on Consumer Electronics, Vol. 50, No. 3, AUGUST 2004. Contributed Paper. Manuscript received May 18, 2004. 0098 3063/04/$20.00 ...
916

IEEE Transactions on Consumer Electronics, Vol. 50, No. 3, AUGUST 2004

Development of Virtual Sound Imaging System using Triple Elevated Speakers Jun Yang, Senior Member, IEEE, Woon-Seng Gan, Senior Member, IEEE, and See-Ee Tang Abstract — An enhanced virtual sound imaging system using triple elevated-speaker projection is proposed. Based on the robustness analysis, an efficient auditory image correction scheme is developed to improve system performance. This is achieved by specially designed inverse filter together with HRTF filtering. The sound imaging algorithm is implemented on a TMS320C6201 EVM DSP board and subjective testing is also carried out to demonstrate the virtual sound effects1. Index Terms — Virtual sound imaging, elevated-speaker projection, inverse filter, subjective testing. I.

INTRODUCTION

In a home entertainment environment and computing system, the proposed number of loudspeakers has been increasing with the demand for better sound effects. The Dolby Digital can support up to 5.1 channels and is used in Digital Versatile Disk (DVD) and digital television (DTV) system. For computing environment, users are no longer satisfied with the traditional stereo speakers and the four-point surround system is gaining popularity. However, the use of the multiple audio channels poses a problem on the placement of loudspeakers. If loudspeakers are not correctly placed, sound image will not be faithfully reproduced as the audio system could be occluded by furniture and decorations. This situation is commonly faced when no dedicated listening room is used for sound reproduction such as in home environment. Even if the ideal loudspeaker position can be found to avoid the issue, it might not be appropriate when the outlook of the environment is affected. Moreover, loudspeakers placed at ear level makes wiring difficult or more costly to conceal. A possible solution is to mount the loudspeakers near the ceiling. This arrangement is less prone to occlusion by furniture and allows the wire to be more easily concealed in ceiling or cornices without much cost, but results in a mismatch between picture content and sound scene due to the elevated speakers. In order to compensate this distortion, an elevated-speaker projection technique is proposed to reproduce the signal as if it is being played back from ear-level loudspeakers [1]-[2]. One image correction filter and two biquad filters are respectively used for the virtual sound imaging at lower frequency and higher frequency. In particular, the use of only spectral notch N1 greatly simplifies the implementation of elevation control.

1 The authors are with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, 639798 (e-mail: [email protected]).

Contributed Paper Manuscript received May 18, 2004

For practical implementation, it is desirable that the elevated sound image system should be robust or tolerant to small perturbation, such as variation in individuals’ head-related transfer functions (HRTFs), deviation from nominal listening position and head movement. Hence, it is important to develop an efficient auditory image correction scheme to improve the sound reproduction performance. The preprocessing technique adopted to steer down the sound images of elevated speakers is, in essence a crosstalk cancellation. Several methods have been proposed to address the crosstalk cancellation so that appropriate binaural signals are delivered to the listener’s ears [3]-[8]. It was recently shown that the robust solution to this virtual sound system could be obtained by placing the loudspeakers so as to assure the acoustic transmission path or transfer function (TF) matrix is a well-conditioned [9]-[10]. However, the optimal speaker position is frequency dependent. For a given speaker angle, there are limited frequency bands within which the system is stable. It is well known that this undesirable property causes an excessive amplitude boost at certain frequencies, which may damage the amplifier and the loudspeakers. Moreover, the crosstalk cancellation is inherently nonrobust for the twochannel system at low frequencies. Recently, Takeuchi et al. [11] have employed a two-channel multiway sound control system to improve the robustness performance. This system requires multiple speakers with different spans at different frequency bands. As an extension to our previous work [10], we propose an alternative solution by using triple speakers [12]. The additional speaker makes the crosstalk cancellation filter more robust over two-speaker system. Furthermore, a modified inverse filtering technique is explored to produce a wider robust bandwidth with accurate sound reproduction. In this paper, based on the robustness analysis of the three-channel configuration in the elevatedspeaker system, a novel auditory image correction scheme is proposed. The virtual sound algorithm is implemented on a TMS320C6201 EVM DSP board, although the TMS320C54X can also be used [13]. Finally, a subjective testing is conducted to verify the perceptual quality. II. AUDITORY IMAGE CORRECTION SCHEME A. Robustness Analysis Generally, consider multichannel sound reproduction: the K recorded signals are transmitted via N loudspeakers to M target points (ears). Let V be the vector of N input signals to loudspeakers, D the vector of M program signals, and A the M × N transfer function matrix. In the free-field case, the

0098 3063/04/$20.00 © 2004 IEEE

J. Yang et al.: Development of Virtual Sound Imaging System using Triple Elevated Speakers

sound field radiated from a monopole source can be written as p = ( S / r ) ⋅ exp[ j (ωt − kr )] , where k = 2π / λ is the wave number, λ being the wave length, r is the distance from the source to the field point, ω is the angular frequency, and the time dependence exp( jωt ) is henceforth omitted for brevity.

S = jωρ 0 q / 4π is defined as strength term, ρ 0 is the density of the medium, and q is the source strength. Then the TF a mn between the

n th monopole source (speaker) and the m th

target point (ear) is given by a mn = e − jkrmn / rmn .

D1

D2

Speaker 2

r21

r12

V2

V3

1

D1

2

where b1 = a1 + a 2

2

2

2

+ a 3 , b2 = a2 + 2 a1a3 cos φ , φ = k∆

is the phase difference, ∆ = r21 − r11 denotes the inter-aural path difference, and Q is an orthogonal matrix. The ill-conditioned problem can be reflected clearly in (4). The entries 1 /(b1 ± b2 ) are “eigenvalues”, denoted as ε 1,2 , and of b1 + b2 or b1 − b2 is small at certain frequencies, it results in a large condition number and so large gain in the inverse filter, e.g., the condition number κ becomes infinity as 2

2

b1 − b2 = 0 or a1 + a3 − 2 a1a3 cos φ = 0 . This makes the

2

implementation of crosstalk cancellation unfeasible in practice. The physical explanation for this singularity is that at certain frequencies in relation to the geometry, the path length difference between a loudspeaker and ears becomes close or equal to a multiple integer of a wavelength of the input signal. On the other hand, the crosstalk canceller will be most robust as cos φ = −1/ 2 ( κ = 1 ). Here, the variables a1 , a2 and

D2

a3 are assumed to be almost equal for simplicity. Obviously,

r13

r22 θ

r11

changes in the system. From (2), the decomposition of a 2 × 2 Hermitian matrix H H H is 0 1/(b1 + b2 )  HHH = Q⋅ (4) ⋅Q , 0 1/(b1 − b2 )  

the singular values become σ 1,2 = ε1,2 . When the magnitude

Crosstalk Cancellation Filter H V1

917

r2 3

Fig. 1. 3-speaker system with inverse filter.

Fig. 1 illustrates virtual sound imaging system with triple speakers ( m = 2 and n = 3 ). A symmetrical configuration is assumed to simplify the filter design, e.g., two speakers are located on each side, and the third one is placed directly in front of the listener. The symmetrical sound path is defined as: r11 = r23 , r21 = r13 , and r12 = r22 . Therefore, the TF matrix A can be written as:  e − jkr11 / r e − jkr12 / r e − jkr13 / r13   a1 a2 a3  A =  − jkr 11 − jkr 12 =  . (1) 21 / r21 e 22 / r22 e − jkr23 / r23   a3 a2 a1  e Then the crosstalk cancellation filter H can be calculated to minimize the cost function J = e H ⋅ e , where e = D − A ⋅ V is the error between the desired and reproduced signals and V = H ⋅ D . It is well known that the pseudoinverse solution is

H = A H ⋅ ( AA H ) −1 , (2) where, the superscript H denotes the Hermitian transpose of a matrix. The condition number κ of matrix H can be used as a robustness measure for the crosstalk canceller, defined as [10] κ = σ max / σ min , (3) where σ min and σ max represent the smallest and largest singular values of H, respectively, which are the square roots of the respective nonzero eigenvalues of H H H . The value of κ is always greater or equal to 1. A large condition number characterizes the ill-conditioned problem, which means the reproduced signals are highly sensitive to small errors or

the performance is improved compared to the two-channel system, which corresponds to the case of a2 = 0 in (1). The two-channel system is the most robust as cos φ = 0 ( κ = 1 ) and nonroubst as cos φ = ±1 ( κ → ∞ ). The simulation results of the condition number will be given in the section C. B. New Crosstalk Canceller Design The crosstalk canceller H described in (2) is obtained in a least-square manner. To further enlarge the robust bandwidth, a modified inverse filter technique is explored by relaxing the requirement of minimum power input into the speakers, and two inverse solutions are derived under different constraints. Since the reproduced signal depends on the physical system A and the input V = H ⋅ D into the speakers, we can always ensure minimum error at any adjustment of A H in (2) without changing the physical system A . For example, if a new matrix

C H is used to replace A H in the inverse filter form, i.e., H ′ = C H ⋅ ( AC H ) −1 , the error e = [ I − AC H ( AC H ) −1 ]D = 0 , provided ( AC H ) is nonsingular. It implies that there are many solutions to ensure accurate reproduction at target points. Assuming the unknown matrix C has the same form as A , we have

a c * + a 2 c 2 * + a 3 c3* a1c3* + a 2 c2 * + a 3 c1*  . (5) AC H =  1 1 * * * * * * a1c3 + a 2 c 2 + a 3 c1 a1c1 + a 2 c2 + a 3 c3  The system should be the most robust as the condition number κ ( AC H ) = 1 . It means

918

IEEE Transactions on Consumer Electronics, Vol. 50, No. 3, AUGUST 2004 ∗





a1c 3 + a 2 c 2 + a 3 c1 = 0 .

(6)

For a simple filter design, set c1 = −c3 . It is straightforward to let c2 = ( a1 − a3 )* or c2 = 1 / a2* . However, the same modified inverse matrix can be deduced:  1 /(a1 − a 3 ) − 1 /(a1 − a 3 ) 1  . (7) H H −1 H 1 = C ( AC ) = 1/ a2 1/ a2  2  − 1 /(a1 − a 3 ) 1 /( a1 − a 3 )  The two eigenvalues of H 1 H H 1 are 1 / 2 and 1 / 2(1 − cos φ ) respectively, which provide a wider robust region compared to pseudoinverse filter described in section A. An alternative solution to (6) can be obtained by setting ci = a i + δa i , i = 1,2,3 . Such configuration assumption has the

because the difference between the direct and cross path is very small. An ill-conditioned problem is encountered, and it is particularly severe in the case of the two side loudspeakers placed closely. From Figs. 2(b), (c) and (d), it can also be seen that the robust performances of modified inverse filters are superior to that of pseudoinverse filter. The proposed crosstalk cancellers H 1 and H 2 further enlarge the robust region compared to the pseudoinverse filter H . The modified inverse H 1 provides a wider robust region than that of H 2 , but the filter H 2 extends the robust zone to low frequencies. As shown in Fig. 2(d), for the side speakers’ placement of 15° , a good robustness is achieved from 300 Hz up to 6 kHz. This is especially attractive because crosstalk cancellation at low

advantage of requiring little effort in deriving a robust system, and also the given solution is likely to be near the least norm. (6) can be rewritten as 2

a1δa 3* + a 3δa1* + a 2δa 2 * = − a 2 − (a 1 a 3* + a1* a 3 ) . *

Similarly, let a 2δa 2 = *

a1δa 3 +

a 3δa1*

*

−( a1a 3 + a1* a 3 )

(8)

, from (8) we have

2

= − a 2 . Assuming δa1 = −δa 3 , it is derived

that the modified inverse has the form: (2 − a1*a3 ) /(a1 − a3 ) (a1a3* − 2) /(a1 − a3 ) 1   H2 = (1− 2cos φ ) / a2 (1 − 2cos φ ) / a2  . (9)  (4 − 2cosφ ) ( a1a3* − 2) /( a1 − a3 ) (2 − a1*a3 ) /( a1 − a3 )   The two eigenvalues of H 2 H H 2 are ε 1 = 1/ 2(1 − cos φ ) and

ε 2 = (4 cos 2 φ − 3cos φ + 2) / 2(2 − cos φ ) 2 respectively. (9) has a better robust performance than the crosstalk canceller H 1 at low frequencies as shown in the section C, which is a desirable design due to the difficulty of crosstalk cancellation at low frequencies [3]. C. Simulation and discussion For the model shown in Fig. 1, all the loudspeakers are placed at a distance of 1 m from the center of the head, and a standard head radius of 0.0875m is assumed. The condition numbers for the two-speaker and triple-speaker systems using inverse, pseudoinverse and modified inverse filters can be calculated from (2), (3), (7) and (9), respectively. The maximum condition number is limited to five, and the simulation results for different azimuth angles and frequencies below 6 kHz are plotted in Fig. 2(a)-(d). It is observed that the triple-speaker configuration gives a better robustness performance over the two-speaker system. There is a significantly wider robust margin (white region) in all the triple-speaker systems. It is the extra degree of freedom that improves the system performance. Fig. 2 shows a clear trend that, for the wider speaker placement, the system is robust over a narrow range of frequencies, whereas the robust region is increased with the decrease in speaker angle. However, the system robustness degrades at low frequencies,

Fig. 2. Condition number plot for (a) two-speaker configuration; and triple-speaker configuration (b) pseudoinverse H ; (c) modified inverse H 1 ; and (d) modified inverse H 2 .

frequencies is fundamentally difficult. In general, considering the robustness at low and high frequencies, a practical solution would be to use a modified inverse filter with side speaker angle around 15° . XCe

XLe

XRe Elevated Loudspeakers

HLLe

XL

HLRe

HRLe

HCLe

HLR

HRRe

HCRe

XR Phantom Ear-Leveled Loudspeakers

HRL

HLL

HRR YL

YR

Fig. 3. Elevated speaker system.

J. Yang et al.: Development of Virtual Sound Imaging System using Triple Elevated Speakers

III. SYSTEM IMPLEMENTATION A. Description of Elevated Sound Imaging System Fig. 3 shows the virtual sound imaging system by using elevated-speaker projection. The triple elevated speakers are employed to create sound image from two virtual phantom loudspeakers at ear-level. The elevated speaker system can be described in frequency domain as follows: Ye = H e ⋅ Ye , (10) where X e is a column vector representing the signal output to the elevated speakers, Ye is a column vector representing the signal at the eardrums of the listener contributed by the elevated speakers. H e is a M × N matrix of transmission paths from the elevated speakers ( N = 3 ) to the listener's eardrums ( M = 2 ), which is modelled by the respective HRTF such as H LLe , H LRe , H CLe , H CRe , H RLe and H RRe . Similarly, for the ear-levelled system, we have: Y = Hv ⋅ X , (11) where the column vectors X and Y denote the signal output to the ear-levelled speakers and the signal at the eardrums of the listener contributed by the ear-levelled speakers, respectively. H v is a M × K matrix of transmission paths from the virtual speakers ( K = 2 ) to the listener's eardrums ( M = 2 ), which is modelled at 0o elevation, horizontal plane. In order to create the phantom ear-levelled sources, the signal at the listener's eardrum produced by the physical elevated loudspeakers should be identical to that when the loudspeakers are placed at ear-levelled, i.e., Ye = Y or

H e X e = H v X . Thus, the correction image filter is designed as H k = H e+ ⋅ H v .

(12)

In the case of creating triple virtual sound at ear-level, i.e., N = 3 and K = 3 , the normal inverse filter is deduced from (12) as  H 11k H 12 k H 13k  H k =  H 21k H 22 k H 23k  , (13)    H 31k H 32 k H 33k  which can be simplified to 1 0 1   H A H k = 0 1 0   H 21k   1 0 − 1  0

X1

should be preprocessed according to (12). B. Efficient filter structure In a two-channel system, good robustness can be achieved when the loudspeakers are placed closely together. However, this is at the expense of low frequency performance. At low frequency, the generation of high IID is difficult as this will require a high boost of low frequency component which is not practical in many situations due to the limitation such as linear operating range of transducer, dynamic range of digital to analogue converter. Often the closely placed speakers need to emulate virtual speakers that placed wider apart. The use of three-channel system enables the sound reproduction at low frequency, and the symmetric speaker configuration results in a simple but efficient filtering structure. This structure is commonly known as the “shuffler” structure [4]. It reduced the number of filtering required for a group of filters.

H 22 k 0

H11k H21k

Y1

0  1 0 1  0  0 1 0  ,   H B  1 0 − 1

(14)

HA

X1

H12k

X2

H22k

Y2

H12k

X2

H31k H21k

Y3

Y2

H22k

H32k

X3

Y1

H21k

H31k

X3

HB

Y3

H11k (b)

(a)

Fig. 4. (a) 3 × 3 normal filter structure, (b) 3 × 3 shuffler structure.

two filter structures are shown in Fig. 4. It can be seen that the 9 filters are reduced to 5 filters. This gives a saving of about 44% computational load. Similarly, for the proposed modified inverse filter design, a further simplification is possible and hence a higher efficiency X1

where H is the N × M pseudoinverse matrix taking the form

speaker system, K is determined by the number of reproduced virtual sources, e.g., if a central virtual sound is to be created in Fig. 3, K = 3 . The signal input X e to the elevated speakers

H 12 k

where H A = ( H 11k + H 31k ) / 2 and H A = ( H 11k − H 31k ) / 2 . The

+ e

of (2). Also, (7) or (9) can be employed to improve the virtual imaging. H k is a N × K matrix, where N = 3 for the triple-

919

H11k Y1

X1

H21k

Y1

H22k

Y2

-H11k

X2

H21k H22k

Y2

X2

H21k -H11k X3

Y3

X3

H11k

-1

Y3

H11k (a)

(b)

Fig. 5. (a) 3 × 3 modified filter structure, (b) 3 × 3 shuffler structure.

can be achieved as shown in Fig. 5. The filtering process is greatly reduced from 7 to 3 with lower computational load than 50%. In Fig. 5, the structure is derived from (12) and (7) or (9), and the correction image filter takes the form of 0 − H 11k   H11k (15) H k =  H 21k H 22 k H 21k  ,    − H11k 0 H11k  which can be simplified to 0 1   H 21k H k = 1 0     0  0 − 1

H 22 k 0

1 0 1  0  0 1 0  . (16)  H 11k   1 0 − 1

920

IEEE Transactions on Consumer Electronics, Vol. 50, No. 3, AUGUST 2004

where the H 11k , H 21k and H 22 k are different from those given in (13). In the case of creating the two virtual sound sources at earlevel shown in Fig. 4, i.e., N = 3 , K = 2 and X 2 = 0 , the filtering structures from (14) and (16) are further reduced to 0  1 0 1   H A 1 1     0  , (17) H k = 0 1 0 H 12 k    1 −1 1 0 −1  0 H B  and

0 1  0  1 1  H H k = 1 0   21k . (18)   0 H11k  1 −1 0 −1  Hence, the corresponding correction image filters can be X1

H11k Y1

X1

HA

Y1

H12k

Y2

HB

Y3

H31k H12k Y2 H12k H31k X3

Y3

X3

H11k (b)

(a)

Fig. 6. (a) 3 × 2 normal filter structure, (b) 3 × 2 shuffler structure.

degradation” is dependent on the user preference. “Elevation perception” test is adopted to gauge the amount that the sound image could be steered down, and the “azimuth perception” studies how accurate the virtual sound image can be created on the horizontal plane for different virtual loudspeaker angles. In the experiment, the Creative Cambridge SoundWorks® PCWORKSTM loudspeakers were mounted near to the ceiling at an elevation of 60o . These loudspeakers are driven by the Rane MA6S multi-channel amplifier that can support up to 6 channels of sources. In the case of triple-speaker system, only the two side speakers were varied from closely spaced (15o ) to moderately spaced (45o ) and widely spaced (80o ) , while the centre speaker was unchanged throughout the test. The stimulus was being played from a PC equipped with a LAYLA2-BIY multi-track digital audio system, which used a train of eight 250 ms Gaussian noise bursts, with 300 ms of silence between the bursts. The outputs from the LAYLA2BIY are transmitted to the Rane MA6S amplifier. The auditory image correction algorithms are implemented on a Texas Instruments TMS320C6201 EVM DSP board. The filters were designed as described in the previous system. The subjects were asked to keep their head still during the test but no special measure was implemented to enforce this. This is also a more realistic listening situation. The various image correction methods were carried out for different elevated-speaker configurations as follows: 1) Filter structure using (2) and (17): a) angle θ e = ±15 o , (experiment no. 1); b) angle θ e = ±45 o , (experiment no. 2); c) angle θ e = ±80 o , (experiment no. 3). 2) Filter structure using (7) and (18): a) angle θ e = ±15 o , (experiment no. 4); b) angle θ e = ±45 o , (experiment no. 5); c) angle θ e = ±80 o , (experiment no. 6).

Fig. 7. (a) 3 × 2 modified filter structure, (b) 3 × 2 shuffler structure.

designed as shown in Figs. 6 and 7. The computational load is reduced from 6 to 3 (normal structure) or 2 (shuffler structure), which means a saving of more than 50% filtering process. IV.

SUBJECTIVE EVALUATION

The subjective evaluation is to ascertain the pros and cons in different speaker setting. Four criteria have been used to assess the system performance. First, the “sound resemblance” is tested to see if the sound steered down by elevated speakers is identical to those using physical ear-levelled loudspeakers of the intended setting. Secondly, the quality of the sound is evaluated in the “quality degradation” test. This is different from “sound resemblance” that emphasizes on whether the sounds are identical between the intended settings. In the experiment of “quality degradation”, sound quality can be enhanced to be better than that intended. However, the “quality

3) Filter structure using (9) and (18): a) angle θ e = ±15 o , (experiment no. 7); b) angle θ e = ±45 o , (experiment no. 8); c) angle θ e = ±80 o , (experiment no. 9). where θ e is the azimuth angle of the physical elevated loudspeakers, and θ is the azimuth angle of the target virtual ear-levelled source. There are a total of six subjects (CK, KS, YT, KK, FH and ST) who are all volunteers involved in these tests. The subjects are adult from 26 to 32 years of age. The virtual source angles that were reproduced were not made known to the subject and the sequence of reproduction of these virtual angles was randomised during the test. The test results are shown in Figs. 8-11. Fig. 8 to Fig. 10 are box-and-whisker plots superimposed with average of each experiment plotted in circles (“o”). The box plot has horizontal lines at the lower quartile, median, and

J. Yang et al.: Development of Virtual Sound Imaging System using Triple Elevated Speakers

upper quartile values. The whiskers are vertical lines extending from each end of the box to show the extent of the rest of the data that is within 1.5 times the interquartile range away from the top or bottom of the box. An outlier is a value that is more than 1.5 times the interquartile range and is represented by symbol “+”. Therefore, the outlier is always beyond the whisker. If there is no data outside the whisker, then, there is a dot at the bottom whisker. If a whisker falls inside the box then it is not drawn. Fig. 8 shows the “sound resemblance” result. During the test for “sound resemblance”, the subject was asked to compare the elevated physical loudspeakers reproduction with that using the same loudspeakers at ear-level. Then the sound through the ear-level physical loudspeaker was played back without any processing; whereas the sound through the elevated speakers were played back after processing it by the use of the methods described in previous section. A value of 1 is graded as a very poor match between the targeted sound and the reproduced one, while a 5 is given for an exact match or where difference cannot be perceived. In the process, the listener can select between the different modes setting meant for subject with different head sizes to obtain the best result for all the virtual speakers placements. This setting was recorded and remained the same for each subject for other tests. From Fig. 8, it can be seen that the system using crosstalk canceller H 2 (experiment

921

badly degraded sound. It is noted that, the method based on the modified inverse H 2 is also slightly worst than the pseudoinverse method and that using H1 filter. This is possibly due to the same reasons in the “sound resemblance” test. Hence,

Fig. 9. “Sound degradation” result

no. 7 to 9) performs slightly poorer than those using H1 (experiment no. 4 to 6) or pseudoinverse filter (experiment no. 1 to 3). This might be due to the variation of the weights in

Fig. 10. Perceived elevation angle result.

Fig. 8. “Sound resemblance” result

frequency domain that gives rise to higher error when the user is not in the intended listening position. The “sound degradation” result is shown in Fig. 9. For the test of quality degradation, a processed sound was played back through the elevated loudspeakers and a reference was also played back through physical ear-level loudspeakers without any processing. The subjects were asked to judge the quality by comparing the processed signal for elevated speakers to the original ear-levelled speakers. A value of 1 in this case is given for very little degradation in quality while a 5 means a

Fig. 11. Perceived azimuth angle result.

922

IEEE Transactions on Consumer Electronics, Vol. 50, No. 3, AUGUST 2004

when the listener is not in the intended listening position certain degree of degradation can be perceived. For “elevation perception”, the sound was played through the elevated loudspeakers only. The subjects were asked to give the perceived elevation in terms of degree with horizontal plane as 0o . The perceived elevation results are plotted in Fig. 10. It shows a sawtooth like waveform for the average values. This implies that the system with loudspeakers placed wider apart has poorer performances and the elevation is judged higher. When the two side loudspeakers are placed closely, the elevation perception improves and the result obtained is closer to the intended horizontal level. Moreover, the auditory image correction schemes using filters H1 and H 2 are comparable in performance and they are also slightly better than that using pseudoinverse filter. For “azimuth perception”, the procedure was the same as for “elevation perception” except that the subjects were asked to give the perceived azimuth angle instead of elevation angle. The azimuth perception result is shown in Fig. 11. In this figure, the size of the circle “o” is proportional to the number of subjects having that particular perception. The wider virtual source angle tends to be perceived towards the centre. However, when the actual physical loudspeakers were placed near the virtual source, azimuth localization improves. This is possibly due to the dynamic cue that is caused by slight head movement of the subject.

V. CONCLUSION A triple elevated-speaker system is proposed for virtual sound imaging to improve the performance over two-channel case. Based on robustness analysis, a preprocessing technique is described, and two modified inverse filters are derived to further enlarge the robust region. This paper focuses on the implementation using TMS320C62X DSP, although other embedded DSP-based platform such as the TMS320C54X can also be used. The auditory image correction scheme has been developed by efficient filter design for a low computational cost and its superiority is verified by the simulation results and the subjective tests.

REFERENCES [1] [2] [3] [4] [5] [6]

S. E. Tan, J. Yang, Y. H. Liu, and W. S. Gan, “Elevated speakers image correction using 3-D audio processing,” 109th Convention of the Audio En. Soc., AES preprint 5204, Los Angeles, US, Sept. 2000. W. S. Gan, S. E. Tan, M. H. Er, and Y. K. Chong, “Elevated speaker projection for digital home entertainment system,” IEEE Trans. Consumer Electron., vol. 47, no. 3, pp. 631-637, Aug. 2001. W. G. Gardner, 3D audio using loudspeakers, Kluwer Academic Pub: Boston, 1998, pp. 44-78. J. L. Bauck, and D. H. Cooper, “Generalized transaural stereo and applications,” J. Audio Eng. Soc., vol. 44, no. 9, pp. 683-685, Sept. 1996. S. M. Kuo, and G. H. Canfield, “Dual-channel audio equalization and cross-talk cancellation for 3-D sound reproduction,” IEEE Trans. Consumer Electron., vol. 43, no. 4, pp. 1189-1196, Nov. 1997. O. Kirkeby, P. A. Nelson, H. Hamada, and F. Orduna-Bustamante, “Fast deconvolution of multichannel systems using regularization,” IEEE Trans. Speech Audio Processing, vol. 6, no. 2, pp. 189-194, Mar. 1998.

[7] [8] [9] [10] [11] [12] [13]

S. Kawano, M. Taira, M. Matsudaira, and Y. Abe, “Development of the virtual sound algorithm,” IEEE Trans. Consumer Electron., vol. 44, no. 3, pp. 1189-1193, Aug. 1998. D. B. Ward, “Joint least squares optimization for robust acoustic crosstalk cancellation,” IEEE Trans. Speech Audio Processing, vol. 8, no. 2, pp. 211-215, Feb. 2000. D. B. Ward, and G. W. Elko, “Effect of loudspeaker position on the robustness of acoustic crosstalk cancellation,” IEEE Signal Processing Letter, vol. 6, no. 5, pp. 106-108, May 1999. J. Yang, and W. S. Gan, “Speaker placement for robust 3D audio system,” IEE Electr. Letter, vol. 36, no. 7, pp. 683-686, Mar. 2000. T. Takeuchi, and P. A. Nelson, “Optimal source distribution for binaural synthesis over loudspeakers,” Acoustics Research Letters Online, vol. 2, no. 1, pp. 7-12, Jan. 2001. J. Yang, W.S. Gan, and S. E. Tan, “Improved sound separation using three loudspeakers,” Acoustics Research Letters Online, vol. 4, no. 2, pp. 47-52, Feb. 2003. N. Iwanaga, W. Kobayashi, K. Furuya, T. Onoye, and I. Shirakawa, “Embedded implementation of acoustic field enhancement for stereo sound sources,” IEEE Trans. Consumer Electron., vol. 49, no. 3, pp. 737-741, Aug. 2003.

BIOGRAPHIES Jun Yang (M’99-SM’04) received his B.Eng. and M. Eng. degrees from the Harbin Engineering University, Harbin, China, and the Ph.D. degree in acoustics from Nanjing University, Nanjing, China in 1990, 1993 and 1996, respectively. From 1996 to 1998, he was a postdoctoral fellow at Institute of Acoustics, Chinese Academy of Sciences, Beijing, China. From October 1998 to April 1999, he worked in Hong Kong Polytechnic University as a visiting scholar. From Jan 1997 to May 1999, he was associate professor at Institute of Acoustics, Chinese Academy of Sciences, Beijing, China. From July 1999 to Sep 2001, he was in School of Electrical and Electronic Engineering (EEE), Nanyang Technological University (NTU), Singapore, as a research fellow. From Sep 2001 to July 2003, he was in School of EEE, NTU as a teaching fellow. He is now with School of EEE, NTU as an assistant professor. His main areas of research interests are in 3-D audio system, acoustic signal processing, active control of sound and vibration, smart structure and nonlinear acoustics. Woon-Seng Gan (S’90-M’93-SM’00) received his B.Eng. (1st Class Honors) and Ph.D. degrees, both in electrical and electronic engineering from the University of Strathclyde, Glasgow, UK, in 1989 and 1993, respectively. He joined the School of Electrical and Electronic Engineering (EEE), Nanyang Technological University (NTU), Singapore, as a lecturer and senior lecturer in 1993 and 1998, respectively. He also was with the Centre for Signal Processing as the Deputy Director until June 2001. Currently, he is with School of EEE, NTU as an associate professor. His research interests include adaptive signal processing, psychoacoustical signal processing, and real-time digital signal processing. He has published more than 80 international refereed journal and conference papers. See-Ee Tan received his B.Eng. (Hons) and M. Eng. degrees both in electrical and electronic engineering from Nanyang Technological University, Singapore, in 1999 and 2002, respectively. Currently, he is with DSO National Laboratories, Singapore as an engineer. His research interests include digital signal processing, communication, 3D audio system, psychoacoustics and non-linear acoustic.

Suggest Documents