Information Systems, Security, and Forensics Lab. Electrical & Computer Engineering Department. University of Michigan - Dearborn, 4901 Evergreen Road, ...
DIGITAL AUDIO FORENSICS USING BACKGROUND NOISE Sohaib Ikram, Hafiz Malik Information Systems, Security, and Forensics Lab. Electrical & Computer Engineering Department University of Michigan - Dearborn, 4901 Evergreen Road, Dearborn, MI 48128. Email: {sikram,hafiz}@umich.edu ABSTRACT
noise spectrogram in Figure 2 (right) reveals this fact. Even if this noise signal is listened to carefully, the speech signal can be perceived partly or sometimes completely. This means that the estimated noise signal is not speech free. Speech leakage in the estimated background noise would deteriorate the performance of various speech processing systems including speech recognition, speaker identification, audio forensics analysis, etc. More importantly, in the case of audio forensics analysis, presence of speech signal in the background noise estimate would bias the audio forensics analysis output, and hence cannot be used in the court of law for litigation purposes. So, it is desirable to minimize speech leakage in the background noise estimate. To achieve this objective, a new framework is proposed to estimate background noise with minimal speech leakage, and hence improved speech enhancement. In this paper, a novel noise estimation method has been proposed for improved speech enhancement. This novel method is a two step approach. In the first step, the initial background noise estimate is obtained using spectral subtraction based on the geometric approach (GA) described in [1]. In the second step, a harmonic analysis is performed on the initial estimate to remove speech leakage. As speech signal power is very low in the estimated background noise, such a noise removal method is needed that performs well in low SNR conditions. To this end, multi-band based spectral subtraction method [2] is used to remove speech leakage from the initial estimate of background noise. The proposed method has been evaluated for different speech signals recorded in various environments. It is shown in the results that the proposed method performs better than the existing methods. Additionally, performance of the proposed method is also compared with the existing methods based on [3] and [4]. The proposed noise estimation method is used to verify the authenticity of speech signals. For this purpose, background noise patterns for different speech signals recorded in various environments are estimated. A correlation analysis is performed to inspect the mutual dependence of these estimated noise patterns. It is shown in the results that the background noise estimates for two different environments are independent. Moreover, doctored/tampered speech signals are prepared, and the concept presented above is applied to verify the authenticity of these signals.
This paper presents a new audio forensics method based on background noise in the audio signals. The traditional speech enhancement algorithms improve the quality of speech signals, however, existing methods leave traces of speech in the removed noise. Estimated noise using these existing methods contains traces of speech signal, also known as leakage signal. Although this speech leakage signal has low SNR, yet it can be perceived easily by listening to the estimated noise signal, it therefore cannot be used for audio forensics applications. For reliable audio authentication, a better noise estimation method is desirable. To achieve this goal, a two-step framework is proposed to estimate the background noise with minimal speech leakage signal. A correlation based similarity measure is then applied to determine the integrity of speech signal. The proposed method has been evaluated for different speech signals recorded in various environments. The results show that it performs better than the existing speech enhancement algorithms with significant improvement in terms of SNR value. Keywords— Spectral subtraction, audio forensics, noise estimation, geometric approach, harmonic analysis 1. INTRODUCTION With the advent of the modern digital era, it is becoming very important to verify the authenticity and integrity of digital media. This digital media is available in various formats (e.g. audio, image, text, video, etc.). A comprehensive analysis is required to verify the authenticity and integrity of digital information. The motivation behind this work is to develop a reliable audio forensic analysis. To achieve this goal, background noise in the recorded audio signal is used for authentication. Speech signal processing has been an active area of research over the past few decades. Over this period, a large number of speech enhancement algorithms have been proposed by researchers and scientists to improve the quality of speech. Existing speech enhancement algorithms can be used for background noise estimation. It is important to mention that estimated noise signal contains residual or harmonics of speech signal known as leakage signal. Although the power of this leakage signal is very low, the signal is perceivable. A careful inspection of the
c 978-1-4244-7493-6/10/$26.00 2010 IEEE
106
ICME 2010
2.1. Initial Noise Estimation The first stage shown in Figure 1 is a speech enhancement algorithm block to obtain the initial noise estimate. Spectral subtraction based speech enhancement framework is used for this purpose. Spectral subtraction is a computationally simple approach as it subtracts the spectrum of noise from the spectrum of input speech signal. The spectrum of noise is computed from the input signal during the time slots when voice activity is absent. The noise removal algorithm keeps on updating the noise spectrum over each such time slot. Limitations of spectral subtraction, is over spectral subtraction, which can result in the removal of the speech signal from the input signal, and under spectral subtraction, which does not remove the interfering noise. The simple spectral subtraction approaches are also constrained to the following issues [1]: (i) it is difficult to deal with the negative values of noise spectrum, and (ii) what if cross correlation of noise and speech signal is not zero? To overcome these limitations, a deterministic approach known as geometric approach (GA) to spectral subtraction has been proposed in [1]. The details of the GA to spectral subtraction approach can be found in [1]. The proposed scheme uses spectral subtraction based on GA to obtain the initial noise estimate from input speech signal.
Fig. 1. Block Diagram of Proposed Method. The outline of this paper is as follows: Section 2 describes the outlines of the proposed method. The implementation details of the proposed method are given in Section 3. Experimental results are presented in Section 4. In the end, Section 5 discusses the conclusions and future work.
2. METHODOLOGY The proposed method uses a two-step approach. In the first step, the input speech signal is processed to obtain the initial noise estimate. To this end, spectral subtraction based on geometric transformation presented in [1] is used. The resulting noise estimate is a mixture of background noise and speech signal or speech leakage, see Figure 2(right). This estimated noise signal is processed further to remove speech leakage from the initial estimate. Multi-band based spectral subtraction is used to remove speech leakage from the initial noise estimate. The block diagram of the proposed method is shown in Figure 1. Figure 1 shows the block diagram of a novel noise removal system for better noise estimation and improved speech enhancement. The input signal y[n] is a passively received speech signal which is the sum of speech signal s[n] and noise signal η[n]. This input signal can be expressed as, y[n] = s[n] + η[n]
2.2. Harmonic Analysis The noise estimate, ηh [n], at the output of first stage shown in Figure 1, is not free of speech leakage, and speech leakage in ηh [n] still contains intelligible speech content (see Figure 2 (right)). Since our objective is to use background noise to verify the authenticity of speech signals for digital audio forensics application, the presence of harmonics can bias our results, hence it is undesirable. The proposed scheme exploits speech harmonic structure in the higher frequency range to remove speech signal from ηh [n]. Therefore, for a better noise estimation, it is pertinent to remove these harmonics. However, it is a very challenging task to remove speech contents completely from the estimated noise, as the spectrum of noise signal is not flat (see Figure 2 (right)). The objective of harmonic analysis stage therefore is to minimize speech leakage in the background noise. Various filtering techniques including Wiener based filtering [4] and [5], minimum mean squared error (MMSE) based filtering [3], and multi-band spectral subtraction, can be used for speech removal. Multi-band based spectral subtraction is used here to achieve it.
(1)
This signal is applied to the first stage of the proposed speech enhancement algorithm. Spectral subtraction based on geometric transformation [1] is used to obtain the intial noise estimate. The output of this speech enhancement stage is an estimation of noise signal ηh [n], which can be expressed as, ηh [n] = ηs [n] + sh [n]
(2)
where ηs [n] is the noise, and sh [n] gives the speech signal present in noise, a.k.a. speech leakage. Our objective is to remove these harmonics from the noise so that we can get a better noise estimation. To accomplish this, a harmonic analysis is carried out on this estimated noise signal ηh [n]. Spectral subtraction based on multi-band analysis is used to achieve this goal. The resulting noise estimate ηˆ[n] is subtracted from the speech signal y[n] to obtain enhanced speech signal yˆ[n]. The algorithms used in this paper are briefly discussed in Section 2.1 and Section 2.2.
2.2.1. Multi-band Spectral Subtraction To remove speech leakage signal from the estimated noise, a multi-band spectral subtraction approach described in [2] is used. This approach is a modification to the spectral subtraction approach proposed by Boll in [6]. Superior performance of multi-band spectral subtraction approach is a major motive behind considering it for speech removal. As we want to minimize (ideally nullify) the speech leakage signal, it is important
107
to estimate the noise with minimal speech leakage. The multiband spectral subtraction approach proposed in [2] is used for filtering speech signal from the initial noise estimate. The final noise estimate ηˆ[n] (output of second processing stage) is then subtracted from the input audio signal to obtain enhanced audio signal, i.e. yˆ[n] = y[n] − ηˆ[n].
Table 1. Results of the Proposed Method Speech Stage 1 Stage 2 Improvement Environment SNR (dB) SNR (dB) (dB) Outside 0.14 5.55 5.41 Small Office 12.98 23.89 10.91 Stairs 5.36 23.75 18.39 Office 8.28 13.02 4.74 Room 10.79 19.34 8.55
3. EXPERIMENTAL SETUP This section provides implementation details, dataset used, experimental settings, and performance evaluation of the proposed framework.
Table 2. Results of Modification 1 Speech Stage 1 Stage 2 Improvement Environment SNR (dB) SNR (dB) (dB) Outside 0.14 3.31 3.17 Small Office 12.98 21.31 8.33 Stairs 5.36 22.95 17.59 Office 8.28 11.48 3.2 Room 10.79 17.48 6.69
3.1. Implementation The proposed method has been implemented in MatlabT M environment to evaluate enhancement performance of the proposed scheme for various kinds of speech signals. The recorded speech signals have been processed by the proposed method individually, and noise has been estimated for each of the signals. To detect the presence of harmonics in noise, a soft-decision voice activity detector based on the approach suggested in [7] has been used.
one such recorded signal, spectrograms at different stages of Figure 1 are shown in Figures 2 and 3. Only a small portion of the spectrograms is shown instead of the complete spectrograms. It is clear from Figure 2 (right) that the estimated noise at the output of the first stage contains the harmonics of speech signal (see circled regions in 2 (right)). Some portion of these harmonics has been marked with ovals in Figure 2 (right). To remove these harmonics, this signal has been processed further by the second stage. Figure 3 (left) shows the noise estimated by using the proposed method based on GA followed by the multi-band spectral subtraction approach. Figures 3 (center), and 3 (right) show spectrograms of the estimated noise by using the two modifications described in Section 3.3. It can be observed from Figure 3 that the power of harmonics has significantly been reduced resulting in a better noise estimation. It can also be observed from Figure 3 that the power of harmonics is less in Figure 3 (left) as compared to Figures 3 (center) and 3 (right). This means that the noise estimated using the proposed method performs significantly better compared to the other two modifications, which are based on MMSE and Wiener filtering.
3.2. Data Set Speech signals used to test the performance of the proposed method have been recorded in various environments. Five different recording environments have been used for this purpose, which are (i) Office, (ii) Small Office, (iii) Room, (iv) Stairs and (v) Outside. The same speech signal has been recorded in these environments using the same recording device so that the noise introduced by the device is same in all recordings, hence it is ignored. 3.3. Experimental Settings Each signal in the dataset has been processed by the proposed method and noise pattern has been estimated in each case. To compare the performance of this proposed method, following two modifications are considered. Details of these modifications are given below: 1. Modification1, replace multi-band spectral subtraction based filtering with the MMSE based filtering approach suggested in [3].
Performance of the proposed method is evaluated in terms of SNR as a quality measure. To this end, SNR value has been measured at the output of both stages for every speech signal using the proposed method. The results calculated are given in Table 1. It is clear from Table 1 that there is a significant improvement in results by using the proposed method. In addition, SNR values have also been calculated by performing modifications described in Section 3.3, and the corresponding results are given in Tables 2 and 3, respectively. It can be observed from Tables 1– 3 that the results of the proposed method are significantly better than the corresponding results calculated for the two modifications. The same observation is also evident from the spectrograms of the signal given in Figures 2 and 3.
2. Modification2, replace multi-band spectral subtraction based filtering with the a priori SNR estimation and Wiener filtering suggested in [4] and [5], respectively. 4. EXPERIMENTAL RESULTS Each recorded signal has been tested individually using the proposed method discussed in Section 2. To compare the performance of the proposed method, each signal has also been tested by performing two modifications described in Section 3.3. For
108
Fig. 2. Spectrogram of Input Speech Signal y[n] (left), Spectrogram of ηh [n] Estimated by GA (right)
Fig. 3. Spectrograms of Estimated Noise ηˆ[n] by: the Proposed Approach (left), using Modification 1 (center), and using Modification 2 (right)
Table 3. Results of Modification 2 Speech Stage 1 Stage 2 Improvement Environment SNR (dB) SNR (dB) (dB) Outside 0.14 1.13 0.99 Small Office 12.98 19.49 6.51 Stairs 5.36 21.12 15.76 Office 8.28 9.95 1.67 Room 10.79 15.64 4.85
Table 4. Correlation Analysis Results Speech Office Small Room Stairs Environment Office Office 1.000 0.018 0.037 0.020 Small Office 0.018 1.000 0.019 0.019 Room 0.037 0.019 1.000 0.021 Stairs 0.020 0.019 0.021 1.000 Outside 0.048 0.025 0.082 0.034
4.1. Application to Audio Forensics
Outside 0.048 0.025 0.082 0.034 1.000
for two different environments have very small values. Also, the correlation coefficients for the same environment have the maximum value, i.e. 1, which agrees to the fact that the estimated background noise of different environments is uncorrelated. To test the performance of the proposed scheme for more practical speech authenticity scenarios, sixteen (16) tampered or doctored speech samples are generated using the existing dataset. The purpose behind it is to estimate the background noise by using the proposed method, and then to perform correlation analysis to verify whether these samples have been doctored or not. The generated doctored samples and their recording environments are mentioned in Table 5. Background noise has been estimated for each such sample by using the proposed method, and a correlation analysis has been performed. The
The estimated background noise for different speech signals can be used to verify their authenticity. This is an application of background noise to digital audio forensics. As stated earlier, speech signals have been recorded under five different environments. To test the effectiveness of the proposed method for audio forensics applications, background noise has been estimated for each signal, and these estimated noise signals have been analyzed for their uniqueness. We expect that estimated background noise for different environments should be mutually independent. To validate this notion, a correlation analysis has been performed on the estimated background noise for all recording environments. The results of this analysis are given in Table 4. It is clear from the table that the correlation coefficients
109
Table 5. Doctored Samples & their Recording Environments Doctored Sample Recording Environment Sample 1 Office, Small Office Sample 2 Office, Room Sample 3 Office, Stairs Sample 4 Office, Outside Sample 5 Small Office, Office Sample 6 Small Office, Room Sample 7 Small Office, Stairs Sample 8 Small Office, Outside Sample 9 Room, Office Sample 10 Room, Small Office Sample 11 Room, Stairs Sample 12 Room, Outside Sample 13 Stairs, Office Sample 14 Stairs, Small Office Sample 15 Stairs, Room Sample 16 Stairs, Outside
Table 6. Correlation Analysis of Doctored Samples Doctored Office Small Room Stairs Outside Sample Office Sample 1 0.516 0.092 0.025 0.019 0.025 Sample 2 0.519 0.016 0.123 0.020 0.033 Sample 3 0.175 0.022 0.026 0.257 0.035 Sample 4 0.081 0.024 0.081 0.032 0.604 Sample 5 0.256 0.615 0.020 0.019 0.032 Sample 6 0.022 0.050 0.065 0.016 0.018 Sample 7 0.019 0.226 0.019 0.520 0.036 Sample 8 0.041 0.225 0.078 0.030 0.568 Sample 9 0.272 0.017 0.643 0.018 0.081 Sample 10 0.027 0.440 0.480 0.019 0.067 Sample 11 0.020 0.022 0.230 0.530 0.036 Sample 12 0.044 0.024 0.230 0.030 0.565 Sample 13 0.108 0.022 0.021 0.697 0.038 Sample 14 0.021 0.241 0.020 0.655 0.035 Sample 15 0.022 0.022 0.179 0.669 0.039 Sample 16 0.036 0.018 0.059 0.496 0.408
results of this analysis are given in Table 6. It can be observed from Table 6 that there is a relatively higher correlation coefficient value between the test-sample and the contributing audio signals (except Sample 6 which consists of Small Office and Room recordings) indicating the dependence of test-sample on the underlying speech signals.
6. REFERENCES [1] Y. Lu and P. C. Loizou, “A geometric approach to spectral subtraction,” Speech Communication, vol. 50, pp. 453–466, 2008. [2] S. D. Kamath and P. C. Loizou, “A multi-band spectral subtraction method for enhancing speech corrupted by colored noise,” IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, ICASSP-02, vol. 4, pp. 323–327, Apr. 2002.
5. CONCLUSION In this paper, a novel audio forensics analysis method using background noise has been presented. This paper highlights the limitation of existing speech enhancement algorithms, which is the presence of speech leakage signal in the estimated noise. Speech leakage in the estimated noise cannot be used for audio forensics applications. To overcome this problem and improve the noise estimation, a novel two step method has been proposed in this paper. In the first step, spectral estimation based on geometric transformation has been used to obtain the initial noise estimation, and the second stage exploits higher harmonic structure characterization of speech signal to remove speech signal from the initial noise estimate. Multi-band spectral subtraction approach has been used for harmonic analysis. The results have shown that the proposed method performs better than the existing speech enhancement algorithms. For applications to digital audio forensics, it has been shown that the estimated background noise can be used to determine the integrity of test-audio clip. Simulation results also show that the background noise can be used to detect tampered speech signals. Currently, we are working on extending the developed background noise based audio authentication method to determine the location of tampering in the test audio signal. We are also evaluating the performance of the proposed framework using publicly available speech datasets such as SQAM (speech quality assessment material).
[3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. ASSP-32, no. 6, pp. 1109–1121, Dec. 1984. [4] P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal to noise estimation,” IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, ICASSP-96, vol. 2, pp. 629–632, May. 1996. [5] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech,” Proceedings of the IEEE, vol. 67, no. 12, pp. 1586–1604, Dec. 1979. [6] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. ASSP-27, no. 2, pp. 113–120, Apr. 1979. [7] S. Gazor and W. Zhang, “A soft voice activity detector based on a laplacian-gaussian model,” IEEE Trans. on Speech Audio Process., vol. 11, no. 5, pp. 498–505, Sep. 2003.
110