Australian Journal of Forensic Sciences
ISSN: 0045-0618 (Print) 1834-562X (Online) Journal homepage: http://www.tandfonline.com/loi/tajf20
A novel audio forensic data-set for digital multimedia forensics Muhammad Khurram Khan, Mohammed Zakariah, Hafiz Malik & Kim-Kwang Raymond Choo To cite this article: Muhammad Khurram Khan, Mohammed Zakariah, Hafiz Malik & Kim-Kwang Raymond Choo (2017): A novel audio forensic data-set for digital multimedia forensics, Australian Journal of Forensic Sciences, DOI: 10.1080/00450618.2017.1296186 To link to this article: http://dx.doi.org/10.1080/00450618.2017.1296186
Published online: 24 Mar 2017.
Submit your article to this journal
Article views: 89
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=tajf20 Download by: [King Saud University]
Date: 18 October 2017, At: 01:11
Australian Journal of Forensic Sciences, 2017 http://dx.doi.org/10.1080/00450618.2017.1296186
A novel audio forensic data-set for digital multimedia forensics Muhammad Khurram Khana , Mohammed Zakariahb , Hafiz Malikc Kim-Kwang Raymond Chood
and
Center of Excellence in Information Assurance (CoEIA), King Saud University, Riyadh, Saudi Arabia; bCollege of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia; cDepartment of Electrical and Computer Engineering, University of Michigan-Dearborn, Dearborn, MI, USA; dSchool of Information Technology & Mathematical Sciences, University of South Australia, Adelaide, Australia
Downloaded by [King Saud University] at 01:11 18 October 2017
a
ABSTRACT
Today, digital multimedia (audio, video, images) is a common evidential source in litigation and criminal justice proceedings, and, not surprisingly, multimedia forensics is an active research area. One particular challenge faced by multimedia forensic researchers is the lack of a comprehensive and publicly available data-set for evaluating existing and new algorithms. This paper presents a digital audio forensic data-set, designed to facilitate evaluation of audio forensic algorithms (e.g. microphone identification, acoustic environment identification and splice detection). This paper also briefly describes the data-collection settings, microphones, speakers, languages and notations used. Existing tamper-detection techniques rely on artefacts due to recording devices, codec and/or the acoustic environment in the audio in question. Experimental results show that the selected approaches achieved promising results.
ARTICLE HISTORY
Received 29 May 2016 Accepted 6 February 2017 KEYWORDS
Digital forensics; digital authentication; audio forensics; audio forensic data-set; speech recognition; multimodality
I. Introduction Designing and evaluating authentication for digital forensic data, protocols and multimedia are crucial in many application domains including military, defence and healthcare. This is, partly, due to constant advances in modern-day technologies. Audio forensics refers to the evaluation and analysis of audio recordings. It is commonly used for integrity verification and authenticity of the evidence in a court of law. Audio forensic algorithms help to establish the authenticity of audio evidence, enhance audio recording to improve speech transparency and the audibility of low-level sounds, and document and interpret audio evidence such as identifying speakers, clarifying dialog, and rebuilding crime or accident scenes 1. Thus, it is important to ensure the forensic soundness of the algorithm or technique used, and one way to evaluate forensic soundness is to benchmark the algorithm using widely used forensic audio databases 2.
CONTACT Muhammad Khurram Khan © 2017 Australian Academy of Forensic Sciences
[email protected]
Downloaded by [King Saud University] at 01:11 18 October 2017
2
M. K. KHAN ET AL.
Enhancing audio files involves the expertise of ‘cleaning’ or ‘reduction’ of unwanted noise, without ‘damaging’ the original recording. For multimedia, including audio, to be admissible as evidence in a court of law, its authenticity must be verified. The amount of reverberation and background noise variance in an audio recording is estimated in 3, while a novel method for acoustic environment identification is proposed in 4. In this method 5 10 different environment sounds are detected using MPEG-7 descriptors together with conventional MFCC features. A study of the automatic identification of eight telephone handsets and eight microphones is presented in 6. In 7, the authors proposed an audio recorder identification method for digital forensics. This requirement is a complex and challenging task, particularly in the absence of metadata 8. One of the challenges faced in multimedia forensics is the lack of a publicly available forensic data-set for performance evaluation and benchmarking of existing as well as new multimedia forensic algorithms. In the early 1960s, the Federal Bureau of Investigation started recruiting experts in audio forensics to improve speech intelligibility, enhancement and authentication of recorded files 9. Investigations of the high-profile assassinations of President John F. Kennedy at Dallas in 1961 10 and of presidential candidate Sen. Robert Kennedy in 1968 at Los Angeles involved acoustic evidence. Other applications of audio forensic investigations include analysing conversations and background sounds from cockpit voice recorders following an aircraft incident 11, and verifying the authenticity and enhancing recordings of terrorists 12. Other research groups 13–16 have also tried to address issues related to evaluation of authenticity and integrity verification of digital audio forensics algorithms on a limited data-set. In such forensic investigations, researchers generally use speaker recognition data-sets such as TIMIT 17 for performance evaluation, which was not developed for such an application. Researchers have used audio data-sets to detect the microphone 18–22 or the environment in which the audio recording was done 22–25. The lack of diversity such as the number of microphones used, types of acoustic environments in which recordings were made and number of speakers, however, limits the applicability of the data-sets in a forensic context. One of the contributions of this paper is to fill this gap. Specifically, we comprehensively study a wide range of microphone distortions, and identify robust features useful in microphone forensic investigations, namely: mel-frequency cepstral coefficients (MFCCs), perceptual linear predictive (PLP) coefficients and Gabor filter bank coefficients (proposed by Schadler et al. 26). The purpose of this study is to create a digital audio forensic database by using different languages and different microphones with different environments. The rest of the manuscript is organized as follows. Section II briefly describes the proposed digital audio forensic data-set. To the best of our knowledge, this is the first publicly available data-set designed for evaluating and benchmarking audio forensic algorithms. Sections III and IV discuss the data-acquisition steps and the results, respectively. Section V concludes the paper.
II. Proposed data-set In its current form, the data-set can be used for performance evaluation of audio forensics methods including microphone identification methods, microphone forensics acoustic environment identification, codec identification and double compression detection. In the
AUSTRALIAN JOURNAL OF FORENSIC SCIENCES
3
following, we provide a brief overview of the settings used for data collection, naming convention used, types of microphones used, and acoustic environments where recordings were made.
Downloaded by [King Saud University] at 01:11 18 October 2017
A. Contents The proposed data-set contains 660 audio files, recorded in four different languages (i.e. Arabic, Bahasa Indonesia, Chinese, and English). The duration of each file is approximately three minutes, while the first minute in each audio file is silence. We recorded 264 files in Arabic by two speakers, 132 files by a non-native English speaker, 132 files by a Chinese speaker, and 132 files by an Indonesian speaker. These files were recorded using 22 different microphones (see Section II.B) in six different acoustic environments, namely: soundproof room (quiet room), classroom, laboratory, staircases, parking area and garden. Seventy-two (12 × 6) sessions were recorded using the microphones in each acoustic environment. Moreover, for each session, a person read a predefined text while sitting approximately 30 cm from the microphone. Each recording was manually aligned to remove starting and ending silence regions. The collected data-set (hereafter referred to as the Digital Multimedia Forensics Data-set – DMFDS) is available for educational and research purposes only on 27.
B. Data-collection settings A Zoom R16 recorder was used to record the audio files. In the following, we provide hardware and software settings and parameters used during the data-collection phase. A conceptual block diagram of the data collection is shown in Figure 1. Hardware: for recording sessions, multitrack recording was performed through the Zoom R16 recorder/interface/controller 28. The Zoom R16, a multitrack recorder, has the following capabilities: (1) data storage on 32 Gb SDHC cards, (2) handling of various input sources including guitars, microphones and Purchasing Equipments
Labelling Mics Recording settings
Organize the recorded files
Repeat for all experiments
Original RAW files are organized into the folder
Generate tampered files
Figure 1. Conceptual block diagram for data-set collection.
Conduct experiemnts
4
M. K. KHAN ET AL.
Table 1. Microphones used. Microphone Make
Downloaded by [King Saud University] at 01:11 18 October 2017
Shure Electro-Voice Sennheiser AKG AKG Neumann Coles t.bone Total number of microphones
Model SM-58 RE-20 MD-421 C 451 C 3000 B KM184 4038 MB88U
No. of microphones 3 2 3 2 2 2 2 6 22
line-level equipment, and (3) recording using two built-in microphones. It also has comprehensive built-in mixer features and a USB port for data transfer. Sensitivity knob: it has sensitivity knobs to adjust the input sensitivity to avoid microphone saturation, a monitoring level and output, and to adjust the recording level and the monitoring level. For each input line a gain option is also available. Parameters: the following parameters are available in the Zoom R16 mixer: Sense (0–10), Attack (Compressor, Rack Comp), Tone (0–10), Level (2–100), Threshold (0–50), Ratio (1–10), Release (1–10). For every data-collection session, the Zoom R16 was set to default. Gain was set to OFF, and sensitivity was set to 50%.
C. Microphones used The digital audio recordings were collected using 22 microphones of eight different types/ models. Table 1 shows the microphones used for data collection. It can be observed from Table 1 that there were at least two identical microphones for each model. We used these microphones to collect speech recordings at six different locations: (1) a quiet (or soundproof ) room, (2) a computer laboratory, (3) a classroom, (4) stairs, (5) a parking area and (6) a garden. Figure 2 shows the microphone settings in each acoustic environment. The quiet room is a soundproof room almost free from noise, whereas all the remaining acoustic environments had background noise due to CPUs, air conditioners, people walking, vehicle engines, birds, wind and other noises. The frequency response of each microphone was investigated for these six acoustic environments.
D. Naming convention The following set of rules is used to assign a unique name to each file in the data-set: • The file name is 20 characters long and consists of five sections separated by underscores. • The first section, consisting of three characters, indicates the microphone make. • The next section, consisting of four characters, indicates the microphone model as in Table 2. • Section 3, consisting of two characters, indicates a specific microphone within a set of microphones of the same make and model, since we have more than one microphone of the same make and model.
Downloaded by [King Saud University] at 01:11 18 October 2017
AUSTRALIAN JOURNAL OF FORENSIC SCIENCES
5
Figure 2. Audio recording set-ups in the six selected acoustic environments (a)–(f).
• Section 4, consisting of two characters, indicates the acoustic environment, where: • Soundproof room = 01 • Classroom = 02 • Lab = 03 • Stairs = 04 • Parking = 05 • Garden = 06 • The fifth section, consisting of two characters, indicates the language, where: • Arabic = 01 • English = 02 • Chinese = 03 • Indonesian = 04 • The sixth section, two characters, indicates the speaker. For instance, the file name SEN_0421_02_01_02_03 indicated that the speaker number 03 was speaking English in the quiet room with the second microphone of model Sennheiser MD-421. Table 2 shows the naming conventions for each microphone.
Downloaded by [King Saud University] at 01:11 18 October 2017
6
M. K. KHAN ET AL.
Figure 3. Complete flow of the paper.
Figure 4. Verification flow.
Table 2. Microphone naming convention. Microphone make and model Shure SM-58 Electro-Voice RE-20 Sennheiser MD-421 AKG C 451 AKG C 3000 B Neumann KM184 Coles 4038 t.bone MB88U
Naming convention SHU_0058 ELE_0020 SEN_0421 AKG_0451 AKG_3000 NEU_0184 COL_4038 TBO_0088
AUSTRALIAN JOURNAL OF FORENSIC SCIENCES
7
Table 3. Audio file specification. File format
.wav PCM 1 705.6 kbps 44.1 k samples/s 16-bit
Encoding Codec ID Bitrate Sampling rate Bit depth
Downloaded by [King Saud University] at 01:11 18 October 2017
Table 4. Microphone features and specifications 29.
Shure SM-58 (Mic1: 3 units)
Electro-Voice RE-20 (Mic2: 2 units)
SHU_0058
ELE_0020
Coles 4038 Ribbon Microphone (Mic3: 2 units)
COL_4038
Sennheiser MD421II Dynamic Cardioid Microphone (Mic4: 3 units)
SEN_0421
AKG C 451 B Condenser Microphone (Mic2: 2 units)
AKG_0451
Table 5. Description of tampered data-set generation 29. No. 1 2 3 4 5 6 7
File name T_AKG_0451_m1_m2 T_ELE_0020_m1_m2 T_SEN_0421_m3_m1 T_SHU_0058_m2_m3 T_COL_4038_m2_m1 T_SEN_0421_m1_m2 T_SHU_0058_m1_m3
Short name T_AKG T_ELE T_SEN_B T_SHU_B T_COL T_SEN_A T_SHU_A
Source file AKG_0451_m2 ELE_0020_m2 SEN_0421_m1 SHU_0058_m3 COL_4038_m1 SEN_0421_m2 SHU_0058_m3
Destination file AKG_0451_m1 ELE_0020_m1 SEN_0421_m3 SHU_0058_m2 COL_4038_m2 SEN_0421_m1 SHU_0058_m1
Start time 0:01:10 0:01:30 0:01:20 0:01:10 0:02:10 0:02:40 0:02:030
End time 0:01:17 0:01:47 0:01:36 0:01:28 0:02:29 0:02:57 0:02:45
III. Data-acquisition flowchart Figure 3 shows the flowchart to describe the data-acquisition process. The audio forensic data-set consists of audio recordings, made in six different acoustic environments using 22 different microphones. The purpose of this data collection is to provide the digital audio forensic research community with a data-set that can be used for benchmarking and performance evaluation. In the following sections, the performance of our existing audio forensics analysis methods 4,28,29 based on acoustic environment identification, microphone identification and forgery detection is evaluated on the DMFDB 27.
IV. Verification of the data-sets into categories In this section, the effectiveness of the three existing audio forensic methods 4,29,30 is evaluated on DMFDB. To this end, the following four scenarios of audio forensics are considered for performance evaluation (Figure 4): A. tampering detection 29 B. microphone classification29,35,36 C. trace error detection within the same mic model 29 D. splice detection and localization using the acoustic environment signature 30,37.
8
M. K. KHAN ET AL.
Table 6. Detection performance (in terms of accuracy (in %)) for K-NN classifier 29. Description
Downloaded by [King Saud University] at 01:11 18 October 2017
Gabor: 23f_2970d PLP: 13f_2980d MFCC: 13f_2980d
T_AKG 87.73 51.48 50.13
T_COL 96.21 60.75 60.55
T_ELE 92.38 47.19 47.81
T_SEN_A 92.39 33.49 33.86
T_SEN_B 94.59 31.86 32.38
T_SHU_A 84.3 27.64 28.2
T_SHU_B 85.78 26.58 27.46
Figure 5. Accuracy of the K-NN classifier for tamper identification 29.
Figure 6. Average accuracy of Gabor+PCA, MFCC and PLP based on environments (left) and models (right).
A. Tampering detection The collected data-set is used for detecting tampering with the audio data file. Table 3 gives the specifications of the audio file. Fourteen audio recordings were used to generate tampered audio files. This study investigates the inter-class dispersion of microphone artefacts. In this regard, recordings from five different microphone models, that is, Shure SM-58, Electro Voice RE-20, Coles 4038 Ribbon, Sennheiser MD421II Dynamic Cardioid Microphone and AKG C 451 B Condenser, are considered. It is important to highlight that each selected microphone class has at least two identical microphones of the same make and model. The details of the microphones considered in the study are shown in Table 4. The tampered audio dataset is generated by replacing an audio segment in the target audio with an audio recording
Downloaded by [King Saud University] at 01:11 18 October 2017
AUSTRALIAN JOURNAL OF FORENSIC SCIENCES
9
Figure 7. Forgery detection and localization results using Zhao et al.’s 4,28,30 method – relatively large splicing.
made with an identical microphone during the same recording session. Details of the tampered data-set generation are given in Table 5. The data-set consisting of the original recordings is used to learn the underlying model of the microphone. As discussed in 28, the Gabor filter coefficients, mel-frequency coefficients (MFCC) and perceptual linear predictive (PLP) coefficients are used to capture traces of microphone artefacts. The extracted features are as follows: reduced-Gabor has 23 features 31, MFCC has 13 features 32 and PLP has 13 features 33 . As shown in the tables and Figure 5, the 23 features of the reduced-Gabor filter bank produced the highest accuracy followed by MFCC and PLP. From Table 6, the highest rate using the reduced-Gabor feature is 96.21% for the tampered audio of the COL_4038 model. For each microphone and each feature vector, a separate K-NN (k nearest-neighbour)-based classifier is selected to learn the underlying model. Trained classifiers are tested on the respective tampered data-sets. The performance (in terms of % accuracies) of the tamper detection of the method proposed in 28 for three different feature vectors is shown in Figure 5 and Table 6. Discussion: the aim of this preliminary study is to determine whether microphone fingerprints can be used for forgery detection or not. In addition, another aim is to investigate the intra-class variability of microphone fingerprints. The classifier is trained on data collected
Downloaded by [King Saud University] at 01:11 18 October 2017
10
M. K. KHAN ET AL.
Figure 8. Forgery detection and localization results using Zhao et al.’s method 4,28,30 – small splicing part a. Table 7. Best recognition rates of the AKG_0451 model. Features desc. Gabor+PCA: 23f_2970d MFCC: 13f_2980d PLP: 13f_2980d
Cabin 83.27% 54.06% 53.99%
Class 65.86% 56.31% 60.03%
Lab 69.46% 63.56% 71.81%
Stairs 85.29% 49.63% 50.23%
Parking 85.79% 57.62% 62.38%
Garden 94.98% 61.75% 66.31%
for each microphone. The effectiveness of all three feature groups, that is, Gabor filter coefficients, MFCC and PLP, is also compared. Features extracted from the originals are used for training and features extracted from the tampered data-set are used for testing. Results presented here are based on 10-fold cross-validation. The classifier recognition rates of tampered audio files are shown in Figure 5 and Table 7. It can be observed from Figure 5 and Table 7 that a microphone’s fingerprint based on reduced-Gabor is a strong discriminator that can identify a microphone even in comparison with an identical model. The accuracy of reduced-Gabor in the other models is also promising with at least 84% detection rates. On the other hand, MFCC and PLP show very poor and non-robust results. According to Table 7, MFCC and PLP achieve maximum detection rates of around 61%. In addition, both features give very low accuracies (less than 34% detection
Downloaded by [King Saud University] at 01:11 18 October 2017
AUSTRALIAN JOURNAL OF FORENSIC SCIENCES
11
Figure 9. Forgery detection and localization results using Zhao et al.’s method 4,28,30 – small splicing part b. Table 8. Best recognition rates of the COL model. Features desc. Gabor+PCA: 23f_2970d MFCC: 13f_2980d PLP: 13f_2980d
Cabin 96.63% 57.62% 60.67%
Class 84.51% 56.31% 59.03%
Lab 64.95% 54.19% 56.78%
Stairs 88.72% 52.82% 54.97%
Parking 93.80% 61.85% 67.08%
Garden 99.36% 73.09% 77.15%
Lab 57.00% 54.33% 55.27%
Stairs 86.13% 48.76% 50.23%
Parking 78.96% 52.99% 53.56%
Garden 92.96% 52.21% 52.15%
Table 9. Best recognition rates of the ELE model. Features desc. Gabor+PCA: 23f_2970d MFCC: 13f_2980d PLP: 13f_2980d
Cabin 97.47% 56.88% 56.98%
Class 77.27% 47.85% 47.11%
rates for models SEN_0421 and SHU_0058). These results indicate that Gabor filter coefficients are capable of capturing microphone artefacts, and are potential candidates for feature extraction.
12
M. K. KHAN ET AL.
Table 10. Best recognition rates of the SEN model. Features desc. Gabor+PCA: 23f_2970d MFCC: 13f_2980d PLP: 13f_2980d
Cabin 88.93% 35.77% 34.47%
Class 59.51% 33.74% 34.70%
Lab 40.36% 41.05% 44.59%
Stairs 78.59% 29.33% 28.50%
Parking 65.86% 35.84% 36.13%
Garden 94.86% 35.17% 36.98%
Lab 41.59% 33.11% 32.19%
Stairs 65.99% 29.62% 28.34%
Parking 60.63% 29.44% 29.66%
Garden 84.29% 25.95% 24.45%
Table 11. Best recognition rates of the SHU model. Features desc.
Downloaded by [King Saud University] at 01:11 18 October 2017
Gabor+PCA: 23f_2970d MFCC: 13f_2980d PLP: 13f_2980d
Cabin 84.38% 37.36% 37.02%
Class 56.34% 32.55% 30.76%
B. Microphone classification The goal of this experiment is to evaluate the reliability of the collected audio forensic dataset, which focuses on microphone forensics of identical models. The main objective of this experimental study is to discover robust features for microphone identification. In this regard, two feature-extraction methods widely used for microphone identification are considered, i.e. MFCCs 32 and PLP 33. In addition, another feature-extraction method called the Gabor filter bank, as proposed by Schadler et al. 31, is also included for effectiveness and performance evaluation. Unfortunately, the Gabor filter bank generates a large number of features (311); hence, dimensional reduction using PCA is required to reduce the number of features. These three feature-extraction methods are then applied to the collected database to produce a fingerprint for each microphone. The extracted features are used to train a K-NN classifier. The trained K-NN classifier is then used for microphone classification. Results presented here are based on 10-fold cross-validation. The numbers of features for each fingerprint are as follows: GABOR+PCA 23 features31, MFCC 13 features 32 and PLP 13 features 33. Tables 7 to 11 demonstrate the final recognition results with different microphone models. Figure 5 depicts the average accuracy for each feature in terms of environments and microphone models. Judging by these results as well as Figure 9, it can be concluded that the Gabor+PCA method outperforms PLP and MFCC. From Table 7, the highest accuracy using the Gabor+PCA method is 94.98% in the garden environment. This means that the fingerprint of the AKG_0451 model based on the Gabor+PCA method can discriminate between two identical models of AKG_0451 with a high detection rate. Meanwhile, the lowest accuracy for AKG_0451 using the Gabor+PCA method is 69.46% in the laboratory, which is about 2% less than with PLP. PLP and MFCC produce quite similar recognition rates, with an average difference of about 4%. The Gabor+PCA method yields recognition rates of 99.36% for the COL model in the garden as depicted in Table 8. This recognition rate is the best result attained using the Gabor+PCA method. In this model, Gabor+PCA is robust in all six environments with an accuracy no lower than 65%. Meanwhile, the MFCC and PLP methods provide better accuracy by about 10% in the COL model compared with the AKG_0451 model. In Table 9, Gabor+PCA shows an even better rate than PLP and MFCC. The best recognition rate for Gabor+PCA is 97.47% in the quiet room. However, the accuracy of the Gabor+PCA, MFCC and PLP methods shows a decrease in the ELE model. Gabor+PCA drops to 57% for
AUSTRALIAN JOURNAL OF FORENSIC SCIENCES
13
Table 12. Statistical analysis of silence recording in the quiet room for Shure SM-58. (Mic1a and Mic1b) Metrics Sigma Mu Peak (crest) factor Q (dB) Dynamic range D (dB) Autocorrelation time (s) Average difference
Sa.Mic1a 0.15204 −0.014822 16.3198 31.5957 54.9212
Sa.Mic1b 0.12433 −0.010417 18.0784 34.6479 54.9202 0.968783
| Sa.Mic1a − Sa.Mic1b | 0.027710 0.004405 1.758600 3.052200 0.001000
Table 13. Statistical analysis of speech recording in the quiet room for Shure SM-58. (Mic1a and Mic1b) Metrics
Downloaded by [King Saud University] at 01:11 18 October 2017
Sigma
Sa.Mic1a 0.10204
Sa.Mic1b 0.10667
| Sa.Mic1a − Sa.Mic1b | 0.00463000
Table 14. Statistical analysis of silence recording in the computer lab for Shure SM-58. (Mic1a and Mic1b) Metrics Sigma Mu Peak (crest) factor Q (dB) Dynamic range D (dB) Autocorrelation time (s) Average difference
Sa.Mic1a 0.21418 −0.0088191 13.3772 36.1236 31.2285
Sa.Mic1b 0.20988 −0.0085453 13.5534 36.3909 45.2657 2.89705476
| Sa.Mic1a − Sa.Mic1b | 0.004300 0.0002738 0.176200 0.267300 14.037200
the ELE, even though it was still the best method for the ELE. Meanwhile, the accuracy of MFCC and PLP similarly decreases to 47%. According to Table 10 for the SEN model, PLP is again better than Gabor+PCA in the laboratory environment. The accuracy of PLP and Gabor+PCA is 44.59% and 40.36%, respectively. However, Gabor+PCA still outperforms the other features in five out of the six environments. Gabor+PCA achieves the best result for the SEN model (94.86% in the garden) and the worst accuracy (40.36% in the laboratory), which is the lowest among the four other models. Similarly, MFCC and PLP show low accuracy (only about 35% for both on average). For the last model, the SHU, Gabor+PCA again comes out with the highest result in all environments. As Table 11 shows, Gabor+PCA has the highest accuracy in the quiet room and the lowest in the laboratory, at 84.38% and 41.59%, respectively. According to Figure 6, the result for the SHU model shows the lowest accuracy on average compared with the four other models. The average accuracy for Gabor+PCA, MFCC and PLP in the SHU model is 65.54, 31.34 and 30.40%, respectively. Figure 6 clearly shows that Gabor+PCA has the highest average accuracy in all six environments, achieving 93.29% in the garden and 90.14% in the quiet room. Meanwhile, the MFCC fingerprint is less robust compared with the others, although MFCC and PLP both have insignificant differences. In addition, Gabor+PCA is robust if looking at average accuracy in terms of microphone models. Gabor+PCA achieves the highest average accuracy for all models (88%).
14
M. K. KHAN ET AL.
Table 15. Statistical analysis of speech recording in the computer lab for Shure SM-58. (Mic1a and Mic1b) Metrics Sigma Mu Peak (crest) factor Q (dB) Dynamic range D (dB) Autocorrelation time (s) Average difference
Sa.Mic1a 0.06243 −0.00012704 24.0922 72.93 0.071315
Sa.Mic1b 0.066088 −0.00012876 23.5976 72.8096 0.07127 0.12374094
| Sa.Mic1a − Sa.Mic1b | 0.00365800 0.00000172 0.49460000 0.12040000 0.00004500
Downloaded by [King Saud University] at 01:11 18 October 2017
C. Microphone trace error detection with the same models The aim of this experiment is to investigate digital traces within the same microphone model. To this end, we used recordings made through two identical microphones of the same models in same acoustic environment. For the rest of the paper, the term ‘identical microphones’ stands for microphones of same make and model. The microphone model selected for this study is the Shure SM-58. There are two identical microphones for that model. It would be of interest to study digital traces from more examples of identical models. However, at this stage, we assume that two microphones are enough to explore the differences between identical microphone models. The collected audio contents are analysed for time-domain statistical feature extraction such as the mean, standard deviation, peak factor Q, dynamic range and auto-correlation 34 . Firstly, the audio signal is divided into silence and speech segments. Manual analysis is used for silence-region extraction. In this experiment, the comparison is between identical microphones, using the same recording types and the same environments. For example, silence recording in the quiet room using the Shure SM-58 microphone is compared with silence recording in the quiet room using another identical Shure SM-58 microphone. Technique: five time-domain statistical attributes (the mean, standard deviation, peak factor Q, dynamic range and auto-correlation) are calculated and compared for the identical microphones. The experimental results are presented as follows. • Microphone forensics in quiet room recordings • Microphone forensics in computer lab recordings Referring to the statistical analysis in Table 12, the big difference between Mic1a and Mic1b is the dynamic range value. Mic1b produces a higher dynamic range than Mic1a. This means that Mic1b generates more noise in the silence recording. The reason is probably a manufacturing defect in Mic1b. The peak crest factor Q of Mic1b is also higher than that of Mic1a. Finally, both identical microphones produce a similar maximum auto-correlation value close to 55 s. Experimental results for speech recording show more stability with high similarity compared to the silence recording as explained previously. The statistical analysis as presented in Table 13 shows that the average difference of the two identical images is much smaller compared to the silence recording in Table 12. In the laboratory recordings, Mic1b again shows strange auto-correlation on the silence recording as presented in Table 14 and the difference in auto-correlation values is higher than in the quiet room (Table 12). However, other metrics (sigma, mu, peak crest-factor, dynamic range) do not show much difference compared to Mic1a, as shown in Table 14.
AUSTRALIAN JOURNAL OF FORENSIC SCIENCES
15
Looking into the speech recording, we found that both identical microphones produce almost the same values. From Table 15, it can be concluded that in this speech recording both identical microphones produce quite similar signals. In addition, comparing this result with previous results for speech recording in the quiet room (Table 13), both results show a very similar pattern. Speech recording in the quiet room and the computer laboratory have similar statistical values.
Downloaded by [King Saud University] at 01:11 18 October 2017
D. Splice detection and localization using the acoustic environment signature One common form of tampering in digital audio signals is known as splicing, where sections from one audio recording are inserted into another audio recording. This experimental study investigates the effectiveness of the acoustic environment signature for splice detection and localization. Recently, Zhao et al. 28,30 proposed an audio splice detection method based on acoustic environment cues. This experiment evaluates the effectiveness of Zhao et al.’s method 28,30 on the proposed data-set. To this end, the magnitude of the acoustic channel impulse response and ambient noise is used for modelling the intrinsic acoustic environment signature and for splicing detection and splicing location identification. The motivation behind considering the combination of acoustic channel impulse response and ambient noise for audio splicing detection is that acoustic reverberations can be used for acoustic environment identification, that is, to determine where a recording was made. In our recent work 3,4,23,25,28,30,35, we showed that acoustic reverberation and ambient noise can be used for acoustic environment identification. One of the limitations of these methods is that they cannot be used for splicing location identification. To address the limitations of reverberation-based methods 3,4,23,25,28,30,35, the magnitude of the channel impulse response is used for audio splicing detection and localization. One of the advantages of the proposed approach is that it does not make any assumptions. In addition, the proposed method 30 is robust to lossy compression attack. Here, we exploit artefacts introduced at the time of the recording as the intrinsic signature and use it for audio recording integrity authentication. Both the acoustic channel impulse response and the ambient noise are jointly considered to achieve this objective. To this end, each input audio is divided into overlapping frames. For each frame, the magnitude of the channel impulse response and ambient noise is jointly estimated using spectrum classification techniques. The similarity between the estimated signatures from the query frame and the reference frame is computed, and is used to determine whether the query frame is a spliced frame or not. More specifically, a spliced frame is detected and localized if its similarity score with the reference frame is less than the threshold. A refining step is further considered to reduce detection and localization errors. Figures 7–9 show the experimental results for audio recordings made with t.bone microphones. The title of each sub-figure is the audio name. The points marked in red stars represent the ground truth. It can be observed from the figures that our method can detect the presence of splicing frames in most cases (e.g. Figure 7(a–e), Figure 8(a,b,c,e), Figure 9(a,b,d,f )). The test also resulted in some false negatives, as shown in Figure 8(d,f ) and Figure 9(e). It was observed that such false negatives could be attributed to the small size of the forgery locations in the test audio. Figure 8(d,f ), and Figure 9(e) show that only a few frames have been modified in the tampered audio. In this case, it is difficult to obtain reliable signature estimation, which indicates that this method is not very successful for tampered audio with small insertions. It was also observed through extensive experimentation that the larger the insertion in the
16
M. K. KHAN ET AL.
tampered audio, the easier it is to detect. Overall, the proposed algorithm resulted in a detection performance of 90% on the developed database. For most cases, the proposed algorithm was able to successfully detect the forgery locations with very high confidence.
Downloaded by [King Saud University] at 01:11 18 October 2017
V. Conclusion and future work In this paper, we have presented a new digital audio forensic data-set. This data-set can be used for benchmarking of existing as well as new audio forensics methods. The performance of three existing audio forensics algorithms 3,28–30 is also evaluated on our benchmarking data-set. The effectiveness of three feature vectors, i.e. Gabor filter coefficients and MFCC and PLP coefficients, is evaluated for microphone identification and forgery detection. Both inter- and intra-class variability of microphone artefacts has also been evaluated. In a future work, the authors would like to make audio recordings with the help of mobile phones. The data-set is made publicly available and can be used for future benchmarking and evaluation of audio forensic tools and techniques.
Acknowledgements The authors express their thanks to F. Kurniawan, S. Khalil, M. Qamhan and M. Al-Hammadi for data collection and assistance.
Disclosure statement No potential conflict of interest was reported by the authors.
Funding This project was funded by the National Plan for Science, Technology and Innovation (MAARIFAH), King Abdulaziz City for Science and Technology, Kingdom of Saudi Arabia, Award Number (12-INF2634–02).
ORCID Muhammad Khurram Khan http://orcid.org/0000-0001-6636-0533 Mohammed Zakariah http://orcid.org/0000-0002-2488-2605 Hafiz Malik http://orcid.org/0000-0001-6006-3888 Kim-Kwang Raymond Choo http://orcid.org/0000-0001-9208-5336
References 1. Garfinkel SL. Digital forensics research: the next 10 years. Digital Invest. 2010;7:S64–S73. 2. Manchester P. Found sound: an introduction to forensic audio. Sound on Sound. 2010;750:90–95. 3. Malik H. Acoustic environment identification and its applications to audio forensics. Inf Forensic Secur IEEE Trans. 2013;8:1827–1837. 4. Zhao H, Malik H. Audio recording location identification using acoustic environment signature. Inf Forensic Secur IEEE Trans. 2013;8:1746–1759. 5. Muhammad G, Alotaibi Y, Alsulaiman M, Huda MN. Environment recognition using selected MPEG7 audio features and mel-frequency cepstral coefficients. In: Digital Telecommunications (ICDT), 2010 fifth international conference; 2010; Athens/Glyfada, Greece. p. 11–16.
Downloaded by [King Saud University] at 01:11 18 October 2017
AUSTRALIAN JOURNAL OF FORENSIC SCIENCES
17
6. Garcia-Romero D, Espy-Wilson CY. Automatic acquisition device identification from speech recordings. In: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference; 2010; Dallas, TX. p. 1806–1809. 7. Moon C-B, Kim H, Kim BM. Audio recorder identification using reduced noise features. In: Jeong YS, Park YH, Hsu CH, Park J, editors. Ubiquitous information technologies and applications. Berlin: Springer; 2014. p. 35–42. 8. Wilson JS. Myspace, your space, or our space-new frontiers in electronic evidence. Or L Rev. 2007;86:1201–1201. 9. Koenig BE. Authentication of forensic audio recordings. J Audio Eng Soc. 1990;38:3–33. 10. Ramsey N, Alvarez L, Chernoff H, Dicke R, Elkind J, Feggeler J, Garwin R, Horowitz P, Johnson A, Phinney R. “Report of the committee on ballistic acoustics,” Commission on Physical Sciences, Mathematics, and Resources. Washington, DC: National Research Council, National Academy Press; 1982. 11. Sachs J. Graphing the voice of terror. Popular Sci. 2003 [Cited 07 August 2009]. Availble from: http:// www.popsci.com/scitech/article/2003-02/graphing-voice-terror. 12. Byrne G. Flight 427: anatomy of an air disaster. New York: Springer Science & Business Media; 2002. 13. Gärtner D, Cuccovillo L, Mann S, Aichroth P. A multi-codec audio dataset for codec analysis and tampering detection. In: Audio engineering society conference: 54th international conference: audio forensics; 2014; London, UK. 14. Korycki R. Authenticity examination of compressed audio recordings using detection of multiple compression and encoders’ identification. Forensic Sci Int. 2014;238:33–46. 15. Liu Q, Sung AH, Qiao M. Detection of double MP3 compression. Cog Comp. 2010;2:291–296. 16. Yang R, Qu Z, Huang J. Detecting digital audio forgeries by checking frame offsets. In: Proceedings of the 10th ACM workshop on Multimedia and security; 2008; Oxford. p. 21–26. 17. Zue V, Seneff S, Glass J. Speech database development at MIT: TIMIT and beyond. Speech commun. 1990;9:351–356. 18. Carson CP, Ingrisano DR-S, Eggleston KD. The effect of noise on computer-aided measures of voice: a comparison of CSpeechSP and the multi-dimensional voice program software using the CSL 4300B module and multi-speech for windows. J Voice. 2003;17:12–20. 19. Smits I, Ceuppens P, De Bodt MS. A comparative study of acoustic voice measurements by means of Dr. Speech and computerized speech lab. J Voice. 2005;19:187–196. 20. Buchholz R, Kraetzer C, Dittmann J. Microphone classification using Fourier coefficients. Inf Hiding. 2009:235–246. 21. Ortega-Garcia J, Cruz-Llanas S, Gonzalez-Rodriguez J. Quantitative influence of speech variability factors for automatic speaker verification in forensic tasks. In: International Conference on Spoken Language Processing (ICSLP); 1998; Sydney. 22. Ikram S, Malik H. Digital audio forensics using background noise. In: Multimedia and Expo (ICME), 2010 IEEE International Conference; 2010; Singapore. p. 106–110. 23. Malik H, Zhao H. Recording environment identification using acoustic reverberation. In: Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference; 2012; Japan. p. 1833– 1836. 24. Pan X, Zhang X, Lyu S. Detecting splicing in digital audios using local noise level estimation. In: Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference; 2012; Japan. p. 1841–1844. 25. Zhao H, Malik H. Audio forensics using acoustic environment traces. In: Statistical Signal Processing Workshop (SSP), 2012 IEEE; 2012; Ann Arbor, MI. p. 373–376. 26. Schädler MR, Kollmeier B. Separable spectro-temporal Gabor filter bank features: reducing the complexity of robust features for automatic speech recognition. J Acoust Soc Am. 2015;137:2047– 2059. 27. http://www.cybertechnos.com/datasets/ 28. Zhao H, Chen Y, Wang R, Malik H. Audio splicing detection and localization using environmental signature. Springer J Multimedia Tools and Appl. 2014:1–31. DOI 10.1007/s11042-016-3758-72016. 29. Fajri K, Khalil MS, Malik H. Robust tampered detection method for digital audio using gabor filterbank. Proceedings of International Conference on Image Processing, Production and Computer Science (ICIPCS). 2015; Thailand. p. 75–82.
Downloaded by [King Saud University] at 01:11 18 October 2017
18
M. K. KHAN ET AL.
30. Zhao H, Chen Y, Wang R, Malik H. Audio source authentication and splicing detection using acoustic environmental signature. In: Proceedings of the 2nd ACM workshop on Information hiding and multimedia security; 2014; Austria. p. 159–164. 31. Schädler MR, Meyer BT, Kollmeier B. Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J Acoust Soc Am. 2012;131:4134–4151. 32. Logan B. Mel frequency cepstral coefficients for music modeling. International Symposium on Music Information Retrieval (ISMIR). 2000; Bloomington, IN. 33. Hermansky H. Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am. 1990;87:1738– 1752. 34. Sui Q, Lau APT, Lu C. Fast and robust blind chromatic dispersion estimation using auto-correlation of signal power waveform for digital coherent systems. J Lightwave Technol. 2013;31:306–312. 35. Malik H, Miller J. Microphone identification using higher-order statistics. In: 46th AES Conference on Audio Forensics, paper no. 5–2; 2012; Denver, CO. 36. Malik H. Secure speaker verification system against reply attack. In: 46th AES Conference on Audio Forensics, paper no. 5–5; 2012; Denver, CO. 37. Malik H, Mahmood H. Acoustic enviroment identification using unsupervised learning. Springer J Secur Inf. 2014;3:11.