Apr 22, 2017 - I declare that this written submission represents our ideas in our own words and where others' ideas or words have been included, we have ...
Accent Classification Using Deep Belief Network
A project report submitted in partial fulfillment of the requirements for the degree of
Bachelor of Engineering by Rishabh G. Upadhyay (Roll No. 7170)
Under the guidance of Jagruti Save
DEPARTMENT OF INFORMATION TECHNOLOGY Fr. Conceicao Rodrigues College of Engineering, Bandra (W), Mumbai 400050 University of Mumbai April 22, 2017
This work is dedicated to my family. I am very thankful for their motivation and support.
Internal Approval Sheet CERTIFICATE This is to certify that the project entitled "Accent Classification Using Deep Belief Network" is a bonafide work of Rishabh G. Upadhyay(7170) submitted to the University of Mumbai in partial fulfillment of the requirement for the award of the degree of Bachelor of Engineering in Information Technology
Jagruti K. Save Supervisor/Guide
Jagruti K. Save
Dr. Srija Unnikrishnan
Head of Department
Principal
Approval Sheet Project Report Approval This project report entitled by Accent Classification Using Deep Belief Network by Rishabh G. Upadhyay, is approved for the degree of Bachelor of Engineering
Examiners 1.————————————– 2.————————————–
Date: Place:
Declaration I declare that this written submission represents our ideas in our own words and where others’ ideas or words have been included, we have adequately cited and referenced the original sources. I also declare that I have adhered to all principles of academic honesty and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I understand that any violation of the above will be cause for disciplinary action by the Institute and can also evoke penal action from the sources which have thus not been properly cited or from whom proper permission has not been taken when needed.
Rishabh G. Upadhyay
Date: April 22, 2017
iii
(Roll No. 7170) (sign)
Abstract As speech recognition and intelligent systems are more prevalent in society today, we need to account for the variety of accents in spoken language. Accent classification technologies directly influence the performance of speech recognition. An important step involves identifying the type of accent given a sample of speech. For this thesis, we worked on to create an effective classifier for foreign accented English speech in order to determine the origins of the speaker. New dataset is developed, which consists of 5 speakers from 6 countries each. The dataset is made using the birth and current position of the speaker. The accent belong to 6 countries i.e.: China, India, France, Germany, Turkey and Spain. We have trained Deep belief networks for this dataset and compared with others state of art methods. Analysing the contributions to accent discrimination by Mel-frequency cepstral coefficients are described.
iv
Acknowledgments I have great pleasure in presenting the report on "Accent Classification Using Deep Belief Network". I take this opportunity to express my sincere thanks towards the guide Prof. Jagruti Ketan Save, C.R.C.E, Bandra (W), Mumbai, for providing the technical guidelines, and the suggestions regarding the line of this work. We enjoyed discussing the work progress with him during our visits to department.
I thank Prof. Jagruti Ketan Save, Head of Information Technology Dept., Principal and the management of C.R.C.E., Mumbai for encouragement and providing necessary infrastructure for pursuing the project.
I also thank all non-teaching staff for their valuable support, to complete our project.
Rishabh G. Upadhyay(Roll No. 7170) Date: April 22, 2017
Contents
Abstract
iv
List of Figures
viii
List of Tables
ix
Glossary
ix
1 Introduction 1.1
1
Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2 Literature Review
3
3 Problem Statement
5
4 Data
6
4.1
Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
4.2
Data-set Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
5 Project Description 5.1
5.2
8
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
5.1.1
9
Mel-scale Frequency Cepstral Coefficient . . . . . . . . . . . . . . .
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.2.1
Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2.2
Deep belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2.3
Restricted Boltzmann Machine . . . . . . . . . . . . . . . . . . . . 12
6 Experimental Results
15
6.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.2
Feature Extraction and Configuration . . . . . . . . . . . . . . . . . . . . . 15
vi
6.3
Computational setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.5
Classifier Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7 Conclusion and Future works
19
7.1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.2
Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References
21
vii
List of Figures
5.1
Propsed method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Steps for MFCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.3
Structure of DBN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.4
Multiple input RBM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.5
Multiple Hidden layers RBM. . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.6
Complete RBM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.1
Training Error for 2 Accents Classification. . . . . . . . . . . . . . . . . . . 16
6.2
Training Error for 4 Accents Classification. . . . . . . . . . . . . . . . . . . 17
6.3
Training Error for 6 Accents Classification. . . . . . . . . . . . . . . . . . . 17
viii
8
List of Tables
4.1
Description of data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
7
Chapter 1 Introduction Accented speech poses a major obstacle for speech recognition algorithms. A spoken language considerably varies in terms of its regional dialects and accents. Accurate recognition of dialect or accent prior to automatic speech and language recognition may help in improving recognition accuracy by speaker and language model adaptation. Being able to accurately classify speech accents would enable automatic recognition of the origin and heritage of a speaker. This would allow for robust accent-specific speech recognition systems and is especially desirable for languages with multiple distinct dialects. Existing Automatic Speech Recognition ASR systems for English have been trained on a wide array of American-accented speech, so we see technology like Apples0 Siri and Google Now perform quite well on understanding native American English speakers. Yet there is much less training data for ASR systems to learn from foreign-accented speakers. With the goal of making these systems more robust, one way to recognize speech (i.e. convert speech to text) that is from a non-standard accent of English is to first classify which accent it is. It has also been discovered that Siri, Apples0 personal assistant, is unable to recognize words spoken with different accents [ [13], [14], [15]]. Furthermore, in the context of immigration screening, it may be helpful to verify semi-automatically whether an applicants0 accent corresponds to accents spoken in a region he claims, he is from. When an individual learns to speak a second language, there is a tendency to replace some syllables in the second language with more prominent syllables from his native language. Thus, accented speech can be seen as the result of a language being filtered by a second language, and the analysis of accented speech may uncover hidden semblances among different languages. There is a clear need for accurate, automatic characterization of spoken dialects and accents. Accent identification also has various other applications such as automated customer assistance routing In addition, analyzing
1
speech data of multiple accents can potentially hint at common linguistic origins. This is a common approach when it comes to speaker and language identification and calls for feature extraction techniques such as spectrograms, MFCCs, and LPC. I have used MFCC features for this thesis.
1.1
Thesis outline This thesis is organized as follows: Chapter 2 gives a detailed literature survey on
the techniques that have been researched previously from the perspective of Classifier Used.Chapter 3 gives insight to Problem Statement. Chapter 4 gives an overview on the Data-set to be used. Chapter 5 describes the feature extraction process using Melfrequency cepstral coefficients (MFCC) and explains classification technique that will be used. Chapter 6 gives conclusion and future works.
2
Chapter 2 Literature Review In this section, some related work on Accent classification, will be discussed. The biggest inspiration for the starting methodology of this thesis was the classifier by Choueiter [1]. They used the Foreign Accented English dataset from the Center for Spoken Language Understanding in order to train Gaussian Mixture Models (GMMs) on the 23 different accents . As a baseline, their models provided their best performance of 22% correct in a 23-way classification task. Typical dialect and accent recognizers use either acoustic or phonotactic modeling. In the former approach, acoustic features such as shifted delta cepstra (SDC), are used with adaptation [2]. Neidert, Chen, and Lee [3] experimented over 2 accents, German and Mandarin, and extended the classi cation experiment to 12 accents. The resulting accuracy for 2 accents was 57.12% using Support Vector Machine (SVM) and Linear Predictive Coding (LPC) as the feature extraction method. On 12 accents, they obtained an accuracy of 13.32%. Watanaprakornkul, Eksombatchai, and Chien [4] classfied 3 foreign accents: Cantonese, Hindi, and Russian, using SVM and Gaussian Mixture Models (GMM). The features were extracted using Mel Frequency Cepstral Coeffcients (MFCC) and Perceptual Linear Prediction (PLP) approaches. Using SVM, they obtained an accuracy of 41.18% and 37.5% accuracy using GMM. Out of 2000 windows, of each speech utterance that was obtained after feature extraction, 10 windows were randomly picked for the experiments. Tang [5] classified accents using Hidden Markov Models (HMM), Directed Acyclic Graph SVM (DAGSVM), and Support Vector Machines (SVM). They found that DAGSVM performed similarly to that of HMM but better than SVM, whereas SVM was effective in classifying different accents. Novich [6] did a study on accent classification using neural networks where they
3
extracted the format features from the vowels of the speech utterances. Hamid Behravan [12] did a study on accent classification using I-Vector, where speaker and channel effects were modeled separately using eigenvoice (speaker subspace) and eigenchannel (channel subspace) model. They also compared it to traditional Gaussian mixture model - universal background model (GMM-UBM) recognizer. Chen, Huang, Chang, and Wang [7] used a Gaussian mixture model in order to classify accented speech and speaker gender. Using MFCC as their feature set, they investigated the relationship between the number of utterances in the test data and accent identification error. The study displays very impressive results, which encourages us to think that non-prosodic feature sets can be promising for accent classification.This was the baseline for selection of MFCC features rather then LPC and SDC. In this chapter, I did a literature survey of the methods dealing with accent classification. I saw that past research have identified the important features to train a computer in classifying the accents. This chapter shows a lot of statistical classifiers have been trained in order to get accurate classification of accents. Among them, it was found that SVM was efficient in identifying different accents. But In past research, Deep learning have not been used. In my thesis I am giving emphasis on classification using Deep learning and comparing it with SVM.
4
Chapter 3 Problem Statement A spoken language considerably varies in terms of its regional dialects and accents. Dialect refers to linguistic variations of a language, while accent refers to different ways of pronouncing a language within a community [11] . So Accent classficiation can help in getting information about the community. In this thesis, I am going to work on accent classification and also getting information about the person. This information will be in terms of country of living or do country affect the accent?. This system can have many application like security, immigration, refugee camp etc.
5
Chapter 4 Data
4.1
Data Collection As thesis emphasis on indentifying the area of person using accent and how much
the accent have been effected by area. For example someone born in India and have been living in USA for several areas, so accent of that person will be affected by American accent. So this system will help in identifying how does area effect the accent, if yes then how much. There are several data sets available for accent such as Foreign Accented English (FAE) and George Mason University (GMU) Speech Accent Archive. But all this data sets are define on the basis of accent for particular country and does not help in finding particular area. So to over come this problem of finding the area of person, I am making new data sets using online sources such as youtube and videolectures and then using social media such as linkedin and facebook for finding the area of living and traveling for particular speaker.
4.2
Data-set Description Using Videolectures and youtube, I came up with my own dataset for this thesis.
The description of dataset is given below in Table 3-1.
6
Country
No.of speaker
Duration per speaker
China
5
1:10:00
India
5
1:01:00
France
5
1:20:00
Germany
5
00:59:25
Spain
5
1:30:00
Turkey
5
0:55:49
Table 4.1: Description of data-set
We initially assigned a random 80% of files per language to be for training and the remaining 20% to be for testing. This train-test split was kept constant across all experiments in order to evaluate the quality of different types of classification. This holdout data would be useful for tuning parameters, but I believed the existing data to be too small to split into three groups
7
Chapter 5 Project Description In this chapter, project description and proposed methods will be discussed.
Figure 5.1: Propsed method.
The proposed method is shown in the Figure 5.1. It consist of Data collection, which is discussed in chapter 3, Feature extraction and Deep belief network which will be discussed in this chapter.
5.1
Feature Extraction In machine learning, pattern recognition and speech processing, feature extraction
starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations.When the input data to an algorithm is too large to be processed and it is suspected to be redundant (e.g. the same measurement in both feet and meters, or the repetitiveness of images presented as pixels), then it can be transformed into a reduced set of features (also named a features vector). This process is called feature selection. Speech is acoustic signal which contains information of idea that is formed in speakers
8
mind. Widely used speech features for auditory modeling are cepstral coefficients obtained through Linear Predictive Coding (LPC). Another well-known speech extraction is based on Mel-frequency Cepstral Coefficients (MFCC). For speech/speaker recognition, the most commonly used acoustic features are melscale frequency cepstral coefficient (MFCC). MFCC takes human perception sensitivity with respect to frequencies into consideration, and therefore are best for speech/speaker recognition. 5.1.1
Mel-scale Frequency Cepstral Coefficient
The most prevalent and dominant method used to extract spectral features is calculating Mel-Frequency CepstralCoefficients (MFCC) [16]. MFCCs are one of the most popular feature extraction techniques used in speech recognition based on frequency domain using the Mel scale which is based on the human ear scale. MFCCs being considered as frequency domain feature.MFCCs being considered as frequency domain features are much more accurate than time domain features [8], [9]. Mel-Frequency Cepstral Coefficients (MFCC) is a representation of the real cepstral of a windowed short-time signal derived from the Fast Fourier Transform (FFT) of that signal. The difference from the real cepstral is that a nonlinear frequency scale is used, which approximates the behaviour of the auditory system. Additionally, these coefficients are robust and reliable to variations according to speakers and recording conditions. MFCC is an audio feature extraction technique which extracts parameters from the speech similar to ones that are used by humans for hearing speech, while at the same time, deemphasizes all other information. The speech signal is first divided into time frames consisting of an arbitrary number of samples. In most systems overlapping of the frames is used to smooth transition from frame to frame. After the windowing, Fast Fourier Transformation (FFT) is calculated for each frame to extract frequency components of a signal in the time-domain. FFT is used to speed up the processing. The logarithmic Mel-Scaled filter bank is applied to the Fourier transformed frame. This scale is approximately linear up to 1 kHz, and logarithmic at greater frequencies . The relation between frequency of speech and Mel scale can be established as : mel(f ) = 1125 ∗ log2 (1 + f /700)
1
MFCCs use Mel-scale filter bank where the higher frequency filters have greater bandwidth than the lower frequency filters, but their temporal resolutions are the same.
9
Figure 5.2: Steps for MFCC.
The last step is to calculate Discrete Cosine Transformation (DCT) of the outputs from the filter bank. DCT ranges coefficients according to significance, whereby the 0th coefficient is excluded since it is unreliable [16]. The overall procedure of MFCC extraction is shown on Figure 5.2. For each speech frame, a set of MFCC is computed. This set of coefficients is called an acoustic vector which represents the phonetically important characteristics of speech and is very useful for further analysis and processing in Speech Recognition.
5.2
Classification After extracting meaningful features from the speech utterances, the next step is to
classify the accents. Classi cation can be either supervised or unsupervised. Supervised classi cation involves learning from the labeled training data whereas unsupervised classication refers to the problem of identifying a structure in the unlabeled data. Previously, in the area of accent classi cation, several classi cation techniques like Hidden Markov Models (HMM), Support Vector Machines (SVM), Gaussian Mixture Models (GMM), Arti cial Neural Networks (ANN), k-Nearest Neighbors (KNN), decision trees, and Naive Bayes algorithms have been used. We chose to experiment with supervised classi cation using Deep belief networks (DBN), since DBN showed best results in music emotion classification. This section provides an overview of supervised classification followed by an explanation of Deep belief network that is used in the experiments.
10
5.2.1
Supervised Classification
The supervised classification is the essential tool used for extracting quantitative information from remotely sensed image data [10].Using this method, the analyst has available sufficient known pixels to generate representative parameters for each class of interest. This step is called training. Once trained, the classifier is then used to attach labels to the data according to the trained parameters. 5.2.2
Deep belief networks
Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple nonlinear transformations. It aims at learning distributed representations of data effectively through several levels with the highest level representing the abstract form of data. Research in this area attempts to define what makes better representations and how to create models to learn these representations. The field of deep learning has attracted a lot of attention in the recent past and the techniques have been applied in many fields including natural language processing, machine learning, information retrieval, computer vision and artificial intelligence, etc. And it is still gaining popularity because of the interesting challenges it poses. The architecture that is used in this thesis work is the Deep Belief Networks (DBN) [18]. Geoffrey Hinton showed that RBMs can be stacked and trained in a greedy manner to form so-called DBN [18]. Deep belief nets are probabilistic generative models that are composed of multiple layers of stochastic, latent variables. The latent variables typically have binary values and are often called hidden units or feature detectors. The top two layers have undirected, symmetric connections between them and form an associative memory. The lower layers receive top-down, directed connections from the layer above. The states of the units in the lowest layer represent a data vector. The principle of greedy layer-wise unsupervised training can be applied to DBNs with RBMs as the building blocks for each layer. Consider a DBN with two hidden layers, then it can be trained in the following way and which can be seen in figure 5.3 [19]: 1. Train the first layer as an RBM that models the raw input x = h(0) as its visible layer 2. Use that first layer to obtain a representation of the input that will be used as data for the second layer. Two common solutions exist.
11
Figure 5.3: Structure of DBN.
3. Train the second layer as an RBM, taking the transformed data as training examples (for the visible layer of that RBM). 4. Iterate over the steps 2 and 3 for the desired number of layers, each time propagating upward either samples or mean values. 5. Fine-tune all the parameters of this deep architecture with respect to a proxy for the DBN ( such as dropout, learning rate, momentum) 5.2.3
Restricted Boltzmann Machine
Figure 5.4: Multiple input RBM.
Invented by Geoff Hinton, a Restricted Boltzmann machine is an algorithm useful for dimensionality reduction, classification, regression, collaborative filtering, feature learning
12
and topic modeling [17]. RBMs are shallow, two-layer neural nets that constitute the building blocks of deep-belief networks. The first layer of the RBM is called the visible, or input, layer, and the second is the hidden layer [19]. That is, there is no intra-layer communication that is the restriction in a restricted Boltzmann machine. Each node is a locus of computation that processes input, and begins by making stochastic decisions about whether to transmit that input or not. Each visible node takes a low-level feature from an item in the dataset to be learned.Each hidden node receives the four inputs multiplied by their respective weights. The sum of those products is again added to a bias (which forces at least some activations to happen), and the result is passed through the activation algorithm producing one output for each hidden node, which is shown in Figure 5.4. If these two layers were part of a deeper neural network, the outputs of hidden layer no. 1 would be passed as inputs to hidden layer no. 2, and from there through as many hidden layers as you like until they reach a final classifying layer which is ilustrated in Figure 5.5.
Figure 5.5: Multiple Hidden layers RBM.
In the reconstruction phase, the activations of hidden layer no. 1 become the input in a backward pass. They are multiplied by the same weights, one per internode edge, just as x was weight-adjusted on the forward pass. The sum of those products is added to a visible-layer bias at each visible node, and the output of those operations is a reconstruction; i.e. an approximation of the original input. Because the weights of the RBM are randomly initialized, the difference between the reconstructions and the original input is often large. You can think of reconstruction error as the difference between the values of r and the input values, and that error is then
13
Figure 5.6: Complete RBM.
backpropagated against the RBM s’ weights, again and again, in an iterative learning process until an error minimum is reached. Complete structure of RBM is shown in Figure 5.6. This chapter thus summarizes the imporant and relevant models in deep learning and it gives a clear picture of the architecture used in this thesis work to build a accent classifier.
14
Chapter 6 Experimental Results This section reports the results of the foreign accent identification experiments during the stages of the system development. The system was used off-line.
6.1
Experimental Setup The experimental evaluation is conducted on above mentioned dataset using a cross-
validation technique: out of the five speakers of a specific regional accent, four of them are selected for the training, and the remaining one is used for the testing. For the classification task, a well known and widely used machine learning algorithm was used, i.e. Deep Belief Network.
6.2
Feature Extraction and Configuration The detailed description of the system is as follows: MFCC algorithm is used so
that features can be extracted and those can be used in deep belief network (DBN). For implementation of MFCC algorithm following parameters are considered: Frame size is 42 ms, Window of 20 ms is considered. For Fourier Transformation singal are of 512 Hz ,no. of transform are 40 and number of ceptrum is 13. So total number of features for each signal is 13.As MFCC used is of 13th order. Total number of target class is 6 i.e. for each regional speaker.
6.3
Computational setup Training DBNs of the sizes used in this report is quite computationally expensive.
An epoch of fine-tuning with backpropagation took around 2 minutes. The discriminative gradient computation for hybrid training was substantially more expensive. These time estimates represent the Intel i7 (5th generation), 8 GB RAM, 2.60 GHz processor. These
15
time is calculated for a model with 2 hidden layer, 1000 nodes per layer.
6.4
Experiments For all experiments, we fixed the parameters model. All DBNs were Fine-tuned with
100 epochs with a fixed learning rate of 0.088.The learning rate started at 0.088. At the end of each epoch, if the substitution error on the development set increased, the weights were returned to their values at the beginning of the epoch and the learning rate was reduced by 0.9 times. This continued until the learning rate fell below 0.001. During both pre-training and fine-tuning, a small weightcost of 0.0002 was used and the learning was accelerated by using a momentum of 0.9 (except for the first epoch of finetuning which did not use momentum).
6.5
Classifier Comparison Classifiers are trained in 3 ways i.e. 2 accents, 4 accents and 6 accents. For 2 accents,
2 most unrelated accents are chosen i.e. Chinese and French. Training error regard-ing all the three way classification is shown below.
Figure 6.1: Training Error for 2 Accents Classification.
Training accuracy for 2 Accents classification is 91.5%.
16
Figure 6.2: Training Error for 4 Accents Classification.
Figure 6.3: Training Error for 6 Accents Classification.
Training accuracy for 4 and 6 Accents Classification are 78.98% and 71.5% respectively.
17
Testing is done on 1 speaker out of 5 speakers. Result for the testing is shown in the figure below. Classifiers
2 accents
4 accents
6 accents
DBN
90.20%
77.20%
71.90%
Random Forest
67.68%
43.59%
33.01%
K-NN
72.01%
49.34%
40.16%
Navie Bayes
57.66%
31.88%
22.69%
Logistic Regression
50.01%
27.01%
19.64%
SGD Classifier
49.51%
25.04%
20.64%
Perceptron
52.22%
24.94%
18.64%
SVM
66.58%
41.74% 32.64% Accuracy Comparison In all the classifier shown above, the best results are con-sidered using various parameters.
18
Chapter 7 Conclusion and Future works This chapter present Conclusion and Future works.
7.1
Conclusion A spoken language considerably varies in terms of its regional dialects and accents.
Dialect refers to linguistic variations of a language, while accent refers to different ways of pronouncing a language within a community. By classfifying the accent we can get information regarding the community of the speaker. This can be useful in immigration process. Accent classification can be useful in identifying the origin country as well as the current living country of the speaker. Futhermore MFCC features have shown good results in the field of speech processing. MFCCs features act as human ears. We used MFCC features to train DBNs without transcriptions, and achieved best results as compared to other classifiers. Moving on, we classify the model in 3 patterns i.e 2 accents, 4 accents and 6 accents. In all the three classification, DBN was best as compared to other state-ofart methods and classifiers. Upon analyzing the classifier with MFCC features extracted from the mentioned dataset, classification task achieved an accuracy of 90% for 2 accented datasets and 71% for 6 accented datasets.
7.2
Future works For my future research, As the results are best as compared to other methods using
Deep Belief networks. More experiments should be carried out on Big datasets. Also the female and male speakers should be in equal proportion or else it will create problem in classifcation. Also the dataset should contains more accents from different countries to make it more robust
19
References [1] G. Choueiter, G. Zweig and P. Nguyen," An empirical study of automatic accent classification". In Acoustics, Speech and Signal Processing, pp. 4265-4268. IEEE, 2008. [2] Pedro Torres-Carrasquillo, Terry Gleason and Douglas Reynolds, “Dialect identification using Gaussian Mixture Models,” Proc. of the ODYSSEY - The Speaker and Language Recognition Workshop, pp. 297-300,2004. [3] J. Neidert, P. Chen and J. Lee, “Foreign accent classification,”,CS 229, 2011. [4] P. Watanaprakornkul, C. Eksombatchai and P. Chien,“Accent classification,” CS 229, 2011. [5] H. Tang and A. A. Ghorbani, “Accent Classification Using Support Vector Machine and Hidden Markov Model,” Proc. of the16th Conference of the Canadian Society for Computational Studies of Intelligence , AI 2003, Halifax, Canada, June 11âĂŞ13, 2003. [6] Scott Novich and A. Trevino, “Introduction to Accent Classification with Neural Networks,” Rice University. [7] T. Chen, C. Huang, C. Chang and J. Wang, “On the use of Gaussian mixture model for speaker variability analysis,” Presented at the Int. Conf. SLP, Denver,USA, 2002. [8] Lei Xie, Zhi-Qiang Liu, “Comparative Study of Audio Features For Audio to Visual Cobversion in MPEG-4 Compliant Facial Animation,” Proc. of theInternational Conference on Machine learning and Cybernetics , Dalian, China, 2015. [9] Alfie Tan Kok Leon, “ A Music Identification System Based on Audio Content Similarity,” Thesis of Bachelor of Engineering, Division of Electrical Engineering, The School of Information Technology and Electrical Engineering, The University of Queensland, Queensland , 2003.
20
[10] Richards, J. A, “Remote sensing digital image analysis: an introduction,” Springer (Second edition), 1993. [11] J. Nerbonne, “Linguistic variation and computation,” In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, 2003. [12] Hamid Behravan, Ville Hautamaki and Tomi Kinnunen, “ Foreign Accent Detection from Spoken Finnish Using i-Vectors,”14th Annual Conference of the International Speech Communication Association, 2013. [13] Swetha Machanavajhala, “ Accent classification: Learning a distance metric over phonetic strings,” Master’s Thesis of The University of Utah , 2013. [14] "Watch:
Siri
has
trouble
recognizing
scottish
accents"
,
Avialable:
http://techland.time.com/2011/10/27/watch-siri-has-trouble-recognizing-scottishaccents/, 2011 [15] "Apple’s
siri
vs.
japanese-accented
english"
,
Avialable:
http://boingboing.net/2012/11/20/apples-siri-vs-japanese-acc.html, 2012. [16] Sahidullah and Saha Goutam, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition,”Speech Communication, 2011. [17] Hinton, G., “Deep belief networks,”2009. [18] G. E. Hinton, S. Osindero, and Y. Teh, "A Fast Learning Algorithm for Deep Belief Nets,”Neural Computation, 2006. [19] G. E. Hinton, "A Practical Guide to Training Restricted Boltzmann Machines,” 2010.
21