Recurrent neural networks and hidden Markov models have been the popular tools for sequence recognition problems such as automatic speech recognition. ...... It has been shown that recurrent neural networks can model non-linear ...... Figure 4 shows the use of neural networks in knowledge refinement which consists of ...
Modelling of Deterministic, Fuzzy and Probablistic Dynamical Systems
Rohitash Chandra
A Thesis in the Field of Computing Science for the Degree of Master of Science in Computing Science
The University of Fiji August, 2007
Supervisor: Prof. Christian Omlin.
Abstract Recurrent neural networks and hidden Markov models have been the popular tools for sequence recognition problems such as automatic speech recognition. This work investigates the combination of recurrent neural networks and hidden Markov models into the hybrid architecture. This combination is feasible due to the similarity of the architectural dynamics of the two systems. Initial experiments were done by training recurrent neural networks to behave like finite-state automata using genetic algorithms in order to demonstrate that their structure is sufficiently rich to represent dynamical systems. The equivalence between these trained recurrent neural networks and corresponding automata is shown by extracting knowledge from the networks. The results show that hybrid recurrent neural networks can learn and represent dynamical systems such as finite automaton. Finally, the proposed hybrid architecture is applied to automatic speech phoneme recognition. The results show that hybrid recurrent neural networks perform with some conditions and degree of success when applied to difficult real-world problems.
Author's Biographical Sketch The author is from the Fiji islands. He was born in Nausori and attended Saraswati Primary and Saraswati College for his primary and secondary school education. The author graduated with a Bachelor of Science Degree from the University of the South Pacific in April 2006. He joined the University of Fiji as a tutor in computing science in early 2007. His research work has been published in numerous international conference proceedings in the field of artificial intelligence. Apart from his contribution to the field of computing science, the author has published poetry in a number of international literary journals. He has released two books of poetry in the years of 2006 and 2007: “Barefoot on Soft River Sand” and “A Hot Pot of Roasted Poems”. He is also the editor and publisher of The Blue Fog Journal.
iv
To Roni.
v
Acknowledgments My sincere gratitude to Prof. Christian Omlin for his guidance and support during the course of the research done in this thesis. I also thank Prof. Akshay Kumar for his supervision. I am grateful to The University of Fiji for giving me the opportunity to pursue this research. Special thanks to Prof. Dharmendra Sharma and Prof. Rajesh Chandra for their support and guidance. Love and thanks to Mom.
vi
Table of Contents
Table of Contents ..............................................................................................................vii List of Figures ..................................................................................................................xiii Chapter 1 Introduction ........................................................................................................ 1 1.1 Motivation ................................................................................................................. 1 1.2 Premises .................................................................................................................... 2 1.2.1 Recurrent Neural Networks................................................................................ 2 1.2.2 Hidden Markov Models ..................................................................................... 3 1.2.3 Finite-State Automata and Knowledge Representation ..................................... 3 1.2.4 Hybrid Systems .................................................................................................. 4 1.2.5 Speech Recognition............................................................................................ 5 1.3 Research Hypothesis ................................................................................................. 6 1.3.1 Learning Finite-State Automata ......................................................................... 6 1.3.2 Extraction of Finite Automaton.......................................................................... 7 1.3.3 Evolutionary Training of Recurrent Neural Networks....................................... 7 1.3.4 Hybrid Systems of Recurrent Networks and Hidden Markov Models .............. 8 1.4 Technical Goals......................................................................................................... 8 1.5 Research Methodology.............................................................................................. 9 1.5.1 Gradient Descent and Evolutionary Training................................................... 10 1.5.2 Extraction of Fuzzy Finite Automaton............................................................. 11 1.5.3 The Hybrid Recurrent Neural Networks Architecture ..................................... 12 vii
1.5.4 Application to Speech Phoneme Recognition .................................................. 12 1.6 Accomplishments.................................................................................................... 13 1.7 Thesis Overview...................................................................................................... 14 Chapter 2 Neural Network Fundamentals......................................................................... 15 2.1 Introduction ............................................................................................................. 15 2.2 Processing Elements................................................................................................ 15 2.3 Topologies............................................................................................................... 17 2.3.2 Recurrent Neural Networks.............................................................................. 19 2.4 Learning Algorithm................................................................................................. 22 2.4 Computational Capabilities..................................................................................... 24 2.5 Summary ................................................................................................................. 25 Chapter 3 Hybrid Systems................................................................................................. 27 3.1 Introduction ............................................................................................................. 27 3.2 Symbolic Connectionist Learning........................................................................... 28 3.2.1 General Paradigm ............................................................................................. 28 3.2.2 The Significance and Insertion of Prior Knowledge ........................................ 29 3.2.3 Knowledge Extraction...................................................................................... 30 3.2.4 Knowledge Refinement.................................................................................... 31 3.3 Neural Expert Systems............................................................................................ 31 3.4 Neuro-Fuzzy Systems ............................................................................................. 33 3.5 Evolutionary Neural Networks ............................................................................... 35 3.5.1 Evolutionary Neural Learning.......................................................................... 35 3.5.2 Evolutionary Neural Topologies ...................................................................... 35
viii
3.6 Hybrid Recurrent Neural Networks Inspired by Hidden Markov Models.............. 36 3.6.1 Motivation ........................................................................................................ 36 3.6.2 Significance of Hybrid Recurrent Neural Networks ........................................ 37 3.6.3 The Derivation for Hybrid Recurrent Neural Networks .................................. 37 3.5.4 Training Hybrid Recurrent Neural Networks................................................... 39 3.6 Summary ................................................................................................................. 41 Chapter 4 Automata Theory.............................................................................................. 42 4.1 Introduction ............................................................................................................. 42 4.2 Formal Languages ................................................................................................... 42 4.3 Finite-State Automata ............................................................................................. 43 4.3.1 Deterministic Finite-State Automata................................................................ 43 4.3.2 Fuzzy Finite-State Automata............................................................................ 44 4.4 Finite State Machines: Hidden Markov Models ..................................................... 46 4.5 Summary ................................................................................................................. 49 Chapter 5 Recurrent Neural Networks .............................................................................. 51 5.1 Introduction ............................................................................................................. 51 5.2 Architectures ........................................................................................................... 52 5.2.1 First-Order Recurrent Neural Networks........................................................... 52 5.2.2 Second-Order Recurrent Neural Networks ...................................................... 53 5.2.3 Locally Recurrent Neural Networks ................................................................ 54 5.2.4 NARX Recurrent Networks ............................................................................. 56 5.2.5 Long Short Term Memory ............................................................................... 57 5.3 Learning .................................................................................................................. 59
ix
5.3.1 Introduction ...................................................................................................... 59 5.3.2 Backpropagation -Through -Time.................................................................... 59 5.3.3 Real Time Recurrent Learning......................................................................... 64 5.3.4 Evolutionary Neural Leaning ........................................................................... 64 5.3.4.1 Genetic Algorithms ....................................................................................... 64 5.3.4.2 Training Neural Networks with Genetic Algorithms .................................... 68 5.4 Recurrent Neural Networks as Models of Computation ..................................... 70 5.5 Knowledge Extraction from Recurrent Neural Network ........................................ 71 5.5.1 Introduction ...................................................................................................... 71 5.5.2 Knowledge Extraction Using Machine Learning............................................. 71 5.6 Applications of Recurrent Neural Networks........................................................... 75 5.6.1 Recurrent Neural Networks for Speech Recognition....................................... 75 5.6.2 Recurrent Neural Networks for Control......................................................... 76 5.6.3 Molecular Biology.......................................................................................... 76 5.6.4 Signature Verification ...................................................................................... 77 5.7 Summary ................................................................................................................. 78 Chapter 6 Training and Extraction of Finite State Automata............................................ 79 6.1 Introduction ............................................................................................................. 79 6.2 Gradient Descent Training of Recurrent Neural Networks..................................... 79 6.2.1 Training on Deterministic Finite-State Automata ............................................ 80 6.2.2 Training on Fuzzy Finite Automaton ............................................................... 82 6.3 Extraction of Finite Automaton from Trained Recurrent Neural Networks ........... 84 6.3.1 Extraction of Deterministic Finite Automaton................................................. 84
x
6.3.2 Extraction of Fuzzy Finite Automaton............................................................. 85 6.4 Evolutionary Training of Hybrid Recurrent Neural Networks ............................... 87 6.4.1 Introduction ...................................................................................................... 87 6.4.2 Learning Deterministic Finite Automaton ....................................................... 89 6.5 Extraction of Deterministic Finite Automaton........................................................ 91 6.6 Summary ................................................................................................................. 93 Chapter 7 Real World Application: Speech Recognition.................................................. 95 7.1 Introduction ............................................................................................................. 95 7.2 Speech Recognition Systems ................................................................................ 95 7.3 Feature Extraction from Speech Sequences ............................................................ 97 7.3.1 Introduction .......................................................................................................... 97 7.3.2 Mel Frequency Cepstral Coefficients............................................................... 98 7.3.3 The TIMIT Database ...................................................................................... 103 7.3.4 Empirical Results ........................................................................................... 104 7.4 An Application to Speech Phoneme Classification .............................................. 105 7.4.1 Empirical Results and Discussion .................................................................. 106 7.5 Summary ............................................................................................................... 110 Chapter 8 Conclusion and Directions for Future Research............................................. 111 8.1 Accomplishments and Open Problems ................................................................. 111 8.1.1 Open Problems ............................................................................................... 112 8.2 Derivation of Gradient Decent Learning of the Hybrid Architecture ................... 113 8.3 Applications to Other Real World Time Series. ................................................... 113 References ....................................................................................................................... 115
xi
Appendix 1 Articles Published in Refereed Conference Proceedings ............................ 127
xii
List of Figures
Figure 1 Output of a single neuron.................................................................................... 17 Figure 2 Neural network topologies.................................................................................. 18 Figure 3 Architectures of recurrent neural networks ........................................................ 20 Figure 4 The framework for combining symbolic and neural learning ........................... 29 Figure 5 Neural expert system ......................................................................................... 33 Figure 6 An example of a neuro-fuzzy system ................................................................ 34 Figure 7 Hybrid recurrent neural networks ....................................................................... 40 Figure 8 Deterministic finite-state automata.................................................................... 44 Figure 9 Fuzzy finite-state automaton with weight state transitions................................ 45 Figure 10 Equivalent deterministic acceptor.................................................................... 45 Figure 11 A first-order discrete Markov model ............................................................... 46 Figure 12 A first-order hidden Markov model................................................................. 47 Figure 13 First-order recurrent neural networks .............................................................. 52 Figure 14 Second-order recurrent neural networks.......................................................... 54 Figure 15 Locally recurrent networks .............................................................................. 55 Figure 16 NARX network architecture ............................................................................ 56 Figure 17 Example of a three layer LTSM topology ....................................................... 58 Figure 18 Unfolding a recurrent neural network in time ................................................. 62 Figure 19 Crossover and mutation operators for genetic algorithms ............................... 67 Figure 20 Crossover operator for evolutionary neural learning....................................... 69 xiii
Figure 21 Knowledge extraction through machine learning............................................ 72 Figure 22 Prefix tree.......................................................................................................... 73 Figure 23 DFA induction ................................................................................................. 74 Figure 24 The 7 state deterministic finite automaton....................................................... 80 Figure 25 The 7 state fuzzy finite automaton................................................................... 83 Figure 26 Hybrid recurrent neural networks .................................................................... 88 Figure 27 Windowing....................................................................................................... 99 Figure 28 The process of MFCC feature extraction....................................................... 100 Figure 29 Hybrid recurrent neural networks ................................................................... 106
xiv
Chapter 1 Introduction
1.1 Motivation
Many real world applications of machine learning deal with data sequences which involve time series modelling and prediction. Machine learning tools such as neural networks and hidden Markov models have been popular paradigms for modelling these time-varying signals. They have been successfully applied to real world problems such as speech, signatures and gesture recognition [1, 2, 3, 4, 5, and 6]. Hybrid systems combine strengths of intelligent systems such as neural expert systems, evolutionary neural learning, symbolic connectionist learning, and neuro–fuzzy systems [7, 8, 9, and 10]. Recurrent neural networks have the ability of providing good generalization but are difficult to train. Hidden Markov models, on the other hand, can train easily but their generalization performance does not compare favourably to that of recurrent neural networks for classification problems. The combination of these major machine learning paradigms into a hybrid architecture may provide a better system with improved generalization and training performance which may make a significant contribution to sequence recognition in general. Hybrid systems have shown many advantages and therefore have motivated to develop a new hybrid recurrent neural networks architecture inspired by hidden Markov models.
1
1.2 Premises Recurrent neural networks on their own can very well represent dynamical systems such as finite automaton. They have been applied to a wide range of real world problems with dynamical characteristics including speech, signature and gesture recognition [1, 2, and 3]. Hidden Markov models, on the other hand, have been popular tools for modelling speech sequences [6]. Finite state automata have been a useful paradigm for studying recurrent neural networks and their dynamical characteristics. Hybrid systems combine useful features of at least two paradigms. Examples of hybrid systems in machine learning include: evolutionary neural learning, neural expert systems, neuro-fuzzy systems, and symbolic connectionist learning.
1.2.1 Recurrent Neural Networks Neural networks are loosely modelled on the brain. They learn by training from past experience and can demonstrate good generalization performance when presented with data not initially included in the training process. Neural networks can be divided into two classes: feedforward and recurrent neural networks. Feedforward networks are used in applications where the data does not contain time variant information while recurrent neural networks model time series sequences and possesses dynamical characteristics. Recurrent neural networks contain feedback connections. They have the ability to maintain information from past states for the computation of future state outputs. It has been shown that recurrent neural networks can model non-linear dynamical systems. Recurrent neural networks have been successfully applied to a wide range of applications including speech, gesture and signature recognition [1, 2, and 3]. One limitation of neural networks is the difficulty to train them using gradient descent learning; a network may get trapped in the local minima resulting in poor training and generalization performance.
2
1.2.2 Hidden Markov Models Hidden Markov models have been popular tools for automatic speech recognition [6]. In a regular Markov model, the state is directly visible to the observer. Therefore, the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, however, the variables influenced by the states are visible. Each state has a probability distribution over the possible output tokens. The sequence of tokens generated by a hidden Markov model gives some information about the sequence of states. In a first order hidden Markov model, the state at time t+1 depends only on state at time t, regardless of the states in the previous times [12]. This first-order assumption is generally inappropriate for speech signals where dependencies often extend through several states, however, hidden Markov models have been very successful for certain types of speech recognition [13].
1.2.3 Finite-State Automata and Knowledge Representation Finite-state automata represent dynamical behaviour and are a useful framework for studying recurrent neural networks as no feature extraction is necessary. A deterministic finite automaton is a finite automaton where one transition to the next state exists for each pair of state and input signal. A deterministic finite automaton reads in a string of input symbols. For each input symbol, it performs a state transition. When the last input symbol has been received, the automaton will either accept or reject the string depending on the output of the state. A fuzzy finite automaton is a finite-state automata where for each pair of state and input signal, there exists a set of possible successor states. Symbolic or expert knowledge can be inserted into neural networks prior to training for better training and generalization performance. It has been shown that deterministic finite-state automata can be directly encoded into recurrent neural networks
3
prior to training [10, 11]. Initially, neural networks were viewed as black boxes as they could not explain the knowledge learnt in the training process. The extraction of rules from neural networks shows how they arrived to a particular solution after training. The extraction of finite-state automata from trained recurrent neural networks shows dynamical features of its knowledge representation. For recurrent neural networks, knowledge can be extracted in the form of finitestate automata which gives insight into knowledge representation in recurrent neural networks. Popular methods of knowledge extraction include knowledge extraction by clustering and knowledge induction through machine learning. In extraction through clustering, the algorithm looks at the internal structure of the trained networks and develops clusters which represent automaton states [15]. In this method, the extraction algorithm depends on the network topology and architecture. In knowledge extraction through machine learning, a set of input strings is presented to a trained network. Then the observation of the networks output for each string is recorded [14]. The TrakhtenbrotBarzdin algorithm can induce deterministic finite-state automata given a set of strings with input-output labels [16].
1.2.4 Hybrid Systems In the past, intelligent systems paradigms have been combined to form powerful hybrid systems; here, a brief summary of popular hybrid systems is given. Neural expert systems combine the learning in neural networks with reasoning in expert systems [17]. Therefore, they are able to learn from past experience and also have the ability to explain how they arrived at a particular solution. In hybrid symbolic connectionist learning, expert knowledge in neural networks is combined prior to training for better generalization and training performance [18]. The paradigm also involves knowledge extraction which explains the knowledge learnt in the training process. Neuro-fuzzy systems combine human style reasoning of fuzzy logic with neural network learning [19]. Evolutionary neural learning combines the optimization technique of genetic algorithm with neural networks to learn the weights of the network from past experience [20]. 4
Compared to gradient descent training of neural networks, evolutionary neural learning tends to drive the network out of the local minima resulting in better generalization performance.
1.2.5 Speech Recognition A speech sequence contains huge amount of irrelevant information. Feature extraction reduces speech to salient which is then used for modelling. Recurrent neural networks and hidden Markov models have been successfully applied to modelling speech sequences [1, 6]. They have been applied to recognize words and phonemes. The performance of speech recognition system can be measured in terms of accuracy and speed. Recurrent neural networks are capable of modelling complicated sequences. They have shown more accuracy in recognition in cases of low quality, noisy data compared to hidden Markov models. However, hidden Markov models have shown to perform better when it comes to large vocabularies. Extensive research on the application of speech recognition has been done for more than forty years, however, scientists are unable to implement systems which can show excellent performance in environments with background noise which come anywhere near human recognition performance.
5
1.3 Research Hypothesis Modelling of real world time varying sequences such as speech, signature and gesture is difficult. These sequences have dynamical characteristics which can be modelled by recurrent neural networks and hidden Markov models. In Section 1.2, the limitations of both these systems have been discussed. Hybrid systems aim at combining the strengths of different paradigms while, at the same, alleviating respective weaknesses. The combination of recurrent neural networks and hidden Markov models may yield a powerful structure which may deal with the individual limitations of these systems. Finite state automata represent dynamical behaviour and are useful models for studying recurrent neural networks. Recurrent neural networks can learn and represent finite state automata in their internal states. These issues are addressed through the following hypothesis:
1.3.1 Learning Finite-State Automata In Section 1.2.1, it has been discussed that recurrent neural networks can represent dynamical systems. Finite-state automata represent dynamical behaviour and are useful frameworks for studying recurrent neural networks as no feature extraction is necessary. Recurrent neural networks can represent deterministic finite automaton in their internal structure upon training from sample strings which represent such automaton. The hypothesis is that recurrent neural networks can also learn and represent fuzzy finite automaton despite the fact that computation in recurrent neural networks is deterministic.
6
1.3.2 Extraction of Finite Automaton In Section 1.2.3, it has been discussed how recurrent neural networks can be used to model finite automaton such as deterministic and fuzzy finite-state automata. To show their knowledge representation, knowledge must be extracted from trained recurrent networks to show what they have learnt in the training process. The hypothesis is that fuzzy finite-state automata can be induced from the inputoutput mappings obtained by presenting data to trained recurrent networks. The generalization of the Trakhtenbrot-Barzdin algorithm can be made in order to induce fuzzy finite automaton from a set of strings with given output labels.
1.3.3 Evolutionary Training of Recurrent Neural Networks Evolutionary optimization techniques such as genetic algorithms can be used to train neural networks. In gradient descent leaning, neural networks can easily get trapped in the local minima resulting in poor training and generalization performance. The hypothesis is that the weights initialized in evolutionary neural learning play a major role in the training process. Experiments on the appropriate range of weight values are to be carried out in order to find which weight values are good for training and shows that through evolutionary training, recurrent neural networks can learn and represent finite state automaton.
7
1.3.4 Hybrid Systems of Recurrent Networks and Hidden Markov Models Recurrent neural networks and hidden Markov models have been discussed in Section 1.2. Recurrent neural networks have good generalization performance while training them is difficult. They have the capability of modelling long time dependencies. Hidden Markov models assume that the information for the computation of current state only depends on the previous state which is unrealistic for real world application such as modelling speech sequences involving long time dependencies. Both systems have strengths and weaknesses. The strength of the two systems can be combined into the hybrid architecture which may provide a better training and generalization performance. The equation for the forward procedure in hidden Markov models resembles the recurrence in recurrent neural networks which are the motivation for developing the hybrid architecture. The main hypothesis is to develop the hybrid architecture of recurrent neural networks and hidden Markov models to show that they can learn and represent dynamical systems such as finite-state automata. The hybrid architecture will also show contribution to the real world application of speech phoneme recognition.
1.4 Technical Goals In order to validate the above research hypothesis, the following technical objectives need to be fulfilled: 1. The development of gradient descent learning algorithm for training recurrent neural networks in order to learn a set of strings with output labels representing finite automaton such as deterministic and fuzzy finite automaton.
8
2. The
implementation
of
the
Trakhtenbrot-Barzdin
algorithm
to
induce
deterministic finite automaton through machine learning from a set of strings with corresponding output labels obtained from the trained network. 3. The generalization of the Trakhtenbrot-Barzdin algorithm to be able to induce fuzzy finite automaton from strings with fuzzy output labels obtained from recurrent neural networks trained on fuzzy automaton. 4. The development of the hybrid learning architecture of recurrent neural networks and hidden Markov models and training on finite-state automata by evolutionary techniques to show that they can learn and represent dynamical systems. 5. To explore the usefulness of the hybrid architecture of recurrent neural networks and hidden Markov models on the real world application of speech phoneme recognition.
1.5 Research Methodology The research work proceeds with the development of recurrent neural networks which employs gradient descent for learning deterministic and fuzzy finite-state automata. Similarly, evolutionary neural learning is applied to train recurrent neural networks to learn finite state automata represented by strings making the training and test set. The investigation continues by extracting knowledge from trained recurrent neural networks. This is done by using machine learning methods to induce corresponding finite automaton on networks trained with deterministic and fuzzy finite-state automata to show the knowledge representation in trained recurrent networks. Once the theoretical foundations of recurrent neural networks have been investigated, the hybrid recurrent neural network architecture inspired by hidden Markov
9
models is programmed using the C++ programming environment. Furthermore, it will be shown that the proposed hybrid architecture can learn and represent dynamical systems. Finally, to explore its usefulness, the hybrid architecture is applied to the real world application of speech phoneme recognition.
1.5.1 Gradient Descent and Evolutionary Training Backpropagation through time is a gradient descent learning algorithm used for training first order recurrent neural networks [21]. The goal of gradient descent learning is to minimize the sum of squared errors by propagating error signals backward through the network architecture upon the presentation of training samples from the training set. These error signals are used to calculate the weight updates which represent the knowledge learnt in neural networks. In gradient descent search for a solution, the network searches through a weight space of errors and thus may become trapped in a local minima easily. This may prove costly in terms for network training and generalization performance. Evolutionary neural learning combines the optimization techniques of genetic algorithms for learning in neural networks given a set of training data. The goal of learning is to find a set of weights which best describe the training data set and generalizes well when presented with a testing data set. The main strength of evolutionary neural learning is its ability to globally search for a solution in hypothesis space whereas methods such as gradient descent only explore a local neighbourhood for a solution, thus, evolutionary learning is likely to result in better solution compared to solutions found through gradient descent. Recurrent neural networks are systems that model dynamical processes. Formal languages such as finite state automata have characteristics of dynamical systems. Using finite state automata it can be shown that recurrent neural networks can learn and represent dynamical systems. Recurrent neural networks can be trained with strings whose labels are assigned by finite automaton such as deterministic and fuzzy finite state 10
automata. The generation of the training data set by presenting string length of 1 to 10 to corresponding finite automaton which labels the output with each corresponding string. Similarly, the testing data set for string lengths from 1 to length 15 is generated.
In time varying sequences, longer patterns represent long time dependencies. Gradient descent has difficulties in learning long time dependencies as error gradient vanishes with increasing duration of dependencies [22]. Incremental data learning addresses this problem by learning through working sets of the complete training set patterns of increasing lengths. In this way, short time dependencies are learnt first which then helps the network to learn longer time dependencies. A working set contains patterns of increasing order of lengths. The networks trains on each working set on a number of training epochs until the network converges. The training is terminated when the network performs satisfactorily on the entire training set or iterates through all working sets of the training set.
1.5.2 Extraction of Fuzzy Finite Automaton Knowledge extraction is a useful paradigm for showing the knowledge representation in recurrent neural networks. There are two main approaches for knowledge extraction from trained recurrent neural networks: (1) inducing finite automata by clustering the activation values of hidden state neurons and (2) the application of machine learning methods to induce automaton from the observation of input-output mappings of the recurrent neural network. Machine learning methods to induce finite automaton will be used to extract knowledge from trained networks. This method can be applied to any network architecture since the algorithm views the network as a black box and only uses input-output mapping made by the trained network. The Trakhtenbrot-Barzdin algorithm induces a minimum deterministic finite automaton from a set of strings with output labels. The algorithm can be easily used in the case of extraction of deterministic finite automaton, however, a few changes needs to be made to the algorithm so that it can also induce fuzzy finite automaton. This will be done by
11
representing fuzzy membership in the output labels of strings for training and extraction. Therefore, the modified Trakhtenbrot-Barzdin algorithm will be able to induce a minimal fuzzy finite automaton given the training set of strings with fuzzy output labels obtained from the trained network.
1.5.3 The Hybrid Recurrent Neural Networks Architecture Recurrent neural networks and hidden Markov models are systems which can model dynamical processes such as sequences in speech. The motivation behind the need to combine recurrent neural networks with hidden Markov models has been discussed in Section 1.1. The equation for the forward procedure in hidden Markov models have the structural resemblance with the recurrence in recurrent neural networks making the combination feasible. The hybrid architecture will be developed and trained to mimic the behaviour of some finite automaton to show that it can learn and represent dynamical systems. It has also been discussed in Section 1.5.1 that evolutionary neural learning can be used to train neural networks. Evolutionary neural learning will be used to train the hybrid architecture in order for them to learn deterministic and fuzzy finite automaton. In this way, it will be shown that the proposed hybrid system can learn and represent dynamical systems which can be a contribution to the application of real world problems.
1.5.4 Application to Speech Phoneme Recognition In the previous section, the methodology used to construct hybrid recurrent neural networks inspired by hidden Markov models has been discussed. It has also been discussed how they can learn and represent dynamical systems which make them useful for modelling real world applications. Speech recognition is a difficult learning problem. As speech sequence contains a large amount of information that is irrelevant for the recognition task, salient features must be extracted from these sequences in order to make them suitable for modelling. Mel frequency cepstral coefficients (MFCC) are useful features as they have characteristics similar to the human auditory system. The human ear 12
performs a similar data reduction before presenting information to the brain for further processing. MFCC feature extraction techniques will be used to extract features from phonemes in a speech database. The extracted features will be used to train the proposed hybrid system to show their contribution to the application of speech phoneme recognition.
1.6 Accomplishments
The main contribution of this thesis can be summarized as follows: 1. The development of the generalization of the Trakhtenbrot-Barzdin algorithm to induce fuzzy finite automata by means of machine learning on information obtained from the input-output mappings of the trained networks. This extraction method works irrespective of the recurrent neural network topology and architecture since the algorithm treats the networks as a black box and only works on the generalization made by the network upon the presentation of input strings. 2. The development of the hybrid architecture of recurrent neural networks and hidden Markov models. The hybrid architecture learns and represents dynamical systems such as deterministic finite automaton which makes them useful for modelling dynamical systems such as speech sequences. 3. The contribution of the hybrid system of recurrent neural networks and hidden Markov models to the real world application of speech phoneme recognition. This also signifies that the hybrid architecture can be used to model other applications of time varying sequences.
13
This thesis explains knowledge learning and representation in recurrent neural networks through finite automaton. The knowledge representation of the proposed hybrid system is further investigated through knowledge extraction by means of machine learning making them suitable for real world applications involving time varying sequences.
1.7 Thesis Overview The reminder of this thesis is organized as follows. In Chapter 2, the discussion is on the fundamentals of neural networks, focusing on their architectures and computational capabilities. Chapter 3 discusses hybrid systems which combine the strength of intelligent system paradigms such as neural networks, expert systems, fuzzy logic and the combination of recurrent neural networks with hidden Markov models. In Chapter 4, the discussion is on automata theory which focuses on finite automaton such as deterministic and fuzzy finite state automata, and probabilistic systems such as hidden Markov models. In Chapter 5, the focus is on recurrent neural networks which furthers on the fundamentals of artificial neural networks. Here, the discussion is on recurrent neural network architectures, learning algorithms, knowledge representation and extraction and their application to real world problems. Chapter 6 shows and discusses the results of the experimentations for training recurrent neural networks on finite automaton. Knowledge is extracted from trained networks which shows that hybrid systems of recurrent neural networks and hidden Markov models can learn and represent dynamical systems. Chapter 7 describes the extraction process of obtaining Mel frequency cepstral coefficients features from phonemes read from the speech database to train phonemes by the hybrid system of hidden Markov models and recurrent neural networks. Chapter 8 concludes the thesis with a discussion on findings and directions for future research.
14
Chapter 2 Neural Network Fundamentals
2.1 Introduction
Artificial neural networks are loosely modelled after biological neural systems. They learn by training from seen data and make generalization on unseen data. They have been applied to real world problems such as speech recognition, bio-conservation, gesture recognition and medical diagnostics. Neural networks learn by training on seen data using an algorithm which modifies the interconnection weights as directed by a learning objective for a particular application. A neuron is a single processing unit which computes the weighted sum of its inputs. The output of the network relies on cooperation of the individual neurons. The knowledge learnt is distributed over a set of trained networks weights. Neural networks are characterized into feedforward and recurrent neural networks. Feedforward networks are used in application where the data does not contain time variant information while recurrent neural networks model time series sequences and possess dynamical characteristics. Neural networks are capable of performing tasks that include pattern classification, function approximation, prediction or forecasting, clustering or categorization, time series prediction, optimization, and control.
2.2 Processing Elements
The basic processing unit of neural networks is the artificial neuron. The standard model of the neuron comprises of a set of input connections, a linear combiner and a
transfer function. The weight wij is defined from the input signal x j to neuron i. The linear combiner computes the weighted sum of input signals and subtracts a bias or
15
threshold term. For N input connections, the total net input activation value yi of neuron i is given by: N
yi = ∑ wij x j − θ j
(2.1)
j =0
where θ j represents the bias. The transfer function f ( yi ) computes the output o i of the unit. The following basic types of activation functions are widely used:
a) Threshold function: The activation function is described as a discrete step function:
⎧1 if yi ≥ τ o i= ⎨ ⎩0 otherwise
(2.2)
for some threshold value τ . The piecewise linear function is one variation of this function and has the form
⎧0 if yi ≤ −τ ⎪ o i = ⎨ yi if − τ < yi ≤ τ ⎪1 if y > τ ⎩ i
(2.3)
b) Sigmoid function: This is a continuously bounded differentiable function given by Equation 2.4 :
oi =
1 1 + e − yi
(2.4)
Many variations of these types of functions exist. For example, the sigmoid function can be replaced by the hyperbolic function tanh ( yi ) or a polynomial approximation such as the piece-wise linear function in the simplest case. The output interval [0, 1] is also easily scaled to [-1, 1] by linear transformation.
16
Figure 1 shows the signal flow and the output computation for a single neuron. The output
f (.)
is the activation function of the net input computed at the summation
node. θ j is the bias.
x0 = −1 wj0 = θ j
x1
f (.)
x2
yi
∑
xN
Figure 1 Output of a single neuron
The inputs x j and the weights wij with j = 0,1,..., N can be represented as the vectors x = [ x0 , x1 ,..., xN ] and w i = [ wi 0 , wi1 ,..., wiN ] , respectively. We can write the t
t
activation value for the i'th neuron as yi = w i x and thus its output as oi = f (w i x) .
2.3 Topologies
Artificial neural network are typically arranged into an input, output and a number of hidden layers. There are no hidden layers in singe-layer networks while multilayer networks have at least one hidden layer. Artificial neural networks can be classified into two different topologies depending on the flow of signals in the network. They are characterized into feedforward networks and feedback or recurrent neural networks. Figure 2 shows examples of (a) feedforward networks that contain only open-loop interconnections and (b) recurrent neural networks that contain one or more closed feedback paths in addition to open-loop paths.
17
Feedforward networks are used in application where the data does not contain time variant information while recurrent neural networks model time series sequences and possesses dynamical characteristics.
(a)
(b)
Figure 2 Neural network topologies
In the past, extensive research has been done to improve the training performance of neural networks which is significant on its generalization. Symbolic or expert knowledge can be inserted into neural networks prior to training for better training and generalization performance. It has been shown that deterministic finite-state automata can be directly encoded into recurrent neural networks prior to training [11]. Until recently, neural networks were viewed as black boxes as they could not explain the knowledge learnt in the training process. The extraction of rules from neural networks shows how they arrived to a particular solution after training. The extraction of finite-state automata from trained recurrent neural network shows that they have characteristics for modelling dynamical systems.
2.3.1 Feedforward Networks
18
Feedforward networks have been applied to bio-conservation [23], molecular biology [24] and many other pattern classification problems. Feedforward networks contain an input layer, one or many hidden layers and an output layer. Each layer contains one or more processing units called neurons which propagate information from one layer to the next by computing a transfer function of their weighted sum of inputs. Figure 2 (a) shows the topology of a feedforward network. The dynamics of a feedforward network is described by Equation 2.5.
⎛ m ⎞ S lj = g j ⎜ ∑ S il − 1 w lji − θ lj ⎟ ⎝ i =1 ⎠
(2.5)
where S lj is the output of the neuron j in layer l. S il −1 is the output of the neuron i in layer l-1 (containing m neurons) and w lji the weight associated with that connection with j. θ lj is the internal threshold/bias of the neuron. gj is the discriminant function
The error backpropagation and self organizing networks are the commonly used instances of feedforward networks [25, 26]. The backpropagation network has neurons with a sigmoidal transfer function. A self organizing network consists of neurons in a rectangular planer arrangement employing a competitive learning rule.
2.3.2 Recurrent Neural Networks Recurrent neural networks have been applied to difficult problems involving timevarying patterns. Their applications range from speech recognition and financial prediction to gesture recognition [1]-[3]. Recurrent neural networks are composed of an input layer, a context layer which provides state information, a hidden layer and an output layer. Each layer contains one or more neurons which propagate information from one layer to the next by computing a non-linear function of their weighted sum of inputs.
19
(a)
(b)
(c)
(d)
Figure 3 Architectures of recurrent neural networks
Recurrent neural networks maintain information about their past states for the computation of future states and outputs. The feedback signals in recurrent neural networks may be transmitted through a time delay. In contrast to feedforward networks, recurrent neural networks are dynamical systems whose next state and output depend on the present network state and input; they are particularly useful for modelling dynamical systems. Popular architectures of recurrent neural networks include first-order recurrent networks [27], second-order recurrent networks [28], NARX networks [29] and LSTM recurrent networks [30]. First-order recurrent neural networks use context units to store the output of the state neurons from computation of the previous time steps. Figure 3 shows this in detail where dashed lines show feedback connections. The Elman architecture uses the context layer which makes a copy of the hidden layer outputs in the previous time steps as shown in Figure 3(a). This architecture can be expanded to include
20
additional hidden layers [31]. The Jordan network is the earliest network specifically designed for temporal sequence modelling and gained popularity by highlighting the idea of context units. In the Jordan architecture [32], the context layer receives feedback from the output as well as the context layer activations in the previous time steps as shown in Figure 3(b). The context units are also referred to as state units. Hidden and output layer activations h j (t ) and yk (t ) respectively, are calculated at time t, from the weighted sum of the inputs xi (t ) and context units as follows: m ⎛ n ⎞ h j (t ) = f ⎜ ∑ w ji xi (t ) + ∑ w ji ci (t ) ⎟ i =1 ⎝ i =1 ⎠
⎛ p ⎞ y k (t ) = g ⎜ ∑ wkj h j ⎟ ⎝ j =1 ⎠
(2.6)
(2.7)
where n, m and p are the number of input, context and hidden units, respectively. The context layer units are updated at each time step by: ci (t + 1) = α ci (t ) + yi (t )
(2.8)
where α is the decay rate of the context units. The Robinson and Fallside architecture is single layer network which has a context layer of fully recurrent connections as shown in Figure 3(c). The model proposed by Williams and Zipser consists of a single layer of fully connected units from the output to the context layer instead as shown in Figure 3(d) [33]. The outputs y j (t ) and context unit values at the following time step ci (t + 1) is computed by:
m ⎛ n ⎞ y j (t ) = f ⎜ ∑ w ji xi (t ) + ∑ w ji ci (t ) ⎟ i =1 ⎝ i =1 ⎠
21
(2.9)
⎛ ⎞ c i (t + 1) = g ⎜ α ∑ wij y j (t ) ⎟ ⎝ j ⎠
(2.10)
2.4 Learning Algorithm
The goal of learning in neural networks is to approximate the output of the network with given set of inputs. This is done by adjusting the weights in the network according to a learning rule until a certain criteria is met. This criterion is usually expressed in terms of the network output error ε ≥ 0 . In supervised learning, the training data consists of input vectors and corresponding outputs. In this case, the task of the learning algorithm is to adjust the weights in the network so that the architecture can predict the correct output of the input vector. The network architecture must also be able to predict the classification of input vectors which were not previously present in the training set. In unsupervised learning, there is no prior output in the training data for the corresponding input vector. Unsupervised learning is usually used for data compression. In its purest form, gradient descent is a technique for function optimization. Gradient descent search techniques have been popular methods for training neural networks. In the training process, the objective is to adjust the connection weights associated with pairs of processing units connecting from one layer to another. The objective function measures the difference between the desired output for given inputs and the actual network output on the same inputs. The delta learning rule is a gradient descent learning rule for updating the weights of the network. The delta rule is derived by minimizing the error in the output of
22
the network through gradient descent. The error E of a network with j outputs can be measured as:
E=∑ j
2 1 tj − yj ) ( 2
(2.11)
where t j is the target output and y j is the actual output. The idea is to move through the weight space of the neuron in proportion to the gradient of the error function with respect to each weight. The derivative ∂E / ∂w ji for the ith weight can be simplified and written as:
∂E = −(t j − y j ) g '(h j ) xi ∂w ji
(2.12)
where g ( x) is the neuron's activation function. In gradient descent, the change for each weight is proportional to the gradient. Choosing the proportionality constant α will allow us to eliminate the minus sign and enable to move the weight in the negative direction of the gradient to minimize the error, hence, we arrive at our target equation: ∆w ji = α (t j − y j ) g '(h j ) xi
(2.13)
where α is a small constant called learning rate. A useful characteristic of the delta rule is that it can be generalized to multilayer networks and forms the basis of the widely used backpropagation learning algorithm [25]. Backpropagation employs gradient descent learning and is the most popular algorithm used for training neural networks. One limitation for training neural networks using gradient descent learning is their weakness of getting trapped in the local minima resulting in poor training and generalization performance. Evolutionary optimization methods such as genetic algorithms are also used for neural network training; they do not face the problems of gradient descent learning [9].
23
2.4 Computational Capabilities
The computational capabilities of neural networks are determined by their topology and the number of parameters. Neural networks can represent Boolean, continuous and arbitrary functions. Every Boolean mapping {0,1}n to {0,1} can be represented exactly by a network with one hidden layer and threshold units [34]. All bounded continuous functions can be approximated with arbitrarily small error by a network with one hidden layer using units with continuously differential activation function [35]. The number of
hidden units required depends on the function to be
approximated. Any function can be approximated to arbitrary accuracy by a network with two hidden layers using sigmoidal functions [36]. The learning complexity is the rate the network converges to a solution. Learning a mapping which represents the relationship in data requires exponential time irrespective of the training algorithm. It has been shown that the learning problem in neural networks is NP-complete, even for approximate learning [37]. If an appropriate architecture and learning algorithm is selected, then learning in polynomial time is possible for certain mappings [38]. The generalization ability of neural networks is an important measure of their performance as it indicates the accuracy of the trained network when presented with data not present in the training set. A poor choice of the network architecture will result in poor generalization even with optimal weight selection. The organization of neurons in the hidden layer may affect the generalization; too many neurons may result in overfitting or overtraining while few neurons will result in underfitting. A network that is not sufficiently complex can fail to detect the input signal in a complicated data set, leading to underfitting. A network that is too complex may fit the noise, not just the signal, leading to overfitting. Overfitting can easily lead to predictions 24
that are far beyond the range of the training data which affects the networks generalization performance. The generalization performance in the case of overfitting may be improved by increasing the number of instances in the training set. Another technique is using weight decay during training. Weight decay in gradient descent learning fractionally decreases the weights during iteration [39]. A successful method is to provide a validation set in addition to the training data. In this method the training algorithm monitors the generalization error with respect to the validation set and terminates the training before the error increases.
2.5 Summary
The fundamental concepts of artificial neural networks have been discussed in this chapter. Feedforward networks are applied in problems where time variant information is not present whereas recurrent neural networks are used for modelling dynamical processes. The goal of learning in neural networks is to approximate the output with a given set of inputs. Backpropagation employs gradient descent and is the most commonly used algorithm for training neural networks. The learning complexity and generalization are the two major factors which measure the performance of neural networks.
25
26
Chapter 3 Hybrid Systems
3.1 Introduction
Hybrid systems combine strengths of at least two intelligent system paradigms. Examples of hybrid systems include, evolutionary neural learning, neural expert systems, neuro-fuzzy systems, and symbolic connectionist learning. Neural expert systems combine the learning in neural networks with reasoning in expert systems. Therefore, they are able to learn from past experience and also have the ability to explain to their user how they arrived to a particular solution. In hybrid symbolic connectionist learning, expert knowledge in neural networks is combined prior to training for better generalization and training performance. Neuro-fuzzy systems on the other hand combine human style reasoning of fuzzy logic with neural network learning. Evolutionary neural learning combines the optimization technique of genetic algorithm with neural networks to learn the weights of the network from past experience. This chapter explains the proposed hybrid system known as hybrid recurrent neural networks which combine the strengths of recurrent neural networks with hidden Markov models used for modelling time-varying sequences.
27
3.2 Symbolic Connectionist Learning
3.2.1 General Paradigm The general paradigm of symbolic connectionist learning includes the combination of symbolic knowledge in neural networks for better training and generalization performance [40,41]. The traditional connectionist representation approach of using neural networks includes initializing of neural network with small random values and training it using some optimization methods such as gradient descent and genetic algorithms on some known data to perform a certain task. After successful training, the network can take advantage of its generalization capability to perform tasks such as classification and recognition when presented with data. During the entire process, the knowledge remains hidden in the networks adaptable connections, hence the name ‘connectionist representation’. The connectionist representation is shown is the lower part of Figure 4. The paradigm in the connectionist representation can be enriched with symbolic knowledge by initializing a network with prior knowledge i.e. the initial domain theory, prior to training. A translation of information from a symbolic into a connectionist representation is required. This is done by programming subset of weights in the network prior to training instead of choosing small random values. The programmed weights define a starting point in weight space for a search of a solution during training. Examples of this approach include pre-structuring a network with Boolean concepts and imposing rotation variance in neural networks for image recognition [18, 41 and 42]. Figure 4 shows the use of neural networks in knowledge refinement which consists of : (i) insertion of prior knowledge known as initial domain theory into a neural network, (ii) refinement of knowledge through training a network on examples, and (iii) extraction of learnt knowledge from a trained network in symbolic form, known as refined domain theory.
28
Figure 4 The framework for combining symbolic and neural learning
Once the network has been trained, knowledge can be extracted in symbolic form i.e. the refined domain theory. The extracted knowledge may approximate the networks true knowledge; in some cases the extracted knowledge may outperform the performance of the trained network.
3.2.2 The Significance and Insertion of Prior Knowledge The fidelity in the mapping of the prior knowledge is very important since the network may not take advantage of poorly encoded knowledge. Poorly encoded knowledge may hinder the learning process. Good prior knowledge encoding may provide the network with beneficial features such as: (1) The leaning process may lead to faster convergence to a solution meaning better training performance, (2) networks trained with prior knowledge may provide better generalization when compared to networks trained with no prior knowledge and, (3) the rules in prior knowledge may help to generate additional training data which are not present in the original data set.
29
Prior knowledge usually represented in the form of explicit rules in symbolic form is encoded in neural networks by programming some weights prior to training [18]. In feedforward neural networks, prior knowledge is encoded in propositional logic expression form by programming a subset of weights. Prior knowledge also determines the topology of the network i.e. the number of neurons and hidden layers appropriate for encoding the knowledge. The paradigm has been successfully applied to real world problems including bio-conservation [23] and molecular biology [24]. The prior or expert knowledge helps the network to get better generalization and training performance when compared with network architecture with no prior knowledge encoding. For recurrent neural networks, finite-state automata are the basis for knowledge insertion. It has been shown that deterministic finite-state automata can be encoded in discrete-time second-order recurrent neural networks by directly programming a small subset of available weights [11]. For first order recurrent neural networks, a method for encoding finite-state automata has been proposed and shown [43].
3.2.3 Knowledge Extraction In the past, neural networks were considered as ‘black boxes’ as they were not able to explain the knowledge acquired in the weights after the training process. Research on the topic has resulted in a number of algorithms for knowledge extraction in symbolic form [14, 15]. In feedforward networks, knowledge is usually extracted in the form of Boolean and fuzzy ‘if-then’ clauses [44, 45]. For recurrent network, finite-state automata have been the main paradigm for temporal symbolic knowledge extraction in [46, 47]. The goal of knowledge extraction is to find the knowledge stored in the network’s weights in symbolic form. One main concern is the fidelity of the extraction process, i.e. how accurately the extracted knowledge corresponds to the knowledge stored in the network. Extraction algorithms can be 30
basically divided
into
two classes:
decompositional methods extract rules from the internal networks structure by looking at the weights and nodes of the network e.g. extraction through clustering [15]. Pedagogical methods view trained neural networks as black-boxes and uses some machine learning methods to extract rules obtained from the input-output mapping of the network e.g. machine learning, TB algorithm [14,48]. These methods will be further discussed in the chapters to follow.
3.2.4 Knowledge Refinement Knowledge refinement or revision is the main goal of learning in a hybrid system where neural learning together with knowledge extraction is combined to produce a more accurate set of rules within a given domain [49]. The initial domain knowledge, which may also contain information inconsistent with the available training data, is encoded in a network. The network is then trained with the available data set with several training runs depending on how close the initial symbolic knowledge is to the final solution. Then the refined or revised rules i.e. in case of poor prior knowledge can be extracted from the trained network in symbolic form. The benefits of knowledge refinement include: (1) better training performance, (2) improved generalization performance, and (3) better understanding of the internal representation of the trained network.
3.3 Neural Expert Systems
Expert systems and neural networks share common goals as they both imitate human intelligence and eventually create an intelligent machine. In rule based expert systems, knowledge is represented by if-then rules collected by interviewing human experts. Once the rules are stored in the knowledge base, they cannot be modified by the
31
expert system as expert systems cannot learn from experience and adapt to a new environment. Neural networks learn from past experience. Knowledge is obtained during the training process when a training set of data is presented to the network. Neural networks learn without human intervention but are unable to explain how a solution is reached. However, in expert systems, knowledge can be divided into individual rules and the user can understand the piece of knowledge applied to the system. In neural networks, one cannot understand the knowledge represented in the weight connections. The neural expert system is a hybrid system which combines the strengths of neural networks and expert systems [50]. Neural expert systems use trained neural network instead of the knowledge base which is capable of making generalization when presented with noisy data. The rule extraction unit examines the trained network and extracts rules which explains the user how the neural network arrived to a particular solution. Figure 5 shows the basic structure of a neural expert system. Neural expert systems have limitations of Boolean logic; they cannot attempt to represent continuous input variable as this will lead to infinite increase of the number of rules. One way to overcome this limitation is to use fuzzy logic which will be discussed in the section to follow.
32
Training Data
Neural Knowledge Base
Rule Extraction
New Data
Rule: If-Then
Inference Engine
Explanation Facilities
User Interface
User
Figure 5 Neural expert system
3.4 Neuro-Fuzzy Systems
Neuro-fuzzy systems combine the human-like reasoning style of fuzzy systems through the use of fuzzy sets with the learning in neural networks. They have been applied to the behavioural representation of computer generated forces and pattern recognition problems [51]. Knowledge in the form of fuzzy rules is used to program weights in the network architecture of fixed dimensions and trained with a data set. The training is usually done with the standard backpropagation algorithm used for training neural networks. The knowledge encoded is refined during training.
33
The structure of neuro-fuzzy system is similar to a multi-layer neural network. It has an input layer, an output and few hidden layers which represent membership functions and fuzzy rules. Figure 6 shows an example of a neuro-fuzzy system. The crisp inputs are transmitted further to Layer 2 which
represents fuzzy sets used in the
antecedents of fuzzy rules. Each neuron in Layer 3 corresponds to a single fuzzy rule which receives inputs from the fuzzification neurons representing fuzzy sets in the rule antecedents. The output of neuro-fuzzy system is crisp which does defuzzification by combining the strengths of the output membership function into a single fuzzy set. .
Layer 1 Crisp Inputs
Layer 2 Input Membership Functions
Layer 3 Fuzzy Rules Layer 4 Output Membership Functions Layer 5 Defuzzification
Figure 6 An example of a neuro-fuzzy system
A neuro-fuzzy system can be interpreted as a system of fuzzy rules before, during and after training. The extraction of fuzzy rules from neuro-fuzzy rules is done by inspecting connection weights which fall above a threshold [52].
34
3.5 Evolutionary Neural Networks
3.5.1 Evolutionary Neural Learning Evolutionary computation methods such as genetic algorithms have been applied for training neural networks since gradient descent learning in neural networks cannot guarantee an optimal solution [53]. It is often observed that a gradient descent search for a solution gets trapped in a local minima resulting in poor training and generalization performance. Genetic algorithms can help the network to escape from the local minima. This ability has made genetic algorithms as popular choices for training neural networks. Genetic algorithms rely on the reproduction heuristic of crossover operator which forms offspring’s by recombining representational components of two members of the population without regard to their content. This approach of creating population members assumes that components of all parent representations may be freely exchanged without altering the search process. Usually, the crossover and mutation operators are altered in genetic algorithm for training neural networks; this is done to represent the real number values of the weights in the network. The paradigm will be discussed in detail in upcoming chapters.
3.5.2 Evolutionary Neural Topologies The topology for neural networks, i.e. the number of neurons in each layer, is a major factor which affects the training and generalization of the network. Selecting the right topology for neural networks has been a major problem as trial and error methods require tremendous efforts and are time consuming. Genetic algorithms have also been applied to selecting the optimal topologies for neural networks [9]. The basic idea of evolving the suitable network architecture is to conduct a genetic search in a population of different network topologies. A square connectivity matrix can be used for encoding the topology into the chromosomes of genetic algorithm. During iteration, each 35
chromosome is decoded into a network topology which is trained with a set of data for a number of epochs and a sum of squared error is obtained. Genetic algorithms use the sum of squared error to search best network topology accordingly. The process iterates until the algorithm finds a network topology with the least squared error.
3.6 Hybrid Recurrent Neural Networks Inspired by Hidden Markov Models
3.6.1 Motivation Recurrent neural networks (RNNs) have been an important focus of research as they can be applied to difficult problems involving time-varying patterns. Their applications range from speech recognition and financial prediction to gesture recognition [1]-[3]. They have the ability to provide good generalization performance on unseen data but are difficult to train. Hidden Markov models (HMMs) have also been applied to solve difficult real world problems involving time-varying patterns. For instance, they have been very popular in areas of speech recognition [6]. Training hidden Markov models is easy i.e. they learn faster when compared to recurrent neural network, but their generalization performance may not perform satisfactorily when compared to the performance of recurrent neural networks. The structural similarity between hidden Markov models and recurrent neural networks is the basis for mapping hidden Markov models into recurrent neural networks. The recurrence equation in the recurrent neural network resembles the equation in the forward algorithm in the hidden Markov models [54, 55]. The combination of the two paradigms into a hybrid system may provide better generalization and training performance which would be a useful contribution to the field of machine learning and pattern recognition. The new hybrid architecture is named hybrid recurrent neural networks as used in further discussion.
36
3.6.2 Significance of Hybrid Recurrent Neural Networks We have stated earlier that the structural similarities of hidden Markov models and recurrent neural networks form the basis for combining the two paradigms into the hybrid architecture. Why is it a good idea? Most often, first-order hidden Markov models are used in practice in which successor states are dependent only on the previous state. This assumption is unrealistic for many real world applications of hidden Markov models. It has been shown that recurrent neural networks can learn higher-order dependencies from training data [56]. Furthermore, the number of states in the hidden Markov model needs to be fixed beforehand for a particular application. However, the number of states for different applications vary. The theory on recurrent neural networks and hidden Markov models suggest that the combination of the two paradigms may provide better generalization and training performance. The proposed architecture of hybrid recurrent neural networks may also have the capability of learning higher order dependencies and one does not need to fix the number of states as in the case of hidden Markov models.
3.6.3 The Derivation for Hybrid Recurrent Neural Networks The structural similarities of the two paradigms are examined here in order to design the hybrid recurrent neural networks architecture. Consider the equation of the forward procedure for the calculation of the probability of the observation O given the model λ , thus P (O | λ ) in hidden Markov models is given by:
⎛
N
⎞
⎝
i
⎠
α j (t ) = ⎜ ∑ α i (t − 1) aij ⎟ b j ( Ot ) 1 ≤ j ≤ N
(3.1)
where N is the number of hidden states in the HMM, aij is the probability of making a transition from state i to j and b j ( Ot ) is the Gaussian distribution for the observation at time t. The calculation in Equation 3.1 is inherently recurrent and bares resemblance to
37
the recursion of recurrent neural networks as shown in Equation 3.2.
⎛ N ⎞ x j ( t ) = f ⎜ ∑ xi ( t − 1) wij ⎟ ⎝ i ⎠
1≤ i ≤ N
(3.2)
where f(.) is a non-linearity as sigmoid, N is the number of hidden neurons and wij is the weights connecting the neurons with each other and with the input nodes. The dynamics of first–order recurrent neural network is given by: ⎛ K S i ( t ) = g ⎜ ∑ V ik S k ( t − 1) + ⎝ k =1
where
J
∑W j =1
ij
⎞ I j ( t − 1) ⎟ ⎠
(3.3)
S k (t ) and I j (t ) represent the output of the state neuron and input neurons
respectively. Vik and Wij represent their corresponding weights. g(.) is a sigmoidal discriminant function. Equation 3.3 is combined with Equation 3.1 to form the hybrid architecture. The subscript j in b j ( O t ) is replaced - it denotes the state by time t in hidden Markov models - to incorporate the feature into recurrent neural networks. Hence, the dynamics for the hybrid recurrent neural networks architecture as shown in Figure 3.4 is given by:
S i (t ) =
K
∑V k =1
ik
⎡ J ⎤ S k ( t − 1) + ⎢ ∑ W ij I j ( t − 1) ⎥ .b t − 1 ( O ⎣ j =1 ⎦
)
(3.4)
where bt −1 ( O ) is the Gaussian distribution. Note that the subscript in bt −1 ( O ) , i.e. time t, in Equation 3.4 is different from the subscript for Gaussian distribution in Equation 3.1. The dynamics of hidden Markov models and recurrent networks varies in this context; however, we can adjust the parameter for time t as shown in Equation 3.4 in order to map hidden Markov models into recurrent neural networks. For a single input, the univariate Gaussian distribution is given by: 38
⎡ 1 ( O − µ )2 ⎤ 1 exp ⎢− bt ( O) = ⎥ 2 2πσ ⎢⎣ 2 σ ⎥⎦
(3.5)
where Ot is the observation at time t, µ is the mean and σ 2 i is the variance. For multiple inputs to the hybrid recurrent neural networks, the multivariate Gaussian for d dimensions is given by:
bt (O) =
2π
1 ⎡ 1 ⎤ exp ⎢ − (O − µ )t ∑ −1 (O − µ ) ⎥ 1/ 2 |∑| ⎣ 2 ⎦
d /2
(3.6)
where O is a d-component column vector, µ is a d-component mean vector, ∑ is a d-byd covariance matrix, and | ∑ | and ∑ −1 are its determinant and inverse, respectively. Figure 7 shows how the Gaussian distribution for hidden Markov model is mapped into hybrid recurrent neural networks. The output of the Gaussian function solely depends on the two input parameters which are the mean and the variance. These are parameters that observe the sequence of the input data in the hybrid architecture which may be DFA strings or data from any real-world time series for example speech recognition.
3.5.4 Training Hybrid Recurrent Neural Networks Hybrid recurrent neural networks can be trained using gradient descent learning methods and genetic algorithms. In the case of training strings of certain lengths representing finite automaton, a univariate Gaussian for one dimensional input will be used as shown in Equation 3.5. For real world applications where multiple dimensions are involved, multivariate Gaussian function would be used instead as shown in Equation 3.6. The dashed lines indicate that the architecture can represent more neurons in hidden and input layer if required. The output of the Gaussian is further multiplied
39
with the output of the neurons in the hidden layer. Note that one Gaussian distribution will be used irrespective of the number of neurons in hidden and input layer.
bt ( O )
Figure 7 Hybrid recurrent neural networks
In gradient descent training, the general backpropagation through time learning can be applied keeping in mind that additional parameters, i.e. the mean and variance of the Gaussian distribution, must be trained. Gradient descent learning, as discussed earlier, has its drawbacks as the network can become trapped in the local minima resulting in poor training and generalization performance. Using evolutionary training methods such as genetic algorithm, the trainable parameters i.e. the weights, biases, mean and variance of the Gaussian distributions need to be encoded in the chromosomes as small real number values. The genetic algorithm will then find the optimal values of the trainable parameters for the hybrid recurrent neural networks given a set of training data.
40
3.6 Summary
In this chapter it has been discussed how the strengths of different intelligent system paradigms can be combined into a hybrid system. The combination of neural networks with symbolic knowledge in symbolic connectionist learning has also been discussed. The strengths of neural networks and expert systems can be combined into a hybrid system which can explain to the user how it arrived at a particular solution and also learns from past experience. Upon presentation of new data, the expert system can update its existing rules through learning in order to keep up-to-date with a changing environment. Neuro-fuzzy systems combine human like reasoning in fuzzy logic with the learning in neural networks. These systems have contributed to a wide range of applications including pattern recognition problems. Genetic algorithms can be used to train neural networks thereby eliminating the problem of local minima as happens frequently in gradient descent learning. Genetic algorithms have also been applied to the optimal selection of neural network topology. Finally, the proposed hybrid system is discussed in detail with its possible training methods.
41
Chapter 4 Automata Theory
4.1 Introduction
Finite-state automata represent dynamical behaviour and are useful frameworks for studying recurrent neural networks as no feature extraction is necessary. A deterministic finite automaton is a finite automaton where for each pair of state and input signal, there is one transition to the next state. The output of a deterministic finite automaton is either an accepting or rejecting state. A fuzzy finite automaton is a finitestate automaton where for each pair of state and input signal, there is a set of possible successor states. Hidden Markov models are finite state machines and have been successfully applied for modelling speech sequences. A detailed discussion on hidden Markov models is done in this chapter.
4.2 Formal Languages
Formal languages are useful for studying recurrent neural networks for a number of reasons; (1) there is no need for feature extraction, (2) they allow us to study the knowledge representation in recurrent neural networks, and (3) the dynamics of many real world processes can be represented as finite-state processes at some level of abstraction [57]. Finite-state automata encoding, induction and extraction has been studied extensively [27, 28, 43, 46, 47, and 48].
42
An alphabet ∑ is a finite set of symbols. A formal language is a set of strings of symbols over some alphabet. Simple alphabets, e.g. ∑= {0, 1}, are typically considered in the study of formal languages since results can easily be extended to larger alphabets. The set of all strings of odd parity L = {ε, 1, 01, 001, 011, 101 …} is an example of a simple language. The symbol ε is used to denote a null string. The language contains an infinite number of strings.
4.3 Finite-State Automata
A finite-state automaton is a device that can be in one of a finite number of states. Under certain conditions, it can switch to another state; this is called a transition. When the automaton starts processing input, it can be in one of its initial states. Some automata contain another important subset of states: the final (or accepting) states. If an automaton is in a final state after processing an input sequence, it is said to accept or reject its input according to the output membership of the last state. Finite-state automata is used as test beds for training recurrent neural networks. Presumably, strings used for training do not need to undergo any feature extraction. They are used to show that recurrent neural networks can represent dynamical systems.
4.3.1 Deterministic Finite-State Automata A language is a set of strings over a finite alphabet ∑ = {σ , σ ,...., σ 1
2
|
∑
} . The length of a
|
string ω will be denoted |ω|. A deterministic finite-state automata (DFA) is defined as a 5-tuple M = (Q, ∑, δ, q1 ,F ), where Q is a finite number of states, ∑ is the input alphabet, δ is the next state function δ : Q × ∑ →Q which defines which state q’ = δ(q,σ) is reached by an automaton after reading symbol σ when in state q, q1 Є Q is the initial state of the automaton (before reading any string) and F ⊆ Q is the set of accepting states of the automaton. The language L(M) accepted by the automaton contains all the strings that
43
bring the automaton to an accepting state. The languages accepted by DFAs are called regular languages. Figure 8 shows the DFA which will be used for training the hybrid recurrent network architecture inspired by hidden Markov models. Double circles in the figure show accepting states while rejecting states are shown by single circles. State 1 is the automaton’s start state. The training and testing set is obtained upon the presentation of strings to this automaton which gives an output i.e. a rejecting or accepting state depending on the state where the last sequence of the string was presented. For example, the output of a string of length 7, i.e. 0100101 is in state 5 which is an accepting state, therefore the output is 1.
Figure 8 Deterministic finite-state automata
4.3.2 Fuzzy Finite-State Automata A fuzzy finite-state automaton M is a 6-tuple M = ( ∑, Q, R, Z, δ, ω ), where ∑ and Q are the input alphabet and the set of finite states, respectively, R є Q is the automaton’s fuzzy start state, Z is a finite output alphabet, δ: ∑ × Q × [0,1] →Q is the fuzzy transition map, and ω: Q → Z is the output map. Consider a restricted type of fuzzy automata whose initial state is not fuzzy, and ω is a function from F to Z, where F is a non-fuzzy set of states, called finite states. Any fuzzy automaton as described above is equivalent to a restricted fuzzy automaton [58]. Notice that a FFA reduces to a
44
conventional DFA by restricting the transition weights to 1. Figures 9 and 10 shows example of an FFA and its corresponding deterministic acceptor, respectively.
Figure 9 Fuzzy finite-state automaton with weight state transitions
State 1 in Figure 9 is the automaton’s start state; accepting states are drawn with double circles. Only paths that can lead to an accepting state are shown (transitions to garbage are not shown explicitly). A transition from state qj to qi on input symbols ak with weight θ is represented as a direct arc from qj to qi labelled ak / θ.
Figure 10 Equivalent deterministic acceptor
The diagram above is the corresponding deterministic acceptor from the FFA in Figure 9 which computes the membership function strings. The accepting states are labelled with a degree of membership. Notice that all transitions in the DFA have weight 1.
45
4.4 Finite State Machines: Hidden Markov Models
Hidden Markov models have been popular in application to speech recognition [6]. In a regular Markov model, the state is directly visible to the observer. Therefore, the state transition probabilities are the only parameters. A hidden Markov model (HMM) describes a process which goes through a finite number of non-observable states whilst generating a signal of either discrete or continuous in nature. In a hidden Markov model, the state is not directly visible; however, the variables influenced by the states are visible. Each state has a probability distribution over the possible output tokens. The sequence of tokens generated by a hidden Markov model gives some information about the sequence of states. In a first order hidden Markov model, the state at time t+1 depends only on state at time t, regardless of the states in the previous times [12].
ω2
ω1
ω3
Figure 11 A first-order discrete Markov model
Figure 11 shows an example of a Markov model. The discrete states, ω i , in a basic Markov model are represented by nodes, and their transitional probabilities, a i j are represented by links. In a first-order discrete Markov model, at any step t the full system
46
is in a particular state ω ( t ) . The state at time t+1 is a random function that depends solely on the state at step t and the transition probabilities. The term “hidden” hints at the process’ state transition sequence which is hidden from the observer. The process reveals itself to the observer only through the generated observable signal. A HMM is parameterized through a matrix of transition probabilities between states and output probability distributions for observed signal frames given the internal process state. It is assumed that for every time step t the system is in state ω (t ) and emits some visible symbol v(t ) . A particular sequence of visible states is defined as
V T = {v(1), v(2),...., v(T )} and thus V 6 = {v5 , v1 , v1 , v5 , v2 , v3 } . In any state ω (t ) , there is a probability of emitting a particular visible state v(t ) . Therefore, the probability P (vt (t ) | ω (t )) = b jk . There is access to the visible states, while the ω j are unobservable.
The model is called a hidden Markov model. Figure 12 shows three hidden units in a HMM and transitions between then are shown in solid lines. The visible states and the emission probabilities of visible states are shown with dashed lines. This model shows that all transitions are possible. In other HMMs, some candidate transitions are not allowed. v3
v2
v1
b13 b12
b11
a11
v4 b14
ω1
v4
v3
a21
ω2
a12
v2 v1
a13
ω3
b34 b33
b32 b31
v1
v4 v3
v2
Figure 12 A first-order hidden Markov model
47
There are three central issues in hidden Markov models: 1. The evaluation problem: Given a complete HMM with transitional probabilities aij and b jk , we have to determine that the particular sequence of visible states V T was generated by that model using the Forward algorithm.
2. The decoding problem: The most likely state sequence of hidden states ω T the system went through in generating the observed signal through the Viterbi algorithm. 3. The learning problem: A set of re-estimation for iteratively updating the HMM parameters given an observation sequence as training data. These formulas strive to maximize the probability of the sequence being generated by the model. The algorithm is known as the Baum-Welch or Forward- backward procedure. The forward algorithm will be discussed in detail for the evaluation problem. The probability that the model produces a sequence V T of visible states P(V T ) can be calculated by the Forward algorithm in Algorithm 1. One limitation for hidden Markov models in the application for speech recognition is that they assume that the probability of being in a state at time t only depends on the previous state i.e. the state at time t-1. This assumption is inappropriate for speech signals where dependencies often extend through several states; nevertheless, hidden Markov models have performed well for certain types of speech recognition [13].
48
_____________________________________________ Forward Algorithm initialize t ← 0, aij , b jk visible sequence V T , α j (0) for t ← t + 1
α j (t ) = b jk v(t )∑ i =1α i (t − 1)aij c
until t = T return P (V T ) ← α 0 (T ) for the final state ____________________________________________ Algorithm 1: The Forward Algorithm. The notation b jk v (t ) is the transition probability b jk selected by the visible state emitted at time t. α 0 denotes the probability of the associated sequence ending to the known final state.
4.5 Summary
In this chapter, we have discussed finite automata which have been the basis of studying knowledge representation of recurrent neural networks. Finite–state automata can be characterized as deterministic, fuzzy or probabilistic automata. A hidden Markov model describes a process which goes through a finite number of states. Fuzzy and probabilistic automata are similar in that each state may have a number of successor states. In probabilistic automata, the transition probabilities from a state for given input must add to 1. This constraint does not apply to fuzzy automata, i.e. there is no notion of a probabilistic selection of a successor state; instead, a fuzzy automaton may be in several states simultaneously with different fuzzy measures. They have been successfully applied to speech recognition. The Forward algorithm discussed is the basis of development of the hybrid architecture of recurrent neural networks and hidden Markov models.
49
50
Recurrent Neural Networks
5.1 Introduction
Recurrent neural networks are well suited for modelling time dependent processes due to their dynamical abilities. This feature has made them successful in applications to speech recognition, time series prediction, language learning and control. Recurrent neural networks are loose models to the brain. Various architectures of recurrent neural networks have been applied to a wide range of problems with strengths and weaknesses in knowledge representation and learning capability. Gradient descent is most widely used for recurrent network training with variation in the form of backpropagationthrough-time and real-time-recurrent learning. Evolutionary training methods such as genetic algorithms have also shown successful results as they do not face the problem of getting trapped in the local minima in the training process compared to error backpropagation. Symbolic knowledge in the form of finite automaton can be encoded in recurrent neural networks. Knowledge extraction from recurrent neural networks aims at finding the underlying models of the learnt knowledge in the form of finite state machines.
51
5.2 Architectures
5.2.1 First-Order Recurrent Neural Networks
The first-order recurrent neural network uses a context layer to store the output of the state neurons in the previous time steps. The context layer is used for computation of present states as they contain information about the previous states. They have shown to learn and represent deterministic finite automaton and fuzzy finite automaton [48, 59]. The architecture and dynamics of first-order context layer networks due to Elman, Jordan, Williams and Zipser, has been discussed in Section 2.3.2.
Figure 13 First-order recurrent neural networks
Figure 13 shows the connection from the context to hidden layer which provides the recurrence in the first-order recurrent neural network architecture. Note that the number of neurons in the hidden layer is equal to the number in the context layer. The neurons propagate information from one layer to another by computing a non-linear function of their weighted sum of inputs.
52
The equation of the dynamics of the change of hidden state neuron activations in first order context layer networks is given by Equation 5.1. ⎛ K S i ( t ) = g ⎜ ∑ V ik S k ( t − 1) + ⎝ k =1
J
∑W j =1
ij
⎞ I j ( t − 1) ⎟ ⎠
(5.1)
where S k (t ) and I j (t ) represent the output of the state neuron and input neurons respectively. Vik and Wij represent their corresponding weights. g(.) is a sigmoidal discriminant function. Knowledge extractions through clustering and machine learning have been successfully applied [14, 15, and 48]. First-order recurrent neural networks have been applied to a wide range of real world problems including speech recognition [1], financial prediction [2] and gesture recognition [3].
5.2.2 Second-Order Recurrent Neural Networks Second order recurrent neural networks are more suited for modelling finite-state behaviour than first-order context layer networks [60]. However, it has been shown that first-order recurrent networks generalize better than second-order networks given that both of them have the same number of neurons as second-order networks have more weight connections [61]. The second-order recurrent network was developed by Giles et. al. and it has been shown that deterministic finite automaton can be directly encoded in them [62]. It employs product units of external input neurons xk (t ) and recurrent state units s j (t ) for all combinations of s j (t ) × xk (t ) as input to the context units in the output layer. The activation for the context units is computed by equation 5.2.
⎛ n m ⎞ s i ( t + 1) = g ⎜ ∑ ∑ w ijk s j ( t ) x k ( t ) ⎟ ⎝ j =1 k =1 ⎠ for n context units and m input units.
53
(5.2)
si (t + 1)
f(.)
bi (t )
Z −1
so (t + 1)
f(.)
f(.)
wijk X
X
X
X
X
te xt
s j (t )
X
te xt
xk (t )
Figure 14 Second-order recurrent neural networks
Figure 14 shows the second-order recurrent network architecture with three state units s j (t ) and two input units xk (t ) . The output si (t + 1) computes a sigmoidal function of
weighted product units.
5.2.3 Locally Recurrent Neural Networks
Locally recurrent neural networks are a class of networks that contain recurrent connections only within individual neurons [63]. Locally recurrent neural networks can be trained with gradient descent learning algorithms. Locally recurrent networks consist of an input layer, a layer of dynamic neurons with self-recurrent connections whose outputs are the inputs of a standard layer feedforward network. The dynamics of locally recurrent neural networks is given as follows:
⎛ ⎞ c i (t ) = f ⎜ wiic ci (t − 1) + ∑ wijx x j (t ) ⎟ j ⎝ ⎠
54
(5.3)
⎛ ⎞ y i (t ) = g ⎜ ∑ wijy c j (t ) ⎟ ⎝ j ⎠
(5.4)
where x j (t ) , c i (t ) and y i (t ) represent the input, internal and output activations, respectively at time t. The self connection of the weight matrix is given by wiic . f and g are the nonlinear functions.
Figure 15 Locally recurrent networks
Note the self connection shown in the hidden layer of locally recurrent networks in Figure 15. The outputs of recurrent state neurons are also fed as inputs to a layered network with one hidden layer capable of representing arbitrary mappings. Locally recurrent networks have been applied to simulate the behaviour of the Chua’s circuit which can be considered a paradigm for studying chaos [64]. They can identify the underlying link among the state variables of the Chua’s circuit.
55
5.2.4 NARX Recurrent Networks
NARX recurrent neural networks are inspired by nonlinear autoregressive models with exogenous inputs [65]. They have shown better training and generalization results when compared to other recurrent network architectures [66].
They compute their
current output from past inputs and past outputs as shown in Figure 16. The architecture has two delay lines; one for input and the other for outputs with all taps fully connected to the hidden layer. Their lengths are referred as input and output order, respectively. The hidden state neurons receive as inputs a window of past network input and output.
Figure 16 NARX network architecture
The output y (t ) for the linear case with input order Tx and output order Ty is given by:
Tx
Ty
τ =1
τ =1
y (t ) = ∑ α t x(t − τ ) + ∑ β t y (t − τ )
(5.6)
where x(t ) is the input signal and α t and βt are constants. For nonlinear dynamics, equation 5.6 can be generalized to the nonlinear function f as described by Equation 5.7.
56
y (t ) = f ( x (t − Tx ),..., x (t − 1), x (t ), y (t − Ty ),... y (t − 1))
(5.7)
NARX networks have been applied to finite automaton identification and signal processing [67]. They can retain information up to two or three times longer than conventional recurrent neural network architectures and hence can alleviate the problem of long-term dependencies [68].
5.2.5 Long Short Term Memory Recurrent neural networks can represent non-linear dynamical systems; however, learning long time dependencies can be difficult [68]. There is a difficulty in propagating long-time dependencies as the gradient descent learning mechanism can only reliably propagate information over certain time steps. In the process of error backpropagation, the error gradient approaches zero after certain number of time steps when n becomes large. The Long Short Term Memory (LTSM) networks have been proposed to overcome the problem of long-term dependencies [69]. They have been applied to grammatical induction and speech recognition problems [70, 71]. More recently, they have also been applied to bioinformatics problems with great success. They are composed of memory cells and gate units. Each memory cell is built around a central linear unit with a fixed self connection. The gate units open and close access to constant error carrousel. Figure 17 shows an example of LTSM network where the input units are fully connected to the hidden layer consisting of a memory block with a single cell. The recurrence is limited to the hidden layer. The gate units, cells and outputs are biased. .
57
Figure 17 Example of a three layer LTSM topology
LTSM networks can be trained using multi grid random search, time-weighted pseudo Newton, discrete error backpropagation, and expectation maximization [72, 73]. LTSM solves complex long time lag tasks that have never been solved by previous recurrent network algorithms. It also works with local, distributed, real-valued, and noisy pattern representations.
58
5.3 Learning
5.3.1 Introduction The principle goal of a learning algorithm is to adjust the connection weights in the network given a set of input-output mappings in the training data. The adjustment of the weights enables the network to make a generalization when present with data not present in the training process.
Recurrent neural networks can be trained with the
principle of the delta learning rule as discussed in detail in Section 2.4. The general idea behind the delta learning rule is to use gradient descent to search the hypothesis space of the weight vectors and find the weights that best fits the training example. Gradient descent provides the basis of the backpropagation algorithm which through error backpropagation trains the network with many interconnected units. This section has discussions
on
backpropagation-through-time,
real-time-recurrent
learning
and
evolutionary techniques for training recurrent neural networks.
5.3.2 Backpropagation -Through -Time Backpropagation is the most widely applied learning algorithm for both feedforward and recurrent neural networks. It learns the weights for a multilayer network, given a network with a fixed set of weights and interconnections. Backpropagation employs gradient descent to minimize the squared error between the networks output values and desired values for those outputs. The learning problem faced by backpropagation is to search a large hypothesis space defined by weight values for all the units of the network. Error is propagated from the output layer back to the hidden layers from
which
the weights
are
updated.
backpropagation algorithm is shown in Algorithm 2.
59
A prototype of the standard
__________________________________________________________________ Backpropagation( Training-Examples, α ) r r r Each training example is a pair of the form ( x , d ) where x is the vector for r the network input values, and d is the vector for desired output values. α is the
learning rate, usually a real number between 0 and 1.
1. Create a network with input, hidden and output units 2. Initialize all network weights to small random numbers 3. Until the termination condition is met, Do •
r r
For each ( x , d ) in the training examples, Do 1. Propagate the input forward through the network: r • Input the instance x to the network and compute the output S Lj of every unit j in the network. 2. •
Propagate the errors backward through the network: Calculate the error δ jL for each layer L in every output neuron j.
•
Update each network weight wLj i for each layer L.
________________________________________________________________________
Algorithm 2: The Backpropagation Algorithm is used for training feedforward networks. The training equations which also applies for training recurrent neural networks unfolded in time is shown in the discussions to follow. Algorithm 2 begins by constructing a network with desired number of input, hidden and output neurons. Then it initializes all the connecting weights to small real random values. Once the network architecture is fixed, the main loop of the algorithm repeatedly iterates over the training examples. For each training example, the algorithm calculates the network output which is used to compute the gradient with respect to the output error. The gradient is used to compute weight updates for each layer in the network. The gradient descent step is iterated until the network performs acceptingly well meaning, the network has learnt the training examples. 60
Backpropagation
is
used
for
training
feedforward
networks
while
backpropagation-through-time (BPTT) is employed for training recurrent neural networks [74]. BPTT is the extension of backpropagation algorithm. The general idea behind BPTT is to unfold the recurrent neural network in time so that it becomes a deep multilayer feedforward network. This can be done by adding a layer for each time step. When unfolded in time, the network has the same behaviour as a recurrent neural network for a finite number of time steps. Figure 18 shows (a) a context layer recurrent neural network unfolded into a deep multilayer feedforward network (b). Note that in this example the unfolding is shown for the network with three time steps, the number of time steps varies according to training data. Backpropagation through time has time complexity of Θ(n3 ) , where d is the depth of error backpropagation. Given below are the training equations unfolded in time, hence time t becomes the layer L which corresponds to Algorithm 2. For each training example d, every weight w ji is updated by adding ∆w ji to it.
∆w ji = −α
∂Ed ∂w ji
(5.8)
where ∂Ed is the error on training example d, summed over all output units in the network
Ed =
1 2
m
∑ (d j =1
61
j
− S Lj ) 2
(5.9)
Output
Output
Hidden(t)
Hidden(t)
Input (t)
Input
Context(t)
(a)
Hidden(t-1)
Input (t-1)
Context
Context(t-1)
Hidden(t-2)
Input (t-2)
Context(t-2)
Hidden(t-3)
Input (t-3)
Context(t-3)
(b)
Figure 18 Unfolding a recurrent neural network in time
Here d j is the desired output for neuron j in the output layer containing m neurons, and S Lj is the network output of neuron j in the output layer L. The weight updates after
computing the derivation is done by:
∆ w Lj i = α δ iL S Lj −1
(5.10)
where α is the learning rate constant. The learning rate determines how fast the weights are updated in the direction of the gradient. The error gradient for neuron i, δ iL for the
62
output layer is given by:
δ jL = (d j − S Lj )S Lj (1− S Lj )
(5.11)
The error gradient for the hidden layers is given by: m
S Lj (1 − S Lj ) ∑ δ kL +1 w kjL +1
(5.12)
k =1
There are many variation of the backpropagation algorithm. The most common variation is to alter the weight update by making the weight update on the nth iteration depend partially on the update that occurred during the (n − 1) iteration as follows: ∆w ji (n) = αδ j x ji + η∆w ji (n − 1)
(5.13)
Here, ∆w ji (n) is the weight update performed during the nth iteration through the main loop of the algorithm, 0 ≤ η < 1 is a constant called the momentum. The momentum term provides a heuristic which gives the search a definite direction in cases where the direction of the local gradient changes rapidly. It has been shown that the backpropagation implements a gradient descent search through the space of possible network weights. This is done by reducing the error E between the training examples desired values and network outputs. As the networks error surface may contain different local minima, gradient descent may become trapped in any of these. Thus, backpropagation is only guaranteed to converge to some local minima E and not necessarily the global minimum error. Heuristics to improve the performance of backpropagation include adding a momentum term and training multiple networks with the same data but different small random initialization prior to training [75]. Backpropagation through time has been applied successfully applied to a wide range of problems [1, 2, 3, 14, and 48]. The gradient computed by BPTT is identical to real-timerecurrent learning which will be discussed in the section to follow.
63
5.3.3 Real Time Recurrent Learning Backpropagation-through-time
uses
the
backward
propagation
of
error
information to compute the error gradient used in the weight update. An alternative approach for computing the gradient is to propagate the error gradient information forward. Real-time recurrent learning is a real time learning algorithm which updates the weights at the end of each sample string presentation with a gradient descent weight update rule. The algorithm computes the derivatives of states and outputs with respect to all weights as the network processes the sequence during the forward step [76]. There is no unfolding performed or necessary for real time recurrent learning.
5.3.4 Evolutionary Neural Leaning 5.3.4.1 Genetic Algorithms Genetic algorithms provide a learning method motivated by biological evolution. They are search techniques that can be used for both solving problems and modelling evolutionary systems [75, 77]. The problem faced by genetic algorithms is to search a space of candidate hypothesis and find the best hypothesis. The hypothesis fitness is a numerical measure which computes the ‘best hypothesis’ that optimizes the problem. The algorithm operates by iteratively updating a pool of hypothesis, called the population. The population consists of many individuals called chromosomes. All members of the population are evaluated by the fitness function during iteration. A new population is then generated by probabilistically selecting the fittest chromosomes from the current population. Some of the selected chromosomes are added to the new generation while others are selected as parent chromosomes. Parent chromosomes are used for creating new offspring’s by applying genetic operators such as crossover and mutation. After applying the genetic operators, the new offspring’s known as child
64
chromosomes are added to the new generation. A prototype of the algorithm is given in Algorithm 3. The fitness function defines the criteria for choosing chromosomes from the population so that the particular chromosome may be ranked against all the other chromosomes. The fitness function shows the fitness of all chromosomes in regard to the application problem. For example, if the task is to train a neural network using genetic algorithm to learn some training data, the fitness function will be the squared error of the neural network. We will discuss this example in the sections to follow. During iteration, a proportion of the existing population is selected to breed a new generation.
Different methods using fitness for selection have been proposed. In the
genetic algorithm shown in Algorithm 5.3.4.1, the probability that a chromosome is selected is given by the ratio of its fitness to the fitness of other members in the population. This method is called the fitness proportionate selection or roulette wheel selection. In tournament selection, a ‘tournament’ is run among a few chromosomes chosen at random from the population from which winners are selected according to some predefined probability p. In rank selection, the chromosomes in the current population are sorted according to their fitness. In this method, the probability that a chromosome will be selected is proportional to its rank, rather that its fitness.
65
___________________________________________________________________ Genetic-Algorithm(Fitness, Fitness-threshold, p, r, m) Fitness: A function that assigns an evaluation score, given a hypothesis. Fitness-threshold: A threshold which specifies the termination criteria. p: The number of hypothesis or chromosomes to be included in the population. r: The fraction of the population to be replaced by crossover at each step. m: The mutation rate. ⎯ Generate p hypothesis at random 1. Initialize population: P ←⎯
2. Evaluate the fitness of individuals: For each h in P, compute Fitness(h) 3. While [max Fitness(h)] < Fitness-threshold do Create a new generation, Ps : •
Select: Probabilistically select (1 – r)p members of P to add to Ps . The probability Pr(hi) of selecting hypothesis hi from P is given by: Pr(hi ) =
•
∑
Fitness(hi ) p j =1
Fitness(h j )
Crossover: Probabilistically select (r.p)/2 pairs of hypothesis from P, according to Pr(hi) given above. For each pair, (h1,h2), produce two offspring’s by applying the Crossover operator. Add all offspring to Ps .
•
Mutate: Choose m percent of the members of Ps with uniform probability. For each, invert one randomly selected bit in its representation.
•
Update: P ←⎯ ⎯ Ps .
•
Evaluate: for each h in P, compute Fitness(h). 4. Return the hypothesis from P that has the highest fitness.
_____________________________________________________________________________________
Algorithm 3: The Genetic Algorithm. During iteration, the successor population Ps is created by probabilistically selecting current hypothesis according to their fitness and by adding genetic operators to the new hypothesis such as crossover and mutation. The crossover operator is applied to a pair of most fit hypothesis or parent chromosomes. The process is iterated until sufficiently fit hypothesis are discovered. 66
The chromosomes in genetic algorithms are represented by bit strings so that they can be easily manipulated by genetic operators such as crossover and mutation. The crossover operator produces two new offspring from two parent chromosomes by copying selected bits from each parent. The bit at position i in the child chromosome is copied from the bit in position i in one of the two parents.
Figure 19 Crossover and mutation operators for genetic algorithms
The crossover mask determines the choice of which parent contributes the bit at position i. The common crossover operators with the mutation operator are shown in Figure 19. The bits underlined in the parent chromosome are combined to form the first offspring while the remaining bits from the two parent chromosomes form the second offspring.
67
In a single-point crossover, a crossover point in the parent chromosome is selected and all data beyond that point is swapped to form two offspring’s. In two-point crossover, two points are selected in the parent chromosome and everything between the two points is swapped to form offspring’s. The uniform crossover combines bits sampled uniformly from two parents. The mutation operator performs small random changes to bit strings by flipping a single bit at random. Mutation is usually performed after the crossover operator is applied according to the mutation rate which is usually a small value typically 0.1 or 0.01. Genetic algorithms
have been applied to a number of
optimization problems such as the travelling salesman problem [78], optimal topology for neural networks [9] and neural networks learning [53, 55].
5.3.4.2 Training Neural Networks with Genetic Algorithms
Evolutionary training methods such as genetic algorithms have been popular for training neural networks other than gradient decent learning [53, 55]. The backpropagation algorithm may converge to sub-optimal weights from which it cannot escape. It has been observed that genetic algorithms overcome the problem of local minima. In gradient descent search for the optimal solution, it may be difficult to drive the network out of the local minima which in turn prove costly in terms of training time. Genetic algorithms are also used to optimize the topology of neural networks [9, 79]. Evolutionary neural learning has been successfully applied to many real world problems such as breast cancer diagnostics [80]. In the neural network training process, genetic algorithms are used to optimize the weights which represent the knowledge learnt in the training process. Usually, the crossover and the mutation operators are altered in genetic algorithms to represent real numbered weight values of the network. In order to use genetic algorithms for training neural networks, we need to represent the problem as chromosomes. Real numbered values of weights must be encoded in the chromosome other than binary values. This is done by altering the
68
crossover and mutation operators. A crossover operator takes two parent chromosomes and creates a single child chromosome by randomly selecting corresponding genetic materials from both parents as shown in Figure 20. The mutation operator adds a small random real number between -1 and 1 to a randomly selected gene in the chromosome.
Figure 20 Crossover operator for evolutionary neural learning
In evolutionary neural learning, the task of genetic algorithms is to find the optimal set of weights in the neural network which minimizes the error function. The fitness function must define the performance of the neural network. Thus, the fitness function is the reciprocal of sum of squared error of the neural network. To evaluate the fitness function, each weight encoded in the chromosome is assigned to the respective weight links of the network. The training set of examples is then presented to the network which propagates the information forward and the sum of squared error is calculated. The
69
smaller the sum of squared error, the better the chromosome of represented weights. In this way, genetic algorithms attempts to find a set of weights which minimizes the error function of the network.
5.4 Recurrent Neural Networks as Models of Computation Recurrent neural networks are appropriate tools for modelling time varying systems for example; speech recognition, physical dynamical systems, and financial prediction. However, these applications are not well suited for addressing their fundamental issues such as training algorithms and knowledge representation. These applications come with specific characteristics, for example, in application to speech recognition feature extraction may be required which may hinder the investigation of the networks fundamental issues. Different applications require different feature extraction techniques. The models such as finite-state automata and their corresponding languages can be viewed as a general paradigm of temporal, symbolic language. There is no feature extraction necessary for recurrent neural networks to learn these languages. The knowledge acquired in recurrent neural networks through learning well corresponds with the dynamics of finite-state automata. The representation of automata is a prerequisite for learning its corresponding languages; i.e. if the architecture cannot represent a particular automaton then it would not be able to learn it either. It has been shown that recurrent networks can represent certain automaton can be mapped directly into recurrent networks [81, 82 and 83].
70
5.5 Knowledge Extraction from Recurrent Neural Network
5.5.1 Introduction Knowledge extraction from RNNs is aimed at finding the underlying models of the learned knowledge typically in the form of finite state machines. Recurrent neural networks have been trained on deterministic finite-state automata (DFAs) and knowledge extraction methods have been applied to explore the knowledge representation in the weights of the trained network. Recently, machine learning methods have been applied to the extraction of symbolic knowledge from recurrent neural networks. Neural networks were once considered as black boxes, i.e. it was difficult to understand the knowledge representation in the weights as part of the information processing in the network. Knowledge extraction is the process of finding the meaning of the internal weight representation of the network. There are two main approaches for knowledge extraction from trained recurrent neural networks: (1) inducing finite automata by clustering the activation values of hidden state neurons [15] and (2) the application of machine learning methods to induce automaton from the observation of input-output mappings of the recurrent neural network [14, 48].
5.5.2 Knowledge Extraction Using Machine Learning In this method of knowledge extraction, the network is initially trained with the training data set. After successful training and testing, the network is presented with another data set which only contains input samples. Then the generalisation made by the network upon the presentation is noted with each corresponding input sample in this data set. In this way, a data set with input-output mappings made by the trained network is obtained. The generalisation made by the output of the network is an indirect measure of the knowledge acquired by the network in the training process. Finally, a machine 71
learning method such as the Trakhtenbrot-Barzdin algorithm is applied to the inputoutput mappings to induce finite automaton. Machine learning methods for knowledge extraction has been successfully applied to the extraction of deterministic finite-state automata from trained recurrent networks [14, 48]. Figure 21 shows the process in detail.
Figure 21 Knowledge extraction through machine learning
The Trakhtenbrot-Barzdin algorithm [16] extracts minimal DFAs from a data set of input strings and their respective output states. The algorithm is guaranteed to induce DFA in polynomial time. A restricting premise of the TB algorithm is that we must know the labels of all strings up to certain length L if the desired DFA is to be extracted correctly. The algorithm requires that the labels of all strings up to length L be known in order for a DFA to be extracted unambiguously; these strings can be represented in a socalled prefix tree. This prefix tree is collapsed into a smaller graph by merging all pairs of nodes that represent compatible mappings from suffixes to labels. Figure 22 shows an example of a prefix tree of depth 3 where accepting states are shown by shaded circles.
72
.
Figure 22 Prefix tree
The algorithm visits all nodes of the prefix tree in breadth-first order, all pairs (i,j) of nodes are evaluated for compatibility by the subtrees rooted at i and j, respectively. The subtrees are compatible if the nodes in corresponding positions in the respective trees are identical. If all corresponding labels are the same, then the edge from i’s parent to i is changed to point at j instead. Nodes which become inaccessible are discarded. The result is smallest automaton consistent with the data set. The algorithm is summarised as follows: Algorithm 4: Let T be a complete prefix tree with n nodes 1,…,n for i = 1 to n by 1 do for j = 1 to i - 1 by 1 do if subtree(i) ≡ subtree(j) parent(j) ← parent(i)
73
Figure 23 DFA induction
Figure 23 shows an example of DFA induction. The algorithm begins by (a) comparing the label of node 2 with node 1. Since all labels in corresponding subtrees are same, node 2 is merged with node 1. In (b), subtrees of node 3 and node 6 are compared and merged as shown. Finally (c), subtrees of node 7 are compared with subtrees of node 1 and merged. The final induced DFA is shown in (d). Using Algorithm 4, a complete prefix tree for strings of up to lengths L=1 is generated. For each successive string length, a FFA is extracted from the corresponding prefix tree of increasing depth using the Trakhtenbrot-Barzdin algorithm. The algorithm checks if the extracted DFA is consistent with the entire training data set, i.e. the input–output labels of the entire training data set must be explained by the extracted DFA. If the DFA is consistent, then the algorithm terminates. The algorithm for running the extraction process is summarised in Algorithm 5:
74
Algorithm 5: Let M be the universal FFA that rejects (or accepts) all strings, and let string L =0. Repeat L← L+1 Generate all strings of up to L Extract ML using Trakhtenbrot-Barzdin(L) Until ML is consistent with the test set
5.6 Applications of Recurrent Neural Networks
5.6.1 Recurrent Neural Networks for Speech Recognition Speech recognition systems are composed of two major components. These are (1) feature extraction component which extracts features from a speech signal, and (2) machine learning component which builds a model on the extracted features. A speech sequence contains huge amount of irrelevant information. In order to model them, feature extraction is necessary. In feature extraction, useful information from speech sequences are extracted which then is used for modelling using recurrent neural networks. Recurrent neural networks have been successfully applied in modelling speech sequences for large vocabulary speech recognition problems [1, 84]. They have been applied to recognize words and phonemes. The performance of speech recognition system can be measured in terms of accuracy and speed. They have been applied to recognize words and phonemes.
Extensive research on the application of speech
75
recognition has been done for more than forty years; however, scientists are unable to implement systems which can show excellent performance in environments with background noise. Applications of speech recognition include voice command systems in home ware devices, speech input devices for interaction with computers other than mouse and keyboards, and speech command interaction with robots.
5.6.2 Recurrent Neural Networks for Control Recurrent neural networks have been successful in control applications [85, 86]. They have been applied to plant control [87]. Plant control sequences contain a region of rapid changes with a region of slow changes in the data. A multilayer training procedure is used to counter this problem where the process combines features of batch update and scrambling by training the networks on multiple streams of data. Comparative studies show that recurrent controllers make fewer errors than feedforward architectures for data generated by known plant functions [88]. Neural networks have been tested as controllers in the galvanizing process of sheets and water treatment applications, predictive head tracking for virtual reality systems, wind turbine power regulation, and electric load forecasting [87, 88].
5.6.3 Molecular Biology The prediction of three-dimensional structure of proteins is an important problem in molecular biology. The tertiary structure determines the function of a protein. Direct determination of the tertiary structure using methods such as X-ray crystallography is expensive and time consuming. The local or secondary structure of proteins provides good approximation of the three-dimensional structure also known as strings of amino acids. The Chou-Fasman algorithm is the standard algorithm for the problem which achieves a prediction accuracy of up to 58%.
76
Recurrent neural networks have been
initiated with Chou-Fasman domain and have been trained using sliding windows of amino acid sequences [89, 90]. Knowledge based neuro-computing paradigm using feedforward networks have also been applied to the problem which has shown small, statically significant improvements in the prediction [91].
5.6.4 Signature Verification A signature verification system is based upon the similarity between signatures. The system is composed into two major components: (1) the signature pre-processing component which focuses on the timing information and positioning of the pen point while making signatures and (2) machine learning component for modelling the extracted features. Recurrent neural networks have been successfully applied for modelling signatures [92]. A training set of extracted features from signatures is used for training which may contain both positive and negative samples. Upon successful training, the network will be presented with new signatures of people who were included in training. The network thus has to predict whether a given signature belongs to a person. Signature verification systems can be used in banks, airports and security systems.
77
5.7 Summary The chapter discussed the architectures of recurrent neural networks which include first-order recurrent networks, second-order recurrent networks, locally recurrent networks, NARX networks, and LTSM. One limitation to neural networks is the difficulty to train using gradient descent learning where the network may get trapped in the local minima resulting in poor training and generalization performance. It has also been discussed how evolutionary neural learning is used to overcome this problem. The extraction of finite-state automata from trained recurrent neural networks shows dynamical features of it knowledge representation. Recurrent neural networks have been successfully applied to a wide range of applications including speech recognition, signature verification, control, and molecular biology [1, 2, 3, 84, 85, 86, 89, 90, 92].
78
Chapter 5 Training and Extraction of Finite State Automata: Experimentation and Results
6.1 Introduction
In the previous chapters, finite automaton and their dynamical characteristics have been discussed. Discussion on the details of how recurrent neural networks and hidden Markov models can be combined to build the hybrid architecture has also been done. This chapter shows how finite automaton can be used for training recurrent neural networks making them suitable for modelling dynamical systems. First-order recurrent neural networks will be trained on deterministic and fuzzy finite-state automata. Then knowledge extraction by machine learning methods will be done to show their knowledge acquisition and representation.
Finally, it will be shown how the proposed hybrid
recurrent neural networks architecture, inspired by hidden Markov models can train and represent finite automaton making them suitable for modelling dynamical systems. The contribution of hybrid recurrent networks to the real world application of speech recognition will be discussed in the chapter to follow.
6.2 Gradient Descent Training of Recurrent Neural Networks
In Section 1.5.1, it has been discussed that gradient descent has difficulties in learning long time dependencies. The error propagated during learning vanishes with increasing duration of dependencies. Here, this problem is addressed by using incremental data learning on finite automaton strings using gradient descent training. Incremental data learning uses working sets obtained from the complete training set 79
composed of increasing lengths of strings. The network trains on each working set on a number of training epochs until it converges to a solution. The training is terminated when the network performs satisfactorily on the entire training set or iterates through all working sets of the training set. The training data set is composed of strings of increasing lengths up to length L representing some finite automaton in their output labels.
6.2.1 Training on Deterministic Finite-State Automata The training data set is generated by presenting string length of 1 to 10 to a deterministic finite automaton as shown in Figure 24. The deterministic finite automaton labels the output on each string as it takes input depending on the state where the final symbol of the string was presented. Similarly, the testing data set is generated for string lengths from 1 to length 15. The training data set consists of 2047 string samples while the testing data set of 65535 string samples. Figure 24 shows the 7 sate deterministic finite automaton where double circles show accepting states. Rejecting states are shown by single circles; state 1 is the automaton’s start state. The training and testing set is obtained upon presentation of strings to this automaton which gives an output i.e. a rejecting or accepting state depending on the state where the last sequence of the string was presented. For example, the output of a string of length 7, i.e. 0100101 is state 5. It is an accepting state, therefore the output is 1.
Figure 24 The 7 state deterministic finite automaton
80
The first-order recurrent neural networks architecture was used for training as discussed in Section 5.2.1. The recurrent neural network topology used was as follows: one neuron in the input layer which represents the string input and one neuron in the output layer representing the string output. The learning rate value used was 0.2. Trial experiments were done using 5, 10, 15 and 20 neurons in the hidden layer. The initial training cycle consisted of 30 samples in the working set of increasing string lengths. For each training cycle, there was a training bound of 300 epochs. After training for 300 epochs in a cycle, the working set was increased by 20 samples; the training continued with the same set of weights that were obtained from training in the previous cycle. The training was terminated once it is seen that a training cycle can explain the entire training set or continue until all samples are included in the working set. The network is presented with data in the testing set and the performance of the network is determined upon its generalization. Table 6.1 shows the results obtained from the experiments. The results show that recurrent neural networks can learn deterministic finite automation by means of gradient descent learning. They show excellent training and generalization performance and it is also seen that the number of training epochs in the last cycle vary for different neural networks topologies depending on the number of neurons in the hidden layer. The number of training epochs in the last cycle for this case can be compared as the number of training cycles was same for all of the experiments.
81
No. of Hidden Neurons
No. of Training Cycles
5 10 15 20
1 1 1 1
No. of Training Epochs in last cycle 120 83 89 113
Training Generalization Performance Performance 100% 100% 100% 100%
100% 100% 100% 100%
Table 6.1: Learning Deterministic Finite Automaton. The table shows the training and generalization performance of recurrent neural networks with different number of neurons in the hidden layer for learning deterministic finite automaton. The training time is given as the number of cycles and training epochs, respectively. A 100% generalization performance reveals that the network can learn and represent the particular DFA.
6.2.2 Training on Fuzzy Finite Automaton
A similar training approach was used as discussed in Section 6.2.1 for training fuzzy finite automaton with recurrent neural networks. The training and testing data set were generated by presenting strings to the fuzzy finite automaton shown in Figure 25 where the accepting states are labelled with a degree of fuzzy membership. For recurrent network training, each fuzzy output was transformed to a distinct neuron in the output layer. Therefore, there are four output neurons for this example.
82
Figure 25 The 7 state fuzzy finite automaton
No. of Hidden Neurons 5 10 15 20
No. of Training Cycles 41 3 11 8
No. of Training Training Generalization Performance Epochs Performance 300 0% 0% 2 100% 100% 1 100% 100% 2 100% 100%
Table 6.2: Learning Fuzzy Finite Automaton. The table shows the training and generalization performance of recurrent neural networks with different number of neurons in the hidden layer for learning fuzzy finite automaton. The training time is given by the number of cycles and training epochs, respectively. A 100% generalization performance reveals that the network can learn and represent the particular FFA.
The results show that the learning and representation of fuzzy finite automaton is directly affected by the topology i.e. the number of neurons in the hidden layer of the recurrent neural network. The network is unable to learn and represent the corresponding fuzzy finite automaton with 5 neurons in the hidden layer.
83
6.3 Extraction of Finite Automaton from Trained Recurrent Neural Networks
In the previous sections, two recurrent neural networks were trained; one with deterministic and the other with fuzzy finite automaton, respectively. The importance of knowledge extraction from neural networks has been discussed in Section 5.5. Knowledge extraction is the means for understanding the knowledge representation in recurrent neural networks. Knowledge extraction through machine induction is applied in this section where the Trakhtenbrot-Barzdin algorithm is used to induce deterministic finite automaton. To induce fuzzy finite automaton, the Trakhtenbrot-Barzdin algorithm has to be generalised so that it can induce fuzzy finite automaton from strings with fuzzy membership in their output labels.
6.3.1 Extraction of Deterministic Finite Automaton In Section 6.2.1, recurrent neural networks were trained to learn a deterministic finite automaton. Knowledge will be extracted from these trained networks to explore their knowledge representation in the network weights after the training process. The trained network was presented with strings of lengths 1-10 to allow the network to make a generalization on presentation of each string in the data set. Hence, a data set with information of the knowledge represented in the network is obtained. Then the data set of labelled strings is presented to the Trakhtenbrot-Barzdin for DFA induction. The prefix tree was generated from each sample of the input string and corresponding output values. The DFA extraction algorithm as shown in Algorithm 4 of Section 5.5.2 is used for induction; for each string length L, we recorded the extracted DFAs string classification performance on the training set. The recorded classification performance of the DFAs extracted with increasing string length L based on Algorithm 5 is shown in Table 6.3.
84
String Length 2 3 4 5 6 7 8 9 10
Percentage correctly consistent with testing set 0% 0% 90.02% 100% 100% 100% 100% 100% 100%
Table 6.3: Deterministic Finite Automaton Induction. The table shows the percentage of strings correctly consistent with the testing set with corresponding string lengths. In this way, deterministic finite automaton is extracted from trained recurrent neural networks showing that they can represent dynamical systems.
Note that the DFAs extracted from lengths L=2 and L=3 show 0% accuracy as a complete DFA could not be constructed due to the length of the string. The string classification accuracy jumps to 92.02% for L=4 and remains at 100% for all prefix trees with larger depth than L=4. The extracted deterministic finite automaton from the trained recurrent neural network is identical to the automaton used for training in Figure 6.1.
6.3.2 Extraction of Fuzzy Finite Automaton
After successful training of recurrent neural networks to learn fuzzy finite automaton in Section 6.2.2, knowledge extraction is applied to identify the knowledge represented in the weights of these trained networks. The prediction made by the network upon the presentation of input strings is recorded in order to obtain the representation of the networks knowledge in the form of input-output string mappings. Then the output values of each string were transformed into fuzzy values which were represented by four output neurons. 85
The prefix tree was built from each sample of the input string and corresponding fuzzy string membership value. The generalisation was made of the DFA induction algorithm for FFA induction. For each string length L, we recorded the induced FFAs fuzzy string classification performance on the training set. The recorded classification performance of the FFAs extracted with increasing string length L are shown in Table 6.4. The table shows the percentage of data consistent with the entire training set with their corresponding string lengths. Note that the FFAs extracted from lengths L=2,…,4 show 0% accuracy. The fuzzy string classification accuracy jumps to 93.3% for L=5 and remains at 100% for all prefix trees with larger depth than L=5. String Length
Percentage correctly consistent with testing set 0% 0% 0% 93.3% 100% 100% 100% 100% 100%
2 3 4 5 6 7 8 9 10
Table 6.4: Fuzzy Finite Automaton Induction. The table shows the percentage of strings correctly consistent with the testing set with corresponding string lengths. In this way, fuzzy finite automaton is extracted from trained recurrent neural networks showing that they can represent dynamical systems.
Experiments were done for the scenario where the training set consisted of 50%, 30% and 10% of all strings up to length 10, i.e. the training data itself certainly no longer embodied the knowledge about the fuzzy membership values necessary in order to induce FFAs. In this case, the trained network had to rely on its generalization capability in order to assign the fuzzy membership values missing in the prefix tree. The experiments show that, even when “holes” are present in the training set, it is possible to extract the ideal
86
deterministic finite-state acceptor by making use of the neural network for missing fuzzy membership values.
6.4 Evolutionary Training of Hybrid Recurrent Neural Networks
6.4.1 Introduction The proposed hybrid recurrent networks architecture has been discussed in Section 3.5. The hybrid recurrent networks architecture is inspired by hidden Markov models. Figure 26 shows the hybrid recurrent architecture for training finite automaton. The hybrid architecture has one neuron in the input layer for string input and is governed by Equation 6.1.
S i (t ) =
⎡ J ⎤ V S ( t 1) − + ⎢ ∑ W ij I j ( t − 1) ⎥ .b t − 1 ( O ∑ ik k k =1 ⎣ j =1 ⎦ K
)
(6.1)
where bt −1 ( O ) is the Gaussian distribution. For a single input, the univariate Gaussian distribution is given by: bt ( O) =
⎡ 1 ( O − µ )2 ⎤ 1 exp ⎢− ⎥ 2 2πσ ⎢⎣ 2 σ ⎥⎦
(3.5)
where Ot is the observation at time t, µ is the mean and σ 2 i is the variance. Figure 26 shows how HMM is mapped into the RNN by tying the output of the HMM, P(O|λ) together by means of a trainable weight leading to an output layer. The output of the Gaussian function solely depends on the two input parameters which are the mean and the variance. These are parameters that observe the sequence of the input data in the hybrid architecture which may be DFA strings or data from any real-world
87
time series for example, speech sequences. The dashed lined indicate that the architecture can represent more neurons in hidden and input layer if required. The output of the Gaussian is further multiplied with the output of each neuron in the hidden layer. Note that one Gaussian distribution will be used for every time step irrespective of the number of neurons in hidden and input layer. In Section 5.3.4, it has been discussed how evolutionary optimization method such as genetic algorithms can be used to train neural networks. Genetic algorithms will be used to train the proposed hybrid recurrent network architecture. There is no need of incremental data learning while using genetic algorithms since the problems which are associated with learning long-term dependencies using gradient descent do not arise. The hybrid recurrent networks will be trained to learn deterministic and fuzzy finite automaton using the whole training set other than using working sets in incremental learning. Trial experiments reveal that the population size of 40, crossover probability of 0.7 and mutation probability of 0.1 have shown good genetic training performance. Hence, these values will be used as optimal parameters in all the experiments.
Z- -1
∑
∏
∏
Output Layer
∑ Input layer
bt ( O )
Hidden Layer
Gaussian Distribution
Figure 26 Hybrid recurrent neural networks
88
The maximum training time was 50 generations. For learning a particular finite automaton, two experiments were done with two different sets of weight initializations. In the implementation of hybrid recurrent architecture, the constant 1/ 2πσ in front of the univariate Gaussian distribution was discarded in order to obtain uniform results.
6.4.2 Learning Deterministic Finite Automaton
The training and testing data set was generated from the deterministic finite automaton as shown in Figure 24. All the parameters of the hybrid architecture was trained, i.e. the weights connecting input to hidden layer, weights connecting hidden to output layer and weights connecting the context to hidden layer. The bias weights and the mean and the variance as parameters of the univariate Gaussian function were also trained. The network topology used for this experiment was as follows: one neuron in the input layer for string input and one output neuron in the output layer. Experimentations were done with different number of neurons in the hidden layer. There were three major experiments with different bounds in the weight initialization prior to training. The results for each experiment are shown in Table 6.5, Table 6.6, and Table 6.7, respectively. Experiment 2 reveals a 100 % training and generalization performance when initialized with weight values of -7 to 7 prior to training while experiment 1 and 3 show poor results. The generalization performance is based upon the presentation of unknown strings on the trained network. This demonstrates that the proposed hybrid recurrent network architecture can train and represent dynamical systems such as deterministic finite automaton.
89
No. of Hidden Neurons 5 10 15 20
Weight No. of Training Initialization Generalization generations Performance Range Performance -3 to 3 max 0% 0% -3 to 3 max 0% 0% -3 to 3 max 0% 0% -3 to 3 max 0% 0%
Table 6.5: Learning Deterministic Finite Automaton: Experiment 1. The table shows the training and generalization performance of hybrid recurrent neural networks with different number of neurons in the hidden layer for learning deterministic finite automaton. The training time is given by the number of generation it takes for the hybrid architecture to learn. ‘max’ denotes the maximum training time which is 50 generations. The weight initialization of the hybrid architecture for training is in the range of -3 to 3. A 0% generalization performance reveals that the network can not learn and represent the particular DFA.
No. of Hidden Neurons 5 10 15 20
Weight No. of Training Initialization Generalization generations Performance Range Performance -7 to 7 8 100% 100% -7 to 7 26 100% 100% -7 to 7 40 100% 100% -7 to 7 9 100% 100%
Table 6.6: Learning Deterministic Finite Automaton: Experiment 2. The weight initialization of the hybrid architecture for training is in the range of -7 to 7.
A 100%
generalization performance reveals that the network can learn and represent the particular DFA.
90
No. of Hidden Neurons 5 10 15 20
Weight No. of Training Initialization Generalization generations Performance Range Performance -15to 15 max 0% 0% -15to 15 3 100% 100% -15to 15 max 0% 0% -15to 15 max 0% 0%
Table 6.7: Learning Deterministic Finite Automaton: Experiment 3. The weight initialization of the hybrid architecture for training is in the range of -15 to 15. ‘max’ denotes the maximum training time which is 50 generations. A 100% generalization performance reveals that the network can not learn and represent the particular DFA.
6.5 Extraction of Deterministic Finite Automaton
The proposed hybrid recurrent networks architecture has been successfully trained in the previous section. This reveals that it can learn and represent dynamical systems such as deterministic finite automaton. To explore the hybrid recurrent neural networks knowledge representation, knowledge must be extracted. The DFA extraction algorithm is applied on string samples which represent knowledge in hybrid recurrent neural networks; for each string length L, the extracted DFAs string classification performance on the training set is recorded. The classification performance of the DFAs extracted with increasing string length L are shown in Table 6.8.
91
String Length 2 3 4 5 6 7 8 9 10
Percentage correctly consistent with testing set 0% 0% 90.02% 100% 100% 100% 100% 100% 100%
Table 6.8: Deterministic Finite Automaton Induction. The table shows the percentage of strings correctly consistent with the testing set with corresponding string lengths. In this way, deterministic finite automaton is extracted from trained hybrid recurrent neural networks showing that they can represent dynamical systems and therefore can be used for modelling real world application problems.
The DFAs extracted from lengths L=2,…, 3 show 0% accuracy. The string classification accuracy jumps to 90.02% for L=4 and remains at 100% for all prefix trees with larger depth than L=4. These results are identical to the extraction of DFA from recurrent neural networks. This suggests that the hybrid recurrent networks architecture inspired by hidden Markov models has similar knowledge representation to recurrent neural networks.
92
6.6 Summary
The experiments show that finite automaton can be represented by recurrent neural networks. Knowledge extraction is done to explore the knowledge represented by trained networks. The proposed hybrid recurrent networks architecture has be implemented and it has been shown that they can train and represent dynamical systems such as deterministic finite automaton. After successful training, knowledge extraction is applied to extract deterministic finite automaton from hybrid recurrent networks. The results in general show that finite automaton can be represented by recurrent neural networks and the proposed hybrid recurrent neural networks.
93
94
Chapter 6 Real World Application: Speech Recognition
7.1 Introduction
The overall speech recognition system deals with feature extraction from speech data and pattern classification which involves recurrent neural networks or hidden Markov models. The proposed hybrid recurrent neural networks are applied to the real world application of speech recognition. The first section of this chapters deals with feature extraction from speech data. The extracted features from the first section are used for classification of speech data by hybrid recurrent neural networks. The main concern is speech recognition on the level of phoneme classification. In this chapter, hybrid recurrent neural networks are trained to classify three different pairs of phonemes.
7.2 Speech Recognition Systems
The main goal of a speech recognition system is to recognize speech utterances spoken by any speaker i.e. man, women or child, in any environment. Applications of speech recognition include voice command systems in household devices, speech input devices for interaction with computers other than mouse and keyboards, and speech command systems. In order for speech recognition systems to be successful applied to these applications, the system should perform similarly to the human auditory system which is able to recognize speech spoken by any gender, age group, and accent in any environment. It should be noted that the environment plays a major role on the system performance; a speech recognition system will have difficulty to perform well in
95
environments with background noise. Research to the field of speech recognition has been done for more than forty years; however, scientists are unable to implement a speaker independent large vocabulary automatic speech recognition system. Speech recognition systems can currently perform well in noise free environments, but are unable to show satisfactory results in environments with background noise. Speech recognition systems are composed of two major components. These are (1) feature extraction component which extracts features from a speech database, and (2) machine learning component also referred to as pattern recognition component which learns by training on a dataset of extracted features. The speech database is usually composed of speech sentences from speakers in different dialects and regions which are hand labelled. The labelling distinguishes phonemes and words with the corresponding sentence. Pattern recognition methods such as hidden Markov models and recurrent neural networks are used for modelling using features extracted from these databases. Feature extraction is necessary as speech sequence contain a huge amount of data with irrelevant information for recognition. In feature extraction, special techniques are used to extract important information while irrelevant information is discarded. Speech recognition systems use different levels of recognition which include: (1) phoneme level recognition where phonemes are recognized in order to recognize words, (2) word level recognition where individual words are modelled by recognition systems, and (3) sentence recognition using grammars and semantic information of the language. Usually these methods are combined iteratively in conjunction i.e. phonemes are recognized to build words which build sentences by using grammar and semantic information of the language. The performance of speech recognition system can be measured in terms of accuracy and speed. Recurrent neural networks and hidden Markov models have been successfully applied to speech recognition [1, 6, and 84]. Recurrent neural networks are capable of modelling complicated recognition tasks. They have shown more accuracy in recognition in cases of low quality noisy data compared to hidden Markov models. However, hidden Markov models have shown to perform better when it comes to large
96
vocabularies. One limitation for hidden Markov models in the application for speech recognition is that they assume that the probability of being in a state at time t only depends on the previous state i.e. the state at time t-1. This assumption is inappropriate for speech signals where dependencies often extend through several states, however, hidden Markov models have shown extremely well for certain types of speech recognition. In this chapter, we will discuss the phoneme level speech recognition in detail. We will discuss feature extraction at the phoneme level and use hybrid recurrent neural networks for classification.
7.3 Feature Extraction from Speech Sequences
7.3.1 Introduction
A digitized speech signal contains large amount of irrelevant information and requires plenty of storage space which is rather difficult to be modelled. In order to simplify the processing of the signal, useful features from these signals must be extracted. In large vocabulary speech recognition, sub-models are used of which phonemes are the most promising since there are about 60 phonemes in the English language set compared to the huge number of words in the language. The basic idea behind phoneme level recognition is to break individual words into phonemes and model them using machine learning techniques such as recurrent neural networks and hidden Markov models. In this section, the discussion is on Mel Frequency Cepstral Coefficients (MFCC) which is one of the most popular methods for feature extraction from speech sequences [93]. The choice of the feature extraction method has effects on the performance of speech recognition system. A poor choice of feature extraction method would therefore allow valuable information in the speech sequence to be lost which can directly result in poor training and classification. Mel Frequency Cepstral Coefficients have shown successful feature extraction results in recognition with recurrent neural networks [94]. 97
7.3.2 Mel Frequency Cepstral Coefficients
Mel frequency cepstral coefficients (MFCCs) are derived from the cepstral representation of speech data. The Mel scale approximates physiological characteristics of the human ear. The process of creating a spectrogram involves the Fourier transform, a function that maps a periodic signal in the time domain into a collection of amplitudes or intensities, called a spectrum in the frequency domain. The process of extracting MFCC features from a speech phoneme signal is described as follows: The first step is to read a phoneme signal from the speech data base. Once a phoneme is read, it is windowed or separated into equal size blocks. Figure 21 shows how the phoneme ‘sh’ is read from a sentence in the speech database. The frames of speech are extracted from the speech signal (phoneme) by applying a Hamming window of width 512 samples (32 ms) to the speech signal every 256 samples (16 ms). A windowing function such as the Hamming window is used because the frame is later discrete Fourier transformed. Discrete Fourier transformation (DFT) requires the frame to be short-time stationary and without any discontinuities at the ends, the hamming window fulfils this requirement.
98
Figure 27 Windowing
The window is therefore used to force the signal to zero near the ends of the frame. The Hamming window W[n] is defined in Equation 7.1. ⎧0.54 − 0.46cos(2π n / N ) W[n] = ⎨ 0 ⎩
0≤n< N n ≤ 0, n ≥ N
⎫ ⎬ ⎭
(7.1)
where N is equal to 512 for this application. The Fourier transformation on the signal changes the signal from time domain to frequency domain as given by equation 7.2.
X a [k ] =
N −1
∑ x[ n ] e
− 2 j π nk / N
,0 ≤ k < N
(7.2)
n=0
where x[n] is the original sample, X[k] is the output given by the Discrete Fourier transformation, and j is the imaginary part of a complex number. Figure 28 shows the entire process of obtaining a vector of MFCC features from a frame of windowed signal. It shows how the frame of speech signal obtained by windowing shown in Figure 27 is presented to the discrete Fourier transformation function to change the signal from time domain to its frequency domain. Then the discrete Fourier transformed based spectrum is mapped onto the Mel scale using triangular overlapping windows. Finally,
the log
energy is computed at the output of each filter and then the discrete cosine transformation of the Mel amplitudes is performed from which the vector of MFCC features is obtained.
99
Frame of speech signal of size 512 obtained from phoneme
Discrete Fourier Transformation
DFT Based Spectrum
Mel Scale Triangular Filters
Log
Discrete Cosine Transformation
Vector of MFCC Feature
Figure 28 The process of MFCC feature extraction
The j in the formula in Equation 7.1 represents an imaginary number so it is not easy to implement the algorithm straightforward; some simplification needs to be done. A complex number z can be expressed as z = x + jy, where j = sqrt(-1), x = real part, y = imaginary part, with both x and y being real numbers. Using Euler’s relation, given a real number φ :
e jφ = (cos φ + j sin φ )
Therefore, Equation 7.2 can be rewritten as:
100
(7.3)
N −1
Xa [k] = ∑ x[n](cosφ + j sin φ), 0 ≤ k < N
(7.4)
n=0
where φ = −2π nk / N . The original frame of speech signal and the resulting discrete Fourier transform is shown in Figure 7.2. Furthermore, the signal is defined into M filters to get acoustic vectors. The filter bank with m filters is defined (m=1,2,…,M), where filter m is triangular filter given by: k < f [ m − 1]
0 ⎧ ⎪ k − f [ m − 1] ⎪ ⎪ f [ m ] − f [ m − 1] H m [k ] = ⎨ ⎪ f [ m + 1] − k ⎪ f [ m + 1] − f [ m ] ⎪ 0 ⎩
⎫ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎭
f [ m − 1] ≤ k ≤ f [ m ] f [ m ] ≤ k ≤ f [ m + 1] k > f [ m + 1]
(7.5)
where f[m] is the Mel cepstrum filter frequencies that are described in the equations to follow, k is the position of the filter taken from the sample, and H m[k] is the filter bank with m triangular filters which satisfies
∑
M m =1
H m [k ] = 1 . Such filters compute the average
spectrum around each centre frequency with increasing bandwidths as shown in Figure 7.3.
H1 [ k ]
H 2 [k ]
H 3[k ]
H 4 [k ]
H 5 [k ]
H 6 [k ]
k
f [0]
f [1]
f [2]
f [3]
f [4]
101
f [5]
f [6]
f [7]
Figure 7.3: Triangular filters used in the computation of the Mel-cepstrum:
The discrete
Fourier transformation of a windowed signal is mapped onto the Mel scale using triangular overlapping windows. Furthermore, the Mel scale is discussed which has physiological characteristics of the human auditory system. Let’s define f l and f h to be the lowest and highest frequencies of the filter bank in Hz, Fs the sampling frequency in Hz, M the number of filters, and N the size of the DFT. The boundary points f [m] are uniformly spaced in the Mel scale: ⎛N f [m] = ⎜ ⎝ Fs
⎞ −1 ⎛ B( fh ) − B( fl ) ⎞ ⎟ B ⎜ B( fl ) + m ⎟ M +1 ⎝ ⎠ ⎠
where the values for f l , f h and Fs are equal to 500 Hz, 7500 Hz
(7.6)
and 16000 Hz,
respectively, and M is equal to 26 for this application. These values vary for different applications. The Mel scale B in Equation 7.6 is given by: B( f ) = 1125ln(1 + f / 700)
(7.7)
The inverse B −1 is given by:
B −1 (b ) = 700 ( exp(b /1125) − 1)
(7.8)
Then the log energy at the output of each filter is computed as ⎛ N −1 S [ m] = ln ⎜ ∑ | X a [ k ] |2 H m [ k ] ⎝ k =0
⎞ ⎟, ⎠
0