Research on Unstructured Text Data Mining and Fault ... - MDPI

6 downloads 491 Views 7MB Size Report
Mar 21, 2017 - Nevertheless, a widely used commercial web search engine is able to log massive amounts of .... (2) Instrument-Agency (IA): Phone operator.
energies Article

Research on Unstructured Text Data Mining and Fault Classification Based on RNN-LSTM with Malfunction Inspection Report Daqian Wei 1 , Bo Wang 1, *, Gang Lin 1 , Dichen Liu 1 , Zhaoyang Dong 2 , Hesen Liu 3 and Yilu Liu 3 1 2 3

*

School of Electrical Engineering, Wuhan University, Wuhan 430072, China; [email protected] (D.W.); [email protected] (G.L.); [email protected] (D.L.) School of Electrical Engineering and Telecommunications, University of NSW, Sydney 2052, Australia; [email protected] Department of Electrical Engineering and Computer Science, The University of Tennessee, Knoxville, TN 37996, USA; [email protected] (H.L.); [email protected] (Y.L.) Correspondence: [email protected]; Tel.: +86-159-7297-6215; Fax: +86-27-6877-2299

Academic Editor: Shuhui Li Received: 23 December 2016; Accepted: 8 March 2017; Published: 21 March 2017

Abstract: This paper documents the condition-based maintenance (CBM) of power transformers, the analysis of which relies on two basic data groups: structured (e.g., numeric and categorical) and unstructured (e.g., natural language text narratives) which accounts for 80% of data required. However, unstructured data comprised of malfunction inspection reports, as recorded by operation and maintenance of the power grid, constitutes an abundant untapped source of power insights. This paper proposes a method for malfunction inspection report processing by deep learning, which combines the text data mining–oriented recurrent neural networks (RNN) with long short-term memory (LSTM). In this paper, the effectiveness of the RNN-LSTM network for modeling inspection data is established with a straightforward training strategy in which we replicate targets at each sequence step. Then, the corresponding fault labels are given in datasets, in order to calculate the accuracy of fault classification by comparison with the original data labels and output samples. Experimental results can reflect how key parameters may be selected in the configuration of the key variables to achieve optimal results. The accuracy of the fault recognition demonstrates that the method we proposed can provide a more effective way for grid inspection personnel to deal with unstructured data. Keywords: deep learning; recurrent neural network (RNN); natural language processing (NLP); long short-term memory (LSTM); unstructured data; malfunction inspection report

1. Introduction With the higher requirements of the economy and safety of the power grid, online condition-based maintenance (CBM) of power transformers without power outages is an inevitable trend for equipment maintenance mode [1,2]. The transformer CBM analysis relies on two basic groups: structured (e.g., numeric and categorical) and unstructured (e.g., natural language text narratives). Using structured data analysis, researchers have proposed a variety of transformer fault diagnosis algorithms such as the Bayesian method [3–5], evidence reasoning method [6], grey target theory method [7], support vector machine (SVM) method [8–10], artificial neural network method [11,12], extension theory method [13], etc. These algorithms have achieved good results in engineering practice. However, for more unstructured data found in practice [14–17] (i.e., the malfunction inspection report), traditional artificial

Energies 2017, 10, 406; doi:10.3390/en10030406

www.mdpi.com/journal/energies

Energies 2017, 10, 406

2 of 22

means of massive original document annotation and classification are not only time-consuming, but are also unable to achieve the desired results. For this reason, it has not been possible to adapt development of network information needs to the grid. Therefore, compared with structured data processing, it is more comprehensive for grid inspection personnel to effectively identify the massive unstructured text in the inspection malfunction report. Deep learning achieved great success in speech recognition, natural language processing (NLP), machine vision, multimedia, and other fields in recent years. The most famous was the face recognition test set Labeled Faces in the Wild [18], in which the final recognition rate of a non-deep learning algorithm was 96.33% [19], whereas deep learning could reach 99.47% [20]. Deep learning also consistently exhibits an advantage in regards to NLP. After Mikolov et al. [21] presented language modeling using recurrent neural networks (RNNs) in 2010, he then proposed two novel model architectures (Continuous Bag-of-Words and Skip-gram) for computing continuous vector representations of words from very large data sets in 2013 [22]. However, the model is not designed to capture the fine-grained sentence structure. When using the back-propagation algorithm to learn the model parameters, RNN needs to expand into the parameter-sharing multi-layer feed-forward neural network with the length of historical information corresponding to the number of layers to expand. Many layers not only make the training speed become very slow, but the most critical problem is also the disappearance of the gradient and the explosion of the gradient [23–25]. Long short-term memory (LSTM) networks were developed in [26] to address the difficulty of capturing long-term memory in RNNs. It has been successfully applied to speech recognition, which achieves state-of-the-art performance [27,28]. In text analysis, LSTM-RNN treats a sentence as a sequence of words with internal structures, i.e., word dependencies. Tai et al. [29] introduce the Tree-LSTM, a generalization of LSTMs to tree-structured network topologies. Tree-LSTMs outperform all existing systems and strong LSTM baselines on two tasks: predicting the semantic relatedness of two sentences and sentiment classification. Li et al. [30] explored an important step toward this generation task: training an LSTM auto-encoder to preserve and reconstruct multi-sentence paragraphs. They introduced an LSTM model that hierarchically builds a paragraph embedding from that of sentences and words, and then decodes this embedding to reconstruct the original paragraph. Chanen [31] showed how to use ensembles of word2vec (word to vector) models to automatically find semantically similar terms within safety report corpora and how to use a combination of human expertise and these ensemble models to identify sets of similar terms with greater recall than either method alone. In Chanen’s paper, an unsupervised method was shown for comparing several word2vec models trained on the same data in order to estimate reasonable ranges of vector sizes to induce individual word2vec models. This method is based on measuring inter-model agreement on common word2vec similar terms [31]. Palangi [32] developed a model that addresses sentence embedding, a hot topic in current NLP research, using RNN with LSTM cells. In the above papers, the RNN-LSTM is continuously improved and developed in the process of deep learning applied to NLP. At present, RNN-LSTM has many applications in NLP, but it has not been applied to the unstructured data in the grid. Unstructured data accounts for about 80% of power grid enterprises, which contain important information about the operation and management of the grid. In the malfunction inspection report, the unstructured data processing method is urgently needed to analyze the effective information. The primary objective of this paper is to provide insight on how to apply the principles of deep learning via NLP to the unstructured data analysis in grids based on RNN-LSTM. The remaining parts in the paper are organized as follows. In Sections 2 and 3, the description of the text data mining–oriented RNN and LSTM model are presented. In Section 4, the malfunction inspection report analysis method based on RNN-LSTM is proposed. Experimental results are provided to demonstrate the proposed method in Section 5. Conclusions are drawn in Section 6.

Energies 2017, 10, 406

3 of 23

Energies 2017, 10, 406

3 of 22

2. Text Data Mining–Oriented Recurrent Neural Network 2. Text Mining–Oriented 2.1. TextData Model Representation Recurrent Neural Network

For Model the vector form of voice and text in the mathematical model, each word represents a vector, 2.1. Text Representation and each dimension of the vector represents a single word. If the word appears in the text, it is set to For the vector form of voice and text in the mathematical model, each word represents a vector, 1; otherwise, it is set to 0. The number of vectors is equal to the dimension of the vocabulary words. and each dimension of the vector represents a single word. If the word appears in the text, it is set to 1; d j = ( wis , w2, j ,..., otherwise, it is set to 0. The number of vectors towthe 1, j equal t , j ) dimension of the vocabulary words.(1) 2.2. Recurrent Neural Networks

d j = w1,j , w2,j , ..., wt,j



(1)

In the past, voice text processing was usually a combination of a neural network and a hidden 2.2. Recurrent Neural Networks Markov model. Taking advantage of algorithms and computer hardware, the acoustics model In the past, voice text processing was usually a combination of a neural network and a hidden established through deep forward-propagation networks has made considerable progress in recent Markov model. Taking advantage of algorithms and computer hardware, the acoustics model years. Taking sound into account, text processing is an internal dynamic processing, and a RNN can established through deep forward-propagation networks has made considerable progress in recent be used as one of its candidate models. Dynamic means that the currently processed text vector is years. Taking sound into account, text processing is an internal dynamic processing, and a RNN can associated with the context of the content, and it cannot be an independent analysis of the current be used as one of its candidate models. Dynamic means that the currently processed text vector is sample, but should be set before and after the memory unit of the text information for a associated with the context of the content, and it cannot be an independent analysis of the current comprehensive analysis of the semantic information. This approach applies a larger data state space sample, but should be set before and after the memory unit of the text information for a comprehensive and a more abundant model dynamic performance. analysis of the semantic information. This approach applies a larger data state space and a more In a neural network, each neuron is a processing unit which is connected to the output of its abundant model dynamic performance. node as the input. Before the output is issued, each neuron will first apply a nonlinear activation In a neural network, each neuron is a processing unit which is connected to the output of its function. It is precisely because of this activation function that neural networks have the ability to node as the input. Before the output is issued, each neuron will first apply a nonlinear activation model nonlinear relationships. However, the general neural model cannot clearly simulate the time function. It is precisely because of this activation function that neural networks have the ability relationship. All data points are composed of a fixed-length vector hypothesis. When there is a to model nonlinear relationships. However, the general neural model cannot clearly simulate the strong correlation with the input phasor, the model will greatly reduce the processing effect. time relationship. All data points are composed of a fixed-length vector hypothesis. When there Therefore, we introduce the recurrent neural network by given the ability of modeling with explicit is a strong correlation with the input phasor, the model will greatly reduce the processing effect. time. The explicit time can not only into the output, but also in the next time step hidden layer, by Therefore, we introduce the recurrent neural network by given the ability of modeling with explicit adding across time points from the hidden layer and hidden layer feedback connection. time. The explicit time can not only into the output, but also in the next time step hidden layer, The traditional neural network has no middle layer of the cycle process. When the specified by adding across time points from the hidden layer and hidden layer feedback connection. input x0 , x1 , x2 ,..., xt , after the process of neurons there will be some corresponding output, The traditional neural network has no middle layer of the cycle process. When the specified input h0 , h1 , h2 ,..., ht . In each training, no information needs to transfer between the neurons. The difference x0 , x1 , x2 , ..., xt , after the process of neurons there will be some corresponding output, h0 , h1 , h2 , ..., ht . between RNNs and traditional neural networks is that inthe every training RNN, neurons to In each training, no information needs to transfer between neurons. Thefor difference betweenneed RNNs transfer some information. In thisistraining, the neurons need use the role of the to lasttransfer neuronsome after and traditional neural networks that in every training for to RNN, neurons need the stated information, similar to the recursive function. The basic structure of the RNN is shown in information. In this training, the neurons need to use the role of the last neuron after the stated Figure 1, and the expansion is shown in Figure 2, where A is the hidden layer; x is the input i information, similar to the recursive function. The basic structure of the RNN is shown in Figure 1, vector; h is the output of the hidden layer. As can be seen from Figure 2, the output of each and the expansion is shown in Figure 2, where A is the hidden layer; xi is the input vector; hhidden i i is the output the hidden layer. As can Figure 2, the layer input as layer is of input as an input vector to be theseen nextfrom hidden layer. Theoutput outputof oneach the hidden impact of theisfollowing an input vector to the layer. Theof output on the impacttext. of theThe following can be can be considered innext the hidden next paragraph the unstructured algorithm of considered the model in is the next paragraph of the unstructured text. The algorithm of the model is analyzed in the Appendix A. analyzed in the Appendix A.

Figure structure. Figure 1. 1. Recurrent Recurrent neural neural network network (RNN) (RNN) basic basic structure.

Energies Energies 2017, 2017, 10, 10, 406 406 Energies 2017, 10, 406 Energies 2017, 10, 406

44 of of 22 23 4 of 23 4 of 23

Figure Figure 2. 2. Recurrent Recurrent neural neural network network expansion. expansion. Figure 2. Recurrent neural network expansion. Figure 2. Recurrent neural network expansion.

3. 3. Long Long Short-Term Short-Term Memory Memory Model Model 3. Long Short-Term Memory Model 3. Long Short-Term Memory Model Although Although the the RNN RNN performs performs the the transformation transformation from from the the sentence sentence to to aaa vector vector in in aaa principled principled Although the RNN performs the transformation from the sentence to vector in principled manner, it is generally difficult to learn the long-term dependency within the sequence due to Although the RNN performs the transformation from the sentence to a vector in a principled manner, it is generally difficult to learn the long-term dependency within the sequence due to the the manner, it is generally difficult to learn the long-term dependency within the sequence due to the vanishing gradients problem. The RNN has two limitations: first, the text analysis is in fact manner, it is generally difficult to learn the long-term dependency within the sequence due to the vanishing gradients gradients problem. problem. The The RNN RNN has has two two limitations: limitations: first, first, the the text text analysis analysis is is in in fact fact vanishing associated with the surrounding while RNN only contacts previous not the vanishing gradients problem. Thecontext, RNN has two the limitations: first, the textthe analysis is intext, factbut associated associated with the surrounding context, while the RNN only contacts the previous text, but not the associated with the surrounding context, while the RNN only contacts the previous text, but not the following text; second, compared to the time step, RNN has more difficulties in the learning time with the surrounding context, while the RNN only contacts the previous text, but not the following following text; second, compared to the time step, RNN has more difficulties in the learning time following text; second, compared to the time step, RNN has more difficulties in the learning time correlation. bidirectional (BLSTM) can be in the first while text; second,A to theLSTM time step, RNNnetwork has more difficulties time correlation. correlation. Acompared bidirectional LSTM (BLSTM) network can be used used in in the thelearning first problem, problem, while the the correlation. A bidirectional LSTM (BLSTM) network can be used in the first problem, while the LSTM model can be used for the second. The RNN repeats the module as shown in Figure 3, which A bidirectional LSTM (BLSTM) network can be used in the first problem, while the LSTM model can LSTM model can be used for the second. The RNN repeats the module as shown in Figure 3, which LSTM model can be used for the second. The RNN repeats the module as shown in Figure 3, which only contains neuron. LSTM model an of the based be used for theone second. TheThe RNN repeats theis as shown in 3, whichRNN onlymodel; contains one only contains one neuron. The LSTM model ismodule an improvement improvement of Figure the traditional traditional RNN model; based only contains one neuron. The LSTM model is an improvement of the traditional RNN model; based on model, the control is to the long-term dependence neuron. The LSTM model is an improvement of the traditional RNN model; based on the RNN model, on the the RNN RNN model, the cellular cellular control mechanism mechanism is added added to solve solve the long-term dependence on the RNN model, the cellular control mechanism is added to solve the long-term dependence problem of the RNN and the gradient explosion problem caused by the long sequence. The the cellular control mechanism is added to solve the long-term dependence problem of the RNN and problem of of the the RNN RNN and and the the gradient gradient explosion explosion problem problem caused caused by by the the long long sequence. sequence. The The model model problem model can make RNN memorize long-term information designing aa special In the problem caused by the long sequence.by model can make structure the RNN cell. model can gradient make the theexplosion RNN model model memorize long-term information byThe designing special structure cell. In can make the RNN model memorize long-term information by designing a special structure cell. In addition, through the design of three kinds of “gate” structures, the forget gate layer, the input gate memorize long-term information by designing a special structure cell. In addition, through the design addition, through the design of three kinds of “gate” structures, the forget gate layer, the input gate addition, through the design of three kinds of “gate” structures, the forget gate layer, the input gate layer, and the output layer, can selectively increase and the information through the of three kinds “gate”gate structures, gate layer, the input gate layer, the output gate layer, layer, and the of output gate layer, it itthe canforget selectively increase and remove remove the and information through the layer, and the output gate layer, it can selectively increase and remove the information through the cell structure when controlling information through the cell. These three “gates” act on the cell it can selectively increase and remove the information through the cell structure when controlling cell structure structure when when controlling controlling information information through through the the cell. cell. These These three three “gates” “gates” act act on on the the cell cell to to cell to form the hidden layer of the LSTM, also known as the block. The LSTM repeat module is shown in information through the cell. These three “gates” act on the cell to form the hidden layer of the LSTM, form the hidden layer of the LSTM, also known as the block. The LSTM repeat module is shown in form the hidden layer of the LSTM, also known as the block. The LSTM repeat module is shown in Figure 4, contains four neurons. also known as the block. The repeat module is shown in Figure 4, which contains four neurons. Figure 4, which which contains fourLSTM neurons. Figure 4, which contains four neurons.

Figure RNN contains contains aa single single layer. layer. Figure 3. 3. The The repeating repeating module module in in aa standard standard RNN layer. Figure 3. The repeating module in a standard RNN contains a single layer.

Figure The repeating module in aa long short-term memory (LSTM) contains four Figure 4.4. 4.The Therepeating repeating module inlong long short-term memory (LSTM) contains four interacting interacting module in ain short-term memory (LSTM) contains four interacting layers. Figure 4. The repeating module a long short-term memory (LSTM) contains four interacting layers. layers. layers.

Energies 2017, 10, 406 Energies 2017, 10, 406 Energies 2017, 10, 406

5 of 23 5 of 22 5 of 23

3.1. Core Neuron 3.1. Neuron 3.1. Core Core Neuron LSTM is used to control the transmission of information, it is usually expressed by sigmoid function. keytototo LSTMs thetransmission cell state, the through the by top of the LSTM isisused control theisthe transmission of information, it isline usually expressed by sigmoid function. LSTMThe used control of horizontal information, it running is usually expressed sigmoid diagram. cell state is kind likecell a conveyor It runs straight the The key toThe LSTMs cell state, the horizontal linebelt. running through thedown top of the entire diagram. cell function. The keyis totheLSTMs isofthe state, the horizontal line running through thechain, topThe ofwith the only some minor linear interactions. It is very easy for information to just flow along it unchanged. state is kind of cell like state a conveyor It runs straight down theruns entire chain, down with only diagram. The is kindbelt. of like a conveyor belt. It straight the some entireminor chain,linear with The the ability to remove or information the to celljust state, by interactions. It ishave very easy for information to add just flow itto unchanged. Thecarefully LSTM have the onlyLSTM some does minor linear interactions. It is very easy foralong information flow along does itregulated unchanged. structures called gates. Gates aretoa remove way optionally let information through. They are composed of ability to remove or add information toto the state, carefully by structures called gates. The LSTM does have the ability orcell add information toregulated the cell state, carefully regulated by astructures sigmoid net layer and aapointwise multiplication operation. The ofare thecomposed LSTM core Gates are aneural way togates. optionally let through. They are composed of astate sigmoid neural net layer called Gates areinformation way to optionally let information through. They of neuron is shown in and a pointwise Themultiplication state of the LSTM core neuron is shown in Figure a sigmoid neuralmultiplication netFigure layer 5.andoperation. a pointwise operation. The state of the LSTM 5. core neuron is shown in Figure 5.

Figure 5. 5. Neuron Neuron state state transfer. transfer. Figure Figure 5. Neuron state transfer.

3.2. Forget Forget Gate Gate Layer Layer 3.2. 3.2. Forget Gate The role role of ofLayer the gate gate layer layer is is to to determine determine the the upper upper layer layer of of input input information information which which may may be be The the discarded, and it is used to control the hidden layer nodes stored in the last moment of historical The role of the gate layer is to determine the upper layer of input information which may be discarded, and it is used to control the hidden layer nodes stored in the last moment of historical information. The forget gate computes a value between 0 and 1 according to the state of the hidden discarded, and is used to control thea hidden layer nodes in the to lastthe moment information. Theit forget gate computes value between 0 andstored 1 according state ofof thehistorical hidden layer at the previous time and the input of the current time node, and acts on the state of the cell at at information. The forget gate computes a value between 0 and 1 according to the state of the layer at the previous time and the input of the current time node, and acts on the state of thehidden cell the previous time to determine what information needs to be retained and discarded. The value “1” layer at the previous time and the input of the current time node, and acts on the state of the cell at the previous time to determine what information needs to be retained and discarded. The value “1” represents “completely keep”, while “0” represents “completely get rid of information”. The output the previous time to determine what“0” information to be retained discarded. The value “1” represents “completely keep”, while representsneeds “completely get ridand of information”. The output of the hidden layer cell (historical information) can be selectively processed by the processing of the represents “completely keep”, while “0” represents “completely get rid of information”. The output of the hidden layer cell (historical information) can be selectively processed by the processing of the forget gate. of the gate. hidden layer cell (historical information) can be selectively processed by the processing of the forget The forget gate gate layer layer is is shown shown in in Figure Figure 6, 6, when when the the input inputisish ht −1 and andthe theoutput outputisisx x: t : forget gate. The forget t t −1 The forget gate layer is shown in Figure 6, when the input is h and the output is xt : t −1   f t = σ (W f ⋅ [ ht −1 , xt ] + b f ) (2) f t = σ W f · [ h t −1 , x t ] + b f (2) f t = σ (W f ⋅ [ ht −1 , xt ] + b f ) (2)

Figure 6. Forget gate layer. Figure 6. 6. Forget Forget gate gate layer. layer. Figure

Energies 2017, 10, 406 Energies 2017, 10, 406

6 of 23 6 of 23 22

3.3. Input Gate Layer 3.3. Input Gate Layer The output gate layer is used to control the input of the cell state of the hidden gate layer. It can 3.3. Input Gate Layer gate layer is used to control input of the cell state ofwhat the hidden gate It can inputThe theoutput information through a number ofthe operations to determine needs to belayer. retained to The output gate layer is used to control the input of the cell state of the hidden gate layer. It can input a number of operations determine through what needs to be retained updatethe theinformation current cellthrough state. First the input gate layer istoestablished a sigmoid function to to input the information through a number of gate operations toestablished determine through what needs to be retained to update the current cell state. First the input layer is a sigmoid function to determine which information should be updated. The output of the input gate layer is a value update the currentinformation cell state. First the input gate layer isoutput established a sigmoid function to determine should be and updated. of input thethrough input gate layer a value between 0 which and 1 of the sigmoid output, then The it acts on the information to is determine determine which information should be updated. The output of the input gate layer is a value between between 0 update and 1 of sigmoid output, and then it state, acts on the 1input information to determine whether to thethe corresponding value of the cell where indicates that the information is 0whether and 1 oftothe sigmoid output, and then value it acts of onthe thecell input information to determine whether to update update the corresponding state, where 1 indicates that the information is allowed to pass, and the corresponding value needs to be updated, and 0 indicates that it may not be the corresponding value of the cell state, where 1 indicates that theand information isthat allowed tonot pass, allowed and the corresponding needs beupdated. updated, indicates it may be allowed to aspass, the corresponding values dovalue not need totobe It can0 be seen that the input gate and the corresponding value needs to be updated, and 0 indicates that it may not be allowed as the allowed the corresponding values information. do not need to be updated. It can bebe seen that the input gate layer canasremove some unnecessary Then a tanh layer can established by adding corresponding values dounnecessary not need to be updated. ItThen can be seen that thecan input gate layer can remove tanh layer can remove some information. a layer be established by adding the candidate state of the neuron phase phasor, and the two jointly calculate the updated value. The some unnecessary information. Then a tanh layer can be established by adding the candidate state the candidate state of the neuron phase input gate layer is shown in Figure 7. phasor, and the two jointly calculate the updated value. The of the neuron phase phasor, and the two jointly calculate the updated value. The input gate layer is input gate layer is shown in Figure 7. it = σ (Wi ⋅ [ ht −1 , xt ] + bi ) shown in Figure 7. (3) ⋅ [ h −1 , x, tx] +] b+ ) (3) it =it σ=(σ W(W (3) 1 t i bi ) i ·i [ htt− C t = tanh (WC ⋅ [ ht −1 , xt ] + bC ) (4) et =Ctanh htt−− C (4) (W(W (4) t = tanh 1 ,1x,t ]x+ t ]bC+) bC ) C C· ⋅[[h

Figure 7. Input gate layer. Figure Figure 7. 7. Input Input gate gate layer. layer.

3.4. Update the State of Neurons 3.4. Update 3.4. Update the the State State of of Neurons Neurons It is time to update the old cell state, Ct −1 , into the new cell state Ct . The previous steps already It is time update the old cell state, Ct −1 ,,into new cell state . The previous previous steps already It is what time to to oldwe cell state, C intothe thedo new state C Cthe already decided to update do, andthe now just need it. cell Multiply old state by steps f t , forgetting tt . The tto −1 actually decided what forgetting decided what to to do, do, and and now now we we just just need need to to actually actuallydo doit. it.Multiply Multiplythe theold oldstate stateby by fftt ,, forgetting the information we decided to forget earlier; then add it × Ct . These are the new candidate values, Cett.. These the we decided These are are the the new new candidate candidate values, values, ×C the information information we decided to to forget forget earlier; earlier; then then add add itit × scaled by how much we decided to update each state value. In the case of the language model, this is scaled by how much we decided to update each state value. In the case of the language model, scaled much we decided each state value. thesubject’s case of the language is where by wehow would actually drop to theupdate information about theInold gender and model, add thethis new this is where we would actually drop the information about the old subject’s gender and add the where we would theprevious information about old subject’s gender and add the new information, as weactually decideddrop in the steps. The the neurons’ state update process is shown in new information, asdecided we decided in previous the previous steps. The neurons’ state update process is shown information, as we in the steps. The neurons’ state update process is shown in Figure 8. in Figure Figure 8. 8. Ct = f t ∗ft C +iti∗t C ∗tCet (5) Ct = ∗C t−t −11 + (5) Ct = ft ∗ Ct −1 + it ∗ Ct (5)

Figure 8. 8. Neuron status status update. update. Figure Figure 8. Neuron status update.

Energies 2017, 10, 406 Energies 2017, 10, 406

7 of 22

7 of 23

Output Gate Layer 3.5.3.5. Output Gate Layer The output gate layer is used to control the output of the current hidden layer node, and to The output gate layer is used to control the output of the current hidden layer node, and to determine whether to output to the next hidden layer or output layer. Through the output of the determine whether to output to the next hidden layer or output layer. Through the output of the control, we can determine which information needs to be output. The value of its state is “0” or “1”. control, we can determine which information needs to be output. The value of its state is “0” or “1”. The value “1” represents a need to output, and “0” represents that it does not require output. Output The value “1” represents a need to output, and “0” represents that it does not require output. Output control information on the current state of the cell for some sort of value can be found after the final control information on the current state of the cell for some sort of value can be found after the final output value. output Determine value. the output of the neurons as (6) and (7), and the output gate layer is shown in Determine the output of the neurons as (6) and (7), and the output gate layer is shown in Figure 9. Figure 9. ot =σ(σW (Wo [oh[th−t −11 ,, xxt t]]++bob)o ) ot =

(6) (6)

∗ tanh t =oott ∗ hth= tanh((CCt t))

(7) (7)

Figure 9. Output gate layer. Figure 9. Output gate layer.

4. Malfunction Inspection Report Analysis Method Based on RNN-LSTM 4. Malfunction Inspection Report Analysis Method Based on RNN-LSTM To learn a good semantic representation of the input sentence, our objective is to make the To learn a good semantic representation of the input sentence, our objective is to make the embedding vectors for sentences of similar meanings as close as possible, and to make sentences of embedding vectors for sentences of similar meanings as close as possible, and to make sentences of different meanings as far apart as possible. This is challenging in practice since it is hard to collect a different meanings as far apart as possible. challenging in practicesignal sincebetween it is harddifferent to collect large amount of manually labeled data thatThis giveisthe semantic similarity a large amount of manually labeled data that give the semantic similarity signal between different sentences. Nevertheless, a widely used commercial web search engine is able to log massive sentences. a widely used commercial search is able toalog massivequery, amounts amountsNevertheless, of data with some limited user feedback web signals. For engine example, given particular the of data with some limited user feedback signals. For example, given a particular query, the click-through click-through information about the user-clicked document among many candidates is usually information about document many signal candidates is usually recordedsimilarity and can be recorded and canthe beuser-clicked used as a weak (binary)among supervision to indicate the semantic used as a weak (binary) supervision signal to indicate the semantic similarity between sentences between two sentences (on the query side and the document side). We try to explain howtwo to leverage (onsuch the query side supervision and the document We try explain how to leverage suchthat a weak supervision a weak signal side). to learn a to sentence embedding vector achieves the aforementioned training objective. Thevector above that objective to make with similar meaning as signal to learn a sentence embedding achieves the sentences aforementioned training objective. as possible machinewith translation whereastwo sentences belong to two to different Theclose above objectiveistosimilar make to sentences similartasks meaning close as possible is similar machine languages withwhere similar meanings, we to want make their semantic representation as close translation tasks two sentencesand belong twotodifferent languages with similar meanings, andaswe possible. want to make their semantic representation as close as possible. this paper,the thealgorithm, algorithm, parameter parameter setting experimental environment are introduced as In In this paper, settingand and experimental environment are introduced as follows: follows: Input: (including various varioustypes typesofofreports reports and their labels). Input:training trainingsamples samplesand and test test sample sample (including and their labels). The number maxlen==200. 200.Minimum Minimum number of words: min_count = 5. The numberofoftruncated truncated words: words: maxlen number of words: min_count = 5. The parameters of the model: Dropout = 0.5, =Dense = 1. Activation function type: Sofamax, tanh and The parameters of the model: Dropout 0.5, Dense = 1. Activation function type:Relu, Sofamax, Relu, sigmoid. NumberNumber of LSTM 64; 32; 128;64; 259; 512. classification results and fault tanh and sigmoid. of units: LSTM32; units: 128; 259;Output: 512. Output: classification results and recognition accuracy of test Initialization: set all parameters of the model to small fault recognition accuracy of samples. test samples. Initialization: set all parameters of the modelrandom to small numbers random numbers. Experimental operating environment: operating system: Windows 10; RAM: G; XeonE5 CPU: Experimental operating environment: operating system: Windows 10; RAM: 32 G;32 CPU: XeonE5 8-core; NVIDIA graphics:980M; NVIDIA 980M; program computing environment: and Keras. 8-core; graphics: program computing environment: TheanoTheano and Keras.

The process of the training method for RNN-LSTM is presented in Appendix B.

Energies 2017, 10, 406

8 of 22

5. Experimental Verification Based on Malfunction Inspection Report The full learning formula for all the model parameters was elaborated on in the previous section. The training method was described in Section 4. In this section, we will take the power grid malfunction inspection report of a regional power grid in the China Southern Power Grid as the object of analysis. Through RNN-LSTM processing, we can use machine learning to classify and analyze the unstructured data in different situations. 5.1. Database The description of the corpus is as follows. The malfunction inspection report records come in from the grid personnel by the inspection of the power grid equipment, lines, and protection devices during daily maintenance. The accumulation of fault-by-statement constitutes the main body of the report. Among them, the information in the malfunction inspection report is mainly composed of six main information bodies such as “DeviceInfo”, “TripInfo”, “Faultinfo”, “DigitalStatus”, “DigitalEvent”, “SettingValue”, and other common information. The TripInfo information body can contain multiple optional FaultInfo information. The FaultInfo information body indicates the current and voltage of the action, and it can clearly reflect and display the fault condition and the operation process through the report. The content source of DeviceInfo information can be a fixed value or a configuration file. The information of Faultinfo, DigitalStatus, DigitalEvent, and SettingValue can be different according to the type of protection or manufacturers. Faultinfo can be used as the auxiliary information of a single action message or as a fault parameter of the whole action group. The contents of each message are as follows: (1) (2) (3) (4) (5)

(6)

DeviceInfo: Description information portion of the recording apparatus. TripInfo: Partially records protection action events during the failure process. FaultInfo: Records the fault current, fault voltage, fault phase, fault distance and other information in the process of recording the fault. DigialStatus: Records the signal before the device into the self-test signal status. DigitalEvent: Records the change of events such as the self-test signal during the process of fault protection; all the switches are sorted according to the action time, and the action time and return time are recorded at the same time. SettingValue: Records the actual value of the device setting at the fault time.

According to the dynamic fault records of the power system, we divide all the faults into the following five categories and give the corresponding labels after each record: mechanical fault; electrical fault; secondary equipment fault; fault caused by external environment; fault caused by human factors. We have selected the malfunction inspection report of the China Southern Power Grid for nearly 10 years as the data set of this paper. In the used data set, the specific types of faults and causes of faults, and the percentage of their statistics, are shown in Table 1. A single sample data size range is between 21 kb to 523 kb. Training samples and test samples were randomly selected to ensure the versatility of the model test. In the semantic analysis, we also analyze the data sets used. In this paper, nine categories were selected to cover the semantic relations among most pairs of entities, and they make no overlap between them. However, there are some very similar relationships that can cause difficulties in recognition tasks, such as Entity-Origin (EO), Entity-Destination (ED), and Content-Container (CC), often appearing in one sample at the same time. Similarly, there are Component-Whole (CW) and Member-Collection (MC). Nine types of relationship profiles and examples are as follows: (1) (2)

Cause-Effect (CE): Those cancers were caused by radiation exposures. Instrument-Agency (IA): Phone operator.

Energies 2017, 10, 406

(3) (4) (5) (6) (7) (8) (9)

9 of 22

Product-Producer (PP): A factory manufactures suits. Content-Container (CC): A bottle full of honey was weighted. Entity-Origin (EO): Letters from foreign countries. Entity-Destination (ED): The boy went to bed. Component-Whole (CW): My apartment has a large kitchen. Member-Collection (MC): There are many trees in the forest. Message-Topic (MT): The lecture was about semantics. Table 1. Different fault type statistics in the dataset. Fault Category

Fault Reason

Amount

Mechanical Fault

Turbine cooler fault Switch mechanism oil leakage Monitor computer fault Low air pressure of equipment

2736 312 39 44

25.6% 2.9% 0.3% 0.4%

Circuit breakers, disconnectors fault Capacitor fault Unbalanced voltage and current Arrester fault Voltage and current transformer fault Battery fault Insulators blew

3375 502 429 87

31.3% 4.7% 4.0% 0.8%

126

1.2%

55 97

0.6% 0.9%

Fault caused by human factors

Wire facilities or referrals stolen

432

4.0%

4.0%

Fault caused by external environment

Lines and trees are too close

1991

18.6%

18.6%

Secondary equipment fault

Electromagnetic locking fault Remote control fault

55 445

0.6% 4.1%

4.7%

Electrical Fault

Percentage

Test sample

3217

30%

Training sample

7508

70%

Total

10725

100%

29.2%

43.5%

The specific distribution of the number of samples in each category is shown in Table 2. Table 2. Statistical distribution of relationship categories in samples. Relation

Sample Quantity

Proportion of Sample

Cause-Effect (CE) Instrument-Agency (IA) Product-Producer (PP) Content-Container (CC) Entity-Origin (EO) Entity-Destination (ED) Component-Whole (CW) Member-Collection (MC) Message-Topic (MT) Other

1331 1253 1137 974 948 923 895 732 660 1872

12.4% 11.7% 10.6% 9.1% 8.8% 8.6% 8.4% 6.8% 6.2% 17.4%

Total

10725

100%

5.2. Result and Analysis Based on the RNN-LSTM training with massive single-fault samples, the test set is imported for fault-type accuracy testing. In this paper, three variables were selected. By comparing the fault recognition results under three different variables and the fault report, we obtained the fault recognition accuracy. When the other two variables are fixed, the test samples are verified by using different

Energies 2017, 10, 406

10 of 22

traverse times. These three variables are: number of LSTM units, type of activation unit, batch size. Batch size is the size of each batch processing data and unique training methods of deep learning. The proper adjustment of Batch size not only can reduce the weight adjustment times to prevent over-fitting, also speeding up training. Energies 2017, 10, 406

10 of 23

5.2.1. Fault Recognition Accuracy and Number of LSTM Units 5.2.1. Fault Recognition Accuracy and Number of LSTM Units This experiment takes the activation unit and batch size at a constant rate, while the number Thisunits experiment takes the activation unit andthe batch size atofa traversals constant rate, while the number of of LSTM gradually increases, and improves number in the same LSTM units’ LSTM units gradually increases, and improves the number of traversals in the same LSTM units’ number conditions. The number of LSTM training samples is 10,000, and the number of test samples is number The number of LSTM training is relationship 10,000, and the number test samples 3000; theconditions. activation unit is a sigmoid; the batch size samples is 20. The between theofaccuracy rate is 3000; the activation unit is a sigmoid; the batch size is 20. The relationship between the accuracy and the number of LSTM units is shown in Table 3, and its trend is shown in Figure 10. rate and the number of LSTM units is shown in Table 3, and its trend is shown in Figure 10. Table 3. Accuracy of fault recognition under different LSTM units. Table 3. Accuracy of fault recognition under different LSTM units. Epoch (Traversal Epoch (Traversal Times) Times)

1 5 10 15 20 30 50

1 5 10 15 20 30 50

32 32 0.36045 0.36045 0.41832 0.41832 0.42052 0.42052 0.47633 0.47633 0.47633 0.47633 0.49817 0.49817 0.52576 0.52576

Number of LSTM LSTMUnits Units Number of 64 128 256 256 64 128 0.38296 0.38847 0.41947 0.38296 0.38847 0.41947 0.43441 0.46547 0.47669 0.43441 0.46547 0.47669 0.48952 0.48952 0.50712 0.48952 0.48952 0.50712 0.49903 0.50058 0.53964 0.49903 0.50058 0.53964 0.50567 0.50856 0.58585 0.50567 0.50856 0.58585 0.53811 0.53585 0.58684 0.53811 0.53585 0.58684 0.53811 0.54273 0.61058 0.53811 0.54273 0.61058

512

512 0.34154 0.34154 0.384570.38457 0.384570.38457 0.395410.39541 0.378540.37854 0.368450.36845 0.395740.39574

Figure 10. Accuracy of fault recognition under different LSTM units. Figure 10. Accuracy of fault recognition under different LSTM units.

As can be observed from Table 3 and Figure 10, when the number of units in the LSTM remains As can be the observed from Table 3 and of Figure 10, when number of units in the LSTM remains constant, with increase of the number traversals, the the higher the fault recognition accuracy rate constant, with the increase of the number of traversals, the higher the fault recognition accuracy rate is. is. When the number of LSTM units is the same, the greater the number of LSTM units, and the When of LSTMhowever, units is thea same, the greater the in number of LSTMrate units, and the when better the better the thenumber performances; significant decline the accuracy emerges the performances; however, a significant decline in the accuracy rate emerges when the number of LSTM number of LSTM units stays at 512. The reason for the decrease of the accuracy rate is that, as the units staysdata at 512. The reason for the decrease the accuracy that, asthe theparameters required data volume required volume increases, if more than of 512 LSTM unitsrate areisneeded, need to be increases, if more than 512 LSTM units are needed, the parameters need to be adjusted and optimized. adjusted and optimized. To To further further analyze analyze the the data, data, the the receiver receiver operating operating characteristic characteristic (ROC) (ROC) curve curve system system is is added added to the results. Due to the different performances by different numbers of LSTM units, we repeated to the results. Due to the different performances by different numbers of LSTM units, we repeated experiments and chose three worth analyzing: 64, 128, 256. The experiments under underdifferent differentepoch epochconditions conditions and chose three worth analyzing: 64, 128, 256. area The under the curve (AUC) reflects the ability of the recognition algorithm to correctly distinguish two types area under the curve (AUC) reflects the ability of the recognition algorithm to correctly distinguish of targets. larger the is, the the better thethe performance the algorithm negative two types The of targets. TheAUC larger AUC is, better the of performance of is. theFalse algorithm is. (FN), False false positive (FP), true negative (TN), and true positive (TP) are important parameters in the ROC negative (FN), false positive (FP), true negative (TN), and true positive (TP) are important curve. Specificity is defined the true negative (TNR), and sensitivity is defined as the true parameters in the ROC curve.asSpecificity is definedrate as the true negative rate (TNR), and sensitivity is positive rate (TPR). In the following experiment, the threshold was set 0.5. The test positive if defined as the true positive rate (TPR). In the following experiment, thetothreshold waswas set to 0.5. The test was positive if the accuracy of the fault recognition under different activation units was higher than the threshold value. As can be seen from Table 4 and Figure 11, in the proposed algorithm, the performance of the algorithm tends to be better in a certain interval with the increase of the number of LSTM units.

Energies 2017, 10, 406

11 of 22

the accuracy of the fault recognition under different activation units was higher than the threshold value. As can be seen from Table 4 and Figure 11, in the proposed algorithm, the performance of the algorithm tends to be better in a certain interval with the increase of the number of LSTM units. Energies 2017, 10, 406

11 of 23

Table 4. Area under the curve (AUC) analysis of receiver operating characteristic (ROC) curves under TableLSTM 4. Area under the curve (AUC) analysis of receiver operating characteristic (ROC) curves different unit numbers. under different LSTM unit numbers. Number of LSTM Units Number of LSTM Units 64 64 128 128 256 256

AUC AUC 0.6994 0.6994 0.7724 0.7724 0.8319 0.8319

Standard Error Lower Bound (95%) Upper Bound (95%) Standard Error Lower Bound (95%) Upper Bound (95%) 0.0785 0.5456 0.8533 0.0785 0.5456 0.8533 0.0666 0.6419 0.9030 0.0666 0.6419 0.9030 0.0588 0.7165 0.9473 0.0588 0.7165 0.9473

(a)

(b)

(c)

(d)

Figure 11. (a) Comparison of ROC curves under different LSTM unit numbers. The percentage of

Figure 11. (a) Comparison of ROC curves under different LSTM unit numbers. The percentage of FN/FP/TN/TP under (b) 64 LSTM units; (c) 128 LSTM units; and (d) 256 LSTM units. FN/FP/TN/TP: FN/FP/TN/TP under (b) 64 LSTM units; (c) 128 LSTM units; and (d) 256 LSTM units. FN/FP/TN/TP: false negative/false positive/true negative /true positive. false negative/false positive/true negative /true positive.

5.2.2. Fault Recognition Accuracy and Activation Unit Type

5.2.2. Fault Recognition Accuracy and Activation Unit Type

This experiment keeps the number of LSTM units and the batch size at a constant rate, while This experiment keeps the number units the andnumber the batch size at ainconstant selecting four different activation units, of andLSTM improving of traversals the samerate, activation unit conditions. number units, of LSTM samples 10,000, and the number test while selecting four differentThe activation andtraining improving theisnumber of traversals in of the same samplesunit is 3000; the number LSTM units is 128;training the batchsamples size is 20. relationship activation conditions. Theofnumber of LSTM is The 10,000, and the between numberthe of test accuracy ratethe andnumber the different activation Table 5, The and relationship its trend is shown in the samples is 3000; of LSTM units isunits 128; is theshown batch in size is 20. between Figure 12. accuracy rate and the different activation units is shown in Table 5, and its trend is shown in Figure 12.

Energies 2017, 10, 406

12 of 22

Energies 2017, 10, 406

12 of 23

Energies 2017, 10, 406

12 of 23

Figure Accuracyofoffault faultrecognition recognition under units. Figure 12.12. Accuracy underdifferent differentactivation activation units.

As can be observed from Table 5 and Figure 12, under the same activation unit condition, with Figure 12. Table Accuracy of fault recognition under different activation units. Asincrease can be observed from 5 and Figure 12, under the same activation unit condition, with the the of the number of traversals, the fault recognition accuracy is higher. With the same increase of the number of traversals, the fault recognition accuracy is higher. With the same number of number traversals, the use ofTable Sofamax andFigure the sigmoid activation unit will obtain better accuracy, Asofcan be observed from 5 and 12, under the same activation unita condition, with traversals, theperformance useofofthe Sofamax and the sigmoid activation will obtain a better accuracy, the Relu’s the followed. It can be the seen thatrecognition theunit greater the number of traversals, Relu theRelu’s increase number of traversals, fault accuracy is higher. With thethe same performance followed. It can be seen that the greater the number of traversals, the Relu and sigmoid and sigmoid performances become closer, but the results obtained using the tanh change are not number of traversals, the use of Sofamax and the sigmoid activation unit will obtain a better accuracy, performances become but theItresults obtained using the the tanh change are not obvious. Thus, obvious. Thus, in thecloser, choice of activation and sigmoid are of more suitable forRelu text the Relu’s performance followed. can befunction, seen thatSofamax the greater number traversals, the processing. in the choice of activation function, Sofamax areobtained more suitable for text and sigmoid performances become closer, and but sigmoid the results using the tanh processing. change are not obvious. Thus, in the choice of activation function, Sofamax and sigmoid are more suitable for text Table Accuracyofoffault faultrecognition recognition under under different 5. 5. Accuracy differentactivation activationunits. units. processing. Table Activation Unit Activation Unit Activation Unit Activation Unit Epoch (TraversalTable Times) Activation Unitrecognition Activation Unit Activation Unitunits.Activation Unit 5. Accuracy of fault under different activation Sofamax Relu tanh Sigmoid Epoch (Traversal Times) Sofamax Relu tanh Sigmoid 1 0.42797 Unit Activation 0.40868Unit Activation 0.37974Unit Activation 0.41947 Activation Unit Epoch (Traversal Times) 0.42797 0.40868 0.37974 0.41947 51 0.48042 0.45259 0.39684 0.47669 Sofamax Relu tanh Sigmoid 5 0.48042 0.45259 0.39684 0.47669 10 0.52669 0.50478 0.35741 0.50712 1 0.42797 0.40868 0.37974 0.41947 10 0.52669 0.50478 0.35741 0.50712 20 0.59572 0.58163 0.40587 0.58585 0.48042 0.45259 0.39684 0.47669 205 0.59572 0.58163 0.40587 0.58585 10 0.52669 0.50478 0.35741 0.50712 20 0.59572 0.58163 for ROC analysis 0.40587 by repeated 0.58585 We also selected the above four activation functions experiments

We also selected theconditions. above fourInactivation functions for ROC analysis experiments under different epoch the following experiment, threshold wasby setrepeated to 0.5. The test was We also selected the above four activation functions for ROC analysis by repeated experiments under different epoch conditions. In the following experiment, threshold was set to 0.5. test positive if the accuracy of the fault recognition under different activation units was higherThe than thewas under different epoch conditions. In the following experiment, threshold was set to 0.5. The test was positive if the accuracy of the fault recognition under different activation units was higher than threshold value. As can be seen from Table 6 and Figures 13 and 14, Sofamax and sigmoid activation the positive ifthe theAs accuracy the from fault the recognition under different activation units was than the threshold value. can beof seen Table 6 and Figures 13 and 14, Sofamax andhigher sigmoid activation performed best. It also confirms above conclusions. threshold value. As can be seen from Table 6 and Figures 13 and 14, Sofamax and sigmoid activation performed the best. It also confirms the above conclusions. performed the best. It also confirms theofabove conclusions. Table 6. AUC analysis ROC curves under different activation units.

Table 6. AUC analysis of ROC curves under different activation units.

Activation Table Unit 6. AUC Standard Error Lower (95%) Upper Bound (95%) AUC analysis of ROC curves underBound different activation units. Sofamax 0.8889 0.0477 0.7953 0.9824 Activation Unit AUC Standard Standard Error Lower Bound (95%) Upper Upper Bound (95%) Activation AUC Bound (95%) Bound Relu Unit 0.7743 0.0692Error Lower 0.6386 0.9099 (95%) Sofamax 0.8889 0.0477 0.7953 0.9824 Sofamax 0.8889 0.0477 0.7953 0.9824 tanh 0.6267 0.0817 0.4665 0.7868 Relu 0.7743 0.0692 0.6386 0.9099 Relu 0.7743 0.0692 0.6386 0.9099 sigmoid 0.7916 0.0667 0.6607 0.9225 tanh 0.6267 0.0817 0.4665 0.7868 tanh 0.6267 0.0817 0.4665 0.7868 sigmoid 0.7916 0.0667 0.6607 0.9225 sigmoid 0.7916 0.0667 0.6607 0.9225

Figure 13. Comparison of ROC curves under different activation units. Figure 13. Comparison of ROC curves under different activation units.

Figure 13. Comparison of ROC curves under different activation units.

Energies 2017, 10, 406 Energies 2017, 10, 406

13 of 22 13 of 23

(a)

(b)

(c)

(d)

Figure 14. The percentage of FN/FP/TN/TP under the activation unit of (a) Sofamax; (b) Relu; (c) tanh; Figure 14. The percentage of FN/FP/TN/TP under the activation unit of (a) Sofamax; (b) Relu; (c) tanh; and (d) sigmoid. and (d) sigmoid.

5.2.3. Accuracy of Fault Recognition and Batch Size 5.2.3. Accuracy of Fault Recognition and Batch Size This experiment keeps the number of LSTM units and the activation unit at a constant rate, This experiment keeps the number of LSTM units and the activation unit at a constant rate, while the single-batch processing data size gradually increases, and improves the number of while the single-batch processing data size gradually increases, and improves the number of traversals traversals in the same batch size condition. The number of LSTM training samples is 10,000, and the in the same batch size condition. The number of LSTM training samples is 10,000, and the number of number of test samples is 3000; the number of LSTM units is 128; and the activation unit is sigmoid. test samples is 3000; the number of LSTM units is 128; and the activation unit is sigmoid. The relationship The relationship between the accuracy rate and the different batch sizes is shown in Table 7, and its between the accuracy rate and the different batch sizes is shown in Table 7, and its trend is shown trend is shown in Figure 15. in Figure 15. Table 7. Accuracy of fault recognition under different batch sizes. Table 7. Accuracy of fault recognition under different batch sizes. Epoch (Traversal Times) Batch Size: 10 Batch Size: 20 Batch Size: 50 Epoch (Traversal Batch0.31189 Size: 10 Batch Size: 20 Batch Size: 50 1 Times) 0.47369 0.31947 5 0.39451 0.49687 0.37669 1 0.31189 0.47369 0.31947 0.50321 0.51476 0.40712 5 10 0.39451 0.49687 0.37669 0.50147 0.55684 0.43964 10 15 0.50321 0.51476 0.40712 15 20 0.50147 0.55684 0.43964 0.53697 0.59548 0.48585 20 50 0.53697 0.59548 0.48585 0.55876 0.64587 0.51058 50 0.55876 0.64587 0.51058

Energies 2017, 10, 406

14 of 22

Energies 2017, 10, 406

14 of 23

Energies 2017, 10, 406

14 of 23

Figure 15. Accuracy of fault recognition under different batch size.

Figure 15. Accuracy of fault recognition under different batch size.

As can be observed Table 7 and Figure 15, under thedifferent same batch size conditions, with the Figurefrom 15. Accuracy of fault recognition under batch size. As can of bethe observed Table 7 the andfault Figure 15, under the same batchWith size conditions, with the increase numberfrom of traversals, recognition accuracy is higher. the same number increase ofcan the be number of batch traversals, the faultFigure recognition accuracy is higher higher. With thethe same number As observed fromsize Table 7 and the same batch than size conditions, with the of of traversals, when the was valued at 20,15, theunder accuracy was with other two sizes. When batch valued 10,20, the rate increased with the the increase thesizes. increase of thethe number traversals, the fault recognition accuracy is higher. With same number traversals, when the batchofsize size was valued at theaccuracy accuracy was higher than with the otherof two of traversals, thesize lack ofthe continuous improvement was anhigher under-fitting state. When the ofnumber traversals, when thebut batch was valued at 20, theincreased accuracy was than with other two of When the batch size was valued 10, accuracy rate with the increase of the the number batch size was valued at 50, the accuracy rate was significantly decreased compared to the previous sizes. When the batch size was valued 10, the accuracy rate increased with the increase of traversals, but the lack of continuous improvement was an under-fitting state. When the batchthe size two sizes, as50, toothe much data eachof batch processing caused an over-fitting phenomenon. number ofattraversals, but theinlack continuous improvement was an under-fitting state. When was valued accuracy rate was significantly decreased compared to the previous two the sizes, We the50, above three batch sizes for ROC analysis by repeated experiments under batch sizealso wasselected valued the accuracy rate wasan significantly decreased compared to the previous as too much data in eachat batch processing caused over-fitting phenomenon. different epoch conditions. In the following experiment, the threshold was set to 0.48. The test was two too much in each batch processing an over-fitting phenomenon. Wesizes, alsoas selected thedata above three batch sizes forcaused ROC analysis by repeated experiments under positive if the accuracy of the fault recognition under different batch sizes was higher than the We also selected the above batch experiment, sizes for ROCthe analysis by repeated experiments under different epoch conditions. In thethree following threshold was set to 0.48. The test was threshold value.conditions. As can be seen from Table 8 experiment, and Figure 16, the batch valued 20,was it different epoch In the following thewhen threshold wassize set was to 0.48. Theattest positive if the accuracy of the fault recognition under different was higher than the threshold performed best. However, the batch size wasunder valued at batch 50, thesizes overall ROC curve tended to be positive if the accuracy ofwhen the fault recognition different batch sizes was higher than the value. As can be seen from Table 8 and Figure 16, when the batch size was valued at 20, it performed smoother. It should be noted that, for different datasets showing different characteristics, the batch threshold value. As can be seen from Table 8 and Figure 16, when the batch size was valued at 20, it best.size However, when thea batch size was valued atAs 50,the theinspection overall ROC curve tended to be smoother. should not have fixed range of selection. a certain word performed best. However, when the batch size was valued at 50, thereport overallrequires ROC curve tended to be It should be noted that, for different datasets showing different characteristics, the size batch size should length that can express the corresponding characteristics, the best value for the batch is 20. smoother. It should be noted that, for different datasets showing different characteristics, the batch

not have a fixed range of selection. As the inspection report requires a certain word length that can size should not have a fixed range of selection. As the inspection report requires a certain word Tablecharacteristics, 8. AUC analysis of ROC curves under different batch sizes.20. express corresponding best value for batch size lengththe that can express the correspondingthe characteristics, thethe best value foristhe batch size is 20. Batch Size AUC Standard Error Lower Bound (95%) Upper Bound (95%) 8.8.AUC analysis of ROC curves under different batchsizes. sizes. 10 Table 0.6996 0.0773 Table AUC analysis of ROC curves0.5481 under different batch 0.8512 20 0.9107 0.0386 0.8349 0.9865 Batch AUC Bound Bound (95%) Batch Size 0.8814 AUC Standard Standard Error Lower Lower Bound(95%) (95%) Upper Upper0.9751 Bound (95%) 50Size 0.0477Error 0.7878 1010 0.6996 0.0773 0.5481 0.8512 0.6996 0.0773 0.5481 0.8512 2020 0.9107 0.0386 0.8349 0.9865 0.9107 0.0386 0.8349 0.9865 0.8814 0.0477 0.7878 0.9751 5050 0.8814 0.0477 0.7878 0.9751

(a)

(b)

(a)

(b)

Figure 16. Cont.

Energies 2017, 10, 406

15 of 22

Energies 2017, 10, 406

15 of 23

(c)

(d)

Figure 16. (a) Comparison of ROC curves under different batch sizes. The percentage of Figure 16. (a) Comparison of ROC curves under different batch sizes. The percentage of FN/FP/TN/TP FN/FP/TN/TP under batch size (b) 10; (c) 20; and (d) 50. under batch size (b) 10; (c) 20; and (d) 50.

In this paper, three key parameters are selected as experimental variables, and the experimental In this paper, three key aredata selected as experimental variables, and the proposed experimental results can be reflected: theparameters unstructured processing method based on RNN-LSTM in results be reflected: thethe unstructured data processing method based on RNN-LSTM proposed this can paper can achieve current research level in machine learning when it is applied to in thismalfunction paper can achieve the reports. current research level in this machine learning when ita ismore applied to malfunction inspection This means that method can provide effective way for grid inspectors dealmeans with unstructured text. can provide a more effective way for grid inspectors inspection reports.toThis that this method to deal with unstructured text. 6. Conclusions 6. Conclusions How to efficiently handle large numbers of unstructured text data such as malfunction inspection reports is handle a long-standing problem faced by operation engineers. paper proposes a How to efficiently large numbers of unstructured text data such asThis malfunction inspection deep islearning method for malfunction report processing usingproposes the text adata reports a long-standing problem faced inspection by operation engineers. Thisbypaper deep mining–oriented effective training strategy for effective RNN-LSTM network for learning method forRNN-LSTM. malfunctionAn inspection report processing byanusing the text data mining–oriented modeling inspection data is presented. From the obtained results, and an effectiveness analysis, it RNN-LSTM. An effective training strategy for an effective RNN-LSTM network for modeling was demonstrated that RNN-LSTM, especially with target replication, can successfully classify inspection data is presented. From the obtained results, and an effectiveness analysis, it was diagnoses ofthat labeled malfunction inspection given unstructured data. It is emphasized that of demonstrated RNN-LSTM, especially withreports target replication, can successfully classify diagnoses the experiment via different variables can be reflected in the configuration of key variables in how labeled malfunction inspection reports given unstructured data. It is emphasized that the experiment we should select parameters to achieve optimal results, including the selection of the maximum via different variables can be reflected in the configuration of key variables in how we should select LSTM unit number, the processing capacity of the activation unit and the prevention of over-fitting. parameters to achieve optimal results, including the selection of the maximum LSTM unit number, In this paper, we used fault labels without timestamps, but we are obtaining timestamped the processing capacity of the activation unit and the prevention of over-fitting. In this paper, we used diagnoses, which will enable us to train models to perform early fault diagnosis by predicting future fault labels without timestamps, but we are obtaining timestamped diagnoses, which will enable us to conditions in a larger inspection data set. train models to perform early fault diagnosis by predicting future conditions in a larger inspection data set. Acknowledgments: The authors are grateful for the projects supported by the National Natural Science Foundation of China No. 51477121 and the China Southern Power Grid No. GZ2014-2-0049. Acknowledgments: The authors are grateful for the projects supported by the National Natural Science Author Contributions: Daqian Wei, Lin and Bo Wang designed the main parts of the study, including Foundation of China No. 51477121 andGang the China Southern Power Grid No. GZ2014-2-0049. RNN-LSTM modeling and the implementation of the algorithms, the neural network training process and the Author Contributions: Daqian Wei, Gang Lin and Bo Wang designed the main parts of the study, including experiments. Daqian Wei, Gang Lin and Hesen Liu mainly contributed to the writing of the paper. Yilu Liu was RNN-LSTM modeling and the implementation of the algorithms, the neural network training process and the responsibleDaqian for guidance, a number of Hesen key suggestions, manuscript Zhaoyang experiments. Wei, Gang Lin and Liu mainlyand contributed to editing. the writing of the Dong paper.and YiluDichen Liu was Liu were also responsible for some constructive suggestions. responsible for guidance, a number of key suggestions, and manuscript editing. Zhaoyang Dong and Dichen Liu were also responsible for some constructive suggestions. Conflicts of Interest: The authors declare no conflict of interest.

Conflicts of Interest: The authors declare no conflict of interest.

Appendix A. The Detailed Description of Recurrent Neural Network Appendix A. The Detailed Description of Recurrent Neural Network Appendix A.1. Forward Propagation Appendix A.1. Forward Propagation

u = Wxh × x

(A1)

Energies 2017, 10, 406

16 of 22

ht = tanh(Whh × ht−1 + u) = tanh(zt + u)

(A2)

u0 = Why × h

(A3)

y = u0

(A4)

where u is the input of the hidden layer, h is the output of the hidden layer, u0 is the input of the output layer, y is the input of the input layer, and because the output layer is linear, y = u0 . Specific calculation of each neuron is as follows: V

∑ Wxh (i, j) × x j

ui =

(A5)

j =1

zit =

H

∑ Whh (i, j) × htj−1

(A6)

j =1

hi = tanh ui + zit ui 0 =



(A7)

H

∑ Why (i, j) × h j

(A8)

j =1

yi = ui 0

(A9)

Appendix A.2. Loss Function There are two kinds of loss function: quadratic error function and cross-entropy error function. Assuming that t is the true value of the training sample, y is the output of the neural network, and a training sample is ( x, t). (1)

Quadratic error function E(t, y) =

(2)

1 ( t − y )2 2

(A10)

Cross-entropy error function E(t, y) = −[t ln(y) + (1 − t) ln(1 − y)]

(A11)

When the activation function does not adopt through the output layer of the neural network, the quadratic error function should be used, which can give relatively fast gradient parameter estimation. If the output layer uses the sigmoid activation function, we use the cross-entropy error function to estimate the parameters. As long as the sigmoid activation function is selected, the cross-entropy error function can be used to eliminate the sigmoid function at the time of derivation, which can speed up the gradient descent. (3)

Partial derivation of error to the output of the output layer (a)

Quadratic error function ∂E ∂ 1 = ( t − y )2 = y − t ∂y ∂y 2

(b)

Cross-entropy error function ∂E ∂y

= (4)

(A12)

∂ ∂y (−[ t ln( y ) + (1 − t ) ln(1 − y )]) −t(1−y)+y(1−t) = y(y1−−ty) y (1− y )

=

Partial derivation of error to the input of the output layer

=

−t y

+

1− t 1− y

(A13)

Energies 2017, 10, 406

(a)

17 of 22

Linear output layer

If the output layer is without an activation function, the different cost functions are solved ∂E ∂E separately: ∂u 0 = ∂y . (b)

Sigmoid output layer

Calculate the derivative of the error on input to the output layer, z denotes the input to the output layer, z = u0 , y = sigmoid(z). In the neural network, the error of the output layer is defined as the derivative of the loss function to the input layer, denoted by δ L . Sigmoid function: σ( x ) = 1+1e−x , where the partial derivative of the sigmoid function is: ∂σ( x ) ∂x

= σ( x )(1 − σ( x )).

δL =

∂E ∂E ∂y y − t ∂y y−t y−t = ∗ = = σ(z)(1 − σ(z)) = y (1 − y ) = y − t ∂z ∂y ∂z y(1 − y) ∂z y (1 − y ) y (1 − y )

(A14)

Appendix A.3. Back Propagation (1)

Partial derivation of the error to the input of the output layer.

Matrix obtained:

(2)

∂ 1 ∂E = ( t − y i )2 = y i − t i ∂yi ∂yi 2 i

(A15)

∂E ∂ 1 = ( t − y )2 = y − t ∂y ∂y 2

(A16)

Partial derivation of the error to the output of the output layer If the output layer does not use the activation function, then ui 0 = yi ∂E ∂E = yi − ti = ∂ui 0 ∂yi

(A17)

∂E ∂E = = y−t ∂u0 ∂y

(A18)

Matrix obtained:

(3)

Partial derivative of the error to Why ∂E ∂Why (i,j)

=

= ( yi − ti ) ∗

∂ui 0 ∂Why (i,j)



∂E ∂ui 0

∂ ∂Why (i,j)

H

! (A19)

∑ Why (i, j) ∗ h j

j =1

= ( yi − ti ) ∗ h j Matrix obtained: ∂E ∂E = ∗ HT ∂Why ∂y (4)

(A20)

Partial derivation of the error to the hidden layer output

Since h is affected by the following two Equations (A21) and (A22), when calculating the partial derivative of the loss function to the hidden layer output, it is necessary to compute the partial derivatives of h with respect to both formulas.  hit+1 = tanh uit + zit , zit =

H

V

j =1

j =1

∑ Whh (i, j) ∗ htj−1 , ui = ∑ Wxh (i, j) ∗ x j

(A21)

Energies 2017, 10, 406

18 of 22

ui 0 =

H

∑ Whh (i, j) ∗ h j

(A22)

j =1

The matrix of Equation (A21): uit+1 = uit + zit , ut+1 = Whh ∗ ht + Whx ∗ x, dnext is defined as the partial derivative of the error to the hidden layer input at the next moment. V

= ∑

∂E ∂hi

k =1 V

∂E ∂uk 0



∂uk 0 ∂hi

= ∑ (yk − tk ) ∗ k =1

dhnext

=

∂E ∂hit H

= ∑

j =1 H

= ∑

j =1 H

= ∑

j =1 H

= ∑

j =1

V

= ∑

k =1

∂uk 0 ∂hi

∂E ∂yk



∂uk 0 ∂hi

(A23)

V

= ∑ (yk − tk ) ∗ Why (k, i ) k =1

t +1 ∂E ∂u j t+1 ∂ht i j=1 ∂u j   H V ∂E ∂ t+ W j, i ∗ h W j, i ∗ x ( ) ( ) ∑ ∑ i xh i ∂utj +1 ∂hit i =1 hh i =1   H V ∂ ∂E ∂ t ∑ W ( j, i ) ∗ hi + ∂ht ∑ Wxh ( j, i ) ∗ xi ∂utj +1 ∂hit i =1 hh i   i =1 H ∂E ∂ ∑ W ( j, i ) ∗ hit + 0 ∂utj +1 ∂hit i =1 hh H ∂E ∂E t+1 (Whh ( j, i ) + 0) = ∑ t+1 Whh ( j, i ) ∂u j j=1 ∂u j

H

= ∑

(A24)

Matrix obtained: T dhnext = Whh

(5)

∂E ∂u

(A25)

∂E ∂E T = Why ∗ 0 + dhnext ∂h ∂u

(A26)

∂E ∂E T = Why ∗ + dhnext ∂h ∂y

(A27)

Partial derivation of the loss function to the hidden layer input   ∂E ∂E ∂hi ∂E ∂ ∂E  = = 1 − tanh(ui )2 (tanh(ui )) = ∂ui ∂hi ∂ui ∂hi ∂ui ∂hi

(A28)

Matrix obtained: ∂E ∂E = (1 − h h ) ∂u ∂h (6)

(A29)

Partial derivation of the error to Wxh

∂E ∂ ∂ui ∂ ∂ = = ∂Wxh (i, j) ∂ui ∂Wxh (i, j) ∂ui ∂Wxh (i, j)

V

∑ Wxh (i, j) ∗ x j

j =1

!

=

∂ x ∂ui j

(A30)

As known ui = ∑V j=1 Wxh (i, j ) ∗ x j , matrix obtained: ∂E ∂E T x = ∂Wxh ∂u (7)

Partial derivation of the error to Whh (i, j)

(A31)

Energies 2017, 10, 406

∂E ∂Whh (i,j)

19 of 22

=

∂ui ∂E ∂ui ∂Whh (i,j) 

=

∂ ∂ ∂uit ∂Whh (i,j)

=

∂ ∂uit

=

∂ ∂uit

V

H

∑ Whh ( j, i ) ∗ hit−1 + ∑ Wxh ( j, i ) ∗ xi

i =1 H

 ∂ ∂Whh (i,j)

 ∂ ∂Whh (i,j)

i =1



Whh ( j, i ) ∗ hit−1



Whh ( j, i ) ∗ hit−1

i =1 H i =1

+

V

∂ ∂Whh (i,j)



+0

=



∑ Wxh ( j, i ) ∗ xi



(A32)

i =1

∂ t −1 h ∂uit i

where uit = ∑iH=1 Whh ( j, i ) ∗ hit−1 + ∑V i =1 Wxh ( j, i ) ∗ xi , matrix obtained: ∂E ∂E  t−1  T ∗ h = ∂Whh ∂u

(A33)

By calculating the partial derivatives of all the parameter matrices with respect to the errors, the parameters can be updated according to the gradient descent of the partial derivatives. ∂E ∂E = ∗ HT ∂Why ∂y

(A34)

∂E T ∂E = x ∂Wxh ∂u ∂E ∂E  t−1  T = ∗ h ∂Whh ∂u

(A35)

∂E ∂E = (1 − h h ) ∂u ∂h

(A37)

∂E ∂E T = Why ∗ + dhnext ∂h ∂y

(A38)

T dhnext = Whh

∂E ∂u

Appendix B. The Process of the Training Method for RNN-LSTM The forward pass process of the training method is shown in Table A1. Table A1. Forward pass process. Process 1: Forward Pass input units: y = current external input; roll over: activation: yˆ = y; cell state: sˆcvj = scvj ; Loop over memory blocks, indexed j Step 1a: input gates (1):   Sj in j v v netin j = ∑m win j m yˆ m + ∑v= = f in j netin j ; 1 win j c j sˆc j ; y Step 1b: forget gate (2):   S

j ϕj = f v v netϕj = ∑m wϕj m yˆ m + ∑v= ϕ j netϕ j ; 1 wϕ j c j sˆc j ; y Step 1c: the cell states (3): Loop over the S j cells in blocks j, index v   {netcvj = ∑m wcvj m yˆ m ; scvj = yϕj sˆcvj + yin j g netcvj }; Step 2: Output gate activation (4): 

 Sj out j v v netout j = ∑m wout j m yˆ m + ∑v= = f out j netout j ; 1 wout j c j sc j ; y Cell outputs (5): Loop over the S j cells in block j, indexed v cv {y j = yout j scvj }; End loop over memory blocks Output units (6): netk = ∑m wkm ym ; yk = f k (netk );

(A36)

(A39)

Energies 2017, 10, 406

20 of 22

The partial derivatives process of the training method is shown in Table A2. Table A2. Partial derivatives process. Process 2: Partial Derivatives Loop over memory blocks, index j {Loop over the S j cells in block j, indexed v   ∂scv jv {Cells, dScm := ∂w vj : c m j  jv jv dScm = dScm yϕj + g0 netcvj yin j yˆ m ; jv

∂scv

∂scv

jv

dS v0 := ∂w j 0 ; in,c v  in j c j   j jv m; 0 ˆ y net = dSin,m yϕj + g netcvj f in in j j

Input gates, dSin,m := jv

dSin,m

j

∂win j m ,

Loop over peephole connections from all cells, indexed v0 jv jv 0 (net )sˆ v0 ;}; {dS v0 = dS v0 yϕj + g(netcvj ) f in in j c j in,c j

in,c j

jv

Forget gates, (dSϕm := jv

∂scv j

∂wϕ j m ,

dS

jv 0 ϕcvj



jv

:=

∂scv j

∂w

ϕ j cv j

0

)



0 dSϕm = dSϕm yϕj + sˆcvj f ϕj netϕj yˆ m ;

Loop over peephole connections from allcells, indexed v0  {dS

jv 0 ϕcvj

= dS

jv ϕj 0y ϕcvj

0

0 + sˆcvj f ϕj netϕj sˆvc ;}}}

End loops over cells and memory blocks

The backward pass process of the training method is shown in Table A3. Table A3. Backward pass process. Process 3: Backward Pass (If Error Injected) Errors and δS : Injected error: ek = tk − yk ; δS of output units: δk = f k0 (netk )ek ; Loop over memory blocks, indexed j {δS of output  gates:

 Sj ∑v=1 scvj ∑k wkcvj δk ;

0 δout j = f out netout j j

Internal state error: Loop over the S jcells in blocks j, indexed v  {escv = yout j ∑k wkcvj δk ;}} j

End loop over memory blocks

The weight updates process of the training method is shown in Table A4. Table A4. Weight updates process. Process 4: Weight Updates Output units: ∆wkm = αδk ym ; Loop over memory blocks, indexed j {Output gates: ∆wout,m = αδout yˆ m ; ∆wout,cvj = αδout scvj ; Input gates: S

jv

j ∆win,m = α∑v= 1 escv dSin,m ; j

Loop over peephole connections from all cells, indexed v0 S

j {∆win,cv0 = α∑v= 1 escv dS j

j

jv 0 ;} in,cvj

Forget gates: S

jv

j ∆wϕm = α∑v= 1 escv dSϕm ; j

Loop over peephole connections from all cells, indexed v0 S

j {∆wϕcv0 = α∑v= 1 escv dS j

j

jv 0 ;} ϕcvj

Cells: Loop over the S j cells in block j, indexed v jv

{∆wcvj m = αescv dScm ;}} j

End loop over memory blocks

Energies 2017, 10, 406

21 of 22

References 1. 2. 3. 4.

5.

6. 7.

8. 9.

10. 11. 12. 13. 14.

15.

16. 17.

18.

19.

20.

Besnard, F.; Bertling, L. An approach for condition-based maintenance optimization applied to wind turbine blades. IEEE Trans. Sustain. Energy 2010, 1, 77–83. [CrossRef] Ahmad, R.; Kamaruddin, S. An overview of time-based and condition-based maintenance in industrial application. Comput. Ind. Eng. 2012, 63, 135–149. [CrossRef] Carita, A.J.Q.; Leita, L.C.; Junior, A.P.P.M.; Godoy, R.B.; Sauer, L. Bayesian networks applied to failure diagnosis in power transformer. IEEE Lat. Am. Trans. 2013, 11, 1075–1082. Su, H.; Li, Q. A hybrid deterministic model based on rough set and fuzzy set and Bayesian optimal classifier. In Proceedings of the First International Conference on Innovative Computing, Information and Control (ICICIC), Beijing, China, 30 August–1 September 2006; Volume 2, pp. 175–178. Zheng, G.; Yongli, Z. Research of transformer fault diagnosis based on Bayesian network classifiers. In Proceedings of the International Conference Computer Design and Applications (ICCDA), Qinhuangdao, China, 25–27 June 2010; Volume 3, pp. 382–385. Tang, W.H.; Spurgeon, K.; Wu, Q.H.; Richardson, Z.J. An evidential reasoning approach to transformer condition assessments. IEEE Trans. Power Deliv. 2004, 19, 1696–1703. [CrossRef] Li, J.; Chen, X.; Wu, C. Power transformer state assessment based on grey target theory. In Proceedings of the International Conference Measuring Technology and Mechatronics Automation, Zhangjiajie, China, 11–12 April 2009; Volume 2, pp. 664–667. Ma, H.; Ekanayake, C.; Saha, T.K. Power transformer fault diagnosis under measurement originated uncertainties. IEEE Trans. Dielectr. Electr. Insul. 2012, 19, 1982–1990. [CrossRef] Shintemirov, A.; Tang, W.; Wu, Q.H. Power transformer fault classification based on dissolved gas analysis by implementing bootstrap and genetic programming. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2009, 39, 69–79. [CrossRef] Koley, C.; Purkait, P.; Chakravorti, S. Wavelet-aided SVM tool for impulse fault identification in transformers. IEEE Trans. Power Deliv. 2006, 21, 1283–1290. [CrossRef] Miranda, V.; Castro, A.R.G.; Lima, S. Diagnosing faults in power transformers with autoassociative neural networks and mean shift. IEEE Trans. Power Deliv. 2012, 27, 1350–1357. [CrossRef] Wu, H.R.; Li, X.H.; Wu, D.N. RMP neural network based dissolved gas analyzer for fault diagnostic of oil-filled electrical equipment. IEEE Trans. Dielectr. Electr. Insul. 2011, 18, 495–498. [CrossRef] Wang, M.H. A novel extension method for transformer fault diagnosis. IEEE Trans. Power Deliv. 2003, 18, 164–169. [CrossRef] Reshmy, A.K.; Paulraj, D. An efficient unstructured big data analysis method for enhancing performance using machine learning algorithm. In Proceedings of the International Conference on Circuit, Power and Computing Technologies (ICCPCT), Nagercoil, India, 19–20 March 2015; pp. 1–7. Yuanhua, T.; Chaolin, Z.; Yici, M. Semantic presentation and fusion framework of unstructured data in smart cites. In Proceedings of the 10th IEEE Conference on Industrial Electronics and Applications (ICIEA 2015), Auckland, New Zealand, 5–17 June 2015; Volume 10, pp. 897–901. Goth, G. Digging deeper into text mining: Academics and agencies look toward unstructured data. IEEE Internet Comput. 2012, 16, 7–9. [CrossRef] Alifa, N.P.; Saiful, A.; Wikan, D.S. Public facilities recommendation system based on structured and unstructured data extraction from multi-channel data sources. In Proceedings of the International Conference on Data and Software Engineering (ICoDSE), Yogyakarta, Indonesia, 25–26 November 2015; pp. 185–190. Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In Proceedings of the Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France, 12–18 October 2008. Chen, D.; Cao, X.; Wen, F.; Sun, J. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 3025–3032. Sun, Y.; Wang, X.; Tang, X. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2892–2900.

Energies 2017, 10, 406

21.

22. 23. 24.

25. 26. 27.

28.

29. 30. 31.

32.

22 of 22

Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; Khudanpur, S. Recurrent neural network based language Model. In Proceedings of the Annual Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010; pp. 1045–1048. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. 2013. Available online: arxiv.org/abs/1301.3781 (accessed on 7 Sepember 2013). Hochreiter, S. Untersuchungen zu Dynamischen Neuronalen Netzen. Master’s Thesis, Technische Universität München, Munich, Germany, 1991. Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. Gradient Flow in recurrent nets: The difficulty of learning long-term dependencies. In A Field Guide to Dynamical Recurrent Neural Networks; Kremer, S.C., Kolen, J.F., Eds.; IEEE Press: Piscataway, NJ, USA, 2001. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [CrossRef] [PubMed] Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed] Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013. Sak, H.; Senior, A.; Beaufays, F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), Singapore, 14–18 September 2014. Tai, K.S.; Socher, R.; Manning, C.D. Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks. 2015. Available online: arxiv.org/abs/1503.00075 (accessed on 30 May 2015). Li, J.; Luong, M.T.; Jurafsky, D. A Hierarchical Neural Autoencoder for Paragraphs and Documents. 2015. Available online: arxiv.org/abs/1506.01057 (accessed on 6 June 2015). Chanen, A. Deep learning for extracting word-levesl meaning from safety report narratives. In Proceedings of the Integrated Communications Navigation and Surveillance (ICNS), Herndon, VA, USA, 19–21 April 2016; pp. 5D2-1–5D2-15. Palangi, H.; Deng, L.; Shen, Y. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 694–707. [CrossRef] © 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).