Neural conditional random fields Trinh-Minh-Tri Do IDIAP Martigny, Switzerland [email protected]

Thierry Artieres LIP6 - UPMC Paris, France [email protected]

Abstract We propose a non-linear graphical model for structured prediction. It combines the power of deep networks to extract high level features with the graphical framework of Markov networks, yielding a powerful and scalable model that we apply to signal labeling tasks.



This paper considers the structured prediction task where one wants to build a system that predicts a structured output from an (structured) input. This is a key problem for many application fields such as bioinformatics, part-of-speech tagging, information extraction, signal (e.g. speech) labelling and recognition and so on. We focus here on signal and sequence labeling tasks for signals such as speech and handwriting. For decades, Hidden Markov Models (HMMs) have been the most popular approach for dealing with sequential data, e.g. for segmentation and classification, although they rely on strong independence assumptions and despite they are learned using Maximum Likelihood Estimation which is a non discriminant criterion. This latter point comes from the fact that HMMs are generative models and that they define a joint probability distribution on the sequence of observations X and the associated label sequence Y. Discriminant systems are usually more powerful than generative models and focus more directly error rate minimization. Many studies focused on developing discriminant training for HMM, one may cite Minimum Classification Error (MCE) (Juang & Katagiri, 1992), Perceptron learning (Collins, 2002), Maximum Mutual Information (MMI) (Woodland & Povey, January 2002) or more recently large margin approaches (Sha & Saul, 2007; Do & Arti`eres, 2009). A more direct approach is to design a discriminative graphical model that models conditional distribution P (Y|X) instead of modeling the joint probability as in generative model (Mccallum et al., 2000; Lafferty, 2001). Conditional random fields (CRF) are a typical example of this approach. Maximum Margin Markov network (M3N) (Taskar et al., 2004) go further by focusing on the discriminant function (which is defined as the log of potential functions in a Markov network) and extend the SVM learning algorithm for structured prediction. While using a completely different learning algorithm, M3N is based on the same graphical modeling as CRF and can be viewed as an instance of CRF. Based on log-linear potentials, CRFs have been widely used for sequential data such as natural language processing or biological sequences (Altun et al., 2003; Sato & Sakakibara, 2005). However CRF with log-linear potentials only reach modest performance with respect to non-linear models exploiting kernels (Taskar et al., 2004). Although it is possible to use kernels in CRFs (Lafferty et al., 2004) the obtained dense optimal solution makes it generally inefficient in practice. Anyway, kernel machines are well known to be less scalable. Besides in the recent years deep neural architectures have been proposed as a relevant solution for extracting high level features from data (Hinton et al., 2006; Bengio et al., 2007). Such models have been successfully applied first to images ((Hinton et al., 2006)) then to motion caption data (Taylor et al., 2007), text data etc. Deep architectures have shown in these fields great capacity to discover and extract relevant features as input to linear discriminant systems. 1

This work introduces neural conditional random fields which are a marriage between conditional random fields and (deep) neural networks (NNs). The idea is to rely on deep NNs for learning relevant high level features which may then be used as inputs to a linear CRF. Going further we propose such a global architecture that we call NeuroCRF and that can be globally trained with a discriminant criterion. Of course, using a deep NN as feature extractor makes the learning become a non convex optimization problem which prevents relying on efficient convex optimizer algorithms with nice warranties such as absence of local optima and easier theoretical analysis (e.g. for convergence rate). However in last years a number of researchers have pointed out that convexity at any price is not always a good idea, one has to look for an optimal trade-off between modeling flexibility and optimization ease (LeCun et al., 1998; Collobert et al., 2006; Bengio & Lecun, 2007). We first introduce the graphical framework of Markov networks and CRFs for structured prediction in Section 2. Then we present Neural CRFs in Section 3 in Section 4 and report experimental results on optical character recognition and automatic speech recognition.



We start with a brief introduction of Markov networks and structured prediction with CRFs. An important family of graphical models rely on a Markov hyopthesis, they are called Markov networks. Markov networks are defined with an undirected graph G = (V, E) where each component Yi of Y is associated with a vertex vi ∈ V . Markov hypothesis states that for any U ⊂ Y \ {Yi , Yj }, Yi is independent of Yj conditionnaly to U if and only if every path from vi to vj goes though at least one node in U . Also, if there is no path between vi and vj then Yi and Yj are independent. A graphical model may be parameterized using the Hammersley-Clifford theroem (Hammersley & Clifford, 1971) which states that any distributions over Y with conditional Q dependencies endoded by graph G can be factorized according to the cliques1 in G as P (Y) ∝ c∈C ψc (Yc ) where C is the set of cliques in graph G, Yc represents the set of nodes (i.e. variables) in clique c, and ψc (Yc ) are (positive) potential functions. Structured output prediction aims at building a model that predicts accurately a structured output y ∈ Y for any input x ∈ X . Let consider the output Y we want to predict as a (set of) random variables whose components are linked by conditional dependencies encoded by a graph G with cliques c ∈ C. Also let note X the random variable corresponding to input (i.e. observation). Then, given x, inference stands for finding the output that maximizes conditional probability2 p(y|x). Relying on (Hammersley & Clifford, 1971) CRFs define such a conditional probability through global normalization: Y p(y|x) = ψc (x, yc )/Z(x) (1) c∈C



where Z(x) = y∈Y c∈C ψc (x, yc ) is a global normalization factor. A common choice of potential function is the exponential function of an energy as in Boltzmann machine: ψc (x, yc ) = e−Ec (x,yc ,w)


To ease learning which resumes then in a convex optimization problem, the standard setting is to use linear energy functions of the parameter vector wc and of a joint feature vector Φc (x, yc ) as Ec (x, yc , w) = −hwc , Φc (x, yc )i, this leads to a log-linear model.





To overcome intrinsic limitation of linear energy functions for modeling complex inputs, we propose to learn a NN to extract relevant features and to output energy scores Ec (in Eq. (2)) as inputs to a CRF. Hence the NN takes an observation as input and outputs several energy outputs3 parameterized 1

A clique is a set of nodes c ⊂ V that forms a fully connected subgraph
We use the notation p(y|x) = p(Y = y|X = x)


by w. The NN has as a number of output units that equals the number of cliques times the number of possible realizations for Yc . Energy outputs are computed for each clique c and for each realization of Yc . For instance, if a clique c has two random variables and Yc ∈ L × L, then there are |L|2 NN outputs dedicated to clique c. The conditional distribution P (Y|X) is then completely defined by the NN’s parameters w. Figure 1 illustrates an example of NeuroCRF with a tree structure. There are 3 cliques of size 2 and 4 cliques of size 1 (we focus on cliques on Y nodes and do not count observation node since we focus on the distribution of Y’s conditioned on X). The number of energy outputs (i.e. the number of the NN’s output units) is then (3|L|2 + 4|L|). As may be seen output units of the NNs (i.e. energy outputs) are grouped by clique (i.e. one group per clique). Y3







X t+1

Y4 high-level features

high-level features


X t-1

input layer

input layer

Neural network

Figure 1: Example of a tree-structured NeuroCRF (left) and of Chain-structured NeuroCRF (right) b that best matches input x, i.e. with highest energy: Inference in NeuroCRFs consists in finding y X b = argmax p(y|x, w) = argmin y Ec (x, yc , w) (3) y



To do this one feeds the NN with x and propagates to compute all energy outputs Ec (x, yc , w). Then b with lowest energy. one uses dynamic programming for finding y NN architecture. We use feed forward neural architectures for building NeuroCRF, building on works by (Hinton et al., 2006) where deep NNs have been shown to extract relevant and high level features in their hidden layers. One can expect the NN will capture higher and higher level information as the number of hidden layers increases. Various architectures may be used. One can use a different NN for any energy function, at the risk of overfitting. Instead one may share hidden layers of Neural nets that compute energy outputs for a single clique (Figure 2-left). Or one may choose to share hidden layers for computing all energy outputs, whatever the clique (Figure 2-right).

high-level features

input layer

} CRF-part



input layer

Figure 2: NN architecture with non shared weights (left) or shared weights (right). Interestingly if we consider a neural net with linear output units (i.e. output units with linear activation function) then a NeuroCRF can be viewed as a standard linear CRF exploiting the high-level representation computed by the neural net (i.e. the set of high level features computed on its last hidden layer). The top part of a NeuroCRF will be called its CRF-part and the remaining part will be called its deep-part (see Figure 2-right). 3.2

Linear Chain NeuroCRFs for sequence labelling

While the NeuroCRF framework we propose is quite general, we focused in our experiments on linear chain NeuroCRFs based on a first-order Markov chain structure (Figure 1-right). This instance allows investigating the potential power of neuroCRFs on standard sequence labelling tasks. There are two kinds of cliques: 3

• local cliques (x, yt ) at each position t, whose potential function are noted ψt (x, yt ), and corresponding energy function are noted Eloc . • transition cliques (x, yt−1 , yt ) between two successive positions at t − 1 and t, whose potential function are noted ψt−1,t (x, yt−1 , yt ), and energy function are noted Etra . In such models it is usual to consider that energy functions are shared between similar cliques at different times (i.e. positions in the graph) (Lafferty, 2001)4 . Then energy functions take an additional argument specifying the position in the graph, which is time t. ψt (x, yt ) = e−Eloc (x,t,yt ,w) and ψt−1,t (x, yt−1 , yt ) = e−Etra (x,t,yt−1 ,yt ,w)


Additional parameter t allows considering a part of input x, whose size may vary and cannot be handled by a fixed size input NN. Time is used for building the input to the NN for computing Eloc (x, t, yt ), it may consists of xt the tth element of x only (see Figure 1-right), or it may include a richer temporal context such as (xt−1 , xt , xt+1 ). At the P end, the conditional probability of output y given input x is defined as p(y|x, w) = P exp(−[ t≥1 Eloc (x, t, yt , w) + t>1 Etra (x, t, yt−1 , yt , w)])/Z(x) with Z(x) being the normalization factor. One can design a NN with only |L| + |L|2 outputs to compute all energy outputs.



Let (x1 , y1 ), ..., (xn , yn ) ∈ X × Y be a training set of n input-output pairs. We seek parameters w such that ∀i : yi = argmaxy∈Y p(y|xi , w). This translates into a general optimization problem: min λΩ(w) + R(w) w


P where R(w) = n1 i Ri (w) is a data-fitting measurement (e.g. empirical risk), and Ω(w) is a regularization term, with λ being a regularization factor that allows finding a tradeoff between a good fit on training data and a good generalization. A common choice of Ω(w) is to use L2 regularization. We discuss now two criterion for training NeuroCRFs and explain how to optimize them. 4.1


There are many discriminative criteria for training CRFs (more generally log-linear models), all can be used for learning NeuroCRFs as well. Probabilistic criterion. In the original work of (Lafferty, 2001), estimation of CRF parameters w was done by maximizing the conditional likelihood (CML) which resumes to: RiCM L (w)

i i =− P Plog p(y |x , w) P = c Ec (xi , yci , w) − y∈Y exp[ c Ec (xi , yc , w)]


Large margin criterion. Large margin method focuses more directly on giving highest discriminant score to the correct output. In NeuroCRFs, the P discriminant function is a sum of energy functions over cliques (see Eq. (3)): F (x, y, w) = − c∈C Ec (x, yc , w). Large margin training for structured output (Taskar et al., 2004) aims at finding w so that: F (xi , yi , w) ≥ F (xi , y, w) + ∆(yi , y) ∀y ∈ Y



where ∆(y , y) allows taking into account differences between labellings (e.g. Hamming distance between yi ). We assume a decomposable loss (alike Hamming distance) such that P y and i i ∆(y , y) = c δ(yc , yc ) so that it can be factorized along the graph structure and integrated in the dynamic programming pass needed to compute argmaxy∈Y p(y|x). The elementary loss function of NeuroCRFs is then (noting with ∆Ec (xi , yc , yci , w) = −Ec (xi , yc , w) + Ec (xi , yci , w)) : RiLM (w) 4

= maxy∈Y P F (xi , y, w) − F (xi , yi , w) + ∆(yi , y) = maxy∈Y c ∆Ec (xi , yc , yci , w) + δ(yci , yc )

These authors consider two set of parameters, one for local cliques and one for transition cliques





Due to non-convexity, initialization is a crucial step for NN learning, especially in the case of deep architecture (see (Erhan et al., 2009) for an analysis). Hopefully, unsupervised pretraining algorithm for deep architectures have recently been proposed to tackle this problem with notable success (Hinton et al., 2006). We detail NN initialization first, then we discuss fine tuning the NeuroCRF. 4.2.1


Initialization of hidden layers in the NeuroCRF is done incrementally as it has been popularized in recent years for learning deep archirectures. In our implementation the deep-part of the NeuroCRF is initialized layer by layer in an unsupervised manner using restricted Boltzmann machines (RBMs) as proposed by Hinton and colleagues (Hinton et al., 2006). Depending on the task inputs may be real valued or binary valued, this may be handled by slightly different RBMs. We considered both cases in our experiments, while coding (hidden) layers always consist of binary units. Once a cascade of successive RBMs have been trained one at a time, one transforms these cascaded RBMs into a feed forward NN which implements the deep-part of the NeuroCRF (without output layer). More details about RBMS and how to learn them may be found in (Hinton et al., 2006). Once the deep-part is initialized, the NN is used to compute high-level representation (i.e. the vector of activations on last hidden-layer) of input samples. The CRF-part may then be initialized by training (in a supervised way) a linear CRF with this high-level coding of input samples. As we said, such a linear CRF is actually an output layer which is stacked over the deep part. The union of the weights of the deep-part and of the CRF-part constitutes an initialization solution w0 which is next fine tuned using supervised learning. 4.2.2


Fine tuning aims at learning the NeuroCRF parameters globally based on an initial and reasonnable solution. None of the criterion we discussed earlier (Subsection 4.1) are convex since we naturally consider NNs with non linear (sigmoid) activation functions in hidden layers. However, provided one can compute an initial and reasonable solution and provided one can compute the gradient of the criterion with respect to NN weights, one can use any gradient-based optimization method such as stochastic gradient or bundle method to learn the model and reach an (eventually local) minimum. We show now how to compute gradient with respect to the NN weights. i (w) As long as Ri (w) is continuous and there is an efficient method for computing ∂ E∂R (this c (x,yc ,w) is actually the case for all criteria discussed in previous section) the (sub)gradient of R(w) with respect to w can be computed with a standard back propagation procedure. Let Ei be the set of i (w) energy outputs corresponding to input xi . Using the chain rule for every ∂R∂w : 1 X ∂Ri (w) 1 X ∂Ri (w) ∂ Ei ∂R(w) = = (9) ∂w n i ∂w n i ∂ Ei ∂w

Ei where ∂∂w is the Jacobian matrix of the NN outputs (for input xi ) with respect to weights w. Then i (w) as back propagation errors of the NN output units, we can backpropagate and get by setting ∂R∂E i ∂Ri (w) ∂w


(using chain rule over hidden layers).


We performed experiments on two sequence labeling tasks with two well-known datasets. We first investigate the behaviour of NeuroCRFs in a first serie of experiments on Optical Character Recognition with the dataset (Kassel, 1995). Then we report comparative results of NeuroCRFs and state of the art methods for the more complex task of automatic speech recognition on TIMIT dataset (Lamel et al., 1986). In both cases we replicated experimental settings of previous works in order to get fair comparison, building on the settings from (Taskar et al., 2004) for the OCR dataset, and using standard partitioning of the data and standard preprocesing for TIMIT. We used linear chain NeuroCRF for both tasks. 5



OCR dataset consists of 6876 words which correpond to roughly 50K characters (Kassel, 1995; Taskar et al., 2004). OCR data are sequences of isolated characters (each represented as a binary vector of dimension 128) belonging to 26 clases. The dataset is divided in 10 folds for cross validation. We investigated two settings, using a large training set by training on 9 folds and testing on 1 fold (this is the large setting) and using a small one by training on 1 fold and testing on 9 folds (this is the small setting). We learned NeuroCRFs with one or two hidden layers. Transition energy outputs has only one connection to a bias unit, meaning that we do not use any input information for building transition energy. When initializing the deep part RBMs are learned with 50 iterations through the training set, using 1 step Contrastive Divergence. Figure 3 shows the influence of NN architecture on NeuroCRFs performance. It reports results gained on the small setting with NeuroCRF with one or two hidden layers of varying size. As can be seen increasing the size of hidden layer improves performance for one hidden layer and two hidden layers NeuroCRF. Also two hidden layers architectures systematically outperform single hidden layer architectures. Note that whatever the number of hidden layers peformance reaches a plateau when incresing hidden layers size. However the plateau is lower and reached faster for two hidden layers architecture. These results suggest that increasing both the size of hidden layers and the number of hidden layers significantly improve performance.

error rate


1 hidden layer 2 hidden layers

0.14 0.12 0.1 0.08

50 100 200 300 400 500 number of hidden units per layer

Figure 3: Influence of NN architecture on OCR dataset (small training set). We compared NeuroCRFs with state of the art methods : M3N, linear CRF and two variant of NeuroCRFs, one trained with conditional maximum likelihood (CML), the other one trained with a large margin criterion (LM). NeuroCRFs have 2 hidden layers of 200 units each. Table 1 reports cross validation error rates for the small setting and the large setting. Performance of initial solutions (i.e. before fine tuning NeuroCRFs) are given in brackets. It may be seen here that NeuroCRFs significally outperform all other methods, including M3N with non-linear kernel (whose results are not reported for large training set due to scalability). Also looking at the performance of NeuroCRF before fine tuning show that initialisation by RBMs and CRFs indeed produce a good starting point, but fine tuning is essential for obtaining optimal performance. Finally, one sees here that both NeuroCRF training criteria are similar with a slight advantage of conditonal likelihood criterion on large margin criterion. Surprisely, we observed that the large margin criterion required more iterations than conditional likelihood criterion. In the following, we only considered NeuroCRFs trained with conditional likehood criterion. Table 1: Comparative error rates of NeuroCRF and state of the art methods on OCR dataset with either a small and a large training sets. Performance of NeuroCRF before fine tuning are indicated in brackets. Results of SVM cubic and M3N cubic come from (Taskar et al., 2004). CRF linear M3N linear SVM cubic M3N cubic NeuroCRFCM L NeuroCRFLM

small 0.2162 0.2113 0.192 0.127 0.1080 (0.1224) 0.1102 (0.1221)


large 0.1420 0.1346 not available not available 0.0444 (0.0697) 0.0456 (0.0736)



We performed ASR experiments on the TIMIT dataset (Lamel et al., 1986) with standard train-test partitionning. Signal were preprocessed as in (Sha & Saul, 2007) except that we do not use whitening by PCA. The 39-dimensional MFCC are only normalized to have zero mean and unit variance. There are roughly 1.1 million frames in the training set, 120K frames and 57K frames respectively in the developement and test set. We used 2-layers NeuroCRFs for ASR trained with condition likelihood criterion. Observations are real valued vectors of MFCC coefficients while RBMs originally use binary logistic units for both visible and hidden variables. We used an extension of RBM proposed (Taylor et al., 2007) for dealing with continuous variable to learn the first layer. This GaussianBinary RBM was trained for 100 passes through the training data of 1.1M frames, using one step Contrastive Divergence. The second RBM is binary and since it converge faster, we use only 10 learning iterations through the training data for the second layer. Remaining of initialization is as in section 4.2.1. Table 2: Comparative phone recognition error rate on TIMIT dataset for discriminant and non discriminant HMM systems and for two hidden layers NeuroCRFs (of 500 or 1000 hidden units each).

1 Gaussian 2 Gaussians 4 Gaussians 8 Gaussians 500x500 1000x1000

ML 40.1 36.5 34.7 32.7

CDHMM CML MCE PT 36.4 35.2 35.6 34.6 33.2 34.5 32.8 31.2 32.4 31.5 31.9 30.9 NeuroCRF (CML) 29.6 29.1

LM 31.2 30.8 29.8 28.2

Table 2 report phone error rates for single state CDHMMs and NeuroCRFs with increasing complexity (number of Gaussians in CDHMMs or number of hidden units in NeuroCRFs). We compared NeuroCRFs with non discriminant CDHMMs (i.e. Maximum Likelihood) and with state of the art approaches for learning CDHMMs with a discriminant criterion, Maximum Conditional Likelihood (CML) (Woodland & Povey, January 2002), Minimum Classification Error (MCE) (Juang & Katagiri, 1992), Large margin (LM) (Sha & Saul, 2007), Perceptron learning (PT) (Cheng et al., 2009) (note that all results come from a compilation taken in (Sha & Saul, 2007) and from (Cheng et al., 2009)). Note that better results may be achieved by using multiple states per phone HMM or by increasing the Gaussian mixture size as traditionnaly done in ASR, or by using more accurate modeling (Deng et al., 2006). Yet, since we implemented ”‘simple”’ single state Neuro CRFs we only compare to similar single state methods. These comparative results are then not exhaustive but still allow drawing some interesting conclusions about our approach. First increasing hidden layers size allows improving NeuroCRF error rate. Unfortunately we don’t know if it still improves when using larger hidden layers by lack of time but one can reasonnably expect that even better results may be reached by using larger hidden layers and/or adding hidden layers. Second NeuroCRF outperform all other discriminant and non discriminant methods except Large Margin training of (Sha & Saul, 2007) when using up to 8 Gaussian distributions per state. While this may not look an impressive result at first glance we claim this result is pretty much promising. Indeed, all other systems in Table 2 rely on the learning of a preliminary CDHMM system, which is then used as initialization and/or for regularization. Hence all these systems integrate prior information from decades of research on how to learn and tune a non discriminant CDHMM for speech. In contrast NeuroCRF are trained completely from scratch with a non supervised intialization and a supervised fine tuning, they require no prior information.



We presented a model combining CRFs and deep neural networks aiming at taking advantage of both the feature extraction ability of deep networks and the discriminant power of CRFs. We detailed learning stategy and reported experimental results on two sequence classification tasks. Results on 7

OCR data show signifiant improvement over state of the art methods and demonstrate the relevance of the combination. Moreover our systems outperform most state of the art discriminant systems on speech recognition while using absolutely no prior in contrast to all other systems relying on an initial solution gained with a non discriminant criterion.

