Segmentation Conditional Random Fields (SCRFs): A New Approach for Protein Fold Recognition Yan Liu1 , Jaime Carbonell1 , Peter Weigele2 , and Vanathi Gopalakrishnan3 1
2
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA {yanliu, jgc}@cs.cmu.edu Biology Department, Massachusetts Institute of Technology, Cambridge, MA, USA {pweigele}@mit.edu 3 Center for Biomedical Informatics, University of Pittsburgh, PA, USA
[email protected]
Abstract. Protein fold recognition is an important step towards understanding protein three-dimensional structures and their functions. A conditional graphical model, i.e. segmentation conditional random fields (SCRFs), is proposed to solve the problem. In contrast to traditional graphical models such as hidden markov model (HMM), SCRFs follow a discriminative approach. It has the flexibility to include overlapping or long-range interaction features over the whole sequence, as well as global optimally solutions for the parameters. On the other hand, the segmentation setting in SCRFs makes its graphical structures intuitively similar to the protein 3-D structures and more importantly, provides a framework to model the long-range interactions directly. Our model is applied to predict the parallel β-helix fold, an important fold in bacterial infection of plants and binding of antigens. The crossfamily validation shows that SCRFs not only can score all known βhelices higher than non β-helices in Protein Data Bank, but also demonstrate more success in locating each rung in the known β-helix proteins than BetaWrap, a state-of-the-art algorithm for predicting β-helix fold, and HMMER, a general motif detection algorithm based on HMM. Applying our prediction model to Uniprot database, we hypothesize previously unknown β-helices.
1
Introduction
It is believed that protein structures reveal important information about the protein functions. One key step towards modeling a tertiary structure is to identify how secondary structures as building blocks arrange themselves in space, i.e. the supersecondary structures or protein folds. There has been significant work on predicting some well-defined types of structural motifs or functional units, such as αα- and ββ-hairpins [1–4]. The task of protein fold recognition is the following: given a protein sequence and a particular fold or super-secondary structure, predict whether the protein contains the structural fold and if so, locate its exact positions in the sequence.
The traditional approach for protein fold prediction is to search the database using PSI-BLAST [5] or match against an HMM profile built from sequences with the same fold by HMMER [4] or SAM [3]. These approaches work well for short motifs with strong sequence similarities. However, there exist many important motifs or folds without clear sequence similarity and involving the long-range interactions, such as folds in β class [6]. These cases necessitate a more powerful model, which can capture the structural characteristics of the protein fold. Interestingly, the protein fold recognition task parallels an emerging trend in machine learning community, i.e the structure prediction problem, which predict the labels of each node in a graph given the observation with particular structures, for example webpage classification using the hyperlink graph or object recognition using grids of image pixels. The conditional graphical models prove to be one of the most effective tools for this kind of problem [7, 8]. In fact, several graphical models have been applied to protein structure prediction. One of the early approaches is to apply simple hidden markov models (HMMs) to protein secondary structure prediction and protein motif detection [3, 4, 9]; Delcher et al. introduced probabilistic causal networks for protein secondary structure modeling [10]. Recently, Liu et al. applied conditional random fields (CRFs), a discriminative graphical model based on undirected graph, for protein secondary structure prediction [11]; Chu et al. extended segmental semi-Markov model (SSMM) under the Baysian framework for protein secondary structures [12]. The bottleneck for protein fold prediction is the long-range interactions, which could be either two β-strands with hydrogen bonds in a parallel β-sheet or helix pairs in coupled helical motifs. Generative models, such as HMM or SSMM, assume a particular generating process, which makes it difficult to consider overlapping features and long-range interactions. Discriminative graphical models, such as CRFs, assume a single residue as an observation. Thus they fail to capture the features over a whole secondary structure element or the interactions between adjacent elements in 3-D, which may be distant in the primary sequence. To solve the problem, we propose segmentation conditional random fields (SCRFs), which retain all the advantages of original CRFs and at the same time can handle observations of variable length.
2
Conditional Random Fields (CRFs)
Simple graphical chain models, such as hidden markov models (HMMs), have been applied to various problems. As a “generative” model, HMMs assume that the data are generated by a particular model and compute the joint distribution of the observation sequence x and state sequence y, i.e. P (x, y). However, generative models might perform poorly with inappropriate assumptions. In contrast, discriminative models, such as neural networks and support vector machines (SVMs), estimate the decision boundary directly without computing the underlying data distribution and thus often achieve better performance.
Recently, several discriminative graphical models have been proposed by the machine learning community, such as Maximum Entropy Markov Models (MEMMs) [13] and Conditional Random fields (CRFs) [14]. Among these models, CRFs proposed by Lafferty et al., are very effective in many applications, including information extraction, image processing and so on [8, 7]. CRFs are “undirected” graphical models (also known as random fields, as opposed to directed graphical models such as HMMs) to compute the conditional likelihood P (y|x) directly. By the Hammersely-Clifford theorem [15], the conditional probability P (y|x) is proportional to the product of the potential functions over all the cliques in the graph, Y 1 P (y|x) = Φc (yc , xc ), Z0 c∈C(y,x)
where Φc (yc , xc ) is the potential function over the clique c, and Z0 is the normalization factor over all possible assignments of y (see [16] for more detail). For a chain structure, CRFs define the conditional probability as N X K X 1 P (y|x) = exp( λk fk (yi−1 , yi , x, i)), Z0 i=1
(1)
k=1
where fk is an arbitrary feature function over x, N is the number of observations and K is the number of features. The model parameters λk are learned via maximizing the conditional likelihood of the training data. CRFs define the clique potential as an exponential function, which results in a series of nice properties. First, the conditional likelihood function is convex so that finding the global optimum is guaranteed [14]. Second, the feature function can be arbitrary, including overlapping features and long-range interactions. Finally, CRFs still have efficient algorithms, such as forward-backward or Viterbi, as long as the graph structures are sequences or trees. Similar to HMMs, we can define the forward-backward probability for CRFs. For a chain structure, the “forward value” αi (y) is defined as the probability of being in state y at time i given the observation up to i. The recursive step is: X X αi+1 (y) = αi (y 0 ) exp( λk fk (y 0 , y, x, i + 1)). y0
k
Similarly, βi (y) is the probability of starting from state y at time i given the observation sequence after time i. The recursive step is: X X βi (y 0 ) = exp( λk fk (y 0 , y, x, i + 1))βi+1 (y). y
k
The forward-backward and Viterbi algorithms can be derived accordingly [17].
3
Segmentation Conditional Random Fields (SCRFs)
Protein folds are frequent arrangement pattern of several secondary structure elements: some elements are quite conserved or prefer a specific length, while
S1
S2
S3 S5
S1
(A)
S4
S6
(B)
Fig. 1. Graph structure of β-α-β motif (A) 3-D structure (B) Protein structure graph: node: Green=β-strand, yellow=α-helix, cyan=coil, white=non-β-α-β (I-node); edge: E1 = {black edges} and E2 = {red edges}.
others might form hydrogen-bonds with each other, such as two β-strands in a parallel β-sheet. To model the protein fold better, it would be natural to think of each secondary structure element as one observation (or node) and the edges between elements indicating their interactions in 3-D. Then, given a protein sequence, we can search for the best segmentation defined by the graph and determine if the protein has the fold. 3.1
Protein Structural Graph
Before covering the algorithm in detail, we first introduce a special kind of graph, called protein structural graph. GivenSa protein fold, a structural graph is defined as G =< V, E1 , E2 >, where V = U {I}, U is the set of nodes corresponding to the secondary structure elements within the fold and I is the node to represent the elements outside the fold. E1 is the set of edges between neighboring elements in primary sequences, and E2 is the set of edges indicating the potential long-range interactions between elements in tertiary structures. Figure 1 shows an example of the structural graph for β-α-β motif. Notice that there is a clear distinction between edges in E1 and those in E2 in terms of probabilistic semantics: similar to HMMs, the E1 edges indicate transitions of states between adjacent nodes. On the other hand, the E2 edges are used to model the long-range interactions, which is unique to the structural graph. In practice, one protein fold might correspond to several reasonable structural graphs given different semantics for one node. There is always a tradeoff between the graph complexity, fidelity of model and the real computational costs. Therefore a good graph is the most expressive one that captures the properties of the protein folds while retaining as much simplicity as possible. There are several ways to simplify the graph, for example we can combine multiple nodes with similar properties into one, or remove those E2 edges that are less important or less interesting to us. We give a concrete example of β-helix fold in Section 4. 3.2
Segmentation Conditional Random Fields
Since a protein fold is regular arrangement of its secondary structure elements, the general topology is often known apriori and we can easily define a structural
graph with deterministic transitions between adjacent nodes. Therefore it is not necessary to consider the effect of E1 edges in the model explicitly. In the following discussion, we focus on this simplified but common case. Consider the graph G0 = hV, E2 i, given a protein sequence x = x1 x2 . . . xN , we can have a possible segmentation of the sequence, i.e. S = (S1 , S2 , . . . , SM ), where M is the number of segments, Si = hpi , qi , yi i with a starting position pi , an end position qi , and the label of the segment yi . The conditional probability of a segmentation S given the observation x can be computed as follows: X Y 1 exp( λk fk (xc , Sc )), P (S|x) = Z0 0 k
c∈G (S,x)
where Z0 is a normalization factor. If each subgraph of G0 is a chain or a tree (an isolated node can also be seen as a chain), then we have P (S|x) =
M X K X 1 0 exp( )), λk fk (x, Si , Si−1 Z0 i=1
(2)
k=1
0 where Si−1 is the direct forward neighbor of Si in graph G0 . We estimate the parameters λk by maximizing the conditional log-likelihood of the training data: P 2 M X K X λ LΛ = λk fk (x, si , s0i−1 ) − log Z0 + k 2 k , 2σ i=1 k=1
where the last term is a Gaussian prior over the parameters as a smoothing term to deal with sparsity problem in the training data. To perform the optimization, we need to seek the zero of the first derivative, i.e. M
X ∂LΛ λk 0 = (fk (x, si , s0i−1 ) − EP (s|x) [fk (x, Si , Si−1 )]) + 2 , ∂λk σ i=1
(3)
0 0 ) over )] is the expectation of feature fk (x, Si , Si−1 where EP (s|x) [fk (x, Si , Si−1 all possible segmentations of x. The convexity property guarantees that the root corresponds to the optimal solution. However, since there is no closed-form solution to (3), it is not straightforward to find the optimal. Recent work on iterative searching algorithms for CRFs suggests that L-BFGS converges much faster than other commonly used methods, such as iterative scaling or conjugate gradient [17], which is also confirmed in our experiments. Similar to CRFs, we still have an efficient inference algorithm as long as each subgraph of G0 is a chain and the nodes with E2 edge has fixed length of residues. We redefine the forward probability α (r, yr ) as the conditional probability that a segment of state yr ends at position r given the observation xl+1 . . . xr and a segment of state yl ends at position l. The recursive step can be written as: X X λ f (x, s, s0 )), α (r, y ) = α (q 0 , y 0 )α 0 (p − 1, ← y−) exp(
r
p, p0 , q 0
k k
r
k
N
0 N
N
r r
N
0
a(r, S3)
a(r, S4) a