A Probabilistic Constraint-based Approach to ... - Semantic Scholar

6 downloads 25626 Views 66KB Size Report
Department of Computer and Information Science, CUNY Brooklyn College ... Protein folding is the process by which a protein structure assumes its functional ...
A Probabilistic Constraint-based Approach to Protein Structure Predication Neng-Fa Zhou Amotz Bar-Noy Department of Computer and Information Science, CUNY Brooklyn College Ronald A. Eckhardt Department of Biology, CUNY Brooklyn College

1

Backgrounds and Objectives

Protein folding and protein structure predication Protein folding is the process by which a protein structure assumes its functional shape or conformation. It is believed that the dynamical folding of a protein to its native conformation is determined by the amino acid sequence of the protein. Given the usefulness of known protein structures in such valuable tasks as rational drug design, protein structure prediction is a highly active field of research in bioinformatics [2, 8]. During recent decades various methods have been developed along the following two main lines [2]. The first line of methods, called homology modeling, rely on assembling structures of proteins using structural fragments of similar sequences available in sources such as PDB [3]. The second line of methods, called ab initio methods, predict the structure from sequence alone, without relying on similarity at the fold level between the modeled sequence and any of the known structures. The folding of any particular protein is an extremely complex process; and simulation of the folding of even a small protein remains an insurmountable challenge to state-of-the-art computers. Many ab initio methods adopt reduced models [16] such as lattice models which alleviate the complexity of the process because of the reduced number of degrees of freedom. Reduced models cannot be expected to consistently generate predications with high accuracy. Nevertheless, the low resolution results obtained can be used to narrow the possible conformations from an exponentially large number to a number small enough that more computationally expensive methods can be applied.

The constraint-based approach to protein structure predication Constraint programming [7, 27] is a declarative programming paradigm for describing and solving combinatorial optimization problems. Constraint languages allow for not only high-level modeling but also efficient solving of problems. Recent implementations are amenable to large scale problems thanks to the availability of constraint solving algorithms such as propagation algorithms for finitedomain constraints [19], sophisticated heuristics [12], and advanced compilation techniques [6, 31]. Constraint programming has found its way into many application areas such as design, scheduling, configuration, bioinformatics, and graphical user interface design [4, 10, 13, 11, 28, 30]. Several constraint programming systems are available now. The PI has designed and implemented an event-handling language, called AR (Action Rules), and developed a state-of-the-art finite-domain constraint solver in AR available with B-Prolog [31].1 Benchmarking shows that the solver in BProlog is 5 times as fast as the one in Sicstus Prolog and 6 times as fast as the one in Eclipse.2 1 www.probp.com. 2 See

www.probp.com/benchmark clpfd.htm for benchmark results.

1

Several attempts have been made to apply constraint programming to protein structure predication [1, 18, 21]. All the systems adopt lattice models for proteins and use some simplified energy functions in the optimization. There are different ways to map lattice models for proteins into CSP (Constraint Satisfaction Problem) models. A straightforward way is to treat amino acids or residues as variables and lattice positions as the domains of variables. The only enforced constraints in this model are neighborhood constraints which ensure that every pair of residues placed in positions that are next to each other. Auxiliary constraints, such as constraints for breaking symmetries, are used to enhance the performance. The constraint-based approach has produced some exciting results. For example, Backofen and Will report in [1] that their program outperforms all other approaches for lattice HP models and is able to find optimal structures and prove the optimality for proteins of up to 200 residues. In [22] a goal is set to handle proteins of lengths up to 500.

Logic-based probabilistic learning The past few years have witnessed a tremendous interest in logic-based probabilistic learning as testified by an increasing number of formalisms and systems (e.g., BLP [15], CLP(BN) [9], PRM [17], PRISM [25], and SLP [20]). Logic-based probabilistic learning is a multidisciplinary research area that integrates relational or logic formalisms, probabilistic reasoning mechanisms, and machine learning and data mining principles. Logic-based probabilistic learning has found its way into many application areas including bioinformatics, diagnosis and troubleshooting, stochastic language processing, information retrieval, linkage analysis and discovery, and robot control. The PI has involved in the development of PRISM,3 an extension of B-Prolog that integrates logic programming, probabilistic reasoning, and EM learning [25, 26, 32]. PRISM allows for the description of independent probabilistic choices and their consequences in general logic programs. PRISM supports parameter learning. For a given set of possibly incomplete observed data, PRISM can estimate the probability distributions to best explain the data. PRISM suitable for applications such as learning parameters of stochastic grammars, training stochastic models for gene sequence analysis, user modeling, and obtaining probabilistic information for tuning systems performance. PRISM offers incomparable flexibility compared with specific statistical tools such as Hidden Markov Models (HMMs) [24], Probabilistic Context Free Grammars (PCFGs) [29] and discrete Bayesian networks [5, 23]. Thanks to an efficient tabling system [33] and other optimization techniques, the latest version of PRISM is able to handle large volumes of data. For example, a natural language application written in PRISM is able to train probabilistic models using corpora of tens of thousands of sentences. There is an increasing interest in using machine learning techniques to learn heuristics for search. For CSPs, the heuristics used to order variables and values can have a dramatic effect on the performance. Various kinds of strategies have been proposed (e.g., the first-fail principle for ordering variables [19] and the mini-conflict strategy for ordering values [14]). Nevertheless, for many problems users still have to experiment with different heuristics and tune them manually, and this process can be very tedious and painful.

Objectives We propose a probabilistic constraint-based approach to protein structure predication. Our approach is based on the reasonable assumption that two similar proteins will have similar structures. In this project we will develop probabilistic models and train them using proteins in the PBD with known 3 Available

from http://mi.cs.titech.ac.jp/prism/.

2

structures. These probabilistic models will be used to predicate structures for other proteins. The number of proteins with known three-dimensional conformations in the PDB is the order of 16,000 and is increasing rapidly, while the number of conformation classes has remained about 500 for some time is not expected to grow beyond 1000.4 It is expected that a trained probabilistic model can guide the search, leading to a better low bound and thus a dramatic reduction of the search space. This project also aims to improve the constraint-based methods used in existing systems. In concrete, we propose the following: (1) tailor our solver to the protein structure predication problem; (2) design and implement new global constraints and propagation algorithms for the problem; and (3) investigate the effectiveness of other constraint solving techniques such as symmetry-breaking and dual-modeling techniques to this problem.

Merits and impacts The main contribution of the project will be a high-performance specialized solver for protein structure predication that takes advantage of probability distributions learned from existing proteins with known structures. The system will be based on our efficient constraint logic programming system B-Prolog and the probabilistic learning system PRISM, and will incorporate the following research results: • Probabilistic models: A naive probabilistic model would facilitate learning but could hardly produce useful results to guide search. On the other hand, a sophisticated model entails exponential learning time. An optimal probabilistic model will be developed through experiments that well balances the learning complexity and the quality of results. • Special-purpose constraints: Special-purpose constraints including routing, symmetry-breaking, and dual-modeling constraints will be identified for the problem, and will be implemented using the AR language available in B-Prolog. This is a multidisciplinary research project which will involve both computer scientists and biologists. The research results will disseminated through several avenues and the resulting system from the project will be made available to the public.

References [1] Rolf Backofen and Sebastian Will. A constraint-based approach to fast and exact structure predication in three-dimensional protein models. Constraints, An International Journal, to appear. [2] David Baker and Andrej Sali. Protein structure prediction and structural genomics. Science, 294:93–96, 2001. [3] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, , and Philip E. Bourne. The protein data bank. Nucleic Acids Research, 28(1):235–242, 2000. [4] A. Borning, R. Lin, and K. Marriott. Constraint-based document layout for the web. ACM Multimedia Systems Journal, 2000. 4 http://scop.mrc-lmb.cam.ac.uk/scop/.

3

[5] Enrique Castillo, Jose Manuel Gutierrez, and Ali S. Hadi. Expert Systems and Probabilistic Network Models. Springer, 1 edition, 1997. [6] Philippe Codognet and Daniel Diaz. Compiling constraints in clp(FD). Journal of Logic Programming, 27(3):185–226, 1996. [7] Jacques Cohen. Constraint logic programming. Communications of the ACM, 33(7), 1990. [8] Jacques Cohen. Bioinformatics - an introduction for computer scientists. ACM Comput. Surv., 36(2):122–158, 2004. [9] Vitor Santos Costa, David Page, Maleeha Qazi, and James Cussens. CLP(BN): Constraint logic programming for probabilistic knowledge. In Proceedings of 2003 Conference on Uncertainty in Artificial Intelligence (UAI-03). Morgan Kaufmann Publishers, 2003. [10] I. Cruz, K. Marriott, , and P. van Hentenryck. Special issue on constraints, graphics, and visualization. Constraints, An International Journal, 3(1), 1998. [11] R.H.C. Yap (eds.) D. Gilbert, R. Backofen. Special issue on bioinformatics. Constraints, An International Journal, 6(2-3), 2001. [12] Rina Dechter. Constraint Processing. Morgan Kaufmann Publishers, 2003. [13] Mehmet Dincbas, Helmut Simonis, and Pascal van Hentenryck. Solving large combinatorial problems in logic programming. Journal of Logic Programming, 8:75–93, 1990. Special Issue: Logic Programming Applications. [14] Daniel Frost and Rina Dechter. Look-ahead value ordering for constraint satisfaction problems. In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI’95, pages 572–578, 1995. [15] Kristian Kersting and Luc De Raedt. Bayesian logic programs. In J. Cussens and A. Frisch, editors, Proceedings of the Work-in-Progress Track at the 10th International Conference on Inductive Logic Programming, pages 138–155, 2000. [16] Andrzej Kolinski and Jeffrey Skolnick. Reduced models of proteins and their applications. Polymer, 45:511–524, 2003. [17] D. Koller. Probabilistic relational models. In S. Dˇzeroski and P. Flach, editors, Proceedings of the 9th International Workshop on Inductive Logic Programming, volume 1634 of Lecture Notes in Artificial Intelligence, pages 3–13. Springer-Verlag, 1999. Invited paper. [18] Ludwig Krippahl and Pedro Barahona. Applying constraint programming to protein structure determination. In Constraint Programming, pages 289–302, 1999. [19] V. Kumar. Algorithms for constraint satisfaction problems: A survey. AI Magazine, 13:32–44, 1992. [20] S. Muggleton. Stochastic logic programs. In L. De Raedt, editor, Advances in Inductive Logic Programming, pages 254–264. IOS Press, 1996. [21] Alessandro Dal Pal` u, Agostino Dovier, and Federico Fogolari. Constraint logic programming approach to protein structure prediction. BMC Bioinformatics, 5:186, 2004.

4

[22] Alessandro Dal Pal` u, Agostino Dovier, and Enrico Pontelli. Heuristics, optimizations, and parallelism for protein structure prediction in clp(d). In PPDP, pages 230–241, 2005. [23] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann Publishers, 1987. [24] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recoginition. Proceedings of the IEEE, 77:257–286, 1989. [25] Taisuke Sato and Y. Kameya. Parameter learning of logic programs for symbolic-statistical modeling. Journal of Artificial Intelligence Research, pages 391–454, 2001. [26] Taisuke Sato, Yoshitaka Kameya, and Neng-Fa Zhou. Generative modeling with failure in prism. In IJCAI, pages 847–852, 2005. [27] P. van Hentenryck and V. Saraswat (eds.). Strategic directions in constraint programming. ACM Computing Survey, 28(4):701–728, 1996. [28] M.G. Wallace. Practical applications of constraint programming. Constraints Journal, 1(1), 1996. [29] C. S. Wetherell. Probabilistic languages: A review and some open questions. ACM Computing Surveys, 12(4):361–379, 1980. [30] Neng-Fa Zhou. CGLIB — a constraint-based graphics library. Software Practice and Experience, 33(13):1199–1216, 2003. [31] Neng-Fa Zhou. Programming finite-domain constraint propagators in action rules. To appear in Theory and Practice of Logic Programming (TPLP), 2006. [32] Neng-Fa Zhou, Taisuke Sato, and Koiti Hasida. Toward a high-performance system for symbolic and statistical modeling. In IJCAI Workshop on Learning Statistical Models from Relational Data, pages 153–159, 2003. [33] Neng-Fa Zhou, Taisuke Sato, and Yi-Dong Shen. Linear tabling strategies and optimizations. Theory and Practice of Logic Programming (TPLP), submitted, preliminary results appear in ACM PPDP’03 and ACM PPDP’04.

Suggest Documents