Multiscale Approaches to Protein Modeling
Andrzej Kolinski Editor
Multiscale Approaches to Protein Modeling
13
Editor Andrzej Kolinski Department of Chemistry University of Warsaw ul. Pasteura 1 02-093 Warszawa Poland
[email protected] ISBN 978-1-4419-6888-3 e-ISBN 978-1-4419-6889-0 DOI 10.1007/978-1-4419-6889-0 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010934732 © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Thanks to enormous progress in sequencing of genomic data, presently we know millions of protein sequences. At the same time the number of experimentally solved protein structures is much smaller, ca. 60,000. This is because of large cost of structure determination. Thus, the theoretical in silico prediction of protein structures and dynamics is essential for understanding the molecular basis of drug action, metabolic and signaling pathways in living cells, and designing new technologies in the life science and material sciences. Unfortunately, a “brute force” approach remains impractical. Folding of a typical protein (in vivo or in vitro) takes milliseconds to minutes, while the state-of-the-art all-atom molecular mechanics simulations of protein systems can cover only a time period of nanoseconds to microseconds. This is the reason for the enormous progress in the development of various multiscale modeling techniques applied to protein structure prediction, modeling of protein dynamics and folding pathways, in silico protein engineering, model-aided interpretation of experimental data, modeling of macromolecular assemblies, and theoretical studies of protein thermodynamics. Coarse-graining of the proteins’ conformational space is a common feature of all these approaches, although the details and the underlying physical models span a very broad spectrum. This book contains comprehensive reviews of the most advanced multiscale modeling methods in protein structure prediction, computational studies of protein dynamics, folding mechanisms, and macromolecular interactions. The presented approaches span a wide range of the levels of coarse-grained representations, various sampling techniques, and a variety of applications to biomedical and biophysical problems. It was our intention to provide a collection of comprehensive reviews that could be used as a reference book for those who just are beginning their adventure with biomacromolecular modeling but also as a valuable source of more detailed information for those who are already experts in the field of biomacromolecular modeling and in related areas of computational biology or biophysics. Proteins are linear copolymers composed of amino acids. Important ideas of polymer physics inspired the field of protein modeling. Chapter 1 explains some basic concepts of polymer conformational statistics and dynamics of chain molecules in context of simple lattice models. This chapter demonstrates how
v
vi
Preface
these ideas could be employed in protein modeling. Chapter 2 describes application of a lattice-based protein model to the very challenging problem of protein docking. Chapter 3 provides a comprehensive overview of various coarse-grained protein-like and protein models. This chapter describes (among other approaches) probably the most rigorous system of physics-based reduced modeling of proteins. Coarse-grained, multiscale, protein modeling requires specific designs of interaction schemes. Chapters 4–6 provide in-depth overviews of various level force-fields for the reduced representations of protein conformational space, including knowledgebased statistical potentials. Chapters 7 and 8 (but also, in part, Chapters 3–5 and 12) describe a variety of applications of reduced models in the study of protein dynamics, folding pathways, molecular mechanisms of mechanical unfolding, and protein interactions. Chapter 9 gives an overview of the most effective sampling strategies in a reduced, although unrestricted conformational space. Chapters 10 and 11 present a very efficient philosophy of a conformational search, where the target structures are assembled from fragments excised from already known protein structures. These strategies proven to be very effective in the large-scale, automated in silico structure prediction. Chapter 12 describes a multiscale method, based on a high-resolution lattice model, for modeling protein folding pathways. Chapters 13 and 14 discuss the most important ideas and techniques of comparative modeling – the most effective and the most popular method for theoretical prediction of protein structures. These chapters provide also reviews of the model-quality assessment methods. The contributing authors are world-wide recognized experts. Some of them (Bujnicki and Zhang) are leaders in the field of protein structure prediction, as assessed by the recent (CASP6–CASP8) community-wide experiments in a blind structure prediction. Others also developed very successful methods for the protein structure prediction (Scheraga, Liwo, Feig, and Kihara). Several of the authors of this book developed very efficient coarse-grained interaction schemes for protein models based on either an evolutionary knowledge approach (Jernigan and Scheraga have built theoretical foundations of this class of approaches, but others also contributed significantly: Feig and Micheletti) or a physics-based approach (Scheraga, Liwo, Feig, and Irback). Among the authors are also the world top leaders of comparative modeling (Bujnicki, Zhang, Tramontano, and Kihara) and automated structure prediction (Zhang and Bujnicki) – the structure prediction server created by Zhang is the best till date. The book presents also the state-of-the-art methods of evaluation of quality of the theoretical protein models (Tramontano and Kihara). Recently, a significant progress has been achieved in multiscale modeling of protein dynamics and folding mechanisms. The authors of the chapters dealing with this class of problems are also world-class leaders (Scheraga, Liwo, Irback, Feig, Cieplak, Jernigan, and Micheletti). The conformational search strategies are crucial in protein modeling. Developers of the most efficient computational techniques and strategies are also among the authors (Hansmann, Scheraga, and others). Warsaw, Poland
Andrzej Kolinski
Contents
1 Lattice Polymers and Protein Models . . . . . . . . . . . . . . . . Andrzej Kolinski
1
2 Multiscale Protein and Peptide Docking . . . . . . . . . . . . . . . Mateusz Kurcinski, Michał Jamroz, and Andrzej Kolinski
21
3 Coarse-Grained Models of Proteins: Theory and Applications . . . . . . . . . . . . . . . . . . . . . . . Cezary Czaplewski, Adam Liwo, Mariusz Makowski, Stanisław Ołdziej, and Harold A. Scheraga 4 Conformational Sampling in Structure Prediction and Refinement with Atomistic and Coarse-Grained Models . . . . . . Michael Feig, Srinivasa M. Gopal, Kanagasabai Vadivel, and Andrew Stumpff-Kane 5 Effective All-Atom Potentials for Proteins . . . . . . . . . . . . . . Anders Irbäck and Sandipan Mohanty 6 Statistical Contact Potentials in Protein Coarse-Grained Modeling: From Pair to Multi-body Potentials . . . . . . . . . . . Sumudu P. Leelananda, Yaping Feng, Pawel Gniewek, Andrzej Kloczkowski, and Robert L. Jernigan 7 Bridging the Atomic and Coarse-Grained Descriptions of Collective Motions in Proteins . . . . . . . . . . . . . . . . . . . . Vincenzo Carnevale, Cristian Micheletti, Francesco Pontiggia, and Raffaello Potestio 8 Structure-Based Models of Biomolecules: Stretching of Proteins, Dynamics of Knots, Hydrodynamic Effects, and Indentation of Virus Capsids . . . . . . . . . . . . . . . . . . . . . Marek Cieplak and Joanna I. Sułkowska 9 Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . Ulrich H. E. Hansmann
35
85
111
127
159
179
209
vii
viii
Contents
10 Protein Structure Prediction: From Recognition of Matches with Known Structures to Recombination of Fragments . . . . . . . . . . . . . . . . . . . Michal J. Gajda, Marcin Pawlowski, and Janusz M. Bujnicki
231
11 Genome-Wide Protein Structure Prediction . . . . . . . . . . . . Srayanta Mukherjee, Andras Szilagyi, Ambrish Roy, and Yang Zhang
255
12 Multiscale Approach to Protein Folding Dynamics . . . . . . . . . Sebastian Kmiecik, Michał Jamroz, and Andrzej Kolinski
281
13 Error Estimation of Template-Based Protein Structure Models . . Daisuke Kihara, Yifeng David Yang, and Hao Chen
295
14 Evaluation of Protein Structure Prediction Methods: Issues and Strategies . . . . . . . . . . . . . . . . . . . . . . . . . Anna Tramontano and Domenico Cozzetto
315
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
341
Contributors
Janusz M. Bujnicki Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Warsaw, Poland; Laboratory of Bioinformatics, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland,
[email protected] Vincenzo Carnevale Institute for Computational Molecular Science, Temple University, Philadelphia, PA, USA,
[email protected] Hao Chen Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN, USA,
[email protected] Marek Cieplak Institute of Physics, Polish Academy of Sciences, Warsaw, Poland,
[email protected] Domenico Cozzetto Department of Biochemical Sciences, “Sapienza” University of Rome, Rome, Italy,
[email protected] Cezary Czaplewski Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA,
[email protected] Michael Feig Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA; Department of Chemistry, Michigan State University, East Lansing, MI, USA,
[email protected] Yaping Feng Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA,
[email protected] Michal J. Gajda European Molecular Biology Laboratories, Hamburg Outstation, Hamburg, Germany,
[email protected] Pawel Gniewek L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA; Laboratory of Theory of Biopolymers, Faculty of Chemistry, University of Warsaw, Warsaw, Poland,
[email protected] ix
x
Contributors
Srinivasa M. Gopal Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA,
[email protected] Ulrich H. E. Hansmann Department of Physics, Michigan Technological University, Houghton, MI, USA,
[email protected] Anders Irbäck Computational Biology & Biological Physics, Department of Theoretical Physics, Lund University, Lund, Sweden,
[email protected] Michał Jamroz Faculty of Chemistry, University of Warsaw, Warsaw, Poland,
[email protected] Robert L. Jernigan Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA,
[email protected] Daisuke Kihara Department of Biological Sciences, College of Science; Department of Computer Science, College of Science; Markey Center for Structural Biology, Purdue University, West Lafayette, IN, USA,
[email protected] Andrzej Kloczkowski Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA,
[email protected] Sebastian Kmiecik Faculty of Chemistry, University of Warsaw, Warsaw, Poland,
[email protected] Andrzej Kolinski Faculty of Chemistry, University of Warsaw, Warsaw, Poland,
[email protected] Mateusz Kurcinski Faculty of Chemistry, University of Warsaw, Warsaw, Poland,
[email protected] Sumudu P. Leelananda Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, USA; L.H.Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA, USA,
[email protected] Adam Liwo Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA,
[email protected] Mariusz Makowski Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA,
[email protected] Contributors
xi
Cristian Micheletti Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy; Democritos CNR-IOM and Italian Institute of Technology (SISSA Unit), Trieste, Italy,
[email protected] Sandipan Mohanty Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH, Jülich, Germany,
[email protected] Srayanta Mukherjee Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Center for Bioinformatics, University of Kansas, Lawrence, KS, USA,
[email protected] Stanisław Ołdziej Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA; Laboratory of Biopolymer Structure, Intercollegiate Faculty of Biotechnology, University of Gda´nsk and Medical University of Gda´nsk, Gda´nsk, Poland,
[email protected] Marcin Pawlowski Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Warsaw, Poland,
[email protected] Francesco Pontiggia Department of Biochemistry, Brandeis University, Waltham, MA, USA,
[email protected] Raffaello Potestio Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy,
[email protected] Ambrish Roy Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Center for Bioinformatics, University of Kansas, Lawrence, KS, USA,
[email protected] Harold A. Scheraga Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA,
[email protected] Andrew Stumpff-Kane Department of Biochemistry and Molecular Biology, Michigan State University, Michigan, USA,
[email protected] Joanna I. Sułkowska Institute of Physics, Polish Academy of Sciences, Warsaw, Poland; CTBP, University of California, Gilman Drive 9500, La Jolla, San Diego, CA, USA,
[email protected] Andras Szilagyi Center for Bioinformatics, University of Kansas, Lawrence, KS, USA; Institute of Enzymology, BRC, Hungarian Academy of Sciences, Budapest, Hungary,
[email protected] Anna Tramontano Department of Biochemical Sciences, “Sapienza” University of Rome, Rome, Italy; Istituto Pasteur – Fondazione Cenci Bolognetti, “Sapienza” University of Rome, Rome, Italy,
[email protected] Kanagasabai Vadivel Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA,
[email protected] xii
Contributors
Yifeng David Yang Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN, USA,
[email protected] Yang Zhang Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Center for Bioinformatics, University of Kansas, Lawrence, KS, USA,
[email protected] Chapter 1
Lattice Polymers and Protein Models Andrzej Kolinski
Abstract The size of conformational space of chain polymers is enormous. Much has been learned about polymer structure, thermodynamics, and dynamics by theoretical considerations and numerical study of simple lattice models. Self-avoiding random walks on a lattice provide a good approximation for the excluded volume effect and nature of the coil–globule transition. Semiflexible polymers on a lattice exhibit two-state collapse transition that captures some essential features of the allor-none folding transition of small globular proteins. More complex, decorated with some structural details, lattice polymers provide a very powerful means for study of protein dynamics and thermodynamics and protein structure prediction.
1.1 Reduced Models of Chain Molecules The torsional rotations, only around the main-chain backbone bonds, make the conformational space of chain molecules enormous in size (Flory 1969). For a chain containing N single bonds, the number of conformations is in the range of qN , where q is approximately equal to the number of distinct low-energy regions of the rotational potential. For a polyethylene chain, q would be 3. Obviously, when N is hundreds or many thousands, a detailed conformational analysis becomes impractical. Impractical are also detailed all-atom computer simulations, unless only very local conformational changes require examination. Thus, in order to make the problem tractable, simplified models have often been designed and studied (Milik et al. 1990; Kolinski and Skolnick 1996), either from statistical analyses or/and by computer simulations. As it will become apparent later, the statistical analysis itself is of rather limited utility and in typical cases requires quite drastic simplifications. Usually, it is difficult to estimate a priori the effect of such simplifications on the final results.
A. Kolinski (B) Faculty of Chemistry, University of Warsaw, Warsaw, Poland e-mail:
[email protected] A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_1,
1
2
A. Kolinski
Let us consider two extremely simple models of polymers, one for idealized conformational statistics and the second for the first level of approximation for chain dynamics. These models can be solved rigorously by simple analytical considerations (Flory 1969). The first is the freely jointed chain (sometimes it is also called “the random flight model”). The freely jointed chain (see Fig. 1.1) consists of n segments of equal length l. Mutual orientations of the segments are completely uncorrelated. It is well known that for a sufficiently large number of segments, the mean-square end-to-end distance of such a chain scales with the number of segments as = l2 n. This result closely resembles the central formula obtained for a Brownian particle theory, where the mean-square displacement is proportional to time. It is also easy to show that the mean-square radius of gyration (a quantity that is easier to measure experimentally than the ) is related to as <S2 > = /6. The distribution of the end-to-end distance and distribution of the segment density is Gaussian. Such an ideal polymer random coil is frequently called the Gaussian chain, although the freely jointed chain is not uniquely Gaussian since other types of chains can also follow Gaussian statistics. Fig. 1.1 An example of the freely jointed chain
The simplifications of the physical properties of real polymers assumed in the freely jointed chain model are essentially of two types. First, the correlations between the chain segments, especially between those that are close to one another along the chain contour, are an important property of polymers and strongly depend on their chemical structure. As long as these correlations extend only to a distance small in comparison with the chain length, it is relatively straightforward to generalize the model by introducing various approximation of the local chain stiffness related to sometimes complex profiles of the rotational potential energy. All the short-range (short distance along the chain contour) correlations do not change the general picture. For all such ideal models = Cl2 n, and the value of the prefactor C depends on the shape of the rotational potential and the temperature. Approximations of the second type are much more significant and much more difficult to deal with. Namely, all ideal chains neglect the effective interactions between the chain segments that are far away from one another along the chain
1
Lattice Polymers and Protein Models
3
but close to each other in space. On the most trivial level, the fact that two segments cannot occupy the same element of space must be taken into account. A rigorous analytical treatment of such “real” chains is not possible, although approximate theories exist (de Gennes 1979). Probably, the most famous is Flory’s mean-field theory (Flory 1953). The theory assumes that a balance between intramolecular interactions and those with solvent defines the average coil size. A quasi-chemical approximation is employed and an average Gaussian density of segments is assumed. The resulting formula describes the chain dimension as a function of temperature: α 5−α3 = const. (1−/T) n1/2
(1.1)
where α is the so-called expansion factor and is defined as α 2 = /
(1.2)
with denoting the ideal chain dimensions. Note that for T = the chain dimensions become identical with the dimensions of the ideal chain. Thus, idea behind Flory’s “theta” () temperature closely resembles the Boyle temperature for real gases. At temperatures below , the chain undergoes a transition to a dense globular state, and this transition is somewhat similar to the gas–liquid transition of small molecule systems. However, the transition for flexible polymers is continuous and has most of the features of a secondorder phase transition (Kolinski et al. 1987b). At high temperatures (see Eq. (1.1)) ∼ n6/5 , and the average chain dimensions are much larger than for an equivalent ideal chain. Interestingly, despite a rather poor estimation of chain entropy and internal energy, Flory’s theory gives quite an accurate estimation of the free energy and conformational properties of chain molecules. Such a cancellation of errors is quite typical of mean-field-type theories. Ideal chain statistics provides a zero-order picture of the protein denatured state, while Flory’s theory is a zero-order approximation for the folding (or collapse) transition. The approximation is quite crude for several reasons. First, protein chains are relatively stiff polymers and the limit of infinitely long chains is hardly satisfied even for large proteins (Creighton 1993). Second, proteins are heteropolymers with highly specific patterns of intramolecular interactions (Branden and Tooze 1991). Even in the random coil state, there is a significant extent of residual structure. Thus, the mean-field theory is hardly applicable. We will address these issues later in more detail. Somewhat analogous to ideal chain statistics, models for ideal chain dynamics were designed. Probably the best known of these is the Rouse model (Rouse 1953), shown in a schematic fashion in Fig. 1.2. It assumes that a flexible polymer chain can be represented as a chain of points joined by harmonic springs of equal strength. This model is analytically solvable. The results are quite interesting. For short times, when the average displacements of chain segments must be small in comparison with the coil size a single segment moves according to
4
A. Kolinski
Fig. 1.2 Schematic drawing of beads-and-springs Rouse chain
(r)2 ∼ t1/2
for
l2 < (r)2
(1.3)
while at longer times the “regular” diffusion is recovered and the mean-square displacement of a segment follows the mean-square displacement of the center of mass of the coil, (r)2 ∼ t, with the diffusion coefficient proportional to n−1 and the longest relaxation time proportional to n2 . It is easy to see that the Rouse model neglects several basic aspects of the physics: It ignores chain volume and the resulting topological restrictions, i.e., a “phantom” chain approximation, and does not take into account the non-uniform flexibility of copolymers and does not account for hydrodynamic interactions (although some extensions of the Rouse model can do this in a highly approximate way). These approximations are more serious for proteins than for flexible long polymers.
1.2 Simple Lattice Polymers Lattice models of simple liquids, except for providing a clear explanation of the entropy of mixing for an ideal solution, are not so useful. The opposite is the case for polymers. In polymers complex correlations can extend to distances many times larger than the sizes of single monomers. Thus, the local details are of less importance. The two ideal models described in the previous section have close lattice analogs. The freely jointed chain can be represented on a regular lattice, and the asymptotic properties remain unchanged (de Gennes 1979). Since on a lattice the allowed angles between consecutive segments belong to a discrete set, the only differences would be seen for short distances along the chain. The Flory’s type of real chain could be modeled on a lattice in a straightforward and efficient fashion. The simplest possibility is a chain with excluded volume (double occupancy of lattice sites is prohibited) having attractive interactions for non-bonded nearest neighbors. The idea is explained in Fig. 1.3. Such a model enables the detailed Monte Carlo study of polymer collapse transitions for various patterns of interactions and for various topologies of the model polymers (branched chains,
1
Lattice Polymers and Protein Models
5
Fig. 1.3 Square lattice polymers. An ideal chain (left) and a real (with excluded volume) chain (right)
macrocycles, etc.). Interestingly, the critical exponent for an athermal linear chain with excluded volume (the limited case of a high-temperature system) estimated from numerous computational experiments is close (however, clearly not identical) to the 6/5 obtained from Flory’s theory. It also has been proven that the collapse transition for long flexible polymers is continuous and that the observed physics does not depend on the particular type of lattice used. Usually, sampling of the conformational space of a lattice polymer is carried out with the use of various Monte Carlo techniques (Binder et al. 2004; Smith and Lisal 2002; Pakula 2004). A Monte Carlo procedure could be employed to build a large number of completely independent random conformations. Then, such a statistical ensemble can be used for the statistical analysis of the conformational and thermodynamic properties of the model. Alternatively, the ensemble can be constructed in a long iterative process of conformational transformations of a single chain or with a collection of chains. For technical reasons, the second possibility is recommended for studies of multichain systems, where growing all chains in parallel without introducing a statistical bias would be a difficult task (Binder et al. 2004; Smith and Lisal 2002; Frenkel and Smit 2001). Computer simulations of single long flexible polymer provided some important insights into nature of coil–globule collapse transition. It has been shown that with increasing chain length the average size of polymer coil changes faster with temperature and that the collapse transition becomes sharper, although it is always continuous (Kolinski et al. 1987b). Critical exponents describing the chain dimensions at various conditions of solvent quality and temperature have been determined from extensive Monte Carlo simulations for polymers of various topology of the main chain: simple linear, branched (especially star-branched) (Hsu et al., 2004; Sikorski and Romiszowski 1996; Sikorski 1993), and ring polymers. Also, computer modeling of polymers stimulated development of new computational techniques (Grest et al. 1996; Freire 1999; Likos 2006).
6
A. Kolinski
The ideal dynamics of the Rouse chain (see the previous section) also has a lattice analog. Imagine a simple lattice chain, which is ideal (lacking excluded volume), and consequently it is also a “phantom” chain – fragments of a chain can cross its own paths during a random motion. Stochastic dynamics could be simulated as a long sequence of local (involving few bonds) conformational transitions at randomly selected positions along the chain (see Fig. 1.4). It has been shown a long time ago by Verdier and Stockmayer (1962) that the long-time dynamics of such model is equivalent to the Rouse dynamics. Lattice chains could easily be modeled as “real” chains, having excluded volume and topological constraints on their motion. This opens the possibility for computational studies of various complex dynamic phenomena, including the mechanism of polymer collapse (protein folding), diffusion in a restricted space, diffusion in dense solutions. It has been shown that excluded volume of chain molecules (and long-range interactions in general) leads to somewhat stronger dependencies of diffusion coefficient and the longest relaxation time on the chain length, when compared to the corresponding relations for the ideal chains in an infinitely diluted solution. Computer simulations were especially helpful in understanding of mechanism of polymer diffusion in gel, concentrated solutions, and in polymer melts. A system of many mutually entangled polymers is probably one of the most complex (if not the most complex) examples of classical multibody problems. It has been shown that the famous “reptation” theory of de Gennes (1979) describes very well the motion of flexible polymers in gel. The term “reptation” relates to snake-like motion of a polymer chain throughout the net of obstacles superimposed by the crosslinked gel. At the same time, many computer simulations demonstrated that the situation in solutions and melts is more complex (Kolinski et al. 1987a, c, and d) and that the mechanism of diffusion cannot be described in the framework of simple “reptation” theory (Skolnick and Kolinski 1990; Sikorski et al. 1994; Di Cecca and Freire 2002).
Fig. 1.4 Verdier–Stockmayer dynamics of a short simple cubic lattice chain showing typical lattice moves: (a) the corner flip, (b) three-bond permutation, (c) the crankshaft move, and (d) the chain-end move
1
Lattice Polymers and Protein Models
7
1.3 Simple Lattice Polymers with Protein-Like Features The collapse transition of a long flexible polymer chain is continuous. But, relatively short natural polypeptide chains undergo reversible pseudo-first-order cooperative transitions from a random coil denatured state to a structurally organized dense globular state (Anfinsen 1973). Since the single protein–solvent system is rather small, it is probably better to describe this transition as all-or-none. This way one can avoid an implicit reference to the thermodynamic limit. At the same time, the term all-ornone refers to a negligible population of the folding intermediates at the transition temperature. It has been pointed out that polypeptides are relatively stiff polymers. Perhaps, the chain stiffness itself can induce a more cooperative collapse transition. To check this hypothesis, extensive simulations were done almost 25 years ago (Kolinski et al. 1986a). The model chains studied were relatively short and consisted of 50–400 segments restricted to the diamond lattice (see Fig. 1.5). The diamond lattice has been chosen because of its tetrahedral valence angle and the qualitative similarity of its stretched (trans-conformation) segments to the β-strands in globular proteins. Short-range and long-range interactions were modeled in a simple way. Local stiffness was controlled by a potential energy preference for the expanded trans-conformation with respect to the two gauche conformations. Attractive longrange interactions were accounted for with a simple contact potential for the nearest non-bonded neighbors on the lattice. It is probably the simplest possible model of a semiflexible homopolymer in a thermodynamically poor solvent. The degree of stiffness can be controlled by changing the ratio of the stiffness parameter to the contact energy parameter. Monte Carlo simulations revealed interesting behavior for such a simple system. For moderately stiff polymers, the collapse transition was continuous, qualitatively similar to the collapse of a chain with unrestricted flexibility. However, at some critical ratio of the stiffness parameter to the segment attraction parameter, the collapse transition became highly cooperative, with all-or-none thermodynamic characteristics. At the transition temperature, the semiflexible polymers exhibit existence of metastable states, characteristic for first-order phase transitions.
Fig. 1.5 A short fragment of a model chain restricted to the diamond lattice. Semiflexible chains have a preference for the expanded trans-type conformations. The long-range interactions are modeled by a contact potential
8
A. Kolinski
Fig. 1.6 Dimensions of short polymeric chains as a function of temperature. The solid line corresponds to the case of flexible chain and the dashed line describes behavior of semiflexible polymer. Tf indicates the collapse (or folding) temperature
This is illustrated in Fig. 1.6. Two types of structures coexisted at the transition temperature. The swollen random coil state exhibited a low density of contacts between the chain segments and relatively low-average lengths of the fully expanded segments. Upon collapse, the length of the expanded strands increased abruptly, accompanied by an abrupt increase of the number of polymer–polymer contacts. In this way the entropy of the random coil has compensated for the low potential energy of the globule. For relatively short chains, the globule had the structure of a bundle of parallel strands. There was a hypothesis that the collapse transition itself induces formation of secondary structures in proteins. This is true, but only when there is an interplay between the short-range conformational stiffness and the long-range interactions. Such interplay seems to be a fundamental feature of proteins and the major factor responsible for the folding cooperativity. For highly flexible polymers, the local ordering in the globular state was undetectable. Here, however, a word of caution needs to be exercised. The results described were obtained for a model of a homopolymer. Strong, specific sequence patterns of the long-range interactions may actually lead to some ordering of the globular structure. Specific interactions of side chains can also augment folding cooperativity (Pande et al. 1996). The model of a semiflexible polymer has one more striking feature. With increasing chain length (somewhere between 200 and 400 segments, depending on the degree of the local stiffness), the globular structure divides into domains of bundles having different orientations of their axes (Rutkowska and Kolinski 2007). This again resembles globular proteins, where longer polypeptide chains fold into two or more separate domains. Obviously, at the limit of a very long chain, where the persistence length becomes small in respect to the chain length of the model homopolymer, a continuous collapse transition should be recovered. The homopolymeric model of protein collapse has, however, an important non-protein-like feature. The globular structure, although highly ordered, is not unique. The average length of the extended
1
Lattice Polymers and Protein Models
9
strands and the distribution of their sizes may differ quite a bit between particular simulations. Also the topology of connections between the strands is not unique. It is interesting that the cooperative collapse transition of a semiflexible polymer could actually be predicted in the framework of a mean-field-type theory, as it was demonstrated a time ago by Post and Zimm (1979). Nevertheless, the picture emerging from the computer simulations is much deeper, and it is exact in the limits of model simplifications and small statistical errors of the Monte Carlo simulations. Models of homopolymeric semiflexible chains provide the zero-order approximation of the physics of globular protein collapse transition, where the interplay between secondary structure preferences (here the local stiffness) and the long-range interactions leads to the characteristic cooperative behavior (Kolinski et al. 1996).
1.4 Minimal Protein-Like Models The homopolymer model of a semiflexible chain, described in the previous section, has several important protein-like features, except the uniqueness of the globular structure. Real protein chains are heteropolymers, with amino acid units that differ in the strength and the physical nature of their short-range and long-range interactions. In a simplest possible approximation, there are two types of amino acids with respect to their hydrophobicity: polar (P), which tend to be exposed to the solvent on the protein surface, and hydrophobic (H), which tend to be buried inside the globule (Lau and Dill 1989). From the point of view of the chain flexibility, one may distinguish between the three main classes (Skolnick et al. 1989): amino acids (or short sequences of amino acids) that tend to adopt extended (e) conformations, residues that tend to build helical structures (h), and, usually more flexible, residues that prefer coil or turn-type local structures (c). Assuming a beta-barreltype target structure, it is natural to limit these possibilities to the two cases: e and c. Thus, in a very crude approximation, the β-type proteins are built from four types of amino acids: He, Hc, Pe, and Pc. By using the diamond lattice approximation described previously, it is relatively easy to design a number of sequences based on these four types of residues that undergo the all-or-none transition to a unique three-dimensional structure, although the level of the folding cooperativity is rather low. Even a quite complex Greek-key topology (seen in many real proteins) of the globule could be designed and obtained, with high reproducibility, in computer simulations (Kolinski et al. 1986b; Skolnick et al. 1988). This, however, requires somewhat more complex patterns of the chain hydrophobicity and flexibility (Skolnick et al. 1989). Also simple helical motifs could be designed and folded in silico employing these simple rules. Of course, instead of e-type residues, the htype had to be used, for which the right-handed three-bond turns have a preferential energy in the short-range interactions (Grosberg and Khokhlov 1994; Kolinski et al. 1996). Slightly after the studies outlined above, a different approach to design a minimal protein-like model has been proposed by Chan and Dill and pursued by many
10
A. Kolinski
others (Dill et al. 1995; Dinner et al. 1994, 1996). In its classic form, the model assumed the simple cubic lattice representation of the chain conformation and two types of residues: polar (P) and hydrophobic (H). In many applications, the target structure had the form of a 3×3×3 cube consisting of 27 model amino acids. It has been shown that it is possible to design sequences that have a single minimum of the chain conformational energy, which is consistent with the cube. This result was exact, since it was feasible to enumerate all possible compact conformations of such short chains (Dill 1999). The model stresses the hydrophobic collapse as the main feature of protein folding and was used in studies of protein-folding thermodynamics, stability, and folding pathways. Many varieties of modifications to this type of model have been proposed and studied in great detail (Sun et al. 1995; Kolinski and Skolnick 1996; Chen et al. 2004; Abkevich et al. 1994, 1996; Sali et al. 1994; Li et al. 1996; Micheletti et al. 1998). The simple polymer approach to protein folding has proven itself to be very productive – through these the most general features of protein folding become better understood. In spite of many successes, the cubic lattice hydrophobic polar (HP) models (and closely related models) have intrinsically some shortcomings that are difficult to ignore. First of all, the notion of secondary structure, so important in real proteins, is quite unclear in these models. Second, the cube geometry has a peculiar pattern of exposed-buried residues. The 27-mer cube has eight highly exposed corner residues, and obviously, unlike in real proteins, all of them are far apart from each other in space. There is only one completely buried residue in the center of the cube. Again, a typical protein domain has about half of its residues buried inside. These shortcomings can be to some extent improved upon by a different definition of interactions. Recent studies by Kaya and Chan (2002) have shown that in order to reproduce true two-state all-or-none folding transitions, more than two types of amino acids need to be included in the model sequences. Their findings indicated also the high significance of the interplay between short-range and long-range interactions. This is in qualitative agreement with the results from the diamond lattice models described in the previous section. Recently, the problem of the minimal model of protein structure and protein folding has been revisited (Pokarowski et al. 2003, 2005). Chains restricted to the face-centered cubic (fcc) lattice were used to represent protein conformational space. This lattice has a higher coordination number, z = 12, and allows for more flexibility than do other simple lattices. Thus, the lattice anisotropy effects are perhaps less severe. Moreover, this lattice allows for the crude, although not trivial, representation of all basic protein-folding motifs: β-type, α-type, and mixed α/β motifs. Representatives of all these motifs have been designed and studied in computational experiments. It was assumed a priori that a minimal model might require three types of potentials to mimic the complex network of molecular interactions in real proteins. In agreement with previous finding, the short-range (sequence-dependent local conformational stiffness) and the long-range contact interactions (with two types of residues, polar and hydrophobic) were implemented. Additionally, the effect of the main-chain hydrogen bonds has been taken into account in the form of a directional component in the pairwise potentials. In the
1
Lattice Polymers and Protein Models
11
simplest form in the β-type models, it seems to be enough to make the polar-polar (PP) interactions orientation dependent. Indeed, in globular proteins the contacting polar side groups on the surface of a globule are almost always approximately parallel. Without the explicit model of the side groups, their hypothetical orientation can be determined easily from the mutual orientation of the two-segment fragments of the interacting nearest neighbors. A more general definition of the ersatz of the hydrogen bonds could be designed, which has the same meaning and the same effect on the system behavior. Let us focus on the example of the relatively complex Greek-key motif of a two-sheet, six-stranded β-barrel. In order to perform detailed analysis of the system thermodynamics, the Replica Exchange Monte Carlo (REMC) sampling method (Hukushima and Nemoto 1996), using a carefully designed set of local conformational modifications, has been combined with the histogram analysis of the density of states (Ferrenberg and Swendsen 1989). A large number of simulations have been performed spanning a wide range of the relative strength (various scaling factors) of the short-range, pairwise hydrophobic, and orientation-dependent polar group interactions. The low-energy structures, including the putative native-like structure and a set of partly folded structures, some of them near-native, were extracted from the REMC pseudo-trajectories. For each one of these structures, its potential energy could be calculated as a function of the scaling factors for particular types of interactions. The assumption that the native structure potential energy has to be the lowest leads to a set of inequalities, with the interaction scaling parameters as the free variables. Solution of these inequalities determines a set of “good” parameters for the model. An important result of such analysis is that all energy parameters must have non-zero contributions. Otherwise, the native structure would not have the minimum energy. Within the set of allowed interaction parameters, there are many possibilities. This parameter space has been explored in additional long simulations, and the level of folding cooperativity has been estimated for every set of scaling parameters from a relatively dense grid within the allowed subspace. The highest, essentially purely two-state, cooperativity has been observed for the system with relatively strong pairwise interactions (hydrophobic and polar, orientation dependent) and a moderate short-range conformational stiffness. The same has been observed for other structural motifs studied. Thus, it has been proven that the proposed model is a minimal one – the three types of interactions are necessary for the protein-like uniqueness of the native structure and the highly cooperative all-or-none folding transition. Interestingly, all designed motifs exhibited some degeneracy for the native structure. For instance, for the Greekbarrel model, 20 structures have exactly the same topology and exactly the same patterns of interactions for all components of the model force field. The only differences were geometrical details, including mirror-image structures. This is probably physical (except of the mirror-image structures) – there are fluctuations in the native structure of the real proteins, and there are known examples where mobile parts contribute to the entropic stabilization of the native state. The simulations for various motifs have shown that a higher degeneracy of the native state leads to a higher cooperativity of the folding transition. Due to the higher entropy of the globular state, its free energy is lower and consequently the free energy gap between the
12
A. Kolinski
globular state and the manifold of random structures is larger. This leads to a very clear two-state behavior of the model system (Pokarowski et al. 2003, 2005). The minimal protein-like models capture the most general physics of globular protein folding. Nevertheless, they are only generic models, which are of a quite limited use for addressing the more detailed problems of specific folds and protein interactions. Possibly, the fcc model described above, with a larger alphabet for the amino acid sequences, can be used for the crude modeling of specific proteins, although the expected accuracy would be rather low, only an overall topology might be reproduced correctly.
1.5 High-Coordination Lattice Protein Models For many reasons, lattice approaches to polymer and biopolymer modeling are very appealing. Conformational transitions could be rapidly calculated in the discrete space of the lattice. The energy landscape is smoother; many local energy barriers are eliminated due to the simplification of the interaction schemes. Moreover, energy calculations are much faster due to the discrete set of allowed distances and angles. This is especially true for proteins, where the interactions are actually rather complex. On the other hand, a higher geometrical accuracy is needed for the study of specific proteins and for protein structure predictions. For this reason, several lattice models of intermediate-to-high resolution were developed in past, and some of them have proven to be rather effective tools for protein molecular modeling (Kolinski and Skolnick 1996; Kolinski 2004; Kolinski et al. 1995, 1996). In several studies of proteins and protein-like models, the three-dimensional “chess-knight” representation of the alpha-carbon trace was used (Kolinski et al. 1996). The chess-knight lattice is built upon a set of vectors type [2,1,0]. There are 24 such vectors; 6 permutations of the coordinates and 4 permutations of the signs. Due to the restrictions on the values of the planar angles of the alpha-carbon trace in real proteins, the number of possible orientations of a Cα −Cα virtual bond should be smaller and dependent on the orientation of the preceding and following bonds. The chess-knight representation of protein structures is more realistic than chains on simple lattices (compare Fig. 1.7 with Fig. 1.5). However, this model is still an intermediate between the protein-like models and models applicable to real proteins – the effects of the lattice anisotropy remain large. For instance, geometrical fidelity of a projection of short β-strands onto the chess-knight lattice depends on the orientation of the projected fragment with respect to the principal axis of the Cartesian coordinate system. Lattice effects are particularly harmful for the simulated dynamics of lattice systems – the relaxation processes could be significantly distorted. To overcome this problem, a modification of the 210 representation was proposed. The set of Cα -trace vectors has been expanded adding the vectors type [1,1,1] and [2,1,1]. The total number of allowed orientations is thereby increased to 56. The vectors type [2,0,0] were excluded for a technical reason relating to the convenience of coding the excluded volume using additional vertices of the underlying simple cubic lattice. The model becomes significantly more flexible, and the
1
Lattice Polymers and Protein Models
13
Fig. 1.7 A short fragment of the “chess-knight” chain with side groups restricted to the lattice
effect of the lattice anisotropy decreases. In spite of rather non-physical fluctuations of the bond length, the overall accuracy and precision of the protein representation improves a lot. As a result, the model produces a plausible picture of polypeptide chain dynamics and enables the de novo prediction of simple low-resolution protein structures (Godzik et al. 1993). Obviously, the structure prediction algorithm requires a properly designed force field based on statistical potentials derived from known protein structures. Basic principles for the design of the interaction schemes for reduced models will be outlined later for a different lattice representation. Fluctuating bond models proved to be a milestone in lattice modeling of protein structures. Interestingly, in parallel the fluctuating bond concept was extensively employed in studies of generic polymeric systems (Carmesin and Kremer 1988). Several specific representations were developed. It has been proven that the dynamics of the fluctuating bond lattice models reproduces well the Rouse dynamics of the continuous space models. Flexibility and computational efficiency of the fluctuating bond models enabled the detailed study of the thermodynamics and dynamics of long polymers, including the extremely complex dynamics of multichain systems. These findings are important for protein modeling. They provide a justification for applications of the flexible lattice models in studies not only of protein structures but also of protein dynamics and protein-folding mechanisms (Kolinski and Skolnick 2004). Proteins are complex heteropolymers with 20 different side chains that are attached to the main-chain backbone, which is rather generic (with the exception of the proline residues), as in synthetic homopolymers or simple copolymers. Thus a satisfactory model of the main chain is just a starting point for an acceptable protein model. Let us consider a more exact fluctuating bond model than the one described above (Kolinski and Skolnick 1994). In this model the main-chain backbone is also
14
A. Kolinski
reduced to the alpha-carbon trace. The number of backbone vectors is equal to 90. These vectors belong to the following set: {[3,1,1],. . . [3,1,0],. . . [3,0,0],. . . [2,2,1], ...}. The amplitude of the bond fluctuations in this model is relatively small, and the lattice anisotropic effects are essentially negligible. The excluded volume of the main chain can be modeled in a convenient way. It is enough to associate with every lattice position of the alpha carbon the 18 closest points of the underlying simple cubic lattice. These 18 lattice vertices (plus the central one) are excluded to other alpha-carbon units, which are also clusters of 19 lattice points. Such lattice coding of excluded volume simplifies immensely the simulation process – the main chain overlaps could be detected with small computational cost. The main-chain discrete geometry provides a convenient reference frame for the definition of the side-chain positions. For each amino acid, a database of known protein structures could be scanned and a database of the observed side-chain rotamers created, assuming certain level of resolution of the model. During the simulations every update of the main-chain conformation has to be associated with an update of the sidechain positions. It could be efficiently done with a help of “prefabricated” sets of allowed side-chain coordinates. The side chains could be modeled as single or multiple interaction centers. They could be restricted to the underlying lattice or could be off-lattice, however with the lattice-bounded reference frame. In the published applications, of this type of fluctuating bond model, a single off-lattice sphere for rotamers was used. The model enabled reproducible de novo folding of several small proteins, with an accuracy of 2–4 Å with respect to their crystallographic structures after the best superposition with the computational models (Kolinski and Skolnick 1994). The CABS (Cα −Cβ –side group) lattice-based model employs a high-resolution discretization of the polypeptide conformational space (Kolinski 2004). As in previously described models, the framework of a polypeptide chain representation is the alpha-carbon trace (see Fig. 1.8 for explanation of the CABS reduced
Fig. 1.8 Schematic drawing of a short fragment of the CABS model
1
Lattice Polymers and Protein Models
15
representation). The alpha carbons are located on the vertices of a simple cubic lattice with the mesh size equal to 0.61 Å. The virtual bonds connecting the alpha carbons belong to the set of 800 vectors type of v = [i, j, k]. The integer coordinates i, j, k are the all possible triplets for which 29 ≤ |v|2 ≤ 49. This set of vectors reproduces the Cα −Cα distance, equal to 3.78 Å, with fluctuations in the range of ±10%. Protein structures could be approximated with an accuracy of about 0.35 Å cRMSD (coordinate Root-Mean-Square Deviation) after the best superposition of the model Cα -trace with corresponding coordinates of the experimental structure. An example is given in Fig. 1.9. The side groups are modeled by two centers of interactions: beta carbons and the centers of the remaining part of the side chain (where applicable). These are not restricted to the lattice and their positions with respect to the backbone are derived from a proper statistics of the known protein structures. Two, the most probable, rotamers are defined for each residue. The excluded volume is modeled by a set of hard spheres centered on the alpha and beta carbons and in the middle of the Cα −Cα virtual bonds. The side groups are treated as soft spheres. Fig. 1.9 High-resolution lattice representation of the alpha-carbon trace of small globular protein (domain B of protein G) – the CABS model. The average accuracy in respect to the crystallographic coordinates is about 0.35 Å. The most probable coordinates of the side-chain united atoms are calculated basing on the Cα -trace geometry and proper statistics of high-resolution protein structures
The force field of the CABS model consists of several components mimicking the real physical interactions in proteins (Kolinski 2004). The generic short-range conformational biases simulate characteristic protein-like chain stiffness. A set of sequence-specific potentials simulate the short-range conformational propensities. Directional potentials of interactions between alpha carbons and between the centers of Cα −Cα virtual bonds simulate the structure-ordering effect of the main-chain hydrogen bonds. Pairwise interactions between the side-group united atoms are “context-dependent,” i.e., they depend on the mutual orientation of the interacting side groups and on the conformations of the corresponding two-bond segments of the main chain. In an implicit way, the pairwise interactions take an approximate account of the average effects of the surrounding solvent. This force field is knowledge-based – the statistical potentials of mean force are derived from structural regularities seen in known high-resolution protein structures (Skolnick
16
A. Kolinski
et al. 1997a; Kolinski 2004). It is perhaps worth noting that the CABS modeling tool performs certainly no worse (and is computationally more effective) than a similar continuous space-reduced model (Boniecki et al. 2003). This shows that the high-resolution lattice approximations are free of lattice artifacts and, due to their computational efficiency, perfectly suited for large-scale applications. Some practical applications of the CABS model are described in the next section. All the high-coordination lattice models described so far have focused on the design of a convenient representation of the main chain of polypeptides, which subsequently are “decorated” with the side chains, using the main-chain backbone as a reference frame. The SICHO (SIde CHain Only) model is based on a completely different concept (Kolinski et al. 2001). In this approach, the explicit lattice approximation uses the fluctuating bond framework for modeling the virtual chain connecting the positions of the centers of mass of polypeptide side groups. Opposite to the CABS model, in the SICHO model the positions of the main-chain atoms are defined in the reference frame of the pseudo-chain connecting the side groups. The SICHO concept is based on the fact that the packing of the side chains is probably the most sequence-specific property of globular proteins. In principle, the sidechain-based models should be computationally faster than the main-chain-based ones.
1.6 Protein Folding and Structure Prediction with Lattice Models High-resolution SICHO and CABS models (and their clones) have been used in a variety of applications. These include ab initio structure prediction (Ortiz et al. 1999; Skolnick et al. 2001), study of protein dynamics, folding pathways and thermodynamics, prediction of protein structure from sparse experimental data and distant homology, or comparative, structure modeling (Kolinski et al. 1999; Kolinski and Skolnick 2004; Pierri et al. 2008; Skolnick et al. 1997b, 1998, 2003). The ab initio folding with the lattice models is practical only for relatively small (say up to 150 residues) and topologically not too complex proteins. With the increasing size of a query protein the success ratio, as well as the accuracy of the produced structures, decreases. Small proteins (50–75 residues) can be folded to a resolution range approaching 2 Å, while for a 150-residue structure the accuracy would be nearer 3–6 Å, depending on fold complexity. It should be pointed out that the bottleneck of the accuracy for reduced models is not their reduced representation but rather the deficiencies of their force fields. The force fields of the reduced models are being permanently updated and in the future this should be the main factor leading to improvements in the algorithms performance. There is a suggestive result partially confirming this. The CABS model has been used many times for comparative modeling, where in addition to the force field, the folding process has been guided by a set of spatial restraints extracted from structures of homologous (or structurally analogous) proteins. In these circumstances the resulting models could be even as good as the best crystallographic structures with a resolution of
1
Lattice Polymers and Protein Models
17
1.0–1.5 Å, at least for the main-chain atoms (Kolinski and Bujnicki 2005). Every 2 years, the Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP) is organized in order to assess the current status of in silico protein structure prediction. The idea of the experiment is simple. Experimentalists provide sequences of amino acids of a number of proteins for which the structures are expected to be solved in the next few months. During this time theoreticians try to predict these structures and deposit them in the CASP databank. A variety of modeling techniques are used by researchers from around the world. Afterward, when the experimental structures become available, a group of experts assesses quality of predictions. Interestingly, the high-resolution latticebased modeling methods systematically perform very well and are among the best methods for protein structure prediction. An example of the CASP6 prediction using the CABS lattice modeling is given in Fig. 1.10. The accuracy of this particular prediction is about 3.0 Å after the best superposition of the predicted structure onto the experimental one (Kolinski and Bujnicki 2005).
Fig. 1.10 Side-by-side view of the crystallographic structure and the predicted structure (the CABS model) for one of the CASP6 targets (target # T0223)
Finally, it should be noted that the reduced structures of SICHO or CABS models can be used as a meaningful starting point for the all-atom reconstruction and structure refinement (Feig et al. 2000; Kolinski and Bujnicki 2005). Such procedures are now pursued in several laboratories, opening the possibility for multi-scale modeling of large biomolecular systems.
References Abkevich VI, Gutin AM, Shakhnovich EI (1994) Free energy landscape for protein folding kinetics: Intermediates, traps, and multiple pathways in theory and lattice model simulations. J Chem Phys 101:6052–6062 Abkevich VI, Gutin AM, Shakhnovich EI (1996) Improved design of stable and fast-folding model proteins. Fold Des 1:221–230 Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230
18
A. Kolinski
Binder K, Müller M, Baschnagel J (2004) Polymers models on the lattice. In: Kotelyanskii MJ, Theodorou DN (eds) Simulation methods for polymers, M. Dekker, New York, NY Boniecki M, Rotkiewicz P, Skolnick J, Kolinski A (2003) Protein fragment reconstruction using various modeling techniques. J Comp Aid Mol Des 17:725–738 Branden C, Tooze J (1991) Introduction to protein structure. Garland, New York, NY Carmesin I, Kremer K (1988) The bond fluctuation method: a new effective algorithm for the dynamics of polymers in all spatial dimensions. Macromolecules 21:2819–2823 Chen H, Zhou X, Chih YL, Chan GK (2004) Kinetic analysis of protein folding lattice models. Mod Phys Lett B 18:163–172 Creighton TE (1993) Proteins: structures and molecular properties. W. H. Freeman, New York, NY De Gennes PG (1979) Scaling concepts in polymer physics, 1st edn. Cornell University Press, New York, NY Di Cecca A, Freire JJ (2002) Monte Carlo simulation of star polymer systems with the bond fluctuation model. Macromolecules 35:2851–2858 Dill KA (1999) Polymer principles and protein folding. Prot Sci 8:1166–1180 Dill KA, Bromberg S, Yue K, Fiebig KM, Yee DP, Thomas PD, Chan HS (1995) Principles of protein folding – a perspective from simple exact models. Prot Sci 4:561–602 Dinner A, Sali A, Karplus M, Shakhnovich E (1994) Phase diagram of a model protein derived by exhaustive enumeration of the conformations. J Chem Phys 101:1444–1451 Dinner AR, Sali A, Karplus M (1996) The folding mechanism of larger model proteins: role of native structure. Proc Natl Acad Sci USA 93:8356–8361 Feig M, Rotkiewicz P, Kolinski A, Skolnick J, Brooks CL 3rd (2000) Accurate reconstruction of all-atom protein representations from side-chain-based low-resolution models. Proteins 41: 86–97 Ferrenberg AM, Swendsen RH (1989) Optimized Monte Carlo data analysis. Phys Rev Lett 63:1195–1198 Flory PJ (1953) Principles of polymer chemistry. Cornell University Press, New York, NY Flory PJ (1969) Statistical mechanics of chain molecules. Wiley, New York, NY Freire J (1999) Conformational properties of branched polymers: theory and simulations. Branched polymers II. Advances in polymer science, vol 143/1999. Springer, Berlin, pp 35–112 Frenkel D, Smit B (2001) Understanding molecular simulation. From algorithms to applications. Computational science series, vol 1, 2nd edn. Academic, New York, NY Godzik A, Kolinski A, Skolnick J (1993) De novo and inverse folding predictions of protein structure and dynamics. J Comp Aid Mol Des 7:397–438 Grest GS, Fetters LJ, Huang JS, Richter D (1996) Star polymers: experiment, theory, and simulation. Adv Chem Phys 104:67–163 Grosberg AY, Khokhlov AR (1994) Statistical physics of macromolecules. American Institutes of Physics Press, New York, NY Hsu HP, Nadler W, Grassberger P (2004) Scaling of star polymers with 1–80 arms. Macromolecules 37:4658–4663 Hukushima K, Nemoto K (1996) Exchange Monte Carlo method and application to Spin Glass Simulations. J Phys Soc Jpn 65:1604–1608 Kaya H, Chan HS (2002) Origins of chevron rollovers in non-two-state protein folding kinetics. Phys Rev Lett 90:258104 Kolinski A (2004) Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 51:349–371 Kolinski A, Skolnick J (2004) Reduced models of proteins and their applications. Polymer 45: 511–524 Kolinski A, Betancourt MR, Kihara D, Rotkiewicz P, Skolnick J (2001) Generalized comparative modeling (GENECOMP): a combination of sequence comparison, threading, and lattice modeling for protein structure prediction and refinement. Proteins 44:133–149 Kolinski A, Bujnicki JM (2005) Generalized protein structure prediction based on combination of fold-recognition with de novo folding and evaluation of models. Proteins Suppl 7(61):84–90
1
Lattice Polymers and Protein Models
19
Kolinski A, Galazka W, Skolnick J (1996) On the origin of the cooperativity of protein folding: implications from model simulations. Proteins 26:271–287 Kolinski A, Milik M, Rycombel J, Skolnick J (1995) A reduced model of short range interactions in polypeptide chains. J Chem Phys 103:4312–4323 Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J (1999) A method for the improvement of threading-based protein models. Proteins 37:592–610 Kolinski A, Skolnick J (1996) Lattice models of protein folding, dynamics and thermodynamics. Molecular biology intelligence unit. Chapman & Hall, New York, NY Kolinski A, Skolnick J (1994) Monte Carlo simulations of protein folding. I. Lattice model and interaction scheme. Proteins 18:338–252 Kolinski A, Skolnick J, Yaris R (1986a) The collapse transition of semiflexible polymers. A Monte Carlo simulation of a model system. J Chem Phys 85:3585–3597 Kolinski A, Skolnick J, Yaris R (1986b) Monte Carlo simulations on an equilibrium globular protein folding model. Proc Natl Acad Sci USA 83:7267–7271 Kolinski A, Skolnick J, Yaris R (1987a) Does reptation describe the dynamics of entangled, finite length polymer systems? A model simulation. J Chem Phys 86:1567–1585 Kolinski A, Skolnick J, Yaris R (1987b) Dynamic Monte Carlo study of the conformational properties of long flexible polymers. Macromolecules 20:438–440 Kolinski A, Skolnick J, Yaris R (1987c) Monte Carlo studies on the long time dynamic properties of dense cubic lattice multichain systems. I. The homopolymeric melt. J Chem Phys 86: 7164–7173 Kolinski A, Skolnick J, Yaris R (1987d) Monte Carlo studies on the long time dynamic properties of dense cubic lattice multichain systems. II. Probe polymer in a matrix of different degrees of polymerization. J Chem Phys 86:7174–7180 Lau KF, Dill KA (1989) A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules 22:3986–3997 Li H, Helling R, Tang C, Wingreen N (1996) Emergence of preferred structures in a simple model of protein folding. Science 273:666–669 Likos CN (2006) Soft matter with soft particles. Soft matter 2:478–498 Micheletti C, Seno F, Maritan A, Banavar JR (1998) Protein design in a lattice model of hydrophobic and polar amino acids. Phys Rev Lett 80:2237–2240 Milik M, Kolinski A, Skolnick J (1990) Monte Carlo dynamics of a dense system of chain molecules constrained to lie near an interface. A simplified membrane model. J Chem Phys 93:4440–4446 Ortiz AR, Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J (1999) Ab initio folding of proteins using restraints derived from evolutionary information. Proteins 37:177–185 Pakula T (2004) Simulations of completely occupied lattice. In: Kotelyanskii MJ, Theodorou DN (eds) Simulation methods for polymers. M. Dekker, New York, NY Pande VS, Grosberg AY, Tanaka T, Rokhsar DS (1996) Pathways for protein folding: is a new view needed? Curr Opin Struct Biol 8:68–79 Pierri CL, De Grassi A, Turi A (2008) Lattices for ab initio protein structure prediction. Proteins 73:351–361 Pokarowski P, Droste K, Kolinski A (2005) A minimal protein-like lattice model: an alpha-helix motif. J Chem Phys 122:214915 Pokarowski P, Kolinski A, Skolnick J (2003) A minimal physically realistic protein-like lattice model: designing an energy landscape that ensures all-or-none folding to a unique native state. Biophys J 84:1518–1526 Post CB, Zimm BH (1979) Internal condensation of a single DNA molecule. Biopolymers 18:1487–1501 Rouse PE (1953) A theory of the linear viscoelastic properties of dilute solutions of coiling polymers. J Chem Phys 21:1272–1280 Rutkowska A, Kolinski A (2007) Why do proteins divide into domains? Insights from lattice model simulations. Biomacromolecules 8:3519–3524
20
A. Kolinski
Sali A, Shakhnovich E, Karplus M (1994) How does a protein fold? Nature 369:248–251 Sikorski A (1993) Monte Carlo study of the dynamics of star-branched polymers. Macromol Theory Simul 2:309–318 Sikorski A, Kolinski A, Skolnick J (1994) Dynamics of star branched polymers in a matrix of linear chains: a Monte Carlo study. Macromol Theory Simul 3:715–729 Sikorski A, Romiszowski P (1996) Motion of star-branched vs. linear polymer: A Monte Carlo study. J Chem Phys 104:8703–8712 Skolnick J, Jaroszewski L, Kolinski A, Godzik A (1997a) Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? Prot Sci 6:676–688 Skolnick J, Kolinski A (1990) Dynamics of dense polymer systems: computer simulations and analytic theories. In: Advances in chemical physics, vol 78. Wiley, New York, NY Skolnick J, Kolinski A, Kihara D, Betancourt M, Rotkiewicz P, Boniecki M (2001) Ab initio protein structure prediction via a combination of threading, lattice folding, clustering, and structure refinement. Proteins Suppl 5:149–156 Skolnick J, Kolinski A, Ortiz AR (1997b) MONSSTER: a method for folding globular proteins with a small number of distance restraints. J Mol Biol 265:217–241 Skolnick J, Kolinski A, Yaris R (1988) Monte Carlo simulations of the folding of beta-barrel globular proteins. Proc Natl Acad Sci USA 85:5057–5061 Skolnick J, Kolinski A, Yaris R (1989) Dynamic Monte Carlo study of the folding of a six-stranded Greek key globular protein. Proc Natl Acad Sci USA 86:1229–1233 Skolnick J, Zhang Y, Arakaki AK, Kolinski A, Boniecki M, Szilágyi A, Kihara D (2003) TOUCHSTONE: a unified approach to protein structure prediction. Proteins 53:469–479 Smith WR, Lisal M (2002) Direct Monte Carlo simulation methods for nonreacting and reacting systems at fixed total internal energy or enthalpy. Phys Rev E 66:011104 Sun S, Brem R, Chan HS, Dill KA (1995) Designing amino acid sequences to fold with good hydrophobic cores. Protein Eng 8:1205–1213 Verdier PH, Stockmayer WH (1962) Monte Carlo calculations on the dynamics of polymers in dilute solution. J Chem Phys 36:227–235
Chapter 2
Multiscale Protein and Peptide Docking Mateusz Kurcinski, Michał Jamroz, and Andrzej Kolinski
Abstract The number of functional protein complexes in a cell is larger by an order of magnitude than the number of proteins. The experimentally determined three-dimensional structures exist for only a very small fraction of these complexes. Thus, the methods for theoretical prediction of structures of protein assemblies are extremely important for molecular biology. Association of two (or more proteins) always induces conformational changes of the individual components. In many cases, these induced changes are relatively small and involve mostly the side chains at the association interface. In such cases, the approaches of rigid-body docking of two (or more) structures are quite successful. Quite frequently, however, the docking-induced conformational changes are significant. In such cases, prediction of the resulting structures is extremely challenging. The cases, where experimental structures of some components do not exist, are yet even more difficult. In this chapter, we briefly overview the existing in silico docking methods and describe a multiscale strategy of unrestricted flexible docking of proteins and peptides.
2.1 Introduction In eukaryotic cells, an average protein can participate in several protein–protein (or protein–nucleic acid) complexes. The number of such complexes is larger by an order of magnitude than the number of proteins. Since the number of experimentally solved protein structures (about 60,000) is a small fraction of all proteins, the fraction of structurally annotated protein complexes is very small. Thus, the theoretical, in silico, prediction of molecular structures of multimeric protein assemblies is one
A. Kolinski (B) Faculty of Chemistry, University of Warsaw, Warsaw, Poland e-mail:
[email protected] A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_2,
21
22
M. Kurcinski et al.
of the most important task of bioinformatics and computational biology (Wodak and Janin 1978; Valencia and Pazos 2002; Salwinski and Eisenberg 2003; Aloy and Russell 2004). There are relatively dependable computational methods for socalled rigid docking. These methods are applicable, provided that the structures of individual components are known and the conformational changes of these components induced upon docking are small. For proteins of known structures, the last requirement is approximately fulfilled quite frequently (Ritchie 2008). Then the problem reduces to generation of a large number of possible poses, according to the shape complementarity of the components and scoring of binding poses by interaction patterns of the interfaces. The latter task is by no means trivial since at least some of the side chains at the interface certainly change their conformations with respect to the conformations seen in the monomeric state or in different complexes. At the moment, the knowledge-based statistical potentials, either atom-wise or united-atom-wise ones, seem to be most productive in scoring of protein–protein interactions. The ultimate goal of the protein docking could be described as follows: having just a set of sequences, find the structure (structures) of the possible assembles. In general, this may appear to be not feasible, at least at present, but maybe this is not so hopeless. Firstly, with the constant progress in protein structure prediction, mainly via rectification of comparative modeling, now it is possible to predict monomeric structures at least for a half of protein sequences and at least with a moderate resolution. Secondly, provided that bioinformatics methods are developed for identification of structure fragments that may change upon the docking, it should be possible to design methods for semi-flexible docking that accounts for the allowed conformational changes of fragments of the components’ structures. A step toward such a docking methodology is described in this chapter. The method employs a multiscale modeling of proteins and peptides. It is based on CABS modeling software. CABS is a high-resolution, coarse-grained, protein modeling tool (the acronym stands for the united atoms representing a residue in a polypeptide chain: CA alpha carbon of the main chain, CB -beta carbon, and S – the center of side group). The CABS protein structure representation is based on united-atom description of protein structure, where a single residue is represented by several (three or four, depending on the size of side chains) united atoms. The conformational space of CABS polypeptide chains is sampled by means of very efficient Monte Carlo schemes. Details of the CABS design are described in the first chapter of this book and in previous publications (Kolinski 2004). The spatial resolution of CABS allows for quite precise reconstruction of atomic details. This reconstruction process (Gront et al. 2007) for main-chain atoms is very fast and accurate within range of few tens of Angstrom. The reconstruction of side groups is less accurate and depends on the achieved accuracy of the Cα-trace fold. Below, we overview briefly the techniques for rigid docking problem, docking with a highly limited flexibility of some structural elements, and then we outline the more flexible (and fully flexible) docking based on a multiscale approach in which the CABS-based structure assembly is the key step of a molecular complex building procedures.
2
Multiscale Protein and Peptide Docking
23
2.2 Rigid Docking Procedures Suppose we know the three-dimensional structures of two proteins that form a dimer, although we do not know how these two proteins are posed in the complex, and which residues form the protein–protein interface. Finding the structure of the resulting complex is not a trivial task. Classical rigid docking consists of two or three fundamental steps (Vajda and Kozakov 2009). The first one is the generation of a large number of binary structures. The second one is scoring the structures according to the shape complementarity and interactions at the interface. Finally, one may perform rectification of the best structures by adjustments of conformations of the side chain at the interface. Finding plausible poses in rigid docking is not trivial – this requires a very effective search algorithm. Fast Fourier Transform makes it possible to reduce the six-dimensional problem to a one-dimensional problem. A number of algorithms have been developed for this purpose (Katchalski-Katzir et al. 1992; Vakser and Aflalo 1994; Vakser 1995; Mandell et al. 2001; Del Carpio-Muñoz et al. 2002; Chen et al. 2003; Carter et al. 2005; Kozakov et al. 2006; Sternberg et al. 1998). Alternatively, various geometric hashing procedures could be used (Fischer et al. 1995). The resulting poses, usually several thousands of them, need to be scored in order to produce a small number of plausible structures. Scoring functions span a wide range, from a simple shape complementarity (Chen et al. 2003), through the knowledge-based statistical potentials (Kozakov et al. 2006; Tobi and Bahar 2006; Zhang et al. 2005; Cerutti et al. 2005), physics-based force fields (Koehl 2006; Sheinerman et al. 2000; Jiang et al. 2002) to data-driven docking, supported by available biochemical information (Res and Lichtarge 2005; Res et al. 2005; Lichtarge et al. 1996; Dominguez et al. 2003; Nilges 1995; Anand et al. 2003; van Dijk et al. 2005). It has been also noted that rigid docking could be achieved in a different way. Instead of ab initio computing the assembly structure, sometimes it is more effective to predict binding interfaces of the interacting proteins and then perform docking in a limited conformational space. In some sense, this is yet another variant of data-driven docking (Jones and Thornton 1997; Burgoyne and Jackson 2006). Efficiency of various approaches to protein docking is systematically evaluated within the framework of community-wide experiments of Critical Assessments of PRediction of Interactions (CAPRI) (Carter et al. 2005; Janin et al. 2003).
2.3 Flexible Docking Usually, although not always, protein association induces conformational changes of the components (Bonvin 2006; Camacho and Vajda 2001; May and Zacharias 2005). In many cases, these conformational changes are essentially limited to the interface side chains (Andrusier et al. 2008). RosettaDock algorithm is well suited to deal with such cases (Gray et al. 2003; Daily et al. 2005; Wang et al. 2005). The procedure starts from rigid docking and then the side chains are optimized using
24
M. Kurcinski et al.
either a rotamer library or free-space side-chain optimization. Such an approach proven to be very successful in blind predictions within CAPRI (Wang et al. 2007; Schueler-Furman et al. 2005). Recently, Rosetta modeling technology has been applied to fully flexible docking, or rather “folding and docking” (Das et al. 2009), of small homo-oligomeric protein assemblies. The method utilizes available experimental data: nuclear magnetic resonance (NMR) chemical shifts and residual dipolar coupling (RDC). Somewhat similar strategy for semi-flexible docking is adapted in ATTRACT (the name of the algorithm comes from attractive interactions of the interface residues) algorithm (Zacharias 2003; May and Zacharias 2007). The method employs a coarse-grained representation of side chains and the docking procedure consists of two steps: rigid docking and optimization of the resulting poses, allowing for flexibility of the side chains of interface residues. ATTRACT algorithm was also used in docking simulations allowing for backbone flexibility of the loop regions. It has been demonstrated that even limited flexibility improves the docking results for most of the tested cases. Small conformational changes, induced by docking, could be accommodated to some extent by reduced representations or/and coarse-grained potentials describing the interface residues. This is probably one of the reasons for surprisingly good performance of docking procedures based on just shape complementarity and smoothened details of the surface. Recently it has been shown that smoothed lowresolution representation of the surface residues leads to more consistent shape complementarity (Zhang et al. 2009). Another way to increase recognition specificity of the interfaces could be achieved by use of multibody knowledge-based potentials. For instance, four-body statistical pseudo-potentials proven to be useful in protein–peptide docking, allowing for full flexibility of the peptide moieties (Aita et al. 2010). Before the applications in protein docking, four-body potentials proven to be very effective efficient in scoring protein decoys (Krishnamoorthy and Tropsha 2003; Feng et al. 2010). In summary, while small, local conformational changes accompanying protein docking are relatively well handled by a variety of docking algorithms, large deformations of components are more difficult to predict. RosettaDock is one of few exceptions, where de novo prediction of new (compared to known structures of components) structures is sometimes feasible. The multiscale approach described below, based on CABS modeling software, bootstrapped with all atom molecular mechanics, is a step toward fully flexible docking of proteins and peptides.
2.4 Multiscale Flexible Docking with CABS CABS (the acronym stands for the united atoms representing a residue in a polypeptide chain: CA – alpha carbon of the main chain, CB – beta carbon, and the center of side group). Cα trace in CABS is restricted to a high-resolution cubic lattice grid, where the lattice spacing is set to 0.61 Å. Cα−Cα distances in CABS are allowed to fluctuate near the 3.8 Å. An additional pseudo-atom is located in the center of
2
Multiscale Protein and Peptide Docking
25
the Cα−Cα bond and supports a model of main-chain hydrogen bonds. The accuracy of a projection of high-resolution protein structure onto the lattice is of about half of the lattice spacing. The coordinates of beta carbons and side chains are not restricted to the lattice and are defined in the reference frame defined by the Cα trace. Due to the lattice representation, computations of local conformational transitions of the model chains are extremely fast and, in most cases, they are reduced to straightforward shuffling of integer numbers. Similarly, most of interactions could be computed via simple references to large hashing tables. Such computations with CABS are about two orders of magnitude faster than it would be possible for – otherwise equivalent – continuous space model (Boniecki et al. 2003). It should be pointed out that, due to the fine grid of the lattice representations, the model does not exhibit any lattice artifacts. Actual accuracy of the molecular models generated by CABS is lower than the resolution resulting from the lattice representation. The results of free modeling, when successful, are accurate within a low-resolution range of 2.5–5 Å. Comparative models are more accurate and their accuracy depends on the quality of templates from which the distance restraints between Cα atoms are extracted. The best models have an accuracy of about 1 Å. CABS allows for easy implementation of various restraints, not only Cα−Cα distances from templates but also restraints from sparse experimental data, as chemical shifts, residual dipolar coupling, side chain–side chain contacts from mutagenesis, etc. This opens a convenient framework for the treatment of docking flexibility at various levels. A geometric fidelity of the CABS representation is sufficient for a reasonably accurate all-atom reconstructions (Gront et al. 2007). A good measure of this fidelity is an experiment of projecting the structure onto CABS lattice followed by subsequent reconstruction of atomic details. The reconstruction consists of two stages: the first one is a very fast rebuilding of the main chain and beta carbons, executed by Backbone Building from Quadrilaterals (BBQ) program, the second one could be side-chain fitting via side-chain replacement with rotamer libraries (SCWRL) program by Dunbrack (Canutescu et al. 2003). The accuracy of such a reconstruction cycle is within a range of few tens of angstroms for the main-chain atoms and a range of 1.5–2 Å for the side-chain atoms, depending on the structure. The allatom structures could be generated at any stage of the docking and scored by force fields other than CABS. Details of the CABS knowledge-based force field (Kolinski 2004) and description of combinations of CABS with all-atom molecular mechanics (Kmiecik and Kolinski 2007, 2008; Kmiecik et al. 2007) could be found in earlier publications. Sampling protocols of CABS employ various Monte Carlo-based algorithms. When the folding mechanisms are of interest, simple simulated annealing or isothermal Monte Carlo dynamics could be appropriate. Since CABS conformational updating employs various local rearrangements controlled by a pseudo-random mechanisms, the trajectories from such simulations represent solutions of a certain Master Equation of motions and thereby provide a coarse-grained picture of the system dynamics. In this respect, CABS differs from the most of other structure assembly-reduced models, such as Rosetta (Rohl et al. 2004). There are, however, reduced space models enabling similar studies. The continuous models (like united
26
M. Kurcinski et al.
residues (UNRES), (Ołdziej et al. 2005) allow for Molecular Dynamics simulations and, similarly to CABS, for Monte Carlo dynamics. When just structure prediction is a goal, more effective than simulated annealing are various multicopy MC algorithms. Most docking experiments with CABS proceed according to a combination of Replica Exchange Monte Carlo (REMC) (simulated tempering) (Hukushima and Nemoto 1996) with simulated annealing. During a typical simulation, a large number of replicas (50) are subject to slow annealing of the entire stack.
2.4.1 Treating of Flexibility CABS facilitates various levels of docking flexibility. Schematically, different instances of docking could be outlined as follows: A. A semi-flexible docking of two or more proteins of known structures. B. Docking of a fully flexible (unrestrained folding) protein or peptide on a semiflexible scaffold of other proteins. C. De novo assembly (fully flexible, free folding) of protein (peptide) complex. Obviously, the success rate and accuracy of the resulting structures decreases from A to C. The unrestricted de novo assembly of a protein complex (C) is now feasible only for relatively small and structurally not too complex proteins. Various symmetry-resulting (as for homo-dimers, trimers, etc.) restraints could be easily implemented (similarly as it was done before for RosettaDock), increasing the docking accuracy significantly. In semi-flexible docking (A), intra-protein restraints are read from the unbonded structures and the corresponding distances are allowed to fluctuate around their unbonded values. The poses could be generated by the CABS REMC in an unrestricted fashion or the initial poses (structures placed at starting replicas) could be obtained from various fast docking shape-complementarity-based algorithms. To speed up assembly, centers of gravity of the assembled molecules are subject to a weak generic attractive force acting at distances larger than the plausible estimated distance within the complex. In such procedures, trivial flexibility related to different conformations of the interface side chains is approximately accounted for at the stage of the all-atom reconstruction. Flexibility of interacting proteins does need to be treated in a uniform fashion. It is easy to restraint just parts of each molecule, allowing the remaining portions to freely adjust during the docking. Prediction of flexible fragments could be achieved in various ways, including structural comparison of different complexes of proteins of interests, normal modes, or Gaussian Network, analysis of these proteins, etc. A relative strength of restraints could be also included in the input data for the docking. Docking of a fully flexible small protein (or a peptide) to semi-flexible scaffold of other protein (a receptor) is almost always successful, without assuming a priori anything about the pose (except the penalty for large distances between molecules) and internal conformation of the free molecule.
2
Multiscale Protein and Peptide Docking
27
In principle, the interaction between the interface amino acids does need to be the same as the intra-protein interactions. The interactions between the side chains in CABS model are described by statistical pseudo-potentials derived from regularities seen in known structures. These potentials are context dependent (accounting for mutual orientations of the side chains and local conformations of the main chain). Thus, the potentials account in an implicit way for complex multi-body packing effects. Also the averaged solvent effect is encoded in these potentials. Interestingly, potentials derived separately for the interfaces in known complexes do not differ significantly from the potentials derived for monomeric proteins. In the example docking simulations described in this chapter, generic CABS potentials were used.
2.4.2 Example of Peptide Docking to Receptor Protein Frequently small peptides act as coactivators for larger proteins. Below we describe a typical example of such a docking experiment (Kurcinski and Kolinski 2007). The receptor protein is the vitamin D receptor (or rather the receptor part of the entire protein). The receptor is treated in a semi-flexible fashion. A large number of Cα−Cα intra-molecular distances are extracted from the crystallographic structure of the protein. Additionally, the secondary structure defined according to the define secondary structure of proteins (DSSP) assignment is a part of the simulation input. Assigned secondary structure provides a bias toward the proper short-range geometry and favors the hydrogen-bonding patterns consistent with this secondary structure. The simulation set-up for the receptor is schematically depicted in the left-hand side of the flow chart given in Fig. 2.1. During the simulation, the receptor structure oscillates around its native structure. The initial set of 50 replicas for the REMC simulations are generated by replication of the receptor structure with randomly placed peptide chains near the protein surface. Internal conformations of the peptide and its location in respect to the receptor are both selected in a random fashion. The set-up for the peptide is illustrated in the top right-hand part of the flow chart. The starting replica with the superimposed receptor structure is shown in the top central panel of the flow chart. The main part of the docking simulations is executed by the CABS algorithm. CABS produces a large number of conformations, stored in a pseudo-trajectory read from the lowest-temperature replica. Typically, the CABS output contains some thousands of structures. Single run generates millions of states and requires several hours of a single LINUX computing unit. The structures stored in the pseudo-trajectory are subject to a clustering procedure (hierarchical clustering or K-means clustering). In the case of the example illustrated here, there is only one well-defined cluster of solutions, containing majority of the structures. Remaining structures are scattered in apparently random fashion. The main cluster is very dense with nicely superimposed receptor structures. Only the end and some loop residues deviate a little (0.5–1.0 Å) from the mean structure, which is almost identical with the crystallographic structure. The cloud of the peptide structures is also very well defined (bottom, right-hand panel of Fig. 2.1), with
28
M. Kurcinski et al.
Fig. 2.1 Flow chart of multiscale hierarchical peptide–protein docking. See the text for details
the mean-square dispersion below 1 Å. The centroid structure from the main cluster provides a scaffold for the all-atom reconstruction (left-hand panels) of the complex. The reconstructed structure is optimized with all-atom force field and rectified in Molecular Dynamics. The final structures obtained in such test docking are of crystallographic resolution. In several tests of peptide docking to various receptors, the proper pose has always been found to be within the main cluster of solution. For longer peptides (25–30 amino acids), the resulting coordinates of end residues of the peptide were usually of lower accuracy (2–3 Å), although the interface contact maps were always predicted (or rather postdicted) with high accuracy.
2.4.3 Protein–Protein Docking The methodology described in the previous sections could be used for protein– protein docking where one or two (or more) proteins could be treated in a fully
2
Multiscale Protein and Peptide Docking
29
flexible fashion, without assuming anything about their structures within the complex. Good results of fully flexible, unrestricted docking, could be expected only for relatively small and structurally not too complex proteins, of a size of the Rop homo-dimer that consists of two antiparallel long-helical hairpins or the crambin pseudo-dimer. For larger proteins, properly folded complex structures are not always obtained. At the present status of the CABS methodology, the semi-flexible docking simulations are more dependable, where at least parts of the modeled structures are controlled by weaker or stronger intra-molecular restraints derived from non-bonded structures or from different complexes of the proteins of interest. Two examples of semi-flexible docking results are illustrated in Figs. 2.2 and 2.3. In both cases, the structures of the larger proteins in the complex were strongly restrained to their non-bonded native structures. Docking simulations modified these structures very little (deviation range of 0.3–0.9 Å). Actually, this is very close to the structural differences seen between the individual proteins in the complexes and their unbound structures. The structures of larger protein are shown in gray. The second components of the complexes had higher flexibility, the restraints were much weaker allowing for large fluctuations, ranging within 5–10 Å. In both cases, near-native structures were found in the largest clusters. The resulting poses are qualitatively correct, although the details of the internal structures of these proteins have several errors. For easy comparison, Figs. 2.2 and 2.3 show both experimental (green) and calculated (red) structures. The drawings were done assuming the best superimposition of the larger proteins in the complex. The resulting superimposition of the second proteins illustrates a sum of the errors of pose and the errors of internal coordinates. The sum of these errors (coordinate Root-Mean-Square deviation of the smaller protein after the best superimposition of the larger protein) is 2.6 Å in the first case and 3.8 Å in the second case, respectively. Thus, qualitatively correct poses have been predicted (nothing has been assumed about the mutual orientation
Fig. 2.2 Structure obtained from docking procedure. The “receptor protein” (PDB code 1ppn) shown in gray, “ligand” protein (PDB code 2oct) shown in green in the crystallographic structure (PDB code 1stf), and in red for the final model. See the text for details
30
M. Kurcinski et al.
Fig. 2.3 Structure obtained from docking procedure. The “receptor protein” (PDB code 2hnt) shown in gray, “ligand” protein (PDB code 5hir) shown in green in the crystallographic structure (PDB code 4htc), and in red for the final model. See the text for details
of the components) although the structural details have been distorted, especially in the second case. Initial 50 replicas for the REMC simulations were generated by FTdock program.
2.5 Perspectives The problem of in silico flexible docking, especially in cases where the dockinginduced conformational changes are large, is far from being solved. Nevertheless, there are numerous encouraging small steps toward a partial solution of this problem. Multiscale procedures, where flexible docking is performed using various coarse-grained protein models, followed by refinement of the resulting poses by more detailed molecular mechanics seem to be very promising. Here, we described combinations of the CABS-reduced space modeling methodology with all-atom refinements applied to flexible and semi-flexible protein–protein and protein– peptide dockings. The method is now mature enough for large-scale predictions of protein interactomes. In the large-scale applications, it may be necessary to introduce a pre-screening phase employing fast docking procedures based on a shape complementarity. At present, the described method is limited to proteins and peptides. An extension onto nucleic acids will require development of their coarse-grained representation, consistent with the CABS representation of proteins. Also the treatment of small ligands within such multiscale docking procedures requires significant extensions of the knowledge-based force fields, and this is still an open problem.
2
Multiscale Protein and Peptide Docking
31
Another possibility of applications of the outlined method is related to the assembly mechanisms of protein complexes. CABS sampling techniques and its force field enable meaningful simulations of folding pathways (see Chapter 12). Extension of the method on multimeric assemblies is straightforward, although it will require larger computing resources, due to a higher complexity of the problem. Finally, we would like to note that the coarse-grained protein models are potentially very interesting in the context of yet different class of docking problem, namely, fitting molecular structures into cryo-electron microscopy (EM) (or similar) low-resolution experimental data (Lindert et al. 2009; Orzechowski and Tama 2008; Jolley et al. 2008). This problem, however, is beyond the scope of this chapter.
References Aita T, Nishigaki K, Husimi Y (2010) Toward the fast blind docking of a peptide to a target protein by using a four-body statistical pseudo-potential. Comput Biol Chem 34:53–62 Aloy P, Russell RB (2004) Ten thousand interactions for the molecular biologist. Nat Biotechnol 22:1317–1321 Anand GS, Law D, Mandell JG, Snead AN, Tsigelny I, Taylor SS, Ten Eyck LF, Komives EA (2003) Identification of the protein kinase A regulatory RIalpha-catalytic subunit interface by amide H/2H exchange and protein docking. Proc Natl Acad Sci USA 100:13264–13269 Andrusier N, Mashiach E, Nussinov R, Wolfson HJ (2008) Principles of flexible protein–protein docking. Proteins 73:271–289 Boniecki M, Rotkiewicz P, Skolnick J, Kolinski A (2003) Protein fragment reconstruction using various modeling techniques. J Comput Aid Mol Des 17:725–738 Bonvin AM (2006) Flexible protein–protein docking. Curr Opin Struct Biol 16:194–200 Burgoyne NJ, Jackson RM (2006) Predicting protein interaction sites: binding hot-spots in protein– protein and protein–ligand interfaces. Bioinformatics 22:1335–1342 Camacho CJ, Vajda S (2001) Protein docking along smooth association pathways. Proc Natl Acad Sci USA 98:10636–10641 Canutescu AA, Shelenkov AA, Dunbrack RL (2003) A graph-theory algorithm for rapid protein side-chain prediction. Prot Sci 12:2001–2014 Carter P, Lesk VI, Islam SA, Sternberg MJ (2005) Protein–protein docking using 3D-dock in rounds 3, 4, and 5 of CAPRI. Proteins 60:281–288 Cerutti DS, Ten Eyck LF, McCammon JA (2005) Rapid estimation of solvation energy for simulations of protein–protein association. J Chem Theory Comput 1:143–152 Chen R, Li L, Weng Z (2003) ZDOCK: an initial-stage protein-docking algorithm. Proteins 52: 80–87 Daily MD, Masica D, Sivasubramanian A, Somarouthu S, Gray JJ (2005) CAPRI rounds 3–5 reveal promising successes and future challenges for RosettaDock. Proteins 60:181–186 Das R, André I, Shen Y, Wu Y, Lemak A, Bansal S, Arrowsmith CH, Szyperski T, Baker D (2009) Simultaneous prediction of protein folding and docking at high resolution. Proc Natl Acad Sci USA 106:18978–18983 Del Carpio-Muñoz CA, Ichiishi E, Yoshimori A, Yoshikawa T (2002) MIAX: a new paradigm for modeling biomacromolecular interactions and complex formation in condensed phases. Proteins 48:696–732 Dominguez C, Boelens R, Bonvin AM (2003) HADDOCK: a protein–protein docking approach based on biochemical or biophysical information. J Am Chem Soc 125:1731–1737 Feng Y, Kloczkowski A, Jernigan RL (2010) Potentials ‘R’ Us web-server for protein energy estimations with coarse-grained knowledge-based potentials. BMC Bioinformatics 11:92 Fischer D, Lin SL, Wolfson HL, Nussinov R (1995) A geometry-based suite of molecular docking processes. J Mol Biol 248:459–477
32
M. Kurcinski et al.
Gray JJ, Moughon S, Wang C, Schueler-Furman O, Kuhlman B, Rohl CA, Baker D (2003) Protein– protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J Mol Biol 331:281–299 Gront D, Kmiecik S, Kolinski A (2007) Backbone building from quadrilaterals: a fast and accurate algorithm for protein backbone reconstruction from alpha carbon coordinates. J Comput Chem 28:1593–1597 Hukushima K, Nemoto K (1996) Exchange Monte Carlo method and application to spin glass simulations. J Phys Soc Jpn 65:1604–1608 Janin J, Henrick K, Moult J, Eyck LT, Sternberg MJ, Vajda S, Vakser I, Wodak SJ (2003) CAPRI: a Critical Assessment of PRedicted Interactions. Proteins 52:2–9 Jiang L, Gao Y, Mao F, Liu Z, Lai L (2002) Potential of mean force for protein–protein interaction studies. Proteins 46:190–196 Jolley CC, Wells SA, Fromme P, Thorpe MF (2008) Fitting low-resolution cryo-EM maps of proteins using constrained geometric simulations. Biophys J 94:1613–1621 Jones S, Thornton JM (1997) Prediction of protein–protein interaction sites using patch analysis. J Mol Biol 272:133–143 Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, Vakser IA (1992) Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci USA 89:2195–1299 Kmiecik S, Gront D, Kolinski A (2007) Towards the high-resolution protein structure prediction. Fast refinement of reduced models with all-atom force field. BMC Struct Biol 7:43 Kmiecik S, Kolinski A (2007) Characterization of protein-folding pathways by reduced-space modeling. Proc Natl Acad Sci USA 104:12330–12335 Kmiecik S, Kolinski A (2008) Folding pathway of the b1 domain of protein G explored by multiscale modeling. Biophys J 94:726–736 Koehl P (2006) Electrostatics calculations: latest methodological advances. Curr Opin Struct Biol 16:142–151 Kolinski A (2004) Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 51:349–371 Kozakov D, Brenke R, Comeau SR, Vajda S (2006) PIPER: an FFT-based protein docking program with pairwise potentials. Proteins 65:392–406 Krishnamoorthy B, Tropsha A (2003) Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations. Bioinformatics 19:1540–1548 Kurcinski M, Kolinski A (2007) Steps towards flexible docking: modeling of three-dimensional structures of the nuclear receptors bound with peptide ligands mimicking co-activators’ sequences. J Steroid Biochem 103:357–360 Lichtarge O, Bourne HR, Cohen FE (1996) An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 257:342–358 Lindert S, Staritzbichler R, Wötzel N, Karaka¸s M, Stewart PL, Meiler J (2009) EM-fold: de novo folding of alpha-helical proteins guided by intermediate-resolution electron microscopy density maps. Structure 17:990–1003 Mandell JG, Roberts VA, Pique ME, Kotlovyi V, Mitchell JC, Nelson E, Tsigelny I, Ten Eyck LF (2001) Protein docking using continuum electrostatics and geometric fit. Protein Eng 14: 105–113 May A, Zacharias M (2005) Accounting for global protein deformability during protein–protein and protein–ligand docking. Biochem Biophys Acta 1754:225–231 May A, Zacharias M (2007) Protein–protein docking in CAPRI using ATTRACT to account for global and local flexibility. Proteins 69:774–780 Nilges M (1995) Calculation of protein structures with ambiguous distance restraints. Automated assignment of ambiguous NOE crosspeaks and disulphide connectivities. J Mol Biol 245: 645–660 Orzechowski M, Tama F (2008) Flexible fitting of high-resolution X-ray structures into cryoelectron microscopy maps using biased molecular dynamics simulations. Biophys J 95:5692–5705
2
Multiscale Protein and Peptide Docking
33
Ołdziej S, Czaplewski C, Liwo A, Chinchio M, Nanias M, Vila JA, Khalili M, Arnautova YA, Jagielska A, Makowski M, Schafroth HD, Ka´zmierkiewicz R, Ripoll DR, Pillardy J, Saunders JA, Kang YK, Gibson KD, Scheraga HA (2005) Physics-based protein-structure prediction using a hierarchical protocol based on the UNRES force field: assessment in two blind tests. Proc Natl Acad Sci USA 102:7547–7552 Res I, Lichtarge O (2005) Character and evolution of protein–protein interfaces. Phys Biol 2:S36–S43 Res I, Mihalek I, Lichtarge O (2005) An evolution based classifier for prediction of protein interfaces without using protein structures. Bioinformatics 21:2496–2501 Ritchie DW (2008) Recent progress and future directions in protein–protein docking. Curr Protein Pept Sci 9:1–15 Rohl CA, Strauss CE, Misura KM, Baker D (2004) Protein structure prediction using Rosetta. Methods Enzymol 383:66–93 Salwinski L, Eisenberg D (2003) Computational methods of analysis of protein–protein interactions. Curr Opin Struct Biol 13:377–382 Schueler-Furman O, Wang C, Baker D (2005) Progress in protein–protein docking: atomic resolution predictions in the CAPRI experiment using RosettaDock with an improved treatment of side-chain flexibility. Proteins 60:187–194 Sheinerman FB, Norel R, Honig B (2000) Electrostatic aspects of protein–protein interactions. Curr Opin Struct Biol 10:153–159 Sternberg MJ, Gabb HA, and Jackson RM (1998) Predictive docking of protein–protein and protein–DNA complexes. Current Opinion in Structural Biology 8:250–256 Tobi D, Bahar I (2006) Optimal design of protein docking potentials: efficiency and limitations. Proteins 62:970–981 Vajda S, Kozakov D (2009) Convergence and combination of methods in protein–protein docking. Curr Opin Struct Biol 19:164–170 Vakser IA, Aflalo C (1994) Hydrophobic docking: a proposed enhancement to molecular recognition techniques. Proteins 20:320–329 Vakser IA (1995) Protein docking for low-resolution structures. Protein Eng 8:371–377 Valencia A, Pazos F (2002) Computational methods for the prediction of protein interactions. Curr Opin Struct Biol 12:368–373 van Dijk AD, de Vries SJ, Dominguez C, Chen H, Zhou H, Bonvin AM (2005) Data-driven docking: HADDOCK’s adventures in CAPRI. Proteins 60:232–238 Wang C, Schueler-Furman O, Andre I, London N, Fleishman SJ, Bradley P, Qian B, Baker D (2007) RosettaDock in CAPRI rounds 6–12. Proteins 69:758–763 Wang C, Schueler-Furman O, Baker D (2005) Improved side-chain modeling for protein–protein docking. Prot Sci 14:1328–1339 Wodak SJ, Janin J (1978) Computer analysis of protein–protein interaction. J Mol Biol 124: 323–342 Zacharias M (2005) ATTRACT: protein–protein docking in CAPRI using a reduced protein model. Proteins 60:252–256 Zacharias M (2003) Protein–protein docking with a reduced protein model accounting for sidechain flexibility. Prot Sci 12:1271–1282 Zhang C, Liu S, Zhu Q, Zhou Y (2005) A knowledge-based energy function for protein–ligand, protein–protein, and protein–DNA complexes. J Med Chem 48:2325–2335 Zhang Q, Sanner M, Olson AJ (2009) Shape complementarity of protein–protein complexes at multiple resolutions. Proteins 75:453–467
Chapter 3
Coarse-Grained Models of Proteins: Theory and Applications Cezary Czaplewski, Adam Liwo, Mariusz Makowski, Stanisław Ołdziej, and Harold A. Scheraga
Abstract In this chapter, reduced (coarse-grained) protein models are discussed. Emphasis is given to those models which can be used in simulating the structure, thermodynamics, and dynamics of real proteins and are, at the same time, transferable. The coarse-grained force fields are introduced in a physics-based way as potentials of mean force of polypeptide chains in reduced representations, in which the secondary degrees of freedom have been averaged out. Based on this general formula, three categories of coarse-grained potentials are introduced: (i) statistical potentials derived from structural databases, (ii) potentials obtained by factorization of the parent potential of mean force, which enables us to split the system into smaller subsystems and derive each effective energy contribution independently, and (iii) potentials obtained by the force-matching method. Optimization of the potential function to achieve foldability is discussed. Applications of coarse-grained potentials to predict protein structures and simulate long-time protein dynamics are presented. We conclude that while, with the aid of massively parallel computers, coarse graining enables us to reach millisecond simulation timescales of real-size proteins, and case studies indicate that the results of these simulations are realistic, much work remains to be done to improve the force fields.
3.1 Introduction There are two aspects to the protein-folding problem. These are the determination of the folding pathways and the resulting native structure. Both experimental and theoretical methods are used to solve this problem. This chapter is concerned only
A. Liwo (B) Faculty of Chemistry, University of Gda´nsk, Gda´nsk, Poland; Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, USA e-mail:
[email protected] This chapter is dedicated to the memory of Urszula Kozłowska, our long-time colleague and coworker. It is unfortunate that her early passing prevented her from participating in writing this chapter.
A. Kolinski (ed.), Multiscale Approaches to Protein Modeling, C Springer Science+Business Media, LLC 2011 DOI 10.1007/978-1-4419-6889-0_3,
35
36
C. Czaplewski et al.
with the theoretical approach, given the amino acid sequence of the protein. The theoretical approach is based on the thermodynamic hypothesis enunciated by Anfinsen (1973), according to which the native protein adopts the conformation in which the protein plus the surrounding solvent is a system whose free energy is at the global minimum. There are two basic ingredients of the theoretical approach: formulation of an appropriate potential energy function with which to compute the interaction between every pair of atoms in the polypeptide chain and development of an algorithm to identify its global minimum. Originally, the focus was on locating the global minimum of the potential energy, based on a large menu of procedures to search conformational space (Scheraga 1988, 1996; Scheraga et al. 2004), but entropic effects were later introduced (Liwo et al. 2007) to locate the global minimum of the free energy. The initial applications made use of an all-atom potential energy function but, with the computer resources available at that time, the largest structure that could be simulated was the 46-residue protein A, with one of the procedures in the aforementioned menu, namely Electrostatically Driven Monte Carlo (EDMC) (Vila et al. 2003). In order to make further progress, resort was to have a hierarchical procedure (Liwo et al. 1993b; Scheraga et al. 2004) in which the initial search of conformational space was carried out with a coarse-grained United-Residue (UNRES) model of the polypeptide chain to locate the region of the global minimum of the UNRES potential energy. This was followed by conversion of the UNRES model to an allatom one (Ka´zmierkiewicz et al. 2002, 2003) and subsequent optimization of the all-atom model with the EDMC procedure. Coarse-grained representations of proteins have long been of interest in theoretical simulations of protein structure and dynamics (Koli´nski and Skolnick 2004; Tozzini 2005; Colombo and Micheletti 2006; Clementi 2008). The primary reason for this is that they involve much less computational effort than all-atom or united-atom representations of the polypeptide chain; this facilitates speed-up of the simulations of dynamics, folding pathways, and thermodynamics by four orders of magnitude compared to all-atom simulations with explicit solvent (Liwo et al. 2005) and, in turn, simulation of biomolecular processes at the millisecond timescale. Another application of coarse-grained models is prediction of protein structure from amino acid sequence, which becomes increasingly important because of the growing gap between the number of known protein sequences and structures; this gap is not likely to be diminished in the foreseeable future even with the improvement of experimental methods such as X-ray, nuclear magnetic resonance (NMR) spectroscopy, and cryomicroscopy. A number of comprehensive reviews of coarse-grained models applied to biomolecular and soft-material systems (Ayton et al. 2007a; Pincus et al. 2008) and specifically to proteins (Koli´nski and Skolnick 2004; Tozzini 2005; Colombo and Micheletti 2006; Gront et al. 2009) have been published recently. A book on coarsegrained models has been published recently under the editorship of Voth (2008) with contributions from over 30 research groups; this book offers a comprehensive survey of the state-of-the-art of predominantly physics-based coarse-grained models.
3
Protein Coarse-Grained Models
37
3.2 History of Coarse-Grained Protein Models The history of coarse-grained models of proteins began with the pioneering work of Levitt and Warshel (1975) continued by Levitt (1976). These investigators used one or two centers to represent a side chain and three centers (Cα and two pseudo-atoms for the peptide group). The virtual-bond geometry was derived by averaging protein crystal data, and a large part of the potential energy function was obtained by Boltzmann-averaging the all-atom energy of model systems, while the parameters of hydrophobic/hydrophilic interaction potentials were based on amino acid solubility and partition coefficient data. The search procedure consisted of a series of local minimizations, performed in angular variables, with a pushing potential that prevented a system from returning to the already-found energy minimum, thereby allowing a larger-scale search of conformational space. The method was tried on bovine pancreatic trypsin inhibitor (BPTI) and found to produce protein-like structures when simulated folding was started from an extended chain; however, the native structure of the test protein was not reached. Although the Levitt and Warshel work was not developed further, it was the first attempt to construct a physics-based reduced model of polypeptide chains and laid solid foundations for the development of later physics-based models (Pincus and Scheraga 1977; Gerber 1992; Liwo et al. 1993b, 2008a; Wallqvist and Ullner 1994; Maupetit et al. 2007; Voth 2008; Chebaro et al. 2009), some of which (Liwo et al. 1993b, 2008a; Derreumaux 1997, 1999; Derreumaux and Mousseau 2007; Maupetit et al. 2007; Chebaro et al. 2009), were implemented successfully in energy-based protein structure prediction (Ołdziej et al. 2005) and ab initio protein-folding simulations (Derreumaux 1999; Liwo et al. 2005; Voth 2008). Potentials of this class are discussed in detail in Section 3.5.3. At nearly the same time, Tanaka and Scheraga (1976) introduced the first knowledge-based (statistical) protein energy function. These investigators determined a residue–residue interaction matrix from the database of protein structures known at that time by applying the Boltzmann principle to the frequency of contacts between pairs of residues of given types. Two residues were considered to be in contact when the distance between their Cα atoms was less than 7 Å (Tanaka and Scheraga 1976). Miyazawa and Jernigan (1985) developed a residue–residue contact potential by using a more refined method, in which the random-flight-chain component was removed from the effective contact energies and the quasi-chemical approximation (Fowler and Guggenheim 1949) was used instead of the simple Boltzmann inversion of contact frequency. Later they revised this potential (Miyazawa and Jernigan 1996) by using a larger database of protein structures and more refined approximations (Miyazawa and Jernigan 1999). Burgess and Scheraga (1975) introduced a residue–residue five-state model to examine the conformational states of bovine pancreatic trypsin inhibitor (BPTI). Wang and Wang (1999) proposed the reduction of the 20×20 contact-potential matrix to introduce only five distinct residue types. Structure-based contact potentials were implemented in simulations of protein packing (Gregoret and Cohen 1990), in onlattice-folding simulations (Covell 1992; Pincus et al. 2008) and in fold recognition (Maiorov and Crippen 1992).
38
C. Czaplewski et al.
Work on continuous knowledge-based potentials for off-lattice simulations was initiated by Kuntz et al. (1976), who constructed a minimal model with Cβ atoms as interaction sites. The pseudo-energy consisted of a series of quadratic penalty terms accounting for the violation of allowed distances between the Cβ atoms close in sequence, favoring close distances between hydrophobic or oppositely charged residues and penalizing close distances between polar or like-charged as well as between polar or charged and hydrophobic residues. Additionally, native disulfide bond topology and the topology of cysteine residues in Fe–S clusters were enforced, if applicable. The residue–residue interaction penalty terms were scaled by interaction-matrix elements, which were assigned in an arbitrary manner according to the number of carbon atoms and the presence of charged or polar groups. When applied to rubredoxin and BPTI, the approach produced low-resolution structures which resembled native-like topology. Crippen and coworkers (Obatake and Crippen 1981; Crippen and Viswanadhan 1984) implemented the Boltzmann principle to determine continuous residue– residue potentials from protein crystal structures and fitted them to a combination of a Lennard–Jones-like and a Gaussian (Obatake and Crippen 1981) or a Lennard– Jones-like functional form (Crippen and Viswanadhan 1984). However, these potentials could locate near-native structures as stable local minima obtained only when performing local energy minimization starting from the experimental structure of a protein. Yˇcas and coworkers (Yˇcas et al. 1978; Goel and Yˇcas 1979) and Wako and Scheraga (1982a,b) developed approaches in which distances between Cα atoms were restrained to average values determined from the Protein Data Bank (PDB) (Berman et al. 2000), depending on residue types and separation in sequence; therefore, the corresponding target functions could be considered knowledge-based potentials. These approaches had some success reproducing the native-like structure of BPTI, lysozyme, and staphylococcal nuclease. Another successful knowledge-based potential was developed by Sun (1993) who was able to locate the lowest-energy structures of melittin, apamin, and avian pancreatic polypeptide inhibitor by using a genetic algorithm with the experimental radius of gyration of the protein under study as a restraint. Following the above-mentioned works, knowledge-based potentials applicable to fold recognition were developed (Jones and Thornton 1993). By applying the Boltzmann principle and using protein crystal data, Sippl and coworkers (Hendlich et al. 1990; Sippl 1990a,b, 1993; Casari and Sippl 1992) developed continuous potentials of residue–residue interactions dependent on inter-residue distance, residue types, and residue separation in sequence. These potentials were able to recognize the folds of proteins and protein fragments. This approach was later continued by a number of investigators (Reva et al. 1997; Samudrala and Moult 1998) and also applied to protein complexes (Jiang et al. 2002) and protein–ligand complexes (Mitchell et al. 1999). More complex pseudo-energy functions for sequence threading, which include explicit local-interaction and solvation terms (Bryant and Lawrence 1993; Jones and Thornton 1993; Godzik et al. 1993; Miller et al. 1996), with parameters
3
Protein Coarse-Grained Models
39
optimized by using a set of training proteins (Meller and Elber 2001), were developed later. Buchete et al. (2003, 2004) developed distance-dependent and orientation-dependent statistical potentials for protein-fold recognition; they demonstrated that introducing orientation dependence greatly improves the capability of fold recognition. The orientation dependence was introduced with spherical harmonics. Statistical orientation-dependent side-chain–side-chain interaction potentials, in which each side chain is represented by an ellipsoid, were constructed by Liwo et al. (1997a) and Mukherjee et al. (2005). The statistical potentials based on the Boltzmann principle are discussed in Section 3.5.2. A different approach to derive knowledge-based potentials was developed by Wolynes and coworkers (Sasai and Wolynes 1990; Friedrichs et al. 2001; Goldstein et al. 1992a,b; Hardin et al. 2002; Eastwood et al. 2002, 2003). Instead of using the Boltzmann principle to determine the potentials, these investigators developed energy functions termed associative Hamiltonians, which linked the parameters of the potentials to the structures of a number of proteins from the database, the strength of coupling depending on the homology between the sequence under study and a sequence from the database. This approach can be considered sophisticated comparative modeling, in which sequence homology is incorporated into an energy function. Later versions of the method (Hardin et al. 2002; Eastwood et al. 2002, 2003) contain hydrogen-bonding and long-range side-chain contact energy. Initially a polypeptide chain was represented as a sequence of Cα atoms; in later versions, the Cα , backbone oxygen atoms, and side-chain centers were defined as interaction sites. The approach had some success in protein structure prediction (Hardin et al. 2002; Prentiss et al. 2006, 2008) and in Brownian dynamics simulations of protein folding (Wolynes 2005). Until the early 1990s, both physics-based and knowledge-based coarse-grained potentials were constructed as sums of individual terms. It has, however, become clear that the simplifications inherent in coarse graining make the resulting energy function inaccurate enough to prevent proteins from folding starting from an arbitrary conformation, without adjusting energy-term parameters. Work on this problem was initiated by Crippen and coworkers (Crippen and Viswandhan 1987; Crippen and Snow 1990; Seetharamulu and Crippen 1991) who defined potentialfunction optimization as a linear programing problem in which the difference between the energy of the native structure and the lowest-in-energy non-native structure of a training protein was maximized. They used the energy-embedding method as a global optimization algorithm to search low-energy structures. With this method and with the use of avian pancreatic polypeptide (APP) (Crippen and Snow 1990) and APP and crambin (Seetharamulu and Crippen 1991) as training proteins, the optimized potential energy function was able to locate the native-like structures of apamin and melittin as the lowest in energy. However, later work in which simulated annealing was implemented as a search method (Snow 1992) demonstrated that non-native structures lower in energy were found, showing the critical role of the quality of sampling in potential-function optimization. The physics-based justification of the methodology initiated by Crippen and coworkers was provided
40
C. Czaplewski et al.
by Wolynes and coworkers (Bryngelson and Wolynes 1987; Hardin et al. 2002; Goldstein et al. 1992a,b) and Shakhnovich and coworkers (Sali et al. 1994a,b). Force-field optimization is discussed in Section 3.5.5. The first successful application of coarse-grained potentials in ab initio protein folding was made by Koli´nski and Skolnick who developed a high-resolution lattice model of proteins and a statistical potential which included side chain–side chain, local, hydrogen bonding, and mutibody terms accounting for cooperative formation of backbone hydrogen bonds and side-chain contact patterns (Koli´nski and Skolnick 1992, 1994a; Koli´nski et al. 1993). A residue was represented by Cα carbon serving as a center of hydrogen-bonding interactions and a side-chain center. Using a Monte Carlo dynamics algorithm, these investigators performed successful folding simulations of protein A, repressor of primer (ROP) dimer, crambin (Koli´nski and Skolnick 1994b), leucine zipper (Vieth et al. 1994), as well as folding model α-helical (Rey and Skolnick 1993; Olszewski et al. 1996; Sikorski et al. 1998) and β-sheet proteins (Koli´nski et al. 1995) and other folding simulations (Koli´nski et al. 1996, 2003; Koli´nski and Skolnick 1997). Later versions of the Koli´nski–Skolnick model are the side chain-only (SICHO) model (Koli´nski and Skolnick 1998) in which only the side-chain centroids are interaction sites and the CABS (Cα , Cβ , side chain) model (Koli´nski 2004). These models served as a basis of the MONSSTER (Skolnick et al. 1997b) and TOUCHSTONE (Skolnick et al. 2003) approaches to protein structure prediction, which combine multiple sequence alignment, secondary-structure prediction, threading, and coarse-grained simulations. These approaches have been very successful in community-wide experiments on the Critical Assessment of Techniques for Protein Structure Prediction (CASP4– CASP8) (Skolnick et al. 2001, 2003; Zhou et al. 2007a) and have also been applied to structure determination from NMR data (Lee et al. 2006; Latek et al. 2007). The CABS model has recently been applied to study protein-folding pathways (Kmiecik et al. 2006; Kmiecik and Koli´nski 2007, 2008). Following the concept of averaging out the less important degrees of freedom, the United-Residue (UNRES) force field has been developed by Liwo, Scheraga, and coworkers (Liwo et al. 1993a,b; 1997a,b; 1998; 2001; 2004b; 2007; 2008a,b; Ołdziej et al. 2003; Czaplewski et al. 2004b; Kozłowska et al. 2007; Chinchio et al. 2007; Makowski et al. 2007a,b, and c; 2008; Rojas et al. 2007; Kozłowska et al. 2010a,b). In this model, the interaction sites are side-chain centers and peptide groups placed halfway between the consecutive α-carbon atoms, while the Cα atoms assist only in the definition of the geometry and peptide group orientation. The early version of UNRES (Liwo et al. 1993a,b) included only pairwise interactions between sites and virtual-bond-torsional terms. Backbone hydrogen-bonding interactions were described by mean-field-based analytical formulas obtained by Boltzmann-averaging of the energy of peptide-group dipoles with parameters derived based on averaging the all-atom ECEPP/2 (Momany et al. 1975) energy; they correctly reproduced the directionality of backbone-group hydrogen bonding while retaining only one interaction site per peptide group. The interactions between the side chains were described by Lennard–Jones-like potentials with radii taken
3
Protein Coarse-Grained Models
41
from Levitt and Chothia (1976) and well depths computed from the Miyazawa– Jernigan (1985) interaction energies. Later (Liwo et al. 1997a), the side-chain interaction potentials were revised to include anisotropy in the Gay–Berne model and reparameterized using a database of 197 high-resolution non-homologous protein structures (Liwo et al. 1997a); the local-interaction parameters were also determined from the PDB (Liwo et al. 1997b) and the weights of the energy terms were determined by using a Z-score optimization approach (Liwo et al. 1997b). Subsequently (Liwo et al. 1998, 2001), the force field was defined rigorously as a restricted free-energy (RFE) function of a united peptide chain in which the secondary degrees of freedom were averaged out; this definition can also be used to derive any coarse-grained force field. Using this definition, Liwo et al. (1998, 2001) derived multibody terms, which are necessary for regular secondary structure to form spontaneously in united-residue simulations (Koli´nski et al. 1993). The multibody terms, as well as the other terms of the force field, were gradually parameterized using ab initio energy surfaces of model peptide systems (Ołdziej et al. 2003; Liwo et al. 2004b; Kozłowska et al. 2007, 2010a,b) so that all knowledgebased local-interaction terms have now been replaced with physics-based terms and the knowledge-based side-chain interaction potentials are presently being replaced with potentials of mean force determined from all-atom simulations of models of pairs of side chains in water (Makowski et al. 2007a,b, and c, 2008). A hierarchical method was developed for force-field optimization (Liwo et al. 2002, 2007; Ołdziej et al. 2004) which extends the original concept of Wolynes and coworkers (Bryngelson and Wolynes 1987; Hardin et al. 2002; Goldstein et al. 1992a) and Shakhnovich and coworkers (Sali et al. 1994a,b) to energy-ranking partially native states. The UNRES force field was initially used for energy-based protein structure prediction formulated as a global minimum search and had considerable success in CASP experiments (Liwo et al. 1999; Pillardy et al. 2001a; Ołdziej et al. 2005). Recently (Liwo et al. 2005; Khalili et al. 2005a,b), a molecular dynamics (MD) algorithm was implemented in UNRES which extended the scope of the force field to study protein-folding pathways (Khalili et al. 2006), thermodynamic properties (Nanias et al. 2006; Liwo et al. 2007), and also to reformulate physics-based protein structure prediction as a search for the most probable conformational ensemble at temperatures below the folding temperature (Liwo et al. 2007). UNRES was also extended to simulate multichain proteins (Saunders and Scheraga 2003a,b; Rojas et al. 2007) and dynamic formation and breaking of disulfide bonds during protein folding and unfolding (Czaplewski et al. 2004b; Chinchio et al. 2007). A semi-coarse-grained model, with all-atom backbone and united-residue side chains, was developed by Derreumaux and coworkers (Derreumaux 1997, 1999; Wei and Derreumaux 2002). Side-chain interactions were represented by a 4–8 potential with well depths computed from the Miyazawa–Jernigan (1985) interaction energies and backbone long-range interactions were focused to reproduce hydrogen bonding. Later (Maupetit et al. 2007), the force field was enhanced in hydrogen-bonding correlation terms, which was motivated by the presence of multibody contributions to the free energy of desolvation of groups of hydrogen bonds.
42
C. Czaplewski et al.
The force field has been applied to Monte Carlo (Derreumaux 1997, 1999; Wei and Derreumaux 2002) and molecular dynamics (Derreumaux and Mousseau 2007; St-Pierre et al. 2008) folding simulations of peptides and small proteins; recent applications include prediction of protein structure (Maupetit et al. 2007) and a study of protein aggregation (Wei et al. 2007; Mousseau and Derreumaux 2008). The knowledge-based force field of Takada and coworkers (Takada 2001; Chikenji et al. 2001; Fujitsuka et al. 2004) is similar in spirit. Just recently (Ayton et al. 2007b; Zhou et al. 2007b; Noid et al. 2008; Thorpe et al. 2008; Wang et al. 2009), a new general approach to coarse-graining has been developed by Voth and coworkers. This approach is based on matching the forces computed in the coarse-grained model of a given system to the mean forces computed by all-atom MD simulations for the same system and is discussed in more detail in Section 3.5.4. Preliminary applications to a model α-helical and β-hairpin peptide resulted in energy landscapes funneled to the respective native structure. The general purpose coarse-graining scheme and force field of Monticelli et al. (2008) termed MARTINI is similar in spirit, though less rigorously derived but incorporating experimental thermodynamic and structural data in parameterization. When applied to proteins, it needs native secondary structure to work. At the time that the physics-based model of Levitt and Warshel (1975) appeared, G¯o and coworkers (Taketomi et al. 1975; Ueda et al. 1978, Cieplak and Sulkowska, Chapter 8 of this book) laid foundations of the now-popular structure-based models, usually termed G¯o-like models (Das et al. 2005; Schug et al. 2008; Hills and Brooks 2009). These models overemphasize the interactions in the native state (in rigorous G¯o-like models only native contacts result in attractive interactions, all other interactions being repulsive). These models exhibit minimal frustration of the energy landscape. Use of G¯o-like models is based on the assumption that the topology of the native state determines the major features of protein-folding pathways and kinetics. Other structure-based models contain potential energy terms biasing toward the native secondary structure (Brown et al. 2003; Brown and Head-Gordon 2004; Eskow et al. 2004). The elastic network coarse-grained models, developed relatively late (Bahar et al. 1997; Hinsen 1998; Atilgan et al. 2001; Tobi and Bahar 2005; Ahmed and Gohlke 2006; Chu and Voth 2007; Moritsugu and Smith 2008), can be considered another class of structure-based models. In these models, a polypeptide chain is treated as a network of Cα atoms connected by springs of equilibrium length corresponding to the distance in the experimental structure of the protein studied and force constants depending on distance in the experimental structure; in more advanced applications (Chu and Voth 2007), a double-well potential is imposed on each spring. In the simplest model (Bahar et al. 1997), the force constant is zero if the distance exceeds 7 Å and 1 otherwise (thus the spring network is described by the Kirchhoff adjacency matrix); in more sophisticated approaches, the force constants are Gaussians in distance (Hinsen 1998) or are calculated based on molecular dynamics simulations (Chu and Voth 2007; Moritsugu and Smith 2008). The elastic network models are used to study low-frequency motions of proteins, including thermal fluctuations (Bahar et al. 1997; Atilgan et al. 2001), domain motions (Hinsen 1998),
3
Protein Coarse-Grained Models
43
conformational changes upon folding and unfolding (Chu and Voth 2007; Moritsugu and Smith 2008), and protein–protein binding (Tobi and Bahar 2005). Finally, coarse-grained protein-like models that capture only general features of protein structures, such as the existence of a single native state, are at the other extreme. The oldest are the HP lattice models developed by Dill and coworkers (Chan and Dill 1989, 1990, 1991, 1994; Dill et al. 1995) in which only two types of beads, hydrophobic (H) and polar (P), are present with three contact interaction energy values and then the NPH lattice models developed by Hao and Scheraga (1994) with three types of beads: neutral (N), polar (P), and hydrophobic (H). These models proved invaluable in determining the origin of the general features of protein structures, such as compactness and formation of a hydrophobic core and a hydrophilic exterior, and cooperativity in folding. Protein-like lattice models with more complex interaction patterns were used by Shakhnovich and coworkers (Sali et al. 1994a; Shakhnovich 1997) and Thirumalai and coworkers (Camacho and Thirumalai 1996; Klimov and Thirumalai 1996a,b, 1998) to study foldability criteria.
3.3 Choice of Conformational Space Representation The geometry of united-residue polypeptide chains is represented in continuous or discretized space. For continuous space, the Cartesian coordinates of the interacting sites or the virtual-bond vectors are usually the variables of choice, although curvilinear (angular) coordinates were implemented in the early model of Levitt and Warshel (Levitt and Warshel 1975; Levitt 1976), in the UNRES model (Liwo et al. 1993b, 1997a,b, 1999), by Hoffman and Knapp (1996) (who considered collective coordinates composed of the ϕ and ψ torsional angles of several adjacent peptide groups), and by He and Scheraga (1998). Angular coordinates are used in energy minimization or Monte Carlo search rather than in molecular dynamics simulations; one of a few counter-examples is the work of He and Scheraga (1998) who did Brownian dynamics of model polypeptides using virtual-bond angles and virtual-bond dihedral angles as variables. Use of curvilinear coordinates enables us to reduce the number of degrees of freedom by treating the virtual-bond lengths and, sometimes, the virtual-bond angles as fixed, which reduces the cost of minimization and Monte Carlo algorithms. However, use of curvilinear coordinates in molecular dynamics is not so convenient because it introduces a non-diagonal inertia tensor which depends on conformation (resulting in the necessity of solving a linear-equation system in every MD step to compute accelerations from forces, which requires N3 operations, N being the number of variables) and also results in problems with singularity when mapping the curvilinear to Cartesian coordinates. Regardless of the representation (Cartesian or curvilinear), the obvious advantage of a continuous-space representation is the possibility of applying the algorithms of conformational-space exploration that require an energy gradient (local energy minimization with gradient minimizers and molecular dynamics and its variations).
44
C. Czaplewski et al.
Discrete representations of conformational space, in the first place, are identified with lattice models (Koli´nski and Skolnick 2004; Chapters 1 and 12), in which the interaction sites are always located on lattice nodes. Depending on the lattice resolution, the lattice models are divided into low-resolution lattice models, in which the sites connected by site–site virtual bonds are located on neighboring lattice nodes, the intermediate-resolution lattices (e.g., the chess-knight lattice) in which the sites connected by a virtual bond are located on second-neighboring lattice nodes, and high-resolution lattices in which the nodes are spaced by 1.45 Å (cubic lattice) or even by 0.61 Å. Low-resolution lattices can be used only to study protein-like polymers while, with high-coordination lattices, the accuracy of chain representation can be as high as 0.35 Å, which is comparable to the inaccuracy inherent in coarsegrained force fields. Various types of lattice models have been discussed extensively in the excellent review by Koli´nski and Skolnick (2004). The advantage of lattice models over continuous-space models is the possibility of pre-computing and storing the contributions to energy corresponding to chain fragments at certain conformations, which saves CPU time. However, while this advantage was substantial a decade ago when processor speed was small compared to the present, and memory access was relatively cheap, now memory access is the bottleneck [the so-called memory wall (Flynn 1999)]. For example, accessing elements of a 10,000,000 array 100,000,000 times at random takes 11 CPU seconds with an Intel Q9400 2.66 MHz processor, which is an issue to which one has to give some consideration. Moreover, conformational search algorithms that require an energy gradient (local minimization using gradient minimizers, molecular dynamics, and related techniques) are not possible with the lattice approach. Another discretization of the conformational space is accomplished by restricting it to decoys derived from structural databases or fragments derived from the sequences homologous to a target sequence. The first is the so-called threading or fold recognition approach (Bryant and Lawrence 1993; Godzik et al. 1993; Jones and Thornton 1993; Miller et al. 1996; Meller and Elber 2001; Buchete et al. 2003, 2004), while the second one is the fragment approach applied recently by Baker and coworkers (Simons et al. 1997; Rohl et al. 2004). These representations are applied mainly in protein structure prediction, although the fragment approach was also applied in protein folding by Monte Carlo dynamics (Fujitsuka and Takada 2004).
3.4 Interaction Schemes Due to the great diversity of coarse-grained models and force fields, it is very difficult, if not impossible, to provide a unique general formula that covers all of them. If we restrict the discussion to force fields that refer to physical interactions, a general formula for the effective energy, U, could be expressed by Eq. (3.1). U=
i
ulocal + i
i
j
uij +
ijkl...
uijkl...
(3.1)
3
Protein Coarse-Grained Models
45
where ulocal denotes a local-interaction term dependent on a single or a number i of adjacent sites, uij denotes the effective interaction energy between sites i and j, and uijkl. . . denotes a multibody interaction which extends over sites i, j, k, l. . . . The terms of the first two sums resemble those of all-atom force fields, although their physical origin is usually different. The local-interaction terms consist not only of those describing the energetics of virtual-bond stretching, virtual-bond angle bending, and virtual-bond torsional terms but also include more complex terms such as those describing the rotameric states of united side chains and terms that depend on a number of consecutive virtual-bond dihedral angles. Gerber (1992) introduced peptide-geometry restoring terms, which penalize the deviations of virtual-bond length and virtual-bond angles from boundary values resulting from the ideal valence geometry of polypeptide chains; these terms are also local-interaction terms. The uij terms corresponding to interactions between united side chains usually encode the totality of such interactions, including the effect of the surrounding solvent. The side-chain–side-chain interaction terms depend only on the distance (in most force fields) or also on side-chain orientation. Usually backbone hydrogen bonding is accounted for separately through interactions between backbone sites located on Cα atoms, between consecutive Cα atoms, or a number of sites, each representing a peptide group. Unless the peptide groups are represented at atomistic or nearly atomistic detail, the backbone hydrogen-bonding terms include dependence on peptide-group orientation. The presence of multibody terms is the main difference between the interaction scheme of all-atom force fields and those of coarse-grained force fields. Although the multibody terms are also present in some specific and very accurate all-atom force fields, their presence there is not required for a force field to work reasonably well. Conversely, coarse-grained force fields without multibody terms have very restricted application and cannot be used for ab initio folding unless specific biases toward elements of the native structure are introduced. The reason for this becomes clear in Section 3.5.3. One type of multibody terms, which is also present in an all-atom force field with implicit solvent representation, is the solvation term computed from the solvent-accessible surface area or related quantities; another one is the centrosymmetric potential implemented by Koli´nski, Skolnick, and coworkers (Koli´nski et al. 1993; Koli´nski and Skolnick 1994a).
3.5 Derivation of Coarse-Grained Force Fields In this section, three leading methods for developing coarse-grained potentials [statistical potentials (Boltzmann principle), factorization of potentials of mean force, and the force-matching method] are discussed. These three approaches result in potentials that can be used to simulate the structure and dynamics of real proteins and are, in theory, transferable. We omit from this discussion (i) the arbitrary potentials designed to simulate protein-like systems in order to study general properties of protein folding and dynamics (Chan and Dill 1989, 1990, 1991, 1994; Dill et al.
46
C. Czaplewski et al.
1995; Camacho and Thirumalai 1996; Cieplak et al. 2002; Chapter 8), (ii) the elastic network potentials (Bahar et al. 1997; Hinsen 1998; Atilgan et al. 2001; Tobi and Bahar 2005; Ahmed and Gohlke 2006; Chu and Voth 2007; Moritsugu and Smith 2008), and (iii) the structure-based potentials (Das et al. 2005; Schug et al. 2008; Hills and Brooks 2009; Brown et al. 2003; Brown and Head-Gordon 2004; Eskow et al. 2004; Chapter 8). The reader is referred to the literature for information regarding these three other categories.
3.5.1 Basic Formulations The coarse-grained energy function can generally be defined as an average of the energy of the corresponding all-atom system, the average being computed over the degrees of freedom that are not present in the coarse-grained representation (the secondary degrees of freedom). An illustration of the correspondence between a united-residue chain and the parent all-atom chain is presented in Fig. 3.1.
Fig. 3.1 Illustration of the correspondence between the all-atom polypeptide chain in water (a) and its coarse-grained (UNRES) representation (b). The side chains in part (b) are represented by ellipsoids of revolution and the peptide groups are represented by small spheres in the middle between consecutive α-carbon atoms. The solvent is implicit in the UNRES model. Reproduced with permission from figure 1 of Czaplewski et al. (2009)
The most physical definition corresponds to the restricted free energy (RFE) or potential of mean force (PMF) obtained by computing the part of the configurational integral of a system corresponding to integrating over the secondary degrees of freedom. If X = (x1 , x2 , . . . , xM ) and Y = (y1 , y2 , . . . , ym ) denote the coarse grained and secondary degrees of freedom (orthogonal to X), respectively, M and m being the dimensions of the space spanned by these coarse-grained and secondary variables, respectively, and E(X;Y) denotes the all-atom energy function, RFE can be expressed by Eq. (3.2) (Liwo et al. 1998, 2001; Izvekov and Voth 2005a,b; Ayton et al. 2007a).
3
Protein Coarse-Grained Models
F(X) = −RT ln
47
⎧ ⎪ ⎨ ⎪ ⎩
Y
⎫ ⎪ ⎬ E(X; Y) dVY + C exp − ⎪ RT ⎭
(3.2)
where Y is the space spanned by Y, R is the universal gas constant, T is the absolute temperature, and C is an additive constant. The choice of C varies from approach to approach; for example, Liwo et al. (1998, 2001) chose C = RT ln VY , which makes F(X) a restricted excess free energy, while Voth and coworkers (Izvekov and Voth 2005a,b; Ayton et al. 2007a,b) chose C = −RT ln ZN /zn , with ZN and zn denoting the configuration integrals computed over the coarse-grained and all-atom conformations, respectively. Equation (3.2) provides the best physical connection between the coarse-grained and the corresponding all-atom system, because exp[−F(X)/RT] is proportional to the probability of a coarse-grained conformation defined by X. Consequently, the ensemble averages computed over F(X) are theoretically equal to those computed over the parent all-atom energy function E(X;Y). Another point is that it is clear from Eq. (3.2) that the effective coarse-grained energy function depends on temperature. This point has recently been addressed by Liwo et al. (2007). Earlier coarse-grained models (Levitt and Warshel 1975; Levitt 1976; Pincus and Scheraga 1977; Liwo et al. 1993b) adopted the Boltzmann-averaged energy as the effective energy function. However, the average energy does not have a direct connection to the probability of a coarse-grained conformation, and it is not straightforward to compute ensemble averages using this quantity. Equation (3.2) defines a multidimensional potential of mean force, the computation or determination of which by direct integration, simulation, or from experimental data for the entire protein is unfeasible. In the next three sections, the determination of this integral by making necessary simplifications is addressed. It should be noted that Sections 3.5.2, 3.5.3, and 3.5.4 of this section do not imply a different physical origin but discuss different approaches to derive the coarse-grained potentials based on Eq. (3.2).
3.5.2 Statistical Potentials (Boltzmann Principle) Knowledge-based potentials, also known as statistical potentials, are commonly used to predict protein structures as well to simulate protein-folding pathways. The basic purpose of this approach (Tanaka and Scheraga 1976; Miyazawa and Jernigan 1985; Sippl 1990a,b, 1993; Covell 1992; Casari and Sippl 1992; Maiorov and Crippen 1992) is to construct an effective energy function [the prototype of which is given by Eq. (3.2)] based on the distributions of inter-residue distances, virtual-bond lengths, bond angles, dihedral angles, geometric parameters characteristic of short-sequence fragments, etc., derived from structures deposited in the Protein Data Bank (PDB) (Berman et al. 2000). The basic equation used in deriving statistical potentials is given by Eq. (3.3)
48
C. Czaplewski et al.
W (x; c; s) = −RT ln
N obs (x; c; s) N ref (x; c; s)
(3.3)
where W(x; c; s) is the estimated potential of mean force of a fragment with geometry expressed by the vector x, composition (the kinds of residues involved) expressed by the vector c, and sequence and/or secondary-structure context expressed by the vector s, R is the universal gas constant, T is the absolute temperature, N obs (x; c; s) is the number of counts of fragments of a given composition and sequence context and geometry close to x observed in the database, and N ref (x; c; s) is the reference number of counts (in the absence of any interactions except those imposed by chainconnectivity and excluded-volume constraints). In simple residue–residue contact potentials (Tanaka and Scheraga 1976; Miyazawa and Jernigan 1985; Covell 1992; Maiorov and Crippen 1992; Rooman et al. 1992; Zhou and Zhou 2004), c consists of the kinds of the first and the second residues involved, x is 1 if the distance between the two selected atoms (Cα , Cβ , mass centers) of the residues is less than rcut (usually equal to 7 Å) and 0 otherwise, and the sequence context is ignored. In more refined pair potentials, residues are split into more centers of interaction, and different potentials are developed for the local and non-local interactions or the order of the two residues in the chain; this implies sequence context (Sippl 1990a,b, 1993; Koli´nski and Skolnick 1992; Koli´nski et al. 1993; Godzik et al. 1993). It is clear from Eq. (3.3) that the statistical potentials depend on the database from which the N obs (x; c; s) values were derived. For example, it was shown that statistical potentials derived from the structures of all α-helical proteins are significantly different from those obtained from all-β proteins (Furuichi and Koehl 1998); a similar situation is observed when single-chain and multichain proteins are used for development of potentials (Moont et al. 1999; Lu et al. 2003). Following the work of Miyazawa and Jernigan (1985), non-homologous protein structures with different types of folds are selected for a database. The complete statistical energy function is a sum of terms determined using Eq. (3.3). It should be noted that each of the terms can be identified with the potential of mean force of a protein fragment, and each of them is evaluated independently. Consequently, Eq. (3.3) does not provide a rigorous connection to the PMF of the polypeptide chain under consideration [defined by Eq. (3.2)], because this PMF is not a sum of fragment PMFs, each determined in the context of the whole structures of database proteins. Moreover, there is a possibility that the same contributions will be included in different terms. The statistical potentials have been discussed in detail in the review by Shen and Sali (2006). They can be classified according to the characteristics a, b, c, and d: (a) Protein structure representation: The representations differ by the number, location (e.g., on Cα atoms, Cβ atoms, centers of masses of selected fragments, etc.), and types of sites (point masses, rods, or rigid bodies), and choice of the representation of the conformational space (continuous, lattice, decoys, and fragments).
3
Protein Coarse-Grained Models
49
(b) Interaction scheme: A number of knowledge-based force fields include only residue–residue interactions that depend only on the distance (Miyazawa and Jernigan 1985; Sippl 1990a,b, 1993; Casari and Sippl 1992; Skolnick et al. 1997a; Samudrala and Moult 1998; Rojnuckarin and Subramaniam 1999; Zhang et al. 2004; Chen and Shakhanovich 2005) or on distance and orientation (Buchete et al. 2003, 2004; Miyazawa and Jernigan 2005). Some of these simple potentials are enhanced by including a solvation term dependent on solventaccessible area (Sippl 1993; Melo and Feytymans 1998). Potentials based on residue–residue interactions are good only for threading but perform poorly in ab initio folding. Potentials containing local-only interactions in the form of virtual-bond dihedral angle terms (Rooman et al. 1991, 1992; Zhou and Zhou 2004; Betancourt 2008) or those containing only hydrogen-bonding interactions (Kortemme et al. 2003) are also known. However, the majority of working potentials contain side-chain–side-chain, hydrogen-bonding, local, and multibody interactions (Koli´nski and Skolnick 1992, 1994a,b; Koli´nski et al. 1993; Sun 1993; Koli´nski 2004). (c) Representation of interactions: The interactions can be represented either in a tabular form (i.e., a PMF value for each different geometry; for contact potentials this is just a single number) (Sippl 1990a,b, 1993; Casari and Sippl 1992; Koli´nski and Skolnick 1992, 1994a,b; Koli´nski et al. 1993; Koli´nski 2004) or by functional forms which are determined by fitting to the PMF values (Obatake and Crippen 1981; Crippen and Viswanadhan 1984, 1987; Sun 1993; Liwo et al. 1997a,b; Buchete et al. 2003, 2004). (d) Reference state definition: Generally the reference state is defined as one in which no specific interactions occur within the fragment under consideration, and the interactions result only from chain-connectivity and excluded-volume constraints. However, the level of sophistication of choosing the reference state varies significantly among the statistical potentials (Sippl 1990a,b, 1993; Casari and Sippl 1992; Jernigan and Bahar 1996; Moult 1997; Skolnick et al. 1997; Samudrala and Moult 1998; Lu and Skolnick 2001; Buchete et al. 2004; Zhou et al. 2006). As an example of the derivation of statistical potentials, the normalized distribution function N obs (r), the reference distribution functions, N ref (r) (r being the distance between the side-chain centers), and their ratio (the correlation function) for the Leu-Leu pair determined from the PDB by Liwo et al. (1997a) are shown in Fig. 3.2a, while the respective PMF together with the fit to a Lennard–Jones (6–12) functional form is shown in Fig. 3.2b. The statistical potentials were the subject of extensive criticism especially related to the fact that they are derived from equilibrium structures and their relationship to potentials of mean force is not clear (Thomas and Dill 1996; Ben Naim 1997). The correctness of reference state definition and the quality of statistical data derived from the databases were also challenged (Thomas and Dill 1996; Ben Naim 1997; Tobi et al. 2000; Summa et al. 2005). Implementators and developers of statistical potentials recognized most of the criticisms related to their methodology and
50
C. Czaplewski et al.
Fig. 3.2 Upper graph: sample pair-distribution and pair-correlation functions for the Leu-Leu pair averaged over consecutive 0.5-Å shells used to determine the statistical side-chain–side-chain interaction potentials for this residue pair by Liwo et al. (1997a). (a) Radial normalized pair-correlation function [N obs /N ref in Eq. (3.3)]; (b) the reference normalized pair number of counts [Nref in Eq. (3.3)]; (c) the total normalized number of pair counts [Nref in Eq. (3.3)] determined from the PDB. All curves were normalized to the maximum value of 1.0. Lower graph: the potential of mean force computed from the correlation function [W of Eq. (3.3); dashed line] and the Lennard–Jones fit (solid line). It should be noted that the Lennard–Jones fit does not reproduce the desolvation maximum. The upper graph has been reproduced with permission from figure 3 of Liwo et al. (1997a) and the lower graph was constructed based on the data from Liwo et al. (1997a)
3
Protein Coarse-Grained Models
51
raised further important points. One of the most important extensions of the statistical potentials was the observation that pairwise energies used there could be insufficient to describe internal protein energy correctly and, therefore, introduction of multibody expansion in the potential became necessary (Koli´nski and Skolnick 1992, 1994a,b; Koli´nski et al. 1993; Godzik et al. 1993; Vendruscolo and Domany 1998; Vendruscolo et al. 1999; Chapter 6). On the other hand, this extension led to greater problems with the quality of the statistical data derived from the PDB (Vendruscolo et al. 1999). Despite the conceptual and technical problems, statistical-based potentials enjoyed a significant degree of success in many applications related to the biophysics of proteins. The primary application of the statistical potentials was prediction of the three-dimensional structure of proteins from amino acid sequences (Koli´nski and Skolnick 1992, 1994a,b; Koli´nski et al. 1993; Panchenko et al. 2000; Skolnick et al. 2000; Tobi and Elber 2000; Tobi et al. 2000; Vendruscolo et al. 2000, Koli´nski 2004), protein thermodynamical stability (Gilis and Rooman 1996, 1997), and the structure and stability of protein–protein and protein–ligand complexes (Gohlke and Klebe 2001). However, the statistical potentials based on the Boltzmann principle [Eq. (3.3)] are currently no longer developed or used as extensively as 5–7 years ago (Lazaridis and Karplus 2000; Meller and Elber 2002; Russ and Ranganathan 2002; Buchete et al. 2003, 2004). The reason for this is that current protein structure prediction is based mostly on the template recognition methodology, rather than on a statistical energy function (Kryshtafovych and Fidelis 2009).
3.5.3 Factor Expansion of the PMF The exact RFE of Eq. (3.2) can be evaluated only numerically, and this task requires, at best, as much effort as all-atom simulations with explicit solvent. Therefore, a way around, to which many investigators resort, is to compose the total PMF [Eq. (3.3)] from contributions corresponding to fragments of a system obtained by all-atom simulations of, e.g., models of pairs of interacting side chains in water or as statistical potentials. However, the connection of such composite potentials to the parent PMF defined by Eq. (3.2) is unclear. Liwo et al. (1998, 2001) developed a formal expansion of the PMF, which is based on Kubo’s (1962) cluster-cumulant approach. First, the total all-atom energy of a system composed of n coarse-grained sites is partitioned into the component energies, ε1 , ε2 , . . . , εN , N = n(n + 1)/2, as given by Eq. (3.4). E(X; Y) =
n k=1 i∈Ik
Eik (X; yk ) +
n k−1 k=1 l=1 i∈Ik j∈Il
Eik;jl (X; yk ; yl ) =
N
εi (X; zi )
i=1
(3.4) where the sets {I1 , I2 , . . . , In } contain the indices of all atoms assigned to interaction site 1, 2,. . ., n, respectively, and zi = yi or zi = (yk , yl ) depending on whether
52
C. Czaplewski et al.
εi is an intrasite or intersite energy. Each component energy is either a sum of all interatomic interactions within a given extended site or between two extended sites. By inserting Eq. (3.2) and splitting the RFE into cluster-cumulant
Eq. (3.4) into functions, εi1 , εi2 , ..., εik f , containing increasing numbers of component energies, Eq. (3.2) becomes Eq. (3.5). F(X) =
εi f +
εi εj
i
i<j
f
εi εj εk
+
f
+... + ε1 ε2 ...εN f
(3.5)
i<j are now calculated by computing averages over the sampled conformations: < O >=
1 Ok M
(9.10)
where Ok is the value measured for the quantity O in the configuration k, and M the number of measurements. This average approximates the ensemble average < O >=
dxi dvi O(xi )e−E(xi ,vi )/kB T . dxi dvi e−E(xi ,vi )/kB T
As E = Epot (xi ) + Ekin (vi ) and Ekin = 1/2 velocities, and < O >=
(9.11)
mi v2i , it is possible to integrate out the
dxi O(xi )e−E(pot xi )/kB T . dxi e−Epot (xi )/kB T
(9.12)
As a consequence, for the generation of configurations by way of the Metropolis algorithms (Eq. (9.9)) one needs to calculate only the difference of the potential energies Epot . For this reason, we will write most times simply E when only the potential energy Epot is relevant. Note also that Monte Carlo does not require calculation of derivatives reducing the numerical workload. As the configurations are drawn randomly in Monte Carlo, it is not possible to follow the trajectory of a protein, and therefore Monte Carlo – unlike molecular dynamics – is not suitable for probing the kinetics of folding. On the other hand, Monte Carlo allows one to sample the configurational space much faster through utilizing artificial but fast move sets. These are often necessary because in the canonical ensemble crossing of an energy barrier of height E is suppressed by a factor ∝ exp(−E/kB T). This is the reason for the multiple minima problem and the resulting slowing down of protein simulations discussed in the introduction.
9
Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms
213
9.2.3 Optimization Techniques Most proteins are thermodynamically stable at room temperature (Anfinsen 1973). This implies that the biologically active configuration is the global minimum in free energy at T ≈ 270 − 300 K. For many proteins, this state is unique up to oscillations around a fixed structure. For this reason, one can identify the global minimum in free energy with that in potential energy, reducing the prediction of protein structures to a global optimization problem. While deterministic methods (for instance, the αBB algorithm (Androulakis et al. 1997)) have many conceptual advantages, stochastic algorithms are often faster and easier to implement. Take as an example simulated annealing (Kirkpatrick et al. 1983) which is inspired by the crystal growth process and realized by gradually decreasing the temperature in a Monte Carlo or molecular dynamics program. While only a logarithmic annealing schedule will ensure that the simulation finds the global minimum (Geman and Geman 1984), limitations in available computer resources require faster annealing schedules where success is no longer guaranteed. Still, because of its simplicity simulated annealing is often the first choice in protein optimization problems. Genetic algorithms (Holland 1975) and Monte Carlo minimization (Li and Scheraga 1987) are two other stochastic optimization techniques commonly used. As simulated annealing they try to avoid entrapment in local minima and continue to search for further solutions. This is a general characteristic of successful optimization techniques. For instance, in tabu search (Cvijovic and Klinowski 1995) the system is guided away from previously explored areas. This can result in slow convergence as the method does not distinguish between important and unimportant regions of the landscape. A somehow opposite approach (Besold et al. 1999; Wenzel and Hamacher 1999) aims at transforming the original energy landscape in a funnellandscape, where convergence toward the global minimum is fast. However, many landscape-deformation methods are hampered either by the required fine tuning or a priori information, or by difficulties with connecting back to the original landscape. Often, minima on the deformed surface are displaced or merged. The latter problem is avoided in energy landscape paving (ELP) (Hansmann and Wille 2002) which merges ideas from tabu search with energy landscape deformation. In ELP, low-temperature Monte Carlo simulations utilize an effective energy: ) w() E) = e−E/kB T
with ) E = E + f (H(q, t)).
(9.13)
Here, T is a (low) temperature and f (H(q, t)) a function of the histogram H(q, t) in a pre-chosen “order parameter” or “reaction coordinate” q. The weight of a local minimum state decreases with the time the system stays in that state, i.e., ELP deforms the energy landscape locally till the local minimum is no longer favored and the system will explore higher energies. It will then either fall in a new local minimum or walk through this high-energy region till the corresponding histogram entries all have similar frequencies and the system again has a bias toward low
214
U.H.E. Hansmann
energies. Since the weight factor is time dependent it follows that ELP violates detailed balance. Hence, the method cannot be used to calculate thermodynamic averages. Note, however, that for f (H(q, t)) = f (H(q)) detailed balance is fulfilled, and ELP reduces to the generalized-ensemble methods (Hansmann and Okamoto 1998) discussed in the following section. We have evaluated the efficiency of ELP in simulations of the 20-residue trp-cage protein whose structure we could “predict” within a root-mean-square deviation (rmsd) of 1 Å (Schug et al. 2005). Energy landscape paving allows also the possibility of zero-temperature simulations (Schug et al. 2005). For T → 0 only moves with ) E ≤ 0 will be accepted. If one chooses: ) E = E + cH(E, t), the acceptance criterion is given by: E + cH(q, t) ≤ 0 ↔ cH(q, t) ≤ −E
(9.14)
where E is the “physical” energy. Hence, energy landscape paving can overcome even at T = 0 any energy barrier. The waiting time for such a move is proportional to the height of the barrier that needs to be crossed. The factor c sets the timescale, and in this sense the T = 0 form of ELP is parameter-free.
9.3 Advanced Simulation Techniques Determining the structure of proteins through global optimization assumes the existence of a cost function whose global minimum describes the native structure. In most cases this is an energy that describes the physical interactions within a protein and between the protein and the surrounding environment, in most cases water. Since neither the available force fields nor the inclusion of solvation effects are perfect, it is not certain that the folded structure (as determined by X-ray or NMR experiments) corresponds to the global minimum conformation. Hence, the accuracy of the force fields sets a limit on any global optimization approach to structure prediction of proteins. Global optimization techniques are also not suitable for investigations of the folding mechanism, the change in shape when interacting with other molecules, or the appearance of mis-folded structures. As with structure prediction, it is necessary to go beyond global optimization techniques and to measure thermodynamic quantities, i.e., to sample a set of configurations from a canonical ensemble and take an average of the chosen quantity over this ensemble. In principle, this is possible with molecular dynamics and Monte Carlo simulations, however, but as argued earlier in this review this requires strategies that lead to a faster sampling of low-energy configurations.
9.3.1 Unfolding Simulations The poor sampling of protein configurations at physiologically relevant temperatures results from their rough energy landscape where barriers of height E are
9
Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms
215
suppressed by e−E/kB T . Hence, by increasing the temperature T it becomes easier for a protein to cross energy barriers. This can be used to induce the thermal unfolding of a protein. Such unfolding simulations at high temperature are interpreted sometimes as reversed-in-time folding (Daggett and Fersht 2003; Daggett 2002). This approach has been used in the past with some success (Daggett and Fersht 2003; Daggett 2002), but it is not clear whether in general it is justified in protein simulations. We have recently demonstrated that the C-fragment of Top 7, named by us as CFr, folds by a non-trivial pathway that involves caching of an N-terminal segment in an adjunct helix. Only when all other parts of the proteins are folded and in place, the N-terminal segment unfolds and re-folds to a strand that completes the final structure in a three-stranded sheet. We found that this folding mechanism cannot be interfered from unfolding simulations at high temperatures. In fact, the interpretation of unfolding data in Mohanty and Hansmann (2008) as folding in reversed time would miss the caching mechanism that governs folding of this protein. Likely, such an interpretation is restricted to simple two-state folder and associated with a nucleation mechanism, as observed, for instance, for CI2 (Daggett and Fersht 2003; Daggett 2002).
9.3.2 Advanced Updates A possible strategy to increase sampling of relevant protein configurations are improved updates. Within the context of molecular dynamics these are techniques that either guide the simulation and/or allow for larger time steps in the integrator. In the context of Monte Carlo these are usually collective moves that lead to a larger change in configurations. Examples are the re-bridging scheme (G¯o and Scheraga 1970; Wu and Deem 1999) and the biased Gaussian step method (Favrin et al. 2001). In hybrid Monte Carlo (Duane et al. 1987; Brass et al. 1993) a short molecular dynamics run is used as a collective move to provide a trial configuration, which is then accepted or rejected according to the Metropolis criterion. This allows to follow a trajectory over a long time with a large step size, because the Metropolis step corrects for the discretization errors in the molecular dynamics run. A general problem with all improved updates is that they depend strongly on the chosen model and are often not known a priori. A collective move that avoids this pitfall has been recently proposed by Berg (Berg 2003) under the name Rugged Metropolis (RM). The idea is to bias a Monte Carlo simulation by using informations from a simulation at a higher temperature. Assume a range of temperatures T1 > T2 > . . . > Tr > . . . > Tf −1 > Tf .
(9.15)
The simulation at the highest temperature, T1 , is performed with the usual Metropolis algorithm and the results are used to construct an estimator of the probability density function
216
U.H.E. Hansmann
ρ(x1 , . . . , xn ; T1 ). that biases the simulation at T2 . In turn, this simulation provides a bias for the one at T3 and iteratively continued down to Tf . For this purpose, Berg assumes the approximation ρ(x1 , . . . , xn ; Tr ) =
n !
ρ 1i (xi ; Tr ),
(9.16)
i=1
where ρ 1i (xi ; Tr ) are estimators of reduced one-variable probability densities ρi1 (xi ; T) =
!
dxj ρ(x1 , . . . , xn ; T) .
(9.17)
j=i
Recursively, the estimated probability density function ρ(x ¯ 1 , . . . , xn ; Tr−1 ) is generated as an approximation of ρ(x1 , . . . , xn ; Tr ). The acceptance step in the (biased) Metropolis procedure at temperature Tr is now given by *
PRM
+ exp −β E ρ(x1 , . . . , xn ; Tr−1 ) = min 1, exp (−β E) ρ(x 1 , . . . , x n ; Tr−1 )
(9.18)
Rugged Metropolis has been tested successfully for simulations of small peptides, however, as with other improved updates, by itself the gain in efficiency is not enough to make folding simulations of protein domains (consisting usually of 50–200 residues) feasible. On the other hand, improved updates are very useful when combined with the other techniques that we describe in the following subsections.
9.3.3 Generalized-Ensemble Techniques A very successful approach for improving the sampling of low-energy protein configurations is the generalized-ensemble approach. Its underlying idea is not to sample directly the canonical ensemble but an artificial ensemble tailored to enable efficient search for local minima while at same time avoiding entrapment. These generalized ensembles are defined in such a way that re-weighting techniques allow one to connect back to the canonical (i.e., physical) ensemble and to calculate thermodynamic averages at temperatures of interest (Hansmann 2003). A great number of such ensembles have been proposed, and while not all of them can be discussed in this review, we can classify them in principle according to whether they are generated by a random walk through order parameter space (for instance, energy),
9
Sampling Protein Energy Landscapes – The Quest for Efficient Algorithms
217
control parameter space (temperature), or through model space (i.e., different energy functions). 9.3.3.1 Random Walks in Order Parameter Space In generalized ensembles that are defined by random walks in order parameter space, one requires that a Monte Carlo or molecular dynamics simulation leads to a broad distribution of a pre-chosen physical quantity. This allows one to sample both low and high-energy states with sufficient probability. For simplicity we will consider only ensembles that lead to flat distributions in one variable. Extensions to higher dimensional generalized ensembles are straightforward (Kumar et al. 1996). Probably the earliest realization of this idea is umbrella sampling (Torrie and Valleau 1977), but now more common is multicanonical sampling (Berg and Neuhaus 1991). Its first application of these techniques to protein simulations can be found in Hansmann and Okamoto (1993) where a Monte Carlo technique was used. Later, it was also adapted to molecular dynamics (Hansmann et al. 1996). The idea is to assign configurations with energy E a weight w(E) such that the distribution of energies Pmu (E) ∝ n(E)wmu (E) = const,
(9.19)
where n(E) is the spectral density. Since all energies appear with the equal probability, a free random walk in the energy space is enforced: the simulation can overcome any energy barrier and will not get trapped in one of the many local minima. For a wide range of temperatures it is now possible to obtain a canonical distribution by the re-weighting techniques (Ferrenberg and Swendsen 1988): −βE , PB (T, E) ∝ Pmu (E)w−1 mu (E)e
(9.20)
since a large range of energies is sampled. This allows one to calculate the expectation value of any physical quantity O at temperature T by < O >T =
dEO(E)PB (T, E) . dEPB (T, E)
(9.21)
The price for the resulting improved sampling is that (unlike in the canonical ensemble) the weights wmu (E) ∝ n−1 (E) are not a priori known (in fact, knowledge of the exact weights is equivalent to obtaining the density of states n(E), i.e., solving the system) and one needs their estimates for a numerical simulation. Calculation of the weights is usually done by an iterative procedure (Berg 2004; Hansmann and Okamoto 1993, 1994). Another efficient recursion is the so-called Wang–Landau sampling (Wang and Landau 2001) where one performs updates with estimators n(E) of the density of states p(E1 → E2 ) = min n(E1 )/n(E2 ), 1 .
(9.22)
218
U.H.E. Hansmann
Each time an energy level is visited, the estimator is updated according to n(E) → n(E) f
(9.23)
where, initially, n(E) = 1 and f = f0 = e1 . Once the desired energy range is covered, the factor f is refined, f1 =
, , f , fn+1 = fn+1 ,
(9.24)
until some small value is reached. In multicanonical simulations the computational effort increases with the number of residues like ≈ N 4 (when measured in Metropolis updates) (Hansmann and Okamoto 1999b). In general, the computational effort in simulations increases with ≈ X 2 where X is the variable in which one wants a flat distribution. This is because generalized-ensemble simulations realize by construction of the ensemble a 1D random walk in the chosen quantity X. In the multicanonical algorithm the reaction coordinate X is the potential energy X = E. Since E ∝ N 2 the above scaling relation for the computational effort ≈ N 4 is recovered. Hence, multicanonical sampling is not always the optimal generalized-ensemble algorithm in protein simulations. A better scaling of the computer time with size of the molecule may be obtained by choosing more appropriate reaction coordinate for our ensemble than the energy. This is the motivation behind the various other realizations of the generalizedensemble approach that exist. All aim at sampling a broad range of energies. In this way the simulation will overcome energy barriers and allow escape from local minima. For instance, in Hansmann and Okamoto (1999a) it was proposed that configurations are updated according to a special choice of the Tsallis generalized mechanics formalism (Curado and Tsallis 1994) (the Tsallis parameter q is chosen as q = 1 + 1/nF ): β(E − E0 ) −nF w(E) = 1 + . nF
(9.25)
Here E0 is an estimator for the ground-state energy and nF is the number of degrees of freedom of the system. The weight reduces in the low-energy region to the canonical Boltzmann weight exp(−βE). This is because E − E0 → 0 for β → 0 leading to β(E − E0 )/nF