VOLUME FOUR HUNDRED AND SIXTY-SEVEN
METHODS
IN
ENZYMOLOGY Computer Methods, Part B
METHODS IN ENZYMOLOGY Editors-in...
18 downloads
2144 Views
15MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
VOLUME FOUR HUNDRED AND SIXTY-SEVEN
METHODS
IN
ENZYMOLOGY Computer Methods, Part B
METHODS IN ENZYMOLOGY Editors-in-Chief
JOHN N. ABELSON AND MELVIN I. SIMON Division of Biology California Institute of Technology Pasadena, California, USA Founding Editors
SIDNEY P. COLOWICK AND NATHAN O. KAPLAN
VOLUME FOUR HUNDRED AND SIXTY-SEVEN
METHODS
IN
ENZYMOLOGY Computer Methods, Part B EDITED BY
MICHAEL L. JOHNSON University of Virginia Health Sciences Center Department of pharmacology Charlottesville, Virginia, USA
LUDWIG BRAND Department of Biology Johns Hopkins University Baltimore, Maryland, USA
AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Academic Press is an imprint of Elsevier
Academic Press is an imprint of Elsevier 525 B Street, Suite 1900, San Diego, CA 92101-4495, USA 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA 32 Jamestown Road, London NW1 7BY, UK First edition 2009 Copyright # 2009, Elsevier Inc. All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: permissions@ elsevier.com. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made For information on all Academic Press publications visit our website at elsevierdirect.com ISBN: 978-0-12-375023-5 ISSN: 0076-6879 Printed and bound in United States of America 09 10 11 12 10 9 8 7 6 5 4 3 2 1
CONTENTS
Contributors Preface Volumes in Series
1. Correlation Analysis: A Tool for Comparing Relaxation-Type Models to Experimental Data
xiii xix xxi
1
Maurizio Tomaiuolo, Joel Tabak, and Richard Bertram 1. Introduction 2. Scatter Plots and Correlation Analysis 3. Example 1: Relaxation Oscillations 4. Example 2: Square Wave Bursting 5. Example 3: Elliptic Bursting 6. Example 4: Using Correlation Analysis on Experimental Data 7. Summary Acknowledgment References
2 3 4 13 15 18 19 20 20
2. Trait Variability of Cancer Cells Quantified by High-Content Automated Microscopy of Single Cells
23
Vito Quaranta, Darren R. Tyson, Shawn P. Garbett, Brandy Weidow, Mark P. Harris, and Walter Georgescu 1. Introduction 2. Background 3. Experimental and Computational Workflow 4. Application to Traits Relevant to Cancer Progression 5. Conclusions Acknowledgments References
3. Matrix Factorization for Recovery of Biological Processes from Microarray Data
24 25 26 34 54 54 54
59
Andrew V. Kossenkov and Michael F. Ochs 1. Introduction 2. Overview of Methods
59 63 v
vi
Contents
3. Application to the Rosetta Compendium 4. Results of Analyses 5. Discussion References
4. Modeling and Simulation of the Immune System as a Self-Regulating Network
68 70 74 75
79
Peter S. Kim, Doron Levy, and Peter P. Lee 1. 2. 3. 4.
Introduction Mathematical Modeling of the Immune Network Two Examples of Models to Understand T Cell Regulation How to Implement Mathematical Models in Computer Simulations 5. Concluding Remarks Acknowledgments References
5. Entropy Demystified: The ‘‘Thermo’’-dynamics of Stochastically Fluctuating Systems
80 84 92 100 105 106 107
111
Hong Qian 1. Introduction 2. Energy 3. Entropy and ‘‘Thermo’’-dynamics of Markov Processes 4. A Three-State Two-Cycle Motor Protein 5. Phosphorylation–Dephosphorylation Cycle Kinetics 6. Summary and Challenges References
6. Effect of Kinetics on Sedimentation Velocity Profiles and the Role of Intermediates
112 113 117 122 125 131 132
135
John J. Correia, P. Holland Alday, Peter Sherwood, and Walter F. Stafford 1. Introduction 2. Methods 3. ABCD Systems 4. Monomer–Tetramer Model 5. Summary Acknowledgments References
136 138 141 151 158 159 159
Contents
7. Algebraic Models of Biochemical Networks
vii
163
Reinhard Laubenbacher and Abdul Salam Jarrah 1. Introduction 2. Computational Systems Biology 3. Network Inference 4. Reverse-Engineering of Discrete Models: An Example 5. Discussion References
8. High-Throughput Computing in the Sciences
164 165 176 181 190 193
197
Mark Morgan and Andrew Grimshaw 1. What is an HTC Application? 2. HTC Technologies 3. High-Throughput Computing Examples 4. Advanced Topics 5. Summary References
9. Large Scale Transcriptome Data Integration Across Multiple Tissues to Decipher Stem Cell Signatures
199 200 204 218 226 226
229
Ghislain Bidaut and Christian J. Stoeckert 1. Introduction 2. Systems and Data Sources 3. Data Integration 4. Artificial Neural Network Training and Validation 5. Future Development and Enhancement Plans Acknowledgments References
230 231 236 238 243 244 244
10. DynaFit—A Software Package for Enzymology
247
Petr Kuzmicˇ 1. Introduction 2. Equilibrium Binding Studies 3. Initial Rates of Enzyme Reactions 4. Time Course of Enzyme Reactions 5. General Methods and Algorithms 6. Concluding Remarks Acknowledgments References
248 250 255 260 262 275 276 276
viii
Contents
11. Discrete Dynamic Modeling of Cellular Signaling Networks
281
Re´ka Albert and Rui-Sheng Wang 1. Introduction 2. Cellular Signaling Networks 3. Boolean Dynamic Modeling 4. Variants of Boolean Network Models 5. Application Examples 6. Conclusion and Discussion Acknowledgments References
12. The Basic Concepts of Molecular Modeling
282 284 286 297 301 303 303 303
307
Akansha Saxena, Diana Wong, Karthikeyan Diraviyam, and David Sept 1. Introduction 2. Homology Modeling 3. Molecular Dynamics 4. Molecular Docking References
13. Deterministic and Stochastic Models of Genetic Regulatory Networks
308 308 317 324 330
335
Ilya Shmulevich and John D. Aitchison 1. Introduction 2. Boolean Networks 3. Differential Equation Models 4. Probabilistic Boolean Networks 5. Stochastic Differential Equation Models References
14. Bayesian Probability Approach to ADHD Appraisal
336 337 343 347 351 353
357
Raina Robeva and Jennifer Kim Penberthy 1. Introduction 2. Bayesian Probability Algorithm 3. The Value of Bayesian Probability Approach as a Meta-Analysis Tool 4. Discussion and Future Directions Acknowledgment References
358 362 369 373 377 378
Contents
15. Simple Stochastic Simulation
ix
381
Maria J. Schilstra and Stephen R. Martin 1. Introduction 2. Understanding Reaction Dynamics 3. Graphical Notation 4. Reactions 5. Reaction Kinetics 6. Transition Firing Rules 7. Summary 8. Notes References
16. Monte Carlo Simulation in Establishing Analytical Quality Requirements for Clinical Laboratory Tests: Meeting Clinical Needs
382 385 386 389 389 393 406 407 409
411
James C. Boyd and David E. Bruns 1. Introduction 2. Modeling Approach 3. Methods for Simulation Study 4. Results 5. Discussion References
17. Nonlinear Dynamical Analysis and Optimization for Biological/Biomedical Systems
412 414 416 417 429 431
435
Amos Ben-Zvi and Jong Min Lee 1. Introduction 2. Hypothalamic–Pituitary–Adrenal Axis System 3. Development of a Clinically Relevant Performance-Assessment Tools 4. Dynamic Programming 5. Computation of Optimal Treatments for HPA Axis System 6. Conclusions Acknowledgments References
436 437 441 452 455 458 458 458
x
Contents
18. Modeling of Growth Factor-Receptor Systems: From Molecular-Level Protein Interaction Networks to Whole-Body Compartment Models
461
Florence T. H. Wu, Marianne O. Stefanini, Feilim Mac Gabhann, and Aleksander S. Popel 1. Background 2. Molecular-Level Kinetics Models: Simulation of In Vitro Experiments 3. Mesoscale Single-Tissue 3D Models: Simulation of In Vivo Tissue Regions 4. Single-Tissue Compartmental Models: Simulation of In Vivo Tissue 5. Multitissue Compartmental Models: Simulation of Whole Body 6. Conclusions Acknowledgments References
19. The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies: Weights, Bias, and Confidence Intervals in Usual and Unusual Situations
462 466 474 482 485 493 494 494
499
Joel Tellinghuisen 1. Introduction 2. Least Squares Review 3. Statistics of Reciprocals 4. Weights When y is a True Dependent Variable 5. Unusual Weighting: When x is the Dependent Variable 6. Assessing Data Uncertainty: Variance Function Estimation 7. Conclusion References
500 503 506 511 521 524 526 527
20. Nonparametric Entropy Estimation Using Kernel Densities
531
Douglas E. Lake 1. Introduction 2. Motivating Application: Classifying Cardiac Rhythms 3. Renyi Entropy and the Friedman–Tukey Index 4. Kernel Density Estimation 5. Mean-Integrated Square Error 6. Estimating the FT Index 7. Connection Between Template Matches and Kernel Densities 8. Summary and Future Work Acknowledgments References
532 533 535 536 538 540 544 545 545 546
Contents
21. Pancreatic Network Control of Glucagon Secretion and Counterregulation
xi
547
Leon S. Farhy and Anthony L. McCall 1. Introduction 2. Mechanisms of Glucagon Counterregulation (GCR) Dysregulation in Diabetes 3. Interdisciplinary Approach to Investigating the Defects in the GCR 4. Initial Qualitative Analysis of the GCR Control Axis 5. Mathematical Models of the GCR Control Mechanisms in STZ-Treated Rats 6. Approximation of the Normal Endocrine Pancreas by a Minimal Control Network (MCN) and Analysis of the GCR Abnormalities in the Insulin Deficient Pancreas 7. Advantages and Limitations of the Interdisciplinary Approach 8. Conclusions Acknowledgment References
22. Enzyme Kinetics and Computational Modeling for Systems Biology
548 550 551 553 556
560 571 575 575 575
583
Pedro Mendes, Hanan Messiha, Naglis Malys, and Stefan Hoops 1. Introduction 2. Computational Modeling and Enzyme Kinetics 3. Yeast Triosephosphate Isomerase (EC 5.3.1.1) 4. Initial Rate Analysis 5. Progress Curve Analysis 6. Concluding Remarks Acknowledgments References
23. Fitting Enzyme Kinetic Data with KinTek Global Kinetic Explorer
584 586 588 590 594 598 598 598
601
Kenneth A. Johnson 1. Background 2. Challenges of Fitting by Simulation 3. Methods 4. Progress Curve Kinetics 5. Fitting Full Progress Curves 6. Slow Onset Inhibition Kinetics 7. Summary Acknowledgments References Author Index Subject Index
602 603 605 610 613 620 624 625 625 627 637
CONTRIBUTORS
John D. Aitchison Institute for Systems Biology, Seattle, Washington, USA Re´ka Albert Department of Physics, Pennsylvania State University, University Park, Pennsylvania, USA P. Holland Alday Department of Biochemistry, University of Mississippi Medical Center, Jackson, Mississippi, USA Amos Ben-Zvi Chemical and Materials Engineering, University of Alberta, Edmonton, Alberta, Canada Richard Bertram Department of Mathematics and Programs in Neuroscience and Molecular Biophysics, Florida State University, Tallahassee, Florida, USA Ghislain Bidaut Inserm, UMR891, CRCM, Integrative Bioinformatics, and Institut PaoliCalmettes; Univ Me´diterrane´e, Marseille, France James C. Boyd Department of Pathology, Charlottesville, Virginia, USA
University
of
Virginia
Health
System,
David E. Bruns Department of Pathology, Charlottesville, Virginia, USA
University
of
Virginia
Health
System,
John J. Correia Department of Biochemistry, University of Mississippi Medical Center, Jackson, Mississippi, USA Karthikeyan Diraviyam Biomedical Engineering and Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA Leon S. Farhy Department of Medicine, Center for Biomathematical Technology, University of Virginia, Charlottesville, Virginia, USA xiii
xiv
Contributors
Feilim Mac Gabhann Department of Biomedical Engineering, Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, USA Shawn P. Garbett Vanderbilt Integrative Cancer Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee, USA Walter Georgescu Vanderbilt Integrative Cancer Biology Center, and Department of Biomedical Engineering, Vanderbilt University Medical Center, Nashville, Tennessee, USA Andrew Grimshaw Department of Computer Science, University of Virginia, Charlottesville, Virginia, USA Mark P. Harris Department of Cancer Biology, Vanderbilt University Medical Center, Nashville, Tennessee, USA Stefan Hoops Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA Abdul Salam Jarrah Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, Virginia, USA Kenneth A. Johnson Department of Chemistry and Biochemistry, Institute for Cell and Molecular Biology, University of Texas, Austin, Texas, USA Peter S. Kim Department of Mathematics, University of Utah, Salt Lake City, Utah, USA Andrew V. Kossenkov The Wistar Institute, Philadelphia, Pennsylvania, USA Petr Kuzmicˇ BioKin Ltd., Watertown, Massachusetts, USA Douglas E. Lake Departments of Internal Medicine (Cardiovascular Division) and Statistics, University of Virginia, Charlottesville, Virginia, USA Reinhard Laubenbacher Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, Virginia, USA Jong Min Lee Chemical and Materials Engineering, University of Alberta, Edmonton, Alberta, Canada
Contributors
xv
Peter P. Lee Division of Hematology, Department of Medicine, Stanford University, Stanford, California, USA Doron Levy Department of Mathematics and Center for Scientific Computation and Mathematical Modeling (CSCAMM), University of Maryland, College Park, Maryland, USA Naglis Malys Manchester Centre for Integrative Systems Biology, and Faculty of Life Sciences, The University of Manchester, Manchester, United Kingdom Stephen R. Martin Division of Physical Biochemistry, MRC National Institute for Medical Research, London, United Kingdom Anthony L. McCall Department of Medicine, Center for Biomathematical Technology, University of Virginia, Charlottesville, Virginia, USA Pedro Mendes Manchester Centre for Integrative Systems Biology, and School of Computer Science, The University of Manchester, Manchester, United Kingdom; Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA Hanan Messiha Manchester Centre for Integrative Systems Biology, and School of Chemistry, The University of Manchester, Manchester, United Kingdom Mark Morgan Department of Computer Science, University of Virginia, Charlottesville, Virginia, USA Michael F. Ochs The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, Maryland, USA Jennifer Kim Penberthy Department of Psychiatry and Neurobehavioral Sciences, University of Virginia Health System, Charlottesville, Virginia, USA Aleksander S. Popel Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA Hong Qian Department of Applied Mathematics, University of Washington, Seattle, Washington, USA
xvi
Contributors
Vito Quaranta Department of Cancer Biology, and Vanderbilt Integrative Cancer Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee, USA Raina Robeva Department of Mathematical Sciences, Sweet Briar College, Sweet Briar, Virginia, USA Akansha Saxena Biomedical Engineering, Washington University, St Louis, Missouri, USA Maria J. Schilstra Biological and Neural Computation Group, Science and Technology Research Institute, University of Hertfordshire, Hatfield, United Kingdom David Sept Biomedical Engineering and Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA Peter Sherwood Boston Biomedical Research Institute, Watertown, Massachusetts, USA Ilya Shmulevich Institute for Systems Biology, Seattle, Washington, USA Walter F. Stafford Boston Biomedical Research Institute, Watertown, Massachusetts, USA Marianne O. Stefanini Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA Christian J. Stoeckert Center for Bioinformatics, Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA Joel Tabak Department of Biological Science and Program in Neuroscience, Florida State University, Tallahassee, Florida, USA Joel Tellinghuisen Department of Chemistry, Vanderbilt University, Nashville, Tennessee, USA Maurizio Tomaiuolo Department of Biological Science and Program in Neuroscience, Florida State University, Tallahassee, Florida, USA Darren R. Tyson Department of Cancer Biology, and Vanderbilt Integrative Cancer Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee, USA
Contributors
xvii
Rui-Sheng Wang Department of Physics, Pennsylvania State University, University Park, Pennsylvania, USA Brandy Weidow Department of Cancer Biology, and Vanderbilt Integrative Cancer Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee, USA Diana Wong Biomedical Engineering, Washington University, St Louis, Missouri, USA Florence T.H. Wu Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
PREFACE
A general perception exists that the only applications of computers and computer methods in biological and biomedical research are either basic statistical analysis or the searching of DNA sequence databases. While these are important applications they only scratch the surface of the current and potential applications of computers and computer methods in biomedical research. The various chapters within this volume include a wide variety of applications that extend beyond this limited perception. The use of computers and computational methods has become ubiquitous in biological and biomedical research. This has been driven by numerous factors, a few of which follow: One primary reason is the emphasis being placed on computers and computational methods within the National Institutes of Health (NIH) Roadmap; another factor is the increased level of mathematical and computational sophistication among researchers, particularly amongst junior scientists, students, journal reviewers, and NIH Study Section members; and another is the rapid advances in computer hardware and software which make these methods far more accessible to the rank and file research community. The training of the majority of senior M.D.s and Ph.D.s in clinical or basic disciplines at academic research and medical centers commonly does not include advanced coursework in mathematics, numerical analysis, statistics, or computer science. The chapters within this volume have been written in order to be accessible to this target audience. MICHAEL L. JOHNSON LUDWIG BRAND
xix
METHODS IN ENZYMOLOGY
VOLUME I. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME II. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME III. Preparation and Assay of Substrates Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME IV. Special Techniques for the Enzymologist Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME V. Preparation and Assay of Enzymes Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VI. Preparation and Assay of Enzymes (Continued) Preparation and Assay of Substrates Special Techniques Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VII. Cumulative Subject Index Edited by SIDNEY P. COLOWICK AND NATHAN O. KAPLAN VOLUME VIII. Complex Carbohydrates Edited by ELIZABETH F. NEUFELD AND VICTOR GINSBURG VOLUME IX. Carbohydrate Metabolism Edited by WILLIS A. WOOD VOLUME X. Oxidation and Phosphorylation Edited by RONALD W. ESTABROOK AND MAYNARD E. PULLMAN VOLUME XI. Enzyme Structure Edited by C. H. W. HIRS VOLUME XII. Nucleic Acids (Parts A and B) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XIII. Citric Acid Cycle Edited by J. M. LOWENSTEIN VOLUME XIV. Lipids Edited by J. M. LOWENSTEIN VOLUME XV. Steroids and Terpenoids Edited by RAYMOND B. CLAYTON xxi
xxii
Methods in Enzymology
VOLUME XVI. Fast Reactions Edited by KENNETH KUSTIN VOLUME XVII. Metabolism of Amino Acids and Amines (Parts A and B) Edited by HERBERT TABOR AND CELIA WHITE TABOR VOLUME XVIII. Vitamins and Coenzymes (Parts A, B, and C) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME XIX. Proteolytic Enzymes Edited by GERTRUDE E. PERLMANN AND LASZLO LORAND VOLUME XX. Nucleic Acids and Protein Synthesis (Part C) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME XXI. Nucleic Acids (Part D) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XXII. Enzyme Purification and Related Techniques Edited by WILLIAM B. JAKOBY VOLUME XXIII. Photosynthesis (Part A) Edited by ANTHONY SAN PIETRO VOLUME XXIV. Photosynthesis and Nitrogen Fixation (Part B) Edited by ANTHONY SAN PIETRO VOLUME XXV. Enzyme Structure (Part B) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVI. Enzyme Structure (Part C) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVII. Enzyme Structure (Part D) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XXVIII. Complex Carbohydrates (Part B) Edited by VICTOR GINSBURG VOLUME XXIX. Nucleic Acids and Protein Synthesis (Part E) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME XXX. Nucleic Acids and Protein Synthesis (Part F) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME XXXI. Biomembranes (Part A) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME XXXII. Biomembranes (Part B) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME XXXIII. Cumulative Subject Index Volumes I-XXX Edited by MARTHA G. DENNIS AND EDWARD A. DENNIS VOLUME XXXIV. Affinity Techniques (Enzyme Purification: Part B) Edited by WILLIAM B. JAKOBY AND MEIR WILCHEK
Methods in Enzymology
VOLUME XXXV. Lipids (Part B) Edited by JOHN M. LOWENSTEIN VOLUME XXXVI. Hormone Action (Part A: Steroid Hormones) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XXXVII. Hormone Action (Part B: Peptide Hormones) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XXXVIII. Hormone Action (Part C: Cyclic Nucleotides) Edited by JOEL G. HARDMAN AND BERT W. O’MALLEY VOLUME XXXIX. Hormone Action (Part D: Isolated Cells, Tissues, and Organ Systems) Edited by JOEL G. HARDMAN AND BERT W. O’MALLEY VOLUME XL. Hormone Action (Part E: Nuclear Structure and Function) Edited by BERT W. O’MALLEY AND JOEL G. HARDMAN VOLUME XLI. Carbohydrate Metabolism (Part B) Edited by W. A. WOOD VOLUME XLII. Carbohydrate Metabolism (Part C) Edited by W. A. WOOD VOLUME XLIII. Antibiotics Edited by JOHN H. HASH VOLUME XLIV. Immobilized Enzymes Edited by KLAUS MOSBACH VOLUME XLV. Proteolytic Enzymes (Part B) Edited by LASZLO LORAND VOLUME XLVI. Affinity Labeling Edited by WILLIAM B. JAKOBY AND MEIR WILCHEK VOLUME XLVII. Enzyme Structure (Part E) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XLVIII. Enzyme Structure (Part F) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME XLIX. Enzyme Structure (Part G) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME L. Complex Carbohydrates (Part C) Edited by VICTOR GINSBURG VOLUME LI. Purine and Pyrimidine Nucleotide Metabolism Edited by PATRICIA A. HOFFEE AND MARY ELLEN JONES VOLUME LII. Biomembranes (Part C: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER
xxiii
xxiv
Methods in Enzymology
VOLUME LIII. Biomembranes (Part D: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LIV. Biomembranes (Part E: Biological Oxidations) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LV. Biomembranes (Part F: Bioenergetics) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LVI. Biomembranes (Part G: Bioenergetics) Edited by SIDNEY FLEISCHER AND LESTER PACKER VOLUME LVII. Bioluminescence and Chemiluminescence Edited by MARLENE A. DELUCA VOLUME LVIII. Cell Culture Edited by WILLIAM B. JAKOBY AND IRA PASTAN VOLUME LIX. Nucleic Acids and Protein Synthesis (Part G) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME LX. Nucleic Acids and Protein Synthesis (Part H) Edited by KIVIE MOLDAVE AND LAWRENCE GROSSMAN VOLUME 61. Enzyme Structure (Part H) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 62. Vitamins and Coenzymes (Part D) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 63. Enzyme Kinetics and Mechanism (Part A: Initial Rate and Inhibitor Methods) Edited by DANIEL L. PURICH VOLUME 64. Enzyme Kinetics and Mechanism (Part B: Isotopic Probes and Complex Enzyme Systems) Edited by DANIEL L. PURICH VOLUME 65. Nucleic Acids (Part I) Edited by LAWRENCE GROSSMAN AND KIVIE MOLDAVE VOLUME 66. Vitamins and Coenzymes (Part E) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 67. Vitamins and Coenzymes (Part F) Edited by DONALD B. MCCORMICK AND LEMUEL D. WRIGHT VOLUME 68. Recombinant DNA Edited by RAY WU VOLUME 69. Photosynthesis and Nitrogen Fixation (Part C) Edited by ANTHONY SAN PIETRO VOLUME 70. Immunochemical Techniques (Part A) Edited by HELEN VAN VUNAKIS AND JOHN J. LANGONE
Methods in Enzymology
xxv
VOLUME 71. Lipids (Part C) Edited by JOHN M. LOWENSTEIN VOLUME 72. Lipids (Part D) Edited by JOHN M. LOWENSTEIN VOLUME 73. Immunochemical Techniques (Part B) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 74. Immunochemical Techniques (Part C) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 75. Cumulative Subject Index Volumes XXXI, XXXII, XXXIV–LX Edited by EDWARD A. DENNIS AND MARTHA G. DENNIS VOLUME 76. Hemoglobins Edited by ERALDO ANTONINI, LUIGI ROSSI-BERNARDI, AND EMILIA CHIANCONE VOLUME 77. Detoxication and Drug Metabolism Edited by WILLIAM B. JAKOBY VOLUME 78. Interferons (Part A) Edited by SIDNEY PESTKA VOLUME 79. Interferons (Part B) Edited by SIDNEY PESTKA VOLUME 80. Proteolytic Enzymes (Part C) Edited by LASZLO LORAND VOLUME 81. Biomembranes (Part H: Visual Pigments and Purple Membranes, I) Edited by LESTER PACKER VOLUME 82. Structural and Contractile Proteins (Part A: Extracellular Matrix) Edited by LEON W. CUNNINGHAM AND DIXIE W. FREDERIKSEN VOLUME 83. Complex Carbohydrates (Part D) Edited by VICTOR GINSBURG VOLUME 84. Immunochemical Techniques (Part D: Selected Immunoassays) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 85. Structural and Contractile Proteins (Part B: The Contractile Apparatus and the Cytoskeleton) Edited by DIXIE W. FREDERIKSEN AND LEON W. CUNNINGHAM VOLUME 86. Prostaglandins and Arachidonate Metabolites Edited by WILLIAM E. M. LANDS AND WILLIAM L. SMITH VOLUME 87. Enzyme Kinetics and Mechanism (Part C: Intermediates, Stereo-chemistry, and Rate Studies) Edited by DANIEL L. PURICH VOLUME 88. Biomembranes (Part I: Visual Pigments and Purple Membranes, II) Edited by LESTER PACKER
xxvi
Methods in Enzymology
VOLUME 89. Carbohydrate Metabolism (Part D) Edited by WILLIS A. WOOD VOLUME 90. Carbohydrate Metabolism (Part E) Edited by WILLIS A. WOOD VOLUME 91. Enzyme Structure (Part I) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 92. Immunochemical Techniques (Part E: Monoclonal Antibodies and General Immunoassay Methods) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 93. Immunochemical Techniques (Part F: Conventional Antibodies, Fc Receptors, and Cytotoxicity) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 94. Polyamines Edited by HERBERT TABOR AND CELIA WHITE TABOR VOLUME 95. Cumulative Subject Index Volumes 61–74, 76–80 Edited by EDWARD A. DENNIS AND MARTHA G. DENNIS VOLUME 96. Biomembranes [Part J: Membrane Biogenesis: Assembly and Targeting (General Methods; Eukaryotes)] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 97. Biomembranes [Part K: Membrane Biogenesis: Assembly and Targeting (Prokaryotes, Mitochondria, and Chloroplasts)] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 98. Biomembranes (Part L: Membrane Biogenesis: Processing and Recycling) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 99. Hormone Action (Part F: Protein Kinases) Edited by JACKIE D. CORBIN AND JOEL G. HARDMAN VOLUME 100. Recombinant DNA (Part B) Edited by RAY WU, LAWRENCE GROSSMAN, AND KIVIE MOLDAVE VOLUME 101. Recombinant DNA (Part C) Edited by RAY WU, LAWRENCE GROSSMAN, AND KIVIE MOLDAVE VOLUME 102. Hormone Action (Part G: Calmodulin and Calcium-Binding Proteins) Edited by ANTHONY R. MEANS AND BERT W. O’MALLEY VOLUME 103. Hormone Action (Part H: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 104. Enzyme Purification and Related Techniques (Part C) Edited by WILLIAM B. JAKOBY
Methods in Enzymology
xxvii
VOLUME 105. Oxygen Radicals in Biological Systems Edited by LESTER PACKER VOLUME 106. Posttranslational Modifications (Part A) Edited by FINN WOLD AND KIVIE MOLDAVE VOLUME 107. Posttranslational Modifications (Part B) Edited by FINN WOLD AND KIVIE MOLDAVE VOLUME 108. Immunochemical Techniques (Part G: Separation and Characterization of Lymphoid Cells) Edited by GIOVANNI DI SABATO, JOHN J. LANGONE, AND HELEN VAN VUNAKIS VOLUME 109. Hormone Action (Part I: Peptide Hormones) Edited by LUTZ BIRNBAUMER AND BERT W. O’MALLEY VOLUME 110. Steroids and Isoprenoids (Part A) Edited by JOHN H. LAW AND HANS C. RILLING VOLUME 111. Steroids and Isoprenoids (Part B) Edited by JOHN H. LAW AND HANS C. RILLING VOLUME 112. Drug and Enzyme Targeting (Part A) Edited by KENNETH J. WIDDER AND RALPH GREEN VOLUME 113. Glutamate, Glutamine, Glutathione, and Related Compounds Edited by ALTON MEISTER VOLUME 114. Diffraction Methods for Biological Macromolecules (Part A) Edited by HAROLD W. WYCKOFF, C. H. W. HIRS, AND SERGE N. TIMASHEFF VOLUME 115. Diffraction Methods for Biological Macromolecules (Part B) Edited by HAROLD W. WYCKOFF, C. H. W. HIRS, AND SERGE N. TIMASHEFF VOLUME 116. Immunochemical Techniques (Part H: Effectors and Mediators of Lymphoid Cell Functions) Edited by GIOVANNI DI SABATO, JOHN J. LANGONE, AND HELEN VAN VUNAKIS VOLUME 117. Enzyme Structure (Part J) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 118. Plant Molecular Biology Edited by ARTHUR WEISSBACH AND HERBERT WEISSBACH VOLUME 119. Interferons (Part C) Edited by SIDNEY PESTKA VOLUME 120. Cumulative Subject Index Volumes 81–94, 96–101 VOLUME 121. Immunochemical Techniques (Part I: Hybridoma Technology and Monoclonal Antibodies) Edited by JOHN J. LANGONE AND HELEN VAN VUNAKIS VOLUME 122. Vitamins and Coenzymes (Part G) Edited by FRANK CHYTIL AND DONALD B. MCCORMICK
xxviii
Methods in Enzymology
VOLUME 123. Vitamins and Coenzymes (Part H) Edited by FRANK CHYTIL AND DONALD B. MCCORMICK VOLUME 124. Hormone Action (Part J: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 125. Biomembranes (Part M: Transport in Bacteria, Mitochondria, and Chloroplasts: General Approaches and Transport Systems) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 126. Biomembranes (Part N: Transport in Bacteria, Mitochondria, and Chloroplasts: Protonmotive Force) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 127. Biomembranes (Part O: Protons and Water: Structure and Translocation) Edited by LESTER PACKER VOLUME 128. Plasma Lipoproteins (Part A: Preparation, Structure, and Molecular Biology) Edited by JERE P. SEGREST AND JOHN J. ALBERS VOLUME 129. Plasma Lipoproteins (Part B: Characterization, Cell Biology, and Metabolism) Edited by JOHN J. ALBERS AND JERE P. SEGREST VOLUME 130. Enzyme Structure (Part K) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 131. Enzyme Structure (Part L) Edited by C. H. W. HIRS AND SERGE N. TIMASHEFF VOLUME 132. Immunochemical Techniques (Part J: Phagocytosis and Cell-Mediated Cytotoxicity) Edited by GIOVANNI DI SABATO AND JOHANNES EVERSE VOLUME 133. Bioluminescence and Chemiluminescence (Part B) Edited by MARLENE DELUCA AND WILLIAM D. MCELROY VOLUME 134. Structural and Contractile Proteins (Part C: The Contractile Apparatus and the Cytoskeleton) Edited by RICHARD B. VALLEE VOLUME 135. Immobilized Enzymes and Cells (Part B) Edited by KLAUS MOSBACH VOLUME 136. Immobilized Enzymes and Cells (Part C) Edited by KLAUS MOSBACH VOLUME 137. Immobilized Enzymes and Cells (Part D) Edited by KLAUS MOSBACH VOLUME 138. Complex Carbohydrates (Part E) Edited by VICTOR GINSBURG
Methods in Enzymology
xxix
VOLUME 139. Cellular Regulators (Part A: Calcium- and Calmodulin-Binding Proteins) Edited by ANTHONY R. MEANS AND P. MICHAEL CONN VOLUME 140. Cumulative Subject Index Volumes 102–119, 121–134 VOLUME 141. Cellular Regulators (Part B: Calcium and Lipids) Edited by P. MICHAEL CONN AND ANTHONY R. MEANS VOLUME 142. Metabolism of Aromatic Amino Acids and Amines Edited by SEYMOUR KAUFMAN VOLUME 143. Sulfur and Sulfur Amino Acids Edited by WILLIAM B. JAKOBY AND OWEN GRIFFITH VOLUME 144. Structural and Contractile Proteins (Part D: Extracellular Matrix) Edited by LEON W. CUNNINGHAM VOLUME 145. Structural and Contractile Proteins (Part E: Extracellular Matrix) Edited by LEON W. CUNNINGHAM VOLUME 146. Peptide Growth Factors (Part A) Edited by DAVID BARNES AND DAVID A. SIRBASKU VOLUME 147. Peptide Growth Factors (Part B) Edited by DAVID BARNES AND DAVID A. SIRBASKU VOLUME 148. Plant Cell Membranes Edited by LESTER PACKER AND ROLAND DOUCE VOLUME 149. Drug and Enzyme Targeting (Part B) Edited by RALPH GREEN AND KENNETH J. WIDDER VOLUME 150. Immunochemical Techniques (Part K: In Vitro Models of B and T Cell Functions and Lymphoid Cell Receptors) Edited by GIOVANNI DI SABATO VOLUME 151. Molecular Genetics of Mammalian Cells Edited by MICHAEL M. GOTTESMAN VOLUME 152. Guide to Molecular Cloning Techniques Edited by SHELBY L. BERGER AND ALAN R. KIMMEL VOLUME 153. Recombinant DNA (Part D) Edited by RAY WU AND LAWRENCE GROSSMAN VOLUME 154. Recombinant DNA (Part E) Edited by RAY WU AND LAWRENCE GROSSMAN VOLUME 155. Recombinant DNA (Part F) Edited by RAY WU VOLUME 156. Biomembranes (Part P: ATP-Driven Pumps and Related Transport: The Na, K-Pump) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER
xxx
Methods in Enzymology
VOLUME 157. Biomembranes (Part Q: ATP-Driven Pumps and Related Transport: Calcium, Proton, and Potassium Pumps) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 158. Metalloproteins (Part A) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 159. Initiation and Termination of Cyclic Nucleotide Action Edited by JACKIE D. CORBIN AND ROGER A. JOHNSON VOLUME 160. Biomass (Part A: Cellulose and Hemicellulose) Edited by WILLIS A. WOOD AND SCOTT T. KELLOGG VOLUME 161. Biomass (Part B: Lignin, Pectin, and Chitin) Edited by WILLIS A. WOOD AND SCOTT T. KELLOGG VOLUME 162. Immunochemical Techniques (Part L: Chemotaxis and Inflammation) Edited by GIOVANNI DI SABATO VOLUME 163. Immunochemical Techniques (Part M: Chemotaxis and Inflammation) Edited by GIOVANNI DI SABATO VOLUME 164. Ribosomes Edited by HARRY F. NOLLER, JR., AND KIVIE MOLDAVE VOLUME 165. Microbial Toxins: Tools for Enzymology Edited by SIDNEY HARSHMAN VOLUME 166. Branched-Chain Amino Acids Edited by ROBERT HARRIS AND JOHN R. SOKATCH VOLUME 167. Cyanobacteria Edited by LESTER PACKER AND ALEXANDER N. GLAZER VOLUME 168. Hormone Action (Part K: Neuroendocrine Peptides) Edited by P. MICHAEL CONN VOLUME 169. Platelets: Receptors, Adhesion, Secretion (Part A) Edited by JACEK HAWIGER VOLUME 170. Nucleosomes Edited by PAUL M. WASSARMAN AND ROGER D. KORNBERG VOLUME 171. Biomembranes (Part R: Transport Theory: Cells and Model Membranes) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 172. Biomembranes (Part S: Transport: Membrane Isolation and Characterization) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER
Methods in Enzymology
xxxi
VOLUME 173. Biomembranes [Part T: Cellular and Subcellular Transport: Eukaryotic (Nonepithelial) Cells] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 174. Biomembranes [Part U: Cellular and Subcellular Transport: Eukaryotic (Nonepithelial) Cells] Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 175. Cumulative Subject Index Volumes 135–139, 141–167 VOLUME 176. Nuclear Magnetic Resonance (Part A: Spectral Techniques and Dynamics) Edited by NORMAN J. OPPENHEIMER AND THOMAS L. JAMES VOLUME 177. Nuclear Magnetic Resonance (Part B: Structure and Mechanism) Edited by NORMAN J. OPPENHEIMER AND THOMAS L. JAMES VOLUME 178. Antibodies, Antigens, and Molecular Mimicry Edited by JOHN J. LANGONE VOLUME 179. Complex Carbohydrates (Part F) Edited by VICTOR GINSBURG VOLUME 180. RNA Processing (Part A: General Methods) Edited by JAMES E. DAHLBERG AND JOHN N. ABELSON VOLUME 181. RNA Processing (Part B: Specific Methods) Edited by JAMES E. DAHLBERG AND JOHN N. ABELSON VOLUME 182. Guide to Protein Purification Edited by MURRAY P. DEUTSCHER VOLUME 183. Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences Edited by RUSSELL F. DOOLITTLE VOLUME 184. Avidin-Biotin Technology Edited by MEIR WILCHEK AND EDWARD A. BAYER VOLUME 185. Gene Expression Technology Edited by DAVID V. GOEDDEL VOLUME 186. Oxygen Radicals in Biological Systems (Part B: Oxygen Radicals and Antioxidants) Edited by LESTER PACKER AND ALEXANDER N. GLAZER VOLUME 187. Arachidonate Related Lipid Mediators Edited by ROBERT C. MURPHY AND FRANK A. FITZPATRICK VOLUME 188. Hydrocarbons and Methylotrophy Edited by MARY E. LIDSTROM VOLUME 189. Retinoids (Part A: Molecular and Metabolic Aspects) Edited by LESTER PACKER
xxxii
Methods in Enzymology
VOLUME 190. Retinoids (Part B: Cell Differentiation and Clinical Applications) Edited by LESTER PACKER VOLUME 191. Biomembranes (Part V: Cellular and Subcellular Transport: Epithelial Cells) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 192. Biomembranes (Part W: Cellular and Subcellular Transport: Epithelial Cells) Edited by SIDNEY FLEISCHER AND BECCA FLEISCHER VOLUME 193. Mass Spectrometry Edited by JAMES A. MCCLOSKEY VOLUME 194. Guide to Yeast Genetics and Molecular Biology Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 195. Adenylyl Cyclase, G Proteins, and Guanylyl Cyclase Edited by ROGER A. JOHNSON AND JACKIE D. CORBIN VOLUME 196. Molecular Motors and the Cytoskeleton Edited by RICHARD B. VALLEE VOLUME 197. Phospholipases Edited by EDWARD A. DENNIS VOLUME 198. Peptide Growth Factors (Part C) Edited by DAVID BARNES, J. P. MATHER, AND GORDON H. SATO VOLUME 199. Cumulative Subject Index Volumes 168–174, 176–194 VOLUME 200. Protein Phosphorylation (Part A: Protein Kinases: Assays, Purification, Antibodies, Functional Analysis, Cloning, and Expression) Edited by TONY HUNTER AND BARTHOLOMEW M. SEFTON VOLUME 201. Protein Phosphorylation (Part B: Analysis of Protein Phosphorylation, Protein Kinase Inhibitors, and Protein Phosphatases) Edited by TONY HUNTER AND BARTHOLOMEW M. SEFTON VOLUME 202. Molecular Design and Modeling: Concepts and Applications (Part A: Proteins, Peptides, and Enzymes) Edited by JOHN J. LANGONE VOLUME 203. Molecular Design and Modeling: Concepts and Applications (Part B: Antibodies and Antigens, Nucleic Acids, Polysaccharides, and Drugs) Edited by JOHN J. LANGONE VOLUME 204. Bacterial Genetic Systems Edited by JEFFREY H. MILLER VOLUME 205. Metallobiochemistry (Part B: Metallothionein and Related Molecules) Edited by JAMES F. RIORDAN AND BERT L. VALLEE
Methods in Enzymology
xxxiii
VOLUME 206. Cytochrome P450 Edited by MICHAEL R. WATERMAN AND ERIC F. JOHNSON VOLUME 207. Ion Channels Edited by BERNARDO RUDY AND LINDA E. IVERSON VOLUME 208. Protein–DNA Interactions Edited by ROBERT T. SAUER VOLUME 209. Phospholipid Biosynthesis Edited by EDWARD A. DENNIS AND DENNIS E. VANCE VOLUME 210. Numerical Computer Methods Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 211. DNA Structures (Part A: Synthesis and Physical Analysis of DNA) Edited by DAVID M. J. LILLEY AND JAMES E. DAHLBERG VOLUME 212. DNA Structures (Part B: Chemical and Electrophoretic Analysis of DNA) Edited by DAVID M. J. LILLEY AND JAMES E. DAHLBERG VOLUME 213. Carotenoids (Part A: Chemistry, Separation, Quantitation, and Antioxidation) Edited by LESTER PACKER VOLUME 214. Carotenoids (Part B: Metabolism, Genetics, and Biosynthesis) Edited by LESTER PACKER VOLUME 215. Platelets: Receptors, Adhesion, Secretion (Part B) Edited by JACEK J. HAWIGER VOLUME 216. Recombinant DNA (Part G) Edited by RAY WU VOLUME 217. Recombinant DNA (Part H) Edited by RAY WU VOLUME 218. Recombinant DNA (Part I) Edited by RAY WU VOLUME 219. Reconstitution of Intracellular Transport Edited by JAMES E. ROTHMAN VOLUME 220. Membrane Fusion Techniques (Part A) Edited by NEJAT DU¨ZGU¨NES, VOLUME 221. Membrane Fusion Techniques (Part B) Edited by NEJAT DU¨ZGU¨NES, VOLUME 222. Proteolytic Enzymes in Coagulation, Fibrinolysis, and Complement Activation (Part A: Mammalian Blood Coagulation Factors and Inhibitors) Edited by LASZLO LORAND AND KENNETH G. MANN
xxxiv
Methods in Enzymology
VOLUME 223. Proteolytic Enzymes in Coagulation, Fibrinolysis, and Complement Activation (Part B: Complement Activation, Fibrinolysis, and Nonmammalian Blood Coagulation Factors) Edited by LASZLO LORAND AND KENNETH G. MANN VOLUME 224. Molecular Evolution: Producing the Biochemical Data Edited by ELIZABETH ANNE ZIMMER, THOMAS J. WHITE, REBECCA L. CANN, AND ALLAN C. WILSON VOLUME 225. Guide to Techniques in Mouse Development Edited by PAUL M. WASSARMAN AND MELVIN L. DEPAMPHILIS VOLUME 226. Metallobiochemistry (Part C: Spectroscopic and Physical Methods for Probing Metal Ion Environments in Metalloenzymes and Metalloproteins) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 227. Metallobiochemistry (Part D: Physical and Spectroscopic Methods for Probing Metal Ion Environments in Metalloproteins) Edited by JAMES F. RIORDAN AND BERT L. VALLEE VOLUME 228. Aqueous Two-Phase Systems Edited by HARRY WALTER AND GO¨TE JOHANSSON VOLUME 229. Cumulative Subject Index Volumes 195–198, 200–227 VOLUME 230. Guide to Techniques in Glycobiology Edited by WILLIAM J. LENNARZ AND GERALD W. HART VOLUME 231. Hemoglobins (Part B: Biochemical and Analytical Methods) Edited by JOHANNES EVERSE, KIM D. VANDEGRIFF, AND ROBERT M. WINSLOW VOLUME 232. Hemoglobins (Part C: Biophysical Methods) Edited by JOHANNES EVERSE, KIM D. VANDEGRIFF, AND ROBERT M. WINSLOW VOLUME 233. Oxygen Radicals in Biological Systems (Part C) Edited by LESTER PACKER VOLUME 234. Oxygen Radicals in Biological Systems (Part D) Edited by LESTER PACKER VOLUME 235. Bacterial Pathogenesis (Part A: Identification and Regulation of Virulence Factors) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 236. Bacterial Pathogenesis (Part B: Integration of Pathogenic Bacteria with Host Cells) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 237. Heterotrimeric G Proteins Edited by RAVI IYENGAR VOLUME 238. Heterotrimeric G-Protein Effectors Edited by RAVI IYENGAR
Methods in Enzymology
xxxv
VOLUME 239. Nuclear Magnetic Resonance (Part C) Edited by THOMAS L. JAMES AND NORMAN J. OPPENHEIMER VOLUME 240. Numerical Computer Methods (Part B) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 241. Retroviral Proteases Edited by LAWRENCE C. KUO AND JULES A. SHAFER VOLUME 242. Neoglycoconjugates (Part A) Edited by Y. C. LEE AND REIKO T. LEE VOLUME 243. Inorganic Microbial Sulfur Metabolism Edited by HARRY D. PECK, JR., AND JEAN LEGALL VOLUME 244. Proteolytic Enzymes: Serine and Cysteine Peptidases Edited by ALAN J. BARRETT VOLUME 245. Extracellular Matrix Components Edited by E. RUOSLAHTI AND E. ENGVALL VOLUME 246. Biochemical Spectroscopy Edited by KENNETH SAUER VOLUME 247. Neoglycoconjugates (Part B: Biomedical Applications) Edited by Y. C. LEE AND REIKO T. LEE VOLUME 248. Proteolytic Enzymes: Aspartic and Metallo Peptidases Edited by ALAN J. BARRETT VOLUME 249. Enzyme Kinetics and Mechanism (Part D: Developments in Enzyme Dynamics) Edited by DANIEL L. PURICH VOLUME 250. Lipid Modifications of Proteins Edited by PATRICK J. CASEY AND JANICE E. BUSS VOLUME 251. Biothiols (Part A: Monothiols and Dithiols, Protein Thiols, and Thiyl Radicals) Edited by LESTER PACKER VOLUME 252. Biothiols (Part B: Glutathione and Thioredoxin; Thiols in Signal Transduction and Gene Regulation) Edited by LESTER PACKER VOLUME 253. Adhesion of Microbial Pathogens Edited by RON J. DOYLE AND ITZHAK OFEK VOLUME 254. Oncogene Techniques Edited by PETER K. VOGT AND INDER M. VERMA VOLUME 255. Small GTPases and Their Regulators (Part A: Ras Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL
xxxvi
Methods in Enzymology
VOLUME 256. Small GTPases and Their Regulators (Part B: Rho Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 257. Small GTPases and Their Regulators (Part C: Proteins Involved in Transport) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 258. Redox-Active Amino Acids in Biology Edited by JUDITH P. KLINMAN VOLUME 259. Energetics of Biological Macromolecules Edited by MICHAEL L. JOHNSON AND GARY K. ACKERS VOLUME 260. Mitochondrial Biogenesis and Genetics (Part A) Edited by GIUSEPPE M. ATTARDI AND ANNE CHOMYN VOLUME 261. Nuclear Magnetic Resonance and Nucleic Acids Edited by THOMAS L. JAMES VOLUME 262. DNA Replication Edited by JUDITH L. CAMPBELL VOLUME 263. Plasma Lipoproteins (Part C: Quantitation) Edited by WILLIAM A. BRADLEY, SANDRA H. GIANTURCO, AND JERE P. SEGREST VOLUME 264. Mitochondrial Biogenesis and Genetics (Part B) Edited by GIUSEPPE M. ATTARDI AND ANNE CHOMYN VOLUME 265. Cumulative Subject Index Volumes 228, 230–262 VOLUME 266. Computer Methods for Macromolecular Sequence Analysis Edited by RUSSELL F. DOOLITTLE VOLUME 267. Combinatorial Chemistry Edited by JOHN N. ABELSON VOLUME 268. Nitric Oxide (Part A: Sources and Detection of NO; NO Synthase) Edited by LESTER PACKER VOLUME 269. Nitric Oxide (Part B: Physiological and Pathological Processes) Edited by LESTER PACKER VOLUME 270. High Resolution Separation and Analysis of Biological Macromolecules (Part A: Fundamentals) Edited by BARRY L. KARGER AND WILLIAM S. HANCOCK VOLUME 271. High Resolution Separation and Analysis of Biological Macromolecules (Part B: Applications) Edited by BARRY L. KARGER AND WILLIAM S. HANCOCK VOLUME 272. Cytochrome P450 (Part B) Edited by ERIC F. JOHNSON AND MICHAEL R. WATERMAN VOLUME 273. RNA Polymerase and Associated Factors (Part A) Edited by SANKAR ADHYA
Methods in Enzymology
xxxvii
VOLUME 274. RNA Polymerase and Associated Factors (Part B) Edited by SANKAR ADHYA VOLUME 275. Viral Polymerases and Related Proteins Edited by LAWRENCE C. KUO, DAVID B. OLSEN, AND STEVEN S. CARROLL VOLUME 276. Macromolecular Crystallography (Part A) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 277. Macromolecular Crystallography (Part B) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 278. Fluorescence Spectroscopy Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 279. Vitamins and Coenzymes (Part I) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 280. Vitamins and Coenzymes (Part J) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 281. Vitamins and Coenzymes (Part K) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 282. Vitamins and Coenzymes (Part L) Edited by DONALD B. MCCORMICK, JOHN W. SUTTIE, AND CONRAD WAGNER VOLUME 283. Cell Cycle Control Edited by WILLIAM G. DUNPHY VOLUME 284. Lipases (Part A: Biotechnology) Edited by BYRON RUBIN AND EDWARD A. DENNIS VOLUME 285. Cumulative Subject Index Volumes 263, 264, 266–284, 286–289 VOLUME 286. Lipases (Part B: Enzyme Characterization and Utilization) Edited by BYRON RUBIN AND EDWARD A. DENNIS VOLUME 287. Chemokines Edited by RICHARD HORUK VOLUME 288. Chemokine Receptors Edited by RICHARD HORUK VOLUME 289. Solid Phase Peptide Synthesis Edited by GREGG B. FIELDS VOLUME 290. Molecular Chaperones Edited by GEORGE H. LORIMER AND THOMAS BALDWIN VOLUME 291. Caged Compounds Edited by GERARD MARRIOTT VOLUME 292. ABC Transporters: Biochemical, Cellular, and Molecular Aspects Edited by SURESH V. AMBUDKAR AND MICHAEL M. GOTTESMAN
xxxviii
Methods in Enzymology
VOLUME 293. Ion Channels (Part B) Edited by P. MICHAEL CONN VOLUME 294. Ion Channels (Part C) Edited by P. MICHAEL CONN VOLUME 295. Energetics of Biological Macromolecules (Part B) Edited by GARY K. ACKERS AND MICHAEL L. JOHNSON VOLUME 296. Neurotransmitter Transporters Edited by SUSAN G. AMARA VOLUME 297. Photosynthesis: Molecular Biology of Energy Capture Edited by LEE MCINTOSH VOLUME 298. Molecular Motors and the Cytoskeleton (Part B) Edited by RICHARD B. VALLEE VOLUME 299. Oxidants and Antioxidants (Part A) Edited by LESTER PACKER VOLUME 300. Oxidants and Antioxidants (Part B) Edited by LESTER PACKER VOLUME 301. Nitric Oxide: Biological and Antioxidant Activities (Part C) Edited by LESTER PACKER VOLUME 302. Green Fluorescent Protein Edited by P. MICHAEL CONN VOLUME 303. cDNA Preparation and Display Edited by SHERMAN M. WEISSMAN VOLUME 304. Chromatin Edited by PAUL M. WASSARMAN AND ALAN P. WOLFFE VOLUME 305. Bioluminescence and Chemiluminescence (Part C) Edited by THOMAS O. BALDWIN AND MIRIAM M. ZIEGLER VOLUME 306. Expression of Recombinant Genes in Eukaryotic Systems Edited by JOSEPH C. GLORIOSO AND MARTIN C. SCHMIDT VOLUME 307. Confocal Microscopy Edited by P. MICHAEL CONN VOLUME 308. Enzyme Kinetics and Mechanism (Part E: Energetics of Enzyme Catalysis) Edited by DANIEL L. PURICH AND VERN L. SCHRAMM VOLUME 309. Amyloid, Prions, and Other Protein Aggregates Edited by RONALD WETZEL VOLUME 310. Biofilms Edited by RON J. DOYLE
Methods in Enzymology
xxxix
VOLUME 311. Sphingolipid Metabolism and Cell Signaling (Part A) Edited by ALFRED H. MERRILL, JR., AND YUSUF A. HANNUN VOLUME 312. Sphingolipid Metabolism and Cell Signaling (Part B) Edited by ALFRED H. MERRILL, JR., AND YUSUF A. HANNUN VOLUME 313. Antisense Technology (Part A: General Methods, Methods of Delivery, and RNA Studies) Edited by M. IAN PHILLIPS VOLUME 314. Antisense Technology (Part B: Applications) Edited by M. IAN PHILLIPS VOLUME 315. Vertebrate Phototransduction and the Visual Cycle (Part A) Edited by KRZYSZTOF PALCZEWSKI VOLUME 316. Vertebrate Phototransduction and the Visual Cycle (Part B) Edited by KRZYSZTOF PALCZEWSKI VOLUME 317. RNA–Ligand Interactions (Part A: Structural Biology Methods) Edited by DANIEL W. CELANDER AND JOHN N. ABELSON VOLUME 318. RNA–Ligand Interactions (Part B: Molecular Biology Methods) Edited by DANIEL W. CELANDER AND JOHN N. ABELSON VOLUME 319. Singlet Oxygen, UV-A, and Ozone Edited by LESTER PACKER AND HELMUT SIES VOLUME 320. Cumulative Subject Index Volumes 290–319 VOLUME 321. Numerical Computer Methods (Part C) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 322. Apoptosis Edited by JOHN C. REED VOLUME 323. Energetics of Biological Macromolecules (Part C) Edited by MICHAEL L. JOHNSON AND GARY K. ACKERS VOLUME 324. Branched-Chain Amino Acids (Part B) Edited by ROBERT A. HARRIS AND JOHN R. SOKATCH VOLUME 325. Regulators and Effectors of Small GTPases (Part D: Rho Family) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 326. Applications of Chimeric Genes and Hybrid Proteins (Part A: Gene Expression and Protein Purification) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON VOLUME 327. Applications of Chimeric Genes and Hybrid Proteins (Part B: Cell Biology and Physiology) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON
xl
Methods in Enzymology
VOLUME 328. Applications of Chimeric Genes and Hybrid Proteins (Part C: Protein–Protein Interactions and Genomics) Edited by JEREMY THORNER, SCOTT D. EMR, AND JOHN N. ABELSON VOLUME 329. Regulators and Effectors of Small GTPases (Part E: GTPases Involved in Vesicular Traffic) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 330. Hyperthermophilic Enzymes (Part A) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 331. Hyperthermophilic Enzymes (Part B) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 332. Regulators and Effectors of Small GTPases (Part F: Ras Family I) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 333. Regulators and Effectors of Small GTPases (Part G: Ras Family II) Edited by W. E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 334. Hyperthermophilic Enzymes (Part C) Edited by MICHAEL W. W. ADAMS AND ROBERT M. KELLY VOLUME 335. Flavonoids and Other Polyphenols Edited by LESTER PACKER VOLUME 336. Microbial Growth in Biofilms (Part A: Developmental and Molecular Biological Aspects) Edited by RON J. DOYLE VOLUME 337. Microbial Growth in Biofilms (Part B: Special Environments and Physicochemical Aspects) Edited by RON J. DOYLE VOLUME 338. Nuclear Magnetic Resonance of Biological Macromolecules (Part A) Edited by THOMAS L. JAMES, VOLKER DO¨TSCH, AND ULI SCHMITZ VOLUME 339. Nuclear Magnetic Resonance of Biological Macromolecules (Part B) Edited by THOMAS L. JAMES, VOLKER DO¨TSCH, AND ULI SCHMITZ VOLUME 340. Drug–Nucleic Acid Interactions Edited by JONATHAN B. CHAIRES AND MICHAEL J. WARING VOLUME 341. Ribonucleases (Part A) Edited by ALLEN W. NICHOLSON VOLUME 342. Ribonucleases (Part B) Edited by ALLEN W. NICHOLSON VOLUME 343. G Protein Pathways (Part A: Receptors) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT VOLUME 344. G Protein Pathways (Part B: G Proteins and Their Regulators) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT
Methods in Enzymology
xli
VOLUME 345. G Protein Pathways (Part C: Effector Mechanisms) Edited by RAVI IYENGAR AND JOHN D. HILDEBRANDT VOLUME 346. Gene Therapy Methods Edited by M. IAN PHILLIPS VOLUME 347. Protein Sensors and Reactive Oxygen Species (Part A: Selenoproteins and Thioredoxin) Edited by HELMUT SIES AND LESTER PACKER VOLUME 348. Protein Sensors and Reactive Oxygen Species (Part B: Thiol Enzymes and Proteins) Edited by HELMUT SIES AND LESTER PACKER VOLUME 349. Superoxide Dismutase Edited by LESTER PACKER VOLUME 350. Guide to Yeast Genetics and Molecular and Cell Biology (Part B) Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 351. Guide to Yeast Genetics and Molecular and Cell Biology (Part C) Edited by CHRISTINE GUTHRIE AND GERALD R. FINK VOLUME 352. Redox Cell Biology and Genetics (Part A) Edited by CHANDAN K. SEN AND LESTER PACKER VOLUME 353. Redox Cell Biology and Genetics (Part B) Edited by CHANDAN K. SEN AND LESTER PACKER VOLUME 354. Enzyme Kinetics and Mechanisms (Part F: Detection and Characterization of Enzyme Reaction Intermediates) Edited by DANIEL L. PURICH VOLUME 355. Cumulative Subject Index Volumes 321–354 VOLUME 356. Laser Capture Microscopy and Microdissection Edited by P. MICHAEL CONN VOLUME 357. Cytochrome P450, Part C Edited by ERIC F. JOHNSON AND MICHAEL R. WATERMAN VOLUME 358. Bacterial Pathogenesis (Part C: Identification, Regulation, and Function of Virulence Factors) Edited by VIRGINIA L. CLARK AND PATRIK M. BAVOIL VOLUME 359. Nitric Oxide (Part D) Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 360. Biophotonics (Part A) Edited by GERARD MARRIOTT AND IAN PARKER VOLUME 361. Biophotonics (Part B) Edited by GERARD MARRIOTT AND IAN PARKER
xlii
Methods in Enzymology
VOLUME 362. Recognition of Carbohydrates in Biological Systems (Part A) Edited by YUAN C. LEE AND REIKO T. LEE VOLUME 363. Recognition of Carbohydrates in Biological Systems (Part B) Edited by YUAN C. LEE AND REIKO T. LEE VOLUME 364. Nuclear Receptors Edited by DAVID W. RUSSELL AND DAVID J. MANGELSDORF VOLUME 365. Differentiation of Embryonic Stem Cells Edited by PAUL M. WASSAUMAN AND GORDON M. KELLER VOLUME 366. Protein Phosphatases Edited by SUSANNE KLUMPP AND JOSEF KRIEGLSTEIN VOLUME 367. Liposomes (Part A) Edited by NEJAT DU¨ZGU¨NES, VOLUME 368. Macromolecular Crystallography (Part C) Edited by CHARLES W. CARTER, JR., AND ROBERT M. SWEET VOLUME 369. Combinational Chemistry (Part B) Edited by GUILLERMO A. MORALES AND BARRY A. BUNIN VOLUME 370. RNA Polymerases and Associated Factors (Part C) Edited by SANKAR L. ADHYA AND SUSAN GARGES VOLUME 371. RNA Polymerases and Associated Factors (Part D) Edited by SANKAR L. ADHYA AND SUSAN GARGES VOLUME 372. Liposomes (Part B) Edited by NEJAT DU¨ZGU¨NES, VOLUME 373. Liposomes (Part C) Edited by NEJAT DU¨ZGU¨NES, VOLUME 374. Macromolecular Crystallography (Part D) Edited by CHARLES W. CARTER, JR., AND ROBERT W. SWEET VOLUME 375. Chromatin and Chromatin Remodeling Enzymes (Part A) Edited by C. DAVID ALLIS AND CARL WU VOLUME 376. Chromatin and Chromatin Remodeling Enzymes (Part B) Edited by C. DAVID ALLIS AND CARL WU VOLUME 377. Chromatin and Chromatin Remodeling Enzymes (Part C) Edited by C. DAVID ALLIS AND CARL WU VOLUME 378. Quinones and Quinone Enzymes (Part A) Edited by HELMUT SIES AND LESTER PACKER VOLUME 379. Energetics of Biological Macromolecules (Part D) Edited by JO M. HOLT, MICHAEL L. JOHNSON, AND GARY K. ACKERS VOLUME 380. Energetics of Biological Macromolecules (Part E) Edited by JO M. HOLT, MICHAEL L. JOHNSON, AND GARY K. ACKERS
Methods in Enzymology
xliii
VOLUME 381. Oxygen Sensing Edited by CHANDAN K. SEN AND GREGG L. SEMENZA VOLUME 382. Quinones and Quinone Enzymes (Part B) Edited by HELMUT SIES AND LESTER PACKER VOLUME 383. Numerical Computer Methods (Part D) Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 384. Numerical Computer Methods (Part E) Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 385. Imaging in Biological Research (Part A) Edited by P. MICHAEL CONN VOLUME 386. Imaging in Biological Research (Part B) Edited by P. MICHAEL CONN VOLUME 387. Liposomes (Part D) Edited by NEJAT DU¨ZGU¨NES, VOLUME 388. Protein Engineering Edited by DAN E. ROBERTSON AND JOSEPH P. NOEL VOLUME 389. Regulators of G-Protein Signaling (Part A) Edited by DAVID P. SIDEROVSKI VOLUME 390. Regulators of G-Protein Signaling (Part B) Edited by DAVID P. SIDEROVSKI VOLUME 391. Liposomes (Part E) Edited by NEJAT DU¨ZGU¨NES, VOLUME 392. RNA Interference Edited by ENGELKE ROSSI VOLUME 393. Circadian Rhythms Edited by MICHAEL W. YOUNG VOLUME 394. Nuclear Magnetic Resonance of Biological Macromolecules (Part C) Edited by THOMAS L. JAMES VOLUME 395. Producing the Biochemical Data (Part B) Edited by ELIZABETH A. ZIMMER AND ERIC H. ROALSON VOLUME 396. Nitric Oxide (Part E) Edited by LESTER PACKER AND ENRIQUE CADENAS VOLUME 397. Environmental Microbiology Edited by JARED R. LEADBETTER VOLUME 398. Ubiquitin and Protein Degradation (Part A) Edited by RAYMOND J. DESHAIES VOLUME 399. Ubiquitin and Protein Degradation (Part B) Edited by RAYMOND J. DESHAIES
xliv
Methods in Enzymology
VOLUME 400. Phase II Conjugation Enzymes and Transport Systems Edited by HELMUT SIES AND LESTER PACKER VOLUME 401. Glutathione Transferases and Gamma Glutamyl Transpeptidases Edited by HELMUT SIES AND LESTER PACKER VOLUME 402. Biological Mass Spectrometry Edited by A. L. BURLINGAME VOLUME 403. GTPases Regulating Membrane Targeting and Fusion Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 404. GTPases Regulating Membrane Dynamics Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 405. Mass Spectrometry: Modified Proteins and Glycoconjugates Edited by A. L. BURLINGAME VOLUME 406. Regulators and Effectors of Small GTPases: Rho Family Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 407. Regulators and Effectors of Small GTPases: Ras Family Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 408. DNA Repair (Part A) Edited by JUDITH L. CAMPBELL AND PAUL MODRICH VOLUME 409. DNA Repair (Part B) Edited by JUDITH L. CAMPBELL AND PAUL MODRICH VOLUME 410. DNA Microarrays (Part A: Array Platforms and Web-Bench Protocols) Edited by ALAN KIMMEL AND BRIAN OLIVER VOLUME 411. DNA Microarrays (Part B: Databases and Statistics) Edited by ALAN KIMMEL AND BRIAN OLIVER VOLUME 412. Amyloid, Prions, and Other Protein Aggregates (Part B) Edited by INDU KHETERPAL AND RONALD WETZEL VOLUME 413. Amyloid, Prions, and Other Protein Aggregates (Part C) Edited by INDU KHETERPAL AND RONALD WETZEL VOLUME 414. Measuring Biological Responses with Automated Microscopy Edited by JAMES INGLESE VOLUME 415. Glycobiology Edited by MINORU FUKUDA VOLUME 416. Glycomics Edited by MINORU FUKUDA VOLUME 417. Functional Glycomics Edited by MINORU FUKUDA
Methods in Enzymology
xlv
VOLUME 418. Embryonic Stem Cells Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 419. Adult Stem Cells Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 420. Stem Cell Tools and Other Experimental Protocols Edited by IRINA KLIMANSKAYA AND ROBERT LANZA VOLUME 421. Advanced Bacterial Genetics: Use of Transposons and Phage for Genomic Engineering Edited by KELLY T. HUGHES VOLUME 422. Two-Component Signaling Systems, Part A Edited by MELVIN I. SIMON, BRIAN R. CRANE, AND ALEXANDRINE CRANE VOLUME 423. Two-Component Signaling Systems, Part B Edited by MELVIN I. SIMON, BRIAN R. CRANE, AND ALEXANDRINE CRANE VOLUME 424. RNA Editing Edited by JONATHA M. GOTT VOLUME 425. RNA Modification Edited by JONATHA M. GOTT VOLUME 426. Integrins Edited by DAVID CHERESH VOLUME 427. MicroRNA Methods Edited by JOHN J. ROSSI VOLUME 428. Osmosensing and Osmosignaling Edited by HELMUT SIES AND DIETER HAUSSINGER VOLUME 429. Translation Initiation: Extract Systems and Molecular Genetics Edited by JON LORSCH VOLUME 430. Translation Initiation: Reconstituted Systems and Biophysical Methods Edited by JON LORSCH VOLUME 431. Translation Initiation: Cell Biology, High-Throughput and Chemical-Based Approaches Edited by JON LORSCH VOLUME 432. Lipidomics and Bioactive Lipids: Mass-Spectrometry–Based Lipid Analysis Edited by H. ALEX BROWN VOLUME 433. Lipidomics and Bioactive Lipids: Specialized Analytical Methods and Lipids in Disease Edited by H. ALEX BROWN
xlvi
Methods in Enzymology
VOLUME 434. Lipidomics and Bioactive Lipids: Lipids and Cell Signaling Edited by H. ALEX BROWN VOLUME 435. Oxygen Biology and Hypoxia Edited by HELMUT SIES AND BERNHARD BRU¨NE VOLUME 436. Globins and Other Nitric Oxide-Reactive Protiens (Part A) Edited by ROBERT K. POOLE VOLUME 437. Globins and Other Nitric Oxide-Reactive Protiens (Part B) Edited by ROBERT K. POOLE VOLUME 438. Small GTPases in Disease (Part A) Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 439. Small GTPases in Disease (Part B) Edited by WILLIAM E. BALCH, CHANNING J. DER, AND ALAN HALL VOLUME 440. Nitric Oxide, Part F Oxidative and Nitrosative Stress in Redox Regulation of Cell Signaling Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 441. Nitric Oxide, Part G Oxidative and Nitrosative Stress in Redox Regulation of Cell Signaling Edited by ENRIQUE CADENAS AND LESTER PACKER VOLUME 442. Programmed Cell Death, General Principles for Studying Cell Death (Part A) Edited by ROYA KHOSRAVI-FAR, ZAHRA ZAKERI, RICHARD A. LOCKSHIN, AND MAURO PIACENTINI VOLUME 443. Angiogenesis: In Vitro Systems Edited by DAVID A. CHERESH VOLUME 444. Angiogenesis: In Vivo Systems (Part A) Edited by DAVID A. CHERESH VOLUME 445. Angiogenesis: In Vivo Systems (Part B) Edited by DAVID A. CHERESH VOLUME 446. Programmed Cell Death, The Biology and Therapeutic Implications of Cell Death (Part B) Edited by ROYA KHOSRAVI-FAR, ZAHRA ZAKERI, RICHARD A. LOCKSHIN, AND MAURO PIACENTINI VOLUME 447. RNA Turnover in Bacteria, Archaea and Organelles Edited by LYNNE E. MAQUAT AND CECILIA M. ARRAIANO VOLUME 448. RNA Turnover in Eukaryotes: Nucleases, Pathways and Analysis of mRNA Decay Edited by LYNNE E. MAQUAT AND MEGERDITCH KILEDJIAN
Methods in Enzymology
xlvii
VOLUME 449. RNA Turnover in Eukaryotes: Analysis of Specialized and Quality Control RNA Decay Pathways Edited by LYNNE E. MAQUAT AND MEGERDITCH KILEDJIAN VOLUME 450. Fluorescence Spectroscopy Edited by LUDWIG BRAND AND MICHAEL L. JOHNSON VOLUME 451. Autophagy: Lower Eukaryotes and Non-Mammalian Systems (Part A) Edited by DANIEL J. KLIONSKY VOLUME 452. Autophagy in Mammalian Systems (Part B) Edited by DANIEL J. KLIONSKY VOLUME 453. Autophagy in Disease and Clinical Applications (Part C) Edited by DANIEL J. KLIONSKY VOLUME 454. Computer Methods (Part A) Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND VOLUME 455. Biothermodynamics (Part A) Edited by MICHAEL L. JOHNSON, JO M. HOLT, AND GARY K. ACKERS (RETIRED) VOLUME 456. Mitochondrial Function, Part A: Mitochondrial Electron Transport Complexes and Reactive Oxygen Species Edited by WILLIAM S. ALLISON AND IMMO E. SCHEFFLER VOLUME 457. Mitochondrial Function, Part B: Mitochondrial Protein Kinases, Protein Phosphatases and Mitochondrial Diseases Edited by WILLIAM S. ALLISON AND ANNE N. MURPHY VOLUME 458. Complex Enzymes in Microbial Natural Product Biosynthesis, Part A: Overview Articles and Peptides Edited by DAVID A. HOPWOOD VOLUME 459. Complex Enzymes in Microbial Natural Product Biosynthesis, Part B: Polyketides, Aminocoumarins and Carbohydrates Edited by DAVID A. HOPWOOD VOLUME 460. Chemokines, Part A Edited by TRACY M. HANDEL AND DAMON J. HAMEL VOLUME 461. Chemokines, Part B Edited by TRACY M. HANDEL AND DAMON J. HAMEL VOLUME 462. Non-Natural Amino Acids Edited by TOM W. MUIR AND JOHN N. ABELSON VOLUME 463. Guide to Protein Purification, 2nd Edition Edited by RICHARD R. BURGESS AND MURRAY P. DEUTSCHER VOLUME 464. Liposomes, Part F Edited by NEJAT DU¨ZGU¨NES,
xlviii
Methods in Enzymology
VOLUME 465. Liposomes, Part G Edited by NEJAT DU¨ZGU¨NES, VOLUME 466. Biothermodynamics, Part B Edited by MICHAEL L. JOHNSON, GARY K. ACKERS, AND JO M. HOLT VOLUME 467. Computer Methods, Part B Edited by MICHAEL L. JOHNSON AND LUDWIG BRAND
C H A P T E R
O N E
Correlation Analysis: A Tool for Comparing Relaxation-Type Models to Experimental Data Maurizio Tomaiuolo,* Joel Tabak,* and Richard Bertram† Contents 2 3 4 13 15 18 19 20 20
1. Introduction 2. Scatter Plots and Correlation Analysis 3. Example 1: Relaxation Oscillations 4. Example 2: Square Wave Bursting 5. Example 3: Elliptic Bursting 6. Example 4: Using Correlation Analysis on Experimental Data 7. Summary Acknowledgment References
Abstract We describe a new technique for comparing mathematical models to the biological systems that are described. This technique is appropriate for systems that produce relaxation oscillations or bursting oscillations, and takes advantage of noise that is inherent to all biological systems. Both types of oscillations are composed of active phases of activity followed by silent phases, repeating periodically. The presence of noise adds variability to the durations of the different phases. The central idea of the technique is that the active phase duration may be correlated with either/both the previous or next silent phase duration, and the resulting correlation pattern provides information about the dynamic structure of the system. Correlation patterns can easily be determined by making scatter plots and applying correlation analysis to the cluster of data points. This could be done both with experimental data and with model simulation data. If the model correlation pattern is in general agreement with the experimental data, then this adds support for the validity of the model.
* {
Department of Biological Science and Program in Neuroscience, Florida State University, Tallahassee, Florida, USA Department of Mathematics and Programs in Neuroscience and Molecular Biophysics, Florida State University, Tallahassee, Florida, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67001-4
#
2009 Elsevier Inc. All rights reserved.
1
2
Maurizio Tomaiuolo et al.
Otherwise, the model must be corrected. While this tool is only one test of many required to validate a mathematical model, it is easy to implement and is noninvasive.
1. Introduction Multivariable systems in which one or more of the variables change slowly compared with the others have the potential to produce relaxation oscillations. These oscillations are characterized by a ‘‘silent state’’ in which the fast variables are at a low value, and an ‘‘active state’’ in which the fast variables are at a high or stimulated value. The fast variables jump back and forth between these states as the slow variables slowly increase and decrease. The fast variable time course thus resembles a square wave, while the slow variable time course has a saw-tooth pattern. The van der Pol oscillator is a classic example for this system (van der Pol and van der Mark, 1928). Several important biological and biochemical systems have the features of relaxation oscillators, including cardiac and neuronal action potentials (Bertram and Sherman, 2005; van der Pol and van der Mark, 1928), population bursts in neuronal networks (Tabak et al., 2001), the cell cycle (Tyson, 1991), glycolytic oscillations (Goldbeter and Lefever, 1972), and the Belousov– Zhabotinskii chemical reaction (see Murray, 1989, for discussion). Bursting oscillations are a generalization of relaxation oscillations, where the active state is itself oscillatory (Bertram and Sherman, 2005; Rinzel and Ermentrout, 1998). Thus, bursting consists of fast oscillations clustered into slower episodes. These oscillations are common in nerve cells (see Coombes and Bressloff, 2005, for many examples) and hormone-secreting endocrine cells (Bertram and Sherman, 2005; Dean and Mathews, 1970; Li et al., 1997; Tsaneva-Atanasova et al., 2007; Van Goor et al., 2001). Analysis techniques for models of relaxation-type oscillations are well developed. For pure relaxation oscillations a phase-plane analysis is typically used (Strogatz, 1994). For bursting oscillations, a geometric singular perturbation analysis, often called fast/slow analysis, is the standard analytical tool (Bertram et al., 1995; Rinzel, 1987; Rinzel and Ermentrout, 1998). From these analyses one can understand features such as threshold behaviors, the effects of perturbations, the conversion of the system from an oscillatory to a stationary state or vice versa, the slowdown of the fast oscillations near the end of the active state that is often observed during bursting, or the subthreshold oscillations that are sometimes observed during the silent phase of a burst. Thus, the analysis is useful for understanding the dynamic behaviors observed experimentally. While most of the analysis described above assumes that the system is deterministic, in reality all of the biological and biochemical systems on
Correlation Analysis of Oscillations
3
which the models are based contain noise. The noise could be due to intrinsic factors such as a small number of substrate molecules or ion channels of a certain type. It could also be due to extrinsic factors such as stochastic synaptic input to a neuron, stochastic activation of G-proteincoupled receptors by extracellular ligands, or measurement error. Whatever the origin, noise can make it more difficult to detect some subtle features of the oscillation. This makes it harder to know how well the mathematical model reproduces the behavior of the system under investigation, since key model predictions may depend on the detection of these subtle features in the experimental record (Bertram et al., 1995). In this chapter, we describe a tool based on statistical correlation analysis that can be used to compare the behavior of a mathematical model against experimental data and thus help to determine the validity of the model. This method is designed for relaxation-type models and makes use of intrinsic noise in the system. Subtle features such as spike frequency slowdown or subthreshold oscillations are not utilized. Instead, we look at correlation patterns between the durations of active and silent phases in the experimental data, and in simulation data generated by stochastic implementations of the corresponding model. We demonstrate the use of the tool through four examples. First, we show how it can be (and has been) used to make a powerful (and testable) prediction that can distinguish the type of slow negative feedback underlying a relaxation oscillation. Second, we demonstrate how the tool can be used to study bursting oscillations, focusing on the ‘‘square wave’’ class of bursters. The third example focuses on ‘‘elliptic bursters,’’ and demonstrates that the correlation pattern can distinguish one type of bursting from another. Finally, we apply the correlation analysis to a model of the pituitary lactotroph, a cell in the pituitary gland that secretes the hormone prolactin. We contrast the correlation patterns obtained with this model with experimental electrical data from a pituitary lactotroph cell line.
2. Scatter Plots and Correlation Analysis In a deterministic system, the duration of each active phase of a relaxation-type oscillator is the same, and the duration of each silent phase is the same. However, when the system exhibits random fluctuations, or noise, durations will vary since the noise can perturb the system prematurely from one state to another. We measure the duration of each silent phase and each active phase (see Appendix for the algorithm used for bursting oscillations), and then make a scatter plot of the active phase durations versus the previous silent phase durations. A separate scatter plot is made of the active phase durations versus the following silent phase durations. We then use
4
Maurizio Tomaiuolo et al.
these scatter plots to look for correlations between the active phase and silent phase durations. This can be done using simulation data from a model, or using actual data for the corresponding experimental system. As we demonstrate in the examples below, one expects certain correlation patterns to exist, based on the dynamic structure of the model. The validity of the model is supported (but not established) if the expected correlation patterns match those from the experimental data. If there is no match, then it is likely that the model should be modified or the parameters adjusted. This approach can be used for relaxation oscillations or bursting oscillations, and is most useful when there are enough experimental data to establish statistical confidence in the correlation patterns.
3. Example 1: Relaxation Oscillations We consider a system whose activity a is controlled by a fast positive feedback and a slow negative feedback process. This forms the basis for many biological oscillators (Ermentrout and Chow, 2002; Friesen and Block, 1984; Tsai et al., 2008). The activity varies according to: da ð1:1Þ ¼ a þ a1 ðwa y0 Þ þ i: dt This equation means that a tends to reach a steady state determined by the steady-state input/output function (or activation function) a1, with a time constant ta (which is set to 1, so time is relative to ta). The function a1 is a sigmoid function of its input (Fig. 1.1) which is proportional to the system’s output, a. This injection of the activity back into the system’s input represents positive feedback and the gain of the feedback loop is set by the parameter w. The other parameter, y0, represents the half-activation input: if the input is below y0, the output (activity) will be low, and if the input is above y0, then the output will be high. Finally, the term i provides for a possible external input to the system, such as a brief resetting perturbation. This activity equation can describe, for example, the mean firing rate within a network of neurons connected by excitatory synapses (Tabak et al., 2000, 2006; Wilson and Cowan, 1972). In this mean field framework, the network steady-state input/output function a1 depends on the input/ output properties of the single cells, the degree of heterogeneity in the network, as well as the synaptic dynamics. The parameter w represents the degree of connectivity in the network while y0 sets the amount of excitation that neurons need to receive to become activated. Such a system will always evolve to a steady state (defined by da/dt ¼ 0). For some parameter values, the system may have two stable steady states, one at a high- and one at a low-activity level. This is a direct consequence of the ta
5
Correlation Analysis of Oscillations
B
A w
w xS
Input
1 0.8
Input
a
+
a∞(wsa-q0)
1
s=1
0.6
0.4
0.4
0.2
0.2 q0
0.4
a
a∞(wa-q0-q)
0.8
s = 0.7
0.6
a
+ –q
0.6
0.8
1
q=0
q = 0.1
q0 q0+ q
a
0.6
0.8
1
Figure 1.1 System with (fast) positive feedback and two types of (slow) negative feedback. The positive feedback loop is shown in black; the negative feedback loop is in gray. (A) System with divisive feedback, which decreases the gain of the positive feedback loop by a factor s (upper panel). The effect of this feedback is a decrease of the slope of the system steady-state activation. (B) System with subtractive feedback, which decreases the effective input by y (upper panel). The effect of this feedback is a shift of the steady-state activation function of the system to the right. In both cases, the steadystate output function is given by a1 ðxÞ ¼ 1=ð1 þ expðx=ka ÞÞ.
positive feedback. To create relaxation oscillations, we add a slow negative feedback process. This will allow the system to switch repetitively between the high and low steady states. We consider two types of slow negative feedback. The first type is divisive feedback. This feedback reduces the amount of positive feedback and is implemented using a slow variable, s, according to: da ta ¼ a þ a1 ðwsa y0 Þ þ i; ð1:2Þ dt ds ð1:3Þ ¼ s þ s1 ðaÞ; dt where s1 is a decreasing function of a, so that s decreases during high activity episodes and recovers when the activity is low. Figure 1.1A illustrates that such divisive feedback decreases the slope of the input/output relationship of the system. In a mean field neuronal network model, for example, synaptic depression would be implemented as divisive feedback (Shpiro et al., 2007; Tabak et al., 2006). ts
6
Maurizio Tomaiuolo et al.
The second type is subtractive feedback. In this case, the half-activation point of the system is shifted by a slow variable, y, according to: ta
da ¼ a þ a1 ðwa y0 yÞ þ i; dt
ð1:4Þ
dy ð1:5Þ ¼ y þ y1 ðaÞ; dt where, y1 is an increasing function of a, so that y increases during high activity and decreases during low activity. Figure 1.1B shows how subtractive feedback shifts the activation function to the right, so more input is necessary to achieve a given output. In a mean field neuronal model, adaptation of cell excitability by outward ionic currents would be implemented as subtractive feedback (Shpiro et al., 2007; Tabak et al., 2006). Both the models defined by Eqs. (1.2) and (1.3) (s-model) and by Eqs. (1.4) and (1.5) (y-model) generate relaxation oscillations. We first examine the oscillations generated by the s-model (Fig. 1.2A). The upper panel shows the time courses of a and s. Activity is oscillating between active (high a) and silent (low a) phases. During the silent phase (1), s increases, increasing the level of positive feedback, until activity jumps to a high level (2). This starts the active phase (3), during which s decreases, decreasing the level of positive feedback. When s is low enough, there is not enough positive feedback to sustain the high activity, a falls to the low level (4) and the cycle repeats. We can gain qualitative understanding of this cyclic activity by using a ‘‘phase-plane’’ representation. Instead of plotting time courses, we plot a(t) versus s(t) in the (a, s) plane (Fig. 1.2A, lower panel). First, we use the fact that s is much slower than a, and, for each value of s, now treated as a parameter, plot the steady states of Eq. (1.2) (the points for which da/dt ¼ 0). We obtain an S-shaped curve, called the a-nullcline. For some values of s there are three possible steady states, one low (stable), one high (stable), and one intermediate that is unstable. Thus, within that range of s values the system is bistable, as mentioned above, with the middle state acting as a threshold: at any given time, if a is below this threshold it will fall to the low steady state; if it is above threshold it will rise to the high steady state. We now allow s to vary and plot the state of the system, represented by a trajectory in the (a, s) plane. Assume a is low initially, so we start on the lower branch of the S-curve. In this case s increases slowly according to Eq. (1.3) and the trajectory follows the lower branch (1). This continues until s passes the value where the low and middle steady states meet (the low ‘‘knee’’ of the a-nullcline, LK), so there is no other steady state of Eq. (1.2) other than the upper steady state. Thus, the system jumps to the high activity state (2). Once in the high activity state, s decreases and the system slowly tracks the high branch of the S-curve, moving to the left; this is the active phase (3). Eventually, the trajectory passes the high knee (HK) of the ty
7
Correlation Analysis of Oscillations
A
B
3
1
1 0.8
2
0.4 0.2 0
a, q
4
0.6
0.2
1 0
500
1000 Time
1 0.8
a
2000
0
Stim
500
1000 Time
2000
2500
1 0.8
4
2
0.4 0.2 0 0.3
2500
3
HK
0.6
0.6 0 0.4
0.5
s
0.4 0.2
1 0.4
0.6 a
a, s
0.8
0.6
0.7
LK 0.8
0
0.2
0.4 q
0.5
0.6
Figure 1.2 The s-model and the y-model produce relaxation oscillations with similar properties. (A) The s-model. Upper panel, oscillatory time courses of s and a. A brief stimulation (‘‘stim,’’ arrow) before full recovery triggers an active phase of shorter duration than the unstimulated ones. Lower panel, the oscillations are represented by a trajectory in the (a, s) plane that slowly tracks the lower and upper branches of the a-nullcline (S-shaped curve) and quickly jumps between the branches at the transition points, or knees of the S-curve. LK, low knee; HK, high knee; stim, stimulation that provokes a premature active phase. (B) The y-model. Upper panel, time courses of y and a. Lower panel, phase-plane trajectory superimposed on the Z-shaped a-nullcline.
S-curve (HK) where the upper steady state meets the middle steady state. Activity then falls abruptly to the low level (4) and the cycle repeats. In the phase plane, the effect of a brief stimulation (arrow in Fig. 1.2A) is apparent: if the stimulus (i) is large enough to bring a above the middle branch of the S-curve (i.e., the threshold), a will immediately jump up to the high state. The resulting premature active phase will be shorter than an unstimulated active phase because it starts at a lower value of s, so less time will be needed to reach the HK. Note that an active phase can also be prematurely terminated by a stimulation that brings a below the threshold. In that case, the following silent phase will also be correspondingly shorter (not shown).
8
Maurizio Tomaiuolo et al.
Figure 1.2B shows that the y-model also generates relaxation oscillations and responds to brief perturbations in a very similar way. The only visible difference is that the a-nullcline is a Z-shaped curve instead of S-shaped. This is because y increases with high activity and decreases during periods of low activity, in an opposite fashion to s. In both models, the oscillations of the slow variable allow the system to switch between the active and silent phases. The system tracks the stable branches of the a-nullcline until it reaches a knee, where it transitions from one branch to the other. In many cases, only the activity variable a, but not the feedback variables s or y, would be readily measurable in experiments. The two models generate oscillations in a with the same properties, so how can one tell whether experimentally observed relaxation oscillations are controlled by a divisive or a subtractive feedback mechanism? In the following, we show that noise that is intrinsic to the biological system has different effects on the two models, so one would only need to record spontaneous oscillations to distinguish between the two possible models. We include noise by replacing i in Eq. (1.4) with m, where m is the magnitude of the noise and is a normally distributed random variable. Results presented in the following are not overly sensitive to the way noise is added to the activity. The most important assumption is that noise perturbs the system’s activity, not the slow feedback process. The simulations with noise produce episodic activity as shown in Fig. 1.2, but with variable durations of the active and silent phases. The activity time course is shown in Fig. 1.3A for the s-model with noise. We expect that noise induces early (or delayed) transitions between the silent and active phases, leading to shortened (or lengthened) active and silent phases. To evaluate these effects, we plot active phase duration as a function of the preceding silent phase (Fig. 1.3B) or following silent phase (Fig. 1.3C). We observe a strong positive correlation between the length of the active phase and preceding—but not following—silent phase. This correlation pattern (Fig. 1.3B and C) is the signature of relaxation oscillations that rely on slow divisive feedback. The cause for this pattern can be deduced from Fig. 1.3A. Transitions between silent to active phases can occur at very different levels of the slow variable s, but the transitions from active to silent phases seem to occur around the same value of s with very little change from period to period. This implies that a shorter silent phase corresponds to a lower value of s at active phase onset, and thus to a correspondingly shorter active phase duration (cf. Fig. 1.2A). The correlation shown in Fig. 1.3B therefore illustrates that all the variability in active phase duration is due to variability in the preceding silent phase duration. On the other hand, regardless of active phase duration, the following silent phase starts at the same s value as all other silent phases and therefore is not influenced by active phase duration. Thus, there is no correlation between active phase and following silent phase duration
9
Correlation Analysis of Oscillations
F
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2 1000
2000
3000
Time
B
0
4000
250
300
D
400
0.3 0.2 0.1 0.75
0.8
0.4 0.3 0.2
0.3
0.6
0.5
0.1
s value at on transition
Next SPD
J
0.6
Frequency
0.4
300
340
I
0.5
Frequency
0.5
320
Previous SPD
0.6
0.5
0.4 0.3 0.2 0.1
0.35
s value at off transition
0.4
340
194
300
E
0.6
0.7
300
320
198
194
Next SPD
4000
202
198
200
350
Previous SPD
3000
H
APD
APD
APD
200
190 180
180
2000
Time
202
200
190
1000
G
C
200
0
APD
0
Frequency
0
Frequency
1
0.8
a, q
a, s
A
0.4 0.3 0.2 0.1
0.14
0.18
0.22
q value at on transition
0.58
0.63
q value at off transition
0.68
Figure 1.3 Activity patterns generated by the two types of relaxation oscillators. (A) Time courses of a and s generated by the s-model with noise. There is visibly more variability of the s value at the on transition than at the off transition (mean values at the transitions are indicated by the horizontal-dashed lines). A strong correlation between active phase and preceding (B), but not following (C), silent phase duration corresponds to a wide distribution of s at the on transition (D) and a narrow distribution of s at the off transition (E). (F) Time courses of a and y generated by the y-model with noise. There is a weak correlation between active phase duration and both the preceding (G) and following (H) silent phase duration. They correspond to equal amounts of variability in the distributions of y at the on (I) and off ( J) transitions. APD, active phase duration; SPD, silent phase duration.
(Fig. 1.3C), that is, the variability in active phase duration does not cause any of the variability in the following silent phase duration. To illustrate this discrepancy between the ‘‘on’’ (silent to active) and the ‘‘off’’ (active to silent) transitions, we plot histograms of the value of s at the transitions. Figure 1.3D shows the wide distribution of s values at the ‘‘on’’
10
Maurizio Tomaiuolo et al.
transition, while Fig. 1.3E reveals a very narrow distribution of s values at the off transition. Thus, the correlation pattern shown in Fig. 1.3B and C is due to the large variations of s at the on transition relative to the off transition. Note that if the variability of s at the off transition was greater, then the correlation between active and preceding silent phase duration would be reduced, since there would be some variability in active phase duration not caused by variability in the length of the silent phase. Also, with less variability at the on transition, the small amount of variability at the off transition would have a larger impact and there would be a tendency for a longer active phase to be followed by a longer silent phase. If the variability of s values at both the on and off transitions was equal, we would observe a weak (but significant) correlation between active phase duration and both preceding and following silent phase durations. This situation occurs with the y-model. Figure 1.3F shows time courses generated by the y-model with noise. For the same amount of noise as used in the s-model, there is less variability in the length of active and silent phases. More importantly, the variability of y is similar at the on and off transitions (Fig. 1.3I and J). This results in weak but significant correlation between the duration of the active phase and both preceding (Fig. 1.3G) and following (Fig. 1.3H) silent phases. Thus, if the correlation pattern is similar to Fig. 1.3B and C then divisive feedback is likely involved, while if the correlation pattern is similar to Fig. 1.3G and H a subtractive feedback is involved. We now give a qualitative explanation for the differences in the amount of variability of the slow negative feedback variables at the on and off transitions, since these differences cause the differences in the correlation patterns. It is possible to predict the correlation patterns using survival analysis of particles in a two-well potential (Lim and Rinzel, submitted). Here, we give an equivalent but more intuitive explanation based on geometrical arguments. Again, we use the difference of time scales between the fast activity and the slow negative feedback processes, and we begin with the s-model (divisive feedback). There are two contributing factors to the observed correlation pattern. The first concerns the shape of the a-nullcline in the phase plane. Figure 1.4A shows that a perturbation which transiently changes the activity level can induce a phase transition if it brings the activity across threshold (the middle branch of the S-curve). Because the nullcline is much flatter on the bottom than on the top, it is easier to induce an on transition (at the sharp low knee (LK) of the S-curve) than an off transition (at the round HK). Thus, positive perturbations will be able to induce on transitions for a much larger range of s values than the range of values for which negative perturbations of the same amplitude can induce off transitions. This effect, however, contributes only a small fraction of the variability induced by noise, because noise does not just act to create a series of quick perturbations in the activity level. The effects of noise are integrated over
11
Correlation Analysis of Oscillations
C
+Δi
0
0.4 0.6 0.8 s
LK
0.5 s
0.6
0.6
2000 Time E 0.6 Frequency
Frequency
1500
0.4 0.2 0.7 0.8 Low knee
0.2 0.3 0.4 High knee
0
0.2 0.4 0.6 q
+Δi 0
0.5 q HK
0.4 LK 1000
2500
0.4
0.5
0.2
HK
1000
1
0.6
LK
0.4
D
H
q
s
0.8
0.5 0
1
G
1
a
0.5
I Frequency
0
HK
0.6 0.4 0.2 0.1 0.2 Low knee
1500 2000 Time
J Frequency
0.5
F
1
a
B
1
a
a
A
2500
0.6 0.4 0.2 0.55 0.65 High knee
Figure 1.4 Qualitative explanation for the differences in the variability at the on and off transitions. (A) The a-nullcline and superimposed trajectory of the relaxation oscillation for the s-model. Vertical arrows show a perturbation of amplitude 0.2 that can induce a premature transition. The on transition can occur further away from the knee than the off transition. (B) The effect of a change in external input, Di ¼ 0.01, on the a-nullcline. (C) Time courses of the low knee (LK), high knee (HK) and s (gray, magnified from Fig. 1.3A). The upward arrow indicates a premature on transition due to noise moving the LK to a small s value. The downward arrow indicates a late on transition due to the LK maintaining its position. (D) Wide distribution of LK and (E) narrow distribution of HK obtained during the simulation. Compare with Fig. 1.3D and E. (F) Symmetrical a-nullcline and superimposed oscillation trajectory for the ymodel, with vertical arrows depicting perturbations that can induce a transition. (G) Changes in input affect the LK and HK similarly. (H) Time courses of LK, HK, and y (gray, magnified from Fig. 1.3F). (I) Distribution of LK and ( J) distribution of HK; compare with Fig. 1.3I and J.
time since noise is included in the activity equation (Eq. 1.4). Thus, if we see noise as a rapidly varying external input to the system, we also realize that it affects the a-nullcline, perturbing it. To quantify this contribution of noise, we first consider how a change Di to a constant input to the system affects the a-nullcline. As illustrated in Fig. 1.4B, Di shifts the LK horizontally to a greater extent than the HK. In fact, we have shown that for a small change in external input, the ratio of the resulting changes in the position of the LK and HK, Dslk/Dshk, is proportional to the ratio of the activity at the HK and LK, ahk/alk (Tabak et al., 2009). Since activity is close to 0 at the LK, this ratio can be high, around 17.4 for the parameters used here.
12
Maurizio Tomaiuolo et al.
The prediction from this analysis with constant input is that noise will ‘‘shake’’ the a-nullcline during the simulation, moving the knees horizontally. Figure 1.4C shows the resulting time course of both LK and HK positions. LK varies much more than HK, as predicted, and the ratio of their standard deviations is close to 17.4. This panel also shows the time course of s (magnified from Fig. 1.3A), which increases during the silent phase and decreases during the active phase. When LK moves downward, s can cross over (upward arrow) and produce an on transition at an unusually low value of s. On the other hand, when LK remains high it can delay a transition (downward arrow). Thus, the large variations in the positions of the LK create the variability of s at the on transition. On the other hand, there is little variability of HK and therefore little variability of s at the off transition. Therefore, the wide and narrow distributions of LK (Fig. 1.4D) and HK (Fig. 1.4E) explain the wide and narrow distributions of s at the on (Fig. 1.3D) and off transitions (Fig. 1.3E). These differences between LK and HK are absent in the y-model. First, Fig. 1.4F shows that for subtractive negative feedback, the knees of the anullcline are symmetrical and therefore it is equally easy for a perturbation to induce an on or off transition (compare with Fig. 1.4A). Second, input variation affects both knees’ position similarly (Fig. 1.4G). Thus, noise creates equal variations in the LK and HK (Fig 1.4H), and the variability of y is similar at the on and off transitions. The distributions of HK and LK (Fig. 1.4I and J) are comparable to the distributions of y at the on and off transitions (Fig. 1.3I and J). In these examples, we used a smooth sigmoid function for a1, the steady-state activation of the system (Fig. 1.1). If instead, a1 had an abrupt onset and smooth saturation, then the Z-shaped nullcline of Fig. 1.4F would have a sharper LK, and the same pattern of correlation as the s-model could be observed. On the other hand, if the activation function was steep at higher a and smooth at lower a then the opposite correlation pattern could be observed (i.e., correlation between length of active and following—but not preceding—silent phase). Thus, the correlation pattern obtained from the y-model depends on the exact shape of the activation function. In contrast, for the s-model we obtain the correlation pattern shown on Fig. 1.3B and C regardless of the shape of a1 because the large noiseinduced variations of the LK are dominant. One exception would be when y0 is very large, in which case the deterministic system would exhibit a stable equilibrium rather than a relaxation oscillation. That is, the oscillation is driven entirely by the noise. In this case, there should be no correlations between the active and either the preceding or the following silent phases. With this exception, the correlation pattern produced by the s-model is very robust to parameter changes.
Correlation Analysis of Oscillations
13
In many systems, the active phase is not steady but oscillatory—this defines bursting. The slow negative feedback variable controls the transitions between active and silent phases of bursting as described above for the relaxation oscillations. However, the fast oscillations that occur during the active phase of bursting can greatly increase the sensitivity of the off transition to noise. This, in turn, can change the correlation pattern. In the following examples, we present several cases of bursting oscillations in excitable cells that exhibit different correlation patterns.
4. Example 2: Square Wave Bursting Square wave bursting has been described in a number of cell types (Butera et al., 1999; Chay and Keizer, 1983; Cornelisse et al., 2001) and belongs to the class of integrator-like neurons (Izhikevich, 2001). It has two primary characteristics. One is that the spikes often ride on a depolarized plateau, as in Fig. 1.5A. However, this is not always the case, since the spikes may undershoot the plateau (Bertram et al., 1995). The second characteristic is that the time between spikes progressively increases during the active phase of a burst. To investigate the correlation pattern on square wave bursting, we use a simplified version of a biophysically derived pancreatic b-cell model (Sherman and Rinzel, 1992). Equations for this and other bursting models used herein can be found in the primary references. Parameter values used were those described therein. Additionally, equations, parameter values, and computer programs for all models are available at http://www.math.fsu.edu/bertram/software/neuron. For the bursting models discussed, the primary observable variable is the membrane potential or voltage (V ), which evolves in time according to X dV ð1:6Þ ¼ i Ii þ Inoise : dt The ionic currents, Ii, vary from model to model, as do the number and identity of other variables. Random noise is introduced through the term Inoise ¼ m where, is a normally distributed random variable and m is the noise magnitude. In addition to this voltage equation, there are equations for current activation and inactivation variables. One of these variables changes slowly compared with V, and for each model is similar to y discussed earlier, providing subtractive negative feedback. Figure 1.5A shows the voltage time course of the model with added noise of magnitude 1 pA with the corresponding slow variable, s, superimposed. This is a slow negative feedback variable that activates an inhibitory current. When s is sufficiently large the voltage cannot reach the spike C
14
Maurizio Tomaiuolo et al.
A
V (mV)
−30
s
−50 V −70
B
0
2
4
C
2
0.5 2 2.5 3 Previous SPD (s)
0 1.5
3.5 E
10
2
3.5
15 10
Count
Count
2.5 3 Next SPD (s)
1 0.5
15
5 0
10
1.5
1
D
8
2
APD (s)
APD (s)
1.5
0 1.5
6
Time (s)
0.17 0.172 0.174 0.176 s value at active phase onset
5 0
0.178 0.18 0.182 0.184 s value at active phase termination
Figure 1.5 (A) Voltage trace and slow variable of a noisy square wave burster (Sherman and Rinzel, 1992). To facilitate superposition, the slow variable (s) time course has been rescaled. (B) Scatter plot obtained by plotting the duration of each active phase with the duration of the preceding silent phase. In this case, no correlation is observed (r ¼ 0.12, p ¼ 0.15). (C) The plot of active phase duration versus duration of the next silent phase shows a positive correlation (r ¼ 0.72, p < 10 20). Thus, on average, a short (long) active phase will be followed by a short (long) silent phase. (D) Distribution of the slow variable at the beginning of an active phase. (E) Distribution of the slow variable at the end of an active phase. The width of the slow variable distribution is greater at the active phase termination than at the active phase onset. That is, active phase termination is more sensitive to noise than active phase initiation.
threshold, so spiking stops and the cell enters a silent phase. In the absence of spiking the inhibitory s variable declines, eventually reaching a level that is low enough to allow spiking to resume.
Correlation Analysis of Oscillations
15
Scatter plots of the active phase duration versus the previous and the next silent phase durations are constructed as described in the previous section. The scatter plot of the active phase versus the following silent phase (Fig. 1.5C) shows a positive correlation, indicating that short (long) active phases lead to short (long) silent phases. In panel B, however, there is no correlation between the durations of the active phase and the previous silent phase. That is, the length of the silent phase does not provide information regarding the duration of the next active phase of bursting. To explain the correlation pattern, we plot the distributions of the values of the slow variable at the beginning and the end of an active phase. Variation of this slow variable is responsible for starting and stopping the spiking during a burst. For square wave bursting the width of the slow variable distribution is greater at the active phase termination (Fig. 1.5E) than at the active phase onset (Fig. 1.5D). The reason for this is that the spiking slows down near the end of the active phase, and the voltage spends most of its time near the spike threshold (i.e., the trajectory is approaching a homoclinic orbit), and so is sensitive to small perturbations. Thus, the termination of the burst is more subject to noise than its initiation. This example illustrates that correlation analysis of the active and silent phase durations, when applied to a model of square wave bursting, produces a pattern with positive correlation for active versus next silent phase duration, but no correlation for active versus previous silent phase duration. This result holds for the other three models of square wave bursting tested, and the rational for this is that all models of square wave bursting have similar dynamic structures. This correlation analysis technique can be used on actual voltage data from a nerve or endocrine cell, for example, to determine if the cell is a square wave burster. It would only require that active and silent phase durations be determined and plotted as in Fig. 1.5B and C. If the patterns match those of a square wave burster, then this tells the modeler a great deal about the form that the model should take. That is, it greatly limits the range of possible models that describes the cell’s electrical behavior.
5. Example 3: Elliptic Bursting Next, we consider elliptic bursting, which is observed in several types of neurons (Del Negro et al., 1998; Destexhe et al., 1993; Llinas et al., 1991) and belongs to the class of resonator-like neurons (Izhikevich, 2000; Llinas, 1988). Elliptic bursting is characterized by large spikes that do not ride on a depolarized plateau and small subthreshold oscillations that are present immediately before and after the active phase of a burst (Fig. 1.6A). The large spikes alone, however, do not uniquely identify this burst type, since
16
Maurizio Tomaiuolo et al.
100
V (mV)
A
50
s
0
V
−50
0
2
4
B
Time (s) C
0.08
0.04
0.08
0.12
0.04 E
8
8
6
6
4 2
0.06 0.08 0.1 Next SPD (s)
0.12
10
Count
Count
0.06 0.08 0.1 Previous SPD (s)
10
0
10
0.04 0.04
D
8
0.12
APD (s)
APD (s)
0.12
6
4 2
0.4 0.42 0.44 0.46 s value at active phase onset
0
0.46 0.48 0.5 0.52 s value at active phase termination
Figure 1.6 (A) Voltage trace and slow variable of an elliptic burster with noise (magnitude 0.2 pA). (B) Scatter plots of active phase duration against the duration of the previous silent phase. A correlation exists (r ¼ 0.53, p < 10 10). (C) There is a weak correlation between the durations of active phases and the following silent phases (r ¼ 0.22, p ¼ 0.006). (D) The width of the slow variable distribution at episode onset is greater than the distribution at active phase termination (E), thus active phase termination is less sensitive to noise than active phase initiation.
square wave bursters can have large spikes (Bertram et al., 1995), as can type two or parabolic bursters (Rinzel and Lee, 1987) and other burst types. Moreover, the small subthreshold oscillations are largely obscured when noise is added to the system. That is, the subthreshold oscillations present in the deterministic system may not readily be distinguished from those introduced by the random noise. We use the reduced version of the
17
Correlation Analysis of Oscillations
A
−20
V (mV)
−30 −40 −50 −60 −70 0
1
1.5
2
2.5 Time (s)
0.25
C 0.25
0.2
0.2 APD (s)
APD (s)
B
0.5
0.15 0.1 0.05 1 1.5 Previous SPD (s)
E
4.5
5
0.1
1 1.5 Next SPD (s)
2
0.7 0.6
APD (s)
0.6 APD (s)
4
0.15
0 0.5
2
0.7
0.5 0.4 0.3 0.2
3.5
0.05
0 0.5 D
3
0.5 0.4 0.3
1
1.2 1.4 1.6 Previous SPD (s)
1.8
0.2
1
1.2
1.4 1.6 Next SPD (s)
1.8
Figure 1.7 Correlation analysis applied to experimental data, and compared with a corresponding model. (A) Sample voltage trace of GH4 cell bursting. (B) Scatter plot obtained using GH4 cell data showing the active phase duration versus previous silent phase duration (r ¼ 0.10, p ¼ 0.21). (C) Scatter plot of active phase duration versus next silent phase duration (r ¼ 0.68, p < 10 20). (D)–(E) Scatter plots obtained from computer simulations of a model of the pituitary lactotroph (Tabak et al., 2007) with noise added (4 pA magnitude), ((D), r ¼ 0.07, p ¼ 0.43) and ((E), r ¼ 0.67, p < 10 15).
Hodgkin–Huxley giant squid axon model (Rinzel, 1985), with an added slow outward current, to analyze the correlation patterns for elliptic bursting. Figure 1.7A shows the voltage time course of the model with noise magnitude of 0.2 pA with the corresponding slow variable superimposed.
18
Maurizio Tomaiuolo et al.
The scatter plot of the active phase versus the duration of the previous silent phase (Fig. 1.6B) shows a positive correlation, indicating that short (long) active phase durations are preceded by short (long) silent phase durations. In contrast, there is only a weak correlation between the active phase duration and the duration of the next silent phase (Fig. 1.6C). Therefore, the duration of the previous silent phase predicts the active phase duration, but the active phase duration does not accurately predict the next silent phase duration. As in the previous sections, we plot the distributions of the slow variable at the onset and at the termination of an active phase. The slow variable in elliptic bursting exhibits a wider distribution at burst onset (Fig. 1.6D) than at burst termination (Fig. 1.6E). The reason for the wide onset distribution is that the subthreshold oscillations bring the voltage near the spike threshold, and once this threshold is crossed an active phase is initiated. Thus, the active phase initiation is very sensitive to noise, has been described previously for this type of bursting (Kuske and Baer, 2002; Su et al., 2004). During the active phase only a precise voltage perturbation at the right time can lead to spike termination (Rowat, 2007). Thus, active phase termination is relatively insensitive to the effects of noise. This example demonstrates that application of correlation analysis can distinguish model elliptic bursting from model square wave bursting. The analysis could also be applied experimentally, taking advantage of the noise that is inherent in the system. The outcome of the analysis could help with the choice of model used to describe the biological system.
6. Example 4: Using Correlation Analysis on Experimental Data In this example, we illustrate how correlation analysis can be used as a test for the validity of a model by applying it to both the model and the experimental system. The model describes fast bursting electrical activity in prolactin-secreting pituitary lactotrophs (Tabak et al., 2007). The experimental preparation is the GH4 pituitary lactotroph cell line. Like primary lactotrophs, cells from this lactotroph cell line often exhibit fast bursting electrical oscillations. A sample trace of GH4 bursting activity is shown in Fig. 1.7A. Correlation analysis was applied to a voltage trace approximately 5-min long, consisting of 150 bursts. The scatter plots show that there is no correlation between active phase duration and the previous silent phase (Fig. 1.7B), but a strong positive correlation between active phase duration and the next silent phase duration (Fig. 1.7C). We next compare these scatter plots with those from computer simulations of the model with added noise of magnitude 4 pA. The bursting
Correlation Analysis of Oscillations
19
produced by the model is neither square wave nor elliptic (Tabak et al., 2007), but instead is of the type referred to as pseudo-plateau (Stern et al., 2008). The model scatter plots show that, as with the experimental data, there is no correlation between active and previous silent phase durations (Fig. 1.7D), and a strong positive correlation between the active and the next silent phase durations (Fig. 1.7E). Thus, the correlation analysis provides some support for the validity of the mechanism for bursting in the mathematical model.
7. Summary We have demonstrated that correlation analysis can be a useful tool for comparing mathematical models with experimental data as a first check for the validity of the model. This type of analysis is appropriate for systems that produce relaxation oscillations or bursting oscillations. While it does not validate the model, it is a first test that is simple to apply to both the model and the biological system. Furthermore, it is noninvasive: all that is required is that one measure the activity of the biological system and make scatter plots of active and silent phase durations. Because the correlation analysis is a statistical test, confidence in the results increases with the number of data points. In this case, the data points are bursts or relaxation oscillations. For our example with the GH4 lactotroph cell line, 5 min of continuous recording was sufficient to give reliable results. While our examples focused on neural oscillations, the method is equally applicable to other types of biological systems that generate relaxation-type oscillations.
Appendix: Algorithm for the Determination of Phase Durations During Bursting Here, we describe the method used to determine silent and active phase durations for a noisy burst time course, where V is the observable. Upon visual inspection, we first set a threshold, VS, such that if V > VS a spike is recorded. Denote the times at which two spikes occur by ti and tj, then two spikes are considered to lie within a single burst if |ti – tj| < d, where, d is a positive parameter chosen by examination of interspike intervals. Conversely, if |ti – tj| > d then the two spikes are not considered part of the same burst. In a similar way, we obtain the duration of each silent phase by computing the difference between the last spike of a burst and the first spike of the following burst. We then create three vectors of equal size. One vector, ! b, contains all the active phase durations in chronological order (i.e., [b1, b2, . . ., bN]). The other two vectors contain the silent phase durations.
20
Maurizio Tomaiuolo et al.
! The preceding silent phase vector is s prec ¼ ½s1 ; s2 ; :::; sN , while the follow! ing silent phase vector is s next ¼ ½s2 ; s3 ; :::; sN þ1 . We then plot the elements of ! b versus those of ! s prec or versus ! s next to make scatter plots. Computer codes for the computation of active and silent phase durations can be downloaded from http://www.math.fsu.edu/bertram/software/neuron. In the case of experimental data, the data may have to be detrended if any slow trends in active and silent phase durations are present.
ACKNOWLEDGMENT This work was supported by NIH grant DA-19356.
REFERENCES Bertram, R., and Sherman, A. (2005). Negative calcium feedback: The road from Chay-Keizer. In ‘‘Bursting: The Genesis of Rhythm in the Nervous System,’’ (S. Coombes and P. C. Bressloff, eds.), World Scientific, Singapore. Bertram, R., Butte, M., Kiemel, T., and Sherman, A. (1995). Topological and phenomenological classification of bursting oscillations. Bull. Math. Biol. 57, 413–439. Butera, R. J., Rinzel, J., and Smith, J. C. (1999). Models of respiratory rhythm generation in the pre-Bo¨tzinger complex I. Bursting pacemaker neurons. J. Neurophysiol. 82, 382–397. Chay, T. R., and Keizer, J. (1983). Minimal model for membrane oscillations in the pancreatic b-cell. Biophys. J. 42, 181–190. Coombes, S., and Bressloff, P.C (2005). Bursting: The Genesis of Rhythm in the Nervous System. World Scientific Publishing Co., Singapore. Cornelisse, L. N., Scheenen, W. J. J. M., Koopman, W. J. H., Roubos, E. W., and Gielen, S. C. A. M. (2001). Minimal model for intracellular calcium oscillations and electrical bursting in melanotrope cells of Xenopus laevis. Neural Comput. 13, 113–137. Dean, P. M., and Mathews, E. K. (1970). Glucose-induced electrical activity in pancreatic islet cells. J. Physiol. 210, 255–264. Del Negro, C. A., Hsiao, C.-F., Chandler, S. H., and Garfinkel, A. (1998). Evidence for a novel bursting mechanism in rodent trigeminal neurons. Biophys. J. 75, 174–182. Destexhe, A., Babloyantz, A., and Sejnowski, T. J. (1993). Ionic mechanisms for intrinsic slow oscillations in thalamic relay neurons. Biophys. J. 65, 1538–1552. Ermentrout, G. B., and Chow, C. C. (2002). Modeling neural oscillations. Physiol. Behav. 77, 629–633. Friesen, W. O., and Block, G. D. (1984). What is a biological oscillator? Am. J. Physiol. 246, R847–R853. Goldbeter, A., and Lefever, R. (1972). Dissipative structures for an allosteric model; application to glycolytic oscillations. Biophys. J. 12, 1302–1315. Izhikevich, E. M. (2000). Neural excitability, spiking and bursting. Int. J. Bifur. Chaos. 10, 1171–1266. Izhikevich, E. M. (2001). Resonate-and-fire neurons. Neural Netw. 14, 883–894. Kuske, R., and Baer, S. M. (2002). Asymptotic analysis of noise sensitivity in a neuronal burster. Bull. Math. Biol. 64, 447–481. Li, Y.-X., Stojilkovic, S. S., Keizer, J., and Rinzel, J. (1997). Sensing and refilling calcium stores in an excitable cell. Biophys. J. 72, 1080–1091.
Correlation Analysis of Oscillations
21
Lim, S., and Rinzel, J. Noise-induced transitions in slow wave neuronal dynamics. J. Comput. Neurosci. (in press). Llinas, R. R. (1988). The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Science 242, 1654–1664. Llinas, R. R., Grace, T., and Yarom, Y. (1991). In vitro neurons in mammalian cortical layer 4 exhibit intrinsic oscillatory activity in the 10- to 50-Hz frequency range. Proc. Natl. Acad. Sci. USA 88, 897–901. Murray, J. D. (1989). Mathematical Biology. Springer-Verlag, Berlin. Rinzel, J. (1985). Excitation dynamics: Insights from simplified membrane models. Fed. Proc. 44, 2944–2946. Rinzel, J. (1987). A formal classification of bursting mechanisms in excitable systems. In ‘‘Mathematical Topics in Population Biology, Morphogenesis and Neurosciences,’’ (E. Teramoto and M. Yamaguti, eds.), Vol. 71. Springer-Verlag, Berlin. Rinzel, J., and Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In ‘‘Methods in Neuronal Modeling: From Ions to Networks,’’ (C. Koch and I. Segev, eds.), pp. 251–291. MIT Press, Cambridge. Rinzel, J., and Lee, Y. S. (1987). Dissection of a model for neuronal parabolic bursting. J. Math. Biol. 25, 653–675. Rowat, P. (2007). Interspike interval statistics in the stochastic Hodgkin–Huxley model: Coexistence of gamma frequency bursts and highly irregular firing. Neural Comput. 19, 1215–1250. Sherman, A., and Rinzel, J. (1992). Rhythmogenic effects of weak electrotonic coupling in neuronal models. Proc. Natl. Acad. Sci. USA 89, 2471–2474. Shpiro, A., Curtu, R., Rinzel, J., and Rubin, N. (2007). Dynamical characteristics common to neuronal competition models. J. Neurophysiol. 97, 462–473. Stern, J. V., Osinga, H. M., LeBeau, A., and Sherman, A. (2008). Resetting behavior in a model of bursting in secretory pituitary cells: Distinguishing plateaus from pseudoplateaus. Bull. Math. Biol. 70, 68–88. Strogatz, S. H. (1994). Nonlinear dynamics and chaos. Addison-Wesley, Reading, MA. Su, J., Rubin, J., and Terman, D. (2004). Effects of noise on elliptic bursters. Nonlinearity 17, 1–25. Tabak, J., Rinzel, J., and O’Donovan, M. J. (2001). The role of activity-dependent network depression in the expression and self-regulation of spontaneous activity in the developing spinal cord. J. Neurosci. 21, 8966–8978. Tabak, J., O’Donavan, M. J., and Rinzel, J. (2006). Differential control of active and silent phases in relaxation models of neuronal rhythms. J. Comput. Neurosci. 21, 307–328. Tabak, J., Toporikova, N., Freeman, M. E., and Bertram, R. (2007). Low dose of dopamine may stimulate prolactin secretion by increasing fast potassium currents. J. Comput. Neurosci. 22, 211–222. Tabak, J., Senn, W., O’Donovan, M. J., and Rinzel, J. (2000). Modeling of spontaneous activity in developing spinal cord using activity-dependent depression in an excitatory network. J. Neurosci. 20, 3041–3056. Tabak, J., Mascagni, M., and Bertram, R. (2009). Mechanism for the universal pattern of activity in developing neuronal networks, submitted. Tsai, T. Y.-C., Choi, Y. S., Ma, W., Pomerening, J. R., Tang, C., and Ferrell, J. E. Jr. (2008). Robust, tunable biological oscillations from interlinked positive and negative feedback loops. Science 321, 126–129. Tsaneva-Atanasova, K., Sherman, A., Van Goor, F., and Stojilkovic, S. S. (2007). Mechanism of spontaneous and receptor-controlled electrical activity in pituitary somatotrophs: Experiments and theory. J. Neurophysiol. 98, 131–144. Tyson, J. J. (1991). Modeling the cell division cycle: cdc2 and cyclin interactions. Proc. Natl. Acad. Sci. USA 88, 7328–7332.
22
Maurizio Tomaiuolo et al.
van der Pol, B., and van der Mark, J. (1928). The heartbeat considered as a relaxation oscillation, and an electrical model of the heart. Phil. Mag. 6, 763–775. Van Goor, F., Zivadinovic, D., Martinez-Fuentes, A. J., and Stojilkovic, S. S. (2001). Dependence of pituitary hormone secretion on the pattern of spontaneous voltagegated calcium influx. J. Biol. Chem. 276, 33840–33846. Wilson, H. R., and Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J. 12, 1–24.
C H A P T E R
T W O
Trait Variability of Cancer Cells Quantified by High-Content Automated Microscopy of Single Cells Vito Quaranta,*,† Darren R. Tyson,*,† Shawn P. Garbett,† Brandy Weidow,*,† Mark P. Harris,* and Walter Georgescu†,‡ Contents 24 25 26 28 30 30 31 31 32 34 34 45 54 54 54
1. Introduction 2. Background 3. Experimental and Computational Workflow 3.1. Time-lapse image acquisition 3.2. Data management 3.3. Image processing 3.4. Cellular parameter extraction 3.5. Statistical analysis 3.6. Data categories 4. Application to Traits Relevant to Cancer Progression 4.1. Cell motility 4.2. Cell proliferation 5. Conclusions Acknowledgments References
Abstract Mapping quantitative cell traits (QCT) to underlying molecular defects is a central challenge in cancer research because heterogeneity at all biological scales, from genes to cells to populations, is recognized as the main driver of cancer progression and treatment resistance. A major roadblock to a multiscale framework linking cell to signaling to genetic cancer heterogeneity is the dearth of large-scale, single-cell data on QCT-such as proliferation, death sensitivity, motility, metabolism, and other hallmarks of cancer. High-volume single-cell * { {
Department of Cancer Biology, Vanderbilt University Medical Center, Nashville, Tennessee, USA Vanderbilt Integrative Cancer Biology Center, Vanderbilt University Medical Center, Nashville, Tennessee, USA Department of Biomedical Engineering, Vanderbilt University Medical Center, Nashville, Tennessee, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67002-6
#
2009 Elsevier Inc. All rights reserved.
23
24
Vito Quaranta et al.
data can be used to represent cell-to-cell genetic and nongenetic QCT variability in cancer cell populations as averages, distributions, and statistical subpopulations. By matching the abundance of available data on cancer genetic and molecular variability, QCT data should enable quantitative mapping of phenotype to genotype in cancer. This challenge is being met by high-content automated microscopy (HCAM), based on the convergence of several technologies including computerized microscopy, image processing, computation, and heterogeneity science. In this chapter, we describe an HCAM workflow that can be set up in a medium size interdisciplinary laboratory, and its application to produce high-throughput QCT data for cancer cell motility and proliferation. This type of data is ideally suited to populate cell-scale computational and mathematical models of cancer progression for quantitatively and predictively evaluating cancer drug discovery and treatment.
1. Introduction Cancer cells both across and within individual patients are heterogeneous with respect to genetic (and epigenetic) makeup (Heng et al., 2009). Furthermore, it is increasingly appreciated that even within genetically identical clonal cell populations, individual cells may differ from each other in phenotypic traits (Brock et al., 2009; Stockholm et al., 2007). Genetic and nongenetic heterogeneity remain a formidable challenge for cancer treatment, especially in the case of molecular targeted drugs. Largescale genetic and epigenetic analyses of cancer variability have begun to extract common patterns at the molecular scale within this morass of heterogeneity. More recently, powerful cell phenotype analytical methods are coming on line, mostly due to the convergence of image processing, computerdriven automation and high-throughput microscopes (Dove, 2003; Evans and Matsudaira, 2007; Perlman et al., 2004; Starkuviene and Pepperkok, 2007). These methods, termed for convenience high-content automated microscopy (HCAM), enable large-scale analyses of cancer cell phenotype variability that will eventually match the scope of genetic variability analyses. In this chapter, we describe our implementation of HCAM methods in order to quantify cell traits such as proliferation and motility, and their variability within a cell population such as a cancer cell line. To be clear, we refer to a quantitative cell trait (QCT) as a cell-scale functional property (e.g., proliferation, motility, metabolism, death sensitivity) that displays cell-tocell variability in a cell population, with respect to some quantitative metric. It is highly desirable that a QCT be defined in numeric terms for machine compatibility, since it is virtually impossible to intuitively deal with, follow in time, or predict consequences of QCT combinations, for example, in
Quantitative Cell Trait Variability in Cancer
25
cancer progression or drug response. In these early days, these metrics are not firmly established and undoubtedly at some point will have to be agreed upon, particularly for comparing data from different sources in an automated fashion. In this chapter, our primary goal was to describe methods that define QCT heterogeneity quantitatively, regardless of the source of heterogeneity (e.g., genetic, epigenetic, nongenetic). However, we recognize that, once metrics are established, investigating the source of QCT variability becomes a tantalizing priority. Such investigations may span from identification of molecular mechanisms responsible for generating or dampening QCT variability, to mathematical or statistical modeling for best guess of the type of heterogeneity source (e.g., genetic or nongenetic). Consequences are very practical, because in the case of genetic sources one would expect permanent inheritance of the heterogeneity source, whereas a nongenetic source would produce temporary inheritance of heterogeneity.
2. Background Heterogeneity is a central feature of cancer that occurs at all biological scales, from genes to cells to populations. For decades, it has been suspected to be largely responsible for cancer progression and resistance to treatment, spawning intense studies especially at the genetic and molecular level. For example, panels of cancer cell lines have been subjected to genomic, gene expression or protein array analyses, evidencing a large number of genetic mutations, and signaling network alterations associated with malignant transformation. While these studies have been enormously informative and have taken our understanding of cancer to incredible depth, they have suffered from at least two limitations: (i) high-throughput genetic and biochemical studies are generally impractical at the single-cell level and are mostly based on average measurements of a test cell population; and (ii) genotype to phenotype mapping in a single cell, that is, linking genetic or molecular changes with phenotypic trait output (like motility or proliferation) remains challenging. These limitations are especially frustrating in the context of cancer progression, which is paced by the appearance of cell clones abnormal with respect to ‘‘hallmark’’ traits, such that cancer may be referred to as a disease of outliers. Methods to map QCT to underlying molecular defects would effectively produce a multiscale bridge from cancer genetics to cancer cell biology. In this multiscale framework, treatment and drug discovery could be approached with predictive methods. Analysis at the single-cell level of QCT variability, regardless of source, is commanding increasing attention due to convergent advances in several disciplines. As a whole, the science of heterogeneity has reached maturity in
26
Vito Quaranta et al.
many fields, such as face recognition, machine learning, and signal processing, producing theory that is applicable to cancer cell biology, as well as a wealth of mathematical and computational tools. Computer-driven microscopes are rapidly being refined and promise to deliver for adherent cells the spectacular advances flow cytometry has produced for cells in suspension. Image processing software and automation have the potential to create automated workflows to capture and analyze the behavior of thousands of single cells under tens of conditions in relatively short experimental times. This ensemble of technology is commonly referred to as HCAM. Its application to cancer cell biology promises to revolutionize our understanding of cell-to-cell variability with respect to a phenotypic trait, referred to above as QCT. It is perhaps worth noting that QCT studies are gaining traction in fields other than cancer. An emerging view is that phenotypic trait variability is inherent to living systems, in large part as an inevitable consequence of biological ‘‘noise’’ at several steps of intracellular molecular operation (e.g., gene transcription, mRNA translation, protein folding). Furthermore, local microenvironmental conditions may extinguish or amplify this noise and, to an extent, even nongenetic variability is inherited from mother to daughter cells on a temporary basis. Normal cells apply considerable resources to constrain or dampen both genetic and nongenetic variability of their traits, particularly as they become functional components of differentiated tissues. In this sense, variability is a negative factor with respect to homeostasis. However, trait variability may provide options to cells for responding to microenvironmental changes, for example, by pushing operation of that trait to the extremes of a range in order to survive under extreme microenvironmental stress, and perhaps for evolving new strategies. In summary, variability of a phenotypic cell trait can be considered a measure of cell adaptability, as well as of evolvability of underlying biochemical circuitry. From this broader perspective, an intriguing view is that during cancer progression the adaptability of cancer cells to microenvironmental changes is ever-increasing. QCT analysis is a key step to breaking down adaptability into numerical parameters that can be evaluated spatially and temporally by higher scale computational modeling.
3. Experimental and Computational Workflow Advances in experimental and microscopic technologies have made it possible to gather high-quality high-content images of cells and cellular components at an ever-increasing rate. Development of such state-of-theart equipment and tools allows investigators to gather spatial (i.e., in x-, y-, and z-axes, covering many fields of view, using multiple wavelengths) and time-resolved (i.e., at rapid intervals, over several days) quantifiers that
27
Quantitative Cell Trait Variability in Cancer
describe various cell traits (i.e., motility, proliferation) in vitro. Such methodology can also be used to explore heterogeneity of traits of individual cells within cell populations. HCAM is rapidly becoming the most efficient methodology to measure phenotypic traits of cancer cell populations at the single-cell level. In this section, we describe a streamlined workflow for acquiring and processing HCAM data that we have established for our own group (Fig. 2.1). This process can be divided into the key steps highlighted in Fig. 2.1. For some of these steps, recent comprehensive reviews have appeared and the reader is referred to those after a brief discussion. Other steps are dealt with in detail. In brief, the workflow is enabled by development of an informatics pipeline for images and image-associated metadata. Image features are derived using a suite of existing and newly developed image analysis and computer High-content automated microscopy (HCAM)
Time-lapse image acquisition
Raw data
(96- or 364-wells per experiment)
Data management (OME/OMERO) >100 GB per experiment
Formatted, classified data
Image processing (ACCRE cluster) ~96-wells/1 h
Processed data segmentation tracking
Data categories average, distribution, variability/subpopulations
Population-and single-cell data
Statistical analysis (non)parametrics, BIC, MLE, clustering
Cellular parameter extraction intensity, morphology, velocity, etc
Input into mathematical/ computational models
Figure 2.1 Workflow for computer-assisted analysis of quantitative cell traits (QCT). For simplification, the general workflow associated with measuring QCT is broken into the following steps: (i) time-lapse image acquisition (e.g., assay setup, microscopy); (ii) data management (OME, ACCRE); (iii) image processing (e.g., cell identification, segmentation, tracking); (iv) cellular parameter extraction (e.g., cell speed, doubling time); (v) statistical analysis ((non)parametric tests); and (vi) data categories (i.e., averages, distributions, and ‘‘statistical subpopulations’’)
28
Vito Quaranta et al.
software tools. Ultimately, the goal of this workflow is to establish an efficient pipeline to store, disseminate, and analyze single-cell data for streamlining the use of data categories, for example, for input into mathematical/computational models (Fig. 2.1). For simplification, the workflow has been broken down into the following steps (all of which are expanded upon in the following sections): 1. 2. 3. 4. 5. 6.
Time-lapse image acquisition Data management Image processing (segmentation and tracking) Cellular parameter extraction Statistical analyses Data categories (average, distribution, statistical subpopulations)
3.1. Time-lapse image acquisition Measurement of spatially- and time-resolved phenotypic traits involves largescale data acquisition. In order to examine traits efficiently in many cell lines, in many relevant conditions (e.g., hypoxia, drug treatment), and over time, inclusion of a high-throughput methodology such as HCAM is vital. We have primarily utilized a temperature- and CO2-controlled, automated, spinning-disk confocal microscope, the BD Pathway 855 (BD Biosciences, Rockville, MD) for single-cell phenotypic studies, although many other systems exist that provide similar functions. The Bioimager is capable of imaging an entire 96-well microplate in a single channel and focal plane in 10 min, but also has the flexibility to accommodate many other sample formats. Imaging can also be performed repeatedly with multiple images per well, multiple focal planes (z-sections), and using multiple fluorescent channels (using two light sources and various filters for a variety of fluorophores), making the setup ideal for performance of high-content time-lapse studies. In addition, the machine has a capacity for automated liquid handling allowing for precise control of the duration and volume of compound treatments. Lastly, this machine is versatile with adaptable hardware that is directly integrated into our data management system (described below in detail). Ultimately, increasing the efficiency of data acquisition via implementation of such methodology makes it possible to maximize the amount (and potentially the value) of quantitative data extracted from image-based single-cell studies. There are several trade-offs that require careful consideration during single-cell studies. First, the speed at which images can be acquired limits the number of wells, surface area covered, number of channels, or z-sections, etc. that can be imaged prior to returning to the starting position for the next sequential round of image acquisition. For instance, the frequency of image acquisition is of critical importance when examining motile cells. In order to automate the identification and tracking of
Quantitative Cell Trait Variability in Cancer
29
individual cells over time (repeated images), the distance the cell has moved between frames must be kept below a minimal threshold that is dependent on the density of cells imaged. It is computationally more challenging to identify a cell between two sequential images as the number of cells and the distance a cell moves increase. A list of imaging trade-offs is shown in Table 2.1. Another important consideration is photobleaching and phototoxicity. This is generally not a problem for phase-contrast imaging, but can be Table 2.1 Imaging trade-offs for dynamic high-content automated microscopy
Frequency of image acquisition Automatic tracking algorithms become more error-prone as cell speed or time between successive frames increases Increasing imaging frequency facilitates automatic tracking but increases total light exposure, which increases risk of phototoxicity and photobleaching Duration of light exposure Increasing exposure time increases signal-to-noise ratio, but also increases total light exposure, which increases risk of phototoxicity and photobleaching Minimum exposure time that provides sufficient signal-to-noise ratio over the entire experiment should be employed Camera binning may be used to increase signal at the cost of some spatial resolution Area to be imaged Directly determines the maximum number of cells to be imaged Decreasing objective magnification (e.g., from 20 to 10) increases area but reduces resolution Digital stitching of adjacent frames (montaging) can be used to increase imaged area at the cost of time, file size, and increased light exposure at overlapping frame borders Number of channels and z-sections Increasing number of parameters and z-sections increases light exposure and time required per well Number of conditions and cell types and technical replicates Limited by the frequency of imaging required in each well to address biological question and time required to image each well Increasing technical replicates allows sufficient number of cells to be imaged if low density culture is required initially Duration of experiment Automatic tracking algorithms become more error-prone as cell density increases, which occurs exponentially under optimal culture conditions Longer experiment times may affect maintenance microenvironmental conditions (e.g., depletion of nutrients, medium evaporation, etc.)
30
Vito Quaranta et al.
a substantial limitation for fluorescence imaging. Nipkow spinning disk confocal imaging is particularly well suited for reducing phototoxicity and photobleaching and has become the method of choice for live-cell imaging (Gra¨f et al., 2005). However, a limitation of imaging through spinning disks is that z-axis resolution is reduced compared to that of laser scanning confocal imaging. Regardless of whether spinning disks are used, the potential effects of imaging on cellular phenotypes must be considered.
3.2. Data management Individual HCAM experiments generate large datasets, commonly exceeding 50 GB in size. Therefore, data management—including storage, retrieval, backup, and processing—is facilitated by incorporation of data into the open microscopy environment (OME; http://www.openmicroscopy. org; Swedlow et al., 2003). This open-source software has been designed specifically to address the challenges of HCAM data and provides a standardized management platform by developing software and data format standards for the storage and manipulation of biological microscopy data (Goldberg et al., 2005; Swedlow et al., 2009). OME has previously been used in a number of biological studies to examine many aspects of cellular behavior (Dikovskaya et al., 2007; Porter et al., 2007). An OME remote objects (OMERO) server is established, which provides access to image data (the binary pixel data) and metadata (i.e., associated information about instrument settings, configurations, annotations). Access to data is enabled through client applications that simply run on a user’s computer. These include light-weight web-based interfaces, which can be accessed from any computer with a standard web browser; Java-based client applications, which provide more functionality than the web interface, but must be installed separately on each client computer; and a full cross-platform API, which provides data accessibility from third-party applications like ImageJ and VisBio. In addition, incorporation of data into OME provides MATLAB bindings to facilitate sophisticated image processing and analysis directly through the OMERO server.
3.3. Image processing Formatted images classified into datasets are then processed using various image analysis tools/algorithms (e.g., cell tracking, segmentation); we use a combination of some existing and freely available from open sources such as MetaMorphTM (Molecular Devices, Sunnyvale, CA), ImageJ (http://rsbweb. nih.gov/ij/; Rasband, 1997–2006), CellProfiler (http://www.cellprofiler.org; Carpenter et al., 2006), OpenLab (Improvision, Waltham, MA), and others
Quantitative Cell Trait Variability in Cancer
31
custom-developed in-house, to utilize the Vanderbilt Advanced Computing Center for Research & Education (ACCRE) Cluster for rapid processing of individual wells in parallel. MATLAB and Unix Shell scripts, which are designed to run in a high-throughput mode, facilitate this effort. We will present three specific processing modules that were custom-designed for processing of cell motility (Section 4.1) and proliferation (Section 4.2).
3.4. Cellular parameter extraction The output from the image processing pipeline is a set of cell parameters and images for visual inspection. Information from each tracked cell can be extracted from raw or processed images or from aggregate data for further data analysis. Typical parameters obtained from each image include cell perimeter, mean pixel intensity, and measures of shape such as eccentricity and solidity. Other measurements first require the identification of individual cells across multiple frames. These QCT include parameters of cell motility (e.g., speed, direction), intermitotic times (IMT), and progeny trees. Once all data are extracted from images, it is saved as a set of CSV files for statistical analysis and a set of images for visual inspection (i.e., quality control).
3.5. Statistical analysis A variety of analytical and statistical tools are applied to further analyze singlecell data, using a few statistical/mathematical packages including, R (http:// www.r-project.org/; free software environment), SPSS (SPSS, Inc., Chicago, IL), and Mathematica (Wolfram Research, Inc., Champaign, IL). Averages and distributions of data are tested using a combination of parametric and nonparametric statistical tests, as needed. First, normality of data is tested using various statistical tests (e.g., D’ Agostino’s K-squared, Shapiro Wilks W), dependent upon sample size, prior to all further analyses. Given a dataset that classically fits normality, parametric statistics (e.g., Student’s t-test, ANOVA) can be applied to detect significant relationships, and presentation of averages, standard deviation (SD), and standard error (SE) is sufficient to describe the population. However, given nonnormality, slightly more extraneous nonparametric tests must be employed to accurately capture population dynamics (e.g., Wilcoxon signed-rank, Kolmogorov–Smirnov tests). Of note, failure to verify assumptions about data (particularly in the study of population heterogeneity) can lead to unfortunate misinterpretation and wrongful conclusions. Statistical subpopulation analysis will also employ a number of other classic and adapted methods, which are described at length in Section 3.6.1.
32
Vito Quaranta et al.
3.6. Data categories Raw or processed data can be presented in various ways, of which we discuss three: (i) averages, (ii) distributions, or (iii) ‘‘statistical subpopulations’’ (i.e., variability distribution discretized by statistical techniques such as clustering). Each of these categories can then be incorporated into corresponding mathematical models. 3.6.1. Averages We have previously incorporated average data, for various phenotypic traits for a panel of genetically related breast cancer cells into the hybrid-discrete continuum (HDC) mathematical model for parameterization (Anderson et al., 2009). However, with the realization of heterogeneity of cell populations, presentation of a single value (average) is often inadequate for an accurate description. Although SD or SE can sometimes be used to effectively describe the variability of normal (Gaussian) populations, skewed nonnormal datasets rich with outliers and possible subpopulations should not rely on these means. Instead, analysis of population probability distributions and subpopulations via various approaches is preferable in these instances, as described below. 3.6.2. Distributions Obtaining single-cell measures using HCAM, in combination with rigorous statistical treatment, allows examination and analysis of large populations of cells (N > 1000) in a fairly efficient manner. Using this type of data acquisition for phenotypic traits, in lieu of population-based metrics, allows for presentation of probability distribution, which describes both the range of possible values that a random variable can attain and the probability that the value is within any subset of that range. This category of measurement is particularly useful for representing the spread or variability (i.e., heterogeneity) of a cell population by depicting the nuances of its data (e.g., nonnormality, skewness, kurtosis, outliers), which are lost using simple presentation of averages. By providing such data for parameterization of mathematical and computational models where appropriate, one can model heterogeneity of populations more realistically (in line with experimentation), which may ultimately lead to important insights otherwise overlooked. A specific example of applying these techniques to single-cell motility data is detailed below in Section 4.1. 3.6.3. Statistical subpopulations Using other statistical approaches, raw data for various phenotypic cell traits can also be processed to reveal ‘‘subpopulations’’ present within the greater population being examined. In order to quantify intracell line variability, we discretize the continuous distribution measurements described in the
33
Quantitative Cell Trait Variability in Cancer
above section into ‘‘functional subpopulations,’’ as previously described (Loo et al., 2007; Perlman et al., 2004; Slack et al., 2008). The advantage of identifying discrete subpopulations is that they can be compared across cell lines and identify common trends in response to perturbations of interest. Specifically, methods can be employed to estimate trait subpopulations using model-fit criteria, such as Bayesian information criterion (BIC) or gap statistics (Fraley and Raftery, 2002). BIC is an approximation of integrated likelihood, according to Eq. (2.1): 2 logðpðD j Mk ÞÞ 2 logðpðD j ^ yk ; Mk ÞÞ vk logðnÞ ¼ BICk
ð2:1Þ
vk is the number of independent parameters to be estimated in model Mk. yk is the parameter being estimated, and n is the number of points in the dataset D. This approximation has been shown to be a consistent estimator of density, even when dealing with nonparametric (Roeder and Wasserman, 1995) or noisy data. Expectation maximization (EM) is also used in statistics for finding maximum likelihood estimates (MLE) of parameters with a known number of clusters (Eliason, 1993). Using this method, model-based hierarchical agglomerative clustering is used to compute an approximate maximum for the classification likelihood, following Eq. (2.2), as previously described (Fraley and Raftery, 2002): n Y Lcl ðy1 ; . . . ; yg ; ‘1 ; . . . ; ‘n jyÞ ¼ fli ðyi j yli Þ ð2:2Þ i¼1
‘n labels a unique classification of each observation, and yg is the parameter estimate for each cluster. By combining the hierarchical agglomerative clustering with both EM and BIC, a robust strategy is developed. The brief outline of this algorithm is as follows: (1) choose maximum number of clusters; (2) perform hierarchical agglomerative clustering to estimate a classification for the data under each model, up to a selected maximum number; (3) compute the EM to determine parameters under each model; and (4) utilize BIC to select which is the most likely model of the data. Additional statistical techniques (i.e., principal components analysis (PCA) or Gaussian mixed models (GMM)) can then be applied as needed to reduce the dimension of a dataset and to find clusters of cells or subpopulations. Ultimately, these subpopulations are represented as probabilistic mixtures of stereotypes (i.e., phenotypes). As presented previously (Loo et al., 2007; Perlman et al., 2004; Slack et al., 2008), we can summarize the percentages of states within a cancer population as a ‘‘subpopulation trait profile’’—a simple probability vector whose entries sum to one. This analysis allows us to approximate subpopulations (i.e., heterogeneity) that exist within a cell line population with respect to a specific cell trait. Further, we can also use this approach to examine whether specific microenvironmental perturbations (e.g., hypoxia, drug treatment) influence/induce
34
Vito Quaranta et al.
apparent patterns of heterogeneity of cancer cell populations. This particular approach can be invaluable for explaining shifts of cell populations, or more interestingly cell subpopulations, which is quickly becoming a major field of study in cancer research. These analyses have all been previously used in different combinations for teasing apart cell subtypes based on various parameters (Loo et al., 2007; Perlman et al., 2004; Slack et al., 2008). In summary, this is just one approach for numerically describing the heterogeneity of cell populations, particularly highlighting outliers, based on any number of relevant traits of interest. Much like flow cytometry separates a cell population from suspension into various subpopulations based on certain assignments (e.g., fluorescent marker, cell size), the coupling of HCAM and rigorous statistical tests can provide a means for separating or grouping of live cells dynamically over time from image-based studies. A specific example of applying these techniques to single-cell proliferation data is detailed below in Section 4.2.
4. Application to Traits Relevant to Cancer Progression As described above, various experimental and computational tools are being developed to investigate a number of applications relevant to the study of QCT in cancer progression. In this chapter, we have focused on measurement and analysis of two specific phenotypic traits (QCT) of single cells, motility and proliferation, both of which are hallmarks of cancer (Hanahan and Weinberg, 2000). It is well established that both traits are aberrantly regulated during disease progression, and probable that intervention strategies targeting these processes may be useful in clinical treatment. The following sections briefly expand upon the clinical importance of each trait, our chosen methodology for various analyses of each, and the implications of conducting such studies.
4.1. Cell motility Cell motility plays an essential role in many biological systems, but precise quantitative knowledge of the biophysical processes involved in cell migration is limited. It is well established that migration of both epithelial and transformed cancer cells is a complex and dynamic process, which involves changes in cell size, shape, and overall movement (Friedl and Wolf, 2003). Therefore, one can characterize cell motility by quantifying several metrics. This provides opportunities to improve the predictive accuracy of computational and mathematical models by incorporating more numerical parameters. Herein, we present a method for assessing single-cell motility,
Quantitative Cell Trait Variability in Cancer
35
combining experimental, statistical and computational tools, and apply it to the analysis of the dynamics of ‘‘unbiased’’ single-cell migration in vitro (i.e., undirected, or without addition of chemoattractant). This pipeline for analysis was designed with the intention of examining large numbers of heterogeneous cancer cell populations (i.e., cell lines in vitro). This method improves upon classic methods for studying migration (e.g., Boyden chamber) because it captures single-cell dynamics underlying the heterogeneity of cancer cell populations. 4.1.1. Single-cell motility analysis: image acquisition and validation We established protocols for both manual (Harris et al., 2008) and automated (custom-written algorithms) cell tracking of single-cell motility. Manual cell tracking is standard practice (Harris et al., 2008) and not covered in this chapter. Although facilitated by several software packages and image analysis tools (Section 3.3), it is laborious and time-consuming (Harris et al., 2008) and limits the throughput. In the context of HCAM and highthroughput studies, it still has a critical function to validate automated analyses. Automated high-throughput cell tracking (thousands of cells) presents significant challenges, discussed in the following. Due to low signal-tonoise ratio (low contrast between cell and background), automated cell tracking of digital bright-field or phase-contrast microscopic images is often impractical and error prone. Fluorescent-based imaging has far superior signal-to-noise ratios and the resultant images significantly simplify the process of automated tracking. Therefore, for high-throughput studies in our laboratory (using the BD 855 Pathway), cells are labeled with a nuclear protein (histone H2B) conjugated to monomeric red fluorescent protein (H2BmRFP; Addgene Plasmid 18982) to enable identification of nuclei of individual cells. This protein has been used by many groups for imaging purposes, and to date no significant alterations in cellular function due to its expression have been described. The most efficient method for obtaining pooled populations of cells with stable expression of a transgene is using retroviral-mediated transduction, although any method that produces similar results may be employed alternatively. Cells should be flow-sorted to minimize the number of nonexpressing cells within the populations. Numerous protocols exist for these procedures and will not be covered here further. Once a stable cell line is established in which the fluorescent protein is expressed, various parameters must be compared to the parental strain to ensure no obvious clonal selection has occurred. HCAM assays can then be carried out as follows: (1) Cells are seeded into 96-well microplates ( 2000 cells per well), allowed to adhere for 1 h in the temperaturecontrolled (37 C), CO2-controlled chamber of the BD Pathway 855 machine, and washed to remove nonadherent cells from wells prior to
36
Vito Quaranta et al.
tracking. (2) Fluorescent images are then automatically obtained at predetermined intervals for a given period of time (5 min intervals, for 4 h), controlled by BD Attovision software. Based on the information presented in Table 2.1, imaging parameters are set to enable the automatic tracking of as many cell types and conditions as possible. For the BD Pathway 855, the optimized settings for automated tracking of H2B-labeled MCF10A cells are listed in Table 2.2. Using these image settings, 240 TIFF images (1.3 MB each) per well per experiment are generated—approximately 35,000 images comprising 50 GB of storage space experiment. These images are exported and stored using the various data management strategies previously described in Section 3.2 (OME, ACCRE). 4.1.2. Single-cell motility: Image processing We are developing custom-written algorithms for automated assessment of single-cell motility. These tools are designed to integrate with a number of programs/applications, including MATLAB, CellProfiler, and ImageJ. They also interface with OME and cluster computing (e.g., ACCRE at Vanderbilt). A motility software module we designed (named WG1) imports raw images, thresholds them to obtain binary images, segments the binary blobs into objects (i.e., cell, nuclei), calculates centroid values, and assembles them into a matrix that is sent to an external tracking algorithm (for bright-field images, an external tensor voting algorithm can also be used to infer missing edges prior to segmentation). Tracks obtained from the external algorithm are then saved and can be used for processing by other modules (described in other sections). Optionally, WG1 can also be used to overlay the detected single-cell outlines and tracks over the original cell images and save the new images to disk. The resulting image stacks can be visually inspected for quality control.
Table 2.2 BD Pathway 855 settings for H2BmRFP1 fluorescence imaging
Approximate time intervals between images of same well is 5–6 min 0.25 s exposure, 2 2 camera binning 20 objective, 1 2 montaging Single-channel illumination (555/28 nm excitation, 600/30 nm emission) in single focal plane through spinning disk 36–40 wells (e.g., 6 technical replicates, 2 microenvironments, 3 cell types), back-and-forth well scanning, no delays (after last well, immediately return to first) 48–96 h total imaging duration
37
Quantitative Cell Trait Variability in Cancer
4.1.3. Single-cell motility: Cellular parameter extraction Once individual cells (nuclei) have been identified and tracked by either a manual or computer-driven method, a number of both classic and novel motility-related parameters for each cell and/or population can be extracted (Table 2.3). Some metrics can be performed at the population-level (P), some at the single-cell level (S), and others at both levels. Some of these parameters are described in the following sections. 4.1.3.1. Classic single-cell and population metrics Speed (S, P): Cell speed is thought to correlate with cancer invasion (Wells, 2006). There are several previous investigations of single-cell speed (undirected) or velocity (directed) for cancer cell lines, in various microenvironments (Anderson et al., 2009; Hofmann-Wellenhof et al., 1995; Jiao et al., 2008). Table 2.3 List of cell motility measurements Motility measurement
Category
Description
Speed
S, P
Persistence time
P
Motion fraction
P
Turn-angle distribution Surface area
P
Speed fluctuation
S
Step-length
S, P
Instantaneous motion fraction Dynamic expansion and contraction cell activity
P
Describes average single-cell or population-based movement according to x, y (and z) coordinates from cell tracking (mm/min) Combination of persistence in direction and speed (min) Percentage of motile cells across a time-lapse movie (image stack) within a population (%) Tracking x, y (and z) coordinates are used to calculate cell trajectories Measurement of cell size (in pixels) (Harris et al., 2008) 95% confidence interval of the standard deviation of speed for single cell over time (image stack) The distance a cell moves between pauses divided by the number of total steps (mm/min) Percentage of cells motile at any given time (image) within a total population (%) Measurement that represents the overall change in cell area and motion over time (Harris et al., 2008)
S, P
S, P
38
Vito Quaranta et al.
Single-cell speed obtained from time-lapse image stacks is automatically calculated using the x, y (and z in three-dimensional studies) coordinates obtained from tracking centroids (calculated from cell nuclei outlines) using MATLAB algorithms. We have previously examined cells for time periods ranging from just a few minutes to 24 h, at various time resolution (30 s to 10 min intervals). It should be noted that experiments should be optimized (e.g., cell type, matrix, surface), as this can contribute to metric accuracy. We have examined single-cell speeds using frequency histograms and scatter-plots overlaid with box-and-whisker plots containing statistics (Fig. 2.2A), particularly to highlight heterogeneity of a dataset and other trends in data (e.g., skewness, kurtosis). Persistence time (P): Persistence time (min) is one of the most common measures of cell motility (Dunn and Brown, 1987). This measure assumes cell motion is a persistent random walk (PRW), and combines persistence in direction and speed in calculation. The PRW model can be derived from the Langevin equation (Eq. (2.3)): d! v ! m! a ¼m ¼ F ð! x Þ b! v þ! ðtÞ |fflfflffl{zfflfflffl} |{z} |ffl{zffl} dt force
drag
ð2:3Þ
noise
This is a stochastic differential equation describing Brownian motion in a potential, resulting in the Ornstein–Uhlenbeck process (Uhlenbeck and Ornstein, 1930) where m is the mass of the particle, v is the velocity vector, x is the position, t is time, is the coefficient of friction and represents noise of mean zero. An expectation of the model is described by the Furth equation (Eq. (2.4) (Furth, 1920): P 2 2 t=P hd i ¼ nd hS iPt 1 ð1 e Þ ð2:4Þ t This equation describes the expected mean-squared displacement over time, d represents displacement, nd is number of dimensions, S is speed, P is persistence time, and t is time. Motion is initially ballistic (directed), transitioning in time to super-diffusive, and finally to diffusive. The persistence time is the descriptive parameter of the break point in this transition (Codling et al., 2008). Thus, to accurately calculate persistence time, one must observe cells for a long enough time interval for them to transition to the diffusive regime (roughly 3 h for a 10 min persistence time). We have previously calculated persistence times by both the traditional Dunn method (Dunn and Brown, 1987), and the updated Kipper method (Kipper et al., 2007), which reduces standard error of data to fit by approximately 50%, which is shown in Eq. (2.5) where x is an estimate of normalized mean-squared displacement.
Cell speed
A
Persistence time
B
Motile cell fraction
C
Threshold for displacement= 20 mm
MCF10A
MCF10A P < 0.001 N = 81
Pt = 7.36 N = 50
1500
20 N = 81
15
N = 98
N = 83
2.50
10
<x> (min2)
Frequency
25
100
1000
N = 50 90
500
5 1.5
2.0
Speed (mm/min) AT1 P = 0.027 N = 98
Frequency
15
0
2.5
10
1.50
1.00
1.5
2.0
Frequency
500
50
100
30 20 10
150
Cell line
40
10
CA1d CA1d
50
200
Time (min)
AT1
60
20 0
MCF10A
70
30
0.00
P < 0.001 N = 83
40
1000
2.5
0
Pt = 3.31 N = 50
1500
<x> (min2)
1.0
200
0
0
Speed (mm/min) CA1d
150
Pt = 4.91 N = 50
1500
0.50 0.5
100
Time (min) AT1
5
0.0
50
Percent motile cells
1.0
<x> (min2)
0.5
Speed (mm/min)
0.0
80
0
2.00
0
MCF10A
AT1
CA1d
Cell line
1000 500 0
0 0.0
0.5
1.0
1.5
2.0
Speed (mm/min)
2.5
0
50
100
150
200
Time (min)
Figure 2.2 Classic motility-based metrics. (A) Single-cell measurements for speed (mm/min) can be effectively presented in frequency histograms (left), whereby the raw average speed calculated for each cell over time (here 4 h, with 5 min intervals) is represented in columns divided by bins (gray). P values represent whether data are distributed normally (P 0.05) or nonnormally (P 0.05) according to a Shapiro Wilks test (the black curve indicates theoretical ‘‘normal’’ fit for each data range shown). Alternatively, data can be presented in scatter-plots (right; representing individual cells) overlaid with box-and-whiskers (representing statistics of the population). Both of these graphical methods are particularly useful for presentation of datasets that are skewed and rich with variability or outliers (i.e., heterogeneity). Here, we show MCF10A, AT1, and CA1d cell lines in normal culture conditions. (B) Persistence time (min) represents the combination of a cell population’s persistence in both direction and speed. Plots include analysis of persistence time according to the Kipper method, whereby a cell population’s breaking point between the ballistic and diffusive regime is quantified (Pt shown for each). Here, we again show MCF10A, AT1, and CA1d cell lines in normal culture conditions. (C) Motile cell fraction captures the percentage of cells moving out of an entire tracked population of cells. Again, MCF10A, AT1, and CA1d cell lines in normal culture conditions are presented.
40
Vito Quaranta et al.
P t=P ðeÞðtÞ Pt 1 ð1 e Þ t
ð2:5Þ
Kipper also provides a full treatment of systematic errors in measurement of persistence time. An example of graphs containing mean-squared displacement versus time are shown for three cell lines in Fig. 2.2B. We are yet to determine a steadfast trend for persistence time in cell lines we have examined (data not published), as no obvious correlations have emerged (possibly due to the heterogeneity of populations), however we have determined that this metric can shift dramatically upon changing the cells’ microenvironment, which is consistent with previous literature (Kim et al., 2008). Motion fraction (P): The motile cell fraction is the percentage of motile cells within a given population, as previously described (Kim et al., 2008). In a number of previous studies, we have determined that for many cell line populations, the majority of cells are nonmotile throughout an entire assay. Interestingly, it seems that, as a trend, a small subpopulation of cells is highly motile, and up to order of magnitude greater in measurement (Fig. 2.2C). Turn-angle distribution (S. P): This metric has classically been applied to analysis of bacterial motility (Berg and Brown, 1972; Duffy and Ford, 1997). Recently, we analyzed turn-angle distributions of epithelial and cancer cell lines (Potdar et al., 2009). Individual cell trajectories are tracked and turnangle values taken from each. This method is subject to systematic measurement error, unless appropriate sampling intervals and high-resolution images are selected. Consider a model system where speed is chosen from an exponential distribution and turn-angle is chosen from a Von Mises (circular-normal) distribution (Eq. (2.6)),1 where r and y are polar coordinates, l and k are shape parameters, I0 is the modified Bessel function of the first kind for the distributions: f ðr; yjk; lÞ ¼
ek cos y l2 elr 2pI0 ðkÞ |fflffl{zfflffl} |fflfflfflffl{zfflfflfflffl} Exponential
ð2:6Þ
Von Mises
Figure 2.3A shows the resulting distribution (l, k ¼ 1), where the peak is the location of a cell, the positive x-axis represents the turn angle (equal to 0), and the grid represents the observable pixels. l is a factor of mean cell speed, observation interval and the pixel width, and the principal factor in experimental configuration. The aliasing occurs in the measurement, because each x, y pair on the grid represents the observable angles (see 1
The extra l normalization term is due to the polar form of the Jacobian.
41
Quantitative Cell Trait Variability in Cancer
Cell motility distribution
A
Brownian motion: measured vs. actual s/pixel width = 2
B 0.08
0.3 0.2 0.1 0.0 -1.0
1.0 0.5 0.0 -0.5
0.02 y 0.00
0.5 1.0
-1.0
-3
Binned motility: measured vs. actual
D Total measurement error
0.15 Density 0.10
0.05
-3
-2
0.04
-0.5
0.0 x
C
Density
Density
0.06
-1
1 Radians
2
3
-2
-1
1 0 Radians
2
3
Total measurement error 0.10
20° 10° 5°
0.08 0.06 0.04 0.02 0.5
1.0 l
1.5
2.0
Figure 2.3 Turn-angle (distribution) analysis. Turn-angle represents the trajectory of single-cells during a time-lapse movie. (A) Von Mises/exponential polar distribution (l, k ¼ 1), where the peak is the location of a cell and the height represents the probability of the cell’s location in the next observation frame, the x-axis represents the turn-angle, and the grid represents the observable pixels. (B) This plot is an example of measurement error calculated for pure Brownian motion. The dotted line is the flat turn angle distribution and the solid line is the measured distribution. (C) This plot shows the resulting error from the Von Mises/exponential model with l ¼ 0.5 and with 37 bins observed. The difference between observed and actual is the shaded region between the two curves. (D) Effects of total measurement error by l on the x-axis and bin size by the three curves. Note that this does not include the potential loss for a sample interval around or above the persistence time. Total measurement error is quantified using an equation presented in the text.
Fig. 2.3B, for an example of measurement error for Brownian motion). Increasing pixel resolution reduces this error. The best time sampling interval is a tradeoff between being too short whereby a cell does not move as far along the grid (increasing aliasing), versus being too long and greater than the persistence time, whereby the cell’s observable motion is diffusive. Figure 2.3C shows the resulting error from the Von Mises/ exponential model with a l ¼ 0.5, and 37 bins. Computation of this is done by integrating the density in each pixel and sum-binning the density of
42
Vito Quaranta et al.
the measurable angle of the coordinate. This is a correctable error, and the observed bins can be corrected by this ratio. Total measurement error (TME) is quantified by the following Eq. (2.7), where ym is measured angle and ya is the true angle and n is the number of bins: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðym ya Þ2 ð2:7Þ TME ¼ n Figure 2.3D is a graph showing the effect of TME by l on the x-axis and bin size by the three curves. Note this does not include the potential loss for a sample interval around or above the persistence time. Further, it is important to note that quantifying these metrics based on cell centroids, as opposed to by pixel, improves the accuracy of the data significantly. Surface area (S, P): Surface area is commonly used in image processing (Alexopoulos et al., 2002; Carpenter et al., 2006), often as an indicator of differentiation, apoptosis, and other biological processes (Mukherjee et al., 2004; Ray and Acton, 2005). This metric simply quantifies overall cell size (in pixels). In previous studies (Harris et al., 2008), we have designed custom-written MATLAB algorithms to obtain single-cell surface area measurements of cancer cells. As with single-cell speed, this metric can be represented at both the individual and population (average) levels. Overall cell size can be assessed from bright-field or fluorescence images, and subcellular compartments (e.g., nuclei, mitochondria) can also be measured given appropriate use of markers. 4.1.3.2. Novel single-cell and population metrics One of the main assumptions of the PRW model is that cells are always in motion. However, we have determined that cells do not necessarily meet these criteria, and instead typically pause frequently as they migrate. In order to refine the model to incorporate this idea, we have developed a few novel metrics, each described below in detail, that quantitate this phenomenon in various ways. Speed fluctuation (S): Individual cells do not typically maintain constant speed during the course of a time-lapse movie. Instead, their activity is often composed of frames of fast movement, slower movement, and no movement. We have implemented a metric to capture this behavior, termed speed fluctuation (Fig. 2.4A). For non-Gaussian datasets, this metric is calculated using bootstrapping to obtain the range of 95% confidence intervals (CI) of the SD of cell speed for each individual cell in a population. In summary, a number of our previous studies have determined that singlecell speed over time is largely variable and that cells within a population exhibit large amounts of heterogeneity in terms of fluctuation (unpublished data). Further, we have also found that distinct cell lines exhibit contrasting trends in fluctuation—with some remaining fairly constant and others fluctuating dramatically—and that introduction of various
Speed fluctuation
B
Step identification
MCF10A
2.0 1.0
Single-cell steps
Speed (mm/min)
4.0
3.0
Dynamic expansion and contraction of cell activity (DECCA)
CA1d
4.0
Speed (mm/min)
D
Threshold for error= 1 mm
3.0 2.0 1.0 0.0
0.0 4 8 12 16 20 24 28 32 36 40 44 48
4 8 12 16 20 24 28 32 36 40 44 48
Frame
Frame
Phase contrast
A
2.0 90 1.0
N = 50
N = 50
N = 50
80
Differential
100
20 40 60 80 100120140160 180200 220 20 40 60 80 100 120 140 160 180 200 220
20 40 60 80 100 120 140 160 180 200 220
20 40 60 80 100120140160 180200 220
20 40 60 80 100120140160 180200 220
Frame
CA1d 4.0 3.0 2.0
70 60
DECCA
4 8 12 16 20 24 28 32 36 40 44 48
Percent motile cells
Speed (mm/min)
Threshold for movement = 1 pixel
3.0
0.0
Speed (mm/min)
Instantaneous motion fraction
C
20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120140 160 180200 220
AT1 4.0
20 40 60 80 100 120 140 160 180 200 220
50 40 30 20
1.0
0 0 0 0 0 0 0 0 0 0 0 20 40 60 80 100120140160180200220
10
0.0
20 40 60 80 100 120 140 160 180 200 220
2500 2000 1500 1000 500 0 −500 −1000 −1500 −2000 −2500
20 40 60 80 100120140160180200220
Time
0 4 8 12 16 20 24 28 32 36 40 44 48
Frame
MCF10A
AT1
CA1d
Cell line
Figure 2.4 Novel motility-based metrics. (A) Plots show speed fluctuation of randomly chosen single-cells (here, MCF10A, AT, and CA1d in normal culture conditions). As cells do not typically maintain constant speed across time, this metric is an effective way to capture fluctuations in speed. For nonnormal datasets, this metric is calculated by bootstrapping to obtain the range of 95% confidence intervals of the standard deviation of a population. (B) Plot shows the steps taken by an randomly chosen individual CA1d cell (cell steps are represented by red dashes). Cell steplength is the sum of the displacement in a step. Cell step-lengths can also be analyzed at the population-level to obtain the best-fit distribution. (C) Instantaneous motion fraction represents the percentage of cells moving (threshold for movement >1 mm) at any given time during a timelapse movie (here, 4 h, with 5 min intervals). (D) Dynamic expansion and contraction of cell activity (DECCA) values can be calculated for single cells by thresholding phase-contrast images to generate differential images that capture different types of cell movement (expansion vs. contraction) using a heat-scale (red/yellow, positive change; blue, negative change; green, no change), which are further converted to DECCA-specific images that are used for direct quantification of this metric, as previously described (Harris et al., 2008).
44
Vito Quaranta et al.
microenvironmental conditions can cause dampening or increases in fluctuation for cells (unpublished data). For normally distributed data, presentation of SD or the interquartile range can convey a similar metric. Step-length (S, P): To accurately add cell pausing into migration models, it is necessary to experimentally determine the distance a cell travels between consecutive pauses. Step-length, flight length, and flight time are three metrics that are used in ecology to study foraging behavior of birds, bees, and mammals (Gautestad and Mysterud, 2005; Viswanathan et al., 1999). The term step-length has also been used to describe the movement of molecular motors on polymers (Wallin et al., 2007). All three terms are used to quantitate distance or time between pauses in motion, but to our knowledge, this metric has not been used previously to quantify the motion of epithelial cells. To obtain step-length, we measured the overall distance traveled between cell pauses in a time-lapse movie using x, y coordinates obtained from cell tracking (defined by two consecutive frames at the same coordinate) and discarded all step-lengths below our tracking error threshold (lengths < 1 mm). Sample step-lengths are shown in Fig. 2.4B. Interestingly, we observe that, just as single-cell speed fluctuates across time and within a population, cell step-length is also highly variable both within and across cell lines. Instantaneous motion fraction (IMF; P): Persistence and diffusion coefficients are often used to describe cellular motion. However, both of these representations make a number of assumptions about cellular behavior. In particular, they assume all cells are in motion at all times. The IMF was developed to test this assumption, and to provide an additional metric to monitor differences in migration characteristics between cell lines and in various conditions. It measures the percentage of motile cells (must move more than 1 pixel, our measurement error threshold) within a given population at any given time (frame) of a time-lapse movie. In contrast to the motile cell fraction metric, which shows percentage of cells that are ‘‘successful’’ in their migration, this metric represents the ratio of cells ‘‘attempting’’ to move. Figure 2.4C shows an example of applying this metric to MCF10A, AT1, and CA1d cell lines in normal tissue culture conditions; quite clearly these cell lines exhibit heterogeneous expression of motility at any given moment (at 5 min intervals). Dynamic expansion and contraction cell activity (DECCA)(S, P): Kymography is one method used to gain insight into the specific mechanisms of cell movement by studying morphological changes in shape and size (Bryce et al., 2005; Cai et al., 2007). However, kymography is used for relatively small sample sizes (due to highly magnified images required), during relatively short periods of time (Bear et al., 2002; Cai et al., 2007). We have developed a novel metric, termed DECCA, which represents the overall change in cell area and motion over time (Harris et al., 2008). We previously developed this novel metric to quantify the difference between a completely
Quantitative Cell Trait Variability in Cancer
45
nonmotile cell (velocity ¼ 0) and a nonmotile cell (also with a velocity ¼ 0, and of the exact same size) that ruffles its lamellipodia, a classic behavior of cancer cells during migration. Figure 2.4D includes a sample of how this metric captures dynamic behavior, adapted from our previous work (Harris et al., 2008). Time-lapse microscopy images of cell motility can be used to extract all or some of the metrics described above, which can subsequently be used to generate computational simulations (Windrose plots) that combine the various parameters into a single visual depiction of motility. Samples simulations for each of the cell lines presented above in normal tissue culture conditions can be viewed at http://vicbc.vanderbilt.edu/itumor/cell. 4.1.3.3. Statistical subpopulations of motile cells Each motility metric demonstrates heterogeneity in a cell population and can be used to investigate relevant differences between normal and cancer cells. However, the reason for using many motility metrics is that each metric by itself is insufficient for defining statistical subpopulations. Defining statistical subpopulations facilitates examining relationships between distinct QCT (e.g., defining how proliferation subpopulations relate to motility subpopulations within a cell population). In the case of motility, cluster analysis methods of BIC and EM (as described in Section 3.5, and applied to proliferation QCT, Section 4.2.4) are applicable, as long as multiple parameters are combined.
4.2. Cell proliferation Typical studies of proliferation in cultured cell lines involve counting cells (either directly or indirectly) in a population over time. These results are usually presented as a population doubling time (DT) calculated from the number of cells identified at various intervals or as a percentage of the population in each phase of the cell cycle (G1, S, or G2/M) at a given point in time (usually using flow cytometry). These population-level assays are generally limited by the fact that, as endpoint assays, they require large numbers of samples to provide accurate information. This limitation is alleviated by continual monitoring/sampling of cells within a population over time. Nonadherent cells can be sampled with relative ease without disrupting their normal culture. However, for adherent epithelial cell lines, this requires microscopic visualization. The use of time-lapse video of transmitted light microscopy for continual visualization of cells over many days has been used for decades. However, due to the low signal-to-noise ratio (low contrast) between cells background, previously described in the motility application above (Section 4.1.1), automated cell counting of digital light microscopic images remains a challenge. Therefore, we have moved to fluorescent-based imaging to facilitate automated tracking.
46
Vito Quaranta et al.
4.2.1. Validation of H2BmRFP-labeled cells As for motility studies, we utilize flow-sorted cells with stable expression of histone H2BmRFP for proliferation studies. Prior to examination of cells at the single cell level, it is important to ensure no obvious clonal selection has occurred during the generation of the modified cells. To do this the resultant population must be compared to the parental cell line. This procedure is easily accomplished using HCAM and comparing to other population-level assays—manual counting being the gold standard. By imaging the cells every 1–4 h and using automatic segmentation algorithms to quantify cell numbers, population doubling times can be calculated by simple linear regression of the natural log of the number of cells in each image. An example of the verification of the similarity of H2B-labeled cells with parental cells is shown in Fig. 2.5A. 4.2.2. Single-cell proliferation rates: Image acquisition Once the population-level proliferation rates have been validated for a particular fluorescent protein-labeled cell line, further investigation of proliferation metrics at the single-cell level can proceed. Based on the information presented in Table 2.1, imaging parameters are set to enable the automatic tracking of as many cell types and conditions as possible. The optimized setting for automatic tracking of H2B-labeled MCF10A cells with the BD Pathway 855 imager are listed in Table 2.2. Using these imaging settings, approximately 240 TIFF images (1.3 MB each) per well per day are generated––approximately 35,000 images comprising 50 GB of storage space per 96 h experiment. 4.2.3. Single-cell proliferation rates: Image processing and parameter extraction The automated analysis of HCAM-generated images can be used to determine IMT (time between mitotic events) of individual cells within a cell population if image acquisition is sufficiently frequent to allow for automatic tracking of cells over time (6–12 frames/h). In addition, the tracking algorithm described for motility has been modified to include the ability to detect mitotic events and associate resultant progeny with their parental cell. The first software module is the same as used for motility (WG1). The output of this module is a set of MATLAB label matrices and a list of cell centroids at each time step, which can be used for processing by two other modules for obtaining additional proliferation metrics. The second module (WG2) uses the track ID and shape parameters from the label matrices to extract parameters. To determine cell division events, this algorithm identifies tracked cell IDs that were not present in a previous frame of a time-lapse movie. In order to separate true mitotic events from cells entering the frame, cells that have been moving too fast and were lost
A
Manual counting
13.5 13.0
6.75 6.50 DT = 18.3 h
12.0
LN (# cells)
LN (# cells)
12.5
11.5 y = 0.0378 x + 10.045 R2 = 0.98041
11.0
6.00
5.50 5.25
20
40 60 Time (h)
80
100
y = 0.039 x + 5.0109 R2 = 0.99465
5.75
10.0 0
DT = 17.7 h
6.25
10.5
9.5
Automated
7.00
5.00
0
10
20 30 Time (h)
40
50
B All generations
All generations
Generation 1 only
60 50
0.20
0.10
40
N = 3864
Density
Density
Density
N = 3864
20
N = 2243 30
10 0
0
0.00 10
15
20
25 30 IMT (h)
35
40
0.02
0.04 GR (h-1)
0.06
0.02
0.04 GR (h-1)
0.06
Figure 2.5 Representative graphs of proliferation data. (A) Validation of cell lines for HCAM studies. Population doubling times of AT1 cells or AT1 cells modified to stably express H2BmRFP were determined by manual counting (AT1, left) or automated cell counting (AT1-H2BmRFP, right). The population DT is calculated by dividing the natural log of 2 by the slope of the curve fit by linear regression and is indicated within each graph. (B) Distributions of single-cell IMT and GR. IMT and GR of individual AT1-H2BmRFP cells cultured under standard conditions were determined using time-lapse HCAM as described in the text. The distribution of IMT has a long rightward tail (left). When the data are transformed to GR the resultant distribution demonstrates a more normal shape (middle). When only a single generation is plotted, the bias toward larger GR (shorter IMT) is reduced, thereby increasing the relative abundance of the smaller GR (longer IMT) (right).
48
Vito Quaranta et al.
by the tracking toolbox and cells whose fluorescence intensity is fluctuating above and below the foreground intensity threshold, which disappear from certain frames and suddenly appear in other frames. We use the collapse in size of the cell nuclei and proximity of the nuclei in anaphase as markers of a true mitotic event. Filters in the algorithm reject new cells that are too far from other cells or have too great an area as possible mitotic events. An additional filter checks the size of possible parents and compares it with the size of the presumptive daughter cells. If the size ratio of parent area to areas of possible daughter cells is too small the event is rejected. Finally, if the size of the two possible daughter nuclei is too dissimilar, the event is also rejected. After the mitotic events are detected, new IDs are assigned to the daughter cells and each cell receives a parent ID. Cells that have entered the frame and cells that were present at the beginning of the movie receive a parent ID equal to zero. In the last module (WG3), proliferation information, as well as centroid position and shape parameters (e.g., area, eccentricity), are saved to a set of comma-separated text files. In addition, images are generated with the detected nuclei boundaries (or cytoplasm in bright-field images) colorcoded based on generation number and cell ID overlaid onto the original image and saved as JPEG files to facilitate manual validation of the automatic segmentation and tracking. 4.2.4. Single-cell proliferation rates: Statistical analyses 4.2.4.1. Single-cell IMT and generation rate Single cell IMT define the duration of each individual cell lifetime or cell cycle. The generation rate (GR) is calculated as LN(2)/IMT and is used instead of IMT, since its distribution has been shown to be normal (Gaussian) in several noncancerous cell lines. However, the distribution of GR of all cells in a population is overrepresented by the faster dividing cells, which generates platykurtotic (tall and narrow) distributions (Sisken and Morasca, 1965). To reduce this bias, only a single generation is analyzed. An example of the distribution of IMT and GR from multiple generations or a single generation is demonstrated in Fig. 2.5B. It is important to compare the single-cell GR with population-level metrics (i.e., population DT) since population-level data is comprised the single-cell metrics. For example, under conditions where the population proliferation rate is nonlinear, calculation of a population DT is inappropriate as it is changing over time (Fig. 2.6A, Condition 2), whereas the population DT is calculated as 16.91 h under normal culture conditions (Fig. 2.6A, Condition 1), corresponding to a GR of 0.041––the slope of the line. The population-level proliferation curve in Condition 2 suggests a decreased IMT of the cells over time. Linear regression of data plotted with single-cell IMT on the x-axis and cell birth time on the y-axis provides a tool to examine whether the IMT is time dependent. The horizontal line in
49
Quantitative Cell Trait Variability in Cancer
Fig. 2.6C, Condition 1, indicates no correlation between birth time and IMT, whereas there is a clear positive correlation of IMT with birth time in Condition 2, indicating that cell cycle times are increasing over the course of the experiment. This type of analysis is not limited to birth time and, therefore, provides a useful general approach for detecting parameter interdependencies. 4.2.4.2. Progeny tree (clonal subpopulation) generation rates The image processing algorithms described above in Section 4.2.3 provide a method to link individual cell data to its parent and progeny to generate a family (progeny) tree of dependent data. Each progeny tree represents a clonal population with unknown dependence to other progeny trees, such that progeny trees may be related to varying degrees or unrelated. One metric A Population doubling time AT1 (0/0)
8.5 8.3 8.1 7.9 DT = 16.91 h 7.7 7.5 7.3 y = 0.041 x + 6.721 7.1 R2 = 0.99393 6.9 6.7 6.5 0 10 20 30 40 Time (h)
B
7.9 LN (total cell #)
LN (total cell #)
AT1 (S/S)
7.5 7.3 7.1 6.9 6.7 6.5
0
Individual cell generation rate (GR) generation 1 only AT1 (S/S) 80 n = 2243
80
20 40 Time (h)
60
AT1 (0/0) n = 528
60 Density
60 Density
7.7
40
40
20
20
0
0 0.01
0.03
GR
0.05
0.07
0.01
Figure 2.6 (Continued)
0.03
0.05 GR
0.07
50
Vito Quaranta et al.
C AT1 (S/S)
Birth time distribution generation 1 only
40
45 Intermitotic time
35 Intermitotic time
AT1 (0/0)
30 25 20 15
40 35 30 25 20 15
10 0
10
20 30 Birth time
D
40
10 0
5
10 15 Birth time
20
Progeny tree generation rate (GR) AT1 (0/0)
AT1 (S/S) 150
150 n = 1350
n = 1317
100 Density
Density
100
50
50
0
0 0.01
0.03
0.05 GR
0.07
0.01
0.03
GR
0.05
0.07
Figure 2.6 Graphical representation of proliferation metrics. AT1 cells were cultured in standard culture conditions (Condition 1, left column) or under growth factor restricted conditions (Condition 2, right column) and subjected to time-lapse HCAM. (A) Population DT was determined as described in figure DRT2 using a larger number of cells and more frequent image acquisition (every 1 h). Proliferation in Condition 1 demonstrates the typical exponential (log-linear) division rate whereas proliferation in Condition 2 is clearly not log-linear. (B) BIC analysis of the distribution plots of individual cell GR from generation 1 in Condition 1 indicate the presence of two subpopulations with mean values of 0.025 and 0.05 h 1 (indicated by vertical dashed lines). The estimated density of a mixed Gaussian using the EM method (described in Section 3.6) is indicated by the curve overlaying the histogram. In Condition 2, BIC analysis indicates two subpopulations with different densities and mean values than in Condition 1. (C) To examine whether the IMT of cells is similar throughout the experiment the IMT of cells are plotted according to their birth time during the experiment. The nearly horizontal linear regression indicates that the IMT of cells in Condition 1 are not increasing significantly over the course of the experiment whereas the IMT of cells in Condition 2 are increasing. (D) The density histograms of progeny tree GR are comprised a single population (by BIC analysis) in Condition 1 but are clearly distinguished into two subpopulations in Condition 2.
51
Quantitative Cell Trait Variability in Cancer
that can be obtained using data pulled from entire progeny trees is and maximum likelihood estimate of GR for each using the following Eq. (2.8): GR ¼
Bt Dt St St
ð2:8Þ
Bt and Dt are the number of mitotic events and the number of deaths, respectively and St is the total lifetime of the population. St is obtained by summing the lifespan of each cell within a progeny tree (Keiding and Lauritzen, 1978). In the absence of detectable death, the equation is reduced to simply (Eq. (2.9)): GR ¼
Bt St
ð2:9Þ
Since the estimate of GR for the progeny tree is based on the population lifetime (St) and the number of mitotic events (Bt) occurring within a progeny tree, these values can be calculated even for progeny trees containing a single mitotic event (one parent and two offspring). In addition, St is calculated using all cells in each tree, regardless of whether it leaves the frame or persists to the end of the experiment. Thus, deriving GR from progeny trees provides a system with which to compare the proliferation rates of clonal subpopulations within the context of a potentially heterogeneous population, without requiring individual clones to be isolated. Thus, this analysis introduces potential for high-throughput comparison of multiple genetically stable clonal populations and should be able to detect preexisting or frequently occurring stable genetic alterations that alter the proliferative capacity of the cells within the population as a whole. A representative plot of progeny tree GR and the relationship to the other metrics are shown in Fig. 2.6. 4.2.4.3. Analysis of sibling pairs Another proliferation metric that can be used to detect variability within each cell line is the similarity of IMT or GR between sibling pairs (or other members within a progeny tree). Since each sibling pair is presumably genetically identical, differences between them can be considered nongenetic. Metrics of this similarity or differences between siblings are obtained either by determining the correlation between sibling GR (Fig. 2.7A and B) or by plotting the difference between the IMT of sibling pairs (Fig. 2.7C). Although not yet applied to our datasets, a very promising approach to quantify the variance of proliferation metrics within cell lines is the bifurcating autoregression model (Staude et al., 1997). The model accounts for cells progressing through a standard cell cycle and can be used to quantify heterogeneity in the population using bifurcating data structures such as progeny trees. The model provides quantitative values of mean and variance
52
Vito Quaranta et al.
sib_2GR (h–1)
0.08 0.06
Sibling pair scatter plots AT1 (condition 1) 0.08 r = 0.74 p = 1.6e–10 0.06 sib_2GR (h–1)
A
0.04 0.02 0.00 0.00
0.08
0.04 0.02 0.00 0.00
120
Residual plots AT1 (condition 1) 120
100
100
80
80
Density
B
Density
0.02 0.04 0.06 sib_1GR (h–1)
60 40
0
0
C
0.08
MCF10A (condition 2)
40 20
–0.01 0.01 Residual error
0.02 0.04 0.06 sib_1GR (h–1)
60
20
–0.03
AT1 (condition 2) r = 0.561 p = 3.4e–08
–0.03
0.03
–0.01 0.01 Residual error
0.03
Differences between sibling intermitotic times (IMT) 1.00
Fraction
0.75
AT1 (condition 1) AT1 (condition 2)
0.50
0.25
0.00 0.01
0.1
1 Time(h)
10
100
Figure 2.7 Sibling pair analysis. (A) Scatter-plots of sibling pairs were demonstrated to be significantly correlated as indicated by the high correlation coefficients (r) and low P-values. (B) Residual plots similarly demonstrate the stronger correlation in Condition 1. (C) The differences between sibling pair IMT can also be represented using cumulative density distributions using a log scale on the x-axis (time).
53
Quantitative Cell Trait Variability in Cancer
in the population and can quantify the variance of metrics between related members of a progeny tree (e.g., mothers and daughters or sibling pairs). 4.2.5. Other proliferation-related metrics Other standard assays of DNA synthesis (e.g., bromodeoxyuridine (BrdU) incorporation) and DNA content (e.g., incorporation of fluorescent DNAbinding dyes such as 40 ,6-diamidino-2-phenylindole (DAPI) or Hoescht 33342) can easily be incorporated into the HCAM experiments. These assays can be performed in situ to produce results similar to those obtainable using flow cytometry. However, a live-cell, fluorescent, ubiquitinationbased cell cycle indicator, ‘‘Fucci’’ system (Sakaue-Sawano et al., 2008) now makes it possible to track the cell cycle of individual cells over time. The Fucci system uses two fluorescent protein-conjugated protein fragments that are rapidly degraded upon ubiquitylation with different fluorescent properties for each phase (G1/S and G2/M) of the cell cycle (SakaueSawano et al., 2008). Data generated by these approaches can easily be integrated with the other proliferation metrics to provide a more complete picture of the cell cycle times of individual cells in the population over time. A list of proliferation, such as IMT, is shown in Table 2.4. 4.2.6. Quality control For verification of automated tracking results, random wells (fields of view) are selected for manual verification. The manually derived results of these fields are subjected to the same analysis, and the results are compared for accuracy with the automated results to determine the error rate of the automated process (e.g., histograms of the mitotic times are compared with the two-sample Kolmogorov–Smirnov test for significant differences.) Table 2.4 Proliferation metrics obtainable from H2B-labeled cells Time-based features
Morphologic features
Other features
Population DT/GR Single cell IMT/GR
Nuclear size Nuclear shape
Differences between sibling IMT Clonal population GR (progeny trees) Mitotic events per unit time G1/M–G2/S conversion rate DNA synthesis rate
Nuclei per cell
Nuclei per frame Distance between nuclei centroids Cell death
Bi- or multipolar mitotic event Nuclear area
DNA content % in cell cycle phase (G1/S/G2)
54
Vito Quaranta et al.
5. Conclusions From this chapter, it is hopefully evident that QCT studies by HCAM can address fundamental questions in cancer, including: (i) defining the relation between progression of cancer cell aggressiveness and QCT variability in a tumor; (ii) determining whether QCT variability range tracks with tumor response to drugs and drug combinations; and (iii) relating QCT variability to the rise of cancer resistance to treatment. It is also expected that these quantitative analyses will have a profound impact on computational and mathematical modeling of cancer progression and treatment, by complementing the plethora of molecular data with an abundance of much needed cellular data.
ACKNOWLEDGMENTS We thank Dr. Jerome Jourquin for incorporating motility movies into http://vicbc. vanderbilt.edu/itumor/cell. Support for this work was provided by NCI grant U54CA113007.
REFERENCES Alexopoulos, L. G., Erickson, G. R., and Guilak, F. (2002). A method for quantifying cell size from differential interference contrast images: Validation and application to osmotically stressed chondrocytes. J. Microsc. 205(Pt 2), 125–135. Anderson, A. R. A., Hassanein, M., Branch, K. M., Lu, J., Lobdell, N. A., Maier, J., Basanta, D., Weidow, B., Reynolds, A. B., Quaranta, V., Estrada, L., and Weaver, A. M. (2009). Microenvironmental independence associated with tumor progression. Cancer Research (in press). Bear, J. E., Svitkina, T. M., Krause, M., Schafer, D. A., Loureiro, J. J., Strasser, G. A., Maly, I. V., Chaga, O. Y., Cooper, J. A., Borisy, G. G., and Gertler, F. B. (2002). Antagonism between Ena/VASP proteins and actin filament capping regulates fibroblast motility. Cell 109(4), 509–521. Berg, H. C., and Brown, D. A. (1972). Chemotaxis in Escherichia coli analysed by threedimensional tracking. Nature 239, 500–504. Brock, A., Chang, H., and Huang, S. (2009). Non-genetic heterogeneity–a mutationindependent driving force for the somatic evolution of tumours. Nat. Rev. Genet. 10(5), 336–342. Bryce, N. S., Clark, E. S., Leysath, J. L., Currie, J. D., Webb, D. J., and Weaver, A. M. (2005). Cortactin promotes cell motility by enhancing lamellapodial persistence. Curr. Biol. 15(14), 1276–1285. Cai, L., Marshall, T. W., Uetrecht, A. C., Schafer, D. A., and Bear, J. E. (2007). Coronin 1B coordinates Arp2/3 complex and cofilin activities at the leading edge. Cell 128(5), 915–929. Carpenter, A. E., Jones, T. R., Lamprecht, M. R., Clarke, C., Kang, I. H., Friman, O., Guertin, D. A., Chang, J. H., Lindquist, R. A., Moffat, J., Golland, P., and
Quantitative Cell Trait Variability in Cancer
55
Sabatini, D. M. (2006). Cell Profiler: Image analysis software for identifying and quantifying cellular phenotypes. Genome Biol. 7(10), R100. Codling, E. A., Plank, M. J., and Benhamou, S. (2008). Random walk models in biology. J. R. Soc. Interface 5(25), 813–834. Dikovskaya, D., Schiffmann, D., Newton, I. P., Oakley, A., Kroboth, K., Sansom, O., Jamieson, T. J., Meniel, V., Clarke, A., and Na¨thke, I. S. (2007). Loss of APC induces polyploidy as a result of a combination of defects in mitosis and apoptosis. J. Cell Biol 176 (2), 183–195. Dove, A. (2003). Screening for content—The evolution of high throughput. Nat. Biotechnol. 21, 859–864. Duffy, K. J., and Ford, R. M. (1997). Turn angle and run time distributions characterize swimming behavior for Pseudomonas putida. J. Bacteriol. 179(4), 1428–1430. Dunn, G. A., and Brown, A. F. (1987). A unified approach to analyzing cell motility. J. Cell Sci. Suppl. 8, 81–102. Eliason, S. R. (1993). Maximum Likelihood Estimation: Logic and Practice 96. SAGE Publications, Thousand Oaks, CA. Evans, J. G., and Matsudaira, P. (2007). Linking microscopy and high content screening in large-scale biomedical research. Methods Mol. Biol. 356, 33–38. Fraley, Chris, and Raftery, AdrianE. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631. Friedl, P., and Wolf, K. (2003). Tumour-cell invasion and migration: Diversity and escape mechanisms. Nat. Rev. 3, 362–374. Furth, R. (1920). Die Brownsche Bewegung bei Berucksichtigung einer Persistenz der Bewegungsrichtung, Mit Anvendunen auf die Bewegung lebender Infusorien. Z. Physik. 2, 244–256. Gautestad, A. O., and Mysterud, I. (2005). Intrinsic scaling complexity in animal dispersion and abundance. Am. Nat. 165(1), 44–55. Goldberg, I. G., Allan, C., Burel, J. M., Creager, D., Falconi, A., Hochheiser, H., Johnston, J., Mellen, J., Sorger, P. K., and Swedlow, J. R. (2005). The open microscopy (OME) data model and XML file: Open tools for informatics and quantitative analysis in biological imaging. Genome Biol. 6, R47. Gra¨f, R., Rietdorf, J., and Zimmermann, T. (2005). Live cell spinning disk microscopy. Adv. Biochem. Eng. Biotechnol. 95, 57–75. Hanahan, D., and Weinberg, R. A. (2000). The hallmarks of cancer. Cell 100(1), 57–70. Harris, M. P., Kim, E., Weidow, B., Wikswo, J. P., and Quaranta, V. (2008). Migration of isogenic cell lines quantified by dynamic multivariate analysis of single-cell motility. Cell Adh. Migr. 2(2), 127–136. Heng, H. H., Bremer, S. W., Stevens, J. B., Ye, K. J., Liu, G., and Ye, C. J. (2009). Genetic and epigenetic heterogeneity in cancer: A genome-centric perspective. J. Cell. Physiol. 220(3), 538–547. Hofmann-Wellenhof, R., Fink-Puches, R., Smolle, J., Helige, C., Tritthart, H. A., and Kerl, H. (1995). Correlation of melanoma cell motility and invasion in vitro. Melanoma Res. 5(5), 311–319. Jiao, X., Katiyar, S., Liu, M., Mueller, S. C., Lisanti, M. P., Li, A., Pestell, T. G., Wu, K., Ju, X., Li, Z., Wagner, E. F., Takeya, T., Wang, C., and Pestell, R. G. (2008). Disruption of c-Jun reduces cellular migration and invasion through inhibition of c-Src and hyperactivation of ROCK II kinase. Mol. Biol. Cell 19(4), 1378–1390. Keiding, N., and Lauritzen, S. L. (1978). Marginal maximal likelihood estimates and estimation of the offspring mean in a branching process. Scand. J. Stat. 5, 106–110. Kim, H. D., Guo, T. W., Wu, A. P., Wells, A., Gertler, F. B., and Lauffenburger, D. A. (2008). Epidermal growth factor-induced enhancement of glioblastoma cell migration in
56
Vito Quaranta et al.
3D arises from an intrinsic increase in speed but an extrinsic matrix- and proteolysisdependent increase in persistence. Mol. Biol. Cell 19, 4249–4259. Kipper, M. J., Kleinman, H. K., and Wang, F. W. (2007). New method for modeling connective-tissue cell migration: Improved accuracy on motility parameters. Biophys. J. 93(5), 1797–1808. Loo, L. H., Wu, L. F., and Altshuler, S. J. (2007). Image-based multivariate profiling of drug responses from single cells. Nat. Methods 4(5), 445–453. Mukherjee, D. P., Ray, N., and Acton, S. T. (2004). Level set analysis for leukocyte detection and tracking. IEEE Trans. Image Process. 13(4), 562–572. Perlman, Z. E., Slack, M. D., Feng, Y., Mitchison, T. J., Wu, L. F., and Altschuler, S. J. (2004). Multidimensional drug profiling by automated microscopy. Science 306(5699), 1194–1198. Porter, I. M., McClelland, S. E., Khoudoli, G. A., Hunter, C. J., Andersen, J. S., McAinsh, A. D., Blow, J. J., and Swedlow, J. R. (2007). Bod1, a novel kinetochore protein required for chromosome biorientation. J. Cell Biol. 179(2), 187–197. Potdar, A. A., Lu, J., Jeon, J., Weaver, A. M., and Cummings, P. T. (2009). Bimodal analysis of mammary epithelial cell migration in two dimensions. Ann. Biomed. Eng. 37(1), 230–245. Rasband, W. S. (1997–2006). ImageJ, U.S. National Institutes of Health, Bethesda, MD, USA. http://rsbweb.nih.gov/ij/. Ray, N., and Acton, S. T. (2005). Data acceptance for automated leukocyte tracking through segmentation of spatiotemporal images. IEEE Trans. Biomed. Eng. 52(10), 1702–1712. Roeder, K., and Wasserman, L. (1995). Practical Bayesian density estimation using mixtures of normals. J. Am. Stat. Assoc. 92. Sakaue-Sawano, A., Kurokawa, H., Morimura, T., Hanyu, A., Hama, H., Osawa, H., Kashiwagi, S., Fukami, K., Miyata, T., Miyoshi, H., Imamura, T., Ogawa, M., et al. (2008). Visualizing spatiotemporal dynamics of multicellular cell-cycle progression. Cell 132(3), 487–498. Sisken, J. E., and Morasca, L. (1965). Intrapopulation kinetics of the mitotic cycle. Cell Biol. 25, 179–189. Slack, M. D., Martinez, E. D., Wu, L. F., and Altschuler, S. J. (2008). Characterizing heterogeneous cellular responses to perturbations. Proc. Natl. Acad. Sci. USA 105(49), 19306–19311. Starkuviene, V., and Pepperkok, R. (2007). The potential of high-content high-throughput microscopy in drug discovery. Br. J. Pharmacol. 152, 62–71. Staude, R. G., Huggins, R. M., Zhang, J., Axelrod, D. E., and Kimmel, M. (1997). Estimating clonal heterogeneity and interexperiment variability with the bifurcating autoregressive model for cell lineage data. Math. Biosci. 143, 103–121. Stockholm, D., Benchaouir, R., Picot, J., Rameau, P., Neildez, T. M. A., Landini, G., Laplace-Builhe, C., and Paldi, A. (2007). The origin of phenotypic heterogeneity in a clonal cell population in vitro. PLoS ONE 2(4), e394. Swedlow, J. R., Goldberg, I., Brauner, E., and Sorger, P. K. (2003). Informatics and quantitative analysis in biological imaging. Science 300(100), 100–102. Swedlow, J. R., Goldberg, I. G., and Eliceiri, K. W. (2009). Bioimage informatics for experimental biology. Annu. Rev. Biophys. 38, 327–346. Uhlenbeck, G. E., and Ornstein, L. S. (1930). On the theory of the Brownian motion. Phys. Rev. 36, 823–841. Viswanathan, G. M., Buldyrev, S. V., Havlin, S., da Luz, M. G., Raposo, E. P., and Stanley, H. E. (1999). Optimizing the success of random searches. Nature 401(6756), 911–914.
Quantitative Cell Trait Variability in Cancer
57
Wallin, A. E., Salmi, A., and Tuma, R. (2007). Step length measurement—Theory and simultion for tethered bead constant-force single molecule assay. Biophys. J. 93(3), 795–805. Wells, A. (2006). Cell Motility in Cancer Invasion and Metastasis. In ‘‘Cancer MetastasisBiology and Treatment Series.’’ Springer.
C H A P T E R
T H R E E
Matrix Factorization for Recovery of Biological Processes from Microarray Data Andrew V. Kossenkov* and Michael F. Ochs† Contents 59 63 63 64 65 68 68 70 74 75
1. Introduction 2. Overview of Methods 2.1. Clustering techniques 2.2. Traditional statistical approaches 2.3. Matrix factorization techniques 2.4. Extensions to nonnegative matrix factorization 3. Application to the Rosetta Compendium 4. Results of Analyses 5. Discussion References
Abstract We explore a number of matrix factorization methods in terms of their ability to identify signatures of biological processes in a large gene expression study. We focus on the ability of these methods to find signatures in terms of gene ontology enhancement and on the interpretation of these signatures in the samples. Two Bayesian approaches, Bayesian Decomposition (BD) and Bayesian Factor Regression Modeling (BFRM), perform best. Differences in the strength of the signatures between the samples suggest that BD will be most useful for systems modeling and BFRM for biomarker discovery.
1. Introduction Microarray technology introduced a new complexity into biological studies through the simultaneous measurement of thousands of variables, replacing a technique (the Northern blot) that typical measured at most tens * {
The Wistar Institute, Philadelphia, Pennsylvania, USA The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, Maryland, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67003-8
#
2009 Elsevier Inc. All rights reserved.
59
60
Andrew V. Kossenkov and Michael F. Ochs
of variables. Traditional analysis focused on measurements with minimal statistical complexity, but direct application of such tests (e.g., the t-test) to microarrays resulted in massive numbers of ‘‘significant’’ differentially regulated genes, when reality suggested far fewer. There were a number of reasons for the failure of these tests, including the small number of replicates leading to chance detection when tens of thousands of variables were measured (Tusher et al., 2001), the unmodeled covariance arising from coordinated expression (Kerr et al., 2002), and non-gene-specific error models (Hughes et al., 2000). While a number of statistical issues have now been successfully addressed (Allison et al., 2006), two aspects of the biology of gene expression raise difficulties for many analyses. The issues can be noted in a simple model of signaling in the yeast S. cerevisiae. In Fig. 3.1, the three overlapping MAPK pathways are shown. The pathways share a number of upstream regulatory components (e.g., Ste11), and regulate sets of genes divided here into five groups (A–E), with a few of the many known targets shown. The Fus3 mating response MAPK protein activates the Ste12 transcription factor, leading to expression of groups A and B. The Kss1 filamentation response MAPK protein activates the Ste12–Tec1 regulatory complex, leading to expression of groups B, C, and D. The Hog1 high-osmolarity response MAPK protein activates the Sko1 transcription factor, leading to expression of groups D and E.
Ste11
Ste7
Fus3
Dig1/2
Kss1
Ste12
A Far1 Pho81 Afr1
Pbs2
B Pcl2 Dig1 Ste2
Hog1
Tec1
C Cln1 Pcl1 Bud8
Sko1
D Hal1
E Gre2 Ahp1 Sfa1
Figure 3.1 The tightly coupled MAPK pathways in S. cerevisiae. Activation of the pathways lead to transcriptional responses, which produce overlapping sets of transcripts that would be measured in a gene expression experiment. This multiple regulation, which is ubiquitous in eukaryotic biology, motivates the use of matrix factorization methods in high-throughput biological data analysis.
61
Matrix Factorization of Microarray Data
The standard methods used in microarray analysis will look for genes that are differentially expressed between two states. If we imagine those two states as mating activation and filamentation activation, we identify genes associated with each process, but we do not identify all genes associated with either process. Alternatively, clustering in an experiment where each process is independently active will lead to identification of five clusters (one for each group A–E) even though only three processes are active. Naturally, the complexity is substantially greater as there is no true isolation of a single biological process, as any system with only a single process active would be dead, and any measurement is convolved with measurements of ongoing biological behavior required for survival, homeostasis, or growth. These processes use many of the same genes, due to borrowing of gene function that has occurred throughout evolution. [Note: for S. cerevisiae, plain text Ste12 indicates the protein, while italic text ste12 indicates the gene.] Essentially, this example shows the two underlying biological principles that need to be addressed in many analyses of high-throughput data— multiple regulation of genes due to gene reuse in different biological processes and nonorthogonality of biological process activity arising from the natural simultaneity of biological behaviors. Mathematically, we can state the problem as a matrix factorization problem: Dij ¼
P X
Aik Pkj þ eij
ð3:1Þ
k¼1
where D is the data matrix comprising measurements on N genes (or other entities) indexed by i across M conditions indexed by j, P is the pattern matrix for P patterns indexed by k, A is the amplitude or weighting matrix that determines how much of each gene’s behavior can be attributed to each pattern, and e is the error matrix. P is essentially a collection of basis vectors for the factorization into P dimensions, and as such it is often useful to normalize the rows of P to sum to 1. This makes the A matrix similar to loading or score matrices, such as in principal component analysis (PCA). It is useful to note there that the nonindependence of biological processes is equivalent to nonorthogonality of the rows of P, indicating the factorization is ideally into a basis space that reflects underlying biological behaviors but is not orthonormal. We introduced Bayesian Decomposition (BD), a Markov chain Monte Carlo algorithm, to address these fundamental biological issues in microarray studies (Moloshok et al., 2002), extending our original work in spectroscopy (Ochs et al., 1999). Kim and Tidor introduced nonnegative matrix factorization (NMF), created by Lee and Seung (1999), into microarray analysis (Brunet et al., 2004; Kim and Tidor, 2003), for the same reason. Subsequently, it was realized that sparseness aids in identifying
62
Andrew V. Kossenkov and Michael F. Ochs
biologically meaningful processes, and sparse NMF was introduced (Gao and Church, 2005). Fortuitously, due to its original use in spectroscopy, sparseness was already a feature of BD through its atomic prior (Sibisi and Skilling, 1997). More recently, Carvalho and colleagues introduced Bayesian factor regression modeling (BFRM), an additional Markov chain Monte Carlo method, for microarray data analysis (Carvalho et al., 2008). Targeted methods that directly model multiple sources of biological information have been introduced as well. Liao and Roychowdhury introduced network component analysis (NCA), which relied on information about the binding of transcriptional regulators to help isolate the signatures of biological processes (Liao et al., 2003). The use of information on transcriptional regulation can also aid in sparseness, as shown by its inclusion in BD as prior information (Kossenkov et al., 2007). These methods have been developed and applied primarily to microarray data, as it was the first high-throughput biological data that included dynamic behavior, in contrast to sequence data. Microarrays were developed independently by a number of groups in the 1990s (Lockhart et al., 1996; Schena et al., 1995), and their use is now widespread. A number of technical issues plagued early arrays, and error rates were high. The development of normalization and other preprocessing procedures improved data reproducibility and robustness (Bolstad et al., 2003; Cheng and Wong, 2001; Irizarry et al., 2003), leading to studies that demonstrated the ability to produce meaningful datasets from arrays run in different laboratories at different times (English and Butte, 2007). Data can be accessed, though not always with useful metadata, in the GEO and ArrayExpress repositories (Edgar et al., 2002; Parkinson et al., 2005). However, the methods discussed here are also suitable for other high-throughput data where the fundamental assumptions of multiple overlapping sets within the data and nonorthogonality of these sets across the samples holds. In the near future, these data are likely to include large-scale proteomics measurements and metabolite measurements. We have previously undertaken a study of some of these methods to determine their ability to solve Eq. (3.1) using simulations of the cell cycle (Kossenkov and Ochs, 2009). This study did not address the recovery of biologically meaningful patterns from real data, where numerous unknowns exist. Most of these relate to the fundamental issue that separates biological studies from those in physics and chemistry—in biology we are unable to isolate variables of interest away from other unknowns, as to do so is to kill the organism under study. Instead, we must perform studies in a background of incomplete knowledge of the activities a cell is undertaking and incomplete knowledge of the entities (e.g., genes, proteins) associated with these processes. In addition, sampling is difficult and therefore tends to be limited (i.e., large N, small P), and the data remain prone to substantial variance, perhaps due to true biological variation instead of technical issues.
Matrix Factorization of Microarray Data
63
We have undertaken a new analysis of the Rosetta compendium, a dataset of quadruplicate measurements of 300 yeast gene knockouts and chemical treatments (Hughes et al., 2000), to determine how well various matrix factorization methods recover signatures of biological processes. The Rosetta study included 63 control replicates of wild-type yeast grown in rich media, allowing a gene-specific error model. One interesting result to emerge from this work is that roughly 10% of yeast genes appear to be under limited transcriptional regulation, so that their transcript levels vary by orders of magnitude without a corresponding variation in protein levels or phenotype. This has obvious implications for studies where whole genome transcript levels are measured on limited numbers of replicates. Using known biological behaviors that are affected by specific gene knockouts, we compared a number of methods from clustering through the matrix factorization methods discussed above to determine how well such methods recover biological information from microarray measurements. We first give a brief description of each method, then we present the dataset and results of our analyses.
2. Overview of Methods 2.1. Clustering techniques To provide a baseline for comparison, we applied two widely used clustering techniques to the dataset, as well as an approach where genes were assigned to groups at random. Hierarchical clustering (HC) was introduced for microarray work by Eisen et al. (1998), and because of easy-to-use software and its lead as the first technique, it has seen significant use and is available in desktop tools (Saeed et al., 2006). HC, as performed by most users, is done in an agglomerative fashion, using a metric to determine intergene and intercluster distances. Metrics used in microarray studies include Pearson correlation, which captures the shape of changes across the samples, and Euclidean distance, which captures the magnitude of changes. HC creates a tree of distances (a dendrogram) and groups the genes based on the nodes of this tree. As such, different numbers of clusters can be created by cutting at different levels on the tree; however, each specific set of clusters is the most parsimonious for that level and that metric. K-means (or K-medians) clustering has also been widely used in microarray studies, and it relies on an initial random assignment of genes to P clusters. Genes are then moved between clusters based on gene-cluster distances in an iterative fashion. The same metrics are typically used as in HC, and since the number of clusters is defined a priori, there is no necessity of choosing a tree level as in HC. However, a tree can be created after clustering is complete if desired.
64
Andrew V. Kossenkov and Michael F. Ochs
2.2. Traditional statistical approaches The factorization implied by Eq. (3.1) can be accomplished in a number of ways. One of the most widely used is singular value decomposition (SVD) or its relative, PCA. These methods create M new basis vectors from the data in D, and these new basis vectors are orthonormal. SVD is an analytic procedure that decomposes D into the product of a left singular matrix U, a diagonal matrix of ordered values S referred to as the singular values, and a right singular matrix VT, that is, D ¼ USVT :
ð3:2Þ
Alter and colleagues introduced SVD to microarray studies, and defined the rows of VT as eigengenes, and the columns of U as eigenarrays (Alter et al., 2000). The eigengenes are similar to the concept of patterns for Eq. (3.1). PCA performs a similar decomposition; however, the analysis proceeds from the covariance matrix, so that the principal components (PCs) follow the variance in the data. The first PC is aligned with the axis of maximum variance in the M-dimensional space of the data matrix, with each additional PC chosen to be orthogonal to the previous PCs and in the direction that maximizes variance among all orthogonal directions. This creates a new orthonormal basis space in which the PCs represent directions of maximum variance. The singular values are now referred to as scores, and the value of the score provides the amount of variance explained by the corresponding PC. In most applications of PCA and SVD to microarray data, the matrices are truncated so that only the strongest eigengenes or PCs are retained. This is a form of dimensionality reduction, which, in the case of PCA, retains the maximum amount of variance across the data at each possible dimension. The orthogonality conditions of SVD and PCA were realized to be overly constraining for microarray data. Lin and colleagues and Liebermeister independently introduced independent component analysis (ICA) to microarray analysis to address this issue (Liebermeister, 2002; Lin et al., 2002). As with typical applications of PCA, ICA projects the data onto a lower dimensional space. In linear ICA, the goal is to solve Eq. (3.1) by finding P, such that P Y ¼ WD
ð3:3Þ
through the identification of the unmixing matrix, W. The unmixing matrix is designed to make the rows of Y, and therefore P, as statistically independent as possible. A number of measures of independence can be used, such as maximizing negentropy or nongaussinity (Hyvrinen et al., 2001). Because ICA is not strictly constrained like PCA or SVD, it is possible to obtain multiple solutions for Y from the same data. As such, sometimes multiple applications must be performed and a rule applied to pick the best Y (Frigyesi et al., 2006).
65
Matrix Factorization of Microarray Data
2.3. Matrix factorization techniques The desire to escape both the exclusivity of gene assignment to a single cluster occurring in clustering and the independence criteria of statistical methods such as PCA led to the introduction of two techniques from other fields that addressed these issues. Naturally, these methods require constraints, as Eq. (3.1) is degenerate, allowing an infinite number of equally good solutions in the absence of a constraint, such as the one provided by an orthonormal basis in PCA. The methods are distinguished by the methods of constraint and the search algorithm for finding an optimal solution to Eq. (3.1) within these constraints. All these methods also rely on dimensionality reduction, so that the number of elements in the matrices A and P are less than those in D. BD applies a positivity constraint within an atomic prior to limit the possible A and P matrices. The atomic prior relies on implementation of an additional domain, an atomic domain modeling an infinite one-dimensional space upon which atoms are placed, and mappings between it and the A and P matrices. This provides great flexibility, as the mappings, in the form of convolution functions, can distribute an atom to a complex distribution encoding additional prior knowledge (e.g., the form of a response curve, a coordinated change in multiple genes). The atomic domain comprises a positive additive distribution (Sibisi and Skilling, 1997), and an Occam’s Razor argument (i.e., parsimony) penalizes excessive structure through the prior distribution on the atoms. The resulting posterior distribution that combines this prior with the likelihood determined from the fit to the data is sampled by a Markov chain Monte Carlo Gibbs sampler (Geman and Geman, 1984). This approach allows patterns to be constrained in multiple ways, permitting the rows of P to be nonorthogonal, while still identifying unique solutions. Even with unique directions defined by the rows of P, there is still flexibility in the equation that allows amplitude in rows of P to be transferred to columns in A without changing D. As such, the rows of P are normalized to sum to 1. For the work presented here, a simple convolution function that maps each atom to a single matrix element is used, as this only enforces positivity on A and P, similar to NMF. The posterior distribution sampled by BD is generated from the prior and the likelihood through Bayes’ equation: pðA; PjDÞ ¼
pðDjA; PÞpðA; PÞ pðDÞ
ð3:4Þ
where pðA; PjDÞ is the posterior distribution, pðDjA; PÞ is the likelihood, pðA; PÞ is the prior, and pðDÞ is the marginal likelihood of the data, which is also known as the evidence. The likelihood is the probability distribution associated with a w2 distribution, and BD therefore uses the estimates of error during modeling, which can be very powerful given the large
66
Andrew V. Kossenkov and Michael F. Ochs
variation in uncertainty across different genes in a microarray experiment. This also permits seamless treatment of missing values, as they can be estimated at a typical value (background level) with a large uncertainty, thus not affecting the likelihood. The evidence is not used by BD, as Gibbs sampling requires only relative estimates of the posterior distribution; however, it has been proposed that it can be used for model selection, which in this case would be determining the correct number of dimensions, P, in Eq. (3.1) (Skilling, 2006). Presently, BD requires a choice of P. NMF applies positivity and dimensionality reduction to find the patterns of P, each of which is defined as a positive linear combination of rows of D. Each row of D is therefore a linear combination of patterns, with the weight given by the corresponding element in A. As with BD, the choice of P must be made before applying the algorithm. In an NMF simulation, random matrices A and P are initialized according to some scheme, such as from a uniform distribution. The two matrices are then iteratively updated with X Aia Dim Pam ¼ Pam X i Aia Mim Xi ð3:5Þ D P j dj aj Ada ¼ Pda X M P i dj aj which guarantees reaching a local maximum in the likelihood. The updating rules climb a gradient in likelihood, which does lead to the problem of becoming trapped in a local maximum in the probability space. In general, application of NMF therefore is done multiple times from different initial random points, and the best fit to the data is used. The fits obtained from repeated runs on complex microarray data can vary significantly in some cases, due to the complex probability structure that appears typical for biological data. MCMC techniques tend to be more resistant to this problem, as they are designed specifically to escape local maxima, although they are prone to miss sharp local maxima in relatively flat spaces; however, this has not yet appeared to be a problem in biological data. The absence of constraints beyond positivity in NMF does lead to a tendency for the recovery of signal-invariant metagenes that carry little or no information, and the failure to include error estimates can lead to genes with large variance being overweighted during fitting. These issues have been addressed in the extensions to NMF discussed below. NCA uses information on the binding of transcriptional regulators to DNA and dimensionality reduction to reduce the possible A and P matrices. The concept is to create a two-layer network with one layer populated by transcriptional regulators and the other by the genes they regulate, with edges connecting regulators to target genes. NCA addresses the degeneracy of Eq. (3.1) through
67
Matrix Factorization of Microarray Data
D ¼ AXX1 P þ «
ð3:6Þ
where AX includes all possible A matrices and X 1P all possible P matrices. By demanding that X be diagonal, A and P are uniquely determined to a scaling factor (i.e., the rows of P require normalization just as in BD). The diagonality of X requires that the transcriptional regulators be independent. The solution of Eq. (3.6) is found by minimizing kD APk2 ;
ð3:7Þ
which is equivalent to maximizing the likelihood with an assumption of uniform Gaussian errors. For the application of NCA, the relative strength of the transcription of a gene by a regulator must be determined. This is done by measuring the binding affinity of a transcription factor to the promoter for a gene. Since each gene can be regulated by multiple regulators, the expression of a gene in a given condition must be estimated as a combination of the regulation from different factors. A log-linear model is used, so each additional binding of a regulator leads to a multiplicative increase in expression. However, it is not clear that the affinity of binding of a transcription factor is the dominant issue in determining transcript abundance, especially in eukaryotes. BFRM is a Markov chain Monte Carlo technique that solves Dij ¼ mi þ
r X k¼1
bik hkj þ
P X
Aip Ppj þ «ij
ð3:8Þ
p¼1
where A can be viewed as factor loadings for latent factors P (Carvalho et al., 2008). The h matrix provides a series of known covariates in the data, which are then treated using linear regression with coefficients b. The mean vector, m, provides a gene specific term that adjusts all genes to the same level, while the « matrix provides for noise, treated as normally distributed with a pattern specific variance. The latent factors here are then those that remain after accounting for covariates. This model has also been extended by inclusion in D of response variables as additional columns. This extends the second summation in Eq. (3.8) to P þ Q, where Q is the number of latent factors tied to response variables. In both cases, the model also aims for sparse solutions, equivalent to the Occam’s razor approach of BD. BFRM also attempts to address the issue of the number of patterns or latent factors. This is done through an evolutionary stochastic search. Essentially, the algorithm attempts to change P to P þ 1 by thresholding the probability of inclusion of a new factor. The model is refit with the additional factor, and the factor is retained if it improves the model by some criterion. In actuality, the algorithm can suggest multiple additional latent factors at each step and choose to keep multiple factors. Evolution ceases
68
Andrew V. Kossenkov and Michael F. Ochs
when no additional factors are accepted. The BFRM software allows turning off of the evolution, which we have done here to allow direct comparison with other methods at the same P.
2.4. Extensions to nonnegative matrix factorization NMF has become widely used in a number of fields, including analysis of high-throughput biological data. Unlike BD and BFRM, there is no inherent sparseness criterion applied in NMF. This is not surprising, as the original application to imaging argues against sparseness (Lee and Seung, 1999), since images tend to have continuous elements. Sparseness is added to NMF in sparse NMF (sNMF), which penalizes solutions based on the number of nonzero components in A and P (Gao and Church, 2005). A similar approach is presented in nonsmooth NMF (nsNMF), which created a sparse representation of the patterns by introducing a smoothness matrix into the factorization (Carmona-Saez et al., 2006). Addressing the lack of error modeling, least-squares NMF (lsNMF) converts Eq. (3.5) to a normalized form, adjusting the Dij and Mij terms by the specific uncertainty estimates at each matrix element (Wang et al., 2006). It also introduces stochastic updating the matrix elements, in an attempt to limit the problem of trapping in local maxima.
3. Application to the Rosetta Compendium The sample dataset for this study is generated from experiments on the yeast, S. cerevisiae, which has been studied in depth for a number of biological processes, including the eukaryotic cell cycle, transcriptional and translational control, cell wall construction, mating, filamentous growth, and response to high osmolarity. There is substantial existing biological data on gene function, providing a large set of annotations for analysis (Guldener et al., 2005; Mewes et al., 2004). In addition, there is a rich resource, the Saccharomyces Genome Database, maintained by the community that includes sequence and expression data, protein structure, pathway information, and functional annotations (Christie et al., 2004). The Rosetta compendium provides a large set of measurements of expression in S. cerevisiae including 300 deletion mutants or chemical treatments targeted at disrupting specific biological functions (Hughes et al., 2000). The 300 experimental conditions were all probed by microarray four times, with dye flips (technical replicates) of two biological replicates. Control experiments involved 63 wild-type yeast grown in rich medium and then analyzed by microarrays. The gene-specific variation seen in these ‘‘identical’’ cultures were combined with variance measured from
Matrix Factorization of Microarray Data
69
quadruplicate measurements of each mutant or chemical treatment to produce a gene-specific error model. This error model provided the estimate of the uncertainty for those algorithms utilizing such an estimate. The data were downloaded from Rosetta Inpharmatics and filtered to remove experiments where less than two genes underwent threefold changes and to remove genes that did not change by threefold across the remaining experiments. The resulting dataset comprised 764 genes and 228 experiments with associated error estimates. All algorithms were applied to the data using default settings, with a Pearson correlation metric and average linkage used for clustering procedures and maximum iterations parameter of 1000 for NMF, sNMF, lsNMF, ICA, and NCA. Patterns for clusters in clustering methods were calculated as average of sum-1 normalized expression profiles for each gene from a cluster. BD and lsNMF were run using the PattTools Java interface (available from the authors). NMF and sNMF were run using the same code base, with sNMF sparseness set to 0.8 (Hoyer, 2004). BFRM was run using version 2 of the BFRM software (Carvalho et al., 2008), and BD and BFRM both sampled 5000 points from the posterior distribution using default settings on hyperparameters. Clustering methods (HC, KMC) naturally assigned a gene to a single cluster. For methods that provided uncertainty estimates for values in the A matrix (BD, lsNMF), we used threshold of 3s to decide if a gene belonged to a pattern. Note that this permitted a gene to be assigned to multiple patterns, each of which explained part of the overall expression at a significant level. An additional conversion step was done for methods that provide continuous values for elements in the A matrix without uncertainty measurements (NMF, sNMF, ICA, NCA, BFRM). For these methods we assigned a gene to a group based on the absolute value of the corresponding element in matrix A being above the average of the absolute values for the gene, as in Kossenkov et al. (2007). The original publication applied biclustering to the data and reported on a number of clusters tied to specific biological processes at varying levels of significance (Hughes et al., 2000). Clusters were found for mitochondrial function, cell wall construction, protein synthesis, ergosterol biosynthesis, mating, MAPK signaling, rnr1/HU genes, histone deacetylase, isw genes, vacuolar APase/iron regulation, sir genes, and the tup1/ssn6 global repressor. We converted these strong signatures to Munich Information Center for Protein Sequences (MIPS) categories from the Comprehensive Yeast Genome Database (Guldener et al., 2005; Mewes et al., 2004). These categories are detailed in Table 3.1. We added MIPS class 38, transposable elements, to the list to look for methods that could distinguish the mating response from the filamentation response (Bidaut et al., 2006). We searched for signatures of these processes in the results of the analyses. To keep the analysis simple and less biased, we looked only for
70
Andrew V. Kossenkov and Michael F. Ochs
Table 3.1 The mapping of originally reported processes identified by twodimensional clustering to MIPS categories together with the number of proteins in each category in MIPS and in the analyzed dataset
Original report
MIPS number
MIPS name
Proteins
In data
Mitochondrial function Cell wall
02.45
44
4
214
33
Protein synthesis Protein synthesis Mating MAPK activation Histone deacetylase
12
Energy conversion and regeneration Biogenesis of cell wall Protein synthesis
480
16
246
9
69 27
15 5
187
5
–
38
120
14
42.01
12.01.01 41.01.01 30.01.05.01.03 10.01.09.05
Ribosomal proteins Mating MAPKKK cascade DNA conformation modification Transposable elements
Transposable elements have been added to track the difference between mating and filamentation, as filamentation requires transposable element activation.
these specific processes. However, it is important to remember that real biological interpretation often relies on identification of coordinated changes in sets of related biological processes (e.g., mating, meiosis, cell fate). For all techniques, we focused on 15 patterns or clusters, as we have previously identified this as providing the best dimensional estimation (Bidaut et al., 2006). Analysis was performed using ClutrFree, which calculates enrichment and hypergeometric test values for all patterns for each MIPS term (Bidaut and Ochs, 2004).
4. Results of Analyses Although the fundamental goal of the nonclustering methods is the optimal solution to Eq. (3.1), albeit potentially with covariates as in Eq. (3.8), the methods differ substantially in their treatment of the data. BD, as applied here, and the NMF methods require positivity in A and P, while NCA, ICA, PCA, and BFRM allow negative values. The A matrix is still easily interpreted in terms of enrichment of the gene ontology terms
Matrix Factorization of Microarray Data
71
from Table 3.1; however, the P matrix can vary greatly in its information content. Therefore, we focused on recovery of the signatures identified in the original study in terms of a hypergeometric p-value when the genes were assigned to patterns as described above. In addition, we focused on the P matrix for the strong mating pattern recovered with good p-values by all methods to determine what type of information can be recovered. Table 3.2 provides p-values determined using the gene ontology categories in Table 3.1. The table does not include any NMF methods, as these produced no p-values under 0.50. This may reflect the known problem that NMF tends to spread signal over many elements, and in this case even the sparse method failed to isolate a signature, though this may reflect a conservative sparseness parameter. We identified all patterns with an uncorrected p-value under 0.05 or the strongest p-value present under 0.5, if no p-values reached the 0.05 threshold. In addition, we show how many of the eight terms in which each method found at least one significant pattern. The Bayesian Markov chain Monte Carlo methods performed best in this regard, and it appears that the other matrix factorization methods captured less of these signatures than the clustering methods, although it is the case that these specific signatures were chosen based on their inclusion in the original paper that relied on clustering. The original paper reported on deletion mutants that were associated with these patterns; however, it is difficult to use this information with matrix factorization methods. For instance, in BD, while the pattern associated with protein synthesis does include all mutants mentioned in the paper, it also includes all other deletion mutants. This was taken to indicate that protein synthesis is vital to all growing yeast (Bidaut et al., 2006), a true though not overly useful insight. On the other hand, BFRM shows two patterns associated with this term, one has only the bub2 deletion mutant (indicated by bub2D) and the other only ste4D with any strength in the pattern matrix. This may reflect the strong sparseness that BFRM enforces on the data, thus indicating that in terms of differences between deletion mutants; these two are the most significant in terms of protein synthesis. No other matrix factorization methods had a significant p-value for this term. The mating term was deemed significant by all methods. Mating and filamentation are strongly coupled in yeast, with the main difference in transcriptional response to pathway activation being the use of the Tec1 cofactor. Tec1 is the driver of transposon activity, so we expect the filamentation signature to include the ‘‘transposable elements’’ category, even though it may include the Mating category due to sharing of genes between these two processes as indicated in Fig. 3.1. We use the ‘‘transposable elements’’ term to choose a mating pattern and a filamentation pattern for BD and NCA, where two patterns appear associated with mating. For BD, we assign pattern E to mating and pattern D to filamentation. Looking at the associated rows of the P matrix for deletion mutants
Table 3.2 Hypergeometric p-values for enrichment in gene ontology terms for different methods MIPS name
BD
BFRM
NCA
ICA
PCA
HC
KMC
Energy generation (ATP synthase) Biogenesis of cell wall Protein synthesis
A 0.029
A 0.17
A 0.39
A 0.19
A 0.47
A 0.28
A 0.16
B 0.015
B 0.050
B 0.14
A 0.083
B 0.18
B 0.069
A 0.15
A 0.0076
C 0.016 D 0.021
B 0.37
B 0.12
C 0.084
B 0.009 C 0.04
Ribosomal proteins Mating
C 0.017 A 0.016 D 0.0001 E <max-angle> <min-
length> <max-length> " exit 1 fi
# Set variables for easier readability MINANGLE=$1; MAXANGLE=$2; ANGLEINCR=$3 MINLEN=$4; MAXLEN=$5; LENINCR=$6
Figure 8.6 (Continued)
210
Mark Morgan and Andrew Grimshaw
# Loop through the angles requested.
We have to use the
# bc program here to do the loop because BASH cannot handle # floating point numbers natively. while [ `echo "scale=1; $MINANGLE <max-length>
" exit 1 fi
# Set variables for easier readability ANGLEFILE=$1 MINLEN=$2; MAXLEN=$3; LENINCR=$4
# Loop through the angles requested. for ANGLE in `cat $ANGLEFILE` do # Inside the angle loop, we are going to loop through the # wing lengths as well.
We assume that length is given as
# an integral number of inches LENGTH=$MINLEN while [ $LENGTH -le $MAXLEN ] do # We create a file name that reflects angle/length OUTPUT=winglift-$ANGLE-$LENGTH.dat if [ ! –e $OUTPUT ] then
Figure 8.7 (Continued)
High-Throughput Computing in the Sciences
213
# We create a file name that reflects angle/length OUTPUT=winglift-$ANGLE-$LENGTH.dat if [ ! –e $OUTPUT ] then echo "Submitting job for $OUTPUT" qsub –v “WINGANGLE=$ANGLE, WINGLENGTH=$LENGTH, \ OUTPUTFILE=$OUTPUT” submission-script.pbs fi
LENGTH=$(( $LENGTH + $LENINCR )) done done
Figure 8.7 Airflow over wing control script redux.
#!/bin/bash
#PBS –q largeQueue
throwDarts ${SEED} > dart-results.${NUMBER}
Figure 8.8 Monte Carlo submission script.
distinguishing characteristic between the various results (though we could if we wanted store the seed number given). Rather, we are merely using the file name as a convenient means of determining whether or not the output was generated for a given sequential run. Also, in this example, the throwDarts program does not generate an output file. Instead, it prints results to the standard output stream. This is not an uncommon occurrence and while it can be selectively used or avoided when the user has control over the source code for the sequential binary, oftentimes the binary is a piece of legacy code which cannot, for various reasons, be modified (Fig. 8.9).
214
Mark Morgan and Andrew Grimshaw
#!/bin/bash
# Check the arguments if [ $# -ne 1 ] then echo "USAGE:
$0 "
exit 1 fi
NUMITERS=$1
# Loop through the iterations while [ $NUMITERS -gt 0 ] do # Use BASH's built in RANDOM variable to generate a seed SEED=$RANDOM
# If the result hasn't yet been generated, submit a job # to create it. RESULTFILE=dart-results.$NUMITERS if [ ! -e $RESULTFILE ] then qsub –v “SEED=$SEED, NUMBER=$NUMITERS” \ submission-script.pbs fi NUMITERS=$(( $NUMITERS - 1 )) done
Figure 8.9 Monte Carlo control script.
High-Throughput Computing in the Sciences
215
3.4. Problem decomposition So far, in this chapter, we have ignored the issue of problem decomposition. Sometimes, the decomposition is either obvious or determined by the sequential program that we are using. Often, however, the user can choose how he or she wants to decompose the larger problem into a collection of smaller ones. Doing so can have a dramatic impact on the total time it takes to execute a high-throughput application as well as the overall probability that application will finish successfully. In the advanced section of this chapter, we examine the latter of these concerns, but for now we are going to see how problem decomposition can affect the overall runtime of an HTC application. Consider the dart-throwing Monte Carlo example we looked at in the previous section. A naı¨ve implementation might have written the throwDarts sequential program such that it threw exactly one dart instead of 1,000,000. In this way, the output files would have been numbered by dart rather than by millions of darts. However, this scheme lacks scalability and efficiency. To get a reasonably good estimate for P, we have to throw lots of darts. Assume that we needed to throw as many as 1 billion darts. If we generated a PBS job for each dart we would have submitted 1 billion jobs into the PBS queue and it would in turn have created 1 billion result files. From a scale point of view, neither PBS nor the file system into which the results will land is capable of handling that many items. From an efficiency perspective, running a sequential job for each dart-throw would take too long. While it might take a long time to throw 1 billion darts, it takes an incredibly short amount of time to throw one dart. On the other hand, submitting a job to PBS, having PBS run that job and then getting the results back is a relatively hefty operation requiring multiple seconds at best to complete. We simply cannot afford to spend seconds setting up and submitting jobs that only need a few microseconds to run. Instead, we batched together a large number of dart-throws into a single run of the throwDarts program so that the time that it takes to generate, submit, and run the job through PBS is amortized by the relatively long execution time of the throwDarts program. As a general rule of thumb, you want the execution time of a single unit of your HTC application to be on the order of 100 times (or more) greater than the amount of time it takes PBS to process the job. In this way, the cost of using PBS has little impact on the overall performance of your application as a whole. At the same time, you want the number of individual submissions to the PBS queue to be large enough to get some benefit. Simply submitting a single throwDarts program to the queue that throws all 1 billion darts alone produces no benefit over simply running the program on your desktop.7 7
Unless of course there is some benefit to running on a machine that the PBS batch system has access to that you do not. It is not uncommon for a user to submit a single job to a batch or grid system when that user simply cannot run the program on any other machines which he or she has access too. However, as this chapter is about HTC applications, we ignore such possibilities.
216
Mark Morgan and Andrew Grimshaw
The choice of how many jobs to break the HTC application into depends on a number of factors including how many resources the queue has access to, how many users are trying to submit jobs to the queue at any given time, and how many slots on the queue a single user is allowed to consume at any one time.8 As a general rule of thumb, if your sequential application runs in a relatively consistent amount of time regardless of the input data, then coordinating the total number of job submissions with the number of slots available to you makes sense (e.g., if you have 10 slots available to you, having somewhere around 10 jobs might make sense, assuming that the runtime of each job is reasonable). However, if the runtime of your sequential program is highly variable depending on input, or the resources on which you are running have a high chance of failure, it makes more sense to decompose the problem into many more pieces. As jobs run through the queue, longer running jobs will consume a slot for a corresponding large period of time while short running jobs finish early and vacate their respective slots, leaving them available for other jobs to run. This is an optimization mechanism known as load balancing, whereby longer running jobs have a decreased effect on the overall runtime length of the batch because many smaller runtime jobs have an opportunity to execute one after another at the same time. This principle is similar to how having multiple lanes of traffic improves the overall efficiency of cars moving along the road as opposed to having a single line of traffic whose speed is ultimately determined by the slowest driver. With the darts program, the method of decomposition, if not the number, was obvious; throwing many darts is identical in concept to throwing a single dart. However, this is not always the case. Sometimes, a sequential program does not naturally decompose into obvious pieces. The example first mentioned in this chapter (in which a scene from a movie was rendered using a batch of sequential jobs, each of which rendered a single frame from the movie) might not in fact be the best decomposition of the problem. If the time required to render a single frame is relatively small, then we would want to render many frames in a single job. Similarly, if the time required to render a single frame took a large amount of time, we would probably want to decompose the problems such that only portions of a frame were rendered by any given job. In both cases, the sequential program (and in fact the output files generated by the hypothetical decomposition) are not necessarily available. After all, the program generates pictures representing an entire frame, not pieces of a picture, or snippits of 8
To prevent one user’s jobs from preventing another user’s jobs from running, batch system administrators will often limit how many slots or nodes a user can simultaneously hold at any given time. Furthermore, the administrator will sometimes also additionally limit how many jobs a user can have in the queue at any given time, regardless of how many are running.
High-Throughput Computing in the Sciences
217
a movie. To make these decompositions work, some amount of programming on the user’s part is required. For the multiframe example, the answer is to submit jobs that are themselves scripts, each script executing the render program multiple times and generating multiple images. These images can then be tarred or zipped by the script and returned as the single result representing the snippit of the movie rendered. In the case of rendering pieces of a frame, the answer is not as simple. Unless the render program had the option to render a piece of a frame and generate a partial image file, the user would have to come up with a way of modifying the render program to perform these partial operations. He or she would also need a way of representing partial frames and later gluing them together.
3.5. Iterative refinement The last example that we will look at in this chapter with respect to typical HTC use cases is that of iterative refinement. So far all of the examples we have examined assume that we want to run a large number of sequential jobs to generate a large number of resultant outputs. These outputs would then, presumably, be combined together at the end to produce a single result. However, sometimes the parameter space is too large and the nature of the result space too unknown for the researcher to provide a sufficient set of boundary conditions for his application. Maybe he or she wants to see what wing angle to the nearest 10th of a degree and nearest 16th of an inch provides the best lift, but it takes too long to run all possible combinations. Using iterative refinement the researcher first submits a portion of an HTC job examining a relatively broad spectrum of the target parameter space. As the results from those jobs arrive, he or she can analyze them to determine which ranges of the broad parameter study show promise and can thus narrow the space and submit new jobs as a refinement of his or her parameter-sweep study. The first study might analyze wing angles by increments of 5 and lengths in increments of 6 in. Based off of the results of that study, new parameter spaces defined in terms of single degree and single inch increments are then launched for areas of interest to the researcher. Iterative refinement need not involve refinement of the parameter space. Sometimes, the refinement takes place instead in the sequential program or algorithm. A researcher comparing a protein sequence against a database of other sequences might first run an HTC application using a ‘‘quick and dirty’’ algorithm to determine which database sequences show promise. Then, based on these results, interesting database sequences could be compared against the test sample using a much slower but more accurate comparison algorithm. Regardless of the reason for the iterative refinement, the methods for controlling them are largely the same and, generally speaking, consist of building on what we have already seen in this chapter. Most of the differences lie in how to analyze the results of one run to generate the inputs for
218
Mark Morgan and Andrew Grimshaw
the next run. Usually, a user will submit the first run using a control script like the one we have described earlier, wait for that HTC run to fully complete, analyze the results, select parameters for the next run, and then submit the new run using either the same or a different control script. However, sometimes it makes more sense to have a single control script analyzing the results of one run as they are returned from the batch system and then on the fly deciding whether or not to generate and submit a new run based off of those results. Doing so, however, is a much more complicated task involving error checking with the queue (to make sure that the job was not lost or failed), making sure that the full results are available (i.e., that the job is completely finished and not simply in the process of generating results), and (potentially) simultaneously managing runs from multiple different refinements.
4. Advanced Topics So far, we have covered the relatively straightforward aspects of using systems to submit and control HTC jobs. In this next section, we will take a deeper look at some of the more advanced topics having to do with HTC applications, including restricting resources for scheduling purposes, checkpointing results, and staging data to and from local file systems. This is by no means an exhaustive list but rather an introduction to a few topics of interest and importance.
4.1. Resource restrictions It used to be the case that organizations would setup a number of queues on batch systems, each one representing a certain job type for which a particular set of resources was intended. For example, an IT department might create one queue for long-running jobs and another one for short jobs; one queue might be intended for Linux machines and the other for Solaris. Increasingly, however, it is becoming more common for an IT department to have only one queue and to rely instead on the user submitting jobs to the queue with certain restrictions indicated. For example, a user can indicate in a PBS batch script how many processors he or she wants per node, how much memory, how long the job will take to run, or even what kind of operating system is preferred. Generally speaking, in order to get the most out of your batch system, you need to describe the appropriate amount of information for your IT department’s resources. The following PBS submission script shows an example in which the user has requested that his or her job be put on a machine with two cpus per node, that it will take 10 GB of memory when executing, and that it needs to be a Linux machine of some type (Fig. 8.10).
High-Throughput Computing in the Sciences
219
#!/bin/bash
#PBS –q largeQueue #PBS –l ncpus=2:mem=10GB:arch=linux #PBS –o /home/jdoe/movie/stdout.txt #PBS –e /home/jdoe/movie/stderr.txt
echo $HOSTNAME cd /home/jdoe/movie render-frame scene-1-frame-1.input scene-1-frame-1.tiff
Figure 8.10 Example resource restrictions submission script.
4.2. Checkpointing Probably, the most important advanced topic—and one that is frequently overlooked when it comes to HTC—is that of checkpointing. While it would be wonderful if all jobs took 15 min to run to completion, the truth is that there are many applications that run for days, weeks, or even months. Unfortunately, it is unrealistic to assume that an application can run uninterrupted for long periods of time. Perhaps, the program leaks memory, or perhaps the user is sharing the machine with another program that is leaking some operating system resource, thus making the machine unstable. Labs sometimes lose power for long periods of time, causing machines to fail while jobs are running. For that matter, it is often the case that batch system itself is configured to kill jobs that take too long to execute.9 In the end, regardless of the cause, the end result is the same: the loss of all in-memory data and progress made on your long-running job. When you start talking about HTC applications, the odds of a longrunning program failing to complete increase. By utilizing lots of machines at the same time, you inadvertently increasing the chances that one of the machines on which your job is running is going to fail before it finishes. There are essentially two ways to deal with the problem of long-running jobs. One is simply to shorten your job so that it does not take as long but instead requires more runs to complete. For example, maybe each instance 9
Configuring a batch queuing system to limit jobs to a certain duration is often a bone of contention between users and administrators but is generally necessary to ensure fairness amongst all of the cluster’s users.
220
Mark Morgan and Andrew Grimshaw
of your program rendered 10 frames from a movie scene. Instead of running 1000 jobs, each rendering 10 frames of the movie, you could perhaps submit 10,000 jobs where each job rendered only one frame. The other solution is to employ something called checkpointing. Checkpointing is the act of periodically recording data about the progress of your program so that if the program should fail for whatever reason, you can simply restart the program from the last known checkpoint and continue from there. Unfortunately, checkpointing is an activity that many researchers ignore because it requires them to implement extra code that would not otherwise be necessary in a perfect world where nothing failed. Also, while a few projects have tried to make the process of checkpointing easier or automatic for applications, the truth is that none of these is perfect and the likelihood is that you will not have access to such a system. Further, it is not generally possible to describe a solution that works for all applications. Each application is different and the nature of your application’s checkpointing needs depends on how your program is structured. Furthermore, if you do not have access to the original program’s source code, you may not have the ability to checkpoint at all. Checkpointing in HTC applications often involves storing intermediate state information about your running application into a shared directory (recall that most batch systems use a shared file system to ease the transfer of applications and data between resources behind the batch system). Your management script then needs to be able to detect when an application has failed and restart that program using the stored checkpoint. Imagine an application with the command line given in Fig. 8.11. Each run of this application takes an input file as a parameter describing the data to be analyzed. It also takes two additional parameters describing, respectively, the name of an output file to generate when the program is complete and the name of a checkpoint file to generate periodically as intermediate results become available.10 Finally, the application takes an optional set of parameters instruct the program to restart from an intermediate checkpoint file already available from a previous run. Given this application, we now revisit a job management script that we saw earlier in this chapter and modify it to work with our new application. analyze-data \ [--restart ]
Figure 8.11 Example checkpointing command-line.
10
Implicit, in this example, is the assumption that the application binary removes checkpoint files as new checkpoints become available or as the program finishes successfully. If this is not the case, the control script needs to differentiate between checkpoint files that are still in use and those that are no longer needed.
High-Throughput Computing in the Sciences
221
If you compare this script with the first job management script given in this chapter, you can see that they are very similar to one another (Fig. 8.12). The only difference is that this script checks for the existence of a checkpoint file before submitting the job to the batch system. If the checkpoint file exists, then we use a different PBS submission template file (one that presumably uses the—restart version of the command). In this way, whenever we run the script, we will submit jobs to the queue, one for each input file that does not yet have a corresponding output file, and one for which the restart option will be given if an appropriately named checkpoint file exists. As with the previous case, it is important to understand the difference
#!/bin/bash
# We make a directory to keep the submission scripts in just to # keep our working directory from getting cluttered. mkdir -p scripts
# Iterate over all the files in the input directory. for INPUTPATH in input/* do # For each file, determine it's name is (without the path) # as well as the name of the desired output file and a # checkpoint file. INPUTFILE=`basename $INPUTPATH` OUTPUTFILE=`echo $INPUTFILE | sed -e "s/input/output/g"` CPFILE=`echo $INPUTFILE | sed –e “s/input/checkpoint/g”`
# If the output file does not exist, create and submit # A PBS job. if [ ! -e output/$OUTPUTFILE ] then
Figure 8.12 (Continued)
222
Mark Morgan and Andrew Grimshaw
# Before submitting a job, we first check to see if there # is an intermediate checkpoint to restart from if [ -e checkpoints/$CPFILE ] then echo “Re-submitting job for input/$INPUTFILE” TEMPLATE=resubmission-template.pbs else echo "Submitting job for input/$INPUTFILE” TEMPLATE=submission-template.pbs fi
qsub –v “INPUT=$INPUTFILE, CHECKPOINT=$CPFILE, \ OUTPUT=$OUTPUTFILE” $TEMPLATE fi done
Figure 8.12 Checkpointing example control script.
between a job that failed before for some transient problem, and one that fails consistently because of bad inputs or bad data. If your program crashes every time, it tries to work with the data file given, no amount of checkpointing and restarting the program will fix that issue.
4.3. File staging File staging is another advanced topic that is sometimes useful (and in fact, is sometimes required) for HTC applications. File staging is the act of copying a file in from a source to the compute node where the computation is taking place, or equivalently, copying some data file out to a target location from a compute node. There are many different ways to copy this data, including downloading it from the Web, copying it using ftp/sftp or rcp/scp, or even mailing a result file to an email address. In some cases, you may have no choice but to copy the data for an HTC job. Despite the fact that most batch systems (PBS included) tend to rely on shared file systems being available, sometimes the data that you need is not available on those file systems and sometimes it might be too large. Maybe
High-Throughput Computing in the Sciences
223
you have 1000 inputs files of 100 MB each and only 1 GB of disk space available (i.e., you have room to store a couple of input files at a time, but not enough to store all of them on the shared file systems). Performance is the other main reason why people will sometimes stage files in and out. When staging a file for performance, what you are essentially doing is paying an upfront cost for copying the file in from a slow storage system (such as NFS) to a faster one (such as the local disk) so that you can use the faster storage medium for repeated reads later. Sometimes, these repeated reads happen during the lifespan of a single program (e.g., the program may need to read a given file over and over again during its execution rather than read it once and store the information internally in memory). Other times, the file is reused multiple times as many different instances of a program are run for a given HTC application. Recall that it is not generally the case that if you have 1000 or 10,000 jobs to run that you will automatically have access to an equivalent number of resources. Generally, a batch system will run a few of your programs at a time and queue the rest until a resource becomes available. In this case, if you have a file that does not change between runs (what is often called a constant file), that file can be copied to local disk once and then repeatedly reused as other copies of the program are run. The following example illustrates a PBS submission script for a movie CG-rendering program that takes not only an input frame to render and the output image to which to render it, but also a texture input indicating a database of scene textures to use for the frame. Since this texture database can be reused for other frames that may later get rendered on this node, we try to copy it to local disk space once and reuse the local copy from there on out.11 File staging as it relates to creating local copies for performance reasons requires that you be aware of how the local disk space is cleaned up and when (Fig. 8.13). If you are sharing the local disk space with other users and the compute setup does not somehow automatically clean up local disk space, then computing etiquette would suggest that you have a way of cleaning up the local copies when your HTC run is complete. Conversely, if the nodes in the cluster have a mechanism in place for automatically cleaning up local disk space (e.g., every time they reboot), that event must also be anticipated.
4.4. Grid systems For the most part, information given in this chapter is independent (except in the specific syntax) of the system providing the cluster management. Whether you are talking about PBS, SGE, Condor, or a grid such as Globus (http://www.globus.org/toolkit/) or Genesis II (http://www.cs. 11
Note that the use of /local-disk and /shared-disk are used only as exemplars in this example. Every organization has its own setup for their respective compute clusters and as such each user will need to determine the geography of his or her compute environment.
224
Mark Morgan and Andrew Grimshaw
#!/bin/bash
#PBS –q largeQueue
if [ ! –e /local-disk/jdoe/scene-textures.dat ] then cp /shared-disk/jdoe/scene-textures.dat \ /local-disk/jdoe/scene-textures.dat fi render-frame scene-1-frame-1.dat \ --textures /local-disk/jdoe/scene-textures.dat \ scene-1-frame-1.tiff
Figure 8.13 Submission script for file staging example.
virginia.edu/vcgr/wiki/index.php/The_Genesis_II_Project; Morgan, 2007), generally speaking, there will be a way to execute jobs using a qsub-like mechanism, there will be a way of querying information about running or completed jobs, and there will be a way of killing or cleaning up jobs. However, there are a few differences between traditional batch systems and grids that are worth pointing out. Grid systems, like batch queuing systems, give users the ability to start, monitor, and manage jobs on remote back-end resources. They differ from batch systems in the flexibility that they offer users both in terms of types and numbers of resources, as well as in the availability of tools for job management and control. Batch systems usually restrict users to clusters of similarly configured machines (generally, though not always, of the same operating system and make). They also typically back-end to resources under a single administrative domain, inevitably limiting the number of resources available for use. Grids, on the other hand, are designed to support greatly varying resource types from numerous administrative domains. It is not at all uncommon for a grid system to include resources from multiple universities, companies, or national labs, ranging in type from large supercomputers or workstations running variations of UNIX to small desktop computers running Mac OS X or Windows to clusters of computers sitting in racks in a machine room somewhere. In fact, a grid system will often contain among its resources other batch queuing systems.
High-Throughput Computing in the Sciences
225
While many batch systems can front-end for heterogeneous compute nodes (i.e., compute nodes of differing architectures and operating systems), this is not generally put to use in most organizations. Usually, a given queue will submit jobs to only one type of compute node (sometimes identical in every regard, sometimes differing in insignificant ways such as hard drive size or clock speed). Grids, however, by their very nature tend to be quite diverse, supporting large numbers and types of resources ranging from Windows to Linux, desktop to rack-mount, and fast to slow. Sometimes, the machines in grids will have policies in place to prohibit execution when someone is logged in to the machine, and sometimes they will not. This diversity means that when you submit a job to a grid, you will often need to specify the resource constraints applicable to your job, such as what operating system it needs and how much memory it requires. Given that grids support heterogeneous sets of machines, these machines are highly unlikely to support a shared file system (which you will recall was an outright assumption for most batch systems). Some grids do support shared namespaces and file systems through the use of grid-specific device drivers such as Genesis II’s FUSE file system for Linux or its G-ICING Installable File System for Windows, but this is by no means guaranteed. Given this restriction, HTC applications running on a grid will often have no choice but to stage data in and out. Another difference between grids and batch systems is that grids often support machines in wildly differing administrative domains and situations. When an HTC job is running on a cluster of machines in a controlled environment such as a power-conditioned machine room, a user could be reasonably confident that his application could run for hours or even a day or more without interruption. However, when you start including machines in public computer labs at a University, or even those sitting in student’s dorm rooms, the chances of the machine getting powered off or rebooting skyrockets. For this reason, when working with grids you will often need to be even more vigilant about picking appropriate job lengths and checkpointing. Finally, in a grid system the chances of your application being installed on any given machine, or installed with the correct plug-ins, modules, or libraries that you need become vanishingly small. For this reason, grids often include some sort of mechanism for application configuration and deployment. While it may seem that using a grid instead of a compute cluster only complicates an already complex problem, it is important to realize that the benefits of grids can often outweigh these drawbacks. Grids are usually many orders of magnitude larger then clusters in terms of numbers of resources. They tend to be undersubscribed in terms of usage as compared to compute clusters that are frequently oversubscribed. Also, they provide many other features and functions that clusters simply cannot, such as data
226
Mark Morgan and Andrew Grimshaw
sharing and collaboration, fault-tolerance, Quality of Service (QoS) guarantees, etc. For many people, these benefits make the added complications worthwhile.
5. Summary In this chapter, we have provided a brief introduction to HTC techniques as they relate to the sciences. We have tried to describe some of the more common patterns in the hopes that the examples are both illustrative and potentially useful to users. However, no single example can ever be a one-size-fits-all solution. Every application has its own nuances and requirements and each solution will by necessity tend to be unique to that application. We have shown that a good working knowledge of scripting can be invaluable to an HTC user and that familiarity with basic tools such as grep, sed, and awk tremendously enhances the ways in which a user can manage and control his or her job. Finally, we have tried to provide enough of an introduction to more advanced HTC compute topics such as staging and checkpointing to give the reader an idea of other areas of computation that he or she can explore if they seem relevant or important to his or her application space. HTC has been and remains one of the more effective means of parallelization available to the researcher. Having a good understanding of these techniques and mechanisms will aid you as you produce not only future applications but also the data that you will one day analyze with those applications. While it is an unfortunate fact of life that you sometimes must work with existing software over which you have little or no control, a working understanding of HTC techniques will help you plan for and simplify HTC control and submission scripts with a little bit of upfront planning.
REFERENCES Chandra, R., Menon, R., Dagum, L., Kohr, D., Maydan, D., and McDonald, J. (2000). Parallel Programming in OpenMP. Morgan Kaufmann, 1558606718. Gropp, W., Lusk, E., and Skjellum, A. (1994). Using MPI: Portable parallel programming with the message-passing interface. Sci. Eng. Comput. Ser. MIT Press, Cambridge, MA pp. 0-262-57104-8307. Ousterhout, J. K. (1994). Tcl and the Tk Toolkit. Addison-Wesley, Reading, MA0-20163337-X. Leach, P. J., and Naik, D. C. (1997). A Common Internet File System (CIFS/1.0) Protocol. http://tools.ietf.org/html/draft-leach-cifs-v1-spec-01.txt. 19 December.
High-Throughput Computing in the Sciences
227
Morgan, M. M. (2007). Genesis II: Motivation, Architecture, and Experiences using Emerging Web and OGF Standards. Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid, 0-7695-2833-3. May 2007. Sun Microsystems, Inc (1989). NFS: Network Filesystem Protocol Specification. IETF RFC-1094. March. Thain, D., Tannenbaum, T., and Livny, M. (2005). Distributed Computing in Practice: The Condor Experience. Concurrency Comput. Pract. Exp. 17(2–4), 323–356. 10.1002/ cpe.938.
C H A P T E R
N I N E
Large Scale Transcriptome Data Integration Across Multiple Tissues to Decipher Stem Cell Signatures Ghislain Bidaut*,†,‡ and Christian J. Stoeckert Jr.§ Contents 1. Introduction 2. Systems and Data Sources 2.1. Computing environment 2.2. Data source 2.3. Normalization 2.4. Databases 2.5. Stem cells generalized hierarchy 3. Data Integration 3.1. Integrating data to a final compendium indexed by common gene identifier 3.2. Vector projection 3.3. Variation filtering 4. Artificial Neural Network Training and Validation 4.1. Leave-one-out validation—generation of 31 ANN models 4.2. Minimal error data set 4.3. Independence testing 4.4. Applying the whole algorithm 4.5. Results interpretation 5. Future Development and Enhancement Plans Acknowledgments References
* { { }
230 231 231 232 232 234 235 236 236 237 237 238 238 240 240 241 243 243 244 244
Inserm, UMR891, CRCM, Integrative Bioinformatics, Marseille, France Institut Paoli-Calmettes, Marseille, France Univ Me´diterrane´e, Marseille, France Center for Bioinformatics, Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67009-9
#
2009 Elsevier Inc. All rights reserved.
229
230
Ghislain Bidaut and Christian J. Stoeckert
Abstract A wide variety of stem cells has been reported to exist and renew several adult tissues, raising the question of the existence of a stemness signature—that is, a common molecular program of differentiation. To detect such a signature, we applied a data integration algorithm on several DNA microarray datasets generated by the Stem Cell Genome Anatomy Project (SCGAP) Consortium on several mouse and human tissues, to generate a cross-organism compendium that we submitted to a single layer artificial neural network (ANN) trained to attribute differentiation labels—from totipotent stem cells to differentiated ones (five labels in total were used). The inherent architecture of the system allowed studing the biology behind stem cells differentiation stages and the ANN isolated a 63 gene stemness signature. This chapter presents technological details on DNA microarray integration, ANN training through leave-one-out cross-validation, and independent testing on uncharacterized adult tissues by automated detection of differentiation capabilities on human prostate and mouse stomach progenitors. All scripts of the Stem Cell Analysis and characterization by Neural Networks (SCANN) project are available on the SourceForge Web site: http://scann. sourceforge.net
1. Introduction In recent years, hundreds of data sets have been deposited in public repositories such as the Gene Expression Omnibus (Barrett et al., 2007) and Array Express (Parkinson et al., 2009), including experiments on cancer studies, developmental biology, and others, in multiple organisms. These databases have the potential to shed light on unresolved biological questions, such as the determination of a common cell signature among the different stem and progenitor cell types reported to exist (stemness signature). Several projects have provided researchers the ability to perform global queries on such databases, such as the SPELL Web interface (meta-analysis in Saccaromices cerevisiae, see Hibbs et al., 2007), or the GeneSapiens system (Homo sapiens, see Kilpinen et al., 2008) which queries an integrated database generated from public transcriptome repositories. These systems are highly nonspecific as they are based on techniques to integrate hundreds of datasets. On a smaller scale, several groups have developed data integration algorithms for cancer tumors classification on multiple datasets. The goal of these is to improve classification robustness when applying a classifier trained on a given dataset to an independent dataset. In Chuang et al. (2007), authors pursued a protein-network based classification by superimposing a breast cancer gene expression dataset on a large-scale protein network to detect subnetworks (subregions from the full interactome)
Large Scale Stem Cells Transcriptome Data Integration
231
whose expression is highly correlated to distant metastasis. Results showed a significant increase in the classifier accuracy when applied to independent data. Numerous ranked-based methods have been also proposed, such as the work proposed by Xu et al. (2005) where top scoring pairs of genes are identified to form a marker list for prostate cancer. The last class of methods is based on combining inter-study data into a final dataset by specific data transformation methods, including data renormalization (Shen et al., 2004). We are proposing the extension of a method previously applied (Scearce et al., 2002)—the vector projection—to integrate multiple microarray datasets to answer the question of stemness existence—that is, to discover a shared transcriptional signature between multiple tissues. A classifier is trained on an integrated stem cell dataset (compendium) generated from a set of individual stem cell DNA microarray datasets—each of these experiments being consistently labeled on a generalized stem cell hierarchy. After extracting the signature on a training set, we applied it to characterize two tissues: mouse stomach progenitors and human prostate progenitor—to identify the type of stem or progenitor cell represented. In Sohal et al. (2008), the authors integrated data across multiple platforms to study hematopoietic stem and progenitors cells. In Gerrits et al. (2008), authors integrated DBNA microarray profiles and genetics linkage of two genetically distinct mice types to find a stem cell signature in hematopoietic stem cells. This chapter describes the study from an experimental point of view, describing data sources, scripts, and algorithms that constitute the Stem Cell Analysis by Neural Network (SCANN system)—detailed biological results are available from a previous study (Bidaut and Stoeckert, 2009).
2. Systems and Data Sources 2.1. Computing environment To manipulate such data structures, we assume proficiency in programming, preferably in languages such as Perl or Python or Java, that allow for quick development of data file manipulations and mathematical transformations. The Perl language is used by the authors throughout the chapter for all scripting tasks. In addition, R and Bioconductor1 (Gentleman et al., 2004) must be installed in order to normalize the Affymetrix# DNA arrays. These languages can be run on most environments, but a Unix-like system (SolarisTM, CentOSTM, or Ubuntu Linux TM) is preferably used on a server-class system (A multicore server with 8 GB þ RAM, and 1
Available from http://www.biocondiuctor.org.
232
Ghislain Bidaut and Christian J. Stoeckert
10 GB þ of disk space is recommended—especially if the algorithms presented here are to be applied on larger datasets). The software described in this chapter has been made available on Sourceforge under the SCANN package version 1.0. The complete archive can be retrieved at http://scann.sourceforge.net
2.2. Data source Data sources are multiple and heterogeneous. We first trained the system with data generated by the Stem Cell Genome Anatomy Consortium (SCGAP2). The consortium generated data in multiple tissues, each member being in charge of a particular stem cell type. Most data are available for download from individual laboratories, with links accessible from the main SCGAP Web site. Other data are accessible through the respective author’s Web site. Although multiple data types were generated, including immunohistochemistry experiments for localized gene expression measurements, only the DNA array data was kept in our integration study. Table 9.1 summarized data sources, types, and platforms, and location on the Web.
2.3. Normalization Normalization was done with the Bioconductor package which is an R-based bioinformatics library that allows for bioinformatics data analysis of most biological data types, including DNA/protein sequences, microarray data, and others. We are using the package affy that allows for normalization of multiple Affymetrix chip types. Detailed documentation is available from the affy vignette: (command vignette(‘‘affy’’) at the R prompt). The following procedure was followed on an example dataset measured on the Affymetrix# HG-U133A platform: $ mkdir data_tmp $ tar xf Data.tar-C data_tmp $ cd data_tmp
Note that if the archive contains data from multiple platforms (For instance, MG-U430 A and B), data must be uncompressed in separate directories. The following commands are then applied from the R environment: > library(affy) > RawData ¼ ReadAffy()
The show() command gives details on loaded data.
2
http://www.scgap.org.
Table 9.1 Summary of data sources, platforms, and location Author/lab
Tissue
Platform
Source
Ochsner et al. (2007) Rowe et al. (unpublished) Ivanova et al. (2002) Ivanova et al. (2002) Ivanova et al. (2002) Ivanova et al. (2002) Ivanova et al. (2002) Ivanova et al. (Unpublished) Ivanova et al. (Unpublished) Ivanova et al. (Unpublished) Oudes et al. (2006) Mills et al. (2002)
Mouse embryonic liver Mouse bone
Affymetrix# MG-U430A Affymetrix# MG-U74Av2, B, C Affymetrix# HG-U95Av2, B, C Affymetrix# MG-U74Av2, B, C Affymetrix# MG-U74Av2, B, C Affymetrix# MG-U74Av2, B, C Affymetrix# MG-U74Av2, B, C Affymetrix# HG-U133A,B Affymetrix# HG-U133A,B Affymetrix# MG-U430A,B Affymetrix# HG-U133A,B Affymetrix# Mu11K A,B Six distinct platforms
http://liver-hsc.scgap.org/data.html
Human fetal liver (HSCs) Mouse fetal liver (HSCs) Mouse adult bone marrow Mouse embryonic stem cells Mouse neural stem cells Human coord blood (HSCs) Human adult bone marrow Mouse adult bone marrow Human prostate progenitors Mouse stomach progenitors Total: five distinct 12 distinct tissues groups
Datasets printed in italic are test datasets characterized independently by the system.
http://skeletalbiology.uchc.edu/ 30_ResearchProgram/304_gap/index.htm http://www.cbil.upenn.edu/SCGAP/related_data. html http://www.cbil.upenn.edu/SCGAP/related_data. html http://www.cbil.upenn.edu/SCGAP/related_data. html http://www.cbil.upenn.edu/SCGAP/related_data. html http://www.cbil.upenn.edu/SCGAP/related_data. html http://www.cbil.upenn.edu/SCGAP/data.html http://www.cbil.upenn.edu/SCGAP/data.html http://www.cbil.upenn.edu/SCGAP/data.html Available from the authors Available from the authors
234
Ghislain Bidaut and Christian J. Stoeckert
> show(RawData) AffyBatch object size of arrays ¼ 712 712 features (63 kb) cdf ¼ HG-U133A (22283 affyids) number of samples ¼ 20 number of genes ¼ 22,283 annotation ¼ hgu133a notes ¼
Data are then normalized with the expresso function, (affy package) with the following options, for loess normalization: > expr ¼ expresso (DataA, bgcorrect.method ¼ ‘‘mas,’’ normalize.method ¼ ‘‘loess’’, pmcorrect.method ¼ ‘‘mas’’, summary.method ¼ ‘‘medianpolish’’)
After normalization, data object expr is exported on disk: > write.table(expr, ‘NormalizedData.txt’); > q()
After normalizing all data, we have in hand several individual datasets grouped by tissues. Please note that data normalization was done on 2007 versions of Linux/Perl and R/Bioconductor and results may vary slightly from what we have obtained at this time.
2.4. Databases To integrate data on a common framework (i.e., a common identification system), several databases must be downloaded and used. 1. The DFCI Resourcerer database: The Resourcerer database is a compendium of continuously maintained annotation files for standard DNA microarray platforms (Tsai et al., 2001). Platforms files for our datasets are available through FTP download here3: These files are used to generate probeID–geneID correspondence tables. 2. NCBI Homologene database (Wheeler et al., 2008). This database (available here4) represents an in silico generated database of homologs of fully sequenced genomes. Homologues are represented under lists of NCBI Gene IDs/Symbols indexed by taxon IDs. Each homolog is indexed by a unique homologene IDs linking several geneIDs from different taxons. This database is used to build a geneID-homologene ID correspondence table. 3. NCBI Gene_info database. This database (available here5) is a flat file version of the NCBI Entrez Gene database (Wheeler et al., 2008). It provides information on genes—Gene Ontology, symbol, and 3 4 5
ftp://occams.dfci.harvard.edu/pub/bio/tgi/data/Resourcerer. ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/homologene.data. ftp://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz.
235
Large Scale Stem Cells Transcriptome Data Integration
synonyms. We use this file to build a geneID to Symbol conversion table to allow mapping of geneIDs from heterogeneous species to our final integrated compendium (see Section 3.2).
2.5. Stem cells generalized hierarchy To train a classifier on heterogeneous datasets, a coordinated and homogeneous description that can be applied to describe these data has been created with all SCGAP consortium members. These descriptions give the state of differentiation of a given tissue, upon which we wish to perform training and classification. To include prior knowledge in the system, we used the well-established stem cell hierarchy from the hematopoiesis system and generalized it to all tissues in our compendium, by including other stem cell types (totipotent stem cells). This full hierarchy is shown in Fig. 9.1A and includes all these types.
A: Totipotent stem cells: Capable of self-renewal and able to generate all cells types B: Multipotent stem cells: Capable of self-renewal and able to generate most cell types C: Progenitors cells: Capable of generating several cell types D: Lineage-committed progenitors (LCP) cells: Capable of generating a single or a restricted number of cell types E: Differentiated cells: Cell displaying final phenotype
These types of cell categories were used to label consistently our data for tissue training and classification (Table 9.2). Some tissues do note contains all the categories—for instance, mouse bone are only represented on categories C, D, and E, and the classifier has to cope with such missing categories. B
A s
r ito
n
er
ff Di
0.6 0.4
Differentiation stage
0.2 0
Diff. cell
ito en
ne
Li
ia
t en
ce
LCP
rs
-c
e ag
ted
Progenitor
om
Multipotent
tem ts
Pr
og
po ten
lls
itt
m
1 0.8
Totipotent
lls
ed
ce
ce lls tem ts M
ul ti
ten ip o To t
ge
o pr
Figure 9.1 (A) The generalized stem cells hierarchy. Arrows denote the hierarchical order of cell differentiation as well as self-renewal capability of totipotent and multipotent stem cells. (B) The stem cell model vectors are represented (highlighted is the multipotent stem cells model vector that peaks for the multipotent category).
236
Ghislain Bidaut and Christian J. Stoeckert
Table 9.2 Stem cell tissues and their category coverage Tissue
Stem cells categories
Mouse embryonic liver Mouse bone Human fetal liver (HSCs) Mouse fetal liver (HSCs) Mouse adult bone marrow (HSCs) Mouse embryonic stem cells (ESCs) Mouse neural stem cells(NSCs) Human cord blood (HSCs) Human adult bone marrow (HSCs) Mouse adult bone marrow(HSCs) Human prostate progenitors Mouse stomach progenitors 12 Distinct tissues
C, E C, D, E B, D, E B, D, E B, C, D, E A B B, D B, D B, C X, E X, E Five distinct categories
In italic, the two uncharacterized tissues are noted X.
3. Data Integration 3.1. Integrating data to a final compendium indexed by common gene identifier Databases described in Section 2.4 were parsed and hash tables were generated in order to integrate all the datasets to a final compendium using a common identifier. The following steps were followed: Once normalized, specific platform probes IDs were collapsed to common gene profile by expression averaging, leaving a set of gene expression profiles indexed by gene IDs. We simultaneously collapsed multiple microarrays when necessary (For instance, mouse bone tissue samples are profiled on Affymetrix# U74Av2, Bv2, Cv2). Then, these sets of expression profiles were aligned separately for each organism, and two separate tables were generated for mouse and human. Finally, the two organisms data were merged through the homologene ID to gene ID table. Homologene IDs were kept as long as at least one homologeneID was present either in mouse or human. This has resulted in a final data matrix of 18,720 homologs and 82 tissues samples (see file all_expression_Princeton _UConn_Baylor_WashU_ISB_separated_samples.txt, available from the SCANN Web site) describing gene expression data of genes across developmental stages for different tissues and organisms. Multiple IDs were kept within the file (Mouse Gene ID, Human Gene ID) for data verification purposes.
Large Scale Stem Cells Transcriptome Data Integration
237
The following scripts from the SCANN distribution were used for the analysis: match_resourcerer.pl: Converts probe_ID to Gene ID collapse_chips.pl: Combines several gene profiles with
the same probe ID to a single profile identified with a gene ID, and simultaneously combines several chips to a single data file align_human_mouse.pl: align gene profiles from mouse and human to a single profile indexed by Homologene ID Note that missing values were propagated in the final file as ‘‘NaN.’’
3.2. Vector projection Samples were selected from every tissue to form groups of progenitorsdifferentiated cells. We populated the hierarchy as much as possible— depending on the available samples (Table 9.2). To quickly capture gene expression variation on categories, we vector projected gene expression profiles across categories for a given tissue on a set of five vectors modeling gene expression profiles we wished to detect. Vectors are detailed in Fig. 9.1B, and the arrow shows the example for the multipotent stem cells vector (B) that peaks over category B and has a lower expression over other categories. Each vector has been designed to extract genes with a higher expression over a given category. The projection itself is a dot product: p ¼< gene profile; mode vectorl >
ð9:1Þ
The two vectors gene_profile and model_vector are normalized to 1.0 before projection to remove variation effects inherent to the nature of the data. Many genes are characterized by missing expression values, and several tissues are not profiled on the five vector types. To cope with missing values, we devised a strategy where the dot product is made without missing points. To compensate for these points within, vectors minus the missing categories are renormalized. A projection matrix of 18 K genes, 12 tissues, and 5 projection values per tissue was obtained. This is the dataset we are submitting to the neural network for training and classification.
3.3. Variation filtering The final dataset was variation filtered: We kept only an initial dataset of the genes characterized by a projection value of at least thstage over nstage number of tissues for a given developmental stage. Table 9.3 recapitulates these settings. Thresholds have been separately parameterized for all tissues to compensate for the overlap variation between stem cells stages. (There is a
238
Ghislain Bidaut and Christian J. Stoeckert
Table 9.3 Parameters settings for vector projection threshold (thstage) over a minimum set of tissues nstage Stem cell stage
thstage
nstage
A: Totipotent stem cells B: Multipotent stem cells C: Progenitor cells D: Lineage-committed progenitors E: Differentiated cells
0.62 0.64 0.8 0.8 0.94
20 20 20 18 3
higher gene expression overlap for early progenitors than for differentiated cells, which are more heterogeneous.) This leads to a final input dataset of 3939 K genes.
4. Artificial Neural Network Training and Validation To extract the common molecular program from the set of tissues, we trained a multiclass single layer Artificial Neural Network on the final combined dataset. The ANN presented Fig. 9.2 is an extension of the single layer associative memory Greer and Khan (2007). It is built around five neurons corresponding to the five stem cell developmental stages A, B, C, D, and E—each of them characterized by a set of n weights, n being the input data size. The output for each neuron has been defined as the following dot product: XN yj ¼ wx ð9:2Þ i¼0 ij i where x is the input data (expression profiles projections defined Section 3.2), w is the set of weights for the current neuron, and y is the neuron activity. The ANN is trained with a subset of the data—the training set—this being the projected data for hematopoietic stem cells, mouse neuronal stem cells, mouse embryonic stem cells, and mouse bone progenitors, representing a total of 31 tissues. Human prostate and mouse stomach progenitor data (comprising nine samples in total) are left out for independent testing.
4.1. Leave-one-out validation—generation of 31 ANN models The training is organized through several cross validation steps performed on a reduced set of genes, to find the set of genes minimizing the training error. At each cross validation step, a tissue is left out (leave-oneout cross-validation, LOO) and the network is trained for 200 epochs with 30 tissues presented in a random order at each epoch.
239
Large Scale Stem Cells Transcriptome Data Integration
Data compendium for different stem/progenitor tissues generated from large scale DNA microarray integration (82 experiments in 40 tissues-18 k Genes)
1. Data integration by controlled experiment labeling and homology alignement (homologene database)
2. Vector projection on 5 stem cell gene expression profile basis vector
3. Variation-based filtering: 3939 genes kept Séparation training/testing Training set (mouse and human HSC, mouse ESC, mouse neural SC, mouse bone) (31 tissues)
Testing set (human prostate, mouse stomach epitelium) (9 tissues)
Leave one out cross validation
ANN training Training population weights Gene 1 Neuron 1: Totipotent stem cell Neuron 2: Multipotent stem cell Neuron 3: Progenitor Neuron 4: Lineage-committed progenitor Neuron 5: Differentiated cell Gene n
5. Weights are averaged over all models and ranked. Top 16 genes per neuron are conserved to optimize classification
31 ANN models
Classification of independent data (Unknown tissues)
Figure 9.2 The full analysis procedure. From the top are basic normalization and annotation steps, data integration, model vector projections, and variation filtering. After this step, the 3939 genes dataset is then subdivided into training and testing subsets. Training dataset goes into a double loop of leave-one-out validation by the artificial neural network (ANN) and size reduction (only the significant genes are kept according to a schedule defined Section 4.2—not represented here). Finally, 31 ANN models are kept for a minimal error rate obtained with 63 genes, and combined for testing on the independent dataset by majority voting.
240
Ghislain Bidaut and Christian J. Stoeckert
At each epoch, tissues are presented, and weights are updated proportionally with the real obtained output and desired output: Dwj ¼ aðnÞ½yj ydj x
ð9:3Þ
where wj is the jth weight vector, a(n) is the monotonic decreasing function of n, the number of epochs, yj is the obtained output, ydj is the desired output, and x is the input vector. a(n) is defined as follows: aðnÞ ¼
1 ðn=30 þ 1Þ
ð9:4Þ
After 200 epochs, the LOO cross-validation yields 31 ANN networks models. For classification purposes, these 31 models are combined through majority vote, for instance, for characterization of unknown tissues/samples (script classifytissue.pl, see also Section 4.3). We plotted the quadratic error during ANN training (Fig. 9.3A) and testing (Fig. 9.3B) phases to insure that no signs of overtraining were present. Overtraining is typically shown through increasing square error curves, showing that the network overspecialized on some training samples and is not able to generalize on others. This is a necessary test to ensure both the quality of the training procedure and absence of biases in the training dataset.
4.2. Minimal error data set During training, we subsequently reduced the training input size. The first training set was the 3939 genes validated after variation filtering (Section 3.3). Subsequently, the set was reduced by keeping the m most significant weights for each network (weights were sorted by decreasing order or absolute value, and top m gene was kept). We plotted systematic error for m taken from this list of values (Fig. 9.4A): m ¼ [400, 300, 200, 175, 150, 125, 100, 75, 60, 65, 50, 45, 40, 35, 30, 25, 23, 22, 20, 18, 16, 14, 12, 10, 8, 7, 6, 5]. Minimal error was obtained for m ¼ 16, corresponding to a set of 63 genes, as shown Fig. 9.4A. Five errors were reported on individual ANNS (Majority vote set this error to 0).
4.3. Independence testing 31 ANN models obtained for m ¼ 16 (63 genes) are combined in a majority vote for testing and classification on independent data. We used these for classification of human prostate progenitors and mouse stomach epithelium samples. These tissues contain potentially adult stem cells progenitors, but with unknown differentiation capabilities. To classify them on the generalized stem cells hierarchy as shown in Fig. 9.1A, we presented them to the network and obtained the following results:
Large Scale Stem Cells Transcriptome Data Integration
A
241
1.8
Quadratic error rate
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2
0
20
40
60
80 100 120 140 160 180 200 Epochs
0
20
40
60
80 100 120 140 160 180 200
B 18
Quadratic error rate
16 14 12 10 8 6 4 2 0
Epochs
Figure 9.3 Training error for training (A) and testing (B) sets during leave-one-out procedure. The traces are strictly decreasing and show no signs of overtraining.
Mouse stomach progenitors: Classifies as Category C: Progenitors Human prostate progenitors: Classifies as Category B: Multipotent stem cells.
4.4. Applying the whole algorithm The whole data analysis pipeline (variation filtering, vector projection, ANN training with leave-one-out testing) is implemented within the classtbynnshuffl.pl script from the SCANN package. Here is the command line syntax for 200 epoch and independence testing on mouse stomach and human prostate tissues:
242
Ghislain Bidaut and Christian J. Stoeckert
A Number of misclassified samples
100
10
Minimum for 63 genes
1 104
103
102
101
Number of genes B 18
Quadratic error rate
16 Training set Held out tissue
14 12 10 8 6 4 2 0
0
5
10
15
20
25
30
ANN index
Figure 9.4 Systematic errors: In (A), the graphics shows the sum of errors across all 31 ANN models for a progressively reduced input dataset. Minimum is obtained for 63 genes—please note that the error is plotted for individual ANN models and that it decreases to 0 when taking into account majority voting. In (B), we plotted the quadratic error rate for the 31 individual ANN models for an input set of 63 genes corresponding to 31 held out tissues. ANNs models committing errors are pointed by arrows. $ /home/Bidaut/bin/scann/classtbynnshuffl.pl -i / data/stem-cell-project-data/final_dataset/all_expression_Princeton_UConn_Baylor_WashU_ISB_separated_samples.txt -h /home/bidaut/data/annotations/list_gene_id _human.txt-m/home/bidaut/data/annotations/list_gene_ id_mouse.txt -g /home/bidaut/data/annotations/gene2go -
Large Scale Stem Cells Transcriptome Data Integration
243
mp 1 -t ‘mouse gut 1’ -t ‘mouse gut 2’ -t ‘human prostate 1’ t ‘human prostate 2’ -t ‘human prostate 3’ -t ‘human prostate 4’ -t ‘human prostate 5’ -t ‘human prostate 6’ -t ‘human prostate 7’ -nepoch 200 list_gene_id_human.txt and list_gene_id_mouse.txt are
tab-delimited files containing NCBI geneIDs, gene symbols, and textual description for human and mouse, respectively. These are generated from the NCBI gene_info file (Section 2.4).
4.5. Results interpretation The proposed neural network architecture (single layer) allows for detailed exploration of two important parameters—which are usually not possible with most classification systems that operate as ‘‘black boxes’’ and do not allow insight for further understanding of the classification process and eventual biases in the training data:
Ranking weights in increasing order for each of the five stages allows extraction of genes reported by the classifier to be of utmost importance in the biology of differentiation at this particular stem cell stage. Ranking genes on a vector y obtained by application of an input vector x (corresponding to a given tissue to characterize) to the neural network (y ¼ x.w) allows to isolates gene profiles critical for proper association of this tissue. Although a hidden layer might lower classification error, the complexity of the data might drive the network to overtraining and artificially fit the weight to the training data without being able to generalize to unknown tissues. Also, this type of multilayer perceptions does not allow for weight interpretation and the ability to correlate list of markers with stem cell developmental stages would have been lost. The list of 63 genes linked with every stem cell stage is available from the supporting Web site, and is detailed in Bidaut and Stoeckert, 2009. Briefly, we found genes involved in development (Hopx), other genes involved in cancer (Letmd1) and some stem cells markers. (CD109 is a cell surface antigen found on a subset of hematopoietic stem cells, FIAT is a transcriptional regulator of osteoblastic functions, Sfrp4 is a Wnt pathway inhibitor that plays a central role in cell fate decisions.)
5. Future Development and Enhancement Plans Our analysis identified genes not previously linked to stem cell differentiation or cancer—these genes are regulated by stem cell genes which are downstream receptors of development pathways. Although these genes are
244
Ghislain Bidaut and Christian J. Stoeckert
good discriminators, they are poor descriptors of the biology linked to differentiation. A possible improvement would be to sort out these genes and keep only the upstream regulators of development/differentiation for every stem cell differentiation stage by interactome–transcriptome integration (Chuang et al., 2007). We are also planning the improvement of our technological base, through (i) implementation and parallelization of the algorithm on a Linux Beowolf Cluster, and (ii) direct use of stem cells data stored in public repositories, to extend our data compendium. Also, a recently published package could allow us to perform annotations and data integration under R (Kuhn et al., 2008). Improvement in classification is envisioned through boosting, and implementation on a public server is planned.
ACKNOWLEDGMENTS Ghislain Bidaut is funded by the Institut National de la Sante´ et de la Recherche me´dicale, the Fondation pour la Recherche Me´dicale, and the Institut National du Cancer (Grant 08/ 3D1616/Inserm-03-01/NG-NC). This work was initially funded by grant U01 DK63481 to Chris Stoeckert. Thanks to Wahiba Gherraby for reading the manuscript and to all members of the SCGAP consortium for sharing data and insights for this project.
REFERENCES Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I. F., Soboleva, A., Tomashevsky, M., and Edgar, R. (2007). NCBI GEO: Mining tens of millions of expression profiles—Database and tools update. Nucleic Acids Res. 35 (Database issue), D760–D765. Bidaut, G., and Stoeckert, C. J. Jr. (2009). Characterization of unknown adult stem cell samples by large scale data integration and artificial neural networks. Pac. Symp. Biocomput. 356–367. Chuang, H. Y., Lee, E., Liu, Y. T., Lee, D., and Ideker, T. (2007). Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 3, 140. Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T. et al. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 5(10):R80. Gerrits, A., Dykstra, B., Otten, M., Bystrykh, L., and de Haan, G. (2008). Combining transcriptional profiling and genetic linkage analysis to uncover gene networks operating in hematopoietic stem cells and their progeny. Immunogenetics 60(8), 411–422. Greer, B., and Khan, J. (2007). Online analysis of microarray data using artificial neural networks. J. Methods Mol. Biol. 377, 61–74. Hibbs, M. A., Hess, D. C., Myers, C. L., Huttenhower, C., Li, K., and Troyanskaya, O. G. (2007). Exploring the functional landscape of gene expression: Directed search of large microarray compendia. Bioinformatics 23(20), 2692–2699.
Large Scale Stem Cells Transcriptome Data Integration
245
Ivanova, N. B., Dimos, J. T., Schaniel, C., Hackney, J. A., Moore, K. A., and Lemischka, I. R. (2002). A stem cell molecular signature. Science 298(5593), 601–604. Kilpinen, S., Autio, R., Ojala, K., Iljin, K., Bucher, E., Sara, H., Pisto, T., Saarela, M., Skotheim, R. I., Bjo¨rkman, M., Mpindi, J. P., Haapa-Paananen, S., et al. (2008). Systematic bioinformatic analysis of expression levels of 17, 330 human genes across 9, 783 samples from 175 types of healthy and pathological tissues. Genome Biol. 9(9), R139. Kuhn, A., Luthi-Carter, R., and Delorenzi, M. (2008). Cross-species and cross-platform gene expression studies with the Bioconductor-compliant R package ‘annotationTools’. BMC Bioinformatics 9, 26. Mills, J. C., Andersson, N., Hong, C. V., Stappenbeck, T. S., and Gordon, J. I. (2002). Molecular characterization of mouse gastric epithelial progenitor cells. Proc. Natl. Acad. Sci. USA 99(23), 14819–14824. Ochsner, S. A., Strick-Marchand, H., Qiu, Q., Venable, S., Dean, A., Wilde, M., Weiss, M. C., and Darlington, G. J. (2007). Transcriptional profiling of bipotential embryonic liver cells to identify liver progenitor cell surface markers. Stem Cells 25(10), 2476–2487. Epub 2007 Jul. 19. Oudes, A. J., Campbell, D. S., Sorensen, C. M., Walashek, L. S., True, L. D., and Liu, A. Y. (2006). Transcriptomes of human prostate cells. BMC Genomics 7, 92. Parkinson, H., Kapushesky, M., Kolesnikov, N., Rustici, G., Shojatalab, M., Abeygunawardena, N., Berube, H., Dylag, M., Emam, I., Farne, A., Holloway, E., Lukk, M., et al. (2009). ArrayExpress update—From an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 37(Database issue), D868–D872. Scearce, L. M., Brestelli, J. E., McWeeney, S. K., Lee, C. S., Mazzarelli, J., Pinney, D. F., Pizarro, A., Stoeckert, C. J. Jr, Clifton, S. W., Permutt, M. A., Brown, J., Melton, D. A., et al. (2002). Functional genomics of the endocrine pancreas: The pancreas clone set and PancChip, new resources for diabetes research. Diabetes 51(7), 1997–2004. Shen, R., Ghosh, D., and Chinnaiyan, A. M. (2004). Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC Genomics 5 (1), 94. Sohal, D., Yeatts, A., Ye, K., Pellagatti, A., Zhou, L., Pahanish, P., Mo, Y., Bhagat, T., Mariadason, J., Boultwood, J., Melnick, A., Greally, J., et al. (2008). Meta-analysis of microarray studies reveals a novel hematopoietic progenitor cell signature and demonstrates feasibility of inter-platform data integration. PLoS ONE 3(8), e2965. Tsai, J., Sultana, R., Lee, Y., Pertea, G., Karamycheva, S., Antonescu, V., Cho, J., Parvizi, B., Cheung, F., and Quackenbush, J. (2001). Resourcerer: A database for annotating and linking microarray resources within and across species. Genome Biol. 2 (11), SOFTWARE0002. Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., Dicuccio, M., Edgar, R., Federhen, S., Feolo, M., Geer, L. Y., et al. (2008). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 36(Database issue), D13–D21. Xu, L., Tan, A. C., Naiman, D. Q., Geman, D., and Winslow, R. L. (2005). Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. Bioinformatics 21(20), 3905–3911.
C H A P T E R
T E N
DynaFit—A Software Package for Enzymology Petr Kuzmicˇ Contents 248 250 250 252 255 255 260 261 262 263 269 273 275 275 276 276 276
1. Introduction 2. Equilibrium Binding Studies 2.1. Experiments involving intensive physical quantities 2.2. Independent binding sites and statistical factors 3. Initial Rates of Enzyme Reactions 3.1. Thermodynamic cycles in initial rate models 4. Time Course of Enzyme Reactions 4.1. Invariant concentrations of reactants 5. General Methods and Algorithms 5.1. Initial estimates of model parameters 5.2. Uncertainty of model parameters 5.3. Model-discrimination analysis 6. Concluding Remarks 6.1. Model discrimination analysis 6.2. Optimal design of experiments Acknowledgments References
Abstract Since its original publication, the DynaFit software package [Kuzmicˇ, P. (1996). Program DYNAFIT for the analysis of enzyme kinetic data: Application to HIV proteinase. Anal. Biochem. 237, 260–273] has been used in more than 500 published studies. Most applications have been in biochemistry, especially in enzyme kinetics. This paper describes a number of recently added features and capabilities, in the hope that the tool will continue to be useful to the enzymological community. Fully functional DynaFit continues to be freely available to all academic researchers from http://www.biokin.com.
BioKin Ltd., Watertown, Massachusetts, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67010-5
#
2009 Elsevier Inc. All rights reserved.
247
248
Petr Kuzmicˇ
1. Introduction DynaFit (Kuzmicˇ, 1996) is a software package for the statistical analysis of experimental data that arise in biochemistry (e.g., enzyme kinetics; Leskovar et al., 2008), biophysics (protein folding; Bosco et al., 2009), organic chemistry (organic reaction mechanisms; Storme et al., 2009), physical chemistry (guest–host complexation equilibria; Gasa et al., 2009), food chemistry (fermentation dynamics; Van Boekel, 2000), chemical engineering (bio-reactor design; Von Weymarn et al., 2002), environmental science (bio-sensors for heavy metals; Le Clainche and Vita, 2006), and related areas. The common features of these diverse systems are that (a) the underlying theoretical model is based on the mass action law (Guldberg and Waage, 1879); (b) the model can be formulated in terms of stoichiometric equations; and (c) the experimentally observable quantity is a linear function of concentrations or, more generally, populations of reactive species. The main use of DynaFit is in establishing the detailed molecular mechanisms of the physical, chemical, or biological processes under investigation. Once the molecular mechanism has been identified, DynaFit can be used for routine quantitative determination of either microscopic rate constants or thermodynamic equilibrium constants that characterize individual reaction steps. DynaFit can be used for the statistical analysis of three different classes of experiments: (1) the progress of chemical or biochemical reactions over time; (2) the initial rates of enzyme reaction, under either the rapid-equilibrium or the steady-state approximations (Segel, 1975); and (3) equilibrium ligand-binding studies. Regardless of the type of experiment, the main benefit of using the DynaFit package is that it allows the investigator to specify the fitting model in the biochemical notation (e.g., E þ S E.S --> E þ P) instead of mathematical notation (e.g., v ¼ kcat[E]0[S]0/([S]0 þ Km)). For example, to fit a set of initial rates of an enzyme reaction to a steadystate kinetic model for the ‘‘Bi Bi Random’’ mechanism (Segel, 1975, p. 647) (Scheme 10.1), the investigator can specify the following text in the DynaFit input file: [data] data ¼ rates approximation ¼ steady-state [mechanism] E þ A E. A : k1 k2 E. A þ B E. A. B : k3 k4 E. A. B E. B þ A : k5 k6
249
DynaFit—A Software Package for Enzymology
k1
E k7
E•A
k2 k4
k8
k3
k6 E•B
k5
E•A•B
k9
E+P+Q
Scheme 10.1 E. B E þ B : k7 k8 E. A. B --> E þ P þ Q : k9 [constants] k8 ¼ (k1 k3 k5 k7) / (k2 k4 k6) ...
The program will internally derive the initial rate law corresponding to this steady-state reaction mechanism (or any arbitrary mechanism), and perform the least-squares fit of the experimental data. This allows the investigator to focus exclusively on the biochemistry, rather than on the mathematics. Using exactly equivalent notation, one can analyze equilibrium binding data, such as those arising in competitive ligand displacement assays, or time-course data from continuous assays. Importantly, the DynaFit algorithm does not make any assumptions regarding the relative concentrations of reactants. Specifically, it is no longer necessary to assume that the enzyme concentration is negligibly small compared to the concentrations of reactants (substrates and products) and modifiers (inhibitors and activators). This feature is especially valuable for the kinetic analysis of ‘‘slow, tight’’ enzyme inhibitors (Morrison and Walsh, 1988; Szedlacsek and Duggleby, 1995; Williams and Morrison, 1979). Since its original publication (Kuzmicˇ, 1996), DynaFit has been utilized in more than 500 journal articles. In the intervening time, many new features have been added. The main purpose of this report is to give a brief sampling of several newly added capabilities, which might be of interest specifically to the enzymological community. The survey of DynaFit updates is by no means comprehensive; the full program documentation is available online (http://www.biokin.com/dynafit). This article has been divided into four parts. The first three parts touch on the three main types of experiments: (1) equilibrium ligand binding studies; (2) initial rates of enzyme reactions; and (3) the time course of enzyme reactions. The fourth and last part contains a brief overview of selected data-analytical approaches, which are common to all three major experiment types.
250
Petr Kuzmicˇ
2. Equilibrium Binding Studies DynaFit can be used to fit, or to simulate, equilibrium binding data. The main purpose is to determine the number of distinct noncovalent molecular complexes, the stoichiometry of these complexes in terms of component molecular species, and the requisite equilibrium constants. The most recent version of the software includes features and capabilities that go beyond the original publication (Kuzmicˇ, 1996). For example, DynaFit can now be used to analyze equilibrium binding data involving—at least in principle—an unlimited number of simultaneously varied components. A practically useful four-component mixture might include (1) a protein kinase; (2) a Eu-labeled antibody (a FRET-donor) raised against the kinase; (3) a kinase inhibitor, whose dissociation constant is being measured; and (4) a fluorogenic FRET-acceptor molecule competing with the inhibitor for binding. Investigations are currently ongoing into the optimal design of such multicomponent equilibrium binding studies.
2.1. Experiments involving intensive physical quantities DynaFit can analyze equilibrium binding experiments involving intensive physical quantities. Unlike their counterparts, the extensive physical quantities, intensive quantities do not depend on the total amount of material present in the system. Instead, intensive quantities are proportional to mole fractions of chemical or biochemical substances. A prime example of an intensive physical quantity is the NMR chemical shift (assuming that fast-exchange conditions apply, where the chemical shift is a weighted average of chemical shifts of all microscopic states of the given nucleus). We have recently used this technique to investigate the guest–host complexation mechanism in a system involving three different ionic species of a guest molecule (paraquat, acting as the ‘‘ligand’’) binding to a crownether molecule (acting as the ‘‘receptor’’), with either 1:1 or 1:2 stoichiometry (Gasa et al., 2009). This guest–host system involved four components forming up to nine noncovalent molecular complexes, and a correspondingly large number of microscopic equilibrium constants. DynaFit has also been used in the NMR context to determine the binding affinity between the RIZ1 tumor suppressor protein and a model peptide representing histone H3 (Briknarova´ et al., 2008). The following illustrative example involves the use of DynaFit for highly precise determination of a protein–ligand equilibrium binding constant.
251
DynaFit—A Software Package for Enzymology
2.1.1. NMR study of protein–protein interactions Figure 10.1 (unpublished data courtesy of K. Briknarova´ and J. Bouchard, University of Montana) displays the changes in NMR chemical shifts for six different protons and six different nitrogen nuclei in the PR domain from a transcription factor PRDM5 (Deng and Huang, 2004), depending on the concentration of a model peptide ligand. The NMR chemical shift data for all 12 nuclei were analyzed in the global mode (Beechem, 1992). The main purpose of this experiment was to determine the strength of the binding interaction. It was assumed that the binding occurs with the simplest 1:1 stoichiometry. A DynaFit code fragment corresponding to Scheme 10.2 is shown as follows: [mechanism] R þ L R.L : Kd1 dissociation [responses] intensive [data] plot titration ...
0.2
1.0
0.1
0.5 Δ d 15N
Δ d 1H
Note the use of the keyword intensive in the [responses] section of the script, which means that the observed physical quantity
0.0
0.0
– 0.5
–0.1
–1.0
–0.2 0.0
0.5 1.0 1.5 (ligand), mM
0.0
2.0
0.5 1.0 1.5 (ligand), mM
2.0
Figure 10.1 NMR chemical shift titration of the PRDM5 protein (total concentration varied between 0.125 and 0.1172 mM) with a model peptide ligand. Left: 1H-chemical shifts of six selected protons. Right: 15N-chemical shifts of six selected nitrogen nuclei. The chemical shifts for all 12 nuclei were fit globally (Beechem, 1992) to the binding model shown in Scheme 10.2. Kd1 R+L
R•L
Scheme 10.2
252
Petr Kuzmicˇ
(chemical shift) is proportional not to the quantity of various molecular species present in the sample, but rather to the corresponding mole fractions. Also note the keyword titration, which is used to produce a simple Cartesian plot—with the ligand concentration [L] formally acting as the only independent variable—even though the experiment was performed by gradual addition of ligand to the same initial protein sample. This means that both the protein (titrand) and the model peptide (titrant) concentrations were changing with each added aliquot. It is very important to recognize that, in this case, the experimental data points are not statistically independent, as is implicitly assumed by the theory of nonlinear least-squares regression ( Johnson, 1992, 1994; Johnson and Frasier, 1985). However, the practice of incrementally adding to the same base solution of the titrand has been firmly established in protein–protein and protein–ligand NMR titration studies. The best-fit value of the dissociation equilibrium constant, determined from the data shown in Fig. 10.1, was Kd1 ¼ (0.087 0.007) [0.073 ... 0.108] mM. The values in square brackets are approximate confidence intervals determined by the profile-t method of Bates and Watts (Brooks et al., 1994). Please note that, unlike the formal standard error shown in the parentheses, the confidence intervals are not symmetrical about the best-fit value. Using the global fit method (Beechem, 1992), the strength of the protein–ligand binding interactions was determined for a number of different nuclei, and the results were highly consistent; the coefficient of variation for the equilibrium was approximately 10% regardless of which chemical shift was monitored.
2.2. Independent binding sites and statistical factors The most recent version of DynaFit (Kuzmicˇ, 1996) allows the investigator to properly define the relationship between (a) intrinsic rate constant or equilibrium constants, and (b) macroscopic rate constants or equilibrium constants. This distinction is necessary in the analysis of multiple identical binding sites. As the simplest possible example, consider the binding of L, a ligand molecule, to R, a receptor molecule that contains two identical and independent binding sites (Scheme 10.3). 2 ka R
kd
L R
Scheme 10.3
ka 2 kd
L R
L
DynaFit—A Software Package for Enzymology
253
In Scheme 10.3, ka and kd are intrinsic rate constants. The statistical factors (‘‘2’’) shown in Scheme 10.3 express the fact that there are two identical pathways for L to associate with R, but only one way for for L to associate with RL. Similarly, RL2 can yield RL in two equivalent ways, whereas RL can dissociate into R þ L only in one way. Thus, if we define the first and second dissociation equilibrium constants as K1 ¼ [RL]eq[L]eq/[RL2]eq and K2 ¼ [R]eq[L]eq/ [RL]eq, then for independent equivalent sites we must have K1 ¼ 4K2. In the DynaFit notation, the difference between independent and interacting binding sites can be expressed by using the following syntax: [task] data ¼ equilibria model ¼ interacting ? [mechanism] R þ L R.L : K1 dissociation R.L þ L R.L.L : K2 dissociation [constants] K1 ¼ ... K2 ¼ ... ... [task] data ¼ equilibria model ¼ independent ? [mechanism] R þ L R.L : K1 dissociation R.L þ L R.L.L : K2 dissociation [constants] K1 ¼ 4 * K2 ; E.dextran. S : k3 E.dextran. S ---> E.dextran þ P : k4 [concentrations] E ¼ 0.18 ! ; invariant S ¼ 11700 ! ; invariant dextran ¼ 0.00002
262
Petr Kuzmicˇ
1200 E 1000
Signal
800
600 D 400 C 200
B A
0 0
20
t, s
40
60
Figure 10.4 SPR sensorgram of the enzyme-catalyzed extension of a dextran surface. Transglucosidase alternansucrase at various concentrations was coinjected with sucrose (11.7 mM) over the surface of the SPR chip. Curves AE: enzyme concentration [E]0 ¼ 0.018, 0.022, 0.03, 0.044, and 0.09 mM, respectively.
The surface catalysis phenomena involved, for example, in starch biosynthesis and in cellulose degradation are still relatively poorly understood. The significance of the on-chip enzyme kinetics experiment is that it can potentially shed light on biologically relevant heterogeneous phase processes. At this preliminary phase of the investigation, the best-fit values of microscopic rate constants (not shown) were obtained separately for each recorded progress curve. The goal of the ongoing research is to produce a global (Beechem, 1992) mathematical model for the on-chip kinetics.
5. General Methods and Algorithms This section briefly summarizes selected features and capabilities added to the DynaFit software package since its original publication (Kuzmicˇ, 1996). These general algorithms are applicable to all types of experimental data (progress curves, initial rates, and complex equilibria) being analyzed. This selection of added features is not exhaustive, but it emphasizes some of the most difficult tasks in the analysis of biochemical data:
DynaFit—A Software Package for Enzymology
263
How do we know where to start (the initial estimate problem); How do we know whether the best-fit parameters are good enough (the confidence interval problem); and How do we know which fitting model to choose among several alternatives (the model discrimination problem).
5.1. Initial estimates of model parameters One of the most difficult tasks of a data analyst performing nonlinear least-squares regression is to come up with initial estimates of model parameters that are sufficiently close to the true values. If the initial estimate of rate or equilibrium constants is not sufficiently accurate, the data-fitting algorithm might converge to a local minimum, or not converge at all. This is the nature of the Levenberg–Marquardt algorithm (Marquardt, 1963; Reich, 1992), which is the main least-squares minimization algorithm used by DynaFit. The updated DynaFit software offers two different methods to avoid local minima on the least-squares hypersurface, that is, to avoid incorrect ‘‘best-fit’’ values of rate constants and other model parameters. The first method relies on a brute-force systematic parameter scan, and the second method uses ideas from evolutionary computing. 5.1.1. Systematic parameter scan To increase the probability that a true global minimum is found for all rate and equilibrium constants, DynaFit allows the investigator to specify a set of alternate initial estimates. The software then generates all possible combinations of starting values, and performs the corresponding number of independent least-squares regressions. The results are ranked by the residual sum of squares. For example, let us assume that the postulated mechanism includes four adjustable rate constants, k1–k4, and that we wish to examine four different starting values (spaced by a factor of 10) for each of them. The requisite DynaFit code would read as follows: [constants] k1 ¼ { 0.01, 0.1, 1, 10} ? k2 ¼ {0.001, 0.01, 0.1, 1} ? k3 ¼ {0.001, 0.01, 0.1, 1} ? k4 ¼ {0.001, 1, 1000, 1000000} ?
In this case, the program would perform 44 ¼ 256 separate least-squares minimizations, starting from 256 different combinations of initial estimates. In extreme cases, the execution time required for such systematic parameter scans might reach many minutes or even hours by using the currently
264
Petr Kuzmicˇ
available computing technology. However, for critically important data analyses, avoiding local minima and therefore incorrect mechanistic conclusions should be worth the wait. 5.1.2. Global minimization by differential evolution As an alternate solution to the problem of local minima in least-squares regression analysis, DynaFit now uses the differential evolution (DE) (Price et al., 2005) algorithm. DE belongs to the family of stochastic evolutionary strategy (ES) algorithms, which attempt to find a global sum-of-squares minimum by using ideas from evolutionary biology. The essential feature of any ES data-fitting algorithm is that it starts from a large number of simultaneous, randomly chosen initial estimates for all adjustable model parameters. The algorithm then evolves this population of ‘‘organisms,’’ by allowing only the sufficiently fit population members to ‘‘sexually reproduce.’’ In this case, by fitness we mean the sum of squares associated with each particular combination of rate constants and other model parameters (the genotype). By sexual reproduction, we mean that selected population members have their genome (i.e., model parameters) carried over into the next generation by using Nature’s usual tricks— chromosomal crossover accompanied by random mutations. There are many variations on the ES computational scheme, and also a growing number of variants of the DE algorithm itself. The interested reader is encouraged to examine several recently published books and monographs (Chakraborty, 2008; Feoktistov, 2008; Onwubolu and Davendra, 2009; Price et al., 2005) for details. Typically, the number of population members does not change through the evolutionary process, meaning that if we start with 1000 different initial estimates for each rate constant, we also have 1000 different estimates at the end, after a large number of generations have reproduced. Importantly, while we might start with a population of 1000 estimates spanning 12 or 18 orders of magnitude for each rate constant, the hope is that we end with 1000 estimates all of which are close to the best possible value. The performance of the DE algorithm (Price et al., 2005), as implemented in DynaFit, is illustrated by using an example involving irreversible inhibition kinetics of the HIV protease. This particular test problem was first presented in the original DynaFit publication (Kuzmicˇ, 1996), and was subsequently reused by Mendes and Kell (1998) to test the performance of the popular software package Gepasi. The simulation software package COPASI (Hoops et al., 2006), a direct descendant of Gepasi, is also being profiled in this volume. Figure 10.5 displays fluorescence changes during a fluorogenic assay (Kuzmicˇ et al., 1996; Peranteau et al., 1995) of the HIV protease. The nominal enzyme concentration was 4 nM in each of the five kinetic experiments; the nominal substrate concentration was 25 mM; the inhibitor
265
DynaFit—A Software Package for Enzymology
0.6
Signal
0.4
0.2
0.0
Residual
0.01 0.00 –0.01 0
1000
2000 t, s
3000
Figure 10.5 Least-squares fit of progress curves from HIV protease in the presence of an irreversible inhibitor. Results of the best-fit were obtained by using the differential evolution algorithm (Price et al., 2005).
concentrations (curves from top to bottom; Fig. 10.5) were 0, 1.5, 3, and 4 nM (two experiments). As is discussed elsewhere (Kuzmicˇ, 1996), each initial enzyme and substrate concentration was treated as an adjustable parameter. The vertical offset on the signal axis was also treated as an adjustable parameter for each experiment separately. The mechanistic model is shown in Scheme 10.8, where M is the monomer subunit of the HIV protease. The numbering of rate constants in Scheme 10.8 was chosen to match a previous report (Mendes and Kell, 1998). The dimensions used throughout the analysis (see also final results in Table 10.2) were mM for all concentrations, mM 1 s 1 for all second-order rate constants, and s 1 for all first-order rate constants. The rate constants k11 ¼ 0.1, k12 ¼ 0.0001, and k21 ¼ k41 ¼ k51 ¼ 100 were treated as fixed parameters in the model, whereas the rate constants k22, k3, k42, k52, and k6
266
Petr Kuzmicˇ
E•P k41 k42 k11 M+M
k21 E
k12 k52
E•S
k22
k3
E+P
k51 E•I
k6
E-I
Scheme 10.8
were treated as adjustable parameters. To match the Gepasi test (Mendes and Kell, 1998) using the same example problem, each rate constant was constrained to remain less than 105 in absolute value. In the course of the DE optimization, rate constants were allowed to span 12 orders of magnitude (between 10 7 and 105). Each adjustable concentration was allowed to vary within 50% of its nominal value. An excerpt from a requisite DynaFit script input file is shown as follows: [task] data ¼ progress task ¼ fit algorithm ¼ differential-evolution [mechanism] M þ M E : k11 k12 E þ S ES : k21 k22 ES ---> E þ P : k3 E þ P EP : k41 k42 E þ I EI : k51 k52 EI --> EJ : k6 [constants] k11 ¼ 0.1 k12 ¼ 0.0001 k21 ¼ 100 k22 ¼ 300 ? (0.0000001 .. 100000) k3 ¼ 10 ? (0.0000001 .. 100000) k41 ¼ 100 k42 ¼ 500 ? (0.0000001 .. 100000) k51 ¼ 100 k52 ¼ 0.1 ? (0.0000001 .. 100000) k6 ¼ 0.1 ? (0.0000001 .. 100000)
267
DynaFit—A Software Package for Enzymology
Table 10.2 Least-squares fit of HIV protease inhibition data shown in Fig. 10.5: Comparison of the simulated annealing (SA) algorithm as implemented in Gepasi (Mendes and Kell, 1998) and COPASI (Hoops et al., 2006) with the differential evolution (DE) algorithm as implemented in DynaFit (Kuzmicˇ, 1996)
Parameter
k22 k3 k42 k52 k6 [S]1 [S]2 [S]3 [S]4 [S]5 [E]1 [E]2 [E]3 [E]4 [E]5 D1 D2 D3 D4 D5 Iterations Sum of squares Run time (h) a b c d e
SA (Mendes and Kell, 1998)
SA (this work)a
DE
SA/DE
201.1 7.352 1171 13,140 30,000 24.79 23.43 26.79 32.10 26.81 0.004389 0.004537 0.005470 0.004175 0.003971 0.00801 0.00391 0.00896 0.01600 0.00379 630,000 0.0211024
273.1 6.517 1989 11,120 4453 24.74 23.46 26.99 20.92 17.59 0.005029 0.004965 0.005796 0.004238 0.003980 0.00712 0.00490 0.01395 0.01192 0.00005 1,025,242 0.0201911
23.67 3.922 128.2 0.00008562 0.0004599 24.65 23.37 26.99 14.39 16.04 0.007484 0.006568 0.007116 0.004221 0.003396 0.00508 0.00289 0.01354 0.00337 0.00777 –b 0.0194526
11.54 1.66 15.51 130,000,000 9,700,000 1.00 1.00 1.00 1.45 1.10 0.67 0.76 0.81 1.00 1.17 1.40 1.69 1.03 3.54 0.01 –b 1.04
–c
16.5d,e
1.1e
15
Software Gepasi (Mendes and Kell, 1998) ver. 3.30. Iteration counts in SA and DE are not compatible. Running time not given in the original publication. Interrupted. IntelÒ CoreTM2 Duo T7400 microprocessor (2.16 GHz, 667 MHz bus, 4 MB cache).
DynaFit automatically chooses the population size, based on the number of adjustable model parameters, and on the range of values they are allowed to span. In this case, the DE algorithm started with 259 separate estimates for each of the 15 adjustable model parameters (five rate constants, five locally adjusted substrate and enzyme concentrations, and five offsets on the signal axis). A representative histogram of distribution for one of the 15 adjustable
268
Petr Kuzmicˇ
Initial
Count
30 20 10 0 –6
–4
–2
0
2
4
6
80 Final
Count
60 40 20 0 – 4.075
– 4.070 log10 k52
– 4.065
Figure 10.6 The initial and final distribution of the rate constant k52 in the differential evolution (Price et al., 2005) fit of HIV protease inhibition data shown in Fig. 10.5. The population contained 259 members.
model parameters (the rate constant k52) is shown in the upper panel of Fig. 10.6. Note that the 259 initial estimates of the rate constant k52 span 12 orders of magnitude. The initial random distribution of parameter values is uniform (as opposed to Gaussian or similarly bell-shaped) on the logarithmic scale. The swarm of 259 ‘‘organisms,’’ each carrying a unique combination of 15 adjustable model parameters (the genotype), was allowed to evolve using the Darwinian evolutionary principles (selection by fitness; chromosomal crossover during the ‘‘mating’’ of population members; random genetic mutations). After 793 generations, each of the 15 model parameters converged to a relatively narrow range of values, as shown in the bottom panel of Fig. 10.6 for the rate constant k52. The simulated best-fit model is shown as smooth curves in Fig. 10.5. The best-fit values of adjustable model parameters are shown in Table 10.2, where Di is offset on the signal axis for individual data sets. The simulated annealing (SA) algorithm (Corana et al., 1987; Kirkpatrick et al., 1983) was chosen for comparison with DE, because it appears to be the best performing global optimization method currently reported in the biochemical literature (Mendes and Kell, 1998).
DynaFit—A Software Package for Enzymology
269
The results listed in Table 10.2 show that the DE algorithm found a combination of model parameters that lead to a significantly lower sum of squares (i.e., a better fit) compared to the SA algorithm. Some model parameters, such as the adjustable substrate concentrations, were very close to identical in both data-fitting methods. Other model parameters, such as the rate constants k52 and k6 that characterize the inhibitor properly, differed by 6–8 orders of magnitude. The SA algorithm had to be terminated manually after approximately 17 h of continued execution, and more than one million iterations. The DE algorithm terminated automatically after 66 min, when defined convergence criteria were satisfied. We can conclude that, in the specific case of the HIV protease irreversible kinetics, the DE global optimization algorithm clearly performs significantly better than the SA algorithm. However, this does not mean that the best-fit DE parameter values listed in Table 10.2 are any closer to the true values, when compared with the SA parameters. In fact, it appears that neither set of parameter values should be regarded with much confidence (see Section 5.2). Probably, the only conclusion we can make safely is that very much more research is needed into the relative merits of global optimization algorithms such as DE and SA—specifically, as they are applied to the analysis of biochemical kinetic data.
5.2. Uncertainty of model parameters Most biochemists are likely to see the uncertainty of kinetic model parameters expressed only as formal standard errors. Formal standard errors are the plus-or-minus values standing next to the best-fit values of nonlinear parameters, as reported by all popular software packages for nonlinear leastsquares regression, including DynaFit. However, it should be strongly emphasized that formal standard errors can (and usually do) grossly underestimate the statistical uncertainty. For a rigorous theoretical treatment of statistical inference regions for nonlinear parameters, see Bates and Watts (1988). Johnson et al. (2009) recently stated that DynaFit (Kuzmicˇ, 1996) users are provided only with the ‘‘standard errors [...] without additional aids to evaluate the extent to which the fitted parameters are actually constrained by the data.’’ This statement is factually false, and needs to be corrected for the record. Since version 2.23 released in January 1997 and extensively documented in the freely distributed user manual, DynaFit has always implemented the profile-t search method of Bates and Watts (Bates and Watts, 1988; Brooks et al., 1994; Watts, 1994) to compute approximate inference regions of nonlinear model parameters. The most recent update to DynaFit adds an additional aid to evaluate the extent to which the fitted parameters are constrained by the data. This is a particular modification of the well-established Monte-Carlo method (Straume and Johnson, 1992).
270
Petr Kuzmicˇ
5.2.1. Monte-Carlo confidence intervals The Monte-Carlo method (Straume and Johnson, 1992) for the determination of confidence intervals is based on the following idea. After an initial least-squares fit using the usual procedure, the best-fit values of nonlinear parameters are used to simulate many (typically, at least 1000) artificial data sets. The idealized theoretical model curves (e.g., the smooth curves in Fig. 10.5) are always the same, but the superimposed pseudo-random noise is different every time. The 1000 slightly different sets of pseudo-experimental data are again subjected to nonlinear least-squares regression. In the end, the 1000 different sets of best-fit values for model parameters are tallied up to construct a histogram of the parameter distribution. The range of values spanned by each histogram is the Monte-Carlo confidence interval for the given model parameter. ‘‘Shuffle’’ and ‘‘shift’’ Monte-Carlo methods A crucially important part of the above Monte-Carlo procedure is the simulation of the pseudo-random noise to be superimposed on the idealized data. How should we choose the statistical distribution, from which the pseudo-random noise is drawn? Usually, it is assumed that the pseudorandom experimental noise has Normal or Gaussian distribution (Straume and Johnson, 1992), and that the individual data points are statistically independent or uncorrelated. If so, the standard deviation of this Gaussian distribution (the half-width of the requisite bell curve) can be taken as the standard error of fit from the first-pass regression analysis of the original data. However, we have recently demonstrated (Kuzmicˇ et al., 2009) that experimental data points recorded in at least one particular enzyme assay are not statistically independent. Instead, we see a strong neighborhood correlation among adjacent data points—spanning up to six nearest neighbors. To reflect the possible serial correlation among nearby data points, DynaFit (Kuzmicˇ, 1996) now allows two variants of the Monte-Carlo method, which could be called the ‘‘shift’’ Monte-Carlo and ‘‘shuffle’’ Monte-Carlo algorithms. In both cases, instead of generating presumably Gaussian errors to be superimposed on the idealized data, we merely rearrange the order of the actual residuals generated by the first-pass leastsquares fit. In the shuffle variant, the residuals are reused in truly randomized order. In the shift variant of the Monte-Carlo algorithm, the order of the residuals is preserved, but the starting position changes. For example, let us assume that a particular reaction progress curve (such as one of those shown in Fig. 10.5) contains 300 experimental data points. After the first-pass least-squares fit, we could simulate up to 300 synthetic progress curves by superimposing the ordered sequence of residuals. In one such simulated curve, the first synthetic data point would be assigned
DynaFit—A Software Package for Enzymology
271
residual No. 17, the second data point residual No. 18, and so on. At the end of the ordered sequence of residuals, we wrap around to the beginning (i.e., data point No. 300 17 ¼ 283 will receive residual No. 1). In another simulated curve, the first data point would be generated from residual No. 213, the second data point from residual No. 214, and so on. The practical usefulness of the shift and shuffle variants of the MonteCarlo method (Straume and Johnson, 1992) is that it avoids having to make assumptions about the statistical distribution (Gaussian, Lorentzian, etc.) of the random noise that is inevitably present in the experimental data. Interestingly, the original conception of the Monte-Carlo method (Dwass, 1957; Nichols and Holmes, 2001) was, in fact, based on permuting existing population members, rather than making distributional assumptions. Two-dimensional histograms The ‘‘shift’’ Monte-Carlo confidence intervals for rate constants k22, k3, and k42 from the least-squares fit of HIV protease inhibition data are shown in Fig. 10.7. The best-fit values of each model parameter are marked with a filled triangle. The rate constant k3 is characterized by a relatively narrow confidence intervals (spanning from approximately 3 to 9 s 1). In contrast, the Monte-Carlo confidence intervals for rate constants k22 and k42 not only are much wider (approximately 4 orders of magnitude for k42) but also are clearly bi-modal. The appearance of such double-hump histogram for any parameter is a strong indication that (a) the model is probably severely over-parameterized, and (b) the data could very likely be fit to at least two alternate mechanisms. In order to better diagnose possible statistical coupling between pairs of rate constants, beyond what conventional Monte-Carlo histograms can provide, DynaFit now produces two-dimensional histograms such as those shown in Fig. 10.8. The thin solid path enclosing each histogram in Fig. 10.8 is the convex hull—the shortest path entirely enclosing a set of points in a plane. The approximate area occupied by the convex hull is a useful empirical measure of parameter redundancy. If any two rate constants were truly statistically independent, the corresponding two-dimensional Monte-Carlo histogram plot would resemble a circular area with the highest population density appearing in the center. We can see in Fig. 10.8 the rate constants k22 and k42 are clearly correlated, as is indicated by the elongated crescent shape of the two-dimensional histogram. In summary, with regard to assessing the statistical uncertainty of nonlinear model parameters, DynaFit (Kuzmicˇ, 1996) has always allowed the investigator to perform the full search in parameter space, using the profile-t method (Bates and Watts, 1988; Brooks et al., 1994; Watts, 1994). As a result of such detailed analysis, the investigator often must face the unpleasant fact that the confidence regions for rate constants, equilibrium constants, or derived kinetic parameters (e.g., Michaelis constants) not only are much
272
Petr Kuzmicˇ
k22
Count
300 200 100 0 10
1000
k3
400 Count
100
300 200 100 0 3 140 120
4
5
6
7
8
9
10
k42
Count
100 80 60 40 20 0 100
1000
10,000
1,00,000
Figure 10.7 Monte-Carlo confidence intervals for model parameters: Distribution histograms for rate constants k22, k3, and k42 from least-squares fit of HIV protease inhibition data shown in Fig. 10.5.
larger than the formal standard errors would suggest, but perhaps also larger than would appear ‘‘publishable.’’ However, it must be strongly emphasized that the formal standard errors for nonlinear parameters reported by DynaFit should never be given much credence. The program reports them mostly for compatibility with other software package typically used by biochemists. In order to obtain a more realistic interpretation of the experimental data, DynaFit users are encouraged to go beyond formal standard errors, and utilize both the previously available profile-t method (Brooks et al., 1994), and now also the modified Monte-Carlo method (Straume and Johnson, 1992).
273
DynaFit—A Software Package for Enzymology
1.0
0.8
log k42
log k3
5
4
3
0.6
1.5
2.0 log k22
2.5
1.5
2.0 log k22
2.5
Figure 10.8 Monte-Carlo confidence intervals for model parameters: Twodimensional correlation histograms for rate constants k22 versus k3 (left; uncorrelated) and k22 versus k42 (right; strong correlation) from least-squares fit of HIV protease inhibition data shown in Fig. 10.5.
5.3. Model-discrimination analysis The problem of selecting the most plausible theoretical model among several candidates (e.g., deciding whether a given enzyme inhibitor is competitive, noncompetitive, or mixed-type) represents one of the most challenging tasks facing the data analyst. Myung and Pitt (2004) and Myung et al. (2009) reviewed recent developments in earlier volumes of this series. This section contains only a very brief summary of the model-discrimination features available in DynaFit (Kuzmicˇ, 1996). The reader is referred to the full program documentation available online (http://www.biokin.com/ dynafit/). DynaFit (Kuzmicˇ, 1996) currently offers two distinct methods for statistical model discrimination. First, for nested fitting models, the updated version of DynaFit continues to offer the F-statistic method previously discussed by Mannervik (1981, 1982) and many others. Secondly, for any group of alternate models, whether nested or nonnested, DynaFit uses the second-order AICc (Burnham and Anderson, 2002) to perform model discrimination. Briefly, the AICc criterion is defined by Eq. (10.1), where S is the residual sum of squares; nP is the number of adjustable model parameters; and nD is the number of experimental data points. For each candidate model in a collection of alternate models, DynaFit computes DAICc as the difference between AICc for the particular model, and the AICc for the best model (with the lowest value of AICc). Thus, the best model is by definition assigned DAICc ¼ 0. The Akaike weight, wi, for the ith model in a collection of m alternatives, is defined by Eq. (10.2):
274
Petr Kuzmicˇ
2nP ðnP þ 1Þ nD nP 1 exp 12DAICðiÞ c wi ¼ Xm ðiÞ 1 exp DAIC 2 c i¼1
AICc ¼ log S þ 2nP þ
ð10:1Þ ð10:2Þ
Burnham and Anderson (2002) formulated a series of empirical rules for interpreting the observed DAICc values for each alternate fitting model, stating that DAICc > 10 might be considered a sufficiently strong evidence against the given model. Practical experience with the Burnham and Anderson rule suggests that it is applicable only when the number of experimental data points is a reasonably small multiple of the number of adjustable model parameters (e.g., nD < 20 nP). In some cases, the number of data points is very much larger. For example, in certain continuous assays or stopped-flow measurements, it is not unusual to collect thousands of experimental data points in order to determine two or three kinetic constants. In such cases, the DAICc > 10 rule has been found unreliable. In general, a candidate model should probably be rejected only if its Akaike weight, wi, is smaller than approximately 0.001. The DynaFit notation needed to compare a series of alternate models, and to select the most plausible model if a selection is possible, is illustrated on the following input file fragment. Please note the use of question marks after each (arbitrarily chosen) model name. This notation instructs DynaFit to evaluate the plausibility of the given model, in comparison with other models that are marked identically. [task] model ¼ Competitive ? [mechanism] E þ S E.S : Ks dissoc E.S ---> E þ P : kcat E þ I E.I : Ki dissoc ... [task] model ¼ Uncompetitive ? [mechanism] E þ S E.S : Ks dissoc E.S ---> E þ P : kcat E.S þ I E.S.I : Kis dissoc ... [task] model ¼ Mixed-type noncompetitive ?
DynaFit—A Software Package for Enzymology
275
[mechanism] E þ S E.S : Ks dissoc E.S ---> E þ P : kcat E þ I E.I : Ki dissoc E.S þ I E.S.I : Kis dissoc ... [task] model ¼ Partial mixed-type ? [mechanism] E þ S E.S : Ks dissoc E.S ---> E þ P : kcat E þ I E.I : Ki dissoc E.S þ I E.S.I : Kis dissoc E.S.I ---> E.I þ P : kcat’ ...
When DynaFit is presented with a series of alternate models in a similar way, it will fit the available experimental data to each postulated model in turn. After the last model in the series is fit to the data, the program presents to the user a summary table listing the values of DAICc. The AIC-based model discrimination feature available in DynaFit has been utilized in a number of reports (Błachut-Okrasinska et al., 2007; Collom et al., 2008; Gasa et al., 2009; Jamakhandi et al., 2007; Kuzmicˇ et al., 2006).
6. Concluding Remarks DynaFit (Kuzmicˇ, 1996) has proved quite useful in a number of projects, as is evidenced by the number of journal publications that cite the program. It is hoped that the software will continue to enable innovative research. This section offers a few closing comments on DynaFit enhancements currently in development.
6.1. Model discrimination analysis The AIC criterion is based solely on the number of optimized parameters and the corresponding sum of squares. The degree of uncertainty associated with each particular set of model parameters is completely ignored. However, if two candidate models with exactly identical number of adjustable parameters hypothetically produced exactly identical sums of squares, but one of these models was associated with significantly narrower confidence regions, then this model should be preferred (Myung and Pitt, 2004). The minimum description length (MDL) also known as stochastic complexity (SC) measure (Myung and Pitt, 2004) would clearly be a more appropriate
276
Petr Kuzmicˇ
model-discrimination criterion. Unfortunately, for technical reasons, the MDL criterion is extremely difficult to compute (Myung et al., 2009). Investigations are currently ongoing into at least an approximate computation of the MDL/SC test.
6.2. Optimal design of experiments Most biochemists—probably like most experimentalists—prefer to do the experiment first, then proceed to data analysis, and finally to publication. However, to paraphrase the eminent statistician G. E. P. Box (Box et al., 1978), no amount of the most ingenious data analysis can salvage a poorly designed experiment. When examining the extant enzymological literature, one often wonders exactly how the concentrations were chosen. Why was an exponential series (1, 2, 4, 8, 16) used for substrate concentrations, instead of a linear series (3, 6, 9, 12, 15) (Kuzmicˇ et al., 2006)? Was it by design, or was it because ‘‘that’s how we always did it’’? Similar choices profoundly affect how much—if anything—can be learned from any given experiment. A wellestablished statistical theory of optimal experiment design (Atkinson and Donev, 1992; Fedorov, 1972) has been used by biochemical researchers in the past (Duggleby, 1981; Endre´nyi, 1981; Franco et al., 1986). At the present time, DynaFit is being modified to implement these ideas, and deploy them for computer-assisted rational design of experiments.
ACKNOWLEDGMENTS Kla´ra Briknarova´ and Jill Bouchard (University of Montana) are gratefully acknowledged for sharing their as yet unpublished NMR titration data. Jan Antosiewicz (Warsaw University) provided stimulating discussions and procured the PNP inhibition data for testing the statistical-factors feature in DynaFit; the raw experimental data were made available by Beata Wielgus-Kutrowska, Agnieszka Bzowska, and Katarzyna Breer (Warsaw University). Stephen Bornemann and his colleagues ( John Innes Center, Norwich) graciously invited me to peek into the mysteries of their unique SPR on-chip kinetic system, and inspired the development of the invariant concentration algorithm. Liz Hedstrom (Brandeis University) made helpful comments and suggestions. I am grateful to Andrei Ruckenstein (formerly of the BioMaPS Institute for Quantitative Biology, Rutgers University; currently at Boston University) for illuminating discussions regarding thermodynamic boxes in biochemical mechanisms. Sarah McCord (Massachusetts College of Pharmacy and Health Sciences) provided expert assistance in editing this manuscript.
REFERENCES Atkinson, A., and Donev, A. (1992). Optimum Experimental Designs. Oxford University Press, Oxford. Bates, D. M., and Watts, D. G. (1988). Nonlinear Regression Analysis and its Applications. Wiley, New York.
DynaFit—A Software Package for Enzymology
277
Beechem, J. M. (1992). Global analysis of biochemical and biophysical data. Methods Enzymol. 210, 37–54. Benkovic, S. J., Fierke, C. A., and Naylor, A. M. (1988). Insights into enzyme function from studies on mutants of dihydrofolate reductase. Science 239, 1105–1110. Błachut-Okrasinska, E., Bojarska, E., Stepin´ski, J., and Antosiewicz, J. (2007). Kinetics of binding the mRNA cap analogues to the translation initiation factor eIF4E under secondorder reaction conditions. Biophys. Chem. 129, 289–297. Bosco, G., Baxa, M., and Sosnick, T. (2009). Metal binding kinetics of Bi-Histidine sites used in c analysis: Evidence of high-energy protein folding intermediates. Biochemistry 48, 2950–2959. Box, G. E. P., Hunter, W. G., Hunter, J. S., and Hunter, W. G. (1978). Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. John Wiley, New York. Briknarova´, K., Zhou, X., Satterthwait, A., Hoyt, D., Ely, K., and Huang, S. (2008). Structural studies of the SET domain from RIZ1 tumor suppressor. Biochem. Biophys. Res. Commun. 366, 807–813. Brooks, I., Watts, D., Soneson, K., and Hensley, P. (1994). Determining confidence intervals for parameters derived from analysis of equilibrium analytical ultracentrifugation data. Methods Enzymol. 240, 459–478. Burnham, K. B., and Anderson, D. R. (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer-Verlag, New York. Bzowska, A. (2002). Calf spleen purine nucleoside phosphorylase: Complex kinetic mechanism, hydrolysis of 7-methylguanosine, and oligomeric state in solution. Bioch. Biophys. Acta 1596, 293–317. Bzowska, A., Koellner, G., Stroh, B. W.-K. A., Raszewski, G., Holy´, A., Steiner, T., and Frank, J. (2004). Crystal structure of calf spleen purine nucleoside phosphorylase with two full trimers in the asymmetric unit: Important implications for the mechanism of catalysis. J. Mol. Biol. 342, 1015–1032. Chakraborty, U. K. (2008). Advances in Differential Evolution. Springer-Verlag, New York. Cle´, C., Gunning, A. P., Syson, K., Bowater, L., Field, R. A., and Bornemann, S. (2008). Detection of transglucosidase-catalyzed polysaccharide synthesis on a surface in real-time using surface plasmon resonance spectroscopy. J. Am. Chem. Soc. 130, 15234–15235. Cle´, C., Martin, C., Field, R. A., Kuzmicˇ, P., and Bornemann, S. (2010). Detection of enzyme-catalyzed polysaccharide synthesis on surfaces. Biocatal. Biotransform. in press. Collom, S. L., Laddusaw, R. M., Burch, A. M., Kuzmicˇ, P., Grover, P. Miller, and Andand, M. D. P. (2008). CYP2E1 substrate inhibition: Mechanistic interpretation through an effector site for monocyclic compounds. J. Biol. Chem. 383, 3487–3496. Corana, A., Marchesi, M., Martini, C., and Ridella, S. (1987). Minimizing multimodal functions of continuous variables with the ‘‘simulated annealing’’ algorithm. ACM Trans. Math. Softw. 13, 262–280. Deng, Q., and Huang, S. (2004). PRDM5 is silenced in human cancers and has growth suppressive activities. Oncogene 17, 4903–4910. Digits, J. A., and Hedstrom, L. (1999). Kinetic mechanism of tritrichomonas foetus inosine 50 -monophosphate dehydrogenase. Biochemistry 38, 2295–2306. Duggleby, R. (1981). Experimental designs for the distribution free analysis of enzyme kinetic data. In ‘‘Kinetic Data Analysis’’ (L. Endre´nyi, ed.), pp. 169–181. Plenum Press, New York. Dwass, M. (1957). Modified randomization tests for nonparametric hypotheses. Ann. Math. Stat. 28, 181–187. Endre´nyi, L. (1981). Design of experiments for estimating enzyme and pharmacokinetic parameters. In ‘‘Kinetic Data Analysis’’ (L. Endre´nyi, ed.), pp. 137–169. Plenum Press, New York.
278
Petr Kuzmicˇ
Fedorov, V. (1972). Theory of Optimal Experiments. Academic Press, New York. Feoktistov, V. (2008). Differential Evolution: In Search of Solutions. Springer-Verlag, New York. Fierke, C. A., Johnson, K. A., and Benkovic, S. J. (1987). Construction and evaluation of the kinetic scheme associated with dihydrofolate reductase from Escherichia coli. Biochemistry 26, 4085–4092. Franco, R., Gavalda, M. T., and Canela, E. I. (1986). A computer program for enzyme kinetics that combines model discrimination, parameter refinement and sequential experimental design. Biochem. J. 238, 855–862. Gasa, T., Spruell, J., Dichtel, W., Srensen, T., Philp, D., Stoddart, J., and Kuzmicˇ, P. (2009). Complexation between methyl viologen (paraquat) bis(hexafluorophosphate) and dibenzo[24]crown-8 revisited. Chem. Eur. J. 15, 106–116. Gilbert, H. F. (1999). Basic Concepts in Biochemistry. McGraw-Hill, New York. ¨ ber die chemische Affinita¨t. J. Prakt. Chem. 127, Guldberg, C. M., and Waage, P. (1879). U 69–114. Hoops, S., Sahle, S., Gauges, R., Lee, C., Pahle, J., Simus, N., Singhal, M., Xu, L., Mendes, P., and Kummer, U. (2006). COPASI—A COmplex PAthway SImulator. Bioinformatics 22, 3067–3074. Jamakhandi, A. P., Kuzmicˇ, P., Sanders, D. E., and Miller, G. P. (2007). Global analysis of protein–protein interactions reveals multiple cytochrome P450 2E1reductase complexes. Biochemistry 46, 10192–10201. Johnson, M. L. (1992). Why, when, and how biochemists should use least squares. Anal. Biochem. 206, 215–225. Johnson, M. L. (1994). Use of least-squares techniques in biochemistry. Methods Enzymol. 240, 1–22. Johnson, M. L., and Frasier, S. G. (1985). Nonlinear least-squares analysis. Methods Enzymol. 117, 301–342. Johnson, K. A., Simpson, Z. B., and Blom, T. (2009). Global Kinetic Explorer: A new computer program for dynamic simulation and fitting of kinetic data. Anal. Biochem. 387, 20–29. King, E. L., and Altman, C. (1956). A schematic method of deriving the rate laws for enzyme-catalyzed reactions. J. Phys. Chem. 60, 1375–1378. Kirkpatrick, S., Gelatt, C., and Vecchi, M. P. (1983). Optimization by simulated annealing. Science 220, 671–680. Kuzmicˇ, P. (1996). Program DYNAFIT for the analysis of enzyme kinetic data: Application to HIV proteinase. Anal. Biochem. 237, 260–273. Kuzmicˇ, P. (2006). A generalized numerical approach to rapid-equilibrium enzyme kinetics: Application to 17b-HSD. Mol. Cell. Endocrinol. 248, 172–181. Kuzmicˇ, P. (2009a). A generalized numerical approach to steady-state enzyme kinetics: Applications to protein kinase inhibition. Biochim. Biophys. Acta—Prot. Proteom. in press, doi:10.1016/j.bbapap.2009.07.028. Kuzmicˇ, P. (2009b). Application of the Van Slyke–Cullen irreversible mechanism in the analysis of enzymatic progress curves. Anal. Biochem. 394, 287–289. Kuzmicˇ, P., Peranteau, A. G., Garcı´a-Echeverrı´a, C., and Rich, D. H. (1996). Mechanical effects on the kinetics of the HIV proteinase deactivations. Biochem. Biophys. Res. Commun. 221, 313–317. Kuzmicˇ, P., Cregar, L., Millis, S. Z., and Goldman, M. (2006). Mixed-type noncompetitive inhibition of anthrax lethal factor protease by aminoglycosides. FEBS J. 273, 3054–3062. Kuzmicˇ, P., Lorenz, T., and Reinstein, J. (2009). Analysis of residuals from enzyme kinetic and protein folding experiments in the presence of correlated experimental noise. Anal. Biochem. 395, 1–7.
DynaFit—A Software Package for Enzymology
279
Le Clainche, L., and Vita, C. (2006). Selective binding of uranyl cation by a novel calmodulin peptide. Environ. Chem. Lett. 4, 45–49. Leskovar, A., Wegele, H., Werbeck, N., Buchner, J., and Reinstein, J. (2008). The ATPase cycle of the mitochondrial Hsp90 analog trap1. J. Biol. Chem. 283, 11677–11688. Mannervik, B. (1981). Design and analysis of kinetic experiments for discrimination between rival models. In ‘‘Kinetic Data Analysis’’ (L. Endre´nyi, ed.), pp. 235–270. Plenum Press, New York. Mannervik, B. (1982). Regression analysis, experimental error, and statistical criteria in the design and analysis of experiments for discrimination between rival kinetic models. Methods Enzymol. 87, 370–390. Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math. 11, 431–441. Mendes, P., and Kell, D. (1998). Non-linear optimization of biochemical pathways: Applications to metabolic engineering and parameter estimation. Bioinformatics 14, 869–883. Morrison, J. F., and Walsh, C. T. (1988). The behavior and significance of slow-binding enzyme inhibitors. Adv. Enzymol. Relat. Areas Mol. Biol. 61, 201–301. Myung, J. I., and Pitt, M. A. (2004). Model comparison methods. Methods Enzymol. 383, 351–366. Myung, J. I., Tang, Y., and Pitt, M. A. (2009). Evaluation and comparison of computational models. Methods Enzymol. 454, 287–304. Nichols, T. E., and Holmes, A. P. (2001). Nonparametric permutation tests for functional neuroimaging: A primer with examples. Human Brain Map. 15, 1–25. Niedzwiecka, A., Stepin´ski, J., Antosiewicz, J., Darzynkiewicz, E., and Stolarski, R. (2007). Biophysical approach to studies of Cap-eIF4E interaction by synthetic Cap analogs. Methods Enzymol. 430, 209–245. Onwubolu, G. C., and Davendra, D. (2009). Differential Evolution: A Handbook for Global Permutation-Based Combinatorial Optimization. Springer-Verlag, New York. Penheiter, A. R., Bajzer, Zˇ., Filoteo, A. G., Thorogate, R., To¨ro¨k, K., and Caride, A. J. (2003). A model for the activation of plasma membrane calcium pump isoform 4b by Calmodulin. Biochemistry 42, 12115–12124. Peranteau, A. G., Kuzmicˇ, P., Angell, Y., Garcı´a-Echeverrı´a, C., and Rich, D. H. (1995). Increase in fluorescence upon the hydrolysis of tyrosine peptides: Application to proteinase assays. Anal. Biochem. 227, 242–245. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipes in C. Cambridge University Press, Cambridge. Price, K. V., Storn, R. M., and Lampinen, J. A. (2005). Differential Evolution: A Practical Approach to Global Optimization. Springer-Verlag, New York. Reich, J. G. (1992). Curve Fitting and Modelling for Scientists and Engineers. McGraw-Hill, New York. Schlippe, Y. V. G., Riera, T. V., Seyedsayamdost, M. R., and Hedstrom, L. (2004). Substitution of the conserved Arg-Tyr dyad selectively disrupts the hydrolysis phase of the IMP dehydrogenase reaction. Biochemistry 43, 4511–4521. Segel, I. H. (1975). Enzyme Kinetics. Wiley, New York. Slyke, D. D. V., and Cullen, G. E. (1914). The mode of action of urease and of enzymes in general. J. Biol. Chem. 19, 141–180. Storme, T., Deroussent, A., Mercier, L., Prost, E., Re, M., Munier, F., Martens, T., Bourget, P., Vassal, G., Royer, J., and Paci, A. (2009). New ifosfamide analogs designed for lower associated neurotoxicity and nephrotoxicity with modified alkylating kinetics leading to enhanced in vitro anticancer activity. J. Pharmacol. Exp. Ther. 328, 598–609. Straume, M., and Johnson, M. L. (1992). Monte Carlo method for determining complete confidence probability distributions of estimated model parameters. Methods Enzymol. 210, 117–129.
280
Petr Kuzmicˇ
Szedlacsek, S., and Duggleby, R. G. (1995). Kinetics of slow and tight-binding inhibitors. Methods Enzymol. 249, 144–180. Van Boekel, M. (2000). Kinetic modelling in food science: A case study on chlorophyll degradation in olives. J. Sci. Food Agric. 80, 3–9. Von Weymarn, N., Kiviharju, K., and Leisola, M. (2002). High-level production of Dmannitol with membrane cell-recycle bioreactor. J. Ind. Microbiol. Biotechnol. 29, 44–49. Watts, D. G. (1994). Parameter estimates from nonlinear models. Methods Enzymol. 240, 23–36. Wielgus-Kutrowska, B., and Bzowska, A. (2006). Probing the mechanism of purine nucleoside phosphorylase by steady-state kinetic studies and ligand binding characterization determined by fluorimetric titrations. Biochim. Biophys. Acta 1764, 887–902. Wielgus-Kutrowska, B., Bzowska, A., Tebbe, J., Koellner, G., and Shugar, D. (2002). Purine nucleoside phosphorylase from cellulomonas sp.: Physicochemical properties and binding of substrates determined by ligand-dependent enhancement of enzyme intrinsic fluorescence, and by protective effects of ligands on thermal inactivation of the enzyme. Biochem. Biophys. Acta 1597, 320–334. Wielgus-Kutrowska, B., Antosiewicz, J., Dlugosz, M., Holy´, A., and Bzowska, A. (2007). Towards the mechanism of trimeric purine nucleoside phosphorylases: Stopped-flow studies of binding of multisubstrate analogue inhibitor—2-amino-9-[2-(phosphonomethoxy)ethyl]-6-sulfanylpurine. Biophys. Chem. 125, 260–268. Williams, J. W., and Morrison, J. F. (1979). The kinetics of reversible tight-binding inhibition. Methods Enzymol. 63, 437–467. Williams, C. R., Snyder, A. K., Kuzmicˇ, P., O’Donnell, M., and Bloom, L. B. (2004). Mechanism of loading the Escherichia coli DNA polymerase III sliding clamp. I. Two distinct activities for individual ATP sites in the g complex. J. Biol. Chem. 279, 4376–4385.
C H A P T E R
E L E V E N
Discrete Dynamic Modeling of Cellular Signaling Networks Re´ka Albert and Rui-Sheng Wang Contents 282 284 286 288 289 291 293 295 295 297 297 298 299 301 301 302 303 303 303
1. Introduction 2. Cellular Signaling Networks 3. Boolean Dynamic Modeling 3.1. Constructing the network backbone 3.2. Determining transfer functions 3.3. Selecting models for state transitions 3.4. Analyzing steady states of the system 3.5. Testing the robustness of the dynamic model 3.6. Making biological implications and predictions 4. Variants of Boolean Network Models 4.1. Threshold Boolean networks 4.2. Piecewise linear systems 4.3. From Boolean switches to dose–response curves 5. Application Examples 5.1. Abscisic acid-induced stomatal closure 5.2. T-LGL survival signaling network 6. Conclusion and Discussion Acknowledgments References
Abstract Understanding signal transduction in cellular systems is a central issue in systems biology. Numerous experiments from different laboratories generate an abundance of individual components and causal interactions mediating environmental and developmental signals. However, for many signal transduction systems there is insufficient information on the overall structure and the molecular mechanisms involved in the signaling network. Moreover, lack of kinetic and temporal information makes it difficult to construct quantitative models of signal transduction pathways. Discrete dynamic modeling, combined with network analysis, provides an effective way to integrate fragmentary Department of Physics, Pennsylvania State University, University Park, Pennsylvania, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67011-7
#
2009 Elsevier Inc. All rights reserved.
281
282
Re´ka Albert and Rui-Sheng Wang
knowledge of regulatory interactions into a predictive mathematical model which is able to describe the time evolution of the system without the requirement for kinetic parameters. This chapter introduces the fundamental concepts of discrete dynamic modeling, particularly focusing on Boolean dynamic models. We describe this method step-by-step in the context of cellular signaling networks. Several variants of Boolean dynamic models including threshold Boolean networks and piecewise linear systems are also covered, followed by two examples of successful application of discrete dynamic modeling in cell biology.
1. Introduction With the increasing availability of high-throughput techniques, nowadays it is possible to collect large datasets on the abundance and activity of biological components such as genes, proteins, RNAs, and metabolites. The diverse interactions between these components coordinate cellular systems and are responsible for cellular functions. Networks of interaction and regulation can be discerned throughout the process in which a coding sequence of DNA is transferred into active proteins (Barabasi and Oltvai, 2004). At the genomic/transcriptomic level, transcription factors can activate or inhibit the expression of genes into mRNAs and regulate the activity of genes, contributing to a transcriptional (gene) regulatory network (Buck and Lieb, 2004; Lee et al., 2002). At the proteomic level, proteins participate in diverse posttranslational modifications of other proteins or form protein complexes with other proteins to exert additional functional roles. Such associations between proteins are achieved by protein–protein interactions (Figeys et al., 2001; Walhout and Vidal, 2001). Biochemical reactions in the cellular metabolism can likewise be integrated into metabolic networks (Hatzimanikatis et al., 2004; Reed et al., 2003). A variety of interactions can integrate into signaling networks. For example, external signals from the exterior of a cell are first transferred to the inside of that cell by a cascade of protein–protein interactions of signaling molecules (Albert, 2005; Li et al., 2006). Then, a combination of biochemical reactions and transcriptional regulation triggers the expression of genes to respond to the signals. Instead of focusing on individual components, an important issue in systems biology is to study how cellular systems facilitate diverse functions by such interacting components. Cellular systems are by no means static. Instead, most cellular responses are transient and biological components interact dynamically with each other. Therefore, to understand the mechanisms by which interacting components achieve dynamic behaviors, topological analysis of cellular networks is insufficient. Dynamic modeling serves as a standard tool for system-level elucidation of the dynamics of cellular processes. It can link
Discrete Dynamic Modeling of Cellular Signaling Networks
283
fundamental physicochemical principles, prior knowledge about regulatory pathways, and experimental data of various types to create a powerful predictive model (Aldridge et al., 2006). Such a model is able to decipher how diverse interactions account for phenotype traits and make novel predictions that lead to further experimental explorations (Li et al., 2006; Thakar et al., 2007; Zhang et al., 2008). Quantitative dynamic models based on differential equations have been widely used to model various biological systems (Aldridge et al., 2006). However, this modeling approach requires many kinetic parameters which are generally not or insufficiently known. This aspect poses an obstacle for quantitative modeling of large-scale systems, especially when temporal data are insufficient for parameter estimation (Conzelmann and Gilles, 2008). At the same time, much knowledge of individual components and causal interactions in a biological process can be inferred from the experimental literature as qualitative data. Therefore, qualitative techniques such as Boolean networks and Petri nets have been used for modeling signal transduction networks (Chaouiya, 2007; Gilbert et al., 2006; Sackmann et al., 2006). Such discrete dynamic modeling approaches, combined with network analysis tools (Kachalo et al., 2008), are able to integrate fragmentary knowledge of regulatory interactions into a predictive and informative model in systems where kinetic parameters are not sufficiently known to allow a continuous model (Bornholdt, 2008). Discrete dynamic modeling, particularly Boolean dynamic modeling, has been successfully applied in modeling many gene regulatory networks and signaling networks, especially in those systems where the organization of networks is more important than the kinetic details of the individual interactions. The Drosophila segment polarity regulatory network was shown to be a robust developmental module that can function despite variations in kinetic parameters (von Dassow et al., 2000). Albert and Othmer (2003) developed a Boolean dynamic model which accurately predicts the gene expression outcomes of this developmental module. The cell cycle control of the budding yeast Saccharomyces cerevisiae is a widely studied robust biological process in the cell. Threshold Boolean network models accurately reproduce the yeast cell cycle dynamics and predict the critical events of the cell cycle (Davidich and Bornholdt, 2008b; Li et al., 2004). Generalized logical models have been used for modeling the segmentation in Drosophila embryos (Sanchez and Thieffry, 2001, 2003). Discrete dynamic modeling has been equally successfully applied in systems as different as plants and mammals, for example, in Arabidopsis flower morphogenesis (Espinosa-Soto et al., 2004; Mendoza et al., 1999), root hair development (Mendoza and Alvarez-Buylla, 2000), abscisic acid (ABA)induced stomatal closure (Li et al., 2006), the human cholesterol regulatory pathway (Kervizic and Corcos, 2008), mammalian immune response to bacteria (Thakar et al., 2007, 2009), T cell signaling networks (Kaufman et al., 1999; Saez-Rodriguez et al., 2007) as well as the T-LGL leukemia survival signaling (Zhang et al., 2008).
284
Re´ka Albert and Rui-Sheng Wang
In this chapter, we introduce the fundamental concepts of Boolean dynamic modeling, in the context of signaling networks. We also illustrate several network analysis tools and Boolean network simulation tools. Finally, we discuss variants of Boolean models including threshold Boolean models and piecewise linear systems and give two examples of successful application of discrete dynamic modeling in cell biology.
2. Cellular Signaling Networks Living cells constantly receive various external stimuli and developmental signals and convert them into intracellular responses. Such processes are collectively known as signal transduction, which involves a collection of interacting chemicals and molecules such as enzymes, proteins, second messengers (Gomperts et al., 2003). Signal transduction is an important part of cell communication that governs basic cellular activities, coordinates cell actions and maintains the equilibrium between the cell and its surroundings. Many cellular decisions such as proliferation, differentiation, and apoptosis are achieved by signal transduction. Therefore, understanding cell signaling is essential for studying the underlying mechanisms of cellular systems. Figure 11.1 gives an abstract view of a signal transduction process, which demonstrates that signal transduction has three main steps. First, signal transduction processes are activated by extracellular signaling molecules binding to cell-surface receptors. Then the signals are transferred inside the cell and trigger a sequence of biochemical reactions such as activation of signal adaptors, phosphorylation of enzymes, which occur in the cytoplasm. The signals are amplified and passed to the nucleus and further to genes through a series of biochemical reactions that link submembrane events. Finally, cells respond to the signals by changing cellular function through the expression of genes. At every step of the signal transduction process feedbacks are possible and important. Signal transduction pathways interact and crosstalk with one another to form signal transduction networks (also called signaling networks) (Gomperts et al., 2003). Signaling networks can be represented as directed graphs where the orientation of the edges reflects the direction of signal propagation. In a signaling network, there exist one or more clear starting nodes representing the binding of the initial signal(s) to receptor(s) and one or more output nodes representing the cellular responses to the signal(s). Besides these nodes, there are a number of intermediate nodes consisting of second messengers, enzymes, kinases, proteins, genes, ions, metabolites, and/or other compounds involved in transferring the signals. The edges in a signaling network represent diverse interactions between signaling components such as protein binding, complex formation, transcription
285
Discrete Dynamic Modeling of Cellular Signaling Networks
Signal Receptor
P
Cell membrane P
Ion Ion
Second messengers Inhibition
Protein
Phosphorylation ATP
P Protein
Protein
Binding
ADP Protein complex
Translation
Activation
mRNA
Transcription
Transcription
Gene
DNA
Gene
mRNA
DNA
Nucleus Cellular responses
Figure 11.1 Scheme of a hypothetical signal transduction process involving diverse interactions of cellular components.
regulation, phosphorylation of a protein, and enzymatic catalysis. As shown in Fig. 11.1, signal propagation follows the paths from starting node(s) via a succession of intermediate components to the final output node(s). Signaling networks are usually very complex in their organization and too complicated to be analyzed by human mind. Analysis of signaling networks requires an iterative combination of experimental and theoretical approaches including developing appropriate models and generating quantitative data. Traditional work in biology has focused on studying individual parts of cell signaling pathways. Mathematical modeling of cellular networks from a system-level perspective helps us understand the underlying structure of cell signaling networks and how changes in these networks may affect the transmission of information (Cho and Wolkenhauer, 2003). Modeling cellular signaling networks is often challenging. Unlike other biological networks, cellular signaling networks involve the interactions of components from different levels such as transcriptome, metabolome, and proteome. In most
286
Re´ka Albert and Rui-Sheng Wang
cases, the dynamic activity of signaling molecules in cellular systems is not experimentally accessible. Even when temporal data are available at the transcriptional or proteomic level, they may not reflect the real activity of signaling molecules because of the existence of various posttranscriptional regulations and posttranslational modifications whose effects are still difficult to quantitatively assess by current biological technologies (Foth et al., 2008). Despite these obstacles, by using discrete dynamic models and network analysis methods, cellular signaling networks can be assembled and qualitatively modeled in a predictive manner and these models provide hypotheses for the underlying mechanisms, as we shall describe in the following section.
3. Boolean Dynamic Modeling Dynamic models describe the behavior of a system over time (Ellner and Guckenheimer, 2006). In dynamic models, each node has a state (or status) that varies over time due to interactions with other nodes. For a continuous dynamic system, the states of nodes are described by quantitative variables and the changes in the nodes’ states are usually modeled by a set of differential equations, in which the time variable runs over a continuous interval. Although the concentration of components in cellular signaling networks is continuously quantitative, various conditions can lead to a saturation regime or a regime of low concentrations, which enables a binary or qualitative simplification of component states (Bornholdt, 2008). In discrete dynamic models, the state of a node is qualitative and the time variable is often also discrete. Discrete dynamic models include Boolean networks (Kauffman, 1969), finite dynamical systems ( Jarrah and Laubenbacher, 2007), difference equations (May, 1976), Petri nets (Chaouiya, 2007), etc. The most popular discrete dynamic model applied in modeling biological networks is Boolean networks, which constitute the focus of this chapter. Boolean networks are a representation of a dynamic system introduced to model gene regulatory networks (Kauffman, 1969; Thomas, 1973). A Boolean network model can be represented by a directed graph G ¼ (V, E) with a set of nodes V ¼ {v1, v2, . . ., vn} denoting the elements of the network and a list of Boolean transfer functions F ¼ {F1, F2, . . ., Fn} implicitly defining the edges E between the nodes. Network node vi stands for a gene, a protein, or a stimulus with an associated expression level Xi, representing the concentration of a gene (protein) or the amount of the stimulus present in the cell. This level is approximated as two qualitative states, where Xi ¼ 1 represents the fact that node vi is expressed or ON (a high concentration) and Xi ¼ 0 means that it is not expressed or OFF (a baseline/subthreshold concentration). F is a set of logical functions, one assigned to each node. It represents the regulatory rules between network
Discrete Dynamic Modeling of Cellular Signaling Networks
287
components and determines the evolution of the system from the current state to the next state. The future state of each node is determined by the current states of other nodes through its Boolean transfer function: Xi* ¼ Fi ðX1 ; X2 ; . . . ; Xn Þ where the * denotes a future state and i ¼ 1, 2, . . ., n. The Boolean transfer functions can be expressed using the logic operators ‘‘not,’’ ‘‘and,’’ and ‘‘or.’’ Since the state space of a Boolean network is finite, the system will eventually reach a stationary state (also called fixed point) or a set of recurring states. These stationary or recurring states are collectively referred to as dynamic attractors. Boolean networks provide a straightforward formalism to describe the dynamics of biological networks without the involvement of kinetic details and thus are suited for modeling large-scale networks. They can be used to analyze the qualitative behavior of a system such as qualitative gene expression patterns, or the stability, or lack thereof, of a response to a signal. We will illustrate the concepts of Boolean dynamic modeling on a simple signaling network given in Fig. 11.2. In this network, I is the input node representing the signal and the output node O stands for the ultimate cellular response. There are six intermediate nodes: A, B, . . ., F denoting proteins, metabolites, or other signaling components. The interactions between the nodes are represented by directed edges. Formulating a Boolean dynamic model for signaling networks entails three main steps: constructing the network, determining the Boolean transfer functions, and selecting update modes for state transitions. In the following, we will discuss these steps one by one. A
I
B
Transfer functions A* = I
A
B
C
B* = not I C* = not I and D D*= (A or not B) and E
D F
E
E* = A or not D F* = not C and D
O
O* = (D and E) or F
Figure 11.2 An example of a simple signaling network and its Boolean network model. (A) In this network example, node I is the input and node O is the output. Nodes A, B,. . .,F are intermediate nodes. Positive interactions are represented by directed edges with sharp arrows, negative interactions are represented by directed edges with blunt arrows. (B) The Boolean transfer functions for each signaling component (node) in (A). The state of the nodes is indicated by the node labels and * denotes the state of the node at a future time instant.
288
Re´ka Albert and Rui-Sheng Wang
A recently developed software for modeling biological systems by using Boolean network models is BooleanNet, available at http://www.code. google.com/p/booleannet/ (Albert et al., 2008). The input to this software is a set of Boolean rules in a text file. Users can select among several state transition modes, and between purely Boolean and continuous-Boolean hybrid modeling. This software requires minimal programming expertise and can be run via a web interface or as a Python library to be used through an application programming interface. All the illustrative computations in this chapter on the example in Fig. 11.2 are done with BooleanNet.
3.1. Constructing the network backbone In modeling signaling networks by a Boolean dynamic model, the first step is to synthesize the network by reading the relevant literature concerning the signaling networks to be modeled. Published literature provides a valuable source of information about individual signaling components and the cause–effect relationships between different components. Although such information is not always explicit and complete, we can infer many regulatory relationships from experimental observations. Experimental information about the involvement of a component in a signaling network has several types. For example, experiments that show that the activity or subcellular location or concentration of a protein changes after giving the input signal or perturbing a known component of the signaling network indicate that this protein might be a component of the signaling network of interest. Different responses to a stimulus after mutating or overexpressing a gene provide genetic evidences for the involvement of the product of the gene in the signal transduction process. Enzymatic activity, protein– protein interactions (Walhout and Vidal, 2001), and transcription factor– gene interactions (Buck and Lieb, 2004) provide biochemical evidence of direct relationships between two components. Chemical or exogenous treatments of a component provide pharmacological evidences which imply indirect relationships between two components. Both biochemical and pharmacological evidences can be represented as component-to-component relationships such as A promotes B (denoted by A ! B), A inhibits B (denoted by A—|B), which correspond to directed arcs from A to B in a graph representing a signaling network. The arcs can be classified as inhibitory (negative) or activating (positive). In many situations, genetic evidences from multiple experiments lead to double causal inferences like C promotes the process through which A promotes B (Albert et al., 2007). In some cases, these inferences can be broken down to two separate component-to-component relationships. For example, if the interaction between A and B is direct and C is not a catalyst of the A–B interaction, C can be assumed to activate A. Many experimental observations are indirect casual relationships or even double casual relationships mentioned above that do not lend themselves to
Discrete Dynamic Modeling of Cellular Signaling Networks
289
easy representation. Fortunately, software applications exist that make network synthesis easier. Based on the idea and method framework by Albert et al. (2007), the software NET-SYNTHESIS for synthesizing signaling networks from collected literature observations was developed and available at http://www.cs.uic.edu/dasgupta/network-synthesis (Kachalo et al., 2008). The main idea of the network synthesis method is to find a most parsimonious network such that it incorporates all known components and processes and is consistent with all reachability relationships between known components (Albert et al., 2007). The input to NET-SYNTHESIS is a list of positive or negative relationships among biological components and its output is a simplified network diagram and a file with the edges of the inferred signaling network (Kachalo et al., 2008).
3.2. Determining transfer functions The constructed network is a static backbone of the signal transduction process. In order to understand the dynamic behavior of the system and beyond, the next step is to determine the dependence relationships among the node states which eventually define the functional state of the signal transduction system. The state change of each node in a Boolean network is described by a Boolean transfer function which is determined by the knowledge of the nodes directly upstream of this node, and the sign (inhibition or activation) of the edges between the upstream nodes (regulators) and the target node. The state of nodes having a single activator and no inhibitors, represented as A ! B in the network and obtained from the observation that a high concentration of A activates B, just follows the state of the activator with a time delay. For example, in Fig. 11.2, the transfer function for the state of node A has the following form: A* ¼ I; where for simplicity the state of the nodes is indicated by the node labels and * denotes the state of node A at a future time instant. The rule indicates that the next state of the node A equals the current state of node I. Nodes having a single inhibitor and no activators, represented by A—|B in the network, indicating that the activation of the target node B requires a low concentration or inactivity of the inhibitor A, have the state opposite to the state of the inhibitor with a time delay. For example, the transfer function for the state of node B in Fig. 11.2 has the following form: B* ¼ not I; where ‘‘not’’ denotes logical negation such that not ON ¼ OFF, and not OFF ¼ ON. In most cases, the activation of a component requires
290
Re´ka Albert and Rui-Sheng Wang
multiple regulators. The ‘‘and’’ operator can be used to denote conditional regulation, that is, that the coexpression of two (or more) regulators is absolutely necessary to activate the target node. In the example of Fig. 11.2, the component C is regulated by both I and D. We assume that the absence of I and the presence of D are required for the activation of node C, so the transfer function for the state of node C has the following form: C* ¼ not I and D: Sometimes a component is regulated by multiple pathways, but any of them can activate the component independently. Such relationships between multiple pathways can be represented by the ‘‘or’’ operator, representing independent activation. In the example of Fig. 11.2, we assume that the regulation of node E by A and D is independent, and either the presence of A or the absence of D is sufficient to activate E, thus leading to the transfer function: E* ¼ A or not D: When a component is regulated by more than two nodes, the transfer function can be a complicated combination of ‘‘and,’’ ‘‘or,’’ and ‘‘not’’ operations, depending on the relationships between multiple pathways, exemplified by the transfer functions of the output node O and the intermediate node D in Fig. 11.2. Each transfer function in a Boolean network model can also be represented by a logical truth table, listing all the states of a node resulted from those of its regulators. The transfer functions are usually determined according to prior knowledge on components or pathways. For example, if the proteins A and B bind with each other to form a complex C, both A and B must be present in order for the node C to be active, and this can be described by C ¼ A and B. Similarly, if a component C is expressed only in A and B double mutant organisms, not expressed in B mutants and A mutants, it means that the expression of C requires the inhibition of both A and B, then we can have C* ¼ not A and not B. Published biological literature is an important source for capturing such dependencies between signaling components. If the structure of the Boolean network or a Boolean transfer function is not fully known, several variants of the Boolean network can be created and comparing their dynamic sequences or the output responses with those observations for the real systems can guide the completion of the Boolean dynamic model (Bornholdt, 2008). For those signaling pathways that are poorly investigated by biologists, a complementary way is to learn the topology of Boolean networks and Boolean functions from available highthroughput temporal data, which has been a key machine learning problem (La¨hdesma¨ki et al., 2003).
Discrete Dynamic Modeling of Cellular Signaling Networks
291
3.3. Selecting models for state transitions Given the topological structure and transfer functions of Boolean networks, the transition from one state of the signal transduction system to the next can be implemented in multiple ways, which have a considerable effect on the dynamics of the system. Synchronous models use the simplest update method, in which the states of all nodes are updated simultaneously (in one step) according to the last state of the system: Xi ðt þ 1Þ ¼ Fi ðX1 ðtÞ; X2 ðtÞ; . . . ; Xn ðtÞÞ: This update method implicitly assumes that the time scales of all biological processes in which signaling components are involved in are similar. The benefit of synchronous models is that the intermediate dynamic state sequence of the system is deterministic, thus an initial condition leads to the same attractor in different replicate simulations. For example, let us assume that the initial condition of the signaling network in Fig. 11.2 is I0 ¼ C0 ¼ 1 (ON), A0 ¼ B0 ¼ D0 ¼ E0 ¼ F0 ¼ O0 ¼ 0 (OFF). The state of the system at the first step is calculated by plugging in the node states in the initial condition into the transfer functions in Fig. 11.2B, leading to I1 ¼ 1 (since nothing is affecting I), A1 ¼ 1, B1 ¼ 0, C1 ¼ 0 (since not I0 ¼ 0), D1 ¼ 0 (since E0 ¼ 0), E1 ¼ 1 (since not D0 ¼ 1), F1 ¼ 0, and O1 ¼ 0. Representing the state of the system as an array in the order of I-A-B-C-D-E-F-O, after one step X(t ¼ 1) ¼ [1 1 0 0 0 1 0 0] using a synchronous model. In the second step, the state of the system becomes X(t ¼ 2) ¼ [1 1 0 0 1 1 1 1], a state which remains unchanged even after further update. Thus, this state is a fixed point dynamical attractor. Biological processes in cellular systems are complicated, and most often the time scales of these processes are different and can vary widely from fractions of seconds to hours. For example, protein phosphorylation and other posttranslational mechanisms are much faster than protein synthesis or transcriptional regulation. Synchronous models assume the existence of a perfect synchronization among the states of signaling components and cannot properly account for the different time scales over which diverse biological processes take place in a cellular system. Thus, the update method of the system state needs to be extended to account for different time scales. In asynchronous models, the nodes are updated in a nonsynchronous order, depending on the timing information, or lack thereof, of individual biological events. In a random asynchronous model, the next updating times for each component may be randomly chosen at each time instant (Chaves et al., 2005). In a more frequently used asynchronous model (Chaves et al., 2005), the update order is selected randomly from all possible permutations of the nodes: Xi ðt þ 1Þ ¼ Fi ðX1 ðt1 Þ; X2 ðt2 Þ; . . . ; Xn ðtn ÞÞ
292
Re´ka Albert and Rui-Sheng Wang
where ti 2 {t, t þ 1}, i ¼ 1, 2, . . ., n denotes the most recent time step at which node vi was updated, which depends on the position of node vi in the update order of all nodes. This update method guarantees that each node is updated exactly once during each unit time interval. For example, in Fig. 11.2, let us set the same initial condition as before, I0 ¼ C0 ¼ 1 (ON), A0 ¼ B0 ¼ D0 ¼ E0 ¼ F0 ¼ O0 ¼ 0 (OFF) and use an asynchronous update with the update order I–A–E–D–B–C–F–O. The next state of the system is calculated by plugging in the most recent node states in the transfer functions in Fig. 11.2B, leading to I1 ¼ 1 (since nothing is affecting I), A1 ¼ 1, E1 ¼ 1 (since A1 ¼ 1 and also not D0 ¼ 1), D1 ¼ 1 (since A1 ¼ E1 ¼ 1), B1 ¼ 0, C1 ¼ 0 (since not I1 ¼ 0), F1 ¼ 1, and O1 ¼ 1. Thus, for this update order X(t ¼ 1) ¼ [1 1 0 0 1 1 1 1], a state that was obtained after two synchronous updates. If, however, the nodes in Fig. 11.2 were updated in the order I–A–B– C–D–E–F–O, the same state would have been obtained as in a synchronous update. In asynchronous models involving stochasticity, the same initial condition could lead to different states, and by extension, to different attractors due to the random choice involved in state transition. The uncertainty in these update methods can reflect the population level differences or system stochasticity in signal transduction processes. If prior knowledge about the timing scales of some components is available, the asynchronous update can be augmented by restricting the update order of these components. For example, if it is known that in a signal transduction process, component A is always activated before component B, the permutations for the update order of components can be restricted to those in which A is before B. There also exist deterministic asynchronous models in which each node vi is associated with an intrinsic time unit gi and is updated at multiples of that unit tikþ1 ¼ tik þ gi ¼ kgi (Chaves et al., 2006). At any given time t, the node vi whose time instant is closest to t, that is, tik ¼ minj;l ftjl > tg is updated in the following way: Xi ðtik Þ ¼ Fi ðX1 ðtk1i ; tk2i ; . . . ; tkni ÞÞ where tkji is the most recent instant when node vj was updated: tkji ¼ maxl ftjl < tik g. While in the random-order asynchronous model presented earlier each node is updated once in every time step (also called a round of update), in this intrinsic time unit asynchronous model nodes with longer time units will have less updates than nodes with shorter time units. This update mode is more intuitive and reasonable if the time units for biological events such as translation, transcription, and phosphorylation involved in a signal transduction process can be estimated from biological knowledge, otherwise the time units for each node can be sampled randomly from an interval.
Discrete Dynamic Modeling of Cellular Signaling Networks
293
3.4. Analyzing steady states of the system For a Boolean network model with n nodes, there are 2n possible initial conditions, from which the system will eventually converge to a limited set of attractors. In synchronous models and deterministic asynchronous models, these attractors are fixed points (steady states) or k-cycles (in which k states are repeated regularly). In stochastic asynchronous models, the attractors are fixed points or so-called loose attractors, sets of states that are repeated irregularly (Harvey and Bossomaier, 1997). Asynchronous models have the same fixed points as synchronous models since in a fixed point (steady state) the order of update of the nodes is irrelevant. For example, the state X ¼ [1 1 0 0 1 1 1 1] is a fixed point of the network in Fig. 11.2 and all states that include the ON state of the input node I will ultimately lead to this fixed point, irrespectively of the (a)synchronicity of the model. The only other attractor possible for this network is X ¼ [0 0 1 0 0 1 0 0], and all states that include the OFF state of the input node I will ultimately lead to this fixed point. Note that in both fixed points the state of the output node is the same as the state of the input node. In modeling signaling networks, usually the initial condition can be set according to prior expert knowledge. For example, the input node in a signaling network represents signals or stimuli, so the initial condition should have the input node as ON. Only after the stimuli are present and transferred into intracellular components, the cell can generate the final response. So, in the initial condition, the output node can be set as OFF. The initial state of other intermediate components can be similarly set according to prior knowledge. If no sufficient information is available for setting a realistic initial condition, one can sample over a large number of random initial conditions and examine the fraction of realizations of a given state of a node (e.g., ON) at a certain time step. Such fraction for the output node represents the probability that the system attains the response and reflects a dynamic behavior that is weakly dependent on the details of the initial conditions (Li et al., 2006). For the output node O in Fig. 11.2, let us set the input node I as ON and the output node O as OFF, randomly sample the initial state of all other nodes, and use BooleanNet with a random-order asynchronous model (Albert et al., 2008). The fraction of O ¼ ON in 50 replicate simulations as a function of time steps is shown in Fig. 11.3. We can see that the fraction of O ¼ ON stabilizes at 1 after three time steps, indicating that in all simulations with I ¼ ON, O ¼ OFF as initial conditions, the output stabilizes at ON, irrespective of the initial states of other nodes. Repeating the analysis with I ¼ OFF, O ¼ OFF leads to the fraction of O ¼ ON stabilizing at 0, indicating that in all simulations the output stabilizes at OFF. By setting different initial conditions, differential modes of input/output behavior can be identified. Attractors represent combinations of the activation
294
Re´ka Albert and Rui-Sheng Wang
1.0 I = ON, O = OFF I = OFF, O = OFF
Fraction of O = ON
0.8 0.6 0.4 0.2 0.0 0
2
4 6 Time steps
8
10
Figure 11.3 The fraction of O ¼ ON in 50 asynchronous simulations of the Boolean network given in Fig. 11.2 as a function of time steps (rounds of update).
states of components that trigger the cell responses in signaling networks or specify the phenotype behaviors in gene regulatory networks. In modeling signaling networks, the state of the output node is more interesting than those of intermediate nodes, so observing the long-term behavior of the output node is most relevant. In modeling gene regulatory networks, usually the attractors and the dynamic sequence of the whole system correspond to known biological events such as certain phases of the cell cycle (Li et al., 2004), apoptosis and cell differentiation (Huang et al., 2009), and thus identifying attractors is most relevant. The fixed points of a Boolean network model can be determined analytically by finding all the possible solutions X of the equations: Xi ¼ Fi ðX1 ; X2 ; . . . ; Xn Þ; meaning that the next state of each node equals its current state. For example, the fixed points of the network in Fig. 11.2 can be determined by solving the set of equations that results when taking away the stars from the left hand side of the transfer functions in Fig. 11.2B. Expressing the state of each node as a function of the state of the node I, and simplifying the resulting expressions, we find A ¼ D ¼ F ¼ O ¼ I, B ¼ not I, C ¼ 0, E ¼ 1, which yields the two solutions we found earlier. The set of initial conditions that leads the system to a specific attractor is referred to as its basin of attraction, which can be determined by doing repeated simulations from each initial condition. While asynchronous models have the same fixed points as synchronous models, the basin of attraction of these fixed points is generally different and the basin of attraction of different fixed
Discrete Dynamic Modeling of Cellular Signaling Networks
295
points can be overlapping, since the final state(s) reachable from a specific initial condition depends on the update mode. BooleanNet provides function modules to detect steady states or cyclic attractors. For example, in Fig. 11.2, if the transfer function for E is changed to E* ¼ A and not D, the components D and E form a negative feedback loop. In a synchronous model the initial condition [1 0 0 1 0 0 0 0] leads to a cyclic attractor with period 4 after two steps: [ 1 1 0 0 1 1 0 0] ! [ 1 1 0 0 1 0 1 1] ! [ 1 1 0 0 0 0 1 1] ! [ 1 1 0 0 0 1 0 1] ! [ 1 1 0 0 1 1 0 0]. There are other initial conditions that lead to the same cyclic attractor in this small network.
3.5. Testing the robustness of the dynamic model With all preparation done, an important step is to assess if the constructed dynamic model is able to reproduce known dynamic behaviors or cellular responses, and if the model is robust in terms of changes in interactions or Boolean transfer functions. Comparing the model’s intermediate dynamic sequence and output responses with the experimentally observed dynamic events can suggest if further changes to the model are needed. For example, assume that a biological initial condition is expected to lead to the ON state of the output node after some time. If this biologically plausible initial condition always leads to an OFF state of the output node in the constructed dynamic model, regardless of the update orders of the nodes, it indicates that one or more Boolean transfer functions may be wrong (e.g., use ‘‘and’’ instead of ‘‘or’’ or vice versa) or incomplete (e.g., some important components are not included). After several rounds of comparison of the model with experimental observations, a dynamic model consistent with all important prior knowledge is obtained. However, it is not a good model if small perturbations lead to drastically different results, because it suggests that the model cannot reflect the adaptability of the system under diverse circumstances. A robust model should be able to maintain the original output response for most small perturbations. Systematic assessment of the robustness of the model can be done in multiple ways, for example, interchanging ‘‘or’’ and ‘‘and’’ rules, switching an inhibitory edge to an activating edges or vice versa, rewiring a pair of edges, adding or deleting an edge. The fraction of cases in which the output is altered in a certain number of perturbations can reflect the robustness of the model.
3.6. Making biological implications and predictions Discrete dynamic modeling allows us to integrate fragmentary knowledge about a system into a logical representation, which further helps us understand the system at a global level. A great advantage of discrete dynamic modeling is its ability to predict the outcomes of system perturbations and
296
Re´ka Albert and Rui-Sheng Wang
direct future wet-bench experiments. There are several efficient ways to analyze the effects of system perturbations. For example, knockout mutants can be simulated by keeping the state of these components as OFF. Overexpression or constitutive activation of certain components can be simulated by keeping the state of these components as ON. Chemical or exogenous treatments can be simulated by activating or inhibiting certain nodes. By studying the effects of such system perturbations, we can assess the importance of certain components, predict the phenotype traits for system perturbations, and gain other valuable insights into the underlying mechanisms of the signal transduction system. For example, in Fig. 11.2, we randomly sample the initial conditions in which I ¼ ON and O ¼ OFF and examine the effects of perturbations of the model by knocking out and overexpressing each node. The fractions of O ¼ ON in wild type (no perturbation) and perturbed models are shown in Fig. 11.4. We found that knockout of A leads to a reduced fraction of O ¼ ON, suggesting hyposensitivity to the signal in A mutants. Blocking B, or C, or F leads to a response very similar to that in the wild type. Knockouts of D or E will completely eliminate the response (i.e., the fraction of O ¼ ON is zero). This insensitivity to the signal in D or E mutants suggests the essentiality of the components D and E for the output response O. In addition, overexpression of A, B, or C has no effect on the response to the signal I. However, overexpression of D, E, or F makes the model reach the ON state of the output node O faster, indicating hypersensitivity to the signal in these perturbed models.
1.0
Fraction of O = ON
0.8 0.6 0.4
Wild type Knockout of A Knockout of B Knockout of D Overexpression of D
0.2 0.0 0
2
4 6 Time steps
8
10
Figure 11.4 The fraction of O ¼ ON in 50 asynchronous simulations of the Boolean network given in Fig. 11.2 as a function of time steps in wild type and perturbed models.
Discrete Dynamic Modeling of Cellular Signaling Networks
297
4. Variants of Boolean Network Models The nodes in a Boolean network have only two states, ON and OFF, which sometimes are not sufficient to characterize the activity or concentration level of signaling components. Logical models with more than two states have been used to model several biological systems, such as root hair development (Mendoza and Alvarez-Buylla, 2000), the segmentation in Drosophila embryos (Sanchez and Thieffry, 2003), and Arabidopsis floral morphogenesis (Espinosa-Soto et al., 2004). Such models are still qualitative but with more levels of activity or concentration for some specific signaling components. Generally, the transfer functions are given in the form of truth tables and the nodes are updated synchronously in these models. In addition, threshold Boolean networks (Ku¨rten, 1988), the simplest Boolean network models, have been successfully used as well (Davidich and Bornholdt, 2008b; Li et al., 2004). A hybrid model of Boolean transfer functions and differential equations called piecewise linear differential equations developed by Glass (1975)) has also been fruitfully applied due to its attractive combination of continuous time, quantitative information, and few kinetic parameters (Chaves et al., 2006; De Jong et al., 2004; Thakar et al., 2009).
4.1. Threshold Boolean networks In threshold networks (Ku¨rten, 1988), each node takes binary values of 0 or 1. Instead of using logic operators ‘‘and,’’ ‘‘or,’’ and ‘‘not,’’ the transfer functions of threshold networks use þ or to represent activation and inhibition. Jij ¼ þ1 denotes that the node vj activates node vi, Jij ¼ 1 means that node vj inhibits node vi, and Jij ¼ 0 denotes that there is no regulatory signal from node vj to node vi. The dynamics of the model is determined by the following transfer functions: 8 Xn < 1 if Jij Xj ðtÞ þ yi > 0 Xj¼1 Xi ðt þ 1Þ ¼ n : 0 if J X ðtÞ þ yi 0 j¼1 ij j which are simple sum rules for each node. yi is a threshold parameter that controls how many signals are needed for the activation of node vi. Though mostly using a synchronous update method, threshold Boolean networks can also be updated in an asynchronous mode. A slightly modified synchronous threshold Boolean network was successfully applied to model the yeast cell cycle control network (Davidich and Bornholdt, 2008b; Li et al., 2004). In these two studies, they found that a large percentage of initial states of the systems lead to a specific fixed point attractor which exactly corresponds to the G1 phase of the cell cycle.
298
Re´ka Albert and Rui-Sheng Wang
The dynamic sequence leading to this attractor corresponds to the biological pathway encoding the cell cycle event, which shows the power and generality of Boolean dynamic modeling. The dynamic models were demonstrated to have very high robustness, which suggests evolutionary constraints for the variable of the systems (Davidich and Bornholdt, 2008b; Li et al., 2004).
4.2. Piecewise linear systems Boolean dynamic models focus on the topological structure of a network and simplify the dynamics of the network, which enables efficient analysis of large networks. However, Boolean models ignore the intermediate states of expression and kinetic details, and may miss dynamic behaviors. Continuous models such as ordinary differential equations provide a more detailed description of a system, but involve many kinetic parameters which are largely unknown. Leon Glass introduced a variant of Boolean network models called piecewise linear differential equations which provides a bridge between discrete and continuous modeling approaches (Glass, 1975). In this model, each node of the network is represented by both a continuous variable X^ i denoting the concentration of the component vi and a discrete variable Xi denoting its activity. The continuous variables are determined by ordinary differential equations: dX^ i ¼ ki Fi ðX1 ; X2 ; . . . ; Xn Þ di X^ i dt which denotes that the concentration change rate of component vi is a combination of synthesis (governed by the Boolean transfer function Fi ) and free degradation (the second term in the right side). ki is the synthesis rate constant and di is the degradation rate constant. At time instant t, the discrete variable Xi is defined as a step function according to a threshold of its continous concentration: 8 ki > ^ > > < 0 X i ðtÞ yi di ; Xi ðtÞ ¼ > > 1 X^ i ðtÞ > yi ki ; > : di where yi 2 (0, 1) is a threshold for the component vi and represents the fraction of concentration necessary for vi to become active. Note that each fixed point of a Boolean network yields a steady state of its piecewise linear system. The dynamic trajectory of a piecewise linear system upon a certain initial condition can be obtained by solving the ordinary differential equations between the time points at which a discrete variable changes its
Discrete Dynamic Modeling of Cellular Signaling Networks
299
value. Piecewise linear models have been developed to model the Drosophila segmentation network (Chaves et al., 2006), the pathogen–immune interaction network (Thakar et al., 2009), and the Escherichia coli carbon starvation response network (Ropers et al., 2006). BooleanNet has a module for implementing such piecewise linear systems (Albert et al., 2008).
4.3. From Boolean switches to dose–response curves In Boolean dynamic modeling, the state of each node is like a Boolean switch, either ON or OFF, depending on the states of its regulators. If step functions are adopted as regulatory functions in ordinary differential equations, the continuous dose–response curve of a dynamic variable can become a Boolean-like switch. Thus, as shown by Davidich and Bornholdt (2008a), a Boolean network model can be formulated as a specific coarse-grained limit of a more detailed differential equation model. Let X(t) denote the mRNA level of a gene at time t, and Y(t) denote the active concentration of the gene’s transcriptional activator at time t. Assume Y(t) linearly increases from 0 to 1 in the time span t 2 [0, 10]. Define a discrete step function over [0, 1] as the regulatory function for X(t): 0 Y 0:5; BðY Þ ¼ 1 Y > 0:5: shown in solid line in Fig. 11.5A. In the Boolean model, the relation between X and Y can be described as X ¼ BðY Þ where for convenient comparison we assume that the state of Y can be transferred to X immediately. In the piecewise linear system, the relation between X and Y can be described as dX ¼ BðY Þ X dt where we assume unit synthesis rate and degradation rate constants for convenience. The state of X according to the Boolean model and the piecewise linear model can be seen in Fig. 11.5B, denoted as ‘‘Boolean’’ and ‘‘PieceWL,’’ respectively. We can see that the curve of X(t) in the piecewise linear system is just the continuous version of that in the Boolean model. Now, instead of using a Boolean switch B(Y), we assume that the regulatory function is the widely used Hill function, as shown in dashed lines in Fig. 11.5A (denoted as ‘‘F(Y), n ¼ 3, 6, 10’’): FðY Þ ¼
Yn ; Y n þ Kn
300
Re´ka Albert and Rui-Sheng Wang
A
Regulatory functions
1.0
F(Y), n = 3 F(Y), n = 6
0.8
F(Y), n = 10 B(Y)
0.6 0.4 0.2 0 0
1
2
3
4
5 t
6
7
8
9
10
5 t
6
7
8
9
10
B 1
ODE, n = 3 ODE, n = 6
0.8
ODE, n = 10 PieceWL
0.6 X(t)
Boolean 0.4 0.2 0 0
1
2
3
4
Figure 11.5 Illustration of step functions and the behavior of the corresponding dynamic systems. (A) A Boolean function and a Hill function with different parameters used as transfer functions. (B) The dynamic behavior of the Boolean model, the piecewise linear system and the ordinary differential equations with Hill functions.
where we set K ¼ 0.5. In the ordinary differential equation system, the relation between X and Y can be described as dX ¼ FðY Þ X: dt Again, we equate the rate constants with 1. The state of X according to the ordinary differential equation can be seen in Fig. 11.5B, denoted as ‘‘ODE, n ¼ 3, 6, 10.’’ We can see that, when n ! 1, Y > K, then dX/dt > 0, therefore the mRNA synthesis of the gene is dominant and
Discrete Dynamic Modeling of Cellular Signaling Networks
301
X(t) is increasing; when Y < K, then dX/dt 0, the mRNA degradation of the gene is dominant and X(t) stays at zero. The simplification of this phenomenon is just a Boolean switch: If Y is ON, the transcription factor activates the gene and X is ON. If Y is OFF, the mRNA degrades and X is OFF. Thus, the continuous dose–response curve becomes a Boolean switch. Note that any step function like Hill functions can generate such correspondence. Here, there is only one regulator to activate the gene. When multiple components regulate the gene, the number of the parameters in the ordinary differential equation will increase, but the piecewise linear system still has two parameters and the Boolean network has no kinetic parameters. This advantage of Boolean network models and piecewise linear models makes them able to model large signaling networks.
5. Application Examples Boolean networks have been successfully applied in modeling many biological processes. Here, we describe two application examples, based on our research on Boolean dynamic modeling of signaling networks.
5.1. Abscisic acid-induced stomatal closure Plants take up carbon dioxide for photosynthesis and lose water by transpiration through the pores called stomata. The guard cells, specialized cells that flank the stomata and determine their size, have developed into a favorite model system for understanding plant signal transduction. For example, under drought stress conditions, plants synthesize the phytohormone ABA that triggers cellular responses in guard cells, resulting in stomatal closure to reduce plant water loss. ABA-induced stomatal closure has been studied by many different labs, but the information about this signal transduction process had been quite fragmentary. In Li et al. (2006), an ABA-induced stomatal closure signaling network with over 40 components has been assembled by an extensive curation of experimental literature. In this network, the input node is the ABA signal and the output node is the response of guard cells to the ABA signal, that is, the closure of the stomata. The intermediate nodes include some important proteins such as G protein a subunit GPA1, G protein b subunit AGB1, protein kinase OST1, second messengers such as cytosolic Ca2þ, phosphatidic acid, and ion flows. Integrating a large number of experimental observations, an asynchronous Boolean dynamic model was developed to simulate the ABA signaling process. The node Closure has a fixed point for each state of the input mode, corresponding to Closure ¼ OFF for ABA ¼ OFF and Closure ¼ ON for ABA ¼ ON. Randomly selected initial conditions
302
Re´ka Albert and Rui-Sheng Wang
were extensively sampled and the fraction of ‘‘Closure ¼ ON’’ in all simulations was used as the output of the model, representing the percentage of the stomata in a population that have been closed due to ABA signaling (Li et al., 2006). Simulating the knockout of signaling components and comparing the percentage of closed stomata with that in wild type indicates that the assembled network is robust against a significant fraction of perturbations and also identifies essential components such as membrane depolarizability, anion efflux, actin cytoskeleton reorganization, whose disruption leads to insensitivity to ABA. In addition, the dynamic model is able to classify nodes by determining whether their disruption would lead to hyposensitivity, hypersensitivity, or insensitivity to ABA. Several of these predictions have been validated by wet-bench experiments, demonstrating the power of discrete dynamic models (Li et al., 2006).
5.2. T-LGL survival signaling network T cell large granular lymphocyte (T-LGL) leukemia represents a class of lympho-proliferative diseases characterized by an abnormal clonal proliferation of cytotoxic T cells. Unlike normal cytotoxic T lymphocytes (CTL) which are eliminated by activation-induced cell death, leukemic T-LGL are not sensitive to Fas-induced apoptosis (which is crucial to the normal activation-induced cell death) and thus remain long-term competent. In Zhang et al. (2008), a T-LGL survival signaling network with over 50 components was created by using NET-SYNTHESIS (Kachalo et al., 2008) to integrate the signaling relationships collected from databases and the literature. The main input node in this network is ‘‘Stimuli’’ representing antigen stimulation, and the main output node is ‘‘Apoptosis,’’ summarizing the biological effect in normal activation-induced cell death. This network describes how stimuli like chronic virus infection activate the T cell receptor and a subsequent signal cascade and induce the depletion of reactive CTL through activation-induced cell death. Certain nodes in this network are activated only in leukemic T-LGL and affect the normal activation-induced cell death. Based on the assembled signaling network, a predictive Boolean dynamic model was constructed. Simulating the overexpression of proteins indicates that all known signaling abnormalities in leukemic T-LGL can be reproduced by only keeping two proteins, IL-15, and PDGF constitutively expressed (ON). The study also identified key mediators of the disease, such as NF-kB, SPHK1, and S1P, which stabilize into an ON or OFF state in T-LGL leukemia and the reversal of this state leads to effective cell death. Several predictions of the model were validated by the corresponding wetbench experiments and provide important insights for the possible treatment of this disease (Zhang et al., 2008). This example again demonstrates that discrete dynamic modeling can help to generate important testable hypotheses without the requirement of kinetic details and quantitative
Discrete Dynamic Modeling of Cellular Signaling Networks
303
information. Such a global view of this complicated biological process would not have been possible without network assembly and discrete dynamic modeling, since patient samples are scarce for this disease.
6. Conclusion and Discussion This chapter introduced discrete dynamic modeling and network analysis approaches. Network-based discrete dynamic modeling allows the logical organization of disparate information from biological experiments into a coherent framework. Based on these models, more predictive and testable hypotheses can be obtained, for example, simulating knockout or overexpression of some components allows us to predict phenotype responses and find new targets or intervention points. It is a powerful tool to help us understand the system behavior of cellular signaling pathways and save much work that has to be done in vivo and in vitro. Importantly, discrete dynamic modeling is conceptually simple and fits biologists’ intuitive thinking without a requirement for sophisticated quantitative knowledge. It is worth noting that there is no permanent model for a biological system. The efficacy and accuracy of dynamic models heavily depends on the current knowledge that was used as input to the model. While guiding experiment designs and helping to generate testable hypotheses, models may become outdated as more biological observations are accumulated. At that point, the models need to be modified and refined. Such interplay between theoretical modeling and biological experimentation plays an essential role in the advancement of systems biology.
ACKNOWLEDGMENTS This work and the original research reported here was partially supported by NSF grants MCB-0618402 and CCF-0643529 (CAREER), NIH grant R01 GM083113-01, and USDA grant 2006-35100-17254.
REFERENCES Albert, R. (2005). Scale-free networks in cell biology. J. Cell Sci. 118, 4947–4957. Albert, R., and Othmer, H. G. (2003). The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster. J. Theor. Biol. 223, 1–18. Albert, R., DasGupta, B., Dondi, R., Kachalo, S., Sontag, E., Zelikovsky, A., and Westbrooks, K. (2007). A novel method for signal transduction network inference from indirect experimental evidence. J. Comput. Biol. 14, 927–949. Albert, I., Thakar, J., Li, S., Zhang, R., and Albert, R. (2008). Boolean network simulations for life scientists. Source Code Biol. Med. 3, 16.
304
Re´ka Albert and Rui-Sheng Wang
Aldridge, B. B., Burke, J. M., Lauffenburger, D. A., and Sorger, P. K. (2006). Physicochemical modelling of cell signalling pathways. Nat. Cell Biol. 8, 1195–1203. Barabasi, A. L., and Oltvai, Z. N. (2004). Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. 5, 101–113. Bornholdt, S. (2008). Boolean network models of cellular regulation: Prospects and limitations. J. R. Soc. Interface 5(Suppl. 1), S85–S94. Buck, M. J., and Lieb, J. D. (2004). ChIP-chip: Considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 83, 349–360. Chaouiya, C. (2007). Petri net modelling of biological networks. Brief Bioinform. 8, 210–219. Chaves, M., Albert, R., and Sontag, E. D. (2005). Robustness and fragility of Boolean models for genetic regulatory networks. J. Theor. Biol. 235, 431–449. Chaves, M., Sontag, E. D., and Albert, R. (2006). Methods of robustness analysis for Boolean models of gene control networks. Syst. Biol. (Stevenage) 153, 154–167. Cho, K. H., and Wolkenhauer, O. (2003). Analysis and modelling of signal transduction pathways in systems biology. Biochem. Soc. Trans. 31, 1503–1509. Conzelmann, H., and Gilles, E. D. (2008). Dynamic pathway modeling of signal transduction networks: A domain-oriented approach. Methods Mol. Biol. 484, 559–578. Davidich, M., and Bornholdt, S. (2008a). The transition from differential equations to Boolean networks: A case study in simplifying a regulatory network model. J. Theor. Biol. 255, 269–277. Davidich, M. I., and Bornholdt, S. (2008b). Boolean network model predicts cell cycle sequence of fission yeast. PLoS ONE 3, e1672. De Jong, H., Gouze, J. L., Hernandez, C., Page, M., Sari, T., and Geiselmann, J. (2004). Qualitative simulation of genetic regulatory networks using piecewise-linear models. Bull. Math. Biol. 66, 301–340. Ellner, S. P., and Guckenheimer, J. (2006). Dynamic Models in Biology. Princeton University Press, Princeton, NJ. Espinosa-Soto, C., Padilla-Longoria, P., and Alvarez-Buylla, E. R. (2004). A gene regulatory network model for cell-fate determination during Arabidopsis thaliana flower development that is robust and recovers experimental gene expression profiles. Plant Cell 16, 2923–2939. Figeys, D., McBroom, L. D., and Moran, M. F. (2001). Mass spectrometry for the study of protein–protein interactions. Methods 24, 230–239. Foth, B. J., Zhang, N., Mok, S., Preiser, P. R., and Bozdech, Z. (2008). Quantitative protein expression profiling reveals extensive post-transcriptional regulation and posttranslational modifications in schizont-stage malaria parasites. Genome Biol. 9, R177. Gilbert, D., Fuss, H., Gu, X., Orton, R., Robinson, S., Vyshemirsky, V., Kurth, M. J., Downes, C. S., and Dubitzky, W. (2006). Computational methodologies for modelling, analysis and simulation of signalling networks. Brief Bioinform. 7, 339–353. Glass, L. (1975). Classification of biological networks by their qualitative dynamics. J. Theor. Biol. 54, 85–107. Gomperts, B. D., Kramer, I. M., and Tatham, P. E. R. (2003). Signal Transduction. Academic Press, San Diego, California. Harvey, I., and Bossomaier, T. (1997). Time out of joint: Attractors in asynchronous random Boolean networks. In ‘‘Proceedings of the Fourth European Conference on Artificial Life (ECAL97),’’ (P. Husbands and I. Harvey, eds.), pp. 67–75. MIT Press, Cambridge, MA. Hatzimanikatis, V., Li, C., Ionita, J. A., and Broadbelt, L. J. (2004). Metabolic networks: Enzyme function and metabolite structure. Curr. Opin. Struct. Biol. 14, 300–306.
Discrete Dynamic Modeling of Cellular Signaling Networks
305
Huang, A. C., Hu, L., Kauffman, S. A., Zhang, W., and Shmulevich, I. (2009). Using cell fate attractors to uncover transcriptional regulation of HL60 neutrophil differentiation. BMC Syst. Biol. 3, 20. Jarrah, A. S., and Laubenbacher, R. (2007). Finite Dynamical Systems: A Mathematical Framework for Computer Simulation. In ‘‘Mathematical Modeling, Simulation, Visualization and e-Learning,’’ (D. Konate´, ed.), pp. 343–358. Springer, Berlin Heidelberg. Kachalo, S., Zhang, R., Sontag, E., Albert, R., and DasGupta, B. (2008). NET-SYNTHESIS: A software for synthesis, inference and simplification of signal transduction networks. Bioinformatics 24, 293–295. Kauffman, S. A. (1969). Metabolic stability and epigenesis in randomly constructed genetic nets. J. Theor. Biol. 22, 437–467. Kaufman, M., Andris, F., and Leo, O. (1999). A logical analysis of T cell activation and energy. Proc. Natl. Acad. Sci. USA 96, 3894–3899. Kervizic, G., and Corcos, L. (2008). Dynamical modeling of the cholesterol regulatory pathway with Boolean networks. BMC Syst. Biol. 2, 99. Ku¨rten, K. E. (1988). Correspondence between neural threshold networks and Kauffman Boolean cellular automata. J. Phys. A 21, L615–L619. La¨hdesma¨ki, H., Shmulevich, I., and Yli-Harja, O. (2003). On learning gene regulatory networks under the Boolean network model. Machine Learn. 52, 147–167. Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., Zeitlinger, J., Jennings, E. G., et al. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799–804. Li, F., Long, T., Lu, Y., Ouyang, Q., and Tang, C. (2004). The yeast cell-cycle network is robustly designed. Proc. Natl. Acad. Sci. USA 101, 4781–4786. Li, S., Assmann, S. M., and Albert, R. (2006). Predicting essential components of signal transduction networks: A dynamic model of guard cell abscisic acid signaling. PLoS Biol. 4, e312. May, R. M. (1976). Simple mathematical models with very complicated dynamics. Nature 261, 459–467. Mendoza, L., and Alvarez-Buylla, E. R. (2000). Genetic regulation of root hair development in Arabidopsis thaliana: A network model. J. Theor. Biol. 204, 311–326. Mendoza, L., Thieffry, D., and Alvarez-Buylla, E. R. (1999). Genetic control of flower morphogenesis in Arabidopsis thaliana: A logical analysis. Bioinformatics 15, 593–606. Reed, J. L., Vo, T. D., Schilling, C. H., and Palsson, B. O. (2003). An expanded genomescale model of Escherichia coli K-12 (iJR904 GSM/GPR). Genome Biol. 4, R54. Ropers, D., de Jong, H., Page, M., Schneider, D., and Geiselmann, J. (2006). Qualitative simulation of the carbon starvation response in Escherichia coli. Biosystems 84, 124–152. Sackmann, A., Heiner, M., and Koch, I. (2006). Application of Petri net based analysis techniques to signal transduction pathways. BMC Bioinform. 7, 482. Saez-Rodriguez, J., Simeoni, L., Lindquist, J. A., Hemenway, R., Bommhardt, U., Arndt, B., Haus, U. U., Weismantel, R., Gilles, E. D., Klamt, S., and Schraven, B. (2007). A logical model provides insights into T cell receptor signaling. PLoS Comput. Biol. 3, e163. Sanchez, L., and Thieffry, D. (2001). A logical analysis of the Drosophila gap-gene system. J. Theor. Biol. 211, 115–141. Sanchez, L., and Thieffry, D. (2003). Segmenting the fly embryo: A logical analysis of the pair-rule cross-regulatory module. J. Theor. Biol. 224, 517–537. Thakar, J., Pilione, M., Kirimanjeswara, G., Harvill, E. T., and Albert, R. (2007). Modeling systems-level regulation of host immune responses. PLoS Comput. Biol. 3, e109. Thakar, J., Saadatpour-Moghaddam, A., Harvill, E. T., and Albert, R. (2009). Constraintbased network model of pathogen–immune system interactions. J. R. Soc. Interface 6, 599–612.
306
Re´ka Albert and Rui-Sheng Wang
Thomas, R. (1973). Boolean formalization of genetic control circuits. J. Theor. Biol. 42, 563–585. von Dassow, G., Meir, E., Munro, E. M., and Odell, G. M. (2000). The segment polarity network is a robust developmental module. Nature 406, 188–192. Walhout, A. J., and Vidal, M. (2001). High-throughput yeast two-hybrid assays for largescale protein interaction mapping. Methods 24, 297–306. Zhang, R., Shah, M. V., Yang, J., Nyland, S. B., Liu, X., Yun, J. K., Albert, R., and Loughran, T. P. Jr. (2008). Network model of survival signaling in large granular lymphocyte leukemia. Proc. Natl. Acad. Sci. USA 105, 16308–16313.
C H A P T E R
T W E LV E
The Basic Concepts of Molecular Modeling Akansha Saxena,* Diana Wong,* Karthikeyan Diraviyam,† and David Sept† Contents 308 308 309 313 313 316 317 317 318 320 321 324 324 325 326 328 329 329 330 330
1. Introduction 2. Homology Modeling 2.1. Sequence analysis 2.2. Secondary structure prediction 2.3. Tertiary structure prediction 2.4. Structure validation 2.5. Conclusions 3. Molecular Dynamics 3.1. Molecular mechanics 3.2. Setting up and running simulations 3.3. Simulation analysis 4. Molecular Docking 4.1. Basic components 4.2. Choosing the correct tool 4.3. Preparing the molecules 4.4. Iterative docking and analysis 4.5. Post analysis 4.6. Virtual screening 4.7. Conclusions References
Abstract Molecular modeling techniques have made significant advances in recent years and are becoming essential components of many chemical, physical and biological studies. Here we present three widely used techniques used in the simulation of biomolecular systems: structural and homology modeling, molecular dynamics and molecular docking. For each of these topics we present a * {
Biomedical Engineering, Washington University, St Louis, Missouri, USA Biomedical Engineering and Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67012-9
#
2009 Elsevier Inc. All rights reserved.
307
308
Akansha Saxena et al.
brief discussion of the underlying scientific basis of the technique, some simple examples of how the method is commonly applied, and some discussion of the limitations and caveats of which the user should be aware. References for further reading as well as an extensive list of software resources are provided.
1. Introduction Molecular modeling techniques have made significant advances in recent years and are becoming essential components of many chemical, physical, and biological studies. There are two primary reasons for this evolution: first, there has been an explosive growth in the available structural data for proteins, not only from X-ray crystallography, but also from NMR and electron microscopy studies. Second, accompanying this growth in the area of structural biology have come significant advances in both computational techniques and hardware. As computers continue to increase in speed and capability, we are able to tackle larger and more complex systems. The purpose of this chapter is to outline the basic methodology behind three commonly used techniques. The first section features a discussion of biomolecular structure and homology modeling techniques, we then discuss molecular dynamics (MD) and sampling of protein conformational space, and finally, we cover molecular docking applications. In each of these sections, we will present the basic approach, discuss some of the details, some caveats and limitations, and provide additional references for the reader who requires more information.
2. Homology Modeling As each genome is sequenced, we are faced with the daunting task of digging for useful information in this growing ocean of letters. At the time of writing this article, GenBank (Benson et al., 2009) reports nearly 100 million sequences and the Protein Data Bank contains almost 55,000 protein structures (Berman et al., 2000). Due to this amount of information, it is next to impossible to manually sort through, order, and correlate all of the data. Thanks to major improvements in computing power and algorithms, we can easily handle such large-scale data and derive useful information. Computational modeling has become an essential tool in guiding and enabling one to make rational decisions with respect to hypothesis driven biological research. In parallel, the wide availability of Web-based applications has produced several computational tools as online servers, which are for the most part user friendly, and have thus made it more amenable for researchers to use. Indeed, most of the tools and protocols that
The Basic Concepts of Molecular Modeling
309
are discussed in this chapter can be accessed and utilized by just using a modern day laptop with an Internet connection. Since the purpose of this chapter is to give the readers a flavor of the different computational methods that involve protein modeling, we are going to skip any discussion of genomics and dive directly into proteomics. There are numerous tools each with its unique protocol to predict the desired property and it is beyond the scope of this chapter to discuss each tool in detail. For an exhaustive list of other tools, readers are referred to sites such as Expasy (Gasteiger et al., 2003), NCBI, EMBL-EBI, BiologyWorkBench (Subramaniam, 1998), and references (Kretsinger et al., 2004; Madhusudhan et al., 2005) (see Table 12.1).
2.1. Sequence analysis It is possible to get general information about a protein function by identifying certain motifs or domains in its sequence. For example, one can calculate the hydrophobicity of an amino acid at a specific position and thereby creating hydropathy plot/scale for a given sequence. Using such information, one could get an idea whether a sequence segment of a protein can be on the protein interior, protein surface, or perhaps a transmembrane segment. One of the tools TMHMM (Krogh et al., 2001) predicts the transmembrane segments by applying a Hidden Markov Model (HMM) (for more details, see, e.g., Punta et al., 2007). If we are interested in predicting possible posttranslational modification sites or DNA-binding motifs or signaling sites, one can scan the query sequence against protein databases such as Prosite (Hulo et al., 2008), Prints (Attwood et al., 2003), or InterPro (Hunter et al., 2009). PredictProtein (Rost et al., 2004) and SCRATCH (Cheng et al., 2005)—online servers that can analyze the sequence for by submitting the sequence to other prediction tools that can each analyze the sequence for unique features. Since all these tools are based on statistics, it is always recommended to try multiple software packages to look for consensus and inconsistencies in the individual predictions. Many meta-servers like InterProScan (Zdobnov and Apweiler, 2001) or PredictProtein (Rost et al., 2004) enable the user to simultaneously submit the query sequence to multiple online tools from one central Web site. Web sites like EMBL-EBI, Expasy, NCBI as mentioned above maintain a list of tools available for various forms of analysis. At the end of the day, one has to keep in mind the limitations of these tools and that the accuracy of any tool is never 100%. As mentioned above, all these prediction algorithms are based on statistical analysis of available data across multiple organisms. If any of these tools lacks the specificity for the organism of interest, the confidence in its predictions may be more questionable. One should always cross check the predictions with the available knowledge of the system and see if it is compatible with the system that is being studied.
310
Akansha Saxena et al.
Table 12.1 Molecular modeling software and resources Name
URL
GenBank Protein Data Bank Expasy NCBI EMBL-EBI InterPro Swiss-Prot UniProt SMART Pfam PROSITE PRINTS
http://www.ncbi.nlm.nih.gov/Genbank/ http://www.rcsb.org http://ca.expasy.org/ http://www.ncbi.nlm.nih.gov http://www.ebi.ac.uk/ http://www.ebi.ac.uk/interpro/ http://ca.expasy.org/sprot/ http://www.uniprot.org/ http://smart.embl-heidelberg.de/ http://pfam.sanger.ac.uk/ http://ca.expasy.org/prosite/ http://www.bioinf.manchester.ac.uk/dbbrowser/ PRINTS/index.php http://ca.expasy.org/prosite/ http://www.bioinf.manchester.ac.uk/dbbrowser/ PRINTS/index.php http://workbench.sdsc.edu/
PROSITE PRINTS Biology Workbench PredictProtein SCRATCH InterProScan BLAST FASTA PSI-BLAST PHI-BLAST HMMER ClustalW TMHMM Jpred3 PSIPRED PHD SSPro FUGUE MODELLER SWISS-MODEL 3D-JIGSAW PLOP
http://www.predictprotein.org/ http://www.igb.uci.edu/tools/scratch/ http://www.ebi.ac.uk/Tools/InterProScan/ http://blast.ncbi.nlm.nih.gov/Blast.cgi http://www.ebi.ac.uk/Tools/fasta/index.html http://www.ebi.ac.uk/Tools/psiblast/ http://www.ebi.ac.uk/Tools/blastpgp/ http://hmmer.janelia.org/ http://www.ebi.ac.uk/Tools/clustalw2/index.html http://www.cbs.dtu.dk/services/TMHMM/ http://www.compbio.dundee.ac.uk/www-jpred/ http://bioinf.cs.ucl.ac.uk/psipred/ http://www.predictprotein.org/ http://scratch.proteomics.ics.uci.edu/ http://tardis.nibio.go.jp/fugue/ http://salilab.org/modeller/ http://swissmodel.expasy.org//SWISS-MODEL.html http://bmm.cancerresearchuk.org/3djigsaw/ http://www.jacobsonlab.org/plop_manual/ plop_overview.htm 123Dþ http://123d.ncifcrf.gov/123Dþhtml pGenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/ 3D-PSSM http://www.sbg.bio.ic.ac.uk/3dpssm/index2.html Rosetta http://robetta.bakerlab.org/ I-TASSER http://zhang.bioinformatics.ku.edu/I-TASSER/
x
The Basic Concepts of Molecular Modeling
311
Table 12.1 (continued) Name
URL
PROCHECK
http://www.biochem.ucl.ac.uk/roman/procheck/ procheck.html http://swift.cmbi.kun.nl/whatif/ https://prosa.services.came.sbg.ac.at/prosa.php http://nihserver.mbi.ucla.edu/Verify_3D/ http://www.jalview.org/ http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/ index.cgi http://www.cgl.ucsf.edu/chimera/ http://ambermd.org/ http://www.gromacs.org/ http://dasher.wustl.edu/tinker/ http://www.ks.uiuc.edu/Research/namd/ http://www.charmm.org/ http://www.schrodinger.com/
WHAT IF ProSA Verify3D Jalview T-Coffee Chimera AMBER GROMACS TINKER NAMD CHARMM DESMOND/ IMPACT AutoDock ClusPro DOCK FITTED FlexX FRED ICM GLIDE GOLD HADDOCK HEX PatchDock SymmDock FireDock RosettaDock Surflex
http://autodock.scripps.edu/ http://cluspro.bu.edu/login.php http://dock.compbio.ucsf.edu/ http://www.fitted.ca/ http://www.biosolveit.de/flexx/ http://www.eyesopen.com/products/applications/ fred.html http://www.molsoft.com/docking.html http://www.schrodinger.com/ProductDescription. php?mID¼6&sID¼6 http://www.ccdc.cam.ac.uk/products/life_sciences/ gold/ http://www.nmr.chem.uu.nl/haddock/ http://www.loria.fr/ritchied/hex/ http://bioinfo3d.cs.tau.ac.il/
http://rosettadock.graylab.jhu.edu/ http://www.optive.com/
A more common and direct approach toward prediction of protein function from sequence is through searching the sequence databases such as NCBI or Swiss-Prot (Bairoch et al., 2004), using tools such as BLAST (Altschul et al., 1990; McGinnis and Madden, 2004) or FASTA (Pearson, 1990), for closely
312
Akansha Saxena et al.
related sequences to your query. The best-case scenario would be if a query picks out sequences (hits) from the database that are both functionally well characterized and share a high sequence identity. Based on high sequence similarity to the hit, it can be generally assumed that the query protein will have a similar fold and hence might belong to the same structural family of a given hit and, depending on the extent of sequence identity, possibly have a function similar to the hit. The above stated generalization of sequence– structure–function relationship has its root in the process of divergent evolution. In divergent evolution, two proteins that share sequence similarity diverged from a common ancestor and since structure diverges slowly than sequence they should also have similar fold. In general, two sequences sharing more than 40% of sequence identity share similar structural fold (Davidson, 2008). Yet this simple sequence, structure, and function relationship is always not true (Roessler et al., 2008). A famous example would be myoglobin and hemoglobin, proteins that are similar in structure and function yet have only 20% sequence similarity. On the same note convergent evolution can result in proteins with very different structural folds but similar function (e.g., subtilisin). These are exceptions to the standard rule that are to be borne in mind when analyzing two sequences and inferring function or fold on the basis of sequence similarity. In a less than ideal scenario, a query fails to generate any significant hits or the generated hits are of low sequence similarity ( 40%) and one of those hits also has a representative X-ray or NMR structure. In this case, opting for homology modeling can produce reliable models that are within 2 A˚ RMSD of the experimental structure. For instances of low sequence identity, for example, where a multiple sequence alignment approach was required to produce reliable hits, homology modeling based on a single structure cannot be reliable. However, current homology modeling protocols can produce good structural models for such proteins for even these low sequence similarity cases by using multiple templates to model different regions. For the most difficult case of low sequence similarity, alternate model building methods such as threading and ab initio modeling methods can be used to produce structural models of the query. 2.3.1. Homology modeling The aim of homology modeling is to model or predict the structural coordinates of a query protein based on the known structure of a sequence homolog (generally referred as the template). Since the model produced can be only as good as the template, it is imperative that rigorous analysis is done before choosing the template. This is achieved by searching the query sequence on various databases and using different search protocols, as discussed above in sequence analysis. In cases where multiple templates are available, it is usually best to use the template with the highest resolution and fewest gaps in sequence/structure. One should also be aware of the environmental conditions under which the template structure was solved. After selecting an appropriate template, the next step involves determining the best alignment between the template and query. The default alignment that is produced by sequence search tools may not be optimal, and the use of profile/patterns derived from multiple sequence alignments could again enhance the quality of alignment for less similar sequences. As with choosing the template, threading can also help in improving the sequence alignment. Threading adds in the structural information to sequence-based alignment. Programs such as FUGUE (Shi et al., 2001) and SALIGN (implemented in MODELLER (Marti-Renom et al., 2000)
The Basic Concepts of Molecular Modeling
315
use an intermediate protocol where the structural information is used in a profile-based alignment. Once an optimal alignment is constructed, this information is fed into homology model building software. There are online model building servers such as SWISS-MODEL (Schwede et al., 2003) or 3D-JIGSAW (Bates et al., 2001) or stand-alone software such as MODELLER that can be used. In MODELLER, the model is built based on spatial restraints resulting from the query-template alignment. It can also perform de novo loop prediction, using multiple templates to construct a model. The final model is generated using conjugate gradient and MD simulation (simulated annealing). In analyzing the final model one has to keep in mind that in addition to errors from an incorrect alignment or a nonideal template, there are also errors inherent to model building method that can result in backbone distortions even in conserved regions and side-chain packing or conformation errors. In such instances, protein optimization programs such as PLOP (Jacobson et al., 2002), both of which can do multiple cycles of side-chain optimization and minimization, can be used to improve the model. 2.3.2. Threading Threading is typically used in cases where there is no significant homology predicted (< 20%) between your query and any sequence in the database, but it can also be employed to improve the alignment of low homology query-template sequences. Briefly, the query sequence is threaded through all available folds in the structural database and a score for each fold is calculated using some suitable scoring scheme, and the fold that gives the best score is assumed to be the fold for the query sequence. For more details in the methods that are used in threading the sequence and scoring algorithms the reader is referred to Mount (2001). Some of the popular online servers for threading includes 123D (Alexandrov et al., 1996), pGenTHREADER (Lobley et al., 2009), and 3D-PSSM (Kelley et al., 2000). Since this method is more often used when the sequence similarity is very low, interpreting details beyond the fold of the protein, like side-chain interactions, are not reliable. In this case, protein optimization software can again be useful in improving the structure. 2.3.3. Ab initio modeling Ab initio modeling uses a combination of statistical analysis and physicsbased energy function to predict the native fold of a given sequence. Ab initio modeling is preferred in predicting the structure of a sequence when no suitable template is found or if it is known that the query adopts a different fold than the predicted template in spite of the sequence similarity. The various different ab initio algorithms use statistical information, secondary structure prediction, and fragment assembly for fold prediction.
316
Akansha Saxena et al.
Also common to all algorithms is the simplified representation of the protein to keep the prediction problem tractable. For more details on the different ab initio prediction methods the reader is referred to (Hardin et al., 2002). Some of the software that is currently used for ab initio structure prediction includes Rosetta (Rohl et al., 2004) and I-TASSER (Zhang, 2008). Rosetta is one of the more widely used packages. It uses a fragmentbased assembly protocol to predict the full structure. Here the fragments form a reduced representation of the protein. A key assumption in Rosetta is that the distribution of structures sampled by a particular sequence fragment is reasonably represented by the distributions of conformations adopted by that fragment and closely related fragments in the protein structure database. Fragment libraries are then built based on the protein structure database. The reduced conformational search space of the query is searched using the Monte Carlo algorithm with an energy function that is defined as the Bayesian probability of sequence/structure matches (Simons et al., 1997). This produces compact structures that have optimized local and nonlocal interactions. Based on the observation that folding of small protein predominantly follows a single exponential process, conformational search is achieved by running short simulations. In general, 1000 short simulations are performed independently (Shortle et al., 1998). The resulting structures are clustered and generally the central model representing the largest cluster is chosen as the best predicted structure for the query sequence. There are important limitations of this fragment based ab initio prediction. The conformational search space of fragments is predetermined, and long-range interactions within the protein are not included in the structure prediction of fragments.
2.4. Structure validation It is critically important that the predicted models, produced using any of the methods discussed above, should be further examined to increase confidence in the structural model. Tools such as PROCHECK (Laskowski et al., 1993) and WHATIF (Vriend, 1990) can check multiple structural variables in the model against expected/reference values for the same. PROSA (Wiederstein and Sippl, 2007) uses knowledge-based potentials of mean force to evaluate model accuracy. It also gives the Z-score for the model that indicates overall model quality. An anomalous Z-score indicates problem with the structural model. Verify 3D (Luthy et al., 1992) is another tool that utilizes the location and environmental profile of each residue in relation to a set of reference structures to predict the quality of the structure. It should be noted that performing the validation analysis on both the predicted model and template and comparing the results can be useful, since the model can only be as good as the template.
The Basic Concepts of Molecular Modeling
317
2.5. Conclusions As the number of experimentally solved structures and sequences deposited into their respective databases increase in number, the prediction accuracy of the sequence analysis and structure prediction tools will continue to advance. Increased availability of cheap and fast computational resources, as well as improvement in prediction algorithms will enable more exhaustive and accurate prediction in less time. In this regard, the Critical Assessment of techniques for protein Structure Prediction (CASP), a large-scale meeting that evaluates the current status of prediction algorithms, is an excellent resource for evaluating the current state of the field. At the end of this biannual meeting, the performance-based rankings results are published in the official Web site (http://predictioncenter.org/), and this resource is perhaps the best source for up-to-date information on current methods and algorithms.
3. Molecular Dynamics We have seen how we can obtain a 3D-structure of a protein from predicting the positions of the nuclear centers of all atoms that make up the molecule. This structure is helpful in understanding the relative position of the atoms, but this is only a static picture and does not tell us how the proteins move and function. More than four decades ago, scientists demonstrated that if we are able to calculate the relative positions of the protein atoms at small intervals of time, we can then predict the behavior of the atoms over a longer time scale. The principal tool for such calculations is the method of MD simulations that was first introduced in the late 1950s by Alder and Wainwright (1959, 1960). In the 1960s, Rahman carried out the first realistic MD simulations with liquid argon and liquid water (Rahman, 1964; Rahman and Stillinger, 1974), and later in 1976, the first protein simulation was performed by McCammon et al. (1976, 1977). Since then, researchers have used MD to investigate the structural, dynamic, and thermodynamic properties of biological molecules, including such items as the characterization of protein folding kinetics and pathways, protein structure refinement and protein–protein interactions. The goal of this section is to give an overview of the basic methodology required for studying the protein dynamics using MD simulations. We will first talk about the basics of MD and try to provide information on the basic considerations that go into a simulation. We will then provide a step-bystep protocol on how to prepare a system for MD, and finally, we will discuss analysis tools that can be used on the resulting trajectory to study the dynamics and flexibility of the protein.
318
Akansha Saxena et al.
3.1. Molecular mechanics The basis of the MD technique is little more than the integration of the Newton’s equation of motion: F ¼ ma, to calculate the positions of the atoms over time. The behavior of the atoms in such a calculation can be likened to the movement of billiard balls (Leach, 2001). Of course, billiard balls move in straight lines until they collide with each other and the collisions change their direction, but in our case, the atoms experience a varying force in between collisions. To account for the effects of these changing forces and to get more realistic dynamics, the equation of motion is integrated over short time steps (typically 1–2 fs) such that the forces can be regarded as constant. The integration yields a series of conformations in time that reveal the realistic movement of the atoms and the protein as a whole. Newton’s equation is a second-order differential equation (the acceleration a is the second derivative of position) and this means that we are required to provide the positions of each atom, from a crystal or model structure, and their individual velocities, calculated from a thermal distribution. The forces that act on each atom, F, is determined from a molecular mechanics potential by taking the negative gradient (i.e., F ¼ rU ). A discussion of molecular mechanics potentials could alone fill multiple chapters, however, it is sufficient to say that these parameter sets are based on first-principles physics but are parameterized empirically through detailed comparisons with many experimental measurements. All molecular mechanics potentials deal with two primary classes of interactions: bonded interactions and nonbonded interactions. The bonded interactions are composed of a bond stretching term for two covalently bonded atoms (see Fig. 12.1), an angle-bending term for three
Torsions Angles Bonds
Electrostatics and van der Waals
Figure 12.1 Representation of the bonded and nonbonded interactions used in molecular mechanics force fields.
The Basic Concepts of Molecular Modeling
319
consecutively bonded atoms, and a torsional term for four consecutive atoms. Since the integration time step is small and the perturbations about the equilibrium point are typically very small in proteins, these potentials are typically modeled as a harmonic springs (E ¼ ð1=2Þkx2 ). The use of such effective potentials makes calculations much easier and faster. The nonbonded interactions capture longer range interactions within the protein. The major nonbonded forces include the electrostatic interactions, based on the Coulomb’s law, and van der Waals interactions, usually based on a Lennard–Jones potential. The following equation sums up these interactions with the first three terms representing the bonded terms while the last two representing the nonbonded interactions: U¼
X
kb ðb bo Þ2 þ
bonds
X
X
ky ðy yo Þ2
angels
X qi qj X C12 C6 þ Að1 cosðnp fÞÞ þ þ 6 er rij12 rij atoms diherdrals charges ij
!
The most commonly used force fields for MD include AMBER, OPLSAA, CHARMM, AMOEBA, and GROMOS (Brooks et al., 2009; Duan et al., 2003; Hess et al., 2008; Ren and Ponder, 2003). A nice review on the usage of different types of force fields and their development can be found in Ponder et al. (Ponder and Case, 2003) and Cheatham et al. discuss force fields that can be used for nucleic acids (Cheatham and Young, 2000). An important component of MD simulations is the definition of a suitable environment for the protein. For cytoplasmic proteins, this means immersing the protein in water with suitable concentrations of salt or other ions, and maintaining the proper temperature and pressure. There are several water models available in the literature including SPC, TIP3P, TIP4P, POL3, and AMOEBA (Berendsen et al., 1987; Caldwell and Kollman, 1995; Jorgensen and Madura, 1985; Jorgensen et al., 1983; Mahoney and Jorgensen, 2000). SPC and TIP3P are the simplest models that represent the water molecule with three interaction sites. The other models are more sophisticated as they include dummy atoms at the lone pair positions, which improve the dipole and quadrupole moments of the water molecule. POL3, SPC/E, and AMOEBA contain additional terms to account for the polarizability. Membrane proteins are treated differently since they have to be immersed in a lipid membrane first, before placing the whole system in a water bath. Several models for the lipid membranes are available such as DPPC, DMPC, POPC, DLPE, DOPC, or DOPE (de Vries et al., 2005; Heller et al., 1993; Tieleman and Berendsen, 1996; Tieleman et al., 1997).
320
Akansha Saxena et al.
3.2. Setting up and running simulations Many MD packages are available for no cost or with modest academic fees. Popular packages include AMBER, CHARMM, GROMACS, NAMD, and TINKER (see Table 12.1). Although there are small differences in the implementation depending on the software package being used, the simulation procedure can be divided into four basic steps: preparation, minimization, heating, and production. 3.2.1. Preparation The first task of the simulation is to obtain a structure of the protein of interest. This is typically a pdb file that contains a list of all atoms in a protein and their 3D coordinates. The protein structure needs to be complete (all atoms), and missing atoms such as hydrogens can be added by using the preparatory tools of the MD software. It is also important to check the protonation states of the histidines, the existence of disulfide bonds and any potential posttranslational modifications. Next, the protein is immersed in a water bath using any of the water models mentioned above. If it is a membrane protein then it is first embedded in a lipid bilayer and then the whole system is submerged in water. Salt ions can be added to more closely mimic physiological conditions, and additional ions are typically added to neutralize the overall charge of the system. 3.2.2. Minimization The starting structure almost assuredly has small atomic clashes, strained bonds and angles, or other potential problems. We need to resolve these issues before starting our simulation, and we do that by minimizing the potential energy of the system, effectively moving all bonds, angles, and so on to their equilibrium values. There is variable quality in the minimization routines for the various MD packages, but as long as major atomic clashes are removed, we should be able to proceed to the next step. 3.2.3. Equilibration Since we are normally trying to connect simulation results with wet-lab experiments, we need to match the experimental conditions as closely as possible. The minimized protein structure can be viewed as being at 0 K, but we need to heat the system to a ‘‘normal’’ temperature of perhaps 300–310 K. Since the MD protocol is an equilibrium method, we need to slowly perturb the system, usually heating the system in 50 K steps for short periods of time (20–50 ps) until we reach our desired simulation temperature. Once we reach our production temperature, we need to allow the system to equilibrate to again remove any artifacts. The time required for equilibration is a point of debate in the simulation community and depending on the
321
The Basic Concepts of Molecular Modeling
system, equilibration times may range from 100 ps to 50 ns, or more. When in doubt, more equilibration is certainly the safest route to follow. 3.2.4. Production Now the system is ready to start the production run. Depending upon the computer resources, the simulations can be divided into parallel processors to increase the speed. Just as with the equilibration step, the simulation time will depend on the size of the system and what the ultimate goal of the simulation is. In Section 3.3, we will discuss some of the analysis tools that can be used to evaluate your simulation results.
3.3. Simulation analysis 3.3.1. Equilibration measures As the simulation progresses, the protein evolves from the minimized state, attains a state of equilibrium and then begins to fluctuate around this point. One measure used by researchers is to study the evolution of the dynamics of the system is the Root Mean Square Deviation or RMSD of the protein relative to the starting structure. The RMSD is defined as the average deviation of all atoms from their starting position. The formula used for this analysis is
1 XN RMSDðt1 ; t2 Þ ¼ m kr ðt Þ ri ðt2 Þk2 1 i i 1 M
1=2
where M is the total mass and ri(t) is the position of atom i at time t. The calculation can be performed using any set of atoms, however, the backbone atoms or just the alpha carbons are the most common choices. A typical RMSD plot is shown in Fig. 12.2, in this case starting from the equilibration phase of the simulation for a protein of 140 amino acids. During the initial equilibration phase, the protein fluctuates significantly, but after about 50 ns it settles down at a steady value and could be considered to have reached equilibrium. One issue with respect to RMSD is that it depends on the reference state (in the case of Fig. 12.2, the structure at the start of the production phase). For this reason, it can be a nonideal measure for equilibrium, and many researchers use methods such as principle component analysis (discussed later). 3.3.2. RMSD fluctuations RMSD fluctuation, commonly known as RMSF, is a tool which quantifies the dynamics of the polypeptide backbone by finding the extent of movement of each residue around its mean position, throughout the length of the simulation. The formula used for this analysis is
322
Akansha Saxena et al.
0.3
0.4 0.2
RMSD (nm)
0.3
0.1
0.25
0
0
2
4
6
0.2 0.15 0.1 0.05 0 0
50
100
150
200
Time (ns)
Figure 12.2 RMSD profile of a molecular dynamics trajectory using the initial structure as a reference. The inset shows the changes during the first 5 ns of the trajectory.
RMSFðri Þ ¼
X
2 1=2 r ðtÞ ^r i t¼1 i n
where, just as before, ri(t) is the position of atom i at time t and ^r i denotes the average position of atom i. To do the analysis, the backbone or alpha carbons atoms are typically selected. The calculation yields large RMSF values for parts of the protein that are highly flexible while portions that are constrained result in lower values. A comparison of these values between a wild type and mutant simulations can give insight into the effects of mutation or ligand binding. An example of an RMSF plot with the calculations performed on the Ca atoms can be seen in the Fig. 12.3. As seen from the corresponding protein structure, large RMSF values result for the loop regions of the protein, but other regions (such as the helix C) also show large-scale motion. 3.3.3. Principal component analysis Principal Component Analysis or PCA is a type of eigen value analysis where the complicated dynamics of a system are decomposed into simpler, orthogonal degrees of freedom. PCA is similar to the RMSF calculation discussed above except that the full cross correlation matrix of all atom pairs is calculated. The eigen modes of this matrix are determined, and in this way the high frequency, small amplitude fluctuations (small eignevalues) can be
323
The Basic Concepts of Molecular Modeling
0.5
E
RMSF (nm)
0.4 0.3
C A
0.2
D
B
F
F
B C
E
0.1 0
D 25
50 75 100 Residue numbers
A
150
Figure 12.3 RMSF plot for an MD trajectory. The peak RMSF values are labeled and are seen to correspond with the most flexible regions of the protein.
4
Principal component 2 (nmˆ2)
3
150 to 180 ns
0 to 30 ns 30 to 60 ns 60 to 90 ns 90 to 120 ns 120 to 150 ns 150 to 180 ns 180 to 210 ns
0 to 30 ns Start
180 to 210 ns
2
30 to 60 ns 1
0 –1 120 to 150 ns –2
90 to 120 ns
60 to 90 ns
End –3 –3
–2
–1 0 1 Principal component 1 (nmˆ2)
2
Figure 12.4 Projection of an MD trajectory on the space spanned by first two principal component vectors.
filtered out of the dynamics trajectory, and the larger/slower motions (large eigen values) can be extracted. Figure 12.4 shows a projection of a dynamics trajectory on the first two principal modes. The trajectory starts in the right
324
Akansha Saxena et al.
top corner of space spanned by these two modes, migrates to the lower left quadrant over the first 100 ns, and then remains in this region for the remaining 100 ns. This is the same MD trajectory used in creating the RMSD plot shown in Fig. 12.3, but it now suggests that the system does not reach equilibrium for almost 100 ns, not the 50 ns as suggested by RMSD analysis. This underlines the challenges and considerations that one face is performing these types of simulations and emphasize how carefully the results of an MD simulation should be analyzed. MD simulations are now being used successfully in a wide variety of chemical, physical, and biological systems. As the field progresses, other simulation techniques such as Brownian dynamics, Monte Carlo simulations, and a host of multiscale and coarse-graining methods are emerging. These techniques have some advantages and disadvantages when compared to MD, but one needs to select the correct tool for the problem at hand.
4. Molecular Docking As it is difficult to obtain structures for every protein, we could possibly want, getting the cocrystal structure of two bound proteins is often a greater challenge. Since solving such bound structures may not always be experimentally possible, there has been significant effort in the simulation community to predict them. The previous sections covered protocols to obtain structures, whether from X-ray crystallography or through ab initio prediction, and how to sample dynamics of these structures. This section will show how to take these results and use them to predict bound complexes with drugs, ligands, or other proteins. As discussed before, the confidence in binding models obtained from docking methods is only as good as the experimental information we have a priori. The more information that is available (mutagenesis data, sequence, or structure conservation, etc.) the more reliable docking simulations will be.
4.1. Basic components While an end-user of various tools available for docking does not have to understand the minutiae of all the algorithms under the hood, it is still important to appreciate some details so as to know which software is suitable for different situations. Every docking program has two essential components: a search algorithm and an energy scoring function (Leach et al., 2006). The details and interdependence of these two components vary greatly among the different pieces of software and some of these details are discussed below.
The Basic Concepts of Molecular Modeling
325
The issue of search space deals with sampling the different possible orientations, or poses, that the macromolecule and ligand can bind. In rigid docking, where the internal coordinates of the macro and ligand are held static, there are six relative degrees of freedom for two molecules: three translational and three rotational degrees of freedom. This can lead to hundreds of thousands or millions of possibilities, depending on the size of molecules, but this problem is further compounded in the case of flexible docking. Once bonds are allowed to rotate and side chain or backbone conformations are explored, the size of the search space increases exponentially. There are many different search algorithms, ranging from brute-force conformational searches or more effective and efficient stochastic-based algorithms. In general, better sampling of the different possible poses will lead to a higher probability of finding the correct binding structure. With so many poses generated from the search step, we need a system to rank them according to their likelihood of being the binding answer(s), whether it be by energetics, binding affinity, or some other metric. Scoring functions ( Jain, 2006) to evaluate these structures must not only be accurate in calculating the energy of a pose, but also efficient to rank a large number of structures in a timely matter. The binding score or energy resulting from various scoring functions can be based on first-principles (like molecular mechanics force fields), empirical data (functions fitted to experimental data), semiempirical (a combination of the two), or knowledge based (statistics and heuristics). There are programs that show good performance, although they may be optimized to the selected benchmarks and may only prove to make good predictions for systems for which they are parameterized (Huang et al., 2006). Since the difficulties in predicting absolute binding energies are great, it is often more desirable and effective to predict the correct relative affinities of a group of compounds. In reality, the entire procedure to generate a single docked complex needs to be repeated tens, hundreds, or thousands of time. To analyze this large ensemble of predictions, many protocols use cluster analysis, looking for structures that are repeatedly predicted as a measure of confidence. Of course, any result needs to be compared to known experimental data as a sanity check. Additional analysis would be recommended, as well as further experimental validation in an iterative cycle to improve any docked model.
4.2. Choosing the correct tool The first step is to select the appropriate docking software for the system of interest. Some programs are parameterized for specific kinds of protein structures or particular ligands (such as only small molecules), while others are more widely applicable. At the same time, some programs are free for academic use, while others charge a nominal or substantial fee, even for academic use. Unfortunately, it is not possible in this limited space to
326
Akansha Saxena et al.
provide details on each software package and the user will need to investigate each package on an individual basis. Table 12.1 contains a list of software and Web addresses, but some of the more widely used packages are AutoDock (Goodsell and Olson, 1990), FlexX (Rarey et al., 1995), GLIDE (Friesner et al., 2004; Halgren et al., 2004), GOLD ( Jones et al., 1995), HADDOCK (Dominguez et al., 2003), and RosettaDock (Schueler-Furman et al., 2005a,b). Although every software package will claim to have certain advantages, the best method in assessing their quality is in head-to-head comparisons, hopefully completed by some independent third party. For small molecule docking, there have been several published comparison published in recent years (Cross et al., 2009; Cummings et al., 2005; Leach et al., 2006; Sousa et al., 2006). In the case of protein–protein docking (Bonvin, 2006; Leach et al., 2006; Ritchie, 2008; Vajda and Kozakov, 2009), the best resource for evaluating the most current docking methods is the results gathered from the bi-annual critical assessment of predicted interactions (CAPRI) (Lensink et al., 2007; Mendez et al., 2005). Like the CASP competition for structure prediction, CAPRI evaluates blind predictions of protein–protein interactions. In the latest round (of which results were published in 2007), there is an additional component for assessing scoring functions. Additionally, there are evaluation tests (Schulz-Gasch and Stahl, 2003; Tiwari et al., 2009; Warren et al., 2006) with decoy benchmarks (Huang et al., 2006; Irwin, 2008) or evaluative reviews (Moitessier et al., 2008) that are released whenever a new tool has been developed.
4.3. Preparing the molecules The ligand of choice impacts how the search step should be carried out. In general, docking protocol can be divided into protein–small molecule docking and protein–protein. Each category will be covered in this section, where the smaller molecule (e.g., drug) is defined as the ligand and the larger protein defined as the macromolecule. 4.3.1. Macromolecule Regardless of the ligand, the macromolecule protein is usually dealt with in the same way. The sheer size of a protein and the potential degrees of freedom usually means that exhaustive sampling is not possible. Many programs have some limited sampling capability, especially for the side chains through the use of rotamer libraries, but they tend to not adequately sample backbone conformations. To facilitate whichever program the reader uses, it can be very beneficial to sample an ensemble of structures from an MD or other type of simulation. Alternatively, if there is NMR structure data, the ensemble of models can be used in independent docking runs.
The Basic Concepts of Molecular Modeling
327
In essence, a series of snapshots will give the docking search algorithms a different starting structure from which to sample, and this may aid in a better holistic representation of the conformational space. Once a series of macromolecule structures are chosen, they need to be prepared for docking. The details of this are specific to the program being used, however, there are several general considerations that need to be kept in mind. Just as with MD simulations, these would include the protonation state of the protein and particular residues, the proper treatment of any nonstandard amino acids and posttranslational modifications, and the inclusion of any required ligands, nucleotides, ions, and so on. If the sites of these modifications is known or thought to be close to the binding site, these may be critical for success—if they are more distal from the binding site, they may be able to be ignored. Less of a formatting issue and more of a technical one, the charge state of the macromolecule is very important and needs to be thoroughly considered as it is often the driving force of many intermolecular interactions. Experimental information, such as pH or salt dependency can help in deciding what a charge state of particular groups should be. Some programs require an active hand in making this determination. There may be other steps for preparing the macromolecule, such as defining flexible regions and rotatable bonds, which are important to consider. 4.3.2. Small molecule ligands Drugs and small peptides tend to have more limited degrees of freedom and can therefore be treated in a more systematic approach. In short, the fewer the rotatable bonds, the easier it is to sample completely. Depending on the software, docking packages that allow the user to define fixed or rotatable bonds are usually sufficient, although caution would still be advised to make sure if enough conformations are used in docking and that they are not sterically hindered. Just as for the macromolecule, the electrostatics on the ligand is very important in driving interactions, and if not correctly represented, the results could be drastically affected. 4.3.3. Protein ligands Peptides with secondary structure and proteins being docked as ligands have a slightly different treatment than drugs or small molecules. While certain parts may be locked into helices or sheets and thus have somewhat restrictive motions, there could be unstructured loops or more dynamic regions. Just as we saw before, it is not possible to fully capture these degrees of freedom, and these protein ligands are treated in the same fashion as the macromolecule. Again, if there is an NMR structure, those models can be used or the ligand could be subjected to simulation studies. Ultimately, we would ideally generate an ensemble of structures for the macromolecule and ligand and perform docking for every combination of the two. As will be
328
Akansha Saxena et al.
covered in the virtual screening section, some have found that combining methods to converge at a docked model may be more necessary for protein–protein docking (Vajda and Kozakov, 2009).
4.4. Iterative docking and analysis With the ligand and macro prepared, we can begin the process of generating docked models (see Fig. 12.5). To find the best model, an iterative method is typically the most successful approach. A general, first pass, docking may help to find a region on the macromolecule where the ligand is most likely to interact. This is a blind run docking, meaning that the macro and ligand are allowed to randomly pair with no bias in any region. To save in time in this step, it is usually best to allow limited or no flexibility. The second step would be to cluster the results from the blind docking, grouping the ligands based on location and examining their energetic, or ranking, score. Ideally, this would identify one particular area on the macromolecule, but if there are several locations, a highly ranked representative structure of each cluster (or binding site) can be used for finer docking runs. As always, the use of experimental data here is crucial in determining probable sites as well as in corroborating the selection of the best model. Using these filtered results, we can perform refinement docking where the
Macro
Ligand
(via NMR, MD, MC etc)
(via NMR, MD, MC etc)
General docking (randomize)
Experimental data (e.g. biochemical, mutagenesis, etc)
Clustering and analysis (filter results)
Refinement docking (impose constraints)
Figure 12.5
The standard docking protocol.
Docked model
The Basic Concepts of Molecular Modeling
329
ligand is restricted to a specified region based on the blind docking results. How this restriction is imposed depends on the software being used, but most programs possess this capability. There are often more fine grained tuning options for more exact exploration, including side-chain sampling or repacking that allow for flexible docking (Bonvin, 2006). The user should make use of these capabilities as appropriate.
4.5. Post analysis If an iterative docking methodology is used, analysis needs to take place intermittently to maximize docking success. Clustering coupled with score ranking is the most basic analysis to find potential good poses. Particularly in the case of protein–protein docking, the use of experimental data, such as mutagenesis data, may be required. Such data can be used to filter out false positives and improve the overall results. Also, although the search and scoring steps of most docking protocols are highly intertwined, one can easily rescore a set of poses using a different scoring function or functions. Some programs have multiple and easily manipulated scoring functions. Although usually requiring extra effort, rescoring aids in enriching the results and converging at a consensus result with which other methods agree.
4.6. Virtual screening So far, the methods described here assume that the ligand to be docked is already identified. In the case of drug discovery, the small molecule of interest that binds a given target may be what we are trying to determine. In such cases, a whole gamut of small molecule data bases (such as Available Chemicals Directory, ChemACX, Maybridge Database, Zinc, NCI Diversity Set (Voigt et al., 2001)), can be docked against the target protein for screening purposes. This kind of virtual screening can aid in narrowing down potential inhibitor candidates before vast resources are devoted to testing them at the bench. High-throughput virtual screening is obviously very popular with pharmaceutical companies since it would not only save money from actual testing but also help in leading drug discovery. With 1000–100,000 compounds in each database, high-throughput methods are required to accomplish such screening in a timely matter. Usually, this simply means using multiple computer processors with a reasonably fast docking tool and performing docking with a database of molecules. Many times it is advantageous to use a combination of several docking software for both the search and scoring algorithms in order to find a consensus subset of the best docking small molecules (Vajda and Kozakov, 2009). While the details of this protocol are out of the scope of this chapter, we list
330
Akansha Saxena et al.
some literature to further elucidate virtual screening (Cross et al., 2009; Irwin, 2008; Jain, 2004; Kitchen et al., 2004; Kontoyianni et al., 2008; Shoichet, 2004; Zoete et al., 2009).
4.7. Conclusions There is one caveat that users of docking software must continually remind themselves of: analyze the docked models with a skeptical eye. It is very easy to accept poses that fit a mechanistic model that we want to prove and thus be biased toward what we wish to see. On the other hand, models that we create are just that: models. It is still fair to include some heuristic filtering, provided it is supported with good reasoning. In the current state of available methods, there is a general acknowledgment that the accurate representation of electrostatics still needs significant improvement. A typical molecular representation boils down complex electrostatic surfaces as simple point charges on a single atom. While this speeds up calculations in a first pass docking, more exact electrostatics, namely higher order moments and the effects of polarization, may be needed to improve the capabilities of all programs (Illingworth et al., 2008).
REFERENCES Alder, B. J., and Wainwright, T. E. (1959). Studies in molecular dynamics. I. General method. J. Chem. Phys. 31(2), 459–466. Alder, B. J., and Wainwright, T. E. (1960). Studies in molecular dynamics. II. Behavior of a small number of elastic spheres. J. Chem. Phys. 33(5), 1439–1451. Alexandrov, N. N., Nussinov, R., and Zimmer, R. M. (1996). Fast protein fold recognition via sequence to structure alignment and contact capacity potentials. In ‘‘Pacific Symposium on Biocomputing ‘96,’’ (L. Hunter and T. Klein, eds.). World Scientific Publishing Co., Singapore. Altschul, S. F., et al. (1990). Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410. Altschul, S. F., et al. (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402. Attwood, T. K., et al. (2003). PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 31(1), 400–402. Bairoch, A., et al. (2004). Swiss-Prot: Juggling between evolution and stability. Brief Bioinform. 5(1), 39–55. Bates, P. A. (2001). Enhancement of protein modeling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM. Proteins (Suppl. 5), 39–46. Benson, D. A., et al. (2009). GenBank. Nucleic Acids Res. 37(Database issue), D26–D31. Berendsen, H. J. C., Grigera, J. R., and Straatsma, T. P. (1987). The missing term in effective pair potentials. J. Phys. Chem. 91(24 %R), 6269–6271. doi:10.1021/j100308a038. Berman, H. M., et al. (2000). The Protein Data Bank. Nucleic Acids Res. 28(1), 235–242. Bonvin, A. M. (2006). Flexible protein–protein docking. Curr. Opin. Struct. Biol. 16(2), 194–200. Brooks, B. R., et al. (2009). CHARMM: The biomolecular simulation program. J. Comput. Chem. 30(10), 1545–1614.
The Basic Concepts of Molecular Modeling
331
Bryson, K., et al. (2005). Protein structure prediction servers at University College London. Nucleic Acids Res. 33, W36–W38. Web server issue. Caldwell, J. W., and Kollman, P. A. (1995). Structure and properties of neat liquids using nonadditive molecular dynamics: Water, methanol, and N-methylacetamide. J. Phys. Chem. 99(16 %R), 6208–6219. doi:10.1021/j100016a067. Cheatham, T. E., and Young, M. A. (2000). Molecular dynamics simulation of nucleic acids: Successes, limitations, and promise. Biopolymers 56(4), 232–256. Cheng, J., et al. (2005). SCRATCH: A protein structure and structural feature prediction server. Nucleic Acids Res. 33, W72–W76. Web server issue. Cole, C., Barber, J. D., and Barton, G. J. (2008). The Jpred 3 secondary structure prediction server. Nucleic Acids Res. 36, W197–W201. Web server issue. Cross, J. B., et al. (2009). Comparison of several molecular docking programs: Pose prediction and virtual screening accuracy. J. Chem. Inf. Model. 49(6), 1455–1474. Cummings, M. D., et al. (2005). Comparison of automated docking programs as virtual screening tools. J. Med. Chem. 48(4), 962–976. Davidson, A. R. (2008). A folding space odyssey. Proc. Natl. Acad. Sci. USA 105(8), 2759–2760. de Vries, A. H., et al. (2005). Molecular dynamics simulations of phospholipid bilayers: Influence of artificial periodicity, system size, and simulation time. J. Phys. Chem. B 109 (23), 11643–11652. doi:10.1021/jp0507952. Diraviyam, K., et al. (2003). Computer modeling of the membrane interaction of FYVE domains. J. Mol. Biol. 328(3), 721–736. Dominguez, C., Boelens, R., and Bonvin, A. M. (2003). HADDOCK: A protein–protein docking approach based on biochemical or biophysical information. J. Am. Chem. Soc. 125(7), 1731–1737. Duan, Y., et al. (2003). A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. J. Comput. Chem. 24(16), 1999–2012. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics 14(9), 755–763. Finn, R. D., et al. (2008). The Pfam protein families database. Nucleic Acids Res. 36(Database issue), D281–D288. Friesner, R. A. (2004). Glide: A new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 47(7), 1739–1749. Gasteiger, E., et al. (2003). ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31(13), 3784–3788. Goodsell, D. S., and Olson, A. J. (1990). Automated docking of substrates to proteins by simulated annealing. Proteins 8(3), 195–202. Halgren, T. A. (2004). Glide: A new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J. Med. Chem. 47(7), 1750–1759. Hardin, C., Pogorelov, T. V., and Luthey-Schulten, Z. (2002). Ab initio protein structure prediction. Curr. Opin. Struct. Biol. 12(2), 176–181. Heller, H., Schaefer, M., and Schulten, K. (1993). Molecular dynamics simulation of a bilayer of 200 lipids in the gel and in the liquid crystal phase. J. Phys. Chem. 97(31 %R), 8343–8360. doi:10.1021/j100133a034. Hess, B., et al. (2008). GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput. 4(3), 435–447. doi:10.1021/ ct700301q. Huang, N., Shoichet, B. K., and Irwin, J. J. (2006). Benchmarking sets for molecular docking. J. Med. Chem. 49(23), 6789–6801. Hulo, N., et al. (2008). The 20 years of PROSITE. Nucleic Acids Res. 36(Database issue), D245–D249.
332
Akansha Saxena et al.
Hunter, S. (2009). InterPro: The integrative protein signature database. Nucleic Acids Res. 37(Database issue), D211–D215. Illingworth, C. J., et al. (2008). Assessing the role of polarization in docking. J. Phys. Chem. A 112(47), 12157–12163. Irwin, J. J. (2008). Community benchmarks for virtual screening. J. Comput. Aided Mol. Des. 22(3–4), 193–199. Jacobson, M. P., et al. (2002). On the role of the crystal environment in determining protein side-chain conformations. J. Mol. Biol. 320(3), 597–608. Jain, A. N. (2004). Virtual screening in lead discovery and optimization. Curr. Opin. Drug Discov. Devel. 7(4), 396–403. Jain, A. N. (2006). Scoring functions for protein–ligand docking. Curr. Protein Pept. Sci. 7(5), 407–420. Jones, G., Willett, P., and Glen, R. C. (1995). A genetic algorithm for flexible molecular overlay and pharmacophore elucidation. J. Comput. Aided. Mol. Des. 9(6), 532–549. Jorgensen, W. L., and Madura, J. D. (1985). Temperature and size dependence for Monte Carlo simulations of TIP4P water. Mol. Phys. Int. J. Interface Chem. Phys. 56(6), 1381–1392. Jorgensen, W. L., et al. (1983). Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79(2), 926–935. Kelley, L. A., MacCallum, R. M., and Sternberg, M. J. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299(2), 499–520. Kitchen, D. B., et al. (2004). Docking and scoring in virtual screening for drug discovery: Methods and applications. Nat. Rev. Drug Discov. 3(11), 935–949. Kontoyianni, M., et al. (2008). Theoretical and practical considerations in virtual screening: A beaten field? Curr. Med. Chem. 15(2), 107–116. Kretsinger, R. H., Ison, R. E., and Hovmoller, S. (2004). Prediction of protein structure. Methods Enzymol. 383, 1–27. Krogh, A., et al. (2001). Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305(3), 567–580. Larkin, M. A., et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23(21), 2947–2948. Laskowski, R. A., MacArthur, M. W., Moss, D. S., and Thornton, J. M. (1993). PROCHECK: A program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291. Leach, A. (2001). Molecular Modelling: Principles and Applications. 2nd edn. Prentice Hall. Harlow, England. Leach, A. R., Shoichet, B. K., and Peishoff, C. E. (2006). Prediction of protein–ligand interactions docking and scoring: Successes and gaps. J. Med. Chem. 49(20), 5851–5855. Lensink, M. F., Mendez, R., and Wodak, S. J. (2007). Docking and scoring protein complexes: CAPRI 3rd edn. Proteins 69(4), 704–718. Letunic, I., Doerks, T., and Bork, P. (2009). SMART 6: Recent updates and new developments. Nucleic Acids Res. 37(Database issue), D229–D232. Lobley, A., Sadowski, M. I., and Jones, D. T. (2009). pGenTHREADER and pDomTHREADER: New methods for improved protein fold recognition and superfamily discrimination. Bioinformatics 25(14), 1761–1767. Luthy, R., Bowie, J. U., and Eisenberg, D. (1992). Assessment of protein models with threedimensional profiles. Nature 356(6364), 83–85. Madhusudhan, M. S., Narayanan Eswar, M. A. M.-R., Bino, J., Ursula, P., Rachel, K., Min-Yi, S., and Andrej, S. (2005). Comparative protein structure modeling. In ‘‘The Proteomics Protocols Handbook,’’ ( J. M. Walker, ed.), pp. 831–860. Humana Press, Totowa, New Jersy.
The Basic Concepts of Molecular Modeling
333
Mahoney, M. W., and Jorgensen, W. L. (2000). A five-site model for liquid water and the reproduction of the density anomaly by rigid, nonpolarizable potential functions. J. Chem. Phys. 112(20), 8910. Marti-Renom, M. A., et al. (2000). Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291–325. McCammon, J. A., et al. (1976). The hinge-bending mode in lysozyme. Nature 262(5566), 325–326. McCammon, J. A., Gelin, B. R., and Karplus, M. (1977). Dynamics of folded proteins. Nature 267(5612), 585–590. McGinnis, S., and Madden, T. L. (2004). BLAST: At the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, W20–W25. Web server issue. Mendez, R., et al. (2005). Assessment of CAPRI predictions in rounds 3–5 shows progress in docking procedures. Proteins 60(2), 150–169. Moitessier, N., et al. (2008). Towards the development of universal, fast and highly accurate docking/scoring methods: A long way to go. Br. J. Pharmacol. 153(Suppl. 1), S7–S26. Mount, W. D. (2001). Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, New York. Notredame, C., Higgins, D. G., and Heringa, J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217. Pearson, W. R. (1990). Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183, 63–98. Pei, J. (2008). Multiple protein sequence alignment. Curr. Opin. Struct. Biol. 18(3), 382–386. Pettersen, E. F., et al. (2004). UCSF Chimera—A visualization system for exploratory research and analysis. J. Comput. Chem. 25(13), 1605–1612. Ponder, J. W., and Case, D. A. (2003). Force fields for protein simulations. Adv. Protein Chem. 66, 27–85. Punta, M., et al. (2007). Membrane protein prediction methods. Methods 41(4), 460–474. Rahman, A. (1964). Correlations in the motion of atoms in liquid argon. Phys. Rev. 136(2A), A405. Rahman, A., and Stillinger, F. H. (1974). Propagation of sound in water. A moleculardynamics study. Phys. Rev. A 10(1), 368. Rarey, M., Kramer, B., and Lengauer, T. (1995). Time-efficient docking of flexible ligands into active sites of proteins. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 300–308. Ren, P., and Ponder, J. W. (2003). Polarizable atomic multipole water model for molecular mechanics simulation. J. Phys. Chem. B 107(24), 5933–5947. doi:10.1021/jp027815þ. Ritchie, D. W. (2008). Recent progress and future directions in protein–protein docking. Curr. Protein Pept. Sci. 9(1), 1–15. Roessler, C. G., et al. (2008). Transitive homology-guided structural studies lead to discovery of Cro proteins with 40% sequence identity but different folds. Proc. Natl. Acad. Sci. USA 105(7), 2343–2348. Rohl, C. A., et al. (2004). Protein structure prediction using Rosetta. Methods Enzymol. 383, 66–93. Rost, B., and Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232(2), 584–599. Rost, B., Yachdav, G., and Liu, J. (2004). The predict protein server. Nucleic Acids Res. 32, W321–W326. Web server issue. Sali, A., et al. (1993). Three-dimensional models of four mouse mast cell chymases. Identification of proteoglycan binding regions and protease-specific antigenic epitopes. J. Biol. Chem. 268(12), 9023–9034. Schueler-Furman, O., Wang, C., and Baker, D. (2005a). Progress in protein–protein docking: Atomic resolution predictions in the CAPRI experiment using RosettaDock with an improved treatment of side-chain flexibility. Proteins 60(2), 187–194.
334
Akansha Saxena et al.
Schueler-Furman, O., et al. (2005b). Progress in modeling of protein structures and interactions. Science 310(5748), 638–642. Schulz-Gasch, T., and Stahl, M. (2003). Binding site characteristics in structure-based virtual screening: Evaluation of current docking tools. J. Mol. Model. 9(1), 47–57. Schwede, T., et al. (2003). SWISS-MODEL: An automated protein homology-modeling server. Nucleic Acids Res. 31(13), 3381–3385. Shi, J., Blundell, T. L., and Mizuguchi, K. (2001). FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310(1), 243–257. Shoichet, B. K. (2004). Virtual screening of chemical libraries. Nature 432(7019), 862–865. Shortle, D., Simons, K. T., and Baker, D. (1998). Clustering of low-energy conformations near the native structures of small proteins. Proc. Natl. Acad. Sci. USA 95(19), 11158–11162. Simons, K. T., et al. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268(1), 209–225. Sousa, S. F., Fernandes, P. A., and Ramos, M. J. (2006). Protein–ligand docking: Current status and future challenges. Proteins 65(1), 15–26. Subramaniam, S. (1998). The biology workbench—A seamless database and analysis environment for the biologist. Proteins 32(1), 1–2. Tieleman, D. P., and Berendsen, H. J. C. (1996). Molecular dynamics simulations of a fully hydrated dipalmitoylphosphatidylcholine bilayer with different macroscopic boundary conditions and parameters. J. Chem. Phys. 105(11), 4871. Tieleman, D. P., Marrink, S. J., and Berendsen, H. J. C. (1997). A computer perspective of membranes: Molecular dynamics studies of lipid bilayer systems. Biophys. Biochem. Acta 1331(3), 235. Tiwari, R., et al. (2009). Carborane clusters in computational drug design: A comparative docking evaluation using AutoDock, FlexX, Glide, and Surflex. J. Chem. Inf. Model. 49(6), 1581–1589. Vajda, S., and Kozakov, D. (2009). Convergence and combination of methods in protein– protein docking. Curr. Opin. Struct. Biol. 19(2), 164–170. Voigt, J. H., et al. (2001). Comparison of the NCI open database with seven large chemical structural databases. J. Chem. Inf. Comput. Sci. 41(3), 702–712. Vriend, G. (1990). WHAT IF: A molecular modeling and drug design program. J. Mol. Graph. 8(1), 52–56. (see also p. 29). Warren, G. L., et al. (2006). A critical assessment of docking programs and scoring functions. J. Med. Chem. 49(20), 5912–5931. Waterhouse, A. M., et al. (2009). Jalview Version 2—A multiple sequence alignment editor and analysis workbench. Bioinformatics 25(9), 1189–1191. Wiederstein, M., and Sippl, M. J. (2007). ProSA-web: Interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Res. 35, W407–W410. (Web server issue). Zdobnov, E. M., and Apweiler, R. (2001). InterProScan—An integration platform for the signature-recognition methods in InterPro. Bioinformatics 17(9), 847–848. Zhang, Y. (2008). I-TASSER server for protein 3D structure prediction. BMC Bioinform. 9, 40. Zhang, Z., et al. (1998). Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res. 26(17), 3986–3990. Zoete, V., Grosdidier, A., and Michielin, O. (2009). Docking, virtual high throughput screening and in silico fragment-based drug design. J. Cell Mol. Med. 13(2), 238–248.
C H A P T E R
T H I R T E E N
Deterministic and Stochastic Models of Genetic Regulatory Networks Ilya Shmulevich and John D. Aitchison Contents 1. Introduction 2. Boolean Networks 2.1. Attractors as cell types and cellular functional states 3. Differential Equation Models 3.1. Accurate description of cellular growth and division and prediction of mutant phenotypes 4. Probabilistic Boolean Networks 4.1. Steady-state analysis and stability under stochastic fluctuations 5. Stochastic Differential Equation Models 5.1. The influence of noise on system behavior References
336 337 341 343 346 347 350 351 352 353
Abstract Traditionally molecular biology research has tended to reduce biological pathways to composite units studied as isolated parts of the cellular system. With the advent of high throughput methodologies that can capture thousands of data points, and powerful computational approaches, the reality of studying cellular processes at a systems level is upon us. As these approaches yield massive datasets, systems level analyses have drawn upon other fields such as engineering and mathematics, adapting computational and statistical approaches to decipher relationships between molecules. Guided by high quality datasets and analyses, one can begin the process of predictive modeling. The findings from such approaches are often surprising and beyond normal intuition. We discuss four classes of dynamical systems used to model genetic regulatory networks. The discussion is divided into continuous and discrete models, as well as deterministic and stochastic model classes. For each combination of these categories, a model is presented and discussed in the context of the yeast cell cycle, illustrating how different types of questions can be addressed by different model classes. Institute for Systems Biology, Seattle, Washington, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67013-0
#
2009 Elsevier Inc. All rights reserved.
335
336
Ilya Shmulevich and John D. Aitchison
1. Introduction Modern molecular biology technologies and the proliferation of Web-based resources containing information on various aspects of biomolecular networks in living cells have made it possible to mathematically model dynamical systems of molecular interactions that control various cellular functions and processes. Such models can then be used to predict the behavior of the system in response to different perturbations or stimuli and ultimately for developing rational control strategies intended to drive the cellular system toward a desired state or away from an undesired state that may be associated with disease. To this end, various dynamical models have been studied, most commonly in the context of genetic regulatory networks, for a variety of biological systems. Although there are a number of natural ways to categorize and classify dynamical models of genetic networks, this chapter presents a model class with an accompanying example in each combination of deterministic versus stochastic and continuous versus discrete model categories. The example used in each of the model classes is that of the yeast cell cycle, as this system has been extensively studied from a variety of different perspectives and with different model classes. It is not the intention of this chapter to go into an in-depth investigation of the cell cycle, but rather to use it as a running example to illustrate the kinds of questions that can be addressed by the different model classes considered. A deterministic model of a genetic regulatory network may involve a number of different mechanisms that capture the collective behavior of the elements constituting the network. The models can differ in numerous ways, such as in the nature of the physical elements that are represented in the model (i.e., genes, proteins, and other factors); the resolution or scale at which the behavior of the network elements are captured (e.g., are genes discretized, such as being either on or off, or do they take on continuous values?); and how the network elements interact (e.g., interactions can either be present or absent or they may have a quantitative nature). The common aspect of deterministic models is the inherent lack of randomness or stochasticity in the model. This chapter presents Boolean networks and systems of differential equations as examples of discrete and continuous deterministic models of genetic networks, respectively. Stochastic models of genetic regulatory networks differ from their deterministic counterparts by incorporating randomness or uncertainty. Most deterministic models can be generalized such that one associates probabilities with particular components or aspects of the model. Thus, stochastic models can also be categorized into discrete and continuous categories. The stochastic or probabilistic components in such models can
Deterministic and Stochastic Models of Genetic Regulatory Networks
337
either be associated with model structure, so that the interactions or rules of interaction are described by probability distributions, or by the incorporation of noise terms that capture intrinsic biological stochasticity or measurement uncertainty. Probabilistic Boolean networks (PBNs) and stochastic differential equations are presented as examples of discrete and continuous stochastic models of genetic networks, respectively.
2. Boolean Networks Boolean networks are a class of discrete dynamical systems that can be characterized by the interactions over a set of Boolean variables. Random Boolean networks (RBN), which are ensembles of random network structures, were first introduced by Kauffman (1969a,b) as a simple model class for studying dynamical properties of gene regulatory networks at a time when the structure of such networks was largely unknown. The idea behind such an approach is to define an ensemble of Boolean networks such that it fulfills certain known features of biological networks and then study random instances of these networks to learn more about general properties of such networks (Kauffman, 1974, 1993, 2004). Boolean network modeling of genetic networks was further developed by Thomas (1973) and others. The ensemble approach has been extraordinarily successful in shedding light on fundamental principles of complex living systems at all scales of organization, including adaptability and evolvability, robustness, coordination of complex behaviors, storage of information, and the relationships between the structure of such complex systems and their dynamical behavior. The reader is referred to several excellent review articles that cover the ensemble properties of Boolean networks (Aldana et al., 2002; Drossel, 2007). However, our focus here is on Boolean network models that can be used to capture the behavior of a specific gene regulatory network. Consider a directed graph where the vertices represent genes and the directed edges represent the actions of genes, or rather their products, on other genes. For example, directed edges from genes A and B into gene C indicate that A and B jointly act on C. The specific mechanism of action is not represented in the graph structure itself, so an additional representation is necessary. One of the simplest representation of frameworks assumes that genes are binary-valued entities, meaning that they can be in one of two possible states of activity (e.g., ON or OFF) at any given point in time, and that they act on each other by means of rules represented by Boolean functions. For example, gene C may be determined by the output of a Boolean function whose inputs are A and B. The underlying directed graph merely represents the input–output relationships. We now present this idea more formally.
338
Ilya Shmulevich and John D. Aitchison
A Boolean network is defined by a set of nodes (genes) {x1, . . ., xn} and a list of Boolean functions {f1, f2, . . ., fn}. Each gene xi 2 {0, 1} (i ¼ 1, . . ., n) is a binary variable whose value at time t þ 1 is completely determined by the values of genes xj1, xj2, . . ., xjki at time t by means of a Boolean function fi : f0; 1gki ! f0; 1g. That is, there are ki regulatory genes assigned to gene xi that determine the ‘‘wiring’’ of that gene. Thus, one can write xi ðt þ 1Þ ¼ fi ðxj1 ðtÞ; xj2 ðtÞ; . . . ; xjki ðtÞÞ:
ð13:1Þ
In an RBN, the functions fi are selected randomly as are the genes that are used as their inputs. This is the basis of the ensemble approach mentioned above. Each xi represents the state (expression) of gene i, where xi ¼ 1 represents the fact that gene i is expressed and xi ¼ 0 means it is not expressed. Such a seemingly crude simplification of gene expression has ample justification in the experimental literature (Bornholdt, 2008). Indeed, consider the fact that many organisms exhibit an amazing determinism of gene activity under specific experimental contexts or conditions, such as Escherichia coli under temperature change (Richmond et al., 1999). The determinism is apparent despite the prevalent molecular stochasticity and experimental noise inherent to measurement technologies such as microarrays. Furthermore, accurate mathematical models of gene regulation that capture kinetic level details of molecular reactions frequently operate with expressed molecular concentrations spanning several orders of magnitude, either in a saturation regime or in a regime of insignificantly small concentrations, with rapid switch-like transitions between such regimes (Davidich and Bornholdt, 2008a). Further, even higher organisms, which are necessarily more complex in terms of genetic regulation and heterogeneity, exhibit remarkable consistency when gene expression is quantized into two levels; for example, different subtypes of human tumors can be reliably discriminated in the binary domain (Shmulevich and Zhang, 2002). In a Boolean network, a given gene transforms its inputs (regulatory factors that bind to it) into an output, which is the state or expression of the gene itself at the next time-point. All genes are assumed to update synchronously in accordance with the functions assigned to them and this process is then repeated. It is clear that the dynamics of a synchronous Boolean network are completely determined by Eq. (13.1). The artificial synchrony simplifies computation while preserving the qualitative, generic properties of global network dynamics. Synchronous updating has been applied in most analytical studies so far, as it is the only one that yields deterministic state transitions. Although the introduction of asynchronous updating, which typically involves a random update schedule, renders the system stochastic, asynchronous updating is not per se biologically more realistic and has to be motivated carefully in every case not to fall victim to artifacts (Chaves et al., 2005). Additionally, recent research indicates that some
339
Deterministic and Stochastic Models of Genetic Regulatory Networks
molecular control networks are so robustly designed that timing is not a critical factor (Braunewell and Bornholdt, 2006), that time ordering in the emergence of cell-fate patterns is not an artifact of synchronous updating in the Boolean model (Alvarez-Buylla et al., 2008), and that simplified synchronous models are able to reliably reproduce the sequence of states in biological systems. Nonetheless, PBNs, presented in Section 4, are able to model asynchronous updating as well as other stochastic generalizations of Boolean networks. Let us start with a simple example to illustrate the dynamics of Boolean networks and present the key idea of attractors. Consider a Boolean network consisting of five genes {x1, . . ., x5} with the corresponding Boolean functions given by the truth tables shown in Table 13.1. Note that x4(t þ 1) ¼ f4(x4(t)) is a function of only one variable and is an example of autoregulation. The maximum connectivity (i.e., maximal number of regulators) K ¼ maxiki is equal to 3 in this case. The dynamics of this Boolean network are shown in Fig. 13.1. Since there are five genes, there are 25 ¼ 32 possible states that the network can be in. Each state is represented by a circle and the arrows between states show the transitions of the network according to the functions in Table 13.1. It is easy to see that because of the inherent deterministic directionality in Boolean networks as well as only a finite number of possible states, certain states will be revisited infinitely often if, depending on the initial starting state, the network happens to transition into them. Such states are called attractors and the states that lead into them, including the attractors themselves, comprise their basins of attraction. For example, in Fig. 13.1, the state (00000) is an attractor and Table 13.1 Truth tables of the functions in a Boolean network with five genes
j1 j2 j3
f1
f2
f3
f4
f5
0 1 1 1 0 1 1 1
0 1 1 0 0 1 1 1
0 1 1 0 1 1 0 1
0 1 – – – – – –
0 0 0 0 0 0 0 1
5 2 4
3 5 4
3 1 5
4 – –
5 4 1
The indices j1, j2, and j3 indicate the input connections for each of the functions.
340
Ilya Shmulevich and John D. Aitchison
10010 00000
00100
10100
10000
01111
01110
00111
00110 10011
11110 11011 11010 01100 11000
01000 11100
11111 10110
00101
00001
10101
11001
10001
01101
01001
01010
00010 10111
11101 01011 00011
Figure 13.1 The state-transition diagram for the Boolean network defined in Table 13.1 (Shmulevich et al., 2002c).
together with the seven other (transient) states that eventually lead into it comprise its basin of attraction. The attractors represent the fixed points of the dynamical system, thus capturing the system’s long-term behavior. The attractors are always cyclical and may consist of more than one state. Starting from any state on an attractor, the number of transitions necessary for the system to return to it is called the cycle length. For example, the attractor (00000) has cycle length 1 while the states (11010) and (11110) comprise an attractor of length 2. Real genetic regulatory networks are highly stable in the presence of perturbations, since the cell must be able to maintain homeostasis in metabolism or its developmental program in the face of such external perturbations and variety of stimuli. Within the Boolean network formalism, this means that when a minimal number of genes transiently change value (say, by means of some external stimulus), the system typically transitions into states that reside in the same basin of attraction and the network eventually ‘‘flows’’ back to the same attractor. Generally speaking, large basins of attraction correspond to higher stability. Such stability of networks in living organisms allows the cells to maintain their functional state within their environment. Although in developmental biology, epigenetic, heritable changes in cell determination have been well established, it is now becoming evident that the same type of mechanisms may also be responsible in carcinogenesis and that gene expression patterns can be inherited without the need for mutational changes in DNA (MacLeod, 1996). In the Boolean network framework, this can be explained by so-called hysteresis; that is, a change in the system’s state caused by a stimulus that does not change back when the stimulus is withdrawn (Huang, 1999). Thus, if the change of some particular gene does in fact cause a transition to a different attractor, the network will often remain in the new attractor even if that gene is switched off. Thus, the
Deterministic and Stochastic Models of Genetic Regulatory Networks
341
structure of the state space of a Boolean network, in which every state in a basin of attraction is associated with the corresponding attractor to which the system will ultimately flow, represents a type of associative memory.
2.1. Attractors as cell types and cellular functional states Real gene regulatory networks exhibit spontaneous emergence of ordered collective behavior of gene activity, captured by the attractors. Indeed, recent findings provide experimental evidence for the existence of attractors in real regulatory networks (Chang et al., 2008; Huang and Ingber, 2000; Huang et al., 2005). At the same time, many studies have shown (e.g., Wolf and Eeckman, 1998) that dynamical system behavior and stability of equilibria can be largely determined from regulatory element organization. This suggests that there must exist certain generic features of regulatory networks that are responsible for their inherent robustness and stability. Since in multicellular organisms, the cellular ‘‘fate’’ is determined by which genes and proteins are expressed, the attractors in the Boolean networks should correspond to cell types, an idea originally due to Kauffman (2004). This interpretation is quite reasonable if cell types are characterized by stable recurrent patterns of gene expression ( Jacob and Monod, 1961). Another interpretation of attractors in Boolean networks is that they correspond to cellular states, such as proliferation (cell cycle), apoptosis (programmed cell death), and differentiation (execution of tissue-specific tasks) (Huang, 1999). Such an interpretation can provide new insights into cellular homeostasis and cancer progression, the latter being characterized by a disbalance between these cellular states. For instance, an occurrence of a structural mutation can result in a reduction of the probability of the network entering the apoptosis attractor(s), making the cells less likely to undergo apoptosis and exhibiting uncontrolled growth. Similarly, an enlargement of the basins of attraction for the proliferation attractor would hyperstabilize it, resulting in hyperproliferation, typical of tumorigenesis. Such an interpretation need not be at odds with the interpretation that attractors represent cellular types. To the contrary, these views are complementary to each other, since for a given cell type, different cellular functional states must exist and be determined by the collective behavior of gene activity. Thus, one cell type can comprise several ‘‘neighboring’’ attractors each corresponding to different cellular functional states. Biological networks can often be modeled as logical circuits from wellknown local interaction data in a straightforward way. This is clearly one of the advantages of the Boolean network approach. Though logical models may sometimes appear obvious and simplistic, compared to detailed kinetic models of biomolecular reactions, they may help to understand the dynamic key properties of a regulatory process. Further, a Boolean network model can be formulated as a coarse-grained limit of the more detailed differential
342
Ilya Shmulevich and John D. Aitchison
equations model for a system (Davidich and Bornholdt, 2008a), discussed in Section 3. They may also lead the experimentalist to ask new questions and to test them first in silico. Let us consider a Boolean network model of the cell cycle control network in the budding yeast Saccharomyces cerevisiae proposed in Li et al. (2004). The core regulatory network involving activations and inhibitions among cyclins, transcription factors, and check points, such as cell size, consists of 11 binary variables. The Boolean functions, Eq. (13.1), assigned to each variable are chosen from the subclass of threshold Boolean functions (Muroga, 1971), which sum up their inputs with weights and if the sum exceeds a threshold, then the output of the function is equal to 1, else it is equal to 0. This is equivalent to a perceptron and represents a hyperplane that cuts the Boolean hypercube into two halves, zeros on one side, and ones on the other. The model, shown in Fig. 1 in Li et al. (2004), also has self-degradation loops such that nodes that are not negatively regulated by others are degraded at the next time point. The dynamics of the model are described by Xn 8 1; aij xj ðtÞ > 0 > > < Xj¼1 n aij xj ðtÞ < 0 xi ðt þ 1Þ ¼ 0; ð13:2Þ > Xj¼1 n > : x ðtÞ a x ðtÞ ¼ 0 i
j¼1 ij j
and the weights were all set to 1 or 1, depending on activation or inhibition, respectively (Li et al., 2004). Since there are 11 nodes in the network, there are 2048 states in total and all the state transitions can be computed directly through Eq. (13.2). One of the attractors, among seven, is the most stable and attracts approximately 86% of all states. This stable (fixed point) attractor, in which the molecules Cdhl and Sicl are equal to 1 and all others (Cln3, MBF, SBF, Cln1/2, Swi5, Cdc20, Clb5/6, Clb1/2, Mcml) are equal to 0, represents the biological G1 stationary state (one of the four phases of the cell cycle process in which the cell grows and can commit to division), guaranteeing cellular stability in this state. It is further demonstrated in Li et al. (2004) that the dynamic state trajectories starting from each of the states in the basin of attraction of the G1 stationary state converge rapidly onto an attracting state trajectory that is highly stable, ensuring that starting from any point in the cell cycle process, the system does not deviate from this trajectory. It is also shown, by comparison with random networks, that the highly stable attractor is unlikely to arise by chance (Li et al., 2004). Additionally, the results were fairly insensitive to the values of the weights, justifying setting them both equal to 1. Other similar studies have been carried out with the cell cycle of the fission yeast Schizosaccharomyces pombe (Davidich and Bornholdt, 2008b)
Deterministic and Stochastic Models of Genetic Regulatory Networks
343
and the mammalian cell cycle (Faure´ et al., 2006). Recently, a new more accurate Boolean network model, which can incorporate time delays, has been proposed as a model of the budding yeast cell cycle (Irons, 2009).
3. Differential Equation Models A model of a genetic network based on a system of differential equations expresses the rates of change of an element, such as a gene product, in terms of the levels of other elements of the network and possibly external inputs. In general, a nonlinear time-dependent differential equation has the form x_ ¼ f ðx; u; tÞ;
ð13:3Þ
where x is a state vector denoting the values of the physical variables in the system, x_ ¼ dx=dt is the elementwise derivative of x, u is a vector of external inputs, and t is time. If time is discretized and the functional dependency specified by f does not depend on time, then the system is said to be time-invariant. If f is linear and time-invariant, then it can be expressed as x_ ¼ Ax þ Bu:
ð13:4Þ
where A and B are constant matrices (Weaver et al., 1999). When x_ ¼ 0, the variables no longer change with time and thus define the steady state of the system, which is analogous to a fixed point attractor in a Boolean network. Consider the simple case of a gene product x (a scalar) whose rate of synthesis is proportional, with kinetic constant k1, to the abundance of another protein a that is sufficiently abundant such that the overall concentration of a is not significantly changed by the reaction. However, x is also subject to degradation, the rate of which is proportional, with constant k2, to the concentration of x itself. This situation can be expressed as x_ ¼ k1 a k2 x with
a; x > 0:
ð13:5Þ
Let us analyze the behavior of this simple system. If initially x ¼ 0, then the decay term is also 0 and x_ ¼ k1 a. However, as x is produced, the decay term k2x will also increase thereby decreasing the rate x_ toward 0 and stabilizing x at some steady-state value x. It is easy to determine this value, since setting x_ ¼ 0 and solving for x yields x ¼
k1 a : k2
ð13:6Þ
344
Ilya Shmulevich and John D. Aitchison
2
1.5
1
0.5
0
0
1.25
2.5 t
3.75
5
Figure 13.2 The behavior of the solution to x_ ¼ k1 a k2 x, x(0) ¼ 0, where k1 ¼ 2, k2 ¼ 1, and a ¼ 1. As can be seen, the gene product x, shown with a solid plot, tends toward its steady-state value given in Eq. (13.6). The time derivative x, _ which starts at initial value of k1a and tends toward 0, is shown with a dashed plot.
This behavior is shown in Fig. 13.2, where x starts off at x ¼ 0 and approaches the value in Eq. (13.6). The exact form of the kinetics is xðtÞ ¼
k1 a ð1 ek2 t Þ: k2
ð13:7Þ
Similarly, the derivative x, _ also shown in Fig. 13.2, starts off at the initial value of k1a and thereafter tends toward zero. Now suppose that a is suddenly removed after the steady-state value x is reached. Since a ¼ 0, we have x_ ¼ k2 x and since the initial condition is x ¼ k1a/k2, x_ ¼ k1 a initially. The solution of this equation is xðtÞ ¼
k1 a k2 t e k2
ð13:8Þ
and it can be seen that it will eventually approach zero. This example describes a linear relationship between a and x. _ However, most gene interactions are highly nonlinear. When the regulator is below some critical value, it has very little effect on the regulated gene. When it is above the critical value, it has virtually full effect that cannot be significantly amplified by increased concentrations of the regulator. This nonlinear behavior is typically described by sigmoid functions, which can be either monotonically increasing or decreasing. A common form is the so-called Hill functions given by
345
Deterministic and Stochastic Models of Genetic Regulatory Networks
xn y þ xn yn F ðx; yÞ ¼ n ¼ 1 F þ ðx; yÞ: y þ xn F þ ðx; yÞ ¼
n
ð13:9Þ
The function Fþ(x, 1) is illustrated in Fig. 13.3 for n ¼ 1, 2, 5, 10, 20, 50, and 100. It can be seen that it approaches an ideal step function with increasing n, thus approximating a Boolean switch. In fact, the parameter y essentially plays the role of the threshold value. Glass (1975) used step functions in place of sigmoidal functions in differential equation models, resulting in so-called piecewise linear differential equations. Glass and Kauffman (1973) also showed that many systems exhibit the same qualitative behavior for a wide range of sigmoidal steepnesses, parameterized by n. Given that gene regulation is nonlinear, the differential equation models can incorporate the Hill functions into their synthesis and decay terms. There are many available computer tools for simulating and analyzing such dynamical systems using a variety of methods and algorithms (Lambert, 1991), including DBsolve (Goryanin et al., 1999), GEPASI (Mendes, 1 0.9 0.8 0.7
F+(x,q )
0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.2
0.4
0.6
0.8
1 x
1.2
1.4
1.6
1.8
2
Figure 13.3 The function Fþ(x, y) for y ¼ 1 and n ¼ 1, 2, 5, 10, 20, 50, and 100. As n gets large, Fþ(x, y) approaches an ideal step function and thus functions as a Boolean switch.
346
Ilya Shmulevich and John D. Aitchison
1993), and Dizzy (Ramsey et al., 2005). Additionally, there are toolboxes available for MATLABÒ that can be used for modeling, simulating, and analyzing biological systems with ordinary differential equations (Schmidt and Jirstrand, 2006). MathWorks’ SimBiologyÒ toolbox (http://www. mathworks.com/products/simbiology) also provides a graphical user interface for constructing models and entering reactions, parameters, and kinetic laws, which can be simulated deterministically or stochastically. A useful review of nonlinear ordinary differential equation modeling of the cell cycle is available in Sible and Tyson (2007).
3.1. Accurate description of cellular growth and division and prediction of mutant phenotypes Let us return to the regulatory network controlling the cell cycle in budding yeast. If the goal of the modeling is to predict detailed quantitative phenomena, such as cell cycle duration in parent and daughter cells, the length of the different phases of the cell cycle, or ratios between certain regulatory proteins, then logical models such as Boolean networks are not appropriate, and systems of ordinary differential equations with detailed kinetic parameters must be used. Chen et al. (2004) constructed such a detailed model of the cell cycle regulatory network containing 36 equations with 148 constants, in addition to algebraic equations (available in Table 1 in that paper, with Table 2 containing parameter values). The model incorporates protein concentrations, cell mass, DNA mass, the state of the emerging bud, and of the mitotic spindle. After manual fitting of some of the parameters, the dynamics generated by the model were able to accurately describe the growth and division of wild-type cells. Remarkably, the model also conformed to the phenotypes of more than 100 mutant strains, in terms of experimentally observed properties such as size at bud emergence or at onset of DNA synthesis, viability, or growth rate, relative to these properties in the wild type. It should be pointed out that parameter estimation of the model must be approached with care. First, the objective function, for example meansquared error between the model predictions and the experimental data, may have multiple local optima in the parameter space. Thus, an apparently good model fit may nonetheless contain unrealistic sets of parameters that will ultimately fail to generalize. For example, as was found in Chen et al. (2004), changing parameters to ‘‘rescue’’ a model with respect to a mutant (i.e., make it agree with experimental observations) often exhibit unintended and unanticipated effects on other mutants. Second, model selection must be carefully considered, since a model that is overly complex, meaning that it has many degrees of freedom, is likely to ‘‘overfit’’ the data and thereby, sacrifice predictive accuracy. In other words, the model may appear to predict very well when tested against data on which it was trained,
Deterministic and Stochastic Models of Genetic Regulatory Networks
347
but when tested against data under new conditions, the model will predict very poorly. There are powerful tools, such as minimum description length, and indeed, entire frameworks based on algorithmic information theory and Bayesian inference, devoted to these fundamental issues (Rissanen, 2007).
4. Probabilistic Boolean Networks PBNs are probabilistic or stochastic generalizations of Boolean networks. Essentially, the deterministic dynamics are replaced by probabilistic dynamics, which can be framed within the mature and well-established theory of Markov chains, for which many analytical and numerical tools have been developed. Recall that Markov chains are stochastic processes having the property that future states depend only on the present state, and not on the past states. The transitions from one state to another (possibly itself) are specified by state transition probabilities. Boolean networks are special cases of PBNs in which state transition probabilities are either 1 or 0, depending on whether Eq. (13.1) is satisfied for all i ¼ 1,. . .,n. The probabilistic nature of this model class affords flexibility and power in terms of making inferences from data, which necessarily contain uncertainty, as well as in terms of understanding the dynamical behavior of biological networks, particularly in relation to their structure. Once the state transition probabilities for a Markov chain corresponding to a PBN are determined, it becomes possible to study the steady-state (long-run) behavior of the stochastic system. This long-run behavior is analogous to attractors in Boolean networks or fixed points in systems of differential equations. Kim et al. (2002) investigated the Markov chain corresponding to a small network based on microarray data observations of human melanoma samples. The steady-state behavior (distribution) of the constructed Markov chain was then compared to the initial observations. If the Markov chain is ergodic, meaning that it is possible to reach any state from any other state after an arbitrary number of steps, then the steady-state probability corresponds to the fraction of the time that the system will spend in that particular state. The remarkable finding was that only a small number of all possible states had significant steady-state probabilities and most of those states with high probability were observed in the data. Furthermore, it was found that more than 85% of those states with high steady-state probability that were not observed in the data were very close to the observed data in terms of Hamming distance, which is equal to the number of genes that ‘‘disagree’’ in their binary values. Based on the transition rules inferred from the data, the model produced localized stability, meaning that the system tended to flow back to the states with high steady-state probability mass if placed in
348
Ilya Shmulevich and John D. Aitchison
their vicinity. Thus, the stochastic dynamics of the Markov chain were able to mimic biological regulation. It should be noted that Markov chains are commonly used to model gene expression dynamics using so-called dynamic Bayesian networks (Murphy and Mian, 1999; Yu et al., 2004; Zou and Conzen, 2005). Indeed, PBNs and dynamic Bayesian networks are able to represent the same joint probability distribution over their common variables (i.e., genes) (La¨hdesma¨ki et al., 2006). Except in very restricted circumstances, gene expression data refute the determinism inherent to the Boolean network model, there typically being a number of possible successor states to any given state. Consequently, if one continues to assume the state at time t þ 1 is independent of the state values prior to time t, then, as stated above, the network dynamics are described by a Markov chain whose state transition matrix reflects the observed stochasticity. In terms of gene regulation, this stochasticity can be interpreted to mean that several regulator gene sets are associated with each gene and at any time point one of these ‘‘predictor’’ sets, along with a corresponding Boolean function, is randomly chosen to provide the value of the gene as a function of the values within the chosen predictor set. It is this reasoning that motivated the original definition of a PBN in which the definition of a Boolean network was adapted in such a way that, for each gene, at each time point, a Boolean function (and predictor gene set) is randomly chosen to determine the network transition (Shmulevich et al., 2002a,c). Rather than simply randomly assigning Boolean functions at each time point, one can take the perspective that the data come from distinct sources, each representing a ‘‘context’’ of the cell. From this perspective, the data derive from a family of deterministic networks and, in principle, the data could be separated into separate samples according to the contexts from which they have been derived. Given the context, the overall network would function as a Boolean network, its transition matrix reflecting determinism (i.e., each row contains one 1, in the column that corresponds to the successor state, and the rest are 0s). If defined in this manner, a PBN is a collection of Boolean networks in which a constituent network governs gene activity for a random period of time before another randomly chosen constituent network takes over, possibly in response to some random event, such as an external stimulus or the action of a (latent) regulator that is outside the scope of the network. Since the latter is not part of the model, network switching is random. This model defines a ‘‘context-sensitive’’ PBN (Brun et al., 2005; Shmulevich et al., 2002c). The probabilistic nature of the constituent choice reflects the fact that the system is open, not closed, the idea being that changes between the constituent networks result from the genes responding to latent variables external to the model network. We now formally define PBNs. Although we retain the terminology ‘‘Boolean’’ in the definition, this does not refer to the binary quantization assumed in standard Boolean networks, but rather to the logical character of
Deterministic and Stochastic Models of Genetic Regulatory Networks
349
the gene predictor functions. In the case of PBNs, quantization is assumed to be finite, but not necessarily binary. However, we restrict ourselves to the binary domain here for simplicity. Formally, a PBN consists of a sequence V ¼ fxi gni¼1 of n nodes, where xi 2 {0, 1}, and a sequence f fl gml¼1 of vector-valued functions, defining constituent networks. In the framework of gene regulation, each element xi represents the expression value of a ð1Þ ð2Þ ðnÞ gene. Each vector-valued function f l ¼ ð fl ; fl ; . . . ; fl Þ determines a constituent network, or context, of the PBN. The function ðiÞ fl : f0; 1gn ! f0; 1g is a predictor of gene i, whenever network l is selected. At each updating epoch, a decision is made whether to switch the constituent network. This decision depends on a binary random variable x: if x ¼ 0, then the current context is maintained; if x ¼ 1, then a constituent network is randomly selected from among all constituent networks according to the selection probability distribution fcl gml¼1
m X
cl ¼ 1:
ð13:10Þ
l¼1
The switching probability q ¼ P(x ¼ 1) is a system parameter. If the current network is maintained, then the PBN behaves like a fixed network and synchronously updates the values of all the genes according to the current context. Note that, even if x ¼ 1, a different constituent network is not necessarily selected because the ‘‘new’’ network is selected from among all contexts. In other words, the decision to switch is not equivalent to the decision to change the current network. If a switch is called for (x ¼ 1), then, after selecting the predictor function fl, the values of genes are updated accordingly; that is, according to the network determined by fl. If q < 1, the PBN is said to be context-sensitive; if q ¼ 1, the PBN is said to be instantaneously random, which corresponds to the original definition in Shmulevich et al. (2002a). Whereas a network switch corresponds to a change in a latent variable causing a structural change in the functions governing the network, a random perturbation corresponds to a transient value change that leaves the network wiring unchanged, as in the case of activation or inactivation owing to external stimuli such as stress conditions, small molecule inhibitors, etc. In a PBN with perturbation, there is a small probability p that a gene may change its value at each epoch. Perturbation is characterized by a random perturbation vector g ¼ (g1, g2, . . ., gn), gi 2 {0, 1}, and P(gi ¼ 1) ¼ p, the perturbation probability; gi is also known as a Bernoulli(p) random variable. If x(t) is the current state of the network, and g(t þ 1) ¼ 0, then the next state of the network is given by x(t þ 1) ¼ fl(x(t)), as in Eq. (13.1); otherwise, x(t þ 1) ¼ x(t) g(t þ 1), where is componentwise exclusive OR. The probability of no perturbation, in which
350
Ilya Shmulevich and John D. Aitchison
case the next state is determined according to the current network function fl, is (1 p)n and the probability of a perturbation is 1 (1 p)n. The perturbation model captures the realistic situation where the activity of a gene undergoes a random alteration (Shmulevich et al., 2002b). As with Boolean networks, attractors play a major role in the study of PBNs. By definition, the attractor cycles of a PBN consist of the attractor cycles of the constituent networks, and their basins are likewise defined. Whereas in a Boolean network two attractor cycles cannot intersect, attractor cycles from different contexts can intersect in a PBN. The presentation of the state transition probabilities of the Markov chain corresponding to the (context-sensitive) PBN is beyond the scope of this chapter, and the reader is referred to Brun et al. (2005). Suffice it to say that from the state transition matrix of the Markov chain, which is guaranteed to be ergodic under a gene perturbation model as described above, even for very small p, one can compute the steady-state distribution. A Markov chain is said to possess a steady-state distribution if there exists a probability distribution p ¼ (p1, p2, . . ., pM) such that for all states i, j 2 {1, 2, . . ., M}, lim Pijr ¼ pj ;
r!1
ð13:11Þ
where Pijr is the r-step transition probability between states i and j. If there exists a steady-state distribution, then regardless of the initial state, the probability of the Markov chain being in state i in the long run can be estimated by sampling the observed states in the simulation (by simply counting the percentage of time the chain spends in that state). Such an approach was used to analyze the joint steady-state probabilities of several key molecules (NFkB; Tie-2 and TGFB3) in a 15-gene network derived from human glioma gene expression data (Shmulevich et al., 2003).
4.1. Steady-state analysis and stability under stochastic fluctuations The Boolean network model of the cell cycle, discussed in Section 2.1, was generalized in Zhang et al. (2006) such that network dynamics are described by a Markov chain with transition probabilities: n X e2bT Pðxi ðt þ 1Þ ¼ 1 þ xðtÞÞ ¼ 2bT ; if T ¼ aij xj ðtÞ 6¼ 0 ð13:12Þ e þ1 j¼1 and Pðxi ðt þ 1Þ ¼ xi ðtÞ þ xðtÞÞ ¼
n X 1 ; if T ¼ aij xj ðtÞ ¼ 0 ð13:13Þ 1 þ ea j¼1
Deterministic and Stochastic Models of Genetic Regulatory Networks
351
The term T appears in Eq. (13.2). Note that this is essentially a way of introducing noise and therefore making the Markov chain ergodic, so that a steady-state distribution exists. The positive number b plays the role of temperature that characterizes the strength of the noise introduced into the system dynamics. The parameter a is used to characterize the stochasticity when the input to a node is zero and determines the probability for a protein to maintain its state when there is no input to it. It should be noted that when a, b ! 1, the stochastic model converges to the deterministic Boolean network model in Li et al. (2004). The state transition probabilities allow the computation of the steady-state distribution in Eq. (13.11). In addition, the so-called net probability flux pi Pij —pj Pji from state i to state j can be determined, where Pij is the state transition probability. The steady-state probability of the stationary G1 phase of the cell cycle was studied relative to the noise level determined by b. It was found that this state is indeed the most probable state of the system and that it decreases with increasing noise strength, as expected, since random perturbations will tend to move the system away from the attractor (Zhang et al., 2006). Interestingly, a type of phase transition was found whereby at a critical value of the parameter b, the steady-state probability of the stationary G1 state virtually vanishes and the system becomes dominated by noise and cannot carry out coordinated behavior. Nonetheless, this critical temperature is quite high and the system is able to tolerate approximately 10% of its rules misbehaving, implying that the cell cycle network is robust against stochastic fluctuations (Zhang et al., 2006). Additionally, the probability flux from states other than those on the cell cycle trajectory from the excited G1 state is convergent onto this trajectory, implying homeostatic stability.
5. Stochastic Differential Equation Models The stochastic generalization of Boolean networks, leading to Markovian dynamics, is intended to capture uncertainty in the data, whether due to measurement noise or biological variability, intrinsic or extrinsic, the latter being caused by latent variables external to the model. On the other hand, if the intention of the modeling is to capture quantitative molecular or physical details, as in systems of ordinary differential equations discussed in Section 3, then stochastic fluctuations on the molecular level can be incorporated explicitly into the model using stochastic differential equations. For example, as most regulatory molecules are produced at very low intracellular concentrations, the resulting reaction rates exhibit large variability. Such intrinsic molecular noise has been found to be important for many biological functions and processes (Ozbudak et al., 2002; Raser and O’Shea, 2005).
352
Ilya Shmulevich and John D. Aitchison
There exist powerful stochastic simulation methods for accurately simulating the dynamics of a system of chemically reacting molecules that can reflect the discrete and stochastic nature of such systems on a cellular scale. A recent review of such methods is available in Cao and Samuels (2009). However, there are undoubtedly other intrinsic and extrinsic contributions to variability in gene and protein expression, for example, due to spatial heterogeneity or fluctuations in cellular components (Swain et al., 2002). Stochastic differential equations allow for a very general incorporation of stochasticity into a model without the need to assume specific knowledge about the nature of such stochasticity. Manninen et al. (2006) developed several approaches to incorporate stochasticity into deterministic differential equation models, obtaining socalled Itoˆ stochastic differential equations, and applied them to neuronal protein kinase C signal transduction pathway modeling. By a comparative analysis it was shown that such approaches are preferred to the stochastic simulation algorithm methods, as the latter are considerably slower by several orders of magnitude when simulating systems with a large number of chemical species (Manninen et al., 2006). The stochastic differential equation framework additionally allows the incorporation of stochasticity into the reaction rates, rate constants, and concentrations. The basic model can be written as a Langevin equation with multiplicative noise (Rao et al., 2002), so that for a single species xi, x_ i ¼ fi ðx; u; tÞ þ gðxi Þxi zðtÞ;
ð13:14Þ
where fi(x, u, t) is the deterministic model and xi(t) is zero mean unit variance Gaussian white noise. The function g(xi) represents the contribution of the fluctuations and it is commonly assumed to p beffiffiffiffiproportional to the square root of the concentration, that is, gðxi Þ xi . The solution to such stochastic differential equations can be obtained by numerical integration using standard techniques.
5.1. The influence of noise on system behavior Let us turn to the cell cycle control network of the fission yeast S. pombe, for which a system of ordinary differential equations was proposed (Novak et al., 2001), consisting of eight deterministic differential equations and three algebraic equations. We mention in passing that a Boolean network model for this network is available in Davidich and Bornholdt (2008b). The differential equation model in Novak et al. (2001) was found to be in good agreement with wild-type cells as well as with several mutants. Steuer (2004)converted this model to a system of stochastic differential equations and compared the simulations with experimental data. It was found that the cycle time and division size distributions within a cell population were predicted well by the model; for example, the model predicted a negative
Deterministic and Stochastic Models of Genetic Regulatory Networks
353
correlation between cycle time and mass at birth, meaning that the cells that are large at birth have shorter cycle times, which ensures homeostasis in successive generations (Steuer, 2004). The stochastic model also accounted for a characteristic ratio of the coefficients of variation for the cycle time and division length. The stochastic differential equation model was also applied to study a certain double mutant (wee1 cdc25 D) that exhibits quantized cycle lengths. A deterministic model of the mutants can be obtained by removing the corresponding parameters from the system of differential equations. However, the simulation of the deterministic differential equation model of the double mutant results in periodically alternating long and short cycle times, which are determined exclusively by cell mass at birth, meaning that small cells have long cycles and have large daughters, and large cells have short cycles and give rise to small daughters. The simulation of the stochastic differential equation model produces very different results: cell mass at birth no longer determines the length of the next cycle and the (nonintuitive) characteristic clusters (i.e., ‘‘quantization’’) in a plot of cycle time versus mass at birth are in good agreement with experimental observations (Steuer, 2004). Additionally, in the stochastic model, the oscillation between long and short cycles disappears, which is consistent with experimental observations. Thus, the inclusion of stochastic fluctuations in the model was able to account for several features not accounted for by the deterministic model. The fact that noise is able to qualitatively alter macroscopic system behavior suggests that stochastic fluctuations play a key role in modulating cellular regulation. Stochastic differential equation models provide a powerful framework for gaining an understanding of these phenomena.
REFERENCES Aldana, M., Coppersmith, S., and Kadanoff, L. P. (2002). Boolean dynamics with random couplings. In ‘‘Perspectives and Problems in Nonlinear Science,’’ (E. Kaplan, J. E. Marsden, and K. R. Sreenivasan, eds.), pp. 23–89. Springer, New York. Alvarez-Buylla, E. R., Chaos, A., Aldana, M., Benı´tez, M., Cortes-Poza, Y., EspinosaSoto, C., Hartasa´nchez, D. A., Lotto, R. B., Malkin, D., Escalera Santos, G. J., and Padilla-Longoria, P. (2008). Floral morphogenesis: Stochastic explorations of a gene network epigenetic landscape. PLoS ONE 3(11), e3626. Bornholdt, S. (2008). Boolean network models of cellular regulation: Prospects and limitations. J. R. Soc. Interface 5(Suppl. 1), S85–S94. Braunewell, S., and Bornholdt, S. (2006). Superstability of the yeast cell-cycle dynamics: Ensuring causality in the presence of biochemical stochasticity. J. Theor. Biol. 245(4), 638–643. Brun, M., Dougherty, E. R., and Shmulevich, I. (2005). Steady-state probabilities for attractors in probabilistic Boolean networks. Signal Process. 85(4), 1993–2013. Cao, Y., and Samuels, D. C. (2009). Discrete stochastic simulation methods for chemically reacting systems. Methods Enzymol. 454, 115–140.
354
Ilya Shmulevich and John D. Aitchison
Chang, H. H., Hemberg, M., Barahona, M., Ingber, D. E., and Huang, S. (2008). Transcriptome-wide noise controls lineage choice in mammalian progenitor cells. Nature 453(7194), 544–547. Chaves, M., Albert, R., and Sontag, D. (2005). Robustness and fragility of Boolean models for genetic regulatory networks. J. Theor. Biol. 235, 431–449. Chen, K. C., Calzone, L., Csikasz-Nagy, A., Cross, F. R., Novak, B., and Tyson, J. J. (2004). Integrative analysis of cell cycle control in budding yeast. Mol. Biol. Cell 15, 3841–3862. Davidich, M., and Bornholdt, S. (2008a). The transition from differential equations to Boolean networks: A case study in simplifying a regulatory network model. J. Theor. Biol. 255(3), 269–277. Davidich, M. I., and Bornholdt, S. (2008b). Boolean network model predicts cell cycle sequence of fission yeast. PLoS ONE 3(2), e1672. Drossel, B (2007). Random Boolean networks. In ‘‘Annual Review of Nonlinear Dynamics and Complexity, Vol. 1,’’ (HG Schuster, ed.), Wiley. Faure´, A., Naldi, A., Chaouiya, C., and Thieffry, D. (2006). Dynamical analysis of a generic Boolean model for the control of the mammalian cell cycle. Bioinformatics 22(14), e124–e131. Glass, L. (1975). Classification of biological networks by their qualitative dynamics. J. Theor. Biol. 54, 85–107. Glass, L., and Kauffman, S. A. (1973). The logical analysis of continuous, nonlinear biochemical control networks. J. Theor. Biol. 39, 103–129. Goryanin, I., Hodgman, T. C., and Selkov, E. (1999). Mathematical simulation and analysis of cellular metabolism and regulation. Bioinformatics 15(9), 749–758. Huang, S. (1999). Gene expression profiling, genetic networks, and cellular states: An integrating concept for tumorigenesis and drug discovery. J. Mol. Med. 77(6), 469–480. Huang, S., and Ingber, D. E. (2000). Shape-dependent control of cell growth, differentiation, and apoptosis: Switching between attractors in cell regulatory networks. Exp. Cell Res. 261(1), 91–103. Huang, S., Eichler, G., Bar-Yam, Y., and Ingber, D. E. (2005). Cell fates as highdimensional attractor states of a complex gene regulatory network. Phys. Rev. Lett. 94 (12), 128701–128704. Irons, D. J. (2009). Logical analysis of the budding yeast cell cycle. J. Theor. Biol. 257(4), 543–559. Jacob, F, and Monod, J (1961). On the regulation of gene activity. Cold Spring Harb. Symp. Quant. Biol. 26, 193–211. Kauffman, S. A. (1969a). Metabolic stability and epigenesis in randomly constructed genetic nets. J. Theor. Biol. 22, 437–467. Kauffman, S. A. (1969b). Homeostasis and differentiation in random genetic control networks. Nature 224, 177–178. Kauffman, S. A. (1974). The large scale structure and dynamics of genetic control circuits: An ensemble approach. J. Theor. Biol. 44, 167–190. Kauffman, S. A. (1993). The Origins of Order: Self-Organization and Selection in Evolution. Oxford University Press, New York. Kauffman, S. (2004). A proposal for using the ensemble approach to understand genetic regulatory networks. J. Theor. Biol. 230(4), 581–590. Kim, S., Li, H., Dougherty, E. R., Cao, N., Chen, Y., Bittner, M. L., and Suh, E. B. (2002). Can Markov chain models mimic biological regulation? J. Biol. Syst. 10(4), 431–445. La¨hdesma¨ki, H., Hautaniemi, S., Shmulevich, I., and Yli-Harja, O. (2006). Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. Signal Process. 86(4), 814–834.
Deterministic and Stochastic Models of Genetic Regulatory Networks
355
Lambert, J. D. (1991). Numerical Methods for Ordinary Differential Equations. Wiley, Chichester. Li, F., Long, T., Lu, Y., Quyang, Q., and Tang, C. (2004). The yeast cell-cycle network is robustly designed. Proc. Natl. Acad. Sci. USA 101(14), 4781–4786. MacLeod, M. C. (1996). A possible role in chemical carcinogenesis for epigenetic, heritable changes in gene expression. Mol. Carcinog. 15(4), 241–250. Manninen, T., Linne, M. L., and Ruohonen, K. (2006). Developing Itoˆ stochastic differential equation models for neuronal signal transduction pathways. Comput. Biol. Chem. 30(4), 280–291. Mendes, P. (1993). GEPASI: A software package for modelling the dynamics, steady states and control of biochemical and other systems. Comput. Appl. Biosci. 9(5), 563–571. Muroga, S. (1971). Threshold Logic and its Applications. Wiley-Interscience. Murphy, K., and Mian, S. (1999). Modelling Gene Expression Data using Dynamic Bayesian Networks. Technical Report, University of California, Berkeley. Novak, B., Pataki, Z., Ciliberto, A., and Tyson, J. J. (2001). Mathematical model of the cell division cycle of fission yeast. Chaos 11(1), 277–286. Ozbudak, E. M., Thattai, M., Kurtser, I., Grossman, A. D., and van Oudenaarden, A. (2002). Regulation of noise in the expression of a single gene. Nat. Genet. 31(1), 69–73. Ramsey, S., Orrell, D., and Bolouri, H. (2005). Dizzy: Stochastic simulations of large-scale genetic regulatory networks. J. Bioinform. Comput. Biol. 3(2), 1–21. Rao, C. V., Wolf, D. M., and Arkin, A. P. (2002). Control, exploitation and tolerance of intracellular noise. Nature 420(6912), 231–237. Raser, J. M., and O’Shea, E. K. (2005). Noise in gene expression: Origins, consequences, and control. Science 309(5743), 2010–2013. Richmond, C. S., Glasner, J. D., Mau, R., Jin, H., and Blattner, F. R. (1999). Genomewide expression profiling in Escherichia coli K-12. Nucleic Acids Res. 27, 3821–3835. Rissanen, J. (2007). Information and Complexity in Statistical Modeling. Springer. Schmidt, H., and Jirstrand, M. (2006). Systems biology toolbox for MATLAB: A computational platform for research in systems biology. Bioinformatics 22(4), 514–515. Shmulevich, I., and Zhang, W. (2002). Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 18(4), 555–565. Shmulevich, I., Dougherty, E. R., Kim, S., and Zhang, W. (2002a). Probabilistic Boolean networks: A rule-based uncertainty model for gene regulatory networks. Bioinformatics 18(2), 261–274. Shmulevich, I., Dougherty, E. R., and Zhang, W. (2002b). Gene perturbation and intervention in probabilistic Boolean networks. Bioinformatics 18(10), 1319–1331. Shmulevich, I., Dougherty, E. R., and Zhang, W. (2002c). From Boolean to probabilistic Boolean networks as models of genetic regulatory networks. Proc. IEEE 90(11), 1778–1792. Shmulevich, I., Gluhovsky, I., Hashimoto, R., Dougherty, E. R., and Zhang, W. (2003). Steady-state analysis of probabilistic Boolean networks. Comp. Funct. Genom. 4(6), 601–608. Sible, J. C., and Tyson, J. J. (2007). Mathematical modeling as a tool for investigating cell cycle control networks. Methods 41(2), 238–247. SimBiology 3.0 Toolbox http://www.mathworks.com/products/simbiology. Steuer, R. (2004). Effects of stochasticity in models of the cell cycle: From quantized cycle times to noise-induced oscillations. J. Theor. Biol. 228(3), 293–301. Swain, P. S., Elowitz, M. B., and Siggia, E. D. (2002). Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc. Natl. Acad. Sci. USA 99(20), 12795–12800. Thomas, R. (1973). Boolean formalization of genetic control circuits. J. Theor. Biol. 42, 563–585.
356
Ilya Shmulevich and John D. Aitchison
Weaver, D. C., Workman, C. T., and Stormo, G. D. (1999). Modeling regulatory networks with weight matrices. Pac. Symp. Biocomput. 4, 112–123. Wolf, D. M., and Eeckman, F. H. (1998). On the relationship between genomic regulatory element organization and gene regulatory dynamics. J. Theor. Biol. 195(2), 167–186. Yu, J., Smith, V. A., Wang, P. P., Hartemink, A. J., and Jarvis, E. D. (2004). Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics 20(18), 3594–3603. Zhang, Y., Qian, M., Ouyang, Q., Deng, M., Li, F., and Tang, C. (2006). Stochastic model of yeast cell-cycle network. Physica D 219(1), 35–39. Zou, M., and Conzen, S. D. (2005). A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 21(1), 71–79.
C H A P T E R
F O U R T E E N
Bayesian Probability Approach to ADHD Appraisal Raina Robeva* and Jennifer Kim Penberthy† Contents 1. Introduction 1.1. Prevalence 1.2. Etiology of ADHD 1.3. Summary of problem 1.4. Comprehensive psychophysiological assessment 2. Bayesian Probability Algorithm 2.1. Methods 2.2. Results 3. The Value of Bayesian Probability Approach as a Meta-Analysis Tool 3.1. Methods 3.2. Results 4. Discussion and Future Directions Acknowledgment References
358 359 360 361 362 362 362 367 369 369 373 373 377 378
Abstract Accurate diagnosis of attentional disorders such as attention-deficit hyperactivity disorder (ADHD) is imperative because there are multiple negative psychosocial sequelae related to undiagnosed and untreated ADHD. Early and accurate detection can lead to effective intervention and prevention of negative sequelae. Unfortunately, diagnosing ADHD presents a challenge to traditional assessment paradigms because there is no single test that definitively establishes its presence. Even though ADHD is a physiologically based disorder with a multifactorial etiology, the diagnosis has been traditionally based on a subjective history of symptoms. In this chapter we outline a stochastic method that utilizes a Bayesian interface for quantifying and assessing ADHD. It can be used to combine of a variety of psychometric tests and physiological markers into a single * {
Department of Mathematical Sciences, Sweet Briar College, Sweet Briar, Virginia, USA Department of Psychiatry and Neurobehavioral Sciences, University of Virginia Health System, Charlottesville, Virginia, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67014-2
#
2009 Elsevier Inc. All rights reserved.
357
358
Raina Robeva and Jennifer Kim Penberthy
standardized instrument that, on each step, refines a probability for ADHD for each individual based on information provided by the individual assessments. The method is illustrated with data from a small study of six college female students with ADHD and six matched controls in which the method achieves correct classification for all participants, where none of the individual assessments was capable of achieving perfect classification. Further, we provide a framework for applying this Bayesian method for performing meta-analysis of data obtained from disparate studies and using disparate tests for ADHD based on calibration of the data into a unified probability scale. We use this method to combine data from five studies that examine the diagnostic abilities of different behavioral rating scales and EEG assessments of ADHD, enrolling a total of 56 ADHD and 55 control subjects of different age groups and gender.
1. Introduction Like most psychiatric disorders, the diagnosis of attention-deficit hyperactivity disorder (ADHD) relies on subjective criteria. Unlike a neurological condition such as stroke, in which examination and neuroimaging provide clear, objective criteria in diagnosis, ADHD lacks the ‘‘hard evidence’’ that aids in evaluation and treatment. The difficulty in clinical diagnosis is reflected in the frequent shifts in the diagnostic criteria for ADHD. For example, various versions of the Diagnostic and Statistical Manual of Mental Disorders (DSM), which is used by clinicians to diagnose ADHD, have all presented different conceptualizations of the disorder. Current DSM-IV diagnostic criteria for ADHD include a persistent pattern of inattention and/or hyperactivity–impulsivity that is more frequent and severe than is typically observed in individuals in a comparable level of development. Evidence of six of nine inattentive behaviors and/or six of nine hyperactive–impulsive behaviors must have been present before age 7, and must clearly interfere with social, academic, and/or occupational functioning. The most current criteria from the Diagnostic and Statistical Manual of Mental Disorders, 4th Edition (DSM-IV; American Psychiatric Association, 1994) distinguishes three subtypes of ADHD. One of the subtypes of ADHD is termed ADHD, predominantly inattentive type, and is often referred to in the literature as ADD, or attention-deficit disorder, signifying that there is an absence of a majority of hyperactive or impulsive symptom criteria. Another subtype of ADHD is termed ADHD, predominantly hyperactive–impulsive type, which specifies persons who demonstrate a majority of symptoms of hyperactivity and impulsivity, but not inattention. A third subtype is ADHD, combined type. ADHD, combined type is the name used when a person meets diagnostic criteria for both ADHD, inattentive type, and ADHD, hyperactive–impulsive type. In other
Bayesian Probability Approach to ADHD Appraisal
359
words, someone diagnosed with ADHD, combined type, displays a majority of symptoms of both inattention and hyperactivity and impulsivity. The diagnosis of any of the three forms of ADHD must still be made exclusively by history, for no laboratory or psychological test or battery is available that provides sufficient sensitivity and specificity. Consequently, the diagnosis of ADHD is highly dependent on a retrospective report of a patient’s past behavior and subjective judgments on degree of relative impairment. Due to the subjective nature of assessment, precision in diagnosis has been elusive. Further complicating the matter, there are a large number and variety of procedures that purportedly assess ADHD. Among these are clinical interviews, rating scales, psychological and neuropsychological tests, observational assessment techniques, and medical procedures, such as MRI and EEG, each with their own variations, and many rating scales with parent, teacher, and self-report versions. However, no one procedure maps perfectly onto all of the DSM-IV criteria for diagnosis of ADHD. To arrive at an ADHD diagnosis, a combination of assessment procedures—a multimethod approach—is necessary (Anastopoulous and Shelton, 2001; DuPaul et al., 1992).
1.1. Prevalence It is difficult to tell if the prevalence of ADHD per se has risen, but it is clear that the number of children identified with the disorder who obtain treatment has risen over the past decade. The US Centers for Disease Control estimates that approximately 4.6 million (8.4%) American children aged 6–17 years have at some point in their lives received a diagnosis ADHD. Of these children, nearly 59% are reported to be taking a prescription medication (Pastor and Reuben, 2008). Rates of stimulant use have been growing fast in both the USA and Europe (Habel et al., 2005; Safer et al., 1996; Zito et al., 2000). Indeed, in the last 10 years, Germany has seen a 47fold increase (Schwabe and Paffrath, 2006). But per capita stimulant consumption remains greater in the USA than in all of Europe. This increased identification and treatment seeking is due in part to greater media interest, heightened consumer awareness, and the availability of effective treatments. Within the USA, ADHD prevalence rates vary substantially across and within states. Reasons for variation in prevalence rates include changing diagnostic criteria over time, the frequent use of referred samples to estimate rates, variations in ascertainment in different settings, and the lack of a comprehensive, reliable, and cost-effective diagnostic assessment. Practitioners of all types vary greatly in the degree to which they use DSM-IV criteria to diagnose ADHD. Practice surveys among primary care pediatricians and family physicians reveal wide variations in practice patterns of using diagnostic criteria and methods. Statistics suggest that only one out of every three people who have an attention disorder get help. Therefore, two out of three people who have an attention disorder never receive a diagnosis
360
Raina Robeva and Jennifer Kim Penberthy
or treatment (Monastra et al., 1999). Part of the dilemma is that the diagnosis of ADHD must still be made exclusively by history. Therefore, a significant problem is a lack of a systematic, reliable, comprehensive, and affordable assessment for ADHD (American Academy of Pediatrics, 2000). This problem is made more urgent by the fact that early recognition, and management of this condition can redirect the educational and psychosocial development of most children with ADHD, thereby having a significant impact upon the well-being of a child accurately diagnosed with ADHD (Hinshaw, 1994; Klein and Mannuzza, 1991; Reiff et al., 1993; Weiss et al., 1985). According to the NIH Consensus Statement (National Institutes of Health Consensus Development Conference Statement, 2000), the diagnosis of ADHD can be made reliably using well-tested diagnostic interview methods. However, as of yet, there is no independent valid test for ADHD. Although research has suggested a central nervous system basis for ADHD, further research is necessary to firmly establish ADHD as a brain disorder. The Consensus Conference concluded that after years of clinical research and experience with ADHD, knowledge about the cause or causes of ADHD remain largely speculative. Despite the plethora of research emerging in the last few years, there remains no biologically based method of diagnosis for ADHD. Indeed, the literature calls for studies that will illuminate the etiology of ADHD (Castellanos, 1997).
1.2. Etiology of ADHD In spite of these well-documented problems, the etiology of ADHD remains methodologically difficult to study and has yielded inconsistent results (Barkley, 1990). One possibility for this is that because of changing and inconsistent use of diagnostic classifications, very few researchers have screened for or tested diagnostically identical ADHD samples. Most investigators accept that ADHD exists as a distinct clinical syndrome and suggest a multifactorial etiology that includes neurobiology as an important factor. Zametkin and Rappaport (1987) identified 11 separate neuroanatomical hypotheses that have been proposed for the etiology of ADHD. A majority of studies have concluded that either delayed maturation or defects in cortical activation play large roles in the pathophysiology of ADHD. For example, studies of cerebral blood flow measured by single-positron emission computerized tomography have demonstrated decreased metabolic activity in suspected attentional areas of the brain (Heilman et al., 1991) and indicated lower arousal in the mesial frontal areas (Lou et al., 1989). Recent anatomical studies have reported reduced bilateral regional brain volumes in specific and multiple subareas of the frontal cortex which govern premotor and higher level cognitive function (Mostofsky et al., 2002; Sowell et al., 2003). In addition, there appears to be reduced volume in the anterior temporal cortices accompanied by increases bilaterally in gray matter in the posterior temporal and inferior
Bayesian Probability Approach to ADHD Appraisal
361
parietal cortices (Sowell et al., 2003). These studies highlight the heterogeneity of the disorder and the neuropsychological constructs used to define the weaknesses associated with the disorder (i.e., executive function, working memory). These, as well as additional neurophysiological findings, have been interpreted as evidence of delayed maturation and cortical hypoarousal in regions of prefrontal and frontal cortex. Unfortunately, while neuroanatomical findings support the notion that ADHD is a distinct clinical syndrome and add to our understanding of the etiology of ADHD, neuroimaging techniques are too expensive for general use, are restricted to a few centers, and lack clear specificity and sensitivity in the diagnosis of ADHD. One technique suggested by a National Institute of Mental Health committee as a possible method to identify functional measures of child and adolescent psychopathology ( Jensen et al., 1993) is that of quantitative EEG. Compared to methods of functional neuroimaging (such as positron emission tomography or single photon emission computed tomography), quantitative EEG is easier to perform, less expensive, does not involve radioactive tracers, and is noninvasive (Kuperman et al., 1990).
1.3. Summary of problem Diagnosing ADHD presents a challenge to traditional assessment paradigms because there is no single assessment tool or medical test that definitively establishes its presence. Because ADHD is considered to be a physiologically based disorder with a multifactorial etiology that includes neurobiology as an important factor, the recommended diagnostic procedure for ADHD relies on a multimethod assessment (American Academy of Pediatrics, 2000; Anastopoulous and Shelton, 2001; DuPaul et al., 1992; National Institutes of Health Consensus Development Conference Statement, 2000). Ideally, this assessment should consist of the following individual components: (a) behavior rating scales, (b) behavioral observations, (c) parent and teacher interviews, (d) neuropsychological assessment, (e) academic screening, and (f ) EEG/brain imaging measures (Anastopoulous and Shelton, 2001; Barkley, 2002). The integrated results presumably converge to provide a composite judgment, considered to be a ‘‘best estimate’’ of the diagnosis and to be more accurate than any single source or assessment alone. The results from multiple measures and methods often do not clearly converge on a diagnosis, but rather provide contradictory information (Anastopoulous and Shelton, 2001; Barkley, 2002). Thus it is, most often, that the ‘‘best estimate’’ diagnosis is the subjective opinion of a clinician whose clinical judgment is influenced and limited by factors such as experience, resources, and prejudices. A reliable and comprehensive assessment that is founded on and determined by objective and standardized methods would provide a welcome advantage to diagnosing ADHD, as well as the
362
Raina Robeva and Jennifer Kim Penberthy
effectiveness of treatments for ADHD. What is needed is not only a comprehensive, multimethod assessment approach, but a strategy for reliably and consistently combining the results from these assessments in a way that provides standardized computation of an accurate probability for diagnosis. An ideal strategy for achieving such results is a Bayesian probability approach to combine disparate assessments. As we will discuss in the following sections, this Bayesian approach can be utilized to combine not only disparate results related to diagnosing the same disorder, but the approach can also successfully be employed to combine different, but related studies, examining the same question in a meta-analysis format.
1.4. Comprehensive psychophysiological assessment We propose and utilize a multimethod procedure that assesses symptoms of ADHD in various domains. Specifically, we employ: (a) standardized psychological questionnaires and ratings from the subjects, their caregivers, and teachers, to determine reported difficulty with cognitive transitions and dysregulation of behavior and attention in the form of ADHD symptoms; (b) prospective behavioral data collected using standardized tests to assess actual impairment on continuous performance tasks and additional tasks associated with poor self-regulation of attention and behavior, as well as standardized ratings from the subjects, parents, and blind raters of the subjects’ behavior and performance; (c) physiological assessments in the form of EEGs to access inconsistency of cognitive transition across multiple time dimensions; and (d) a comprehensive, yet flexible, assessment model combining data from multiple sources to address the complete DSM-IV criteria for ADHD, which is also reactive to treatment effects. This assessment procedure is designed to incorporate various tests and markers for ADHD, none of which alone could claim perfect sensitivity and specificity in diagnosing ADHD. In addition, an important feature of this sequential assessment is that it is test-order-invariant, and can accommodate missing data. The formal framework of the combined assessment employs a Bayesian algorithm that allows for linking of disparate ADHD assessment instruments, within a single study or across multiple studies, into one unified and objective stochastic assessment.
2. Bayesian Probability Algorithm 2.1. Methods We begin with a general description of the Bayesian algorithm, followed by the method for standardizing the scores of different tests and assessments. We then provide brief description of the studies and data that we use to illustrate these methods.
363
Bayesian Probability Approach to ADHD Appraisal
2.1.1. Standardizing the scores for different tests The algorithm is based on the idea that on every assessment subjects earn a certain test scores where the magnitude of the score depends on whether the subject has ADHD, as well as on the severity of the disorder. Therefore, a subject with a certain condition (ADHD) is expected to yield a higher score (or a lower score depending on the direction of the test), compared to a subject without that condition. However, the relationship between the condition and the test score is not always exact—it may happen that a subject without ADHD receives a score indicating ADHD or vice versa. Thus, this relationship is probabilistic and is best quantified as a conditional probability of earning certain score, given a preexisting condition, which is a value between 0 and 1. The exact conversion of the test scores to probabilities for ADHD depends on the assessment’s range of scores, direction of the test’s scale (i.e., whether lower or higher scores are associated with ADHD), and cutoff values that separate the scores indicating ADHD from those indicating nonADHD. The probability of earning a certain score on a test depends on the subjects’ condition, ADHD or non-ADHD, and the score’s calibration translates into conditional probabilities for earning a given test score x, given a condition of ADHD or non-ADHD. Assume that a test for ADHD can generate a range of values from 0 to M, with scores greater than a certain value C, 0 < C < M, indicating ADHD. In this case the mapping of a test score x on this test, 0 < x < M, to a probability for earning this score in case of ADHD, could be computed as x a lnð0:5Þ PðxÞ ¼ PðxjADHDÞ ¼ ; where a ¼ ; ð14:1Þ M lnðC=MÞ where the value of a obtained from the condition that the cutoff value C is mapped to a probability of 0.5. That is, a is determined from the condition PðCÞ ¼ ðC=MÞa ¼ 0:5, leading to a ¼ lnð0:5Þ= lnðC=MÞ. For tests where scores lower than the cutoff value C indicate ADHD, the standardized probability is computed as PðxÞ ¼ PðxjADHDÞ ¼ 1
x a M
;
where a ¼
lnð0:5Þ ; lnðC=MÞ
ð14:2Þ
mapping a score of 0 to a probability for ADHD equal to 1 and the maximal score M to a probability for ADHD equal to 0. Figure 14.1 depicts these two cases for different values of M and C. An alternative approach for standardizing tests scores can be found in Robeva et al. (2004) where calibration of the probabilities is done by piecewise linear functions (with cutoff values being mapped to a probability of 0.5 as well). A brief discussion of the similarities and differences between these standardizations in the context of a more general approach can be found in Section 4.
364
Raina Robeva and Jennifer Kim Penberthy
A 1 0.9 0.8 0.7 P(x)
0.6 0.5 0.4 0.3 0.2 0.1 0 B
0
2
4
6
8
10 12 14 16 18 20 22 24 26 28 30 32 34 36 Test score
1 0.9 0.8 0.7
P(x)
0.6 0.5 0.4 0.3 0.2 0.1 0
0
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Test score
Figure 14.1 Examples of test score standardization according to Eqs. (14.1) and (14.2). (A) Calibration for a test with M ¼ 36, C ¼ 12, and scores >12 indicating ADHD. In this case, a ¼ lnð0:5Þ=lnðC=MÞ ¼ 0:63093 and the function depicted in the figure is PðxÞ ¼ ðx=36Þ0:63093 . (B) Calibration for a test with M ¼ 100, C ¼ 40, and scores 40 on this scale. The WURS test is a 61-item retrospective self-report scale with adequate reliability and validity (Ward et al., 1993). Individuals rate the severity of ADHD symptoms experienced during childhood using a 5-point Likert scale. The score from the WURS (short form) ranges from 0 to 100, with scores >30 indicating ADHD. For adults, WURS has been shown to be a valid retrospective screening and dimensional measure of childhood ADHD symptoms (Stein et al., 1999, 2000), to replicate and correlate with Connors Abbreviated Parent and Teacher Questionnaire and demonstrate internal consistency reliability (Fossati et al., 2001), and to exhibit good construct validity (Weyandt et al., 1995). The EEG-CI is an EEG-based measure of ADHD (Cox et al., 1998; Kovatchev et al., 2001; Merkel et al., 2000). The CI ranges from 0% to 100%; a CI 30 indicate ADHD):
Bayesian Probability Approach to ADHD Appraisal
367
x 0:57572 : ð14:5Þ 100 EEG CI scale (M ¼ 100, 0 x 100, C ¼ 40, sores 0.5 classifies the subject as ADHD, while a probability 12 indicating ADHD (Cox et al., 1998). The AD/HD Rating Scale-IV is similar to the ADHD-SI, both scales being developed independently and concurrently at different laboratories. This rating scale has demonstrated adequate reliability and validity (DuPaul et al., 1998). The scale items reflect the DSM-IV criteria and respondents are asked
372
Raina Robeva and Jennifer Kim Penberthy
to indicate the frequency of each symptom on a 4-point Likert scale. The Home and School Versions of the scale both consist of two subscales: Inattention (nine items) and Hyperactivity–Impulsivity (nine items). The manual provides information regarding the factor analysis procedures to develop the scales, as well as information regarding the standardization, normative data, reliability, validity, and clinical interpretation of the scales. The score ranges from 0 to 100 with scores >93 indicating ADHD (DuPaul et al., 1998). Following Eqs. (14.1) and (14.2), the standardization of ADHD-SI and AD/HD RS-IV scores into conditional probabilities is done using the following functions: ADHD-SI scale (M ¼ 36, 0 x 36, C ¼ 12, sores >12 indicate ADHD, Fig. 14.1A): x 0:63093 PðxÞ ¼ PðxjADHDÞ ¼ ð14:7Þ 36 AD/HD Rating Scale-IV (M ¼ 100, 0 x 100, C ¼ 93, sores >93 indicate ADHD): x 9:55134 PðxÞ ¼ PðxjADHDÞ ¼ ð14:8Þ 100 After the data from these five studies are standardized using Eqs. (14.5)– (14.8), we can use the Bayesian algorithm described above to perform metaanalysis of the data. In this case, the algorithm is employed just as in the case of a single study, except that this time we can use scores from all measures for ADHD used in the studies. In our illustration, we will use ADHD-SI, AD/HD RS-IV, WURS, and the EEG CI. As before, when transitioning between steps i 1 and i, for i ¼ 1; 2; . . . ; k, where k is the total number of i tests used for the analysis, the probability PADHD , representing the probability for ADHD based on the first i tests, is computed from Eq. (14.4). Since, however, not all tests have been administered to all subjects in the combined dataset, the following modification is needed to allow for inclusion of studies for which scores from test i are not recorded: 8 i1 P1;i PADHD > < if a score for test i is recorded i1 i1 i þ P2;i ð1 PADHD Þ PADHD ¼ P1;i PADHD > : i1 PADHD if a score for test i is not recorded ð14:9Þ 3.1.3. Statistical analyses T-tests were used to compare the probabilities for ADHD estimated by the combined tests, across the ADHD versus no ADHD groups. Three-way ANOVA was used to elucidate the effect of age and gender on ADHD/ non-ADHD classification. Two-way ANOVA was used to elucidate the effect of study on ADHD/non-ADHD classification.
Bayesian Probability Approach to ADHD Appraisal
373
3.2. Results To further illustrate the Bayesian algorithm described in Section 2.1.2, Table 14.3 presents the sequential probabilities for ADHD based on the sequence of tests ADHD SI, AD/HD RS- IV, WURS, and the EEG CI. Notice that since the results of all different tests are standardized, it is possible to combine ADHD/controls scores across different studies and groups. Table 14.3 presents the process of increasing separation of ADHD and control subjects along the steps of the Bayesian algorithm. With more tests, the difference in the mean classification probabilities for ADHD and controls increases within each gender and age group, achieving better separation at the 4 end of the procedure (Table 14.3, column PADHD ). Notice also that, according to Eq. (14.9), the probability for ADHD remains unchanged when a certain tests is not available for a particular group. For the group of girls who are >16 years of age, the probability for ADHD remains at 0.5 until WURS is taken into consideration, since the ADHD-SI and AD/HD RS-IV were not administered for this age group and no test scores were available. Table 14.4 exemplifies how the effects of age and gender, as well as the effect of different studies combined for the meta-analysis, on the classification into ADHD/non-ADHD groups weakens with the incorporation of more tests. The interaction between age/gender or study with ADHD / non-ADHD classification effects are diminishing progressively from being highly significant at the first test to becoming negligible after the fourth test. This indicates that the accumulation of data from multiple tests gradually eliminates the effects of confounding factors such as gender and age and any between-study differences. This justifies the validity of meta-analysis using scores from multiple tests.
4. Discussion and Future Directions Accurately and reliably diagnosing ADHD presents challenges because there is currently no single assessment tool or medical test that can definitively diagnosis it. What do exist, however, are numerous assessments and tests of varying design. Some are checklists of symptoms that caregivers such as parents and teachers complete, some are behavioral or neuropsychological tests that the child completes in an office, some are written evaluations of behavior or symptoms based on the clinician’s observations, some are based on physiological data such as MRI or EEG readings. Each assessment has its own scoring system, criteria, and format for administration, and unfortunately, none of these individual assessment tools has been shown to be 100% accurate in diagnosing ADHD. This is to be expected, however, since ADHD is considered to be a physiologically based disorder with a
Table 14.3
Mean Probabilities for ADHD by age and gender groups for the sequential Bayesian assessment
Group
Boys 16 Boys > 16 Girls 16 Girls > 16
ADHD Control t-value, p ADHD Control t-value, p ADHD Control t-value, p ADHD Control t-value, p
Step 1: SI
Step 2: AD/HD RS-IV
Step 3: WURS
Step 4: CI
1 PADHD
2 PADHD
3 PADHD
4 PADHD
0.5860 0.3824 t ¼ 4.87, p ¼ 0.000 0.7188 0.1924 t ¼ 5.12, p ¼ 0.002 0.7707 0.3208 t ¼ 7.90, p ¼ 0.000 0.500 0.500 –
0.6441 0.0615 t ¼ 7.90, p ¼ 0.000 0.7188 0.1924 t ¼ 5.12, p ¼ 0.002 0.7707 0.3208 t ¼ 7.90, p ¼ 0.000 0.500 0.500 –
0.6441 0.0615 t ¼ 7.90, p 0.7188 0.1924 t ¼ 5.12, p 0.7707 0.3208 t ¼ 7.90, p 0.5883 0.2605 t ¼ 4.38, p
0.7199 0.0961 t ¼ 9.62, p ¼ 0.000 0.9357 0.2310 t ¼ 4.55, p ¼ 0.003 0.8704 0.3676 t ¼ 7.10, p ¼ 0.000 0.8022 0.1002 t ¼ 10.09, p ¼ 0.000
¼ 0.000 ¼ 0.002 ¼ 0.000 ¼ 0.002
375
Bayesian Probability Approach to ADHD Appraisal
Table 14.4 Significance of age–gender and study effects on ADHD/control classifications Three-way interactions: ADHD–age–gender
Two-way interactions: ADHD–study
Sequential tests
F
p
F
p
ADHD-SI ADHD-SI þ AD/HD RS-IV ADHD-SI þ AD/HD RS-IV þ WURS ADHD-SI þ AD/HD RS-IV þ WURS þ EEG-CI
22.879 4.281
0.000 0.041
54.595 6.322
0.000 0.000
0.115
0.735
1.059
0.381
0.221
0.639
0.730
0.574
multifactorial etiology that includes neurobiology as an important factor, and would not be easily classified by only one assessment tool. In fact, the reliability of the ADHD diagnosis based on one method or test alone is quite low and lower still when chance agreement is considered. For example, previous research has found 78% agreement between a structured interview and a diagnosis of ADHD (Welner et al., 1987) and 70–80% accuracy (with considerable variation depending on age range) of laboratory measures of attention in correctly predicting an ADHD diagnosis (Fischer et al., 1995). Thus, as reported by Angold et al. (1999), this results in both problems with overdiagnosis as well as underdiagnosis of ADHD. What is needed is a methodology for combining disparate assessments and tests in order to not only provide a more accurate diagnosis of the individual but also to enable the combination of multiple studies of ADHD assessments, thus increasing the sample size and providing more power, generalizability, and possibilities for cross-sectional comparisons. Other researchers are now utilizing the Bayesian approach to diagnose ADHD and producing interesting preliminary results. For example, Foreman et al. (2009) recently reported using a Bayesian approach to model the Development and Well-Being Assessment (DAWBA)’s various parameters in order to justify its utilization in primary care settings in the UK. They determined that using the DAWBA in primary care settings may improve access to accurate diagnosis of ADHD in primary care settings (Foreman et al., 2009). Similarly, researchers continue to look for novel methods to classify and predict diagnoses of ADHD in the fields of imaging and genetics that will more closely link assessment data with underlying neurobiological markers (Castellanos and Tannock, 2002).
376
Raina Robeva and Jennifer Kim Penberthy
In our own research utilizing the Bayesian approach described above, we have successfully combined different individual assessments to produce a more reliable and accurate individual diagnosis of ADHD. To be used for combining the results of disparate tests, the Bayesian approach requires a method for standardization of the test results. In general, this standardization is a mapping of test scores into probabilities that is required to follow some basic principles: (1) it should preserve the direction of the original scale (i.e., if scores higher than the threshold are indicative of ADHD, the probabilities for ADHD should be increasing with the increasing of the test score while they should be decreasing if test scores lower than the cutoff are indicative of ADHD; (2) it should be a monotone function; (3) it should map the value recommended as a cutoff for each test to 0.5 and the minimal and maximal score values into 0 or 1. Clearly, these basic requirements can be achieved by multiple mappings, including the power functions used in this chapter. We already mentioned that piecewise linear mapping defined in Robeva et al. (2004) was used in our earlier work, generating similar qualitative results (Penberthy et al., 2005; Robeva et al., 2004). However, we adopted the use of power functions in this chapter, since it eliminates the problem of generating probabilities with different increase/decrease rates on both sides of the cutoff value. Regardless, claiming any one specific mapping to be preferable or superior over any other would be a speculation at this point. We are currently working on finding a more comprehensive answer to this question that includes the use of mappings capable of accommodating the ‘‘gray zone’’ ranges of some tests, as well as translating these ranges into respective ‘‘gray zones’’ on the standardized scales. In each of the five studies described earlier, we standardized and combined disparate assessments, including a behavioral assessment of ADHD, such as ADHD-SI, AD/HD Rating Scale-IV, WURS, and an EEG assessment of ADHD—the CI. These assessments were joined, within each study, using a Bayesian algorithm, resulting in a combined probability for ADHD for each subject. In general, this combined probability presents a better assessment of ADHD than each of the separate tests it includes. Such a procedure is especially useful in situations such as diagnosing ADHD, when there is no single conclusive assessment, but rather a number of imperfect tests that marginally address the outcome of interest, and where researchers may have multiple related tests performed on a single subject, which they wish to combine into a more comprehensive assessment of this subject. Equally important, once the data output from each individual study is standardized, the Bayesian approach allows for data to be combined across different studies, thus producing a method for meta-analysis. In addition to significantly increasing the sample size, this approach allows the data to be examined in subgroups divided by age and/or gender, diagnostic group, etc. In our example, in all five studies, the subjects were classified into groups of ADHD versus non-ADHD. However, different studies focused on different
Bayesian Probability Approach to ADHD Appraisal
377
age and gender groups. The standardization of the data allowed crosssectional analyses, which were not possible with the original data. For example, we found that within each age and gender group the Bayesian algorithm increases the separation between the ADHD and control groups with the incorporation of more test scores. We also found that accumulation of more tests diminishes the effect of age and gender on the ADHD/ non-ADHD classification. Based on our review and research we propose that a viable alternative to a single definitive measure of ADHD is a combination of measures, equipped with a method for refining the results from one test with the results from another and yielding a compounded assessment that works better than each of its separate components. This concept of combining test outcome with intuitive knowledge and expert opinion is well developed mathematically. Bayesian methods provide a way to combine probabilistic and experimentaldata reasoning, as well as convenient tools for creating sequential evaluation procedures that refine the outcome assessment with every subsequent step. These procedures are especially useful in situations where there is no single conclusive assessment, but rather a number of imperfect tests that marginally address the outcome of interest. Various individual ADHD assessment tools, including rating scales and physiological assessments, have not proven to be as accurate in diagnosing ADHD as a comprehensive, standardized, objective, yet flexible and adaptive, assessment package for ADHD that can incorporate multifactorial assessments. The proposed assessment does not aim at replacing any established practices for screening and diagnosing of ADHD but instead at demonstrating that the outcomes of related studies can be combined in a manner that allows meta-analysis of different types of data which may not be collected in the same manner in each study, and which can include physiological data as well as symptom reports. It should be emphasized that the application of the proposed meta-analysis tool is not limited to the specific tests used in the discussed studies—the metaanalysis is capable of accommodating a variety of other tests. Specifically, almost any individual test or assessment could be employed within this model, assuming that the output of such test can be standardized into probabilities for the specific disorder or disease. As such, this meta-analysis procedure may provide a much-needed tool for combining related studies with similar or disparate tests and assessments in a number of research areas, which may otherwise have small, less generalizable studies of limited power.
ACKNOWLEDGMENT The authors thank Boris Kovatchev from the University of Virginia for a consultation regarding the statistical analyses.
378
Raina Robeva and Jennifer Kim Penberthy
REFERENCES American Academy of Pediatrics (2000). Clinical practice guideline: Diagnosis and evaluation of the child with attention-deficit/hyperactivity disorder. Pediatrics 105, 1158–1170. American Psychiatric Association (1994). Diagnostic and statistical manual of mental disorders. 4th edn American Psychiatric Association, Washington, DC. Anastopoulous, A. D., and Shelton, T. L. (2001). Assessing Attention-Deficit/Hyperactivity Disorder. Kluwer Academic/Plenum Publishers, New York. Angold, A., Costelle, E. J., Farmer, E., Burns, B., and Erkanli, A. (1999). Impaired by undiagnosed. J. Am. Acad. Child Adolesc. Pychiatry 38, 129–137. Barkley, R. A. (1990). A critique of current diagnostic criteria for attention deficit hyperactivity disorder: Clinical and research implications. J. Dev. Behav. Pediatr 11, 343–352. Barkley, R. A. (2002). Attention Deficit Hyperactivity Disorder: A Handbook for Diagnosis and Treatment. Guilford Press, New York. Brown, T. E. (1996). Brown Attention Deficit Disorder Scales: Manual. The Psychological Corporation, San Antonio, TX. Carlin, B. P., and Louis, T. A. (2000). Bayes and Empirical Bayes Methods for Data Analysis. 2nd edn. Chapman & Hall/CRC, Washington, DC. Castellanos, F. X. (1997). Toward a pathophysiology of attention-deficit/hyperactivity disorder. Clin. Pediatr. (Phila) 36, 381–393. Castellanos, F. X., and Tannock, R. (2002). Neuroscience of attention-deficit/ hyperactivity disorder: The search for endophenotypes. Nat. Rev. Neurosci. 3, 617–628. Cox, D. J., Kovatchev, B. P., Morris, J. B., Phillips, C., Hill, R., and Merkel, L. (1998). Electroencephalographic and psychometric differences between boys with and without Attention-Deficit/Hyperactivity Disorder (ADHD): A pilot study. Appl. Psychophysiol. Biofeedback 23, 179–188. Cox, D. J., Merkel, R. L., Kovatchev, B., and Seward, R. (2000). Effect of stimulant medication on driving performance of young adults with attention-deficit hyperactivity disorder: A preliminary double-blind placebo controlled trial. J. Nerv. Ment. Disord. 188, 230–234. DuPaul, G. J., Anastopoulos, A. D., Shelton, T. L., Guevremond, D. C., and Metevia, L. (1992). Multi-method assessment of attention deficit hyperactivity disorder: The diagnostic utility of clinic-based tests. J. Clin. Child Psychol. 21, 394–402. DuPaul, G. J., Power, T. J., Anastopoulos, A. D., and Reid, R. (1998). ADHD Rating Scale—IV Checklists, Norms, and Clinical Interpretation. Guilford Press, New York. Fischer, M., Newby, R. F., and Gordon, M. (1995). Who are the false negatives on Continuous Performance Tests? J. Clin. Child Psychol. 24, 427–433. Foreman, D., Morton, S., and Ford, Tamsin (2009). Exploring the clinical utility of the Development And Well-Being Assessment (DAWBA) in the detection of hyperkinetic disorders and associated diagnoses in clinical practice. Child Psychol. Psychiatry 50, 460–470. Fossati, A., Di Ceglic, A., Acquarini, E., Donati, D., Donini, M., Novella, L., and Maffei, C. (2001). The retrospective assessment of childhood attention-deficit hyperactivity disorder in adults: Reliability and validity of the Italian version of the Wender Utah Rating Scale. Compr. Psychiat. 42, 326–336. Habel, L. A., Schaefer, C. A., Levine, P., Bhat, A. K., and Elliott, G. (2005). Treatment with stimulants among youths in a large California health plan. J. Child Adolesc. Psychopharmacol. 15, 62–67. Heilman, D., Voeller, K., and Nadeau, S. (1991). A possible pathophysiologic substrate of attention deficit hyperactivity disorder. J. Child Neurol. 6(Suppl.), S74–S79. Hinshaw, S. P. (1994). Attention Deficits and Hyperactivity in Children. Sage, Thousand Oaks, CA.
Bayesian Probability Approach to ADHD Appraisal
379
Jensen, P. S., Koretz, D., Locke, B. Z., Schneider, S., Radke-Yarrow, M., Richters, J. E., and Rumsey, J. M. (1993). Child and adolescent psychopathology research: Problems and prospects for the 1990s. J. Abnorm. Child Psychol. 21, 551–581. Kalbfleisch, M. L. (2001). Electroencephalographic (EEG) differences between boys with average and high-aptitude with and without attention deficit hyperactivity disorder (ADHD) during task transitions. Dissertation Abstr. Int. Sect. B Sci. Eng. 62(1-B), 96. Klein, R. G., and Mannuzza, S. (1991). Long-term outcome of hyperactive children: A review. J. Am. Acad. Child Adolesc. Psychiatry 30, 383–387. Kovatchev, B. P., Cox, D. J., Hill, R., Reeve, R., Robeva, R. S., and Loboschefski, T. (2001). A psychophysiological marker of Attention Deficit/Hyperactivity DisorderDefining the EEG consistency index. Appl. Psychophysiol. Biofeedback 26, 127–139. Kuperman, S., Gaffney, G. R., Hamdan-Allen, G., Preston, D. F., and Venkatesh, L. (1990). Neuroimaging in child and adolescent psychiatry. J. Am. Acad. Child Adolesc. Psychiatry 29, 159–172. Lou, H., Henriksen, L., Bruhn, P., Borner, H., and Nielsen, J. (1989). Striatal dysfunction in attention deficit and hyperkinetic disorder. Arch. Neurol. 46, 48–52. Merkel, R. L., Cox, D. J., Kovatchev, B. P., Morris, J., Seward, R., Hill, R., and Reeve, R. (2000). The EEG consistency index as a measure of Attention Deficit/Hyperactivity Disorder and responsiveness to medication: A double blind placebo controlled pilot study. Appl. Psychophysiol. Biofeedback 25, 133–142. Monastra, V. J., Lubar, J. F., Linden, M., VanDeusen, P., Green, G., Wing, W., Phillips, A., and Fenger, T. N. (1999). Assessing attention deficit hyperactivity disorder via quantitative electroencephalography: An initial validation study. Neuropsychology 13, 424–433. Mostofsky, S. H., Cooper, K. L., Kates, W. R., Denckla, M. B., and Kaufmann, W. E. (2002). Smaller prefrontal and premoter volumes in boys with attention-deficit/ hyperactivity disorder. Biol. Psychiatry 52, 785–794. National Institutes of Health Consensus Development Conference Statement (2000). Diagnosis and treatment of attention-deficit/hyperactivity disorder (ADHD). J. Am. Acad. Child Adolesc. Psychiatry 39, 182–193. Pastor, P. N., and Reuben, C. A. (2008). Diagnosed attention deficit hyperactivity disorder and learning disability: United States, 2004-2005. Vital Health Statistics. National Center for Health Statistics, Vol. 10. Penberthy, J. K., Cox, D., Breton, M., Robeva, R., Kalbfleisch, M. L., Loboschefski, T., and Kovatchev, B. (2005). Calibration of ADHD assessments across studies: A metaanalysis tool. Appl. Psychophysiol. Biofeedback 30(1), 31–51. Reiff, M. I., Banez, G. A., and Culbert, T. P. (1993). Children who have attentional disorders: Diagnosis and evaluation. Pediatr. Rev. 14, 455–465. Robeva, R., Penberthy, J. K., Loboschefski, T., Cox, D., and Kovatchev, B. (2004). Sequential psycho-physiological assessment of ADHD: A pilot study of Bayesian probability approach illustrated by appraisal of ADHD in female college students. Appl. Psychophysiol. Biofeedback 29(1), 1–18. Safer, D. J., Zito, J. M., and Fine, E. M. (1996). Increased methylphenidate usage for attention deficit disorder in the 1990s. Pediatrics 98, 1084–1088. Schwabe, U., and Paffrath, D. (2006). Arzneiverordnungs—Report 2006. Springer, Berlin. Sowell, E. R., Thompson, P. M., Welcome, S. F., Henkenius, A. L., and Toga, A. W. (2003). Cortical abnormalities in children and adolescents with attention-deficit hyperactivity disorder. Lancet 62, 1699–1707. Stein, M. A., Fischer, M., and Szumowski, E. (1999). Evaluation of adults for ADHD. J. Am. Acad. Child Adolesc. Psychiatry 38, 940–941. Stein, M. A., Fischer, M., and Szumowski, E. (2000). Evaluation of adults for ADHD: Erratum. J. Am. Acad. Child Adolesc. Psychiatry 39, 674.
380
Raina Robeva and Jennifer Kim Penberthy
Ward, M. F., Wender, P. H., and Reimherr, F. W. (1993). The Wender Utah Rating Scale: An aid in the retrospective diagnosis of childhood attention-deficit-hyperactivity disorder. Am. J. Psychiatry 150, 885–890. Weiss, G., Hechtman, L., Milroy, T., and Perlman, T. (1985). Psychiatric status of hyperactives as adults: A controlled prospective 15-year follow-up of 63 hyperactive children. J. Am. Acad. Child Adolesc. Psychiatry 24, 211–220. Welner, Z., Reich, W., Herjanic, B., and Jung, K. G. (1987). Reliability, validity, and parent-child agreement studies of the Diagnostic Interview for Children and Adolescents (DICA). J. Am. Acad. Child Adolesc. Psychiatry 26(5), 649–653. Weyandt, L. L., Linterman, I., and Rice, J. A. (1995). Reported prevalence of attentional difficulties in a general sample of college students. J. Psychopathol. Behav. 17, 293–304. Zametkin, A. J., and Rappaport, J. L. (1987). Neurobiology of attention deficit disorder with hyperactivity: Where have we come in 50 years? J. Am. Acad. Child Adolesc. Psychiatry 26, 676–686. Zito, J. M., Safer, D. J., dosReis, S., Gardner, J. F., Boles, M., and Lynch, F. (2000). Trends in the prescribing of psychotropic medications to preschoolers. JAMA 283, 1025–1030.
C H A P T E R
F I F T E E N
Simple Stochastic Simulation Maria J. Schilstra* and Stephen R. Martin† Contents 382 385 386 389 389 390 392 392 393 393 394 395 401 404 406 407 409
1. 2. 3. 4. 5.
Introduction Understanding Reaction Dynamics Graphical Notation Reactions Reaction Kinetics 5.1. Second-order reactions 5.2. First-order reactions 5.3. Pseudo-first-order reactions 5.4. Aside 6. Transition Firing Rules 6.1. Ground rules 6.2. First-order reactions 6.3. Multiple options 6.4. Pseudo-first-order and second-order reactions 7. Summary 8. Notes References
Abstract Stochastic simulations may be used to describe changes with time of a reaction system in a way that explicitly accounts for the fact that molecules show a significant degree of randomness in their dynamic behavior. The stochastic approach is almost invariably used when small numbers of molecules or molecular assemblies are involved because this randomness leads to significant deviations from the predictions of the conventional deterministic (or continuous) approach to the simulation of biochemical kinetics. Advances in computational methods over the three decades that have elapsed since the publication of Daniel Gillespie’s seminal paper in 1977 ( J. Phys. Chem. 81, 2340–2361) have allowed researchers to produce highly * Biological and Neural Computation Group, Science and Technology Research Institute, University of Hertfordshire, Hatfield, United Kingdom Division of Physical Biochemistry, MRC National Institute for Medical Research, London, United Kingdom
{
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67015-4
#
2009 Elsevier Inc. All rights reserved.
381
382
Maria J. Schilstra and Stephen R. Martin
sophisticated models of complex biological systems. However, these models are frequently highly specific for the particular application and their description often involves mathematical treatments inaccessible to the nonspecialist. For anyone completely new to the field to apply such techniques in their own work might seem at first sight to be a rather intimidating prospect. However, the fundamental principles underlying the approach are in essence rather simple, and the aim of this article is to provide an entry point to the field for a newcomer. It focuses mainly on these general principles, both kinetic and computational, which tend to be not particularly well covered in specialist literature, and shows that interesting information may even be obtained using very simple operations in a conventional spreadsheet.
1. Introduction Over the past two decades, a number of important single molecule techniques have emerged and been applied to research in biology. These techniques have given researchers the ability to obtain information about single molecules—or molecular assemblies—that could not be obtained from measurements on large ensembles of molecules. Single molecule methods have facilitated the study of the movement of molecular motors along actin or microtubules; the characterization of RNA/DNA-based motors (polymerases, topoisomerases, and helicases); the dynamic growth/ shortening behavior of individual microtubules; the motion of individual ribosomes as they translate single messenger RNA hairpins; the movement of single biomolecules in a membrane or viruses on a cell surface; and the behavior of single proteins or ligands both within living cells and interacting with cell surface receptors. The data generated from these experiments contain kinetic, statistical, and spatial information and allow one to understand how the molecules behave, either individually or—by averaging many data sets—as an ensemble. Quantitative interpretation of these observations requires the construction of dynamic models, in which all components that are thought to be essential for the functioning of the system under study act and interact in a manner that might resemble the original. These models may then be used to simulate the dynamics of the real-world system. If ‘‘test runs’’ under conditions that mimic the experimental ones give the expected results, one may challenge the model by applying different conditions, and check whether its predictions are borne out by the behavior of the real system. One may also make a quantitative comparison between the observed and simulated data, and thereby refine the model parameters to achieve improved agreement with the experimental observations. The traditional approach to modeling dynamic biochemical systems is based on the law of mass-action, and relies for its validity on three major
Simple Stochastic Simulation
383
assumptions. These are that the reaction volume is spatially homogeneous (so that any small subvolume is indistinguishable from any other), that it is well stirred (so that the chance of any two molecules colliding is the same throughout the volume), and that it contains a large number of molecules. Because interactions between molecules are random processes it is impossible to predict the exact time at which a reaction involving one or two molecules or particles will occur. In systems with a large number of interacting molecules, this random behavior is averaged out and the overall state of the system in terms of the concentrations of the components becomes totally predictable. It is this property that enables the traditional deterministic simulation approach to be employed. The starting point for such simulations is the set of coupled ordinary differential equations (ODEs) which describe the timedependence of the concentrations of the different chemical species involved. When the parameters and initial concentrations have been defined, a numerical integrator is used to calculate the concentrations as a function of time. Deterministic simulations will always produce the same ‘‘predetermined’’ result for any given set of starting conditions because they model reactions as continuous fluxes of material and implicitly include the assumption that one is dealing with a very large number of molecules. If only small numbers of molecules are involved, then the stochastic fluctuations are no longer averaged out, and the evolution of the system— in terms of the position, concentration, or number of species—from any given set of starting conditions will never be exactly the same. These stochastic fluctuations are clearly of enormous importance in single molecule studies where properties of the system may often be inferred from the distributions of experimentally observed values. Stochastic fluctuations are equally prominent in cellular processes which generally occur in very small volumes, often involve relatively small numbers of molecules, and are neither spatially homogeneous nor well stirred. The deterministic approach cannot be used in such situations, and these systems must be modeled using a stochastic formulation of their chemical kinetics. Stochastic formulations include the random aspects of the system, as well as its deterministic characteristics that yield a predictable average behavior. ‘‘Monte Carlo’’ computer simulations, which use random number generators to incorporate randomness, but take account of the probabilities that particular events will occur, may then be used to generate time courses that single molecules or particles may follow. These probabilities are calculated from exactly the same physical properties of the system—rate constants, etc.—as those used in the deterministic approach. The exponential rise in citations over the last decade (Fig. 15.1) of the paper that presents the algorithm for performing such Monte Carlo simulations, now known as Gillespie’s Direct Method (Gillespie, 1977), parallels the revolution in the development of single molecule techniques (Cornish and Ha, 2007). Further analysis of the citation data suggests that the citations mostly originate from papers whose subject area are classified as (biological)
384
Maria J. Schilstra and Stephen R. Martin
Number of citations
300
200
100
0 1975
1985
1995 Year
2005
Figure 15.1 Number of citations per year for Gillespie (1977), based on a search of the Science Citation Index—Expanded and the Conference Proceedings Citation Index— Science (Thomson Reuters, Web of Science). The paper has been cited 1286 times in the period from 1977 to 2008. The solid line shows that the increase has been exponential from 1996 (17 citations) to 2008 (275 citations).
physics, mathematics, or computer science. Perhaps not surprisingly, the papers that contain an analysis, explanation, justification, or refinement of the technique, are often laden with equations and notions from statistical physics. As a result, the impression is created that producing stochastic formulations of biochemical reaction systems is the preserve of mathematicians and physicists, and requires one to be almost superhumanly conversant with probability theory and statistics. Although this may be the case for those who wish to analyze and base generally applicable theory on such formulations, we contend that setting up stochastic models for the sole purpose of simulation is not only straightforward, but also engaging, and above all, enlightening. Our view is that anyone who is able to calculate the residual amount of 32P in a stock of radioactive ATP a week after receiving it can also set up and perform a stochastic simulation of a biochemical reaction system. Moreover, we believe that hands-on experience with quantitative model building and stochastic simulation may significantly deepen one’s insight into the basic principles of chemical kinetics. The aim of this article is to provide an entry point to the field for a newcomer. It focuses mainly on explaining the origin and meaning of the fundamental equations of Gillespie’s original1 exact algorithm (Gillespie, 1
Gillespie’s (1976) paper is usually credited as the first publication of this method, but a very similar algorithm, applied to the simulation of Ising spin systems was published a year earlier (Bortz et al., 1975). To our knowledge, the method was independently (re-)discovered at least twice in the period before 1995 when the Gillespie paper was relatively unknown (Bayley et al., 1990; Kleutsch and Frehland, 1991).
Simple Stochastic Simulation
385
1976), now known as the First Reaction Method. We intentionally focus on the modeling of very simple systems, because our primary purpose here is to show that stochastic simulations are both easy to understand and relatively easy to implement. Although the ability to write some basic computer code is a considerable advantage, it is, in fact, the case that simple applications can quite often be implemented in a spreadsheet program. Many situations one might wish to simulate are, of course, much more complex than those described here, but most of the fundamental principles we discuss remain exactly the same.
2. Understanding Reaction Dynamics Even seemingly very complex transformations in cell biology can usually be broken down into a series of elementary physical and chemical reactions. A reaction is a process in which one or more ‘‘species’’ change into other species. Species that are consumed in a reaction are the reactants; those that are formed are the reaction products. Members of a particular species—atoms, ions, molecules, or molecular assemblies—have the same properties as one another, but are different from the members of other species. More generally, species embody state, and reactions state transitions. A state is characterized by a finite lifetime: finite meaning nonzero—that is, measurable— and the lifetime of a state is the average time that particles spend in that state. For example, the radionuclide 32P has a half-life 14.3 days, from which one may calculate that its atoms have an average life of approximately 20 days before decaying to 32S. Many individual atoms will, of course, exist as 32P for significantly shorter periods, and many will survive for much longer. In contrast, the transition of a single 32P atom to the 32S state is effectively instantaneous. A process that occurs instantaneously is called an event. Appreciation of the notions of state and state transition, lifetime, and event is the key to understanding how the dynamics of chemical, and therefore biochemical reactions are modeled. Each state (and thus each species) is, by definition, restricted to some kind of container, and movement from one container into another involves a state transition. A container may be a cellular compartment, such as the cytoplasm or the nucleus, but more commonly is simply the vessel in which an in vitro reaction is occurring. Containers are usually three-dimensional, and have volume, but containers with fewer spatial dimensions are also possible. Two-dimensional containers have an area, rather than a volume, whereas one-dimensional ones have a length. Two-dimensional containers are used, for example, in models of diffusion of proteins in membranes, and one-dimensional ones in models of the motion of particles along a track. To avoid confusion we will only consider three-dimensional containers, but
386
Maria J. Schilstra and Stephen R. Martin
most concepts and expressions are, with the appropriate substitutions, equally applicable to two-and even one-dimensional containers. In the following, we will give species names that begin with capital letters (e.g., X, Y), and indicate specific instances of a species (molecules, assemblies, etc. in a specific state) with lower case letters (x, y). The number of instances or items of a particular species in a compartment is written as n subscripted with the species name (nX, nY). Square brackets around the species name indicate concentration ([X], [Y]), the number of items expressed in moles per unit of volume: ½X ¼
nX NA V
ð15:1Þ
Here, V is the volume of the container, and NA is Avogadro’s Number, 6.023 1023 items per mole. In biochemistry, concentration is generally expressed in molar (abbreviated M), which is the same as moles per liter (mol/L). Unless specified otherwise, reactions will be indicated as RXY, where the subscript identifies the reactants.2
3. Graphical Notation Throughout this article, we will use Petri-nets to depict species and reactions. The Petri-net format was invented3 to describe discrete distributed systems, in which objects are transformed into other objects in multiple processes that occur simultaneously and interact with each other. Petri-nets are used for many different purposes, from the design of complex computer software to the management of manufacturing systems. They are, however, also eminently suitable for depicting biochemical reaction systems at the level of interactions between molecules. The reason we prefer to illustrate our argument with Petri-nets rather than equivalent, often more compact, chemical reaction schemes is the evocative way in which the Petri-net notation and terminology emphasize the discrete nature of biochemical reactions, and draw attention to network structure and its associated dependencies. A Petri-net is a directed bipartite graph; a graph being a mathematical concept that may be visualized as a set of symbols, such as circles or boxes, connected by lines or arrows. The symbols are called ‘‘nodes’’ or ‘‘vertices’’; the lines are ‘‘arcs’’ or ‘‘edges.’’ Each arc connects two nodes. Arcs may have arrowheads, and point from one node to the other, which makes the 2 3
Note that in the formulation used in this paper reversible reactions are always specified as two separate reactions in which the reactants of one reaction are the products of the other. The Petri-net notation is named after its inventor, C. A. Petri (http://www.scholarpedia.org/article/ Petri_net). Petri-nets are sometimes called place-transition or P/T graphs.
Simple Stochastic Simulation
387
graph directed. ‘‘Bi-partite’’ indicates that there are two kinds of nodes, and that nodes of the first type can only have connections to nodes of the other type. In the following we shall use the classic, somewhat idiosyncratic Petrinet terminology. The two types of nodes are called place and transition nodes, or more simply, places and transitions. Places represent states (species), and transitions represent processes that involve a state transition (reactions). Places are usually represented by circles or ellipses, transitions by rectangles, squares, and sometimes simply by straight line segments. Arcs that point from place to transition nodes are called input arcs; arcs that point away from a transition toward a place are output arcs. Likewise, places that are connected to transitions via input arcs are called input places; those that are connected via output arcs are the output states of the transition. Input and output places therefore represent reactants and reaction products. Place nodes may contain tokens, which are often shown as small circles or dots inside the node. Tokens count the items (molecules, assemblies, etc.) that are in the state represented by the place. Although each token represents one item, the tokens in a given place are indistinguishable as they are not associated with any specific item. In the simple Petri-net example in Fig. 15.2A, the single reactant in reaction (transition) RX is species (place) X, and the product is species Y. There are three instances (tokens) of species X in the compartment or container (which is not specifically indicated), and none of Y. Reaction RXZ in Fig. 15.2B has two reactants, X and Z, and one product (Y). Arcs have weights. The default arc weight is 1, but other numbers are allowed, provided they are nonnegative integers. Default weights are not normally indicated in the graph, but other weights are. In Fig. 15.2C and D, the input arcs of RX, RX1, RX2, and the output arc of RY have higher weights (namely 2), whereas all other arcs have a default weight of 1. It may, by now, be apparent that the Petri-net notation closely corresponds to the conventional chemical reaction notation: the example Petri-nets depict the reactions X ! Y (A), X þ Z ! Y (B), 2X ! Y (C), and the two reactions 2X ! Y; 2X ! Z (D), with the input and output arc weights corresponding to the reactant and product stoichiometry in the reactions. Transitions make the Petri-net dynamic, as they can fire. When a transition fires, tokens are removed from its input place or places, and new tokens are deposited in its output place(s). The number of tokens removed and deposited is equal to the weight of the connecting arcs. Thus, upon firing of RX in Fig. 15.2C, two tokens will be removed from X, and one deposited in Y. Likewise, when RY fires (Fig. 15.2C) one token will be removed from Y, and two deposited in X. It is important to realize that tokens do not move from place to place: they are simply counters that have no identity of their own. Transitions can fire if, and only if, there is a sufficient number of tokens in each of its input places, that is, if number of tokens in the input places is
388
Maria J. Schilstra and Stephen R. Martin
A
RX X
Y
B RXZ
X
Y
Z C
2
X
D
RX (Disabled) (Enabled) Transition (Reaction) RY
2 2
Place (Species, state)
Y
2 Arc with weight (Stoichiometry)
RX1 Y
Token (Species instance, item)
RX2
X 2
Z
Figure 15.2 Simple Petri-net examples and explanation of the symbols. These Petrinets can be used to represent the chemical reactions X ! Y (A), X þ Z ! Y (B), 2X ! Y (C), and the two reactions 2X ! Y; 2X ! Z (D).
equal to or greater than the weights of the associated input arcs. Therefore, RX in Fig. 15.2A is enabled, because there are three tokens in X. Transition RXZ in B, however, cannot fire, because even though there are enough tokens in X, there are none in Z. Similarly, in Fig. 15.2C only RX is enabled to fire; RY, which needs at least one token in its input place Y, is disabled. After the firing of RX and the accompanying removal of two tokens, there will be one token left in X, not enough for RX to fire again. However, one token will have appeared in Y, thereby enabling transition RY. When RY fires after having become enabled, two tokens will be put back in X and one removed from Y, so that firing can go on ad infinitum. In contrast, RX (Fig. 15.2A) can fire three times in succession, but no more. Apart from the rule that a transition can only fire if all of its input places contain a sufficient number of tokens, the basic Petri-net definition specifies no further rules for transition firing. In Fig. 15.2D, both transitions RX1 and RX2 can fire. However, they cannot fire simultaneously—because this would require the presence of four tokens in X—and firing of RX1 will disable RX2 and vice versa. To avoid conflict, different firing rules have been
Simple Stochastic Simulation
389
established for different Petri-net applications. For Petri-nets that represent chemical reaction systems, the logical option is to apply the rules that underlie chemical reaction kinetics. These rules are based on the principle that, although it is impossible to predict exactly when an individual species item will undergo a chemical reaction, the likelihood that it will do so within a given time interval can be computed. According to these rules, the reaction that happens first (and thereby possibly prevents other reactions that could also have occurred) will be chosen on the basis of a weighted lottery. To understand where these rules come from, a good understanding of chemical reaction kinetics is crucial. In the following, we will give a summary of the basic principles.
4. Reactions Most chemical reactions have either one or two reactants. Reactions with a single reactant appear to happen spontaneously, whereas reactions with two reactants require a collision between the two participants. Reactions that require a three-body collision do exist but are uncommon, simply because a three-body collision is a rare event. Radionuclide decay, such as the reaction 32P ! 32S, is an example of a one-reactant, or unimolecular, reaction. Examples from biochemistry include conformational changes (or isomerizations) and dissociation of protein–ligand complexes. The formation of a complex between two species such as a protein and a ligand, is an example of the very common two-reactant, or bimolecular, reaction. Although many important complexes in biochemistry contain multiple components, these will almost invariably be assembled through a series of individual reactions with just two reactants. Likewise, many complex reaction schemes—such as the enzyme catalyzed conversion of one or more substrates into one or more reaction products—consist of a series of unimolecular and bimolecular events.
5. Reaction Kinetics To understand reaction dynamics, it is convenient to consider what happens to a single ‘‘central’’ molecule that takes part in a reaction. We will first consider the case in which the central molecule takes part in a secondorder bimolecular reaction, and then discuss what happens in first-order unimolecular reactions. Whereas the terms unimolecular and bimolecular indicate the number of reactants, the designations first-order and secondorder refer to the sum of the power to which the concentration terms in the rate equation are raised, as we will clarify below.
390
Maria J. Schilstra and Stephen R. Martin
5.1. Second-order reactions If the central molecule is a reactant of type X in the bimolecular reaction X þ Z ! Y, as in Fig. 15.2B, it needs to collide with (‘‘find’’) its reaction partner, Z, before the reaction can take place. The frequency with which type Z molecules collide with the single central molecule, x, is proportional to the concentration of Z: if there are twice as many molecules of the reaction partner in the compartment in which the reaction takes place, the collision frequency will be double. Conversely, if a given number of Z molecules become distributed over a volume that is twice as large, the collision frequency will be halved. The collision frequency also depends on factors such as the temperature and the viscosity of the medium in which the molecules move. Moreover, in few reactions, if any, does every collision lead to a reaction. In biochemical reactions, which often involve large molecules, the reactants must at least be in a favorable orientation with respect to each other, so that only a fraction of the total number of collisions between reactants results in an actual reaction. We will assume, for now, that the environmental factors that affect the collision frequency and the fraction of successful reactions are constant. In this case, the frequency of ‘‘successful’’ encounters (that is the number of successful encounters within a given period) between the central molecule and its reaction partner would also be proportional to the concentration of the reaction partner—if, of course, the central molecule were not consumed by the reaction. If there are, say, 1000 items of the same species as the central molecule, the probability that any one of those will undergo a successful collision and react with a reaction partner is 1000 times greater.4 Therefore, the rate at which the reaction proceeds, that is the rate at which the reactant molecules are consumed, is proportional to the number of items of one of the species taking part, and to the concentration of the other:
dnX dnZ nX nZ ¼ ¼ k nX ½Z ¼ k dt dt NA V
ð15:2Þ
Here dnX/dt and dnZ/dt are the rates at which X and Z items disappear from the medium (a plus sign would indicate appearance). The rate is measured in number of items per unit of time. If time is measured in seconds, and volume in liters, the dimensions of the expressions on both sides of the equation are s 1 (as the numbers nX and nZ are dimensionless), and the dimension of the proportionality factor k is M 1s 1. The proportionality factor is called the rate constant,5 and equations that express the rate 4 5
Provided the molecules do not hinder each other. Significant hindrance necessitates the introduction of a correction factor, but this is beyond the scope of the present discussion. Following the recommendations in the IUPAC Compendium of Chemical Terminology, we use k as the symbol for rate constants.
391
Simple Stochastic Simulation
at which the quantity of one particular species changes, are rate equations. The reaction rate may also be expressed in terms of concentration by dividing both sides of the equation by NA V:
d½X d½Z ¼ ¼ k½X½Z dt dt
ð15:3Þ
Note that this operation alters the dimensions of the whole expression to M/s, but leaves the value and dimensions of the rate constant unchanged. Reactions with rate equations of the form of Eq. (15.3) are called secondorder reactions, with k as a second-order rate constant, because the total power to which the concentrations of all reactant species are raised is 2 (i.e., 1 þ 1). Reactions in which two molecules of the same species react, for example, in the dimerization reaction RX in Fig. 15.2C, are also secondorder reactions. In this case, two items disappear following a successful collision. Since a molecule cannot react with itself, each molecule has only nX1 other molecules with which it can react and Eq. (15.2) then becomes:
dnX nX ðnX 1Þ ¼ 2k dt NA V
ð15:4Þ
If there are many molecules, (nX1) nX, and the reaction rate may be expressed as:
dnX ¼ 2k½X2 dt
ð15:5Þ
Of course it is also possible to incorporate V or NA V in the value of k, as k0 ¼ k/V or k0 ¼ k/(NA V ). In these cases, the dimensions of k0 are mol 1s 1 or molecule 1s 1, and its value depends on the volume of the compartment in which the reaction takes place. It is very important, however, to realize that a second-order rate constant expressed as mol 1s 1 hides a spatial dimension, and that this concealment may lead to severe confusion. It is advisable to always report the values of second-order rate constants in M 1s 1, not least because their values may then be compared with those of other second-order reactions. Reactions that require a collision between two molecules are limited by the velocity at which the individual molecules move around. The average velocity of the particles may be estimated from their diffusion coefficients, which, in turn, depend on the temperature and viscosity of the medium. As a rule of thumb, the maximum value for a second-order rate constant at 25 C in water is of the order of 109 M 1s 1(Atkins, 1994). Reported or observed second-order rate constants with values that are significantly greater than 109 M 1s 1 should therefore be treated with suspicion.
392
Maria J. Schilstra and Stephen R. Martin
5.2. First-order reactions As collisions are events, it is easy to picture bimolecular reactions as collections of events that happen after the reactants have spent some time moving around in the medium. Unimolecular reactions, in contrast, do not generally or obviously require collisions: they appear to happen spontaneously. Nonetheless, state transitions that involve a single particle as a reactant, such as the reaction X ! Y in Fig. 15.2A, are often also best described as events that happen spontaneously after the particle has spent some (so far indeterminate) amount of time in the reactant state, here X. For example, out of a population of 1000 32P atoms, approximately 500 would have decayed to 32S after 14.3 days, but this does not, of course, mean that each of these 500 atoms took 14.3 days to make the full transition. Many unimolecular biochemical transitions also appear to happen instantaneously. Just as in the bimolecular case, the probability that any single molecule or particle in a population will make the transition from state X to state to state Y is directly proportional to the number of particles present and the rate at which items of X disappear (dnX/dt) is therefore given by:
dnX ¼ knX dt
ð15:6Þ
Dividing both sides by NA V gives: d½X ¼ k½X ð15:7Þ dt The proportionality constant k is, in this case, a first-order rate constant, measured in s 1, and the unimolecular reaction on which the proportionality is based is of the first order, because it involves the concentration of one species, raised to the power of one. Unlike second-order rate constants, first-order ones do not incorporate a volume factor, either explicitly or implicitly. First-order reaction rates are independent of the volume of the compartment in which the reactions take place. Like second-order rate constants, first-order ones also have upper limits to their values. The fastest reactions in biochemistry are those that involve the transfer of energy, electrons, or protons between well-positioned donors and acceptors, such as the chlorophylls and cytochromes in photosynthetic reaction centers. The values of the first-order rate constants for these very fast reactions are of the order of 1013 down to 107 s 1. Most other biochemical processes are likely to have significantly smaller rate constants.
5.3. Pseudo-first-order reactions Suppose that, in a reaction that yields Eq. (15.2) or (15.3), the concentration of Z changes hardly or not at all when the reaction takes place (i.e., d[Z]/ dt 0 or d[Z]/dt ¼ 0). This would happen if [Z] were so much larger than
393
Simple Stochastic Simulation
[X] that even when all items of X had reacted the number of Z items would remain almost unaffected. In this case, the factor [Z] may be incorporated into a new rate constant, k0 ¼ k [Z], giving:
d½Z dnX ¼ knX ½Z ¼ k 0 nX ; ¼0) dt dt
d½X 0 ¼ k ½X dt
ð15:8Þ
Here, the shapes of dnX/dt and d[X]/dt are the same as those in Eqs. (15.6) and (15.7), and k0 is called a pseudo-first-order rate constant. Like real first-order rate constants, k0 is measured in s 1 (M 1s 1 M ), as it incorporates the dimensions (molar) of [Z].
5.4. Aside Rate constants are a measure of the reactivity of the participant(s) in a reaction, and depend on external factors such as temperature, pH, and ionic strength. Second-order rate constants also depend on the viscosity of the medium. If one’s aim is to predict concentration changes that occur in a reaction system, whether by evaluating ODEs or by carrying out stochastic simulations, knowledge of the values of the rate constants is sufficient—but essential. It is not necessary to specifically consider any of the factors that are amalgamated in the rate constants, or to consider the situation at the level of collisions. Similarly, observation of concentration changes in a reaction system only yields information on the values of the rate constants. However, the values of the rate constants may provide important, but indirect, clues about the molecular properties of the reactants, or about the environment in which they were obtained. In any case, the effect of changing environmental conditions (if any) on the rate constants must be assessed and taken into account.
6. Transition Firing Rules In the above, we have established the following: 1. Both first- and second-order reactions involve instantaneous state transitions. 2. Individual items (molecules, complexes, assemblies, etc.) that act as reactants in a reaction spend a certain amount of time in this ‘‘reactant state’’ before they undergo the state transition. 3. The frequency at which a state transition occurs is proportional to the number of items of the single reactant in the case of a first-order reaction, and to the number of items of one reactant and the concentration of the other in the case of a second-order reaction. The proportionality constants are called the rate constants for the reactions.
394
Maria J. Schilstra and Stephen R. Martin
Although it is not possible to predict exactly when any particular item (or items) will react, it is possible to use the rate constants to compute the probability that it will do so within a given period of time.
6.1. Ground rules Consider Eq. (15.6), which says that the difference dnX between the number of X molecules, nX, present at the beginning, t0, and the end, t1, of a very short period, dt, is proportional to nX (dt ¼ t1t0 and dnX ¼ nX,t¼1nX,t¼0, where nX,t¼0 is nX at time t0, and nX,t¼1 is nX at time t1). Equation (15.6) is a relatively simple differential equation that has an analytical solution: an equation that expresses how many X molecules (nX) are still left in the reaction volume at any time into the reaction.6 The equation is: nX ¼ n0X ekt
ð15:9Þ
Here, n0X is the number of X molecules at the beginning of the reaction, k is the first-order rate constant for the reaction, and t is the time into the reaction. By dividing both sides by nX, we know what fraction of the initial amount is still present at time t (namely nX =n0X ¼ ekt ), and which fraction has already reacted (1 nX =n0X ¼ 1 ekt ). Therefore, if there are 106 molecules of X at the start of the reaction, and k is 10 s 1, after 1 ms, (1e0.00110) 100% ¼ 0.995%, or approximately 9950 molecules will have reacted and disappeared from the compartment. After 10 and 50 ms, the percentages are 9.52 (9.52 104 molecules) and 39.3 (or 3.93 105 molecules). Note that, although the absolute quantities will be different for different values of n0X , the percentages will always the same. Thus, if there are (1 106 3.93 105) ¼ 6.07 105 molecules (60.7%) left after 50 ms, there will be 60.7% of 6.07 105, or 3.68 105 molecules left after another 50 ms. Therefore, if we know k and nX at any point in time, we can predict how many will be left after a given time interval. Equation (15.9) is exactly valid only when the number of items is effectively infinite. It also expresses the average that would be obtained for many observations (‘‘experiments’’) on a finite numbers of molecules. Focusing now on a single X molecule, x, we can predict the odds that it will have disappeared after, for example, 50 ms. As we have seen above, about 39.3% of all X molecules that were present at the beginning will have disappeared by the end of the interval, and the chance that x is among those is therefore also 39.3%. If x is indeed still present at the end of the interval, 6
Readers who are unfamiliar with differential equations may just accept that Eq. (9) follows from Eq. (6) for nX ¼ n0X at t ¼ 0.
395
Simple Stochastic Simulation
the probability that it will react within the next 50 ms is, again, 39.3%. This may be expressed as an equation: F ¼ 1 ekDt
ð15:10Þ
Here F is the probability that an item has undergone the state transition in the time interval Dt ¼ tt0, and k is the first-order rate constant for the reaction. Equation (15.10) may be used, inter alia, to compute the half-life, t½, of a species for which k is known: this is the time at which 50% of the molecules or other items have reacted (i.e., F ¼ 0.5). The time for which this is true is t½ ¼ ln(2)/k. In the example in which k ¼ 10 s 1, t½ is 69 ms. It can be shown7 that the average life span, or lifetime,t, of the species is equal to 1/k, and that the standard deviation on the average life span is equal to the value of t itself. This means that at Dt ¼ t, some 63.2% of the amount present at t0 has disappeared, and 36.8% is still left, as e-t/k ¼ 1/e 0.368. We will use this knowledge, especially Eq. (15.10), in the formulation of the firing rules for the transitions in the Petri-nets that represent biochemical reactions.
6.2. First-order reactions Imagine an object that can undergo multiple sequential, irreversible firstorder state transitions, X1 ! X2, X2 ! X3, etc., until it reaches a final state Y, as illustrated in the Petri-net in Fig. 15.3. This could be a simple model of a protein that undergoes a number of conformational changes, an electron that hops from center to center in an electron transfer chain, or a molecular motor that moves along a track. In this example, we will make the rate constants for all reactions the same: kX1 ¼ kX2 ¼ . . . ¼ k ¼ 10 s 1. Initially, place X1 contains a token, indicating that there is one item in state X1. Upon firing of transition RX1, this token will disappear, and a new one will be deposited in X2 (the weight of each arc is the default, 1). The questions one might pose are how long does it take, on average, for a molecule to undergo the full transformation from X1 to Y, and what is the spread in the arrival times? One seemingly logical approach would be to divide the total estimated reaction time into small steps and calculate the probability of the event occurring within a time step. For example, according to Eq. (15.10), there is a 50% chance that the first transition will fire within 69 ms. If we were to divide the total reaction time into 69 ms segments we could use a simple coin flip to decide whether or not the transition had fired during the interval between t ¼ 0 and t ¼ 69 ms. If it did not, we could look at the next 69 ms interval, and try again. However, if it had happened, we would not, of 7
Because the proof for these statements is quite involved, it is omitted here.
396
Maria J. Schilstra and Stephen R. Martin
System X1 state
X2
X3
X4
Y
0 1 2 3 4 RX1
RX2
RX3
RX4
Figure 15.3 Petri-net model of five sequential state transitions undergone by a single molecule. X1 to Y represent the five states that the molecule can assume; the five system states (top to bottom) indicate the development of the system after the consecutive firings of transitions RX1 to RX4. The model is used to establish the average dwell time of the token in each place, and to determine the spread in the arrival time the token in place Y.
course, know exactly when it had happened. Furthermore, it would be possible that not only RX1, but also RX2, or even more transitions might also have fired within the 69 ms. To avoid these uncertainties, the size of the interval must be reduced. As we have seen above, in the interval between 0 and 1 ms, there is about 0.995% chance that RX1 fires. The chance that RX2 also fires within that interval is therefore 0.995% of 0.995%, which is only 0.0099%, or about 1 in 10,000. We can use a computational random number generator8 to draw a number from a very large set of uniformly distributed numbers between 0 and 1. If the draw yields a number greater than 0.00995, the situation is unchanged after 1 ms, and we draw another random number to determine whether the event will happen in the interval between 1 and 2 ms, and so on. If we draw a number between 0 and 0.00995 for a particular interval, we know that the RX1 has fired. However, there is still a worry that, although unlikely, RX2 may also already have fired within that period. Because of that, we cannot be sure what the situation is after 1 ms: does X2 or X3, or maybe even one of the places further down the chain, contain the token? To resolve this, we could decide that both RX1 and RX2 fire in the interval under consideration if the draw yields a number r that is smaller than 9.9 10 5, that RX3 also fires when r < 9.9 10 7, and so on. Alternatively, we could divide time into smaller and smaller intervals, so that the times at which events happen are identified 8
See, for example, http://en.wikipedia.org/wiki/Random_number_generator or http://www.random.org/
397
Simple Stochastic Simulation
with greater precision, and the probability that two or more events occurring in the same interval becomes vanishingly small. Unfortunately, this would require generating more random numbers, and, for the very small time intervals that would be considered ‘‘safe,’’ slow down the computation of the firing times to a snail’s pace. Fortunately, there is a simple solution to this problem. Instead of calculating the probability that an event occurs within a fixed time period we can use a random number to calculate directly the time at which the event occurs. We know that the probability F that the transition will fire within the period between now and Dt increases monotonically from 0 to 1. We now divide the vertical axis in a plot of F against Dt into 10 segments, segment 1 from 0 to 0.1, segment 2 from 0.1 to 0.2, and so on, so that each segment represents an equal part, 10%, of the total probability. Rearranging Eq. (15.10) and taking logarithms of both sides gives Dt ¼ ln(1F )/k, which allows us to compute the time slots that correspond to each segment, as shown in Fig. 15.4. These time slots are unequal in size, and the last time slot is infinitely long. We then draw a random number to choose one of the B
1.0
1.0 0.8
0.6
0.6 F
0.8
F
A
0.4
0.4
0.2
0.2
0.0 0.0
0.1
0.2 0.3 Time (s)
0.4
0.5
0.0 0.0
0.1
0.2 0.3 Time (s)
0.4
0.5
Figure 15.4 Two ways to decide on the timing of an event in a stochastic simulation. The solid black line indicates the monotonically increasing probability F that the event has occurred in the interval from time 0 to time t. In (A), the time axis is sampled in equal steps. The corresponding probability that an event occurs in a particular time slot is different for each slot (F(n)F(n-1) ¼ (1ekt(n))(1ekt(n-1)), where F(n) is the function value at the end t(n) of the nth time slot). The correspondence is indicated by the solid gray lines. Because of the different probabilities, it is necessary to draw a random number r (from a uniformly distributed set) for each time slot to decide whether an event has taken place in that slot (r < 1ekt(n)). In (B), the F-axis is divided into equally sized segments representing equal probabilities. Each segment corresponds to a particular time slot (t(n)t(n1) ¼ (ln(1F(n 1))ln(F(n)))/k whose size increases with increasing F, as indicated by the solid gray lines. In this case, it is only necessary to draw a single random number to decide on the time slot in which the event takes place. In both cases, the uncertainty in the event timing (the width of the time slot) may be reduced by increasing the sampling rate. This will reduce the efficiency of the first method (A), but not that of the second (B).
398
Maria J. Schilstra and Stephen R. Martin
vertical divisions (all are equally probable), and decide that the transition will fire in the corresponding time slot. Of course, these time slots, particularly the ones corresponding to the higher segments, are quite long, and there is an undesirable uncertainty in the timing of the event. By dividing the vertical axis into smaller segments, we can narrow down the corresponding time slots (apart from the last one, which will always be infinitely large), so that if we divide it in an infinitely large number of segments, the time slots will become infinitely short. If we then randomly draw a number, r, from an infinitely large,9 uniformly distributed set, we can pinpoint the precise time t at which the event occurs by evaluating Dt ¼ tt0 ¼ ln(1r)/k. Now suppose there are nX tokens in X1, instead of one. In that case, the transition will be firing at a rate of J ¼ nX k, where the transition firing rate J, may also be called the reaction flux, propensity, or hazard. The time interval up to the first transition firing is computed in exactly the same way as illustrated above, with J substituted for k: Dt ¼
lnð1 rÞ J
ð15:11Þ
Note that r may be 0 (in which case the event happens at the same time as the previous one, but they do occur in an ordered fashion), but may not be 1. With this knowledge, we now return to the situation in which there is just one token in X1, as in system state 0 in Fig. 15.3. Suppose evaluation of Eq. (15.11) (with JX1 ¼ 1 k) with random number r1 yields a firing time t1 ¼ Dt1 þ t0 for RX1. Coincident with this firing event at t1, the overall state of the system changes from the starting state 0 to state 1, in which RX1 is disabled as X1 has lost its token, whereas X2 now contains a token and RX2 is enabled (Fig. 15.3). We then carry out the same steps for the now enabled RX2, drawing a random number and computing Dt2 and firing time t2 using time t1 as the starting point, removing the token from RX2’s input place and placing one in its output place at t2. After repeating this process for RX3 and RX4, the endpoint of this simulation is reached, in which Y contains a token, and all transitions are disabled, so that the system cannot develop any further. We now know how long it has taken the molecule on this occasion to undergo the full transformation from X1 to Y (or the 9
In practice, the amount of random numbers that are generated by random number generators on digital computers is finite, and limited by the number of bits, nB, that are used to express each generated number. If nB ¼ 16 bits (2 bytes), as is sometimes the case, only 65,536 different numbers can be expressed, which means that the vertical axis will be divided in segments of size 1.5 10 5. This means that the uncertainty in the times corresponding to the lowest, middle, and one but highest segments (01.5 105, 0.50.500015, and 0.9999690.999985) are 1.5 105 k, 3.0 105 k, and 11.0 k! Although most modern random number generators have much greater precision, it is worth keeping in mind that there is always some uncertainty associated with firing times that are computed on the basis of the numbers that they generate.
399
Simple Stochastic Simulation
molecular motor to move from position X1 to Y, etc.), and how long each individual step has taken, and we can plot the ‘‘time trajectories’’ of each place. A typical series of trajectories is plotted in Fig. 15.5A. By starting anew, and repeating the whole process many times, we can generate histograms of the arrival times, and of the ‘‘dwell times’’ (life spans) A
B
Dwell time (X2) Arrival time X1 X2 X3 X4 Y 0.0
0.2
0.4 0.6 Time (s)
0.8
t 0.000
0.6023
0.092
0.092
RX1
Event
0.7506
0.139
0.231
RX2
0.5640
0.083
0.314
RX3
0.8482
0.188
0.503
RX4
4.0
200
0.2
E
0.4 0.6 Time (s)
0.8
0.8 1.2 Time (s)
400
0.4
0
0.0
0.2
0.0 1.6
0.4 0.6 Time (s)
0.8
0. 0 1.0
800
0.8
400
0.4
0 0.0
0.4
0.8 1.2 Time (s)
0.0 1.6
Cumulative density
0.4
Probability density
2.0 1.0
100
0.8
F
Arrival time in Y
200
0 0.0
0. 0 1.0
Accumulated number
0 0.0
800
Cumulative density
8.0
Probability density
Dwell time in X2
Accumulated number
D 400
Number
Δt
1.0
C
Number
Random
Figure 15.5 (A) Typical time trajectories for places X1 to Y in the simple sequential model of Fig. 15.3. Each line represents the state of the place indicated on the left as a function of time (low, no token; high, token present). (B) Computation of the transition firing times using a spreadsheet. Random numbers were drawn using the spreadsheet’s random number generator; time intervals Dt were computed from Eq. (15.11) (k ¼ 10 s 1); time t is the accumulation of the Dt values. (C, E) Distribution of the token dwell times (life spans) in X2 and arrival times in Y, based on 1000 trajectories. (D, F) Accumulated data from (C) and (E). Dashed lines indicate the theoretical probability density and cumulative density distributions, obtained using the spreadsheet’s exponential distribution function (C, noncumulative, Eq. (15.12); D, cumulative, Eq. (15.10), with k ¼ 10), and gamma distribution function (E, noncumulative; F, cumulative, with b ¼ 10 and a ¼ 4).
400
Maria J. Schilstra and Stephen R. Martin
of the tokens in each individual place. Dwell times are obtained by subtracting the firing time of the transition that deposited the token in the place from the firing time of the one that removed it, or, in the case of X1, simply by recording the firing time of RX1. The dwell time of the token in Y is infinite, as there is no transition to remove it. The arrival time of the token in Y is obtained by recording the firing time of RX4. The data in these histograms (Fig. 15.5C and E) may be accumulated (‘‘integrated’’) into new histograms by adding the number in each time slot to the sum of the numbers in all previous time slots. These cumulative histograms show the number of observed dwell or arrival times falling inside a particular time slot or in earlier ones. When normalized to 1 (or 100%), these plots therefore indicate the probability that an event has happened at or before the upper time limit of the slot. In other words, they express the same thing as Eq. (15.10), a cumulative distribution of probabilities. The dwell times of tokens in X2 are determined by the occurrence of a single event (firing of RX2), and the cumulative distribution of these dwell times is, therefore, described by Eq. (15.10). Figure 15.5D contains a plot of the cumulative density (the values of F ) obtained by applying either Eq. (15.10) with k ¼ 10, or the equivalent cumulative exponential distribution function provided in statistical function packages of spreadsheet programs or other computational tools. As the histograms in Fig. 15.5D and F are the integrated versions of those in C and E, C and E are the derivatives of D and F. Since the normalized data in Fig. 15.5D are described with Eq. (15.10), it follows that the data in Fig. 15.5C are described by its derivative, Eq. (15.12): f ¼
dF ¼ keJDt dt
ð15:12Þ
Equation (15.12) is a so-called exponential probability density function, and Eq. (15.10) is the cumulative distribution function for this probability density function. In Fig. 15.5C, the values of f obtained by applying Eq. (15.12) (or equivalent noncumulative exponential distribution function in statistical functions packages) are compared with the histogram data. Unlike the dwell time distribution in Fig. 15.5C, which was derived from a process whose timing is determined by a single event, the arrival time distribution in Fig. 15.5E and F is nonexponential. It can be shown that arrival time distributions in irreversible sequential multistep processes, in which the transition probabilities are the same in each step, are described by the gamma distribution function,10 which is expressed in terms of the ‘‘shape parameter’’a, the ‘‘rate parameter’’ b, and the time interval Dt. In this case, a and b are equal to the number of transitions and the rate constant 10
The gamma distribution function is f ¼ (baDta1ebDt)/G(a), where G(a) ¼ (a 1)! if a is a positive integer.
401
Simple Stochastic Simulation
k, respectively. The mean of gamma distributed values is equal to a/b: the number of steps times the average time taken for each step (which is the lifetime,qtffiffiffiffiffiffiffiffiffi ¼ 1/k), and the standard deviation (the square root of the ffi
variance) is a=b2 . Thus, the theoretical values for the mean and standard deviation on the arrival times in this system are 0.4 and 0.2; we obtained values of 0.401 and 0.205 after recording 1000 trajectories.
6.3. Multiple options The Petri-net in Fig. 15.6 is similar to that in Fig. 15.3, but in this case the reactions in which the item goes from state to state are reversible. Reversible reactions are modeled using two separate transitions, and the places that are on either side provide input for one and output for the other. In the system states with one token in X1 or Y (0 and 4), only one transition is enabled (RX1 or RY); in all other cases, there are two. Firing of either one will yield different new system states (and enabled transitions). Suppose all rate constants kf for the forward reactions (kf, for RX1, RX2f, . . ., RX4f ) are 10 s 1, as in the previous example, and those for the reactions in the reverse direction (kr, for RX2r, RX3r, . . ., RY) are four times slower at 2.5 s 1, and the system is in state 0 when we start looking at it. We can use the method described above to draw a random number and compute the
System state 0
RX1
RX2f
RX3f
RX4f
RX2r
RX3r
RX4r
RY
1
4 X1
X2
X3
X4
Y
Figure 15.6 Petri-net model of a system in which a single molecule or other item undergoes four sequential reversible state transitions. The model is used to estimate the average time it takes until a token appears in Y, the so-called mean first-passage time.
402
Maria J. Schilstra and Stephen R. Martin
time for the first transition firing. After the transition has fired, the system is in state one, and both RX2f and RX2r are enabled. However, RX1f fires four times as fast as RX2r, so if they would fire independently (i.e., if firing of one would not affect the odds of the other firing), about 80 out of 100 firings would originate from RX1f, and 20 from RX2r. It may be understood intuitively that, if a random number is drawn for both transitions, there is a 20% possibility that the value of Dt associated with the slower reaction, RX2r, is the smaller of the two. We can, therefore, decide which transition will fire simply by choosing the one that would fire first. This is illustrated in Fig. 15.7A. If this happens to be RX2f, a token will appear in X3 and enable RX3f and RX3r at the time of firing; if it is RX2r, one will appear in X1,
A
B 1.0
X1
X2
X3
X4
Y
F
0.8 0.6 0.4 0.2 0.0 0.0
0.2
0.4 0.6 Time (s)
0.8
1.0
0.0
0.2
0.4 0.6 Time (s)
0.8
1.0
C
Number
First passage of Y
100
0 0.0
2.0
1.0
0.4
0.8 Time (s)
1.2
Probability density
200
0.0 1.6
Figure 15.7 (A) Cumulative density functions (Eq. (15.10)) for k ¼ 10 (solid gray line) and k ¼ 2.5 (gray dashed line) and comparison of Dt values computed from two sets of random numbers (Eq. (15.11); one random number for each enabled transition), one set for which the smallest Dt value is obtained with the faster transition (solid black lines), and one in which the slower transition ‘‘wins’’ (dashed black lines; smallest Dt values indicated by circles). (B) Typical token trajectories in the Petri-net of Fig. 15.6; the lowest position in each trajectory indicates a token in X1, second lowest, token in X2, highest, token in Y, etc. (C) Distribution of the time intervals between the start and the first appearance of a token in Y, based on 1000 trajectories. The mean firstpassage time obtained from these data is 0.51 0.29, and the dashed line is a gamma distribution constructed on the basis of these values (a ¼ 3.12, b ¼ 6.15 s 1). All computations were again carried out in a spreadsheet program.
Simple Stochastic Simulation
403
reenabling RX1, and in both cases the token will disappear from X2, and both RX2f and RX2r will be disabled. In contrast to the model in Fig. 15.3, this model will always have enabled transitions, so that the system remains dynamic. Figure 15.7B shows four typical trajectories over the first second into the reaction. This model may be used to estimate mean first-passage times, the average amount of time that passes before a particular state (here Y) is first reached from a starting state (here X1) in a sequence of reversible reactions. Figure 15.7C shows the distribution of first-passage times. The mean m and standard deviation s obtained from the first-passage times in 1000 trajectories were used to compute the values of a and b (b ¼ m/s2; a ¼ m b) to construct the gamma distribution function that is shown in the figure.11 Now consider the Petri-net in Fig. 15.3 again. Rather than collecting data from 1000 trajectories as we have done above, we may also start with 1000 tokens in place X1, and record the distribution of tokens over all five places as time progresses. Transition RX1 will now fire 1000 times, until all its tokens have disappeared. Equation (15.11) may again be used to compute the interval to the first transition firing. As J is now a thousand times greater, the interval is of course likely to be significantly smaller than an interval computed for a single token. After RX1 has fired once, the number of tokens in X1 is 999 and that in X2 is 1. Both RX1 and RX2 are now enabled, with the firing propensity of RX1 slightly smaller (999 k) than it was before the event, and that of RX2 (1 k) significantly smaller than that of RX1, but now finite. Again, we determine which transition will fire first by drawing a random number for both enabled transitions, and compute a value for Dt for each based on their firing propensity J. After one of the enabled transitions has fired, and the tokens have been redistributed accordingly, we may repeat this process until 1000 tokens have arrived Y, and all transitions are disabled. Figure 15.8 shows the token redistribution and transition firing count over time in this system. Note that the dwell times in X1 and arrival times in Y are distributed in the same way in the 1- and 1000-token systems. However, trajectories of single items, such as the ones in Figs. 15.5A and 15.7B can only be obtained from simulations in which each place contains a maximum of one token. As tokens are indistinguishable, a particular molecule cannot be associated with a particular token if there is more than one token in one place.
11
As in this case, the value of a is noninteger, the expression for RG(a) in the gamma distribution function is substituted by a more complex, continuous expression G(a) ¼ 0 1xa1e xdx. Many spreadsheet programs supply functions for evaluating the gamma distribution equation, given Dt, a, and b, in its noncumulative as well as its cumulative form.
404
Maria J. Schilstra and Stephen R. Martin
B X1 X2 X3
400
X4 0 0.0
0.4
0.8 1.2 Time (s)
1.6
0.8
800 RX4
0.4
400
0 0.0
0.4
0.8 1.2 Time (s)
0.0 1.6
Cumulative density
800
RX1
Y
Number of firings
Number of tokens
A
Figure 15.8 (A) Token redistribution over all five places in the Petri-net model in Fig. 15.3 as a function of time. The first-order rate constant for all transitions was 10 s 1, and at the start of the simulation there were 1000 tokens in X1, and none in the other places. (B) Number of times transitions RX1 to RX4 have fired as a function of time (unlabeled curves are the data of RX2 and RX3, left to right). Filled gray circles indicate the firing times observed in the stochastic simulation; solid black lines are an exponential cumulative density function for Eq. (15.10) with k ¼ 10 s 1 for RX1 and cumulative gamma distribution functions with b ¼ 10 s 1 and a ¼ 2, 3, and 4 for RX2, RX3, and RX4, respectively. Note that curve X1 (in A) describes the distribution of dwell-times in X1; that the data in Y (A) and RX4 (B) are equal (as the number of tokens in Y registers the number of times RX4 has fired), and that the derivatives of the cumulative distribution functions for RX1, RX2, RX3, and RX4 describe the token arrival time distribution for X1, X3, X4, and Y.
6.4. Pseudo-first-order and second-order reactions The Petri-net in Fig. 15.9 is similar to that in Fig. 15.6, but one of its transitions, RX3f, represents a second-order reaction. The compound Z is consumed in a reaction with X3, and a product, A, is released in the reverse (first-order) reaction represented by RX4r. If there are a very large number of tokens in Z, such that change in the number of tokens over the full time simulation is negligible, we may assume that its concentration is constant. In that case, the reaction is pseudo-first-order (see Eq. (15.8)), and treated in the same way as real first-order reactions. Equation (15.11) is used to compute the interval to the next transition firing, with J equal to k0 X3 if there is a single token, and to nX3 k0 X3 if there are nX3 tokens in X3. The pseudo-first-order rate constant k0 X3 is equal to kX3 Ztot, where kX3 is the second-order rate constant, and Ztot the (constant) total concentration of Z. In this case, it is not necessary to keep track of the actual number of tokens in Z, and a constant value for k0 X3 may be used throughout the simulation. However, if the number of Z particles is relatively small and changes significantly over the course of the simulation, the number of tokens nZ in Z must also be taken into account in the expression for J. Nonetheless, because neither nX3 nor nZ changes between events, J ¼ kX3 nX3 [Z] ¼ kX3 nX3 nZ/(NAV) may be used to compute Dt for RX3.
405
Simple Stochastic Simulation
A Z
X1
RX1
RX2f
RX3f
RX4f
RX2r
RX3r
RX4r
RY
X2
X3
X4
Y
A
B Y X4 X3 X2 X1 Y X4 X3 X2 X1 Y X4 X3 X2 X1
400 1
200 0 400
2
200
nA 3
0 400 200
nZ 0
20
40
60
80
0 100
Time (s)
Figure 15.9 (A) Petri-net model of a system similar to that in Fig. 15.6, but in which RX3f is a second-order reaction in which species Z reacts with X3, and a reaction product (A) is released in reaction RX4r. The rate constants for the first-order reactions (i.e., all reactions except RX3f ) are all 10 s 1 and the volume of the container in which the reaction takes place is 8 10 18 L (8 fl, the volume of a cube with 0.2 mm sides which is the size of a small bacterium). All simulations were started with a token in X1; the figure depicts possible states before and after depletion of the tokens in Z. (B) Trajectories of the token position in X1 to Y (gray lines), the number of tokens in Z (nZ, large black plusses, only shown in panel 3), and in A (nA, smaller black plusses, all panels). Initial value of nZ and values for the second-order rate constant kX3f were
406
Maria J. Schilstra and Stephen R. Martin
7. Summary In summary, the procedure to set up and perform a stochastic simulation of a dynamic biochemical reaction network includes the following steps. 1. Set up the model structure. The model must describe how particular types of molecules or molecular assemblies (‘‘places’’) are transformed by chemical or physical reactions (‘‘transitions’’) into other types. Reactions remove instances (‘‘tokens’’: molecules, molecular assemblies) of their reactant(s) and produce product instances. Reactions perform individual transformations (or ‘‘fire’’) at a particular rate; with each firing being an ‘‘event’’ that occurs instantaneously. The number of instances removed and produced upon a single firing event is specified in the reaction stoichiometry. 2. Associate each reaction with an equation that can be used to evaluate the firing rate, J, under any set of conditions. In the modeling of biochemical reactions, it is reasonable to use the laws of Mass Action: First-order reactions: J ¼ knX Second-order reactions between particles of a different type: J ¼ knX nY =NA V Second-order reactions between particles of the same type (dimerization): J ¼ knX ðnX 1Þ=NA V Here, nX and nY are the number of instances of type X and Y present in the reaction vessel or compartment volume V (the number of tokens in places X and Y), and k is the first- or second-order rate constant for the reaction. Other expressions for J are allowed, but J must be constant between events. If it is not, the central Eq. (15.11) is no longer valid. 3. Decide how many instances of each type there are at t0, the beginning of the simulation, and set an end time, tend, for the simulation. Set the simulated time t to t0. 4. For each reaction Ri, compute its value Ji, randomly draw a number ri between 0 and 1 (0 ri < 1) from a large, uniformly distributed set, and use these values to calculate the putative time, ti at which it will fire next: ti ¼ t þ
lnð1 ri Þ Ji
100,000 (20 mM) and 500 M 1s 1 (panels 1 and 2) or 400 (80 mM) and 5 107 M 1s 1 (panel 3). In the simulation shown in panel 1, a pseudo-first-order approximation was used (with nZ constant). In those in panels 2 and 3, the actual value of nZ after each event was used in the computation of J and Dt. Comparison of panels 1 and 2 shows that the results are very similar if nz changes relatively little (about 0.2%).
Simple Stochastic Simulation
407
5. Decide which reaction has produced tmin, the smallest value of ti. If tmin < tend, the earliest event will occur within the maximum simulation time. In this case, set the simulated time t to tmin, and let the transition that produced tmin proceed by removing the specified number of instances from its reactant(s) and adding new ones to its product(s). This token redistribution changes the overall state of the system. 6. To continue the simulation, the procedure is repeated from point 4.
8. Notes 1. Improving efficiency. Owing to the special properties of exponential equations (Eqs. (15.9) and (15.10)), computing new firing times for all reactions after an event is justified, but not entirely necessary if the firing propensity of some reactions is unchanged by the event. Gibson and Bruck (2000) have described a method in which the dependencies in the reaction network (expressing which reactions affect the firing propensity of which other reactions) are identified, and taken into account in the reevaluation of the system. They coined the names ‘‘First Reaction Method’’ for the method in which all transitions are evaluated, and ‘‘Next Reaction Method’’ for their variant. In another important variant of the First Reaction Method, known as Gillespie’s Direct Method, Gillespie (1977), just two random numbers are drawn per evaluation round. Here, the sum of all propensities is used to compute the firing time, and the reaction that fires at that time is selected through a lottery in which each reaction’s chance of being chosen is proportional to its firing propensity. Dependent on the characteristics of the modeled network, either variant may improve the efficiency of the simulation. However, as both incur some computational overhead, the First Reaction Method, which is easiest to implement, is often equally efficient in small systems. 2. Modeling more complex systems. Events included in a stochastic simulation do not have to be just chemical reactions. Any process that can be associated with a probability can be included as an event in the simulation, and stochastic simulations really come into their own systems with specific localized spatial characteristics. For example, a molecular motor moving along an actin or microtubule track will periodically and randomly encounter an obstruction, or may ‘‘jump’’ from one track to another. Likewise, molecules moving in a cellular environment will frequently encounter obstacles. Stochastic simulations may also be easily adapted to model the behavior of a single geometrically complex structure such as, for example, the end of a microtubule (Martin et al., 1993). Note that the graphical Petri-net notation may be helpful in the model design state; however, its use is not essential. Although its computational
408
Maria J. Schilstra and Stephen R. Martin
implementation may be compact, the drawings quickly become unwieldy as the systems get larger. Moreover, some systems lend themselves better to description in the form of Petri-nets than others. Systems such as those mentioned above are more easily and more efficiently expressed in purpose-built code, outside of the straightjacket of the Petri-net formulation or the equivalent chemical reaction notation. 3. Modeling and simulation in practice. Simple stochastic simulations, such as the ones presented in Figs. 15.5 and 15.7, are easily performed in a conventional spreadsheet. Obviously, this requires an excellent understanding of the stochastic approach, and we recommend newcomers to try implementing these examples first. For more complex systems, the ability to write computer code offers a considerable advantage. This code may equally well be written in interpreted ‘‘scripting’’ languages such as Python, Perl, or VBA as in compiled ones such as Java, Fortran or C/Cþþ, or in numerical computing environments such as MATLAB, Mathematica, Octave, or R. Because of its simplicity, implementation of the First Reaction algorithm as outlined in Section 11 is an ideal goal for those who would like to familiarize themselves with numerical problem solving. The ability to write code gives a programmer the ultimate control over the model and its simulation, input, output, and presentation. In addition, there is nowadays a raft of software tools (e.g., see the list of SBML-supporting packages on http://smbl.org) that will allow users to enter a set of (bio)chemical reactions, a set of parameters, and an initial state, and perform a stochastic simulation. These tools usually implement a First Reaction Method variant, sometimes complemented with accelerated approximate techniques such as Tau-leaping (Gillespie, 2001), or methods based on the Langevin or Fokker-Planck equations. Because the expressions for the reaction firing propensity J can, in combination with the reaction stoichiometry, be used directly to construct the ODEs for the system, some tools also offer facilities for deterministic simulation. These tools allow the user to quickly set up a model, perform a simulation, and obtain the results in a convenient format. Few, if any of these tools are suitable for modeling spatial inhomogeneity or structures with geometric complexity. Regrettably, however, such tools contribute little to their users’ understanding of the principles that lie beneath the stochastic approach. 4. Further reading. The first section in molecular biology and biochemistry textbooks is often dedicated to the kinetics and thermodynamics of biochemical reaction systems (e.g., see http://en.wikibooks.org/wiki/ Biochemistry). More extensive information on this subject may be found in Atkins and de Paula (2006). Cornish-Bowden (1999) explains basic concepts from mathematics, including exponents and logarithms, differential and integral calculus, and statistics, aiming at students of biochemistry. Wilkinson (2006) provides an extensive formal introduction to stochastic modeling in Systems Biology. The book ‘‘Systems
Simple Stochastic Simulation
409
Modelling in Cellular Biology’’ (Szallasi et al., 2006) contains some excellent chapters on stochastic modeling and related numerical simulation methods, notably (Gillespie and Petzold, 2006; Kruse and Elf, 2006; Paulsson and Elf, 2006). Useful reviews discussing recent developments are found in the work of Gillespie (2007) and Pahle (2009).
REFERENCES Atkins, P. W. (1994). Physical Chemistry. Oxford University press, Oxford, UK. Atkins, P. W., and de Paula, J. (2006). Physical Chemistry for the Life Sciences. W. H. Freeman and Company, New York, NY. Bayley, P. M., Schilstra, M. J., and Martin, S. R. (1990). Microtubule dynamic instability: A numerical simulation of experimental microtubule properties using the lateral cap model. J. Cell Sci. 95, 33–48. Bortz, A. B., Kalos, M. H., and Lebowitz, J. L. (1975). A new algorithm for Monte Carlo simulation of Ising spin systems. J. Comput. Phys. 17, 10–18. Cornish, P. V., and Ha, T. (2007). A survey of single-molecule techniques in chemical biology. ACS Chem. Biol. 2, 53. Cornish-Bowden, A. (1999). Basic mathematics for biochemists. Oxford University Press, New York, NY. Gibson, M. A., and Bruck, J. (2000). Efficient exact stochastic simulation of chemical systems with many species and many channels. J. Phys. Chem. A 104, 1876–1889. Gillespie, D. T. (1976). A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J. Comp. Phys. 22, 403–434. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361. Gillespie, D. T. (2001). Approximate accelerated stochastic simulation of chemically reacting systems. J. Comp. Phys. 115, 1716–1733. Gillespie, D. T. (2007). Stochastic simulation of chemical kinetics. Annu. Rev. Phys. Chem. 58, 35–55. Gillespie, D. T., and Petzold, L. R. (2006). Numerical simulation for biochemical kinetics. In ‘‘System Modeling in Cellular Biology,’’ (Z. Szallasi J. Stelling and V. Periwal, eds.), pp. 331–353. MIT Press, Cambridge, MA. Kleutsch, B., and Frehland, E. (1991). Monte-Carlo-simulations of voltage fluctuations in biological membranes in the case of small numbers of transport units. Eur. Biophys. J. 19, 203–211. Kruse, K., and Elf, J. (2006). Kinetics in spatially extended system. In ‘‘System Modeling in Cellular Biology,’’ (Z. Szallasi J. Stelling and V. Periwal, eds.), pp. 177–198. MIT Press, Cambridge, MA. Martin, S. R., Schilstra, M. J., and Bayley, P. M. (1993). Dynamic instability of microtubules: Monte-Carlo simulation and application to different types of microtubule lattice. Biophys. J. 65, 578–596. Pahle, J. (2009). Biochemical simulations: Stochastic, approximate stochastic and hybrid approaches. Brief. Bioinform. 10, 53–64. Paulsson, J., and Elf, J. (2006). Stochastic modelling of intracellular kinetics. In ‘‘System Modeling in Cellular Biology,’’ (Z. Szallasi J. Stelling and V. Periwal, eds.), pp. 149–175. MIT Press, Cambridge, MA. Szallasi, Z., Stelling, J., and Periwal, V. (eds.) (2006). In ‘‘System modeling in cellular biology’’ MIT Press, Cambridge, MA. Wilkinson, D. J. (2006). Stochastic Modelling for Systems Biology. Chapman & Hall/CRC, London, UK.
C H A P T E R
S I X T E E N
Monte Carlo Simulation in Establishing Analytical Quality Requirements for Clinical Laboratory Tests: Meeting Clinical Needs James C. Boyd and David E. Bruns Contents 412 414 414 415 416 417 417 427 429 431
1. Introduction 2. Modeling Approach 2.1. Simulation of assay imprecision and inaccuracy 2.2. Modeling physiologic response to changing conditions 3. Methods for Simulation Study 4. Results 4.1. Yale regimen 4.2. University of Washington regimen 5. Discussion References
Abstract Introduction. Patient outcomes, such as morbidity and mortality, depend on accurate laboratory test results. Computer simulation of the effects of test performance parameters on outcome measures may represent a valuable approach to defining the quality of assay performance that is needed to provide optimal outcomes. Methods. We carried out computer simulations of patients on intensive insulin treatment to determine the effects of glucose meter imprecision and bias on (1) the frequencies of glucose concentrations >160 mg/dL; (2) the frequencies of hypoglycemia (160 mg/dL increased with negative assay bias and assay imprecision; and frequencies of hypoglycemia increased with positive assay bias and assay imprecision. Nevertheless, each regimen displayed unique sensitivity to variations in meter imprecision and bias. Conclusions. Errors in glucose measurement exert important regimendependent effects on glucose control in intensive IV insulin administration. The results of this proof-of-principle study suggest that simulation of the clinical effects of measurement error is an attractive approach for assessment of assay performance requirements.
1. Introduction Quantitative laboratory measurements play an increasingly important role in medicine. Well-known examples include (a) quantitative assays for cardiac troponins for diagnosing acute coronary syndromes (heart attacks) (Morrow et al., 2007) and (b) measurements of LDL cholesterol to guide decisions on use of statin drugs (Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults, 2001). Errors in these assays are recognized to lead to misdiagnoses and to inappropriate treatment or lack of appropriate treatment. A growing problem is to define how accurate laboratory measurements must be. Various approaches have been used to define quality specifications (or analytical goals) for clinical assays. A hierarchy of these approaches has been proposed (Fraser and Petersen, 1999; Petersen, 1999). At the low end of the hierarchy, performance of an assay can be compared with the ‘‘state of the art’’ or with performance criteria set by regulatory agencies or with the opinions of practicing clinicians or patients. A step higher, biology can be used as a guide to analytical quality by considering the average inherent biological variation within people; for substances whose concentration in plasma varies dramatically from day to day, there is less pressure to have assays that are highly precise because the analytical variation constitutes a small portion of the total variation. However, none of these approaches directly examines the relationship between quality of test performance and clinical outcomes. The collected opinions of physicians are likely to be
Simulation Studies for Analytical Quality Goals
413
anecdotal and reflect wide variation of opinion, whereas criteria based on biological variation or analytical state of the art, or the criteria of regulatory agencies may have no relation to clinical outcomes. Few studies have examined analytical quality requirements based on the highest criterion, patient outcomes. Clinical trials to determine the effects of analytical error on patient outcomes (such as mortality) are extremely difficult to devise and expensive (Price et al., 2006). Unlike trials of new drugs, the costs of such studies are large in relation to the potential for profit. For ethical reasons, it may be impossible to conduct prospective randomized clinical trials in which patients are randomized to different groups defined by use of high- or low-quality analytical testing methods. The common lack of standardization of methods used in different studies usually undermines efforts to draw useful general conclusions on this question based on systematic reviews of published and unpublished clinical studies. By contrast to these approaches, computer simulation studies allow a systematic examination of many levels of assay performance. There are common clinical situations in which patient outcomes are almost certainly connected with the analytical performance of a laboratory test. These situations represent ideal models for using simulation studies. One such situation occurs when a laboratory test result is used to guide the administration of a drug: a measured concentration of drug or of a drug target determines the dose of drug. Errors of measurement lead to selection of an inappropriate dose of drug. A common example is the use of measured concentrations of glucose to guide the administration of insulin. In this situation, higher glucose concentrations are a signal that a higher dose of insulin is needed, whereas a low glucose concentration is a signal to decrease or omit the next dose of insulin. Several years ago, we carried out simulation modeling of the use of home glucose meters by patients to adjust the patients’ insulin doses (Boyd and Bruns, 2001). In clinical practice, the insulin dose has been determined from a table that relates the measured glucose concentration and the dose of insulin to give. We examined the relationships between errors in glucose measurements and resulting errors in selection of the insulin dose that is appropriate for the true glucose concentration. The simulation model addressed glucose meters with specified bias (average error) and imprecision (variability of repeated measurements of a sample, expressed as coefficient of variation (CV)). We found that to select the intended insulin dosage 95% of the time required that both the bias and the CV of the glucose meter be 160 mg/dL); (2) the frequency of plasma glucose concentrations in the hypoglycemic range (defined as > > > : ; 0:113 0:399 þ 0:514i 0:399 0:514i 0:177 82 3 2 3 2 3 2 39 0:052 0:360 1:57i 0:360 þ 1:57i 0:682 > > > > > > > : ; 0:111 0:249 0:057i 0:249 þ 0:057i 0:184 82 3 2 3 2 3 2 39 3:23 4:53i 3:23 þ 4:53i 0:119 0 > > > > 0 0 > > > : ; 0:493 0:31055i 0:493 þ 0:31055i 0:108 0
443
Nonlinear Dynamical Analysis and Optimization
Note that the order of the eigenvectors and eigenvalues in their respective sets was chosen so that ith eigenvector in each set corresponds to the ith eigenvalue for the same rest point with i 2 f1; 2; 3g (e.g., the third entry in Vh is the eigenvector for the third entry in lh). Two of the points, xl and xh, corresponding to low and high cortisol concentration, respectively, are stable in the sense that both Df ðxl ; pnom ; 0; 0Þ and Df ðxh ; pnom ; 0; 0Þ have only eigenvalues whose real parts are negative. The steady-state point, xm corresponding to an intermediate cortisol concentration is unstable in the sense that the linearization Df ðxm ; pnom ; 0; 0Þ has one eigenvalue with a positive real part and three eigenvalues with negative real parts. Using the stable manifold theorem in (Chicone, 1999), it is possible to show that the point xm lies on an invariant manifold of codimension one. Furthermore, this manifold separates the basin of attraction of points xl and xh. Theorem 17.1 (Chicone, 1999) Suppose that S : Rk ! Rk and U : Rl ! Rl are linear transformations such that all eigenvalues of S have real part less than a, all eigenvalues of U have real part greater than b, and a < b. If F 2 C 1 ðRk Rl ; Rk Þ and G 2 C 1 ðRk Rl ; Rl Þ are such that Fð0; 0Þ ¼ 0, DFð0; 0Þ ¼ 0, Gð0; 0Þ ¼ 0, DGð0; 0Þ ¼ 0 , and such that kFk1 and kGk1 are sufficiently small, then there is a unique function a 2 C 1 ðRk ; Rl Þ, with the following properties: að0Þ ¼ 0;
Dað0Þ ¼ 0;
e 2 sup Rk kDaðeÞk < 1
whose graph, namely the set W ð0; 0Þ ¼ fðz; yÞ 2 Rk Rl : y ¼ aðzÞg is an invariant manifold for the system of differential equations given by Eq. (17.1) z_ ¼ Sz þ Fðx; yÞ;
y_ ¼ Uy þ Gðz; yÞ
ð17:3Þ
To see how Theorem 17.1 is useful for analysis of the HPA axis system let x ¼ x xm . Furthermore, consider the Taylor series expansion of the mapping f from system (17.2) with respect to x, with u ¼ d ¼ 0. f ¼ Df ðxm Þ x þ Bð xÞ Furthermore, let 2 0:0196 6 0:00593 P¼6 4 0:330 0:00528
0:00505 0:0960 0:00702 0:0108
0:488 0:0510 0:161 0:00276
be a linear transformation such that the matrix
ð17:4Þ 3 0:0172 0:00184 7 7 0:328 5 0:0773
ð17:5Þ
444
Amos Ben-Zvi and Jong Min Lee
2
0:123 6 0 L ¼ P 1 Df ðxm ÞP ¼ 6 4 0 0
0 9:90 0 0
0 0 1:00 0:660
3 0 0 7 7 0:660 5 1:00
ð17:6Þ
is in real-Jordan form. In the coordinates x~ ¼ P 1 x, and using the Taylor series expansion shown in Eq. (17.4), system (17.2) can be written as x~_ ¼ L~ x þ BðP~ xÞ
ð17:7Þ
With respect to Theorem 17.1, let k ¼ 3, l ¼ 1, y ¼ x~1 , and z ¼ ½~ x1 ; x~2 ; x~3 T , where x~i denotes the ith element of x~. Furthermore, let 2 3 9:90 0 0 S¼4 0 1:00 0:660 5 0 0:660 1:00 and u ¼ [0.123]. Note that all eigenvalues of S are negative. Let the numbers a and b in Theorem 17.1 be 0 e and 0 þ e, respectively with e > 0 being an arbitrarily small number. Define the function F and G by the relation Gðz; yÞ BðP~ xÞ ¼ Bðz; yÞ ¼ ð17:8Þ Fðz; yÞ Note that G(z, y) and F(z, y) can be written as a sum of polynomials in z and y of order two or higher. As a result, G(0, 0) ¼ 0, F(0, 0) ¼ 0, DG (0, 0) ¼ 0, and DF(0, 0) ¼ 0 and all conditions of Theorem 17.1 are met. The conclusion of Theorem 17.1 implies that there exists an invariant manifold defined by the constraint y ¼ a(z). However, computation of the function a : R3 ! R is computationally intensive. As a result, an approximation to a will be computed. This approximation will be based on the linearization of Eq. (17.3). In particular, if yð0Þ ¼ 0, and about the steadystate point ðy; zÞ ¼ ð0; 0Þ, the differential equation becomes z_ ¼ Sz
ð17:9Þ
and it is exponentially stable. As a result, the set of points belonging to the invariant manifold described by Theorem 17.1 can be approximated locally about (y, z) ¼ (0, 0) by the set Z ¼ fz 2 R3 jkzk < ez g where ez > 0 is a small positive number. Computationally, the set Z may be mapped in the x coordinates by choosing a value z 2 Z and computing x ¼ Pð½0; z T Þ x ¼ xm þ x ¼ xm þ P~
ð17:10Þ
Nonlinear Dynamical Analysis and Optimization
445
The set surface generated by mapping the set Z into the x coordinates can be projected into three dimensions using the projection mapping pr : ðx1 ; x2 ; x3 ; x4 Þ7!ðx1 ; x2 ; x3 Þ. The projection is shown in Fig. 17.3A. The surface shown in Fig. 17.3A is a first-order approximation of the invariant manifold separating the basin of attraction of points xh and xl. This idea is illustrated in Fig. 17.3 where the system dynamics are integrated forward in time for t 2 ½0; 5 from different points on the surface. As can be seen from Fig. 17.3, the system trajectories approach xm for a small time. For a large time (t > 500), however, the system trajectories are driven away from xm and toward either xh and xl. This is shown in Fig. 17.4 which shows system trajectories for t 2 ½0; 500. In order for a treatment to be successful, it must drive the HPA axis system from xl to xh. To do this, the trajectory of the system must penetrate the boundary approximated by the surface shown in Fig. 17.3A. Generally, the prescription and application of medicine is done in two-step iterative process. First, the condition of the patient is assessed (observation) then a treatment is administered (i.e., control is applied). Finally, the condition of the patient is reassessed and further treatment is chosen as necessary. Typically, a patient will take their medication over a time span of several days or weeks. As a result, most medical treatment is, in effect an ‘‘open-loop’’ process where measurements are taken infrequently and typically only to verify that the open-loop treatment has worked. As a result, it is not generally possible to monitor the condition of a patient to determine when the boundary illustrated in Fig. 17.3A has been crossed. Rather, one must rely on a sufficiently robust open-loop treatment so that the boundary is likely to be crossed even if there is significant system-model mismatch or exogenous disturbances.
3.2. Evaluation of treatment options The approximate boundary computed in the previous section corresponds to a nominal and disturbance-free model. If there were no modeling errors and no unmodeled exogenous disturbances then an optimal treatment which employs the minimum amount of therapeutic intervention to penetrate the surface could be computed. For example, the optimal dosage for a drug could be chosen as the minimal dosage necessary to drive the HPA axis across the boundary. This idea is illustrated in Fig. 17.5 where the system trajectory under constant dosages corresponding to input values of uðtÞ ¼ 0:5; uðtÞ ¼ 0:75; uðtÞ ¼ 0:8 and u(t) ¼ 2 for t 2 ½0; 15 are shown. The point along the trajectory where treatment stops is shown as a black square along the trajectory path. As shown in Fig. 17.5, the input corresponding to u(t) ¼ 0.75 is insufficient to drive the HPA axis system across the boundary. That is, the black square located along the trajectory corresponding to u(t) ¼ 0.75 is above the boundary plane.
446
Amos Ben-Zvi and Jong Min Lee
A 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.09 0.08
0.07
0.07
0.06
0.06 0.05
0.05 0.04
0.04 0.03
0.03 0.02
0.02
The projection of the set Z onto three dimension in the x coordinates.
B 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.09 0.08
0.07
0.07
0.06
0.06 0.05
0.05 0.04
0.04 0.03
0.03 0.02
0.02
Scenario 2
Figure 17.3 Numerical integration of system (17.2) with initial conditions in Z for 0 t 5.
447
Nonlinear Dynamical Analysis and Optimization
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.09
0.08 0.07 0.06 0.05 0.04 0.03 0.02
0.625
0.63
0.635
0.64
0.645
0.65
0.655
0.66
0.665
0.67
0.675
Figure 17.4 Long-time numerical integration of system (17.2) with initial conditions in Z for 0 t 500.
The just-sufficient input (u(t) ¼ 0.8) does force the system across the boundary and therefore drives the system to the new equilibrium. Finally, the high dosage trajectory (u(t) ¼ 2) forces the system to cross the boundary and therefore drives the system to the new equilibrium. The disadvantage of using a high dosage is that, as shown in Fig. 17.5, the state trajectories do not stay close to the equilibrium conditions. As a result, if this treatment was applied, it would cause severe deviation in hormone concentrations. These deviations can cause serious side effects including high blood pressure, activated immune system, weight gain and irritability. The advantage of using a high dosage is that, as shown in Fig. 17.5, it drives the HPA axis well past the boundary plane. As a result, the high dosage treatment is likely to be effective even if the computed boundary surface is wrong due to modeling inaccuracies. This idea is illustrated in Fig. 17.6 where the response of the system to inputs of u ¼ 0.8 and u ¼ 2 are compared for two different values of the parameter kad, one corresponding to the nominal model (kad ¼ 10.0) and one corresponding to a small modeling error (kad ¼ 9.75). As shown in Fig. 17.6 while an input of uðtÞ ¼ 0:8 is effective under nominal conditions, it is likely to be ineffective even for small errors in the nominal model. An ideal control strategy would, therefore, not only penetrate but also drive the process some distance from the boundary surface. The ideal control strategy (i.e., treatment) would have three competing objectives. First, it should drive the HPA axis well past the nominal
448
Amos Ben-Zvi and Jong Min Lee
B 0.6
x3 (Free GR concentration)
x3 (Free GR concentration)
A 0.5 0.4 0.3 0.2 0.1
x
0.06 CT H c 0.05 onc 0.04 ent rati on)
2 (A
0.65
0.7
0.75
0.6 0.5 0.4 0.3 0.2 0.1
x2 ( 0.06 AC TH con 0.05 cen trat 0.04 ion )
on) 0.6 ntrati once RH c x 1 (C
Trajectory for u(t) = −0.5
0.7
0.75
n) 0.6 tratio oncen RH c x 1 (C
Trajectory for u(t) = −0.75
C
D 0.6
x3 (Free GR concentration)
x3 (Free GR concentration)
0.65
0.5 0.4 0.3 0.2 0.1
x2
0.06
(AC 0.05 TH con cen 0.04 trat ion )
0.65
0.7
n) tratio oncen RH c x 1 (C
0.6
Trajectory for u(t) = −0.80
0.75
0.6 0.5 0.4 0.3 0.2 0.1 0.06 x2 ( AC 0.05 TH con 0.04 cen trat ion )
0.65
0.7
0.75
n) 0.6 tratio oncen RH c x 1 (C
Trajectory for u(t) = −2.0
Figure 17.5 Constant dosage trajectories for constant dosage trajectories with t 2 [0, 15]. The square on the trajectory path is the point where treatment stops. The star and circle are the hypocortisolic and healthy rest points.
boundary and therefore ensure that the system will stabilize about the healthy equilibrium. Second, it would prevent large deviation in hormone levels. Finally, an ideal control strategy would minimize the amount of input action required to achieve the first two objectives. Such a control strategy would be robust in the face of modeling errors, would cause minimal side effects and would minimize the cost of treatment.
3.3. Development of an appropriate optimal control objective A general dynamic optimization problem commonly found in optimal control literature is as follows: min
u0; ...;utf 1
tf 1 X
fðxk ; uk Þ þ f ðxtf Þ
ð17:11Þ
i¼0
with the state transition rule given by Eq. (17.1) for a given initial state x0 and piecewise constant inputs u(t) ¼ uk and d(t) ¼ dk for kh t (k þ 1)h.
449
Nonlinear Dynamical Analysis and Optimization
Equilibria points for nominal system kcd = 9.75
0.6
0.5
Equilibria points for kcd = 9.75
0.4 x3 0.3
0.2
0.1 0.065
0.06
0.055 x2
0.05 0.045
0.04
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
x1
Figure 17.6 Constant dosage trajectories for u(t) ¼ 0.8 (dashed) and u(t) ¼ 2 (solid) with t 2 [0, 15] for two different values of the parameter kad. The black square along the trajectory path is the point where treatment stops.
h is the sampling time and xk represents the value of x at the kth sample is the time (i.e., x(t) at t ¼ hk). f is the single stage cost function and f terminal state cost function at time tf. In MPC, a typical form of the objective function is fðxk ; uk Þ ¼ ðxk rk ÞT Qðxk rk Þ þ ðuk uk1 ÞT Rðuk uk1 Þ t Þ ¼ ðxt rt ÞT Qt ðxt rt Þ fðx f f f f f ð17:12Þ where rt is a reference point at time t, and Q, R, and Qt are weighting matrices with proper dimensions that make the stagewise cost scalar. MPC solves Eq. (17.11) with the stagewise cost of Eq. (17.12) in the context of finding an open-loop input trajectory offline for fixed finite-time process. Given the feedback measurement of x at each time, it can also solve the problem online in a receding horizon fashion (Morari and Lee, 1999). The single stage cost in Eq. (17.12) cannot explicitly incorporate all of the control objectives because choosing reference trajectory is overly restrictive. Specifically, the goal of constraining the states is to reduce the side-effects of treatment. The time-dependence implied by choosing a specific trajectory, rt is not necessary. This idea is illustrated in Fig. 17.7. In Fig. 17.7A and B, the trajectory generated by the NLMPC solution is shown for Q ¼ 0 and R ¼ 0, respectively. As can be seen from Fig. 17.7, the NLMPC solution which seeks to minimize the input effort (i.e., Q ¼ 0) also provides a
450
Amos Ben-Zvi and Jong Min Lee
x3 (Free GR concentration)
A
x2
0.6 0.5 0.4 0.3 0.2 0.1 0.07 (AC 0.06 TH 0.05 con cen 0.04 tra tio n)
0.8 0.75 ion) ntrat once
0.7
0.6 x1
0.65 c (CRH
NLMPC trajectory with Q = 1/100, R = 0. The end of treatment coincides with the healthy equilibrium point.
x3 (Free GR concentration)
B
x2
0.6 0.5 0.4 0.3 0.2 0.1 0.07 (AC 0.06 TH 0.05 con cen 0.04 tra tio n)
0.8 0.75 ion) ntrat once
0.7
0.6 x1
0.65 c (CRH
NLMPC trajectory with Q = 0, R = 1/100.
Figure 17.7 NLMPC generated trajectories for t 2 [0, 15]. The square on the trajectory path is the point where treatment stops. The star and circle are the hypocortisolic and healthy rest points.
trajectory with least state deviation. It can also be seen that choosing Q ¼ 0 results in a much more robust trajectory (the end of treatment coincides with the healthy equilibrium). These qualitative aspects of the trajectories in Fig. 17.7 are not obvious from the choice of Q and R. As a result, one is
451
Nonlinear Dynamical Analysis and Optimization
forced to seek an alternative approach for the development of an appropriate objective function. A clinically relevant and straightforward objective function for computing treatment is suggested. A stagewise cost function is designed to yield a control law that maintains the controlled trajectories in a user-defined tube containing a line connecting an initial state of sickness to the corresponding healthy state as shown in Fig. 17.8. Quantitative definition of the stagewise cost is proposed as 8 d < m and xkþ1 2 Xs : 0 > > < d < m and xkþ12 = Xs : 50l fðxk ; uk Þ ¼ ð17:13Þ d m and x > kþ1 2 Xs : 500d > : d < m and xkþ12 = Xs : 50l þ 500d where d is the distance between the state point at time k þ 1 and the straight line connecting the sick initial state and the corresponding healthy state. m is a user-specified radius of the tube. Xs is a set of states from which no control action (u ¼ 0) is necessary onwards for the system to settle in a healthy state. This set can be found by integration of Eq. (17.2) with u and d set as zero. The scaling factors of the stagewise cost and the radius m are specified by a user according to the relative importance between l and d as well as the order of distances. Since the logics included in the single-stage cost requires integer variables, the resulting optimization problem of MPC becomes multihorizon mixed integer nonlinear programming (MINLP). Though the objective
m xsickk
xk+1 d
l
xhealthy Figure 17.8 Design of clinically relevant objective function for HPA axis system.
452
Amos Ben-Zvi and Jong Min Lee
function is straightforward to formulate, the optimization problem is very difficult to solve due to the increasing number of optimization variables with the size of horizon in addition to the inherent complexity of MINLP.
4. Dynamic Programming Dynamic programming (DP) offers an alternative approach to solving multistage optimal control problems (Bellman, 1957). The approach involves stagewise calculation of the so-called ‘‘cost-to-go’’ values for all states. The cost-to-go of a state is the sum of all costs that you can expect to incur under a given policy starting from that state, and hence expresses the quality of a state in terms of achievable future performance. Though MPC has been the most popular advanced control technique for the process industry owing to its ability to handle a large multivariable system with constraints, DP has several advantages over MPC in solving biological/biomedical dynamic optimization problems. First, DP provides flexibility for choosing a complex stagewise cost because DP reduces a multistage optimization problem to a single-stage one by encoding longterm performance in a cost-to-go function. The single-stage optimization problem is solved by calculating the optimal action that minimizes the sum of the current stagewise cost and the cost-to-go of the successor state. Another advantage of DP is its ability to take into account the uncertainty in the optimal control calculation, whereas the conventional MPC ignores the uncertainty and feedback at future time points and solves a deterministic open-loop optimal control problem.
4.1. DP for deterministic systems For a deterministic system where the success state can be exactly evaluated given the current state and input values, the entire future sequence of states and actions is determined with a fixed starting state and a deterministic policy. The cost-to-go function, J, under a policy m is the sum of stagewise costs up to the end of the horizon. Jkm ðxk Þ ¼
t f 1 X
tÞ fðxi ; ui Þ þ fðx f
ð17:14Þ
i¼k
where ui ¼ mðxi Þ. The optimal cost-to-go function, J , is the cost-to-go function under an optimal policy and is unique: J ¼ min J m ¼ J m m2P
ð17:15Þ
453
Nonlinear Dynamical Analysis and Optimization
where P is a set of all possible deterministic policies that map xi to ui. For the finite horizon problem of Eq. (17.14), the optimal cost-to-go function should satisfy the following Bellman’s optimality equation: Jk ðxk Þ ¼ minffðxk ; uk Þ þ Jkþ1 ðxkþ1 Þg uk
ð17:16Þ
To solve the above optimality equation, sequential calculation of Jk for all state points is performed in a backward manner starting from the terminal t Þ. stage with Jtf ¼ fðx f With the optimal cost-to-go function for the k þ 1 stage, Jkþ1 ðxkþ1 Þ, calculated offline, the following single stage problem, which is equivalent to tf stage problem defined earlier, is solved to compute an optimal control action for any given state x at time k: uk ¼ arg minffðxk ; uk Þ þ Jkþ1 ðxkþ1 Þg uk
ð17:17Þ
Infinite horizon formulation, in which tf is set to infinity, was shown to be advantageous in the context of system’s stability and feasibility of optimal solutions for systems without termination time (Rawlings and Muske, 1993). A typical objective function of infinite horizon problems is given as min
u0 ;u1 ;...;u1
1 X
gi fðxi ; ui Þ
ð17:18Þ
i¼0
where g 2 ð0; 1Þ is a discount factor. It should be noted that f is not limited to a certain type of norm and g is used to prevent the total cost from diverging to infinity. For the infinite horizon problem, by letting k ! 1, the following Bellman equation is obtained. J1 ðxk Þ ¼ minffðxk ; uk Þ þ gJ1 ðxkþ1 Þg uk
ð17:19Þ
The above Bellman equation is solved offline for all possible states and one cost-to-go function is obtained regardless of the time point. There are two conventional approaches for computing the cost-to-go function offline, value iteration and policy iteration. In this chapter, the value iteration will be used for its simplicity. In value iteration, one starts with an initial guess, usually zero, for the cost-to-go for each state and iterates on the Bellman equation until convergence. This is equivalent to calculating the cost-to-go value for each state by assuming an action that minimizes the sum of the current stage cost and the cost-to-go for the next state according to the current estimate. Hence, each update assumes that the calculated action is optimal. The algorithm involves the following steps.
454
Amos Ben-Zvi and Jong Min Lee
1. Discretize the continuous state space into a finite number of state points, xi ði ¼ 1; . . . ; N Þ 2 Xi . 2. Initialize J 0 ðxi Þ as 0 for all xi . 3. In jth iteration, obtain ( j þ 1)th estimate of cost-to-go for each state xk: J jþ1 ðxi Þ ¼ minffðxi ; ui Þ þ gJ j ð^ xÞg ui
ð17:20Þ
where x^ is the successor state of xi obtained by integrating Eq. (17.1) with the constant inputs of ui and di. x^ may not be found in Xi and thus estimation of cost-to-go for x^ is necessary. It was shown that instance-based local averaging schemes such as k-nearest neighborhood method provide stable offline learning for the estimation (Lee et al., 2006). 4. Perform the above iteration (step 3) until k J jþ1 ðxi Þ J j ðxi Þk1 < ½ð1 gÞ=2ge, where e is a user-defined threshold value. 5. If the convergence criterion met, use J jþ1 as J for computing optimal control action for any given state.
4.2. Minimizing worst-case cost under uncertainty using DP In practical situations, HPA axis dynamics of each patient is likely to have variations from the nominal parameter values, and the sequence of optimal treatments based on the nominal model may not work well. Another advantage of DP is that uncertainty can be taken into account explicitly. There are two possible approaches within DP to handle uncertain systems. The first is stochastic DP approach where the expectation of cost-to-go function is minimized given the joint probability distribution function (PDF) of uncertain parameter vector. The other approach is minimizing the worst-case (maximum) scenario of cost-to-go function when the parameters are known only within certain bounds. The resulting solution of min-max optimization is conservative, but it corresponds to the best strategy available in the absence of PDF. In this chapter, we show the latter approach because it is relatively easier to set bounds of parameters than to find PDF, which requires many data sets. In the worst-case formulation, the following discounted infinite worstcase cost is minimized: 1 X max gk fðxk ; uk Þ ð17:21Þ p0; ...;p1
k¼0
The corresponding dynamic program is formulated as J ðxk Þ ¼ min maxffðxk ; uk Þ þ gJ ðxkþ1 Þg uk 2U pk 2P
ð17:22Þ
Nonlinear Dynamical Analysis and Optimization
455
Obtaining the converged cost-to-go function offline is the same as in the deterministic case except the maximization problem is solved for each input first before the minimization step.
5. Computation of Optimal Treatments for HPA Axis System 5.1. Deterministic optimization The first step is to learn the optimal cost-to-go function offline. To store the cost-to-go values in a tabular form with interpolation by k-nearest neighbor, the state and action spaces are discretized in an equal-spaced fashion. The state space is discretized such that each state variable has ten points between the two steady states. As a result, the total number of discrete state points is 10,000. The input is discretized into 25 points between 2 and 2. This is also clinically relevant consideration because adjustment of drug dosage chosen among a finite set of discrete values is easier to implement than among infinite number of continuous values. For the 10,000 points, initial cost-to-go values were set as zero and the value iteration algorithm is implemented to obtain converged cost-to-go values. In estimating cost-to-go values of the points that are not found in the set of discretized points, the following k-nearest neighborhood with k ¼ 4 was employed: X Jð^ xÞ ¼ wi Jðxi Þ ð17:23Þ xi 2Nk ð^ xÞ
where 1=di wi ¼ X 1=di i
ð17:24Þ
di is the Euclidean distance between x^ and its ith nearest point (i ¼ 1, 2, . . ., k). The cost-to-go estimates are not sensitive to the number of neighboring points because the weighting factor is inversely proportional to the distance. With the discount factor of 0.98 and the convergence tolerance e of 0.1, the offline learning step converges after 15 steps. With the converged costto-go function, the following single-stage optimization is solved to find the optimal control action at each time step: uk ¼ arg minffðxk ; uk Þ þ gJ~ ðxkþ1 Þg uk 2U
ð17:25Þ
where J~ is the converged cost-to-go values with the four-nearest neighborhood approximator.
456
x3 (Free GR concentration)
Amos Ben-Zvi and Jong Min Lee
0.6 0.5 0.4 0.3 0.2 0.1 0.06 x ( 2 A CT Hc
0.75 0.05 ent rati on
0.65
onc
0.04 )
0.6 x1
0.7
) ration ncent o c (CRH
Figure 17.9 State trajectory under the treatment policy derived from deterministic DP.
Figure 17.9 shows the controlled trajectory of the nominal system computed from the learned cost-to-go function. The HPA axis system penetrates the boundary successfully under the computed policy. Note, however, that the end of treatment is very close to the boundary surface that separates the healthy and unhealthy equilibria. That is, the nominal treatment is just barely sufficient to achieve the desired outcome (driving the system to the healthy equilibrium) assuming a perfect model and no external disturbances. This solution is not robust to errors in model formulation or parameter estimation and is therefore unlikely to achieve the desired outcome under real-world conditions. To obtain a more realistic treatment, a modified DP approach can be used to generate a treatment that will be effective under a variety of conditions. This type of approach is typically called ‘‘worst-case’’ optimization because it seeks to find a treatment that is effective even for the most difficult-to-treat scenario.
5.2. Worst-case optimization In this chapter, the worst-case optimization is computed to deal with a situation where the parameter values in the process model are not known exactly. The HPA axis model contains seven parameters. While none of the parameters may be precisely known, a more typical situation arises when some subset of the model parameters are unknown. In this case, it will be assumed that two of the parameters kad and krd are not known exactly, but,
457
Nonlinear Dynamical Analysis and Optimization
rather, are known to vary between [9.5, 10.5] and [0.85 0.95], respectively. The approach presented in this chapter may be extended to a larger number of unknown parameters. The DP ‘‘worst-case’’ objective is to find robust control actions that can drive the system to a healthy state without exact knowledge of the systems’ parameters by solving the min-max problem. Since the system is nonlinear, it is computationally exorbitant to find the real worst-case scenario with exhaustive search over all possible combination of the parameters. Hence, we use a sample-based approach to approximate the worst-case cost function. Before the offline learning, fifty points were sampled from the 2-D parameter space. In solving Eq. (17.22), the 50 randomly generated points (using a uniform distribution) are searched over given each control action to find the worst case first. It should be noted that the parameter values chosen for this analysis were constrained to values that allow for multiple equilibria. Once the converged cost-to-go function was obtained, the following optimization problem is solved to find an optimal control action at each time. uk ¼ min max ffðxk ; uk Þ þ gJ ðxkþ1 Þg
ð17:26Þ
uk 2U pk 2P50
x3 (Free GR concentration)
where P50 is the set of sampled parameters. Figure 17.10 shows the case where the true system’s parameter is (kad, krd) ¼ (9.7545, 0.9094). Without the knowledge of true parameter values, the system could be driven to a healthy steady-state point by the suggested approach. It is noted that the worst-case optimization gives very
0.6 0.5 0.4 0.3 0.2 0.1 0.06 x ( 2 A CT Hc
0.75 0.05 ent rati on
0.65
onc
0.04 )
0.7
) ration ncent o c RH x 1 (C
0.6
Figure 17.10 State trajectory under the worst-case solution via DP.
458
Amos Ben-Zvi and Jong Min Lee
conservative treatment, the highest dosage sustained for 7 days compared to 5 days in nominal case. However, this treatment was computed through DP without the knowledge of the true parameters by minimizing the worstcase performance criterion. Without such an optimization tool, it is very difficult to provide guidelines on the dosage when the exact model parameter of each patient is not available. Hence, the proposed procedure will be able to provide a single treatment course that is ‘‘likely,’’ in an appropriate sense, to treat the vast majority of individuals with minimal side-effects.
6. Conclusions This chapter discussed two algorithms for dynamic optimization of biological/biomedical systems. Dynamic programming has the distinct advantage over model predictive control approach in that DP has the flexibility for choosing objective function and ability to take uncertainty into account. Steady state and stability analyses of nonlinear systems with multiple steady states were also discussed with the example of HPA axis system. Despite its generality, the potential issue of DP is that its offline computational requirement increases with the number of state variables, referred to as ‘‘curse-of-dimensionality.’’ In addition, it also requires full measurement of state variables. The recent emergence of reinforcement learning and approximate dynamic programming puts forth a possibility to avert the curse-of-dimensionality and necessity of full state measurement.
ACKNOWLEDGMENTS This work was supported partly by the National Sciences and Engineering Research Council (NSERC) of Canada under Discovery Grant.
REFERENCES Bellman, R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ. Ben-Zvi, A., et al. (2009). Model-based therapeutic correction of hypothalamic-pituitaryadrenal axis dysfunction. PLoS Comput. Biol. 5, e1000273. Chicone, C. (1999). Ordinary Differential Equations with Applications. Springer-Verlag, New York. Cleare, A., et al. (2001). Hypothalamo-pituitary-adrenal axis dysfunction in chronic fatigue syndrome. J. Clin. Endocrinol. Metab. 86, 3545–3554. Crofford, L., et al. (2004). Basal circadian and pulsatile acth and cortisol secretion in patients with fibromyalgia and/or chronic fatigue syndrome. Brain Behav. Immun. 18, 314–325. Deutsch, A., et al. (2007). Mathematical Modeling of Biological Systems: Cellular Biophysics, Regulatory Networks, Development, Biomedicine, and Data Analysis. Birkha¨user Boston, Boston.
Nonlinear Dynamical Analysis and Optimization
459
Giorgio, A., et al. (2005). 24-hour pituitary and dernal hormone profiles in chronic fatigue syndrome. Psychosom. Med. 67, 433–440. Guerlain, S., et al. (2002). The MPC elucidator: A case study in the design for humanautomation interaction. IEEE Trans. Syst. Man Cybernet. Part A: Syst. Hum. 32, 25–40. Gupta, S., et al. (2007). Inclusion of the glucocorticoid receptor in a hypothalamic pituitary adrenal axis model reveals bistability. Theor. Biol. Med. Model 4. Jacobson, L. (2005). Hypothalamic-pituitary-adrenocortical axis regulation. Endocrinol. Metab. Clin. North Am. 34, 271–292. Kimas, N., et al. (1990). Immunologic abnormalities in chronic fatigue syndrome. J. Clin. Microbiol. 28, 1403–1410. Klipp, E., et al. (2006). Integrative model of the response of yeast to osmotic shock. Nat. Biotechnol. 23, 975–982. Lee, J. M., et al. (2006). Choice of approximator and design of penalty function for an approximate dynamic programming based control approach. J. Process Control 16, 135–156. Morari, M., and Lee, J. H. (1999). Model predictive control: Past, present and future. Comput. Chem. Eng. 23, 667–682. Rawlings, J. B., and Muske, K. R. (1993). The stability of constrained receding horizon control. IEEE Trans. Automat. Contr. 38, 1512–1516. Rizzi, M., et al. (1997). In vivo analysis of metabolic dynamics in Saccharomyces cerevisiae: II. Mathematical model. Biotechnol. Bioeng. 55, 592–608. Schu¨rmeyer, T. H., et al. (1996). Effect of cyproheptadine on episodic ACTH and cortisol secretion. Eur. J. Clin. Invest. 26, 397–403. Teusink, B., et al. (2000). Can yeast glycolysis be understood in terms of in vitro kinetics of the constituent enzymes? Testing biochemistry. Eur. J. Biochem. 267, 5313–5329.
C H A P T E R
E I G H T E E N
Modeling of Growth Factor-Receptor Systems: From Molecular-Level Protein Interaction Networks to Whole-Body Compartment Models Florence T. H. Wu,* Marianne O. Stefanini,* Feilim Mac Gabhann,† and Aleksander S. Popel* Contents 1. Background 1.1. Biology of growth factor systems 1.2. Computational models of the VEGF system 2. Molecular-Level Kinetics Models: Simulation of In Vitro Experiments 2.1. Mathematical framework for biomolecular interaction networks 2.2. Case study: Mechanism of PlGF synergy—Shifting VEGF to VEGFR2 versus PlGF–VEGFR1 signaling 2.3. Case study: Mechanism of NRP1–VEGFR2 coupling via VEGF165—Effect on VEGF isoform-specific receptor binding 3. Mesoscale Single-Tissue 3D Models: Simulation of In Vivo Tissue Regions 3.1. Mathematical framework for tissue architecture, blood flow, and tissue oxygenation 3.2. Case study: Proangiogenic VEGF gene therapy for muscle ischemia 3.3. Case study: Proangiogenic VEGF cell-based therapy for muscle ischemia 3.4. Case study: Proangiogenic exercise therapy for muscle ischemia 4. Single-Tissue Compartmental Models: Simulation of In Vivo Tissue
* {
462 462 466 466 468 468 472 474 474 480 480 481 482
Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA Department of Biomedical Engineering, Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, USA
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67018-X
#
2009 Published by Elsevier Inc.
461
462
Florence T. H. Wu et al.
4.1. Mathematical framework for tissue porosity and available volume fractions 4.2. Case study: Pharmacodynamic mechanism and tumor microenvironment affect efficacy of anti-NRP1 therapy in cancer 5. Multitissue Compartmental Models: Simulation of Whole Body 5.1. Mathematical framework of intertissue transport 5.2. Case study: Pharmacokinetics of anti-VEGF therapy in cancer 5.3. Case study: Mechanism of sVEGFR1 as a ligand trap 6. Conclusions Acknowledgments References
482
483 485 486 488 491 493 494 494
Abstract Most physiological processes are subjected to molecular regulation by growth factors, which are secreted proteins that activate chemical signal transduction pathways through binding of specific cell-surface receptors. One particular growth factor system involved in the in vivo regulation of blood vessel growth is called the vascular endothelial growth factor (VEGF) system. Computational and numerical techniques are well suited to handle the molecular complexity (the number of binding partners involved, including ligands, receptors, and inert binding sites) and multiscale nature (intratissue vs. intertissue transport and local vs. systemic effects within an organism) involved in modeling growth factor system interactions and effects. This chapter introduces a variety of in silico models that seek to recapitulate different aspects of VEGF system biology at various spatial and temporal scales: molecular-level kinetic models focus on VEGF ligand–receptor interactions at and near the endothelial cell surface; mesoscale single-tissue 3D models can simulate the effects of multicellular tissue architecture on the spatial variation in VEGF ligand production and receptor activation; compartmental modeling allows efficient prediction of average interstitial VEGF concentrations and cell-surface VEGF signaling intensities across multiple large tissue volumes, permitting the investigation of wholebody intertissue transport (e.g., vascular permeability and lymphatic drainage). The given examples will demonstrate the utility of computational models in aiding both basic science and clinical research on VEGF systems biology.
1. Background 1.1. Biology of growth factor systems 1.1.1. Growth factor systems in angiogenesis At the molecular and cellular levels, growth factors are extracellularly secreted polypeptides which, upon binding to specific cell-surface target receptors, trigger intracellular signal transduction pathways that regulate cell
Modeling Growth Factor-Receptor Systems
463
proliferation, differentiation and survival (Lodish et al., 2004). At the tissue and organ levels, growth factors are responsible for orchestrating many physiological processes in complex multicellular organisms. Of utmost importance in human physiology and pathology is the process known as angiogenesis—the growth of new capillaries or microvessels from preexisting blood vasculature—which critically supports organogenesis during embryonic development (Haigh, 2008); physiological growth and repair in adult tissues (such as in wound healing (Bao et al., 2009), muscular adaptation to exercise (Brown and Hudlicka, 2003), or endometrial regeneration (Girling and Rogers, 2005)); as well as the malignant growth of tumor tissues (Kerbel, 2008). Sprouting angiogenesis is a well-coordinated and complex cascade of molecular, cellular, and tissue-level events (Qutub et al., 2009): First, tissue ischemia is converted into a chemical cue for angiogenesis, as hypoxic cells (e.g., cancer cells in tumor tissue or myocytes in ischemic muscle) transcribe and secrete growth factors in response to hypoxia-inducible factor 1 (HIF1) activation. The growth factor ligands then diffuse throughout the extracellular fluid, where some become sequestered to matrix proteoglycans and others bind cell-surface receptors on the capillary endothelium. Cell-surface receptorbound ligands initiate vessel sprouting by turning quiescent endothelial cells into migratory tip cells. Extracellular matrix-bound and freely diffusing ligands form chemotatic gradients that guide the migration of tip cell filopodia in the capillary sprout. In Sections 1.1.2 and 1.1.3, we further explore the multiscale nature and molecular complexity of angiogenic growth factor interactions. 1.1.2. Systems biology of VEGF: Interaction networks and molecular cross talk Many growth factor systems are involved in angiogenic regulation, including the vascular endothelial growth factor (VEGF) system of at least five ligands (VEGF-A, PlGF, VEGF-B, VEGF-C, VEGF-D) and three receptors (VEGFR1, VEGFR2, VEGFR3) (Mac Gabhann and Popel, 2008; Roy et al., 2006); the fibroblast growth factor (FGF) system of at least 18 ligands (FGF1 to FGF10 and FGF16 to FGF23) and 4 receptors (FGFR1 to FGFR4) (Beenken and Mohammadi, 2009); the angiopoietin (Ang) system of at least four ligands (ANG1 to ANG4) and two receptors (TIE1 and TIE2) (Augustin et al., 2009); the platelet-derived growth factor (PDGF) system of at least four ligands (PDGF-A to PDGF-D) and two receptors (PDGFR-a and PDGFR-b) (Andrae et al., 2008); and the insulin-like growth factor (IGF) system of at least two ligands (IGF1 and IGF2) and two receptors (IGF1R and IGF2R) (Mazitschek and Giannis, 2004; Pollak, 2008). There are organizational similarities between these growth factor systems—all of the above receptors except for the IGF2R are transmembrane receptor tyrosine kinases (RTKs) that are activated by ligand-induced dimerization and transphosphorylation of tyrosine residues (Gschwind et al., 2004);
464
Florence T. H. Wu et al.
coreceptors (e.g., neuropilin-1 (NRP1) for VEGFRs and syndecan-4 for FGFRs) and endothelial integrins (e.g., avb5 for VEGFRs and avb3 for FGFRs) often modulate receptor signaling (Simons, 2004); heparan sulfate proteoglycans (HSPGs) are involved in the extracellular matrix sequestration of ligands in the VEGF, PDGF, FGF systems (Andrae et al., 2008; Beenken and Mohammadi, 2009; Roy et al., 2006). Intracellularly, details are also emerging on the convergent and integrative cross talk between and within growth factor systems in angiogenic signaling. Among the overlap in their transcriptional profiles downstream of RTK activation, all aforementioned growth factor systems can activate the canonical Ras-MAPK signaling pathway (Simons, 2004). VEGF and FGF2 were observed to induce the expression of each other (Simons, 2004). Within the VEGF system, the existence of heterodimeric ligands (e.g., VEGF-A/PlGF and VEGF-A/VEGF-B) and heterodimeric receptors (e.g., VEGFR1/VEGFR2) is expected to introduce new signal transduction pathways in addition to those downstream of classic homodimeric ligand–receptor activation (Cao, 2009; Mac Gabhann and Popel, 2008). While VEGFR1 is mainly a negative regulator of angiogenesis, its possible proangiogenic roles are suspected to involve the intermolecular transphosphorylation of VEGFR2 by PlGF-activated VEGFR1 (Autiero et al., 2003). In this chapter, we will introduce computational frameworks that are well suited for the quantitative modeling of highly complex molecular interaction networks such as that of the VEGF system in angiogenesis. While our examples focus on the VEGF system, the mathematical frameworks are generally adaptable for any of the organizationally similar growth factor systems introduced above, with the potential of further integration between the VEGF, FGF, PDGF, IGF, Ang-Tie system modules themselves. 1.1.3. Multiscale biology of VEGF: Transport and signaling range In understanding the biology of angiogenic growth factors, it is of equal importance to identify the key molecular players and to distinguish where in the body the molecular interactions take place. The spatial range of activity can vary between growth factors (or between isoforms of the same growth factor) depending on their propensity for intratissue and intertissue transport. The intratissue transport of a secreted growth factor—that is, its diffusive and convective transport within the extracellular matrix (ECM)—is dependent on the growth factor’s molecular size, the pore sizes of the ECM, and its chemical affinity for ECM proteoglycans. For instance, heparin-binding affinity of VEGF, which determines the extent of its sequestration by ECM heparan sulfate proteoglycans, is traditionally thought to be encoded in the VEGF-A gene on exons 6 and 7 (Harper and Bates, 2008). Hence, the proangiogenic VEGF121 and antiangiogenic VEGF121b splice isoforms which skip exons 6 and 7 are mostly freely diffusible once secreted into the
Modeling Growth Factor-Receptor Systems
465
extracellular fluid; while the higher molecular-weight isoforms VEGF145(b), VEGF148, VEGF165(b), VEGF183(b), VEGF189(b), and VEGF206 have progressively higher heparin-binding affinity due to their inclusion of increasingly greater portions of exons 6 and 7 (Harper and Bates, 2008). Yet these higher molecular-weight isoforms in their matrix-bound state can be subjected to proteolytic cleavage by plasmin or matrix metalloproteinases (MMPs), which releases active fragments of 110–113 amino acids in length and with similar angiogenic properties as VEGF121 (Ferrara and Davis-Smyth, 1997; Lee et al., 2005; Qutub et al., 2009). On a larger scale, the intertissue transport of a growth factor is first affected by its rate of entry into or exit from the blood or lymphatic vasculature, that is, its permeability through the blood capillary endothelium or its lymphatic drainage rate from interstitial spaces. Once in the bloodstream or lymphatic fluid, the intertissue transport of a growth factor may be further facilitated by specific carrier proteins or circulating cells. For VEGF, its potential carriers in blood include soluble forms of its normal receptors (sVEGFR1, sVEGFR2, sNRP1) (Ebos et al., 2004; Gagnon et al., 2000; Sela et al., 2008), plasma fibronectin (Wijelath et al., 2006), as well as platelets (Verheul et al., 2007). All together, these transport properties influence the distance over which a growth factor can signal. Autocrine signaling occurs when growth factors act upon the same cells that produced them; juxtacrine growth factors act upon adjacent cells after secretion; paracrine growth factors diffuse through the extracellular fluid and target cells within the same tissue but of a different cell type than that which secreted them; whereas growth factors that are transported through the bloodstream to distant target tissues act in an endocrine manner (Lauffenburger and Linderman, 1993; Lodish et al., 2004). Angiogenic VEGF signaling occurs predominantly in a paracrine fashion: epithelial cells in fenestrated organs (e.g., glomerular podocytes), mesenchymal cells (e.g., skeletal myocytes in ischemic muscles), vascular stromal cells (e.g., pericytes and smooth muscle cells) and hypoxic tumor cells are all known to secrete VEGF, which then diffuses to and activates the VEGF receptors on nearby endothelial cell surfaces (Kerbel, 2008; Maharaj and D’Amore, 2007). However, there is also evidence of autocrine VEGF signaling loops in VEGFproducing endothelial cells (Martin et al., 2009). Furthermore, autocrine VEGF signaling involving intracellular VEGF receptors (‘‘intracrine signaling’’) has been documented in breast carcinoma cells (Lee et al., 2007a,b) and hematopoietic stem cells (Gerber et al., 2002); although in these contexts, VEGF functions as a cell survival signal rather than a proangiogenic factor. While there have not been formally established specific endocrine functions of VEGF in normal physiology, aberrantly high circulating levels of VEGF may have deleterious systemic effects. In clinical trials administering VEGF via intravascular infusion to stimulate therapeutic angiogenesis for ischemic muscle diseases, unintended side effects of the high systemic VEGF concentrations such as hypotension and macular edema have been transiently
466
Florence T. H. Wu et al.
and sporadically observed (Collinson and Donnelly, 2004). The VEGF concentrations in the plasma of cancer patients are also known to be several-fold higher than healthy baseline levels, although it is uncertain whether the tumors are themselves the source of the elevated circulating VEGF, or whether conversely the elevated circulating VEGF triggered the malignant growth of tumors (Kut et al., 2007; Stefanini et al., 2008). Therefore, a complete understanding of VEGF biology—including its pathogenic role in cancer and its therapeutic potential in ischemic diseases—necessitates an appreciation of the dynamic distribution of VEGF in the human body (Kut et al., 2007). The computational models presented in this chapter are complementary, capturing the biology of VEGF interactions at different length scales: the molecular-level kinetic models can predict the local intensity of VEGF– VEGFR complex formation on endothelial cell surfaces as a marker of angiogenic activation or vessel sprouting initiation; the mesoscale singletissue 3D models involving multiple cell types can predict the spatial gradients of VEGF in the extracellular space and simulate paracrine signaling distributions that guide capillary sprout migration; and the multitissue compartmental models can be used to simulate whole-body VEGF distributions and to investigate the possibility of endocrine VEGF effects.
1.2. Computational models of the VEGF system Figure 18.1 summarizes the multiscale nature and complexity of the molecular interactions involved in the systems biology of the VEGF ligand–receptor system. In the following sections, we introduce models for investigating emergent behavior at progressively higher spatial scales: from molecular (subcellular) level models, to mesoscale (intratissue) models, to whole-body (intertissue) models. The chosen examples also illustrate the versatility of computational modeling for investigating basic science questions (e.g., simulating the molecular mechanisms underlying the PlGF–VEGF synergy and NRP1–VEGFR2 synergy) and assisting in the design of translational medicine (e.g., comparing the therapeutic efficacy of cell-based versus protein delivery of VEGF; optimizing the dose of anti-VEGF therapy).
2. Molecular-Level Kinetics Models: Simulation of In Vitro Experiments In our first two examples, models were developed to investigate specific molecular functions and interactions of key players in the VEGF ligand–receptor system—PlGF and NRP1—by recapitulating in vitro experiments. The spatial scope of these models focused on molecular behaviors near the endothelial cell surface, including extracellular ligand diffusion and cell-surface ligand–receptor binding.
467
Modeling Growth Factor-Receptor Systems
Interstitial fluid
PIGF
1 VEGF165
sVEGFR1
VEGF121
2
Tumor
5 7
VEGFR2 VEGF
6
R1
stin Ava
Extracellular matrix
4
3 Blood flow
Blood flow
7
Systemic blood circulation
sFn 9
VEGF
sVEGFR1
Lymph flow via thoracic duct
Avastin
Platelets 8
cyte l myo Skeleta VE
GF
R2
12 7 F VEG 10 H 2O
Lymph flow
PIG
F
VEG VEG
FR1
VEG
NR
PI
ECM
11 VEGF
VEG
F
FR1
FR2
8 Blood flow
Figure 18.1 Multiscale Systems Biology of the VEGF ligand–receptor system. (1) Hypoxia, such as that in growing tumor tissues (top panel), trigger the expression and extracellular secretion of VEGF ligand proteins, for example, VEGF-A (isoforms VEGF121 and VEGF165) and PlGF. At the cellular level, VEGF ligands diffuse toward nearby capillary surfaces, binding endothelial cell-surface receptors (VEGFR1, VEGFR2) and coreceptors (NRP1) in various configurations to activate (2) proangiogenic and (3) antiangiogenic downstream signaling. (4) At the tissue level, VEGF ligands with heparin-binding domains can be sequestered at heparan sulfate proteoglycan sites in the extracellular matrix (ECM), forming chemotactic gradients that guide capillary sprout migration. (5) Soluble VEGFR1 (sVEGFR1) potentially modulates angiogenic signaling via ligand trapping or dominant-negative heterodimerization with transmembrane VEGFR monomers. (6) Humanized anti-VEGF antibodies (e.g., AvastinÒ ), through their capacity to sequester specific VEGF ligands, are being investigated as antiangiogenic agents in cancer treatment. At the whole-organism level, macromolecules such as VEGF ligands and their soluble receptors may have systemic
468
Florence T. H. Wu et al.
2.1. Mathematical framework for biomolecular interaction networks Mathematical theory and formulations for kinetic modeling of cell-surface ligand–receptor binding and cell-surface receptor/ligand trafficking have been presented in classical texts (Lauffenburger and Linderman, 1993). The standard description for the binding kinetics of ligand L to receptor R to form complex C involves characterization of the complex association and dissociation rate constants kon and koff: dC ð18:1Þ ¼ kon RL koff C dt Endocytotic internalization of free receptors and complexes are generally characterized by first-order rate constants: RþL ÐC $
dR dC ð18:2Þ ¼ kint;R R; ¼ kint;C C dt dt Free receptor insertion rates are typically introduced through zero-order source terms; in the following models, they are chosen to maintain a steady total population (free and bound) of receptors in the absence of added ligand.
2.2. Case study: Mechanism of PlGF synergy—Shifting VEGF to VEGFR2 versus PlGF–VEGFR1 signaling Our first case study sought to decipher the molecular mechanisms behind PlGF’s observed ability to augment the angiogenic response to VEGF-A in in vitro assays for endothelial cell survival, proliferation and migration. Details and full references can be found in Mac Gabhann and Popel (2004). These two members of the VEGF family have different receptorbinding properties: VEGF-A (hereinafter referred to as simply ‘‘VEGF’’) binds with both VEGFR1 and VEGFR2; while PlGF only binds VEGFR1. Two proposed mechanisms for the PlGF–VEGF synergy were: (a) ‘‘ligand shifting’’, where PlGF displaces VEGF from VEGFR1, effectively freeing effects, as they enter the blood circulatory system (middle panel) through intertissue transport processes including (7) transcapillary vascular permeability and (8) lymphatic drainage of the interstitial fluid. (9) Other VEGF carriers in the blood include soluble fibronectin and platelets. (10) Similarly in skeletal muscle (bottom panel), VEGF ligand expression is upregulated in hypoxic myocytes. However in peripheral arterial disease, the angiogenic response is insufficient to alleviate muscle ischemia. Proangiogenic therapies under investigation include VEGF-A delivery through cell, gene, and protein therapy. Adjuvant therapeutic targets include (11) PlGF (thought to work synergistically through VEGFR1 signaling or ligand shifting) and (12) NRP1 (via presentation of VEGF to VEGFR2 or reducing antiangiogenic VEGFVEGFR1 complexes).
Modeling Growth Factor-Receptor Systems
469
more VEGF to bind the more proangiogenic VEGFR2; and (b) PlGF– VEGFR1 signaling, where PlGF activation of VEGFR1 may transduce qualitatively different (proangiogenic) signals than that from VEGF activation (generally inhibitory of angiogenic signaling). An in silico model was thus constructed to quantify these mechanistic contributions to the VEGF– PlGF synergy. The in silico model formulation mimicked in vitro assay geometry and conditions as illustrated in Fig. 18.2A. At the bottom of a cell culture well, a confluent layer of endothelial cells expressing receptors VEGFR1 and VEGFR2 on the surface (z ¼ 0) was exposed to the fluid media. Into the cell culture media (from z ¼ 0 to z ¼ h), ligands were administered at time zero—either VEGF alone (‘‘PlGF’’ case) or VEGF and PlGF (‘‘PlGFþ’’ case)—to assess the synergistic effects of PlGF. Mathematically, each molecular species was represented by either a volumetric concentration (V for VEGF, P for PlGF) or cell-surface concentration (R1 for VEGFR1, R2 for VEGFR2, VR1 for VEGFVEGFR1 complex, PR1 for PlGFVEGFR1 complex, VR2 for VEGFVEGFR2 complex), using the continuum approach. The initial value problem was essentially a single spatial dimension problem (z-direction) as molecular concentrations are assumed uniform in the plane parallel to the endothelial cell surface. Coupled diffusion and reaction equations (Eqs. (18.3) and (18.4), respectively) were used to describe the time evolution of extracellular ligand transport and cell-surface molecular binding interactions. D represented ligand diffusivity (cm2/s); sR is the insertion rate or receptor R (mol/cm2/s); kint is the receptor or complex internalization rate (s 1); kon and koff are the rate constants of complex association (M 1 s 1) and dissociation (s 1). @ V @ 2 DV V ¼ @z2 DP P @t P
ð18:3Þ
3 2 3 3 3 2 2 3 2 kint;R1 R1 kon;V ;R1 koff ;V ;R1 R1 sR1 7 7 6 6 R2 7 6 sR2 7 6 kint;R2 R2 7 6 0 0 7 6 7 7 7 6 6 7 6 @6 6 VR1 7 ¼ 6 0 7 6 kint;VR1 VR1 7 þ 6 kon;V ;R1 7V R1 þ 6 koff ;V ;R1 7VR1 7 6 7 7 7 6 6 7 6 @t 6 5 5 4 4 PR1 5 4 0 5 4 kint;PR1 PR1 5 4 0 0 kint;VR2 VR2 0 0 0 VR2 3 3 3 3 2 2 2 2 0 0 koff ;P;R1 kon;P;R1 7 7 7 7 6 6 6 6 k k 0 0 7 7 6 on;V ;R2 7 6 off ;V ;R2 7 6 6 7V R2 þ 6 7VR2 7P R1 þ 6 7PR1 þ 6 0 0 þ6 0 0 7 7 7 7 6 6 6 6 5 5 4 4 4 kon;P;R1 5 4 koff ;P;R1 5 0 0 kon;V ;R2 koff ;V ;R2 0 0 2
ð18:4Þ Boundary conditions were given by Eq. (18.5). qV is the endothelial secretion rate of VEGF (mol/cm2/s).
PIGF
VEGF
VEGF
PIGF
V(z,t)
P(z,t)
Diffusion eqns: Initial conditions (1) Ligand shifting PIGF
VEGF (2) PR1 signaling
Boundary conditions: VEGF
PIGF
Reaction eqns:
z=0
0
Endothelial cells
0
0
C
koff
1 1
R2
R1
–
V121
VEGF121
Exons 1–5
V165
VEGF165
Exons 1–5
kon
1 1 0
VR1(t)
0
koff
kon
2 2
1 1
V165
+
+
2 2
2 2
kon
koff
V121
kon
0
PR1(t)
7a
+
R2(t)
7b
8a
8b
8a
8b
NRP1 binding
+ N
koff
Rx2 k on
“Bridge” V165
koff N
kc
koff
Rx3
Rx1
V165 2 2
kon 2 2
2 2
V165
+ 2 2
koff 1 1 +
R1(t)
VEGFR2 binding V121
VEGF
N
2 2
V165
kc koff
+ 2 2
N
~ ~
140
PIGF has a small effect on VEGFR2 signaling
120
V-R2 (with PIGF) V-R2 (without PIGF) V-R1 + P-R1 (with PIGF) V-R1 (with PIGF) V-R1 (without PIGF)
100 80 PIGF has a large impact on VEGF-VEGFR1 and PIGF-VEGFR1 signaling
60 40 20 0
~ ~
z=h
B
t>0
P0
Ligand-receptor complexes (,000 per cell)
t=0
V0
0
VR2(t)
D
VEGF-VEGFR2 (,000/cell)
A
1
2 Time (h)
3
22 24
100
80
60
VEGF165 VEGF121 VEGF165 + NRP1-Ab Inhibition of VEGFR2 signaling by blocking binding to NRP1 ~8%
40 Inhibition of VEGFR2 signaling by blocking coupling to NRP1 ~45%
20
0 0.1
1 Initial VEGF (nM)
Figure 18.2 Molecular-level kinetics models. Modeling of molecular interactions underlying PlGF–VEGF synergy: experimental setup (A) and sample results (B) based on data previously published in Mac Gabhann and Popel (2004). Modeling of NRP1’s role in the differential binding of VEGF isoforms to VEGFR2: experimental setup (C) and sample results (D) based on data previously published in Mac Gabhann and Popel (2005).
Modeling Growth Factor-Receptor Systems
DV
DP
@V @z @P @z @V @z
471
! ¼ qV þ ðkon;V ;R1 V R1 koff ;V ;R1 VR1Þ z¼0
þðkon;V ;R2 V R2 koff ;V ;R2 VR2Þ
!
¼ ðkon;P;R1 P R1 koff ;P;R1 PR1Þ !z¼0 ¼ z¼h
@P @z
ð18:5Þ
! ¼0 z¼h
The initial conditions describing exogenous ligand administration and total receptor densities were specified by Eq. (18.6). V ðt ¼ 0Þ ¼ V0 ; Pðt ¼ 0Þ ¼ P0 ; R1ðt ¼ 0Þ ¼ R10 ; R2ðt ¼ 0Þ ¼ R20 ; VR1ðt ¼ 0Þ ¼ PR1ðt ¼ 0Þ ¼ VR2ðt ¼ 0Þ ¼ 0 ð18:6Þ The predicted final concentration of cell-surface ligated VEGFVEGFR2 complexes served as a surrogate marker for achieved angiogenic response. Representative values used for the model parameters were based on experimental literature as detailed in Mac Gabhann and Popel (2004). Numerical solution of the coupled nonlinear differential equations (Eqs. (18.3)–(18.6)) was achieved using an iterative implicit finite-difference scheme. Sample results are shown in Fig. 18.2B. Crucially, the simulations predicted that PlGF addition only increased VEGFVEGFR2 complex formation up to 5% at the peak percentile change between PlGFþ and PlGF cases; that is, the ‘‘ligand shifting effect’’ is expected to be minimal. On the other hand, the transient increase in total ligated VEGFR1 complexes was as high as 43%, during which, the magnitude increase in PlGFVEGFR1 complexes exceeded that of the decrease in VEGFVEGFR1 complexes. This would suggest that ‘‘VEGFR1 signaling’’ plays a more prominent role in the observed PlGF–VEGF synergy: that PlGF critically alters the VEGFR1 signaling profile in both the absolute quantity of signaling VEGFR1 complexes and the signaling quality of VEGFR1 complexes (elevated proangiogenic PlGFVEGFR1 signaling and reduced modulatory VEGFVEGFR1 signaling). Experimental support for these computational predictions have been found (Autiero et al., 2003), as intermolecular cross talk was reported to occur downstream of PlGF–VEGFR1 binding, leading to the transphosphorylation of VEGFR2 and amplification of proangiogenic VEGFVEGFR2 signaling.
472
Florence T. H. Wu et al.
2.3. Case study: Mechanism of NRP1–VEGFR2 coupling via VEGF165—Effect on VEGF isoform-specific receptor binding Our second case study investigated another molecular player in the systems biology of VEGF, neuropilin-1 (NRP1). As illustrated in Fig. 18.2C, NRP1-binding affinity is conferred to the higher molecular-weight isoforms of VEGF such as VEGF165 predominantly through transcription of exon 7; the VEGF121 isoform, which lacks exon 7, is generally considered to have negligible affinity for NRP1. Because the NRP1-binding domains of VEGF165 do not overlap with its VEGFR-binding domains, VEGF165 can act as a bridge in the formation of a ternary complex: VEGF165VEGFR2NRP1. A reduced interaction network model involving VEGF121, VEGF165, VEGFR2, and NRP1 (Fig. 18.2C) was thus constructed to quantify the role of VEGF165-bridged VEGFR2–NRP1 coupling in generating VEGF isoform-specific angiogenic responses. Details and full references can be found in Mac Gabhann and Popel (2005). Geometrically, the experimental setup again involved a confluent layer of endothelial cells on the bottom of a cell culture well, as in Fig. 18.2A. As before, the initial value problem in one spatial dimension was formulated as a system of coupled diffusion and reaction equations (Eqs. (18.7) and (18.8)– (18.13), respectively). A new parameter, kc, represents the coupling rate between VEGFR2 and NRP1 via VEGF165: @V121 ¼ DV r2 V121 ; @t
@V165 ¼ DV r2 V165 @t
ð18:7Þ
@R2 ¼ sR2 kint;R2 R2 ðkon;VR2 V165 R2 koff ;VR2 V165 R2Þ @t ðkon;VR2 V121 R2 koff ;VR2 V121 R2Þ ðkc;VN 1;R2 V165 N 1 R2 koff ;VR2 V165 R2N 1Þ ð18:8Þ @N 1 ¼ sN 1 kint;N 1 N1 ðkon;VN 1 V165 N 1 koff ;VN 1 V165 N 1Þ @t ðkc;VR2;N 1 V165 R2 N 1 koff ;VN 1 V165 R2N 1Þ ð18:9Þ @V121 R2 ¼ kint;VR2 V121 R2 ðkon;VR2 V121 R2 koff ;VR2 V121 R2Þ @t ð18:10Þ
473
Modeling Growth Factor-Receptor Systems
@V165 R2 ¼ kint;VR2 V165 R2 ðkon;VR2 V165 R2 koff ;VR2 V165 R2Þ @t ðkc;VR2;N 1 V165 R2 N 1 koff ;VN 1 V165 R2N 1Þ ð18:11Þ @V165 N 1 ¼ kint;VN 1 V165 N 1 ðkon;VN 1 V165 N 1 koff ;VN 1 V165 N 1Þ @t ðkc;VN 1;R2 V165 N 1 R2 koff ;VR2 V165 R2N 1Þ
ð18:12Þ @V165 R2N 1 ¼ kint;VR2N 1 V165 R2N1 þ ðkc;VN 1;R2 V165 N 1 R2 @t koff ;VR2 V165 R2N 1Þ þ ðkc;VR2;N 1 V165 R2 N 1 koff ;VN 1 V165 R2N 1Þ ð18:13Þ Boundary conditions were given by Eq. (18.14): ! @V121 DV ¼ qV þ ðkon;VR2 V121 R2 koff ;VR2 V121 R2Þ @z !z¼0 @V165 DV ¼ qV þ ðkon;VR2 V165 R2 koff ;VR2 V165 R2Þ @z z¼0
@V121 @z
! ¼ z¼h
þðkon;VN ! 1 V165 N 1 koff ;VN 1 V165 N 1Þ @V165 ¼0 @z z¼h
ð18:14Þ
The predicted final VEGFVEGFR2 concentration again served as a marker for the strength of proangiogenic signal transduction. The experimental and theoretical derivations of model parameter values are detailed in Mac Gabhann and Popel (2005). Numerical solution of the coupled nonlinear differential equations (Eqs. (18.7)–(18.14)) was achieved using an iterative implicit finite-difference scheme. Sample results are shown in Fig. 18.2D. The in silico modeling of stepwise reaction kinetics (Fig. 18.2C) allowed the prediction of differential antiangiogenic efficacies from therapeutic interference of two distinct aspects of NRP1 function: VEGF binding and VEGFR2 coupling. Simulated blockade of VEGF165 binding to NRP1 (blocking reactions ‘‘Rx1’’ and ‘‘Rx2’’ in Fig. 18.2C) resulted in the convergence of VEGF165 response to that of VEGF121 in terms of VEGFR2 activation; however, simulated blockade of
474
Florence T. H. Wu et al.
NRP1–VEGFR2 coupling (blocking reactions ‘‘Rx1’’ and ‘‘Rx3’’ in Fig. 18.2C) converted NRP1 into a VEGF165 sink (through intact reaction ‘‘R2’’ in Fig. 18.2C), further reducing VEGF165 response to below that of VEGF121 (Fig. 18.2D).
3. Mesoscale Single-Tissue 3D Models: Simulation of In Vivo Tissue Regions The study of VEGF binding to receptors on cells in vitro, and the validation of the VEGF kinetic interaction network between multiple ligands and multiple receptors, leads us to ask the question: how does this network behave in vivo? In Sections 4 and 5, we will discuss the transport of VEGF between tissues and around the body, but here we will focus first on the behavior of VEGF in a local volume of tissue. This multicellular milieu requires significant additions to our model in order to accurately simulate the local transport of VEGF, including diffusion of VEGF ligands over significant distances, extracellular matrix sequestration and variable production rates of VEGF throughout the tissue. We place these all in an anatomically based 2D or 3D multicellular tissue geometry. The models can predict the creation of interstitial VEGF gradients due to the nonuniform nature of the tissue anatomy. This is of particular interest because VEGF is believed to be a chemotactic guiding agent for blood vessels, but also because local variability in VEGF concentration can lead to local variation in VEGF receptor ligation and signaling, allowing for focal activation of endothelial cells. The model framework can be adapted to most tissues; here we present a case with parameters specifically selected to represent a skeletal muscle experiencing ischemia (specifically, rat extensor digitorum longus, or EDL, for a rodent model of hindlimb artery ligation), and we describe how to computationally test several therapeutic interventions including gene therapy and exercise.
3.1. Mathematical framework for tissue architecture, blood flow, and tissue oxygenation 3.1.1. 2D and 3D tissue geometry based on microanatomy A cross-section (for 2D) or a volume of tissue (for 3D) is reconstructed from histological and other microanatomical information (Fig. 18.3A–C). The major relevant features of the tissue are the blood vessels, the parenchymal cells (from here on, we will assume these are skeletal myocytes, i.e., long multinucleated cells), and the interstitial space between them. From a computational modeling point of view, the tissue comprised volumes and surfaces, defined as those portions of the tissue where molecules can move
475
Modeling Growth Factor-Receptor Systems
in all directions (volumes) and those portions where the movement of molecules is restricted to a plane (e.g., receptors inserted in cell membrane can move only laterally). There are three major volumes of the tissue for our purposes: the vascular space (i.e., inside the blood vessels, determined by the density of blood vessels and their diameters), the intracellular space (whether inside of parenchymal cells or endothelial cells), and the interstitial space between cells, which is itself divided into three volumes (each of which is not contiguous), based on the density of the fibrous matrix present: the
A
80
208 um
m 0u
400 um B
Figure 18.3 (Continued)
476
Florence T. H. Wu et al.
D
C SM MBM ECM
Microvascular network geometry
Muscle fiber geometry
Matrix composition
Vessel
EBM
Tissue O2 distribution
F
40 Peak Average
30
20
10
VEGF secretion
Interstitial VEGF gradients and VEGFR activation
1.8 1.6
VEGF-VEGFR (,000/EC)
VEGF gradients (%VEGF/10 um)
E
Blood flow
Peak Average
1.4 1.2 1.0 0.8 0.6 0.4 0.2
0
0.0 Untreated Uniform Random Regional Distant Adjacent Exercise Gene therapy
Cell therapy
Rest
Exercise training
Untreated Uniform Random Regional Distant Adjacent Exercise Gene therapy
Cell therapy
Rest
Exercise training
Figure 18.3 Mesoscale modeling. (A) Schematic of generated microvascular network of capillaries, arterioles and venules, consistent with histological and other measurements of rat skeletal muscle. (B, C) Cross-section of muscle (indicated by box in A), showing the capillaries (red) and the muscle fibers or myocytes (brown; ‘‘SM’’). Detail in (C) of myocytes, endothelial cells, and the extracellular matrix (ECM) and basement membranes (gray; ‘‘MBM’’ and ‘‘EBM’’) between. For three-dimensional simulations, the full volume of tissue is used; for two-dimensional simulations the indicated cross-section is used. (D) Schematic illustrating how the tissue microanatomy (top row) impacts on the calculation of blood flow, oxygen distribution, VEGF distribution and VEGFR binding. (E) Local VEGF gradients within ligated EDL following treatment. The maximum within the tissue and the average across the tissue are reported. (F) VEGFR2 activation on vessels of ligated EDL following treatment. (A–D) Adapted from Mac Gabhann et al. (2007a). (E, F) Based on results previously published in Mac Gabhann et al. (2007a).
extracellular matrix, and basement membrane regions surrounding the endothelial cells and the myocytes. There are two major surfaces; again, these are not contiguous. First, the combined cell surfaces of the skeletal myocytes, which are assumed to be cylindrical (diameter 37.5 mm, consistent with rat histology), and arranged in a regular hexagonal grid formation, accounting for almost 80% of total tissue cross-sectional area. VEGF is secreted from the myocytes’ surfaces. Second, the surface of the endothelial cells that make up the blood vessels, specifically, the abluminal surface (the luminal surface faces the blood stream and we neglect it for now). Again, the blood vessels are assumed to be cylindrical, and although most (but not all) are parallel to the muscle fibers, they do not occupy every possible position between fibers, but instead have
Modeling Growth Factor-Receptor Systems
477
a stochastic, nonuniform arrangement (based on experimentally measured capillary-to-fiber ratios, capillary-to-fiber distances and histology) occupying 2.5% of total tissue volume (leaving 18% as interstitial space). On this endothelial surface, VEGF receptors are expressed. Thus, VEGF must diffuse from the myocyte surface where it is secreted, through basement membranes and extracellular matrix, to the endothelial surface where it ligates its cognate receptors. To model tissues at the mesoscale, we use the above microanatomical information as an input to a set of integrated models of blood flow, oxygen transport, and VEGF transport (Fig. 18.3D). 3.1.2. Volumes: Blood flow Blood flow and hematocrit value calculations are based on Pries et al.’s twophase continuum model (Pries and Secomb, 2005), and reduces to a system of nonlinear algebraic equations (two per vessel) that are solved iteratively ( Ji et al., 2006). The Fahraeus–Lindqvist effect and nonuniform hematocrit distribution at vascular bifurcations are included in the blood flow model. Higher blood flow rates are used for exercising conditions, to represent the increased perfusion (and enhanced oxygen delivery) to exercising muscles. In addition, exercise-trained rats have higher average capillary blood velocity. 3.1.3. Volumes: Diffusion and consumption of oxygen Oxygen transport in the tissue in detailed in Ji et al. (2006) and Goldman and Popel (2000). Oxygen arrives in the tissue via the blood vessel network, and the partial pressure of oxygen in the vessels, Pb, is described by RBC @Pb 2 RBC @SHb ð18:15Þ Jwall ¼ 0 vb ab þ HD Cbind @Pb @x R where vb is mean blood velocity; ab is oxygen solubility in blood; HD is the RBC RBC discharge hematocrit; Cbind and SHb are the oxygen binding capacity and oxygen saturation of the red blood cell; x is the longitudinal position in the vessel; R is vessel radius; Jwall is the oxygen flux across the vessel walls (i.e., into the tissue). Oxygen diffuses across the endothelial cells, and freely throughout the tissue (both interstitial and intracellular). Within the cells, it may also be consumed by binding to Myoglobin (Mb). The local partial pressure of oxygen in the tissue, P, is described by Mb @P Cbind @SMb 1 @SMb 1 2 Mb DO2 r P þ DMb Cbind r rP MðPÞ ¼ 1þ a @P @P @t a a ð18:16Þ
478
Florence T. H. Wu et al.
where DO2 and DMb are the diffusivities of oxygen and myoglobin; a is the Mb oxygen solubility in tissue; Cbind and SMb are the oxygen binding capacity and oxygen saturation of myoglobin; and M(P) represents Michaelis– Menten kinetic consumption of oxygen.
3.1.4. Volumes: Diffusion of VEGF and sequestration by ECM The VEGF ligands, VEGF120 (rodent form of VEGF121) and VEGF164 (rodent form of VEGF165), can both diffuse through the interstitium following secretion; however, the longer isoform also binds to glycoproteins in the extracellular matrix, becoming reversibly sequestered. The equations are thus identical to those of Section 2, with the addition of binding and unbinding terms: if iþj bind X @Ci ðkon;i;j Ci Cj koff ;ij Cij Þ ¼ Di r2 Ci @t j
ð18:17Þ
where Ci and Cj are concentrations of two interstitial molecules, i and j. In the rat EDL model, Ci ¼ [V120] or [V164]; Cj ¼ [GAG]. Concentration of proteins in the thin endothelial or myocyte basement membranes is given by an equation of the form: if iþm Xbind @Ci 1 si þ ðkoff ;im Rim kon;i;m Ci Rm Þ Jout;i ¼ @t dBM m
þ
if iþj bind X
ð18:18Þ
ðkoff ;ij Cij kon;i;j Ci Cj Þ
j
where si is the secretion rate from the cell (typically from myocytes); Rm and Rim are the concentrations of receptor m and of the i–m complex on the cell surface (typically on endothelial cells); Jout is the Fickian diffusive flux from BM to ECM of VEGF; and dBM is the basement membrane thickness.
3.1.5. Surfaces: Receptor–ligand interactions The ligand–receptor interactions that take place are precisely those that were outlined in Section 2, and that will be used in Sections 4 and 5: VEGF120 and VEGF164 bind to VEGFR1 and VEGFR2, while only the longer isoform binds Neuropilin-1 and extracellular matrix. The general form of the receptor and receptor complex equations is therefore:
Modeling Growth Factor-Receptor Systems
if iþm Xbind @Rm ðkoff ;im Rim kon;i;m Ci Rm Þ ¼ ðsm kint;m Rm Þ þ @t i if mþn Xbind ðkdissoc;mn Rmn kcouple;m;n Rm Rn Þ þ
479
ð18:19Þ
n
where sm and kint,m are the membrane insertion rate and internalization rate of receptor m; kcouple,m,n and kdissoc,mn are the kinetic rate of binding and unbinding of two surface receptors m and n to each other. Note in particular that the concentration of the ligand (Ci) in each case is the concentration in the basement membrane region closest to the receptor. Thus, the receptor occupancy varies from cell to cell across the capillary network. Examples of specific individual equations can be found in Section 2. 3.1.6. Surfaces: VEGF production/secretion rates The production and secretion of VEGF has been observed to be inducible by hypoxia (Forsythe et al., 1996; Qutub and Popel, 2006). Here, we use an empirical relationship (Mac Gabhann et al., 2007b) for the increase in the baseline secretion rate of VEGF (S0) based on the observed upregulation of VEGF mRNA and protein during hypoxia in cells and tissues ( Jiang et al., 1996; Tang et al., 2004): ( !a ! 20 PO2 S ¼ S0 1 þ 5 ; S0 jPO2 20 mmHg; 19 ) ð18:20Þ 6S0 jPO2 1 mmHg
3.1.7. What is not included in these models? Intracellular VEGF is not included in these simulations; that includes both postinternalization VEGF and presecretion VEGF. In addition, we neglect the intravasation of VEGF into the bloodstream, either by endothelial cell secretion or through paracellular routes, for example, permeability. Lymphatic transport of VEGF is also neglected. These additional transport routes could be accommodated in the above model structure with the addition of new surfaces or terms. Although endothelial VEGF production and parenchymal VEGFR expression have been observed in recent years (Bogaert et al., 2009; Lee et al., 2007a,b), these are not included as part of these simulations; there is no technical obstacle to doing so.
480
Florence T. H. Wu et al.
3.1.8. Relationship to single-compartment models It is important to note that the spatial averages of VEGF concentrations at the endothelial cell surface and of VEGFR activation in the mesoscale models match well with the values in single-compartment models (Section 4) that do not include diffusion or VEGF gradients. Thus, it may be possible to calculate the average receptor activation using less computationally intensive compartment models, and use the mesoscale models to estimate the spatial gradients.
3.2. Case study: Proangiogenic VEGF gene therapy for muscle ischemia To improve the perfusion and healing of ischemic muscle tissue with impaired angiogenic response, several therapies have been suggested, typically involving the delivery of VEGF (one or more isoforms) to the muscle. The first of these, gene therapy, increases the VEGF secretion by adding additional VEGF-encoding genes to the cells that are transfected. By transfecting multiple copies, or by judicious choice of VEGF promoters and enhancers in the new construct, significant increases in VEGF secretion can be obtained. We have modeled both uniform upregulation of VEGF (increasing VEGF secretion at every myocyte surface point in the model) and stochastic upregulation, in which each cell has a randomly increased VEGF production within a certain range (using the myonuclear density, we know the size of the myocyte surface that is under the control of each nucleus; thus, we can assign a random number to each region, that stays constant through the simulation) (Mac Gabhann et al., 2007a). These increases in VEGF production result in increased VEGFR2 activation, however the VEGF gradients are not significantly increased (Fig. 18.3E and F); in this case blood vessels might be induced to sprout, but have no directional cues. Further simulations restricting the VEGF transfection to a specific region of the muscle demonstrates increased VEGFR2 activation coupled with very high VEGF gradients towards the transfected tissue, but only in a narrow region between transfected and nontransfected tissue (Mac Gabhann et al., 2007a). This suggests that VEGF gene delivery needs to be effectively localized with a high degree of spatial accuracy to allow the gradients of VEGF to bring the new vessels to the affected volume.
3.3. Case study: Proangiogenic VEGF cell-based therapy for muscle ischemia Another route to bringing more VEGF to the tissue, and one which may allow for more spatial specificity, is the delivery of VEGF-overexpressing cells, for example, myoblasts that will effectively integrated into the existing
Modeling Growth Factor-Receptor Systems
481
muscle and produce excess VEGF locally. To simulate this, we select specific myocytes in the model to overexpress VEGF, and distribute these distantly or close together (Mac Gabhann et al., 2006, 2007a). That is, since the secretion rate of VEGF can have a different value for every spatial location on the myocytes surface in our model, we can upregulate VEGF in a specific subset of these cells. For this therapy, we observe in the simulations both increased VEGFR2 binding and increased VEGF gradients (Fig. 18.3E and F), but only within approximately one to two myocyte diameters from the new VEGF overexpressing cells (Mac Gabhann and Popel, 2006, 2007a). In addition, cells close together synergize while distant ones do not. In this way, we can see that a small number of cells, or cells distributed too broadly, would have a low probability of attracting perfusion from a neighboring region; however, a large mass of cells, at the right location, could serve as a local chemoattractant. The results described in Sections 3.2 and 3.3, for therapies reliant on VEGF upregulation alone, mirror the outcome of the several clinical trials of VEGF isoforms in humans for coronary artery disease (CAD) or peripheral artery disease (PAD); these trials have not had the success that was expected of them. Instead, the standard of care for PAD continues to be exercise, and it is this therapy that we consider next.
3.4. Case study: Proangiogenic exercise therapy for muscle ischemia Exercise training in rats has been shown to not only restore the ability of hypoxic, ischemic tissue to upregulate VEGF following injury, but also increases the expression levels of the VEGF receptors (Lloyd et al., 2003). Thus, we used our model to simulate the exercise-dependent upregulation of both the ligands and the receptors, using experimentally measured increases ( Ji et al., 2007; Mac Gabhann et al., 2007a,b). In this case, we increase the secretion rate of VEGF isoforms from each point on the myocyte surface, during exercise; in addition, we increase the insertion rate of the VEGF receptors at every point on the endothelial cell surface at all times (as a result of exercise training). The results of these simulations are quite different from those before: first, during exercise, both the VEGFR2 activation and the VEGF gradients are increased, not just locally but across the upregulated tissue ( Ji et al., 2007; Mac Gabhann et al., 2007a); second, during rest periods, while VEGF upregulation ceases and the occupancy of VEGFR2 returns to lower levels, the high VEGF gradients are maintained (Fig. 18.3E and F). This suggests that the activation step for attracting new blood vessels may be during a smaller window of time, while the guidance of the new vessel to its destination can take place continuously.
482
Florence T. H. Wu et al.
This observation that our current best strategy for PAD, exercise, increases both ligand expression and receptor activation, leaves us with the possibility of developing combined ligand–receptor therapy (especially for those who cannot exercise).
4. Single-Tissue Compartmental Models: Simulation of In Vivo Tissue From Section 3, we saw that the investigative use of VEGF models for therapeutic applications require larger spatial scale modeling of in vivo growth factor transport and effects at the tissue level. However, the full spatial resolution afforded by 3D modeling becomes computationally intensive when the volume of interest grows from tissue regions to whole tissues and organs. Here we demonstrate the strategy of compartmental modeling— where tissue fluid volumes (e.g., interstitial fluid volume) are approximated as well-mixed compartments of uniform protein concentrations, based on the assumption that diffusion occurs on faster timescales than that of molecular binding kinetics (Damkohler number d½proteininterstitium ½proteininterstitium ½proteinblood p STB p STB > > þ ¼ > < dt Tissue Volume KAV ;tissue Tissue Volume KAV ;blood kBT kTB > d½proteinblood ½proteinblood ½proteininterstitium p STB p STB > > þ ¼ > : dt Tissue Blood KAV ;blood Tissue Blood KAV ;tissue
ð18:24Þ kTB p
represents the microvascular permeability rate for the protein of where interest going from the tissue (T) to the blood (B) compartment (cm/s) across STB (cm2), the total endothelial surface area (i.e., the tissue–blood interface). Microvascular permeability is modeled as a passive bidirectional transport process, represented in Fig. 18.5 by a double-arrow, with identical BT intravasation and extravasation rates, kTB p ¼ kp . The in vivo value of kp for a given protein is rarely found in the literature. Instead, we extrapolate from calibration curves (Garlick and Renkin, 1970) correlating permeability rates with protein size (Stokes–Einstein radius of the molecule). For instance, the calculated Stokes–Einstein radius (ae) of the 45 kDa globular VEGF protein ˚ according to 0.483 (molecular weight in Da)0.386 as given in is 30.2 A Venturoli and Rippe (2005). Our extrapolation methods yield a baseline permeability rate of 4.3 10 8 cm/s for VEGF (Stefanini et al., 2008). The microvascular permeability for VEGF in tumor tissue (for use in modeling breast tumor as the tissue of interest in Section 5.2) can be estimated from corresponding values for similar-sized molecules, for example, ˚ ) was measured to permeate at about ovalbumin (45 kDa; ae ¼ 30.8 A 7 5.77 10 cm/s (Yuan et al., 1995). Details can be found in Stefanini et al. (2008). 5.1.2. Lymphatic drainage Lymphatic drainage is a major route by which interstitial proteins are transported to the blood because size-dependent transendothelial permeability restricts their intravasation into blood capillaries. In contrast, there is no macromolecular impedance in the filling of the initial lymphatics, hence the protein concentrations drained through the lymphatics are assumed to be continuous with interstitial concentrations at the lymphatic entrance. Mathematically, we describe unidirectional lymphatic drainage as:
488
Florence T. H. Wu et al.
d½proteininterstitium kL ½proteininterstitium ; ¼ dt Tissue Volume KAV ;tissue d½proteinblood kL ½proteininterstitium ¼þ dt Blood Volume KAV ;tissue
ð18:25Þ
where kL is the lymphatic drainage rate (cm3/s). Detailed derivation and parameterization can be found in Wu et al. (2009b).
5.2. Case study: Pharmacokinetics of anti-VEGF therapy in cancer The pharmacokinetics of anti-VEGF therapy can be studied via wholebody compartmental modeling based on the detailed biochemical reactions between VEGF ligands and their receptors described above. Specifically, we can simulate and study the VEGF isoform specificity of anti-VEGF agents, ligand–agent binding configurations (e.g., whether such binding is monoor multimeric), agent biodistribution (if the anti-VEGF agent is confined to one or several compartments), as well as various therapeutic regimen designs (varying dosage, frequency of administration, the site of injection). This example models bevacizumab, a humanized monoclonal antibody to VEGF, the characteristics and properties of which have been reported (150 kDa; Kd of 1.8 nM in Presta et al. (1997); half-life ¼ 21 days in Gordon et al. (2001)). The diseased tissue of interest is a tumor. In the absence of the anti-VEGF agent, a total of 40 ordinary differential equations describes the compartmental system (19 ordinary differential equations (ODEs) for each tissue; 2 ODEs for the blood compartment). When the anti-VEGF agent is added and confined to the blood compartment (nonextravasating agent), three more equations are added to the model (blood compartment), representing the chemical interactions of the antiVEGF with VEGF121 and VEGF165, as well as free anti-VEGF: d½Ablood KAV ;blood;A ½V165 Ablood ¼ qA KAV ;blood;A cA ½Ablood koff ;V A;blood dt KAV ;blood;VA ½V165 blood þ kon;V A;blood ½Ablood KAV ;blood;V KAV ;blood;A koff ;V A;blood ½V121 Ablood KAV ;blood;VA ½V121 blood þ kon;V A;blood ½Ablood KAV ;blood;V
ð18:26Þ
Modeling Growth Factor-Receptor Systems
489
where KAV,blood,i is the available volume fraction of the molecule i (A ¼ antiVEGF, V ¼ VEGF, VA ¼ complex VEGF/anti-VEGF). Note that UAV ¼ KAV U. The first term represents the concentration of anti-VEGF drug injected into the patient bloodstream. The second term represents the clearance of the anti-VEGF agent by the organs (e.g., kidneys or liver). The equations related to the complex form of the anti-VEGF drug are of the form: d½V165 Ablood ¼ cV A ½V165 Ablood koff ;V A;blood ½V165 Ablood dt þkon;V A;blood
ðKAV ;blood;VA Þ2 ½V165 blood ½Ablood KAV ;blood;V :KAV ;blood;A ð18:27Þ
Figure 18.6 illustrates the transient dynamics resulting from the intravenous injection of the VEGF-antibody; in these simulations a breast tumor of 2 cm diameter is considered. In the single-dose treatment (10 mg/kg), intravenous injection leads to a rapid decrease in the free VEGF concentration (Fig. 18.6A). That level returns to baseline after about 3–4 weeks. For daily smaller doses of treatment or ‘‘metronomic’’ treatment (1 mg/kg for 10 days), a new lower pseudo-steady state for the plasma VEGF level emerges during the duration of treatment (Fig. 18.6B). Following treatment cessation (10 days), the plasma VEGF level returns to that before treatment after about 3 weeks. The metronomic injection also delays the peak of maximum formation of VEGF-antiVEGF complex as compared to a singledose treatment (Fig. 18.6C and D). Equations (18.26) and (18.27) describe an intravenous injection of antiVEGF antibodies that would be confined to the plasma. If the anti-VEGF agent is injected intravenously and can extravasate, terms of the form of Eq. (18.24) are added to Eqs. (18.26) and (18.27) and additional equations corresponding to the free and bound antibody concentrations are needed for each tissue compartment in the systems of ODEs. Interestingly, the addition of extravasation to the model drastically changes the form of the response. Transiently after injection, the free VEGF level in plasma drops drastically (data not shown). This is due to the binding of the antibody to the free VEGF present in the plasma. Following this drop there is a several-fold increase of free VEGF concentration in plasma. The apparent ‘‘rebound’’ effect is due to the amount of drug delivered and its extravasation. Briefly, while some antibodies bind to the free VEGF in plasma, another portion extravasates and binds to the free VEGF present in the available interstitial fluid volume in the tissues (healthy tissue and breast tumor). Although some of the formed complexes subsequently dissociate within the same tissue, significant quantities are brought into the bloodstream (via microvascular permeability and lymphatic drainage) where the complex dissociates,
490
Florence T. H. Wu et al.
B
A
200
200 100
Free [VEGF] (pM)
Free [VEGF] (pM)
150 50
100 50
5 4 3 2 1 0 D 2.5 [VEGF-antiVEGF] (nM)
5 4 3 2 1 0 C 2.5 2.0 1.5 1.0
1.5 1.0 0.5
0.0 E 2.0
F 0.0 2.0 Free [antiVEGF] (mM)
0.5
2.0
Free [antiVEGF] (mM)
[VEGF-antiVEGF] (nM)
Healthy tissue Blood Tumor
150
Healthy tissue Blood Tumor
1.5 1.0 0.5 0.0 0
7
14 21 Time (days)
28
35
1.5 1.0 0.5 0.0 0
7
14 21 Time (days)
28
35
Figure 18.6 Compartmental model of whole-body anti-VEGF pharmacokinetics. Comparison between single-dose (A) and metronomic (B) intravenously delivered anti-VEGF treatment (without extravasation of the anti-VEGF molecule). Each dose (intravenous infusion) takes place over 90 min. (A, C, E) Single dose of 10 mg/kg, (B,D,F) 1 mg/kg daily for 10 days.
leading to more free VEGF in the blood compartment. Such a counterintuitive increase in serum VEGF following intravenous administration of anti-VEGF agents has been observed in experiments (Gordon et al., 2001; Segerstrom et al., 2006; Willett et al., 2005) and our model is, to our knowledge, the first to explain this phenomenon by an intrinsic mechanism of intertissue transport.
491
Modeling Growth Factor-Receptor Systems
This example illustrates how computational models can provide useful insights that are not easily accessible by in vitro or in vivo experiments. This model can also be extended to examine the effects of drug treatment via alternate routes of anti-VEGF administration (e.g., intramuscular injection).
5.3. Case study: Mechanism of sVEGFR1 as a ligand trap In this last example of a multitissue compartmental model, we investigated the molecular mechanisms by which sVEGFR1, a truncated soluble variant of the endothelial cell-surface VEGFR1, inhibits VEGF signaling. The two prevailing postulated mechanisms are: direct VEGF ligand sequestration, reducing the VEGF available for VEGFR activation (Fig. 18.7A, middle); and heterodimerization with cell-surface VEGFR monomers, rendering the receptor dimer nonfunctional as trans-phosphorylation of paired intracellular domains of full-length VEGFRs is necessary for activating signal transduction (Fig. 18.7A, bottom). The model as described in detail in Wu et al. (2009a) simulated the first mechanism, to assess the antiangiogenic potential of sVEGFR1’s ligand trapping capacity alone. sVEGFR1 in its monomeric (110 kDa) and dimeric (220 kDa) forms are about two to five times larger than VEGF; thus free sVEGFR1 and the sVEGFR1VEGF complex have lower vascular permeability rates than VEGF, while sharing the same lymphatic drainage rates as VEGF. Full mathematical equations describing these transport properties, along with the sVEGFR1–VEGF binding and sVEGFR1–NRP1 coupling interactions, can be found in Wu et al. (2009a). Sample equations for the concentrations of free sVEGFR1 and sVEGFR1VEGF165 in tissue j are given here: d½V165 sR1 j dt
¼
kL;j ½V165 sR1 j Uj KAV ;j
½V165 sR1 j gj SjB ½V165 sR1 B B!j j!B þ kp;V sR1 kp;V sR1 Uj KAV ;B KAV ;j þ kon;V sR1;j ½V165 j ½sR1 j koff ;V sR1;j ½V165 sR1 j
d½sR1 j
!
ð18:28Þ !
kL;j ½sR1 j gj SjB B!j ½sR1 B j!B ½sR1 j þ kp;sR1 kp;sR1 dt Uj KAV ;j Uj KAV ;B KAV ;j kon;sR1M;j ½sR1 j ½MEBM j þ koff ;sR1M;j ½sR1 MEBM j kon;sR1M;j ½sR1 j ½MECM j þkoff ;sR1M;j ½sR1 MECM j kon;sR1M;j ½sR1 j ½MPBM j þ koff ;sR1M;j ½sR1 MPBM j kon;sR1N 1;j ½sR1 j ½N1 j þ koff ;sR1N 1;j ½sR1 N1 j kon;V sR1;j ½V121 j ½sR1 j þkoff ;V sR1;j ½V121 sR1 j kon;V sR1;j ½V165 j ½sR1 j þ koff ;V sR1;j ½V165 sR1j ¼ qsR1;j
ð18:29Þ
A
V121
V165
V165
2 2
N 2 2
2 2
Signaling VEGFR complexes V121 s s 1 1
N
V 121
V 165 s s1 1
s s1 1
sVEGFR1 as a ligand trap of VEGF V121
V165
s 1
s 1
2
N 2
V165
2
s 1
Non-signaling heterodimer complexes
B
Free VEGF in normal interstitium [pM] 11.2 1.8 11 1.6 10.8 1.4 10.6 1.2
Free VEGF in plasma [pM]
Free VEGF in calf interstitium [pM] 11.4
CTRL 11.2
sR1-V: permeability ⫽ 0
11 10.8
sR1-V: lymph drainage⫽0 free sR1 and sR1-V: permeability⫽0
10.6
sR1-V: permeability⫽0,lymph drainage⫽0
10.4
10.4 1 10.2 15
20 25 30 Time (wks)
35
0.8
15
20
25 Time (wks)
30
35
15
20 25 30 Time (wks)
35
Figure 18.7 Compartmental model of whole-body sVEGFR1 transport. (A) sVEGFR1 has been postulated to have antagonistic effects on VEGF signaling complex formation (top row) through competitive binding of VEGF ligands (middle row) and dominant-negative heterodimerization with endothelial cell-surface VEGF receptors (bottom row). (B) Sample simulations of the ligand-trapping effects of intravascularly administered exogenous sVEGFR1 based on results previously published in Wu et al. (2009a).
Modeling Growth Factor-Receptor Systems
493
In this model, calf muscle tissue was chosen as the ‘‘tissue of interest’’ compartment (Fig. 18.5), in order to investigate the effects of endogenous sVEGFR1 produced from the calf muscle on local (calf compartment) and global (normal compartment) VEGF signaling complex formation. Such predictions were expected to provide insight on whether pathological upregulation in sVEGFR1 expression in the calf may contribute to the dampened VEGF response observed in ischemic calf muscles in peripheral arterial disease. Therapeutic intravascular (IV) delivery of exogenous sVEGFR1 was also simulated to assess its efficacy in lowering systemic levels of VEGF (Wu et al., 2009a). While, intuitively, intravascular sVEGFR1VEGF complex formation following simulated IV infusion of sVEGFR1 would be expected to lead to a sustained reduction of plasma free VEGF, in fact sample results in Fig. 18.7B show that the permeability rates and lymphatic drainage rates of free sVEGFR1 and sVEGFR1VEGF are predicted to critically determine whether this reduction will take place or not. In other words, these simulations suggest that prior to the clinical translation of administering exogenous sVEGFR1 to lower systemic VEGF levels in antiangiogenic therapy, extensive experimental research is needed to exclude the computationally predicted possibilities where sVEGFR1 delivery counterintuitively elevates plasma free VEGF. As in Section 5.2, this computational study demonstrates that the intertissue transport properties of proteins significantly affect their whole-body pharmacokinetic effects.
6. Conclusions In this chapter, we summarized several computational models for investigating different spatial and temporal aspects of VEGF systems biology. In Section 2, we described molecular network models that simulated in vitro endothelial cell-surface interaction experiments to investigate the roles of the PlGF ligand and NRP1 coreceptor within the VEGF system. In Section 3, we presented mesoscale models for investigating the effects of in vivo tissue architecture on VEGF ligand and receptor interactions, and for predicting the intramuscular response and relative therapeutic efficacy of various modalities of proangiogenic therapies (gene vs. cell vs. exercise therapy) for ischemic muscle diseases. In Sections 4 and 5, moleculardetailed compartmental modeling was introduced as a method for efficient prediction of average molecular concentrations within tissue subcompartments (e.g., interstitial or plasma VEGF concentrations; intramuscular cell-surface density of signaling complexes) and investigation of intercompartment transport processes (e.g., microvascular permeability and lymphatic drainage). We described several compartmental model studies
494
Florence T. H. Wu et al.
predicting in vivo effects of intratissue trafficking and whole-body pharmacokinetics on angiogenic response to treatments using NRP1, anti-VEGF and sVEGFR1 as therapeutic targets/agents. While the VEGF system models presented in this chapter were limited to the representation of two ligand isoforms (VEGF121 and VEGF165) and three receptors (VEGFR1, VEGFR2, NRP1), these model frameworks can be readily extended to include other VEGF ligand isoforms and receptors (e.g.,VEGFR3, NRP2). Furthermore, similar computational modeling techniques are applicable and have contributed to the study of other growth factor systems, including that of FGF (Filion and Popel, 2004, 2005; Forsten et al., 2000) and EGF (Wiley et al., 2003).
ACKNOWLEDGMENTS This work was supported by NIH grants R01 HL079653, R33 HL0877351, and R01 CA138264.
REFERENCES Andrae, J., Gallini, R., and Betsholtz, C. (2008). Role of platelet-derived growth factors in physiology and medicine. Genes Dev. 22, 1276–1312. Augustin, H. G., Koh, G. Y., Thurston, G., and Alitalo, K. (2009). Control of vascular morphogenesis and homeostasis through the angiopoietin-tie system. Nat. Rev. Mol. Cell. Biol. 10, 165–177. Autiero, M., Waltenberger, J., Communi, D., Kranz, A., Moons, L., Lambrechts, D., Kroll, J., Plaisance, S., De Mol, M., Bono, F., Kliche, S., Fellbrich, G., et al. (2003). Role of PIGF in the intra- and intermolecular cross talk between the VEGF receptors Flt1 and Flk1. Nat. Med. 9, 936–943. Bao, P., Kodra, A., Tomic-Canic, M., Golinko, M. S., Ehrlich, H. P., and Brem, H. (2009). The role of vascular endothelial growth factor in wound healing. J. Surg. Res. 153, 347–358. Beenken, A., and Mohammadi, M. (2009). The FGF family: Biology, pathophysiology and therapy. Nat. Rev. Drug. Discov. 8, 235–253. Bogaert, E., Van Damme, P., Poesen, K., Dhondt, J., Hersmus, N., Kiraly, D., Scheveneels, W., Robberecht, W., and Van Den Bosch, L. (2009). VEGF protects motor neurons against excitotoxicity by upregulation of GluR2. Neurobiol. Aging [Epub ahead of print] Pubmed ID: 19185395. Brown, M. D., and Hudlicka, O. (2003). Modulation of physiological angiogenesis in skeletal muscle by mechanical forces: Involvement of VEGF and metalloproteinases. Angiogenesis 6, 1–14. Cao, Y. (2009). Positive and negative modulation of angiogenesis by VEGFR1 ligands. Sci. Signal. 2, re1. Collinson, D. J., and Donnelly, R. (2004). Therapeutic angiogenesis in peripheral arterial disease: Can biotechnology produce an effective collateral circulation? Eur. J. Vasc. Endovasc. Surg. 28, 9–23.
Modeling Growth Factor-Receptor Systems
495
Ebos, J. M., Bocci, G., Man, S., Thorpe, P. E., Hicklin, D. J., Zhou, D., Jia, X., and Kerbel, R. S. (2004). A naturally occurring soluble form of vascular endothelial growth factor receptor 2 detected in mouse and human plasma. Mol. Cancer Res. 2, 315–326. Feng, D., Nagy, J. A., Dvorak, H. F., and Dvorak, A. M. (2002). Ultrastructural studies define soluble macromolecular, particulate, and cellular transendothelial cell pathways in venules, lymphatic vessels, and tumor-associated microvessels in man and animals. Microsc. Res. Tech. 57, 289–326. Ferrara, N., and Davis-Smyth, T. (1997). The biology of vascular endothelial growth factor. Endocr. Rev. 18, 4–25. Filion, R. J., and Popel, A. S. (2004). A reaction-diffusion model of basic fibroblast growth factor interactions with cell surface receptors. Ann. Biomed. Eng. 32, 645–663. Filion, R. J., and Popel, A. S. (2005). Intracoronary administration of FGF-2: A computational model of myocardial deposition and retention. Am. J. Physiol. Heart Circ. Physiol. 288, 263–279. Forsten, K. E., Fannon, M., and Nugent, M. A. (2000). Potential mechanisms for the regulation of growth factor binding by heparin. J. Theor. Biol. 205, 215–230. Forsythe, J. A., Jiang, B. H., Iyer, N. V., Agani, F., Leung, S. W., Koos, R. D., and Semenza, G. L. (1996). Activation of vascular endothelial growth factor gene transcription by hypoxia-inducible factor 1. Mol. Cell. Biol. 16, 4604–4613. Fu, B. M., and Shen, S. (2003). Structural mechanisms of acute VEGF effect on microvessel permeability. Am. J. Physiol. Heart Circ. Physiol. 284, 2124–2135. Gagnon, M. L., Bielenberg, D. R., Gechtman, Z., Miao, H. Q., Takashima, S., Soker, S., and Klagsbrun, M. (2000). Identification of a natural soluble neuropilin-1 that binds vascular endothelial growth factor: In vivo expression and antitumor activity. Proc. Natl. Acad. Sci. USA 97, 2573–2578. Garlick, D. G., and Renkin, E. M. (1970). Transport of large molecules from plasma to interstitial fluid and lymph in dogs. Am. J. Physiol. 219, 1595–1605. Gerber, H. P., Malik, A. K., Solar, G. P., Sherman, D., Liang, X. H., Meng, G., Hong, K., Marsters, J. C., and Ferrara, N. (2002). VEGF regulates haematopoietic stem cell survival by an internal autocrine loop mechanism. Nature 417, 954–958. Girling, J. E., and Rogers, P. A. (2005). Recent advances in endometrial angiogenesis research. Angiogenesis 8, 89–99. Goldman, D., and Popel, A. S. (2000). A computational study of the effect of capillary network anastomoses and tortuosity on oxygen transport. J. Theor. Biol. 206, 181–194. Gordon, M. S., Margolin, K., Talpaz, M., Sledge, G. W. Jr., Holmgren, E., Benjamin, R., Stalter, S., Shak, S., and Adelman, D. (2001). Phase I safety and pharmacokinetic study of recombinant human anti-vascular endothelial growth factor in patients with advanced cancer. J. Clin. Oncol. 19, 843–850. Gschwind, A., Fischer, O. M., and Ullrich, A. (2004). The discovery of receptor tyrosine kinases: Targets for cancer therapy. Nat. Rev. Cancer 4, 361–370. Haigh, J. J. (2008). Role of VEGF in organogenesis. Organogenesis 4, 247–256. Harper, S. J., and Bates, D. O. (2008). VEGF-A splicing: The key to anti-angiogenic therapeutics? Nat. Rev. Cancer 8, 880–887. Ji, J. W., Tsoukias, N. M., Goldman, D., and Popel, A. S. (2006). A computational model of oxygen transport in skeletal muscle for sprouting and splitting modes of angiogenesis. J. Theor. Biol. 241, 94–9108. Ji, J. W., Mac Gabhann, F., and Popel, A. S. (2007). Skeletal muscle VEGF gradients in peripheral arterial disease: Simulations of rest and exercise. Am. J. Physiol. Heart Circ. Physiol. 293, H3740–H3749. Jiang, B. H., Semenza, G. L., Bauer, C., and Marti, H. H. (1996). Hypoxia-inducible factor 1 levels vary exponentially over a physiologically relevant range of O2 tension. Am. J. Physiol. 271, C1172–C1180.
496
Florence T. H. Wu et al.
Kerbel, R. S. (2008). Tumor angiogenesis. N. Engl. J. Med. 358, 2039–2049. Kut, C., Mac Gabhann, F., and Popel, A. S. (2007). Where is VEGF in the body? A metaanalysis of VEGF distribution in cancer. Br. J. Cancer 97, 978–985. Lauffenburger, D. A., and Linderman, J. L. (1993). Receptors: Models for Binding, Trafficking, and Signaling. Oxford University Press, New York. Lee, S., Jilani, S. M., Nikolova, G. V., Carpizo, D., and Iruela-Arispe, M. L. (2005). Processing of VEGF-A by matrix metalloproteinases regulates bioavailability and vascular patterning in tumors. J. Cell. Biol. 169, 681–691. Lee, S., Chen, T. T., Barber, C. L., Jordan, M. C., Murdock, J., Desai, S., Ferrara, N., Nagy, A., Roos, K. P., and Iruela-Arispe, M. L. (2007a). Autocrine VEGF signaling is required for vascular homeostasis. Cell 130, 691–703. Lee, T., Seng, S., Sekine, M., Hinton, C., Fu, Y., Avraham, H. K., and Avraham, S. (2007b). Vascular endothelial growth factor mediates intracrine survival in human breast carcinoma cells through internally expressed VEGFR1/FLT1. PLoS Med. 4. Lloyd, P. G., Prior, B. M., Yang, H. T., and Terjung, R. L. (2003). Angiogenic growth factor expression in rat skeletal muscle in response to exercise training. Am. J. Physiol. Heart Circ. Physiol. 284, H1668–H1678. Lodish, H., Berk, A., Matsudaira, P., Kaiser, C. A., Krieger, M., Scott, M. P., Zipursky, S. L., and Darnell, J. (2004). Molecular Cell Biology. W.H. Freeman & Co., New York. Mac Gabhann, F., and Popel, A. S. (2004). Model of competitive binding of vascular endothelial growth factor and placental growth factor to VEGF receptors on endothelial cells. Am. J. Physiol. Heart Circ. Physiol. 286, H153–H164. Mac Gabhann, F., and Popel, A. S. (2005). Differential binding of VEGF isoforms to VEGF receptor 2 in the presence of neuropilin-1: A computational model. Am. J. Physiol. Heart Circ. Physiol. 288, H2851–H2860. Mac Gabhann, F., and Popel, A. S. (2006). Targeting neuropilin-1 to inhibit VEGF signaling in cancer: Comparison of therapeutic approaches. PLoS Comput. Biol. 2, e180. Mac Gabhann, F., and Popel, A. S. (2008). Systems biology of vascular endothelial growth factors. Microcirculation 15, 715–738. Mac Gabhann, F., Ji, J. W., and Popel, A. S. (2006). Computational model of vascular endothelial growth factor spatial distribution in muscle and pro-angiogenic cell therapy. PLoS Comput. Biol. 2, e127. Mac Gabhann, F., Ji, J. W., and Popel, A. S. (2007a). Multi-scale computational models of pro-angiogenic treatments in peripheral arterial disease. Ann. Biomed. Eng. 35, 982–994. Mac Gabhann, F., Ji, J. W., and Popel, A. S. (2007b). VEGF gradients, receptor activation, and sprout guidance in resting and exercising skeletal muscle. J. Appl. Physiol. 102, 722–734. Maharaj, A. S., and D’Amore, P. A. (2007). Roles for VEGF in the adult. Microvasc. Res. 74, 100–113. Martin, D., Galisteo, R., and Gutkind, J. S. (2009). CXCL8/IL8 stimulates vascular endothelial growth factor (VEGF) expression and the autocrine activation of VEGFR2 in endothelial cells by activating NFkappaB through the CBM (Carma3/Bcl10/Malt1) complex. J. Biol. Chem. 284, 6038–6042. Mazitschek, R., and Giannis, A. (2004). Inhibitors of angiogenesis and cancer-related receptor tyrosine kinases. Curr. Opin. Chem. Biol. 8, 432–441. Pollak, M. (2008). Insulin and insulin-like growth factor signalling in Neoplasia. Nat. Rev. Cancer 8, 915–928. Presta, L. G., Chen, H., O’Connor, S. J., Chisholm, V., Meng, Y. G., Krummen, L., Winkler, M., and Ferrara, N. (1997). Humanization of an anti-vascular endothelial growth factor monoclonal antibody for the therapy of solid tumors and other disorders. Cancer Res. 57, 4593–4599.
Modeling Growth Factor-Receptor Systems
497
Pries, A. R., and Secomb, T. W. (2005). Microvascular blood viscosity in vivo and the endothelial surface layer. Am. J. Physiol. Heart Circ. Physiol. 289, H2657–H2664. Qutub, A. A., and Popel, A. S. (2006). A computational model of intracellular oxygen sensing by hypoxia-inducible factor HIF1 Alpha. J. Cell. Sci. 119, 3467–3480. Qutub, A. A., Mac Gabhann, F., Karagiannis, E. D., Vempati, P., and Popel, A. S. (2009). Multiscale models of angiogenesis. IEEE Eng. Med. Biol. Mag. 28, 14–31. Roy, H., Bhardwaj, S., and Yla-Herttuala, S. (2006). Biology of vascular endothelial growth factors. FEBS Lett. 580, 2879–2887. Segerstrom, L., Fuchs, D., Backman, U., Holmquist, K., Christofferson, R., and Azarbayjani, F. (2006). The anti-VEGF antibody bevacizumab potently reduces the growth rate of high-risk neuroblastoma xenografts. Pediatr. Res. 60, 576–581. Sela, S., Itin, A., Natanson-Yaron, S., Greenfield, C., Goldman-Wohl, D., Yagel, S., and Keshet, E. (2008). A novel human-specific soluble vascular endothelial growth factor receptor 1: Cell-type-specific splicing and implications to vascular endothelial growth factor homeostasis and preeclampsia. Circ. Res. 102, 1566–1574. Simons, M. (2004). Integrative signaling in angiogenesis. Mol. Cell. Biochem. 264, 99–102. Stefanini, M. O., Wu, F. T. H., Mac Gabhann, F., and Popel, A. S. (2008). A compartment model of VEGF distribution in blood, healthy and diseased tissues. BMC Syst. Biol. 2, 77. Tang, K., Breen, E. C., Wagner, H., Brutsaert, T. D., Gassmann, M., and Wagner, P. D. (2004). HIF and VEGF relationships in response to hypoxia and sciatic nerve stimulation in rat gastrocnemius. Respir. Physiol. Neurobiol. 144, 71–80. Truskey, G. A., Yuan, F., and Katz, D. F. (2004). Porosity, tortuosity, and available volume fraction. Transport Phenomena in Biological Systems, Pearson Prentice Hall, NJ, pp. 389–398. Venturoli, D., and Rippe, B. (2005). Ficoll and dextran vs. globular proteins as probes for testing glomerular permselectivity: Effects of molecular size, shape, charge, and deformability. Am. J. Physiol. Renal Physiol. 288, 605–613. Verheul, H. M., Lolkema, M. P., Qian, D. Z., Hilkes, Y. H., Liapi, E., Akkerman, J. W., Pili, R., and Voest, E. E. (2007). Platelets take up the monoclonal antibody bevacizumab. Clin. Cancer Res. 13, 5341–5347. Wijelath, E. S., Rahman, S., Namekata, M., Murray, J., Nishimura, T., Mostafavi-Pour, Z., Patel, Y., Suda, Y., Humphries, M. J., and Sobel, M. (2006). Heparin-II domain of fibronectin is a vascular endothelial growth factor-binding domain: Enhancement of VEGF biological activity by a singular growth factor/matrix protein synergism. Circ. Res. 99, 853–860. Wiley, H. S., Shvartsman, S. Y., and Lauffenburger, D. A. (2003). Computational modeling of the EGF-receptor system: A paradigm for systems biology. Trends Cell. Biol. 13, 43–50. Willett, C. G., Boucher, Y., Duda, D. G., di Tomaso, E., Munn, L. L., Tong, R. T., Kozin, S. V., Petit, L., Jain, R. K., Chung, D. C., Sahani, D. V., Kalva, S. P., et al. (2005). Surrogate markers for antiangiogenic therapy and dose-limiting toxicities for bevacizumab with radiation and chemotherapy: Continued experience of a phase I trial in rectal cancer patients. J. Clin. Oncol. 23, 8136–8139. Wu, F. T., Stefanini, M. O., Mac Gabhann, F., Kontos, C. D., Annex, B. H., and Popel, A. S. (2009a). A computational kinetic model of VEGF trapping by soluble VEGF receptor-1: Effects of transendothelial and lymphatic macromolecular transport. Physiol. Genomics 38, 29–41. Wu, F. T., Stefanini, M. O., Mac Gabhann, F., and Popel, A. S. (2009b). A compartment model of VEGF distribution in humans in the presence of soluble VEGF receptor-1 acting as a ligand trap. PLoS ONE 4, e5108. Yuan, F., Dellian, M., Fukumura, D., Leunig, M., Berk, D. A., Torchilin, V. P., and Jain, R. K. (1995). Vascular permeability in a human tumor xenograft: Molecular size dependence and cutoff size. Cancer Res. 55, 3752–3756.
C H A P T E R
N I N E T E E N
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies: Weights, Bias, and Confidence Intervals in Usual and Unusual Situations Joel Tellinghuisen Contents 500 503 503 505
1. Introduction 2. Least Squares Review 2.1. Standard linear and nonlinear least squares 2.2. Multiple uncertain variables: Deming’s treatment 2.3. Uncertainty in functions of uncertain quantities: Error propagation 3. Statistics of Reciprocals 3.1. A simple Monte Carlo experiment 3.2. Implications—The 10% rule of thumb 3.3. Application to binding and kinetics data 4. Weights When y is a True Dependent Variable 4.1. Constant sy 4.2. Illustrations for perfectly fitting data 4.3. Real data example 4.4. Monte Carlo simulations 5. Unusual Weighting: When x is the Dependent Variable 5.1. Effective variance treatment 5.2. Checking the results with exactly fitting data 5.3. The unique answer 6. Assessing Data Uncertainty: Variance Function Estimation 7. Conclusion References
505 506 506 509 510 511 511 512 515 517 521 521 522 524 524 526 527
Department of Chemistry, Vanderbilt University, Nashville, Tennessee, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67019-1
#
2009 Elsevier Inc. All rights reserved.
499
500
Joel Tellinghuisen
Abstract The rectangular hyperbola, y ¼ abx/(1 þ bx), is widely used as a fit model in the analysis of data from studies of binding, sorption, enzyme kinetics, and fluorescence quenching. The choice of this or its linearized versions—the double-reciprocal, y-reciprocal, or x-reciprocal—in unweighted least squares imply different assumptions about the error structure of the data. The rules of error propagation are reviewed and used to derive weighting expressions for application in weighted least squares, in the usual case where y is correctly considered the dependent variable, and in the less common situations where x is the true dependent variable, in violation of one of the fundamental premises of most least-squares methods. The latter case is handled through an effective variance treatment and through a least-squares method that treats any or all of the variables as uncertain. The weighting expressions for the linearized versions of the fit model are verified by computing the parameter standard errors for exactly fitting data. Consistent weightings yield identical standard errors in this exercise, as is demonstrated with a common data analysis program. The statistical properties of linear and nonlinear estimators of the parameters are examined with reference to the properties of reciprocals of normal variates. Monte Carlo simulations confirm that the least-squares methods yield negligible bias and trustworthy confidence limits for the parameters as long as their percent standard errors are less than 10%. Correct weights being the key to optimal analysis in all cases, methods for estimating variance functions by least-squares analysis of replicate data are reviewed briefly.
1. Introduction One of the simplest and most frequently encountered nonlinear relations in physical data analysis is the rectangular hyperbola, expressible in a number of ways, including abx : ð19:1Þ 1 þ bx This mathematical form occurs in the analysis of data obtained in studies of enzyme kinetics (Askelof et al., 1976; Cleland, 1967; Cornish-Bowden and Eisenthal, 1974; Dowd and Riggs, 1965; Mannervik, 1982; Ritchie and Prvan, 1996; Wilkinson, 1961), binding and complexation (Bowser and Chen, 1998; Feldman, 1972; Johnson, 1985; Munson and Rodbard, 1980), sorption (Barrow, 1978; Bolster, 2008; Kinniburgh, 1986), and fluorescence quenching (Eftink and Ghiron, 1981; Laws and Contino, 1992). From the earliest encounter with this relation, workers recognized that it could be rewritten to facilitate analysis with straight-line graphical plots (Langmuir, 1918), and the various linearized forms earned naming status for y¼
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
501
their proposers (Connors, 1987). Thus, the version of Michaelis– Menten enzyme kinetics expressed as 1 1 1 ¼ þ CX þ A ð19:2Þ y abx a became the Lineweaver–Burk equation, while equivalent forms became the Benesi–Hildebrand equation of complexation and the Stern–Volmer relation of fluorescence quenching. Multiplying through by x got Hanes and Woolf their version of enzyme kinetics, x 1 x ¼ þ ¼ C þ Ax y ab a
ð19:3Þ
while expressions with y on both sides of the equation, like y ¼ ab by ð19:4Þ x earned Scatchard and Eadie and Hofstee their places on the linearization marquee. Following Connors (1987) and others, I will call these the double reciprocal, the y-reciprocal, and the x-reciprocal forms, respectively; and I will use upper-case letters to denote reciprocals, as done above. Almost from the time these linearizations were proposed, it was clear that their use in quantitative analysis by the method of least squares required attention to weighting (Lineweaver et al., 1934). This problem has attracted much attention, as is clear from the titles of most of the references already cited. Yet, as was lamented by Kinniburgh (1986) over two decades ago, ‘‘Although much well-founded criticism of the various linearized forms of the Langmuir isotherm has appeared in the environmental chemistry literature, the lessons to be learned seem to go largely unheeded.’’ Praising the virtues of nonlinear least squares (NLLS), he continued, ‘‘The benefit of the NLLS approach, when properly weighted, is subtle and not to be seen in statistics such as R2; rather the benefit is in the assurance that the best parameter estimates have been obtained.’’ Kinniburgh showed that many of the claims in favor of or against the various linearized forms of the Langmuir isotherm were incorrectly based on tacit assumptions about the data error structures embodied in the use of unweighted LS (ULS) with these forms. Writing a decade later, Ritchie and Prvan (1996) noted, ‘‘By the time least squares linear regression methods had become readily available on calculators and computers, the need for appropriate weighting had been largely forgotten.’’ They also pointed out that it was not fair to Lineweaver and Burk to use their name for ULS analysis with Eq. (19.2), because LB actually used weighted LS (WLS), working in collaboration with the statistician Deming (Lineweaver et al., 1934). The main conclusions of the works by Kinniburgh and Ritchie and Prvan and many of the other writers already cited might be summarized, in the quest for optimal LS
502
Joel Tellinghuisen
analysis of rectangularly hyperbolic data, the weighting of the data is far more important than the choice of fit relation. Sadly, their comments about the lessons going largely unheeded remain current, as works purporting to test the various fitting representations but doing so incorrectly for well-known reasons continue to be published. Kinniburgh (1986) also addressed a special problem in most sorption work and in many studies of binding and complexation: The quantity normally measured—the equilibrium concentration [L] of ligand or sorbate—is also the independent variable of the common fit models. This violates one of the fundamental assumptions behind most LS methods, that the independent variable be error-free. One way of handling this problem, recognized long ago (Barrow, 1978; Feldman, 1972; Meinert and McHugh, 1968; Munson and Rodbard, 1980) but still not widely used (Bolster, 2008), is to express the measured quantity [L] in terms of the total sorbate or ligand concentration L t, which is normally much more precisely determined and hence more suitable as independent variable in the fit model. Recently, we (Tellinghuisen and Bolster, 2009a) have examined these dependencies in detail and have used an effective variance treatment (Barker and Diana, 1974; Clutton-Brock, 1967; Orear, 1982) to provide weighting expressions that are statistically valid for all of the common forms of Eq. (19.1), for uncertainty in both [L] and L t. In this work, we also noted that there is one NLLS algorithm that yields identical results for all ways of expressing the relation among the model parameters and the variables, any number of which can be considered uncertain. This is an algorithm based on Deming’s (1964) treatment and implemented in iterative form as early as 1972 (Britt and Luecke, 1973; Jefferys, 1980; Lybanon, 1984; Powell and Macdonald, 1972). Since this algorithm gives results that are independent of the choice of fit relation, it becomes clear that the only user inputs that can affect the parameter values for a given set of ([L], L t) values are the data weights, or equivalently the assessed uncertainties in [L] and L t. One important topic still rarely addressed is how to obtain the weights for the data. In linear LS (LLS), it is rigorously true that minimum-variance (hence most precise) estimates of the model parameters are obtained if and only if the data are assigned weights inversely proportional to their variances, wi / s2 i . It is not possible to make such an assertion for NLLS (Di Cera, 1992), in part because many nonlinear estimators do not even have finite variance (Shukla, 1972). Nonetheless, the parameter estimates from NLLS are generally reasonable; and Monte Carlo studies have shown that the precision estimates can be reliable in establishing confidence limits (Tellinghuisen, 2000a). Further, there appears to be no general prescription for achieving the narrowest confidence limits that works better than the same as for LLS, wi / s2 i . I make that assumption here. In subsequent sections, I first briefly review the fundamental LS relations relevant to the present work. I then address the question: How reliable is NLLS in estimating the constants a and b in Eqs. (19.1)–(19.4) and their
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
503
confidence limits, in the normal situation where y is the single error-prone (dependent) variable? The answer to this question is aided by considerations of the statistics of reciprocals of normal variates. I test these predictions through Monte Carlo simulations for selected conditions. I then address the situation where x is the directly measured quantity and y is obtained from x through a simple computation. The application of effective variance methods to this problem requires care, because y is fully correlated with x. I close with a discussion of variance function (VF) analysis for obtaining the data weights. Why the just expressed concern with reciprocals? In estimating the constant K(b) that characterizes the kinetics or binding, we have the choice of estimating this or its inverse (e.g., the dissociation constant Kd). If one of these is a normal variate, the other does not even have finite variance; and when it has relatively large nominal uncertainty, the reciprocal variate can be significantly biased with very asymmetric confidence limits (Tellinghuisen, 2000a,b). It is easier to avoid such complications if we know which—K or Kd—is the more normal variate. On the other hand, if they are sufficiently precise, the question becomes irrelevant; so predicting their precision becomes equally important. And that in turn requires information about the data error. Frequently one encounters statements like ‘‘If the data are precise and the fit model is correct, it doesn’t matter which of Eqs. (19.1)–(19.4) you use, because all will return nearly identical estimates of the parameters.’’ But it is not just the parameters that are of interest in a LS analysis: At least as important are their uncertainties ( Johnson and Faunt, 1992), since after all, one can express all results as 1 1 (times a power of 10). And fitting precise data to these equations without attention to weighting will surely return different parameter uncertainties. Below I will illustrate how one can obtain reliable parameter error estimates by fitting exact data with assumed data error, using the a priori covariance matrix Vprior. In fact, it is not necessary to actually work with this matrix, because many data analysis programs either have an option permitting this choice or make it by default. The KaleidaGraph (Synergy Software) program is in the latter category (Tellinghuisen, 2000c). I have found it both valuable and instructive in predicting parameter precisions from exactly fitting data, including demonstrating that proper weighting yields identical nominal parameter standard errors (SEs) for all of Eqs. (19.1)–(19.4). I will demonstrate its use in this manner below.
2. Least Squares Review 2.1. Standard linear and nonlinear least squares The theory and procedures of linear and nonlinear least-squares fitting methods have been covered in several of the already cited works (Connors, 1987; Johnson and Faunt, 1992; Tellinghuisen, 2000a) and are
504
Joel Tellinghuisen
readily available elsewhere (Bevington, 1969; Press et al., 1986), including in two earlier contributions from me in this series (Tellinghuisen, 2004, 2009a). Here, I emphasize a few important points, in which I will use the same notation as in the last two cited works.
Minimum-variance estimation of the adjustable parameters requires that the data be weighted inversely as their variances, wi / s2 i :
ð19:5Þ
As already noted, this is rigorously true for LLS, and I assume it for NLLS.
The variances for the estimated parameters are the diagonal elements of the variance–covariance matrix, of which we distinguish two versions, Vprior and Vpost, both proportional to A1 ¼ ðXT WXÞ1 ;
ð19:6Þ
where the design matrix X (also called the Jacobian matrix) is as given in the earlier works and the weight matrix W is diagonal, with elements Wii ¼ wi. If the data variances are known a priori, use of wi ¼ s2 yields Vprior ¼ i A 1. Vprior is exact for linear LLS and exact in the limit of small data error for nonlinear NLLS. If the data are normally distributed, the estimated parameters will be normally distributed for LLS and normal in the small data error limit for NLLS. In LLS, with x representing the independent variable and y the dependent, Vprior depends only on the x-structure and the error structure of the data; in NLLS it may depend also on the values of the yi and the fit parameters. The LS solution minimizes S ¼ Swi d2i , where di is the fit residual in the dependent variable y. If the weights are taken as wi ¼ s2 i , S follows the w2 distribution, which has expectation value n and variance 2n, where n is the number of statistical degrees of freedom. Equivalently S/n follows the reduced w2 distribution, with mean 1 and variance 2/n. These properties require data with normally distributed error and a fit to a true model. If the data variances are not known absolutely, parameter variance estimates are obtained from Vpost ¼ s2y A1 , where s2y is the estimated variance for data of unit weight, calculated from the fit residuals using
S ð19:7Þ s2y ¼ : v In LLS, the parameter variances are now estimated quantities having the statistical properties of w2. In NLLS, the variability inherent in Vprior makes Vpost-based estimates even more variable.
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
505
The consequences of violating the prescription of Eq. (19.5) include the obvious one—that the estimated parameters will not be of best precision—and a less recognized but arguably more important one: V-based parameter standard-error estimates are unreliable.
2.2. Multiple uncertain variables: Deming’s treatment All of the foregoing applies to most LS methods, in which a single variable is taken to be uncertain. Deming’s (1964, original version 1938) treatment makes no such distinction between independent (hence error-free) and dependent variables. In his approach, the minimization target is X S¼ wxi d2xi þ wyi d2yi þ ð19:8Þ where the sum runs over all uncertain variables, with each residual being the difference between measured and adjusted value, for example, dxi ¼ xadj,ixj. The iterative implementations of Deming’s approach facilitate convergence on the minimum S. If the weights are again taken as the inverse variances in x, y, ... , the resulting Vprior is again exact in the small-error limit. From the definition of S in Eq. (19.8), it is clear that the results must be independent of the manner in which the fit model is expressed. This includes situations where x is the single uncertain variable. By contrast, properly weighted fits to different versions of a given response function, like Eqs. (19.1)–(19.4), yield statistically equivalent but not numerically identical results. The very form of Eq. (19.8) directs the user’s attention to the weights, which must all be correct within a common scale factor to achieve the minimum-variance results. Thus with just x and y uncertain, the wxi and wyi must correctly reflect the relative uncertainties of all xi and yi. This is not done in, for example, methods that minimize ‘‘perpendicular’’ distances from the measured points to the calculated curve (Schulthess and Dey, 1996; Valsami et al., 2000). In such methods, results change with scale changes in the axes, requiring ‘‘axis conversion factors,’’ which are ill-defined and fail to recognize that even the perpendicular distances should be shorter for very precise points than for imprecise ones. It follows that such methods are not invariant with respect to changes in the representation of the fit relationship.
2.3. Uncertainty in functions of uncertain quantities: Error propagation To calculate the uncertainty, sf in some function f of uncertain quantities, we use error propagation. Taking the uncertain quantities as elements of the vector b,
506
Joel Tellinghuisen
s2f ¼ gT Vg;
ð19:9Þ
in which the jth element of the vector g is @f/@bj. This expression is rigorously correct for functions f that are linear in variables bj that are themselves normal variates (Tellinghuisen, 2001). If the bj are independent, the covariance matrix V will not have off-diagonal elements, and one has the more familiar expression, !2 X @f s2f ¼ s2bj : ð19:10Þ @bj The simplest application of Eqs. (19.9) and (19.10) is to functions of a single uncertain variable, giving sf ¼ jdf =dbjsb :
ð19:11Þ
On the other hand, if the bj are the parameters from a least-squares fit, they are usually correlated, requiring the use of Eq. (19.9). The foregoing has direct application to the linearized versions, Eqs. (19.2)–(19.4), of the Langmuir relation. Thus, with minor provisos on the data and proper weighting, one can use LLS to obtain statistically reliable estimates of A and C from Eq. (19.2) or (19.3), and hence satisfactory results for a (¼ 1/A), b (¼ A/C), and sa [from Eq. (19.11), sa/a ¼ sA/A]. However, A and C are typically correlated, so Eq. (19.9) is needed to obtain sb. All three ‘‘linear’’ forms are in fact nonlinear in a and b. Accordingly, when analyzed with NLLS, they all yield directly the desired estimates of a and b and their SEs. [Equation (19.4), with y on both sides, has special problems that will be addressed with the effective variance method below.]
3. Statistics of Reciprocals 3.1. A simple Monte Carlo experiment Here I will use the KaleidaGraph (KG) program to illustrate some properties of random variates and their sums, and then to examine the statistics of reciprocals of normal variates. I strongly believe that serious readers not already acquainted with this or a similar program (Origin, Igor, SigmaPlot) designed for scientific data analysis and presentation, will find the time invested to become so acquainted well spent. However, the present exercise can be done with Excel (de Levie, 2008). The main points will parallel results illustrated in Table 19.1 and Fig. 19.2 of my ‘‘Bias and inconsistency’’ paper (Tellinghuisen, 2000b) and in my instructional work and its online supplement (Tellinghuisen, 2000c).
507
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
Table 19.1 Monte Carlo statistics of a ¼ 1 and its reciprocal A, from 105 values with normally distributed random error of specified saa
a b
sa
a
sab
A
sA
0.05 0.10 0.20 0.30 0.40 0.40 0.40
0.99991 0.99982 0.99964 0.99946 0.99928 0.99933 1.00202
0.050135 0.100270 0.200539 0.300809 0.401078 0.399943 0.399634
1.00262 1.01054 1.04642 1.13154 1.05163 1.39158 1.43629
0.050621 0.104363 0.24286 2.4916 71.732 34.877 23.782
Same random normal variates used for first five sa values, to illustrate the effects of scaling. Obtained from sampling statistics, as ½a2 a2 1=2 , and analogously for A.
For good sampling statistics, it is desirable to generate very large data sets. To a very good approximation, histogrammed (binned) data follow Poisson statistics, meaning the variance equals the bin count. Thus, bin counts of 104 have 1% error (s 102); smaller bin counts have smaller absolute, larger relative error, and vice versa for larger bin counts. I start by expanding the number of rows in the KG data sheet to 105. Then, executing from the ‘‘Formula Entry’’ box the statement, c0 ¼ ranð Þ þ ranð Þ
ð19:12aÞ
generates a sum of two random, uniform (0 < x < 1) variates in the first column of the data sheet. It is instructive to use the ‘‘Bin Data’’ command at this point to gain visual appreciation: The result is a triangular distribution (Fig. 19.1), peaking at 1. Adding another uniform random variate— c0¼c0þran( )—generates a distribution that is piecewise quadratic with mean 1.5. Revise to c0 ¼ c0 þ ranð Þ þ ranð Þ þ ranð Þ
ð19:12bÞ
and execute the ‘‘Run’’ command three times. The resulting sum of 12 random numbers will have a mean close to 6.00 and standard deviation 1.00; and through the beauty of the Central Limit Theorem, the distribution will be very close to the Gaussian, or normal distribution, " # ðx mÞ2 PG ðm; s; xÞ ¼ CG exp ; ð19:13Þ 2s2 where m is the mean, s2 the variance, and CG a normalizing constant. Subtract 6 [c0 ¼ c06] to produce a column of 105 random, normal deviates of mean 0. [While easy, this is not the best method for generating random normal deviates; see Press et al. (1986).]
508
Joel Tellinghuisen
5000
Count
4000 3000 2000 1000 0 0.0
0.5
1.0 Histogram x
1.5
2.0
Figure 19.1 Histogram of results from summing two uniform random deviates, each defined over the range 0 < x < 1.
Now we use this column to add varying amounts of Gaussian error to a constant. For example, c1 ¼ 1 þ 0.1*c0 produces in the second column a random variate of mean 1 with normal error of s ¼ 0.1. Clearly, the statistics of this column or any other produced in the same manner are fully predictable from the statistics of the entries in c0. But the statistics and distributions for the reciprocals of these quantities—for example, c1 ¼ 1/(1 þ 0.1*c0), c2 ¼ 1/(1 þ 0.2*c0), etc. are another matter. With increasing s, we observe progressively increasing positive bias in their means and their standard deviations, and eventually, instability in both (Table 19.1). The reason for the instability is that the distribution of reciprocals has Lorentzian tails, which means it has infinite variance. This violates a prime requirement for sampling under the Central Limit Theorem, meaning sampling cannot be relied upon to yield convergent estimates of the mean and standard deviation. Results for sa ¼ 0.4 are illustrated in Fig. 19.2. Qualitatively, the instability arises from the significant probably of getting a 0 in the initial normal distribution. As long as this probability is small, there is only a modest systematic bias in the mean of A (considered as an estimator of a) and in sA (which should be considered an asymptotic estimator, since formally the variance is infinite). Thus, the bias in A¯ is only 1% for sa ¼ 0.10 [relative standard error (RSE) sa/a the same], rising to 5% for 20% RSE. By Eq. (19.11) sA should equal sa (same RSE); but it is 1% larger for 5% RSE (sa ¼ 0.05), with the excess rising sharply thereafter, to 4% at 10% RSE, 21% at 20% RSE, and >700% for 30% RSE.
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
509
7000 6000
Count
5000 4000 3000 2000 1000 0 –0.5
0.0
0.5
1.0 1.5 a (A)
2.0
2.5
Figure 19.2 Histograms of 105 values of the normal variate a (m ¼ 1, s ¼ 0.4) and its inverse A. The smooth curves are fits to Eq. (19.13) for a and to the derived distribution for A given in Eq. (12) of Tellinghuisen (2000b).
3.2. Implications—The 10% rule of thumb I have examined with Monte Carlo (MC) simulations a number of nonlinear fit models in recent years, and this sort of reciprocal behavior is the most pathological I have seen. It has led me to state a ‘‘10% rule of thumb’’ for when to trust the V-based parameter SEs from NLLS (Tellinghuisen, 2000a): If the parameter’s estimated RSE is less than 10%, then this SE is valid within 10% for assessing the confidence limits of the parameter, for which the bias is insignificant. Note from Table 19.1 that at 10% RSE (sa ¼ 0.10), the bias in A is less than 10% of its SE, which is only 5% in excess of its predicted value (0.10). Thus, the 10% rule would appear to be conservative. However, as already noted, in NLLS there is variability inherent in Vprior for ‘‘real’’ data, which augments that from reciprocal behavior. Of course, there are many cases where the nonlinear estimator behaves much better than this. On the other hand, most workers still rely upon the Vpost-based estimates for their parameter SEs, and these can never be any more reliable than their inherent uncertainty from the statistical properties of w2. This means relative uncertainty (2/n)1/2 in the estimated variances, and by Eq. (19.11), (2n) 1/2 in the estimated SEs. Many binding and kinetics studies employ eight or fewer points in a dataset. This translates into 30% uncertainty in the Vpost-based parameter SEs—an amount that will usually dwarf that from the NLLS method itself. Concerns about reciprocal behavior apply to the data also, which are inverted in the use of Eqs. (19.2) and (19.3) to analyze binding and kinetics data. Although we seldom collect enough data to confirm that they
510
Joel Tellinghuisen
are normal, it seems more reasonable to assume that the raw data are normal than that their reciprocals are; and the data certainly have finite variance, whereas their reciprocals may not, thus violating one of the requirements for LS. Many instruments average numerous digital conversions of an analog signal, and by the Central Limit Theorem, these (with certain instrumental limitations) should approach normality, just as we observed above in the MC experiment. Counting instruments follow Poisson statistics, and for large numbers of counts, the Poisson distribution approaches the Gaussian. So it does seem more reasonable to attribute normality to the raw data than to their inverses. As before, the proper warning is to avoid inverting data with large relative uncertainty, and the 10% rule is again a good guideline. Although the inverted data remain biased estimators of the original quantities and thus yield biased and even inconsistent estimates of the LS parameters (Tellinghuisen, 2000b), the magnitudes of the biases will typically remain insignificant compared with the parameter SEs, if the data are properly weighted.
3.3. Application to binding and kinetics data Consider analysis of data using the double reciprocal linearization, Eq. (19.2). Neglecting the nonnormality and bias of the data from inversion, LLS yields A and C estimates that are normally distributed. With proper weighting these will be minimum-variance estimates, and we can obtain a and b from a ¼ 1/A and b ¼ A/C. If C is precise to better than 10%, we expect b to be well characterized and near-normal in this analysis (Tellinghuisen, 2000b). Similarly, if A is precise, a will be well characterized. Then we also expect B ¼ 1/b (from C/A) to be close to normal, even if C is imprecise. Neither a nor b emerges naturally from these considerations as a normal variate, so we cannot tell which of b and B will be more normal. Note that if we employ NLLS with Eq. (19.2) and fit to a and b, we will obtain identical values as from fitting to A and C, provided the data are weighted the same; and of course we obtain estimates of sa and sb directly in the nonlinear fit. We will see below that if sy is constant, the transformation to 1/y imposes strongly y-dependent weighting in fits to Eq. (19.2). This weighting itself can be a source of enhanced bias and imprecision for noisy data. Alternatively, consider the nonlinear fit to the variation of Eq. (19.1), x y¼ ; ð19:14Þ C þ Ax which requires the same weighting as fitting to Eq. (19.1)—unweighted for constant sy. This was originally proposed as a better way to obtain results from the nonlinear fit (Ratkowsky, 1986), which is a nonexistent problem with today’s computational methods. Thus, we would normally prefer the fit to Eq. (19.1), which yields directly the SEs in a and b. However,
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
511
the previous considerations about the properties of the six estimators remain valid, and we expect both A and C from an analysis with Eq. (19.14) to be nearly normal variates.
4. Weights When y is a True Dependent Variable 4.1. Constant sy In most experimental ways of studying MM kinetics, complexation, or quenching, x is a controlled variable and y is measured, making the usual identification of independent and dependent variables appropriate (Connors, 1987). For some of these methods, it is also reasonable to take sy as constant, especially if measurements are taken over a small dynamic range. In this case, the direct NLLS fit to Eq. (19.1) is the straightforward approach, doable with ULS; if sy is thought to be known, taking wi ¼ s2 y permits use of Vprior and subsequent use of the w2 test as a check of the fit’s reasonableness (Zeng et al., 2008a). Use of Eq. (19.11) yields the well-known weighting expressions for fits to the double-reciprocal and y-reciprocal linearizations. Letting z stand for the transformed dependent variable, sz ¼ |dz/dy| sy, hence s1=y ¼ sy =y2 :
ð19:15aÞ
By assumption x is error-free, so Eq. (19.11) applies for Eq. (19.3), too, giving sx=y ¼ sy x=y2 :
ð19:15bÞ
Both results yield weights that vary strongly over a typical dataset, and the question arises, what values of y should be used to obtain numerical values? With precise data, this is of little concern, but for noisy data, Monte Carlo tests have indicated it is better to use the calculated than the observed values. This makes the computation iterative, since the calculated values are not known at the outset. It is noteworthy that consistently weighted fits to Eqs. (19.2) and (19.3) yield identical results. With naive use of ULS with these forms, workers in some fields have come to prefer Eq. (19.3) over Eq. (19.2). This is perhaps because the factor of x neutralizes some of the y2 dependence in the denominator, making the weighting error from ULS less significant with Eq. (19.3). The use of y as both the dependent and the independent variable in Eq. (19.4) makes this linearized version in violation of the basic LS assumptions. However, it can be treated with an effective variance (EV) approach to obtain weights that yield consistent results for the parameter SEs. The idea behind the EV approach is to project the uncertainty in the
512
Joel Tellinghuisen
‘‘independent’’ variable into an equivalent uncertainty in the dependent variable. Again error propagation is used to yield seff ¼ |dy/dx| sx, and if the two contributions are independent, the variances add, giving s2y;tot ¼ s2eff þ s2y . However, here the two contributions are not independent, rather are fully correlated, because they involve the same variable. Thus, an error ey in y produces a direct error ey/x in the dependent variable and an indirect error of magnitude (df/dy)ey ¼ bey, through its effect on the fit function, f ¼ abby. The result of the perfect correlation is to make the two contributions additive in s, giving sy;tot ¼ sy ðx1 þ bÞ:
ð19:16Þ
This result is not readily available in the literature. It is given incorrectly in Eq. (3.25) of Connors (1987), who treats the two contributions as independent, hence adding in quadrature. Bowser and Chen (1999) give it correctly but cite Connors and give no explanation. In using Eqs. (19.15a) and (19.16), one must of course use values of x and y for each of the i ¼ 1:n data points. Accordingly, these equations are already in a form to accommodate any variability in sy, by just using syi values for the individual data points.
4.2. Illustrations for perfectly fitting data Figures 19.3 and 19.4 illustrate results obtained from exactly fitting data for a 5-point model having an approximately arithmetic structure, with x having a dynamic range of 20 (0.5–10) and y a range of 5 and constant sy. Note that the parameter SEs all agree to the fourth significant figure, where the discrepancies result from imprecision in the numerical differentiation used by the KG program. Also, the w2 values (‘‘Chisq’’) are all 0, as expected for exactly fitting data; thus the Vpost-based SEs would also all be 0, which is of course meaningless for the present exercise. Figure 19.4 includes results for fitting the double-reciprocal data to the second form of Eq. (19.2). It is easy to verify that sa/a ¼ sA/A, as predicted by Eq. (19.11) for A ¼ 1/a. On the other hand, application of Eq. (19.10) to b ¼ A/C yields sb sA 2 sC 2 1=2 þ ¼ 0:1767; ¼ b A C
ð19:17Þ
which is about 14% smaller than the correct value, the difference being due to the correlation between A and C, omitted from Eqs. (19.10) and (19.17). To achieve the results illustrated in Figs. 19.3 and 19.4 using the KG program, one first plots the data by selecting independent (‘‘x’’) and dependent (‘‘y’’) variables. When the ‘‘General’’ fit menu is then opened under ‘‘Curve Fit,’’ the user selects a fit name (or adds it if necessary) and then clicks in the ‘‘Define...’’ box to enter the fit relation. Clicking the ‘‘Weight Data’’
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
8
2.0
7 6
1.5
4
1.0
x/y
y
5
3 0.5
0.0
513
0
2
4
x
6
8
y = x/a + 1/(a*b) Value Error a 2.000 0.08266 b 1.000 0.2045 Chisq 0.000 NA R 1.000 NA
2
y x/y
1 0
10
y = a*b*x/(1 + b*x) Value Error a 2.000 0.08261 b 1.0000 0.2044 Chisq 7.28e-13 NA R 1.0000 NA
Figure 19.3 Five-point model having an approximately arithmetic structure (x ¼ 0.5, 3, 5, 7.5, 10), with a ¼ 2, b ¼ 1, and sy ¼ 0.08, displayed and fitted in accord with Eq. (19.1) (left ordinate scale) and Eq. (19.3) (right). The KaleidaGraph ‘‘General’’ NLLS routine is used to obtain the results presented in the fit results boxes, where ‘‘Error’’ is the Vprior-based standard error. The entries at the top of each box show the fit model, in which ‘‘x’’ is the default independent variable and ‘‘y’’ the dependent; the user enters only the part to the right of the ‘‘¼’’ sign in the fit definition box. 1/x 0.0
0.5
1.0
1.5
2.0
2.5 2.0
1.6
1.5 y/x
1/y
1.2
1.0
0.8
0.4
0.0 0.0
y=X/(b*a)+1/a Value Error a 2.000 0.08266 b 1.0000 0.2045 Chisq 6.486e-13 NA R 1.0000 NA
a b Chisq R
y=a*b–b*x Value Error 2.000 0.08262 1.0000 0.2044 8.931e-13 NA 1.0000 NA
0.5
1/y y/x 0.5
A C Chisq R
y=C*X+A Value Error 0.5000 0.02065 0.5000 0.08590 6.486e-13 NA 1.0000 NA
1.0 y
1.5
2.0
0.0
Figure 19.4 Same data displayed and analyzed in accord with the double-reciprocal [Eq. (19.2), axes top and left] and x-reciprocal [Eq. (19.4), bottom and right axes] linearizations.
box ensures that the user will later be prompted to designate a column of sy values for weights upon selecting a dependent variable to fit. This manner of providing for the computation of weights makes it easy to verify that the
514
Joel Tellinghuisen
parameter SEs scale with sy. Thus, for example, increasing sy to 0.2 in the present exercise will increase the SEs by a factor of 0.2/0.08 ¼ 2.5. With such a change, a remains fairly precise, with RSE 0.1; but the RSE in C increases from 17% to 43%, so we can expect b (¼ A/C ) to exhibit strongly non-Gaussian behavior, while B should be close to normal. Before examining these distributions, let us consider several other aspects of the present computations with exact data. First, there is no particular role for the distribution of the values on the independent axis, because with proper weighting, the results from Fig. 19.4, where the data are strongly bunched on the ‘‘x’’ axes, are identical to those from Fig. 19.3, where they are evenly distributed. We can take such considerations further by asking what difference is made by using a geometric distribution for x? Or what is the effect of changing b or the data error structure for the same x-structure? Or what if the fit is redefined in the form more commonly used to analyze kinetics data, with B instead of b. Results answering these questions are displayed in Figs. 19.5–19.7. From the middle results in Fig. 19.5, we see that changing the x-structure decreases the precision in a (from sa ¼ 0.083–0.091) but increases that for b (sb ¼ 0.20–0.16). Changes of this magnitude are not likely to be practically important, so we can conclude that for a specified range of x and constant sy, geometric and arithmetic data structures are roughly equivalent. Results in all these figures show that a is better defined when there are more points near the large-y asymptote, which occurs for large b. On the other hand, b is determined with best relative precision in the midrange of values sampled here. These trends are not significantly altered when the error structure is changed from constant sy to constant coefficient of variation, 8% sy, in part because
2.0
y = a*b*x/(1 + b*x) Value Error a 2.000 0.6429 b 0.1000 0.05429 Chisq NA 1.128e–14 R NA 1.0000
y
1.5 b = 0.1 1 10
1.0
a b
0.5
0.0
a b
1
x
Value 2.000 1.000 Value 2.000 10.000
Error 0.09050 0.1595 Error 0.05712 3.529
10
Figure 19.5 Results for same model with geometric x structure (0.5, 1, 2, 5, 10), obtained as a function of b. Note the logarithmic axis scale for x.
515
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
y = a*b*x/(1 + b*x) Value Error a 2.000 0.3376 b 0.1000 0.02136 Chisq 6.431e-14 NA R 1.0000 NA
2.0
y
1.5
1.0
a b
0.5 a b
0.0
1
Value 2.000 1.000 Value 2.000 10.000
Error 0.1387 0.1691 Error 0.1101 6.252
10
x
Figure 19.6 Results for same model as in Fig. 19.5, but error structure changed from sy ¼ 0.08 to sy ¼ 0.08 y. 2.0 a B Chisq R
y
1.5 b ⫽ 0.1 1 10
1.0
a B
0.5
0.0
a B
1
x
y = a*x/(B + x) Value Error 2.000 0.6433 10.000 5.433 1.128e–14 NA 1.0000 NA
Value 2.000 1.000 Value 2.000 0.1000
Error 0.09052 0.1595 Error 0.05712 0.03528
10
Figure 19.7 Same data as in Fig. 19.5, but with fit model redefined in terms of B ¼ 1/b.
the sampling range of the y values is not large enough to give strong heteroscedasticity except for b ¼ 0.1. Finally, Fig. 19.7 confirms expectations that sB/B ¼ sb/b when the fit model includes B instead of b.
4.3. Real data example Next let us use some of the normal random deviates produced in the earlier exercise on reciprocals to produce a ‘‘real’’ dataset and see how it responds to analysis with the different models. I take sy ¼ 0.08 to scale the normal deviates, but assume just that sy ¼ constant for the analyses.
516
Joel Tellinghuisen
This corresponds to the common situation where the uncertainty is thought to be independent of y but its magnitude is unknown. Figure 19.8 shows the data and results of their analysis using Eqs. (19.1) and (19.3), as in Fig. 19.3. Consider first the unweighted fit to Eq. (19.1). The results returned by KG for ULS are Vpost-based, so the parameter variances already include the prefactor s2y from Eq. (19.7). Using the output ‘‘Chisq’’ value, we can estimate sy ¼ 0.073 [¼ (0.01586/3)1/2]. This is somewhat smaller than the value adopted for the simulation. Had we assumed that sy was known to be 0.08 and used these values in a weighted fit, we would have obtained Vprior-based parameter SEs larger by the factor 0.08/0.073, and w2 ¼ 2.5 (¼ 3(0.073/0.08)2). For analysis using the y-reciprocal form of Eq. (19.3), we must weight the data using Eq. (19.15b), but we do not know sy, so we take it to be 1.0, thus using sx/y ¼ x/y2. For y in this expression, we use the calculated value from the fitted function, which makes the weighting iterative. The calculations converge in several cycles, yielding the results in the second fit box in Fig. 19.8. However, KG always uses Vprior in weighted fits, which means that it treats our weights as absolute. To obtain correct results for the parameter SEs we must now include the scale factor of Eq. (19.7), which means multiplying each SE by the factor (Chisq/3)1/2. We can obtain the same results by just rescaling our data s values by the same factor, yielding the results shown in the third results box. The results from the two different fits are now close but not identical. Note also that rescaling the data s values has raised w2 to its expected value of 3 (¼ np ¼ 52) in the third box. 8
2.0
y = a*b*x/(1 + b*x) Value Error a 2.097 0.08893 b 0.7738 0.1458 Chisq 0.01586 NA R 0.9929 NA
7 6
1.5
4
1.0
3
x/y y
0.5
2 1
0.0
0
2
4
x
6
8
10
0
a b Chisq R
y = x/a + 1/a/b Value 2.098 0.7668 0.01551 0.9974
a b Chisq R
2.098 0.7668 3.001 0.9974
x/y
y
5
Error 1.225 1.974 NA NA 0.08810 0.1419 NA NA
Figure 19.8 Synthetic data having x structure of Fig. 19.3 and sy ¼ 0.08, giving the following synthetic y values: 0.551, 1.555, 1.597, 1.759, and 1.889. The curves show results from unweighted fit to Eq. (19.1), and weighted fit to Eq. (19.3) with iterative adjustment of the weights. The middle box shows WLS results prior to scaling the data s values; final parameter SEs can be obtained by multiplying the indicated errors for a and b by (w2/n)1/2 ¼ (0.01551/3)1/2 ¼ 0.0719.
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
517
The use of consistent data weighting returns identical results for any data set analyzed with Eqs. (19.2) and (19.3). We can confirm this by converting our final sx/y values to s1/y by dividing by xi (Eqs. (19.15a) and (19.15b)) and then using these in fitting the data to Eq. (19.2). Results are shown in Fig. 19.9 (first box), which also includes results for fitting the same data to the linear double-reciprocal form (second box) and to the x-reciprocal form. In all cases, the s values in the columns used to compute the weights have been scaled to ensure that w2 ¼ n ¼ 3, which is equivalent to using Eq. (19.7) to obtain Vpost. From the three different analyses, the a values now fall in the range 2.102.13 and their SEs 0.0880.100, while b ¼ 0.720.77 and sb ¼ 0.140.15. Thus, for this dataset a is about 1 s larger than its true value, while b is almost 2 s smaller. It is interesting to note that the SEs for A and C (middle box) are not identical to those in Fig. 19.4. This is because the weights have been defined in terms of the fitted y values in Fig. 19.9. Use of the original weights (based on the exact yi) would have resulted in SEs identical to those in Fig. 19.4, and in slightly different A and C values no longer compatible with those in the first box.
4.4. Monte Carlo simulations I turn now to the question of how well data analyzed with these relations comport with the predictions based on exactly fitting data. All of the results reported here were obtained for the model first introduced in Fig. 19.3, under variation of the data error sy and changes in b. Table 19.2 presents 1/x 0.5
1.0
1.5
2.0
2.5
2.0
1.6
1.5
1.2
1.0
0.5
0.0 0.0
0.8
0.4
1/y y/x 0.5
1.0 y
1.5
2.0
0.0
y = X/(b∗a) + 1/a Value Error a 2.098 0.08810 b 0.7668 0.1419 Chisq 3.001 NA R 0.9679 NA
y/x
1/y
0.0
y = C∗X + A Value Error A 0.4766 0.02000 C 0.6215 0.09324 Chisq 3.001 NA R 0.9679 NA y = a∗b − b∗x Value Error 2.130 0.09999 a b 0.7155 0.1445 Chisq 3.001 NA R 0.9439 NA
Figure 19.9 Analyses of same synthetic data with Eqs. (19.3) (solid points and line) and (19.4). All parameter SEs are a posteriori, or based on Vpost.
Table 19.2 Monte Carlo statistics (as % biases) of a, b, C and their reciprocals, from 105 datasets, for model illustrated in Fig. 19.3 (a ¼ 2, b ¼ 1) with constant sy of varying magnitudea,b sy
0.004 0.008 0.080 0.200 0.010 0.050 0.100 0.200 a b
A
sA
C
sC
c
sc
B
sB
b
sb
0.00 0.00 0.15 0.91 0.00 0.06 0.23 0.94
0.0 0.0 1.1 6.0 0.0 0.0 1.3 5.7
0.00 0.02 1.42 8.87
0.0 0.0 2.4 14.4
0.00 0.01 1.57 11.16
0.0 0.0 3.0 28.2
0.00 0.03 2.32 16.11 0.03 0.89 3.62 16.38
0.0 0.0 4.6 36.7 0.0 1.5 6.9 38.0
0.04 0.75 3.14 14.46
0.0 1.3 5.6 36.0
First four lines from fit using Eq. (19.14); others using Eq. (19.1). Where entries are missing, they were not evaluated. Exact parameter standard errors for this model: sA ¼ 0.2582 sy; sa ¼ 1.0327 sy; sC ¼ 1.0737 sy; sc ¼ 4.2948 sy; sb ¼ sB ¼ 2.5550 sy. Thus the predicted RSEs equal 10% when sy ¼ 0.19 (A and a), 0.039 (B and b), 0.047 (C and c).
519
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
summary statistics in the form of % bias in the parameters and their SEs, the references being the true values for the parameters and the predictions of their SEs from exactly fitting data. The results bear out expectations of negligible bias in all quantities for sufficiently small data error and show progressively increasing bias with increasing sy. The parameter A (¼ 1/a) is nearly normal for all simulations summarized here, as illustrated in Fig. 19.10. (For this reason, a is not included in the table.) A and a are also the most precise of the three base quantities summarized here, so that sA/A reaches 0.1 only for the highest data error included (sy ¼ 0.2). Deviations from normality were comparable for B and C and their reciprocals, as illustrated for C and c in Fig. 19.11. Still, for the smallest data error included in Table 19.2, both are approximately normal at the level of precision obtained from 105 data sets. Figure 19.12 shows that these properties change with changes in b. When b ¼ 10, a is closer to normal than its reciprocal, though both are reasonably close, because their RSE is only 1.5%. B is also nearly normal, but its reciprocal is significantly nonnormal; their RSEs are 22%. These properties would follow from the predictions above if C were nearly normal (since B ¼ C/A remains normal if C is normal and A is precise). An MC check confirmed this. When b is reduced to 0.1, A is the most nearly normal σy = 0.008
10,000
8000
Count
y = a*exp(−.5*(x-c)^2/b^2)
0.08 0.20
6000
a
Value 9928.2
b c
1.0043 −0.0051031
Chisq R
29.081 1
Error 38.549 0.0022661 0.0031798 NA NA
4000
2000
0 −4
−3
−2
−1
0 δA/σA
1
2
3
4
Figure 19.10 Histogrammed results for A from 105 simulated datasets for model of Fig. 19.3 with varying data error, analyzed using Eq. (19.14). The displayed fit results are obtained fitting the binned data for sy ¼ 0.008 to a Gaussian (Eq. 19.13), with weighting based on the Poisson treatment of bin counts (variance ¼ count). The Chisq value is reasonable for the 32 data points fitted here, but not for the other two datasets (Chisq ¼ 138 and 747), showing that these are not Gaussian at this precision level.
520
Joel Tellinghuisen
10,000
sy = 0.004 0.08 0.20
Count
8000 6000 4000 2000 0 −3
−2
−1
0 1 dc/sc
2
3
4 −3
−2
−1
0 1 dc/sc
2
3
4
Figure 19.11 Histogrammed results for C and c for the same model. The curves are fitted Gaussians for sy ¼ 0.004 and yield w2 values that are only marginally consistent with normality—46 (left) and 38. 12,000 10,000
σy = 0.05 b = 10
b = 0.1
Count
a b A B
a b A B
8000 6000 4000 2000 0 −3
−2
−1
0 1 db/sb
2
3
4 −3
−2
−1
0 1 db/sb
2
3
4
Figure 19.12 Histogrammed results for sy ¼ 0.05, a ¼ 2, and b ¼ 10 (left) and 0.1 (right). The fitted curves are for the most normal dataset in each case and confirm normality at this precision for a when b ¼ 10 (w2 ¼ 23) but not for A when b ¼ 0.1 (w2 ¼ 240).
parameter (21% RSE) and b is closer to normal than B. This result is expected when A is normal and C is precise (b ¼ A/C ). This is not quite true, although the predicted RSE in C for these conditions (16%) is much less than the 37% for b and B. Space does not permit extending these MC computations to the linearized forms of Eq. (19.1), but previous work has shown that these behave similarly, but with additional sources of bias from the inverted data and from the occurrence of y-dependent weights (Tellinghuisen, 2000a,b).
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
521
These problems make the transformed relations poorer statistically, but not drastically so, except when data having large relative error are inverted, as already discussed. In summary, the 10% rule of thumb is conservative in predicting the range of validity of the V-based parameter SEs, in that there are situations where the parameters can be roughly normal for RSEs significantly exceeding 10%. In predicting conditions where near normality may hold for RSEs exceeding 10%, the considerations of LLS fitting to A and C are useful but not infallible. And recall that if Vpost is used to estimate the SEs, the relative uncertainty in this estimator, (2n) 1/2 ¼ 0.41 for the present example, would swamp the uncertainty inherent in the NLLS method itself [and require the use of the t-distribution to assess confidence limits (Tellinghuisen, 2000a)].
5. Unusual Weighting: When x is the Dependent Variable 5.1. Effective variance treatment If x is a measured rather than controlled quantity, it is uncertain, in violation of an important LS assumption for the independent variable. For example, in sorption work one measures the equilibrium concentration x of sorbate, and the sorbed amount y is computed from the initial concentration x0 using an equation of form y ¼ V (x0x) (Bolster, 2008). Often x0 and V are both much more precisely determined than x, making the errors in y and x perfectly correlated. This situation can arise also in binding studies where free ligand is not in great excess, making it improper to set [L] Lt, the prepared total ligand concentration, taken as precisely known. Notably, dialysis and related methods fall in this category. In such cases, it is not correct to compute weights as wi / s2 yi , because the uncertainty in x is manifested in y in two ways: the direct contribution from y ¼ V (x0x) and an indirect contribution from the effect of changes in x on the fit function. Bolster and I have recently treated this case with the effective variance (EV) method and have derived weighting formulas for the situation just described (Tellinghuisen and Bolster, 2009a). Here, I will rederive the weighting formulas for fitting with Eqs. (19.1)–(19.4) and will illustrate how exactly fitting data can be used to verify the results. Interested readers are referred to the full paper for numerical examples and results of MC simulations. As was noted already in connection with Eq. (19.16), the idea behind the EV method is to project the uncertainty in x onto the y-axis. Again here the only source of error in y is presumed to be that in x, and this makes the two contributions fully correlated, requiring that the ss be added, with attention to signs, to give s2tot ¼ ðseff sdir Þ2 . The two contributions arise as follows for analysis using Eq. (19.1): Let a point on the true curve be
522
Joel Tellinghuisen
subject to an error ex in x. This produces a direct error ey ¼ Vex, leading to sdir ¼ Vsx. There is also an effective or indirect error (dy/dx)ex, through the displacement of the fit function to (x þ ex). The two contributions add in the same direction, leading to a total " # ab stot;1 ¼ sx V þ ; ð19:18Þ ð1 þ bxÞ2 from which the weights can be computed as usual, as s2 tot;1 . [The same result can be obtained more directly by first rewriting the equation as Vx0 ¼ Vx þ abx=ð1 þ bxÞ]. Error propagation again suffices to obtain corresponding expressions for fits to Eqs. (19.2) and (19.3). Thus, since we have already fully projected the effects of sx into stot,1, we can use Eqs. (19.15a) and (19.15b) to obtain for Eq. (19.2), stot;2 ¼ sx
V ð1 þ bxÞ2 þ ab ; ðabxÞ2
ð19:19Þ
and we can obtain the expression needed for fitting with Eq. (19.3) from Eq. (19.19) by noting that stot,3 ¼ xstot,2. (These expressions can also be derived ‘‘starting from scratch’’ for each.) Equation (19.4) is complicated by the use of the pseudo-independent variable y, requiring considerations like those already discussed in connection with Eq. (19.16). We obtain sx ð19:20Þ þ bV sx : x Similar considerations can be used to add contributions from x0, if it is considered uncertain. All results for both are collected in Table 19.3, where I include also another version of Eq. (19.1) that has been used relatively little, but which is especially appropriate when x0 is error-free and x is uncertain. This is the equation obtained by solving the quadratic expression, stot;4 ¼ ½V þ bða yÞ
x2 þ xða=V x0 þ 1=bÞ x0 =b ¼ 0;
ð19:21Þ
for x, which I refer to as the direct equation, since it properly treats x as the dependent and x0 the independent variable.
5.2. Checking the results with exactly fitting data The signs of the two terms which add to give stot can be a source of confusion. Fortunately it is easy to check the results with exactly fitting data, since we have already seen that consistent weighting yields identical parameter SEs for all forms of Eq. (19.1) with exactly fitting data. We will do this with the model of Figs. 19.3 and 19.4, but now with error in x instead of y. I take V ¼ 3, from which one could compute x0 values consistent with y ¼ V (x0x). However,
523
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
Table 19.3 Summary of effective-variance-based weighting expressions for the least-squares analysis of data following the equation y ¼ V (x0x) ¼ abx (1 þ bx)1a
a
Relation (Equation)
stot,x
Direct (19.21)
sx
stot,x0
"
x þ 1=b 2x þ a=V x0 þ 1=b sx0V
sx0
#
Langmuir (19.1)
ab sx V þ ð1 þ bxÞ2
Double reciprocal (19.2) y-Reciprocal (19.3) x-Reciprocal (19.4)
V ð1 þ bxÞ2 þ ab sx ðabxÞ2 V ð1 þ bxÞ2 þ ab sx x ðabxÞ2 sx x ½V þ bða yÞ þ sx bV
sx0 V
ð1 þ bxÞ2 ðabxÞ2
ð1 þ bxÞ2 ðabxÞ2 sx0 x0 ½V þ bða yÞ þ sx0 bV sx0 x V
V is presumed to be a known constant of negligible uncertainty. Weights are w ¼ s2 tot , where s2tot ¼ s2tot;x þ s2tot;x0 ; and quantities are evaluated for each point using the relevant xi and x0,i values. Errors in x and x0 are assumed to be independent.
y ⫽ aⴱbⴱx/(1 ⫹ bⴱx) a
y ⫽ x/a ⫹ 1/a/b
Value
Error
2.000
0.1369
Value
Error
a
2.000
0.1370
b
1.0000
0.3689
b
1.000
0.3690
Chisq
2.996e-13
NA
Chisq
0.000
NA
R
1.0000
NA
R
1.000
NA
y ⫽ x/(bⴱa) ⫹ 1/a
y ⫽ aⴱb ⫺ bⴱx Value
Error
2.000
0.1369
b
1.0000
0.3688
Chisq
3.693e-13
NA
1.0000
NA
Value
Error
a
2.000
0.1370
a
b Chisq
1.0000 2.754e-13
0.3690 NA
R
1.0000
NA
R
Figure 19.13 Results from LS analyses via Eqs. (19.1)–(19.4) of exactly fitting data for the model of Fig. 19.3, with constant error in x, sx ¼ 0.04. Weights were obtained using the EV expressions of Eqs. (19.18)–(19.20) and Table 19.3, with the constant V taken as 3.0.
524
Joel Tellinghuisen
this is not necessary except for the direct fit to Eq. (19.21), because the other models express y as a function of x. I also take sx ¼ 0.04 for this illustration. Results for fits to Eqs. (19.1)–(19.4) (Fig. 19.13) confirm that Eqs. (19.18)–(19.20) provide consistent weightings for error in x. To verify that these weightings are correct for the physical model, we conducted MC computations in which we added random error to x and propagated it into y through the latter’s definition. We also checked that the solutions to Eq. (19.21), where x is properly the dependent variable, yielded identical parameter SEs. In application to real data, the EV weighting expressions require iterative adjustment, since they depend on the values of the fit parameters. Again, such iterations are easy to perform through repetitive fits with KG and similar programs, and they converge rapidly.
5.3. The unique answer As I have noted, the algorithm of Deming (1964) yields the same values for the parameters and their estimated SEs for all correct forms expressing the relation among the variables and parameters. This follows from the definition of the minimization target in Eq. (19.8), and Bolster and I have demonstrated its performance in applications to both synthetic and real data (Tellinghuisen and Bolster, 2009a,b). In all of these examples, the application of the EV treatment to Eq. (19.1) yielded results very close to those obtained with the Deming algorithm, so I feel confident in advising users to just use Eq. (19.1) with the EV weights of Eq. (19.18) and Table 19.3 [or their analogs if isotherms or binding relations other than Eq. (19.1) are involved].
6. Assessing Data Uncertainty: Variance Function Estimation From the foregoing, it is clear that a correct analysis of data by LS fitting requires knowledge of the data uncertainty. An obvious approach is to repeat the experiments many times and collect sampling statistics. Lest the reader despair of having to run dozens of day-long experiments, it is useful to know that neglect of weighting may not be a serious problem when the range of weights is moderate—say less than a factor of 10 over the data set (Tellinghuisen, 2007). By contrast, the weights needed for transformed relations like Eq. (19.2) cannot be neglected, as these can easily span a range of 100 or greater, from their y4 dependence. As an alternative to tedious repetition of every experiment, the data error can often be gleaned
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
525
from archival data collected over time in many different experiments done with similar equipment and techniques. There is great value in knowing the data error, even if it is approximately constant, because this knowledge permits the use of the w2 test to judge the suitability of a fit model (Tellinghuisen and Bolster, 2009b; Zeng et al., 2008a). While the w2 test may not be adequate to guarantee the suitability of a model (Straume and Johnson, 1992), it does work well to eliminate many inadequate models. Also, data heteroscedasticity can usually be characterized through VFs that contain only two or three parameters, which can be estimated adequately from as few as 20 data points (Tellinghuisen, 2008a, 2009b). The estimation is generally done by LS, methods for which I will briefly review for the analysis of replicate data. Details and examples, and discussion of VF estimation from residuals can be found in the cited references. Under the assumption that the data errors depend in some simple, smooth way on the experimental parameters, we seek to obtain that relation, var(xi,yi), through LS fitting of sampling estimates s2i that we obtain from replicate measurements. Such estimates follow a scaled w2 distribution, which means that they have error proportional to their magnitude, namely s(s2) ¼ (2/n)1/2 s2. If we obtain the estimate s2i from m measurements, first averaged to obtain a mean, n ¼ m1. Accordingly, the estimates should be weighted wi / s 4. The large relative uncertainty (e.g., 100% for m ¼ 3, 50% for m ¼ 9), means that the wi can be quite uncertain. The remedy for this is to use the estimated VF itself to compute the weights, wi ¼ var(xi,yi) 2. This renders the fit iterative, since the VF is not known at the outset. However, like similar iterative weightings already discussed, these computations typically converge adequately in a few cycles. An alternative approach is to fit ln(s2i ). From error propagation, if z ¼ ln(y) and y has uncertainty sy, then sz ¼ sy/y. If sy is proportional to y, sy ¼ cy, then sz ¼ c. This is the case here, and s(ln(s2)) ¼ (2/n)1/2. This approach has the advantage that the weights are independent of any fitting, so no iteration is required, but the disadvantage that the resulting estimates of the VF are biased negatively for small m (Tellinghuisen, 2008a, 2009b). Still, if the VF is not itself a primary target of the study, log fitting should give negligible loss of precision in the fitted response function. Why not just assign the weights as wi ¼ s2 i ? This approach works well when large numbers of replicates (10 or more) are used, but it has long been known that such weighting can actually be worse than ULS for small m (Jacquez and Norusis, 1973; Tellinghuisen, 2007). For illustration, suppose data of constant s are sampled with m ¼ 3 replicates at a number of (x,y) points. The large (100%) uncertainty in the s2i values ensures that many of these estimates will be much too small or too large, and by the principle of Eq. (19.5), the resulting WLS fit cannot be the minimum-variance one, which is this case would be the ULS fit.
526
Joel Tellinghuisen
One approach for deriving weighting functions that definitely should not be used is the trial-and-error analysis of the data with various weighting formulas, using ‘‘quality coefficients’’ to assess the results. Although this method has seen increasing use recently in some areas of biochemical and medical research, I have shown that it is fundamentally flawed, because the nature of this test makes it self-fulfilling (Tellinghuisen, 2008b). What functions can be used as VFs? A number of studies have indicated that measurement or instrumental variances generally contain terms that are constant, proportional to y, and proportional to y2 (Ingle and Crouch, 1972; Rodbard and Frazier, 1975; Thompson, 1988; Zeng et al., 2008b), though often only two of these three are justified statistically. The measurement variance can be assessed straightforwardly through repeated measurements of samples that span the desired range of y. On the other hand, the method variance is needed to assign realistic weights, and it is harder to assess, as it requires repetition of entire experiments. The functional dependence of the method variance on y is also not easy to predict. Statisticians have preferred power functions and exponentials (Davidian and Carroll, 1987). Bolster and I found that the exponential form was needed in a recent sorption study (Tellinghuisen and Bolster, 2009b).
7. Conclusion Numerous studies conducted over the last half century have emphasized the importance of proper weighting in the least-squares analysis of rectangularly hyperbolic data. Here, I have attempted to substantiate and simplify those findings, by showing that consistent weights for Eq. (19.1) and its linearizations are easily derived using the rules for error propagation. Extending the oft-seen statement that precise data yield identical values of the LS parameters in all fit representations, I have emphasized that consistently weighted, precise data also yield identical parameter standard errors. Further, I have addressed the unusual situation of many sorption and binding studies, where the ‘‘independent’’ variable x of the fit models is the measured, uncertain quantity. Several treatments of this problem— effective variance, reexpressing the relation with x the dependent variable, and use of the Deming/Lybanon algorithm—all yield satisfactory results. Consistent weighting is unfortunately not the same as correct weighting. For the latter, there is a simple but widely neglected rule for obtaining minimum-variance estimates, rigorously true in LLS and anecdotally valid in NLLS: wi / s2 i . Better awareness of this rule should serve to direct analysts’ attention toward determining their data error structure, instead of seeking ‘‘magic weighting formulas’’ through trial-and-error experimentation with
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
527
different weighting expressions—an invalid approach that has unfortunately gained traction in some fields in recent years. Knowledge of the data VF is useful for more than correct weighting of the data from the most recent experiment: It also facilitates the design of better experiments. This is an important topic only touched on here but addressed in many other works, including those by Connors (1987) and Bowser and Chen (1998, 1999). The same methods that I have used to confirm weighting expressions, with exactly fitting data, can also be used to explore other ranges of parameter and variable extent, in order to achieve better results. But such efforts are of limited value without reliable information about the data error structure. Finally, I have used Monte Carlo computations to show that the V-based parameter error estimates from nonlinear LS analysis of rectangularly hyperbolic data are trustworthy for establishing confidence limits unless the RSEs are large. The 10% rule of thumb is a useful and normally conservative guideline in this context: The V-based estimates should be reliable within 10% if the RSE is less than 10%. This guideline has emerged from examination of the statistical properties of reciprocals of normal variates—the most pathological behavior typically observed for NLLS estimators.
REFERENCES Askelof, P., Korsfeldt, M., and Mannervik, B. (1976). Error structure of enzyme kinetic experiments: Implications for weighting in regression analysis of experimental data. Eur. J. Biochem. 69, 61–67. Barker, D. R., and Diana, L. M. (1974). Simple method for fitting data when both variables have uncertainty. Am. J. Phys. 42, 224–227. Barrow, N. J. (1978). The description of phosphate adsorption curves. J. Soil Sci. 29, 447–462. Bevington, P. R. (1969). Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, New York. Bolster, C. H. (2008). Revisiting a statistical shortcoming when fitting the Langmuir model to sorption data. J. Environ. Qual. 37, 1986–1992. Bowser, M. T., and Chen, D. D. Y. (1998). Monte Carlo simulation of error propagation in the determination of binding constants from rectangular hyperbolae. 1. Ligand concentration range and binding constant. J. Phys. Chem. A 102, 8063–8071. Bowser, M. T., and Chen, D. D. Y. (1999). Monte Carlo simulation of error propagation in the determination of binding constants from rectangular hyperbolae. 2. Effect of the maximum-response range. J. Phys. Chem. A 103, 197–202. Britt, H. I., and Luecke, R. H. (1973). The estimation of parameters in nonlinear, implicit models. Technometrics 15, 233–247. Cleland, W. W. (1967). The statistical analysis of enzyme kinetic data. Adv. Enzymol. 29, 1–32. Clutton-Brock, M. (1967). Likelihood distributions for estimating functions when both variables are subject to error. Technometrics 9, 261–269. Connors, K. A. (1987). Binding Constants: The Measurement of Molecular Complex Stability. Wiley, New York.
528
Joel Tellinghuisen
Cornish-Bowden, A., and Eisenthal, R. (1974). Statistical considerations in the estimation of enzyme kinetic parameters by the direct linear plot and other methods. Biochem. J. 139, 721–730. Davidian, M., and Carroll, R. J. (1987). Variance function estimation. J. Am. Stat. Assoc. 82, 1079–1091. de Levie, R. (2008). Advanced Excel for Scientific Data Analysis. Oxford University Press, New York. Deming, W. E. (1964). Statistical Adjustment of Data. Dover, New York. Di Cera, E. (1992). Use of weighting functions in data fitting. Methods Enzymol. 210, 68–87. Dowd, J. E., and Riggs, D. S. (1965). A comparison of estimates of Michaelis–Menten kinetic constants from various linear transformations. J. Biol. Chem. 240, 863–869. Eftink, M. R., and Ghiron, C. A. (1981). Fluorescence quenching studies with proteins. Anal. Biochem. 114, 199–227. Feldman, H. A. (1972). Mathematical theory of complex ligand-binding systems at equilibrium: Some methods for parameter fitting. Anal. Biochem. 48, 317–338. Ingle, J. D. Jr., and Crouch, S. R. (1972). Evaluation of precision of quantitative molecular absorption spectrometric measurements. Anal. Chem. 44, 1375–1386. Jacquez, J. A., and Norusis, M. (1973). Sampling experiments on the estimation of parameters in heteroscedastic linear regression. Biometrics 29, 771–779. Jefferys, W. H. (1980). On the method of least squares. Astron. J. 85, 177–181. Johnson, M. L. (1985). The analysis of ligand-binding data with experimental uncertainties in independent variables. Anal. Biochem. 148, 471–478. Johnson, M. L., and Faunt, L. M. (1992). Parameter estimation by least-squares methods. Methods Enzymol. 210, 1–37. Kinniburgh, D. G. (1986). General purpose adsorption isotherms. Environ. Sci. Technol. 20, 895–904. Langmuir, I. (1918). The adsorption of gases on plane surfaces of glass, mica, and platinum. J. Am. Chem. Soc. 40, 1361–1403. Laws, W. R., and Contino, P. B. (1992). Fluorescence quenching studies: Analysis of nonlinear Stern-Volmer data. Methods Enzymol. 210, 448–463. Lineweaver, H., Burk, D., and Deming, W. E. (1934). The dissociation constant of nitrogen-nitrogenase in azotobacter. J. Am. Chem. Soc. 56, 225–230. Lybanon, M. (1984). A better least-squares method when both variables have uncertainties. Am. J. Phys. 52, 22–26. Mannervik, B. (1982). Regression Analysis, Experimental error, and statistical criteria in the design and analysis of experiments for discriminating between rival kinetic models. Methods Enzymol. 87, 370–390. Meinert, C. L., and McHugh, R. B. (1968). The biometry of an isotope displacement immunologic microassay. Math. Biosci. 2, 319–338. Munson, P. J., and Rodbard, D. (1980). LIGAND: A versatile computerized approach for characterization of ligand-binding systems. Anal. Biochem. 107, 220–239. Orear, J. (1982). Least squares when both variables have uncertainties. Am. J. Phys. 50, 912–916. Powell, D. R., and Macdonald, J. R. (1972). A rapidly convergent iterative method for the solution of the generalized nonlinear least squares problem. Computer J. 15, 148–155. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1986). Numerical Recipes. Cambridge University Press, Cambridge, UK. Ratkowsky, D. A. (1986). A suitable parameterization of the Michaelis–Menten enzyme reaction. Biochem. J. 240, 357–360. Ritchie, R. J., and Prvan, T. (1996). A simulation study on designing experiments to measure the Km of Michaelis–Menten kinetics curves. J. Theor. Biol. 178, 239–254. Rodbard, D., and Frazier, G. R. (1975). Statistical analysis of radioligand assay data. Methods Enzymol. 37, 3–22.
The Least-Squares Analysis of Data from Binding and Enzyme Kinetics Studies
529
Schulthess, C. P., and Dey, D. K. (1996). Estimation of Langmuir constants using linear and nonlinear least squares regression analysis. Soil Sci. Soc. Am. J. 60, 433–442. Shukla, G. K. (1972). Problem of calibration. Technometrics 14, 547–553. Straume, M., and Johnson, M. L. (1992). Analysis of residuals: Criteria for determining goodness of fit. Methods Enzymol. 210, 87–105. Tellinghuisen, J. (2000a). A Monte Carlo study of precision, bias, inconsistency, and nonGaussian distributions in nonlinear least squares. J. Phys. Chem. A 104, 2834–2844. Tellinghuisen, J. (2000b). Bias and inconsistency in linear regression. J. Phys. Chem. A 104, 11829–11835. Tellinghuisen, J. (2000c). Nonlinear least-squares using microcomputer data analysis programs: KaleidaGraphTM in the physical chemistry teaching laboratory. J. Chem. Educ. 77, 1233–1239. Tellinghuisen, J. (2001). Statistical error propagation. J. Phys. Chem. A 105, 3917–3921. Tellinghuisen, J. (2004). Statistical error in isothermal titration calorimetry. Methods Enzymol. 383, 245–282. Tellinghuisen, J. (2007). Weighted least squares in calibration: What difference does it make? Analyst 132, 536–543. Tellinghuisen, J. (2008a). Least squares with non-normal data: Estimating experimental variance functions. Analyst 133, 161–166. Tellinghuisen, J. (2008b). Weighted least squares in calibration: The problem with using ‘‘quality coefficients’’ to select weighting formulas. J. Chromatogr. B 872, 162–166. Tellinghuisen, J. (2009a). Least squares in calibration: Weights, nonlinearity, and other nuisances. Methods Enzymol. 454, 259–285. Tellinghuisen, J. (2009b). Variance function estimation by replicate analysis and generalized least squares: A Monte Carlo comparison. Chemometr. Intell. Lab. Syst. doi: 10.1016/ j.chemolab.2009.09.001. Tellinghuisen, J., and Bolster, C. H. (2009a). Weighting formulas for the least-squares analysis of binding phenomena data. J. Phys. Chem. B 113, 6151–6157. Tellinghuisen, J., and Bolster, C. H. (2009b). Least-squares analysis of high-replication phosphorus sorption data with weighting from variance function estimation. Environ. Sci. Technol. unpublished work. Thompson, M. (1988). Variation of precision with concentration in an analytical system. Analyst 113, 1579–1587. Valsami, G., Iliadis, A., and Macheras, P. (2000). Non-linear regression analysis with errors in both variables: Estimation of co-operative binding parameters. Biopharm. Drug Dispos. 21, 7–14. Wilkinson, G. N. (1961). Statistical estimations in enzyme kinetics. Biochem. J. 80, 324–332. Zeng, Q. C., Zhang, E., and Tellinghuisen, J. (2008a). Univariate calibration by reversed regression of heteroscedastic data: A case study. Analyst 133, 1649–1655. Zeng, Q. C., Zhang, E., Dong, H., and Tellinghuisen, J. (2008b). Weighted least squares in calibration: Estimating data variance functions in high-performance liquid chromatography. J. Chromatogr. A 1206, 147–152.
C H A P T E R
T W E N T Y
Nonparametric Entropy Estimation Using Kernel Densities Douglas E. Lake Contents 1. Introduction 2. Motivating Application: Classifying Cardiac Rhythms 3. Renyi Entropy and the Friedman–Tukey Index 4. Kernel Density Estimation 5. Mean-Integrated Square Error 6. Estimating the FT Index 7. Connection Between Template Matches and Kernel Densities 8. Summary and Future Work Acknowledgments References
532 533 535 536 538 540 544 545 545 546
Abstract The entropy of experimental data from the biological and medical sciences provides additional information over summary statistics. Calculating entropy involves estimates of probability density functions, which can be effectively accomplished using kernel density methods. Kernel density estimation has been widely studied and a univariate implementation is readily available in MATLAB. The traditional definition of Shannon entropy is part of a larger family of statistics, called Renyi entropy, which are useful in applications that require a measure of the Gaussianity of data. Of particular note is the quadratic entropy which is related to the Friedman–Tukey (FT) index, a widely used measure in the statistical community. One application where quadratic entropy is very useful is the detection of abnormal cardiac rhythms, such as atrial fibrillation (AF). Asymptotic and exact small-sample results for optimal bandwidth and kernel selection to estimate the FT index are presented and lead to improved methods for entropy estimation.
Departments of Internal Medicine (Cardiovascular Division) and Statistics, University of Virginia, Charlottesville, Virginia, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67020-8
#
2009 Published by Elsevier Inc.
531
532
Douglas E. Lake
1. Introduction Results from experiments are often simply reported with the summary statistics of sample mean and standard deviation. These statistics gives qualitative information about the location and scale of the data, but does not answer questions about the shape of the distribution that may be important. Does the data have a bell-shaped Gaussian (normal) distribution? Does the data have more than one mode? Higher order sample statistics such as the skewness and kurtosis are common calculations that provide additional information to start to better answer these questions, but more work is usually needed. One way to fully understand these properties is to look at the distribution of the data by simply constructing a histogram. Mathematically, a histogram is an estimate of the underlying probability density function (PDF) of a random variable X representing the quantity being calculated. Better PDF estimates can be obtained using a method called kernel density estimation which is readily available in many software packages, for example, the MATLAB function KSDENSITY, and has been widely studied (Scott, 1992). Despite this, the method remains underutilized in practice and the goal of this chapter is to introduce its use for analysis of experimental data from biological and medical sciences. While visualizing and pondering an estimate of the distribution is feasible on a case-by-case basis, many applications require determining information about many distributions in an automated fashion by calculating a number, called a functional of the PDF. One widely used functional of the PDF is the entropy or more precisely the Shannon entropy originally developed as part of communication theory (Shannon, 1997). Shannon entropy is a member of the Renyi entropy family (discussed below) and is an example of a measure of Gaussianity which can indicate whether a PDF is bell shaped or perhaps has multiple modes ( Jones and Sibson, 1987; Lake, 2006). Beirlant et al., provide an excellent overview of nonparametric methods to estimate entropy and some of the terminology used there is repeated here (Beirlant et al., 1997). However, the statistical properties studied there are asymptotic in nature and many of these results have limited use in practice. Another important member of the Renyi entropy family is quadratic entropy. Quadratic entropy does not share some of the optimal theoretical properties of Shannon entropy, but has advantages that will make it the focus of this chapter. One advantage is that many of the statistical properties of estimates of a related quantity to quadratic entropy, called the Friedman– Tukey (FT) index, can be expressed with exact closed-form expressions. Another advantage is that the properties for quadratic entropy are generally better for small sample sizes where, for example, Shannon entropy estimates can have large bias.
533
Nonparametric Entropy Estimation Using Kernel Densities
2. Motivating Application: Classifying Cardiac Rhythms The classification of cardiac rhythms using series of interbeat or RR intervals is an important clinical problem that has proved to benefit from entropy analysis (Costa et al., 2002; Lake, 2006; Lake et al., 2002). Several common clinical scenarios call for identification of cardiac rhythm in ambulatory outpatients. Atrial fibrillation (AF) is an increasingly common disorder of cardiac rhythm in which the atria depolarize at exceedingly fast rates and is often paroxysmal (occurs suddenly) in nature. There is an increased risk of stroke with AF along with risks associated with treatments, so decisions about its therapy are best informed by knowledge of the frequency, duration, and severity of the arrhythmia. Detecting and classifying AF can often be confused with normal sinus rhythms with ectopic (premature) beats or other arrhythmias such as bigeminy or trigeminy. Figure 20.1 shows examples of a series of 100 RR intervals for AF and trigeminy rhythms. The series of AF can perhaps best be described as
Atrial fibrillation (AF)
RR interval (ms)
1000
800
600
400
0
20
40
60
80
100
60
80
100
Trigeminy
RR interval (ms)
1000
800
600
400
0
20
40 Beat number
Figure 20.1 Examples of n ¼ 100 consecutive beats from two abnormal heart rhythms, atrial fibrillation (AF) and trigeminy.
534
Douglas E. Lake
Kernel density estimates AF (Q ⫽ 1.24) Trigeminy (Q ⫽ 0.502)
1
Probability density
0.8
0.6
0.4
0.2
0 ⫺4
⫺3
⫺2
0 1 ⫺1 Standardized values
2
3
4
Figure 20.2 Kernel density estimates for the n ¼ 100 standardized observations from the two examples from Fig. 20.1 using the default settings from the MATLAB function KSDENSITY. The corresponding entropy estimate for the bell-shaped AF density is much higher than for the multimodal trigeminy density function.
looking like ‘‘white’’ noise while the trigeminy rhythm has three distinct levels (or modes) of heart rate. The differences between these rhythms are clear by looking at the distribution of the RR intervals. Figure 20.2 shows the kernel density estimates of the two series after they have been standardized (zero mean and unit variance). The results use the default MATLAB settings of KSDENSITY. The trigeminy series has three readily apparent peaks in its density estimate representing each of the heart rate modes while the AF rhythm has more of a bell-shaped or normal distribution. Entropy estimates associated with the densities (given below) are much higher for the AF (Q ¼ 1.24) than for the trigeminy (Q ¼ 0.502). While the differences in the two rhythms are obvious with n ¼ 100 points, sometimes decisions for therapy need to be made with much smaller sample sizes. An important example of this is for patients with severe heart disease who have implantable cardioverter-defibrillator (ICD) devices to reduce the incidence of sudden cardiac death. The therapy in this case is an electric shock and accurate decisions need to be made on records on the order of n ¼ 16 beats. Finding good entropy estimates on small data sets can be challenging and requires some of the mathematical detail presented here.
535
Nonparametric Entropy Estimation Using Kernel Densities
3. Renyi Entropy and the Friedman–Tukey Index The precise mathematical definitions of the entropy measures to be discussed will now be presented. The entropy (Shannon entropy) of a continuous random variable X with density f is ð1 HðXÞ ¼ E½ logð f ðXÞÞ ¼ logðf ðxÞÞf ðxÞdx ð20:1Þ 1
where E is the expectation and log is the natural logarithm. The quadratic entropy is defined to be ð 1 2 QðXÞ ¼ logðE½ f ðXÞÞ ¼ log f ðxÞdx ð20:2Þ 1
which is similar to entropy with the expectation and logarithm operations reversed. Both measures are special cases of Renyi entropy (or q-entropy) defined to be ð 1 1 1 q1 q Rq ðXÞ ¼ f ðxÞdx ð20:3Þ logðE½ f ðXÞ Þ ¼ log 1q 1q 1 where for q ¼ 1, the limit can be obtained using calculus (l’Hospital’s rule). In particular, Shannon entropy corresponds to q ¼ 1, that is H(X ) ¼ R1(X ), and quadratic entropy corresponds to q ¼ 2, that is, Q(X ) ¼ R2(X ). All of the above entropies involve finding the expectation of a function of X which is defined to be an integral involving the PDF f (x). For quadratic entropy this is just the integral of ‘‘f-squared’’ or f 2(x). This quantity also arises in the analysis of kernel densities and the following specific notation will be used ð1 Ið f Þ ¼ f 2 ðxÞdx: ð20:4Þ 1
This quantity is also named the FT index of a random variable and the notation FT(X) ¼ I( f ) will also be used. A good place to start in finding accurate entropy estimates is to investigate the statistical properties of estimates of FT(X). Entropy is an example of a measure of Gaussianity that has received much attention recently in a variety of applications, including the analysis of heart rate (Lake, 2006). These measures are used in independent component analysis (ICA) where the alternative terminology measure of non-Gaussianity is used (Hyvarinen et al., 2001). One example application of ICA is the separation of signals from multiple speakers, which is informally called the cocktail party problem. Measures of Gaussianity are also used for exploratory projection pursuit (EPP), which searches for interesting low-dimensional
536
Douglas E. Lake
projections of high-dimensional data ( Jones and Sibson, 1987). Here, interesting means non-Gaussian and is measured by what is called a projection index. Non-Gaussian projections can be used as features for multivariate discrimination and for data visualization (e.g., XGobi software) (Ripley, 1996). The Friedman–Tukey index was originally developed as a projection index and is commonly used for this purpose. The competition and debate over the use of order 2 (q ¼ 2) versus order 1 (q ¼ 1) entropies has taken place in a variety of applications and appears in many different forms. For development of optimal decision trees using the CART method, the discrete form of the FT index, called the Gini index (order 2), is alternative measure of impurity to the discrete form of entropy (order 1) ( Breiman et al., 1984). For goodness-of-fit tests there has been a long history and debate over using the log-likelihood ratio test (order 1) versus chi-squared tests (order 2) (Read and Cressie, 1988). Finally, the original motivation for this work involves two popular entropy measures for time series data, approximate entropy (order 1) and sample entropy (order 2). These measures will be discussed in more detail below.
4. Kernel Density Estimation Kernel density estimation has been widely studied for the past 50 years and an excellent resource for its many interesting mathematical details can be found in Scott (1992). The basic setup is a random sample of independent and identically distributed (iid) data X1, X2,. . ., Xn coming from a random variable X with PDF f. To estimate f, another function K(u), called the kernel, is associated with each observation. All the kernels to be considered here will also be a PDF associated with a random variable (which will also be called K). All kernels considered here will have mean equal to 0 and be symmetric so that K(u) ¼ K(u). The variance of the kernel density function will be denoted by s2K , which for purposes of simplifying the comparison of kernels, will be assumed to be 1. The kernel function is scaled by an important parameter h called the bandwidth and the notation Kh(u) ¼ K(u/h)/h is used. The bandwidth can be interpreted as a scale parameter of the kernel and the random variable Kh has standard deviation h. The bandwidth is analogous to the bin size for histograms. Once a kernel function and bandwidth have been specified, the estimated density function at each point x is calculated by n n 1X 1X x Xi f^ðxÞ ¼ : ð20:5Þ Kh ðx Xi Þ ¼ K h n i¼1 nh i ¼ 1
537
Nonparametric Entropy Estimation Using Kernel Densities
Common kernels include the Gaussian, uniform, and triangle kernels. Another kernel that has many optimal asymptotic properties is called the Epanechnikov kernel which is parabolic in shape. These kernels, scaled to have unit variance, are all shown in Fig. 20.3 and are all available options with the KSDENSITY function. To illustrate the effect of the shape of the kernel and bandwidth selection on PDF estimates, a small set of n ¼ 16 observations were simulated from a standard normal distribution. This is a reasonable model of 16 standardized points from a short episode of AF. Histograms and kernel density estimates (using the default options) along with the standard normal PDF are displayed in Fig. 20.4. The optimal bandwidth of h ¼ 0.587 used in the kernel density estimate comes from a formula to be discussed below that involves both the sample size n and an estimate of the standard deviation of the data. This simple example clearly shows the benefit of using a smooth estimate of the PDF versus the usual ‘‘choppy’’ step-function histogram estimate. With this same data set, Fig. 20.5 displays the effect of kernel and bandwidth selection on the PDF estimates. For all kernels, bandwidths that are too small provide too much resolution for each of the 16 points and bandwidths that are too large smear out the data essentially removing all the characteristics of the distribution. The optimal bandwidth for each kernel is approximately h ¼ 0.6 and the estimate provides a reasonably good tradeoff between the two extreme cases. Note that the uniform kernel
0.5
Gaussian Uniform Epanechnikov Triangle
0.45 0.4 0.35
K(u)
0.3 0.25 0.2 0.15 0.1 0.05 0 ⫺4
⫺3
⫺2
⫺1
0 u
1
2
3
4
Figure 20.3 Shapes of four common kernels used in kernel density estimation all normalized to have unit variance.
538
Douglas E. Lake
Histogram Kernel density estimate True density
0.6
Probability density f(x)
0.5
0.4
0.3
0.2
0.1
−3
−2
−1
0 x
1
2
3
Figure 20.4 A comparison of the kernel density method with the more common histogram (both using default MATLAB settings) for n ¼ 16 random observations from a standard normal distribution. The kernel density method clearly provides a far better estimate in this case. The 16 observations were 0.39, 0.14, 2.33, 1.36, 1.81, 1.11, 0.142, 1.11, 0.56, 0.48, 0.68, 0.28, 1.33, 0.72, 0.66, 0.20.
produces a discontinuous step-function estimate similar to the histogram and that the Epanechnikov and Gaussian results are very similar.
5. Mean-Integrated Square Error The bandwidth for each of the kernels used by KSDENSITY is selected using a formula that asymptotically minimizes a goodness-of-fit criterion called Mean-Integrated Squared Error (MISE) for the Gaussian kernel. The MISE is defined to be ð1 MISE ¼ E½ð f^ðxÞ f ðxÞÞ2 dx ð20:6Þ 1
where f is the true PDF estimate and f^ is the kernel density estimate. For the simplest case of where both the true density and kernel functions are standard Gaussian, the MISE can be calculated exactly (after some work) to be
539
Nonparametric Entropy Estimation Using Kernel Densities
Gaussian
h = 0.1
Uniform
h=1
h=2
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
–2
0
2
0
–2
0
2
0
–2
0
2
0
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 Epanechnikov
h = 0.6
0.6
–2
0
2
0
–2
0
2
0
–2
0
2
0
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
–2
0 x
2
0
–2
0 x
2
0
–2
0 x
2
0
–2
0
2
–2
0
2
–2
0 x
2
Figure 20.5 Effect of the bandwidth parameter h and the kernel function K on the density estimate for the example data from Fig. 20.4. For all kernels, a bandwidth of around h ¼ 0.6 gives an optimal estimate.
1 MISEðhÞ ¼ pffiffiffi 1 2 2 p
! rffiffiffiffiffiffiffiffiffiffiffiffiffi 2 1 1 1 þ pffiffiffiffiffiffiffiffiffiffiffiffiffi þ pffiffiffiffiffiffiffiffiffiffiffiffiffi 2 þ h2 1 þ h2 nh n 1 þ h2 ð20:7Þ
where h is the bandwidth. This complicated expression can be well approximated by a Taylor’s series expansion giving the asymptotic MISE (AMISE) AMISEðhÞ ¼
3 1 pffiffiffi h4 þ pffiffiffi 32 p 2 pnh
ð20:8Þ
whose minimal value can be found using calculus. This results in an optimal bandwidth of h* ¼ ð4=3Þ1=5 n1=5 ¼ 1:0592n1=5
ð20:9Þ
and this is precisely the formula used by KSDENSITY. A subtle, but nontrivial, point is that the above quantity assumes that the data are standardized to have unit variance. This can only be achieved exactly if s is known which is not likely in practice and an estimate is needed. While the sample standard deviation (usually denoted by s) seems a
540
Douglas E. Lake
reasonable choice, an estimate that is robust in the presence of outliers is generally preferable. In fact, the KSDENSITY function estimates s as the median of the absolute deviations from the median of the data divided by an appropriate constant (the expected value of this calculation for standard normal data). Another such robust estimate would be a normalized interquartile range.
6. Estimating the FT Index The quantity FT(X) is a functional I( f ) of the PDF f and can be estimated in a couple of ways (Beirlant et al., 1997; Rao, 1983). One approach is called the plug-in estimate and simply involves inserting any estimate of the density into the formula ð1 2 I^ ¼ I^ð f Þ ¼ Ið f^Þ ¼ ð20:10Þ f^ ðxÞdx 1
and evaluating the integral numerically (e.g., Simpson’s method). A second method is called the resubstitution estimate given by I^ ¼ I^ð f Þ ¼
n 1X f^ðXi Þ; ni¼1
ð20:11Þ
which requires estimating the density only at the observed values from the sample. Unless otherwise stated, all the estimates of I( f ) will refer to the resubstitution estimate above. The kernel density estimate at one of the sample points Xi is more biased when the term involving Xi is included in the sum. A less biased estimate is f^ðXi Þ ¼
1 X Kh ðXi Xj Þ; ðn 1Þ j 6¼ i
ð20:12Þ
which only involves n 1 terms. This is analogous to not including selfmatches in the SampEn algorithm which has proved to be less biased than the ApEn algorithm which includes self-matches (Lake et al., 2002; Richman, 2004; Richman and Moorman, 2000). Both these algorithms are discussed below. The biased estimate includes all n terms n 1X Kð0Þ n 1 ^ f^b ðXi Þ ¼ Kh ðXi Xj Þ ¼ ð20:13Þ þ f ðXi Þ; nj ¼1 nh n which can be substantially biased if nh is not large. Combining Eqs. (20.11) and (20.12), the estimate of the FT index becomes a double summation over all possible pairs of sample points
Nonparametric Entropy Estimation Using Kernel Densities
I^ ¼
541
n X n1 X n X X 1 2 Kh ðXi Xj Þ ¼ Kh ðXi Xj Þ nðn 1Þ i ¼ 1 j 6¼ i nðn 1Þ i ¼ 1 j ¼ i þ 1
ð20:14Þ where the last expression follows from the assumed symmetry of the kernel density function. The form of this estimate is, relatively speaking, much simpler than a corresponding estimate of the Shannon entropy. For example, many of its statistical properties can be found exactly in closed-form and optimized. The selection of the bandwidth h becomes a tradeoff between the increased bias at large values and the increased variance at small values. Traditionally these two properties are combined in the mean-squared error (MSE) of an estimator MSE ¼ E½ðIð f Þ I^ð f ÞÞ2 ¼ ðIð f Þ E½I^ð f ÞÞ2 þ V ½I^ð f Þ
ð20:15Þ
which is the sum of the variance V and bias squared. Note that while there are some similarities between the expressions for MISE and MSE, they are different criteria for evaluating the estimators. The optimal asymptotic bandwidths to minimize the MSE were investigated by Pawlak, though some of the formulas there are incorrect (Pawlak, 1987). In order to do this, expressions for the asymptotic mean-squared error (AMSE) are needed. Expanding the double summation in Eq. (20.14) and after some work, the exact MSE can be calculated as follows 2 MSEðhÞ ¼ ðIð f Þ E1 Þ2 þ ðE2 þ 2ðn 1ÞE11 ð2n 3ÞE12 Þ; nðn 1Þ ð20:16Þ where the three expectations are E1 ¼ E½Kh ðX1 X2 Þ E2 ¼ E½Kh2 ðX1 X2 Þ E11 ¼ E½Kh ðX1 X2 ÞKh ðX1 X3 Þ:
ð20:17Þ
Using Taylor series expansions of these kernel density expectations (as is done for the AMISE in Scott (1992)) results in an expression for the AMSE in terms of the bandwidth h 4 2 1 AMSEðhÞ ¼ Varð f ðXÞÞ þ 2 IðKÞIð f Þ þ I 2 ð f 0 Þh4 ð20:18Þ n nh 4 where f 0 is the derivative of the PDF. This expression asymptotically approximates the MSE within a constant times h4, that is, MSE ¼ AMSE þ O(h4). The bandwidth that minimizes this quantity can again be found using calculus to be
542
Douglas E. Lake
h* ¼ ð2IðKÞIð f ÞÞ1=5 Ið f 0 Þ2=5 n2=5 ;
ð20:19Þ
which for Gaussian data becomes h* ¼ 23/5n 2/5 ¼ 1.516 n 2/5. For moderate sizes of n, this bandwidth is smaller than that in Eq. (20.9). This suggests increased entropy estimation accuracy can be achieved using smaller bandwidths than those optimized for MISE. The optimal bandwidth in Eq. (20.19) gives the following minimal AMSE 4 5 AMSE* ¼ Varð f ðXÞÞ þ ð2IðKÞIð f ÞÞ4=5 Ið f 0 Þ2=5 n8=5 ð20:20Þ n 4 The MSE of these estimators go to zero for large n and are therefore consistent. A similar expression using the biased estimate of the PDF with self-matches can be determined and the second term is of the form of a constant times n 4/3. This is asymptotically larger (not as good estimator) as the unbiased version without self-matches. As before, the exact MSE for the special case of standard Gaussian data and kernel can be found for all n in closed-form with 1 Ið f Þ ¼ pffiffiffi ; 2 p 1 E1 ¼ pffiffiffi ð1 þ h2 =2Þ1=2 ; 2 p 1 E2 ¼ ð1 þ h2 =4Þ1=2 ; 4ph 1 E11 ¼ pffiffiffi ð1 þ h2 Þ1=2 ð1 þ h2 =3Þ1=2 : 2p 3
ð20:21Þ
Since bandwidths are usually selected using asymptotic results, a natural question would be to what extent is error reduced using exact formulas. Figure 20.6 shows the exact and asymptotic approximation formulas for n ¼ 16. In this case, the optimum bandwidth using the ANSE is h* ¼ 0.674 versus h* ¼ 0.5. The corresponding minimum MSE is approximately 10% lower than that using the nonexact approximations. The expression in Eq. (20.20) depends on the kernel through the quantity I(K) and more generally sKI(K) with smaller values giving better results. This optimal MISE depends on this quantity in the same manner and used to compare the efficiency of estimators. Table 20.1 shows these results for commonly used kernels. It can be shown that the Epanechnikov parabolic kernel is asymptotically most efficient among all possible estimators. It is also interesting to note that the normal kernel is more efficient than uniform, but not as efficient as the triangle kernel.
543
Nonparametric Entropy Estimation Using Kernel Densities
0.02 Exact gaussian MSE Asymptotic MSE (AMSE)
0.018 0.016 0.014
MSE
0.012 0.01 0.008 0.006
hⴱ ⫽ 0.5
0.004 hⴱ ⫽ 0.674
0.002 0
0
0.2
0.4 0.6 Bandwidth (h)
0.8
1
Figure 20.6 The exact MSE for the asymptotic AMSE for n ¼ 16. The optimal exact value is h* ¼ 0.5 versus the optimal asymptotic value of h* ¼ 0.674 resulting in approximately a 10% reduction in the MSE. Table 20.1 Efficiency of commonly used kernels Kernel
skI(K)
Efficiency
Uniform Triangle Epanechnikov Normal
0.2887 0.2722 0.2683 0.2821
1.0758 1.0143 1.0000 1.0513
A final observation should be made about the plug-in estimate in comparison to the resubstitution estimate results presented here. At first glance, plug-in estimate looks superior to sample point estimate because it uses estimates of density at all points. However, it can be shown that the plug-in estimate is equivalent to the resubstitution estimate with a new kernel K2 equal to a convolution of the original kernel K with itself and using self-matches. This corresponds to a new random variable K2 equal the sum of two independent random variables with distribution the same as K. So if K is Gaussian, K2 is also Gaussian and the two methods are equivalent (with different bandwidths) (Erdogmus et al., 2004). However,
544
Douglas E. Lake
the equivalent resubstitution estimate includes self-matches which introduces extra bias and would argue against using the plug-in method.
7. Connection Between Template Matches and Kernel Densities Our example showing the analysis of heart rhythms by simply looking at the distribution of the RR intervals does not tell the full story about the physical process. The interactions between successive observations provide additional information that is not available in a kernel density estimate which does not depend on the order of the observations. In fact, the most prominent feature of AF is not that its overall distribution is unimodal compared to multimodal, but that observations are ‘‘white’’ in that they are unpredictable and appear to occur randomly with little apparent dependence on previous observations. The concepts of entropy to measure concepts like predictability or order do extend to time series data in the form of what is termed entropy rate (Cover and Thomas, 1991). These properties have been widely studied with the two popular and related measures of approximate entropy and sample entropy. The fundamental calculation of both of these methods involves the counting of template matches which is basically part of a multivariate kernel density estimate using a uniform kernel. Within the framework of Renyi entropy, these two measures correspond to orders q ¼ 1 and q ¼ 2, respectively We now briefly describe these methods to show their relation to the results on kernel density estimation presented here. For a time-series x1, x2, . . ., xN, let xm(i) denote the m points xi, xiþ 1, . . ., xiþm 1 which we call a template and can be considered a vector of length m. An instance where all the components of the vector xm( j) are within a distance r of xm(i) is called a template match. The quantity r is essentially the bandwidth of a uniform kernel. Let Bi denote the number of template matches with xm(i) and Ai denote the number of template matches with xmþ 1(i). The number pi ¼ Ai/Bi is an estimate of the conditional probability that the point xjþm is within r of xiþm 1 given that xm( j) matches xm(i). Pincus introduced the statistic approximate entropy or Apen as a measure of regularity (Pincus, 1991). Denoted by ApEn, this can be calculated by m 1 NX Ai ApEnðm; r; N Þ ¼ ð20:22Þ log Bi N mi¼1 and is the negative average natural logarithm of this conditional probability. Self-matches are included in the original ApEn algorithm to avoid the
545
Nonparametric Entropy Estimation Using Kernel Densities
pi ¼ 0/0 indeterminate form, but this convention leads to noticeable bias especially for smaller N and larger m. A related but more robust statistic called sample entropy or SampEn was introduced by Richman and Moorman designed to reduce this bias by not including self-matches (Richman and Moorman, 2000). SampEn is calculated by , ! N m N m X X SampEnðm; r; N Þ ¼ log ð20:23Þ Ai Bi ; i¼1
i¼1
which is just negative the logarithm of an estimate of the conditional probability of a match of length m þ 1 given a match of length m. As with quadratic entropy, SampEn has added advantage that its statistical properties are more accessible than those of ApEn. The optimal bandwidths presented here provide a formal setting for evaluating and selecting the tolerance r. In particular, the matching part of these algorithms for templates of length 1 is proportional to results using the uniform kernel in Fig. 20.3 with r ¼ 31/2 h ¼ 1.732 h.
8. Summary and Future Work Entropies are functionals of the PDFs and can be effectively estimated using the kernel density methods. The optimal bandwidths for the FT index, which is part of the calculation of quadratic entropy, have been presented. The optimal bandwidth tends to be smaller than that traditionally obtained to minimize the MISE and used by MATLAB. Estimating entropy for small samples can benefit from exact results which are in closed-form for the Gaussian signal and kernel case and can be evaluated numerically in other instances. Exact results for Gaussian mixtures are straightforward, but messy. Future work includes extending these results to arbitrary entropies (q 6¼ 2) and in particular Shannon entropy (q ¼ 1). This is not a trivial undertaking because the estimates are not simply an average and involve nonlinear functions, for example, the logarithm. The one-dimensional results (d ¼ 1) need to be also extended to higher dimensions (d > 1). An important application of these results is estimating entropy for timeseries data which not only involves d > 1, but has dependent data. These advances would be directly applicable to finding the optimal tolerance r for the template matching step in the calculation of SampEn.
ACKNOWLEDGMENTS This work was supported by grant 0855399E from the American Heart Association, MidAtlantic Research Consortium. Yan Liu and Sida Peng provided support in checking the
546
Douglas E. Lake
mathematical details of some of the formulas presented here as well as investigating future directions in the area of nonparametric entropy estimation using kernel densities and other methods. My continued collaboration with Randall Moorman, MD in the mathematical analysis of heart rate, including detecting atrial fibrillation in short records, provided ample clinical motivation for the development of the methods presented here.
REFERENCES Beirlant, J., Dudewicz, E. J., Gyorfi, L., and van der Meulen, E. C. (1997). Nonparametric entropy estimation: An overview. Int. J. Math. Stat. Sci. 6(1), 17–39. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wandswork and Brooks-Cole, Monterey, CA. Costa, M., Goldberger, A. L., and Peng, C. K. (2002). Multiscale entropy analysis of complex physiologic time series. Phys. Rev. Lett. 89, 068102. Cover, T. M., and Thomas, J. A. (1991). Elements of Information Theory. John Wiley and Sons, New York. Erdogmus, D., Hild, K., Principe, J., Lazaro, M., and Santamaria, I. (2004). Adaptive blind deconvolution of linear channels using renyi’s entropy with parzen window estimation. IEEE Trans. Signal. Process. 52(6), 1489–1498. Hyvarinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. John Wiley and Sons, New York. Jones, M. C., and Sibson, R. (1987). What is projection pursuit? J. R. Stat. Soc. A 150, 1–36. Lake, D. E. (2006). Renyi entropy measures of heart rate Gaussianity. IEEE Trans. Biomed. Eng. 53(1), 21–27. Lake, D. E., Richman, J. S., Griffin, M. P., and Moorman, J. R. (2002). Sample entropy analysis of neonatal heart rate variability. Am. J. Physiol. 283, R789–R797. Pawlak, M. (1987). Contribution to the discussion of the paper ‘‘‘What is Projection Pursuit’’, Jones, M.C., Sibson R. J. R. Stat. Soc. A 150, 1–36, (the contribution pp. 31–32). Pincus, S. M. (1991). Approximate entropy as a measure of system complexity. Proc. Natl. Acad. Sci. 88, 2297–2301. Rao, B. L. S. P. (1983). Nonparametric Functional Estimation. Academic Press, London. Read, T., and Cressie, N. (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer-Verlag, New York. Richman, J. S. (2004). Sample entropy statistics. University of Alabama Birmingham. Ph.D. dissertation. Richman, J. S., and Moorman, J. R. (2000). Physiological time series analysis using approximate entropy and sample entropy. Am. J. Physiol. 278, 2039–2049. Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge. Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley, New York. Shannon, C. E. (1997). The mathematical theory of communication (Reprinted). M D Comput. 14, 306–317.
C H A P T E R
T W E N T Y- O N E
Pancreatic Network Control of Glucagon Secretion and Counterregulation Leon S. Farhy and Anthony L. McCall Contents 1. Introduction 2. Mechanisms of Glucagon Counterregulation (GCR) Dysregulation in Diabetes 3. Interdisciplinary Approach to Investigating the Defects in the GCR 4. Initial Qualitative Analysis of the GCR Control Axis 4.1. b-Cell inhibition of a-cells 4.2. d-Cell inhibition of a-cells 4.3. a-Cell stimulation of d-cells 4.4. Glucose stimulation of b- and d-cells 4.5. Glucose inhibition of a-cells 5. Mathematical Models of the GCR Control Mechanisms in STZ-Treated Rats 6. Approximation of the Normal Endocrine Pancreas by a Minimal Control Network (MCN) and Analysis of the GCR Abnormalities in the Insulin Deficient Pancreas 6.1. Dynamic network approximation of the MCN 6.2. Determination of the model parameters 6.3. In silico experiments 6.4. Validation of the MCN 6.5. In silico experiments with simulated complete insulin deficiency 6.6. Defective GCR response to hypoglycemia with the absence of a switch-off signal in the insulin deficient model 6.7. GCR response to switch-off signals in insulin deficiency
548 550 551 553 553 554 554 555 555 556
560 561 562 564 565 565 566 567
Department of Medicine, Center for Biomathematical Technology, University of Virginia, Charlottesville, Virginia, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67021-X
#
2009 Published by Elsevier Inc.
547
548
Leon S. Farhy and Anthony L. McCall
6.8. Reduction of the GCR response by high glucose conditions during the switch-off or by failure to terminate the intrapancreatic signal 6.9. Simulated transition from a normal physiology to an insulinopenic state 7. Advantages and Limitations of the Interdisciplinary Approach 8. Conclusions Acknowledgement References
567 569 571 575 575 575
Abstract Glucagon counterregulation (GCR) is a key protection against hypoglycemia compromised in insulinopenic diabetes by an unknown mechanism. In this work, we present an interdisciplinary approach to the analysis of the GCR control mechanisms. Our results indicate that a pancreatic network which unifies a few explicit interactions between the major islet peptides and blood glucose (BG) can replicate the normal GCR axis and explain its impairment in diabetes. A key and novel component of this network is an a-cell auto-feedback, which drives glucagon pulsatility and mediates triggering of pulsatile GCR by hypoglycemia via a switch-off of the b-cell suppression of the a-cells. We have performed simulations based on our models of the endocrine pancreas which explain the in vivo GCR response to hypoglycemia of the normal pancreas and the enhancement of defective pulsatile GCR in b-cell deficiency by switch-off of intrapancreatic a-cell suppressing signals. The models also predicted that reduced insulin secretion decreases and delays the GCR. In conclusion, based on experimental data we have developed and validated a model of the normal GCR control mechanisms and their dysregulation in insulin deficient diabetes. One advantage of this construct is that all model components are clinically measurable, thereby permitting its transfer, validation, and application to the study of the GCR abnormalities of the human endocrine pancreas in vivo.
1. Introduction Blood glucose (BG) homeostasis is maintained by a complex, ensemble control system characterized by a highly coordinated interplay between and among various hormone and metabolite signals. One of its key components, the endocrine pancreas, responds dynamically to changes in BG, nutrients, neural, and other signals by releasing insulin and glucagon in a pulsatile manner to regulate glucose production and metabolism. Abnormalities in the secretion and interaction of these two hormones mark the progression of many diseases, including diabetes, but also the metabolic syndrome, the polycystic ovary syndrome, and others. Diminished or complete loss of endogenous insulin secretion in diabetes is closely associated with failure
Network Control of Glucagon Counterregulation
549
of the pancreas to respond with glucagon secretion properly not only to hyper but also to hypoglycemia. The latter is not caused by loss of glucagon secreting a-cells, but instead is due to defects in glucagon counterregulation (GCR) signaling, through an unknown mechanism, which is generally recognized as a major barrier to safe treatment of diabetes (Cryer and Gerich, 1983; Gerich, 1988) since unopposed hypoglycemia can cause coma, seizures, or even death (Cryer, 1999, 2002, 2003). Our recent experimental (Farhy et al., 2008) and mathematical modeling (Farhy and McCall, 2009; Farhy et al., 2008) results show that a novel understanding of the defects in the GCR control mechanisms can be gained if these are viewed as abnormalities of the network of intrapancreatic interactions that control glucagon secretion, rather than as defects in an isolated molecular interaction or pathway. In particular, we have demonstrated that in a b-cell-deficient rat model the GCR control mechanisms can be approximated by a simple feedback network (construct) of dose–response interactions between BG and the islet peptides. Within the framework of this construct, the defects of GCR response to hypoglycemia can be explained by loss of rapid switch-off of b-cell signaling during hypoglycemia to trigger an immediate GCR response. These results support the ‘‘switch-off ’’ hypothesis which posits that a-cell activation during hypoglycemia requires both the availability and rapid decline of intraislet insulin (Banarer et al., 2002). They also extend this hypothesis by refocusing from the lack of endogenous insulin signaling to the a-cells as a sole mechanistic explanation and instead focus on possible abnormalities in the general way by which the b-cells regulate the a-cells. In addition, the experimental and theoretical modeling data collected so far indicate that the GCR control network must have two key features: a (direct or indirect) feedback of glucagon secreting a-cells on themselves (auto-feedback) and a (direct or indirect) negative regulation of glucagon by BG. In our published model these two properties are mediated by d-cell somatostatin and we have shown that such connectivity adequately explains ours [and others (Zhou et al., 2004)] experimental data (Farhy and McCall, 2009; Farhy et al., 2008). The construct we proposed recently (Farhy and McCall, 2009; Farhy et al., 2008) is suitable for the study and analysis of rodent physiology, but the explicit involvement of somatostatin limits its applicability to clinical studies since in the human, pancreatic somatostatin cannot be reliably measured and therefore, the ability of the model to describe adequately the human physiology and its potential differences from rodent physiology cannot be verified. In the current work, we review our existing models and show that a control network in which somatostatin is not explicitly involved (but incorporated implicitly) can also adequately approximate the GCR control mechanisms. We confirm that the (new) construct can substitute for the older, more complex construct by verifying that it explains the same
550
Leon S. Farhy and Anthony L. McCall
experimental observations already shown to be reconstructed by the older network (Farhy and McCall, 2009; Farhy et al., 2008). We also demonstrate that the newer network can explain the regulation of the normal pancreas by BG and the gradual reduction in the GCR response to hypoglycemia during the transition from a normal to an insulin deficient state. As a result, a more precise description of the components that are the most critical for the system is provided by a model of GCR regulation. This model can be applied to study the abnormalities in glucagon secretion and counterregulation and to identify hypothetical ways to repair these not only in the rodent but also in the human.
2. Mechanisms of Glucagon Counterregulation (GCR) Dysregulation in Diabetes Studies of tight BG control in types 1 and 2 diabetes to prevent chronic hyperglycemia-related complications have found a threefold excess of severe hypoglycemia (The Action to Control Cardiovascular Risk in Diabetes Study Group, 2008; The Diabetes Control and Complications Trial Research Group, 1993; The UK Prospective Diabetes Study Group, 1998). Hypoglycemia impairs quality of life and risks coma, seizures, accidents, brain injury, and death. Severe hypoglycemia is usually due to overtreatment against a background of delayed and deficient hormonal counterregulation. In health, GCR curbs dangerously low BG nadirs and stimulates quick recovery from hypoglycemia (Cryer and Gerich, 1983; Gerich, 1988). However, in type 1 (Fukuda et al., 1988; Gerich et al., 1973; Hoffman et al., 1994) and type 2 diabetes (Segel et al., 2002), the GCR is impaired by uncertain mechanisms and if it is accompanied by a loss of epinephrine counterregulation it leads to severe hypoglycemia and thus presents a major barrier to safe treatment of diabetes (Cryer, 1999, 2002). Understanding the mechanisms that mediate GCR, its dysregulation and how it can be repaired, is therefore a major challenge in the struggle for safer treatment of diabetes. Despite more than 30 years of research, the mechanism by which hypoglycemia stimulates GCR and how it is impaired in diabetes have yet to be elucidated (Gromada et al., 2007). First described by Gerich et al. (1973), defective GCR is common after about 10 years of T1DM. The loss of GCR appears to be more rapid with a very young age of onset and may occur within a few years after onset of T1DM0. Although unproven, the appearance of defective GCR seems to parallel insulin secretory loss in these patients. The defect appears to be stimulus specific, since a-cells retain their ability to secrete glucagon in response to other stimuli, such as arginine (Gerich et al., 1973). Three mechanisms have been proposed as a potential source for impairment of GCR. Those that account for the stimulus specificity of the
Network Control of Glucagon Counterregulation
551
defect include impaired BG-sensing in a-cells (Gerich et al., 1973) and/or autonomic dysfunction (Hirsch and Shamoon, 1987; Taborsky et al., 1998) The ‘‘switch-off’’ hypothesis envisions that a-cell activation by hypoglycemia requires both the availability and rapid decline of intraislet insulin and attributes the defect in the GCR in insulin deficiency to loss of a (insulin) ‘‘switch-off ’’ signal from the b-cells (Banarer et al., 2002). These theories are not mutually exclusive, but they all could be challenged. For example, a-cells do not express GLUT2 transporters (Heimberg et al., 1996) and it is unclear whether the a-cell GLUT1 transporters can account for the rapid a-cell response to variations in BG (Heimberg et al., 1995). In addition, proglucagon mRNA levels are not altered by BG (Dumonteil et al., 2000) and it is debatable whether BG variations in the physiological range can affect the a-cells (Pipeleers et al., 1985). The switch-off hypothesis can also be disputed, since in the a-cell-specific insulin receptor knockout mice the GCR response to hypoglycemia is preserved (Kawamori et al., 2009). Finally, the hypothesis for autonomic control contradicts evidence that blockade of epinephrine and acetylcholine actions did not reduce the GCR in humans (Hilsted et al., 1991), and that the denervated human pancreas still releases glucagon in response to hypoglycemia (Diem et al., 1990). Recent in vivo experiments by Zhou et al. support the ‘‘switch-off ’’ hypothesis. They have shown that, in STZ-treated rats, GCR is impaired, but can be restored if their deficiency in intraislet insulin is reestablished and decreased (switched off ) during hypoglycemia (Zhou et al., 2004). Additional in vitro and in vivo evidence to support the switch-off hypothesis has been reported (Hope et al., 2004; Zhou et al., 2007a). Whether insulin is the trigger of GCR in the studies by Zhou et al. (2004, 2007a) has been challenged by results by the same group, in which zinc ions, not the insulin molecule itself, provided the switch-off signal to initiate glucagon secretion during hypoglycemia (Zhou et al., 2007b). In view of the above background, the mechanisms that control the secretion of glucagon and their dysregulation in diabetes are not well understood. This lack in understanding prevents restoring GCR to normal in patients with diabetes and the development of treatments to effectively repair defective GCR to allow for a safer control of hyperglycemia. No such treatment currently exists.
3. Interdisciplinary Approach to Investigating the Defects in the GCR The network underlying the GCR response to hypoglycemia includes hundreds of components from numerous pathways and targets in various pools and compartments. It would therefore be unfeasible to collect and
552
Leon S. Farhy and Anthony L. McCall
relate experimental data pertaining to all components of this network. Nevertheless, understanding the glucagon secretion control network is vital for furthering knowledge concerning the control of GCR, its compromise in diabetes, and developing treatment strategies. To address this problem, we have taken a minimal model approach in which the system is simplified by clustering all known and unknown factors into a small number of explicit components. Initially, these components were chosen with the goal to test whether recognized physiological relationships can explain key experimental findings. In our case, the first reports describing the in vivo enhancement of GCR by switch-off of insulin (Zhou et al., 2004) prompted us to propose a parsimonious model of the complex GCR control mechanisms including relationships between the a- and d-cells, BG and switch-off signals (below). According to these initial efforts (Farhy et al., 2008), the postulated network explains the switch-off phenomenon by interpreting the GCR as a rebound. It further predicts that: (i) in b-cell deficiency, multiple a-cell suppressing signals should enhance GCR if they are terminated during hypoglycemia, and (ii) the switch-off-triggered GCR must be pulsatile. The model-based predictions motivated a series of in vivo experiments, which showed that indeed, in STZ-treated male Wistar rats, intrapancreatic infusion of insulin, and somatostatin followed by their switch-off during hypoglycemia enhances the pulsatile GCR response (Farhy et al., 2008). These experimental results confirmed that the proposed network is a good candidate for a model of the GCR control axis. In addition to confirming the initial model predictions, our experiments also suggested some new features of the GCR control network, including indications that different a-cell suppressing switch-off signals not only can enhance GCR in b-cell deficiency but also that they do so via different mechanisms. For example, the results suggest higher response to insulin switch-off and more substantial suppression of glucagon by somatostatin (Farhy et al., 2008). To show that these observations are consistent with our network model, we had to extend it to reflect the assumption that the a-cell activity can be regulated differently by different a-cell suppressing signals. We showed that this assumption can explain the difference in the GCRenhancing action of two a-cell-suppressing signals (Farhy and McCall, 2009). The simulations suggest strategies to use a-cell inhibitors to manipulate the network and repair defective GCR. However, they also indicate that not all a-cell inhibitors may be suitable for that purpose, and the infusion rate of the ones that are, should be carefully selected. In this regard, a clinically verified and tested model of the GCR control axis can greatly enhance our ability to precisely and credibly simulate changes resulting from certain interventions and ultimately will assist us in defining the best strategy to manipulate the system in vivo in humans. However, the explicit involvement of somatostatin and the d-cells in our initial network and model limits the potential for clinical applications as pancreatic somatostatin cannot be
Network Control of Glucagon Counterregulation
553
reliably measured in the human in vivo and the ability of the model to describe the human glucagon axis cannot be verified. To address this limitation we have recently reduced our initial network into a Minimal Control Network (MCN) of the GCR control axis in which somatostatin and the d-cells are no longer explicitly involved, but their effects are implicitly incorporated in the model. Our analysis (presented below) shows that the new MCN is an excellent model of the GCR axis and can substitute the older, more complex structure. Thereby, we have developed a model that can be verified clinically and used to assist the analysis of the GCR axis in vivo in humans. Importantly, the new model is not limited to b-cell deficiency and hypoglycemia only. In fact, it describes the transition form a normal to a b-cell deficient state and can explain the failure of suppression of basal glucagon secretion in response to increase in BG observed in this state. If it is confirmed experimentally that the MCN can successfully describe both the normal and b-cell deficient pancreas, future studies may focus on the defects of the pancreatic network not only in type 1 but also in type 2 diabetes, or more generally, in any pathophysiological condition that is accompanied by metabolic abnormalities of the endocrine pancreas.
4. Initial Qualitative Analysis of the GCR Control Axis To understand the mechanisms of GCR and their dysregulation, the pancreatic peptides have been extensively studied and much evidence suggests that a complex network of interacting pathways modulates glucagon secretion and GCR. Some of the well-documented relationships between different islet cell signals are summarized in the following subsections.
4.1. b-Cell inhibition of a-cells Pancreatic perfusions with antibodies to insulin, somatostatin, and glucagon have suggested that the blood within the islets flows from b- to a- to d-cells in dogs, rats, and humans (Samols and Stagner, 1988, 1990; Stagner et al., 1988, 1989). It was then proposed that, insulin regulates glucagon, which in turn regulates somatostatin. Various b-cell signals provide an inhibitory stimulus to the a-cells and suppress glucagon. These include cosecreted insulin, zinc, GABA, and amylin (Gedulin et al., 1997; Gromada et al., 2007; Ishihara et al., 2003; Ito et al., 1995; Maruyama et al., 1984; Rorsman and Hellman, 1988; Rorsman et al., 1989; Samols and Stagner, 1988; Wendt et al., 2004; Xu et al., 2006). In particular, b-cells store and secrete GABA, which can diffuse to neighboring cells to bind to localized within the islets only on a-cells (Rorsman and Hellman, 1988; Wendt et al., 2004). Insulin can directly
554
Leon S. Farhy and Anthony L. McCall
suppress glucagon by binding to its own receptors (Kawamori et al., 2009) or to IGF-1 receptors on the a-cells (Van Schravendijk et al., 1987). Insulin also translocates and activates GABAA receptors on the a-cells, which leads to membrane hyperpolarization and, ultimately, suppresses glucagon. Hence, insulin may directly inhibit the a-cells, and indirectly potentiate the effects of GABA (Xu et al., 2006). Infusion of amylin in rats inhibits argininestimulated glucagon (Gedulin et al., 1997), but not hypoglycemia GCR (Silvestre et al., 2001); similar results were found with the synthetic amylin analog pramlintide (Heise et al., 2004) even though in some studies hypoglycemia was increased, but it is unclear if this is a GCR effect or is related to failure to reduce meal insulin adequately (McCall et al., 2006). Finally, a negative effect of zinc on glucagon has been proposed (Ishihara et al., 2003), including a role in control of GCR (Zhou et al., 2007b). The role of zinc is unclear as zinc ions do not suppress glucagon in the mouse (Ravier and Rutter, 2005).
4.2. d-Cell inhibition of a-cells Exogenous somatostatin inhibits insulin and glucagon; however, the role of the endogenous hormone is controversial (Brunicardi et al., 2001, 2003; Cejvan et al., 2003; Gopel et al., 2000a; Klaff and Taborsky, 1987; Kleinman et al., 1994; Ludvigsen et al., 2004; Portela-Gomes et al., 2000; Schuit et al., 1989; Strowski et al., 2000; Sumida et al., 1994; Tirone et al., 2003). The concept that d-cells are downstream of a- and d-cells favors the perception that in vivo, intraislet somatostatin cannot directly suppress the a- or b-cell through the islet microcirculation (Samols and Stagner, 1988, 1990; Stagner et al., 1988, 1989). On the other hand, the pancreatic a- and b-cells express at least one of the somatostatin receptors (SSTR1-5) (Ludvigsen et al., 2004; Portela-Gomes et al., 2000; Strowski et al., 2000), and recent in vitro studies involving somatostatin immunoneutralization (Brunicardi et al., 2001) or application of selective antagonists to different somatostatin receptors suggest that a-cell somatostatin inhibits the release of glucagon (Cejvan et al., 2003; Strowski et al., 2000). In addition, d-cells are in close proximity to a-cells in rat and human islets, and d-cell processes were observed to extend into a-cell clusters in rat islets (Kleinman et al., 1994, 1995). Therefore, somatostatin may act via existing common gap junctions or by diffusion through the islet interstitium.
4.3. a-Cell stimulation of d-cells The ability of endogenous glucagon to stimulate d-cell somatostatin is supported by a study in which administration of glucagon antibodies in the perfused human pancreas resulted in inhibition of somatostatin release (Brunicardi et al., 2001). Earlier immunoneutralization perfusions of the
Network Control of Glucagon Counterregulation
555
rat or dog pancreas also showed that glucagon stimulates somatostatin (Stagner et al., 1988, 1989). The glucagon receptor colocalized with 11% of immunoreactive somatostatin cells (Kieffer et al., 1996), suggesting that the a-cells may directly regulate some of the d-cells. Exogenous glucagon also stimulates somatostatin (Brunicardi et al., 2001; Epstein et al., 1980; Kleinman et al., 1995; Utsumi et al., 1979). Finally, glutamate, which is cosecreted with glucagon under low-glucose conditions, stimulates somatostatin release from diencephalic neurons in primary culture (Tapia-Arancibia and Astier, 1988) and a similar relation could exist in the islets of the pancreas.
4.4. Glucose stimulation of b- and d-cells It is well established that hyperglycemia directly stimulates b-cells, which react instantaneously to changes in BG (Ashcroft et al., 1994; Bell et al., 1996; Dunne et al., 1994; Schuit et al., 2001). Additionally, it has been proposed that d-cells have a glucose-sensing mechanism similar to those in b-cells (Fujitani et al., 1996; Gopel et al., 2000a) and consequently, that somatostatin release is increased in response to glucose stimulation (Efendic et al., 1978; Hermansen et al., 1979), possibly via a Ca2þ-dependent mechanism (Hermansen et al., 1979).
4.5. Glucose inhibition of a-cells Hyperglycemia has been proposed to inhibit glucagon even though hypoglycemia alone appears insufficient to stimulate high amplitude GCR (Gopel et al., 2000b; Heimberg et al., 1995, 1996; Reaven et al., 1987; Rorsman and Hellman, 1988; Schuit et al., 1997; Unger, 1985). In addition to the above mostly consensus findings which show that the a-cell activity is controlled by multiple intervening pathways, there are other indirect evidences suggesting that the dynamic relationships between the islet signals are important for the regulation of glucagon secretion and GCR. For example, the concept is supported by the pulsatility of the pancreatic hormones (Genter et al., 1998; Grapengiesser et al., 2006; Grimmichova et al., 2008), which implies feedback control (Farhy, 2004), and by results suggesting that: insulin and somatostatin pulses are in phase ( Jaspan et al., 1986; Matthews et al., 1987) pulses of insulin and glucagon recur with a phase shift (Grapengiesser et al., 2006), pulses of somatostatin and glucagon appear in antisynchronous fashion (Grapengiesser et al., 2006), and insulin pulses entrain a- and d-cell oscillations (Salehi et al., 2007). A pancreatic network consistent with these findings is shown in Fig. 21.1. It summarizes interactions (mostly consensus) between BG, b-, a-, and d-cells: somatostatin (or more generally the d-cells) is stimulated by glucagon (a-cells) and BG; glucagon (a-cells) is inhibited by the d-cells (by somatostatin) and by bcell signals; and BG stimulates the b-cells. This network could easily explain the
556
Leon S. Farhy and Anthony L. McCall
Blood glucose
Switch-off signals
(–) Alpha cells
(+)
(–)
Delta cells (+)
Figure 21.1 Schematic presentation of a network model of the GCR control mechanisms in STZ-treated rats.
GCR response to hypoglycemia. Indeed, hypoglycemia would decrease both b- and d-cell activity, which would entail increased release of glucagon from acells after the suppression from the neighboring b- and d-cells is removed. However, it is not apparent whether this network can explain the defect in GCR observed in b-cell deficiency or the above mentioned restoration of defective GCR by a switch-off. This dampens the appeal of the network as a simple unifying hypothesis for regulation of GCR, and for the compromise of this regulation in diabetes. The difficulties in intuitive reconstruction of the properties of the network emerge from the surprisingly complex behavior of this system due to the a–d-cell feedback loop. Shortly after the first reports describing the in vivo repair of GCR by intrapancreatic infusion and switch-off of insulin (Zhou et al., 2004), we applied mathematical modeling to analyze and reconstruct the GCR control network. These considerations demonstrated that the network in Fig. 21.1 can explain the switch-off effect (Farhy and McCall, 2009; Farhy et al., 2008). We have also presented experimental evidence to support these model predictions (Farhy et al., 2008). These efforts are described in the following section.
5. Mathematical Models of the GCR Control Mechanisms in STZ-Treated Rats We have developed and validated (Farhy and McCall, 2009; Farhy et al., 2008) a mathematical model of the GCR control mechanisms in the b-cell deficient rat pancreas which explains two key experimental observations: (a) in STZ-treated rats, rebound GCR which is triggered by a switch-off signal (a signal that is intrapancreatically infused and terminated during hypoglycemia) is pulsatile; and (b) the switch-off of either
Network Control of Glucagon Counterregulation
557
somatostatin or insulin enhances the pulsatile GCR. The basis of this mathematical model is the network outlined in Fig. 21.1 which summarizes the major interactive mechanisms of glucagon secretion in b-cell deficiency by selected consensus interactions between plasma glucose, a-cell suppressing switch-off signals, a-cells, and d-cells. We should note that the b-cells were part of the network proposed in Farhy et al. (2008), but not part of the corresponding mathematical model, which was designed to approximate the insulin deficient pancreas. In addition to explaining glucagon pulsatility during hypoglycemia and the switch-off responses mentioned above, this construct predicts each of the following experimental findings in diabetic STZ-treated rats: (i) Glucagon pulsatility during hypoglycemia after a switch-off, with pulses recurring at 15–20 min as suggested by the results of the pulsatility deconvolution analysis we have previously performed (Farhy et al., 2008); (ii) Pronounced (almost fourfold increase over baseline) pulsatile glucagon response following a switch-off of either insulin or somatosatin during hypoglycemia (Farhy et al., 2008); (iii) Restriction of the GCR enhancement by insulin switch-off by high BG conditions (Zhou et al., 2004); (iv) Lack of a GCR response to hypoglycemia when there is no switch-off signal (Farhy et al., 2008); (v) Suppression of GCR when insulin is infused into the pancreas but not switched off during hypoglycemia (Zhou et al., 2004); (vi) More than 30% higher GCR response to insulin vs somatostatin switch-off (Farhy et al., 2008); (vii) Better glucagon suppression by somatostatin than by insulin before a switch-off (Farhy et al., 2008). We note that in our prior study (Farhy et al., 2008) the comparisons between insulin and somatostatin switch-off in (vi) and (vii) were not significant. However, the difference in (vii) was close to being significant at p ¼ 0.07. Therefore, one of the goals of the latter study (Farhy and McCall, 2009) was to test in silico whether the differences (vi) of a higher GCR response to insulin switch-off and (vii) a better glucagon suppression by somatostatin switch-off were likely and can be predicted by the model of the insulin-deficient pancreas (Fig. 21.1). To demonstrate the above predictions we used dynamic network modeling and formalized the network shown in Fig. 21.1 by a system of nonlinear ordinary differential equations to approximate the glucagon and somatostatin concentration rates of change under the control of switch-off signals and BG. Then, we were able to adjust the model parameters to reconstruct the experimental findings listed in (i)–(vii) which validates the model based on the network shown in Fig. 21.1.
558
Leon S. Farhy and Anthony L. McCall
The model equations are: 1 1 1 0 þ rGL GL ¼ kGL GL þ r basal nSS 1 þ I1 ðtÞ 1 þ ½SSðt DSS Þ=tSS 1 þ I2 ðtÞ ð21:1Þ 0
SS ¼ kSS SS þ rSS
½GLðt DGL Þ=tGL nGL ½BGðtÞ=tBG nBG þ b SS 1 þ ½GLðt DGL Þ=tGL nGL 1 þ ½BGðtÞ=tBG nBG ð21:2Þ
Here, GL(t), SS(t), BG(t), I1(t), and I2(t) denote the concentrations of glucagon, somatostatin, blood glucose, and exogenous switch-off signal(s) [acting on the pulsatile or/and the basal glucagon secretion], respectively; the derivative is with respect to the time t. The meaning of the remaining parameters is explained in the following section. We note that the presence of two terms, I1(t) and I2(t), to represent the switch-off signal in Eq. (21.1) reflects the assumption that different switch-off signals may have a different impact on glucagon secretion and may suppress differently the basal and/or d-cell-regulated a-cell release. We have used the above model (Farhy and McCall, 2009) to show that the glucagon control axis postulated in Fig. 21.1 is consistent with the experimental findings (i)–(vii) [above] and we showed that insulin and somatostatin affect differently the basal and the system-regulated a-cell activity. After the model was validated we used it to predict the outcome of different switch-off strategies and explore their potential to improve GCR in b-cell deficiency: Fig. 21.2 (Farhy and McCall, 2009). The figure summarizes results from in silico experiments tracking the dynamic of glucagon from time t ¼ 0 h (start) to t ¼ 4 h (end). In some simulations, intrapancreatic infusion of insulin or somatostatin started at t ¼ 0.5 h and was either continued to the end or was switched off at t ¼ 2.5 h. When hypoglycemia was simulated, BG ¼ 110 mg/dL from t ¼ 0 h to t ¼ 2 h, glucose decline starts at t ¼ 2 h, BG ¼ 60 mg/dL at t ¼ 2.5 h (switch-off point), at the end of the simulations (t ¼ 4 h) BG ¼ 43 mg/dL. At the top of the bar graph (a), we show baseline results without switch-off signals. The black bar illustrates the glucagon level before t ¼ 2 h which is the time where BG ¼ 110 mg/dL and glucagon would be maximally suppressed if a switch-off signal were present. The white and the gray bars illustrate the maximal glucagon response in the 1 h interval from t ¼ 2.5 h to t ¼ 3.5 h without (white) and with (gray) hypoglycemia stimulus. This interval corresponds to the 1 h interval after a switch-off in all other simulations. The black and white bars are the same since glucagon levels remain unchanged if there is no hypoglycemia. Each subsequent set of three bars indicates these effects with single switch-off [(b) and (c)], combined switchoff (d), no switch-off of a single signal [(e) and (f )], a mixture of switch-off
559
Network Control of Glucagon Counterregulation
Glucagon concentration [pg/ml] 0
100
200
(a) no SO
300
500
400
1.4
(b) SS (SO)
3.8
(c) INS (SO)
3.9
(d) SS (SO) + INS(SO) (e) SS (no SO)
10.2 1.6
(f) INS (no SO) (g) SS (no SO) + INS (SO)
2.4 3.2
(h) SS (SO) + INS (no SO) (i) SS (no SO) + INS (no SO)
600
8.6 2.9
Glucagon supressed by the intrapancreatic signal(s) GCR response to switch-off without hypoglycemia GCR response to switch-off with hypoglycemia
Figure 21.2 Summary of the model-predicted GCR responses to different switch-off signals with or without simulated hypoglycemia (see text for more detail). SO, switchoff; no SO, the signal was not switched off; SS, somatostatin; INS, insulin. Modified from Farhy and McCall (2009).
and no switch-off for the two signals [(g) and (h)], and no switch-off for the combination of the two signals (i). Thus, the bar graph gives the following glucagon concentrations: glucagon suppressed by the intrapancreatic signal (black bars: the glucagon concentration immediately before the onset of BG decline at t ¼ 2 h: at that time glucagon is maximally suppressed by the intrapancreatic infusion and not affected by the decline in glucose); GCR response to a switch-off if hypoglycemia was not induced (white bars: the maximal glucagon concentrations achieved within a 1 h interval after the switch-off); and GCR response if hypoglycemia was induced (gray bars: the maximal glucagon concentrations achieved within a 1 h interval after the switch-off ). The graph also includes the maximal fold increase in glucagon in response to a switch-off during hypoglycemia relative to the glucagon levels before the onset of BG decline. Thus, we concluded that the impact of an a-cell inhibitor on the GCR depends on the nature of the signal and the mode of its delivery. These comparisons between strategies of manipulating the network to enhance the GCR by a switch-off revealed a good potential of a combined switch-off to
560
Leon S. Farhy and Anthony L. McCall
amplify the benefits provided by each of the individual signals (Farhy and McCall, 2009) and even a potential to explore scenarios in which the a-cell suppressing signal is not terminated.
6. Approximation of the Normal Endocrine Pancreas by a Minimal Control Network (MCN) and Analysis of the GCR Abnormalities in the Insulin Deficient Pancreas The explicit involvement of somatostatin in the model described above limits the potential clinical application as pancreatic somatostatin cannot be reliably measured in the human in vivo and the ability of the model to describe the human glucagon axis cannot be verified. It is, however, possible to simplify the network in a way that somatostatin is no longer explicitly involved, but is incorporated implicitly. In the original model shown in Fig. 21.1, somatostatin appears in the following two compound pathways, the ‘‘a-cell ! d-cell ! a-cell’’ feedback loop and in the ‘‘BG ! d-cell ! a-cell’’ pathway. By virtue of its interactions in the ‘‘a-cell ! d-cell ! a-cell’’ pathway, the a-cells effectively control their own activity and therefore this pathway can be replaced by a delayed ‘‘a-cell ! a-cell’’ auto-feedback loop. Such regulation is also consistent with reports that glucagon may directly suppress its own release (Kawai and Unger, 1982) possibly by binding to glucagon receptors located on a subpopulation of the a-cells (Kieffer et al., 1996) or by other autocrine mechanisms. Through the ‘‘BG ! d-cell ! a-cell’’ pathway, blood glucose downregulates the release of glucagon and the action is mediated by somatostatin. Therefore, this pathway can be simplified and substituted by the BG ! a-cell interaction. The outcome of the described procedure of network reduction is a new Minimal Control Network (MCN) of the GCR control mechanisms in which somatostatin and the d-cells are no longer explicitly involved (Fig. 21.3). As was originally proposed in our prior work (Farhy et al., 2008), the b-cells of the normal pancreas are now part of the MCN (and of the mathematical model). This feature also extends the physiological relevance of the model. The b-cells are assumed to be stimulated by hyperglycemia and to suppress the activity of the a-cells. The latter action is based on an extensive data that the b-cells (co)release a variety of signals, including insulin, GABA, zinc, and amylin, all of which are known to suppress the a-cell activity (Gedulin et al., 1997; Ishihara et al., 2003; Ito et al., 1995; Reaven et al., 1987; Rorsman and Hellman, 1988; Samols and Stagner, 1988; Van Schravendijk et al., 1987; Wendt et al., 2004; Xu et al., 2006). In addition, it has been reported that the pulses of insulin and glucagon recur with a phase shift (Grapengiesser et al., 2006) which is
561
Network Control of Glucagon Counterregulation
Beta cells
(+)
Blood glucose
(–) (–) Alpha cells
(–)
Figure 21.3 A Minimal Control Network (MCN) of the interactions between BG and the a- and b-cells postulated to regulate the GCR in the normal pancreas. In this network the d-cells are not represented explicitly.
consistent with the postulated negative regulation of the a-cells by the b-cells. An extensive background justifying all postulated MCN relationships was presented in Section 4.
6.1. Dynamic network approximation of the MCN Similar to the analysis of the old network, dynamic network modeling methods are used to study the properties of the MCN shown in Fig. 21.3. In particular, two differential equations approximate the glucagon and insulin concentration rate of change: 0
tINS 1 þ rGL tINS þ INS 1 þ ðBG=tBG ÞnBG 1 tINS nGL 1 þ ½GLðt DGL Þ=tGL tINS þ INS
GL ¼ kGL GL þ rGL;basal
INS ¼ kINS INS þ rINS 0
ð21:3Þ
ðBG=tBG;2 ÞnBG;2 þ rINS;basal Pulse ð21:4Þ 1 þ ðBG=tBG;2 ÞnBG;2
Here, GL(t), BG(t), and INS(t) denote time-dependent concentrations of glucagon, blood glucose, and insulin (or exogenous switch-off signal in the b-cell-deficient model), respectively; the derivative is the rate of change with respect to the time t. The term Pulse in Eq. (21.4) denotes a pulse generator specific to the b-cells superimposed to guarantee physiological relevance of the simulations. The meaning of the parameters is defined as follows: kGL and kINS are rates of elimination for glucagon and insulin, respectively; rGL is BG- and auto-feedback-regulated rate of release of glucagon;
562
Leon S. Farhy and Anthony L. McCall
rGL,basal is glucagon basal rate of release; rINS is BG-regulated rate of release of insulin; rINS,basal is insulin basal rate of release; tINS is half-maximal inhibitory dose for the negative action of insulin on glucagon; tBG and tBG,2 are half-maximal inhibitory doses for BG (ID50); tGL is half-maximal inhibitory dose for glucagon (ID50); nBG, nBG,2, and nGL are Hill coefficients describing the slope of the corresponding dose–response interactions; DGL is delay in the auto-feedback.
6.2. Determination of the model parameters The half-life (t1/2) of glucagon was assumed to be 2 min to match the results of our pulsatility analysis (Farhy et al., 2008) and other published data. Therefore, we fixed the parameter kGL ¼ 22 h 1. The half-life of insulin was assumed to be 3 min as suggested in the literature (Grimmichova et al., 2008). Therefore, to approximate insulin’s t1/2, we fixed the parameter kINS ¼ 14 h 1. The remaining parameters used in the simulations were determined functionally and some of the concentrations presented below are in arbitrary units (specifically, those related to insulin). These units, however, can be easily rescaled to match real concentrations. The delay in the auto-feedback DGL ¼ 7.2 min was functionally determined, together with the potencies tBG ¼ 50 mg/dL, tGL ¼ 6 pg/mL and sensitivities nBG ¼ 5, nGL ¼ 5 in the auto-feedback control function, to guarantee that glucagon pulses during GCR recur at intervals of 15–20 min to correspond to the number of pulses after a switch-off point detected in the pulsatility analysis (Farhy et al., 2008). The parameters rINS ¼ 80,000 and rINS,basal ¼ 270, together with the amplitude of the pulses of the pulse generator and the parameters tBG,2 ¼ 400 mg/dL and nBG,2 ¼ 3 were functionally determined to guarantee that BG is capable of stimulating more than ninefold increase in insulin over baseline in response to a glucose bolus. The ID50, tINS ¼ 20, was functionally determined based on the insulin concentrations to guarantee that insulin withdrawal during hypoglycemia can trigger GCR. The glucagon release (rGL ¼ 42,750 pg/mL/h) and basal secretion rate (rGL,basal ¼ 2,128 pg/mL/h) were functionally determined so that a strong hypoglycemic stimulus can trigger more than 10-fold increase in glucagon from the normal pancreas. The parameters of the pulse generator, Pulse, were chosen to generate every 6 min a square wave of height ¼ 10 over a period of 36 s based on published reports on insulin pulsatility reporting recurring insulin pulses every 4–12 min (Prksen, 2002). We note that insulin pulsatility was modeled to mimic the variation of insulin in the portal vein, rather than in the circulation. This explains the deep nadirs between the pulses evident in the simulations. The parameter values of the model are summarized in Table 21.1.
Table 21.1 Summary of core interactive constants in the auto-feedback MCN Rate constant Elimination (1/h)
Glucagon kGL ¼ 22 h 1 BG Insulin Pulse
kINS ¼ 14 h 1
Dose–response control functions Release (concentration/h)
ED50, or ID50 (concentration)
Slope
Delay (min)
rGL ¼ 42,570 pg/mL/h rGL,basal ¼ 2,128 pg/ mL/h
tGL ¼ 85 pg/mL
nGL ¼ 5
DGL ¼ 7.2 min
tBG ¼ 50 mg/dL tBG,2 ¼ 400 mg/dL tINS ¼ 20
nBG ¼ 5 nBG,2 ¼ 3
rINS ¼ 80,000 rINS,basal ¼ 270 Periodic function: a square wave of height ¼ 10 over a period of 36 s recurring every 6 min
564
Leon S. Farhy and Anthony L. McCall
6.3. In silico experiments The simulations were performed as follows: Simulation of the glucose input to the system. We performed two different simulations to mimic hypoglycemia: (a) BG decline from 110 to 60 mg/dL in 1 h and (b) stepwise (1 h steps) decline in BG from 110 to 60 (same as in (a)), then to 45, and then to 42 mg/dL. The stepwise decline into hypoglycemia is intended to investigate a possible distinction between the model responses to 60 mg/dL (a) and to a stronger hypoglycemic stimulus (b); it also mimics a commonly employed human experimental conditions (staircase hypoglycemic clamp). To0 generate glucose profiles that satisfy (a) and (b), we used the equation BG ¼ 3BG þ 3 step þ 330, where the function step changes from 110 to 60, 45, and 42 mg/dL at 1-h steps. Then we used the solution to the above equation in Eqs. (21.3) and (21.4). Similarly, an increase of glucose was simulated by using the above equation and a step function which increases the BG levels from 110 to 240 mg/dL to mimic acute hyperglycemia. Transition from a normal to an insulin deficient state. The simulation was performed by gradually reducing to zero the amplitude of the pulses generated by the pulse generator, Pulse. Simulation of intrapancreatic infusion of different a-cell suppressing signals. These simulations were performed in an insulin deficient model. Equation (21.4) is replaced by an equation which describes the dynamic of the infused signal: 0
SO ¼ kSO SO þ InfusionðtÞ Here, SO represents the concentration of the switch-off signal, an abrupt termination of an a-cell suppressing signal. The function Infusion describes the rate of its intrapancreatic infusion (equal to Height if the signal is infused or to 0 otherwise) and kSO its (functional) rate of elimination. Then, the terms (1 þ m1 SO) and (1 þ m2 SO) are used in Eq. (21.3) to divide the parameters rGL and rGL,basal, respectively to simulate suppression of the a-cell activity by the signal. Differences in the parameters m1 and m2 model unequal action of the infused signal on the basal and BG/auto-feedback-regulated glucagon secretion. In particular, to simulate an insulin switch-off we used parameters kSO ¼ 3, Height ¼ 55, m1 ¼ 0.08, and m2 ¼ 0.5; to simulate somatostatin switch-off we used kSO ¼ 3.5, Height ¼ 10, m1 ¼ 1, and m2 ¼ 1.4. The parameters were functionally determined to explain our experimental observations (below) and the possible differences in the response to the two types of switch-offs (Farhy et al., 2008). In particular, the action of exogenous insulin on BG/auto-feedback-regulated and basal glucagon secretion is distributed like a 1:6.3 ratio. Similar to our previous work (Farhy and McCall, 2009), exogenous insulin suppresses the basal more
Network Control of Glucagon Counterregulation
565
than the pulsatile glucagon release, for somatostatin, the suppressive effect is more uniform in a 1:1.4 ratio.
6.4. Validation of the MCN To validate the new network we perform an in silico study in three steps:
Demonstrate that the (new) MCN (Fig. 21.3) is compatible with the mechanism of GCR and response to switch-off signals in insulin deficiency. We have already shown that our (original) network which includes somatostatin as an explicit node is consistent with key experimental data. To confirm that the (new) MCN can substitute the older more complex construct, we tested the hypothesis that it can approximate the same key experimental observations [all (i) through (vii) listed in the beginning of Section 5] already shown to be predicted by the old network (Fig. 21.1). Show that the mechanisms underlying the dysregulation of GCR in insulin deficiency can be explained by the MCN. To this end we demonstrated that the BG-regulated MCN can explain (i) high GCR response if the bcells are intact and provide a potent switch-off signal to the a-cells; and (ii) reduction of GCR following a simulated gradual decrease in insulin secretion to mimic transition from normal physiology to an insulinopenic state. Verify that the proposed MCN approximates the basic properties of the normal endocrine pancreas. Even though our primary goal is to explain the GCR control mechanisms and their dysregulation, we have demonstrated that the postulated MCN can explain the increase in insulin secretion and decrease in glucagon release in response to BG stimulation. The goal of this in silico study is to validate the MCN by demonstrating that the parameters of the mathematical model (Eqs. (21.3) and (21.4)) that approximate the MCN (Fig. 21.3) can be determined in a way that the output of the model can predict certain general features of the in vivo system. Therefore, the simulated profiles are expected to reproduce the overall behavior of the system rather than to match exactly experimentally observed individual hormone dynamics. To integrate the equations we used a Runge–Kutta 4 algorithm and its specific implementation within the software package Berkeley-Madonna.
6.5. In silico experiments with simulated complete insulin deficiency We demonstrate that the proposed MCN model, which has changed significantly since initially introduced (Farhy and McCall, 2009; Farhy et al., 2008), is consistent with the experimental observations in STZ-treated rats reported by us and others (Farhy et al., 2008; Zhou et al., 2004).
566
Leon S. Farhy and Anthony L. McCall
6.6. Defective GCR response to hypoglycemia with the absence of a switch-off signal in the insulin deficient model The plot in Fig. 21.4 (bottom left panel) shows the predicted lack of glucagon response to hypoglycemia if a switch-off signal is missing—a key observation reported in our experimental study (Farhy et al., 2008) and elsewhere (Zhou et al., 2004, 2007a,b). The system responds with only about 30% increase in the pulse amplitude of glucagon in the 45 min interval after BG reaches 60 mg/dL, which agrees with our experimental observations (Fig. 21.4, top panels) and shows that the model satisfies condition (iv) from Section 5 (no GCR response to hypoglycemia without a switch-off signal).
600 Saline switch-off
Insulin switch-off
Somatostatin switch-off
Glucagon [pg/mL]
480
360
240
120
0 120
600 Insulin switch-off
Somatostatin switch-off 80
480 Glucagon [pg/mL]
BG
BG
BG
360
40
240
0
BG [mg/dl]
No switch-off
Glucagon 120 Glucagon Glucagon 0 0
1
2 3 Time [h]
4
5
0
1
2 3 Time [h]
4
5
0
1
2 3 Time [h]
4
5
Figure 21.4 The mean observed (top) and model-predicted (bottom) glucagon response to hypoglycemia and saline switch-off or no switch-off (left), insulin switch-off (middle), and somatostatin switch-off (right). The shaded area marks the period monitored in our experimental study. The simulations were performed with a complete insulin deficiency.
Network Control of Glucagon Counterregulation
567
6.7. GCR response to switch-off signals in insulin deficiency The model response to a 1.5 h intrapancreatic infusion of insulin or somatostatin switched off at hypoglycemia (BG ¼ 60 mg/dL) is shown in the bottom middle and right panels of Fig. 21.4. The infusion was initiated at time t ¼ 0.5 h (arbitrary time units) and switched off at t ¼ 2 h. A simulated gradual BG decline started at t ¼ 1 h and BG ¼ 60 mg/dL at the switch-off point. The model response illustrates a pulsatile rebound glucagon secretion after the switch-off reaching almost a fourfold increase in glucagon in the 45 min period after the switch-off as compared to the preswitch-off levels, which is similar to the experimental observations: Fig. 21.4 (top middle and right panels). Therefore, the model satisfies conditions (i) the pulsatility timing and (ii) the pulsatility amplitude increase from Section 5 in regard to insulin and somatostatin switch-off. In addition, the bottom middle and right panels of Fig. 21.4 demonstrate that the model satisfies conditions (vi) >30% higher GCR response to insulin vs somatostatin switch-off and (vii) better glucagon suppression by somatostatin before a switch-off compared with suppression by insulin from Section 5. Of interest, the prediction that an insulin switch-off signal suppresses more potently the basal, rather than the pulsatile glucagon release is similar to the prediction of the previous model (Farhy and McCall, 2009) and it is necessary to explain the difference between the insulin switch-off and somatostatin switch-off: Fig. 21.4, middle vs left panels. Note that the experimental data shown in the top panels of Fig. 21.4 were collected during our previous experimental study (Farhy et al., 2008). The pulsatility of glucagon is not apparent in the plots presented in Fig. 21.4 since they reflect averaged experimental data (n ¼ 7 in the saline group and n ¼ 6 in the insulin and somatostatin switch-off groups). In Farhy et al. (2008), glucagon pulsatility was confirmed on the individual profiles of glucagon measured in the circulation by deconvolution analysis and the current simulations which approximate the dynamic of glucagon in the portal circulation agree well with these results.
6.8. Reduction of the GCR response by high glucose conditions during the switch-off or by failure to terminate the intrapancreatic signal For comparison, Fig. 21.5 depicts the GCR response if an insulin signal was infused and switched off, but hypoglycemia was not present (top panel) or if intrapancreatic insulin was infused, but not switched off during hypoglycemia (bottom panel). In the first simulation glucagon increases by only 60 pg/mL relative to the concentration at the switch-off point and in the second simulation the GCR response is reduced approximately twofold as
568
Leon S. Farhy and Anthony L. McCall
600
120
Glucagon [pg/mL]
80
Insulin switch-off
360
40
BG [mg/dL]
BG 480
No hypoglycemia 240
0
120
600
120
480
80
Glucagon [pg/mL]
BG Insulin infused, but not switched off
360
40
BG [mg/dL]
0
Hypoglycemia 240
0
Glucagon 120
0 0
1
2 3 Time [h]
4
5
Figure 21.5 Model-predicted minimal absolute glucagon response to insulin switchoff if the intrapancreatic signal (black bar) is terminated during euglycemia (top panel: glucagon increases minimally with only 60 pg/mL greater than the concentration at the switch-off point) and to insulin intrapancreatic insulin infusion if the signal is not
Network Control of Glucagon Counterregulation
569
compared to the response depicted in Fig. 21.4 (middle bottom panel). This result agrees with the observations reported in Zhou et al. (2004) which demonstrate a lack of significant increase in glucagon in this 1 h interval if insulin is not switched off. In an additional analysis (results not shown), we increased the simulated rate of infusion of the insulin switch-off signal fourfold. By increasing the parameter Height from 55 to 220 (see Section 6.3) and used a stronger hypoglycemic stimulus (40 mg/dL). The model responded with an increase in glucagon after the switch-off which reached concentrations above 800 pg/mL in the 1 h interval after the switch-off point. When the same signal was not terminated in this simulation, the response was restricted to a rise only to 180 pg/mL. This outcome reproduces more closely the observations in Zhou et al. (2004). Thus, the model satisfies conditions: (iii) the restriction of the response to an insulin switch-off by high BG conditions and (v) the absence of a pronounced GCR when no insulin switch-off is performed as detailed in Section 5.
6.9. Simulated transition from a normal physiology to an insulinopenic state One set of simulations was performed to evaluate the model-generated glucagon response to a stepwise BG decline into hypoglycemia with a normal and insulin-deficient pancreas. The response of the normal model shown in Fig. 21.6 (top panel) illustrates a pronounced glucagon response to hypoglycemia (about fourfold increase when BG ¼ 60 mg/dL and about 14-fold increase over baseline when BG approaches 42.5 mg/dL). Of interest, the model predicts that when BG starts to fall the highfrequency glucagon, pulsatility during the basal period entrained by the insulin pulses will be replaced by low-frequency oscillations maintained by the a-cell auto-feedback. The model also predicts that a complete absence of BG-stimulated and basal insulin release will result in the following abnormalities in the glucagon secretion and response to hypoglycemia (Fig. 21.6, bottom panel):
A significant reduction in the fold glucagon response to hypoglycemia relative to baseline (only about 1.3-fold increase when BG ¼ 60 mg/dL and only about threefold increase when BG approaches 42 mg/dL).
switched off (bottom panel: glucagon increases only 85 pg/mL greater than to the concentration at the time when BG ¼ 60 mg/dL and only about twofold relative to baseline)—these values are by contrast increased more than 3.5 fold when the switchoff occurs—see Fig. 21.4, the bottom middle panel. All of these simulations were performed with a complete insulin deficiency.
570
Leon S. Farhy and Anthony L. McCall
Normal pancreas
Glucagon [pg/mL]
480
120 80
BG
40
360 Insulin
0
240
BG [mg/dL], insulin
600
120 Glucagon 0 Insulin deficient pancreas
120
Glucagon [pg/mL]
480
80 BG
360
40 Insulin
240
0
BG [mg/dL], insulin
600
Glucagon 120 0 0
1
2 3 Time [h]
4
5
Figure 21.6 Model-derived glucagon response to hypoglycemia (stepwise BG decline) in normal physiology with intact insulin release (top) and predicted decrease and delay in GCR following a simulated removal of insulin secretion to mimic a transition from a normal to an insulin deficient state.
A reduction in the absolute glucagon response to hypoglycemia (15% lower response when BG ¼ 60 mg/dL and 42% lower response when BG approaches 42.5 mg/dL). A delay in the GCR response (BG remains below 60 mg/dL for more than 1 h without any sizable change in glucagon). A 2.5-fold increase in basal glucagon. Disappearance of the insulin-driven high-frequency glucagon pulsatility. A comparison between the model response to hypoglycemia when BG remains at 60 mg/dL (Fig. 21.4, lower left panel) and when it falls further to about 42.5 mg/dL in the staircase hypoglycemic clamp (Fig. 21.6, bottom panel) reveals the interesting model prediction that a sufficiently strong hypoglycemic stimulus may still evoke some delayed glucagon release. However, additional analysis (results not shown) disclosed that if the basal glucagon release (model parameter rGL,basal) is 15–20% higher this response
Network Control of Glucagon Counterregulation
571
will be completely suppressed. Therefore, the model predicts that GCR abnormalities may be due to both the lack of an appropriate switch-off signal and to a significant basal hyperglucagonemia. The same simulations were performed also under the assumption that BG declines only to 60 mg/ dL and remains at that level similarly to the experiments depicted in the lower panels of Fig. 21.4 (results not shown). We detected that the glucagon pulses released by the normal pancreas were about 47% lower which stresses the importance of the strength of the hypoglycemic stimulus to the amount of GCR response. Under conditions of complete absence of insulin the weaker hypoglycemic stimulus evokes practically no response (this outcome has already been shown in Fig. 21.4, lower left panel) and the concentration of glucagon was 57% lower than the response stimulated by the stepwise decline (Fig. 21.6, bottom panel). A second set of simulations was designed to test the hypothesis that the model of the MCN can correctly predict a typical increase in insulin secretion and a decrease in glucagon following an increase in BG. We also monitored how these two system responses change during a transition from a normal physiology to an insulinopenic state. To this end an increase in BG was simulated (see Section 6.3) with an elevation of the BG concentration from 110 to about 240 mg/dL in 1 h and then a return back to normal in the next 1.5 h. The model-predicted response of the normal pancreas is shown on the top panel of Fig. 21.7. In this simulation the BG-driven release of insulin increased almost ninefold which caused significant suppression in glucagon release. The bottom plot in Fig. 21.7 illustrates the effect on the system response of 100% reduction in BG-stimulated insulin release. As expected, insulin deficiency results in an increase of glucagon and limited ability of hyperglycemia to suppress glucagon (Meier et al., 2006).
7. Advantages and Limitations of the Interdisciplinary Approach A key conclusion of our model-based simulations is that some of the observed system behavior (like the system response to a switch-off) emerges from interplay between multiple components. Models like the networks in Figs. 21.1 and 21.3 are certainly not uncommon in endocrine research, and typically exemplify regulatory hypotheses. Traditionally, such models are studied using methods that probe individual components or interactions in isolation from the rest of the system. This approach has been taken in the majority of the published studies that investigate the GCR regulation (see Section 4). The limitation of this approach is that the temporal relationships between the system components and relative contribution of each interaction to the overall system behavior cannot be properly assessed.
572
Normal pancreas
900
Insulin 600 300 120
0 Glucagon
60 0
Insulin deficient pancreas
900 600 300
Glucagon [pg/mL]
BG
Insulin
120
0
BG [mg/dL], insulin
Glucagon [pg/mL]
BG
BG [mg/dL], insulin
Leon S. Farhy and Anthony L. McCall
Glucagon 60 0 0
1
2 3 Time [h]
4
5
Figure 21.7 Simulated progressive decline of the ability of glucose to suppress glucagon resulting from a gradual transition (same as in Fig. 21.6) from a normal physiology (top) to an insulinopenic state (bottom).
Therefore, especially when the model contains feedbacks, the individual approach cannot answer the question of whether the model explains the system control mechanisms. The main reason for this limitation is that some key specifics of the system behavior, like its capability to oscillate and respond with a rebound to a switch-off, both require and are the result of the time-varying interactions of several components. If these are studied in isolation, little information will be gained about the dynamic behavior of this network-like mechanism. Numerous reports have documented that the glucagon control axis is indeed a complex network-like structure, and therefore it lends itself to a complex dynamic behavior analysis approach. This highlights both the significance and necessity of the mathematical methods that we propose to use to analyze the experimental data. Using differential equations-based modeling is perhaps the only way to estimate the dynamic interplay of the pancreatic hormones and their importance for GCR control.
Network Control of Glucagon Counterregulation
573
Mathematical models have not been applied to study the GCR control mechanisms, but have been used to explore other aspects of the control of BG homeostasis (Guyton et al., 1978; Insel et al., 1975; Steele et al., 1974; Yamasaki et al., 1984). For example, the minimal model of Bergman and colleagues, proposed in 1979 for estimating insulin sensitivity (Bergman et al., 1979), received considerable attention and further development (Bergman et al., 1987; Breda et al., 2001; Cobelli et al., 1986, 1990; Mari, 1997; Quon et al., 1994; Toffolo et al., 1995, 2001). We have previously used modeling methods to successfully estimate and predict the onset of the counterregulation in T1DM patients (Kovatchev et al., 1999, 2000) as well as to study other complex endocrine axes (Farhy, 2004; Farhy and Veldhuis 2003, 2004, 2005; Farhy et al., 2001, 2002, 2007). However, despite the proven utility of this methodology, our recent efforts were the first to apply a combination of network modeling and in vivo studies to dissect the GCR control axis (Farhy and McCall, 2009; Farhy et al., 2008). The selected few MCN components cannot exhaustively recreate all signals that control the GCR. Indeed, in the normal pancreas, glucagon may control its own secretion via a/b-cell interactions. For example, human b-cells express glucagon receptors (Huypens et al., 2000; Kieffer et al., 1996) and exogenous glucagon stimulates insulin by glucagon- and GLP-1-receptors (Huypens et al., 2000). One immunoneutralization study suggests that endogenous glucagon stimulates insulin (Brunicardi et al., 2001), while other results imply that a-cell glutamate may bind to receptors on b-cells to stimulate insulin and GABA (Bertrand et al., 1992; Inagaki et al., 1995; Uehara et al., 2004). It has been recently reported that in human islets, a-cell glutamate serves as a positive autocrine signal for glucagon release by acting on ionotropic glutamate receptors (iGluRs) on a-cells (Cabrera et al., 2008). Thus, absence of functional b-cells may cause glutamate hypersecretion, followed by desensitization of the a-cell iGluRs, and ultimately by defects in GCR as conjectured (Cabrera et al., 2008). Interestingly, a similar hypothesis to explain the defective GCR in diabetes by an increased chronic a-cell activity due to lack of b-cell signaling can be formulated based on our results. However, in our case hyperglucagonemia is the main reason for the GCR defects. The two hypotheses are not mutually exclusive, but ours can explain also the in vivo GCR pulsatility during hypoglycemia observed by us (Farhy et al., 2008) and others (Genter et al., 1998). Most importantly, the acell positive autoregulation is consistent with the proposed here negative delayed a-cell auto-feedback, which could be mediated in part by iGluRs desensitization as suggested (Cabrera et al., 2008). The autocrine regulation is implicitly incorporated in our model equations in the parameter rGL. The b-cells may control the d-cells, which are downstream from b-cells in the order of intraislet vascular perfusion. However, in one study, anterograde infusion of insulin antibody in the perfused rat pancreas stimulated
574
Leon S. Farhy and Anthony L. McCall
both glucagon and somatostatin (Samols and Stagner, 1988), while another immunoneutralization study documented a decrease in somatostatin at high glucose concentrations (Brunicardi et al., 2001). Suppression by insulin of the a-cells (as proposed here) could explain this apparent contradiction. It is also possible that the d-cells inhibit the b-cells (Brunicardi et al., 2003; Huypens et al., 2000; Schuit et al., 1989; Strowski et al., 2000). Finally, the MCN components are influenced by numerous extrapancreatic factors, some of which have important impacts on glucagon secretion and GCR, including the autonomic input, catecholamines, growth hormone, ghrelin, and incretins (Gromada et al., 2007; Havel and Ahren, 1997; Havel and Taborsky, 1989; Heise et al., 2004). For example, the incretin GLP-1 inhibits glucagon, though the mechanism of this inhibition is still controversial (Gromada et al., 2007). Also, there are three major autonomic influences on the a-cell: sympathetic nerves, parasympathetic nerves, and circulating epinephrine, all of which are activated by hypoglycemia, and are capable of stimulating glucagon and suppressing insulin (Bolli and Fanelli, 1999; Brelje et al., 1989; Taborsky et al., 1998). We cannot track all signals that control the GCR and most of them have no explicit terms in our model. However, they are not omitted or considered unimportant. In fact, when we describe mathematically the MCN, we are including the impact of the nervous system and other factors, even though they have no individual terms in the equations. Thus, the MCN unifies all factors that control glucagon release based on the assumption that the primary physiological relationships that are explicit in the MCN are influenced by these factors. The model-based simulations suggest that the postulated MCN model of regulation of GCR is consistent with the experimental data. However, at this stage we cannot estimate how good this model is, and it is therefore hard to assess the validity of its predictions. The simulations can only reconstruct the general ‘‘averaged’’ behavior of the in vivo system, and new experimental data are required to support the much important property that the model can explain the GCR response in individual animals. These should involve interventional studies to manipulate the vascular input to the pancreas and analyze the corresponding changes in the output by collecting frequently sampled portal vein data for multiple hormones simultaneously. These must be analyzed by the mathematical model to estimate whether the MCN provides an objectively good description of the action of the complex GCR control mechanism. Note that with this approach we cannot establish the model-based inferences in ‘‘micro’’ detail, since they imply molecular mechanisms that are out of reach of the in vivo methodology. The approach cannot nor is it intended to address the microscopic behavior of the a-cells or the molecular mechanisms that govern this behavior. In this regard, insulin and glucagon (and somatostatin) should be viewed only as (macroscopic) surrogates for the activity of the different cell types under a variety of other intra- and extrapancreatic influences.
Network Control of Glucagon Counterregulation
575
Even though it is usually not stated explicitly, simple models are always used in experimental studies and, especially in in vivo experiments, many factors are ignored or postulated to have no impact on the outcome. Using constructs like the ones described in this work to analyze hormone concentration data has the advantage that the underlying model is very explicit, incorporates multiple relationships and uses well established mathematical and statistical techniques to show its validity and reconstruct the involved signals and pathways.
8. Conclusions In the current work, we present our interdisciplinary efforts to investigate the system-level network control mechanisms that mediate the GCR and their abnormalities in diabetes—a concept as yet almost completely unexplored for GCR. The results confirm the hypothesis that a streamlined model which omits an explicit (but not implicit) somatostatin (d-cell) node entirely reproduces the results of our original more complex models. Our new findings define more precisely the components that are the most critical for the system and strongly suggest that a delayed a-cell auto-feedback plays a key role in GCR regulation. The results demonstrate that such a regulation is consistent not only with most of the in vivo system behavior typical for the insulin deficient pancreas, but it also explains key features, characteristic for the transition from a normal to an insulin deficient state. A major advantage of the current model is that its only explicit components are BG, insulin, and glucagon. These are clinically measurable which would allow the application of the new construct to the study of the control, function, and abnormalities of the human glucagon axis.
ACKNOWLEDGEMENT The study was supported by NIH/NIDDK grant R21 DK072095.
REFERENCES Ashcroft, F. M., Proks, P., Smith, P. A., Ammala, C., Bokvist, K., and Rorsman, P. (1994). Stimulus-secretion coupling in pancreatic beta cells. J. Cell. Biochem. 55(Suppl), 54–65. Banarer, S., McGregor, V. P., and Cryer, P. E. (2002). Intraislet hyperinsulinemia prevents the glucagon response to hypoglycemia despite an intact autonomic response. Diabetes 51 (4), 958–965. Bell, G. I., Pilkis, S. J., Weber, I. T., and Polonsky, K. S. (1996). Glucokinase mutations, insulin secretion, and diabetes mellitus. Annu. Rev. Physiol. 58, 171–186.
576
Leon S. Farhy and Anthony L. McCall
Bergman, R. N., Ider, Y. Z., Boeden, C. R., and Cobelli, C. (1979). Quantitative estimation of insulin sensitivity. Am. J. Physiol. 236, E667–E677. Bergman, R. N., Prager, R., Volund, A., and Olefsky, J. M. (1987). Equivalence of the insulin sensitivity index in man derived by the minimal model method and the euglycemic glucose clamp. J. Clin. Invest. 79, 790–800. Bertrand, G., Gross, R., Puech, R., Loubatieres-Mariani, M. M., and Bockaert, J. (1992). Evidence for a glutamate receptor of the AMPA subtype which mediates insulin release from rat perfused pancreas. Br. J. Pharmacol. 106(2), 354–359. Bolli, G. B., and Fanelli, C. G. (1999). Physiology of glucose counterregulation to hypoglycemia. Endocrinol. Metab. Clin. North Am. 28, 467–493. Breda, E., Cavaghan, M. K., Toffolo, G., Polonsky, K. S., and Cobelli, C. (2001). Oral glucose tolerance test minimal model indexes of beta-cell function and insulin sensitivity. Diabetes 50(1), 150–158. Brelje, T. C., Scharp, D. W., and Sorenson, R. L. (1989). Three-dimensional imaging of intact isolated islets of Langerhans with confocal microscopy. Diabetes 38(6), 808–814. Brunicardi, F. C., Kleinman, R., Moldovan, S., Nguyen, T. H., Watt, P. C., Walsh, J., and Gingerich, R. (2001). Immunoneutralization of somatostatin, insulin, and glucagon causes alterations in islet cell secretion in the isolated per fused human pancreas. Pancreas 23(3), 302–308. Brunicardi, F. C., Atiya, A., Moldovan, S., Lee, T. C., Fagan, S. P., Kleinman, R. M., Adrian, T. E., Coy, D. H., Walsh, J. H., and Fisher, W. E. (2003). Activation of somatostatin receptor subtype 2 inhibits insulin secretion in the isolated perfused human pancreas. Pancreas 27(4), e84–e89. Cabrera, O., Jacques-Silva, M. C., Speier, S., Yang, S. N., Ko¨hler, M., Fachado, A., Vieira, E., Zierath, J. R., Kibbey, R., Berman, D. M., Kenyon, N. S., Ricordi, C., et al. (2008). Glutamate is a positive autocrine signal for glucagon release. Cell Metab. 7(6), 545–554. Cejvan, K., Coy, D. H., and Efendic, S. (2003). Intra-islet somatostatin regulates glucagon release via type 2 somatostatin receptors in rats. Diabetes 52(5), 1176–1181. Cobelli, C., Pacini, G., Toffolo, G., and Sacca, L. (1986). Estimation of insulin sensitivity and glucose clearance from minimal model: New insights from labeled IVGTT. Am. J. Physiol. 250, E591–E598. Cobelli, C., Brier, D. M., and Ferrannini, E. (1990). Modeling glucose metabolism in man: Theory and practice. Horm. Metab. Res. Suppl. 24, 1–10. Cryer, P. E. (1999). Hypoglycemia is the limiting factor in the management of diabetes. Diabetes Metab. Res. Rev. 15(1), 42–46. Cryer, P. E. (2002). Hypoglycemia the limiting factor in the glycaemic management of type I and type II diabetes. Diabetologia 45(7), 937–948. Cryer, P. E., and Gerich, J. E. (1983). Relevance of glucose counterregulatory systems to patients with diabetes: Critical roles of glucagon and epinephrine. Diabetes Care 6(1), 95–99. Cryer, P. E., Davis, S. N., and Shamoon, H. (2003). Hypoglycemia in diabetes. Diabetes Care 26, 1902–1912. Diem, P., Redmon, J. B., Abid, M., Moran, A., Sutherland, D. E., Halter, J. B., and Robertson, R. P. (1990). Glucagon, catecholamine and pancreatic polypeptide secretion in type I diabetic recipients of pancreas allografts. J. Clin. Invest. 86(6), 2008–2013. Dumonteil, E., Magnan, C., Ritz-Laser, B., Ktorza, A., Meda, P., and Philippe, J. (2000). Glucose regulates proinsulin and prosomatostatin but not proglucagon messenger ribonucleic acid levels in rat pancreatic islets. Endocrinology 141(1), 174–180. Dunne, M. J., Harding, E. A., Jaggar, J. H., and Squires, P. E. (1994). Ion channels and the molecular control of insulin secretion. Biochem. Soc. Trans. 22(1), 6–12.
Network Control of Glucagon Counterregulation
577
Efendic, S., Nylen, A., Roovete, A., and Uvnas-Wallenstein, K. (1978). Effects of glucose and arginine on the release of immunoreactive somatostatin from the isolated perfused rat pancreas. FEBS Lett. 92(1), 33–35. Epstein, S., Berelowitz, M., and Bell, N. H. (1980). Pentagastrin and glucagon stimulate serum somatostatin-like immunoreactivity in man. J. Clin. Endocrinol. Metab. 51, 1227–1231. Farhy, L. S. (2004). Modeling of oscillations in endocrine networks with feedback. Methods Enzymol. 384, 54–81. Farhy, L. S., and McCall, A. L. (2009). System-level control to optimize glucagon counterregulation by switch-off of a-cell suppressing signals in b-cell deficiency. J. Diabetes Sci. Technol. 3(1), 21–33. Farhy, L. S., and Veldhuis, J. D. (2003). Joint pituitary-hypothalamic and intrahypothalamic autofeedback construct of pulsatile growth hormone secretion. Am. J. Physiol. Regul. Integr. Comp. Physiol. 285(5), R1240–R1249. Farhy, L. S., and Veldhuis, J. D. (2004). Putative GH pulse renewal: Periventricular somatostatinergic control of an arcuate-nuclear somatostatin and GH-releasing hormone oscillator. Am. J. Physiol. Regul. Integr. Comp. Physiol. 286(6), R1030–R1042. Farhy, L. S., and Veldhuis, J. D. (2005). Deterministic construct of amplifying actions of ghrelin on pulsatile growth hormone secretion. Am. J. Physiol. Regul. Integr. Comp. Physiol. 288, R1649–R1663. Farhy, L. S., Straume, M., Johnson, M. L., Kovatchev, B., and Veldhuis, J. D. (2001). A construct of interactive feedback control of the GH axis in the male. Am. J. Physiol. Regul. Integr. Comp. Physiol. 281(1), R38–R51. Farhy, L. S., Straume, M., Johnson, M. L., Kovatchev, B., and Veldhuis, J. D. (2002). Unequal autonegative feedback by GH models the sexual dimorphism in GH secretory dynamics. Am. J. Physiol. Regul. Integr. Comp. Physiol. 282(3), R753–R764. Farhy, L. S., Bowers, C. Y., and Veldhuis, J. D. (2007). Model-projected mechanistic bases for sex differences in growth-hormone (GH) regulation in the human. Am. J. Physiol. Regul. Integr. Comp. Physiol. 292, R1577–R1593. Farhy, L. S., Du, Z., Zeng, Q., Veldhuis, P. P., Johnson, M. L., Brayman, K. L., and McCall, A. L. (2009). Amplification of pulsatile glucagon secretion by switch-off of a-cell suppressing signals in streptozotocin treated rats. Am. J. Physiol. Endocrinol. Metab. 295, E575–E585. Fujitani, S., Ikenoue, T., Akiyoshi, M., Maki, T., and Yada, T. (1996). Somatostatin and insulin secretion due to common mechanisms by a new hypoglycemic agent, A-4166, in perfused rat pancreas. Metab. Clin. Exp. 45(2), 184–189. Fukuda, M., Tanaka, A., Tahara, Y., Ikegami, H., Yamamoto, Y., Kumahara, Y., and Shima, K. (1988). Correlation between minimal secretory capacity of pancreatic betacells and stability of diabetic control. Diabetes 37(1), 81–88. Gedulin, B. R., Rink, T. J., and Young, A. A. (1997). Dose-response for glucagonostatic effect of amylin in rats. Metabolism 46, 67–70. Genter, P., Berman, N., Jacob, M., and Ipp, E. (1998). Counterregulatory hormones oscillate during steady-state hypoglycemia. Am. J. Physiol. 275(5), E821–E829. Gerich, J. E. (1988). Lilly lecture: Glucose counterregulation and its impact on diabetes mellitus. Diabetes 37(12), 1608–1617. Gerich, J. E., Langlois, M., Noacco, C., Karam, J. H., and Forsham, P. H. (1973). Lack of glucagon response to hypoglycemia in diabetes: Evidence for an intrinsic pancreatic alpha cell defect. Science 182(108), 171–173. Gopel, S. O., Kanno, T., Barg, S., and Rorsman, P. (2000a). Patch-clamp characterisation of somatostatin-secreting-cells in intact mouse pancreatic islets. J. Physiol. 528(3), 497–507. Gopel, S. O., Kanno, T., Barg, S., Weng, X. G., Gromada, J., and Rorsman, P. (2000b). Regulation of glucagon release in mouse-cells by KATP channels and inactivation of TTX-sensitive Naþ channels. J. Physiol. 528, 509–520.
578
Leon S. Farhy and Anthony L. McCall
Grapengiesser, E., Salehi, A., Quader, S. S., and Hellman, B. (2006). Glucose induces glucagon release pulses antisynchronous with insulin and sensitive to purinoceptors inhibition. Endocrinology 147, 3472–3477. Grimmichova, R., Vrbikova, J., Matucha, P., Vondra, K., Veldhuis, P., and Johnson, M. (2008). Fasting insulin pulsatile secretion in lean women with polycystic ovary syndrome. Physiol. Res. 57, 1–8. Gromada, J., Franklin, I., and Wollheim, C. B. (2007). a-Cells of the endocrine pancreas: 35 Years of research but the enigma remains. Endocr. Rev. 28(1), 84–116. Guyton, J. R., Foster, R. O., Soeldner, J. S., Tan, M. H., Kahn, C. B., Koncz, L., and Gleason, R. E. (1978). A model of glucose-insulin homeostasis in man that incorporates the heterogeneous fast pool theory of pancreatic insulin release. Diabetes 27, 1027–1042. Havel, P. J., and Ahren, B. (1997). Activation of autonomic nerves and the adrenal medulla contributes to increased glucagon secretion during moderate insulin-induced hypoglycemia in women. Diabetes 46, 801–807. Havel, P. J., and Taborsky, G. J. Jr. (1989). The contribution of the autonomic nervous system to changes of glucagon and insulin secretion during hypoglycemic stress. Endocr. Rev. 10(3), 332–350. Heimberg, H., De Vos, A., Pipeleers, D., Thorens, B., and Schuit, F. (1995). Differences in glucose transporter gene expression between rat pancreatic alpha- and beta-cells are correlated to differences in glucose transport but not in glucose utilization. J. Biol. Chem. 270(15), 8971–8975. Heimberg, H., De Vos, A., Moens, K., Quartier, E., Bouwens, L., Pipeleers, D., Van Schaftingen, E., Madsen, O., and Schuit, F. (1996). The glucose sensor protein glucokinase is expressed in glucagon-producing alpha-cells. Proc. Natl. Aca. Sci. USA 93(14), 7036–7041. Heise, T., Heinemann, T., Heller, S., Weyer, C., Wang, Y., Strobel, S., Kolterman, O., and Maggs, D. (2004). Effect of pramlintide on symptom, catecholamine, and glucagon responses to hypoglycemia in healthy subjects. Metabolism 53(9), 1227–1232. Hermansen, K., Christensen, S. E., and Orskov, H. (1979). Characterization of somatostatin release from the pancreas: The role of potassium. Scand. J. Clin. Lab. Invest. 39(8), 717–722. Hilsted, J., Frandsen, H., Holst, J. J., Christensen, N. J., and Nielsen, S. L. (1991). Plasma glucagon and glucose recovery after hypoglycemia: The effect of total autonomic blockade. Acta Endocrinol. 125(5), 466–469. Hirsch, B. R., and Shamoon, H. (1987). Defective epinephrine and growth hormone responses in type I diabetes are stimulus specific. Diabetes 36(1), 20–26. Hoffman, R. P., Arslanian, S., Drash, A. L., and Becker, D. J. (1994). Impaired counterregulatory hormone responses to hypoglycemia in children and adolescents with new onset IDDM. J. Pediatr. Endocrinol. 7(3), 235–244. Hope, K. M., Tran, P. O., Zhou, H., Oseid, E., Leroy, E., and Robertson, R. P. (2004). Regulation of alpha-cell function by the beta-cell in isolated human and rat islets deprived of glucose: The ‘‘switch-off ’’ hypothesis. Diabetes 53(6), 1488–1495. Huypens, P., Ling, Z., Pipeleers, D., and Schuit, F. (2000). Glucagon receptors on human islet cells contribute to glucose competence of insulin release. Diabetologia 43(8), 1012–1019. Inagaki, N., Kuromi, H., Gonoi, T., Okamoto, Y., Ishida, H., Seino, Y., Kaneko, T., Iwanaga, T., and Seino, S. (1995). Expression and role of ionotropic glutamate receptors in pancreatic islet cells. FASEB J. 9(8), 686–691. Insel, P. A., Liljenquist, J. E., Tobin, J. D., Sherwin, R. S., Watkins, P., Andres, R., and Berman, M. (1975). Insulin control of glucose metabolism in man. A new kinetic analysis. J. Clin. Invest. 55, 1057–1066. Ishihara, H., Maechler, P., Gjinovci, A., Herrera, P. L., and Wollheim, C. B. (2003). Islet b-cell secretion determines glucagon release from neighboring a-cells. Nat. Cell Biol. 5, 330–335.
Network Control of Glucagon Counterregulation
579
Ito, K., Maruyama, H., Hirose, H., Kido, K., Koyama, K., Kataoka, K., and Saruta, T. (1995). Exogenous insulin dose-dependently suppresses glucopenia-induced glucagon secretion from perfused rat pancreas. Metab. Clin. Exp. 44(3), 358–362. Jaspan, J. B., Lever, E., Polonsky, K. S., and Van Cauter, E. (1986). In vivo pulsatility of pancreatic islet peptides. Am. J. Physiol. 251(2 Pt 1), E215–E226. Kawai, K., and Unger, R. H. (1982). Inhibition of glucagon secretion by exogenous glucagon in the isolated, perfused dog pancreas. Diabetes 31(6), 512–515. Kawamori, D., Kurpad, A. J., Hu, J., Liew, C. W., Shih, J. L., Ford, E. L., Herrera, P. L., Polonsky, K. S., McGuinness, O. P., and Kulkarni, R. N. (2009). Insulin signaling in alpha cells modulates glucagon secretion in vivo. Cell Metab. 9(4), 350–361. Kieffer, T. J., Heller, R. S., Unson, C. G., Weir, G. C., and Habener, J. F. (1996). Distribution of glucagon receptors on hormone-specific endocrine cells of rat pancreatic islets. Endocrinology 137(11), 5119–5125. Klaff, L. J., and Taborsky, G. J. Jr. (1987). Pancreatic somatostatin is a mediator of glucagon inhibition by hyperglycemia. Diabetes 36(5), 592–596. Kleinman, R., Gingerich, R., Wong, H., Walsh, J., Lloyd, K., Ohning, G., De Giorgio, R., Sternini, C., and Brunicardi, F. C. (1994). Use of the Fab fragment for immunoneutralization of somatostatin in the isolated perfused human pancreas. Am. J. Surg. 167(1), 114–119. Kleinman, R., Gingerich, R., Ohning, G., Wong, H., Olthoff, K., Walsh, J., and Brunicardi, F. C. (1995). The influence of somatostatin on glucagon and pancreatic polypeptide secretion in the isolated perfused human pancreas. Int. J. Pancreatol. 18(1), 51–57. Kovatchev, B. P., Farhy, L. S., Cox, D. J., Straume, M., Yankov, V. I., GonderFrederick, L. A., and Clarke, W. L. (1999). Modeling insulin-glucose dynamics during insulin induced hypoglycemia. Evaluation of glucose counterregulation. J. Theor. Med. 1, 313–323. Kovatchev, B. P., Straume, M., Farhy, L. S., and Cox, D. J. (2000). Dynamic network model of glucose counterregulation in subjects with insulin-requiring diabetes. Methods Enzymol. 321, 396–410. Ludvigsen, E., Olsson, R., Stridsberg, M., Janson, E. T., and Sandler, S. (2004). Expression and distribution of somatostatin receptor subtypes in the pancreatic islets of mice and rats. J. Histochem. Cytochem. 52(3), 391–400. Mari, A. (1997). Assessment of insulin sensitivity with minimal model: Role of model assumptions. Am. J. Physiol. 272, E925–E934. Maruyama, H., Hisatomi, A., Orci, L., Grodsky, G. M., and Unger, R. H. (1984). Insulin within islets is a physiologic glucagon release inhibitor. J. Clin. Invest. 74(6), 2296–2299. Matthews, D. R., Hermansen, K., Connolly, A. A., Gray, D., Schmitz, O., Clark, A., Orskov, H., and Turner, R. C. (1987). Greater in vivo than in vitro pulsatility of insulin secretion with synchronized insulin and somatostatin secretory pulses. Endocrinology 120(6), 2272–2278. McCall, A. L., Cox, D. J., Crean, J., Gloster, M., and Kovatchev, B. P. (2006). A novel analytical method for assessing glucose variability: Using CGMS in type 1 diabetes mellitus. Diabetes Technol. Ther. 8(6), 644–653. Meier, J. J., Kjems, L. L., Veldhuis, J. D., Lefebvre, P., and Butler, P. C. (2006). Postprandial suppression of glucagon secretion depends on intact pulsatile insulin secretion: Further evidence for the intraislet insulin hypothesis. Diabetes 55(4), 1051–1056. Pipeleers, D. G., Schuit, F. C., Van Schravendijk, C. F., and Van de Winkel, M. (1985). Interplay of nutrients and hormones in the regulation of glucagon release. Endocrinology 117(3), 817–823. Prksen, N. (2002). The in vivo regulation of pulsatile insulin secretion. Diabetologia 45(1), 3–20.
580
Leon S. Farhy and Anthony L. McCall
Portela-Gomes, G. M., Stridsberg, M., Grimelius, L., Oberg, K., and Janson, E. T. (2000). Expression of the five different somatostatin receptor subtypes in endocrine cells of the pancreas. Appl. Immunohistochem. Mol. Morphol. 8(2), 126–132. Quon, M. J., Cochran, C., Taylor, S. I., and Eastman, R. C. (1994). Non-insulin mediated glucose disappearance in subjects with IDDM. Discordance between experimental results and minimal model analysis. Diabetes 43, 890–896. Ravier, M. A., and Rutter, G. A. (2005). Glucose or insulin, but not zinc ions, inhibit glucagon secretion from mouse pancreatic alpha cells. Diabetes 54, 1789–1797. Reaven, G. M., Chen, Y. D., Golay, A., Swislocki, A. L., and Jaspan, J. B. (1987). Documentation of hyperglucagonemia throughout the day in nonobese and obese patients with noninsulin-dependent diabetes mellitus. J. Clin. Endocrinol. Metab. 64(1), 106–110. Rorsman, P., and Hellman, B. (1988). Voltage-activated currents in guinea pig pancreatic alpha 2 cells. Evidence for Ca2þ-dependent action potentials. J. Gen. Physiol. 91(2), 223–242. Rorsman, P., Berggren, P. O., Bokvist, K., Ericson, H., Mohler, H., Ostenson, C. G., and Smith, P. A. (1989). Glucose-inhibition of glucagon secretion involves activation of GABAA-receptor chloride channels. Nature 341(6239), 233–236. Salehi, A., Quader, S. S., Grapengiesser, E., and Hellman, B. (2007). Pulses of somatostatin release are slightly delayed compared with insulin and antisynchronous to glucagon. Regul. Pept. 144, 43–49. Samols, E., and Stagner, J. I. (1988). Intra-islet regulation. Am. J. Med. 85(5A), 31–35. Samols, E., and Stagner, J. I. (1990). Islet somatostatin–microvascular, paracrine, and pulsatile regulation. Metab. Clin. Exp. 39(9 Suppl 2), 55–60. Schuit, F. C., Derde, M. P., and Pipeleers, D. G. (1989). Sensitivity of rat pancreatic A and B cells to somatostatin. Diabetologia 32(3), 207–212. Schuit, F., De Vos, A., Farfari, S., Moens, K., Pipeleers, D., Brun, T., and Prentki, M. (1997). Metabolic fate of glucose in purified islet cells. Glucose-regulated anaplerosis in beta cells. J. Biol. Chem. 272(30), 18572–18579. Schuit, F. C., Huypens, P., Heimberg, H., and Pipeleers, D. G. (2001). Glucose sensing in pancreatic beta-cells: A model for the study of other glucose-regulated cells in gut, pancreas, and hypothalamus. Diabetes 50(1), 1–11. Segel, S. A., Paramore, D. S., and Cryer, P. E. (2002). Hypoglycemia-associated autonomic failure in advanced type 2 diabetes. Diabetes 51(3), 724–733. Silvestre, R. A., Rodrı´guez-Gallardo, J., Jodka, C., Parkes, D. G., Pittner, R. A., Young, A. A., and Marco, J. (2001). Selective amylin inhibition of the glucagon response to arginine is extrinsic to the pancreas. Am. J. Physiol. Endocrinol. Metab. 280, E443–E449. Stagner, J. I., Samols, E., and Bonner-Weir, S. (1988). Beta-alpha-delta pancreatic islet cellular perfusion in dogs. Diabetes 37(12), 1715–1721. Stagner, J. I., Samols, E., and Marks, V. (1989). The anterograde and retrograde infusion of glucagon antibodies suggests that A cells are vascularly perfused before D cells within the rat islet. Diabetologia 32(3), 203–206. Steele, R., Rostami, H., and Altszuler, N. (1974). A two-compartment calculator for the dog glucose pool in the nonsteady state. Fed. Proc. 33, 1869–1876. Strowski, M. Z., Parmar, R. M., Blake, A. D., and Schaeffer, J. M. (2000). Somatostatin inhibits insulin and glucagon secretion via two receptors subtypes: An in vitro study of pancreatic islets from somatostatin receptor 2 knockout mice. Endocrinology 141(1), 111–117. Sumida, Y., Shima, T., Shirayama, K., Misaki, M., and Miyaji, K. (1994). Effects of hexoses and their derivatives on glucagon secretion from isolated perfused rat pancreas. Horm. Metab. Res. 26(5), 222–225.
Network Control of Glucagon Counterregulation
581
Taborsky, G. J. Jr., Ahren, B., and Havel, P. J. (1998). Autonomic mediation of glucagon secretion during hypoglycemia: Implications for impaired alpha-cell responses in type 1 diabetes. Diabetes 47(7), 995–1005. Tapia-Arancibia, L., and Astier, H. (1988). Glutamate stimulates somatostatin release from diencephalic neurons in primary culture. Endocrinology 123, 2360–2366. The Action to Control Cardiovascular Risk in Diabetes Study Group (2008). Effects of intensive glucose lowering in type 2 diabetes. N. Engl. J. Med. 358, 2545–2559. The Diabetes Control and Complications Trial Research Group (1993). The effect of intensive treatment of diabetes on the development and progression of long-term complications in insulin-dependent diabetes mellitus. N. Engl. J. Med. 329, 977–986. Tirone, T. A., Norman, M. A., Moldovan, S., DeMayo, F. J., Wang, X. P., and Brunicardi, F. C. (2003). Pancreatic somatostatin inhibits insulin secretion via SSTR-5 in the isolated perfused mouse pancreas model. Pancreas 26(3), e67–e73. Toffolo, G., De Grandi, F., and Cobelli, C. (1995). Estimation of beta-cell sensitivity from intravenous glucose tolerance test C-peptide data. Knowledge of the kinetics avoids errors in modeling the secretion. Diabetes 44, 845–854. Toffolo, G., Breda, E., Cavaghan, M. K., Ehrmann, D. A., Polonsky, K. S., and Cobelli, C. (2001). Quantitative indices of b-cell function during graded up&down glucose infusion from C-peptide minimal models. Am. J. Physiol. 280, 2–10. Uehara, S., Muroyama, A., Echigo, N., Morimoto, R., Otsuka, M., Yatsushiro, S., and Moriyama, Y. (2004). Metabotropic glutamate receptor type 4 is involved in autoinhibitory cascade for glucagon secretion by alpha-cells of islet of Langerhans. Diabetes 53(4), 998–1006. UK Prospective Diabetes Study Group (1998). Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes. Lancet 352, 837–853. Unger, R. H. (1985). Glucagon physiology and pathophysiology in the light of new advances. Diabetologia 28, 574–578. Utsumi, M., Makimura, H., Ishihara, K., Morita, S., and Baba, S. (1979). Determination of immunoreactive somatostatin in rat plasma and responses to arginine, glucose and glucagon infusion. Diabetologia 17, 319–323. Van Schravendijk, C. F., Foriers, A., Van den Brande, J. L., and Pipeleers, D. G. (1987). Evidence for the presence of type I insulin-like growth factor receptors on rat pancreatic A and B cells. Endocrinology 121(5), 1784–1788. Wendt, A., Birnir, B., Buschard, K., Gromada, J., Salehi, A., Sewing, S., Rorsman, P., and Braun, M. (2004). Glucose inhibition of glucagon secretion from rat alpha-cells is mediated by GABA released from neighboring beta-cells. Diabetes 53(4), 1038–1045. Xu, E., Kumar, M., Zhang, Y., Ju, W., Obata, T., Zhang, N., Liu, S., Wendt, A., Deng, S., Ebina, Y., Wheeler, M. B., Braun, M., et al. (2006). Intraislet insulin suppresses glucagon release via GABA-GABAA receptor system. Cell Metab. 3, 47–58. Yamasaki, Y., Tiran, J., and Albisser, A. M. (1984). Modeling glucose disposal in diabetic dogs fed mixed meals. Am. J. Physiol. 246, E52–E61. Zhou, H., Tran, P. O., Yang, S., Zhang, T., LeRoy, E., Oseid, E., and Robertson, R. P. (2004). Regulation of alpha-cell function by the beta-cell during hypoglycemia in Wistar rats: The ‘‘switch-off ’’ hypothesis. Diabetes 53(6), 1482–1487. Zhou, H., Zhang, T., Oseid, E., Harmon, J., Tonooka, N., and Robertson, R. P. (2007a). Reversal of defective glucagon responses to hypoglycemia in insulin-dependent autoimmune diabetic BB rats. Endocrinology 148, 2863–2869. Zhou, H., Zhang, T., Harmon, J. S., Bryan, J., and Robertson, R. P. (2007b). Zinc, not insulin, regulates the rat a-cell response to hypoglycemia in vivo. Diabetes 56, 1107–1112.
C H A P T E R
T W E N T Y- T W O
Enzyme Kinetics and Computational Modeling for Systems Biology Pedro Mendes,*,†,‡ Hanan Messiha,*,§ Naglis Malys,*,} and Stefan Hoops‡ Contents 1. Introduction 2. Computational Modeling and Enzyme Kinetics 2.1. Standards in computational systems biology 2.2. COPASI: A biochemical modeling and simulation package 3. Yeast Triosephosphate Isomerase (EC 5.3.1.1) 4. Initial Rate Analysis 5. Progress Curve Analysis 6. Concluding Remarks Acknowledgments References
584 586 586 587 588 590 594 598 598 598
Abstract Enzyme kinetics is a century-old area of biochemical research which is regaining popularity due to its use in systems biology. Computational models of biochemical networks depend on rate laws and kinetic parameter values that describe the behavior of enzymes in the cellular milieu. While there is a considerable body of enzyme kinetic data available from the past several decades, a large number of enzymes of specific organisms were never assayed or were assayed in conditions that are irrelevant to those models. The result is that systems biology projects are having to carry out large numbers of enzyme kinetic assays. This chapter reviews the main methodologies of enzyme kinetic data analysis and proposes using computational modeling software for that purpose. It applies the biochemical network modeling software COPASI to data from enzyme assays of yeast triosephosphate isomerase (EC 5.3.1.1). * { {
} }
Manchester Centre for Integrative Systems Biology, The University of Manchester, Manchester, United Kingdom School of Computer Science, The University of Manchester, Manchester, United Kingdom Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA School of Chemistry, The University of Manchester, Manchester, United Kingdom Faculty of Life Sciences, The University of Manchester, Manchester, United Kingdom
Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67022-1
#
2009 Published by Elsevier Inc.
583
584
Pedro Mendes et al.
1. Introduction Modern biochemical research is becoming a systems approach where mathematical models of the dynamics of molecular networks play an important role. These models are needed to understand the relationship between the underlying biophysical and biochemical parameters and the nonlinear behavior of the system. Models are also important as devices that integrate the various types of data needed for these studies. Under the term systems biology we include two distinct types of studies: one is driven by whole-genome data such as that from transcriptomics and high-throughput protein–protein interactions, while the other is based on in vitro data from purified molecules. The former is a top-down (analytic) approach that centers on network inference, while the latter is a bottom-up (synthetic) approach that reconstructs the system based on the knowledge of the individual parts. The ultimate objective of both approaches is the same, however: to understand how the behavior of living cells depends on the molecular mechanisms that compose them. To some extent, systems biology can be seen as the link between biochemistry and physiology. The bottom-up approach to systems biology is based on existing knowledge of the network of molecular interactions and much work is ongoing to create accurate descriptions of these networks (e.g., Herrga˚rd et al., 2008). But assembling the network structure is only the first part, and while it provides for interesting analyses (Schilling et al., 1999) the bulk of cellular properties is dynamic and requires dynamic models for their understanding. Dynamics are introduced in models through the kinetics of the molecular interactions, the majority of which are enzyme-catalyzed reactions. The determination of kinetic parameters and rate laws is thus an important activity in systems biology. But the new field of application has its own specific requirements that result in different constraints to assays and data analysis. Traditionally enzyme kinetics has been a vehicle for determining reaction mechanisms. This means that assays had to expose differences between mechanisms, which are often subtle, and therefore there was a big emphasis on accuracy of results. Since the mechanism of catalysis of an enzyme is rarely different for each of its substrates, many assays were carried out with synthetic substrate analogs, which are often more readily available (and cheaper) than the physiological substrate; other reasons for use of substrate analogs are related to advantageous physicochemical properties (e.g., solubility, light absorption, etc.). The same applies to modifiers, which were often also analogs of, or event entirely unrelated to, physiological metabolites of that pathway. Another common practice in the quest for mechanisms is to carry
Enzyme Kinetics for Computational Systems Biology
585
out the assays at the optimum pH of the enzyme, not the physiological pH. Frequently only one of the directions of the reaction was assayed and parameters for the products were not determined. Finally, the enzyme preparations themselves were often not sufficiently pure containing unknown proportions of isoenzymes. To construct biochemical network models that are relevant to cellular physiology it is important to determine the kinetic properties of the enzyme in conditions as close as possible to the cellular milieu. At a minimum the pH and temperature should be consistent with the relevant cells. Importantly, synthetic substrates or inhibitor analogs are undesirable and provide no useful information to the model. As much as possible one should also determine the kinetic properties of each single isoenzyme (or at least the isoenzymes of relevance), after all when several forms exist in an organism it is because they have different properties and fulfill different roles (even if in certain diseases or mutants one form may substitute the other). The kinetic parameters of all substrates and products should be determined, so that one can appropriately include reversible reactions in the model. Even if one has to represent some reaction as irreversible, it is important that the rate law be sensitive to the product concentrations. It is not a surprise that little amount of data has been published to date that fulfills the requirements above. As it turns out, even without these requirements, the number of isoenzymes which have been studied kinetically in any form, is less than is often portrayed. Consequently, there is a real need to assay a large number of different isoenzymes to provide data for the construction of physiologically relevant biochemical network models. Systems biology needs enzyme kinetics assays in large numbers and therefore, in this age of robotics, there is a real need for highthroughput enzyme characterizations that follow the principles described out here. But an enhanced interaction between systems biology and enzyme kinetics is synergistic, and also enzyme kinetics has something to gain from systems biology. With the increase in interest to model biochemical networks, computational systems biology has been creating a series of tools that are also useful if applied to enzyme kinetics. This is particularly true in the area of parameter estimation where several algorithms have been shown to be useful and are indeed also useful to enzyme kinetics. The availability of increasingly sophisticated and standardized modeling and simulation software will undoubtedly benefit enzyme kinetics. Here we review the main approaches to enzyme kinetic data analysis and discuss them in light of their new field of application and how systems biology modeling tools can be useful. An illustration is presented with the COPASI modeling software (Hoops et al., 2006) applied to the kinetics of purified yeast triosephosphate isomerase (EC 5.3.1.1).
586
Pedro Mendes et al.
2. Computational Modeling and Enzyme Kinetics Biochemical networks are sets of reactions that are linked by common substrates and products. The dynamics of biochemical networks is frequently described as sets of coupled ordinary differential equations (ODEs) that represent the rate of change of concentrations of the chemical species involved in the network. The right-hand side of these ODEs is the algebraic sum of the rate laws of the reactions that produce or consume the chemical species (positive when it is produced, negative when consumed). There is formally no difference between a biochemical network and an enzyme reaction mechanism as both conform to this description. It is possible (though perhaps not desirable) to represent an entire biochemical network through elementary reactions, as was done in the past (Chance et al., 1960), but soon shown to be impractical and unnecessary (Rhoads et al., 1968). For the purposes of systems biology studies it suffices to represent each enzyme-catalyzed reaction as a single step and associate to it an appropriate integrated rate law. It is debatable whether the rate laws even need to be based on a mechanism and generic rate laws have been proposed for this effect (Liebermeister and Klipp, 2006). The systems biologist should be cautioned, though, that mechanistic details may indeed affect the dynamics as is the case with competitive versus uncompetitive inhibitor drugs (Cornish-Bowden, 1986; Westley and Westley, 1996).
2.1. Standards in computational systems biology A major force driving computational systems biology has been the establishment of standards for various aspects of modeling. The systems biology markup language (SBML) (Hucka et al., 2003) is a standard format to encode the information required to express a biochemical network model including its kinetics. SBML is represented in extended markup language (XML), which is itself a standard widely adopted on the Internet. Although it could appear that a simple common format would not be terribly significant, the creation and subsequent development of SBML has resulted in the formation of a vibrant community of researchers which has passed critical mass. The consequence is that there are now several compatible software packages to model biochemical networks. Some are generic and provide many algorithms, while others are more specialized. Importantly, all of these are compatible in the sense that they can read and write models in a way that allows researchers to be able to use them without hindrance. This includes not only simulators (Hoops et al., 2006) but also packages for graphical depiction of networks (Funahashi et al., 2003), databases of reactions and kinetic parameters (Rojas et al., 2007), network analysis and data
Enzyme Kinetics for Computational Systems Biology
587
visualization (Kohler et al., 2006; Shannon et al., 2003), and so on. In some cases these packages can even work in a more integrated way, such as the SBW suite (Sauro et al., 2003), or Cell Designer and COPASI. SBML has also been a source of innovation as the specification has covered modeling methods that were not previously supported very well or at all. Models represented in SBML can be based on ODE, algebraic equations, stochastic kinetics, and discrete events. Beyond SBML, there are also standards for how to report models and their simulations (MIRIAM) (Le Nove`re et al., 2005), for the graphical representation of networks and models (SBGN) (Le Nove`re et al., 2009). An ontology for systems biology is being developed, a large section of which covers enzyme kinetics terms. Finally, there are emerging standards for specifying modeling procedures (MIASE) and data (SBRML). All of these could also be useful to some extent to enzyme kinetics and the software that has resulted from them is definitely useful indeed.
2.2. COPASI: A biochemical modeling and simulation package COPASI (Hoops et al., 2006) is an open source biochemical network modeling and simulation software package that we (PM and SH) have been developing with our colleagues Ursula Kummer and Sven Sahle (University of Heidelberg) and many coworkers. COPASI has implemented almost all of the features described in SBML (with the single exception of explicit time delays). It contains algorithms for simulation through ODE, algebraic equations, the stochastic simulation algorithm of Gillespie (1977) and derivatives, and discrete events. It also allows to mix several of these in a single simulation. COPASI also includes a number of algorithms for stoichiometric analyzes, systematic parameter scanning or Monte Carlo sampling, metabolic control analysis and generic sensitivity analysis, time scale and stability analysis, optimization and parameter estimation. Of greater relevance to the present topic are sensitivity analysis, and parameter estimation. COPASI is available free to nonprofit research. The COPASI user represents a biochemical network model through the language of biochemistry, while the software internally constructs the appropriate mathematical representation (the user is able to check it if needed). As indicated earlier, models could consist of the elementary reactions of an enzyme-catalyzed mechanism or use integrated rate laws (of which there are several predefined). Thus, COPASI is also useful for modeling in enzyme catalysis. The parameter estimation infrastructure of COPASI is fairly sophisticated allowing to use data from several different experiments that can even be of different type (e.g., time courses or steady-state measurements) and be stored across several files. COPASI currently uses a least-squares approach
588
Pedro Mendes et al.
whereas the sum of the square of residuals between the data and model is minimized. The sum of squares can be constructed over several variables which will be scaled appropriately (such that all contribute equally to the total sum). The number and type of parameters to be estimated is unrestricted by the software. The minimization can be subject to arbitrary nonlinear constraints on any feature of the model. The approach used in COPASI follows the framework of Mendes and Kell (1998) where a number of different nonlinear optimization algorithms can be used to minimize the sum of squares. These can be carried out alternatively to each other or in sequence (Rodriguez-Fernandez et al., 2006). The obvious application of COPASI’s parameter estimation engine to enzyme kinetics is for progress curve analysis. This is fairly straightforward and requires only (1) entering the relevant reactions and rate laws in the model—a single overall reaction following an integrated rate law, or a series of elementary reactions following mass action kinetics, (2) setting up the link between the data and the model, by identifying what elements of the model do the columns in the data file represent, (3) selecting which parameters are to be estimated, their boundaries (if any), and whether the fit is to be independent for each experiment or global to all experiments, and (4) selecting an algorithm for minimization. In addition to progress curves, COPASI is also useful for initial rate analysis, being able to carry out the two steps needed for this approach: determination of initial rates and nonlinear regression to the appropriate rate law.
3. Yeast Triosephosphate Isomerase (EC 5.3.1.1) One of the objectives of the Manchester Centre for Integrative Systems Biology is to demonstrate the feasibility of the bottom-up approach by applying it to the metabolism of the yeast Saccharomyces cerevisiae. To achieve this we established a range of experimental and computational methodologies that consist of purification of proteins, kinetic assays, measurement of enzyme concentrations through targeted mass spectrometry, measurement of metabolite levels by GC–MS and LC–MS, and computational work flows to manage and analyze data. Here, we will use the yeast enzyme triosephosphate isomerase (EC 5.3.1.1) for illustration of the procedures discussed in the remainder of the chapter, many other enzymes are being analyzed in our pipeline. Protein production and purification is based on the MORF mutant collection (Gelperin et al., 2005) composed of yeast strains that each overexpress a single one of the proteins of the yeast genome (for other proteins we also use the TAP mutant collection, Ghaemmaghami et al., 2003). Yeast cultures are grown in raffinose medium and then switched to galactose to
Enzyme Kinetics for Computational Systems Biology
589
trigger the overexpression of the protein of interest. MORF proteins have C terminus tags that allow affinity purification using IgG and nickel. While the majority of the MORF tag is cleaved off, a small 6 His peptide is still left in the C terminus at the end of the purification. This might affect the kinetics of these enzymes and ideally we would prefer to obtain native enzymes, however this would require devising new constructs which is presently beyond the scope of our work. Aliquots of the purified protein are stored at 20 ºC in MES (2-[N-morpholino]-ethanesulfonic acid) buffer at pH 6.5, as used in the kinetic assays. In this scenario where hundreds of proteins are being assayed it is important to standardize the assay conditions and to process them in as high throughput as possible. Thus, we have settled on running spectrophotometric assays monitoring the consumption or production of NADH or NADPH, by using one or more coupling reactions where needed. Assays are carried out with a NOVOstar plate reader in 384-well format plates with a reaction volume of 60 ml. A reaction buffer consisting of 100 mM MES (2-[N-morpholino]-ethanesulfonic acid), pH 6.5, 100 mM KCl and 5 mM MgCl was used throughout. Triosephosphate isomerase (EC 5.3.1.1) was isolated from the MORPH strain overexpressing the gene TPI1 as described above. The kinetics of the purified enzyme were then determined in both reaction directions by coupling to glyceraldehyde 3-phosphate dehydrogenase (EC 1.2.1.12) or glycerol 3-phosphate dehydrogenase (EC 1.1.1.8). The forward reaction was measured according to Krietsch (1975) with slight modifications. The reaction mixture contained 1 mM NADþ, 1 mM EDTA, 120 mM DTT, 4 mM sodium arsenate, 2.5 U glyceraldehyde 3phosphate dehydrogenase in the reaction buffer at various concentrations of glycerone phosphate (DHAP). The overall reaction scheme considered is: DHAP ! G3P G3P þ NADþ þ As2 ! NADH þ 3PG
ð22:1Þ
The reverse reaction was measured in the reaction buffer based on Bergmeyer et al. (1974) with minor modifications with 8.5 U/ml glycerol-3-phosphate dehydrogenase, 0.15 mM NADH at various concentrations of glyceraldehyde 3-phosphate (G3P). The overall reaction considered is: G3P ! DHAP DHAP þ NADH ! NADþ þ Gol3P
ð22:2Þ
In both cases the enzyme was incubated in the reaction mixture and the reactions were started by the addition of the DHAP or G3P and absorbance was collected every 19 s for 4731 s (nearly 80 min).
590
Pedro Mendes et al.
4. Initial Rate Analysis In the early days of enzymology, when there were no computational aids for calculations, Michaelis and Menten (1913) proposed to determine the kinetics of enzymes by measuring initial rates of reaction. This had the advantage of simplifying calculations as there is no product accumulation to consider. The methodology passes by determining progress curves at different concentrations of substrate and estimating the rate at t ¼ 0. Then these data are used to estimate the kinetic parameters by regression on the integrated rate equation. In the case of Henri–Michaelis–Menten kinetics it is also possible to estimate these parameters by simple linear regression using transformations of the rate law (Lineweaver and Burk, 1934) or by graphical methods (Eisenthal and Cornish-Bowden, 1974). However, with the widespread availability of computers, this is now recognized to be best carried out through nonlinear regression. Several software packages exist that are capable of carrying out this type of regression, including DynaFit (Kuzmic, 1996) described elsewhere in this volume. Several authors have criticized the (unfortunately still widespread) practice of determining the initial rate by linear regression of the ‘‘linear’’ part of the curve. Of course the curve has no linear part, and regression of a set of initial data points results in underestimating the rate (Duggleby, 1985). A better approach is to fit the parameters of a hyperbola to each progress curve and then use the corresponding initial substrate concentration to obtain the rate at t ¼ 0: v0 ¼
V app S0 app Km þ S0
ð22:3Þ
where the parameters of the hyperbola V app and Kmapp are only gross estimates of V and Km but nevertheless allow for an accurate estimation of the initial rate through Eq. (22.3). This procedure is easy to carry out with COPASI, where it can estimate all of the initial rates in one step. Essentially one enters the reaction schemes (22.1) or (22.2), assigns the irreversible Henri–Michaelis–Menten rate law to the reaction of interest and mass action kinetics for the coupling reaction (a more complex rate law could also be used, but if the assay was designed correctly the linking enzyme should be operating in conditions near firstorder kinetics). An algebraic equation needs to be added to the model, to express the conversion of absorbance units to the concentration of NADH: Abs340 nm ¼ ½NADHe þ offset
ð22:4Þ
where Abs340 nm is a new variable in the model, e (a constant) is the molar absorptivity coefficient of NADH (in our case multiplied by the path
Enzyme Kinetics for Computational Systems Biology
591
length, which we calibrated to be 0.43 cm in the 384-well plate) and offset is another constant that will be needed to adjust for the initial absorbance. The data file is organized by rows representing each time-dependent reading, columns containing values of time, initial concentrations of DHAP and G3P, the initial absorbance (offset), and the absorbance measured; an empty line separates one time course from the next. These data can easily be formatted by the multiplate reader software or with a simple (automated) script. Once this file is mapped to the appropriate model elements in COPASI, one selects the parameters to estimate (in this case V and Km for the enzyme of interest, as well as the rate constant to represent the rate of the coupling enzyme reaction). Finally, one needs to choose an optimization method and run the minimization. Here, we applied the SRES algorithm (Runarsson and Yao, 2000), followed by Levenberg–Marquardt (Levenberg, 1944; Marquardt, 1963), as suggested by Rodriguez-Fernandez et al. (2006). This is easily done by setting COPASI to update the model with the result of the estimation, and then we simply to run the LM algorithm from where SRES finished. Application of this method to the data of TPI’s forward reaction yields a set of V app and Kmapp which are then used to calculate initial rates (in a spreadsheet applying Eq. (22.3). Note that while the absorbance is very well fit many of the V app and Kmapp values are poor estimates of the V and Km. The second step uses the initial rates already estimated and the corresponding initial substrate concentrations and fit them to the Michaelis– Menten equation. This is carried out in COPASI in a new model similar to the first, but where we fixed the concentrations of the substrate and product, and associated the measured initial rates with the steady-state rate of the TPI reaction in the model (the coupling reaction is no longer needed). The results are depicted in Fig. 22.1 and the final estimates for the parameter values are Km ¼ 6.4265 0.18582 mM and V ¼ 8.5267 104 7.1161 106 mM s1. A similar procedure was repeated with the data for the reverse reaction and it was observed, at the end of the first stage, that a strong substrate inhibition was taking place (Fig. 22.2). This meant that a different rate law needed to be used in the second step. First, we attempted to fit the initial rates to the substrate inhibition rate law that is derived when a second molecule of substrate binds the enzyme–substrate complex (and forms a nonproductive complex): v¼
VS 2 Km þ S 1 þ KSi
ð22:5Þ
however, the software was not able to provide a good fit, even after applying global optimization algorithms (all of those available in COPASI).
592
Pedro Mendes et al.
A
Initial rate estimation
2
1.5
1
0.5
0 0
100
200
300
400
500
B
8.10–4
v0
6.10–4
4.10–4
2.10–4
0
10
20
30 [DHAP0]
40
50
60
Figure 22.1 Initial rate analysis of the forward reaction of triosephosphate isomerase (EC 5.3.1.1). (A) Independent fits to the time courses from which the initial rates were determined (crosses are data points, solid lines are the fitted curves). (B) Nonlinear regression of kinetic parameters on initial rate data; position of Km and V are represented with dashed lines.
593
Enzyme Kinetics for Computational Systems Biology
A Initial rate estimation
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0
1000
2000
3000
4000
5000
B 8.10– 4
v0
6.10– 4
4.10– 4
2.10– 4
0
10
20
30 [G3P]0
40
50
60
Figure 22.2 Initial rate analysis of the reverse reaction of triosephosphate isomerase (EC 5.3.1.1), displaying strong substrate inhibition. (A) Independent fits to the time courses from which the initial rates were determined (crosses are data points, solid lines are the fitted curves). (B) Nonlinear regression of kinetic parameters on initial rate data; position of Km and V are represented with dashed lines.
594
Pedro Mendes et al.
Therefore, we attempted a rate law where the substrate inhibition term is raised to the fourth power: v¼
VS 4 ; Km þ S 1 þ KSi
ð22:6Þ
and this provided a very good fit to the data (Fig. 22.2). Mechanistic enzymologists would not normally use such an equation without identifying a mechanism that explains it. However, for our purposes of building a network model, this rate law is perfectly acceptable. With the network model we identify which steps have a strong effect on other parts of the network using sensitivity analysis, and those steps that do indeed have high levels of control are then chosen for a further, more thorough kinetic analysis. That means that TPI could be examined further in case it has a strong effect on the rest of the network model; if it does not then it is not important to identify a more accurate rate law. The most important feature for building a bottom-up biochemical network model is that the relation between the concentration of the effectors and the rate be accurate; the underlying mechanism is secondary.
5. Progress Curve Analysis The reason why progress curves are problematic early on in the history of enzyme kinetics is now what makes them very attractive: they combine information from the forward and reverse reactions. Thus, progress curves contain more information than initial rates, and because of that one may be able to estimate kinetic parameters from a smaller number of samples than with the initial rate approach. The main difficulty of progress curves passes through the need to integrate the ODEs, since the progress curve is an explicit relation between concentrations (rather than rates of change) and time. This is not a problem, however, for biochemical simulation software which are equipped with integrators that can deal with a very wide range of initial value problems. In particular, those that also incorporate minimization algorithms, such as COPASI, are able to carry out progress curve analysis directly. To carry out this type of analysis the model must include the reversible reaction since immediately after the start there will be molecules of substrate and product present and therefore the two reactions happen simultaneously. It is important also to include the coupling reaction, when they are used. Actually, it would be beneficial to include the full kinetic details of the coupling enzyme(s) as it is likely that at some point in a time course they are no longer operating in the optimal conditions. But if the assay is designed carefully then the linking reaction can be represented with a fast mass action rate law.
Enzyme Kinetics for Computational Systems Biology
595
While it is possible to obtain estimates of all parameters of a rate law from a single time course, those estimates are poor. A much more robust method is to perform a global analysis where the same set of parameter values must fit all of the time courses measured. To set up such a procedure in COPASI is similar to the first step of the initial rate described above: the data file must contain all of the trajectories and columns with all of the metabolites whose concentration was changed, plus the variables measured. It is better to use the measured signals (absorbance, fluorescence intensity, etc.) and to include in the model the equations that transform them into concentrations. This allows for factors that are included in such equations to be adjusted as part of the fit, if needed. For the example of yeast TPI, we have included all of the time courses up until the point when the absorbance reaches 3.25 where the detector is saturated and no longer provides a linear relation between signal and concentration. We also removed obvious outliers, in this case an absorbance curve with a negative slope but which should have been positive. At this stage, we need to consider which rate law to use, since through the initial rate analysis we already identified that it should contain substrate inhibition by G3P. The solution is to use either v¼
V r KPmp 4 ; S P 1 þ Kms þ Kmp 1 þ KSi Vf
S Kms
ð22:7Þ
for the reaction of G3P to DHAP, or V r KPmp v¼ 4 ; S P 1 þ Kms þ Kmp 1 þ KPi Vf
S Kms
ð22:8Þ
for the reaction of DHAP to G3P. The reaction assays used here were planned exclusively for initial rate analysis and could be optimized further for progress curve analysis. For example, the levels of NAD and especially NADH used were fairly low, partly to ensure the linking enzyme was operating close to first order, but also because NADH strongly absorbs light. However, its initial concentration could still be increased two or threefold in order to allow the reaction extent to reach further. Ideally, one would like progress curves to reach close to equilibrium, as in this way each curve carries more information about the reverse reaction parameters. In the example presented here, the progress curves of the forward reaction have little information about the reverse, and vice versa. Figures 22.3 and 22.4 represent the progress curves of the forward and reverse reactions and their fits. It is clear by eye that the reverse direction is a
596
Pedro Mendes et al.
Progress curves (forward) 4 3.5 3 2.5 2 1.5 1 0.5 0 0
1000
2000
3000
4000
5000
Figure 22.3 Progress curve analysis of the forward reaction of triosephosphate isomerase (EC 5.3.1.1). All curves were fit simultaneously to Eq. (22.8) and consequently follow the same value for the kinetic parameters of that curve.
Progress curves (reverse)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0
1000
2000
3000
4000
5000
Figure 22.4 Progress curve analysis of the reverse reaction of triosephosphate isomerase (EC 5.3.1.1). All curves were fit simultaneously to Eq. (22.7) and consequently follow the same value for the kinetic parameters of that curve.
597
Enzyme Kinetics for Computational Systems Biology
worse approximation overall; a plot of residuals is not needed to reveal this (though that is usually the best way to assess quality of fit). A summary of all results obtained in the initial rate analysis, and in the two progress curves analyses is presented in Table 22.1. If we take the parameters obtained by initial rate analysis as the most reliable, then one can conclude that the two progress curve analyses were able to obtain some parameter values in the correct range but not those of the reaction in the opposite direction. This is due to the fact that the progress curves ended quite far from equilibrium and if the assays had been designed for them those estimates were likely to be better. Despite this the estimates for the substrate inhibition constant of G3P are quite consistent by the three methods. The hardest kinetic data to find published are parameters from the reverse reaction (the direction that is less favorable thermodynamically). Progress curves are ideal to reveal some level of information from the parameters of the reverse reaction when it is not feasible to operate assays in that direction. This is a problem in reactions that have strong energetic drive in one direction (such as many kinases) but also when the products of reaction are not available commercially. Obviously, when possible one should run the reaction in both directions as the data obtained that way is of higher quality (irrespective of being based on initial rates or progress curves); but when this is not possible the availability of progress curve data is a much appreciated gift to the modeler. Table 22.1 Summary of kinetic parameters of yeast triosephosphate isomerase (EC 5.3.11) determined by initial rate and progress curve analyses. Values are represented as estimates and standard deviations (coefficient of variation in brackets)
Initial rate
Km DHAP 6.43 0.186 (2.89%) Km G3P
5.25 0.635 (12.1%)
Ki,G3P
35.1 1.07 (3.06%)
Vf
0.853 103 7.12 106 (0.835%) 0.446 103 22.4 106 (5.03%)
Vr
Progress curves (forward)
Progress curves (reverse)
8.82 1.19 (13.5%)
2.70 103 0.759 103 (28.1%) 10.4 0.925 (8.90%)
9.21 103 2.36 103 (25.6%) 16.0 1.10 (6.84%) 0.938 103 0.127 103 (13.6%) 1.00 108 8.75 107 (8750%)
25.3 0.528 (2.08%) 1.00 108 6.17 107 (6150%) 1.20 103 0.108 103 (9.02%)
598
Pedro Mendes et al.
6. Concluding Remarks Systems biology has many innovative experimental and computational technologies that are revolutionizing research. But it is also creating a stronghold for a technology that is very well established and which has a strong theoretical background—enzyme kinetics. In our own laboratory we have embarked on a large-scale effort to obtain enzyme kinetic data for the purpose of constructing models of metabolism. But the objective is clearly to learn more about how cells work via the means of computational models, and not at all about the mechanisms of catalysis, except when they reveal themselves of importance to cellular function. Computational systems biology has made considerable advances recently and appears poised to enter an exponential growth phase, fueled by a strong community that grew out of the standardization efforts. The technologies of the semantic Web are already impacting this field and more is to be expected (Kell and Mendes, 2008). Computational modeling and simulation software are becoming more and more sophisticated, allowing to carry out computations that would be unbelievable only a couple of decades ago. These advances are also benefiting enzyme kinetics data analysis and we foresee a time when the concept of ‘‘gene function’’ becomes synonym with the kinetics of its protein product embedded in the cellular biochemical network.
ACKNOWLEDGMENTS We are grateful to many colleagues for discussions about this topic, in particular Neil Swainston, Juergen Pahle, and Douglas B. Kell. COPASI is a collaborative project with Ursula Kummer and Sven Sahle (University of Heidelberg). PM and SH thank the National Institute for General Medical Sciences for financial support (R01 GM080219), PM and NM thank the BBSRC and EPSRC for funding the MCISB (BB/C008219/1), and PM and HM thank the BBSRC funding through grant BB/F003501/1. This is a contribution from the Manchester Centre for Integrative Systems Biology.
REFERENCES Bergmeyer, H. U., et al. (1974). Enzymes as biochemical reagents. In ‘‘Methods of Enzymatic Analysis,’’ (H. U. Bergmeyer, ed.), Vol. I, pp. 425–522. Academic Press, New York, NY. Chance, B., et al. (1960). Metabolic control mechanisms. V. A solution for the equations representing interaction between glycolysis and respiration in ascites tumor cells. J. Biol. Chem. 235, 2426–2439. Cornish-Bowden, A. (1986). Why is uncompetitive inhibition so rare? A possible explanation, with implications for the design of drugs and pesticides. FEBS Lett. 203, 3–6. Duggleby, R. G. (1985). Estimation of the initial velocity of enzyme-catalysed reactions by non-linear regression analysis of progress curves. Biochem. J. 228, 55–60. Eisenthal, R., and Cornish-Bowden, A. (1974). The direct linear plot. A new graphical procedure for estimating enzyme kinetic parameters. Biochem. J. 139, 715–720.
Enzyme Kinetics for Computational Systems Biology
599
Funahashi, A., et al. (2003). CellDesigner: A process diagram editor for gene-regulatory and biochemical networks. Biosilico 1, 159–162. Gelperin, D. M., et al. (2005). Biochemical and genetic analysis of the yeast proteome with a movable ORF collection. Genes Dev. 19, 2816–2826. Ghaemmaghami, S., et al. (2003). Global analysis of protein expression in yeast. Nature 425, 737–741. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361. Herrga˚rd, M. J., et al. (2008). A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nat. Biotechnol. 26, 1155–1160. Hoops, S., et al. (2006). COPASI: A Complex pathway simulator. Bioinformatics 22, 3067–3074. Hucka, M., et al. (2003). The systems biology markup language (SBML): A medium for representation and exchange of biochemical network models. Bioinformatics 19, 524–531. Kell, D. B., and Mendes, P. (2008). The markup is the model: Reasoning about systems biology models in the Semantic Web era. J. Theor. Biol. 252, 538–543. Kohler, J., et al. (2006). Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics 22, 1383–1390. Krietsch, W. K. (1975). Triosephosphate isomerase from yeast. Methods Enzymol. 41, 434–438. Kuzmic, P. (1996). Program DYNAFIT for the analysis of enzyme kinetic data: Application to HIV proteinase. Anal. Biochem. 237, 260–273. Le Nove`re, N., et al. (2005). Minimum information requested in the annotation of biochemical models (MIRIAM). Nat. Biotechnol. 23, 1509–1515. Le Nove`re, N., et al. (2009). The systems biology graphical notation. Nat. Biotechnol. 27, 735–741. Levenberg, K. (1944). A method for the solution of certain nonlinear problems in least squares. Quart. Appl. Math. 2, 164–168. Liebermeister, W., and Klipp, E. (2006). Bringing metabolic networks to life: Convenience rate law and thermodynamic constraints. Theor. Biol. Med. Model 3, 41. Lineweaver, H., and Burk, D. (1934). The determination of enzyme dissociation constants. J. Am. Chem. Soc. 56, 658–666. Marquardt, D. W. (1963). An algorithm for least squares estimation of nonlinear parameters. SIAM J. 11, 431–441. Mendes, P., and Kell, D. (1998). Non-linear optimization of biochemical pathways: Applications to metabolic engineering and parameter estimation. Bioinformatics 14, 869–883. Michaelis, L., and Menten, M. L. (1913). Die kinetik der invertinwirkung. Biochem. Z. 49, 333–369. Rhoads, D. G., et al. (1968). A method of calculating time-course behavior of multi-enzyme systems from the enzymatic rate equations. Comput. Biomed. Res. 2, 45–50. Rodriguez-Fernandez, M., et al. (2006). A hybrid approach for efficient and robust parameter estimation in biochemical pathways. Biosystems 83, 248–265. Rojas, I., et al. (2007). Storing and annotating of kinetic data. In Silico Biol. 7, S37–S44. Runarsson, T., and Yao, X. (2000). Stochastic ranking for constrained evolutionary optimization. IEEE Trans. Evol. Comp. 4, 284–294. Sauro, H. M., et al. (2003). Next generation simulation tools: The Systems Biology Workbench and BioSPICE integration. Omics 7, 355–372. Schilling, C. H., et al. (1999). Metabolic pathway analysis: Basic concepts and scientific applications in the post-genomic era. Biotechnol. Prog. 15, 296–303. Shannon, P., et al. (2003). Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504. Westley, A. M., and Westley, J. (1996). Enzyme inhibition in open systems. Superiority of uncompetitive agents. J. Biol. Chem. 271, 5347–5352.
C H A P T E R
T W E N T Y- T H R E E
Fitting Enzyme Kinetic Data with KinTek Global Kinetic Explorer Kenneth A. Johnson Contents 602 603 605 605 607 608 608 609 609 610 613 617 620 624 625 625
1. Background 2. Challenges of Fitting by Simulation 3. Methods 3.1. Defining the model 3.2. Defining each experiment 3.3. Defining output factors 3.4. A note on units 3.5. Information content of data 3.6. A note on statistics 4. Progress Curve Kinetics 5. Fitting Full Progress Curves 5.1. Error analysis 6. Slow Onset Inhibition Kinetics 7. Summary Acknowledgments References
Abstract KinTek Global Kinetic Explorer software offers several advantages in fitting enzyme kinetic data. Behind the intuitive graphical user interface lies fast and efficient algorithms to perform numerical integration of rate equations so that kinetic parameters or starting concentrations can be scrolled while the time dependence of the reaction is dynamically updated in the graphical display. This immediate feedback between the model and the output provides a powerful tool for learning kinetics, for exploring the complex relationships between rate constants and the observable signals, and for fitting data. Dynamic simulation provides an easy means to obtain starting estimates for kinetic parameters before fitting by nonlinear regression and for exploring parameter space after a fit is achieved. Moreover, the fast algorithms for numerical integration allow for Department of Chemistry and Biochemistry, Institute for Cell and Molecular Biology, University of Texas, Austin, Texas, USA Methods in Enzymology, Volume 467 ISSN 0076-6879, DOI: 10.1016/S0076-6879(09)67023-3
#
2009 Elsevier Inc. All rights reserved.
601
602
Kenneth A. Johnson
the brute force computation of confidence contours to provide reliable estimates of the range over which parameters can vary, which is especially important because it reveals when parameters are not well constrained. As illustrated by several examples outlined here, standard nonlinear regression methods fail to detect when parameters are not constrained by the data and generally produce standard error estimates that are extremely misleading. This brings forth an important distinction between a ‘‘good’’ fit where a minimum chi2 is achieved and one where all variable parameters are well constrained based upon sufficient information content of the data. These concepts are illustrated by example in fitting full progress curve kinetics and in fitting the time dependence of slow-onset inhibition.
1. Background Fitting kinetic data based upon numerical integration of rate equations has several advantages over conventional fitting to mathematical functions derived by analytical solution of the rate equations (Barshop et al., 1983; Johnson et al., 2009a,b; Zimmerle and Frieden, 1989). In particular, by fitting primary data directly to a model by computer simulation, all aspects of the data are included in the fitting process including rates as well as amplitudes of the reactions without any simplifying assumptions. In contrast, conventional data fitting is dependent upon solution of mathematical expressions to define the time and concentration dependence of the reaction. Solving mathematical expressions usually requires simplifying assumptions that may only be valid to a first approximation. For example, steady-state kinetic methods assume that one can measure an initial velocity without having significant changes in the concentrations of substrate or product and this restricts data collection to early stages of the reaction where the signal amplitude is low. Integration of differential equations for fitting pre-steady-state kinetic data usually requires construction of a simplified model with no more than two or three kinetically significant steps because of the complexities of the math, producing one exponential phase for each step. In either case, fitting the primary data to measure the rates of reaction is usually followed by subsequent analysis of the concentration dependence of the observed rates. By this process, one fits the data to multiple equations and parameters, some of which are redundant in their information content and many of which are subsequently discarded (e.g., in plotting the concentration dependence of only the observed rate and ignoring the amplitude of a reaction). As an end result, errors are compounded, or worse yet, glossed over in reaching mechanistic conclusions. As a point of contrast, we consider the fitting of data defining the formation of a quininoid species upon reaction of serine with pyridoxal phosphate in the first step of the beta reaction of tryptophan synthase.
603
KinTek Explorer
The reaction can be monitored by fluorescence stopped-flow and fit to a simple two-step model (Anderson et al., 1991) to derive all four rate constants: E+S
k1 k–1
ES
k2
EA
k–2
By conventional methods, each transient obtained at a different serine concentration was fit to a double exponential function with five unknown variables two amplitudes, two rates, and an endpoint: Y ¼ A1 el1 t þ A2 el2 t þ C This fitting ignores the relationships between the rates and amplitudes that are inherent in the data set and therefore increases the errors in the process of extracting the two rates, l1 and l2. The rates of the fast- and slow-reaction phases were then plotted as a function of substrate concentration and fitted to equations obtained by solving the differential equations for the two-step reaction. In the end, the data, consisting of transients collected a four different substrate concentrations, were fit to a total of 23 independent parameters but only three of the rate constants could be estimated from this analysis. A fourth rate constant was estimated by analysis of the reaction amplitudes (Anderson et al., 1991) to define the net equilibrium constant K1K2. Data fitting based upon numerical integration of rate equations overcomes the many limitations to conventional data fitting. In this process, primary data, consisting of the observable signal as a function of time at several substrate concentrations, are fit globally to the model including appropriate output factors to scale the observable signal to the absolute concentrations of reactants. In the tryptophan synthase example, the data set can be fit directly to the model to derive all four rate constants and two fluorescence scaling factors, where the observed fluorescence was attributable to the formation and decay of the ES complex: F ¼ F0 þ DF½ES, as described in detail in Johnson et al. (2009a). Moreover, the full extent to which individual kinetic parameters are constrained by the data was revealed by analysis of the confidence contours derived by monitoring the sum square error as parameters are systematically varied while fitting the data ( Johnson et al., 2009b).
2. Challenges of Fitting by Simulation There are two competing challenges to fitting data based upon computer simulation; namely, a model must be complete enough to provide an adequate description of the underlying mechanism, but not more complex than can be supported by the data. A complete model is required so that the
604
Kenneth A. Johnson
data fitting is built upon a realistic mechanism without unsupported simplifying assumptions. Even with a realistic minimal model, not all of the rate constants may be known or constrained by the data. Accordingly, one needs to have a good understanding of what can be determined in fitting the data and how to set up the system to extract meaningful information. Most importantly, after a good fit has been obtained, it is essential that the model and the parameter set be carefully evaluated to estimate how well each of the kinetic parameters is constrained by the data. Here, an important distinction must be made between a good fit and well-constrained parameters. A good fit is achieved when the minimum chi2 value derived by nonlinear regression reflects the sigma value of the original data (Bates and Watts, 1988). However, if an equally good fit can be achieved with a different set of parameters, then the parameters are not well constrained. In this chapter, the concept of the information content of data will be introduced. That is, how many constants can be determined from a given set of data, and specifically which rate constants are constrained by data? This is the most important question and one that is so often overlooked in fitting kinetic data to a model. With modern computer programs, it is far too easy to define an overly complex model with many parameters that are not determined by the data. Although one expects that the standard error analysis from nonlinear regression should indicate when parameters are ill-defined, this approach usually fails when fitting multiple parameters that are not well constrained by the data ( Johnson et al., 2009b). Thus, one must clearly define what is known, what is not known, and what simplifying assumptions were made to enable data to be fit. The process of developing a model and fitting experimental data will be illustrated with several examples in this chapter. The confidence contour analysis will be used to show what happens when parameters are not well constrained and how the problems can be overcome by performing additional experiments or by simplifying the model. Although KinTek Explorer professional version is offered for sale to defray the programming costs, a free student version is available at www. kintek-corp.com which includes an extensive instruction manual that describes the operation of the program in more detail than can be given here. In addition, each of the examples in this manuscript, illustrating the use of the simulation and data fitting, is also included with the simulation program in the examples folder of the software available online. Many of the concepts explained here are illustrated better by simply running the program, opening the appropriate file and adjusting the rate constants and output factors in order to the see how the curves change in shape. One unique feature of KinTek Explorer is the ability of the user to click with the computer mouse on a rate constant, starting concentration, or signal output factor and to scroll the value up and down while simultaneously observing the changes in the shape of the output curves. This dynamic simulation provides rich feedback to help learn kinetics and to
KinTek Explorer
605
provide initial estimates of kinetic parameters for fitting by nonlinear regression. Perhaps more importantly, dynamic simulation affords a powerful means to explore parameter space to see how well individual constants are constrained by the data, and to explore whether individual parameters are linked to one another and to search a wide range of parameter space in looking for alternative values to fit the data. In this short review, details about how to perform simulations and fit data will be given only in general terms since the manual provided with the KinTek Explorer gives the necessary instructions on how to use the software. Rather, the approach of using simulation to fit data will be illustrated by use of examples. In particular, the examples show how tools unique to KinTek Explorer can be used to evaluate the extent to which parameters are constrained by the data and then to use that information to design new experiments to fill in the gap in knowledge or understand how the model must be reduced to be inline with the information inherent in the data.
3. Methods In fitting kinetic data there is no substitute for a sound understanding of the principles in the design and interpretation of experiments. Nonetheless, by use of kinetic simulators in general and KinTek Explorer in particular, many of the pitfalls in interpretation can be avoided. Every week models are published that are simply not consistent with the data. These errors could be avoided by fitting data using computer simulation because all elements of the data must be consistent with the model to achieve a good fit. Moreover, the simulation program itself serves as a valuable learning tool. Prior to performing any experiments, the user can run a simulation and see what results might be obtained from a given experiment based upon different underlying models, as described below (see ahead to Fig. 23.1). Moreover, one can readily see the effects of changing substrate concentrations or rate constants on the observable outputs. In this way, intuition can be developed that helps a great deal in deciphering more complex kinetic data to divine the simplest model. A simulation is based upon four required elements: a model, a set of starting concentrations of reactants, an observable output function, and a set of rate constants. In using simulation to fit data, one seeks a minimal model and a set of unique rate constants that quantitatively account for the observable data.
3.1. Defining the model To begin the simulation, the reaction sequence is entered using a simple text description. For example the reaction scheme 23.1
A
C 0.1
1
2 10
6 4
5 10 20
6 4 2
2 0
0
100
200
300 Time, s
B
400
500
0
600
0
100
200
D
300 Time, s
400
500
600
20
10
20
0
1 2
8
5
(Product), mM
(Product), mM
0
8
5
8
6 4
15 10
10
5
5
2 0
10 2
(Product), mM
(Product), mM
10
0
100
200
300 Time, s
400
500
600
0
0
100
200
300 Time, s
400
500
600
607
KinTek Explorer
E+S
k1 k-1
ES
k2 k-2
EP
k3 k-3
E+P
Scheme 23.1
is entered simply as: E þ S ¼ ES ¼ EP ¼ E þ P. The program then solves the differential equations and sets up the necessary equations for performing the numerical integration to simulate the time dependence of the reaction. Each enzyme species must have a unique, user defined description consisting of case-sensitive alphanumeric characters plus the special characters , $, and #. Multiple steps of the reaction can be written on one continuous line as long as mass balance is maintained for each reaction. For example, more complex pathways involving two substrates and products, such as EPSP synthase, require multiple lines to maintain mass balance (Anderson et al., 1988): E þ A ¼ EA EA þ B ¼ EAB ¼ EI ¼ EPQ ¼ EQ þ P EQ ¼ E þ Q
3.2. Defining each experiment An experiment is defined by specifying the starting concentrations of the reactants and the signal that is measured, much the same way an experiment is defined in the laboratory. In fact, every aspect of the simulation should mimic the experimental details underlying the original data collection. In the software, the starting concentrations of reactants are entered into a table created by the program based upon the mechanism. In addition, if two or more reactants are allowed to equilibrate before adding additional reactants, that can be easily programmed by using multiple mixing steps. This is valuable because the rate constants governing the initial equilibration are included in the process of fitting the data. For example, in studies on DNA polymerases, we often incubated enzyme with DNA and then add the nucleotide substrate. In some experiments, the data fitting defines the DNA dissociation rate according to constraints imposed both during the pre-incubation phase, which determines the amplitude of the reaction, and Figure 23.1 Progress curve kinetics. Curves were calculated by numerical integration to illustrate the changes in the shape of the curves dependent upon kinetic parameters. All curves were computed with 1 mM enzyme, 10 mM substrate (unless noted), and the kinetic constants given in Table 23.1. (A) Effect of variable Km (Km ¼ 0.1, 0.5, 1, 2, 5, 10 mM). (B) Effect of product inhibition (k 3 ¼ 0, 1, 2, 5 mM 1s 1). (C) Effect of reversible chemistry (k 2 ¼ 0, 2, 5, 10, 20 s 1, with k 3 ¼ 5 mM 1s 1). (D) Variable substrate concentration ([S] ¼ 5, 10, 20 mM); the dotted line shows the simulation with irreversible chemistry and irreversible product release, and other constants given in Table 23.1.
608
Kenneth A. Johnson
during the subsequent phase where the rate of multiple turnovers is limited by DNA release. A good example of this is in work on the inhibition of HIV reverse transcriptase with nonnucleoside inhibitors which bind slowly to the enzyme (Spence et al., 1995). Fitting of the original data by simulation is given in the example file HIV_NNRTI.mec, provided with the software.
3.3. Defining output factors An essential part of the definition of an experiment includes specifying the properties of the output signal. All simulations are performed in absolute concentrations of reacting species. One must then define an output expression that relates concentrations of species to observable signals. In a rapid quenchflow experiment or another method based upon quenching a sample and quantifying the amount of product formed, the output may be the sum of all species containing the product. For example, for Scheme 23.1, total product will be defined by the sum EP þ P because upon quenching the reaction, product bound to the enzyme will be released. In the case of EPSP synthase, total product Q will be defined by the sum EPQ þ EQ þ Q. If there is an absorbance change upon conversion of substrate to product, the signal will be defined by the difference in extinction coefficients: a(ES þ S)b(EP þ P). On the other hand, if one is monitoring a change in protein fluorescence with different enzyme-bound states, the net signal will be defined by the different fluorescence coefficients for each species: a E þ b ES þ c EP. In this case, it is often useful to normalize the fluorescence relative to the starting enzyme and include a scaling factor: f(E þ b ES þ c EP). Defining the output expression in this manner helps the user keep track of the relative fluorescence change while fitting data and thereby avoid the pitfall of fitting data with an inordinately large fluorescence coefficient and a correspondingly low concentration of the species. Possible output expressions for Scheme 23.1 include: Fluorescence: Signal ¼ f (E þ b ES þ c EP) Burst of product formation: Signal ¼ EP þ P Absorbance of S and P: Signal ¼ a(S þ ES) þ b(P þ EP) The output coefficients can be derived readily as unknowns during the fitting process but care must be taken to define the minimal output expression. For example, an output expression as f(a E þ b ES þ c P) is overdefined and has an infinite number of solutions since any combination of terms in f a can give a desired constant, for example.
3.4. A note on units Units of time and concentration can be whatever is convenient for the experiments, but there must be consistency in that all concentrations must be in the same units and correspond to the dimensions of the second-order
609
KinTek Explorer
rate constants. Similarly, all rate constants must be entered in the units of time chosen for the experiment. For most enzymes, concentration units of micromolar and time in seconds is most appropriate such that second-order rate constants are given in units of mM 1s 1 (106 M 1s 1). In these units, the diffusion limit for substrate binding is approximately 1000 mM 1s 1 and a conservative estimate may be 100 mM 1s 1. First-order rate constants typically range from 0.001 s 1 to 10,000 s 1 for observable enzymecatalyzed reactions. One can easily adopt different units for time and concentration and it is advisable to keep entered numbers in the range of 1e-6 to 1e6, in part, to avoid round-off errors in the math, but also to afford easier and therefore less error-prone data entry. Even though all math is done in 64-bit double precision, avoiding extremely large or small numbers in data entry will minimize round-off errors.
3.5. Information content of data Understanding the information content of data is important to prevent overinterpretation. KinTek Explorer offers several unique tools to assess whether fitted parameters are well constrained by the data (i.e., whether the model is overly complex). This question is distinct from whether a good fit can be achieved and cannot be answered based upon whether nonlinear regression returns small estimates of standard error. Standard error calculations fail when multiple parameters are underconstrained. Rather we address these questions by exploring parameter space by scrolling rate constants and looking for how much the curves depend upon a given parameter, to assess whether certain parameters may be linked and whether a very different area of parameter space may contain another good fit to the data. Finally, we also rely upon computation of the confidence contours by quantifying how the total sum square error surface varies as a function of individual parameters. There are currently 50 example files included with the software online that illustrate the use of KinTek Explorer in fitting data from multiple experiments, based largely on transient kinetic data. Here, the program will be illustrated using methods from the field of steady-state kinetics, but where rigorous fitting is greatly facilitated by use of computer simulation; namely, in the analysis of full progress curves and the fitting of slow-onset inhibition.
3.6. A note on statistics Data fitting by nonlinear regression analysis is based upon finding the minimum sum square error, defined as the sum of the residuals squared: SSE ¼
N X i¼1
ðyi yðxi ÞÞ2
610
Kenneth A. Johnson
where yi is the observed data and y(xi) is the calculated y value for the x value at the ith data point and N is the number of data points. When the standard deviation (sigma) values for the data are known, then the residuals are normalized by dividing by sigma to compute chi2: w2 ¼
2 N X yi yðxi Þ i¼1
si
When the sigma values are not known, it is often assumed that the sigma value is constant for all of the data to enable one to compute an average sigma value: N 1 X
s2AVE ¼
w ¼ N M 2
2
½yi yðxi Þ
i¼0
N M
where N is the number of data points and M is the number of parameters being fit to the data. In the examples shown here, we will use these three terms to evaluate goodness of fit. In particular, the calculated average sigma value can be compare to the sigma values input in generating artificial data.
4. Progress Curve Kinetics Standard steady-state kinetic analysis is based upon the error-prone estimates of initial velocities, restricting data to the first 10–20% of the reaction and the need to measure initial slope before the reaction starts to become nonlinear. The initial velocities must then be plotted as a function of substrate (and perhaps inhibitor) concentrations and then fit to another set of equations to extract kcat and Km values, all the while being careful to propagate error estimates. These time-consuming methods can be replaced by direct fitting of the primary data to the model using computer simulation ( Johnson et al., 2009a). Moreover, once a reaction is started, it can be followed to completion to get the most information from each sample. In particular, analysis of the full progress curve as the reaction goes to completion allows definition of the kcat and Km values for substrate and, in some cases, the Kd for product inhibition, or possibly kcat and Km for the reverse reaction. Fitting the full progress curve is not new; in fact, Michaelis and Menten (1913) fit their data to the integrated form of the rate equation in their landmark 1913 paper. However, the ease and utility of fitting based upon computer simulation finally makes fitting to the full progress curves the preferred method, rather than restricting attention to initial velocities.
KinTek Explorer
611
In order to understand the information content of the progress curve kinetic data, we begin by analysis of a simple enzyme-catalyzed reaction and compare the effects of different sets of rate constants, as shown in Fig. 23.1 using constants summarized in Table 23.1. This illustrates the use of KinTek Explorer to explore the landscape before doing an experiment, which can aid in defining optimal reaction conditions. If the product inhibition is negligible, then the shape of the curvature is determined solely by the substrate concentration dependence of the rate. Figure 23.1A shows the effect of variable Km on the shape of the progress curve. As Km is increased, the curvature becomes more pronounced. Of course, the observed shape is also dependent upon the concentrations of enzyme and substrate, so if the Km is at the lower limit of what is shown in Fig. 23.1A, then the experiment needs to be repeated at a lower enzyme concentration and lower substrate concentration to resolve the curvature. Thus, the first step in fitting data is to collect data in the optimal region of concentrationtime space to reveal the underlying parameters. Note, for example, if one attempted to extract Km from the lowest Km curve in Fig. 23.1A, one could only place an upper limit on the estimated value of Km. However, in practice, this initial estimate could be used to then design an experiment at lower concentration, optimized to measure the lower Km, a goal that can be easily accomplished using the simulation program. Product inhibition also changes the shape of the curves as shown in Fig. 23.1B, illustrating the effect of decreasing Kd for product rebinding. A priori, one does not know whether the curvature in the time dependence is due to a higher Km for substrate or a lower Kd for product rebinding. Therefore, experiments must be done at several concentrations of substrate and/or product to resolve the two parameters, a process that can be achieved easily by global fitting of the family of curves, as illustrated below. The variation in the curvature as a function of starting substrate concentration or added product provides the information content to define both the Km for substrate and Kd for product. If the chemical reaction is reversible (k 2 > 0), then the amplitude of the reaction is also affected (as shown in Fig. 23.1C), which provides the additional information necessary to define the overall equilibrium constant and therefore derive kcat and Km values in both the forward and reverse directions. However, the extent to which k 2 can be defined, based upon forward-rate measurements, is dependent upon the magnitude of its effect on the observed reaction, as described below. As a general rule, it is necessary to measure the full progress curves at several concentrations of starting substrate as shown in Fig. 23.1D. Here, the fact that the curvature is different at the three substrate concentrations provides information to define product inhibition. For comparison, the dotted lines in Fig. 23.1D show the case where there is no product inhibition and one can see that the curved portion could be superimposed for each
Table 23.1 Rate constants for computing progress curvesa
a
Figure
k1 (mM 1s 1)
k 1 (s 1)
k2 (s 1)
k 2 (s 1)
k3 (s 1)
k 3 (mM 1s 1)
A B C D D (dots)
10 10 10 10 10
Variable 50,000 50,000 20,000 20,000
120 120 120 120 120
0 0 0, 2, 5, 10, 20 20 0
10,000 10,000 10,000 10,000 10,000
0 0, 1, 2, 5 5 5 0
Curves displayed in Fig. 23.1 were calculated using Scheme 23.1 and the rate constant summarized here. Variable Km values of 0.1, 1, 2, 5, and 10 mM in Fig. 23.1A were obtained by varying k 1 from 1000 to 100,000 s 1.
KinTek Explorer
613
of the concentrations. If the kinetics are measured using a coupled enzyme assay so that product does not accumulate, then the data would follow the dotted line, and can be fit to only extract kcat and Km values for the forward reaction without complications due to product inhibition. Global fitting of several progress curves simultaneously allows definition of all relevant kinetic parameters.
5. Fitting Full Progress Curves In fitting steady-state or full progress curve kinetics, all that can be determined are kcat and Km values, possibly in both the forward and reverse directions depending upon the reversibility of the reaction and the properties of the data. Accordingly, one can only fit the data to extract two or four constants. However, the minimal model (Scheme 23.1) contains three steps and six rate constants. One easy approach is to simply fit to a model with all six rate constants as variable parameters and then can calculate kcat and Km values. However, the set of six rate constants will not be unique and it will not be possible to estimate errors on the kcat and Km values. That is, one could arbitrarily (within some limits) choose another set of six rate constants to fit the data and compute the same kcat and Km values. In fact, it is a useful exercise to fit a given set of date using multiple sets of rate constants and show that the same kcat and Km values are obtained. This was done in analysis of alanine racemase data ( Johnson et al., 2009a,b) in order to refute claims that eight rate constants could be extracted from the progress curve data using Dynafit ( Johnson et al., 2009a,b; Spies and Toney, 2007; Spies et al., 2004) and subsequent claims of fitting 18 rate constants (Spies and Toney, 2007). The alanine racemase example illustrates how easy it is to be misled in fitting multiple parameters to a data set without carefully considering the distinction between a good fit and one in which parameters are constrained by the data. In order to estimate errors on parameters, simplifications are needed to reduce the number of variables to correspond to the information content of the data. If there is no product inhibition and the reaction is largely irreversible, one can only get kcat and Km for the forward reaction, and one progress curve would sufficient. Better yet, a full progress curves performed at several substrate concentrations or in the presence and absence of added product would improve confidence in the parameters. In order to develop a general method for fitting progress curve kinetics, one must allow for the possibility of product inhibition and reversal of the chemical reaction. Therefore, the fitting procedure must provide estimates of kcat and Km in both the forward and reverse directions. This still entails fitting only four constant to a minimal model containing six rate constants.
614
Kenneth A. Johnson
One method to reduce the number of variable parameters involves setting the second-order rate constants for substrate and product binding at the diffusion limit. Under these conditions, the rates of product release and substrate release are then set to be much greater than kcat so that k2 and k 2 limit the net rate of turnover in each direction. E+S
100 mM–1s –1
ES
k–1
k2 k–2
k3
EP
100 mM–1s –1
E+P
Scheme 23.2
By fitting the data to this rapid equilibrium binding model, Km,S ¼ k 1/k1 and kcat ¼ k2 for the forward reaction and Km,P ¼ k3/k 3 and kcat,rev ¼ k 2. It is important to note that this does NOT imply that the rapid equilibrium binding model necessarily represents a valid description of the elementary rate constants. Rather it serves only as a tool to extract the steady-state kinetic parameters. To illustrate this approach to fitting progress curve kinetics, artificial data were generated based upon a model shown below. The time course of reaction was simulated and random noise was added (sigma ¼ 0.02) to generate data at three substrate concentrations as shown in Fig. 23.2A, using constants shown in Table 23.2. E+S
10 µM -1 s-1 400 s-1
ES
180 s-1 20 s-1
EP
1200 s-1 10 µM -1 s-1
E+P
The data were then fit to a model assuming diffusion-limited substrate and product binding steps (fixed at 100 mM 1s 1) so that only the remaining four rate constants were allowed to float during fitting. The following parameters were derived: E+S
(100 µM -1 s-1) 4780 s-1
ES
157 s-1 13.9 s-1
EP
11400 s-1 (100 µM -1 s-1)
E+P
From this model it is easy to calculate that kcat ¼ k2 ¼ 157 s 1 and Km,S ¼ k 1/k1 ¼ 47.8 mM in the forward reaction and kcat,rev ¼ k 2 ¼ 13.9 s 1 and Km,P ¼ k3/k 3 ¼ 114 mM for the reverse reaction. These are the same steady-state kinetic constants that are calculated from the starting model. Moreover, by limiting the number of parameters used in fitting to correspond to the information content of the data, standard error estimates derived in fitting apply directly to the kcat and Km values. The data can also be fit to the model in which all six rate constants are varied. One such fit is shown below, which yields the same kcat and Km values in each direction. E+S
4.4 µM -1 s-1 241 s-1
ES
2000 s-1 132 -1
EP
178 s-1 1.85 µM -1 s-1
E+P
615
KinTek Explorer
A
(Product), mM
10 8 6 4 2 0
0
50
100 Time, s
150
200
B
(Product), mM
2 1.5 1 0.5 0 0
50
100
150
200
250
300
Time, s Figure 23.2 Simultaneous fitting of three progress curves. Simulated curves were calculated according to the constants in Table 23.2, line a, with an enzyme concentration of 1 mM, substrate concentrations of 2, 5, and 10 mM, and with added random errors giving a sigma value of 0.02. Two sets of fitted curves are shown with rate constants summarized in Table 23.2. An optimal global fit required fitting to four parameters according to the original model (solid black line). When an attempt was made to fit the data to a simplified irreversible model (dotted line), only one of the three curves could be fit adequately. (B) The reverse reaction was simulated using 1 mM enzyme, 2 mM product, and a trap to sequester any free substrate with random noise added (sigma ¼ 0.02).
Thus, multiple sets of parameters can be fit to the data and used to compute the values for kcat and Km. This exercise reinforces what we already know: steady-state kinetic data cannot be used to establish elementary rate constants in an enzyme-catalyzed reaction. One shortcut that has been suggested for fitting full progress curves is based upon reducing the model to a minimal two-step irreversible sequence:
Table 23.2 Kinetic parameters in fitting three progress curvesa
a
b
Curve
k1 (mM 1s 1)
k 1 (s 1)
k2 (s 1)
k 2 (s 1)
k3 (s 1)
k 3 (mM 1s 1)
Sigma
a b c
(10) (100) 0.0701
400 4780 (0)
180 157 297
20 13.9 (0)
1200 11400 (10000)
(10) (100) (0)
0.02 0.0204 0.0197b
Three sets of constants were used to derive curves in attempting to fit simultaneously the three progress curves shown in Fig. 23.2. Curve a represents the parameters used to generate the fake data with a sigma value of 0.02 using 1 mM enzyme and 2, 5, and 10 mM substrate. Curve b represents the best fit derived with diffusion-limited binding of substrate and product fixed at 100 mM 1s 1 as shown by the solid black lines. Curve c shows an attempt to fit the middle progress curve to a simplified irreversible model to derive kcat and Km values, but fails to account for the data derived at lower or higher substrate concentrations. Numbers in parentheses were held fixed during the fitting. Average sigma values were computed from the best fit. In this case, the sigma value was calculated for the fitting only the one curve at 5 mM substrate.
617
KinTek Explorer
E+S
k1
ES
0
k2 0
EP
Fast
E+P
0
With this simplified model, k1 ¼ kcat/Km and k2 ¼ kcat. This approach could work but only under the limited circumstances where product does not rebind to the enzyme during the approach to the endpoint. Because it is not known a priori whether product inhibition is significant, this approach can be very misleading unless the reactions are examined at several concentrations of substrate. As shown in Fig. 23.2A, data collected at one concentration can be fit using this model (the middle concentration in this example), but one cannot fit all three concentrations simultaneously using this oversimplified model. Because fitting by computer simulation does not require such a potentially misleading oversimplification and this reduced model offers no advantages, it is not recommended.
5.1. Error analysis The next step in the analysis is to assess errors on the estimates for each of the rate constants. Standard error analysis based upon the covariance matrix derived during nonlinear regression suggests that each of the rate constants is known with a great deal of certainty as summarized in Table 23.3. However, confidence contour analysis, which provides a much more robust assessment of the limits on each parameter, suggests that the parameters are not well constrained. Construction and evaluation of FitSpace confidence contours are explained in more detail in Johnson et al. (2009b). In order to construct the confidence contour, individual rate constants are pushed to higher and lower values while allowing all other constants to be adjusted in deriving the best fit. The limits on each constant are then defined by the observed increase in the sum square error that is attributable to constraints on each parameter individually without any assumptions regarding the values for other constants. A three-dimensional plot is then generated Table 23.3 Error estimates on kinetic parameters in fitting three progress curvesa
a
Source
k 1 (s 1)
k2 (s 1)
k 2 (s 1)
k3 (s 1)
NR FS-A FS-A&B
4776 241 14–10300 2850–7350
156.6 0.2 155–276 155–159
13.9 0.2 13–3020 13.4–14.4
11,380 560 570–24600 6730–17800
Error estimates were derived while simultaneously fitting three progress curves shown in Fig. 23.2A. NR, nonlinear regression standard error; FS, FitSpace confidence contour error limits based upon a 10% increase in the sum square error. FS-A is based upon fitting data in Fig. 23.2A. FS-A&B is based upon fitting the data in Fig. 23.2A and B simultaneously.
618
Kenneth A. Johnson
291
showing the dependence of the sum square error on each pair of parameters. The shape of the surface reveals underlying relationships between parameters and a fixed threshold in the sum square error surface can be used to define upper and lower limits for each parameter. Figure 23.3 shows the confidence contours computed for the data shown in Fig. 23.2A fit to Scheme 23.2. The most striking results shown visually are that kcat for the reverse reaction (k 2) is not constrained by the data and that there is a linear correlation between k3 and k 1. The ranges allowed for individual parameters are listed in Table 23.3, row FS-A. Upon seeing these results, the initial reaction of most investigators is disbelief. Nonlinear regression gives very small errors, so how can it be that these
k +2
1.243 (min)
153
1.616 (1.3x) 1.865 (1.5x)
19400
k −2
2.487 (2x)
11.8
11.8
k −2
2300
k −1
3030
12.3
19400
12.3
k −1
19400
153
k +2
291
153
k +2
291
46,100
299 (241x)
k +3
k +3
501
501
501
k +3
46,100
k −1
46,300
12.3
11.8
k −2
75.9
Figure 23.3 Confidence contours in fitting progress curves. Confidence contours are shown derived from the fitting of the data in Fig. 23.2A. Red shows the area of best fit and the yellow band between red and green shows a threshold at which the sum square error increased by 10% over the minimum value. The results show that k 2 has no upper limit and that there is a wide range over which kþ 3 and k 1 can vary as long as the constant ratio of kþ 3/k 1 is maintained. The isolated peaks in the kþ 3 versus k 1 plot result from sampling on a grid, whereas the underlying function should produce a continuous ridge.
619
KinTek Explorer
constants are so poorly defined? The answer is that nonlinear regression grossly underestimates the errors. Another test of whether one might believe the large range over which the rate constants can vary is to overlay on the data all the curves calculated at the extremes of the parameter set. This can be done within the simulation program, but is difficult to display in print because all of the curves superimpose at the resolution of the figure. This analysis shows that even the most extreme ranges of rate constants still account for the data and produce traces that are largely indistinguishable. A careful reassessment of the experimental design and the parameters that were derived in fitting points to the possible limitations with the data. First, the rate of the reverse reaction is small and contributes negligibly to the observable signal. Therefore, perhaps an experiment should be performed to better define the reverse rate constants. Of course, this is easy when the experiments are done by simulation, but even in the real world, it is often useful to simulate experiments first to see whether they could help to distinguish models. An additional ‘‘experiment’’ was then performed by monitoring the reaction in reverse. In the simulation, the starting conditions contained only the product of the reaction and the formation of substrate was monitored as a function of time. However, it was immediately recognized that one cannot simply drive the reaction in reverse without the addition of a coupled-enzyme assay to remove substrate. This can be programmed in KinTek Explorer simply by the addition of a trap to sequester substrate or by the full programming of the kinetic properties of the coupled-enzyme assay (Hanes and Johnson, 2008). E+S
k1 k–1
ES
k2 k–2
EP
k3 k–3
E+P
S + trap = Strap or E2 + S
E2S
E2 + X
The new ‘‘data’’ are shown in Fig. 23.2B. Including this data in the process of global fitting greatly improves confidence in the value of kcat in the reverse direction as defined by k 2 in the model. Moreover, by increasing confidence in k 2, the range over which k 1 and k3 can vary was also restricted to provide a better global fit to all of the data (Fig. 23.2A and B) fit simultaneously. This is illustrated by the confidence contour shown in Fig. 23.4 and Table 23.3 (row FS-A&B); in which each constant is bounded by an upper and lower limit. In summary, full progress curve kinetic traces can be fit to a simplified model in which substrate and product binding rates are assumed to be diffusion limited only for the sake of extracting kcat and Km values. Simultaneous fitting of data collected at several concentrations is required to test
620
175
Kenneth A. Johnson
155
k +2
1.625 (min)
2.112 (1.3x)
10500
2.437 (1.5x)
k –2
3.25 (2x)
10500
k +2
162
k +2
174
259 (160x)
k –1
10500
k +3 1520
k +3 1520
k +3 1450 545
153 26,300
k –1
26,300
545
26,300
13.4
13.1
k –2
16.3
k –1
16.3
545
155
13.4
k –2
16
Figure 23.4 Confidence contours in fitting progress curves forward and reverse reaction. Confidence contours are shown from the fitting of the data in Fig. 23.2A and B simultaneously. Colors are as in Fig. 23.3. The results show that all parameters are well constrained.
for and possibly quantify product inhibition. The process of data fitting and refinement is facilitated by careful use of the confidence contours to find gaps in the data that lead to large errors in estimated parameters, which can then be overcome by performing additional experiments.
6. Slow Onset Inhibition Kinetics In this example, we consider data collected in the steady state involving slow-onset inhibition. The data shown in Fig. 23.5 were generously provided by Vern Schramm and Andrew Murkin of the Albert Einstein College of Medicine from their work in developing transition state analog inhibitors of purine nucleoside phosphorylase (PNPase) (Kicska et al., 2002). These unpublished data show the increase in absorbance with time in the
621
KinTek Explorer
A 0.6
Absorbance
0.5 0.4 0.3 0.2 0.1 0
0
1000
2000
3000 Time, s
4000
5000
0
1000
2000
3000 Time, s
4000
5000
B 0.6
Absorbance
0.5 0.4 0.3 0.2 0.1 0
Figure 23.5 PNPase slow onset inhibition kinetics. The time dependence of product formation is shown after starting the PNPase reaction with 1 mM substrate and various concentrations of the inhibitor, DADMe-ImmH (0, 0.02, 0.06, 0.1, 0.15, 0.3, 0.5, 1, 2, 5, 7, and 10 mM). The kcat ¼ 0.34 s 1 and Km ¼ 5 mM values were used and the enzyme concentration was adjusted to 26 nM to fit the trace in the absence of inhibitor. Data were then fit globally (black lines superimposed on the data shown as thicker green lines) to either a one- or two-step inhibitor binding model based upon a Km ¼ 5 mM and kcat ¼ 0.34 s 1. Fitted curves are shown according to the constants summarized in Table 23.4. (A) Fit to the two-step binding model based upon fits a or b (Table 23.4). (B) Fitted curves based upon the one-step binding model (Scheme 23.4) with parameters in row c of Table 23.4. The three sets of fitted curves are indistinguishable. Data were kindly provided by Vern Schramm and Andrew Murkin of the Albert Einstein College of Medicine (Kicska et al., 2002).
presence of various concentrations of the DADMe-ImmH inhibitor with the PNPase from Plasmodium falciparum. The tight binding of this inhibitor makes it a promising candidate for treating malaria.
622
Kenneth A. Johnson
The relevant mechanistic question to address is whether the data reveal a two-step inhibitor binding mechanism with an initial weak binding followed by a slower isomerization to tighter binding (Scheme 23.3) or whether a onestep binding model is sufficient to account for the data (Scheme 23.4). If the one-step binding model accounts for the slow inhibition, then it is still likely that the reaction occurs in two steps, but the initial binding may be too weak to measure. To address which model accounts for the data we simply fit the data to both models and then examine the errors in the parameters and evaluate goodness of fit both visually and computationally. E+I
K1
EI
k2 k-2
FI
Scheme 23.3 E+I
K1 k-1
EI
Scheme 23.4
The results of fitting the data to a two-step model are shown in Fig. 23.5A with rate constants summarized in Table 23.4, row a. Initial analysis based upon nonlinear regression suggests that the parameters are well constrained, supporting the conclusion that the two-step binding model is well defined. However, one can scroll the constants and find another area of parameter space leading to an equally good fit with the constants summarized in Table 23.4, row b. Clearly, the parameters are not as well constrained as the nonlinear regression error analysis would lead us to believe. A full confidence contour analysis of the fitting to a two-step model reveals the underlying problem, as shown in Fig. 23.6A. The figure shows a linear correlation between k2 and k 1. This implies that, above a lower limit, the data only define the ratio of k2/k 1. Because k1 was assumed to be a constant, we can translate this to a constant term defined by k1k2/k 1 which equals K1k2 ¼ 0.22 mM 1s 1. This can be immediately recognized as the apparent second-order rate constant for a two-step binding reaction in the range of low Table 23.4
a
Kinetic parameters in fitting slow onset inhibition of PNPasea
Fit
k 1 (s 1)
k2 (s 1)
k 2 (s 1)
Chi2
a b
7.15 0.05 2000 100
c
–
0.0154 0.0003 4.3 0.2 k1 (mM 1s 1) 0.250 0.0006
0.00013 0.0001 0.00013 0.0001 k 1 (s 1) 0.00021 0.000007
0.034 0.0422 Chi2 0.032
Three sets of parameters illustrate the fitting of the data in Fig. 23.5 to either a two-step inhibitor binding mechanism (a and b) or a one-step mechanism (c). In fitting this data to Scheme 23.3, k1 was fixed at 100 mM 1s 1.
623
KinTek Explorer
0.0004
k−2
0.0009 0
k2
1.18
A
590 0
k−1
k2
0.027
0.0001
k−1
0.0003
0.7
B
0.216
k1
0.286
Figure 23.6 Confidence contours for fitting PNPase slow onset inhibition. Pair-wise confidence contours are shown after fitting the data in Fig. 23.5 to either a two-step inhibitor binding model with three variable parameters (A) or a one-step model with two variable parameters (B). The contours are colored with red showing the area of best fit. The yellow boundary separating red and green defines a threshold where the SSE was increased by 10% over the minimum. The numbers at the corners of each plot show the ranges for each kinetic parameter. These plots were used to derive parameter confidence intervals summarized in Table 23.5. The analysis shows that the data do not support the definition of a two-step binding mechanism. Rather, in the two-step binding model, the product K1k2 defines a second-order rate constant for inhibitor binding equal to 0.22 mM 1s 1 according to the slope of the diagonal boundary in the plot of SSE for k2 versus k 1 in (A). Note that the diagonal boundary in (B) demonstrates that the ratio defining the net Kd ¼ k 2/k2 is known with greater certainty than either of the parameters individually. Nonetheless, both parameters are well constrained.
624
Kenneth A. Johnson
Table 23.5 PNPase kinetic parameter confidence intervalsa Model
Parameter
Lower limit
Upper limit
Two-step
1/K1 (nM) k2 (s 1) k 2 (s 1) k1 (mM 1s 1) k 1 (s 1)
18 0.0033 0.000096 0.233 0.00017
none none 0.00027 0.265 0.000026
One-step a
Confidence intervals on individual kinetic parameters were derived from the threshold defined by a 10% increase in the SSE as described ( Johnson et al., 2009b).
concentrations (½I 1=K1 ) where the rate is linearly dependent upon inhibitor concentration. This analysis leads to the conclusion that although one can fit to a two-step model, there is no data to define the Kd(1/K1) for the initial complex and the model collapses to a one-step mechanism. The fit to a one-step mechanism is shown in Fig. 23.5B and the corresponding confidence contour is shown in Fig. 23.6B. Clearly, the data are adequately fit by the one-step model and the rate constants for inhibitor binding and release are well constrained. The brief linear correlation between k 1 and k1 implies that the net dissociation constant 1/K1 ¼ k 1/k1 is known with greater certainty than either of the rate constants, but still the range over which the individual constants can vary is relatively small, with the greatest uncertainty in k 1, as summarized in Table 23.5. This analysis once again illustrates that the standard error estimates derived from nonlinear regression are not to be trusted. However, the confidence contour analysis reveals the extent to which parameters are underconstrained and defines the underlying relationships between parameters. Careful analysis leads one to either simplify the model or perform additional experiments to fill in the gaps in the data.
7. Summary The two examples for data fitting serve to illustrate the use of KinTek Explorer in fitting data to derive steady-state kinetic constants and the rates of slow onset inhibition. In these cases, the fitting based upon simulation is fast and reliable. By fitting the parameters of the model directly to the data, simplifying assumptions and errors are eliminated. Standard error analysis during nonlinear regression is not reliable and it fails to reveal when parameters are seriously underconstrained. This can be understood in that the Hessian matrix that must be solved is singular when
KinTek Explorer
625
parameters are not well constrained and so there are huge round-off errors in computing the covariance matrix. We are in the process of solving this problem by singular value decomposition, but in the meantime, it is important to recognize that standard nonlinear regression routines used by all currently available programs for fitting data seriously underestimate errors. The software can be easily adapted to fit data to examine enzyme activation, an important area of research in the pharmaceutical industry. We are currently using the software to simultaneously fit data collected by rapid quench methods and data obtained by fluorescence methods in the stopped-flow instrument. The rigorous fitting of both datasets simultaneously overcomes many of the limitations in previous attempts to correlate the results from both experiments ( Johnson and Taylor, 1978). The ease of use of the software and the efficiency of the program allows many experiments to be fit directly to models with the greatest accuracy in estimating kinetic parameters and evaluating models. Financial conflict of interest. KinTek Explorer was developed using private funds and a professional version of the software is offered for sale.
ACKNOWLEDGMENTS Supported by KinTek Corporation (www.kintek-corp.com)
REFERENCES Anderson, K. S., Sikorski, J. A., and Johnson, K. A. (1988). A tetrahedral intermediate in the EPSP synthase reaction observed by rapid quench kinetics. Biochemistry 27, 7395–7406. Anderson, K. S., Miles, E. W., and Johnson, K. A. (1991). Serine modulates substrate channeling in tryptophan synthase. A novel intersubunit triggering mechanism. J. Biol. Chem. 266, 8020–8033. Barshop, B. A., Wrenn, R. F., and Frieden, C. (1983). Analysis of numerical methods for computer simulation of kinetic processes: development of KINSIM—a flexible, portable system. Anal. Biochem. 130, 134–145. Bates, D. M., and Watts, D. G. (1988). Nonlinear Regression Analysis and its Applications. Wiley, New York. Hanes, J. W., and Johnson, K. A. (2008). Real-time measurement of pyrophosphate release kinetics. Anal. Biochem. 372, 125–127. Johnson, K. A., and Taylor, E. W. (1978). Intermediate states of subfragment 1 and actosubfragment 1 ATPase: Reevaluation of the mechanism. Biochemistry 17, 3432–3442. Johnson, K. A., Simpson, Z. B., and Blom, T. (2009a). Global Kinetic Explorer: A new computer program for dynamic simulation and fitting of kinetic data. Anal. Biochem. 387, 20–29. Johnson, K. A., Simpson, Z. B., and Blom, T. (2009b). FitSpace Explorer: An algorithm to evaluate multidimensional parameter space in fitting kinetic data. Anal. Biochem. 387, 30–41. Kicska, G. A., Tyler, P. C., Evans, G. B., Furneaux, R. H., Kim, K., and Schramm, V. L. (2002). Transition state analogue inhibitors of purine nucleoside phosphorylase from Plasmodium falciparum. J. Biol. Chem. 277, 3219–3225.
626
Kenneth A. Johnson
Michaelis, L., and Menten, M. L. (1913). Die Kinetik der Invertinwirkung. Biochem. Z. 49, 333–369. Spence, R. A., Kati, W. M., Anderson, K. S., and Johnson, K. A. (1995). Mechanism of inhibition of HIV-1 reverse transcriptase by nonnucleoside inhibitors. Science 267, 988–993. Spies, M. A., and Toney, M. D. (2007). Intrinsic primary and secondary hydrogen kinetic isotope effects for alanine racemase from global analysis of progress curves. J. Am. Chem. Soc. 129, 10678–10685. Spies, M. A., Woodward, J. J., Watnik, M. R., and Toney, M. D. (2004). Alanine racemase free energy profiles from global analyses of progress curves. J. Am. Chem. Soc. 126, 7464–7475. Zimmerle, C. T., and Frieden, C. (1989). Analysis of progress curves by simulations generated by numerical integration. Biochem. J. 258, 381–387.
Author Index
A Acton, S. T., 42 Ahmed, R., 88 Ahren, B., 574 Aitchison, J. D., 335–353 Akutsu, T., 179, 181, 184 Albert, I., 288, 293, 299 Albert, R., 183, 281–303 Aldana, M., 337 Alday, P. H., 135–159 Alder, B. J., 317 Aldridge, B. B., 283 Alexandrov, N. N., 315 Alexopoulos, L. G., 42 Allison, D. B., 60 Alon, U., 173 Alter, O., 64 Altman, C., 257 Altschul, S. F., 311–312 Alvarez-Buylla, E. R., 172, 283, 297, 339 Anastopoulous, A. D., 359, 361 Anderson, A. R. A., 32, 37 Anderson, D. R., 254, 273, 274 Anderson, K. S., 603, 607 Andrae, J., 463–464 Andrec, M., 179 Angold, A., 375 Antia, R., 85–86, 88, 92–93 Antognazza, M. R., 112 Ao, P., 122, 132 Apweiler, R., 309 Arstila, T. P., 80 Ashcroft, F. M., 555 Askelof, P., 500 Astier, H., 555 Atkinson, A., 276 Atkins, P. W., 391, 408 Attwood, T. K., 309 Augustin, H. G., 463 Autiero, M., 464, 471 B Bachut-Okrasinska, E., 253, 275 Baer, S. M., 18 Bains, I., 80 Bairoch, A., 311 Balleza, E., 173, 176 Banarer, S., 549, 551
Bansal, M., 179, 180 Bao, P., 463 Barabasi, A. L., 282 Bardi, J. S., 112 Barker, D. R., 502 Barkley, R. A., 360–361 Barrett, C. B., 176 Barrett, T., 230 Barrow, N. J., 500, 502 Barshop, B. A., 602 Bates, D. M., 138, 269, 604 Bates, D. O., 464–465 Bates, P. A., 315 Beal, M. J., 179 Beard, D. A., 126 Bear, J. E., 44 Beckett, D., 137 Beechem, J. M., 251–252, 262 Beenken, A., 463–464 Beirlant, J., 532, 540 Bell, G. I., 555 Bellman, R. E., 452 Benkovic, S. J., 256–257, 259 Ben–Naim, A., 132 Benson, D. A., 308 Bentele, M., 178 Ben–Zvi, A., 435–438 Berendsen, H. J. C., 319 Berg, H. C., 40 Bergman, R. N., 573 Bergmeyer, H. U., 589 Berman, H. M., 308 Bernard, A., 181 Bernasconi, C. F., 151 Bertram, R., 1–20 Bertrand, G., 573 Bevington, P. R., 504 Bheekha Escura, R., 83 Bidaut, G., 69–71, 229–244 Bogaert, E., 479 Bo¨hm, C. M., 81 Bolli, G. B., 574 Bolstad, B. M., 62 Bolster, C. H., 500, 502, 521, 524–526 Bonvin, A. M., 326, 329 Bornholdt, S., 171, 283, 286, 290, 297, 299, 338–339, 342, 352 Bosco, G., 248 Bossomaier, T., 293
627
628
Author Index
Bowser, M. T., 500, 512, 527 Box, G. E. P., 276 Boyd, J. C., 411–431 Braunewell, S., 339 Breda, E., 573 Breiman, L., 536 Brelje, T. C., 574 Brenner, S., 177 Bressloff, P. C., 2 Briknarova´, K., 250 Britt, H. I., 502 Brock, A., 24 Bromberg, S., 113 Brooks, B. R., 319 Brooks, I., 252, 269, 271–272 Brown, A. F., 38 Brown, D. A., 40 Brown, P. J., 137 Brown, T. E., 365–366 Bruck, J., 407 Bruggemann, F. J., 164 Brunet, J. P., 61 Brunicardi, F. C., 554–555, 574 Brun, M., 348, 350 Bruns, D. E., 411–431 Bryce, N. S., 44 Bryson, K., 313 Buck, M. J., 282, 288 Burk, D., 590 Burnham, K. B., 254, 273, 274 Burroughs, N. J., 85–86, 97 Butera, R., 13 Butler, J. T., 176 Butte, A. J., 62, 178 Bzowska, A., 253–255 C Cabrera, O., 573 Cai, L., 44 Caldwell, J. W., 319 Camacho, D., 180 Cann, J., 137 Cann, J. R., 137 Cao, Y., 352, 464 Carlin, B. P., 364 Carmona–Saez, P., 68 Carneiro, J., 85, 86 Carpenter, A. E., 30, 42 Carroll, R. J., 526 Carton, D. M., 86 Carvalho, C. M., 62, 67, 69 Casadevall, A., 83 Casal, A., 86, 90 Case, D. A., 319 Castellanos, F. X., 360, 375 Catron, D. M., 89 Cejvan, K., 554 Celada, F., 100
Chakraborty, U. K., 264 Chance, B., 586 Chandra, R., 202 Chang, H. H., 341 Chang, W.-C., 179 Chaouiya, C., 283, 286 Chaves, M., 183, 291–292, 297, 299, 338 Chay, T. R., 13 Cheatham, T. E., 319 Chen, D. D. Y., 500, 512, 527 Cheng, J., 309, 313 Cheng, L., 62 Chen, J., 113, 122, 132 Chen, K. C., 346 Chicone, C., 443 Ching, W. K., 184 Cho, K. H., 285 Chow, C. C., 4 Christie, K. R., 68 Chuang, H. Y., 230, 244 Church, G., 62, 68 Cleare, A., 437 Cle´, C., 261 Cleland, W. W., 500 Clutton-Brock, M., 502 Cobelli, C., 573 Codling, E. A., 38 Cohn, M., 100 Cole, C., 313 Colijn, C., 86, 87 Collinson, D. J., 466 Collom, S. L., 275 Connors, K. A., 501, 503, 511–512, 527 Contino, P. B., 500 Conzelmann, H., 283 Conzen, S. D., 179, 348 Coombes, S., 2 Cooper, J. A., 129, 130 Corana, A., 268 Corcos, L., 283 Cornelisse, L. N., 13 Cornish-Bowden, A., 408, 500, 586, 590 Cornish, P. V., 383 Correia, J. J., 135–159 Costa, M., 533 Cover, T. M., 544 Cowan, J. D., 4 Cox, D. J., 365–366, 370–371 Cressie, N., 536 Crofford, L., 437 Cross, J. B., 326, 330 Crouch, S. R., 526 Cryer, P. E., 549–550 Cullen, G. E., 261 Cummings, M. D., 326
629
Author Index D Dam, J., 137 D’Amore, P. A., 465 D’Ari, R., 172 Davendra, D., 264 Davidian, M., 526 Davidich, M. I., 171, 283, 297–299, 338, 342, 352 Davidson, A. R., 312 Davis–Smyth, T., 465 Dean, P. M., 2 de Boer, R. J., 85–86, 167, 168 De Donder, T., 115 De Jong, H., 297 de la Fuente, A., 179–180 de Levie, R., 506 Del Negro, C. A., 15 Deming, W. E., 502, 505 Deng, Q., 251 Deng, X., 179 de Paula, J., 408 de Pillis, L. G., 85–86 Desai, A., 136 Destexhe, A., 15 Deutsch, A., 436 de Vries, A. H., 319 Dey, D. K., 505 Diana, L. M., 502 Di Cera, E., 502 Diem, P., 551 Digits, J. A., 259 Dikovskaya, D., 30 Dill, K. A., 113 Dimitrova, E., 184–185, 187, 188 Diraviyam, K., 307–330 Dojer, N., 179 Dominguez, C., 326 Donev, A., 276 Donnelly, R., 466 Dove, A., 24 Dowd, J. E., 500 Drossel, B., 337 Duan, Y., 319 Duffy, K. J., 40 Duggleby, R. G., 249, 260, 276, 590 Dumonteil, E., 551 Dunne, M. J., 555 Dunn, G. A., 38 DuPaul, G. J., 359, 361, 370–372 Dwass, M., 271 E Ebos, J. M., 465 Eccleston, J. F., 151 Eddy, S. R., 312 Edgar, R., 62 Eeckman, F. H., 341 Efendic, S., 555
Eftink, M. R., 500 Eisen, M. B., 63 Eisenthal, R., 500, 590 Elf, J., 409 Eliason, S. R., 33 Ellner, S. P., 286 Endre´nyi, L., 276 English, S. B., 62 Epstein, S., 555 Erdogmus, D., 543 Ermentrout, G. B., 2, 4 Ernst, J., 179 Espinosa-Soto, C., 172, 283, 297 Evans, J. G., 24 F Fanelli, C. G., 574 Farhy, L. S., 547–575 Faunt, L. M., 503 Faure´, A., 171, 343 Fedorov, V., 276 Feldman, H. A., 500, 502 Feng, D., 487 Feoktistov, V., 264 Ferrara, N., 465 Fierke, C. A., 256, 259 Figeys, D., 282 Figge, M. T., 86, 90–91 Finn, R. D., 312 Fischer, M., 375 Folcik, V. A., 100 Fong, L., 81 Ford, R. M., 40 Foreman, D., 375 Forsythe, J. A., 479 Fossati, A., 366 Foth, B. J., 286 Fouchet, D., 85–86, 97 Franco, R., 276 Fraser, C. G., 412 Frasier, S. G., 252 Frazier, G. R., 526 Frieden, C., 602 Friedl, P., 34 Friedman, N., 179 Friesner, R. A., 326 Frigon, R. P., 137 Frigyesi, A., 64 Fu, B. M., 486, 487 Fujitani, S., 555 Fukuda, M., 550 Funahashi, A., 586 Furth, R., 38 G Gabhann, F. M., 461–494 Gadkar, K., 179
630 Gagnon, M. L., 465 Galon, J., 81 Gao, Y., 62, 68 Garbett, S. P., 23–54 Gardner, T. S., 179–180 Garlick, D. G., 487 Gasa, T., 248, 250, 275 Gaspard, P., 132 Gasteiger, E., 309 Gat-Viks, I., 173 Gautestad, A. O., 44 Gedulin, B. R., 553–554, 560 Gelinas, A. D., 137 Gelperin, D. M., 588 Geman, D., 65 Geman, S., 65 Genter, P., 555, 573 Gentleman, R. C., 231 Georgescu, W., 23–54 Gerber, H. P., 465 Gerich, J. E., 549–551 Gerrits, A., 231 Ghaemmaghami, S., 588 Ghiron, C. A., 500 Giannis, A., 463 Gibson, M. A., 407 Gilbert, D., 283 Gilbert, G. A., 151 Gilbert, H. F., 255 Gilbert, L. M., 151 Gilles, E. D., 283 Gillespie, D. T., 383–384, 407–409, 587 Giorgio, A., 437 Girling, J. E., 463 Glass, L., 297–298, 345 Goldberg, I. G., 30 Goldberg, P. A., 416 Goldbeter, A., 2, 128–129 Goldman, D., 477 Goldstein, H., 112 Gomperts, B. D., 284 Gonzalez, A., 172 Goodsell, D. S., 326 Gopel, S. O., 554–555 Gordon, M. S., 488, 490 Goryanin, I., 346 Gra¨f, R., 30 Grapengiesser, E., 555, 560 Greer, B., 238 Griesdale, D. E. G., 414 Grimmichova, R., 555, 562 Grimshaw, A., 197–226 Gromada, J., 550, 553, 574 Gropp, W., 202 Gschwind, A., 463 Guckenheimer, J., 286 Guerlain, S., 441
Author Index
Guldberg, C. M., 248 Guldener, U., 68–69 Gupta, S., 171, 437–438, 440 Guyton, J. R., 573 H Habel, L. A., 359 Hackett, C. J., 80 Haigh, J. J., 463 Hairer, E., 140 Halgren, T. A., 326 Hanahan, D., 34 Hanes, J. W., 619 Hardin, C., 316 Harper, S. J., 464–465 Harris, M. P., 23–54 Harris, S. E., 175 Hartemink, A., 181 Harvey, I., 293 Ha, T., 383 Hatzimanikatis, V., 282 Havel, P. J., 574 Hedstrom, L., 259 Heilman, D., 360 Heimberg, H., 551, 555 Heise, T., 554, 574 Heller, H., 319 Hellman, B., 553, 555, 560 Heng, H. H., 24 Hermansen, K., 555 Herrga˚rdet, M. J., 584 Herrgard, M. J., 171, 176 Hess, B., 319 Heyman, B., 83 Hibbs, M. A., 230 Hill, T. L., 123 Hilsted, J., 551 Hinshaw, S. P., 360 Hirsch, B. R., 551 Hoffman, R. P., 550 Hofmann–Wellenhof, R., 37 Holmes, A. P., 271 Hoops, S., 264, 267, 583–598 Hope, K. M., 551 Hopfield, J. J., 112 Howard, J., 122 Hoyer, P., 69 Huang, A. C., 294 Huang, C-H. C., 137 Huang, N., 325–326 Huang, S., 251, 340–341 Hucka, M., 586 Hughes, T. R., 60, 63, 68–69 Hulo, N., 309, 312 Hunter, S., 309, 313 Huypens, P., 573–574 Hyvarinen, A., 535 Hyvrinen, A., 64
631
Author Index I Ideker, T. E., 178–179, 184, 188 Illingworth, C. J., 330 Inagaki, N., 573 Ingber, D. E., 341 Ingle, J. D. Jr., 526 Insel, P. A., 573 Irizarry, R. A., 62 Irons, D. J., 343 Irwin, J. J., 326, 330 Ishihara, H., 553–554, 560 Ito, K., 553, 560 Ivanova, N. B., 233 Izhikevich, E. M., 13, 15 J Jacob, F., 341 Jacobson, L., 437 Jacobson, M. P., 315 Jacquez, J. A., 525 Jain, A. N., 325, 330 Jain, S., 171 Jamakhandi, A. P., 275 Jarrah, A. S., 163–192, 286 Jaspan, J. B., 555 Jefferys, W. H., 502, 524 Jenkins, R. C. Ll, 151 Jensen, P. S., 361 Jiang, B. H., 479 Jiang, D.-Q., 132 Jiao, X., 37 Ji, J. W., 477, 481 Jirstrand, M., 346 Johnson, K. A., 269, 601–625 Johnson, M. L., 138, 252, 269–272, 500, 503, 525 Jones, G., 326 Jones, M. C., 532, 536 Jorgensen, W. L., 319 K Kachalo, S., 283, 289, 302 Kaech, S. M., 88 Kalbfleisch, M. L., 371 Karlin, S., 113 Kauffman, S. A., 168, 173–176, 179, 181, 286, 337, 341, 345 Kaufman, M., 283 Kawai, K., 560 Kawakami, Y., 81 Kawamori, D., 551, 554 Kegeles, G., 137, 156–158 Keiding, N., 51 Keizer, J., 13 Kell, D. B., 177, 264–268, 588, 598 Kelley, L. A., 315 Kerbel, R. S., 463, 465
Kerr, M. K., 60 Kervizic, G., 283 Khan, J., 238 Kicska, G. A., 620–621 Kieffer, T. J., 555, 560, 573 Kilpinen, S., 230 Kimas, N., 437–438 Kimberly, M. M., 430 Kim, H. D., 40 Kim, J., 179 Kim, P. M., 61 Kim, P. S., 79–105 Kim, S., 347 Kimura, S., 180 King, E. L., 257 Kinniburgh, D. G., 500–502 Kipper, M. J., 38 Kirkpatrick, S., 268 Kitchen, D. B., 330 Klaff, L. J., 554 Kleinman, R., 554–555 Klein, R. G., 360 Klipp, E., 436, 586 Kohler, J., 587 Kollman, P. A., 319 Kontoyianni, M., 330 Koshland, D. E., 128–129 Kossenkov, A. V., 59–73 Kovatchev, B. P., 365–366, 415, 573 Kozakov, D., 326, 329 Kremling, A., 180 Kretsinger, R. H., 309 Krietsch, W. K., 589 Krogh, A., 309 Kruse, K., 409 Kuhn, A., 244 Kuperman, S., 361 Ku¨rten, K. E., 297 Kuske, R., 18 Kut, C., 466 Kuzmic, P., 590 Kuzmic, P., 247–276 L La¨hdesma¨ki, H., 290, 348 Lake, D. E., 531–545 Lambert, J. D., 345 Langmuir, I., 500 Larkin, M. A., 312 Laskowski, R. A., 316 Laubenbacher, R., 163–192, 286 Lauffenburger, D. A., 178, 465, 468 Laughlin, R. B., 132 Lauritzen, S. L., 51 Laws, W. R., 500 Leach, A. R., 318, 324, 326 Leach, P. J., 204 Le Clainche, L., 248
632
Author Index
Lee, D. D., 61, 68 Lee, H. Y., 85–87 Lee, J. H., 441, 449 Lee, J. M., 435–458 Lee, P. P, 79–105 Lee, S., 465, 469 Lee, T. I., 282, 465, 479 Lee, Y. S., 16 Lefever, R., 2 Le Nove`re, N., 587 Lensink, M. F., 326 Leo´n, K., 85–86, 97 Leskovar, A., 248 Letunic, I., 313 Levenberg, K., 591 Levy, D., 79–105 Liang, S., 179, 184 Liao, J. C., 62 Liebermeister, W., 64, 586 Lieb, J. D., 282, 288 Li, F., 171, 283, 294, 297–298, 342, 351 Linderman, J. L., 465, 468 Lineweaver, H., 501, 590 Li, N. K., 85 Lin, S. M., 64 Li, S., 171, 282–283, 293, 301–302 Li, X., 179 Li, Y.-X., 2 Llinas, R. R., 15 Lloyd, P. G., 481 Lobert, S., 158 Lobley, A., 315 Lockhart, D. J., 62 Lodish, H., 463, 465 Loo, L. H., 33–34 Loomis, W. F., 177 Lou, H., 360 Louis, T. A., 364 Ludvigsen, E., 554 Luecke, R. H., 502 Luthy, R., 316 Lybanon, M., 502, 524 M Macdonald, J. R., 502 Mac Gabhan, F., 470–473, 479–481, 483–485 Mackey, M. C., 86–87, 113, 121, 132 MacLeod, M. C., 340 Madden, T. L., 311 Madhusudhan, M. S., 309 Madura, J. D., 319 Maharaj, A. S., 465 Mahoney, M. W., 319 Malys, N., 583–598 Mannervik, B., 273, 500 Manninen, T., 352 Mannuzza, S., 360 Margolin, A. A., 179
Mari, A., 573 Marino, S., 180 Marquardt, D. W., 263, 591 Martin, D., 465 Martin, S. R., 179, 184, 381–409 Marti-Renom, M. A., 314 Maruyama, H., 553 Mason, D., 84 Mata, J., 100 Mathews, E. K., 2 Matsudaira, P., 24 Matthews, D. R., 555 May, R. M., 286 Mazitschek, R., 463 McCall, A. L., 547–575 McCammon, J. A., 317 McGinnis, S., 311 McHugh, R. B., 502 Mead, R., 138 Meffre, E., 80 Mehra, S., 179 Meier, J. J., 571 Meinert, C. L., 502 Mendes, P., 180, 264–268, 346, 583–598 Mendez, R., 326 Mendoza, L., 172, 283, 297 Menten, M. L., 590, 610 Mercado, R., 88 Merkel, R. L., 365–366, 371 Merrill, S. J., 85–86 Messiha, H., 583–598 Mewes, H. W., 68–69 Mian, S., 348 Michaelis, L., 590, 610 Mills, J. C., 233 Mitchell, M., 132 Mitchison, T. J., 136 Mohammadi, M., 463–464 Moitessier, N., 326 Molldrem, J. J., 81 Moloshok, T. D., 61 Monastra, V. J., 360 Monod, J., 341 Moore, H., 85–86 Moorman, J. R., 540, 545 Morari, M., 441, 449 Morgan, M. M., 197–226 Morrison, J. F., 249, 260 Morrow, D. A., 412 Mostofsky, S. H., 360 Mount, W. D., 315 Mukherjee, D. P., 42 Munson, P. J., 500, 502 Murakami, M., 82 Muroga, S., 342 Murphy, K., 348 Murray, J. D., 2, 112, 128 Muske, K. R., 453
633
Author Index
Mysterud, I., 44 Myung, J. I., 254, 273, 275–276 N Naik, D. C., 204 Nariai, N., 179 Nelder, J. A., 138 Nelson, B. H., 81 Nichols, T. E., 271 Niedzwiecka, A., 253 Nikolayewaa, S., 175–176 Norusis, M., 525 Notredame, C., 313 Novak, B., 352 O Oberhauser, D. F., 137 Ochs, M. F., 59–73 Ochsner, S. A., 233 Olson, A. J., 326 Oltvai, Z. N., 282 Onsum, M., 86, 89 Onwubolu, G. C., 264 Orear, J., 502 Ornstein, L. S., 38 O’Shea, E. K., 351 Othmer, H. G., 183, 283 Oudes, A. J., 233 Ousterhout, J. K., 200 Ozbudak, E. M., 351 P Paffrath, D., 359 Pahle, J., 409 Pardoll, D. M., 81 Parkinson, H., 62, 230 Pastor, P. N., 359 Paulsson, J., 409 Pawlak, M., 541 Pearson, W. R., 311 Pe’er, D., 179 Pei, J., 312 Penberthy, J. K., 357–377 Penheiter, A. R., 260 Pepperkok, R., 24 Peranteau, A. G., 264 Perelson, A. S., 85–86 Perlman, Z. E., 24, 33–34 Petersen, P. H., 412 Pettersen, E. F., 313 Petzold, L. R., 409 Philo, J. S., 138 Pincus, S. M., 544 Pipeleers, D. G., 551 Pirofski, L. A., 83 Pitt, M. A., 254, 273, 275
Pollak, M., 463 Ponder, J. W., 319 Popel, A. S., 461–494 Prksen, N., 562 Portela-Gomes, G. M., 554 Porter, I. M., 30 Potdar, A. A., 40 Pournara, I., 179 Powell, D. R., 502 Press, W. H., 140, 258, 504, 507 Presta, L. G., 488 Price, C. P., 413 Price, K. V., 264–265, 268 Pries, A. R., 477 Prvan, T., 501 Punta, M., 309 Q Qian, H., 111–132 Quaranta, V, 23–54 Quon, M. J., 573 Qutub, A. A., 463, 465, 479 R Raeymaekers, L., 173 Rahman, A., 317 Rajewsky, K., 80 Ramsey, S., 346 Rao, B. L. S. P.7, 540 Rao, C. V., 86, 89, 352 Rappaport, J. L., 360 Rarey, M., 326 Rasband, W. S., 30 Raser, J. M., 351 Ratkowsky, D. A., 510 Ravier, M. A., 554 Rawlings, J. B., 453 Ray, N., 42 Read, T., 536 Reaven, G. M., 555, 560 Reed, J. L., 282 Regoes, R., 85–86, 97 Reich, J. G., 263 Reiff, M. I., 360 Renkin, E. M., 487 Ren, P., 319 Reuben, C. A., 359 Rhoads, D. G., 586 Rice, J. J., 179 Richman, J. S., 540, 545 Richmond, C. S., 338 Riggs, D. S., 500 Rinzel, J., 2, 13–14, 16–17 Ripley, B., 536 Rissanen, J., 347 Ritchie, D. W., 326 Ritchie, R. J., 501
634
Author Index
Rizzi, M., 436 Robeva, R., 168, 357–377 Rodbard, D., 500, 502, 526 Rodriguez-Fernandez, M., 588, 591 Roeder, K., 33 Roessler, C. G., 312 Rogers, P. A., 463 Rohl, C. A., 316 Rojas, I., 586 Ropers, D., 299 Rorsman, P., 553, 555, 560 Rosenberg, S. A., 81 Rost, B., 309, 313 Rowat, P., 18 Roy, H., 463, 465 Runarsson, T., 591 Rutter, G. A., 554 Rysselberghe, P., 115 S Sackmann, A., 283 Saeed, A. I., 63 Saez-Rodriguez, J., 171, 283 Safer, D. J., 359 Sakaguchi, S., 82, 84 Sakaue-Sawano, A., 53 Salehi, A., 555 Sali, A., 314 Samal, A., 171 Samols, E., 553–554, 560, 574 Samuels, D. C., 352 Sanchez, L., 172, 283, 297 Sander, C., 313 Sauro, H. M., 587 Savageau, M. A., 179 Saxena, A., 307–330 Scearce, L. M., 231 Scheele, R. B., 137 Schena, M., 62 Scherer, A., 86, 89 Schilling, C. H., 584 Schilstra, M. J., 159, 381–409 Schlippe, Y. V. G., 260 Schmidt, H., 346 Schuck, P., 137 Schueler-Furman, O., 326 Schuit, F. C., 554–555, 574 Schulthess, C. P., 505 Schulz-Gasch, T., 326 Schu¨rmeyer, T. H., 441 Schuster, T. M., 137 Schwabe, U., 359 Schwede, T., 315 Scott, D. W., 532, 536, 541 Scott, M. G., 414 Secomb, T. W., 477 Segel, I. H., 117, 128, 248, 255, 258 Segel, S. A., 550
Segerstrom, L., 490 Seiden, P. E., 100 Sela, S., 465 Sept, D., 307–330 Seung, H. S., 61, 68 Shachtman, T., 113 Shahaf, G., 85–86 Shamir, R., 173 Shamoon, H., 551 Shannon, C. E., 112, 532 Shannon, P., 587 Shelton, T. L., 359, 361 Shen, R., 231 Shen, S., 486–487 Sherman, A., 2, 13–14 Sherwood, P. J., 135–159 Shi, J., 314 Shire, S. J., 137 Shmulevich, I., 173, 181, 184–185, 335–353 Shoichet, B. K., 330 Shortle, D., 316 Shpiro, A., 5–6 Shukla, G. K., 502 Sibisi, S., 62, 65 Sible, J. C., 346 Sibson, R., 532, 536 Silvestre, R. A., 554 Simons, K. T., 316 Simons, M., 464 Sippl, M. J., 316 Skilling, J., 62, 65–66 Slack, M. D., 33, 34 Slyke, D. D. V., 261 Sohal, D., 231 Sotiropoulou, P. A., 81 Sousa, S. F., 326 Sowell, E. R., 360–361 Spence, R. A., 608 Spies, M. A., 613 Stafford, W. F., 135–159 Stagner, J. I., 553–555, 560, 574 Stahl, M., 326 Starkuviene, V., 24 Staude, R. G., 51 Steele, R., 573 Stefanini, M. O., 461–494 Stein, M. A., 366 Sternberg, P. W., 177 Stern, J. V., 19 Steuer, R., 352–353 Stewart, J., 85–86 Stigler, B., 166, 168–171, 188 Stillinger, F. H., 317 Stockholm, D., 24 Stoeckert, C. J. Jr., 229–244 Storme, T., 248 Straume, M., 269–272, 525
635
Author Index
Strogatz, S. H., 132 Strowski, M. Z., 554, 574 Sturmfels, B., 168 Subramaniam, S., 309 Su, J., 18 Sumida, Y., 554 Swain, P. S., 352 Swedlow, J. R., 30 Szallasi, Z., 409 Szedlacsek, S., 249, 260 T Tabak, J., 1–20 Taborsky, G. J. Jr., 551, 554, 574 Tai, M., 137, 158 Tang, K., 479 Tannock, R., 375 Tapia-Arancibia, L., 555 Taylor, E. W., 625 Taylor, H. M., 113 Tellinghuisen, J., 499–527 Teusink, B., 436 Thain, D., 198 Thakar, J., 283, 297, 299 Thieffry, D., 172, 283, 297 Thomas, J. A., 544 Thomas, R., 172, 180, 286, 337 Thompson, M., 526 Thomsen, A. R., 85, 93 Thusius, D., 137, 158–159 Tidor, B., 61 Tieleman, D. P., 319 Timasheff, S. N., 137 Tirone, T. A., 554 Tiwari, R., 326 Toffolo, G., 416, 573 Tomaiuolo, M., 1–20 Toney, M. D., 613 Trence, D. L., 416 Tringe, S, 179 Truskey, G. A., 482 Tsai, J., 234 Tsaneva–Atanasova, K., 2 Tusher, V. G., 60 Tyson, D. R., 23–54 Tyson, J. J., 2, 346 U Uehara, S., 573 Uhlenbeck, G. E., 38 Unger, R. H., 555, 560 Utsumi, M., 555 V Vajda, S., 326, 328–329 Valsami, G., 505 Van Boekel, M., 248
Van den Berghe, G., 414, 431 van der Mark, J., 2 van der Pol, B., 2 Van Goor, F., 2 Van Schravendijk, C. F., 554, 560 van Stipdonk, M. J., 88 Varela, F. J., 85–86 Veldhuis, J. D., 573 Veliz-Cuba, A., 166, 168–171 Vera-Licona, P., 189 Verheul, H. M., 465 Vidal, M., 282, 288 Viswanathan, G. M., 44 Vita, C., 248 Vlasselaers, D., 414 Voigt, J. H., 329 Voit, E., 180 von Dassow, G., 283 Von Weymarn, N., 248 Vriend, G., 316 W Waage, P., 248 Waddington, C. H., 173 Wagner, A., 179 Wainwright, T. E., 317 Walhout, A. J., 282, 288 Wallin, A. E., 44 Walsh, C. T., 249, 260 Wang, G., 68 Wang, R.-S., 281–303 Wanner, G., 140 Wardemann, H., 80 Ward, M. F., 365–366, 370 Warren, G. L., 326 Wasserman, L., 33 Waterhouse, A. M., 313 Watts, D. G., 138, 269, 271, 604 Weaver, D. C., 343 Weaver, W., 112 Wegner, A., 136 Weidow, B., 23–54 Weinberg, R. A., 34 Weisbuch, G., 85–86 Weiss, G., 360 Wells, A., 37 Welner, Z., 375 Wendt, A., 553, 560 Werhli, A. V., 180 Wernisch, L., 179 Westerhoff, H., 164 Westley, A. M., 586 Westley, J., 586 Weyandt, L. L., 366 Wheeler, D. L., 234 Wiederstein, M., 316 Wielgus-Kutrowska, B., 253–255
636
Author Index
Wiener, R., 414 Wijelath, E. S., 465 Wilkinson, D. J., 408 Wilkinson, G. N., 500 Willett, C. G., 490 Williams, C. R., 260 Williams, J. W., 249, 260 Wilson, H. R., 4 Wodarz, D., 85, 93 Wolf, D. M., 341 Wolf, K., 34 Wolkenhauer, O., 285 Wong, D., 307–330 Wong, W. H., 62 Wu, F. T. H., 461–494 X Xu, E., 553–554, 560 Xu, L., 231 Y Yamasaki, Y., 573
Yao, X., 591 Yeung, M. K., 179–180 Young, M. A., 319 Yuan, F., 487 Yu, J., 179, 181, 184–185, 348 Z Zametkin, A. J., 360 Zdobnov, E. M., 309 Zeng, Q. C., 511, 525–526 Zhang, R., 171, 283, 302 Zhang, W., 338 Zhang, Y., 316, 350–351 Zhang, Z., 312 Zhao, H., 137 Zhou, H., 549, 551–552, 554, 556–557, 565–566, 569 Zimmerle, C. T., 602 Zito, J. M., 359 Zoete, V., 330 Zou, M., 179, 348
Subject Index
A ABCD systems c(r) distribution simulation, 142 concerted system simulation, 147 cooperative model data simulation, 148 correlation plots, 150 direct boundary analysis, 143 kinetically mediated concerted model, 145–146 koff values and MC analysis, 149 noise perturbed data simulation, 144, 146 velocity data simulation, 141 Abscisic acid-induced stomatal closure, 301–302 Adrenocorticotropic hormone (ACTH), 437–438, 440 Agent-based models (ABM), 89–90 ANN training and validation independence testing, 240–241 leave-one-out validation, 238–240 minimal error data set, 240 results interpretation, 243 whole data analysis pipeline, 241–243 Asymptotic Mean-Integrated Squared Error (AMISE), 539 Asymptotic mean-squared error (AMSE), 541–543 Attention-deficit hyperactivity disorder (ADHD) appraisal Bayesian probability algorithm meta-analysis tool, 369–373 procedure, 366–367 results, 367–369 score standardization, 363–364 statistical analyses, 367 subjects, 365–366 comprehensive psychophysiological assessment, 362 diagnosis of, 359 DSM-IV diagnostic criteria, 358 etiology of, 360–361 mean probability, age-gender groups, 374 prevalence, 359–360 problem summary, 361–362 types, 358 Autocrine signaling, 465
B Batch queuing systems, 201–202 Bayesian factor regression modeling (BFRM), 62 Bayesian probability algorithm meta-analysis tool procedure, 371–372 results, 373 subjects, 370–371 procedure, 366–367 results, 367–369 score standardization, 363–364 statistical analyses, 367 subjects, 365–366 Bellman equation, 453 Bi Bi Random mechanism, 248–249, 257 Boolean dynamic modeling, cellular signaling networks abscisic acid-induced stomatal closure, 301–302 biological implications and predictions, 295–296 Boolean switches to dose-response curves, 299–301 gene regulatory networks, 286 illustration, 287 network backbone construction, 288–289 piecewise linear systems, 298–299 robustness testing, 295 state transition model selection, 291–292 steady state analysis, 293–295 threshold Boolean networks, 297–298 T-LGL survival signaling network, 302–303 transfer function determination, 289–290 Boolean networks algebraic model framework, 173 dynamics of, 170 genetic regulatory model asynchronous updating, 338 attractors as cell types, 341–343 definition, 338 hysteresis, 340 PBN attractors role, 350 state transition probability, 347 steady-state analysis and stability, 350–351 switching probability, 349 transition matrix, 348
637
638
Subject Index
Boolean networks (cont.) random networks (RBN), 337 state-transition diagram, 340 truth tables, functions, 339 logical model, 172 nested canalyzing function (NCF), 171 phase space, 175 reverse-engineering deterministic and stochastic Boolean network, 181–184 inference, 184–185 lac operon model, 189–190 polynome, parameter estimation, 185–189 time-discrete dynamical system, 171 transcriptional network, 176 wiring diagram, 169 C Cell motility cellular parameter extraction dynamic expansion and contraction cell activity, 44–45 image acquisition and validation, 35–36 image processing, 36 instantaneous motion fraction, 44 motion fraction, 40 persistence time, 38 speed fluctuation, 42–44 statistical subpopulations, 45 step-length, 44 surface area, 42 turn-angle distribution, 40–42 image acquisition and validation, 35–36 image processing, 36 Cell proliferation H2BmRFP-labeled cells validation and image acquisition, 46 image processing and parameter extraction, 46–48 statistical analysis Fucci system, 53 progeny tree, 49–51 quality control, 53 sibling pair analysis, 51–53 single-cell IMT and generation rate, 48–49 Cellular signaling networks Boolean dynamic modeling abscisic acid-induced stomatal closure, 301–302 biological implications and predictions, 295–296 Boolean switches to dose-response curves, 299–301 gene regulatory networks, 286 illustration, 287 network backbone construction, 288–289 piecewise linear systems, 298–299
robustness testing, 295 state transition model selection, 291–292 steady state analysis, 293–295 threshold Boolean networks, 297–298 T-LGL survival signaling network, 302–303 transfer function determination, 289–290 directed graph representation, 284 hypothetical signal transduction process, 285 Checkpointing process, 219–222 Chronic fatigue syndrome (CFS), 437 Clinically relevant performance-assessment tools invariant manifold construction eigenvalues and eigenvectors, 442–443 optimal input signals, 441 stable manifold theorem, 443–444 steady-state points, 441 optimal control objective, 448–452 treatment options, evaluation, 445, 447–448 Clustering techniques, 63 COPASI software, 587–588, 590–591, 594–595 Coronary artery disease (CAD), 481 Corticotropin-releasing hormone (CRH), 437 D Decipher stem cell signature detection ANN training and validation independence testing, 240–241 leave-one-out validation, 238–240 minimal error data set, 240 results interpretation, 243 whole data analysis pipeline, 241–243 computing environment, 231–232 databases, 234–235 data sources, 232 final compendium index, 236–237 generalized hierarchy, 235–236 normalisation, 232–234 variation filtering, 237–238 vector projection, 237 Delay differential equations (DDE), 87–88 DynaFit software package, enzymology Bi Bi Random mechanism, 248–249 common features, 248 enzyme reactions steady-state initial rate equation, 256–260 thermodynamic cycles, initial rate models, 255–256 time course, 260–262 equilibrium binding studies independent binding sites and statistical factors, 252–253 interacting vs. independent sites, trimeric enzyme, 253–255 NMR study, protein-protein interactions, 251–252 initial model parameter estimation
639
Subject Index
global minimization, differential evolution, 264–269 model-discrimination analysis, 273–275 Monte-Carlo confidence intervals, 270–273 systematic parameter scan, 263–264 optimal design of experiments, 276 use and advantage of, 248 Dynamic programming (DP) deterministic systems Bellman equation, 453 cost-to-go function, 452–453 objective function, infinite horizon problems, 453 multistage optimal control problems, 452 worst-case cost, 454–455 E Echo statements, 206 Elliptic bursting model characteristics, 15 subthreshold oscillations, 16 voltage time course, 17 voltage trace, 16 Entropy balance equation, 118–119 Enzyme reactions invariant concentrations of reactants, 261–262 steady-state initial rate equation, 256–260 thermodynamic cycles, initial rate models, 255–256 Equilibrium binding studies independent binding sites and statistical factors, 252–253 interacting vs. independent sites, trimeric enzyme, 253–255 NMR study, protein-protein interactions, 251–252 F Fibroblast growth factor (FGF) system, 463 Fibromyalgia, 437–438 File staging, 222–223 Friedman-Tukey index, 535–536 AMSE, 541–542 mean-squared error (MSE), 541–543 plug-in and resubstitution estimate, 540 Fucci system, 53 G Genetic regulatory network modelling Boolean networks asynchronous updating, 338 attractors as cell types, 341–343 definition, 338 hysteresis, 340 random Boolean networks (RBN), 337 state-transition diagram, 340
truth tables, functions, 339 differential equation models function Fþ(x,1), 345 mutant phenotype prediction, 346–347 nonlinear time-dependent equation, 343 sigmoid functions, 344 time-invariant system, 343 probabilistic Boolean networks attractors role, 350 state transition probability, 347 steady-state analysis and stability, 350–351 switching probability, 349 transition matrix, 348 stochastic differential equation models, 351–353 Gibbs entropy, 114, 132 Gilbert theory, 151 Glucagon counterregulation (GCR) b-cell-deficient rat model, 549 dysregulation, diabetes hypoglycemia, 550–551 switch-off hypothesis, 551 initial qualitative analysis b-cell inhibition, a-cells, 553–554 a-cell stimulation, d-cells, 554–555 d-cell inhibition, a-cells, 554 glucose inhibition, a-cells, 555–556 glucose stimulation, b-and d-cells, 555 interdisciplinary approach advantages and limitations, 571–575 Minimal Control Network (MCN), 553 mathematical models, control mechanisms a-cell inhibitor, 559 experimental findings, diabetic STZ-treated rats, 557 model equations, 558 model-predicted GCR responses, 558–559 somatostatin and glucagon concentration rates, 557 normal endocrine pancreas dynamic network approximation, MCN, 561–562 model parameters determination, 562–563 response, reduction, 567–569 response to hypoglycemia, 566 response to switch-off signals, 567 in silico experiments, 564–565 simulated transition, normal physiologyinsulinopenic state, 569–571 validation, MCN, 561–562 Glucocorticoids, 437 Glyceraldehyde 3-phosphate (G3P), 589 Glycerone phosphate, 589 Grid systems, 223–226 Growth factor-receptor systems angiogenesis, biology, 462–463 mesoscale single-tissue 3D models, 474–482
640
Subject Index
Growth factor-receptor systems (cont.) cell based therapy, muscle ischemia, 480–481 exercise therapy, muscle ischemia, 481–482 gene therapy, muscle ischemia, 480 mathematical framework, 474–478 molecular level kinetics models mathematical framework, 468 PIGF synergy mechanism, 468–471 multi-tissue compartmental models anti-VEGF therapy, pharmacokinetics, 488–491 ligand trap, sVEGFR1, 491–493 lymphatic drainage, 487–488 macromolecular vascular permeability, 486–487 normal compartment, 485 single-tissue compartmental models mathematical framework, 482–483 pharmacodynamic mechanism, 483–485 vascular endothelial growth factor (VEGF) computational models, 466 multiscale biology, 464–466 systems biology, 463–464 H High-content automated microscopy (HCAM) cancer cell trait variability (see Quantitative cell traits (QCT) variability) implementation of, 24 High-throughput computing (HTC) application, 199–200 batch queuing systems, 201–202 checkpointing, 219–222 data transformation pattern BASH shell script and echo statements, 206 line detector submission control script, 207 PBS submission script, 204–205 submission manager or control script, 205–206 file staging, 222–223 grid systems, 223–226 iterative refinement, 217–218 Monte Carlo simulations, 211–214 parameter space study airflow over wing submission script., 208–210 BASH shell script, 209 portable batch system, 202–204 resource restrictions, 218–219 scripting languages, 200–201 throwDarts sequential program, 215–217 HIV protease inhibition distribution histograms, 272–273 least-squares fit, progress curves, 265 mechanistic model, 266 Homology modeling
secondary structure prediction, 313 sequence analysis, 306–313 structure validation, 316 tertiary structure prediction ab initio algorithms, 315–316 template-query alignment, 314–315 threading, 315 HTC. See High-throughput computing Hyperglycemia, 414, 425, 429 Hypoglycemia, 414, 417, 423–425, 427, 429 Hypothalamic-pituitary-adrenal (HPA) axis system ACTH, 437–438 block diagram, dynamics, 437 deterministic optimization, 455–456 exogenous ACTH (EACTH), 438 homeostasis maintenance, 437 steady-state analysis cortisol, 440 model predictive control (MPC), 441 nominal parameters, 439 stress-related disorders, 437 system model, 438–439 worst-case optimization, 456–458 Hypoxia-inducible factor 1 (HIF1) activation, 463 I Immune network, self-regulating immune regulation complexity, 81–83 mathematical modelling agent-based models (ABM), 89–90 delay differential equations (DDE), 87–88 ordinary differential equations (ODE), 85–87 partial differential equations (PDE), 88–89 stochastic differential equations (SDE), 90–91 self/nonself discrimination, 83–84 T-cell regulation (see Intracellular T-cell regulation) thymic selection, 80 tumor-associated antigens (TAA), 81 Insulin infusion protocol, 416 Insulin-like growth factor (IGF) system, 463 Intracellular T-cell regulation iTreg-based negative feedback death rate, 99 model diagram, 98 simulation, 103–105 T cell contraction, 98 T cell proliferation program antigen function graphs, 98 expanded diagram, 94–96 parameter estimates, 97 simulation, 100–103 summary, 93 variable definition, 94
641
Subject Index K Kernel density bandwidth, 536–537 Epanechnikov kernel, 537 PDF, 532, 536–538 symmetry, 541 template matches, 544–545 zero mean and unit variance, 534 KinTek Global Kinetic Explorer software fitting data, simulation, 603–605 fitting full progress curves error analysis, 617–620 information content, 613 kcat and Km values, 613–615 kinetic parameters, 614, 616 rapid equilibrium binding model, 614 rate constants, 613–614 methods average sigma value, 610 experiment definition, 607–608 information content, data, 609 model definition, 605–607 nonlinear regression analysis, data fitting, 609 output factors definition, 608 time and concentration, units, 608–609 progress curve kinetics effect of variable, 607, 611 information content, 611 product inhibition effect, 607, 611 rate constants, 611–612 rate constants, 603 slow onset inhibition kinetics absorbance, 620–621 purine nucleoside phosphorylase (PNPase), 620–624 tryptophan synthase, 603 L Least-squares analysis data uncertainty, variance function estimation, 524–526 Michaelis-Menten enzyme kinetics, 501 multiple uncertain variables, Deming’s treatment, 505 rectangular hyperbola, 500 standard linear and nonlinear least squares, 503–505 statistics, reciprocals binding and kinetics data, 510–511 implications, thumb rule, 509–510 Monte Carlo experiment, 506–509 uncertainty in functions, error propagation, 505–506 unusual weighting, dependent variable effective-variance-based weighting expressions, 523
effective variance treatment, 521–522 weights, true dependent variable constant sy, 511–512 Monte Carlo simulations, 517–521 perfectly fitting data, illustrations, 512–515 real data example, 515–517 Levenberg–Marquart algorithm, 138 M Markov process entropy and balance equation, 118–119 equilibrium and time reversibility, 119–120 free energy and relative entropy, 121–122 Matrix factorization Baye’s equation, 65 Bayesian factor regression modeling (BFRM), 62 mathematical statement, 61 Occam’s Razor argument, 65 positivity and dimensionality reduction, 66 transcriptional regulators, 67 Mean-Integrated Squared Error (MISE) definition, 538 optimal bandwidth, 539 Mesoscale single-tissue 3D models cell based therapy, muscle ischemia, 480–481 exercise therapy, muscle ischemia, 481–482 gene therapy, muscle ischemia, 480 mathematical framework blood flow volumes, 477 2D and 3D tissue geometry, 474–477 diffusion, 477 receptor-ligands interactions, 478–479 sequestration, ECM, 478 single-compartment models, 480 VEGF production/secretion rates, 479 Michaelis-Menten enzyme kinetics, 128–130, 501 Microarray data analysis clustering techniques, 63 mating activation and filamentation activation, 61 matrix factorization Baye’s equation, 65 Bayesian factor regression modeling (BFRM), 62 mathematical statement, 61 Occam’s Razor argument, 65 positivity and dimensionality reduction, 66 transcriptional regulators, 67 nonnegative matrix factorization (NMF), 61–62, 68 Rosetta compendium, 63, 68–72 tightly coupled MAPK pathways, 60 traditional statistical approaches, 64
642 Minimal Control Network (MCN) interdisciplinary approach, 553 normal endocrine pancreas dynamic network approximation, 561–562 hypoglycemia, response, 566 model parameters determination, 562–563 response, reduction, 567–569 in silico experiments, 564–565 simulated transition, normal physiologyinsulinopenic state, 569–571 switch-off signals, response, 567 validation, 561–562 Molecular docking basic components, 324–325 iterative docking and analysis, 328–329 molecule preparation macromolecule protein, 326–327 protein ligands, 327–328 small molecule ligands, 327 post analysis, 329 software selection, 325–326 virtual screening, 329–330 Molecular dynamics mechanics, 318–319 setting up and running simulations equilibration, 320–321 minimization and preparation, 320 production, 321 simulation analysis equilibration measures, 321–324 principal component analysis, 322–324 RMSF fluctuation, 321–322 Molecular modeling homology modeling secondary structure prediction, 313 sequence analysis, 306–313 structure validation, 316 tertiary structure prediction, 313–316 molecular docking basic components, 324–325 iterative docking and analysis, 328–329 molecule preparation, 326–328 post analysis, 329 software selection, 325–326 virtual screening, 329–330 molecular dynamics mechanics, 318–319 setting up and running simulations, 320–321 simulation analysis, 321–324 software and resources, 310–311 Monomer-tetramer model Gilbert theory, 151 kinetics of reequilibration, 157 monomer–dimer–trimer–tetramer species distributions, 155 normalised distribution, 153 reaction relaxation time, 156 sedimentation distribution analysis, 152
Subject Index
sequential model, 154 Monte Carlo simulations, 211–214 glucose control, 429 glucose meters, 430–431 hypoperfusion, skin capillaries, 430 limitations, 430–431 methods glucose concentration, 416–417 tight glucose control (TGC) regimens, 416 modeling approach assay imprecision and inaccuracy, 414–415 physiologic response, 415 University of Washington regimen hypoglycemia, 427, 429 inaccurate and imprecise glucose assay, 421, 429 insulin infusion rate, 427–428 Yale regimen bias vs. coefficient of variation (CV), 422–426 glucose concentrations, 417–423 insulin infusion rates, 417–422 Motor protein model biochemical kinetic scheme, 123 chemomechanical and futile cycle, 124 motor efficiency, 125 Multi-tissue compartmental models anti-VEGF therapy, pharmacokinetics, 488–491 ligand trap, sVEGFR1, 491–493 lymphatic drainage, 487–488 macromolecular vascular permeability, 486–487 normal compartment, 485 Muscle ischemia cell based therapy, 480–481 exercise therapy, 481–482 gene therapy, 480 N Newton-Raphson distribution, 140 Nonnegative matrix factorization (NMF), 61–62, 68 Nonparametric entropy cardiac rhythms, classification atrial fibrillation (AF), 533–534 implantable cardioverter-defibrillator (ICD), 534 trigeminy rhythms, 533–534 Friedman-Tukey index, 535–536 AMSE, 541–542 mean-squared error (MSE), 541–543 plug-in and resubstitution estimate, 540 Kernel density bandwidth, 536–537
643
Subject Index
Epanechnikov kernel, 537 PDF, 536–537 template matches, 544–545 Mean-Integrated Squared Error (MISE) definition, 538 optimal bandwidth, 539 Renyi entropy, 535–536 O Occam’s Razor argument, 65 Ordinary differential equations (ODEs), 85–87, 383, 586, 594 P Partial differential equations (PDE), 88–89 Peripheral artery disease (PAD), 481 Phosphorylation-dephosphorylation cycle (PdPC) kinetics Michaelis–Menten kinetics, 128–130 signaling switch and phosphorylation energy, 125–127 substrate specificity amplification, 130 Piecewise linear systems, 298–299 Platelet-derived growth factor (PDGF) system, 463 Portable batch system, 202–204 Probabilistic Boolean networks (PBN) attractors role, 350 state transition probability, 347 steady-state analysis and stability, 350–351 switching probability, 349 transition matrix, 348 Probability density function (PDF), 532, 536–538 Progress curve kinetics KinTek Global Kinetic Explorer software effect of variable, 607, 611 information content, 611 product inhibition effect, 607, 611 rate constants, 611–612 systems biology COPASI, 594–595 forward reaction, 595–596 G3P, 595 kinetic parameters, 597 Projection index, 536 Protein-protein interactions, NMR study, 251–252 Q Quantitative cell traits (QCT) variability cell proliferation Fucci system, 53 H2BmRFP-labeled cells validation and image acquisition, 46 image processing and parameter extraction, 46–48
progeny tree, 49–51 quality control, 53 sibling pair analysis, 51–53 single-cell IMT and generation rate, 48–49 computational workflow average data, 32 cellular parameter extraction, 31 computer-assisted analysis, 27 data management, 30 distribution data, 32 image processing, 30–31 statistical analysis, 31 statistical subpopulations, 32–34 time-lapse image acquisition, 28–30 definition, 24 single cell motility dynamic expansion and contraction cell activity, 44–45 image acquisition and validation, 35–36 image processing, 36 instantaneous motion fraction, 44 motion fraction, 40 persistence time, 38 speed fluctuation, 42–44 statistical subpopulations, 45 step-length, 44 surface area, 42 turn-angle distribution, 40–42 R Reaction kinetics first-order reactions, 392 pseudo-first-order reactions, 392–393 rate constants, 393 second-order reactions, 390–391 Receptor tyrosine kinases (RTKs), 463–464 Relaxation-type models elliptic bursting characteristics, 15 subthreshold oscillations, 16 voltage time course, 17 voltage trace, 16 phase duration determination algorithm, 19–20 prolactin-secreting pituitary lactotrophs, 18–19 relaxation oscillations activation function, 4 activity pattern, 9 a-nullcline shape, 10 divisive feedback, 5 noise effects, 8 on and off transition, 11 positive and negative feedback system, 5 s-model and y-model, 6–8 subtractive feedback, 6 survival analysis of particles, 10 Z-shaped nullcline, 12 scatter plots and correlation analysis, 3–4
644
Subject Index
Relaxation-type models (cont.) square wave bursting active and silent phase durations, 15 membrane potential, 13 wave spiking and voltage trace, 14 Renyi entropy, 535–536 Resource restrictions submission script, 219 reverse reaction, 595–596 Rosetta compendium, 63, 68–72 S SBML. See Systems biology markup language Scatter plots and correlation analysis, 3–4 SDE. See Stochastic differential equations Sedanal fitting models, 138 Sedimentation velocity profiles ABCD systems c(r) distribution simulation, 142 concerted system simulation, 147 cooperative model data simulation, 148 correlation plots, 150 direct boundary analysis, 143 kinetically mediated concerted model, 145–146 koff values and MC analysis, 149 noise perturbed data simulation, 144, 146 velocity data simulation, 141 advanced parameter kinetics equilibrium control window, 141 concerted tetramer model, 140 Levenberg–Marquart algorithm, 138 monomer-tetramer model Gilbert theory, 151 kinetics of reequilibration, 157 monomer–dimer–trimer–tetramer species distributions, 155 normalised distribution, 153 reaction relaxation time, 156 sedimentation distribution analysis, 152 sequential model, 154 Newton–Raphson distribution, 140 Sedanal fitting models, 138 stathmin-eGFP to tubulin binding, 139 Shannon entropy, 112–113, 532, 535, 541 Simple stochastic simulation citation data, 383 graphical notation Petri-net format, 386–389 place and transition nodes, 387 Monte Carlo computer simulations, 383 reaction dynamics, 385–386 reaction kinetics first-order reactions, 392 pseudo-first-order reactions, 392–393 rate constants, 393 second-order reactions, 390–391 reactions, 389
transition firing rules first-order reactions, 395–401 ground rules, 394–395 pseudo-first-order and second-order reactions, 404–405 rate constants, 393 Single-tissue compartmental models mathematical framework, 482–483 pharmacodynamic mechanism, 483–485 Slow onset inhibition kinetics absorbance, 620–621 purine nucleoside phosphorylase (PNPase) confidence contours, 623 kinetic parameters, 622, 624 Plasmodium falciparum, 621 Square wave bursting model active and silent phase durations, 15 membrane potential, 13 wave spiking and voltage trace, 14 Stem Cell Analysis and characterization by Neural Networks (SCANN), 231 Stem Cell Genome Anatomy Project (SCGAP) Consortium, 230, 232, 235 Stochastically fluctuating system, thermodynamics cycle kinetics and thermodynamics box, 115–117 equilibrium and nonequilibrium steady state, 113–115 Gibbs and Shanon entropy, 132 historical reflection, 131–132 Markov process entropy and entropy balance equation, 118–119 equilibrium and time reversibility, 119–120 free energy and relative entropy, 121–122 PdPC kinetics Michaelis–Menten kinetics, 128–130 signaling switch and phosphorylation energy, 125–127 substrate specificity amplification, 130 three-state two-cycle motor protein biochemical kinetic scheme, 123 chemomechanical and futile cycle, 124 motor efficiency, 125 Stochastic differential equations (SDE), 90–91 Systems biology Boolean networks algebraic model framework, 173 dynamics of, 170 logical model, 172 nested canalyzing function (NCF), 171 phase space, 175 time-discrete dynamical system, 171 transcriptional network, 176 wiring diagram, 169 bottom-up approach, 584 computational modeling and enzyme kinetics
645
Subject Index
COPASI, biochemical modeling and simulation package, 587–588 ordinary differential equations (ODEs), 586 standards, computational systems biology, 586–587 gene regulatory networks, 165 Hill function, 166 initial rate analysis COPASI, 590–591 forward reaction, 592 Henri-Michaelis-Menten kinetics, 590 rate law, 594 reverse reaction, 593 lac operon, 166 network interference, 176–181 ODE, 166 progress curve analysis COPASI, 594–595 forward reaction, 595–596 G3P, 595 kinetic parameters, 597 ODEs, 594 reverse reaction, 595–596 reverse-engineering deterministic and stochastic Boolean network, 181–184 inferring Boolean networks, 184 inferring stochastic Boolean networks, 184–185 lac operon model, 189–190 polynome, parameter estimation, 185–189 vascular endothelial growth factor (VEGF), 463–464 yeast triosephosphate isomerase (EC 5.3.1.1) forward reaction, 589 kinetic parameters, 597 MORF proteins, 588–589 reverse reaction, 589 Saccharomyces cerevisiae, 588 Systems biology markup language (SBML), 586–587 T TAA. See Tumor-associated antigens T cell proliferation program antigen function graphs, 98 expanded diagram, 94–96 parameter estimates, 97
simulation cell dynamics dependence, 101 cell populations, 100 time evolution, 102 summary, 93 variable definition, 94 Threshold Boolean networks, 297–298 throwDarts sequential program, 215–217 Thymic selection, 80 T-LGL survival signaling network, 302–303 Transition firing rules first-order reactions, 395–401 ground rules, 394–395 pseudo-first-order and second-order reactions, 404–405 rate constants, 393 Trimeric enzyme, interacting vs. independent sites, 253–255 Tumor-associated antigens (TAA), 81 V Vascular endothelial growth factor (VEGF) computational models, 466 ligand shifting, 468 multiscale biology autocrine signaling, 465 heparin-binding affinity, 464–465 intertissue transport, 465 intratissue transport, 464 paracrine signaling, 465–466 nonlinear differential equations, 471 NRP1–VEGFR2 coupling, 472–474 in silico model formulation, 469 systems biology, 463–464 VEGF VEGFR2 complex, 471 X X-linked agammaglobulinemia, 91 Y Yeast triosephosphate isomerase (EC 5.3.1.1) forward reaction, 589 kinetic parameters, 597 MORF proteins, 588–589 reverse reaction, 589 Saccharomyces cerevisiae, 588