Systems Biology Volume 1
Series Editor Sangdun Choi
For further volumes: http://www.springer.com/series/7890
Sangdun Choi Editor
Systems Biology for Signaling Networks
123
Editor Sangdun Choi Department of Molecular Science and Technology Department of Biological Science College of Natural Science Ajou University 442-749 Suwon Korea, Republic of (South Korea)
[email protected] ISBN 978-1-4419-5796-2 e-ISBN 978-1-4419-5797-9 DOI 10.1007/978-1-4419-5797-9 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010932192 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Systems biology is a recently emerged academic field that aims to understand the relationships from genes to organisms through networks in biological systems. The vast amount of data currently being generated should be addressed in a meaningful way, and the development of systematic approaches may lead to a new approach to scientific study. Since the year 2000, the term “Systems Biology” has been widely used in biosciences, but the actual process has not yet been well defined. Therefore, I have created a book series to describe systems biology in conjunction with Springer (http://www.springer.com). The current book, Systems Biology for Signaling Networks, is part of a series (series editor: Sangdun Choi) consisting of 1. Systems Biology for Signaling Networks (Choi S) 2. Systems Immunology (Selvarajoo K and Tsuchiya M) 3. Physiologic Computer Modeling and Systems Analysis for the Clinical Researcher (Summers RL and Coleman TG) 4. Systems Biology of Regulated Exocytosis in Pancreatic β-Cells (Booß-Bavnbek B, Klösgen B, Larsen JK, Pociot F and Renström E) The scope of this series of books is wide and ranges from the molecular parts of cells to network modeling. Among them, Systems Biology for Signaling Networks focuses on systematic approaches to cellular signaling in humans and animals. Although systems biology is in its infancy, our book offers an exciting solution in terms of exploring cellular signaling that will be a great help to biomedical studies in this century. Suwon, Korea
Sangdun Choi
v
Contents
Part I
Concepts
1 Systems Biology Approaches: Solving New Puzzles in a Symphonic Manner . . . . . . . . . . . . . . . . . . . . . . . . . . Sangdun Choi 2 Current Progress in Static and Dynamic Modeling of Biological Networks . . . . . . . . . . . . . . . . . . . . . . . . . . Bernie J. Daigle, Jr., Balaji S. Srinivasan, Jason A. Flannick, Antal F. Novak, and Serafim Batzoglou
3
13
3 Getting Started in Biological Pathway Construction . . . . . . . . Rebecca A. Sealfon and Stuart C. Sealfon
75
4 From Microarray to Biology . . . . . . . . . . . . . . . . . . . . . Mikhail Dozmorov and Robert E. Hurst
85
Part II
Modeling and Reconstruction
5 Computational Procedures for Model Identification . . . . . . . . Eva Balsa-Canto and Julio R. Banga
111
6 Assembly of Logic-Based Diagrams of Biological Pathways . . . . Tom C. Freeman
139
7 Automating Mathematical Modeling of Biochemical Reaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Dräger, Adrian Schröder, and Andreas Zell
159
8 Strategies to Investigate Signal Transduction Pathways with Mathematical Modelling . . . . . . . . . . . . . . . . . . . . Julio Vera, Svetoslav Nikolov, and Olaf Wolkenhauer
207
9 Inferring Transcriptional Regulatory Network . . . . . . . . . . . Ming Zhan
235
10 Finding Functional Modules . . . . . . . . . . . . . . . . . . . . . Mutlu Mete, Fusheng Tang, Xiaowei Xu, and Nurcan Yuruk
253
vii
viii
Contents
11 Modeling the Dynamics of Biological Networks from Time Course Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sašo Džeroski and Ljupˇco Todorovski
275
12 Decision Making in Cells . . . . . . . . . . . . . . . . . . . . . . . Tomáš Helikar, Naomi Kochi, John Konvalina, and Jim A. Rogers
295
13 Robustness of Neural Network Models . . . . . . . . . . . . . . . Daniel W. Franks and Graeme D. Ruxton
337
14 Functional Modules in Protein–Protein Interaction Networks . . Tobias Müller and Marcus Dittrich
353
15 Mixture Model on Graphs: A Probabilistic Model for Network-Based Analysis of Proteomic Data . . . . . . . . . . . . . Josselin Noirel, Guido Sanguinetti, and Phillip C. Wright
371
16 Integration of Network Information for Protein Function Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoyu Jiang and Eric D. Kolaczyk
399
Part III Applications for Signaling Networks 17 Cellular-Level Gene Regulatory Networks: Their Derivation and Properties . . . . . . . . . . . . . . . . . . . . . . Benjamin de Bivort
429
18 Tyrosine-Phosphoproteome Dynamics . . . . . . . . . . . . . . . . Masaaki Oyama, Shinya Tasaki, and Hiroko Kozuka-Hata
447
19 Systems Biology of the MAPK1,2 Network . . . . . . . . . . . . . Melissa Muller and Prahlad T. Ram
455
20 Pathway Crosstalk Network . . . . . . . . . . . . . . . . . . . . . Yong Li
491
21 Crosstalk Between Mitogen-Activated Protein Kinase and Phosphoinositide-3 Kinase Signaling Pathways in Development and Disease . . . . . . . . . . . . . . . . . . . . . . . Jijun Hao, Marie A. Daleo, and Charles C. Hong 22 Systems-Level Analyses of the Mammalian Innate Immune Response . . . . . . . . . . . . . . . . . . . . . . . . . . . David J. Lynn, Jennifer L. Gardy, Christopher D. Fjell, Robert E.W. Hancock, and Fiona S.L. Brinkman 23 Molecular Basis of Protective Anti-Inflammatory Signalling by Cyclic AMP in the Vascular Endothelium . . . . . . Claire Rutherford and Timothy M. Palmer
505
531
561
Contents
ix
24 Construction of Cancer-Perturbed Protein–Protein Interaction Network of Apoptosis for Drug Target Discovery . . . Liang-Hui Chu and Bor-Sen Chen 25 Transcriptional Changes in Alzheimer’s Disease . . . . . . . . . . Jeremy A. Miller and Daniel H. Geschwind 26 Pathogenesis of Obesity-Related Chronic Liver Diseases as the Study Case for the Systems Biology . . . . . . . . . . . . . Ancha Baranova, Aybike Birerdinc, Michael Estep, and Zobair M. Younossi
589 611
645
27 The Evolving Transcriptome of Head and Neck Squamous Cell Carcinoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yau-Hua Yu
687
28 Peptide Microarrays for a Network Analysis of Changes in Molecular Interactions in Cellular Signalling . . . . . . . . . . Michael D. Sinzinger and Roland Brock
703
Part IV
Tools for Systems Biology
29 A Primer on Modular Mass-Action Modelling with CellML . . . Michael T. Cooling 30 FERN – Stochastic Simulation and Evaluation of Reaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Florian Erhard, Caroline C. Friedel, and Ralf Zimmer
721
751
31 Programming Biology in BlenX . . . . . . . . . . . . . . . . . . . Lorenzo Dematté, Roberto Larcher, Alida Palmisano, Corrado Priami, and Alessandro Romanel
777
32 Discrete Modelling: Petri Net and Logical Approaches . . . . . . Ina Koch and Claudine Chaouiya
821
33 ProteoLens: A Database-Driven Visual Data Mining Tool for Network Biology . . . . . . . . . . . . . . . . . . . . . . . . . . Jake Yue Chen and Tianxiao Huan
857
34 MADNet: A Web Server for Contextual Analysis and Visualization of High-Throughput Experiments . . . . . . . . . . Igor Šegota, Petar Glažar, and Kristian Vlahoviˇcek
877
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
889
Contributors
Eva Balsa-Canto (Bio)Process Engineering Group, IIM-CSIC (Spanish National Research Council), C/Eduardo Cabello, 6, 36208-Vigo, Spain,
[email protected] Julio R. Banga (Bio)Process Engineering Group, IIM-CSIC (Spanish National Research Council), C/Eduardo Cabello, 6, 36208-Vigo, Spain Ancha Baranova Center for the Study of Genomics in Liver Diseases, Molecular and Microbiology Department, George Mason University, Fairfax, Virginia, USA,
[email protected] Serafim Batzoglou Department of Computer Science, Stanford University, Stanford, CA, USA,
[email protected] Aybike Birerdinc Center for the Study of Genomics in Liver Diseases, Molecular and Microbiology Department, George Mason University, Fairfax, Virginia, USA,
[email protected] Fiona S.L. Brinkman Department of Molecular Biology and Biochemistry, 8888 University Drive, Simon Fraser University, Burnaby, British Columbia, Canada, V5A 1S6 Roland Brock Depatment of Biochemistry, Nijmegen Centre for Molecular Life Sciences, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands,
[email protected] Claudine Chaouiya IGC, Instituto Gulbenkian de Ciência, Rua da Quinta Grande 6, P-2780-156 Oeiras, Portugal,
[email protected] Bor-Sen Chen Department of Electrical Engineering, National Tsing Hua University, 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan,
[email protected] Jake Yue Chen Indiana University School of Informatics; Department of Computer & Information Science, Purdue University; Indiana Center for Systems Biology and Personalized Medicine, WK190, 719 Indiana AVE, Indianapolis, IN 46209, USA,
[email protected] xi
xii
Contributors
Liang-Hui Chu Department of Electrical Engineering, National Tsing Hua University, 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan,
[email protected] Sangdun Choi Department of Molecular Science and Technology, Ajou University, Suwon, 443-749, korea,
[email protected] Michael T. Cooling Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand,
[email protected] Bernie J. Daigle, Jr. Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA,
[email protected] Marie A. Daleo Department of Medicine, Division of Cardiovascular Medicine, Department of Pharmacology, Vanderbilt University School of Medicine, Nashville, TN, USA,
[email protected] Benjamin de Bivort Rowland Institute at Harvard, Harvard University, 100 Edwin Land Blvd, Cambridge, MA 02142, USA,
[email protected] Lorenzo Dematté CoSBi and Università di Trento, Trento, Italy Marcus Dittrich Biocenter, Bioinformatics Department, University of Wuerzburg, 97074 Wuerzburg, Germany,
[email protected] Mikhail Dozmorov Departments of Urology, Oklahoma University Health Sciences Center, Oklahoma City, OK, 73104, USA Andreas Dräger Center for Bioinformatics Tübingen (ZBIT), Sand 1, 72076 Tübingen, University of Tübingen, Germany,
[email protected] Sašo Džeroski Department of Knowledge Technologies, Jožef Stefan Institute, Jamova 39, SI-1000 Ljubljana, slovenia,
[email protected] Florian Erhard LFE Bioinformatik, Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstraße, 17, 80333 M¨unchen, Germany,
[email protected] Michael Estep Betty and Guy Beatty Center for Integrated Research, Inova Health System, Falls Church, Virginia, USA,
[email protected] Christopher D. Fjell Centre for Microbial Diseases and Immunity Research, 232 - 2259 Lower Mall, University of British Columbia, Vancouver, British Columbia, Canada, V6T 1Z4 Jason A. Flannick Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, MA,
[email protected] Daniel W. Franks York Centre for Complex Systems Analysis (YCCSA), Department of Biology, and Department of Computer Science, University of York, YO10 5YW, UK,
[email protected] Contributors
xiii
Tom C. Freeman The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Roslin, Midlothian, Scotland, UK, EH25 9PS,
[email protected] Caroline C. Friedel LFE Bioinformatik, Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstraße, 17, 80333 M¨unchen, Germany,
[email protected] Jennifer L. Gardy Centre for Microbial Diseases and Immunity Research, 232 2259 Lower Mall, University of British Columbia, Vancouver, British Columbia, Canada, V6T 1Z4; Genome Research Laboratory, B.C. Centre for Disease Control, 655 West 12th Ave., Vancouver, British Columbia, Canada, V5Z 4R4 Daniel H. Geschwind Department of Neurology and Center for Neurobehavioral Genetics, University of California, Los Angeles, CA, USA,
[email protected] Petar Glažar Bioinformatics Group, Division of Biology, Faculty of Science, Zagreb University, Horvatovac 102a, 10000 Zagreb, Croatia,
[email protected] Robert E.W. Hancock Centre for Microbial Diseases and Immunity Research, 232 - 2259 Lower Mall, University of British Columbia, Vancouver, British Columbia, Canada, V6T 1Z4 Jijun Hao Department of Medicine, Division of Cardiovascular Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA,
[email protected] Tomáš Helikar Department of Pathology and Microbiology, University of Nebraska Medical Center, 983135 Nebraska Medical Center, Omaha, NE 68198, USA Charles C. Hong Research Medicine, VA TVHS, Vanderbilt University, Nashville, TN, USA,
[email protected] Tianxiao Huan Shandong University, Microbiology Building 608#, Shanda Nanlu 27#, 250100, Jinan, People’s Republic of china,
[email protected] Robert E. Hurst Departments of Urology and Biochemistry and Molecular Biology and The Oklahoma University Cancer Institute, Oklahoma University Health Sciences Center, Oklahoma City, OK 73104, USA,
[email protected] Xiaoyu Jiang Research and Development, Boehringer Ingelheim Pharmaceuticals, Inc., 900 Ridgebury Road, Ridgefield CT 06877 USA,
[email protected] Ina Koch Institute for Computer Science, Molecular Bioinformatics, Johann Wolfgang Goethe-University, Robert-Mayer-Str. 11-15, 60325 Frankfurt a. Main, Germany,
[email protected] xiv
Contributors
Naomi Kochi Department of Genetics, Cell Biology, and Anatomy, University of Nebraska Medical Center, 985805 Nebraska Medical Center, Omaha, NE 68198, USA Eric D. Kolaczyk Department of Mathematics and Statistics, Boston University, 111 Cummington Street, Boston MA 02215 USA,
[email protected] John Konvalina Department of Mathematics, University of Nebraska, 6001 Dodge Street, Omaha, NE 68182, USA Hiroko Kozuka-Hata Medical Proteomics Laboratory, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Roberto Larcher CoSBi and Università di Trento, Trento, Italy Yong Li Computational Biology, GlaxoSmithKline R&D, 709 Swedeland Road, UW2230, King of Prussia, PA 19406, USA,
[email protected] David J. Lynn Department of Molecular Biology and Biochemistry, 8888 University Drive, Simon Fraser University, Burnaby, British Columbia, Canada, V5A 1S6,
[email protected]; Current address: Animal & Bioscience Research Department, Teagasc, Grange, Dunsany, Co. Meath, Ireland,
[email protected] Mutlu Mete Department of Computer Science, Texas A&M University-Commerce, Commerce, TX, USA,
[email protected] Jeremy A. Miller Interdepartmental Program for Neuroscience, University of California, Los Angeles, CA, USA Melissa Muller Department of Systems Biology, University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Boulevard, Box 950, Houston, TX 77030, USA Svetoslav Nikolov Systems Biology and Bioinformatics Group, University of Rostock, 18051 Rostock, Germany; Institute of Mechanics and Biomechanics, Bulgarian Academy of Science, 1113 Sofia, Bulgaria Josselin Noirel ChELSI Research Institute, Department of Chemical and Process Engineering, University of Sheffield, Mappin St, S1 3JD Sheffield, UK Antal F. Novak Department of Computer Science, Stanford University, Stanford, CA, USA,
[email protected] Masaaki Oyama Medical Proteomics Laboratory, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan,
[email protected] Timothy M. Palmer Biochemistry and Cell Biology, Faculty of Biomedical and Life Sciences, University of Glasgow, Glasgow G12 8QQ Scotland, UK,
[email protected] Contributors
xv
Alida Palmisano CoSBi and Università di Trento, Trento, Italy Corrado Priami CoSBi and Università di Trento, Trento, Italy Prahlad T. Ram Department of Systems Biology, University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Boulevard, Box 950, Houston, TX 77030, USA,
[email protected] Jim A. Rogers Department of Pathology and Microbiology, University of Nebraska Medical Center, 983135 Nebraska Medical Center, Omaha, NE 68198, USA; Department of Mathematics, University of Nebraska, 6001 Dodge Street, Omaha, NE 68182, USA,
[email protected] Alessandro Romanel CoSBi and Università di Trento, Trento, Italy,
[email protected] Claire Rutherford Biochemistry and Cell Biology, Faculty of Biomedical and Life Sciences, University of Glasgow, Glasgow G12 8QQ Scotland, UK Graeme D. Ruxton Division of Environmental & Evolutionary Biology, Institute of Biomedical and Life Sciences, Graham Kerr Building, University of Glasgow, Glasgow, G12 8QQ, UK Guido Sanguinetti ChELSI Research Institute, Department of Chemical and Process Engineering, University of Sheffield, Mappin St, S1 3JD Sheffield, UK; 2Department of Computer Science, 211 Portobello St, University of Sheffield, S1 4DP Sheffield, UK Adrian Schröder Center for Bioinformatics Tübingen (ZBIT), Sand 1, 72076 Tübingen, University of Tübingen, Germany,
[email protected] Rebecca A. Sealfon Center for Translational Systems Biology, Mount Sinai School of Medicine, New York, NY 10029, USA Stuart C. Sealfon Center for Translational Systems Biology, Department of Neurology, Box 1137, Mount Sinai School of Medicine, One Gustave L. Levy Place, New York, NY 10029, USA,
[email protected] Igor Šegota Department of Physics, Cornell University, Ithaca, NY 14853-2501, USA,
[email protected] Michael D. Sinzinger Depatment of Biochemistry, Nijmegen Centre for Molecular Life Sciences, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands,
[email protected] Balaji S. Srinivasan Departments of Computer Science and Statistics, Stanford University, Stanford, CA, USA,
[email protected] Fusheng Tang Department of Biology, University of Arkansas at Little Rock, Little Rock, AR, USA,
[email protected] xvi
Contributors
Shinya Tasaki Medical Proteomics Laboratory, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Ljupˇco Todorovski Faculty of Administration, University of Ljubljana, Gosarjeva 5, SI-1000 Ljubljana, Slovenia,
[email protected] Julio Vera Systems Biology and Bioinformatics Group, University of Rostock, 18051 Rostock, Germany,
[email protected]; Web: www.sbi.uni-rostock.de Kristian Vlahoviˇcek Bioinformatics Group, Division of Biology, Faculty of Science, Zagreb University, Horvatovac 102a, 10000 Zagreb, Croatia; Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway,
[email protected] Olaf Wolkenhauer Systems Biology and Bioinformatics Group, University of Rostock, 18051 Rostock, Germany Phillip C. Wright ChELSI Research Institute, Department of Chemical and Process Engineering, University of Sheffield, Mappin St, S1 3JD Sheffield, UK Xiaowei Xu Department of Information Science, University of Arkansas at Little Rock, Little Rock, AR, USA,
[email protected] Zobair M. Younossi Betty and Guy Beatty Center for Integrated Research, Inova Health System, Falls Church, Virginia, USA,
[email protected] Yau-Hua Yu School of Dentistry, National Yang-Ming University, Taipei, 112 Taiwan, Republic of China; Department of Dentistry and the Department of Medical Research and Education, Taipei Veterans General Hospital, Taipei, 112 Taiwan, Republic of China,
[email protected] Nurcan Yuruk Department of Applied Science, University of Arkansas at Little Rock, Little Rock, AR, USA,
[email protected] Andreas Zell Center for Bioinformatics Tübingen (ZBIT), Sand 1, 72076 Tübingen, University of Tübingen, Germany,
[email protected] Ralf Zimmer LFE Bioinformatik, Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstraße, 17, 80333 M¨unchen, Germany,
[email protected] Part I
Concepts
Chapter 1
Systems Biology Approaches: Solving New Puzzles in a Symphonic Manner Sangdun Choi
Abstract The AfCS (Alliance for Cellular Signaling) endeavored to delineate the complex immune signaling systems and control networks by using systems biological approaches. We, the AfCS, have analyzed the changes in transcription and cytokine levels after the addition of multiple ligands in murine B cells and macrophage RAW 264.7 cells. We have also examined the fluctuations in cAMP, calcium and phosphoprotein, and measured protein–protein interactions and RNAi/drug perturbations. A time series examination of the combined effects of endogenous or exogenous ligands enabled the identification of signaling networks that are responsible for cellular signaling. Biological processes are driven by complex systems of functionally interacting macromolecules. Complex biological phenomena can be understood in terms of the interactions of these functional components, and the measurement of cellular responses after network perturbations can be used to probe connectivity and signaling system. Combined with current molecular biological tools, systems biological approaches are ideal for the description of signaling networks and the development of predictive/preventative medicine. Keywords AfCS · Integrative manner · Omics · Signaling networks · Systems biology The occurrence of seemingly common events during a person’s daily life becomes meaningful when we look at them in depth. For example, a mirror shows your left as right and your right as left; however, your face does not appear upside down (Fig. 1.1). The reasons for this phenomenon will be revealed upon detailed analysis of the reflections. For example, when you hold a rock or scissors in your right or left hand the mirror shows the same symbol. Here, our eyes perceive the signal as altered, even though it is just a simple physical phenomenon associated with the reflection by the mirror that makes us perceive that it is on the other side. Systematically, the hand position remains unaltered; hence the face should not be S. Choi (B) Department of Molecular Science and Technology, Ajou University, Suwon 443-749, Korea e-mail:
[email protected] S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_1,
3
4
S. Choi
Fig. 1.1 When you look in the mirror, the left and right are altered. Then, why is the face not upside down?
upside down. However, when we ask ourselves questions as scientists, we tend to overlook various aspects of the science by treating it as if we are studying images in a mirror. Traditionally, there are two methods used by scientists to learn about natural phenomena: a reductionist approach and a systems approach. In the reductionist approach, we see things partially and separately, which makes it difficult to make out the entire scenario. Conversely, in systems approach, the entire picture is evaluated first. Systems biology is an approach to the study of biological phenomena in a systematic and symphonic manner. In systems biology, the emphasis shifts from asking specific questions or testing hypotheses based on a single approach to filtering out the most significant observation the data offer. As the actual cellular arena is very complex, it is necessary to employ high-throughput techniques that can describe individual events that are occurring simultaneously in a system as a whole. The research tools that provide extensive data include omics tools, such as genomics, transcriptomics, proteomics, interactomics, metabolomics, localizomics, and phenomics. Using this approach enables any kind of data to be explored including gene expression, structural genomics, protein expression, phosphorylation, and protein interaction (Fig. 1.2). One of the journeys toward systems biology was initiated by the Alliance for Cellular Signaling (AfCS: http://www.signaling-gateway.org) (Gilman et al. 2002), wherein the author of this chapter was actively engaged in elucidation of the cellular signaling networks in murine macrophages and B cells (Fig. 1.3). The goal of AfCS was to delineate the signaling pathways after stimulation with various ligands (around 50) to determine how the pathways interact in a complex cell system (Sambrano et al. 2002). If scientists look for a signaling mechanism caused by a ligand, it may appear as a low-complexity scenario since the ligand stimulates one individual pathway. However, when multiple ligands are involved, the
1
Systems Biology Approaches
5
Gene Expression
Structural Genomics
Protein Expression Phospho rylation
SNPs
Protein Interaction
Any type of data can be explored
Methylation
Cell State
Pertur bation
Metabolitics
Disease
Drug Response
Protein Structure
Drug Structure
Fig. 1.2 Any type of data can be explored using systems biological approaches
Lipidomics (Vanderbilt) Cell Preparation and Analysis (UTSW)
Bioinformatics (UCSD)
Alliance for Cellular Signaling
(www.signaling-update.org) Antibody (UTSW)
Development of Signaling Assays (UCSF)
Microscopy (Stanford) Molecular Biology (Caltech)
Protein Chemistry (UTSW)
Fig. 1.3 The alliance for cellular signaling
interactions become much more complex. The AfCS took advantage of this concept to study cellular responses after the treatment of ligands as single, double, and multiple pairs. Using a systemic approach, the responses of the cell for a given ligand
6
S. Choi
were monitored as the responses that occurred within a few seconds, a few minutes, and a few hours. For example, cellular calcium levels were measured as a parameter of immediate response, while cAMP measurement reflected a response that occurred within a few seconds and phosphoproteins and gene expression were responses that required a few minutes (Fig. 1.4). In addition, the AfCS explored the cytokine secretion, protein localization, and protein–protein interaction as late responses of the ligands. The AfCS then used this information for subsequent RNAi (RNA interference) and drug perturbations. Single & Multiple Ligand Screen
Calcium cAMP Phosphoproteins Gene expression Cytokines Protein expression Protein location Yeast 2 hybrid Lipids RNAi Drug Perturbation
Nature/ AfCS Database
Fig. 1.4 Cellular responses were measured after treatment with single and multiple ligands
Once these high-throughput experiments were completed, the data were deposited in a database managed by Nature Publishing Group and the AfCS (www.signaling-gateway.org). The ligands chosen for screening included complement C5a, interferon-beta (IFNβ), interleukin 4 (IL4), lipopolysaccharide (LPS), macrophage colony-stimulating factor (MCSF), prostaglandin E2 (PGE2 ), sphingosine-1-phosphate (S1P), transforming growth factor-beta (TGFβ), and many other well-known cytokines (Pradervand et al. 2006). Clicking on each listed icon in the database leads to the data sets obtained from each measurement. Multiple ligand screening results are also shown in the database. In addition to Ca2+ and cAMP levels, gene expression was analyzed using oligonucleotide microarrays, and their expression profiles were then evaluated by cluster analysis (Fig. 1.5) (Lee et al. 2006; Zhu et al. 2004, 2006). Secreted cytokines were measured using antibody beads. The results revealed that many cytokines were significantly upregulated by Toll-like receptor ligands, such as LPS, PAM2CSK4, PAM3CSK4, and resiquimod (R-848). Protein expression was also screened by 2D gel electrophoresis followed by MALDI-TOF analysis. The phosphoprotein levels were measured after 1, 3, 10, and 30 min in single or double ligand treatments (Fig. 1.6) (Mumby and Brekken 2005). The results showed that AKT, ERK1, and ERK2 phosphorylation levels were increased in response to LPS treatment, but not in response to interferon-gamma (IFNγ). However, treatment with both LPS and IFNγ led to significant enhancement of the phosphorylation of these proteins. Moreover, systematic data analyses revealed that these non-additively regulated molecules were the hubs of cross talk
Systems Biology Approaches
7
15,840 gene elements
2MA AIG BAF BLC BOM 70L 40L CGS CPG DIM ELC FML GRH IFB IFG IGF IL10 IL4 LPA LPS LB4 M3A NEB NGF NPY PAF PGE S1P SDF SLC TER TGF TNF
1
33 ligands (4 time points each)
Fig. 1.5 Gene expression analyses
Fig. 1.6 Enhanced or attenuated protein phosphorylation levels. AKT, ERKs, and p90RSK revealed synergy
8
S. Choi
points between the ligand-induced pathways (Natarajan et al. 2006; Polouliakh et al. 2009; Roach et al. 2008; Wall et al. 2009). The regulated genes showed complex multiple transcription factor binding sites in their promoter regions. The communications between these transcriptions factors in the DNA upstream regions are methods of controlling the complex signaling networks, and transcription factors exhibit various functions by acting as inducers or repressors. Data analysis also revealed the attenuation of relevant molecules. Based on the obtained data, canonical signaling pathways were identified. To further compile the fundamental data, protein localization experiments were conducted. The location of the signaling molecule is very important in cellular signaling because certain molecules have to translocate to target areas to carry out their functional responses. For example, NFκB and p53 have to move into the nucleus to control the synthesis of new proteins, whereas AKT has to move from cytosol to the membrane following stimulation with S1P. A yeast two-hybrid assay was also conducted to identify the protein–protein interactions through which the signal was relayed. In the B-cell receptor pathway (Fig. 1.7), the protein–protein interaction
Fig. 1.7 Protein–protein interactions in the B-cell receptor pathway
1
Systems Biology Approaches
9
test revealed that Pak binds to Rac and Nck, while PLCγ binds to SHIP, BTK to calcineurin, and PDK1 to PDE4B3. Once the brief signaling pathways were identified using the experimental data, RNAi oligonucleotides that could be used to knock down the target genes and study the effects of the missing signaling molecule were designed (Lee et al. 2009; Shin et al. 2006; Zhu et al. 2007). For example, the molecules that are at the hubs of the G-protein signaling networks were knocked down using RNAi, and this perturbation enabled the identification of signaling molecules responsible for the important Ca2+ or PIP3 module (Hwang et al. 2004, 2005) (Fig. 1.8). The AfCS has extended these studies to drug perturbation. Specifically, the following drugs were chosen: piceatannol for Syk, pertussis toxin for Gαi, PP2 for Lyn, LY294002 for PI3K, U73122 for PLC beta and gamma, Go6976 for PKC, thapsigargin for SERCA, ML-7 for MLCK, cyclosporine A for calcineurin, and KN62 for CaMK. Calcium and PIP3 were measured as the output responses. To date, the AfCS has generated a large amount of omics data. The next step is to extract the most significant and meaningful observations from that vast amount of data. Once the meaning has been extracted, we will piece together the findings from each group of data and then use this information to elucidate signaling
Fig. 1.8 The target genes knocked down by RNAi. The underlined molecules indicate that the target, when knocked down, changed the Ca2+ response phenotype. Red color indicates an increase and green indicates a decrease of calcium level
10
S. Choi
Fig. 1.9 Cell signaling networks
pathways to construct complete network models. Figure 1.9 shows the cell signaling networks constructed for the B-cell receptor (BCR), IL4 receptor, calcium, chemotaxis, insulin receptor, and cAMP. Although the signaling networks presented here were primarily created based on previously identified canonical pathways, the future construction of signaling networks should be more comprehensive and integrative. Additionally, future analyses should include a simulative virtual cell system in which various factors can be input and the different outputs in response to these factors predicted. This should provide much more valuable information to enable understanding of the actual biology performed simultaneously by every component, which would be useful in the development of novel medicines more rapidly than typical individual technical approaches. Since the AfCS started this endeavor 8 years ago (2000–2008), it has appeared as if it was tagged with functional genomics and cellular signaling. However, evaluation of the overall project reveals that the AfCS was actually engaged in systems biology. Systems biology is a relatively novel field that focuses on the systematic study of complex biological interactions by using mostly omics data. This term has become common in biosciences since the year 2000. As systems biology becomes more disciplined, demand for its effective execution will increase. The goal of systems biology is to construct all of the signaling networks that arise from entire molecules in a biological system. To better understand the entire process being evaluated, it may be necessary to identify the individual parts first and then evaluate the unique effects of each molecule on the biological situation being investigated.
1
Systems Biology Approaches
11
Scientific discovery is a type of puzzle-solving activity. As a result, when we find an unexpected association of a puzzle with a fact that has been ignored we are always fascinated. Systems biology may link facts that were previously ignored to new puzzles that were generated by recent advances. Solving such new biological puzzles in a symphonic and integrative manner is the essence of systems biology. Acknowledgments This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (2010-0016256) and a grant (10182KFDA992) from Korea Food & Drug Administration in 2010.
References Gilman AG, Simon MI, Bourne HR et al (2002) Overview of the alliance for cellular signaling. Nature 420(6916):703–706 Hwang JI, Choi S, Fraser ID et al (2005) Silencing the expression of multiple Gbeta-subunits eliminates signaling mediated by all four families of G proteins. Proc Natl Acad Sci USA 102(27):9493–9498 Hwang JI, Fraser ID, Choi S et al (2004) Analysis of C5a-mediated chemotaxis by lentiviral delivery of small interfering RNA. Proc Natl Acad Sci USA 101(2):488–493 Lee G, Santat LA, Chang MS et al (2009) RNAi methodologies for the functional study of signaling molecules. PLoS One 4(2):e4559 Lee JA, Sinkovits RS, Mock D et al (2006) Components of the antigen processing and presentation pathway revealed by gene expression microarray analysis following B cell antigen receptor (BCR) stimulation. BMC Bioinform 7:237 Mumby M, Brekken D (2005) Phosphoproteomics: new insights into cellular signaling. Genome Biol 6(9):230 Natarajan M, Lin KM, Hsueh RC et al (2006) A global analysis of cross-talk in a mammalian cellular signalling network. Nat Cell Biol 8(6):571–580 Polouliakh N, Nock R, Nielsen F et al (2009) G-protein coupled receptor signaling architecture of mammalian immune cells. PLoS One 4(1):e4189 Pradervand S, Maurya MR, Subramaniam S (2006) Identification of signaling components required for the prediction of cytokine release in RAW 264.7 macrophages. Genome Biol 7(2):R11 Roach TI, Rebres RA, Fraser ID et al. (2008) Signaling and cross-talk by C5a and UDP in macrophages selectively use PLCbeta3 to regulate intracellular free calcium. J Biol Chem 283(25):17351–17361 Sambrano GR, Chandy G, Choi S et al (2002) Unravelling the signal-transduction network in B lymphocytes. Nature 420(6916):708–710 Shin KJ, Wall EA, Zavzavadjian JR et al. (2006) A single lentiviral vector platform for microRNAbased conditional RNA interference and coordinated transgene expression. Proc Natl Acad Sci USA 103(37):13759–13764 Wall EA, Zavzavadjian JR, Chang MS et al. (2009) Suppression of LPS-induced TNF-alpha production in macrophages by cAMP is mediated by PKA-AKAP95-p105. Sci Signal 2(75):ra28 Zhu X, Chang MS, Hsueh RC et al (2006) Dual ligand stimulation of RAW 264.7 cells uncovers feedback mechanisms that regulate TLR-mediated gene expression. J Immunol 177(7): 4299–4310 Zhu X, Hart R, Chang MS et al (2004) Analysis of the major patterns of B cell gene expression changes in response to short-term stimulation with 33 single ligands. J Immunol 173(12): 7141–7149 Zhu X, Santat LA, Chang MS et al (2007) A versatile approach to multiple gene RNA interference using microRNA-based short hairpin RNAs. BMC Mol Biol 8:98
Chapter 2
Current Progress in Static and Dynamic Modeling of Biological Networks Bernie J. Daigle, Jr., Balaji S. Srinivasan, Jason A. Flannick, Antal F. Novak, and Serafim Batzoglou
Abstract The relentless advance of biochemistry has enabled us to take apart biological systems with ever more fine-grained and precise instruments. The fruits of this dissection are millions of measurements of base pairs and biochemical concentrations. Yet to make sense of these numbers, we need to reverse our dissection by putting the system back together on the computer. This first step in this process is reconstructing molecular anatomy through static modeling, the determination of which pieces (DNA, RNA, protein, and metabolite) is present, and how they are related (e.g., regulator, target, inhibitor, cofactor). Given this broad outline of component connectivity, we may then attempt to reconstruct molecular physiology via dynamic modeling, computer simulations that model when cellular events occur (ODE), where they occur (PDE), and how frequently they recur (SDE). In this review we discuss techniques for both of these modeling paradigms, illustrating each by reference to important recent papers. Keywords Biological networks · Computer simulation · Dynamic modeling · Static modeling
2.1 Introduction The term “post-genomic era” became a cliche even before the human genome was sequenced, but it has a definite meaning. It refers to the refocusing of effort on tasks that were insurmountable without the genome as platform, such as the construction of hybridization probes for every human gene (Schena et al. 1996) or the phenotyping of knockout strains for every yeast ORF (Winzeler et al. 1999). Many different kinds of these genome-scale data sets are now available (Collins et al. 2007; Foster B.S. Srinivasan (B) Departments of Computer Science and Statistics, Stanford University, Stanford, CA, USA e-mail:
[email protected] B.J. Daigle Jr., and B.S. Srinivasan have contributed equally to this work.
S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_2,
13
14
B.J. Daigle et al.
et al. 2006; Gavin et al. 2006; Kim et al. 2005; Krogan et al. 2006; Lamb et al. 2006; Sachs et al. 2005), and each analysis tells the same story: the components of biological systems are not free-floating parts, but are organized into functional modules (Hartwell et al. 1999). Systems biology is the science of quantitatively defining and analyzing these modules (Bornholdt 2005) and can be divided into two broad areas: static modeling of an organism’s interactome (Section 2.2) and dynamic modeling of a biological system’s kinetics, spatial structure, or stochastic variation (Section 2.3). In general, static models tend to be broader and coarser in scope, often encompassing the entire interactome, while dynamic models usually focus on the details of a single subsystem, such as chemotaxis (Alon et al. 1999), lysogeny (Arkin et al. 1998), or morphogenesis (Igoshin et al. 2004a,b). Static modeling is less demanding from an experimental perspective, as just about any assay on a population of cells will prove informative. By contrast, deterministic dynamic models require temporally and sometimes spatially (Meinhardt and de Boer 2001) resolved data, and stochastic dynamic models require even more data in the form of population ensembles. In this review, we discuss both modeling strategies with an eye toward describing statistical considerations and summarizing recent successes.
2.2 Static Models of Biological Networks Static modeling is best conceptualized as the computerized reconstruction of molecular anatomy. In much the same way that macroscopic anatomy tells us that the shinbone is connected to the kneebone, molecular anatomy tells us which molecules interact with each other. Yet the situation at the molecular level is complicated by the fact that we cannot yet “see” the molecular components of a cell at the same resolution that a pathologist can observe the bones and muscles of a cadaver. Our approach is rather more like that of an archaeologist who discovers many piles of bones in different configurations and must statistically reason that shinbones and kneebones are likely to be functionally related, as they are (1) often found near each other, (2) usually present together in different species, and (3) more correlated in size than random pairs of bones. This concept of statistically inferring static relationships via “gnilt by association” is one of the core ideas behind static modeling. We visually represent these inferred static relationships by a network. Nodes of this network correspond to components of the system and edges to relationships between components. Different kinds of static models are usefully distinguished by the number and type of nodes and edges which are present. As a general rule, larger networks for more complex organisms require more data to reconstruct. In this section, we review methods for the representation and inference of static network models from multiple data sources. We describe a common Bayesian formulation which unifies the steps of network integration and experimental validation. By analogy to the concept of a reference genome assembly (Lander et al. 2001;
2
Current Progress in Static and Dynamic Modeling
15
Venter et al. 2001), we then describe how recent large-scale efforts at network determination, such as the recent connectivity map (Lamb et al. 2006) and the proposed human interactome project (Ideker and Valencia 2006), have led naturally to the concept of ontologically labeled, richly typed reference networks. We conclude by discussing methods for network alignment, network visualization, and network-guided experimental prioritization.
2.2.1 Advantages of Static Models Because static modeling is about determining which elements are present (nodes) and how they are interconnected (edges), it is a basic prerequisite for any kind of systems biological analysis. As one example, determining whether two bacteria can metabolize the same sets of compounds requires an enumeration of their functional modules, roughly corresponding to the evolutionarily conserved subgraphs in their respective static models. As another example, identifying proteins which are essential for cellular function can be greatly aided by knowledge of which proteins are central in static network models. In particular, static models are essential starting points for more complex dynamic modeling strategies.
2.2.2 Limitations of Static Models Perhaps the most obvious limitation of a static model is that it is in fact static: it does not incorporate temporal, spatial, or conditional information except indirectly. In particular, less detailed static models may give little information about how different nodes talk to each other. For example, low-resolution models that predict solely whether two proteins “interact” with some probability are useful for generating hypotheses, but give little mechanistic insight as to whether they are related by physical contact, presence in the same pathway, or regulation of the same genes. These limitations can be partially overcome by including more types of edges, though there are fundamental limitations on the level of conditional detail (Fig. 2.1) possible in a static network.
2.2.3 Specific Tasks Associated with Static Modeling We can order the process of static modeling into five sequential tasks: 1. Determine desired network detail. The first step in static modeling is to determine the scope and detail of the network reconstruction. Put simply, how many nodes and edges are desired, and what are their types? The goal here is to quantitatively parametrize the network by a response variable. For example, this could be an N2 × 1 vector of boolean connectivities on N2 edges (as per L in Fig. 2.2)
16
B.J. Daigle et al. Marginal Correlations Can Be Determined Individually...
a
Rt
Expression Ratios (Time Course)
gene correlation across time points
t1
t2
t3
t4
Gene 1
.97
.69
.65
1.1
1
Gene 2
.93
.74
.73
1.2
.96
Gene 3
.32
.87
.65
.84
.96 –.17 1
...But Conditional Correlations Demand Cross-Sectional Data
b
f1
Rf
Prot. 2
37385
38217
19175
1
Prot. 2
5221
3071
4431
–.19 1
Prot. 3
9388
5698
5021
.59 –.68 1
tim
f3
e
t4
f1
t4
calculate correlation
Rf |t 4
–.19 .59 –.68
t1
extract submatrix
(Subcellular Organelle Abundance) protein correlation f2 f1 f3 across fractions Prot. 1
f3
Prot. 1
–.07
–.17 –.07 1
Protein Profiling
on
cti
fra
Prot. 3
conditional corr elation across fractions at given time point
Fig. 2.1 Data availability constrains network detail. (a) Given a cell cycle time course of gene expression measurements, we can determine which genes are temporally coexpressed (Spellman et al. 1998). Similarly, from protein correlation profiling (Foster et al. 2006), we can determine which proteins are abundant in the same subcellular organelles, and thereby derive a rough measure of colocalization. (b) Suppose, however, that we wish to determine whether a given protein pair is colocalized at a particular time in the cell cycle. To calculate this conditional correlation we must (1) sharply increase the number of data points in our experiment and (2) collect both kinds of data on the same object at the same time. This may be difficult or impossible to do experimentally; for example, the methods for determining protein abundance across organelles are very different from those for determining an mRNA abundance time series. As more kinds of variables are incorporated (chemical stimuli, genetic background, etc.) the requisite number of data points increases exponentially. These constraints fundamentally limit the extent to which conditional interactions can be probed with independently collected data sets. Reproduced from (Srinivasan et al. 2007) with permission from Oxford University Press
or an N × 1 vector of node properties (as per Y in Fig. 2.3). Ideally some highresolution data on Y or L is already available, a so-called training set. This data could come from well-established individual publications and/or from low throughput, expensive experiments. 2. Enumerate input data sources. The next step is to work backward to determine which input variable could potentially predict the desired network properties. Input variables can usually be either N × 1 vectors that give data on each node or N2 × 1 vectors that predict edge data (X and E, respectively, in Fig. 2.3). 3. Network reconstruction. Given predictors (X, E) and partial training data on (Y, L), we can use machine learning to predict the remaining elements of (Y, L) ˆ L) ˆ with information (Figs. 2.2 and 2.3). The result is a static network model (Y, on the nodes and edges of the biological system under consideration. Here the “hat” denotes the fact that these values are estimates and of lower reliability than the “gold-standard” training set. 4. Experimental confirmation. In the ideal scenario, the properties of the predicted network are then experimentally tested. The gold standard is to make new high-resolution measurements on (Y, L) using the same methods used to assemble the training set and to compare these experimental measurements to
2
Current Progress in Static and Dynamic Modeling
17
Fig. 2.2 Enumerating labels and predictors for data integration. For each protein pair, we can compute labels and predictors. At the top of the figure, two kinds of labels and four predictors have been tabulated for each pair of proteins; given N proteins, this table will have N (N − 1)/2 rows. Labels are directly useful to humans while predictors represent raw experimental data. Importantly, many labels correlate with predictors. For example, calculating conditional density estimates (lower left) for the phylogenetic profile correlation over all pairs in Mycoplasma genitalium shows that highly coinherited pairs are likely to functionally interact in the same KEGG category (Srinivasan et al. 2006). This statistical dependence can be used to put predictors on the same scale, by normalizing them in terms of their ability to recapitulate functional interactions. It can also be used to fill in uncurated labels and integrate different data types (Fig. 2.3). Reproduced from Srinivasan et al. (2007) with permission from Oxford University Press
ˆ L). ˆ If the predictions match experiment sufficiently well, we the predictions (Y, can replace potentially expensive high-resolution measurements on (Y, L) with cheaper bulk measurements of the relevant (X, E) variables. 5. Network applications. Given an experimentally reliable static network, we can then proceed to further applications, such as comparisons of networks across species and conditions (network alignment) and network-guided experimental prioritization.
18
B.J. Daigle et al.
Data Integration As Supervised Learning Enumerate data for individual proteins Labels/Predictors Y = KEGG category X 1 = expression profile X 2 = phylogenetic profile
And/or enumerate data for protein pairs Labels/Predictors L = shared KEGG E1 = coexpression E2 = coinheritance E3 = TAP/MS interaction E4 = Y2H interaction
X 1,1 X 1,2 X 1,3 X 2,1 X 2,2 X 2,3
ID
Y
1
00071,00280
.70
.80
.90
105 200 300
2
NA
.81
.95
.70
60
105
55
3
00071
.50
.51
.59
120 180 310
4
03010
.40
.10
.20
200
80
50
ID1
ID2
L
E1
E2
E3
E4
1
2
NA
–.44
–.11
NA
NA
1
3
1
.91
.98
1
NA
1
4
0
–.65
–.94
NA
1
2
3
NA
–.77
–.30
1
0
2
4
NA
–.39
–.24
NA
1
3
4
0
–.29
–.86
NA
NA
Calculate or approximate
P (Y /X)
Yields: integrated functional category prediction from data
Calculate or approximate
P (L/E) Yields: integrated functional interaction prediction from data
Fig. 2.3 Data integration as supervised learning. For each biological object, we tabulate labels and predictors as in Fig. 2.2. Rather than comparing predictors in terms of their correlation with the label, we use all the predictors at the same time to estimate the label. For example, if we do this for the specific biological object of individual proteins, we can obtain an integrative prediction of protein function. If instead we do this for pairs of proteins, we can obtain an integrative prediction of protein interaction. Note that some of the columns in the pair table are only defined for pairs (in this case, the TAP/MS and Y2H data) while other quantities can be computed from the protein table. Note also that for statistical reasons, the interaction prediction problem can be easier than the function prediction problem. In the former case, we have a multiclass classification problem with only a few thousand data points, while in the latter case we have a binary classification problem with millions of data points (Hastie et al. 2001). Importantly, the supervised learning framework can be applied to many other kinds of biological objects besides proteins and protein pairs. Reproduced from (Srinivasan et al. 2007) with permission from Oxford University Press
The remainder of the chapter is split into four parts: a summary of data sources used for static modeling, an overview of algorithms for network reconstruction, a discussion of network representations, and a survey of applications of static networks.
2.2.4 Data for Static Modeling 2.2.4.1 Data Types and Sources The goal of static modeling is to infer the properties of nodes and edges for a given biological system, which is often the entire interactome of a single organism. Subgraphs of tightly interconnected objects in these networks represent functional modules (Barabasi and Oltvai 2004). Some of these networks are obtained from edge predictors E in Fig. 2.3, in that they come from direct measurements of pairwise interactions (Zhu et al. 2007), including physical (Gavin et al. 2006; Krogan
2
Current Progress in Static and Dynamic Modeling
19
et al. 2006), signaling (Pokholok et al. 2006; Ptacek and Snyder 2006), transcriptional (Davidson et al. 2002; Wei et al. 2006), metabolic (Covert et al. 2004), and epistatic (Collins et al. 2007; Schuldiner et al. 2005; Tong et al. 2001) networks. Other networks have their connectivity inferred indirectly through measurements on node predictors X, such as coexpression under the same conditions (Lamb et al. 2006), in the same tissues (Chen et al. 2006), or at the same time points (Laub et al. 2000; Spellman et al. 1998); coinheritance in the same species (Pellegrini et al. 1999; Srinivasan et al. 2005); collocation on chromosomes (Overbeek et al. 1999); coevolution of residues (Pazos et al. 2005); or shared mutant phenotype (Dudley et al. 2005). These indirect networks are constructed by using variation along one dimension (time, space, environmental perturbation, etc.) to inform the construction of the global network. For example, proteins that are abundant in the same subcellular organelles (Foster et al. 2006) are likely to functionally interact, as are genes that are expressed at the same time (Spellman et al. 1998); such interacting sets represent subgraphs in the global interaction network. Given that hundreds of these large-scale data sets are now available, it has become essential to consult meta-databases. Among the most useful are Pathguide (Bader et al. 2006), BiowareDB (Matthiessen 2003), BioGRID (Stark et al. 2006), the yearly Nucleic Acids Research Database (Galperin and Cochrane 2009) and Web Server (Benson 2009) issues, and a recent compilation of more than 150 publicly available functional genomic resources (Ng et al. 2006). As a general rule, it is probably best to limit one’s use of raw data to data sets curated by the major databases (NCBI, EBI, DDBJ, UCSC, etc.). Otherwise a great deal of time will be spent mapping identifiers and parsing various data formats. 2.2.4.2 Data Limits Static Model Complexity As we shall see, the advantage of static modeling is that it can incorporate data sets compiled by a number of investigators at different times and under different conditions. However, this very advantage also imposes fundamental limitations on static model complexity. For example, consider the problem of determining a conditional network of interactions or correlations in each subcellular organelle. As Fig. 2.1 shows, this seemingly simple request dramatically increases the amount of data that must be simultaneously collected. Moreover, in many cases the extra resolution is simply unavailable with current experimental techniques. Microfluidic automation of basic laboratory procedures (Demello 2006; Hansen and Quake 2003) may make such cross-sectional measurements feasible in the future, but with few exceptions, such as the high-throughput construction and characterization of deletion strains (Giaever et al. 2002), fine-grained conditional data is usually unavailable. Even in large-scale studies, data is usually collected on only one variable at a time. Thus, the limitations of the available data tend to force us toward a static lowest common denominator map of interactions for most organisms, averaged over time, space, perturbation, and other variables. All is not lost, however, as this static network is still a significant conceptual leap beyond the raw genome sequence of an
20
B.J. Daigle et al.
organism. Moreover, variation of different kinds (e.g., upregulation of genes or spatial localization of proteins) can be visualized by superimposing tracks and layouts upon such static networks (Hu et al. 2007; Shannon et al. 2003) in the same way we view gene and motif tracks upon a genome assembly (Kuhn et al. 2007).
2.2.5 Network Reconstruction 2.2.5.1 Labels vs. Predictors For the purposes of data integration, a useful data set is one that provides measurements on at least one type of biological object, such as genes, proteins, or protein pairs (Fig. 2.2). Such data sets can be divided into two broad categories: labels and predictors. Predictors, such as expression ratio measurements on a gene (Schena et al. 1996) or phylogenetic profiles of a protein (Pellegrini et al. 1999), are often “dense” in that they are available for most instances of a biological object and are acquired in a high-throughput way. For example, because most genes are present on standard microarrays, expression profiles are available for most genes (modulo missing values). In contrast, labels such as GO consortium gene annotations (Ashburner et al. 2000) or phosphorylation interactions culled from the literature (Saric et al. 2006) tend to be sparse and of high quality. One of the most important discoveries (Jansen et al. 2003) in functional genomics is that these curated labels, which represent directly useful information, can be statistically predicted from combinations of uncurated predictors. 2.2.5.2 Early Methods for Clustering and Integration The road to this discovery began with early attempts at unsupervised integration and clustering. When the first microarray data sets became available, dozens of different algorithms for unsupervised clustering of these data sets were published (Altman and Raychaudhuri 2001; Sherlock 2000). These techniques were also applied to other data sets, such as phylogenetic profiling (Pellegrini et al. 1999). While individual clusters of genes were sometimes experimentally validated (Srinivasan et al. 2005; Stuart et al. 2003), it was difficult to assess the extent to which any given clustering reflected the true modules of the organism. Given the fuzziness of the module concept, the fact that genes (and other biological objects) can belong to more than one module, and the often conditional nature of intra-module interactions, it was not clear whether the concept of a true set of modules was even a useful one. This problem became more pronounced when investigators began to combine interaction networks inferred from different assays, which in turn had apparently different modular structures. The first attempts (Tong et al. 2002) applied arbitrary thresholds to the interactions derived from different assays and used the union or intersection of these sets as an integrated network. In some cases, such as largescale yeast two hybrid data, the intersection was essentially the null set (Ito et al. 2001). While the goal of combining different assays to reduce noise was a step in the
2
Current Progress in Static and Dynamic Modeling
21
right direction, the problem was that no clear method for weighting the confidence of different assays was available. As with unsupervised clustering, the underlying issue here was the lack of a true set of curated modules to benchmark different assays against. 2.2.5.3 Data Integration by Supervised Learning Supervised Normalization The solution (Jansen et al. 2003; Lee et al. 2004; Lu et al. 2005a; Srinivasan et al. 2006; Tanay et al. 2004; Troyanskaya et al. 2003; Wong et al. 2004) was to build a gold standard to compare different kinds of predictors. In general, different kinds of gold standards can be built from different labels; while a colocalization gold standard can be built from MIPS (Mewes et al. 1999), a functional interaction gold standard can be generated from EcoCyc (Karp et al. 2002), Reactome (Vastrik et al. 2007), GO (Harris et al. 2004), or KEGG (Kanehisa et al. 2006). Negative examples can then be easily generated via random permutation of positive labels (Ben-Hur and Noble 2006). Though simple, the permutation-based approach to generating negative examples has been shown to be superior to selecting a statistically biased subset of negative examples, such as proteins known to be in different subcellular localizations (Ben-Hur and Noble 2006). Given this gold standard, a useful predictor will separate positive from negative examples. This observed statistical separation can then be converted into a posterior probability by applying Bayes’ Rule (Srinivasan et al. 2006), allowing different predictors (uncurated data) to be compared in terms of their ability to recapitulate known biological labels (curated data). In the specific case of protein interaction prediction, a good predictor will recapitulate known labels by separating interacting protein pairs from non-interacting pairs (Figs. 2.2 and 2.3). Detection of Corrupted Data One important application of this result is screening microarray experiments for corrupted data (Srinivasan et al. 2006). In addition to a battery of internal consistency checks (Irizarry et al. 2005; Woo et al. 2004), a series of expression measurements can also be used to calculate a correlation matrix, which can then be compared to a gold standard. If coexpression correlations separate positive and negative training examples as in the lower left panel of Fig. 2.2, the data set contains at least some signal; if no separation is observed, problems may have occurred with some hybridizations. Another important application is identifying which kinds of data may be systematically unreliable; for example, interactions from large-scale yeast two hybrid (Y2H) studies appear to be uncorrelated with any of several gold standards (Qi et al. 2006). This matches other results that indicate that the properties of hubs (Bloom and Adami 2003) and degree distributions (Deeds et al. 2006) in Y2H networks may be artifactual and may explain the low overlap of independently collected Y2H data
22
B.J. Daigle et al.
sets with each other (Gandhi et al. 2006; Goll and Uetz 2006; Hart et al. 2006) and with literature-curated interactions (Rual et al. 2005). Moreover, the generally low correlation of Y2H interactions with curated data stands in contrast to TAP/MSderived interactions and most other kinds of functional genomic data, including expression arrays (Qi et al. 2006). The ability to perform such comparisons is one of the primary advantages of a gold standard. Supervised Integration In addition to allowing comparison of different predictors, a gold standard also enables us to perform data integration. In the context of protein interaction prediction, an array of association predictors is the input to a binary classifier function, which returns the integrated probability that two proteins are linked in the sense stipulated by the gold standard (Fig. 2.3). When this binary classifier function is applied to predict interaction probabilities for all protein pairs in a genome, the result is an integrated probabilistic protein interaction network. Variants of this approach have been used to predict functional associations (Jansen et al. 2003; Lee et al. 2004; Srinivasan et al. 2006), physical contacts (Qi et al. 2006), synthetically lethal genetic interactions (Wong et al. 2004), and colocalizations (Qi et al. 2006; Jansen et al. 2002). Importantly, this supervised learning framework for data integration is not limited to interaction prediction and has also been applied to direct prediction of protein function (Han et al. 2004; Lu et al. 2005b) and transcription factor/DNA binding (Beyer et al. 2006). In fact, the applications of supervised learning in functional genomics can be seen as a natural outgrowth of supervised learning methods in gene finding (Ratsch et al. 2007), protein sequence alignment (Do et al. 2006a), and RNA secondary structure prediction (Do et al. 2006b; Gruber et al. 2007).
2.2.6 Network Representation 2.2.6.1 From Reference Assemblies to Reference Networks Along with algorithms for network reconstruction, a fundamental question in static modeling is the issue of data structures, of how an inferred network model is represented on the computer. The field is currently moving from a number of ad hoc network representations to a better defined concept of a reference network. To motivate this, recall the concept of a “reference human genome assembly.” This concept is a fiction, as the genome coils and uncoils (Champoux 2001), moves about the cell (Riddihough 2003), is methylated and demethylated (Weber and Schubeler 2007), varies substantially between individuals (Abecasis et al. 2007) and has nontrivial three-dimensional structure (SantaLucia and Hicks 2004). Nevertheless, it is a useful fiction, as each of these phenomena can be visualized and analyzed by superimposing tracks upon the reference assembly, which represents a lowest common denominator of analysis. In particular, by separating the
2
Current Progress in Static and Dynamic Modeling
23
raw data (the reference assembly) from the metadata (the species-specific tracks and annotations), cross-species comparisons and genome alignments are enabled (ENCODE Project Consortium 2007; Brudno et al. 2003). Similarly, a feasible near-term goal for static modeling is the construction of reference networks for key model organisms with explicitly typed edges (Figs. 2.4 and 2.5). These reference networks may integrate multiple data types (Fig. 2.3) and incorporate explicit models of uncertainty. However, as they are meant to represent the average cell of a given organism near the median of the norm of reaction (Lynch and Walsh 1998), they should not directly incorporate interactions which only occur during certain perturbations, at specific times, or within particular cell types. As with reference assemblies, such conditional interactions should be modeled by superimposing tracks and layouts on the static reference network rather than incorporating conditional interactions directly into the reference network (Fig. 2.4).
Reference assembly concept separates finished sequence from metadata (tracks)
a
...GCATGCTAC...
Reference sequence of high confidence base pairs gene track SNP track
b
Reference network concept separates physical interactions from metadata (tracks & layouts) Reference tabulation of high confidence physical interactions protein ncRNA
small molecule regulatory motif
Essentiality (node track) Interaction extracellular cytoplasmic conservation (edge track) (layout) (layout)
essential not ess. weak strong
Fig. 2.4 Reference assemblies and reference networks. (a) The concept of a reference assembly allows us to enforce a divide between data and metadata. Everything other than finished sequence data is visualized and represented as a metadata track associated with the raw sequence (Kuhn et al. 2007). (b) Enforcing a similar kind of separation for a reference network will have key advantages. By enumerating a static list of highly probable physical interactions which occur for an average cell of a given species (averaged over condition, space, time, etc.), we can obtain a lowest common denominator of interaction information to compare between species. Given this physical backbone, metadata can then be visualized via tracks and layouts. For example, we can apply a node track to flag essential nodes, an edge track to highlight strongly and weakly conserved edges, and a layout to mirror the known physical separation of modules. Reproduced from Srinivasan et al. (2007) with permission from Oxford University Press
24
B.J. Daigle et al.
By keeping the building blocks of the reference network separate from the details of when or where they interact, a separation between data and metadata is enforced that permits powerful kinds of network visualizations and alignments (Fig. 2.4). This is particularly valuable because network metadata is likely to accumulate in bits and pieces due to the prohibitive cost of compiling cross-sectional data on different network states (Fig. 2.1). With respect to visualizing this metadata, the primary new feature in the network context is the availability of layouts in addition to tracks, which are particularly suitable for visualizing spatial or functional relationships (Fig. 2.4b). 2.2.6.2 Strongly Typed Static Network Models One of the most important lessons learned from genome sequencing was the value of the Gene Ontology’s systematic, machine-readable approach to categorizing function (Ashburner et al. 2000). Before GO, it was impossible for a computer to discern that a protein annotated as an alcohol dehydrogenase was a kind of oxidoreductase. We propose that a similar state of affairs is currently prevalent in systems biology, and believe that a Network Ontology for explicit ontological markup of reference networks will prove to be an essential tool (Fig. 2.5) specifically, note that the edges and nodes of the reference network in Fig. 2.4 have explicit ontological
Fig. 2.5 (continued) Jansen and Gerstein 2004), or transcription factors and motifs (Beyer et al. 2006). In order to achieve the ambition of a refernce network, however, a notation must be devised for dealing with many kinds of typed interactions. (a) As a motivating example, consider the interaction of EGRI with a transcription factor-binding site, which involves three zinc finger domains and a zinc cofactor. (b) One possible schematic of this interaction is shown, where an individual protein with three domains (top layer) conditionally binds a DNA position (bottom layer) in the presence of zinc (middle layer). The problem is that it is not immedialtely obvious how to represent this in machine-readable terms. (c) One solution lies in representing a network as a list of triples encoded in a “Network Ontology”. This proposed Network Ontology is a meta-ontology that draws on established ontologies and controlled vocabularies. By combining these source vocabularies, the small set of interactions described in panel (b) can be described in terms of a set of unordered triples. Each triple represents a fact about the network, expressed as (subject, predicate, object) tuple. In general, each member of the triple has its own canonical identifier. For example, the triple (CID: 23994, MI: 0407, CDD: pfam 00096) indicates that zinc (CID: 23994 in PubChem) physically interacts (MI: 0407, in PSI-MI) with the zinc finger domain (CDD: pfam 00096 in the CDD). For simplicity, we have represented the is_a and part_of predicates as literals, but in general these should also be specified by URIs. For example, the subtleties regarding the Sequence Ontology’s part_of definition are treated during the discussion of extensional mereology operators in (Eilbeck et al. 2005). (d) The advantage of the triple-based representation of the network is that it corresponds to the RDF standard of the W3C consortium. While RDF can be expressed as an XML file, the N3/Turtle notation (Beckett and Berners-Lee 2007) is far more compact and human readable. Shown is an example of a Turtle format encoding of the triplestore described in panel (c). After the preliminary enumeration of namespaces, each non-comment line corresponds to a single triple. Reproduced from Srinivasan et al. (2007) with permission from Oxford University Press
2
Current Progress in Static and Dynamic Modeling
a
EGR1 interaction with DNA (three zinc fingers in complex)
25
Schematic of physical interactions
b
EGR1 w/ three zinc finger domains Zinc cofactors enable binding EGR1 binding site
Problem: implicit semantics prevent machine readability
Represent network as a list of triples
c
(Network Ontology is a meta-ontology) Subject
Predicate
Object
Note
CID:23994
is_a
MI:0682
Zinc is a cofactor zinc directly interacts w/ zinc finger domain
CID:23994
MI:0407
CDD:pfam00096
UniProt:P18146
is_a
GO:0003700
EGR1 is a transcription factor
craHsap:197014
is_a
SO:0000235
This motif is transcription factor-binding site
dom:P18146-d1
part_of
UniProt:P18146
First domain in protein
dom:P18146-d1
is_a
CDD:pfam00096
Type of domain
...
...
...
d #Define @prefix @prefix @prefix @prefix @prefix @prefix @prefix @prefix
...
Explicit RDF representation (Turtle/N3 format) Namespaces rdf: . CID: . CDD: . UniProt: . craHsap: . MI: . GO: . SO: .
#Begin List of Triples CID:23994 is_a MI:0682 . CID:23994 MI:0407 CDD:pfam00096 . UniProt:P18146 is_a GO:0003700 . craHsap:197014 is_a SO:0000235 . dom:P18146-d1 part_of UniProt:P18146 . dom:P18146-d1 is_a CDD:pfam00096 . #...more triples below...
Fig. 2.5 Network Ontology and RDF representation. Most current networks involve only one or two kinds of biological objects, such as proteins alone (Lee et al. 2004; Srinivasan et al. 2006;
26
B.J. Daigle et al.
labels. This Network Ontology is a kind of meta-ontology that derives largely from existing ontologies, something like a more focused analog of the Unified Medical Language System (Bodenreider 2004) for systems biology. Many of the terms can be derived from existing ontologies like the Gene and Sequence Ontology and from lists of canonical identifiers such as those available through Entrez Gene (Wheeler et al. 2007), UniProt (Mulder et al. 2007), CDD (Marchler-Bauer et al. 2003), and PubChem (Wheeler et al. 2007). There are also several available standards in the systems biology space (Stromback and Lambrix 2005) which can serve as building blocks for this project, including SBML (Hucka et al. 2003), CellML (Nielsen and Halstead 2004), BioPax (Luciano 2005), and PSI-MI (Orchard et al. 2005). Of these ontologies. SBML and CellML are invaluable tools for detailed, time-dependent modeling but may be too granular for genomic scale networks. BioPax and PSI-MI are more appropriate; BioPax was originally developed for exchanging pathway data between database such as KEGG and Ecocyc, and PSI-MI was built for describing the results of high-throughput experiments (Hermjakob et al. 2004). By combining these source vocabulaties, a Network Ontology provides a unified framework for defining a reference network and its associated metadata in terms of lists of triples (Fig. 2.5). Each triple corresponds to a fact about the network, represented as a subject/predicate/object tuple of uniform resource identifiers (URIs). Each URI represents a canonical identifier drawn from one of the established databases or ontologies. In addition to the vast number of ontological terms compiled by the members of the OBO foundary (Rubin et al. 2006), good URIs currently exist for proteins via UniProt, domains via the CDD, genes via Entrez Gene, and small molecules via PubChem. Canonical names are also emerging for ncRNAs (Kin et al. 2007) and regulatory motifs (Robertson et al. 2006), though a consensus solution will remain elusive until NCBI or EBI launches a database. Given a consensus set of URIs for biological objects, an explicitly typed reference network can then be naturally represented as a set of ontological triples, such as A physically_interacts_with B, or X is_a Y, in which canonical URIs are used for each member of the triple (Fig. 2.5). This triple-based representation of a network corresponds to the RDF format of the World Wide Web Consortium (Prudhommeaux and Seaborne 2007). Though originally developed for the Semantic Web (i.e. web page X links to web page Y), a list of triples (also known as a triplestore) is clearly also a natural representation for pathway and network information. Importantly, significant progress has already been made by the BioRDF working group (Stephens 2007) toward converting key biological databases into RDF format. One of the principle advantages of representing network data as an RDF triplestore with canonical URIs for each member of the triple is that if everyone uses the same URIs, then facts produced by different providers can be integrated by forming the union of the two triple stores (though in practice statistical methods will be used to resolve any contradictory triples). Another advantage is that a network in RDF format with explicitly typed nodes and edges can be the subject of nontrivial queries based on the SPARQL query language (Prudhommeaux and Seaborne
2
Current Progress in Static and Dynamic Modeling
27
2007), such as find all X’s which are regulated by Y or find all singal transduction paths between A and B. A network with explicitly marked nodes and edges also suggests natural possibilities for data visualization and enables rich kinds of network alignment (Section 2.2.7.2). Reference networks can be inferred by a direct extension of the supervised learning methods described in Section 2.2.4. As depicted in Fig. 2.3, the shared thread behind the supervised learning methods for network integration and protein function predication is to (1) select a biological object (protein pair, gene pair, protein, etc.), (2) calculate a list of desired labels and predictive features, and (3) use machine learning to compute a mapping between features and labels. Given sufficient labels and predictors data on any kind of biological object can be integrated. This kind of approach has already been used to score interaction confidence during the process of data collection (Krogan et al. 2006); in the long run such techniques may become as common to network determination as PHRED and PHRAP (Ewing and Green 1998; Ewing et al. 1998) became in the early days of sequence determination.
2.2.7 Applications of Network Models Now that static network modeling has become commonplace for several years, the trend is to make network analysis a starting point for applications, such as user-friendly network visualization, network-guided experimental validation, and network alignment. 2.2.7.1 Experimental Prioritization Ultimately, an interaction network is a model of a system, and a model is only useful to the extent that it successfully predicts experiments. In particular, one of the most important ways to leverage network data is not simply to analyze it, but to use it to understand what data to gather next. One way to formulate this problem is in terms of an experiment recommender, which uses network context to prioritize experiments. For example, network context can be used to identify genes that are likely to be in pathways of interest (Owen et al. 2003). Experiment recommenders of different kinds have also been used to determine rate constants (Flaherty et al. 2005), define metabolic topologies (Barrett and Palsson 2006), determine disease genes (Aerts et al. 2006), and discern causal structure in signaling pathways (Sachs et al. 2005). It is important to note that many such recommendation problems can be viewed as updates of an uncertain state variable, such as the GO category of a protein or the value of a rate constant. On a formal basis, this is highly similar to the Bayesian supervised learning model for data integration described in Fig. 2.3, in which a prior gold standard is updated to produce a posterior distribution. There is thus a
28
B.J. Daigle et al.
significant opportunity to unify the problems of data integration and experiment recommendation in a common Bayesian framework, where experiments are recommended in order of their ability to reduce the uncertainty of state variables of interest.
2.2.7.2 Network Alignment Once multiple genome sequences became available, research attention naturally turned to the question of comparative genomics (ENCODE Project Consortium 2007). Similarly, the availability of several different kinds of networks from different sources and species has ignited interest in comparative functional genomics. Many questions are still open in this area: for example, can we enumerate an organism’s inventory of modules much as we can enumerate its inventory of genes? Is it feasible to transfer module annotations from well-studied organisms to newly sequenced ones? And can we identify conserved modules of unknown function? One promising way of answering such questions is through network alignment, which is a systems biological analog of sequence alignment. Network alignment allows us to compare interaction networks between different species to find conserved modules. When comparing protein interaction networks, conserved modules are sets of proteins that have both conserved primary sequences and conserved pairwise interactions between species. For example, we can apply network alignment to find all species with nitrate reduction systems similar to that of Escherichia coli, or to examine the extent to which the cell division apparatus is conserved across a set of microbes. A sample alignment found with the Graemlin network aligner is shown in Fig. 2.6; the figure displays a putative DNA uptake and transformation module in which seven protein families across four species show a conserved pattern of functional association (Flannick et al. 2006). Network alignment has attracted much interest in recent years, beginning with manual alignments of metabolic pathways (Dandekar et al. 1999; Forst and Schulten 2001), proceeding to precursors of network alignment guided by best bidirectional BLAST hits (Ogata et al. 2000; Stuart et al. 2003; Yu et al 2004), and culminating in more recent graph-based formulations (Kelley et al. 2003). Recent alignment algorithms have introduced the ability to compare three networks at once (Sharan et al. 2005) as well as simple models of network evolution (Koyuturk et al. 2006). We recently developed the Graemlin network aligner, which was the first program capable of identifying conserved functional modules across an arbitrary number of dense association networks. By using a number of BLAST-like optimizations Graemlin’s running time scaled linearly rather than exponentially with the number of species (Flannick et al. 2006). Just as sequence alignment rests upon substitution matrices (Henikoff and Henikoff 1993) and models of sequence evolution (Durbin et al. 1999), it will be crucial to provide a prinicipled foundation for network alignment by developing a detailed theory of network evolution (Berg and Lassig 2006; Weitz et al. 2007).
2
Current Progress in Static and Dynamic Modeling E. coli
V. cholerae
29 C. jejuni
C. crescentus
ruvC ybgC ruvA
tolR
pal Network alignment locates conserved module: DNA uptake and transformation
tolQ
tolB
Fig. 2.6 Network alignment. A sample network alignment calculated with the Graemlin algorithm (Flannick et al. 2006). In the top row, integrated association networks for four microbes are depicted. In these large graphs, nodes represent proteins and edge weights are probabilities of association between proteins. Calculating a global network alignment finds several conserved modules, including one consisting of seven conserved protein families: ruvC, ruvA, tolR, tolB, tolQ, pal, and ybgC. Each family contains four homologous proteins, one in each species; node shape denotes the species of origin and proteins from a given family are grouped near each other. Moreover, the pattern of functional associations between protein families (as revealed by the edges) displays significant conservation. The alignment suggests a possible function for the module: exogenous DNA is allowed into the cell by the tol/exb membrane channel proteins and then incorporated into the chromosome by the ruv recombination proteins. The literature supports this hypothesis, as insertional disruption of tol/exb family proteins in Pseudomonas stutzeri reduces transformational efficiency to 20% of its previous level (Graupner and Wackernagel 2001). This strongly suggests that exogenous DNA travels through these channels before chromosomal incorporation. Reproduced from Srinivasan et al. (2007) with permission from Oxford University Press
Moreover, just as fast algorithms for sequence alignment such as BLAST became ever more essential as sequence data accumulated, it seems clear that the utility of network alignment will rise in direct proportion to the quality of inferred interaction networks in different organisms.
30
B.J. Daigle et al.
Indeed, the pace of research in this area is accelerating, with several papers published in the last few months (Zhenping et al. 2007; Liang et al. 2006; Singh et al. 2007; Stumpf et al. 2007). Part of the reason for this interest is that many of the signal successes of bioinformatics have been concentrated in the area of alignment (Batzoglou 2005). Even though the vast majority of objects in biology have not been directly characterized by experimentalists, information on objects which have good digital encodings, like sequences and structures, can easily propagated with an appropriate alignment tool. For example, we can characterize a protein in Drosophila melanogaster and immediately BLAST its digital representation to get some clue as to the function of that protein in other insects, or possibly even in humans or yeast. Yet the lack of digital representation means that many other interesting objects (like tissues or developmental hierarchies) are not yet easily “aligned” between organisms. Currently, we resort to simple phylogenetic interpolation to reason that if organism X is phylogenetically equidistant between organism Y and organism Z, then its characteristics are intermediate between these two organisms. However, it is well known that gene trees are not the same as species trees (Degnan and Rosenberg 2006; Nichols 2001; Pamilo and Nei 1988), and that it is far more accurate to compare genes via sequence alignment. While the divergence of a network tree from the species tree is likely to be less than that of a gene tree (as a collection of genes will have lower sampling variance than an individual gene), nevertheless the same principle holds: the evolutionary history of a module is distinct from that of its host. The promise of network alignment, then, is that we may be to improve upon crude phylogenetic interpolation by directly comparing network models of higher order processes (such as organs and developmental hierarchies) between species and individuals. 2.2.7.3 Network Visualization Large interaction data sets with thousands of nodes and edges are best visualized interactively rather than statically. Several tools for this purpose are now available and can be divided into standalone applications, programming libraries, and web applications. Desktop Tools Among standalone programs, several options are available, including Cytoscape (Shannon et al. 2003), Osprey (Breitkreutz et al. 2003), Medusa (Hooper and Bork 2005), and Pajek (de Nooy et al. 2005). Cytoscape is a popular choice with many features and plugins, but as it is written in Java it requires large amounts of memory to navigate dense networks. Osprey is similar in functionality and is somewhat more responsive, but has a smaller user community. Medusa has several novel features, including support for multigraphs with multiple edges between a given pair of nodes. Pajek has many features for mathematical graph analysis but a comparatively steep learning curve.
2
Current Progress in Static and Dynamic Modeling
31
Programming Libraries Data analysts often wish to dynamically generate network visualizations from within programs, and many libraries for this purpose are available. Cytoscape, mentioned has an API that can be called from within Java. The Boost Graph Library (Siek et al. 2007) and AT&T’s Graphyviz library (Ellson and North 2007) are open source C+ libraries which have bindings for many different programming languages, including R, Python, and Perl.
Online Network Browsers Several rich web applications for network visualization have been described in recent years, including STRING (von Mering et al. 2007), PubGene (Jenssen et al. 2001), iHOP (Fernandez et al. 2007), and PSTIING (Ng et al. 2006). STRING provides several different kinds of interaction predictions between genses for many sequenced genomes. STRING, PubGene, and iHOP all allow browsing of literature co-occurrence networks. PSTIING is a powerful data browser which is particularly useful for analysts looking for new data sets to integrate.
2.2.8 Outstanding Challenges in Static Modeling Now that hundreds of different functional genomic data sets are available through resources like NCBI’s GEO, an important near-term goal is the generation of static reference networks for major model organisms. In order to make these networks relevant, every predicted node and edge should have an associated gold-standard empirical test for verification purposes. For example, a postulated network of physical protein protein interactions is in theory confirmable by exhaustive coimmunoprecipitation of protein pairs. Moreover, the parameters of static models should be designed to be flexible enough to be updated in the light of new information, e.g., by using Bayesian updates.
2.3 Dynamical Models of Biological Networks A dynamical model is a reconstruction of molecular physiology, a description of how the state of a system evolves over time. This description usually consists of equations that describe the time dependence of each of the state variables of the system. To describe a biological networks as a dynamical system requires identification of the variables (protein species, signaling molecules, and their associated amounts), how they interact (network connectivities), and how both the values of the state variables and their interactions change over time. A variety of approaches have been used to model signaling networks as dynamical systems; one useful means of organization is by whether the model uses discretely or continuously varying states.
32
B.J. Daigle et al.
2.3.1 Discrete Models Discrete models require the states of the system variables (genes, proteins, signaling molecules) to take on integer values. Although at a molecular level this requirement is the most realistic, it is often used at a higher level to simplify the resulting models. A boolean model provides one such simplification: it consists of binary-valued variables whose interrelationships are captured by boolean functions. In systems biology, this expresses the state of a gene (“on” or “off”) as a boolean function of the states of other genes. As an example, a Boolean model was constructed for the mammalian cell cycle (Faure et al 2006), and it was shown to reproduce known wildtype and mutant behavior. Boolean models can either be deterministic or stochastic; the latter as referred to as probabilistic boolean networks (PBNs) (Shmulevich et al. 2002). In cases where a boolean model is too coarse grained for a particular system, a more elaborate dynamic Bayesian network can be used. These models can be either discrete or continuous, and they allow dynamical systems to be described probabilistically. An example of a recent discrete DBN applied to yeast cell cycle time series data is found in Zou and Conzen (2005). Although DBNs are more realistic than boolean models, they are still more descriptive than mechanistic. Short of molecular dynamics simulations that track the simultaneous position and velocity of every molecule in the system, the most realistic (and mechanistic) signaling network models fall under the stochastic chemical kinetics framework (Gillespie 2007). These models represent biological systems as well-stirred collections of finite numbers of chemical species; reactions are simulated probabilistically according to known reaction propensities. We shall return to these models in Section 2.3.8.
2.3.2 Continuous Models Continuous models permit system variables to take on non-negative real-valued states. We focus on so-called chemical kinetic (mechanistic) models where states represent concentrations of molecules. These models, though approximate, are sufficiently accurate when the molecular populations of all species are orders of magnitude larger than one (Gillespie 2007). The oldest and most common modeling formalism uses ordinary differential equations (ODEs) and known chemical kinetic/physico-chemical principles (Cornish-Bowden 1979) to deterministically model molecular concentrations as a function of time. Though these equations are not usually analytically solvable, there exist a wide variety of numerical tools that can efficiently model relatively complex systems (Rangamani and Jyengar 2007). We shall cover ODEs in more detail in Section 2.3.6. Partial differential equation (PDE) models of signaling networks describe the evolution of molecular concentrations as functions of both space and time. These models are more physically realistic than ODEs, but they are also significantly more
2
Current Progress in Static and Dynamic Modeling
33
difficult to solve and typically require custom-made numerical solution methods (Eungdamrong and Iyengar 2004). We discuss PDEs in detail in Section 2.3.7. The addition of a noise term to a deterministic differential equation yields a stochastic differential equation (SDE), which in chemical kinetic systems often takes the form of a chemical Langevin equation (CLE) (Gillespie 2000). The CLE follows from approximations to discrete stochastic chemical kinetics, and its solution can be computed much more efficiently than solutions for the corresponding discrete models (Wilkinson 2009). We discuss SDE modeling of signaling networks in Section 2.3.8. Discrete dynamical models represent an active area of research in systems biology, and they have recently been discussed elsewhere (Uhrmacher et al. 2005). In the remainder of this chapter we restrict our focus to the three classes of continuous dynamical models listed. As these models are mechanistic in nature, their means of specification and analysis are the most dissimilar to the descriptive models of static interaction networks (Section 2.2) and most of the discrete dynamical models mentioned above. We begin by describing some general advantages and limitations of representing biological networks with differential equations, followed by common tasks carried out when applying such models. These points will motivate the remaining discussion and the particular examples used for illustration.
2.3.3 Advantages of Continuous Dynamical Models The cellular environment is constantly changing as a result of deterministic chemical reactions and stochastic fluctuations. Thus, dynamical systems are more realistic depictions of biology than static models, and they can be used to answer detailed questions unanswerable by the latter (Mogilner et al. 2006). In particular, the modeler can test hypotheses that would be hard to query experimentally (Angeli et al. 2004). Through simulation, dynamical models enable characterization of nonlinear, emergent behavior that evolves over time. Such behavior is often only visible at a systems level and would be missed by reductionist methods (Bhalla and Iyengar 1999). The outputs of differential equation models relate more closely to experimentally observed phenotypes than coarser-grained alternatives (Sauer 2004). As a result, though these models often require extensive parameterization, the parameter space can be constrained such that the model reproduces experimental data. This significantly reduces the complexity of model calibration and also enables easier model validation (Rangamani and Iyengar 2007). In addition, the models we shall discuss are mechanistic, and first principles of chemical kinetics (Cornish-Bowden 1979) and physics can reduce parametric uncertainty. These same principles are often unapplicable in more approximate models (Price and Shmulevich 2007). In general, the process of parameter learning sheds light on correctness of initial hypotheses: if no parameter values exist which reproduce observed behavior, initial assumptions must be revisited (Tomlin and Axelrod 2007; You 2004; Ideker et al. 2001).
34
B.J. Daigle et al.
Finally, though these models require large amounts of high-resolution data, experimental systems are in place to make many of the needed measurements (Albeck et al. 2006).
2.3.4 Limitations of Dynamical Models The level of detail present in differential equation models can also impose limitations. Implementation of these systems often requires detailed prior biochemical/network knowledge which is not always readily available or uniformly reliable (Herrgard et al. 2003; Mogilner et al. 2006). Given the extensive parameterization needed, it can be hard to validate the entire model and multiple solutions (network structures/parameter values) often exist. With limited amounts of data, models are also prone to overfitting (Amonlirdviman et al. 2005; You 2004). Though dynamical systems are able to reproduce experimental observations, their calibration is not always compatible with high-throughput data (Price and Shmulevich 2007). Instead, these models require costlier quantitative data to define concentrations of signaling components, kinetic/diffusion parameters, and initial/ boundary conditions (Weng et al. 1999; Schnell and Turner 2004). Finally, due to their complexity, simulation of these models is computationally intensive, and models are often limited by size (Rangamani and Iyengar 2007; Tomlin and Axelrod 2007). In light of the above points, it is not surprising that successful examples of dynamical biological modeling are in systems that benefit from the advantages while minimizing the effects of the limitations. We will cover some of these examples in detail in the remainder of the chapter.
2.3.5 Specific Tasks Associated with Dynamical Modeling We categorize the undertakings and objectives of dynamical modeling into five common tasks (Aldridge et al. 2006a), many of which follow from the characteristics of dynamical models listed above: 1. Model construction and calibration. The first step is to specify the structure and parameterization of a model from prior knowledge and experimental data. As we discuss below, this often requires advanced computational and statistical methods to process noisy or incomplete data (Brewer et al 2008; Wilkinson 2009; van Riel and Sontag 2006; Jaqaman and Danuser 2006). 2. Model validation and testing. After calibration, it is important to compare model output with existing experimental data (Eungdamrong and Iyengar 2004; Ideker et al. 2001). This procedure is necessary (though not sufficient) to determine whether a model is specified correctly. 3. Parameter sensitivity analysis. Sensitivity analysis involves determining which molecular concentrations or kinetic parameters have the greatest influence on
2
Current Progress in Static and Dynamic Modeling
35
model behavior. This is valuable when prioritizing parameters for subsequent experimental measurement or perturbation (Rangamani and Iyengar 2007). 4. Analysis of emergent behavior. As mentioned emergent behavior arises from systems level properties that are not apparent from studying individual components. Many of these phenomena, which can include robustness to noise, feedback, bistability, and oscillation, are best characterized through simulation of the model (Gilbert et al 2006; Angeli et al 2004). 5. Predictive modeling and discovery. One of the most exciting areas of systems biology is prospective modeling to test hypotheses that are too difficult or expensive to query in vivo. Here, a prerequisite for making accurate predictions is a sufficiently detailed and accurate model (You 2004). The remainder of the chapter is split into three parts: one each covering ODE, PDE, and SDE modeling of biological systems. Each section begins with an introduction to the corresponding modeling framework, followed by a brief review of early successes from the literature. We then focus in depth on current (within the last 5 years) examples, which we discuss in terms of the five tasks listed above. We conclude each section with outstanding research challenges.
2.3.6 ODE Systems Ordinary differential equation models are by far the most common dynamical model used in biology (Andrews and Arkin 2006). They represent behavior at the level of chemical kinetics, whereby the concentration of each system component yi (t) as a function of time is represented in the following manner: dyi (t) = fi (y (t)), 1 ≤ i ≤ n, dt
(2.1)
where y (t) = yi (t) , . . . , yn (t) and fi is a function which describes the rate of change of yi (t). This function can be constant (uninhibited synthesis), linear (first-order reaction such as degradation), or nonlinear (second-order reaction like Michaelis–Menten kinetics), and its precise form follows from qualitative prior experimental knowledge. These coupled expressions are often collectively referred to as reaction rate equations (RREs). The RREs of most biologically realistic systems cannot be solved analytically, but numerous well-developed and efficient numerical methods for solving these systems are available.
2.3.6.1 Assumptions of ODE Biological Network Models The relative ease with which ODE models of biological systems can be constructed and solved is a consequence of the simplifying assumptions made about the system. These assumptions include as follows:
36
B.J. Daigle et al.
• Reactions occur in a homogeneous, well-stirred volume (corollary: molecular concentrations are functions of time and not space) • Reactions occur in a deterministic manner • Discrete effects on molecular concentrations can be ignored (corollary: molecular populations of all species are orders of magnitude larger than one) The solution to the RREs describes the deterministic time evolution of the system’s component concentrations; this solution often represents the average (mean) result of a population of many individual reaction trajectories in the presence of noise (Gillespie 2007). However, if any of the above assumptions are violated, ODE models of the system may be invalid and even exact solutions of such models can differ substantially from population averages. Even when the assumptions are met, it can be shown that the solution to the RREs is not equivalent to the population ensemble mean (Samoilov and Arkin 2006). Nevertheless, these models have proven useful in describing the dynamic behavior of biological networks, and they have been in use for several decades. 2.3.6.2 Early Examples of ODE Models Describing Biological Systems One of the earliest uses of an ODE model to describe a biological network comes from Goodwin, who constructed equations describing the change in concentration of an mRNA species and its corresponding protein product (Goodwin 1963). This work simulated feedback loops, which were shown to give rise to nonlinear oscillations. Walter built upon this work, where he identified a finite range of parameter values in a feedback system that led to oscillatory behavior (Walter 1970). Tyson and Othmer furthered our understanding of feedback control in biological networks, and they characterized emergent properties such as stability, bifurcation, periodicity, and hysteresis that were exhibited by these networks (Othmer 1976; Tyson 1975; Tyson and Othmer 1978). Since that time, ODE models have been applied to biological networks governing a wide range of functions, including viral infection (Shea and Ackers 1985), chemotaxis (Spiro et al 1997), cell cycle regulation (Novak et al. 1998), and developmental patterning (von Dassow et al. 2000); see (You 2004) for additional references. Many of these biological ODE models focus on small, well-characterized biological systems, and for good reason: they can be easily parameterized from existing knowledge and they are computationally inexpensive to characterize and solve. 2.3.6.3 Modern Applications of ODE Models to Biological Networks We now turn to more recent ODE modeling applications in systems biology. Several of the studies below perform many or all of the common tasks listed in Section 2.3.5, but we discuss only one per study. As the use of ODEs to model biological networks is quite widespread, we have tried to choose particularly novel or innovative examples.
2
Current Progress in Static and Dynamic Modeling
37
Bayesian Calibration of a GPCR ODE Model Using Noisy Data G-protein-coupled receptors (GPCRs) are a large family of transmembrane receptors that facilitate the transduction of a wide range of cellular signals. Cells exposed to multiple GPCR-binding ligands often respond as if the signals are additive, though occasionally the response can be synergistic. The precise mechanism of such synergy is unknown, which motivated the authors of Flaherty et al. (2008) to model the calcium release in mouse macrophage cells exposed to the signaling molecules complement factor 5a (C5a) and uridine diphosphate (UDP). Their mathematical model consisted of 53 ODEs (constructed using prior knowledge) with 84 parameters and 24 non-zero initial conditions. Parameters were estimated from a combination of preexisting data and knowledge and newly performed experiments. For the latter, the authors made time-resolved intracellular calcium measurements of mouse RAW264.7 cells in response to varying doses of C5a and UDP. They also collected similar measurements using five knockdown cell lines, illustrating the effects of decreasing quantities of five key signaling proteins (GRK2, Gα i2, Gαq, PLCβ3, and PLCβ4). These data were used to learn 20 of the 84 parameters most relevant to the five knockdown targets. Unlike most optimization procedures that choose point estimates of parameters maximizing the fit to the observed data, this study adopted a Bayesian procedure to estimate a full posterior distribution of parameters given the data. Bayesian methods are well suited for incorporating prior knowledge of parameter values with observed data to arrive at updated posterior parameter estimates. These posterior estimates are calculated as the mode of a posterior distribution, specified by Bayes’ rule: Pr (θ |y) =
p (y|θ ) Pr (θ ) , Pr (y)
(2.2)
where θ represents the parameters, y the observed data, Pr(·) a probability measure, and p(·) a likelihood function. The posterior distribution often cannot be expressed in closed form; in these cases Markov chain Monte Carlo (MCMC) methods are used to generate samples from the distribution. The probabilistic nature of the Bayesian framework is appropriate for dealing with the presence of uncertainty; the authors note that measurement uncertainty and knockdown efficiency uncertainty are two such sources present in their data. Informative prior distributions were placed on the 20 parameters of interest that excluded negative values and centered on previous estimates of these parameters from biochemical experiments. A Metropolis–Hastings algorithm was then used to empirically estimate the posterior density of the parameters using the data measurements in conjunction with a Gaussian likelihood function. This procedure resulted in posterior parameter estimates that were sometimes quite different from (but still influenced by) their prior values. Figure 2.7 shows two examples. Each parameter’s posterior distribution provides an automatic measure of precision: the tighter the distribution, the more precise the estimate. As the authors note, parameters with low precision are good candidates for further biochemical experimentation.
38
B.J. Daigle et al.
Fig. 2.7 Prior (light gray) and posterior (dark gray) density estimates for two parameters from the GPCR ODE model in Flaherty et al. (2008). Distributions consist of ∼30,000 MCMC samples; vertical line denotes parameter value chosen for the model. Densities are plotted as a function of parameter values on a log scale. Reproduced from Flaherty et al. (2008) with permission from PLoS
Simulation and analysis of the calibrated model led to new insight into synergistic GPCR-mediated calcium release in macrophage cells. Specifically, the authors discuss the mechanistic causes, robustness, and specificity of synergy. The authors also discuss two reasons why a Bayesian formulation was effective for calibration of their model: the abundance and quality of collected data and the speed and robustness of algorithmic methods for sampling from posterior distributions. Validation and Testing of a Mathematical Model of Cell Death The proper regulation of cellular apoptosis is essential for multicellular development, and its misregalation has been implicated in cancer, HIV progression, and viral infection, among other disorders. One of the mysteries of the apoptosis mechanism stems from the observation that cells receiving a tumor necrosis factor (TNF) or TNF-related apoptosis-inducing ligand (TRAIL) signal undergo a variable length delay followed by immediate cell breakdown. This breakdown is due to effector caspase activity on cellular substrates. To better understand the overall process, termed “variable-delay, snap-action” switching, the authors of Albeck et al. 2008b built an ODE model including reactions both upstream and downstream of a pivotal apoptotic process: mitochondrial outer membrane permeabilization (MOMP). Their model, referred to as EARM v1.0 (extrinsic apoptosis reaction model), consists of 58 coupled ODEs describing 18 gene products and their modifications across two cellular compartments. The model requires values for 70 rate constants, which were manually adjusted to minimize the difference between simulated and experimental data measuring caspase activity, timing of MOMP, and effects of protein depletion and overproduction. Once calibrated, an essential requirement of any mechanistic model is that it accurately reproduce experimental data. The authors simulated TRAIL treatment over a range of concentrations and measured the switching time between initial and complete effector caspase substrate cleavage (Ts ), the fraction of cellular substrate
2
Current Progress in Static and Dynamic Modeling
39
cleaved by caspases upon cell death (f), and dose-dependent variation of the variable length delay period (Td ). These matched previously experimentally observed values of ∼30 min and 1.0 for the first two, and a negatively sloped curve ranging from 3–10 h for the latter. Simulated time courses of processes involving three gene products (Bid cleavage, Smac translocation, and cPARP levels) also closely matched experimentally observed trends. Figure 2.8 displays these results.
Fig. 2.8 Training data derived from live-cell microscopy used in Albeck et al. (2008b). a Simulation of Td (left) or Ts and f (right) as a function of TRAIL dose (lines) alongside corresponding experimental values (points with error bars indicating standard deviations). For predicted values of Td , an envelope of constant coefficient of variation (CV) is shown, as estimated from experimental data (CV ≈ 20%); the source of variation is not known. b Composite plot of IC-RP and EC-RP cleavage (measuring initiator and effector caspase activity, respectively) for >50 cells treated with 50 ng/ml TRAIL and aligned by the average time of MOMP (left) and model-based simulation of the corresponding species (right). Data in the left panel were originally reported in Albeck et al. (2008a). Reproduced from Albeck et al. (2008b) with permission from PLoS
Upon proper experimental validation, the model was then used to make six predictions concerning the molecular mechanisms of variable-delay, snap-action switching. These predictions were all supported experimentally, leading to a deeper understanding of TNF/TRAIL-regulated apoptosis. The authors also demonstrated that the level of mechanistic detail of their model (including compartmentalization) was necessary to faithfully reproduce experimental results, as a series of simpler models did not adequately fit the data. Though it is noted that the parameters of EARM v1.0 are mathematically non-identifiable, the empirical approach used to select parameter values that best matched observed data led to an accurate model of a relatively complex biological system.
40
B.J. Daigle et al.
Multivariate, Transient Response Sensitivity Analysis of Model Initial Conditions Traditional sensitivity analysis measures the effects of single parameter changes on time-evolving model behavior. This univariate approach is useful for identifying reactions and species of importance to the overall reaction scheme, but it cannot characterize multiparameter effects on behavior. Naïve approaches that measure effects of changing multiple parameters simultaneously are often computationally intractable. In contrast steady-state sensitivity analysis can identify and describe equilibrium system states as a function of multiple parameter values, but this approach necessarily ignores transient effects on system dynamics. As signal transduction networks often utilize short-lived signals to enact downstream function, methods that characterize transient parameter sensitivities would be beneficial. To satisfy both of the above requirements, the authors of Aldridge et al. (2006b) have applied direct finite-time Lyapunov exponent (DLE) analysis to a biological ODE system to determine sensitivities to model initial conditions (hereafter referred to as “parameters”). DLE analysis captures transient behavior as a function of all parameters simultaneously. The method can be used to identify separatrices or regions in multivariate initial condition space that separate qualitatively different downstream responses. A DLE takes on the following form: DLE (t, x0 ) = log λmax
∂x (t) ∂x0
T
∂x (t) ∂x0
,
(2.3)
where x0 is a vector of initial conditions, x(t) is a vector of species concentrations as a function of time, and λmax is the square of the spectral norm of the deformation gradient ∂x(t)/∂x0 . Thus, a DLE measures the local sensitivity to changes in parameters evaluated at a finite time, with large DLE values corresponding to large sensitivity of the system trajectory to parameter changes. Practically speaking, DLEs are calculated numerically across a multidimensional grid of parameter values; the presence of separatrices can be visualized in plots of DLE versus a two or three-dimensional subset of parameters. In Aldridge et al. (2006b), DLE analysis is applied to a subset of the apoptosis model in Albeck et al. (2008b) containing eight ODEs. This portion of the apoptosis pathway contains the activation of caspase-3 by caspase-8, leading to cell death, and the influence of X-linked inhibitor of apoptosis (XIAP), which negatively regulates caspase-3 activity. The authors note that this system is expected to have a separatrix due to the cell’s binary decision of life or death. Systems with more graded responses would have uniform DLEs and thus be unlikely to have discernible separatrices. DLE analysis on the apoptosis system identified a pronounced nonlinear separatrix between cell survival and death. The separatrix tends toward increased XIAP concentrations as the amount of active caspase-8 increases, highlighting the antagonistic effects of these species. For comparison, the authors also applied steady-state sensitivity analysis to the model and demonstrate that cell fate is indistinguishable based on steady-state locations.
2
Current Progress in Static and Dynamic Modeling
41
More generally, results of a DLE analysis allow the prediction of cell fate at a given time based only on initial species concentrations. This ability will likely be useful for a number of applications, including the characterization of cellular disease states. As ODE models of biological networks continue to increase in complexity and scale, it is likely that sophisticated dynamical systems tools like DLE analysis will be more frequently used for the identification and characterization of systemslevel properties. Dose-to-Duration Encoding as a Means to Transmit Quantitative Information Signaling networks are responsible for transmitting extracellular signals to intracellular components to generate appropriate cellular responses. This transmission is not solely passive; many signaling networks involve signal modulation leading to phenomena like cross inhibition and negative feedback. Often, cellular response to a signal depends on the dose of that signal, so the signaling system is capable of transmitting quantitative information about the dose to downstream effectors. One example of such a system lies in the pheromone response pathway of Saccharomyces cerevisiae, where the dose of the pheromone leads to qualitatively different yeast phenotypes. At low dose, cells engage in vegetative growth; at intermediate pheromone levels the cells adopt an elongated shape, and at high dose the cells undergo growth arrest and extension of mating projections. The authors of Behar et al. (2008) propose a dose-to-duration mechanism as a means for encoding pheromone dose into varying downstream behaviors. Unlike linear response pathways, whose dynamic range is limited by saturation levels of network components (i.e., receptors), a dose-to-duration mechanism can increase the dynamic range of the system in such a way that dose-dependent responses can continue even after saturation of pathway components. This ability is due to the nonlinearity of the signaling pathway and can lead to a more robust transmission mechanism when acting between heterogeneous components. The authors begin with observations from previous work (Hao et al. 2008), which suggest that increasing doses of pheromone signal lead to increased dose and duration or only duration of two intracellular MAP kinases (Fus3 and Kss1). A hypothetical pathway architecture is constructed consisting of four components that make up a negative feedback loop (Fig. 2.9). Through calibration and simulation of a simple ODE model, it is shown that dose-to-duration encoding is a valid response of even this simple system. The authors then construct a similar signaling network using components of the yeast pheromone response pathway and fit the parameters of a six ODE model to observed data. They demonstrate that this simple dynamic model results in dose-to-duration encoding that matches experimental data closely. Though this agreement does not prove the correctness of the model, it does suggest a biologically plausible mechanism for the observed emergent behavior. The authors emphasize that information transfer via a dose-to-duration mechanism occurs through transient activation of pathway components (enacted by signals of varying durations). This underscores the necessity of using dynamical (i.e, ODE) models to understand such behavior, as static or steady-state models would preclude such transient phenomena.
42
B.J. Daigle et al.
Fig. 2.9 Pathway architectures that convert stimulus dose to signal duration. a Feed-forward and b negative feedback encoding modules (KK: Kinase–Kinase, K: Kinase, X: Phosphatase). Shown are cases of negative regulation operating by inhibiting activation (left) or promoting deactivation (right). Reproduced from Behar et al (2008) with permission from PLoS
Further implications of the dose-to-duration mechanism are discussed, including its potential relevance to multicellular organisms. In particular, photoreceptors in rod cells encode intensity of light as the duration of downstream G-protein-mediated activity. Such behavior may be due to a biochemical mechanism similar to that observed in yeast. As typical signal transduction pathways exhibit more elaborate architecture than that modeled in this study, more complex variations of dose-toduration encoding likely exist and the methods of analysis featured in this work will be useful to decipher such behavior. Predictive Modeling with a Large-Scale ErbB Signaling ODE System ErbB signaling, which encompasses the pathways activated by the ErbB1-4 receptor tyrosine kinases, is one of the best-studied components of multicellular eukaryotic signal transduction. Abnormal ErbB signaling has been implicated in many human cancers, and members of these pathways are common drug targets. The four ErbB receptors orchestrate a complex array of cellular signals, as they are known to bind 13 distinct ligands, form hetero-and homo-oligomers once bound, and activate multiple downstream pathways including the MAPK/ERK and PI3K/Akt cascades. It is
2
Current Progress in Static and Dynamic Modeling
43
not surprising that the precise mechanisms for how different ligands induce differing downstream responses are poorly understood. To improve our understanding of ErbB signaling, Chen et al.(2009) developed a large-scale ODE model including all four ErbB receptors and the ERK and Akt signaling pathways. In the interest of computational tractability, the authors made several simplifications in the number and type of receptor dimers, phosphorylation states, and structure of degradation pathways when constructing the model. Nevertheless, 828 reactions remained, which were described by 499 ODEs with 229 parameters. To calibrate the model, the authors set parameters to literature-derived values when possible, and a subset of the rest were learned from experimental data (chosen according to their impact on an objective function describing model fit). Experimental data consisted of ErbB1, Akt, and ERK activity levels across a 2-hour time course following stimulation with two different ligands. Given the complexity of the model, the parameters were expected to be non-identifiable (multiple combinations of parameter values fit the data equally well), and a simulated annealing optimization scheme was used repeatedly to identify these best-fit parameter value combinations. Once the model was (partially) constrained, the authors used simulation results to make predictions and test them experimentally. The first validated prediction involved differential sensitivity of ERK and Akt activity to treatment with the antiErbB drugs gefitinib and lapatinib. The ODE model predicted that Akt activity would be more sensitive to both drugs and experimental results corroborated this result. Next, several predictions concerning the dose–response of the ErbB network to ligand were made and subsequently tested. One of these predictions was for a Hill coefficient (Happ ) describing the steepness of the pathway response to increasing ligand concentration. This coefficient is used in the Hill equation.
signal (x) =
xHapp , xHapp + kHapp
(2.4)
where “signal” is a measure of the pathway response, x represents the concentration of ligand, and k is the concentration of ligand that gives half-maximal response. Previous work in Xenopus oocytes predicted a switch-like ERK response to progesterone (acts as a proxy for EGF) (Huang and Ferrell 1996). This corresponded to a Hill coefficient of 4.9. In contrast, the ODE model of Chen et al. predicted a much more gradual response to EGF treatment (Happ ∼ 0.30), and experimental data confirmed this result. The reason for the discrepancy was identified when the authors created a sub-model of the ERK response pathway. Simulation of this model when treated with EGF reproduced switch-like activity, suggesting that modeling of the larger signaling context was necessary to faithfully reproduce the experimental observations. This study demonstrates that a large-scale, partially constrained ODE model consisting of many elementary reactions is capable of accurately predicting observed data.
44
B.J. Daigle et al.
2.3.6.4 Outstanding Challenges in ODE Modeling As mentioned above, use of ODE models is widespread in systems biology, and recent applications have begun to model larger and larger signaling networks. We expect this trend to continue; thus, an obvious challenge is the proper calibration of these large, complex models. This will require advances in the quality and quantity of time-resolved data generation and collection. Several recent developments in this area are discussed in Albeck et al. (2006). Additionally, as models grow in size, parameter learning can become prohibitively difficult, and certain parameters will be non-identifiable given limited experimental data. Computational methods to identify and correctly deal with these parameters will be needed; work in (Chen et al. 2009) provides a nice example of how to address this task. An additional challenge arising with larger models is defining the structure of the individual reactions and their constituent species. In the past, this was mostly performed manually based on prior knowledge, but this process is time-consuming and often not feasible for less well-studied systems. Thus, automatic generation of model structure from high-throughput data is an area of active research; recent examples can be found in Carrera et al. (2009) and Bonneau (2008).
2.3.7 PDE Systems Biological systems are known to exhibit spatial inhomogeneity, and some tasks require explicit modeling of the spatial dimension. This is especially true when the biological system in question extends across several cellular organelles, each potentially containing different components, or when the diffusion of individual components across the modeled space cannot be treated as an instantaneous process. Compartmental ODE models have been successfully used to model the former case, where components are assumed to be well mixed within compartments and transport between compartments occurs at a much slower measurable rate (Aldridge et al. 2006a). As these models are modified versions of the ODE models described above, we will not discuss them further. In the latter case, i.e., when explicitly modeling the diffusion of certain components, partial differential equation models are necessary. Here, the spatial dimension is modeled as a continuous quantity, and the concentration of each component becomes a function of both space and time. The PDEs most commonly used to describe such systems are reaction–diffusion equations, where the concentration of each component yi (t) of the system can be represented as follows (derived using Fick’s second law of diffusion):
∂ 2 yi (t) ∂yi (t) = fi (y (t)) + Di , 1 ≤ i ≤ n, 1 ≤ m ≤ 3, ∂t ∂xj2 m
j=1
(2.5)
2
Current Progress in Static and Dynamic Modeling
45
where y(t) is as above, Di is a diffusion coefficient, xj represents a spatial dimension, and m is the number of spatial dimensions modeled. The first term on the RHS, fi , describes the contributions of chemical reactions to the time derivative, and the second term describes the contributions of diffusion. Compared to ODE models, PDE systems are much more challenging to solve, in part because they require many more parameters (Eungdamrong and Iyengar 2004). Aside from the kinetic parameters needed to specify fi , the reaction–diffusion system requires a diffusion coefficient for each species (which are difficult to measure experimentally (Rangamani and Iyengar 2007)), and fluxes and/or concentrations of each component must be specified at the boundary of the physical space being modeled. This latter constraint becomes even more prohibitive when considering complex physical geometries. Solutions to nonlinear PDE systems are almost exclusively numerical, and the added realism of the model comes at a computational cost due to the increased dimensionality of the system. 2.3.7.1 Early Examples of PDE Models Describing Biological Systems Models of biological systems governed by PDEs employ two of the three simplifying assumptions of ODE models, with spatial homogeneity being the exception. Nonetheless, when mathematically and computationally tractable, these models can accurately reproduce spatially verying molecular behavior. One of the first examples of a PDE model describing a biological system modeled the behavior of two (generic) morphogens reacting and diffusing through simple geometries of cells (Turing 1952). These equations were constructed in a way that provided for analytical solutions, and these solutions (expressed as functions of the morphogen concentrations) gave rise to spatial patterns reminiscent of those seen in organismal development. Subsequent work elaborated upon this simple model of morphogen-controlled patterning. One study simulated a four morphogen reaction–diffusion system to mimic pattern formation in Drosophila embryogenesis (Lacalli 1990). By adjusting model parameters, the authors could produce striped patterns of morphogen concentration compatible with observed wild-type and mutant phenotypes. Another application modified the simple morphogen model to allow diffusion coefficients to depend on the spatial variable (Maini et al. 1992). Model-derived patterns were shown to produce behavior more compatible with known mechanisms of vertebrate limb development. A model of pattern formation in the context of E. coli cell division was created using six coupled PDEs modeling three proteins in both membranebound and cytoplasmic states (Meinhardt and de Boer 2001). Results from the model confirmed observed spatial oscillatory behavior of two of the proteins and suggested a molecular mechanism for the centralized localization of the third. Additional reaction–diffusion models have been constructed to describe diverse biological processes such as striping patterns in fish (Kondo and Asai 1995; Asai et al. 1999), cell migration in butterfly wings (Sekimura et al. 1999), and avian embryogenesis (Painter et al. 2000). See Baker et al. (2008) for many additional references.
46
B.J. Daigle et al.
2.3.7.2 Modern Applications of PDE Models to Biological Networks Recent PDE models capture more biological detail (and are thus more realistic) than their earlier counterparts. As with the ODE section above, we focus on work published within the last 5 years and discuss one particularly innovative example per dynamical modeling task. We note that there are considerably fewer numbers of published studies using PDEs to model signaling networks (when compared to ODEs); this is due to the increased computational complexity and demand for experimental data imposed by these models.
Calibration of a Planar Cell Polarity Model Using Qualitative Phenotypes The process by which planar cell polarity (PCP) signaling generates distally oriented hairs in cells of the D. melanogaster wing is not fully understood. Aside from wild-type function, certain single gene mutants in cell clones result in an aberrant hair phenotype in adjacent cells, a process called domineering non-autonomy. The primary molecular players in this process include the transmembrance receptors Van Gogh/strabismus (Vang) and frizzled (Fz), and the cytoplasmic proteins Dishevelled (Dsh) and Prickle-spiny-legs (Pk). Experimental evidence indicates that these proteins selectively accumulate on the distal or proximal sides of wing cells during wild-type function. In addition, unknown diffusible factors X and Z have been proposed to explain domineering non-autonomy, although no such factors have yet been experimentally identified. To better understand the process of PCP and to test whether domineering non-autonomy is possible without implicating unknown diffusible factors, Amonlirdviman et al. developed a reaction–diffusion model of hexagonal cells arranged in a planar array (Amonlirdviman et al. 2005). They included the four identified proteins listed above and they simulated their known influences on each other via a feedback loop by allowing the formation of six protein complexes (DshFz, VangPk, FzVang, DshFzVang, FzVangPk, and DshFzVangPk). The FzVang interactions were designed to occur across adjacent cells, and all others are conducted intracellularly. The existence of most of these interactions is supported by experimental evidence. The feedback loop created by these four proteins is thought to amplify an initial asymmetry cue, resulting in their polarized spatial accumulation. The authors implemented two different forms of such a signal and found that both resulted in similar behavior. The overall model contained a system 10 nonlinear PDEs, whose rate constants, diffusion coefficients, and initial protein concentrations were unknown. To calibrate these parameters, the authors created an objective function describing the error between the model output and 12 qualitative experimentally observed phenotypes (both wild-type and mutant). Table 2.1 lists the phenotypes and corresponding genotypes. Numerical optimization methods were used to identify parameter values that satisfied all constraints, and sensitivity analysis demonstrated that some parameters were more tightly constrained than others.
2
Current Progress in Static and Dynamic Modeling
47
Table 2.1 Characteristic PCP phenotypes (and associated references) used in model objective function. [Table reproduced from Table S1 of Amonlirdviman et al. (2005)] Genotype
Phenotype
Wild-type
Asymmetric accumulation of Dsh and Fz on the distal cell membrane. Asymmetric accumulation of Pk and Vang on the proximal cell membrane (Tree et al. 2002; Bastock et al. 2003; Axelrod 2001; Strutt 2001). Polarity disruption inside of the mutant clone. Autonomous phenotype (Kligensmith et al. 1994; Theisen et al. 1994). Distal domineering non-autonomy (Gubb and García-Bellido 1982). Proximal domineering non-autonomy (Taylor et al. 1998). No polarity reversal (Amonlirdviman et al. 2005). Proximal domineering non-autonomy (unpublished). Proximal domineering non-autonomy (Strutt 2001). Distal domineering non-autonomy (unpublished). Distal domineering non-autonomy. Polarity disruption inside of the mutant clone. Autonomous phenotype (Jones et al. 1996). Proximal domineering non-autonomy (Strutt 2001). Overexpression of Pk results in protein accumulation to a degree greater than or equal to that for wild-type results (Tree et al. 2002).
dsh fz Vang pk >>dsha >>fz >>Vang >>pk fzautonomous >>fzautonomous EnGAL4, UASpk
a>> denotes overexpression
After calibration, numerical simulation of the model was used to investigate potential mechanisms for domineering non-autonomy. In particular, two mutant alleles of frizzled (fzF31 , fzR52 ) that cause autonomous and non-autonomous phenotypes, respectively, were hypothesized to differ in their interactions with Vang. By making the necessary nodifications to the model for each allele, the desired mutant behavior was reproduced and experimental evidence confirmed the hypothesized differences in Fz-Vang interactions. The authors note that though simulation of their PDE system is able to reproduce all known phenotypes, this does not prove the correctness of the underlying biological model. Nevertheless, model results demonstrate the feasibility of the proposed mechanism for domineering non-autonomy and suggest that unknown diffusible factors are not needed to explain the behavior of this system. Validation of a Simple diffusion Model of Bicoid in the Drosophila Embryo Another well-studied signal transduction system in D. melanogaster controls anteroposterior patterning in the developing embryo. Here, gradients in the concentration of maternal proteins establish gene expression domains that lead to eventual body segmentation. The Bicoid (Bcd) transcription factor is one of the best-studied maternal morphogens. Bcd RNA is deposited during oogenesis at the anterior pole of the egg, resulting in an anteroposterior protein gradient. Bcd has been shown to regulate the hunchback, krüppel, and even-skipped genes which collectively generate striped gene expression patterns.
48
B.J. Daigle et al.
Though it was hypothesized that gradients of Bcd arise through simple diffusion, this claim had never been rigorously verified. To address this, Gregor et al. injected dextran particles of similar size to Bcd into a Drosophila egg and made concentration measurements across a time course at 18 spatial positions (Gregor et al. 2005). They compared these results to those predicted numerically by a simple three-dimensional diffusion model in an embryo-shaped volume, governed by the following equation: ∂c (r, t) = D∇ 2 c (r, t), ∂t
(2.6)
where c(r, t) represents particle concentration at position r and time t, D is the diffusion coefficient, and ∇ 2 is the Laplace operator (sum of the unmixed second partial derivatives with respect to each spatial dimension). A nonlinear fitting routine was used to select the value of D that minimized the difference between the experimental and predicted concentrations. When simulation output was compared to experimental results, it was found that the simple diffusion model fit the data very closely. Gregor et al. then constructed a reaction–diffusion model for Bcd protein, parameterized by the Bcd diffusion coefficient and decay lifetime (τ). The corresponding PDE takes the following form: 1 ∂c (r, t) = D∇ 2 c (r, t) − c (r, t), ∂t τ
(2.7)
where the second term on the RHS of Eq. (7) represents the degradation rate of Bcd. Immunofluorescence data of Bcd protein in Drosophila embryos were used to fit the parameter τ, resulting in an estimate of ∼6 min. The authors then asked how Bicoid behavior would scale with increasing embryo size, such as is observed in Calliphora vicina, a fly with an egg length ∼3 times greater than Drosophila. Dextran injection experiments were repeated for three additional species of fly with varying embryo sizes, and estimated diffusion coefficients were shown to be similar to that of Drosophila. The values of τ for these other species were estimated as before, yielding values that ranged from 3 for the smallest embryo size to 32 min for the largest. These values are plausible, yet near the upper limit given the species’ respective developmental time courses. Such a limit further supports the hypothesis of simple diffusion for Bcd behavior, as active cellular mechanisms restricting Bcd diffusion would require decay lifetimes for proper gradient formation that exceed developmental time scales. The collective findings from this study argue that Bcd gradient formation is controlled by simple diffusion and that the protein’s decay lifetime increases with increasing egg size in different fly species. Interestingly, as fly embryos of different species develop along similar time scales, the above conclusion implies that pattern formation based on diffusible Bicoid would become physically impossible in fly embryos much larger than C. vicina.
2
Current Progress in Static and Dynamic Modeling
49
Sensitivity Analysis of a Sonic hedgehog Signaling PDE Model Traditional studies of morphogen gradients in development have focused on steadystate signal levels and their effects on target genes. Recently, it has become clear that gradient dynamics are important to tissue patterning as well, as evidenced by the Sonic hedgehog (Shh) signaling pathway. Shh forms a concentration gradient during vertebrate development, and it is involved in limb bud, midbrain, and spinal cord patterning. It has been shown that both time of exposure to Shh and the timing of Shh secretion are determinants of tissue patterning. Besides passive diffusion, mechanisms like active transport and interactions with cell surface and extracellular matrix components are known to affect the temporal dynamics of signaling. To achieve a better understanding of Shh signaling dynamics, Saha and Schaffer have constructed a multicellular PDE model of spinal cord patterning in the chick embryo (Saha and Schaffer 2006). The model comprises a transverse section of the developing neural tube during the time when Shh secretion from the floorplate induces dorsally oriented cells to switch from an interneuron to motoneuron fate (∼33–116 hours after egg laying). System behavior is governed by eight coupled reaction–diffusion equations describing Shh diffusion and its interaction with receptors, membrane proteins, and downstream transcription factors. Model parameters were chosen from known values in the literature or estimated from similar biological systems, and parameters were adjusted to match experimental observations. Simulation was carried out using a finite element method numerical solver. The authors conducted sensitivity analysis on model parameters to determine how varying each one affected the response of a target transcription factor (Gli 1) to changing Shh concentration. Previous work identified two steady states corresponding to a Gli 1 switch being ‘on’ and ‘off’ (concentration above and below a threshold, respectively). Three behavioral regimes were described: one where the ‘on’ state is stable, another where the ‘off’ state is stable, and a bistable regime. By varying the values of each parameter separately across four orders of magnitude, it was determined that some parameters did not affect the behavioral regime (rate constant of Shh receptor outflux), while others altered it given large enough changes (rate constant of Shh receptor influx). Choice of regime was most sensitive to the value of the maximum rate of Glil synthesis, where even small changes altered behavior. The authors then employed the model to reproduce known tissue patterning results and characterize novel behavior. They confirmed that Shh interactions and active transport played an important role in Shh-induced downstream changes, due in part to modulation of Shh signal dynamics. Characterization of this behavior would not have been possible using a steady-state model, underscoring the importance of dynamical models for understanding signal transduction systems.
Wave Propagation in Astrocyte Signaling Networks The nervous system is composed of two cell types: neurons and glial cells. For years glial cells were regarded as nothing more than support cells for neurons, until
50
B.J. Daigle et al.
it was shown in 1990 that glutamate can induce Ca2+ waves in astrocyte (a type of glial cell) cultures (Cornell-Bell et al. 1990). Proposed mechanisms for astrocytic Ca2+ wave production fall into two categories: simple intercellular diffusion of the IP3 signaling molecule and active regeneration of signal by released ATP. Recent experimental evidence supports the latter theory, although the mechanism of ATP release is still largely unknown. A better understanding of this process would be clinically useful, as abnormal astrocytic wave propagation has been linked to disorders including migraine and epilepsy. Stamatakis and Mantzaris (2006) have attempted to clarify the mechanism of astrocytic wave propagation through mathematical modeling. They constructed both a single-cell ODE and a multiple cell PDE model. The latter comprises a coupled set of four reaction–diffusion equations taking the following form: ∂u = D∇ 2 u + f (u), (2.8) ∂t T where u = [ATP] [IP3 ] Ca2+ h , f (u), is a vector of reaction terms for each species derived from the single cell model, and D contains diffusion coefficients. h is a dimensionless variable containing information about the fraction of open channels. Values for the diffusion coefficients were estimated from the literature, and the PDE model was numerically simulated in both one-and two-dimensional domains. ATP-mediated wave propagation was characterized in PDE models employing one of two hypothetical mechanisms: Ca2+ -dependent, excitable (by positive feedback) ATP release and IP3 -dependent, non-excitable ATP release. The authors found that the first mechanism led to frequency-encoded oscillations in single cells and propagation of one-dimensional waves of infinite range in multiple cells. In two dimensions, a point stimulus of ATP led to spiral waves of ATP and Ca2+ . In contrast, the second mechanism did not lead to single-cell oscillatory behavior, and multiple cells exhibited propagation of waves with finite range. This behavior was due to the extracellular ATP concentration falling below a threshold at a certain distance from the original stimulus. Experimental data have been observed that support both ATP-mediated mechanisms. On the one hand, spiral waves have been detected in cell culture, and previous explanations invoked spatial inhomogeneities between cells as the cause. Model results from this study argue that a simple mechanism of Ca2+ -dependent ATP release is sufficient to explain such patterns. On the other hand, experimentally observed astrocytic wave propagation is finite in range. This is suggestive of an IP3 -dependent ATP release mechanism. Though the results of this study do not definitively favor either of the two mechanisms of astrocytic wave propagation (or argue strongly for both), the proposed spatial model does provide a testing ground for further hypotheses to ultimately elucidate the true mechanism. Predictive Modeling of Molecular Mechanisms for Hair Follicle Spacing As mentioned above, one of the earliest PDE models of a biological signaling system implicated only two morphogens reacting and diffusing through cellular space.
2
Current Progress in Static and Dynamic Modeling
51
This model, proposed by Alan Turing, treated one morphogen as an activator and the other as an inhibitor, and its simulation produced patterns reminiscent of pigmentation in the animal kingdom (Turing 1952). Until recently, a bona fide real-world example of the model had never been identified; thus, doubts existed regarding its authenticity (Maini et al. 2006). Sick et al. (2006) model and experimentally characterize such an example: the Wnt signaling pathway controlling hair follicle spacing in mice. A collection of proteins from the Wnt family and their known inhibitors Dkk1 and Dkk4 are present in and around the hair follicle during development, and available data suggest that the Wnt and Dkk proteins are the primary determinants of follicle spacing patterns. To test this hypothesis in the framework of the Turing reaction–diffusion model, the authors constructed the following PDE model: a2 ∂a − μa a, = Da ∇ 2 a + ρa ∂t (Kh + h) 1 + κa2 a2 ∂h − μh h, = Dh ∇ 2 h + ρh ∂t (Kh + h) 1 + κa2
(2.9)
where a and h are the concentrations of generic Wnt and Dkk proteins, respectively; Da and Dh are diffusion cofficients; ρa and ρh are reaction constants scaling the speed of the interaction between Wnt and Dkk; and μa and μh are decay constants (Kh and κ are additional reaction constants). Parameters were set arbitrarily, though it is noted that variations in their values do not qualitatively change behavior. The system was numerically simulated to create consecutive waves of hair follicle formation in a square grid of mouse skin. The author then made experimental predictions based on modifications of their PDE model. Results from model simulation suggested that moderate overexpression of Wnt proteins increases follicular density, while strong overexpression completely disrupts patterning. In contrast, overexpression of Dkk in the model led to increased interfollicular spacing and clustering of new follicles in subsequent waves of formation (due to higher levels of Wnt around preexisting follicles). Transgenic expression of Dkk2 (another Wnt inhibitor) in mouse skin was used to test the latter two predictions. As expected, with increasing levels of inhibitor, hair follicle formation was impeded, leading to lower follicular density. A closer examination of mutant mouse skin demonstrated that follicle clusters were present, and ringlike patterns of Wnt signal-receiving cells were present around preexisting follicles. These results confirmed both model predictions concerning increased Wnt inhibitor concentration. The computational and experimental results of this study provide compelling evidence for a reaction–diffusion mechanism for the Wnt signaling pathway. Though additional signaling pathways are known to be involved in follicular patterning, acting mostly downstream of formation, a model consisting of just two morphogens provides an accurate representation of experimentally observed behavior.
52
B.J. Daigle et al.
2.3.7.3 Outstanding Challenges in PDE Modeling Simulation of PDE network models is substantially more computationally intensive than with their ODE counterparts, so a primary challenge is the development of mathematical approaches enabling the characterization of larger systems. Coarsegrained spatial methods like compartmental modeling can lessen the computational load in systems that are tolerant of reduced spatial resolution (de Jong 2002; Rangamani and Iyengar 2007). As with ODE systems, improvements in highquality data collection will further enable the simulation of larger PDE models; Rangamani and Iyengar (2007) discusses techniques for experimentally estimating protein diffusion constants. Another challenge is the efficient spatial modeling of complex cellular geometries. Most existing work treats cells and subcellular components as simple geometric shapes; though this simplifies computation, it may not be sufficiently accurate for certain model systems. Advances in finite volume modeling enable representation of and computation on arbitrary cell morphologies, which can lead to more realistic (and accurate) models.
2.3.8 SDE Systems Both ODE and PDE models of biological systems assume that reactions occur in a deterministic manner. This assumption seems to imply that biological reactions exhibit little to no heterogeneity or stochasticity (“intrinsic noise”), which is known to be false (McAdams and Arkin 1997). Rather, the main reason for the success of deterministic biological models is that stochastic effects are often rendered negligible by averaging across large numbers of molecules or cells. This phenomenon also underlies the success of continuous mechanistic models, where discrete numbers of molecules can be approximated with continuous concentrations. There are, however, a number of well-characterized biological systems where the modeling assumption of deterministic reactions leads to qualitatively incorrect depictions of behavior. A deterministic model of the circadian rhythm oscillator parameterized with particular degradation rates fails to oscillate; in contrast, the noise present in the corresponding stochastic model gives rise to more robust oscillatory behavior (Vilar et al. 2002). In a common class of biochemical reaction mechanisms, enzymatic futile cycles, extrinsic noise (i.e., noise due to components/processes outside the system) in a stochastic model was shown to induce bistable oscillatory behavior that was absent in a similar deterministic model (Samoilov et al. 2005). These (and other) important exceptions to deterministic reaction mechanisms have led to the application of stochastic models to biological systems. In this section, we review continuous stochastic chemical kinetic models in the from of stochastic differential equations (SDEs). We focus mostly on a class of SDEs that can be derived from principles of discrete stochastic chemical kinetics; for illustrative purposes, our treatment starts with this derivation in brief (for more details see Gillespie 2000; El Samad et al. 2005; Gillespie 2007; Resat et al. 2009).
2
Current Progress in Static and Dynamic Modeling
53
We begin by representing the state of the system as a function of time with Z (t) = [Zi (t) , . . . , Zn (t)], where Zi (t) represents the number of molecules of species i. Capital letters are used to emphasize the stochastic nature of the model; the Zi ’s are random variables. A specific instantiation of the system is represented by lowercase letters; i. e., z = [zi , . . . , zn ]. The system state can be altered by the firing of any of p reactions; each reaction changes the state by vk = [v1 k , . . . , vnk ] , 1 ≤ k ≤ p, where vik represents the change in the number of molecules of species i after the completion of reaction k. Each reaction can be characterized by its propensity function ak (z), defined so that ak (z) dt is equivalent to the probability that reaction k will occur once in the system in the infinitesimal time interval [t, t + dt] given Z (t)=z. Given that any instantiation of the system is random, it would be useful to have a probabilistic expression for the time evolution of the system P (z, t|z0 , t0 ) (probability that the system is in state z at time t, given that it is in state z0 at time t0 ). Using the above quantities and the laws of probability, this can be derived as follows: P (z, t + dt|z0 , t0 ) = P (z, t|z0 s t0 ) × 1 − +
p
p
ak (z) dt
k=1
(2.10)
P (z − vk , t|z0 , t0 ) × ak (z − vk ) dt.
k=1
After rearranging and taking the limit as dt → 0 ∂P (z, t|z0 , t0 )
= (ak (z − vk ) P (z − vk , t|z0 , t0 ) − ak (z) P (z, t|z0 , t0 )) . ∂t k=1 (2.11) p
Equation (2.11) is called the chemical master equation (CME). Since the possible values of z are discretely varying, the CME is actually a set of coupled ODEs that is nearly as large as the number of possible combinations of molecules in the system. Consequently, except for very simple systems, these equations are not solvable analytically and numerical solutions are usually intractable. Progress has been made in developing approximation schemes for numerically solving the CME (Munsky and Khammash 2006; Deuflhard et al. 2007; Jahnke and Huisinga 2008), but most applications turn to Monte Carlo methods to sample from the distribution P (z, t|z0 , t0 ). The stochastic simulation algorithm (SSA), also known as the Gillespie algorithm, simulates each reaction sequentially as they occur in time (Gillespie 1977). This approach has been widely used in stochastic modeling of biological networks, in part because it produces a draw from the exact probability distribution that solves the CME. A few example applications of the SSA include McAdams and Arkin (1997), Samoilov et al. (2005), Arkin et al. (1998), Weinberger et al. (2005), El-Samad et al. El-Samad and Khammash (2006), and wang et al. (2006). For systems with large numbers of molecules, use of the SSA becomes very computationally intensive. An efficient approximation known as tau-leaping was developed to instantiate multiple reactions that occur during the elapse of a
54
B.J. Daigle et al.
preselected time τ (Gillespie 2001). This gives the following (approximate) update equation for the system state, given that Z (t) = z at time t: Z (t + τ ) ≈ z +
p
vk × Poisk (ak (z) τ ),
(2.12)
k=1
where the Poisk (λ) are i.i.d. Poisson random variables with mean (and variance) λ. Subsequent work in approximate methods has led to further speed up of the SSA; some example include Rao and Arkin (2003), Rathinam and El Samad (2007), Cao et al. (2005, 2007), and Cao and Petzold (2008). Discrete stochastic kinetic models are the most common stochastic approach for modeling biological networks, and they have previously been reviewed extensively (Li et al. 2008; Higham 2008; Resat et al. 2009 Wilkinson 2006, 2009). As our focus in this chapter is continuous differential equation models, we will not discuss them further. Under certain conditions (see Section 3.8.1), we can approximate a Poisson random variable with one that is normally distributed (with the same mean and variance), yielding Z (t + τ ) ≈ z + =Z+ =Z+
p
k=1 p
p
vk × Nk (ak (z) τ , ak (z) τ )
k=1
√ vk ak (z) τ + ak (z) × Nk (0, τ ) vk ak (z) τ +
k=1
p
(2.13)
√ vk ak (z) × Nk (0, τ ) ,
k=1
where Nk μ, σ 2 is a normally distributed random variable with mean μ and variance σ2 . Equation (2.13) is a form of the Langevin leaping formula, where the discretely valued Z(t) has become continuously valued due to the normal approximation. To convert (2.13) to a differential equation, we note that the increment Bk (t + τ ) − Bk (t), where Bk is a standard Brownian motion, is normally distributed with mean 0 and variance τ (Karlin and Taylor 1975). Thus, by rewriting τ as dt and rearranging we have the approximate relation (given that Z(t) = z): dZ(t) dt
=
≈
p
k=1
p
vk ak (Z (t)) +
k=1
vk ak (Z (t)) +
p
k=1
p
√ vk ak (Z (t)) ×
dBk(t) dt
√ vk ak (Z (t)) × Wk (t),
(2.14)
k=1
where Wk (t) is a white noise process, which, though not well-defined in an ordinary calculus sense, can be characterized using stochastic calculus and acts as a useful approximation for naturally occurring noise (Karlin and Taylor 1981). Equation (2.14) is an SDE known as the chemical Langevin equation (CLE).
2
Current Progress in Static and Dynamic Modeling
55
It is useful to make one final approximation to establish a connection between the CLE and the RRE (Section 2.3.6). If the volume of the system and the number of molecules of each species Zi (t) approach infinity in such a way that the concentration of each species remains constant (thermodynamic limit), the reaction propensities grow linearly with the system size (Gillespie 2007; El Samad et al. 2005). Referring to (2.14), we see that this will result in the second (noise) term becoming negligibly small with respect to the first (drift) term. If we set the noise term to zero and divide each state variable by the system volume, random molecule counts (Zi (t)’s) become deterministic concentrations (yi (t)’s) and we have dy (t)
≈ vk ak (y (t)). dt p
(2.15)
k=1
This is a form of the RRE (1) which we have derived using principles of stochastic chemical kinetics. 2.3.8.1 Assumptions of SDE Biological Network Models We now review the assumptions needed to derive and subsequently apply SDEs to modeling biological systems. They include • As with ODE models, reactions occur in a homogeneous, well-stirred volume. • There exist small-enough values for leap times τ such that, as the system evolves with each time leap, no propensity function changes by an appreciable amount as a result of executed reactions (permits approximation in Eq. (2.12)). • The τ times from above are also large enough that the expected number of occurrences of each reaction during each leap is much greater than one (permits approximation in Eq. (2.13)). The latter two assumptions also permit the replacement of τ with dt (macroscopic infinitesimal) in Eq. (2.14) (Gillespie 2002). These assumptions can usually be simultaneously satisfied if the numbers of molecules of each reacting species are sufficiently large (Gillespie 2002; 2007; El Samad et al. 2005). The major advantage of working with the CLE is that numerical solutions are much more efficient to generate than when using the Gillespie algorithm (Adalsteinsson et al. 2004), provided the above assumptions are satisfied, the solutions should not differ appreciably. 2.3.8.2 Modern Application of SDE Models to Biological Networks Use of SDE models in system biology is even less common than PDEs, with most examples emerging in just the last few years. We thus focus immediately on recent applications to modeling biological networks, highlighting one example per dynamical modeling task.
56
B.J. Daigle et al.
Bayesian Calibration of an SDE Model from Noisy Time Course Measurements One challenge in calibrating a dynamical model with time course measurements is the often coarse time resolution of the data. This is particularly true for SDE models, where approximations that assist in learning parameter values from data require measurements collected at uniformly closely spaced time points. Heron et al. illustrate this difficulty with an SDE model of the Hesl autoregulatory network, which takes the following (differential) form (Heron et al. 2007): sM v1 dt − v M (t) 2 1 + (D [P (t)] /sP k1 )n sM v1 √ + sM + v2 M (t) dBM (t) , 1 + (D [P (t)] /spk1 )n sP dP (t) = v3 M (t) − v4 P (t) dt sM sP √ + sP v3 M (t) + v4 P (t) dBP (t) , sM
dM (t) =
(2.16)
where M(t) and P(t) represent the relative quantities of Hesl mRNA and protein, respectively; D[P(t)] represents a delay term acting between transcription and translation; dBM (t) and dBP (t) are independent infinitesimal increments of one dimensional Brownian motion; {v1 , v2 , v3 , v4 , n, k1 } are reaction parameters; and sM and sP are scaling factors. In order to learn parameter values from the data, an objective function is required that describes how well a given set of parameter values fit the data. As SDE models are by nature probabilistic, a likelihood-based approach is a natural choice. With small enough time steps ti = ti + 1 − ti , the increments M (ti + 1) − M (ti ) and P (ti + 1) − P (ti ) are normally distributed with means and variances derived from Eq. (2.16). This property allows the calculation of a likelihood function for parameter values given discrete data. When the time course data do not differ by small enough time steps (as is true for the Hesl data used in Heron et al. 2007), this likelihood function cannot be used directly. To compensate, the authors employ a latent data-based Bayesian approach, whose application to systems biology was first described in Golightly and Wilkinson (2005, 2006). The method infers unobserved (latent) data at finer time points than those measured, the presence of which satisfies the assumptions of the likelihood function. MCMC methods are used to sample distributions of the latent data and parameters simultaneously, and parameter values can be chosen which maximize the Bayesian posterior probability function. Heron et al. applied this method to sparse Hes1 data from Hirata et al. (2002) and obtained convergent posterior distributions for all parameters except the scaling factors. They simulated the model with a wide range of high likelihood parameter values and consistently recovered the cyclicity observed in the experimental data. Thus, behavior of the calibrated Hes 1 model appears to be robust.
2
Current Progress in Static and Dynamic Modeling
57
In an ideal world, experimental data would always be available at sufficient resolution for straightforward model calibration. As modeling is used to study more and more complex signaling networks, this ideal becomes less realistic. In the increasingly likely scenario where experimental data are sparse, probabilistic methods like the one described in this study will be useful for robust model calibration.
Comparison of a Stochastic Cell Cycle Model with Experimental data The eukaryotic cell cycle has been a popular choice for systems biology modeling due to its fundamental role in development and reproduction. During this ordered sequence of molecular events, a cell duplicates its components and divides them into two daughter cells. In the fission of yeast Schizosaccharomyces pombe, fundamental players in this process include the M (mitotic)-phase promoting factor (MPF), its negative regulator Weel, and a collection of cell division cycle (Cdc) proteins (i.e., Cdc25, which leads to increased levels of MPF). Over the years, more and more elaborate (and realistic) models have been built describing this system, most of them deterministic. In Steuer (2004), the authors adapt a previously published yeast ODE model (Novak et al. 2001) to include stochastic influences. Written as Langevin-type SDEs with multiplicative noise, the equations take the following form: dxi = fi (· · · ) + 2Di xi Wi (t) , dt
(2.17)
where xi represents the concentration of a single species, fi (. . .) comes from the original deterministic equation, Di is a constant denoting the noise amplitude, and Wi (t) is white noise. The second term on the RHS of Eq. (2.17) is not derived from elementary biochemical reactions; rather, it constitutes a general noise term including both intrinsic and extrinsic sources. The amplitude of the noise is controlled by the parameter Di . After parameterizing their model with the values used in the original work (and a small but constant value for all Di ), the authors run simulations of both the original ODE and their SDE system. They compare the resulting behavior to well-known experimental observation of cell cycle time and cell division size distributions. Both models reproduce the negative correlation seen in wild-type cells between cycle time and mass at birth (representing a form of cell size control). The authors then evaluate simulations of a wee1− cdc25 double mutant, whose behavior has been shown experimentally to result in three to four clusters in the cycle time vs. mass at birth plot. The deterministic model does not generate clustered behavior, whereas the stochastic model consistently does. Mechanistically, this is due to the occasional inability of MPF levels to reach the threshold needed for entry into mitosis, as a result of Cdc25 absence and stochastic fluctuations in Pyp3 activity (a weaker MPF activator). These fluctuations can lead to a random number of G2 phase resets, which in turn lead to varying but quantized cycle times.
58
B.J. Daigle et al.
Thus, an SDE depiction of the cell cycle appears to be more realistic than a deterministic model. The authors go on to characterize the cell size control checkpoint in their stochastic model, identifying noise-induced oscillations which occur at sufficiently (but not too) large amplitudes of noise (so-called “coherence resonance”). These oscillations have not as of yet been detected experimentally, suggesting additional regulatory mechanisms that stabilize this behavior. Sensitivity of Cell Cycle Behavior to Intrinsic vs. Extrinsic Noise More recently, Yi et al. investigated the separate effects of intrinsic and extrinsic noise on the S. pombe cell cycle (Yi et al. 2008). Unlike in Steuer (2004), Yi et al. derived a CLE from elementary reactions in the spirit of Section 2.3.8. Thus, the noise modeled in their SDE system is exclusively intrinsic in origin. The authors parameterize their model in the same manner as above, and they numerically simulate cell cycle behavior of the wee1− cdc25 double mutant. Using a small-enough system size to exhibit fluctuations, the results look quite similar to those resulting from the SDE in Eq. (2.17) (which combines intrinsic and extrinsic noise), with visible clustering present in the cycle time vs. mass at birth plot. To test the sensitivity of system behavior to noise that is of extrinsic origin, the authors added noise to the parameter governing Pyp3 activation of MPF (see above), turning it into a random variable. They inserted this parameter into the ODE model from Novak et al. (2001) and again performed a simulation on the double mutant strain. Clustering was still observed in the cycle time vs. mass plot, although the number of cells with long cycle time is markedly reduced. Finally, the authors incorporate the above parametric noise into their SDE model to explicitly test the effects of both intrinsic and extrinsic noise. Once again, clustering in the cycle time plot was present, with the number of cells having longer cycle times resembling its original quantity. In this study, the behavior of the fission yeast cell cycle model demonstrates sensitivity to the source of noise, with intrinsic noise giving rise to more cells with long cell cycle times. The precise cause of this difference was not investigated, and further experiments would be expected to clarify the relative importance of the two noise sources in this system. It is noteworthy that both sources of noise led to behavior that is qualitatively different from a deterministic model (and similar to experimental observations), which highlights the importance of modeling stochastic effects. Though most biological networks are subject to both intrinsic and extrinsic noise, modeling approaches that separate the two can be useful for understanding the mechanisms and consequences of noise generation. Oscillatory Behavior Due to Coherence Resonance in a Stochastic Model of Circadian Rhythm As mentioned above, the phenomenon of coherence resonance (CR) describes emergent behavior (e.g., oscillations) due to an optimal amount of noise present in the
2
Current Progress in Static and Dynamic Modeling
59
system. In CR, noise amplitudes lower or higher than the optimum diminish the oscillatory behavior. Yi et al. (2006) characterize this phenomenon as a function of both intrinsic and extrinsic noise in an SDE model of the Drosophila circadian oscillator. Their model includes two proteins, PER and dCLOCK, which combine to from both a negative and positive feedback loops. Exposure to light causes the degradation of PER, which bestows on the system a circadian rhythm of 24 h. The authors derive a CLE from elementary chemical reactions outlined in Smolen et al. (2002), resulting in a stochastic model with intrinsic noise. The relevant SDEs are as follows: Lfree (t − τ1 ) dP (t) = vsp − kdp P(t) dt K1 + Lfree (t − τ1 ) 1 Lfree (t − τ1 ) +√ W1 (t) − kdp P(t)W2 (t) , vsp K1 + Lfree (t − τ1 ) V K2 dP (t) = vsc − kdc L(t) dt K2 + Lfree (t − τ2 ) 1 K2 +√ W3 (t) − kdc L (t)W4 (t) , vsc K2 + Lfree (t − τ2 ) V
(2.18)
where P(t) and L(t) represent the concentrations of the PER and dCLOCK proteins, respectively; Lfree = max (L − P, 0) ; τ1 and τ2 are time delays; vsp , vsc , K1 , K2 , kdp , kdc are kinetic parameters, V is the system volume, and W1 (t) and W2 (t) are independent white noise. The authors parameterized the model according to Smolen et al. (2002), and they carried out simulations with kdp = 2.85 h - 1 (light-induced degradation rate of PER), which in the deterministic system induces a non-oscillatory steady state. By varying the system volume V, Yi et al. could control the effects of intrinsic noise (see Eq. (2.18)) and as a result, the amplitude of noise-induced oscillations. For a moderate system size, oscillations (with period ∼24 h) were most clearly observed. The authors created a metric β, which measures the strength of CR oscillation in terms of signal-to-noise ratio (SNR), and they discovered that optimal oscillatory behavior resulted when V = 500. Yi et al. then added extrinsic noise to the system by making kdp a random variable with a mean of 2.85 h−1 . They used a parameter D to control the strength of the noise, and they studied the behavior of the modified SDE as a function of both V and D. They discovered that the highest values of β were achieved when V was very large and D ≈ 0.04, which corresponds to no intrinsic noise and an optimal level of extrinsic noise. This value of β was roughly 10-fold higher than the maximum achieved above with only intrinsic noise in the system. In general, increasing levels of intrinsic noise in a system already under the influence of extrinsic noise
60
B.J. Daigle et al.
reduced oscillatory behavior, whereas increasing extrinsic noise in a system already exhibiting intrinsic noise-induced CR greatly increased the oscillations. This study provides an initial characterization of noise-induced oscillations that arise in a well-characterized biological system. Future work will include uncovering the mechanisms involved in differential sensitivity of CR to noise sources. In addition, as the intrinsic and extrinsic noise sources in the above model were simulated to be independent, the authors plan to study the potential coupling effects of the two types of noise. Such effects would be expected to occur in biological systems where the number of reactant molecules is very small. Prediction of Noise-Induced Bistability in an Enzymatic Futile Cycle Enzymatic futile cycles are a ubiquitous control mechanism in biological systems consisting of two reactions: conversion of a substrate to a product via a forward enzyme and conversion of the product back to the substrate via a reverse enzyme. This motif is present in diverse signaling processes including MAPK cascades, cell division cycles, and stress response pathways. Deterministic models of futile cycles have demonstrated that they can act as molecular switches that convert continuous input signals to binary outputs as well as signal amplifiers. To characterize the effects of noise on enzymatic futile cycles, Samoilov et al. developed an SDE model that adds an extrinsic noise term to the deterministic ODE, yielding the following differential form (Samoilov et al. 2005): dX ∗ =
p σ+ E+ k+ X k− E− (X0 − X) k+ E+ X − dt + dB(t) , k+ + X K− + X0 − X K+ + X
(2.19)
where X and X∗ represent the concentrations of the substrate and product, respectively; X0 = X + X ∗ , E+ , and E− are the concentrations of the forward and reverse enzymes, respectively; k+ and k− are reaction rate constants; k+ and k− are Michaelis–Menten constants; σ+ and P parameterize the extrinsic noise; and dB(t) is an infinitesimal increment of Brownian motion. The authors solve for the stationary-state response curve R(.), which gives p
R (Xss , E+ ; E− ) =E+ − =0
σ+ E+ k+ K+ k− E− (X0 − Xss ) (K+ + Xss ) + , k− Xss (K− + X0 − Xss ) (K+ + Xss )2
(2.20)
where Xss is the steady-state concentration of X. This equation is fourth order in Xss , which allows for a bistable solution (unlike the deterministic equivalent which is missing the second term in Eq (2.20)). Thus, a prediction of the SDE model is that Xss can be multivalued. Further analysis by the authors suggests that the onset of bistability occurs with 20 ≤ El+ ≤ 30, p > 0.75, and σ+ ≈ 20%. Samoilov et al. tested their bistability predictions by constructing a discrete stochastic model consisting of elementary biochemical reactions. Stochasticity in this system only arises through fluctuations in the individual components. Using the
2
Current Progress in Static and Dynamic Modeling
61
Gillespie algorithm, the authors simulated trajectories of the model and confirmed the above predictions on bistability, suggesting that the form of extrinsic noise in the SDE formalism is sufficiently realistic to predict system behavior. Ideally, experimental evidence would corroborate the existence of noise-induced bistability in a real-world enzymatic futile cycle. Unfortunately, to our knowledge no such evidence has yet been discovered, perhaps due in part to the difficulty of separating noise-induced behavior from measurement error. Nonetheless, as the authors suggest, given the ubiquity of enzymatic futile cycles in nature “it is reasonable to assume that such a behavior is exploited in at least some cellular systems.” 2.3.8.3 Outstanding Challenges in SDE Modeling SDE modeling of a signaling network requires the existence of a timestep τ that acts as a macroscopic infinitesimal (Section 2.3.8.1). If this assumption is violated, more computationally intensive (but accurate) discrete stochastic models must be used. Thus, an active area of research is a modeling framework that combines features of both models, partitioning the system into discrete and continuous components as necessary (Bentele and Eils 2005; Salis and Kaznessis 2005). Another challenge arises when using probabilistic or MCMC techniques for SDE model calibration, as in Heron et al. (2007). With large enough models, these methods become computationally intractable, suggesting the need for a more efficient strategy. One approach, borrowed from weather and climate modeling, involves creating an approximate surrogate of the model called an emulator. Parameter estimation can then be performed more cheaply on the emulator than on the full model (Wilkinson 2009). Though few applications of this method exist so far (Henderson et al. 2009 demonstrates one), calibration approaches like these will be necessary as biological models increase in complexity.
2.3.9 Relevant Software Several excellent reviews have been written detailing available software packages for constructing and simulating ODE, PDE, and SDE models of biological networks. Rather than reproduce that information here, we refer the reader to these articles: Gilbert et al. (2006); Resat et al. (2009); You (2004). Of particular note is SBML (Hucka et al. 2003), a data format for representing systems biology models; data files for many of the models discussed in this review are available at http://www.sbml.org.
2.3.10 Hybrid Dynamical Models of Biological Systems As model systems continue to grow in size and complexity, hybrid models will become more important. A hybrid approach is computationally advantageous in that it can limit expensive procedures (e.g., stochastic versus deterministic modeling) to
62
B.J. Daigle et al.
system subsets where those procedures are necessary. The remaining parts of the system can be evaluated using computationally cheaper methods without appreciable losses in overall accuracy. One example of a hybrid dynamical model was given in Section 2.3.8.3, combining a continuous SDE and discrete stochastic model. Another example of a hybrid scheme combines spatial with stochastic modeling, creating a so-called “spatial Langevin” system (Andrews and Arkin 2006; Elf and Ehrenberg 2004). This provides arguably the most realistic modeling framework short of molecular dynamics simulations. Several of the software packages listed above allow implementation of multiple types of hybrid modeling schemes. Hybrid modeling can also be used to incorporate coarser-grained, non-dynamical approaches like flux-balance analysis (FBA). FBA uses the steady-state assumption to model reaction fluxes as a system of linear equations. Because the system is assumed to be at steady state, these equations are algebraic and can be solved efficiently using linear programming methods. Covert et al. create such a hybrid model of E. coli which combines FBA with boolean and ODE systems (Covert et al. 2008). As the systems biology community moves toward whole cell models, which in eukaryotic organisms could contain > 1012 reactions (Resat et al. 2009), hybrid models will be essential in their efficient simulation and characterization.
2.4 Conclusions With thousands of sequenced genomes (Wheeler et al. 2007) and hundreds of functional genomic data sets (Barrett et al. 2005), the future of systems biology is bright. In static modeling, the supervised learning approach, in which high-throughput data is compared against a small training set of curated knowledge, has proven to be the most fruitful data integration strategy to date. In particular, supervised predictions of function and interaction from multiple data sets are more robust than those derived from individual data sets and have provided a foundation for recent work on network alignment and systematic validation. The primary challenges for static modeling are to (1) decide on a set of reference networks and (2) tie every predicted node and edge in such networks to a gold-standard experimental test such as co-immunoprecipitation for confirmation of physical protein interactions. These steps will be crucial to bringing network predictions to the same level of confidence and widespread utilization as gene predictions. For dynamic models, the core problem is that the area will remain data starved (Albeck et al. 2006) until high-throughput methods for the determination of rate constants (Famili et al. 2005) and spatial substructure (Foster et al. 2006; Schubert et al. 2006) become commonplace. Recent efforts at compiling and curating a number of biological constants (Milo et al. 2009) and developing a repository of systems biology models (Hucka et al. 2003) are an important step in the right direction toward establishing a repository of “consensus constants.”
2
Current Progress in Static and Dynamic Modeling
63
Ultimately the relevance of both kinds of models is directly proportional to their ability to predict experiments. In particular, the use of a framework (likely Bayesian) for smoothly incorporating new measurements into updated parameter estimates is likely to be of central importance. We also believe that the incorporation of tools from system identification (Nelles 2000) and parameter estimation will prove useful in the years to come. In short, now that the mathematical aspects of the field have matured, from this point forward we expect experimental testability to increasingly become the focus of the field, with models specifically formulated to be updated as new experimental data arrives. Acknowledgements We thank Russ Altman for helpful discussions.
References Abecasis G, Tam P, Bustamante C, et al (2007) Human genome variation 2006; emerging views on structural variation and large-scale SNP analysis. Nat Genet 39(2):153–155 Adalsteinsson D, McMillen D, Elston T (2004) Biochemical network stochastic simulator (BioNetS): software for stochastic modeling of biochemical networks. BMC Bioinformatics 5 Aerts S, Lambrechts D, Maity S, et al (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24(5):537–544 Albeck JG, MacBeath G, White FM, et al (2006) Collecting and organizing systematic sets of protein data. Nat Rev Mol Cell Biol 7(11):803–812 Albeck JG, Burke JM, Aldridge BB, et al (2008a) Quantitative analysis of pathways controlling extrinsic apoptosis in single cells. Mol Cell 30(1):11–25 Albeck JG, Burke JM, Spencer SL, et al (2008b) Modeling a snap-action, variable-delay switch controlling extrinsic cell death. PLoS Biol 6(12):2831–2852 Aldridge BB, Burke JM, Lauffenburger DA, et al (2006a) Physicochemical modelling of cell signalling pathways. Nat Cell Biol 8(11):1195–1203 Aldridge BB, Haller G, Sorger PK, et al (2006b) Direct Lyapunov exponent analysis enables parametric study of transient signalling governing cell behaviour. IEE Proceedings Syst Biol 153(6):425–432 Alon U, Surette MG, Barkai N, et al (1999) Robustness in bacterial chemotaxis. Nature 397(6715):168–171 Altman RB, Raychaudhuri S (2001) Whole-genome expression analysis: challenges beyond clustering. Curr Opin Struct Biol 11(3):340–347 Amonlirdviman K, Khare N, Tree D, et al (2005) Mathematical modeling of planar cell polarity to understand domineering nonautonomy. Science 307(5708):423–426 Andrews SS, Arkin AR (2006) Simulating cell biology. Curr Biol 16(14):R523–R527 Angeli D, Ferrell J, Sontag E (2004) Detection of multistability, bifurcations, and hysteresis in a large class of biological positive-feed back systems. Proc Natl Acad Sci USA 101(7): 1822–1827 Arkin A, Ross J, McAdams H (1998) Stochastic kinetic analysis of developmental pathway bifurcation in phage lambda-infected Escherichia coli cells. Genetics 149(4):1633–1648 Asai R, Taguchi E, Kame Y, et al (1999) Zebrafish Leopard gene as a component of the putative reaction-diffusion system. Mech Dev 89(1–2):87–92 Ashburner M, Ball CA, Blake JA, et al (2000) Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–29 Axelrod JD (2001) Unipolar membrane association of Dishevelled mediates Frizzled planar cell polarity signaling. Genes Dev 15(10):1182–7
64
B.J. Daigle et al.
Bader GD, Cary MP, Sander C (2006) Pathguide: a pathway resource list. Nucleic Acids Res 34(Database issue) Baker RE, Gaffney EA, Maini PK (2008) Partial differential equations for self-organization in cellular and developmental biology. Nonlinearity 21(11):R251–R290 Barabasi AL, Oltvai ZN (2004) Network biology: understanding the cell’s functional organization. Nat Rev Genet 5(2):101–113 Barrett C, Palsson B (2006) Iterative reconstruction of transcriptional regulatory networks: an algorithmic approach. PLoS Comput Biol 2(5):e52 Barrett T, Suzek TO, Troup DB, et al (2005) NCBI GEO: mining millions of expression profiles– database and tools. Nucleic Acids Res 33(Database issue) Bastock R, Strutt H, Strutt D (2003) Strabismus is asymmetrically localised and binds to Prickle and Dishevelled during Drosophila planar polarity patterning. Development 130(13): 3007–14 Batzoglou S (2005) The many faces of sequence alignment. Brief Bioinform 6(1):6–22 Beckett D, Berners-Lee T (2007) RDF Primer, Turtle Version, www.w3.org/TeamSubmission/ turtle. Accessed 31 Aug 2009 Behar M, Hao N, Dohlman HG, et al (2008) Dose-to-duration encoding and signaling beyond saturation in intracellular signaling networks. PLoS Comput Biol 4(10) Ben-Hur A, Noble WS (2006) Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinform 7 Suppl 1:S2 Benson G (2009) Nucleic Acids Research annual Web Server Issue in 2009. Nucl Acids Res 37(suppl_2):W1–2 Bentele M, Eils R (2005) General stochastic hybrid method for the simulation of chemical reaction processes in cells. Comput Meth Syst Biol 3082:248–251 Berg J, Lassig M (2006) Cross-species analysis of biological networks by Bayesian alignment. Proc Natl Acad Sci USA 103(29):10,967–72 Beyer A, Workman C, Hollunder J, et al (2006) Integrated assessment and prediction of transcription factor binding. PLoS Computational Biol 2(6):e70 Bhalla US, Iyengar R (1999) Emergent properties of networks of biological signaling pathways. Science 283(5400):381–387 Bloom JD, Adami C (2003) Apparent dependence of protein evolutionary rate on number of interactions is linked to biases in protein-protein interactions data sets. BMC Evol Biol 3 Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucl Acids Res 32(suppl_1):D267–270 Bonneau R (2008) Learning biological networks: from modules to dynamics. Nat Chem Biol 4(11):658–64 Bornholdt S (2005) Systems Biology: less is more in modeling large genetic networks. Science 310(5747):449–451 Breitkreutz BJ, Stark C, Tyers M (2003) Osprey: a network visualization system. Genome Biol 4(3):R22 Brewer D, Barenco M, Callard R, et al (2008) Fitting ordinary differential equations to short time course data. Philos Trans Ro Soc A Math Phys Eng Sci 366(1865):519–544 Brudno M, Do CB, Cooper GM, et al (2003) LAGAN and Multi-LAGAN: efficient tools for largescale multiple alignment of genomic DNA. Genome Res 13(4):721–731 Cao Y, Petzold L (2008) Slow-scale tau-leaping method. Comput Meth Appl Mech Eng 197 (43–44):3472–3479 Cao Y, Gillespie D, Petzold L (2005) The slow-scale stochastic simulation algorithm. J Chem Phys 122(1) Cao Y, Gillespie DT, Petzold LR (2007) Adaptive explicit-implicit tau-leaping method with automatic tau selection. J Chem Phy 126(22) Carrera J, Rodrigo G, Jaramillo A (2009) Model-based redesign of global transcription regulation. Nucleic Acids Res 37(5):e38 Champoux JJ (2001) DNA topoisomerases: structure, function, and mechanism. Annu Rev Biochem 70:369–413
2
Current Progress in Static and Dynamic Modeling
65
Chen WW, Schoeberl B, Jasper PJ, et al (2009) Input-output behavior of ErbB signaling pathways as revealed by a mass action model trained against dynamic data. Mol Syst Biol 5 Chen X, Wu JM, Homischer K, et al (2006) TiProD: the Tissue-specific Promoter Database. Nucleic Acids Res 34(Database issue) Collins S, Miller K, Maas N, et al (2007) Functional dissection of protein complexes involved in yeast chromosome biology using a genetic interaction map. Nature 446(7137):806–810 Cornell-Bell AH, Finkbeiner SM, Cooper MS, et al (1990) Glutamate induces calcium waves in cultured astrocytes: long-range glial signaling. Science 247(4941):470–3 Cornish-Bowden A (1979) Fundamentals of Enzyme Kinetics. Butterworths Covert MW, Knight EM, Reed JL, et al (2004) Integrating high-throughput and computational data elucidates bacterial networks. Nature 429(6987):92–96 Covert MW, Xiao N, Chen TJ, et al (2008) Integrating metabolic, transcriptional regulatory and signal transduction models in Escherichia coli. Bioinformatics 24(18):2044–50 Dandekar T, Schuster S, Snel B, et al (1999) Pathway alignment: application to the comparative analysis of glycolytic enzymes. Biochem J 343 Pt 1:115–124 von Dassow G, Meir E, Munro EM, et al (2000) The segment polarity network is a robust developmental module. Nature 406(6792):188–192 Davidson EH, Rast JP, Oliveri P, et al (2002) A genomic regulatory network for development. Science 295(5560):1669–1678 Mrvar A, Batagelj V (2005) Exploratory social network analysis with Pajek. Cambridge University Press, Cambridge Deeds EJ, Ashenberg O, Shakhnovich EI (2006) A simple physical model for scaling in proteinprotein interaction networks. Proc Natl Acad Sci USA 103(2):311–316 Degnan JH, Rosenberg NA (2006) Discordance of species trees with their most likely gene trees. PLoS Genet 2(5) Demello A (2006) Control and detection of chemical reactions in microfluidic systems. Nature 442(7101):394–402 Deuflhard P, Huisinga W, Jahnke T, et al (2007) Adaptive discrete Galerkin methods applied to the chemical master equation. SIAM J Sci Comp 30(6):2990–3011 Do C, Gross S, S B (2006a) CONTRAlign: discriminative training for protein sequence alignment. Proceedings of the tenth annual international conference on computational molecular biology, (RECOMB 2006) pp 160–164 Do CB, Woods DA, Batzoglou S (2006b) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 22(14) Dudley A, Janse D, Tanay A, et al (2005) A global view of pleiotropy and phenotypically derived gene function in yeast. Mol Syst Biol 1(1):msb4100,004–E1–msb4100,004–E11 Durbin R, Eddy S, Krogh A, et al (1999) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge Eilbeck K, Lewis SE, Mungall CJ, et al (2005) The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol 6(5):R44 El-Samad H, Khammash M (2006) Regulated degradation is a mechanism for suppressing stochastic fluctuations in gene regulatory networks. Biophys J 90(10):3749–3761 El Samad H, Khammash M, Petzold L, et al (2005) Stochastic modelling of gene regulatory networks. Int J Robust Nonlinear Control 15(15):691–711 El-Samad H, Kurata H, Doyle J, et al (2005) Surviving heat shock: Control strategies for robustness and performance. Proc Natl Acad Sci USA 102(8):2736–2741 Elf J, Ehrenberg M (2004) Spontaneous separation of bi-stable biochemical systems into spatial domains of opposite phases. Syst Biol (Stevenage) 1(2):230–236 Ellson J, North S (2007) Graphviz: Graph Visualization Software. www.graphviz.org. Accessed 31 Aug 2009 ENCODE Project Consortium (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146):799–816 Eungdamrong N, Iyengar R (2004) Modeling cell signaling networks. Biol Cell 96(5): 355–362
66
B.J. Daigle et al.
Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8(3):186–194 Ewing B, Hillier L, Wendl MC, et al (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8(3):175–185 Famili I, Mahadevan R, Palsson B (2005) k-Cone Analysis: Determining All Candidate Values for Kinetic Parameters on a Network Scale. Biophys J 88(3):1616–1625 Faure A, Naldi A, Chaouiya C, et al (2006) Dynamical analysis of a generic Boolean model for the control of the mammalian cell cycle. Bioinformatics 22(14):E124–E131 Fernandez J, Hoffmann R, Valencia A (2007) iHOP web services. Nucleic Acids Res Flaherty P, Giaever G, Kumm J, et al (2005) A latent variable model for chemogenomic profiling. Bioinformatics 21(15):3286–3293 Flaherty P, Radhakrishnan ML, Dinh T, et al (2008) A dual receptor crosstalk model of G-proteincoupled signal transduction. PLoS Comput Biol 4(9) Flannick J, Novak A, Srinivasan BS, et al (2006) Graemlin: general and robust alignment of multiple large interaction networks. Genome Res 16(9):1169–1181 Forst CV, Schulten K (2001) Phylogenetic analysis of metabolic pathways. J Mol Evol 52(6): 471–489 Foster L, de Hoog C, Zhang Y, et al (2006) A Mammalian Organelle Map by Protein Correlation Profiling. Cell 125(1):187–199 Galperin MY, Cochrane GR (2009) Nucleic acids research annual database issue and the NAR online Molecular Biology Database Collection in 2009. Nucl Acids Res 37(suppl_1):D1–4 Gandhi TKB, Zhong J, Mathivanan S, et al (2006) Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 38(3):285–293 Gavin AC, Aloy P, Grandi P, et al (2006) Proteome survey reveals modularity of the yeast cell machinery. Nature 440(7084):631–636 Giaever G, Chu AM, Ni L, et al (2002) Functional profiling of the Saccharomyces cerevisiae genome. Nature 418(6896):387–391 Gilbert D, Fuss H, Gu X, et al (2006) Computational methodologies for modelling, analysis and simulation of signalling networks. Brief Bioinform 7(4):339–353 Gillespie D (1977) Exact stochastic simulation of coupled chemical-reactions. J Phys Chem 81(25):2340–2361 Gillespie D (2000) The chemical Langevin equation. J Chem Phys 113(1):297–306 Gillespie D (2001) Approximate accelerated stochastic simulation of chemically reacting systems. J Chem Phys 115(4):1716–1733 Gillespie D (2002) The chemical Langevin and Fokker-Planck equations for the reversible isomerization reaction. J Phys Chem A 106(20):5063–5071 Gillespie DT (2007) Stochastic simulation of chemical kinetics. Ann Rev Phys Chem 58:35–55 Golightly A, Wilkinson D (2005) Bayesian inference for stochastic kinetic models using a diffusion approximation. Biometrics 61(3):781–788 Golightly A, Wilkinson D (2006) Bayesian sequential inference for stochastic kinetic biochemical network models. J Comput Biol 13(3):838–851 Goll J, Uetz P (2006) The elusive yeast interactome. Genome Biol 7(6):223 Goodwin BC (1963) Temporal organization in cells: a dynamic theory of cellular control processes. Academic Press, Newyork Graupner S, Wackernagel W (2001) Identification and characterization of novel competence genes comA and exbB involved in natural genetic transformation of Pseudomonas stutzeri. Res Microbiol 152(5):451–460 Gregor T, Bialek W, van Steveninck R, et al (2005) Diffusion and scaling during early embryonic pattern formation. Proc Natl Acad Sci USA 102(51):18,403–18,407 Gruber AR, Neubeck R, Hofacker IL, et al (2007) The RNAz web server: prediction of thermodynamically stable and evolutionarily conserved RNA structures. Nucleic Acids Res 35: w335–338 Gubb D, García-Bellido A (1982) A genetic analysis of the determination of cuticular polarity during development in Drosophila melanogaster. J Embryol Exp Morphol 68:37–57
2
Current Progress in Static and Dynamic Modeling
67
Han JD, Bertin N, Hao T, et al (2004) Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430(6995):88–93 Hansen C, Quake SR (2003) Microfluidics in structural biology: smaller, faster em leader better. Curr Opin Struct Biol 13(5):538–544 Hao N, Nayak S, Behar M, et al (2008) Regulation of cell signaling dynamics by the protein kinase-scaffold Ste5. Mol Cell 30(5):649–56 Harris MA, Clark J, Ireland A, et al (2004) The Gene Ontology (Go) database and informatics resource. Nucleic Acids Res 32(Database issue) Hart GT, Ramani AK, Marcotte EM (2006) How complete are current yeast and human proteininteraction networks? Genome Biol 7(11):120 Hartwell LH, Hopfield JJ, Leibler S, et al (1999) From molecular to modular cell biology. Nature 402(6761 Suppl) Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning. Springer, New York, NY Henderson DA, Boys RJ, Krishnan KJ, et al (2009) Bayesian emulation and calibration of a stochastic computer model of mitochondrial DNA deletions in substantia nigra neurons. J Am Stat Assoc 104(485):76–87 Henikoff S, Henikoff JG (1993) Performance evaluation of amino acid substitution matrices. Proteins 17(1):49–61 Hermjakob H, Montecchi-Palazzi L, Bader G, et al (2004) The HUPO PSI’s molecular interaction format-a community standard for the representation of protein interaction data. Nat Biotechnol 22(2):177–183 Heron EA, Finkenstaedt B, Rand DA (2007) Bayesian inference for dynamic transcriptional regulation; the Hes l system as a case study. Bioinformatics 23(19):2596–2603 Herrgard M, Covert M, Palsson B (2003) Reconciling gene expression data with known genomescale regulatory network structures. Genome Res 13(11):2423–2434 Higham DJ (2008) Modeling and simulating chemical reactions. SIAM Rev 50(2): 347–368 Hirata H, Yoshiura S, Ohtsuka T, et al (2002) Oscillatory expression of the bHLH factor Hes l regulated by a negative feedback loop. Scinence 298(5594):840–3 Hooper SD, Bork P (2005) Medusa: a simple tool for interaction graph analysis. Bioinformatics 21(24):4432–3 Hu Z, Mellor J, Wu J, et al (2007) Towards zoomable multidimensional maps of the cell. Nat Biotechnol 25(5):547–554 Huang CY, Ferrel JE Jr (1996) Ultrasensitivity in the mitogen-activated protein kinase cascade. Proc Natl Acad Sci USA 93(19):10,078–83 Hucka M, Finney A, Sauro HM, et al (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19(4):524–531 Ideker T, Valencia A (2006) Bioinformatics in the human interactome project. Bioinformatics 22(24):2973–2974 Ideker T, Galitski T, Hood L (2001) A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet 2:343–372 Igoshin O, Neu J, Oster G (2004a) Developmental waves in myxobacteria: A distinctive pattern formation mechanism. Phys Rev E 70(4) Igoshin O, Welch R, Kaiser D, et al (2004b) Waves and aggregation patterns in myxobacteria. Proc Natl Acad Sci USA 101(12):4256–4261 Irizarry R, Warren D, Spencer F, et al (2005) Multiple-laboratory comparison of microarray platforms. Nat Meth 2(5):345–350 Ito T, Chiba T, Ozawa R, et al (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98(8):4569–4574 Jahnke T, Huisinga W (2008) A dynamical low-rank approach to the chemical master equation. Bull Math Biol 70(8):2283–2302
68
B.J. Daigle et al.
Jansen R, Gerstein M (2004) Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction. Curr Opin Microbiol 7(5): 535–45 Jansen R, Lan N, Qian J, et al (2002) Integration of genomic datasets to predict protein complexes in yeast. J Struct Funct Genomics 2(2):71–81 Jansen R, Yu H, Greenbaum D, et al (2003) A Bayesian networks approach for predicting proteinprotein interactions form genomic data. Science 302(5644):449–453 Jaqaman K, Danuser G (2006) Linking data to models: data regression. Nat Revi Mol Cell Biol 7(11):813–819 Jenssen TK, Laegreid A, Komorowski J, et al (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28(1):21–28 Jones KH, Liu J, Adler PN (1996) Molecular analysis of EMS-induced frizzled mutations in Drosophila melanogaster. Genetics 142(1):205–15 de Jong H (2002) Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol 9(1):67–103 Kanehisa M, Goto S, Hattori M, et al (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34(Database issue) Karlin S, Taylor HM (1975) A first course in stochastic processes, 2nd edn. Academic Press, Newyork Karlin S, Taylor HM (1981) A second course in stochastic processes. Academic Press, Newyork Karp P, Riley M, Saier M, et al (2002) The EcoCyc Database. Nucl Acids Res 30(1):56–58 Kelley BP, Sharan R, Karp RM, et al (2003) Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc Natl Acad Sci USA 100(20): 11,394–11,399 Kim JK, Gabel HW, Kamath RS, et al (2005) Functional genomic analysis of RNA interference in C. elegans. Science 308(5725):1164–1167 Kin T, Yamada K, Terai G, et al (2007) fRNAdb: a platform for mining/annotating functional RNA candidates from non-coding RNA sequences. Nucleic Acids Res 35(Database issue) Klingensmith J, Nusse R, Perrimon N (1994) The Drosophila segment polarity gene dishevelled encodes a novel protein required for response to the wingless signal. Genes Dev 8(1): 118–30 Kondo S, Asai R (1995) A reaction-diffusion wave on the skin of the marine angelfish Pomacanthus. Nature 376(6543):765–768 Koyuturk M, Kim Y, Subramaniam S, et al (2006) Detecting conserved interaction patterns in biological networks. J Comput Biol 13(7):1299–1322 Krogan NJ, Cagney G, Yu H, et al (2006) Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature 440(7084):637–643 Kuhn RM, Karolchik D, Zweig AS, et al (2007) The UCSC genome browser database: update 2007. Nucleic Acids Res 35(Database issue) Lacalli TC (1990) Modeling the Drosophila pair-rule pattern by reaction-diffusion: gap input and pattern control in a 4-morphogen system. J Theor Biol 144 (2):171–194 Lamb J, Crawford ED, Peck D, et al (2006) The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313(5795):1929–1935 Lander ES, Linton LM, Birren B, et al (2001) Initial sequencing and analysis of the human genome. Nature 409(6822):860–921 Laub MT, McAdams HH, Feldblyum T, et al (2000) Global analysis of the genetic network controlling a bacterial cell cycle. Science 290(5499):2144–2148 Lee I, Data SV, Adai AT, et al (2004) A probabilistic functional network of yeast genes. Science 306(5701):1555–1558 Li H, Cao Y, Petzold LR, et al (2008) Algorithms and software for stochastic simulation of biochemical reacting systems. Biotechnol Prog 24(1):56–61 Liang Z, Xu M, Teng M, et al (2006) Comparison of protein interaction networks reveals species conservation and divergence. BMC Bioinformatics 7:457
2
Current Progress in Static and Dynamic Modeling
69
Lu LJ, Xia Y, Paccanaro A, et al (2005a) Assessing the limits of genomic data integration for predicting protein networks. Genome Res 15(7):945–953 Lu P, Szafron D, Greiner R, et al (2005b) PA-GOSUB: a searchable database of model organism protein sequences with their predicted Gene Ontology molecular function and subcellular localization. Nucleic Acids Res 33(Database issue) Luciano JS (2005) PAX of mind for pathway researchers. Drug Discov Today 10(13):937–942 Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sunderland, MA: Sinauer Associates Maini P, Benson D, Sherratt J (1992) Pattern-formation in reaction diffusion-models with spatially inhomogeneous diffusion-coefficients. Ima J Math Appl Med Biol 9(3):197–213 Maini PK, Baker RE, Chuong CM (2006) Developmental biology. The Turing model comes of molecular age. Science 314(5804):1397–8 Marchler-Bauer A, Anderson JB, DeWeese-Scott C, et al (2003) CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 31(1):383–387 Matthiessen MW (2003) BioWareDB: the biomedical software and database search engine. Bioinformatics 19(17):2319–2320 McAdams H, Arkin A (1997) Stochastic mechanisms in gene expression. Proc Natl Acad Sci USA 94(3):814–819 Meinhardt H, de Boer PA (2001) Pattern formation in Escherichia coli: a model for the pole-to-pole oscillations of Min proteins and the localization of the division site. Proc Natl Acad Sci USA 98(25):14,202–14,207 Mewes HW, Heumann K, Kaps A, et al (1999) MIPS: a database for genomes and protein sequences. Nucleic Acids Res 27(1):44–48 Milo R, Jorgensen P, Springer M (2009) Bionumbers: The Database of Useful Biological Numbers. bionumbers.hms.harvard.edu. Accessed 31 Aug 2009 Mogilner A, Wollman R, Marshall WF (2006) Quantitative modeling in cell biology: What is it good for? Dev Cell 11(3):279–287 Mulder NJ, Apweiler R, Attwood TK, et al (2007) New developments in the InterProdatabase. Nucleic Acids Res 35(Database issue) Munsky B, Khammash M (2006) The finite state projection algorithm for the solution of the chemical master equation. J Chem Phys 124(4) Nelles O (2000) Nonlinear system identification: from classical approaches to neural networks and fuzzy models, 1 st edn. Springer, New York, NY Ng A, Bursteinas B, Gao Q, et al (2006) pSTING:a ‘systems’ approach towards integrating signalling pathways, interaction and transcriptional regulatory networks in inflammation and cancer. Nucleic Acids Res 34(Database issue) Nichols R (2001) Gene trees and species trees are not the same. Trends Ecol Evol 16(7):358–364 Nielsen P, Halstead M (2004) The evolution of CellML. Conf Proc IEEE Eng Med Biol Soc 7:5411–5414 Novak B, Csikasz-Nagy A, Gyorffy B, et al (1998) Mathematical model of the fission yeast cell cycle with checkpoint controls at the G1/S, G2/M and metaphase/anaphase transitions. Biophys Chem 72(1–2):185–200 Novak B, Pataki Z, Ciliberto A, et al (2001) Mathematical model of the cell division cycle of fission yeast. Chaos 11(1):277–286 Ogata H, Fujibuchi W, Goto S, et al (2000) A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Res 28(20): 4021–4028 Orchard S, Hermjakob H, Taylor CF, et al (2005) Further steps in standardisation. Report of the second annual Proteomics Standards Initiative Spring Workshop (Siena, Italy 17–20th April 2005). Proteomics 5(14):3552–3555 Othmer H (1976) Qualitative dynamics of a class of biochemical control-circuits. J Math Bio 3(1):53–78 Overbeek R, Fonstein M, D’Souza M, et al (1999) The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 96(6):2896–2901
70
B.J. Daigle et al.
Owen A, Stuart J, Mach K, et al (2003) A Gene Recommender Algorithm to Identify Coexpressed Genes in C. elegans. Genome Res 13(8):1828–1837 Painter KJ, Maini PK, Othmer HG (2000) A chemotactic model for the advance and retreat of the primitive streak in avian development. Bull Math Biol 62(3):501–525 Pamilo P, Nei M (1988) Relationships between gene trees and species trees. Mol Biol Evol 5(5):568–583 Pazos F, Ranea J, Juan D, et al (2005) Assessing Protein Co-evolution in the Context of the Tree of Life Assists in the Prediction of the Interactome. J Mol Biol 352(4):1002–1015 Pellegrini M, Marcotte EM, Thompson MJ, et al (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 96(8):4285–4288 Pokholok DK, Zeitlinger J, Hannett NM, et al (2006) Activated signal transduction kinases frequently occupy target genes. Science 313(5786):533–6 Price ND, Shmulevich I (2007) Biochemical and statistical network models for systems biology. Curr Opin Biotechnol 18(4):365–370 Prudhommeaux E, Seaborne A (2007) SPARQL Query Language for RDF. www.w3.org/TR/rdfsparql-query. Accessed 31 Aug 2009 Ptacek J, Snyder M (2006) Charging it up: global analysis of protein phosphorylation. Trends Genet 22(10):545–54 Qi Y, Bar-Joseph Z, Klein-Seetharaman J (2006) Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins: Struct Funct Bioinform 63(3):490–500 Rangamani P, Iyengar R (2007) Modelling spatio-temporal interactions within the cell. J Biosci 32(1):157–167 Rao C, Arkin A (2003) Stochastic chemical kinetics and the quasi-steady-state assumption: Application to the Gillespie algorithm. J Chem Phys 118(11):4999–5010 Rathinam M, El Samad H (2007) Reversible-equivalent-monomolecular tau: A leaping method for “small number and stiff” stochastic chemical systems. J Computational Phys 224(2):897–923 Ratsch G, Sonnenburg S, Srinivasan J, et al (2007) Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Comput Biol 3(2):e20 Resat H, Petzold L, Pettigrew MF (2009) Kinetic modeling of biological systems. Methods Mol Biol 541:311–335 Riddihough G (2003) Chromosomes through space and time. Science 301(5634):779 van Riel NAW, Sontag ED (2006) Parameter estimation in models combining signal transduction and metabolic pathways: the dependent input approach. IEE Proc Syst Biol 153(4):263–274 Robertson G, Bilenky M, Lin K, et al (2006) cisRED: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Res 34(Database issue) Rual JF, Venkatesan K, Hao T, et al (2005) Towards a proteome-scale map of the human proteinprotein interaction network. Nature 437(7062):1173–8 Rubin DL, Lewis SE, Mungall CJ, et al (2006) National Center for Biomedical Ontology: advancing biomedicine through structured organization of scientific knowledge. OMICS 10(2):185–198 Sachs K, Perez O, Pe’er D, et al (2005) Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science 308(5721):523–529 Saha K, Schaffer D (2006) Signal dynamics in Sonic hedgehog tissue patterning. Development 133(5):889–900 Salis H, Kaznessis Y (2005) Accurate hybrid stochastic simulation of a system of coupled chemical or biochemical reactions. J Chem Phys 122(5) Samoilov M, Plyasunov S, Arkin AP (2005) Stochastic amplification and signaling in enzymatic futile cycles through noise-induced bistability with oscillations. Proc Natl Acad Sci USA 102(7):2310–2315 Samoilov MS, Arkin AP (2006) Deviant effects in molecular reaction pathways. Nat Biotechnol 24(10) 1235–1240 SantaLucia J, Hicks D (2004) The thermodynamics of DNA structural motifs. Annu Rev Biophys Biomol Struct 33:415–440
2
Current Progress in Static and Dynamic Modeling
71
Saric J, Jensen LJ, Ouzounova R, et al (2006) Extraction of regulatory gene/protein networks from medline. Bioinformatics 22(6):645–650 Sauer U (2004) High-throughput phenomics: experimental methods for mapping fluxomes. Curr Opin Biotechnol 15(1):58–63 Schena M, Shalon D, Heller R, et al (1996) Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc Natl Acad Sci USA 93(20):10,614–10,619 Schnell S, Turner T (2004) Reaction kinetics in intracellular environments with macromolecular crowding: simulations and rate laws. Prog Biophys Mol Bio 85(2–3):235–260 Schubert W, Bonnekoh B, Pommer A, et al (2006) Analyzing proteome topology and function by automated multidimensional fluorescence microscopy. Nat Biotechnol 24(10):1270–1278 Schuldiner M, Collins S, Thompson N, et al (2005) Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell 123(3):507–519 Sekimura T, Zhu M, Cook J, et al (1999) Pattern formation of scale cells in lepidoptera by differential origin-dependent cell adhesion. Bull Math Biol 61(5):807–827 Shannon P, Markiel A, Ozier O, et al (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504 Sharan R, Suthram S, Kelley RM, et al (2005) From the Cover: Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci USA 102(6):1974–1979 Shea M, Ackers G (1985) The “or” control-system of bacteriophage-lambda - a physical-chemical model for gene-regulation. J Mol Biol 181(2):211–230 Sherlock G (2000) Analysis of large-scale gene expression data. Curr Opin Immunol 12(2): 201–205 Shmulevich I, Dougherty E, Kim S, et al (2002) Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 18(2):261–274 Sick S, Reinker S, Timmer J, et al (2006) WNT and DKK determine hair follicle spacing through a reaction-diffusion mechanism. Science 314(5804):1447–1450 Siek J, Lee L. Lumsdaine A (2007) The Boost Graph Library. www.boost.org/libs/graph/. Accessed 31 Aug 2009 Singh R, Xu J, Berger B (2007) Pairwise Global Alignment of Protein Interaction Networks by Matching Neighborhood Topology. Proceedings of the 11th Annual International Conference on Computational Molecular Biology (RECOMB 2007) Smolen P, Baxter D, Byrne J (2002) A reduced model clarifies the role of feedback loops and time delays in the Drosophila circadian oscillator. Biophys J 83(5):2349–2359 Spellman PT, Sherlock G, Zhang MQ, et al (1998) Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9(12):3273–3297 Spiro P, Parkinson J, Othmer H (1997) A model of excitation and adaptation in bacterial chemotaxis. Proc Natl Acad Sci USA 94(14):7263–7268 Srinivasan B, Caberoy N, Suen G, et al (2005) Functional genome annotation through phylogenomic mapping. Nat Biotechnol 23(6):691–698 Srinivasan BS, Novak AF, Flannick J, Batzoglou S, McAdams HH (2006) Integrated protein interaction networks for 11 microbes. In: RECOMB, pp 1–14 Srinivasan BS, Shah NH, Flannick JA, et al (2007) Current progress in network research: toward reference networks for key model organisms. Briefings In Bioinform 8(5):318–332 Stamatakis M, Mantzaris NV (2006) Modeling of ATP-mediated signal transduction and wave propagation in astrocytic cellular networks. J Theor Biol 241(3):649–668 Stark C, Breitkreutz BJ, Reguly T, et al (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34(Database issue) Stephens S (2007) HCLSIG BioRDF Subgroup. esw.w3.org/topic/HCLSIG_BioRDF_Subgroup. Accessed 31 Aug 2009 Steuer R (2004) Effects of stochasticity in models of the cell cycle: from quantized cycle times to noise-induced oscillations. J Theor Biol 228(3):293–301 Stromback L, Lambrix P (2005) Representations of molecular pathways: an evaluation of SBML, PSI MI and BioPAX. Bioinformatics 21(24):4401–4407
72
B.J. Daigle et al.
Strutt DI (2001) Asymmetric localization of frizzled and the establishment of cell polarity in the Drosophila wing. Mol Cell 7(2):367–75 Stuart J, Segal E, Koller D, et al (2003) A gene-coexpression network for global discovery of conserved genetic modules. Science 302(5643):249–255 Stumpf M, Kelly W, Thorne T, et al (2007) Evolution at the system level: the natural history of protein interaction networks. Trends Ecol Evol 22(7):366–373 Tanay A, Sharan R, Kupiec M, et al (2004) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci USA 101(9):2981–2986 Taylor J, Abramova N, Charlton J, et al (1998) Van Gogh: a new Drosophila tissue polarity gene. Genetics 150(1):199–210 Theisen H, Purcell J, Bennett M, et al (1994) dishevelled is required during wingless signaling to establish both cell polarity and cell identity. Development 120(2):347–60 Tomlin CJ, Axelrod JD (2007) Biology by numbers: mathematical modelling in developmental biology. Nat Rev Genet 8(5):331–340 Tong AH, Evangelista M, Parsons AB, et al (2001) Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294(5550):2364–2368 Tong AH, Drees B, Nardelli G, et al (2002) A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295(5553): 321–324 Tree DRP, Shulman JM, Rousset R, et al (2002) Prickle mediates feedback amplification to generate asymmetric planar cell polarity signaling. Cell 109(3):371–381 Troyanskaya OG, Dolinski K, Owen AB, et al (2003) A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci USA 100(14):8348–8353 Turing A (1952) The Chemical Basis of Morphogenesis. Philosophical Transactions of the Royal Society of London Series B-Biological Sciences 237(641):37–72 Tyson J (1975) Existence of oscillatory solutions in negative feedback cellular control processes. J Math Biol 1(4):311–315 Tyson J, Othmer H (1978) The dynamics of feedback control circuits in biochemical pathways. Prog Theor Biol 5:1–60 Uhrmacher A, Degenring D, Zeigler B (2005) Discrete event multi-level models for systems biology. Transactions on computational systems biology I pp 66–89. Springer, Berlin Vastrik I, D’Eustachio P, Schmidt E, et al (2007) Reactome: a knowledgebase of biological pathways and processes. Genome Biol 8:R39 Venter JC, Adams MD, Myers EW, et al (2001) The sequence of the human genome. Science 291(5507):1304–1351 Vilar JMG, Kueh HY, Barkai N, et al (2002) Mechanisms of noise-resistance in genetic oscillators. Proc Natl Acad Sci USA 99(9):5988–5992 von Mering C, Jensen LJ, Kuhn M, et al (2007) STRING 7–recent developments in the integration and prediction of protein interactions. Nucleic Acids Res 35(Database issue) Walter CF (1970) The occurrence and the significance of limit cycle behavior in controlled biochemical systems. J Theor Biol 27(2):259–272 Wang X, Hao N, Dohlman H, et al (2006) Bistability, stochasticity, and oscillations in the mitogenactivated protein kinase cascade. Biophys J 90(6):1961–1978 Weber M, Schubeler D (2007) Genomic patterns of dna methylation: targets and function of an epigenetic mark. Curr Opin Cell Biol 19(3):273–280 Wei CL, Wu Q, Vega VB, et al (2006) A global map of p53 transcription-factor binding sites in the human genome. Cell 124(1):207–219 Weinberger LS, Burnett JC, Toettcher JE, et al (2005) Stochastic gene expression in a lentiviral positive-feedback loop: HIV-1 Tat fluctuations drive phenotypic diversity. Cell 122(2):169–182 Weitz J, Benfey P, Wingreen N (2007) Evolution, interactions, and biological networks. PLoS Biol 5(1):e11
2
Current Progress in Static and Dynamic Modeling
73
Weng G, Bhalla U, Iyengar R (1999) Complexity in biological signaling systems. Science 284(5411):92–96 Wheeler DL, Barrett T, Benson DA, et al (2007) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 35(Database issue) Wilkinson DJ (2006) Stochastic modelling for systems biology. Chapman and Hall/CRC mathematical and computational biology series, Taylor and Francis Wilkinson DJ (2009) Stochastic modelling for quantitative description of heterogeneous biological systems. Nat Rev Genet 10(2):122–133 Winzeler EA, Shoemaker DD, Astromoff A, et al (1999) Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285(5429):901–906 Wong SL, Zhang LV, Tong AH, et al (2004) Combining biological networks to predict genetic interactions. Proc Natl Acad Sci USA 101(44):15,682–15,687 Woo Y, Affourtit J, Daigle S, et al (2004) A Comparison of cDNA, Oligonucleotide, and Affymetrix GeneChip Gene Expression Microarray Platforms. J Biomol Tech 15(4):276–284 Yi M, Jia Y, Liu Q, et al (2006) Enhancement of internal-noise coherence resonance by modulation of external noise in a circadian oscillator. Phys Rev E 73(4) Yi M, Jia Y, Tang J, et al (2008) Theoretical study of mesoscopic stochastic mechanism and effects of finite size on cell cycle of fission yeast. Phys A-Stat Mech Appl 387(1):323–334 You L (2004) Toward computational systems biology. Cell Biochem Biophys 40(2):167–184 Yu H, Luscombe NM, Lu HX, et al (2004) Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res 14(6):1107–1118 Zhenping L, Zhang S, Wang Y, et al (2007) Alignment of molecular networks by integer quadratic programming. Bioinformatics 23(13):1631–1639 Zhu X, Gerstein M, Snyder M (2007) Getting connected: analysis and principles of biological networks. Genes Dev 21(9):1010–1024 Zou M, Conzen S (2005) A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 21(1):71–79
Chapter 3
Getting Started in Biological Pathway Construction Rebecca A. Sealfon and Stuart C. Sealfon
Abstract The increasingly extensive data on dynamic cellular processes are revolutionizing biology and medicine. Computational tools are necessary for exploring these data. A familiar and intuitive approach is to organize information into biological pathways. This chapter provides a general overview and an introduction to the specific software and notations used to store, construct, and analyze biological pathways. Biological pathways are collected in public and private databases, which are often curated by many researchers. In these databases, information is represented in one of several standard notations, is usually viewable in several different layouts, and can be readily updated by multiple users. Specialized software allows users to manually or automatically mine the data in order to construct biological pathways. Keywords Pathway · Systems biology mark-up language · SBML · Curation
3.1 Introduction Organisms maintain homeostasis by monitoring and responding to their internal and external environments. This essential adaptation of living matter depends on the function of elaborate molecular interaction networks. The large-scale study, reconstruction, and modeling of these biological networks are important aspects of the field of systems biology. Although individual components of these networks have been studied for decades (Akino et al. 1971, Mcclay and Gooding 1978, Medicus et al. 1976), the accumulation of sufficient data to reconstruct molecular networks is a recent advance (Bradham et al. 2006, Iyengar 2009, Kitano 2002, Ma’ayan et al. 2009, Robertson et al. 2006, Viswanathan et al. 2007, 2008, Voit et al. 2006, Weitz et al. 2007). Analysis and visualization tools are still being developed S.C. Sealfon (B) Center for Translational Systems Biology, Department of Neurology, Mount Sinai School of Medicine, One Gustave L. Levy Place, Box 1137, New York, NY 10029, USA e-mail:
[email protected] S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_3,
75
76
R.A. Sealfon and S.C. Sealfon
and refined to help mine and synthesize the data emerging from high-throughput research (Ma’ayan et al. 2009, Viswanathan et al. 2007). It is convenient to represent the chains of cause-and-effect relationships driving biological processes as pathways and to collect the information about these pathways in annotated databases. The aim of this chapter is to review the approaches and techniques for assembling and using representations of biological pathways. Systems biology provides a framework for understanding the dynamic nature of biological networks (Kitano 2002). Thus it may provide a spatiotemporal understanding, at the level of the molecular processes, of emergent cellular response states. One challenge of systems biology is to organize the data generated by research communities into navigable databases. Pathway construction is one approach to literature and data mining with many practical ramifications. A comprehensive understanding of biological pathways can aid in the development of drugs to target specific cellular mechanisms while avoiding unwanted side effects (Cho et al. 2006, Kell 2006, Loging et al. 2007, Materi and Wishart 2007). The applications of systems biology to the understanding and treatment of various diseases are explored in subsequent chapters of this volume. For example, pathway analysis has emerged as an important component of the study of tumorigenesis (Hernandez et al. 2007, Maxwell et al. 2008, Reed 2008). Tumorigenesis contributes to the formation of many cancers (Holland and Cleveland 2009, Tu et al. 2009, Wong and Lemoine 2009) and even when benign may be problematic due to a tumor’s location or size (Al Habeeb et al. 2009, Saint-Blancard and Trueba 2009). Since each tumor can be caused by a different set of mutations, an understanding of the systems biology of tumorigenesis may eventually enable us to create more specific drugs that target each tumor individually (Bagirov et al. 2003, Beroukhim et al. 2007, Hoshida et al. 2007).
3.2 Approaches and Examples of Pathway Construction Pathway construction refers both to the storage of pathway information in databases (Viswanathan et al. 2008) and to the curation, mining, and synthesis of this information (Wierling et al. 2007). The most appropriate approach depends on the specific datasets utilized and the hypotheses being evaluated. Some pathways and pathway databases are created around a particular question or topic of interest, usually according to the results of multiple experiments. Others are created from the results of a single experiment, such as a microarray analysis, to organize the information about gene and protein relationships. The database of the molecular pathways implicated in the mammalian innate immune response, a system discussed later in this volume, is primarily topic based. The innate immune system is only one part of the mammalian immune system and is responsible for generic responses to pathogens such as phagocytosis. Unlike the adaptive immune system, which is unique to jawed vertebrates, the innate immune
3
Getting Started in Biological Pathway Construction
77
system is found in all known multicellular eukaryotes and aspects of innate immunity even exist in unicellular organisms (Janeway et al. 2001). Many questions about the mammalian innate immune system, such as how it can produce distinct responses to particular pathogens, remain unanswered (Lynn et al. 2008). Lynn et al. have created InnateDB, a manually curated database of the molecules, interactions, and pathways involved in the human and mouse innate immune responses (Lynn et al. 2008). A large database such as InnateDB may be helpful in understanding major aspects of the innate immune system. Gilchrist et al. (Gilchrist et al. 2006) describe a database created from the results of a single experiment. These researchers studied innate immune system regulation by analyzing mouse bone marrow macrophages (BMMs) with activated toll-like receptors (TLRs), innate immune receptors that recognize structurally conserved molecules found on microbes. TLR activation involves changes in the expression of over a thousand genes and is thus well suited to being studied with a systems biology approach (Nau et al. 2002). A dataset of gene expression levels was collected using microarrays of these TLR-activated BMMs. Substantially perturbed genes were identified using significance analysis of microarrays. This dataset was used to experimentally identify a major group of genes regulated by the transcription factor ATF3. Regardless of one’s initial approach to pathway construction, it is important to ensure that pertinent information is accurate, accessible, and easy to integrate with other studies. Annotations such as cell type, developmental stage or species, for example, can provide additional information by which the pathway can be understood and searched. Experts should be able to correct, update, or annotate data that have already been added to a database, and both experts and laymen should be able to find and comprehend the data. A number of standard notations and software tools have been developed to facilitate the storage and data mining of different types of pathways, and the computational resources are becoming progressively easier to use. The next sections provide an overview of the available knowledge bases and current methods of pathway construction.
3.3 Pathway Databases Numerous public and private databases have been created to store various types of information relevant to pathway construction. Only some of these databases contain pathways per se. Nucleotide sequence and protein information, for example, may be useful for identifying the nodes in various pathways. Systems biologists often use nucleotide sequence, gene, protein, and other databases provided by the National Center for Biotechnology Information (NCBI) or the European Bioinformatics Institute (EMBL-EBI), including the genetic sequence database (GenBank), the Ensembl gene annotation database, and the UniProt protein sequence and annotation database (Wierling et al. 2007). Large databases of ontologies, such as the gene
78
R.A. Sealfon and S.C. Sealfon
ontology annotation (GOA) database, offer a standardized system of annotating biological data (Barrell et al. 2009). A detailed description of the large number of existing biological databases is beyond the scope of this chapter. New databases are regularly developed. Nucleic Acids Research publishes an annual issue devoted entirely to new and improved databases, and a new journal on databases, DATABASE, recently printed its first issue (Landsman et al. 2009). A comprehensive list of pathway databases can be found at http://www.pathguide.org. In many databases, information is intentionally available in more than one layout. A later chapter of this volume describes MADNet (Microarray Database Network Web Server), which is freely available to academic users at http://www.bioinfo.hr/madnet. MADNet, which integrates information supplied by the user with other biological databases, allows several types of data, such as biological pathway organization, gene expression levels, and transcription factor regulation, to be visually represented (Segota et al. 2008). In addition to the results of microarray experiments, MADNet also analyzes other forms of high-throughput data, including phage display and metagenomic information. A related data visualization program is BiologicalNetworks, which is based on the PathSys data integration platform (Baitaluk et al. 2006a, b). It currently analyzes a larger set of databases than MADNet (over 20 as of 2006). Unlike MADNet, BiologicalNetworks searches for and analyzes molecular pathways based on length rather than statistical significance (Baitaluk et al. 2006b, Segota et al. 2008). Other data visualization programs focus on more network-specific properties. For example, BioTapestry is a software tool designed to represent networks that increase in size and interconnectedness over time (Longabaugh et al. 2005, 2009), such as the gene regulatory network implicated in the development of the sea urchin (Su et al. 2009) or the signaling network determining T lymphocyte fate in mammals (Georgescu et al. 2008). Data visualization software will be described in greater detail in a separate section, but these more specialized programs are suitable for databases that would call for a hybrid topic-based and experiment-based approach.
3.4 Standard Notations for Representing Biological Pathways There are several notations for representing biological pathways, including systems biology markup language (SBML), proteomics standards initiative-molecular interactions (PSI-MI), and biological pathways eXchange (BioPAX) (Viswanathan et al. 2008). All are designed for easy storage and parsing by computers. The primary usefulness of each is slightly different. SBML is best suited for mathematical modeling and simulation; PSI-MI is designed for structured representation of experimental evidence information; and BioPAX, a new language that is still developing, integrates PSI-MI within a pathway representation format and provides general representation mechanisms that permit storage of additional information such as mathematical models (Viswanathan et al. 2008). These and other commonly used
3
Getting Started in Biological Pathway Construction
79
notations are based on the XML markup language, which combines the chief advantages of its predecessor HTML (HyperText Markup Language) with an attempt to overcome HTML’s disadvantages (Achard et al. 2001). HTML is the most commonly used markup language on the World Wide Web. Programming in HTML is fast and web pages in HTML are easy to read and navigate (McMurdo 1996, Pallen 1995). It is adequate for representing documents with a simple structure, including most types of web pages, but is not designed for encoding information with a complex structure. In addition, programmers are not prevented from placing HTML tags in a counterintuitive, misleading order, such as using level 1 headings to subdivide sections indicated by level 2 headings, and HTML itself provides little information about the content of a document (Achard et al. 2001). Headings and content information can be useful when searching through highly structured databases, which is often required when constructing or analyzing biological pathways. XML (eXtensible Markup Language), like HTML, is derived from the standard generalized markup language (SGML), an ISO (International Organization for Standardization) standard meta-language for designing markup languages (Achard et al. 2001, Roberts 1998). In general, markup languages are systems of notations for indicating the format, layout, and structure of text documents. Unlike HTML, XML can easily represent complex structured documents, such as biological information organized into several levels of subcategories. It also allows the sections of these documents to be identified by content or other attributes, a necessary feature for enabling users to search through a large database for specific molecules, pathways, or other biological processes (Achard et al. 2001). In XML, structural and content constraints of documents are represented by a special description called a schema. One example of a schema is DTD (Document Type Definition). A schema is useful for dividing documents into pieces of textual data known as elements. Elements are identified and separated by tags, which include the relevant information about the document subsections they mark (Lee and Chu 2000). This information can be represented in a standardized fashion, such as according to an ontological classification (Jiang and Nash 2006). An XML document is called well-formed if it obeys the basic XML syntax rules, and valid if its logical structure matches the structure specified in its schema (Achard et al. 2001). XML editors generally allow developers to check whether documents are well formed, and many check whether documents are valid (Schroeder and Mello 2009). These features help prevent users from submitting improperly formatted data to a database.
3.5 Pathway Building Tools Pathway building tools are required to populate, visualize, and store a pathway. Currently, there are various pathway building tools that provide the ability to extract information as well as to support multiple standard formats (Stromback
80
R.A. Sealfon and S.C. Sealfon
et al. 2006). Cytoscape, CellDesigner, and JDesigner are graphical environments for constructing pathways that can import or export SBML models for simulation. Cytoscape can also access large databases containing protein and gene interactions with additional support for PSI-MI and BioPAX formats. Pathway analysis tools for integration and knowledgebase (PATIKA) provides a web-based interface to public databases, such as Reactome, HPRD, and IntAct through supporting both SBML and BioPAX formats. Its visualization and layout tools facilitate pathway analysis. Reactome displays reactions as pathway diagrams and provides online tools for authoring, curation, and visualization as well as export to SBML and BioPAX formats. Ingenuity pathway analysis tool, a web-based interface of the ingenuity knowledgebase, available by paid subscription, enables users to query molecular interactions, biological functions, and diseases for generating customized pathways and analysis.
3.6 The Pathway Building Process Pathway curation can be manual or automated. Manual curation provides the most reliable means of extracting information from the literature. However, the pace of new discovery can make manually populated databases difficult to maintain. In the mining process, use of appropriate keywords increases the chances of identifying the relevant information. Automated text mining through natural language processing, the automated processing of data formatted in languages normally used by humans, reduces the personnel required for recovery of information but is severely limited in accuracy. Information in the scientific literature is highly specialized, semantically unpredictable, and often not textual. Agreeing on “facts” is difficult even for expert curators. The present generation of text mining tools is probably most useful as an aid to manual curation. The efficient mining of information from the plethora of resource databases hinges on the identification of the most useful primary literature and databases for the biological area of interest. This often poses a challenge, as the choice of databases and mining strategies are biologically area specific. We find Reactome, UniHI, and Ingenuity Systems useful and appropriate for many biological areas.
3.7 Summary and Conclusions The representation of biological pathways is a multi-step process that entails obtaining data from the literature, mining existing databases, and/or gathering data from experiments; assembling the pathways step by step; and submitting the pathway information and annotations to public or private databases. Tools are being continually developed to aid in each step. This promising field of bioinformatics may yield a greater spatiotemporal understanding of cellular responses.
3
Getting Started in Biological Pathway Construction
81
References Achard F, Vaysseix G, Barillot E (2001) XML, bioinformatics and data integration. Bioinformatics 17:115–125 Akino T, Abe M, Arai T (1971) Studies on biosynthetic pathways of molecular species of lecithin by rat lung slices. Biochim Biophys Acta 248:274–281 Al Habeeb A, Alkhalidi H, Idikio H, Ghazarian D (2009) Cutaneous solitary neural hamartoma: report of an unusual case. Am J Dermatopathol 31:484–486 Bagirov AM, Ferguson B, Ivkovic S, Saunders G, Yearwood J (2003) New algorithms for multiclass cancer diagnosis using tumor gene expression signatures. Bioinformatics 19:1800–1807 Baitaluk M, Qian X, Godbole S, Raval A, Ray A, Gupta A (2006a) PathSys: integrating molecular interaction graphs for systems biology. BMC Bioinformatics 7:55. DOI 1471-2105-7-55 [pii] 10.1186/1471-2105-7-55 Baitaluk M, Sedova M, Ray A, Gupta A (2006b) BiologicalNetworks: visualization and analysis tool for systems biology. Nucleic Acids Res 34:W466–471. DOI 34/suppl_2/W466 [pii] 10.1093/nar/gkl308 Barrell D, Dimmer E, Huntley RP, Binns D, O’Donovan C, Apweiler R (2009) The GOA database in 2009 – an integrated gene ontology annotation resource. Nucleic Acids Res 37:D396–403. DOI gkn803 [pii] 10.1093/nar/gkn803 Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee JC, Huang JH, Alexander S, Du J, Kau T, Thomas RK, Shah K, Soto H, Perner S, Prensner J, Debiasi RM, Demichelis F, Hatton C, Rubin MA, Garraway LA, Nelson SF, Liau L, Mischel PS, Cloughesy TF, Meyerson M, Golub TA, Lander ES, Mellinghoff IK, Sellers WR (2007) Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci USA 104:20007–20012. DOI 0710052104 [pii] 10.1073/pnas. 0710052104 Bradham CA, Foltz KR, Beane WS, Arnone MI, Rizzo F, Coffman JA, Mushegian A, Goel M, Morales J, Geneviere AM, Lapraz F, Robertson AJ, Kelkar H, Loza-Coll M, Townley IK, Raisch M, Roux MM, Lepage T, Gache C, McClay DR, Manning G (2006) The sea urchin kinome: a first look. Dev Biol 300:180–193. DOI S0012-1606 (06)01142-0 [pii] 10.1016/j.ydbio.2006.08.074 Cho CR, Labow M, Reinhardt M, van Oostrum J, Peitsch MC (2006) The application of systems biology to drug discovery. Curr Opin Chem Biol 10:294–302. DOI S1367-5931 (06)00089-5 [pii] 10.1016/j.cbpa.2006.06.025 Georgescu C, Longabaugh WJ, Scripture-Adams DD, David-Fung ES, Yui MA, Zarnegar MA, Bolouri H, Rothenberg EV (2008) A gene regulatory network armature for T lymphocyte specification. Proc Natl Acad Sci USA 105:20100–20105. DOI 0806501105 [pii] 10.1073/pnas.0806501105 Gilchrist M, Thorsson V, Li B, Rust AG, Korb M, Roach JC, Kennedy K, Hai T, Bolouri H, Aderem A (2006) Systems biology approaches identify ATF3 as a negative regulator of Toll-like receptor 4. Nature 441:173–178. DOI nature04768 [pii] 10.1038/nature04768 Hernandez P, Huerta-Cepas J, Montaner D, Al-Shahrour F, Valls J, Gomez L, Capella G, Dopazo J, Pujana MA (2007) Evidence for systems-level molecular mechanisms of tumorigenesis. Bmc Genomics 8; DOI Artn 185, Doi 10.1186/1471-2164-8-185 Holland AJ, Cleveland DW (2009) Boveri revisited: chromosomal instability, aneuploidy and tumorigenesis. Nat Rev Mol Cell Biol 10:478–487. DOI nrm2718 [pii] 10.1038/nrm2718 Hoshida Y, Brunet JP, Tamayo P, Golub TR, Mesirov JP (2007) Subclass mapping: identifying common subtypes in independent disease data sets. PLoS One 2:e1195. DOI 10.1371/journal.pone.0001195 Iyengar R (2009) Computational biochemistry: systems biology minireview series. J Biol Chem 284:5425–5426. DOI R800066200 [pii] 10.1074/jbc.R800066200 Janeway C, National Center for Biotechnology Information (U.S.), National Institutes of Health (U.S.). PubMed. (2001) Immunobiology: the immune system in health and disease.
82
R.A. Sealfon and S.C. Sealfon
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowTOC&rid=imm.TOC&depth =2 Connect to free full text of book. Jiang K, Nash C (2006) Application of XML database technology to biological pathway datasets. Conf Proc IEEE Eng Med Biol Soc 1:4217–4220. DOI 10.1109/IEMBS.2006.259850 Kell DB (2006) Systems biology, metabolic modelling and metabolomics in drug discovery and development. Drug Discov Today 11:1085–1092. DOI 10.1016/J.Drudis.2006.10.004 Kitano H (2002) Systems biology: a brief overview. Science 295:1662–1664. DOI 10.1126/ science.1069492 295/5560/1662 [pii] Landsman D, Gentleman R, Kelso J, Francis Ouellette BF (2009) DATABASE: A new forum for biological databases and curation. Database 2009:bap002-. DOI 10.1093/database/bap002 Lee D, Chu WW (2000) Comparative analysis of six XML schema languages. Sigmod Record 29:76–87 Loging W, Harland L, Williams-Jones B (2007) High-throughput electronic biology: mining information for drug discovery. Nat Rev Drug Discov 6:220–230. DOI 10.1038/Nrd2265 Longabaugh WJ, Davidson EH, Bolouri H (2005) Computational representation of developmental genetic regulatory networks. Dev Biol 283:1–16. DOI S0012-1606 (05)00257-5 [pii] 10.1016/j.ydbio.2005.04.023 Longabaugh WJ, Davidson EH, Bolouri H (2009) Visualization, documentation, analysis, and communication of large-scale gene regulatory networks. Biochim Biophys Acta 1789:363–374. DOI S1874-9399 (08)00162-4 [pii] 10.1016/j.bbagrm.2008.07.014 Lynn DJ, Winsor GL, Chan C, Richard N, Laird MR, Barsky A, Gardy JL, Roche FM, Chan TH, Shah N, Lo R, Naseer M, Que J, Yau M, Acab M, Tulpan D, Whiteside MD, Chikatamarla A, Mah B, Munzner T, Hokamp K, Hancock RE, Brinkman FS (2008) InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol 4:218. DOI msb200855 [pii] 10.1038/msb.2008.55 Ma’ayan A, Jenkins SL, Webb RL, Berger SI, Purushothaman SP, Abul-Husn NS, Posner JM, Flores T, Iyengar R (2009) SNAVI: Desktop application for analysis and visualization of largescale signaling networks. BMC Syst Biol 3:10. DOI 1752-0509-3-10 [pii] 10.1186/1752-05093-10 Materi W, Wishart DS (2007) Computational systems biology in drug discovery and development: methods and applications. Drug Discov Today 12:295–303. DOI 10.1016/ J.Druidis.2007.02.013 Maxwell CA, Moreno V, Sole X, Gomez L, Hernandez P, Urruticoechea A, Pujana MA (2008) Genetic interactions: the missing links for a better understanding of cancer susceptibility, progression and treatment. Mol Cancer 7. DOI Artn 4 Doi 10.1186/1476-4598-7-4 Mcclay DR, Gooding LR (1978) Involvement of histocompatibility antigens in embryonic cell recognition events. Nature 274:367–368 McMurdo G (1996) HTML for the lazy. J Inf Sci 22:198–212 Medicus RG, Schreiber RD, Gotze O, Mullereberhard HJ (1976) Molecular concept of properdin pathway. Proc Natl Acad Sci USA 73:612–616 Nau GJ, Richmond JF, Schlesinger A, Jennings EG, Lander ES, Young RA (2002) Human macrophage activation programs induced by bacterial pathogens. Proc Natl Acad Sci USA 99:1503–1508. DOI 10.1073/pnas.022649799 022649799 [pii] Pallen M (1995) Guide to the Internet. The world wide web. BMJ 311:1552–1556 Reed JA (2008) Deciphering the melanoma interactome. J Cutan Pathol 35:11–15. DOI 10.1111/J.1600-0560.2008.01119.X Roberts A (1998) What is SGML (standard generalised markup language)? IHRIM 39:48–49 Robertson AJ, Croce J, Carbonneau S, Voronina E, Miranda E, McClay DR, Coffman JA (2006) The genomic underpinnings of apoptosis in Strongylocentrotus purpuratus. Dev Biol 300: 321–334. DOI S0012-1606 (06)01141-9 [pii] 10.1016/j.ydbio.2006.08.053 Saint-Blancard P, Trueba F (2009) A rare splenic lesion, the splenoma or splenic hamartoma. Rev Med Interne 30:533–536. DOI 10.1016/J.Revmed.2008.07.017
3
Getting Started in Biological Pathway Construction
83
Schroeder R, Mello RD (2009) Designing XML documents from conceptual schemas and workload information. Multimedia Tools Appl 43:303–326. DOI 10.1007/S11042-009-0272-1 Segota I, Bartonicek N, Vlahovicek K (2008) MADNet: microarray database network web server. Nucleic Acids Res 36:W332–335. DOI gkn289 [pii] 10.1093/nar/gkn289 Stromback L, Jakoniene V, Tan H, Lambrix P (2006) Representing, storing and accessing molecular interaction data: a review of models and tools. Brief Bioinform 7:331–338. DOI 7/4/331 [pii] 10.1093/bib/bbl039 Su YH, Li E, Geiss GK, Longabaugh WJ, Kramer A, Davidson EH (2009) A perturbation model of the gene regulatory network for oral and aboral ectoderm specification in the sea urchin embryo. Dev Biol. DOI S0012-1606 (09)00163-8 [pii] 10.1016/j.ydbio.2009.02.029 Tu LC, Foltz G, Lin E, Hood L, Tian Q (2009) Targeting stem cells – clinical implications for cancer therapy. Curr Stem Cell Res Ther 4:147–153 Viswanathan GA, Nudelman G, Patil S, Sealfon SC (2007) BioPP: a tool for web-publication of biological networks. BMC Bioinformatics 8:168. DOI 1471-2105-8-168 [pii] 10.1186/14712105-8-168 Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC (2008) Getting started in biological pathway construction and analysis. PLoS Comput Biol 4:e16. DOI 10.1371/ journal.pcbi.0040016 Voit E, Neves AR, Santos H (2006) The intricate side of systems biology. Proc Natl Acad Sci USA 103:9452–9457. DOI 0603337103 [pii] 10.1073/pnas.0603337103 Weitz JS, Benfey PN, Wingreen NS (2007) Evolution, interactions, and biological networks. PLoS Biol 5:e11. DOI 06-PLBI-E-1190R2 [pii] 10.1371/journal.pbio.0050011 Wierling C, Herwig R, Lehrach H (2007) Resources, standards and tools for systems biology. Brief Funct Genomic Proteomic 6:240–251. DOI elm027 [pii] 10.1093/bfgp/elm027 Wong HH, Lemoine NR (2009) Pancreatic cancer: molecular pathogenesis and new therapeutic targets. Nat Rev Gastroenterol Hepatol 6:412–422. DOI 10.1038/Nrgastro.2009.89
Chapter 4
From Microarray to Biology Mikhail Dozmorov and Robert E. Hurst
Abstract Microarrays became an essential part of the tools available for a molecular biologist. However, the complexity of data (thousands of genes, several replicates, or time points) poses a significant challenge for data interpretation. Important questions do not have simple answers. Which experimental design to use? How to handle several, often heterogeneous, microarray data sets? Which software tools are available for microarray data analysis? How to use statistics for identification of reproducible results? Given the complexity of the approach to microarray data analysis we focus on the best and most reliable techniques and tools. The reader is encouraged to explore other exciting possibilities in the microarray field and other high-information/high-throughput techniques. This chapter provides an overview of current microarray technologies and provides some answers to the questions above. Keywords Microarray · System analysis · Gene expression profile · Gene regulation · Gene networks · Gene expression analysis
4.1 Introduction A system is group of interacting, interrelated, or interdependent elements forming a complex whole. This means that altering any one element alters the entire system and therefore all the other elements. A further corollary is that the properties of the system as a whole cannot be determined from studying the elements in isolation, even if such isolation is possible. In other words, the whole is greater than the sum of the parts. The idea that biological organisms form integrated systems has a long history (Savageau 1976, Von Bertalanaffy 1933). However, until the completion of the Human Genome Project and the development of high-throughput R.E. Hurst (B) Departments of Urology and Biochemistry and Molecular Biology and Oklahoma University Cancer Institute, Oklahoma University Health Sciences Center, Oklahoma City, OK 73104, USA e-mail:
[email protected] Supported in part by a grant from the National Institutes of Health, DK 069808
S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_4,
85
86
M. Dozmorov and R.E. Hurst
and high-information technologies, the idea of studying biological organisms as systems has mostly been a theoretical practice rather than one that could actually be carried out. In addition, the success of reductionist science has led many to take the dogmatic view that reductionist science is the only acceptable practice. Reductionist science represents the other view of science, namely that complex systems can be understood by studying the parts in isolation. In other words, the whole is just the sum of the parts. With the flowering of molecular biology that began in the 1970s, it became easy to study individual genes and learn how they interacted with other genes, one gene at a time. This approach provided huge amounts of information. With completion of the Human Genome Project, it became evident that we knew far less about how organisms functioned than we thought. Our knowledge about the genome turns out to be very unevenly distributed. Even today, about a third of the genome is completely unannotated, which is to say, about a third of genes have no known functions and no publications about them have appeared. Half the genome has, at most, only one publication associated with it, and often that is a paper reporting on the identification of thousands of genes. The reason for this uneven distribution of knowledge is that almost all of it was developed using the reductionist approach. One problem with the reductionist approach is that it requires a hypothesis, which is really just a guess. We are very bad at guessing, and the farther we step from what we already know, the worse becomes our ability to guess. Therefore, knowledge tends to follow knowledge, which leaves vast areas of the human genome and proteome unmapped. As mentioned above, all this is changing with new technology that now makes possible genome-wide studies. We can now determine how every gene in the genome responds to a particular biological challenge in a particular system. Entire genomes can now be sequenced in a week or two. Every mRNA can be identified, not just the ones we know about. This unparalleled increase in our ability to obtain data and information now makes “systems biology” possible. However, it calls for new ways of thinking. The advantage of the reductionist approach is that at least in theory it seeks to hold all biological variables constant except for the one under study. This approach greatly simplifies data analysis. If a parameter changes, it is assumed to be because of the challenge applied to the system. In fact, life is much more complicated. Even genetically identical mice were attached to different places in the placenta, have different birth orders, and may be treated in different orders so they are exposed to different levels of stress. Although these differences may seem trivial and probably are for the usual reductionist experiment, when the effect of a treatment on the entire genome is being investigated, these differences can lead to changes in gene expression that could be indistinguishable from those produced by the biological experiment. The challenge of using the new technologic advances is to produce biological insight, not just information, to identify mechanisms, not just changes, and to integrate the findings into a whole that can be used to predict the behavior of the system in response to challenges other than the one just investigated. Some call this “systems biology,” but the term itself is contentious. There is no accepted definition of “systems biology,” so we will tend to avoid the term. For the purposes of this chapter we will use it as meaning a set of steps and methods composed of theory,
4
From Microarray to Biology
87
analytic, or computational modeling to answer specific testable hypotheses about a biological system. The objective of this chapter is to provide the reader with both theoretical and practical knowledge of how huge amounts of data or information can be analyzed to produce biological insight and predictive models. We will also show that the reductionist and inductive models are not competitive, that both are necessary and that working together, these two modes of scientific enquiry can speed the rate of discovery much faster than either approach working alone. By integrating these two approaches, the inductive approach identifies the important elements of the system that should then be investigated in detail using the reductionist model.
4.2 Biology vs. Information Biological systems perform a variety of tasks, which can be viewed as transformation of incoming data into a response that maintains biological homeostasis. This informational processing occurs at all levels of biological organization, from whole organism to a single cell responding to changes in a microenvironment. Such information flow requires coordinated actions of several signaling and metabolic pathways, translocations, and modifications of signaling molecules. Figure 4.1 illustrates how hierarchies of interaction and organization exist in living organisms. Understanding how information flows biologically provides the basis for integrating measurements about the system into an integrated understanding of biological
Fig. 4.1 Living organisms form a system with a hierarchy of organization that depends upon the state of all other levels of organization and on interactions with the external environment. At the center is cell signaling, which integrates the internal state of the cell with signaling molecules appearing at its surface and the connections that the cell forms with neighboring cells and the extracellular matrix. Cells are organized into tissues that are further organized into organs, which then form the organized individuals. Paracrine and endocrine interactions alter the cells within tissues and organs by altering cellular signaling. The external environment also plays a strong role in determining the state of the organism
88
M. Dozmorov and R.E. Hurst
processes at all levels and built deeper understanding of underlying biological processes. The roots of our understanding of biological information lie in 1870 in Gregor Mendel’s famous experiments demonstrating that information about different traits passes from parent to offspring. Although 2 years earlier Friedrich Miescher, a Swiss physician, isolated what is now known as DNA from cell nuclei, it was not until 1952 that Alfred Hershey and Martha Chase in their now famous experiment proved conclusively that DNA and not protein carried the genetic information (Hershey and Chase 1952). On April 25, 1953 “Nature” carried what is perhaps the most important paper in biology, the famous paper by James Watson and Francis Crick on the structure of DNA (Watson and Crick 1953). In this short paper, which won the authors the Nobel Prize, is one of the great understatements of science. “It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.” Thus began the great revolution in molecular biology that inevitably brought us to the
Information Flow in Molecular Biology DNA
RNA transcription
Protein
1965
translation
Transcriptional complex
DNA
RNA transcription
Methylation Transcriptional complex
Post-translational modification
Protein translation Regulation of RNA processing—small RNAs
Ubiquination and degradation. 2009
Fig. 4.2 Comparison of knowledge of information flow in 1969 as compared to 2009. In 1965, the pioneering work of Jacob and Monod and others had shown that DNA was transcribed to RNA, which was translated into protein, and that the rate of transcription was controlled by a feedback loop in which protein levels regulated the activity of the transcriptional complex. By 2009, the flow of information was understood to be much more complex. DNA itself had been discovered not to be constant, and that genes could be permanently silenced by methylation of the 5-carbon of cytosine in the sequence CpG in promoter regions. Proteins also carry information. Post-translational modification with phosphate, acetyl, glycosyl, and other groups can change a protein from active to inactive or vice versa. Protein levels are regulated by adding ubiquitin to the protein, which sends it to the proteasome, the cellular recycling bin. Genes can also be transcribed in alternate forms, or splice variants, in which entire exons may be omitted or alternate exons used instead. This process can produce proteins lacking or adding particular functions. The latest discovery is that processing of mRNA is more extensive than known and that small RNA molecules can silence entire sets of genes by degrading the message
4
From Microarray to Biology
89
current place. Following along shortly were the fathers of molecular biology, the great French scientists Jaques Monod and Francois Jacob, whom with Andre Lwoff shared the 1965 Nobel Prize for Physiology for their work showing that the expression of enzymes in bacteria was controlled by feedback mechanisms at the level of transcription into messenger RNA. Jacob and Monod and their doctoral student Jean-Pierre Changeaux also showed how information flowed at the protein level by altering the three-dimensional structure of interacting proteins in a process called allosterism. By the mid-1960s, the basis for understanding how information is integrated into biology was in place. This was often referred to as the “central dogma” of molecular biology. Information flows from DNA to RNA and then to protein, which then feeds back to regulate transcription through the transcriptional complex, as is shown in Fig. 4.2. It all looked so simple. That it could be and was more complex was shown independently in 1970 by Howard Temin and David Baltimore, who won the Nobel Prize in Physiology (with Renato Dulbecco) for their discovery of reverse transcriptase. In the ensuing 30 years, even greater complexity has emerged, also as illustrated in Fig 4.2.
4.3 Microarray History The first technique that was capable of measuring gene expression on a large scale was the microarray. With the completion of the Human Genome Project, it became possible to place probes for every gene in the genome on an array and therefore measure its level of transcription. The microarray technology was developed from Southern blotting technique (Southern 1992) and added a new dimension to the field of molecular biology. It was comprehensively described in 1995 (Schena et al. 1995) and is as widely used nowadays as Western blots. Their accuracy and reliability are acquiring “gold standard” status. Studies from multiple groups have shown about a 90% correlation between microarray and quantitative PCR when the same sequence as the probe binds is amplified. When the two disagree, it is not clear that the PCR finding is always correct. They are widely used to assess the expression of thousands of genes simultaneously. Microarrays can be used to quickly overview which patterns of genes are active in a particular tissue or in cultured cells. The output of microarray experiment is called a “gene expression profile” when the target is to measure the abundance of mRNAs. Various platforms and applications of microarrays exist – from custom printed microarrays addressing several hundreds or thousands of probes to high-density commercial arrays, containing the entire genome plus a number of splice variants. Gene expression profiling has advanced well beyond a simple goal of identifying a few differentially expressed genes to where it can be used to examine gene expression as a system property. The array format can also be applied to assess expression level of proteins (antibody arrays), microRNAs (miRNA profiling), to detect single nucleotide polymorphisms (SNPs) in DNA, and for comparative genomics hybridization studies (CGH). At the genome level for SNP and other measurements, the format is the “tiling array” in which the
90
M. Dozmorov and R.E. Hurst
entire genome is loaded onto the array in the form of oligonucleotides. In the current chapter only microarrays used for mRNA profiling and approaches for their analysis will be considered.
4.4 Microarray Technology Although all microarrays employ hybridization of nucleic acid strands principle, a number of different technologies have been exploited (Hardiman 2004). A typical microarray structure includes glass polymer-coated slide with densely packed probes affixed on its surface. The RNA sample, labeled with fluorescent dye, is evenly distributed over the surface of microarray and hybridizes with probes on the slide. The intensity of fluorescence at each spot where probe was attached is proportional to the amount of RNA bound. Microarrays are separated into two groups, based on the type of probes: (1) cDNA arrays, where probes are constructed from PCR products and can be up to few thousand base pairs long and (2) oligonucleotide arrays, containing short (∼30 bp) or long (∼70 bp) oligonucleotide probes. The cDNA arrays are rapidly disappearing. All microarrays use a common approach in which the RNA is reverse transcribed and simultaneously labeled with a fluorescent dye, normally one of the members of the cyanine family (Cy3 and Cy5). Following hybridization, the slide is washed and laser light excites the fluorescent dye on the surface. Fluorescent intensity depends on the amount of initial RNA (and the quality of hybridization), and scanning the intensity of emission gives an estimate of the relative amounts of the different transcripts represented by probes on the slide. For example, a signal produced by 100 molecules will be twice as strong as signal produced by 50 such molecules. Arrays can be either two color or one color. In the two-color array, the test sample is labeled with one dye and a reference sample with a dye of a different color. The relative intensities of the two colors show the difference in concentration of the message between the two samples. The one-color array uses only a single color to label only the test sample. The output is a direct measurement of the absolute concentration of the message. Because the dynamic range is wider, the one-color array is gradually supplanting the two-color approach. The main manufacturers, listed in approximate order of their popularity, are Affymetrix, Illumina, Agilent, and Roche NimbleGen. Affymetrix uses multiple probes per gene transcript, selected from a sequence near 3 end. Each probe consists of 25-nt oligonucleotides. The arrays are built up by a process similar to printing of semiconductors. About 10 perfectly matching (PM) gene probes per gene sequence are used. Mismatch oligos also were used to account for non-specific hybridization but the use of mismatch oligos gradually is becoming obsolete, and non-specific hybridization is accounted by specific probes enriched in GC content. Illumina uses a completely different technology and binds 50-nt probes to beads that have a “barcode” to identify the sequence bound to the bead. There are at least 30 beads containing multiple copies of the same probe in each analysis. Mismatch oligos are not used. Agilent Technologies uses a printed
4
From Microarray to Biology
91
array in which the 60-nt probes are printed onto small spots on a glass slide. The number of probes on a microarray can range from hundreds to millions, with the typical gene expression microarray containing 20–50 thousands of probes, covering whole genome of an organism, with multiple probes related to the same gene or to its splice variants. Roche NimbleGen also uses a 60-mer printed array with several probes per target mRNA and between one and four replicates per target, depending upon the array format. Box 4.1 lists different gene annotation sources.
Box 4.1 Gene/Pathway Annotations BioCarta – http://www.biocarta.com/genes/index.asp BioGPS – http://biogps.gnf.org/#goto=search iHOP: Information hyperlinked over proteins – http://www.ihop-net.org/ UniPub/iHOP/ KEGG: Kyoto Encyclopedia of Genes and Genomes – http://www.genome.jp/ kegg/ Panther: Protein ANalysis THrough Evolutionary Relationships – http://www. pantherdb.org/ Pathway Interaction Database – http://pid.nci.nih.gov/index.shtml PubGene – http://www.pubgene.org/ SOURCE – http://smd.stanford.edu/cgi-bin/source/sourceBatchSearch Synergizer – http://llama.med.harvard.edu/synergizer/translate/ The Cancer Cell Map – http://cancer.cellmap.org/cellmap/
4.5 Microarray Data Analysis Although the analysis of microarray data relies heavily on statistical analysis, it is not solely a problem in statistics. In the end, it is a problem in biology, and statistical analysis is a tool that is used within a biological context. The biological context imposes constraints, but also provides two, guiding principles. The first constraint is that microarrays can only detect changes in gene expression. For example, if a drug leads to phosphorylation that switches a protein from an active to an inactive form, then unless phosphorylation leads to changes in gene expression, there will be no discernible change in the gene expression of the system. Events that occur only at the protein level, and which may represent fundamental changes in the system, also may be undetectable. The second constraint is that low-expression genes may be difficult to measure because their level of expression may be near the noise in the system. The third constraint is that in a system, changes in gene expression may be distributed in a non-obvious manner throughout the system. The fourth constraint is that all microarray experiments are underdetermined, meaning there are far more unknowns (genes) than there are equations (samples). This means that there is no
92
M. Dozmorov and R.E. Hurst
unique solution or even an optimal solution, and different analysis methods will yield different solutions, all of which have some validity. Because biology is a science, its principles can help in finding solutions that at least are consistent with the laws of biology. The first guiding principle is that in any experiment, the expression of the vast majority of genes is not biologically modulated and therefore any variability in their expression is due to technical factors alone. The second guiding principle is that genes that are co-expressed or whose expression tracks each other during an experiment tend to function together. The sheer amount of data acquired from a microarray experiment requires at least some level of computer-aided automation to convert data to information and eventually to biological insight. However, such high-informational data along with a relatively high noise level present several problems for careful biological interpretation. These include the following.
4.6 Experimental Design Clear formulation of scientific question ultimately leads to well-designed elegant experiments, and microarray experiments are no exception. Several approaches to the design of microarray experiments exist, depending on what answers one wants to get from profiling of thousands of genes. The main types of microarray experiments can be divided into four major groups: class comparison, class prediction, class discovery, and time series. The goal of class comparison is to identify genes behaving differently between groups of samples. The groups may represent different stages of development, cell cycle, or conditions before and after intervention. If comparison is made between different biological subjects then unpaired analysis is used, while in case of assessment of one subject (before and after treatment) calls for paired analysis. Class prediction is used to identify a group of genes that can predict membership of a sample in a class, such as patients with aggressive cancer vs. unaggressive cancer. Class discovery identifies whether in the gene expression data set several groups can be distinguished, for example, related to different types of the disease with similar clinical symptoms. Time series analysis is used when there is a need to look at the dynamics of gene expression changes followed by a treatment, intervention, or age.
4.6.1 Replicates Microarrays undergo robust quality control, and the accuracy and precision of measurements continue to improve. Still, use of replicated experiments greatly helps to distinguish truly differentially expressed genes. Two types of replicates are used – technical and biological. Technical replicates are replicate analyses of the same sample. Many modern microarrays include technical replicates on the chip itself in the
4
From Microarray to Biology
93
form of more than one spot or bead containing the same gene probe. Illumina, for example, has at least 30 beads containing the same probe. Technical replicates are probably a waste of time and money because microarrays have improved to the point where reproducibility of technical replicates is better than ± 10%. Biological replicates are replicates of the experiment. For example, instead of pooling the RNAs from several animals treated with a drug, analyzing each animal separately provides biological replicates. The problem with replicates, of course, is cost. However, as the costs of arrays have fallen, the advantage of using replicates outweighs the costs. The number of replicates also is a function of the experimental design. If the experiment seeks to compare expression in one state with that in another, and the statistical test being used is a comparison of mean values by a t-test, then the standard considerations of statistical power govern the choice. On the other hand, tests based on variance may not require as many replicates for reasons discussed below. With biological replicates it is possible to capture the effects of biological variables that are not those of primary interest but that could affect how results could be reproduced in another laboratory. Consider as an example the time between when the nutrient medium is changed and the RNA is isolated from cells. When the entire genome is scanned, it is likely that this factor will affect gene expression. The problem is that all biological variables will affect the results of a gene array experiment, and with the entire genome being scanned, the probability is high that expression of some genes will be sensitive to such extraneous biological variables. Thus if two labs perform what is nominally the same experiment, but in one lab the cells are harvested for RNA isolation 18 hours after changing the nutrient medium but in another, they are harvested 26 hours later, then superimposed upon the primary effect (for example, response to a hormone or drug) will be the effect of the difference in timing. Undoubtedly factors such as these explain the common observation that gene lists derived at different times and places are not robust. However, if the investigator wishes to capture this variability, then the time between changing the nutrient medium and cell harvest will be deliberately varied over some range. Those genes that are insensitive to this variable will show low variability characteristic of the inherent reproducibility of the microarray. This variability is denoted as the technical variability. In contrast, those that are sensitive to timing will show higher variability, even within a nominally homogeneous group such as treated cells or controls. The matter of variability is dealt with in more detail below.
4.6.2 Data Preprocessing and Normalization Multiple factors can influence high-information microarray data – different efficiencies of reverse transcription, reagents used for labeling and hybridization, physical problems with the arrays, and even the temperature outside the lab. Therefore it is necessary to adjust microarray data from different arrays to eliminate low-quality or
94
M. Dozmorov and R.E. Hurst
questionable data and to be able to reliably compare relative gene expression levels among different experiments. With Affymetrix data, summarization is needed because transcripts are represented by multiple probes. For each gene, the background adjusted and normalized probe intensities need to be summarized into one value. This is done by software from Affymetrix. To understand how differentially expressed genes are identified, let us begin at the beginning and examine some raw data from microarray experiments. Figure 4.3 shows a histogram of the raw intensity data for a printed array, an Illumina array, and an Affymetrix array. Interestingly, these different array technologies provide somewhat different overall dynamics. The spotted array shows a distinctly bimodal pattern with a clear zero point. The figure also shows how expression is background subtracted and the expression data normalized to the system noise to facilitate deciding whether a given gene is expressed or not. Figure 4.4 illustrates the process of normalization. Multiple normalization methods exist: total intensity normalization, mean/median centering, linear regression, log mean centering normalization,
Fig. 4.3 Histogram of dynamics of gene expression. A histogram shows the number of objects with a given value as a function of that value. Two modes are apparent in the data from the spotted array (a). First is the mode around zero. This represents the unexpressed genes. The reason this is not exactly zero is there is some non-specific binding or trapping of fluorescent molecules that gives a very low background. The mode of the background represents this low value, and the value of the mode is subtracted from all other measurements. There is some dispersion to this background, that is some spots have a little more background, some a little less. The other array types (b – Illumina Human Ref 8, c – Affymetrix Mouse 430_2) do not show this bimodal distribution and show a continuous distribution instead
4
From Microarray to Biology
95
Fig. 4.4 Normalization and regression of data. (a) The left-most peak from the spotted array representing noise around zero is normally distributed. The standard deviation is calculated and is used to normalize the expression of all genes to units of noise. This facilitates filtering by expression. Choosing a minimum of three noise units to call a gene expressed means that 99.7% of the time, the call will be correct. Raising the value to 5 will eliminate some very low expression genes but will also reduce the variability of the data set. For other platforms such as Illumina or Affymetrix which lack the bimodal histogram (Fig. 4.3), the left portion of the distribution is normally distributed and the standard deviation calculated from the left half is used for noise normalization. (b) Illustration of the effect of normalization of all the arrays against each other using robust linear regression with box-whisker plots of the data before (left panel) and after regression. It can be clearly seen that median and quartile distributions of the data before regression is quite heterogeneous, while after regression the group become more homogeneous
quantile and Lowess normalization, and others. While many of these can be found in software packages such as Bioconductor, in our lab we prefer robust linear regression. This method is informed by the first guiding principle, namely that the expression of most genes does not change. If there is a little more or a little less
96
M. Dozmorov and R.E. Hurst
probe on one array as compared to another, then they will have different slopes. Normalization seeks to correct for such differences. Robust linear regression adjusts all the arrays in an experiment to have the same slope and weights points that are differentially expressed according to the inverse of their distance from the slope. This means that the effect of differentially expressed genes is down-weighted. Other methods such as log mean centering are simpler, but we believe that the most stable normalization is the robust linear regression.
4.6.3 Identification of Genes of Interest The first step in analysis is to identify the set of “genes of interest” that exceed some statistical criterion that places them in a set that is different from those genes that express only technical variability. These “genes of interest” are then analyzed by a variety of experimental and computational techniques to develop an understanding of the underlying biology. To understand the effect of a drug we may ask a question which genes are upregulated (increase in their expression) and which are downregulated (decrease in their expression) following drug administration. The simplest approach is to use fold change based on what is a statistically significant fold change in expression. With this approach, the overall dynamics of gene expression are examined to discover what is the minimum statistically significant difference in expression. For the Affymetrix array, this is actually about 1.4-fold. Therefore every gene with a higher fold change is considered to be differentially expressed. The well-accepted t-test is often used in comparison of two experimental situations. However, the p-value should be adjusted for multiple comparison problem. The p-value is basically the probability that a given finding represents a real biological effect and is not due to chance. Generally for a single value, p 0). – The methods are highly efficient when started close to the solution. Local methods have been largely used in combination with the single and the multiple shooting approaches for the purpose of parameter estimation. However, the nonlinear character of the biological dynamic models leads to the presence of several suboptimal solutions and thus local methods may end up in a suboptimal solution. It has been argued that multiple shooting-based approaches can circumvent some local minima by allowing for discontinuous trajectories while searching the global minimum. And even though this may be true for some cases, for example, oscillatory systems, convergence to the global solution cannot be guaranteed (Balsa-Canto et al. 2008b). Moreover, in the presence of a bad fit, there is no way of knowing if it is due to a wrong model formulation or if it is simply a consequence of local convergence. 5.3.2.2 Global Methods Global methods have emerged as the alternative to search the global optimum. One of the simplest global methods is a Multistart method. Here, a large amount of initial guesses are drawn from a distribution and subjected to a parameter estimation algorithm based on a local optimization approach. The smallest minimum is then regarded as being the global optimum. In practice, however, there is no guarantee of arriving to the global solution and the computational effort can be quite large. These difficulties arise because it is a priori not clear how many random initial guesses are necessary. Over the last decade more suitable techniques for the solution of multimodal optimization problems have been developed (see, e.g., Pardalos et al. 2000 for a review). The successful methodologies combine effective mechanisms of exploration of the search space and exploitation of the previous knowledge obtained by the search. Depending on how the search is performed and the information they are exploiting, the alternatives may be classified into three major groups: deterministic, stochastic, and hybrid. Global deterministic methods in general take advantage of the problem’s structure and even guarantee convergence to the global minimum for some particular problems that verify specific conditions of smoothness and differentiability. Reviews of these methods can be found in Pinter (1996) or Floudas (2000). Several recent works propose the application of global deterministic methods for model calibration in the context of chemical processes, biochemical processes,
120
E. Balsa-Canto and J.R. Banga
metabolic pathways, and signaling pathways (Esposito and Floudas 2000; Gau and Stadtherr 2000; Lin and Stadtherr 2006; Polisetty et al. 2006). Although very promising and powerful, there are still limitations to their application, mainly due to rapid increase of computational cost with the size of the considered system and the number of its parameters. As opposed to deterministic approaches, global stochastic methods do not require any assumptions about the problem’s structure. Stochastic global optimization algorithms are making use of pseudorandom sequences to determine search directions toward the global optimum. This leads to an increasing probability of finding the global optimum during the runtime of the algorithm. The main advantage of these methods is that they rapidly arrive to the proximity of the solution. The number of stochastic methods has rapidly increased in the last decades. The most successful approaches lie in one (or more) of the following groups: pure random search and adaptive sequential methods, clustering methods, populationbased methods, or nature-inspired methods (Dréo et al. 2006). Figure 5.6 presents a classification of the most widely used ones.
Fig. 5.6 Classification of global stochastic methods. Sequential adaptive methods generate one solution at a time, every solution is used to generate search directions and new iterates are being generated considering their quality. Population-based methods generate a set of solutions at a time, the most promising are usually selected to generate new populations
Some of these strategies have been successfully applied to parameter estimation problems in the context of systems biology, see Mendes and Kell (1998) for the application of simulated annealing, Moles et al. (2003) and Rodriguez-Fernandez et al. (2006a) for the application of evolutionary search algorithms, or Sugimoto et al. (2005) for genetic programming.
5
Computational Procedures for Model Identification
121
Despite the fact that many stochastic methods can locate the vicinity of global solutions very rapidly, the computational cost associated with the refinement of the solution is usually very large. In order to surmount this difficulty, hybrid methods and metaheuristics have been recently presented for the solution of parameter estimation problems (Balsa-Canto et al. 2008b; Rodriguez-Fernandez et al. 2006a, b) that speed up these methodologies while retaining their robustness. In particular, the Scatter Search metaheuristic (Rodriguez-Fernandez et al. 2006a, Egea et al. 2007) and the sequential hybrid with automatic switching (Balsa-Canto et al. 2008b) showed speeds of up to between one and two orders of magnitude with respect to the use of stochastic global methods. 5.3.2.3 An Illustrative Example: The Goodwin Model Parameter estimation for oscillating systems is usually more involved than for systems showing a transient behavior. A well-known model describing oscillations in biological applications is the model suggested by Goodwin (1965). It consists of the following set of ordinary differential equations: a x˙ = A+z σ − bx y˙ = αx − βy z˙ = γ y − δz
(5.7)
Here, x may represent an enzyme concentration whose rate of synthesis is regulated by feedback control via a metabolite z. The intermediate product y regulates the synthesis of z. Oscillatory behavior is not a necessary characteristic of this set of equations. Different values for the parameters may result in limit cycle oscillations, damped oscillations, or monotonic convergence to a steady state. In fact, only a restricted range of parameter values result in oscillations. The following values have been used here: x(0) = 0.3617, y(0) = 0.9137, z(0) = 1.3934 for the initial conditions and a = 3.4884, A = 2.1500, b = 0.0969, α = 0.0969, β = 0.0581, γ = 0.0969, σ = 10, and δ = 0.0775 for the model parameters, resulting in oscillatory behavior. These values were used to generate noisy experimental data with a maximum variance of 10%. From the simulated data we aim to estimate the rate constants a, A, b, α, β, γ , and δ whose global optimum value, in this case, is known. In the case of local optimization methods – single and multiple shooting – we used multistarts, where the initial guess of each restart is randomly chosen from the intervals [0, 5] (Box 5), [0, 10] (Box 10), and [0, 100] (Box 100) using a uniform distribution. For each box size 100 restarts are chosen. The results are summarized in Fig. 5.7 showing the percentage of convergence to the global minimum, local minima, or failure for different box sizes. Both local methods encounter difficulties in finding the global optimum; single shooting rapidly steps in local minima or diverges and only on a reduced percentage of the runs converges to the global solution, whereas multiple shooting performs in all cases better than single shooting at the expense of higher computational costs.
122
E. Balsa-Canto and J.R. Banga
Fig. 5.7 Summary of results for the example of the Goodwin oscillator using a local indirect method (generalized quasi-Newton) in combination with single and multiple shooting. Shown is the percentage of convergence to the global minimum, local minima, or failure of the optimization method using 100 restarts. The initial guess of each restart is randomly chosen from interval [0, 100] (Box 100) and [0, 1000] (Box 1000) using a uniform distribution
In case of the global approaches two methods were tested: an evolutionary algorithm, SRES (Runarsson and Yao 2000), and differential evolution, DE (Storn and Price 1997). Both techniques have been quite successful in solving global optimization problems. Particularly SRES has been widely used in the context of parameter estimation of biological dynamic models (for example, see Moles et al. 2003, Rodriguez-Fernandez et al. 2006b). For this particular example only DE was able to arrive at the global under the choice of robust thus slower strategy parameters. This emphasizes the difficulties in finding the optimal solution for oscillatory systems even for global search strategies. Figure 5.8 shows representative convergence curves for the DE and the hybrid to the global optimum of the Goodwin problem. The benefit of the hybrid can be appreciated by comparing the left panel (DE) with the right panel (hybrid). For box size 10 the hybrid converges almost 10 times faster while for larger box sizes the asset is even more pronounced. Figure 5.9 presents the fit corresponding to a typical local solution and the one corresponding to the global.
5
Computational Procedures for Model Identification
123
Fig. 5.8 Illustrative convergence curves for DE and the hybrid in the solution of the Goodwin model. Every figure presents five different convergence curves for the different box sizes considered. It should be noted that for Box 100, DE was not converging for all the five runs, whereas the hybrid was able to converge all runs and with significant reductions in the computational costs
Fig. 5.9 Comparison of fits for a typical local solution and the global solution for the Goodwin model
5.4 Identifiability In most practical situations only a limited number of components in the network can be measured, usually much less than the number of components incorporated to the model; only some specific stimuli are available, and the system may be stimulated in very specific ways; the number of sampling times is usually rather limited and the experimental data are subject to substantial experimental noise. These constraints together with the dynamic and usual nonlinear character of the models of cell signaling pathways result in identifiability problems, i.e., in the impossibility to provide an unique solution for the parameters.
124
E. Balsa-Canto and J.R. Banga
5.4.1 Structural Versus Practical Identifiability Parameter identifiability is concerned with the possibility of finding an unique value for the parameters. We can distinguish between structural and practical identifiability. Structural identifiability is a theoretical property of the model structure depending only on the observation and the input functions. The parameters of a model are structurally globally identifiable if, under ideal conditions of noise-free observations and error-free model structure and independently of the particular values of the parameters, they can be uniquely estimated from the designed experiment (Walter and Pronzato 1997). Practical identifiability analysis is related to the question whether it is possible to find a unique solution for the parameters given a pair model-experimental data. Although the questions seem pretty similar, there are several crucial differences: – Structural analysis is performed in absence of noise whereas for the practical analysis the experimental error is crucial. – In most of the cases lack of structural identifiability will be unsolvable. In most examples one should consider reformulating the model or aggregating measured quantities. – Lack of practical identifiability will be in general terms solvable, provided the experimental constraints allow for designing sufficiently rich experiments. – Performing a structural identifiability analysis is by far more complicated; complex symbolic manipulations are required and this might make a full analysis impossible for highly nonlinear large-scale models. Practical identifiability analysis will be in general computationally intensive but it is extremely helpful to assess parameter estimates reliability and to compare possible experimental designs.
5.4.2 Methods for Practical Identifiability Practical identifiability may be evaluated by the use of the observables’ sensitivity with respect to the parameters evaluated at the optimum. If the sensitivity functions are linearly dependent the model is not identifiable, and if they are nearly linearly dependent, parameters are highly correlated (Ljung 1999). Additionally one may compute the confidence or uncertainty region for the parameters and evaluate its properties. At this point, it is important to note that, in fact, in the presence of experimental error, there are several equivalent solutions defining the confidence region of the parameters, but of course this does not necessarily mean that the model is not practically identifiable. The shape and size of such region will determine whether practical identifiability is or is not guaranteed. Confidence regions may be computed by using the Fisher Information Matrix or a Monte Carlo-based approach. Both methods are described in detail in Balsa-Canto et al. (2008a), here they are only briefly discussed.
5
Computational Procedures for Model Identification
125
5.4.2.1 Fisher Information Matrix The Fisher Information Matrix (FIM) provides a measure of the quantity and quality of the information of a given experiment for a particular value of the parameters. It is mathematically defined as follows: FIM = E
y˜ |μ
dJ(θ) dθ
dJ(θ) dθ
T ,
(5.8)
where E represents expected value and μ is a value of the parameters hopefully closed to their “real” value. The Crammèr–Rao inequality provides a lower bound on the covariance of the estimators: C ≥ FIM−1 (μ)
(5.9)
The diagonal elements of the covariance matrix correspond to half the confidence intervals for the parameters. The correlation between parameters may be computed C by Crij = √ ij in such a way that if Crij = ±1 the parameters ij are highly Cii Cjj
correlated whereas if Crij = 0, the parameters are fully uncorrelated. 5.4.2.2 The Monte Carlo-Based Approach The Monte Carlo-based approach requires the solution of the parameter estimation problem under hundreds of realizations of the experimental data. This allows to generate a cloud of solutions for the parameters that represent the confidence region. The numerical information about the parameter uncertainties is obtained here through the manipulation of the resulting matrix of estimated parameters (Balsa-Canto et al. 2008a). Figure 5.10 presents some illustrative examples of Monte Carlo-based confidence regions by pairs of parameters.
Fig. 5.10 Examples of Monte Carlo-based parameter confidence regions by pairs of parameters. (a) Spherical confidence region means that the parameters are highly uncorrelated and thus identifiable; (b) elongated ellipse means that the parameters are correlated, but if the size of the ellipse is small enough the parameters are still identifiable; (c) infinite elongated ellipse means that the parameters are highly correlated and may be non-identifiable, for example in this case only θ1 can be identified
126
E. Balsa-Canto and J.R. Banga
5.4.3 Illustrative Example: The Brusselator Model The Brusselator model describes the dynamics of the following set of multimolecular reactions: k1
A −−−−→ X k2
B + X −−−−→ Y + D k3
2X + Y −−−−→ 3X k4
X −−−−→ E The corresponding mathematical formulation of the model is A˙ = −k1 A B˙ = −k2 BX ˙ = k2 BX D ˙E = k4 X X˙ = k1 A − k2 BX + k3 X 2 Y − k4 X Y˙ = k2 BX − k3 X 2 Y
(5.10)
with the following initial conditions: A(0)=1.5 mol/L and B(0)=2 mol/L. X(t) and Y(t) have been measured resulting in the following: The objective is to compute the parameter values in the range [0, 1000] so as to make the model reproduce the data in Table 5.1. Table 5.1 Experimental data for the Brusselator example
Time (min)
X(t) (mol/L)
Y(t) (mol/L)
0 0.1 0.2 0.5 0.7 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
0 0.1288457 0.1953830 0.2841268 0.2928819 0.2857117 0.2034689 0.1413993 0.0649589 0.0237066 0.0095934 0.0034794 0.0012887 0.0004757 0.0001766 6.223e − 5
0 0.0131707 0.0453177 0.1845682 0.3031271 0.4132581 0.5674398 0.7156677 0.8098184 0.8369413 0.8628408 0.8512628 0.8675203 0.8934875 0.9055489 0.8899257
First the problem is solved with a multistart of a local method, dn2fb, an adaptive nonlinear least squares algorithm (Dennis et al. 1981), resulting in the histogram of solutions shown in Fig. 5.11:
5
Computational Procedures for Model Identification
127
Fig. 5.11 Histogram of solutions achieved with a 100 multistart of a local method (dn2fb) in the solution of the Brusselator example
From the histogram it is clear that there are several suboptimal solutions, therefore and not knowing if we have arrived at the global solution we solve the problem by the use of global methods. We tested different possibilities (DE, SRES, and ssm) and all of them converged to the global solution: k1 = 0.979, k2 =1.003, k3 =0.943, and k4 =0.997. Since the experimental data are subject to experimental noise we should now perform the identifiability analysis to compute the uncertainty on the parameter values. Using the Monte Carlo-based approach we obtain the following results: k1 = 0.9786 ± 0.0003; k2 = 1.0030 ± 0.0007; k3 = 0.9402 ± 0.0080, and k4 = 0.9971 ± 0.0005 revealing that the maximum uncertainty is under 1%. Figure 5.12 presents the ellipses obtained by pairs of parameters.
Fig. 5.12 Robust identifiability analysis of the Brusselator problem. Confidence ellipses by pairs of parameters
128
E. Balsa-Canto and J.R. Banga
From the figures it may be concluded that the pairs [k1 , k3 ], [k2 , k3 ], and [k3 , k4 ] are the most correlated since k3 seems to be not as identifiable as the other parameters. In any case this correlation is not that important taking into account that the size of the ellipses is sufficiently small.
5.5 Optimal Experimental Design Performing experiments to obtain a rich enough set of experimental data is a costly and time-consuming activity. The purpose of optimal experimental design (OED) is to devise the necessary dynamic experiments in such a way that the parameters are estimated from the resulting experimental data with the best possible statistical quality, which is usually a measure of the accuracy and/or correlation of the estimated parameters. In other words, based on model candidates, we seek to design the best possible experiments in order to facilitate parametric identification. Mathematically the OED problem can be formulated as a dynamic optimization problem where the objective is to find a set of inputs u, usually time-varying variables (stimuli) together with initial conditions, sampling times, and experiment durations, so as to maximize or minimize a cost function which is related to the FIM (Eq. 5.8). Since the FIM may be related to the confidence hyper-ellipsoid for the parameters, the different OED criteria provide information about its shape and size. The D-criterion, i.e., the maximization of the determinant of the FIM, minimizes the confidence ellipsoid’s volume; the E-criterion, i.e., the maximization of the minimum eigenvalue of the FIM, minimizes the length of the largest axis, whereas the modified E-criterion, i.e., the minimization of the condition number of the FIM, minimizes the ratio of the largest to the smallest axis, seeking to make those ellipsoids as spherical as possible. Numerical solutions for this dynamic optimization problem can be obtained using direct methods, which transform the original problem into a nonlinear programming (NLP) problem via parameterizations of the inputs and/or the states. However, because of the frequent non-smoothness of the cost functions, the use of gradient-based methods to solve this NLP might lead to local solutions. As it happened in parameter estimation there is a need of global optimization methods to ensure proper solutions.
5.5.1 Numerical Methods: The Control Vector Parameterization Approach The control vector parameterization (CVP) method proceeds dividing the duration of the experiment(s) into a number of elements and approximating the stimuli using low-order polynomials (Balsa-Canto et al. 2008a). Figure 5.13(a) presents the general CVP method. However, it is often the case that in practice not all stimulation profiles are possible. In fact, in most of the cases, only sustained, pulse-wise, or stair-wise stimulations are possible (Fig. 5.13(b)).
5
Computational Procedures for Model Identification
129
Fig. 5.13 (a) Illustrative representation of the CVP approach. (b) Typical stimulation profiles in practice
Once the CVP has been applied, the general OED problem is transformed into an NLP, the decision variables being the amount of stimulation plus the experimental sampling times, the duration of the experiments, and initial conditions. Note that the solution of this NLP requires a suitable NLP solver and an IVP similar to the parameter estimation problem as shown in Fig. 5.14. Regarding the NLP solver, both local and global methods may be selected. However, several tests performed (Banga et al. 2002, Balsa-Canto et al. 2008a)
Fig. 5.14 Iterative procedure for the solution of the optimal experimental design problem
130
E. Balsa-Canto and J.R. Banga
revealed the multimodality of the OED problem, thus the use of global techniques, as detailed above, is necessary. Regarding the IVP solver, several alternatives exist: to use symbolical manipulation or automatic differentiation to obtain the parametric sensitivities and solve them together with the systems dynamics using a standard IVP solver or to exploit a backward differentiation formula-based (BDF) method so as to simultaneously solve the original initial value problem and the parametric sensitivities. Typical solvers are ODESSA (Leis and Kramer 1988) or the recently developed CVODES included in the suit of IVP solvers SUNDIALS (Hindmarsh et al. 2005).
5.5.2 An Illustrative Example: The NFκB Regulatory Module NFκB is implicated in several common diseases, especially those with inflammatory or autoimmune components, such as septic shock, cancer, arthritis, diabetes, and atherosclerosis. Mathematical models connected to experimental data have played a key role in revealing forms of regulation of NFκB signaling and the underlying molecular mechanisms. Several models have been proposed that include additional feedback loops, cross talk with other pathways, and NFκB oscillations. The model to be considered, proposed by Lipniacki et al. (2004), involves two compartment kinetics of the activators IKK and NFκB, the inhibitors A20 and IκBα, and their complexes. It is assumed that IKK exists in any one of three forms: neutral (IKKn), active (IKKa), or inactive (IKKi). In the presence of an extracellular signal such as TNF, IKK is transformed into its active (phosphorylated) form. In this form it is capable of phosphorylating IκBα, and this leads to its degradation. In resting cells, the unphosphorylated IκBα binds to NFκB and sequesters it in an inactive form in the cytoplasm. As a result, degradation of IκBα releases the second activator, NFκB. The free NFκB enters the nucleus and upregulates transcription of the two inhibitors IκBα and A20 and of a large number of other genes including the control gene cgen. The newly synthesized IκBα again inhibits NFκB, while A20 inhibits IKK by catalyzing its transformation into another inactive form in which it is no longer capable of phosphorylating IκBα. The scheme of the pathway and the corresponding mathematical model are presented in Fig. 5.15. It should be noted that the model consists of 15 nonlinear ordinary differential equations with 30 parameters. In their paper Lipniacki et al. (2004) fixed some of the model parameters by using values from the literature. To fit the unknown parameters they used experimental data from previous works by Lee et al. (2000) and Hoffmann et al. (2002). Lee et al. (2000) considered wild-type cells subject to a persistent TNF signal and collected data for A20 mRNA (A20t), total IKK (IKKn+IKKα+IKKi), activated IKK (IKKa), total cytoplasmic IκBα (IκBα + (IκBα|NF-κB)), IκBα mRNA (IκBαt), and free nuclear NFκB (NFκB n). Hoffmann et al. (2002) measured the responses of the free nuclear NFκB (NFκBn) and the cytoplasmic IκBα (IκBα + (IκBα|NF-κB))
5
Computational Procedures for Model Identification
131
Fig. 5.15 Scheme of the NFκB regulatory module and the corresponding mathematical model (Lipniacki et al. 2004)
132
E. Balsa-Canto and J.R. Banga
in wild-type cells under persistent and pulse-wise TNF stimulation. The combination of two types of experiments will be regarded from now as ESLH (experimental scheme by Lee et al. and Hoffmann et al.). Lipniacki et al. (2004) concluded in their work that several different values of parameters were capable of reproducing the data. Let us analyze this in more detail through the practical identifiability analysis. Consider that we want to estimate θ = [c3a; c4a; c5; k1; k2; kprod; i1; i1a]T ; all other parameters are known. We assume that we can perform a battery of hundreds of experiments (1,000 experiments in this case) under such experimental conditions and, furthermore, that we get experimental data with zero-mean Gaussian noise with unknown varying standard deviation but with a maximum corresponding to 10%. To perform the quantitative analysis according to the Monte Carlo approach the model calibration problem was solved for all sets of data by using Scatter Search (SSm). Results (Table 5.2) reveal that with the experimental scheme ESLH the mean value of the cloud of solutions for the parameters is close to the optimal value but most of the confidence regions for the parameters are over 16% and up to 23%. Figure 5.16 presents the confidence regions for the worst (k3) and the best case (c3a). Table 5.2 Practical identifiability analysis for the experimental scheme ESLH Parameter
Nominal value
c3a c4a c5 k1 k3 kprod i1 i1a
4 × 10−4 0.5 3 × 10−4 2.5 × 10−3 1.5 × 10−3 2.5 × 10−5 2.5 × 10−3 1 × 10−3
Mean value 4.06 × 10−4 0.49 2.86 × 10−4 2.49 × 10−3 1.42 × 10−3 2.5 × 10−3 2.44 × 10−3 1.06 × 10−3
Uncertainty (%) 6.7 8.1 19.0 19.7 23.1 7.0 16.8 19.3
Fig. 5.16 Robust confidence regions for the worst (k3) and the best case (c3a) for the available experimental scheme (ESLH). Histograms of the solutions achieved with the Monte Carlo-based approach. μ represents the mean value achieved for the corresponding parameter and ∗ represents the optimum expected value
5
Computational Procedures for Model Identification
133
Fig. 5.17 Robust confidence ellipses by pairs of parameters for the available experimental scheme (ESLH). The most (k1 and c3a) and the least (k3 and c5) correlated pairs of parameters
Looking at the confidence regions in more detail, it may be concluded that pulsewise stimulation is capable of decorrelating parameters. The eccentricity values by pairs of parameters range from 1.12 for the less correlated pair (k3 and c5) (very close to the ideal minimum of 1) to 3.35 for the most correlated pair (k1 and c3a). Figure 5.17 shows the corresponding confidence ellipses. Note, in addition, that the mean eccentricity corresponds to a value of 2.00. In order to improve the identifiability properties we considered a parallelsequential optimal experimental design, in such a way that the information reported by the experimental scheme ESLH was taken into account by introducing the experiments in the Fisher Information Matrix. New experiments were designed within the following experimental constraints: – Initial conditions correspond to those for wild-type cells after resting. – The TNF stimulus is activated and may be pulse-wise. – The maximum number of sampling times will be 15 and they may be optimally located. – The experimental noise corresponds to a maximum variance of 10%. – The reference value for the parameters in the FIM corresponds to the mean value obtained for ESLH. Regarding the FIM-based criteria for optimal experimental design, the D- and Eoptimality criteria are the usually preferred ones. For this particular example and attending to the eccentricity values obtained for the experimental scheme ESLH, D-optimality seemed to be the most suitable, since this promotes the reduction of the expected uncertainty, even though the parameters may end up being more correlated. Figure 5.18 presents the resulting experimental scheme ESOPT. Results (Table 5.3) reveal that the addition of one optimally designed experiment led the mean value to practically coincide with the nominal (“real”) value. In addition the expected uncertainty was, unless for the case of k1, less than 16%, with substantial
134
E. Balsa-Canto and J.R. Banga
Fig. 5.18 Overall experimental scheme: two experiments included in the ESLH plus the optimally designed pulse-wise experiment
Table 5.3 Practical identifiability analysis after optimal experimental design Parameter
Nominal value
c3a c4a c5 k1 k3 kprod i1 i1a
4 × 10−4 0.5 3 × 10−4 2.5 × 10−3 1.5 × 10−3 2.5 × 10−5 2.5 × 10−3 1 × 10−3
Mean value 4.0 × 10−4 0.5 3.02 × 10−4 2.5 × 10−3 1.51 × 10−3 2.5 × 10−3 2.5 × 10−3 1.0 × 10−3
Uncertainty (%) 4.9 3.5 14.2 17.2 14.0 5.3 13.9 11.2
improvements as compared to the confidence regions found for the experimental scheme ESLH. The maximum eccentricity of 3.53 corresponds to the pair (k1 and c3a) as in ESLH, whereas the minimum of 1.08 corresponds now to the pair (i1 and i1a), being the mean eccentricity of 2.03. Figure 5.19 presents a comparison of the expected uncertainties for the most and the least correlated pairs of parameters.
5
Computational Procedures for Model Identification
135
Fig. 5.19 Comparison of robust uncertainty ellipses by pairs of parameters for ESLH and ESOPT. The most (k1 and c3a) and the least (i1 and i1a) correlated pairs of parameters
5.6 Overview In this chapter we have focused on three critical steps of the model building loop: parameter estimation, identifiability analysis, and optimal experimental design. The parameter estimation and the optimal experimental design problems can be formulated as nonlinear optimization problems that may be solved by using nonlinear programming methods. We have presented a rough overview of available optimization techniques with special emphasis on the necessity of using global optimization methods to ensure proper solutions for these problems. In addition the identifiability analysis was presented as a way to measure the quality of the parameter estimates or to anticipate the quality of a given experimental design. The Goodwin oscillator model helped us to illustrate the benefits of using global optimization methods for parameter estimation. The Brusselator model was used to illustrate the use of identifiability analysis to measure the quality of the parameter estimates. The NFκB regulatory module was then used to illustrate identifiability problems and the usefulness of optimal experimental design to iteratively improve the quality of parameter estimates. Acknowledgments The authors acknowledge financial support from Spanish MICINN project “MultiSysBio,” ref. DPI2008-06880-C03-02.
References Balsa-Canto E, Alonso AA, Banga JR (2008a) Computational procedures for optimal experimental design in biological systems. IET Syst Biol 2(4):163–172 Balsa-Canto E, Peifer M, Banga JR et al (2008b) Hybrid optimization method with general switching strategy for parameter estimation. BMC Syst Biol 2:26 Banga JR, Versyck KJ, Van Impe JF (2002) Computation of optimal identification experiments for nonlinear dynamic process models: an stochastic global optimization approach. Ind Eng Chem Res 41, 2425–2430 Bock H (1981) Numerical treatment of inverse problems in chemical reaction kinetics. In: K E, P D, W J (eds) Modelling of chemical reaction systems. Springer, New York
136
E. Balsa-Canto and J.R. Banga
Bock H (1983) Recent advances in parameter identification techniques for ordinary differential equations. In: P D, E H (ed) Numerical treatment of inverse problems in Differential and Integral Birkhäuser Cho KH, Wolkenhauer O (2003) Analysis and modelling of signal transduction pathways in systems biology. Biochem Soc Trans 31:1503–1509 Dennis JE, Gay DM and Welsch RE (1981) An adaptive nonlinear least-squares algorithm. ACM Trans Math Software 7(3) Dréo J, Petrowski A, Taillard E, Siarry P (2006) Metaheuristics for hard optimization. Methods and case studies. Springer, New York Egea JA, Rodriguez-Fernandez M, Banga JR, Martí R (2007). Scatter search for chemical and bioprocess optimization. J Glob Opt 37(3):481–503 Esposito WR, Floudas C (2000) Global optimization for the parameter estimation of differentialalgebraic systems. Ind Eng Chem Res 39:1291–1310 Fletcher R. (1987) Practical methods of optimization. Wiley, UK Floudas CA (2000) Deterministic global optimization: theory, methods and applications. Kluwer Academics, Netherlands Gau CY, Stadtherr MA (2000) Reliable nonlinear parameter estimation using interval analysis: error in variable approach. Comp Chem Eng 24:631–637 Goodwin BC (1965) Oscillatory behavior in enzymatic control processes. Adv Enz Regul 3: 425–428 Hairer E, Nørsett SP, Wanner G (1993) Solving ordinary differential equations I: Nonstiff problems, 2nd edn, Springer, Berlin Hairer E, Wanner G (1996) Solving ordinary differential equations II: Stiff and differentialalgebraic problems, 2nd edn, Springer, Berlin Hindmarsh AC, Brown PN, Grant KE, Lee SL, Serban R, Shumaker DE, Woodward CS (2005). Sundials: Suite of nonlinear and differential/algebraic equation solvers. ACM Trans Math Softw 31(3):363–396 Hoffmann A, Levchenko A, Scott ML, Baltimore D (2002) The IkB-NF-kB signaling module: temporal control and selective gene activation. Science 298:1241–1245 Janes KA, Lauffenburger DA (2006) A biological approach to computational models of proteomic networks. Curr Op Chem Biol 10:73–80 Klipp E, Liebermeister W (2006) Mathematical modelling of intracellular signaling pathways. BMC Neurosci 7, doi:10.1186/1471-2202-7-S1-S10 Lee EG, Boone DL, Chai S, Libby SL, Chien M, Lodolce JP, Ma A (2000) Failure to regulate TNF-induced NF-·B and cell death responses in A20-deficient mice. Science 289: 2350–2354 Leis JR, Kramer MA (1988) Odessa- an ordinary differential-equation solver with explicit simultaneous sensitivity analysis. ACM Trans Math Soft 14:61–67 Lin Y, Stadtherr MA (2006) Deterministic global optimization for parameter estimation of dynamic systems. Ind Eng Chem Res 45:8438–8448 Lipniacki T, Paszek P, Brasier AR et al (2004) Mathematical model of NFκB regulatory module. J Theor Biol 228:195–215 Ljung L (1999) System identification: theory for the user. Prentice Hall, NJ Mendes P, Kell DB (1998) Nonlinear optimization of biochemical pathways: Applications to metabolic engineering and parameter estimation. Bioinformatics 14:869–883 Moles C, Mendes P, Banga J (2003) Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Res 13:2467–2474 Nash SG, Sofer A (1996) Linear and nonlinear programming. McGraw-Hill, New York, NY Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7:308–313 Pardalos P, Romeijna H, Tuyb H (2000) Recent developments and trends in global optimization. J Comp App Math 124:209–228 Peifer M, Timmer J (2007) Parameter estimation in ordinary differential equations for biochemical processes using the method of multiple shooting. IET Syst Biol 1(2):78–88
5
Computational Procedures for Model Identification
137
Pinter J (1996) Global optimization in action. Continuous and Lipschitz optimization: Algorithms, implementations and applications. Kluwer, Netherlands Polisetty P, Voit E, Gatzke E (2006) Identification of metabolic system parameters using global optimization methods. Theor Biol Med Mod 3:4 Rodriguez-Fernandez M, Egea JA, Banga JR (2006a) Novel metaheuristic for parameter estimation in nonlinear dynamic biological systems. BMC Bioinform 7:483 Rodriguez-Fernandez M, Mendes P, Banga JR (2006b) A hybrid approach for efficient and robust parameter estimation in biochemical pathways. Bio Syst 83(2–3):248–265 Runarsson T, Yao X (2000) Stochastic ranking for constrained evolutionary optimization. IEEE Trans Evol Comp 564:284–294 Schittkowski K (2002) Numerical data fitting in dynamical systems. Kluwer, Netherlands Seber GAF, Wild CJ (1989) Nonlinear regression. Wiley series in probability and mathematical statistics. Wiley, New York Storn R, Price K (1997) Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces. J Global Optim 11:341–359 Sugimoto M, Kikuchi S, Tomita M (2005) Reverse engineering of biochemical equations from time-course data by means of genetic programming. BioSystems 80:155–164 Swameye I, Müller T, Timmer J et al (2003) Identification of nucleocytoplasmic cycling as a remote sensor in cellular signaling by data-based modeling. Proc Natl Acad Sci 100(3):1028–1033 Vera J, Balsa-Canto E, Wellstead P et al (2007) Power-law models of signal transduction pathways. Cell Signal 19:1531–1541 Walter E, Pronzato L (1997) Identification of parametric models from experimental data. Springer, New York Wolkenhauer O, Ullah M, Kolch W et al (2004) Modeling and simulation of intracellular dynamics: Choosing an appropriate framework. IEEE Trans. Nanobiosci 3(3):200–207
Chapter 6
Assembly of Logic-Based Diagrams of Biological Pathways Tom C. Freeman
Abstract The networks of molecular interactions that underpin cellular function are highly complex and dynamic. The topology, behaviour and logic of these systems, even on a relatively small scale, are far too complicated to understand intuitively. Furthermore, enormous amounts of systems-level data pertaining to the nature of genes and proteins, and their potential cellular interactions, have now been generated, but we struggle to interpret these data. There is therefore general agreement amongst biologists about the need for good pathway diagrams. However, the challenge of creating models that reflect our current understanding of these systems and displaying this information in an intuitive and logical manner is not trivial. The modified Edinburgh pathway notation (mEPN) scheme is founded on a notation system originally devised a number of years ago, but through use has now been refined extensively. This has been primarily driven by the author’s attempts to produce process diagrams for a diverse range of biological pathways, particularly with respect to immune signalling in mammals. Whilst requiring a considerable effort, the assembly of pathway models provides a resource for training, literature/data interpretation, computational pathway modelling and hypothesis generation. Here I discuss the mEPN scheme, its symbols and rules for its use and thereby hope to provide a coherent guide to those planning to construct pathway diagrams of their biological systems of interest. Keywords Pathway modelling · Notation scheme · Process diagram · Graphical representation
T.C. Freeman (B) The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Roslin, Midlothian EH25 9PS, UK e-mail:
[email protected] S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_6,
139
140
T.C. Freeman
6.1 Introduction Complete genome sequencing of hundreds of pathogenic and model organisms over the last decade has provided us with the parts list of life (Janssen et al. 2003). At the same time enormous amounts of data pertaining to the nature of genes and proteins and their potential cellular interactions have now been generated using new analytical platforms including, but not limited to, gene coexpression analysis, yeast two-hybrid assays, mass spectrometry and RNA interference (Reed et al. 2006). With the advent of next generation sequencing technologies and advances in other fields, this deluge of data on biological systems only looks set to continue and increase. Whilst the data from these ‘omics’ platforms can be overwhelming these analyses finally allow us to open a window on to the complex cellular and molecular networks that underpin life (Kitano 2002; Nurse 2003). The main problem we now face is how to interpret all these data and use it to better understand the structure and function of biological pathways in health and disease (Cassman 2005). Our existing knowledge of biological pathways and systems is still largely based on the painstaking efforts of countless investigators whose work has, and continues to be, focused on a specific cell type and the function of one or a small number of proteins within that cell. Their studies have produced our current framework of understanding of how proteins and genes interact with each other to form the metabolic, signalling and effector systems that together regulate biological form and function. Much of this work, however, remains locked inside the literature where specific insights into the functional role of cellular components are subject to the semantic irregularities that come with their description by different authors. As a result, the details of a given pathway have traditionally been known only to a few experts in the field whose research is often focused on a single protein and its immediate interaction partners within that pathway. These pathways are understood more generally by their description in reviews and diagrams produced on an ad hoc basis. To a certain degree the concept of a biological pathway is an artificial construct and in reality there is only one big integrated network of molecular interactions operating within a cell. However, it is still useful to think in terms of pathways as being connected modules of this network. As such, a pathway may be considered to consist of a specific biological input or event that initiates a series of directional interactions between the components of a system leading to an appropriate shift in cellular activity. In other words a biological pathway might be viewed as starting from the engagement of a ligand with its receptor to all the downstream consequences of that interaction. This is not to say that the cellular components utilised for such a pathway will be necessarily unique to it, only that they are connected in this context. As we begin to appreciate the complexity of these molecular networks, their topology and interconnectivity, there is increasing interest in moving away from the traditional gene-centric view of life to a systems- or pathway-level appreciation of biological function. To do this we need to create models of these pathways. Pathway diagrams act as a visual representation of known networks of interaction between cellular components, and modelling them is fundamental to our
6
Assembly of Logic-Based Diagrams of Biological Pathways
141
understanding of them. At their best formalised diagrams of biological pathways act as a clear and concise visual representation of the known interactions between cellular components. However, the task of assimilating the large amounts of available data on a particular pathway and representing this information in an intuitive manner remains an ongoing challenge. Indeed, there are numerous different ways that one can represent a pathway and pathway diagrams are currently available in a plethora of different forms. Using the term in the broadest sense, they can be a picture that accompanies a review article, wall charts distributed by journals and companies, small schematic diagrams used to support mathematical modelling efforts or network graphs reflecting all known protein interactions based on the results of large-scale interaction studies or literature mining. As such, pathway models are an invaluable resource for interpreting the results of genomics studies (Antonov et al. 2008; Arakawa et al. 2005; Babur et al. 2008; Cavalieri and De Filippo 2005; Dahlquist et al. 2002; Ekins et al. 2007; Pandey et al. 2004), for performing computational modelling of biological processes (Eungdamrong and Iyengar 2004; Kwiatkowska and Heath 2009; Ruths et al. 2008; van Riel 2006; Watterson et al. 2008) and fundamentally important in defining the limits of our existing knowledge. To support these efforts there are also a growing number of databases that serve up a wide range of pathways which are either curated centrally (http://www.biopax.org/; http://www.ingenuity.com/; Kanehisa and Goto 2000; Thomas et al. 2003) or increasingly by the community (Joshi-Tope et al. 2005; Pico et al. 2008; Schaefer et al. 2009; Vastrik et al. 2007). These offer searchable access to pathway diagrams and interaction data derived from a combination of manual and automated (text mining) extraction of primary literature, reviews and large-scale molecular interaction studies. The sheer range of resources available (Bader et al. 2006) reflect the current interest in pathway science. Whilst invaluable and in many ways the best we have, a major problem with these efforts is that the information content of these diagrams is frequently limited, generic and visualisations of these systems are of variable and often poor quality; pathways are drawn using informal and idiosyncratic notation systems using a variety of shapes (glyphs) to illustrate component ‘type’. There are variable degrees of accuracy and specificity in defining what pathway components are being depicted and the relationships between them. Resources are often fragmented with some proteins or metabolites being members of numerous pathways, the concept of pathway membership being a highly subjective division. The pathways themselves are rarely available as a cohesive network and there are numerous pathway exchange formats in current use (Hucka et al. 2003; Lloyd et al. 2004; Luciano 2005). Finally, pathway diagrams are generally highly subjective reflecting the curator’s bias, such that two diagrams depicting the ‘same’ pathway may share little in common. Together these factors commonly result in uncertainty as to what exactly is being shown. All in all, despite the huge efforts in time and resources that has been poured into pathway science the state of the art leaves a lot to be desired. As our appreciation of systems-level biology increases rapidly, there has been an increasing realisation of the need for comprehensive well-constructed maps of known pathways. Over the past 10 years a number of groups have suggested
142
T.C. Freeman
formalised notation schemes and syntactical rules for drawing ‘wiring diagrams’ of cellular pathways (Cook et al. 2001; Kitano et al. 2005; Kohn 1999; Moodie et al. 2006; Pirson et al. 2000). These have been used to construct a number of large pathway diagrams (Calzone et al. 2008; Oda et al. 2004; Oda and Kitano 2006). These pioneering efforts have all contributed to the field and more recently the Systems Biology Graphical Notation (SBGN) group has proposed a series of formalised pathway notation schemes to be adopted by all (Novere et al. 2009). Of course in principle this is an excellent idea but it remains to be seen whether the SBGN schemes are going to be widely taken up or indeed whether they are flexible enough to suit all purposes. Our own efforts on pathway modelling stem from our interest in macrophage biology and in understanding pathways known to be activated in these cells during infectious and inflammatory disease. Therefore, the last few years we have been constructing large graphical models of macrophage-related pathways as a way of recording what is known about the signalling events controlling this cell’s immune biology (Raza et al. 2008, Raza et al. 2010). In so doing our main objectives have been to create models that 1. support the detailed representation of a diverse range of biological entities, interactions and pathway concepts 2. represent a consensus view of pathway knowledge in a semantically and visually unambiguous manner 3. are easy to assemble and understandable by a biologist 4. are useful in the interpretation of ‘omics’ data 5. are sufficiently well defined that software tools can convert these graphical models into formal models, suitable for analysis and simulation In attempting to achieve these goals we have faced one of the central challenges in pathway biology: How exactly does one construct clear concise pathway diagrams of the known interactions between cellular components that can be understood by and useful to a biologist? In the beginning our efforts were largely based on the principles of the process diagram notation (PDN) (Kitano et al. 2005) and the original Edinburgh pathway notation (EPN) scheme (Moodie et al. 2006). However during the course of working with these notation schemes it became apparent that the available diagrams drawn using these systems were not always easy to interpret and the schemes were a challenge to implement. Furthermore, we found that these notation schemes did not support all of the concepts that we wished to represent in order to reflect the full diversity of pathway components and the relationships between them. As a result of our efforts we have significantly modified these existing schemes and created what has now been named the ‘modified Edinburgh pathway notation’ (mEPN) scheme (Raza et al. 2008, Freeman et al. 2010). Below I describe the basic principles behind the mEPN scheme and illustrate how it can be used to depict a wide variety of biological pathways.
6
Assembly of Logic-Based Diagrams of Biological Pathways
143
6.2 Definition of the Modified Edinburgh Pathway Notation (mEPN) Scheme A pathway may be considered to be a directional network of molecular interactions between components of a biological system that act together to regulate a cellular event or process. In this context a component is any physical entity involved in a pathway that contributes or influences its activity, e.g. a protein, protein complex, nucleic acid (DNA, RNA), molecule. The mEPN scheme is a collection of formalised symbols that form the constituent parts of a graphical system for depicting the components of a biological pathway and the interactions between them. The mEPN scheme is based on the node and edge principles of depicting networks. This allows one to use ideas and tools previously developed in graph theory and applied more recently to computational systems biology. Cellular components are represented as nodes (vertices) in the network and specific glyphs (stylised graphical symbols) are used that impart information nonverbally on the class of biological entity portrayed, e.g. protein, gene, biochemical. The processes that connect components are also represented by nodes using different glyphs and the connectivity between them is defined by edges (lines/arcs). Edges represent interactions or relationships between one component and another usually where one component influences the activity of another, e.g. through its binding to, inhibition of, catalytic conversion of. The network of interactions between cellular components and processes thereby defines a pathway.
6.2.1 Depiction of Pathway Components When drawing pathways one has to decide about the level of biological detail that you wish to depict. It is not uncommon in pathway depiction to use component glyphs that infer structural or functional characteristics of the entities depicted. For instance, receptors may be shown using a glyph with a specific ligand binding site or possibly as a protein containing membrane-spanning domains. Whilst on one level this approach is appealing to the eye and imparts visual information on the nature of the molecular species depicted, it can lead to complications. After all both depictions described above may be appropriate for any receptor and a protein may also have other functional domains which could be graphically depicted. If one tries to impart all this information visually it leads to a notation system that is difficult to implement and to remember. Such a system also requires the development of specific pathway editing tools that support it. In contrast we have used a set of standard shapes to represent different classes of components (molecular species) and in so doing created a notation scheme that is supported by generic network-editing/visualisation tools, in particular the tool of choice for all our work has been the freely available yEd (yFiles, Tubingen). There is, however, a variety of other pathway and networkediting tools available (Pavlopoulos et al. 2008). It is worth remembering that the
144
T.C. Freeman
ability to graphically depict a wide variety of pathway concepts depends not only on the tool used to construct and display them but also on the pathway notation scheme employed. The mEPN scheme as described here is based on the concepts first described for the process diagram notation (PDN) scheme (Kitano et al. 2005). However, our experience in building large-scale pathway models of a variety of biological systems has required us to depict concepts that were not supported by the original PDN scheme. Furthermore, lack of available pathway editing tools when we began this work and the scale of our diagrams have both played their part in determining our approach to pathway depiction. As a result there are a number of important differences that exist between the mEPN scheme described here and the other PDN schemes. First, in common PDN, the mEPN uses simple shapes to define the class of a component but only a labelling system to define the exact identity of components (nodes). Other schemes use circles overlaid on nodes to depict protein modifications. We have found this a considerable overhead to implement which can interfere the clarity of what is depicted rather than enhance it. Furthermore, the PDN scheme is not supported by many of the general purpose network visualisation tools, e.g. yEd, Cytoscape, Biolayout Express3D (Freeman et al. 2007; http://www.yworks.com; Yeung et al. 2008), requiring instead the use of dedicated pathway editing software, e.g. CellDesigner (Funahashi et al. 2008). Second, we have avoided the use of different styles of arrowheads to depict the nature of interactions (edges) which limits the vocabulary of edges and is a system that can be challenging to remember. Instead where appropriate, we have chosen to use inline annotation nodes to depict the meaning of edges; these carry a visual clue (a letter symbolising the meaning of the edge, e.g. A for activation, I for inhibition) and can potentially support a wider range of edge meanings. Again the use of a wide variety of arrowheads is not supported by many pathway/network-editing software packages. Finally, we explicitly state the nature of interactions by the use of labelled process nodes. Under other PDN-based schemes process nodes are used but generally not as a means to convey the nature of interactions except in the case of protein binding (association) and dissociation. When pathways are large and the distance between interacting species may be great, having a visual clue as to the nature of interactions is very important in our experience. The full set of glyphs employed in the mEPN scheme is shown in Fig. 6.1. Under the scheme peptides, proteins and protein complexes are all represented by a rounded rectangle and genes depicted using a rectangle. Parallelograms may be used to show a specific DNA sequence known to play a specific functional role, e.g. promoter sequence. This may be shown on its own or associated with a gene or other genomic feature. Simple biochemicals, e.g. sugars, amino acids, nucleic acids, metabolites, are represented using a hexagon. It is often the case that an interacting component of a pathway is not an exact molecular entity but rather a molecular class or complex entity such as a virus or other pathogen. In this case we use a flattened circle (ellipse) to depict any generic entity. A small molecule or biologic known to affect a biological system is shown using a trapezoid. These may be licensed as a drug or used for experimental manipulation of biological components, e.g. enzyme
6
Assembly of Logic-Based Diagrams of Biological Pathways
145
Fig. 6.1 List of the glyphs used by the modified Edinburgh pathway notation (mEPN) scheme Unique shapes and identifiers are used to distinguish between each element of the notation scheme. The notation scheme essentially consists of the following categories of nodes representing cellular components, processes and Boolean logic operators. Edges are used to denote the interactions between components, the nature of the relationship between them being described using process nodes and Boolean operators and edge annotations. The cellular compartment in which these components reside is depicted by their spatial localisation in the network and background colour
inhibitor, siRNA. Finally, ions, e.g. Ca2+ , Na+ , Cl– , or other simple molecules H2 O, NO, O2 , CO2 are represented using a diamond-shaped glyph.
6.2.2 Component Annotation Multiple component names are often available to describe any given component. For example, the same protein may be called several different names in the literature. In other cases the same name has been used to describe different proteins and some protein names are quite different from the gene name. Other names sometimes used for labelling components in pathways do not represent any specific entity at all, e.g. NF-κB. Therefore, when non-standard nomenclature is used to name pathway components it frequently leads to ambiguity as to the exact identity of what is being depicted. Use of standard nomenclature to denote a component’s identity removes this uncertainty and also assists in the comparison and overlay of experimental data with pathway models. Under mEPN we recommend the use of standard gene nomenclature systems, e.g. human genome nomenclature committee (HGNC) or mouse genome database (MGD) systems to name human or mouse genes/proteins, respectively. These nomenclature systems now provide a near-complete annotation of all human and mouse genes. Their use in the naming of proteins as well as genes
146
T.C. Freeman
provides a direct link between the two. Therefore, when a protein or gene is discussed within a paper almost the first act is to search the databases in order to record the identity of the component according to standard nomenclature. Where other names (‘alias’) are in common use these name(s) may be shown as an addition to the label on the glyph representing the protein and included after the official gene symbol in rounded ( ) brackets. Protein complexes are named as a concatenation of the proteins belonging to the complex separated by a colon. Again if the complex is commonly referred to by a generic name this may be shown. There are no strict rules as to the order in which the protein names are shown in the complex and are often shown in the order in which the proteins join the complex, in the position they are likely to hold relative to other members of the complex (where known) or position relative to cellular compartments, e.g. with receptor proteins in a membrane-bound protein complex protruding into the extracellular space. Where a specific protein is present multiple times within a complex, this may be represented by placing the number of times a protein is present within the complex in angular brackets < >. If the number of proteins in the complex is unknown this may be represented by . The particular ‘state’ of an individual protein or a protein within a complex may be altered as a consequence of a particular process. This change in the component’s state is marked using square [ ] brackets following the component’s name, each modification being placed in separate brackets. This notation may be used to describe the whole range of protein modifications from phosphorylation [P], truncation [t], ubquitinisation [Ub], etc. Where details of the site of modification are known this may be represented as, e.g. [P-L232] = phosphorylation at leucine 232. Alternatively the details of a particular modification may be placed as a note on the node visible only during ‘mouse-over’ or when viewing a node’s properties. Where multiple sites are modified this may be shown using multiple brackets, each modification (state) being shown in separate brackets. Unfortunately, there appears to be no universally recognised nomenclature system for many of the other classes of biologically active molecules, e.g. lipids, metabolites, drugs, and therefore when included in a pathway we have generally used names commonly recognised by biologists. Colour may also be added to the diagrams to assist in their interpretation. Components may be coloured to impart information on component’s type, location, or state, e.g. to visually differentiate between a protein and a complex, to denote cellular location or denote a component’s expression level. In addition process nodes, Boolean operators, compartments and edge annotations are generally coloured to improve the visual impact of the diagram. However, it must be stated that the exact choice of colours is down to individual taste and colour recognition capabilities and the mEPN scheme has been designed to work even in the absence of colour.
6.2.3 Depiction of Biological Processes A process node in the context of this notation system can be defined as a specific action, transformation, transition or process occurring between components or to a component and is represented by a process node. Process nodes impart information on the type of process that is associated with transformation of a component from
6
Assembly of Logic-Based Diagrams of Biological Pathways
147
one state to another or movement in cellular location. They also act as junctions between components and as such may have multiple inputs or outputs to or from components. All process nodes are represented by a small circular glyph and the process they represent is indicated by a one-to-three letter code. Colour has been used as a visual clue to group processes into ‘type’ but is not necessary for inferring meaning. There are currently 31 process nodes recorded under the mEPN. Different process nodes generally have different network connectivity. For instance, a process node depicting a component’s translocation from one compartment to another will generally only have one input and output edge (Fig. 6.2a). In contrast a ‘binding’ node will have multiple inputs and one output (Fig. 6.2b); the opposite is true for a dissociation node (Fig. 6.2c). Process nodes also act as a way of collating information about a given event; for example, protein X may be converted from one state to another by a process activated by protein Y (Fig. 6.2d). However, this process may also be inhibited by such a protein (Fig. 6.2e).
Fig. 6.2 Depiction of basic concepts in pathway biology using the mEPN scheme. (a) Depiction of the transition of a component from one location or state to another, e.g. the translocation of a protein from the cytoplasm to the nucleus or transcription/translation of a gene to protein. (b) Binding (association) of two proteins to form a complex. (c) Dissociation of a complex into its constituent parts. (d) Activation of the transformation of one component by another. (e) Inhibition of the transformation of one component by another. (f) Activation of the transformation of two components by another. (g) Absolute requirement (co-dependency) of two components for the activation of a process. (h) Requirement of either of two components for the activation of a process. (i) Activation of the transformation of one component by another that requires ATP. (j) Depiction of a ‘conditional gate’ that indicates the start of potentially multiple alternative pathway outcomes which are dependent on other factors. The main octagon is labelled with the process name, e.g. G1 to S phase checkpoint, and the other smaller octagons are used to denote the factors that influence progression down one pathway or another
148
T.C. Freeman
6.2.4 Boolean Logic Operators Components in a pathway are dependent on each other. For example, if a process requires X and Y to be present for it to proceed, perhaps because they are independently acting cofactors in a given reaction then the process will not proceed unless both are present. Alternatively, if a given process can be catalysed by either X or Y, then the process will proceed if either component is present. Such dependencies can be captured using Boolean logic operators which are used to define the relationships between multiple inputs into a process. An ‘AND’ operator is used when two or more components are required to bring about a process, i.e. an event is dependent on more than one factor being present (Fig. 6.2f). In modelling of flowthrough networks these act in a similar manner to ‘bind’ process nodes, i.e. all inputs must be present before a product is formed or reaction proceeds. In contrast an‘OR’ operator is used when one component or another may orchestrate the same change in another component (Fig. 6.2 g). For instance multiple kinases, e.g. MAP2K3, MAP2K6, MAP2K7, may catalyse the phosphorylation of p38 (MAPK14) and are therefore shown connecting with p38 via an OR operator. OR operators have also occasionally been used to infer that a component(s) can potentially lead to multiple outcomes.
6.2.5 Depiction of Other Concepts There are a number of glyphs that represent concepts that do not sit neatly under the headings of being a component, a process or logic operator. These include the following: Energy/molecular transfer nodes are used to represent simple co-reactions associated with or required to drive certain processes (e.g. ATP→ADP, GTP→GDP, NADPH→NADP+ ). They are linked directly to the node representing the process in which they take part (Fig. 6.2h). Conditional fates are used where there are potentially multiple gates of a component and the output is dependent on other factors such as the component’s concentration and time or is associated with a cellular state (Fig. 6.2i). These have been used to depict events such as the checkpoint controls in the cell cycle where the decision to go on to the next phase cell replication is under the control of a number of factors and two or more outcomes are possible. Another example is where cholesterol, depending on its intracellular concentration, may either be exported out of the cell or trigger the cholesterol biosynthetic pathway. Pathway modules define complicated processes or events that are not otherwise fully described. Examples include signalling cascades, endocytosis and compartment fusion. They are a short-hand way of representing molecular events that are not known, not recorded or not shown. Pathway outputs detail the cumulative output of series of interactions or function of an individual component at the end of a pathway. Pathway outputs are shown
6
Assembly of Logic-Based Diagrams of Biological Pathways
149
in order to describe the significance of those interactions in the context of a biological process or with respect to the cell. The input lines leading into a pathway output node have been coloured light blue to emphasise the end of the pathway description.
6.2.6 Depiction of Interactions Between Components and the Use of Edges Interactions are depicted by edges, sometimes referred to as lines or arcs (a directional edge). They signify a relationship between components/processes in a pathway and convey the directionality of that interaction. The nature of an interaction is inferred through the use not only of process nodes and Boolean logic operators but also of edge annotation nodes. An edge annotation node is characterised as having only one input (with no arrowhead) and one output and functions to describe the type of activity implied by the line, e.g. activation, inhibition, catalysis (Fig. 6.2). A number of notation schemes use different arrowheads to indicate the ‘type’ of interaction but their use has been avoided in the mEPN scheme for several reasons; first, there is a limit to the number of different types of arrowheads which potentially fall below the possible number of biological concepts one may need to depict. Second, differentiating between different arrowheads is sometimes difficult when viewed at a distance. Third, few arrowheads are symbolic or indicative of the action they are designed to describe, requiring them to be committed to memory. Finally, multiple arrowhead types are not always supported by different network-editing/visualisation software. Interaction edges may be coloured for visual emphasis but as with nodes, the definition of meaning is not reliant on colour. However, in certain instances they can be used as distribution nodes, e.g. where one component activates many others such as with transcriptional activation of a number of genes by a transcription factor it can reduce the number of edges emanating from the transcription factor and therefore simply the representation (Figs. 6.2j and 6.3). Where separate depiction of modules belonging to the same component is desirable an undirected edge (no arrowhead) is used to denote a physical connection (bond) between two or more components.
6.2.7 Cellular Compartments Pathway components exist in different cellular compartments. A cellular compartment can be a region of the cell, an organelle or cellular structure, dedicated to particular processes and/or hosting certain subsets of components, e.g. genes are found only in the nuclear compartment. In principle a subcellular compartment can be any size or shape. Compartments are defined by a labelled background to the pathway and arranged with spatial reference to cell structure. Compartments are coloured differently for emphasis. Similar or related compartments are shown to
150
T.C. Freeman
Fig. 6.3 Example of a small pathway depicted using the mEPN scheme. Interferon B (IFNB) is a cytokine released from many cell types in response to immune stimulation. It homodimerises and binds to a cell surface receptor complex composed of the receptor proteins IFNAR1 and IFNAR2 and the intracellular kinases TYK2 and JAK2. The complex is composed of two of each of these proteins. Binding causes a conformation change in the complex resulting in the autophosphorylation of JAK1. Once activated the complex catalyses the phosphorylation of STAT2 which forms a heterodimer with STAT1. This complex then binds interferon regulatory factor 9 (IRF9) forming the complex often referred to as ISGF3 and translocates to the nucleus. Here it binds to the ISGF3 element in the promotor of a number of genes including IRF2, IL12B, STAT1, IL15, TAP1, GBP1, PSMB9, initiating their transcription. For a more detailed view of this and other immune-related pathways, see Raza et al. (2008)
6
Assembly of Logic-Based Diagrams of Biological Pathways
151
share the same fill colour but different coloured perimeters. This has been used to differentiate between different but related compartments, e.g. different classes of vesicles derived from the endoplasmic reticulum or plasma membrane.
6.3 Collation of Information and Pathway Assembly The assembly of a pathway diagram is an extraordinarily interesting and informative exercise. The act of converting text-based information into a visual resource forces one to understand the information that is being presented to a level that the mere reading of an article never requires. When presented with a long textual description of a process involving numerous components all interacting through a complex series of events, it is easy to read about them but far more difficult to construct an accurate picture of them in the mind’s eye. Furthermore, the semantics of the written word does not always make sense when drawn, at least not when done in a logical fashion. The art of pathway construction therefore relies on the ability to convert numerous textual descriptions where different words may be used to describe the same or similar processes between multiple components which in turn may or may not be designated the same name into a concise and unambiguous model of events. When embarking on the construction of pathway diagram there is a need to define the specific areas that are of interest to you. This sounds obvious but in reading the literature on one system, it is common to find that other systems are discussed (the one big network scenario) and it is easy to stray from the area of original interest. This in itself is not a problem and indeed part of the learning exercise, as long as the area covered has been documented correctly before moving on. The danger is that after a mapping exercise has been ‘completed’ what results is a sketch covering many components in related systems, where the relationships between them have not been documented to a sufficient level of detail to render the diagram truly useful or informative. It is therefore better to aim for quality over quantity when engaging in this activity. It is also true that what makes sense to the pathway curator does not necessarily make sense to another individual. Great emphasis should therefore be placed on the need to discuss and justify the information represented to others. If the knowledge gained by the curator cannot be communicated clearly and effectively, then they have not done their job properly. Pathway content, adherence to the notation system and layout should always be assessed by others to ensure that the graphical depiction of pathway/interactions is intelligible and unambiguous to another individual familiar with the notation scheme. Ideally the work should also be inspected by those intimately familiar with the field of research that one is attempting to depict; this is always a good test of the accuracy and completeness of the information. The best source of information about pathways is buried in the primary literature. However, the amount of pathway information that can be gleaned from any one paper is generally limited as a given piece of work will tend to focus only on one or a small number of components and their interacting partners. It is therefore
152
T.C. Freeman
advisable to spend some time gaining a high-level view of any given pathway or system of interest. Internet searches for images of the pathway or specific complexes within it provide a framework for understanding of the pathway of interest. Pathway databases such as Reactome or Kegg (Kanehisa and Goto 2000; Vastrik et al. 2007) can be used to gain a high-level view of the pathway. Interaction databases, e.g. String, IntAct, Ingenuity, HPRD or Bind (Alfarano et al. 2005; Hermjakob et al. 2004; Jensen et al. 2009; Mishra et al. 2006) might also be used to gain a view of molecular interactions of a given component. Our experience, however, has been that such resources present such a generic network view of pathways and often capture seemingly erroneous interactions, thereby limiting their utility for this purpose. One of the best starting points is literature reviews. Whilst they frequently discuss information of limited use to pathway construction, e.g. concerning protein structure, evolution of protein families, high-level concepts, they frequently provide graphical depictions of subsystems and are an excellent portal into the primary literature. The point is not to get too involved too early but to take snapshots of the current understanding of the system and construct a framework of understanding and sources of available information prior to going into detail. During the course of pathway mapping exercise many papers will be read and snippets of information will be mentally recorded concerning all aspects of pathway biology. It is important to have mechanisms in place that allow the curator to record this information and its source, otherwise all this information will be lost. Evidence to support an interaction derived from the primary literature (and reviews) must be recorded in an interaction table. This must include the identity of interacting partners, the direction of the interaction, e.g. HGNC1 → HGNC2, the type of interaction (phosphorylation, cleavage), method by which the interaction was determined, PubMed ID of the paper reporting the interaction and site of specific change of state, e.g. phosphorylation of serine 123. Of course more than one paper may be used to support the same interaction and arguably two or more references are preferable to a single work reporting an interaction. Indeed no interaction should be included within the pathway without published evidence to back it up. An example of a pathway interaction table is shown in Table 6.1. Additional notes and hyperlinks to external databases are also useful in linking additional information on the biology depicted. Graphml files support this activity and pathway diagrams may include URL links to Entrez Gene (or other database of choice) for each protein or gene component in the pathway. Furthermore, component descriptions obtained from databases, PubMed IDs and textual descriptions can be included and stored on appropriate edges or nodes. These can be accessed under the properties description tab for nodes or edges or appear when hovering over a node or edge, thereby supplementing what is shown graphically. As a final note on pathway construction, it should be emphasised that the visualisation of specific events as well as overall layout of a diagram is everything in ensuring the pathway’s usability. Under the mEPN system each step in a given process is explicitly depicted. For example, if the activation of a given signalling pathway requires the receptor complex to go through a series of changes, e.g. binding, phosphorylation or dissociation events following ligand binding, then each
Interaction type Interaction location NCBIPubMed ID
Interacting partner 2
Interacting partner 1
Official gene symbol Gene ID Interactant type Interactant as on map Official gene symbol Gene ID Interactant type Interactant as on map
Interaction no.
472:4214 Complex ATM[P]: IKBKG[P][Ub] CHUK 1147 Protein CHUK
472 Protein
ATM[P]
4214 Protein
IKBKG[SU] Binding cytoplasm 16497931
Binding
nucleus
16497931
IKBKG
ATM:IKBKG
2
ATM
1
16497931
cytoplasm
Binding
ERC1(ELKS)
23085 Protein
ATM[P]: IKBKG[P][Ub] ERC1
472:4214 Complex
ATM:IKBKG
3
14668329
nucleus
NFKB1(p50): NFKB1(p50) Activation
NFKB1(p50): NFKB1(p50) N/A Complex
BCL2
596 Gene
BCL2
4
15371334
cytoplasm
Binding
HSP90AA1
3320 Protein
HSP90AA1
CDC37
11140 Protein
CDC37
5
Table 6.1 Example of the information that should be stored when recording an interaction associated with the construction of a pathway
15145317
cytoplasm
Binding
CHUK
1147 Protein
CHUK
CHUK
1147 Protein
CHUK
6
6 Assembly of Logic-Based Diagrams of Biological Pathways 153
154
T.C. Freeman
intermediate stage should ideally be shown (see Fig. 6.3). Whilst this can make the depiction of events long-winded it accurately reflects what is known and may ultimately be important in understanding the pathway’s regulation. Another important rule is that although a given pathway component can play a role in numerous different processes it may only be represented once in any given cellular compartment. Whilst this rule can potentially lead to a tangle of edges due to certain components possessing numerous connections to other components spread across the pathway, the benefits of the rule outweigh the issues in adhering to it. The number of edges leaving each node gives the reader an exact indication of a component’s interactions with other components and hence potential activity, without the need for scanning the entire diagram to find other instances where the component is described. A component may, however, be shown more than once in a given cellular compartment if it changes from one state to another, e.g. from an inactive form to an active form, in which case both forms are represented as separate components. As a general rule nodes (components, processes, operators) and edges (interactions) should be drawn in such a way as to make the diagram compact with about a minimum of crossing over, changes in direction of edges and length, i.e. edges should be easy to follow. Hierarchical relationships between components should be shown in the layout of interactions. In order to do this an orientation of pathway flow is chosen, e.g. left to right or top to bottom and where possible should be maintained throughout the diagram. Ideally the direction of interactions should follow the ‘flow’ through the pathway, although it is appreciated this becomes more difficult in larger diagrams. A certain degree of consistency should also be aimed for when depicting components and their interactions, e.g. components should be depicted using nodes of a similar size, similar pathway relationships should be drawn in a consistent manner. Visual clarity relies on a ‘clean’ layout of pathways and whilst there are a number of automated algorithms available for network layout, they are currently no substitute for a curator with an attention to detail and an artistic eye.
6.4 Summary The networks of molecular interactions that underpin cellular function are highly complex and dynamic. The topology, behaviour and logic of these systems, even on a relatively small scale, are far too complicated to understand intuitively. Formalised models provide a possible solution to the problem. However, the challenge of creating models that reflect our current understanding of these systems and display this information in an intuitive and logical manner is not trivial. The task of constructing pathway diagrams is time consuming and laborious involving many hours of work. On the other hand, it summarises the results of investigations that may have taken many thousands of hours of time to perform and it is difficult to envisage how one could précis such a body of work in any other meaningful way. The act of creating a pathway model forces you to formalise what you know about a system and justify
6
Assembly of Logic-Based Diagrams of Biological Pathways
155
it using appropriate sources. It allows you to explore the nature of relationships that might have existed as mental picture but the need to graphically depict them in a formalised way is in itself highly informative. As well as defining what you do know about a system, it is equally useful in defining what you do not. The mEPN scheme described here provides a system where pathways can be represented in a logical, unambiguous and biologist-friendly fashion, whatever the system of interest. What we would like to see and believe is essential is the support of the wider community in assembling and editing such diagrams. Such efforts are underway (Pico et al. 2008; Schaefer et al. 2009; Vastrik et al. 2007) and are already providing a vital forum for debate on the known details of pathways in different cell systems. Ideally these efforts will result in detailed models of biological systems that can be shared and assimilated. However, in order to achieve this end pathway models clearly need to be assembled using standard rules and graphical languages. We therefore hope our work will contribute to the ongoing community effort to develop such standards (Le Novère et al. 2009). To gain a systems-level view of these pathways is to gain an insight into the molecular networks that regulate normal function and whose malfunction underpins disease pathology. Greater understanding of the overall architecture of the pathways and their susceptibility to deregulation by disease-causing agents should ultimately lead to new strategies and targets for therapeutic intervention. For my group the creation of pathway models has provided a resource for training, literature/data interpretation, computational pathway modelling and hypothesis generation. As such the approach is now central to our ongoing investigations of macrophage biology and has transformed the way we think about these cells and our interpretation of results of investigations into their immune biology.
References Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, Buzadzija K, Cavero R, D Abreo C, Donaldson I, Dorairajoo D, Dumontier MJ, Dumontier MR, Earles V, Farrall R, Feldman H et al (2005) The biomolecular interaction network database and related tools 2005 update. Nucleic Acids Res 33:D418–424 Antonov AV, Dietmann S, Mewes HW (2008) KEGG spider: interpretation of genomics data in the context of the global gene metabolic network. Genome Biol 9:R179 Arakawa K, Kono N, Yamada Y, Mori H, Tomita M (2005) KEGG-based pathway visualization tool for complex omics data. In Silico Biol 5:419–423 Babur O, Colak R, Demir E, Dogrusoz U (2008) PATIKAmad: putting microarray data into pathway context. Proteomics 8:2196–2198 Bader GD, Cary MP, Sander C (2006) Pathguide: a pathway resource list. Nucleic Acids Res 34:D504–506 Calzone L, Gelay A, Zinovyev A, Radvanyi F, Barillot E (2008) A comprehensive modular map of molecular interactions in RB/E2F pathway. Mol Syst Biol 4:173 Cassman M (2005) Barriers to progress in systems biology. Nature 438:1079 Cavalieri D, De Filippo C (2005) Bioinformatic methods for integrating whole-genome expression results into cellular networks. Drug Discov Today 10:727–734 Cook DL, Farley JF, Tapscott SJ (2001) A basis for a visual language for describing, archiving and analyzing functional models of complex biological systems. Genome Biol 2:RESEARCH0012
156
T.C. Freeman
Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR (2002) GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet 31:19–20 Ekins S, Nikolsky Y, Bugrim A, Kirillov E, Nikolskaya T (2007) Pathway mapping tools for analysis of high content data. Methods Mol Biol 356:319–350 Eungdamrong NJ, Iyengar R (2004) Modeling cell signaling networks. Biol Cell 96:355–362 Freeman TC, Goldovsky L, Brosch M, van Dongen S, Maziere P, Grocock RJ, Freilich S, Thornton J, Enright AJ (2007) Construction, visualisation, and clustering of transcription networks from microarray expression data. PLoS Comput Biol 3:2032–2042 Freeman TC, Raza S, Theocharidis A, Ghazal P (2010) The mEPN Scheme: an intuitive and flexible graphical system for rendering biological pathways BMC Syst Biol 4:65 Funahashi A, Matsuoka Y, Jouraku A, Morohashi M, Kikuchi N, Kitano H (2008) CellDesigner 3.5: A versatile modeling tool for biochemical networks. Proc IEEE 96:1254–1265 Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res 32:D452–455 Accessed on June 1, 2010. http://www.biopax.org/. Biological Pathways Exchange Accessed on June 1, 2010. http://www.ingenuity.com/. Ingenuity Pathway Analysis. Accessed on June 1, 2010. http://www.yworks.com. yEd Graph Editor – yWorks the diagramming company. Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Gilles ED, Ginkel M, Gor V, Goryanin, II, Hedley WJ, Hodgman TC, Hofmeyr JH, Hunter PJ et al (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19:524–531 Janssen P, Audit B, Cases I, Darzentas N, Goldovsky L, Kunin V, Lopez-Bigas N, Peregrin-Alvarez JM, Pereira-Leal JB, Tsoka S, Ouzounis CA (2003) Beyond 100 genomes. Genome Biol 4:402 Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C (2009) STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 37:D412–416 Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L (2005) Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 33:D428–432 Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28:27–30 Kitano H (2002) Computational systems biology. Nature 420:206–210 Kitano H, Funahashi A, Matsuoka Y, Oda K (2005) Using process diagrams for the graphical representation of biological networks. Nat Biotechnol 23:961–966 Kohn KW (1999) Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol Biol Cell 10:2703–2734 Kwiatkowska MZ, Heath JK (2009) Biological pathways as communicating computer systems. J Cell Sci 122:2793–2800 Le Novère N, Hucka M, Mi H, Moodie S, Shreiber F, Sorokin A, Demir E, Wegner K, Aladjem MI, Wimalaratne SM, Bergman FT, Gauges R, Ghazal P, Kawaji H, Li L, Matsuoka Y, Villéger A, Boyd SE, Calzone L, Courtot M, Dogrusoz U, Freeman TC, Funahashi A, Ghosh S, Jouraku A, Kim S, Kolpakov F, Luna A, Sahle S, Watterson S, Wu G, Goryanin I, Kell DB, Sander C, Sauro H, Snoep JL, Kohn K, Kitano H. (2009) The systems biology graphical notation. Nat Biotechnol 27:735–741 Lloyd CM, Halstead MD, Nielsen PF (2004) CellML: its future, present and past. Prog Biophys Mol Biol 85:433–450 Luciano JS (2005) PAX of mind for pathway researchers. Drug Discov Today 10:937–942
6
Assembly of Logic-Based Diagrams of Biological Pathways
157
Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS et al (2006) Human protein reference database— 2006 update. Nucleic Acids Res 34:D411–414 Moodie SL, Sorokin A, Goryanin I, Ghazal P (2006) A graphical notation to describe the logical interactions of biological pathways. J Integr Bioinform 3:11 Novere NL, Hucka M, Mi H, Moodie S, Schreiber F, Sorokin A, Demir E, Wegner K, Aladjem MI, Wimalaratne SM, Bergman FT, Gauges R, Ghazal P, Kawaji H, Li L, Matsuoka Y, Villeger A, Boyd SE, Calzone L, Courtot M et al (2009) The systems biology graphical notation. Nat Biotechnol 27:735–741 Nurse P (2003) Systems biology: understanding cells. Nature 424:883 Oda K, Kimura T, Matsuoka Y, Funahashi A, M. M, Kitano H. (2004) Molecular interaction map of a macrophage. The alliance for cellular signaling (AfCS) Research Reports, vol. 2, www.signaling-gateway.org/reports/v2/DA0014/DA0014.htm Oda K, Kitano H (2006) A comprehensive map of the toll-like receptor signaling network. Mol Syst Biol 2:2006 0015 Pandey R, Guru RK, Mount DW (2004) Pathway Miner: extracting gene association networks from molecular pathways for predicting the biological significance of gene expression microarray data. Bioinformatics 20:2156–2158 Pavlopoulos GA, Wegener, A-L., Schneider, R. (2008) A survey of visualization tools for biological network analysis. BioData Mining 1:12 Pico AR, Kelder T, van Iersel MP, Hanspers K, Conklin BR, Evelo C (2008) WikiPathways: pathway editing for the people. PLoS Biol 6:e184 Pirson I, Fortemaison N, Jacobs C, Dremier S, Dumont JE, Maenhaut C (2000) The visual display of regulatory information and networks. Trends Cell Biol 10:404–408 Raza S, McDerment N, Lacaze PA, Robertson K, Watterson S, Chen Y, Chisholm M, Eleftheriadis G, Monk S, O’Sullivan M, Turnbull A, Roy D, Theocharidis A, Ghazal P, Freeman TC (2010) Construction of a large scale integrated map of macrophage pathogen recognition and effector systems. BMC Syst Biol 4:63 Raza S, Robertson KA, Lacaze PA, Page D, Enright AJ, Ghazal P, Freeman TC (2008) A logicbased diagram of signalling pathways central to macrophage activation. BMC Syst Biol 2:36 Reed JL, Famili I, Thiele I, Palsson BO (2006) Towards multidimensional genome annotation. Nat Rev Genet 7:130–141 Ruths D, Muller M, Tseng JT, Nakhleh L, Ram PT (2008) The signaling petri net-based simulator: a non-parametric strategy for characterizing the dynamics of cell-specific signaling networks. PLoS Comput Biol 4:e1000005 Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH (2009) PID: the Pathway Interaction Database. Nucleic Acids Res 37:D674–679 Thomas PD, Kejariwal A, Campbell MJ, Mi H, Diemer K, Guo N, Ladunga I, Ulitsky-Lazareva B, Muruganujan A, Rabkin S, Vandergriff JA, Doremieux O (2003) PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification. Nucleic Acids Res 31:334–341 van Riel NA (2006) Dynamic modelling and analysis of biochemical networks: mechanism-based models and model-based experiments. Brief Bioinform 7:364–374 Vastrik I, D’Eustachio P, Schmidt E, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, Matthews L, Wu G, Birney E, Stein L (2007) Reactome: a knowledge base of biologic pathways and processes. Genome Biol 8:R39 Watterson S, Marshall S, Ghazal P (2008) Logic models of pathway biology. Drug Discov Today 13:447–456 Yeung N, Cline MS, Kuchinsky A, Smoot ME, Bader GD (2008) Exploring biological networks with Cytoscape software. Curr Protoc Bioinformatics Chapter 8 :Unit 8 13
Chapter 7
Automating Mathematical Modeling of Biochemical Reaction Networks Andreas Dräger, Adrian Schröder, and Andreas Zell
Abstract In this chapter we introduce a five-step modeling pipeline that ultimately leads to a mathematical description of a biochemical reaction system. We discuss how to automate each individual step and how to put these steps together. First, we create a topology of interconversion processes and mutual influences between reactive species. The Systems Biology Markup Language (SBML) encodes the model in a computer-readable form and allows us to add semantic information to each component of the model. Second, from such an annotated network, the procedure known as SBMLsqueezer generates kinetic equations in a context-sensitive manner. The resulting model can then be combined with already existing models. Third, we estimate the values of all newly introduced parameters in each created rate law. This procedure requires that a time series of quantitative measurements of the reactive species within this system be available, because we calibrate the parameters with the aim that the model will fit these experimental data. Fourth, an experimental validation of the resulting model is advisable. Fifth, a model report is generated automatically to document the model with all of its components. For a better understanding, we will begin with an introduction to current standardization attempts in systems biology and generalized approaches for common rate equations before discussing computer-aided modeling, parameter estimation, and automatic report generation. We complete this chapter with a discussion of possible further improvements to our modeling pipeline. Keywords Computer aided modeling · Automatic rate law generation · Model documentation · Model annotation · Model semantics · Model merging · Modeling tools · Software in systems biology
A. Zell (B) Center for Bioinformatics Tübingen (ZBIT), Sand 1, 72076 Tübingen, University of Tübingen, Germany e-mail:
[email protected] S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_7,
159
160
A. Dräger et al.
7.1 A Straightforward Modeling Pipeline The mathematical modeling of biochemical reaction networks plays a central role in understanding the behavior of complex biological systems (Kitano 2002a; Lloyd et al. 2004) (Fig. 7.1). All these networks bear many resemblances in their structure: a set of reactions interconverts substances, often referred to as reacting species, that are located within some cellular compartment. The type of species and therefore also the type of reaction can strongly vary due to the variety of substances a cell is composed of. Some species interfere with a reaction but are neither consumed nor produced. These species are often called modulators of the reaction. This does not mean, however, that the amount of modulators cannot change, because, just like other species, modulators of one reaction can act as reactants or products in other reactions. In cellular environments, a special type of modulator, enzymes, catalyze most reactions. Other modulators speed up the reaction rate and are therefore called potentiators. If a modulator lowers the velocity of a reaction, it is referred to as an inhibitor. Biological
Experimental data
Mathematical
Phenomenon
Structural knowledge
Modeling
Hypothesis
In silico
In vivo
Experiments
Validation
New Knowledge
Simulated Data
Fig. 7.1 The process of knowledge discovery in systems biology. This figure shows how research in systems biology proceeds, starting with a biological phenomenon to be investigated. In close collaboration between experimenters and modelers, new insights into the phenomenon can be discovered by iteratively performing in silico and in vivo experiments. This procedure refines the mathematical model, leading to new hypotheses and finally new biological knowledge. In this chapter we focus on the question of how to create such a model aided by a combination of dedicated software tools
In order to create a mathematical model to quantitatively describe these reaction dynamics, specialized ordinary differential equations are needed. These equations are given by kinetic equations derived from the experimental analysis of the reaction mechanism and may contain several uncertain parameters (Liebermeister and Klipp 2005). To set up a model of a biochemical system, the following five steps have to be undertaken: 1. Determination of the structure of the reaction system, the network topology. 2. Assignment of appropriate rate laws to each reaction, thereby considering all modifiers (e.g., catalysts, inhibitors, and potentiators), and the type and stoichiometry of all reactants and products.
7
Automating Mathematical Modeling of Biochemical Reaction Networks
161
3. Model calibration, i.e., determination of values for all the parameters within the rate laws with the aim that the dynamic behavior of the model will mirror given measurement data of the reacting species. 4. Experimental validation of the model. 5. Model documentation. This model building process, however, continues to be a highly complicated and labor-intensive task that requires human expertise in numerous details of biochemistry and mathematics. In this chapter we explore possibilities to let automatic procedures do the work, but in some steps human intervention remains indispensable. To tackle the first task of our modeling pipeline, we make use of databases like KEGG and MetaCyc (Caspi et al. 2008; Kanehisa et al. 2006), which provide a large set of known reaction pathways in a multitude of organisms. There we can easily look up the processes we are interested in. But before we can extract any information from these pathway databases, we have to think about the level of detail of the biological process we intend to investigate (Wilkinson 2006 p. 1). Instead of considering all reactions within the pathway map of interest, it often makes sense to lump several reactions together. To give an example of this, a chain of coupled reactions within a protein complex may proceed much faster than all other reactions and we can therefore ignore intermediate steps. Sometimes it may be desirable to abstract from the real process because, for technical reasons, the amount of certain species cannot be measured with sufficient accuracy. In other cases we may want to disregard changes in the amount of small molecules such as carbon dioxide in order to keep the system simple. Changes in the amount of water molecules can usually also be neglected because normally water is found abundantly in cellular systems. As soon as we have decided on the level of detail our model should encompass, we can use an appropriate graphical pathway modeling tool like CellDesigner (Funahashi et al. 2003) to build a reaction pathway based on the information obtained from pathway databases. CellDesigner itself provides instant access to several online databases. Additionally, tools such as KEGG2SBML1 can assist us in this step. KEGG2SBML converts KEGG’s pathway files into a format we can open in CellDesigner. In the second step, all the reactions within the network topology have to be described by one kinetic equation. Manifold equation types have been suggested for this purpose. Some rate laws are based upon the exact mechanism of the reaction, the so-called mechanistic equations. Often these equations have been individually derived for a particular reaction. The reaction kinetics database SABIO-RK provides a large collection of curated and annotated rate equations from multiple sources (Krebs et al. 2007; Wittig et al. 2006). For many reactions there is, however, still a lack of detailed rate laws (Bulik et al. 2009). Therefore, in these cases more general equations have been suggested to provide an approximative description
1 http://sbml.org/Software/KEGG2SBML
162
A. Dräger et al.
of the real process. Purely phenomenon-oriented equations reproduce the dynamics of the reaction without taking the reaction process itself into account, whereas semi-mechanistic equations simplify the underlying process. We call both types of simplified equations approximative equations to distinguish them from detailed kinetic equations. Many errors can potentially be introduced when assembling complicated kinetic equations manually. It may also be desirable to apply several types of approximative kinetic equations to the reactions and compare the differences in the dynamic behavior of the whole system (Dräger et al. 2009a). For these reasons, the second step of our pipeline requires the modeler to have a considerable level of biochemical and mathematical knowledge about the underlying reaction mechanisms and also to invest a significant amount of time and effort into this process (Ziller 2009). It is impossible to entirely automate this second step, even though this would be desirable (Dräger et al. 2008). To be beneficial for the modeler, a semiautomatic procedure that suggests rate equations for each reaction must be able to take the types of reacting species on both sides of the reaction, as well as all the types of modifiers, into account. Due to the wide variety of cellular components and different reaction types, a plethora of special cases needs to be covered by such a program. After the model has been appropriately annotated it can be integrated into already existing models. In the third step of the modeling pipeline, the parameters within the kinetic equations are equipped with meaningful values leading to a biologically plausible dynamic behavior of the model as a whole. Often optimization procedures try to estimate the parameter values with regard to the error between model output and measured data. Many optimization methods exist and it is often difficult to choose the best method for each case. Most optimization procedures themselves provide several parameters that highly influence their performance. A systematic comparison of several procedures, together with their specific settings, improves the quality of the solution (Dräger et al. 2009a). In the next step, an experimental validation of the result is desirable. To this end, several in silico experiments should be performed with the model that has already been determined to fit the experimental data. We can, for instance, introduce variations in the initial concentration of some substances or modify the temperature to affect the temperature-dependent parameters. Analyzing the long-term behavior of the model usually suggests another in silico experiment. All these model variants or simulation outputs should be verified by an experimenter. In additional wet-lab experiments, the predictive power of the model can be further checked by setting the experimental conditions to those from our in silico experiments. A subsequent quantification of all model components and a comparison of these new values with the predicted values is performed. Basically, two contrary results are possible, but in practice several intermediate levels can occur. First, the model simulation may provide an exact prediction of the experimental values. In this case, the model is considered valid. Second, the model predictions may deviate strongly from the new experimental data. This means that either the model or the system under study is incorrect. We have to go back and critically rethink both our model and the system. In the best case, this leads to new insights about the pathways in the system (Fig. 7.1). Furthermore, a model that reproduces experimental data can also serve
7
Automating Mathematical Modeling of Biochemical Reaction Networks
163
as a source of inspiration for the experimenter to perform completely different investigations on the system. Finally, a model should be annotated and fully documented to make it reusable by other researchers even after a long time. As with the previous steps, an automatic procedure that assists the modeler in this time-consuming step is desirable (Dräger et al. 2009b; Laible and Le Novère 2007; Liebermeister et al. 2008, 2009). Such an automatic documentation procedure can also assist in the construction of the model structure and help the modeler to gain an overview of the equations within the model. An important prerequisite for automating the modeling process is a standard data format that ensures reusability for exchange of quantitative models. A computer program can only deal with formal model representations like the Systems Biology Markup Language (SBML) (Hucka et al. 2003, 2008), which should not be written by hand but be created with the help of computer programs (Lloyd et al. 2004). In this way, modelers do not have to be concerned about particular file formats that are required to represent both mathematical and biological concepts in computer-interpretable constructs. Such software tools should also check the syntactical correctness of the model during creation to ensure consistent mathematical frameworks. To communicate the semantics of a model unambiguously to both computers and humans, controlled vocabularies like the Systems Biology Ontology (SBO) are important (Le Novère et al. 2006b). Otherwise, an automatic procedure will not be able to create rate equations in a context-dependent manner. An alternative approach would be to apply the same type of approximative rate equations to every reaction within a network (Borger et al. 2007a). However, a minimal amount of computer-interpretable knowledge is still required, for instance, in order to distinguish between different types of modification. In the following sections, we give an overview of all the requirements of the modeling pipeline described above. We assume that the topology of the system is known and for the sake of simplicity, we also assume that the sizes of compartments and the values of all parameters stay constant during the period of simulation.
7.2 Standards in Systems Biology Any automatic procedure that assists modelers during the model building process greatly depends on the availability of specific information about the model components and reaction types. In particular, our modeling pipeline requires that the results of each modeling step can be used directly as the input for the next procedure. For this purpose, clear and standardized semantics for all model components are important because it does matter whether, for instance, a modifier acts as an enzyme or as an inhibitor. The role of each reaction participant should be clearly and unambiguously highlighted to be interpretable for both humans and machines. Models of biological processes are often published as a set of mathematical equations or in a descriptive form (Nickerson and Buist 2009). Both formats hamper the reproducibility and reusability of the models. In even worse cases, the publication
164
A. Dräger et al.
of a model in the form of program code shifts the focus from the model itself to the algorithmic description of how to compute and simulate it. This form of representing the model can therefore potentially hide its mathematical meaning (Lloyd et al. 2004). Furthermore, the exchange of such a model between different simulation environments may not be easy (Finney and Hucka 2003). Often reaction pathways are encoded in figures with certain graphical shapes (glyphs) representing reactants, products, and modifiers. Several types of arcs interconnect these glyphs representing processes such as the reaction, catalysis, and inhibition. Such figures can only be interpreted by humans with the help of a figure legend and are not easily machine-readable. Model description standards are thus required to ensure both software interoperability and model reusability. Otherwise one day the plethora of naming and drawing conventions in biology, and file formats in software development, will resemble the confusion of tongues at the Tower of Babel (Fig. 7.2) leading to the need to reinvent the wheel again and again.
Fig. 7.2 The construction of the Tower of Babel. According to biblical legend (Genesis 11:1–9), before the Tower of Babel was built, mankind had one language and one culture. Upon seeing man’s conceit and this arrogant monument to himself, God became angry and disrupted the work on it by creating such a diversity of tongues or languages that workers could no longer communicate with each other. The Tower of Babel could hence never be completed and the people were dispersed over the face of the earth. Reproduced from the oil painting “The Tower of Babel” by Pieter Bruegel the Elder, 1563 AD, with permission from the Kunsthistorisches Museum Vienna, Austria
Since even syntactically correct models can still be semantically incorrect, or barely understandable, a standardized form of model annotation, together with its visual representation, will allow automated validation of model consistency (Le Novère 2006; Le Novère et al. 2006b).
7
Automating Mathematical Modeling of Biochemical Reaction Networks
165
7.2.1 The Systems Biology Markup Language A well-known problem in the lifetime of software is that as soon as the program is no longer being further enhanced or administrated, its internal storage standard may disappear. In the field of research such cases are especially likely to occur, because often only functional software prototypes are developed. As a consequence, models that have been created with this software and stored in their own format may no longer be readable and therefore not usable anymore. If one wants to reuse such a model in further investigations, it often has to be recreated from scratch or translated into another language or format, which is often an error-prone process. In even worse cases, models are hard-coded in a specific programming language and have therefore never been exchangeable to other software tools (Rodriguez et al. 2007). To avoid such cases a standard format for model creation and storage is required to provide the scientific community with a highly reproducible and reusable data structure. To this end, the Systems Biology Markup Language (SBML) has been developed to store and represent various kinds of biochemical models, i.e., gene-regulatory or signal transduction pathways, and metabolic networks. (Hucka et al. 2003, 2004). As an XML-based language (eXtensible Markup Language), SBML is designed as a platform-independent, computer-readable, and tool-neutral data format (Shapiro et al. 2007). A complete documentation can be found at http://sbml.org. With more than 180 software tools that now support SBML (December 2009), it has become a widely accepted standard in the systems biology community and now defines a special mime type of the Internet Engineering Task Force (IETF) (Le Novère 2006). The development of SBML is driven by the needs of software developers and the scientific community (Le Novère 2006). SBML thus mirrors a consensus as it does not cover all requested features but the currently most important aspects of modeling (Hucka et al. 2008). One main concept of the language is that it is organized on coexisting levels. Minor changes in the language definitions are called versions. Currently, SBML Level 1 Version 2 and SBML Level 2 Version 4 define the two language specifications and SBML Level 3 Version 1 Core Release 1 Candidate has just been proposed (December 2009) (Finney and Hucka 2003; Hucka et al. 2008). Besides the precise XML schema for SBML, the powerful and easily usable library, libSBML (Bornstein et al. 2008), constitutes one of the main reasons for the widespread acceptance of SBML: libSBML assures software interoperability because it provides not only parsers and writers for SBML but also methods for consistency checks and interconversion methods between SBML levels and versions. With the help of libSBML sophisticated user interfaces can be created that assist modelers when encoding their models in SBML. CellML constitutes another freely available XML-based modeling language with an objective similar to that of SBML (Lloyd et al. 2004). Since there is a broad overlap in the scope of the two modeling languages, CellML models can be converted to SBML (Schilstra et al. 2006). However, the slightly different language structures may cause a loss of information during this conversion. A full documentation on CellML can be found at http://www.cellml.org. An advantage of CellML is that it
166
A. Dräger et al.
allows both modular and multiscale models, a feature that is also intended to be introduced into SBML with the release of Level 3. In contrast to CellML, SBML is a language tailored specifically for biochemical models and therefore serves as a lingua franca in this field of research (Hucka et al. 2004). Therefore, we will only focus on SBML, but similar approaches could be taken for CellML as well. In the remainder of this section, we introduce and explain some main components of SBML Level 2 Version 4. Every SBML file starts with the declaration of the XML type used, followed by the definition of one model object, which may contain several other components each collected and defined within dedicated list objects (Listing 1). The name of each such component list starts with the prefix listOf followed by the plural form of the name of the component, for instance, UnitDefinitions or Compartments. In our description of SBML, we focus only on the following subset of all possible components: Listing 1 Minimal framework of an SBML model 1 2 3 4 5 6
listOfUnitDefinitions If undefined, SBML predefines the following units: substance (in moles), volume (in liters), area (in square meters), length (in meters), and time (in seconds). User-defined units can be useful in order to ensure unit balance and consistency. listOfCompartments In SBML a compartment constitutes a finite reaction space, which does not have to be a cellular compartment. At least one compartment should be defined. listOfSpecies The reacting substances within the model that can be referenced in reactions as reactants, modifiers, or products. listOfParameters A list of model-wide valid constants or variables. A KineticLaw object defined within one reaction may contain a list of local parameters whose values are always constant and only valid within the particular rate equation. listOfReactions Each reaction should reference species acting as reactants, modifiers, or products and may contain one KineticLaw object. For a more comprehensive description, we refer the reader to the SBML specifications (Hucka et al. 2008). Note that SBML requires a designated order of all lists in the file, but all lists are optional. The reactive species play a central role within the model. Each species has to be located inside of exactly one compartment. Therefore, a model that contains species
7
Automating Mathematical Modeling of Biochemical Reaction Networks
167
must also contain at least one compartment. Listing 2 declares the species S, P, and E, each within the compartment cytosol with a volume of 1 nl. To let a species take part in a reaction, the species must act as a reactant, product, or modifier of the reaction. For this purpose, a reaction contains references to the species. With the help of its unambiguous identifier, such a speciesReference refers to the original species in the model. Listing 3 encodes the irreversible reaction of substrate S being converted into product P catalyzed by enzyme E: E, P
S −−→ P,
(7.1)
with feedback inhibition of the product. To be able to simulate the model dynamically, the reaction needs a kineticLaw object to be assigned to it. This law is written in terms of a subset of the XML-based mathematics language MathML and it may call the values of several parameters. These parameters can be defined either globally within the model or locally within the reaction. Local parameters are only valid within the kineticLaw that defines them. Global parameters are not necessarily constants because they can also act as variables in the models if their constant attribute is set to false. Listing 4 encodes the following equation, the enzymatic rate law describing the competitive inhibition of an irreversible unireactant enzyme by the product: v=
V×S KS +
KS KP P + S
.
(7.2)
Listing 2 Definition of compartments and species in SBML 1 2 3 4 5 6 7 8
Listing 3 Definition of a reaction in SBML 1 2 3 4 5
168
6 7 8 9 10 11 12 13 14 15
A. Dräger et al.
Listing 4 Definition of a rate equation in SBML 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
V S Ks Ks P Kp S
7
Automating Mathematical Modeling of Biochemical Reaction Networks
169
Listing 5 Definition of a unit in SBML 1 2 3 4 5 6 7 8
For the parameter V we declare the unit mol_per_s as follows (Listing 5): 1 mol s−1
(7.3)
to ensure the validity of the model, because every kineticLaw should evaluate to units of substance per time. The parameters KS and KP have the predefined unit substance, which is mol by default. The species S, P, and E also use the default substance unit because for these species initial amounts are declared and their hasOnlySubstanceUnits attribute is set to true. This flag indicates that the species always has to be interpreted in terms of molecule counts and not in terms of concentration, e.g., mol per litre. Therefore, the unit of the reaction rate is exactly the unit of parameter V.
7.2.2 The Systems Biology Ontology From the SBML example model in the previous section we learn that there is no way to distinguish between different kinds of modifiers within a reaction (Listing 3). Looking at the SBML model alone, it remains decidedly unclear as to whether a modifier acts as an inhibitor, a catalyst, or an activator. However, as the kinetic equation must reflect the role of the modifier, implicit knowledge about this function should also be contained within the SBML model. With the aim of automatically deriving rate equations from a given network topology, this difference even becomes a prerequisite. But why does SBML not contain special objects for each kind of modification? The reason is that this semantic information goes far beyond the scope of SBML, which has been developed to store and exchange a pure description of the mathematical model, in particular, complete models that already contain kinetic equations. Fortunately, SBML offers several ways to overcome this limitation. The Systems Biology Ontology (SBO)2 (Le Novère et al. 2006) provides the simplest way to satisfy our requirement.
2 http://www.ebi.ac.uk/sbo
170
A. Dräger et al.
In the context of information science, an ontology is an explicit specification of a conceptualization. A conceptualization is a simplified, abstract view of the world that we wish to represent for a specific purpose. Ontologies are represented in a collection of controlled vocabulary terms and formal axioms that constrain their interpretation and guarantee the well-formed use of these terms. Each term has a unique identifier and a verbal definition. Often ontologies can be viewed in a hierarchically structured, directed acyclic graph, a taxonomic hierarchy of classes (Gruber 1993). Figure 7.3 illustrates a part of the SBO graph up to level five. Several other ontologies have also been defined to help clarify and structure the usage of concepts in science, of which the Gene Ontology (GO) (Ashburner et al. 2000) is one of the best-known examples in biology. The SBO is similar to the GO, but only contains terms that are clearly related to systems biology. The SBO mainly aims to define relations between semantic descriptions of model components and model structure. The SBO is organized in six vocabularies with an is a hierarchy of sub-classes. The entity class allows us to specify the material entity for species in SBML, e.g., whether a species represents a ribonucleic acid or an ion. In the interaction branch the SBO provides terms to distinguish between various kinds of reactions such as transport processes or degradation. Furthermore, the SBO already contains several mathematical expressions with predefined rate laws to describe specific types of reactions, each including links to other SBO terms. One of these is the branch quantitative parameters, which defines various kinds of kinetic parameters, and the second one is participant role, which characterizes products, reactants, biological activity, compartments, and modifiers like inhibitors, potentiators, or catalysts. The modelling framework branch is useful to ascertain if a model is to be interpreted in a continuous, discrete, or logical manner. Therefore, all kinetic equations in the SBO point to sub-categories of this branch. The model itself, however, should be annotated with a term from the interaction branch. The program called semanticSBML provides convenient methods to create and edit SBO annotations within an SBML file (Liebermeister et al. 2008, 2009). With this powerful annotation at hand, an automatic procedure can, from the stoichiometry of a reaction and the knowledge of the roles and material entities of the participating species, automatically derive a list of the most suitable rate equations. As an example of what such an annotated model looks like, let us consider the definition of Reaction 1 in Listing 3. An annotated version of this definition is shown in Listing 6. This version allows an automatic procedure to clearly distinguish between the different roles of the two modifiers E and P. The SBO itself already contains many special cases of such equations. Generated rate laws and their parameters can, on the other hand, also be automatically annotated using the corresponding terms from the SBO. In this way, even automatically generated models are human- and machine-interpretable. For even more comprehensive model annotation parameters, compartments and species should also be annotated accordingly.
7
Automating Mathematical Modeling of Biochemical Reaction Networks sbo term
sbo term
entity
interaction
functional entity
ribozyme transporter enzyme ...
171
material interaction entity outcome
biological process activity biochemical or transport reaction
macrophysical simple molecule compartment chemical
chemical information ... macromolecule macromolecule
...
ribonucleic polypeptide desoxynucleic ... acid acid chain
...
relationship
control
logical ... combination
biochemical transport simulation ... inhibition reaction reaction
conversion ... isomerization degradation
sbo term
sbo term
modeling framework
quantitative parameter
continuous logical discrete framework framework framework
biocemical parameter
physical ... characteristic
non-spatial spatial spatial non-spatial equilibrium or thermoconduc- temperature kinetic continuous continous discrete discrete dynamic ... ... steady-state tance difference constant framework framework framework framework characteristic temperature relative equilibrium ... constant
sbo term
sbo term
mathematical expression
participant role
steady state conservation rate law expression law
...
zeroth forward bimolecular order rate rate ... rate constant constant constant
modifier product reactant
functional compartment
Hill-type rate law, enzymatic mass action inhibitor potentiator catalyst substrate interactor generalised rate law rate law form
...
enzymatic rate law for ... unireactant enzymes
... ...
noncompetitive competitive inhibitor inhibitor
...
Fig. 7.3 Extract from the Systems Biology Ontology graph (June 2009). This figure shows the SBO terms up to level five in the directed acyclic SBO graph. All arcs mean is a and are directed toward the parent node. The nodes labeled with “. . .” stand for omitted terms. The graph is split into six distinct graphs, each rooted at the same node sbo term (highlighted in gray)
172
A. Dräger et al.
Listing 6 Definition of a reaction in SBML including SBO annotations 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
7.2.3 The Systems Biology Graphical Notation Up to now, various glyphs have been used in a multitude of publications to display reaction pathways under the tacit assumption that a specific set of known symbols is widely accepted within the community. However, in a cross-disciplinary field such as systems biology, these diagrams should also be intuitively interpretable for scientists from other fields of research. To this end, the Systems Biology Graphical Notation (SBGN) explicitly defines these formerly implicitly accepted symbols in a comprehensive system and brings them together into one context with precise and visually unambiguous semantics. The SBGN consistently defines syntactic rules for the usage of all specified shapes, arcs, and arrows. In this way, interconversion between graphical models for visualization and formal models for analysis or simulation becomes possible, because all glyphs and edges are intended to correspond to specific SBO terms. Based on SBGN graph drawing software, the publication, education, or simply visualization of biological models can be easily implemented. Due to the unambiguous interpretation of the graphical information, it will also be possible to automatically extract models from printed diagrams. The SBGN homepage3 provides the documentation of the standard for process diagrams (Kitano et al. 2005; Le Novère et al. 2008). Figure 7.4 gives an overview of several single 3 http://sbgn.org
7
Automating Mathematical Modeling of Biochemical Reaction Networks
173
(a) Reversible ion–catalyzed reaction
(b) Reversible uni–uni enzyme reaction with feedback inhibition
(c) Reversible bi–uni enzyme reaction
(d) Reversible bi–uni enzyme reaction with feedback inhibition
(e) Irreversible bi–bi enzyme reaction
(f) Irreversible bi–bi enzyme reaction with feedback inhibition
(g) Irreversible signal transduction reaction
(h) Transcription and translation
Fig. 7.4 Example reactions in systems biology graphical notation. Ovals represent simple molecules, whereas ions are displayed as circles. Rectangles with rounded edges denote macromolecules. Residues of these macromolecules are highlighted using circles at the edges of these glyphs (Fig. 7.4g). According to CellDesigner, genetic elements can be drawn as rectangles. The empty set symbol Ø illustrates sources and sinks of reactions. In CellDesigner, parallelograms denote RNA. Table 7.1 explains the various kinds of reaction modifications, such as catalysis, and inhibition. CellDesigner places the identifier of each reaction next to each arc, e.g., re1
reactions in SBGN. In our modeling pipeline, we apply the SBGN-based modeling tool, CellDesigner, which allows us to easily translate the information from pathway databases into an SBML-formatted file without having to care about SBO terms or exact SBML syntax.
174
A. Dräger et al.
7.3 Toward Generalized Rate Laws The topology of a given reaction system corresponds to a stoichiometric matrix N, which SBML encodes implicitly. Each row in this matrix stands for one species and each column is assigned to one reaction. A positive value nij means that reaction Rj produces nij molecules of species Si whereas negative values represent the consumption of the species. An entry nij = 0 means that species Si does not take part as a reactant or product in reaction Rj . In a similar way, we define the ternary modulation matrix W that contains the values wjm = −1 if species Sm inhibits reaction Rj , wjm = 1 if Sm acts as an activator, or wjm = 0 if no interaction takes place between Rj and Sm . In our framework we can then compute the rates of change of each species within the system as follows: d S = Nv(S(t), t, W, p), dt
(7.4)
with the vector of reaction velocities or rate laws v that depends on time t, the vector of reacting species S, modulation matrix W, and the parameter vector p. In the rate laws we will consider here, v does only implicitly depend on time. Every rate law vj corresponds to one dynamic modeling framework, i.e., a description of how to interpret the equation. We classify these modeling frameworks according to a two-dimensional scheme. On the first axis we distinguish discrete and continuous equations. On the second axis we classify each rate equation to be either probabilistic or deterministic. Many kinetic equations have been proposed for every one of the four possible combinations along this scheme of modeling frameworks (Albert 2007). The purpose of our model and its desired level of detail play a central role when selecting a modeling framework together with appropriate kinetic equations. In the remainder of this section we give a short overview of some important classes of continuous deterministic rate equations, which we can insert into Eq. (7.4). More information about specific rate laws and special cases can be found in relevant textbooks on enzyme kinetics (Bisswanger 2000; Cornish-Bowden 2004; Segel 1993) and in the SBO specifications. Here we mainly consider the reversible forms of selected rate equations because in multi-enzyme systems the rate equation of an irreversible reaction implies that the product concentrations are completely zero and do not interfere with the progress of the reaction. The assumption of having irreversible reactions originates from in vitro enzyme studies on single reactions and is, in most cases, not satisfied for multi-enzyme systems (Cornish-Bowden 2004, pp. 312–314). In several studies it has been found that the effects of the products in multi-enzyme systems play an important role (Dräger et al. 2007a,b, 2009a). Irreversible rate equations should therefore only be applied in very specific cases like the transport of a substance out of the system. In many cases actual reaction mechanisms remain unknown and therefore reliable, specific rate equations cannot be derived (Bulik et al. 2009). With an increasing number of pathways and reactions stored in databases like KEGG or MetaCyc
7
Automating Mathematical Modeling of Biochemical Reaction Networks
175
(Caspi et al. 2008; Ogata et al. 2000) the discovery of the actual underlying reaction mechanisms becomes more and more important. In vitro experiments with isolated reactions of purified enzymes are often applied to uncover the individual reaction steps. The same enzyme can, however, behave differently in its cellular environment (Cornish-Bowden 2004, p. 277), leading to discrepancies between measured rate equations and its actual behavior in vivo. In the models encoded in SBML it is usually not evident how the reaction mechanism proceeds, because, for the sake of simplicity, intermediate steps are typically omitted. Therefore, the mechanism can often not be reconstructed from such a topology. Any method that assigns rate laws more or less automatically needs some level of abstraction from the reaction mechanism. Hence, what is needed are generalized rate equations that abstract from the underlying processes to a certain degree and are therefore applicable to a wider range of reactions than the equations that are derived individually for very specific reaction mechanisms.
7.3.1 Generalized Mass Action Kinetics For cases in which the enzyme mechanism does not play the main role, or in noncatalyzed or in non-enzyme reactions, the mass action rate law can be applied. This very simple equation is derived from the assumption that the reaction probability grows proportionally to the collision probability of the reactants and is therefore proportional to the concentrations of all substrates raised to the power of their stoichiometric molecularity, i.e., the number of molecules that need to collide to initiate the reaction: − + (7.5) vj (S, p) = k+j [Si ]nij − k−j [Si ]nij , i
i
where n± ij are the absolute values of the negative and the positive stoichiometric coefficients and the parameters k±j denote the forward and reverse rate constants that depend on temperature and pressure. Square brackets around a species denote the concentration of the species. In 1983 Schauer and Heinrich propose a generalized form of the mass action equation (Heinrich and Schuster 1996; Schauer and Heinrich 1983): n− n+ vj (S, W, p) = Fj (S, W, p) × k+j [Si ] ij − k−j [Si ] ij , (7.6) i
i
in which the Fj terms are arbitrary positive functions introducing saturation and inhibition effects to the mass action equation. With this generalization it becomes possible to include the effects of activators and inhibitors even in the mass action equation. This can, for instance, be achieved by setting Fj (S, W, p) =
m
+
−
hA ([Sm ], kAjm )wjm hI ([Sm ], kIjm )wjm .
(7.7)
176
A. Dräger et al.
Here, the modulation matrix W comes into play. Accordingly, the w± jm denote the absolute values of the positive and negative elements in W. The product runs over all modulators. Both the activation function hA and inhibition function hI depend on the concentration of the modulating species Sm and one parameter, kAjm or kIjm , and are defined as follows (Liebermeister and Klipp 2006): hA ([Sm ], kAjm ) = hI ([Sm ], kIjm ) =
[Sm ] , kAjm + [Sm ]
(7.8)
kIjm . kIjm + [Sm ]
(7.9)
Including both functions allows the generalized mass action kinetics to be applied to various reactions. In several publications this rate law is successfully used as an approximative equation in cases where detailed knowledge about the reaction kinetics remains unknown (Bulik et al. 2009; Dräger et al. 2007b, 2009a). It should also be noted that reactions with multiple (more than two) reactants are unlikely to take place if all molecules are required to collide by chance within the medium. In many cases these reactions contain several reaction steps each involving only two reactants at a time (Cornish-Bowden 2004, P. 6).
7.3.2 Generalizing Enzyme Kinetics The well-known Michaelis–Menten equation from 1913 constitutes the basis of modern enzyme kinetics and has been extended several times (Cornish-Bowden 2004; Segel 1993). The full reversible form of this equation assumes the reaction mechanism pictured in Fig. 7.5, which is sometimes referred to as a uni–uni reaction because one substrate molecule is converted to one product molecule. An outstanding feature of this rate law is that it already contains terms describing the influence of inhibitors. Activation is, however, not considered in this equation but can be included by combining this equation with a function similar to Fj (S, W, p) from the generalized mass action equation:
k1 k2 −− − E + S1 − − ES1 − − − E + S2
ES1 I
KIb
EI
KIa
− − − − I+
k−2
− − − − I+
k−1
Fig. 7.5 Uni–uni reaction scheme. This reaction scheme demonstrates the underlying mechanism of an enzyme-catalyzed reaction, including the states in which an inhibitor interferes with the reaction
7
Automating Mathematical Modeling of Biochemical Reaction Networks
vj (S, W, p) = [E]0
m
hA ([Sm ], kAjm )
kcat+j kcat−j KMj,S1 [S1 ] − KMj,S2 [S2 ]
w+ jm
177
[I] KIaj
1+
+
[S1 ] KMj,S1
+
[S2 ] KMj,S2
1+
[I] KIbj
.
activation
(7.10) Besides the activation constant kAjm , this equation contains several other parameters: the turnover number or catalytic constant kcat±j , the Michaelis constants KMj,S1 and KMj,S2 of the substrate S1 and the product S2 , and the inhibition constants KIaj and KIbj that belong to the inhibitory species I. Often the initial enzyme concentration [E]0 is unknown and therefore avoided in the formula by introducing the limiting rates V±j = [E]0 × kcat±j as new parameters. It should be noted that the vector S includes the amounts of all reacting species and modifiers taking part in the reaction, i.e., S1 , S2 , E, and I. The SBO tree contains several special cases of this formula and also other enzyme rate laws that go beyond the scope of this chapter. We would like to draw the reader’s attention to some special cases of enzyme kinetics with more than one substrate molecule. For those reactions, the order in which the substrate molecules bind to the enzyme strongly influences the velocity of the reaction. We distinguish basically three types of reaction mechanisms for cases of two substrate and two product molecules (bi–bi reactions): random order, compulsory order, and ping-pong mechanism. In the first case, both substrate molecules can bind to the enzyme in a random order. The products are also released randomly. In contrast, in the ordered mechanism, the order in which substrate molecules bind to the enzyme and the products are released is strictly fixed. The ping-pong mechanism works differently: the binding of the first substrate molecule induces a modification of the enzyme that enables the second substrate to bind right after the first product is released. Then the second substrate reacts to the second product, thereby recovering the enzyme for the next reaction. A rate law for these types of reactions must reflect these different reaction steps. Many special cases of rate laws are described in dedicated text books like references (Bisswanger 2000; Cornish-Bowden 2004; Segel 1993), all of which also cover the method of King and Altman (1956) to derive additional rate laws from the reaction mechanism. Many rate equations are already included in the SBO together with their formula and a short description. In 2006, Liebermeister and Klipp derive a generalization of the Michaelis– Menten equation, known as convenience kinetics:
vj (S, W, p) = Fj (S, W, p) × [Ej ]
kcat+j
[Si ] n−ij i
KMji
n−ij [Si ] m i
m=0
KMji
+
− kcat−j
[Si ] n+ij i
KMji
n+ij [Si ] m i
m=0
KMji
. −1 (7.11)
The convenience rate law is based on the random order ternary-complex mechanism, in which two substrate molecules bind in random order to the enzyme. One important achievement of the authors in the derivation of convenience kinetics is that they
178
A. Dräger et al.
ensure the thermodynamic correctness of the system’s parameters by deriving a second form of this equation for cases in which the stoichiometric matrix of the system contains linearly dependent columns, i.e., N does not have a full column rank: vj (S, W, p) = Fj (S, W, p) × kVj × [Ej ]×
[Si ] i KMji
n− ij
(kGi KMji )−
nij 2
nij [S ] n+ ij − i K i (kGi KMji ) 2
Mji + [Si ] m nij [S ] m + i m=0 K i −1 m=0 KMji Mji
n− ij i
.
(7.12)
This condition is necessary because if the stoichiometric matrix does not have a full column rank, at least one reaction thermodynamically depends on another reaction. Note that the thermodynamically independent form shown in Eq. (7.12) replaces ∓ nij the parameters kcat±j of the simple form in Eq. (7.11) with kGi KMji 2 , where the energy constant kGi belongs to species Si regardless of reaction Rj . Additionally, the whole fraction is multiplied by the velocity constant kVj , which is only assigned to reaction Rj . The Michaelis constant KMji contains both indices i and j and therefore belongs to one species in one reaction.
7.3.3 The Hill Equation In 1910, Archibald V. Hill derives a purely empirical equation that describes the cooperative effects of the binding of oxygen to hemoglobin (Heinrich and Schuster 1996, p. 23; Cornish-Bowden 2004 pp. 243–245). Mathematically, Hill’s equation takes the following form: vj (Si , p) =
Vj [Si ]hj , Kj + [Si ]hj
(7.13)
with the limiting rate vj , a constant Kj (sometimes referred to as the half saturation constant, e.g., by Heinrich and Schuster (1996, p. 23)), and the Hill coefficient hj . For hj = 1 this equation simplifies to the irreversible non-modulated Michaelis– Menten equation. Note that hj is not necessarily an integer. It is a measure of the cooperativity effects of the enzymes, which does not depend on the number of substrate binding sites. Positive cooperative effects can be obtained with hj > 1, whereas values of hj < 1 lead to the much less frequently observable effect of negative cooperativity. The Hill equation is successfully used to model the effects of gene-regulatory reactions (Hinze et al. 2007). Equation (7.13) does not take reverse reactions into account, which are, however, of great importance in the case of systems with multiple enzyme-catalyzed reactions. Athel Cornish-Bowden therefore generalizes this equation to produce a reversible form that can be applied to multi-enzyme systems and also includes the effects of modifier species M (Cornish-Bowden 2004, p. 314):
7
Automating Mathematical Modeling of Biochemical Reaction Networks
vj (S, P, M, p) =
Vj [S] kSj
1−
[P] Kj [S]
hj
kMj +[M]hj hj
kMj +βj [M]hj
+
[S] kSj
[S] kSj
+
[P] kPj
+
[P] kPj
179
hj −1
hj
,
(7.14)
with the limiting rate of the forward reaction vj and the constants kSj , kPj , and kMj describing the substrate S, product P, and modifier concentrations giving 12 Vj . Kj denotes the equilibrium constant and hj is the Hill coefficient. The modifier M acts either as an activator for βj > 1 or as an inhibitor if βj < 1. For βj = 1 the modification term vanishes. However, one problem with this equation is the dependence between some of its parameters.
7.4 Computer-Aided Mathematical Modeling of Biological Systems After this introduction to two fundamental modeling requirements, standards in systems biology and generalized rate equations, we now direct the reader’s attention to our modeling pipeline. We start with the graphical design of a network topology and then discuss how to automatically equip all reactions with appropriate rate equations in a context-sensitive manner using the information from the annotated process diagrams.
7.4.1 The Graphical Modeling Tool CellDesigner The SBGN and the model storage format SBML could only become accepted standards amongst the scientific community because of the existence of user-friendly tools that provide Graphical User Interfaces (GUIs) for model creation in SBML format according to SBGN. The program known as CellDesigner was developed as a graphical process diagram editor to fulfill exactly this task as a modeling tool for gene-regulatory, signaling, and biochemical networks (Funahashi et al. 2003, 2007b). Although a growing number of other modeling tools has since become available, and many of them even provide a diagram-based GUI (Alves et al. 2006), CellDesigner remains one of the most frequently used tools amongst members of the systems biology community (Klipp et al. 2007) and it has been used in several studies for model creation (Calzone et al. 2008; Dampier and Tozeren 2007; Oda et al. 2005). The reasons for this success are probably the user-friendly nature of the tool (Fig. 7.6), the intuitive program layout, and its free availability4 for Windows, Linux, and Mac OS. The internal data format of all diagrams created with CellDesigner is SBML. CellDesigner does not convert the data to any other data structure. Therefore, 4 Since
August 2008, CellDesigner has been available in version 4.0.1 at http://celldesigner.org.
180
A. Dräger et al.
Fig. 7.6 Screenshot of the SBGN-based graphical modeling tool named CellDesigner version 4.0.1. CellDesigner’s tool bar offers a large variety of glyphs representing certain types of species, compartments, state transitions, or types of modification. The user can insert any glyph by selecting it from the menu bar and just clicking on the desired position within the process diagram. The menu bar provides several options, for instance, the plug-in menu or access to online databases. On the left-hand side, the hierarchy of model components is shown. On the bottom all details about the components within the model can be viewed in well-arranged tables
CellDesigner is fully SBML compliant. The graphical layout of CellDesigner allows the user to create process diagrams (2005) that can be exported to figures in many different formats, including PDF, SVG, EPS, PNG, and JPEG. As one of the very early tools that supported graphical SBML editing, it uses the software-specific annotation tags in SBML to store its layout information in a way that can only be interpreted by CellDesigner. More recently, the SBML layout extension has been suggested and it will become an integral part of SBML Level 3 (Deckard et al. 2006; Gauges et al. 2006), but it is currently not supported by CellDesigner’s SBML code. CellDesigner is an SBW-enabled5 (Hucka et al. 2002) program. This allows other programs to access CellDesigner’s functions through the SBW broker. CellDesigner can also call methods from other SBW-enabled programs. To simulate networks created with, or imported into, CellDesigner, it contains the SBML ODE Solver (Machná et al. 2006) and COPASI (Hoops et al. 2006) can also be easily integrated. All simulation results can be exported to JPEG, PNG, and various bitmap file formats.
5 Systems
Biology Workbench (SBW)
7
Automating Mathematical Modeling of Biochemical Reaction Networks
181
Furthermore, specialized online databases, such as SGD6 (Saccharomyces genome database) (Cherry et al. 1998), DBGET7 (database retrieval system for a diverse range of molecular biology databases) (Fujibuchi et al. 1998), iHOP8 (information hyperlinked over proteins) (Hoffmann and Valencia 2004), PubMed,9 and BioModels.net10 (Le Novère et al. 2006a), all of which can be accessed via menu items (Funahashi et al. 2007a), assist the user when setting up the topology of a reaction system. A web service integration of CellDesigner and SABIO-RK allows us to equip many reactions with enzyme kinetic equations from this powerful database (Funahashi et al. 2007a).
7.4.2 Context-Sensitive Assignment of Rate Equations CellDesigner aims to provide easy methods for model creation, simulation, as well as analysis and to allow users to convert a graphically represented model into mathematical formulas for analysis and simulation (Funahashi et al. 2007b). However, its internal equation generator does not allow for context-sensitive rate law creation. Since kinetic equations can only be selected from a rather limited set of predefined formulas, the user has to manually assemble most equations, which remains a highly error-prone and cumbersome task. Furthermore, such a manual procedure runs the risk of rate equations conflicting with the SBGN representation of the process diagram. A fast comparison of several modeling approaches only becomes possible with the availability of a quick procedure that creates kinetic equations for given topologies based on the SBGN representation. The current version of CellDesigner cannot benefit from the SBO effort (Le Novère et al. 2006b) although CellDesigner contains its own annotation system to provide a lot of features to distinguish different kinds of species, reactions, and, most importantly for our purposes, modifications. At this point a specialized rate law generator is required, one which performs such a semi-automatic rate law selection based on the context of each reaction and creates the user-selected equations. Since the release of CellDesigner 4.0α in December 2006, a plug-in interface has been made available that allows developers to create a customized tool to fill this gap. The great advantage of CellDesigner over other model creation tools is that it has been offering methods to distinguish between several different glyphs for reactions, species, and regulatory modes since before the actual SBGN and SBO were specified. Figure. 7.4 shows examples for many special cases of reversible and irreversible reactions in CellDesigner’s SBGN representation, including various
6 http://www.yeastgenome.org 7 http://www.genome.jp/dbget 8 http://www.ihop-net.org/UniPub/iHOP 9 http://www.ncbi.nlm.nih.gov/pubmed 10 http://www.ebi.ac.uk/biomodels-main
182
A. Dräger et al.
Table 7.1 Types of modulation. The exact meaning of the symbol modulation remains unclear. Enzymatic catalysis is a special case of catalysis. Currently, both forms of catalysis cannot be distinguished from each other based on the process diagram. Since most enzymes are proteins or RNA, we may consider a catalyst an enzyme if it belongs to these classes of molecules SBGN connecting arc
Definition
Interpretation
Modulation
Activation, inhibition, or catalysis
(Physical) stimulation
Activation
Catalysis
Catalysis or enzymatic catalysis if E1 is, for instance, a protein or RNA
Inhibition
Inhibition
Trigger
Necessary activation
species types (ions, simple molecules, genes, RNA, proteins, and phosphorylation sites; empty sets as unknown sources) and the modulation symbols for feedback inhibition, catalysis, and necessary stimulation. Table 7.1 gives an overview of all the kinds of modulation arcs in CellDesigner and SBGN together with their interpretation. CellDesigner encodes this additional information within the SBML annotation tags, which contain software-specific information on reactions, modifiers, and species. Therefore, there exists no conflict between the SBML standard and CellDesigner’s annotations. Since SBML Level 2 Version 3 has been released, the same information can also be encoded using the SBO attributes of each element. In any case, an automatic procedure can interpret this information to suggest possible rate laws for each reaction. In particular, the annotations for different kinds of modulation are a prerequisite for such a rate law generation, because interpreting a modifier, e.g., as an inhibitor, makes a big difference to the interpretation of the same species to act as a potentiator. The special reaction types transcription, translation, and regular state transition may also deserve special attention. It should also be considered that each species type may act differently. A reaction that takes place on the surface of a metal plate, e.g., with an ion species as the catalyst, Fig. 7.4a can hardly fulfill the properties of an enzyme-catalyzed reaction (Fig. 7.4b–g), because the metal cannot, for instance, pass the substrate from one catalytic center to another. In contrast, several other species, especially proteins, complexes, and RNA, may show enzyme-like behavior under certain circumstances. A rate law generator must thus take the annotation of the species into account. The stoichiometric structure
7
Automating Mathematical Modeling of Biochemical Reaction Networks
183
of the reaction also plays an important role. For enzyme-catalyzed reactions with one reactant and one product (uni–uni), we may apply a Michaelis–Menten-based kinetic equation. The convenience rate law constitutes a generally applicable choice for all enzyme-catalyzed reactions with any number of reactants and products. In other cases, the mass action rate law can be selected. Note that the catalyst may be omitted from the process diagram for the sake of simplicity, but the reaction may still be considered to be enzyme-catalyzed. What still cannot be distinguished from a process diagram, even with SBO annotation, are both the order in which the reactants bind to the catalyst and the order in which products are released into the surrounding medium. Therefore, for bi–uni and bi–bi reactions (Fig. 7.4c–f) or even more complicated stoichiometries, any automatic procedure can only suggest the kinetic equation for compulsory or random order reactions or the substituted enzyme mechanism (also known as ping-pong mechanism) in the case of two reactants and two products. The knowledge of the user is still required to select one of these equations. Another special case of reactions are those that involve genes or mRNA, the so-called gene-regulatory reactions. Often these reactions show cooperativity or saturation effects because several transcription factors determine the expression level (the mRNA concentration) of genes, a behavior that can be modeled with Hill-like equations (Hinze et al. 2007) or a zeroth order mass action rate law in the case of the absence of any modifiers (basal transcription rate). The fact is often neglected that gene expression, i.e., the formation of RNA molecules, is a complex reaction process. Actually, this process is a reaction of nucleic acids to produce RNA molecules and requires genes as template. In many cases transcription factors, i.e., specialized proteins, inhibit or enhance the transcription process. The SBGN suggests that one should illustrate this overall process in order to be consistent with the example in Fig. 7.4h. Without SBO annotation, the SBML code corresponding to this cascade only describes two distinct uni–uni reactions with several modifiers. This information is insufficient for an automatic rate law generator, because such a procedure relies on annotations or SBO attributes to point out that the reactant is actually an empty set, the product of the first reaction is an RNA molecule, and that the modifiers are a gene and some stimulating or inhibiting proteins. Therefore, a controlled vocabulary together with a standardized diagrammatic network representation are required for all these attempts on automation.
7.4.3 SBMLsqueezer SBMLsqueezer is an application that was designed to perform context-sensitive rate law generation as described above (Dräger et al. 2008). SBMLsqueezer is a freely available plug-in for CellDesigner and entirely written in JavaTM . Up-to-date versions of the source code and binaries for this application can be found at the project homepage.11 11 http://www.ra.cs.uni-tuebingen.de/software/SBMLsqueezer
184
A. Dräger et al.
SBMLsqueezer interprets the context of the reactions in accordance with the SBGN representation to assemble lists of applicable kinetic equations for each reaction, thereby supporting most of the continuous rate laws defined in the SBO. Moreover, it suggests several rate laws that are not yet specified by the SBO consortium, especially those that were introduced in Section 7.3. One of these is the convenience rate law (Liebermeister and Klipp 2006). Other examples include the kinetic equations for the detailed ternary-complex mechanisms with random and compulsory order as well as the substituted enzyme mechanism. Each generated equation reflects the influence of modifiers that inhibit or activate the reaction. SBMLsqueezer always sets the boundaryCondition flag of genes to true to avert genes being consumed or produced within reactions. It highlights improper usage of transcription or translation arcs. Kinetic equations can even be generated in the case of non-integer stoichiometries of the reaction participants. All equations are written to SBML in MathML format, an XML-based and machine-readable code, hardly understandable or writable for humans. SBMLsqueezer can be used in two different ways. First, it is possible to assign rate equations to all reactions within the model in a single effort (Fig. 7. 7). Second, the reaction context menu allows the user to apply SBMLsqueezer to single reactions (Fig. 7.8). In either case, the user can change all suggestions made by the program. In the current version, SBMLsqueezer presents the SBO term of its created equations but is unable to assign these to the model because the feature is lacking in CellDesigner. One of the problems one may encounter when using SBMLsqueezer is the following: for SBML models not created in CellDesigner, a default annotation is added upon import into CellDesigner. Species become proteins, and reactions are always state transitions, and knowledge about the type of modulation can get lost. In these cases SBMLsqueezer may provide unexpected results. With the increasing use of, and support for, SBO, even in CellDesigner, this problem will vanish. In the future, SBMLsqueezer will be extended with additional rate laws like power law approximations (Savageau 1969a,b, 1970), loglin (Hatzimanikatis and Bailey 1996; Hatzimanikatis et al. 1996), or linlog (Visser and Heijnen 2002, 2003) kinetics and a stand-alone version will be released that interprets SBO attributes instead of the CellDesigner-specific annotation tags. This way, SBMLsqueezer can then be integrated as an equation-generating core into several applications. This could be used, for instance, to generate and evaluate the dynamic behavior of several versions of one model – each based on a different type of rate equation.
7.4.4 Model-Merging Using MIRIAM Annotations Various SBML models are publicly available in curated repositories like the BioModels.net database (Le Novère et al. 2006; Le Novère et al. 2006a) or JWS online (Olivier and Snoep 2004). By modifying and combining existing pathway models, more comprehensive models of the cell can be built (Snoep et al. 2006).
7
Automating Mathematical Modeling of Biochemical Reaction Networks
185
1
2 3 1
Rate Laws
1.1 Reaction: re1, Hill equation, microscopic form v5 = V5 ·
[s1 ]h+5s1
(1)
[s1 ]h+5s1 + kS+5s1
1.2 Reaction: re2, reversible simple convenience kinetics [s4 ] [s5 ] [s6 ] · kM1s − kcat−1 · kM1s kcat+1 · kM1s 4 6 5 v1 = [s3 ] · [s4 ] [s5 ] [s6 ] 1 + kM1s 1 + kM1s + kM1s 4
5
(2)
6
1.3 Reaction: re3 , kinetics of non-modulated unireactant enzymes v2 = [s8 ] ·
kcat−2 kcat+2 kM2s6 [s6 ] − kM2s7 [s7 ] [s6 ] [s7 ] 1 + kM2s + kM2s 6
(3)
7
.. .
Fig. 7.7 Creating rate laws for the whole network in a single step, adapted from Dräger et al. (2008). If started from the plugin menu, SBMLsqueezer can create all rate equations for the whole model at once. By default, all reactions are set to be reversible and modeled accordingly. Already existing equations are not overwritten and parameters are stored globally. All these defaults can be changed by clicking on show options. Details on all suggested equations, including the SBO number if available, are then presented in a table. Double-clicking on an equation’s name allows the user to select a different formula for the reaction. A single click elsewhere in any row shows an equation preview
186
A. Dräger et al.
Fig. 7.8 SBMLsqueezer’s reaction context menu, adapted from Dräger et al. (2008). The selection of suggested kinetic equations can change if the user sets the reaction from reversible to irreversible or the other way around. SBMLsqueezer shows an equation preview using HotEqn (available under the terms of GPL Version 3 at http://www.esr.ruhr-unibochum.de/VCLab/software/HotEqn/HotEqn.html) for all generated or existing kinetic equations. With the radio buttons at the bottom, the user can select whether to store parameters globally or locally for the particular reaction. Note that some kinetic equations, like convenience kinetics, contain parameters that belong to the species and have therefore to be stored globally. SBMLsqueezer treats these parameters correctly even if local parameters is selected. Already existing kinetic equations are also displayed to the user. To avoid inconsistencies for these equations, neither reversibility nor the way parameters are stored can be changed
However, when models have processes or substances in common and do not provide special interfaces for combining them, the merged output model will contain redundant elements. If the original models contain different mathematical statements (values, formulas, kinetics) for these elements, conflicts will arise and for each such element, one of the statements will have to be selected. Thus, key tasks in model-merging are (i) matching all redundant elements and (ii) recognizing and resolving all conflicts between them, in order to obtain a single consistent element in the output model (Liebermeister 2008). In severe cases, models may disagree so strongly in their way of describing biochemical processes and substances that merging is impossible. Serious conflicts arise when a model contains lumped reactions or substances (e.g., variables representing several modifications of one substance), especially if models describe the same processes at different levels of detail. Software tools can assist the user by detecting redundant elements, highlighting the conflicts, suggesting possible solutions, or resolving the conflicts automatically according to rules (e.g., based on a priority ranking of the input models). A basic requirement for matching the elements is that their biological meaning be recognizable automatically; comparing elements by their names would not suffice because the input models may adopt different naming conventions. The biochemical meaning of model elements (for instance, the fact that a certain species element represents water) can be declared in SBML in terms of annotations compliant with the standard MIRIAM (Minimal Information Required In the
7
Automating Mathematical Modeling of Biochemical Reaction Networks
187
Annotation of Models). MIRIAM is a set of guidelines for the annotation and curation processes of computational models, which facilitates their exchange and reuse (Laible and Le Novère 2007; Le Novère et al. 2005). Such an annotation points to an entry in web resources such as public databases or database ontologies. Water, for instance, can be represented by the entry CHEBI:15377 in the ChEBI database12 (Degtyarenko et al. 2008) and by the compound identifier C00001 in the KEGG database. A number of official data types for annotation are listed in the MIRIAM resources13 (Laible and Le Novère 2007; Le Novère et al. 2005). To declare that the SBML species A represents water, an annotation tag as shown in Listing 7 should be inserted into the declaration tag of species A. Listing 7 MIRIAM annotation for species A 1 2 3
4 5 6 7 8 9 10 11 12 13 14
The RDF (Resource Description Framework) syntax allows the modeler to specify a relation between the model element A and the database entry B in more detail: various possible relations (like A is exactly B, and A is an instance of B, and A contains B as a physical part) can be declared by BioModels.net qualifiers (Le Novère et al. 2006b). 12 Database
of chemical entities of biological interest, http://www.ebi.ac.uk/chebi
13 http://www.ebi.ac.uk/miriam
188
A. Dräger et al.
Fig. 7.9 Annotating an SBML model with semanticSBML (screenshot). All SBML elements that can be annotated are shown in a tree on the left. Tickmark icons indicate that an element has MIRIAM annotations or SBO terms, and flag icons mark elements without annotation. Annotations and SBO terms for a selected element can be edited on the right
The web application Saint automatically annotates uploaded models (Lister et al. 2009). A program known as semanticSBML14 (Fig. 7.9) allows modelers and model curators to edit MIRIAM-compliant annotations and SBO terms and to merge several SBML models (Liebermeister et al. 2009). In the annotation section, the user can browse through the elements of a model and edit the annotations. The tool checks existing annotations automatically for syntactic correctness, based on the list of database resources specified by MIRIAM (Le Novère et al. 2005). New entries for MIRIAM-compliant annotations and possible SBO terms can be found easily by a keyword search. In the merging section, the program generates a preliminary version of the merged model based on the MIRIAM annotations of all elements in the input models (Fig. 7.10). The merged model structure can be manually refined by the user to create a biologically meaningful structure. Each element of the merged model can be selected in an overview. The properties of a selected element are listed in a detailed view. This view groups the properties and highlights conflicts that arise through the merging of the elements. For instance, matching species elements can have different initial values. The user can resolve each conflict manually or, alternatively, let the program automatically resolve all conflicts. SemanticSBML makes its decisions based on a list of model priorities supplied by the user: In cases of conflicts, the program chooses the property value of the model with the highest priority. After
14 http://www.semanticsbml.org
7
Automating Mathematical Modeling of Biochemical Reaction Networks
189
Fig. 7.10 Model-merging with semanticSBML (screenshot). On the left, matching elements from all input models are aligned with each other. Details about a selected element (as described by the input models and by the output model in its preliminary form) are shown on the right. The user can edit their properties and change the matching between elements until the models are ready for merging
all conflicts have been resolved, the merged model can be exported into the SBML format. An online version of semanticSBML with limited functionality is accessible at the project homepage.15
7.5 Obtaining Model Parameters When the topology of the system under study is known, the model is annotated, and kinetic equations are defined for all reactions within the system, but one question remains: How does one determine all the parameters belonging to the equations in the model, e.g., Michaelis constants, limiting rates, or the constants describing the influence of inhibitors and activators? By default, SBMLsqueezer sets all newly created parameters to one. In contrast to CellDesigner’s default of zero, this allows a direct simulation of the model. The argument to justify this decision is that in many rate equations, parameters that are set to zero lead to trivial terms within the mathematical expression or to even more questionable results if division by these parameters becomes necessary. Let us consider Eq. (7.2) on p. 9 to illustrate both problems. The whole expression equates to zero when setting parameter V = 0. For
15 http://www.semanticsml.org
190
A. Dräger et al.
KP = 0, the rate equation becomes undefined. However, both cases are avoided if the parameter value is set to one. Nevertheless, default values do occasionally constitute a good choice for a parameter value since it is usually desired that the model reproduces experimental data and allows predictions of unmeasured values. Many parameters such as Michaelis constants or first-order rate constants can be determined experimentally. This requires expensive and time-consuming experiments and is not possible for all parameters. Two alternative ways exist to obtain meaningful parameter values for newly created rate laws: one way is to select known parameter values from freely available online databases such as the Brunswick Enzyme Database BRENDA (Barthelmes et al. 2007; Chang et al. 2009; Schomburg et al. 2002). The second way is to estimate the parameter values with respect to given measurement data. Like Borger et al. (2007b), we suggest a combination of both strategies. The values within enzyme databases often correspond to in vitro experiments of isolated reactions. Hence, the values of these parameters are obtained with respect to the dynamics of the particular reaction. Often experimental conditions vary from reaction to reaction, because, e.g., different pH buffers are needed to stabilize the enzyme. When combining many such reactions in a multi-enzyme model it can therefore not be ensured that the in vitro parameter values lead to realistic in vivo dynamics of the overall system (Teusink et al. 2000). Subsequent adjustment of the parameter values is therefore advisable. An advantage of enzyme databases is that these often contain parameter values of the same reaction from multiple organisms. Hence we can gain – besides meaningful initial parameter values – additional statistical properties of many parameters like lower and upper bounds, averages, or medians. For some parameters we can even guess a distribution. These key features of the parameters should be considered in subsequent optimizations. In this section we describe how naturally inspired heuristic optimization procedures can be used to tackle this task. The basic idea of the parameter estimation approach is that the parameters should be adapted in such a way that the difference between the simulation result of the model and the given measurement data becomes minimal. This distance serves as a quality measure for a given parameter set. Heuristic optimization procedures try to minimize this distance by iteratively sampling from the parameter space and simulating the model. The parameter set leading to the smallest distance between experimental data and model output is called optimal with respect to the data. This procedure is often referred to as model calibration. In the first step, one needs to choose an appropriate distance function as a measure of quality for the parameters: the target function f (p) that depends on the parameter vector p. Minimization of the target function is defined as the search for a vector p0 ∈ P in the set of possible solutions P satisfying ∀p ∈ U ⊆ P : f (p) ≥ f (p0 ), with U being a connected set. We call p0 a global optimum of f if U = P and a local optimum of f otherwise. Several studies mention the relative squared error (RSE)
fRSE (ˆx(p), X) =
dimˆ x(p) dimτ
i=1
t=1
xˆ i (τt , p) − xti xti
2 (7.15)
7
Automating Mathematical Modeling of Biochemical Reaction Networks
191
as a useful distance function (Spieth et al. 2004, 2005a,b, 2006b). Here, xˆ is the model output vector and the matrix X = (xti ) contains all experimental data. The vector τ contains the time points from the experiment. The first sum runs over all measured state variables in the model (reacting species, variable parameters, and compartments) and the second sum runs over all time points. Missing values are ignored. The advantage of fRSE over other distance measures like the Euclidean distance consists in the weighted difference between model output and experimental data for each state variable. In biological systems, certain substances occur in much higher concentration than others. An appropriate quality measure has to take this into account, or otherwise more highly concentrated species would dominate the parameter optimization procedure. Furthermore, fRSE is a dimensionless quantity, which becomes especially important if state variables are of mixed types like compartment sizes and amounts or concentrations of substances. To avoid division by zero a sufficiently high default value is returned instead of the fraction in cases in which xti = 0. Many optimization approaches have been proposed and applied to the parameter estimation problem in biochemical pathways (Balsa-Canto et al. 2008; RodriguezFernandez et al. 2006a,b). The simplest randomized method, Monte Carlo, virtually rolls the dice multiple times and memorizes the best solution found, whereas deterministic approaches try to perform a clever and almost exhaustive search in the space of possible solutions using, for instance, a grid search that can be combined with a branch-and-bound strategy. Regarding the cost-performance trade-off, one class of optimization procedures has been found to be especially promising: naturally inspired heuristic optimization procedures, which traverse the search space using a heuristic strategy to “look” at a larger set of potential solutions in parallel. Many evolutionary algorithms that are inspired by Darwinian evolution maintain several individuals, each representing a possible solution, in such a population and use mechanisms like crossover and mutation to exchange parameter values between their individuals to increase the chance of escaping local optima. Examples are the genetic algorithm (Holland 1975), the evolution strategy (Rechenberg 1973), and differential evolution (Storn 1996). Other approaches mimic the way a mountaineer ascends a mountain (hill climbing, suggested by Tovey (1985)) or simulate the formation of crystal structures in metallurgy (simulated annealing, introduced by (Kirkpatrick et al. 1983)). More recent approaches try to emulate the swarm behavior of fishes or birds (particle swarm optimization (Clerc and Kennedy 2002; Clerc 2005)). Hybrid strategies have been suggested to combine the advantages of global and local strategies (Balsa-Canto et al. 2008; Rodriguez-Fernandez et al. 2006b). The great advantage of naturally inspired algorithms is that these procedures are capable of optimizing even highly multimodal16 and non-convex target functions. Model calibration often belongs to this class of problems due to the high nonlinearities of the rate laws.
16 An
optimization problem is called multimodal if it contains a large number of local optima.
192
A. Dräger et al.
The performance of optimization procedures can be influenced by several settings of the particular algorithm. Examples are temperature in simulated annealing or population size in population- or swarm-based approaches. The large number of available optimization procedures, together with their various specific settings, leads to the question: Which method actually estimates the parameters most successfully (Banga 2008; Dräger et al. 2007a,b; Moles et al. 2003)? Taking this one step further, an additional question arises: Which approximative rate law leads to the best results when the model is calibrated to experimental data? A study (Dräger et al. 2009a) investigates both questions, with seven variants of one exemplary network, based on four different rate laws: generalized mass action, Michaelis–Menten, convenience kinetics, and the stochastic Langevin equation (Gillespie 2000). For each model, except the stochastic one, there is one counterpart in which all reactions are considered reversible. Note that each model contains a different number of parameters. According to William Occam’s famous dictum from the 14th century, “shave away all that is unnecessary,” also known as Occam’s razor, the model with the smallest number of parameters for an adequate representation of data constitutes the preferred choice. This principle is called the law of parsimony in science. The reason for this strategy is that the fit of any model can be increased by introducing a higher number of free parameters. However, a model that is too simple may miss important treatment effects in experimental settings. Therefore, there is a tradeoff between model complexity and parsimony (Burnham and Anderson 2002, pp. 29–37). To evaluate both the dynamic properties of all seven model variants and the performance of various parameter optimization methods, several heuristic optimization procedures (Monte Carlo optimization, hill climber, simulated annealing, genetic algorithm, evolution strategy, differential evolution, particle swarm optimization, and tribes) are systematically evaluated and benchmarked with alternative settings for each dynamic model. More details about each of these optimization methods can be found in the additional material of the study.17 As an example network, the authors pick the well-investigated biosynthesis of the amino acids valine and leucine in Corynebacterium glutamicum. All seven models are calibrated based on in vivo measurements of metabolite concentrations along the pathway (Magnus et al. 2006). The authors conclude that the settings-free and, therefore, easily usable tribes algorithm yields good results for first optimization attempts. Table 7.2 summarizes the most successful optimization procedures together with their specific settings. Furthermore, the reversible forms of the generalized mass action and convenience kinetics are found to be the most promising approximative approaches for the kinetics of each reaction when compared to modeling using kinetic equations for irreversible reactions (Dräger et al. 2009a). As explained in Section 7.3, in multi-enzyme systems, all reactions should be considered reversible.
17 A
comprehensive introduction to naturally inspired heuristic optimization procedures can be found at http://www.biomedcentral.com/imedia/8946429342473639/supp1.pdf
7
Automating Mathematical Modeling of Biochemical Reaction Networks
193
Table 7.2 Most successful settings for heuristic optimization procedures. Shown are five promising optimization procedures together with their best settings according to a benchmark study (Dräger et al. 2009a) Algorithm
Settings
Population size
Particle swarm optimization
φ1 = φ2 = 2.05 on linear3, star, or grid3 topology The triplet (f , λ, CR) should be set to (0.8, 0.5, 0.3), (0.8, 0.5, 0.5), or (0.8, 0.8, 0.3) Covariance matrix adaptation as mutation operator without crossover Adaptive mutation and one-point or no crossover at all –
25
Differential evolution
Evolution strategy Binary genetic algorithm Tribes
100
(5+25) or (10, 50) 250 or 100 –
The freely available workbench EvA218 (Kronfeld 2008; Streichert and Ulmer 2005; Ulmer 2005) contains several naturally inspired heuristic optimization procedures, including those presented in Table 7.2 and has therefore been integrated into software tools such as JCell for the inference of gene-regulatory networks (Spieth et al. 2006a) or the Systems Biology Toolbox (SBToolbox2) (Schmidt and Jirstrand 2006; Schmidt 2007; Schmidt et al. 2007). For our purposes, the combination of EvA2 with the MATLABTM -based SBToolbox2 provides several useful features. We can import our SBML file into the SBToolbox2 and apply its very fast integration routine as well as several analytical functions. Access to the optimization procedures of EvA2 is well described on the project homepage.19 EvA2 provides a well-documented Application Programming Interface (API) and can therefore be easily included as the parameter estimator into an automatic modeling pipeline.
7.6 Generation of Model Reports with SBML2LATEX Once a mathematical description of the biochemical network has been created, it is desirable to create a human-readable report about the model as a whole. Such documentation can enhance the model’s re-usability. The best way to create a model report would be to obtain the necessary information directly from the SBML file without writing additional text. Similar approaches exist for modern programming languages like JavaTM , for which a complete API documentation can be created using the specialized tool JavaDoc. The idea of JavaDoc is basically to extract the existing comments from the source code and to put them in context with the methods the objects provide in a human-readable text format, e.g., HTML. In this way, the 18 http://www.ra.cs.uni-tuebingen.de/software/EvA2 19 The homepage http://www.sbtoolbox2.org freely provides SBToolbox2 for MATLABTM
ing a comprehensive documentation.
includ-
194
A. Dräger et al.
double work of writing comments and documentation, which often leads to inconsistencies between the documentation and source code after modification of one or both, can be avoided. In the object-oriented language SBML, all components inherit from the abstract class SBase. This class passes the two optional tags notes and annotation to every SBML component. The notes tag allows modelers to insert comments, descriptions, and other information into each SBML element in XHTML format, intended to be displayed to human readers. Additionally, in the annotation field, modelers can place MIRIAM descriptions pointing to external resources (Le Novère et al. 2005). With the help of annotations and notes, knowledge comparable to that of comments in source code can be assigned to every SBML component in two different ways. SBML models contain some information only implicitly, e.g., the ordinary differential equation system of the time-dependent change of the concentrations or the amounts of all species. These equation systems are defined by the reactions each species participates in together with its constant and boundaryCondition fields. In this system, the rate of change of some species is not determined by reactions at all but by other SBML constructs such as events or all kinds of rules that affect this species. The size of compartments or the value of variable parameters can also change due to rules or events, and initialAssignments can override the initial values of all elements. In many cases, these complex coherences are not directly apparent from the SBML file. Furthermore, SBML itself was never intended to be a language that is written or read by humans directly. It is designed as a computer-readable and writable modeling language. Furthermore, it may happen that, during the modeling process, a user does not carefully annotate some of the components in the model or that the units of the components turn out to be inconsistent. For these reasons, the SBML community provides two helpful applications. First, the SBML validator20 checks the consistency and validity of uploaded models. Second, SBML2LATEX (Dräger et al. 2009b) contains this SBML validator and weaves its results into its comprehensive model report, in which all components of the model are clearly arranged in tables or written text. SBML2LATEX extracts all information including notes, MIRIAM annotations, and SBO terms from the SBML file and puts it all together. If the model contains any SBO terms, a glossary with term name and definition appears at the end of the report. Its convenient online version directly creates various human-readable files, including PDF, PS, DVI, or LATEX (Fig. 7.11). Writing these often complex mathematical expressions from SBML files into external documentation tends to be especially error-prone. However, these can be directly adopted from the generated LATEX file. The report file also helps modelers to identify potential errors in the model or insufficient annotation that cannot be detected from a pure validity check alone.
20 http://sbml.org/Facilities/Validator
7
Automating Mathematical Modeling of Biochemical Reaction Networks
195
All components are described in more detail in the following sections.
Fig. 7.11 The online version of SBML2LATEX, adapted from (Dräger et al. 2009b). This convenient online version allows direct creation of SBML model reports from a given SBML file without local installation of any software. Several output formats are available; the font size and style, paper size, and page orientation can be changed. Landscape pages are very useful because it is often not possible to introduce a line break into long denominators and the resulting formula therefore does not fit on portrait pages. The following features can be switched off: translation of MIRIAM annotations to online resources, SBML consistency check, and the inclusion of all predefined units. Other layout options enable the user to choose whether the details of all reaction participants should be listed in one table or separate tables for each group (reactants, modifiers, and products). Instead of a simple headline, a separate title page can be created
SBML2LATEX can also be used locally as a stand-alone tool (Fig. 7.12). SBMLsqueezer includes a flattened version of SBML2LATEX. Like the current version of SBMLsqueezer itself, its SBML2LATEX translator depends on CellDesigner. A comprehensive model report that supports even the latest
196
A. Dräger et al.
Fig. 7.12 Stand-alone version of SBML2LATEX. In addition to the settings of the online version the GUI of the stand-alone version provides the ability to select alternative fonts for headings, main text, and typewriter passages. All settings are also available as command line options enabling the user to apply this program in a batch mode to multiple files in one go. The SBML2LATEX project homepage http://www.ra.cs.uni-tuebingen.de/software/SBML2LATEX provides detailed information and the latest version of this program
specification of SBML can only be obtained by using SBML2LATEX, which is freely available at the project homepage21 and on the SBML homepage22 .
7.7 Conclusions In this chapter we have introduced a simple model building pipeline and discussed how this pipeline can be automated at each step. There are still several possibilities for improving this suggested procedure. In particular, what we have proposed here is to combine, in a semi-automatic way, several freely available software tools that support the modeling language called SBML (Hucka et al. 2003, 2008). The Linux live DVD SB.OS23 (Systems Biology Operational Software) contains most of the freely available software tools described in this chapter as well as several additional programs useful for research in systems biology. SB.OS is based on the popular
21 http://www.ra.cs.uni-tuebingen.de/software/SBML2LAT
EX
22 http://sbml.org 23 SB.OS
can be freely downloaded at www.sbos.eu for all non-commercial purposes.
7
Automating Mathematical Modeling of Biochemical Reaction Networks
197
Linux distribution Ubuntu and runs directly from a bootable DVD and hence does not require any installation of software on the local computer. In the first step of the pipeline, we set up a topology in CellDesigner (Funahashi et al. 2003, 2007a) of the network we want to create a dynamic model for. To this end, we can use tools like KEGG2SBML and the information from databases like KEGG or MetaCyc (Caspi et al. 2008: Kanehisa et al. 2006). Saving the resulting process diagram in SBML, including CellDesigner-specific annotations, allows machine-readable distinctions between several specific types of reactions, modifiers, and species. In the second step, the modeling tool known as SBMLsqueezer (Dräger et al. 2008) equips all reactions in these pathways with kinetic equations. Thereby, SBMLsqueezer derives each such equation from the context provided by CellDesigner’s annotations. A stand-alone version of SBMLsqueezer would require SBO annotations (Le Novère et al. 2006b) instead. In any case, manual interaction is still needed because SBMLsqueezer can only suggest applicable kinetic equations. From these, the user selects, based on his or her knowledge, the most appropriate equations for the envisaged purpose. To simplify this process, SBMLsqueezer offers standard equations for each type of reaction and thus creates all equations of the network in one step. This mode reduces the number of necessary user interactions to a minimum. With the help of semanticSBML, the resulting model can be annotated with MIRIAM qualifiers and be combined with already existing models. In the third step, the modeler estimates the parameters within the kinetic equations with respect to given measurement data. The combination of the MATLABTM based SBToolbox2 (Schmidt and Jirstrand 2006; Schmidt 2007; Schmidt et al. 2007) with EvA2 (Kronfeld 2008; Streichert and Ulmer 2005) as its integral optimization core with appropriate settings is advisable. The modeling tool named COPASI (Hoops et al. 2006) or the parameter estimation tool called SBML-PET (Zi and Klipp 2006) provides valuable alternatives. As soon as the dynamic model behavior fits the data well enough, the model should be validated experimentally (Chassagnole et al. 2002). To this end, the experimenters take additional samples at time points that have not been measured in their original experiment or under different conditions, e.g., varied initial concentrations of certain intermediates. We compute the model’s output at these time points, or for the new conditions and then compare the results of the new in vivo and in vitro experiments. If both experiments strongly deviate from each other, we should return to an earlier step of our modeling pipeline and rethink our modeling approach or the structure of the system. Finally, a model report should be created to document all parameter values and the model structure as a whole. In this way, the model can iteratively be refined by using the reporting tool named SBML2LATEX, which provides an overview of all model equations and the reaction system (Dräger et al. 2009b). These reports uncover sources of potential mistakes. This overall process should be viewed as an iterative procedure of topological modeling, rate law generation, model annotation, parameter estimation, model
198
A. Dräger et al.
validation, report generation, and again modification of the model’s structure. It may also be advisable to produce model reports at earlier stages of the pipeline for the sake of a better overview of the model. Two ways exist to move from the semi-automatic method to a fully automated modeling process: First, command line scripts can be written that combine the tools introduced in this chapter into a pipeline. To this end, shells such as Bash provide the special pipe command that sends the output of the first tool to the input stream of the second one. With the SBO-based stand-alone version of SBMLsqueezer, such a procedure becomes possible as long as the underlying SBML file contains SBO annotations at least for modifiers and species. The script must contain all command line settings of each program in the pipeline. The availability of source codes opens up a second possibility to obtain an automated modeling procedure. With some programing effort, another application can be created that directly combines all programs. Despite these details, we acknowledge that several questions were not discussed within this chapter. Here we give a brief overview of other important issues regarding modeling that requires additional effort and that should be incorporated into our modeling pipeline to be complete. The mathematical space of possible models that are able to reproduce given quantitative time series data exceeds the space of biologically plausible, realistic, and meaningful models by several orders of magnitude. Therefore, physical model properties should also be taken into account. As pointed out in Section 7.3.2, the convenience rate law is designed in a thermodynamically independent form in which the values of all parameters can be estimated independently. In other cases, care must be taken to obtain thermodynamically valid parameters. In their study, Magnus et al. suggest a method to incorporate thermodynamics in a calibration procedure (Magnus et al. 2006). The time scales of very fast and very slow processes with respect to the time frame of interest constitute another important topic. By setting these significantly slower or faster reactions to constant rates (i.e., a zeroth order mass action rate), the system can be greatly simplified. Since biological systems are known to be robust (Kitano 2002b), a stability analysis and also a (global) sensitivity analysis of the system that strongly depends on the estimated parameters constitute a valuable improvement of our modeling pipeline. Tools like SBML-SAT (SBML Sensitivity Analysis Tool) (Zi et al. 2008) or SBToolbox2 (Schmidt and Jirstrand 2006) provide easily usable methods to perform such analyses. Furthermore, the existence or nonexistence of one or multiple steady-states or bifurcation analyses gives meaningful insights into the model’s properties and can also be performed using existing tools like SBToolbox2. Acknowledgments The authors are grateful to Michael J. Ziller, Marcel Kronfeld, Catherine Lloyd, Falko Krause, and Wolfram Liebermeister for helpful advice, discussion, and contribution. This work was funded by the German Federal Ministry of Education and Research (BMBF) in the two projects, National Genome Research Network (NGFN-II EP under grant number 0313323, later NGFN-Plus under grant number 01GS08134) and HepatoSys under grant number 0313080 L, and the German Federal State of Baden-Württemberg in the two projects Identifikation und Analyse metabolischer Netze aus experimentellen Daten under contract number 7532.22-26-18 and Tübinger Bioinformatik-Grid under contract number 23-7532.24-4-18/1.
7
Automating Mathematical Modeling of Biochemical Reaction Networks
199
References Albert R (2007) Network Inference, Analysis, and Modeling in Systems Biology. Plant Cell 19(11):3327–3338. doi:10.1105/tpc.107.054700. http://www.plantcell.org/cgi/reprint/ 19/11/3327.pdf Alves R, Antunes F, Salvador A (2006) Tools for kinetic modeling of biochemical networks. Nat Biotechnol 24(6):667–672. doi:10.1038/nbt0606-667. http://dx.doi.org/10.1038/nbt0606-667 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. Nat. Genet 25(1):25–29. doi:10.1038/75556. http://dx.doi.org/10.1038/75556 Balsa-Canto E, Peifer M, Banga JR, Timmer J, Fleck C (2008) Hybrid optimization method with general switching strategy for parameter estimation. BMC Syst Biol 2(1):26. doi:10.1186/17520509-2-26. http://dx.doi.org/10.1186/1752-0509-2-26 Banga JR (2008) Optimization in computational systems biology. BMC Systems Biol 2:47. doi:10.1186/1752-0509-2-47. http://www.biomedcentral.com/1752-0509/2/47/ Barthelmes J, Ebeling C, Chang A, Schomburg I, Schomburg D (2007) BRENDA, AMENDA and FRENDA: the enzyme information system in 2007. Nucl Acids Res 35(Suppl_1):D511–514. doi:10.1093/nar/gkl972. http://nar.oxfordjournals.org/cgi/content/abstract/35/suppl_1/D511, http://nar.oxfordjournals.org/cgi/reprint/35/suppl_1/D511.pdf Bisswanger H (2000) Enzymkinetik – Theorie und Methoden, 3rd edn. Wiley-VCH, Weinheim Borger S, Liebermeister W, Uhlendorf J, Klipp E (2007a) Automatically generated model of a metabolic network. Int Conf Genome Inform 18:215–224. doi:10.1142/9781860949920_0021. http://eproceedings.worldscinet.com/9781860949920/9781860949920_0021.html Borger S, Uhlendorf J, Helbig A, Liebermeister W (2007b) Integration of enzyme kinetic data from various sources. Silico Biol 7(2 Suppl):S73–S79. http://www.bioinfo.de/isb/2007/07/S1/09/ Bornstein BJ, Keating SM, Jouraku A, Hucka M (2008) LibSBML: an API Library for SBML. Bioinformatics 24(6):880–881, doi:10.1093/bioinformatics/btn051, http://bioinformatics. oxfordjournals.org/cgi/content/abstract/24/6/880, http://bioinformatics.oxfordjournals.org/cgi/ reprint/24/6/880.pdf Bulik S, Grimbs S, Huthmacher C, Selbig J, Holzhütter HG (2009) Kinetic hybrid models composed of mechanistic and simplified enzymatic rate laws-a promising method for speeding up the kinetic modelling of complex metabolic networks. FEBS J 276(2):410–424. doi:10.1111/j.1742-4658.2008.06784.x, http://www3.interscience.wiley.com/ journal/121588609/abstract Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach, 2nd edn. Springer, New York, NY. http://www.springer. com/statistics/statistical+theory+and+methods/book/978-0-387-95364-9?cm Calzone L, Gelay A, Zinovyev A, Radvanyi F, Barillot E (2008) A comprehensive modular map of molecular interactions in rb/e2f pathway. Mol Syst Biol 4:173, doi:10.1038/msb.2008.7. http://dx.doi.org/10.1038/msb.2008.7 Caspi R, Foerster H, Fulcher CA, Kaipa P, Krummenacker M, Latendresse M, Paley S, Rhee SY, Shearer AG, Tissier C, Walk TC, Zhang P, Karp PD (2008) The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res 36(Database issue):D623–D631. doi: 10.1093/nar/gkm900. http://nar.oxfordjournals.org/cgi/content/abstract/36/suppl_1/D623 Chang A, Scheer M, Grote A, Schomburg I, Schomburg D (2009) BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009. Nucleic Acids Res 37(Database issue):D588–D592. doi:10.1093/nar/gkn820. http://nar.oxfordjournals. org/cgi/content/full/gkn820, http://nar.oxfordjournals.org/cgi/reprint/37/suppl_1/D588.pdf Chassagnole C, Noisommit-Rizzi N, Schmid JW, Mauch K, Reuss M (2002) Dynamic modeling of the central carbon metabolism of Escherichia coli. Biotechnol Bioengineer 79(1):54–73. doi:10.1002/bit.10288. http://www3.interscience.wiley.com/journal/93519745/abstract
200
A. Dräger et al.
Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D (1998) SGD: Saccharomyces Genome Database. Nucleic Acids Res 26(1):73–79. doi:10.1093/nar/26.1.73. http://nar.oxfordjournals.org/cgi/content/ abstract/26/1/73 Clerc M (2005) Particle swarm optimization. ISTE Ltd, London Clerc M, Kennedy J (2002) The particle swarm–explosion, stability, and convergence in a multidimensional complex space. IEEE Trans on Evol Comput 6(1):58–73 Cornish-Bowden A (2004) Fundamentals of enzyme kinetics, 3rd edn. Portland Press Ltd., 59 Portland Place, London Dampier W, Tozeren A (2007) Signaling perturbations induced by invading H. pylori proteins in the host epithelial cells: a mathematical modeling approach. J Theor Biol 248(1):130–144. doi:10.1016/j.jtbi.2007.03.014. http://dx.doi.org/10.1016/j.jtbi.2007.03.014 Deckard A, Bergmann FT, Sauro HM (2006) Supporting the SBML layout extension. Bioinformatics 22(23):2966–2967. http://bioinformatics.oxfordjournals.org/cgi/content/ abstract/22/23/2966, http://bioinformatics.oxfordjournals.org/cgi/reprint/22/23/2966.pdf Degtyarenko K, Matos Pd, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M (2008) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 36(Database issue):D344–D350. doi:10.1093/nar/gkm791. http://nar.oxfordjournals.org/cgi/content/full/gkm791v1 Dräger A, Kronfeld M, Supper J, Planatscher H, Magnus JB, Oldiges M, Zell A (2007a) Benchmarking evolutionary algorithms on convenience kinetics models of the Valine and Leucine Biosynthesis in C. glutamicum. In: Srinivasan D, Wang L (eds) 2007 IEEE congress on evolutionary computation, IEEE computational intelligence society. IEEE Press, Singapore, pp 896–903 Dräger A, Supper J, Planatscher H, Magnus JB, Oldiges M, Zell A (2007b) Comparing various evolutionary algorithms on the parameter optimization of the valine and leucine biosynthesis in Corynebacterium glutamicum. In: Srinivasan D, Wang L (eds) 2007 IEEE congress on evolutionary computation, IEEE computational intelligence society, IEEE Press, Singapore, pp 620–627 Dräger A, Hassis N, Supper J, Schröder A, Zell A (2008) SBMLsqueezer: a CellDesigner plugin to generate kinetic rate equations for biochemical networks. BMC Syst Biol 2(1):39. doi:10.1186/1752-0509-2-39. http://dx.doi.org/10.1186/1752-0509-2-39 Dräger A, Kronfeld M, Ziller MJ, Supper J, Planatscher H, Magnus JB, Oldiges M, Kohlbacher O, Zell A (2009a) Modeling metabolic networks in C. glutamicum: a comparison of rate laws in combination with various parameter optimization strategies. BMC Syst Biol 3:5. doi:10.1186/1752-0509-3-5. http://www.biomedcentral.com/1752-0509/3/5 Dräger A, Planatscher H, Wouamba DM, Schröder A, Hucka M, Endler L, Golebiewski M, Müller W, Zell A (2009b) SBML2LATEX: Conversion of SBML files into humanreadable reports. Bioinformatics doi:10.1093/bioinformatics/btp170. http://bioinformatics. oxfordjournals.org/cgi/content/abstract/btp170v1, http://bioinformatics.oxfordjournals.org/cgi/ reprint/btp170v1.pdf Finney A, Hucka M (2003) Systems biology markup language: Level 2 and beyond. Biochem Soc Trans 31(Pt 6):1472–1473. doi:10.1042/. http://www.biochemsoctrans.org/ bst/031/1472/bst0311472.htm Fujibuchi W, Goto S, Migimatsu H, Uchiyama I, Ogiwara A, Akiyama Y, Kanehisa M (1998) DBGET/LinkDB: an integrated database retrieval system. Pac Symp Biocomput pp 683–694 Funahashi A, Tanimura N, Morohashi M, Kitano H (2003) CellDesigner: a process diagram editor for gene-regulatory and biochemical networks. BioSilico 1(5): 159–162. doi:10.1016/S1478-5382(03)02370-9. http://www.sciencedirect.com/science/article/B75GS4BS08JD-5/2/5531c80ca62a425f55d224b8a0d3f702 Funahashi A, Jouraku A, Matsuoka Y, Kitano H (2007a) Integration of CellDesigner and SABIORK. Silico Biology 7(2 Suppl):S81–S90. http://www.bioinfo.de/isb/200707S110/main.html Funahashi A, Morohashi M, Matsuoka Y, Jouraku A, Kitano H (2007b) Cell Designer: a graphical biological network editor and workbench interfacing simulator. In: Choi S (ed) Introduction
7
Automating Mathematical Modeling of Biochemical Reaction Networks
201
to systems biology, Humana Press, chap 21, pp 422–434. doi:10.1007/978-1-59745-531-2_21. http://www.springerlink.com/content/hqk374162wg70146/ Gauges R, Rost U, Sahle S, Wegner K (2006) A model diagram layout extension for SBML. Bioinformatics 22(15):1879–1885. doi:10.1093/bioinformatics/btl195. http://bioinformatics. oxfordjournals.org/cgi/content/abstract/22/15/1879, http://bioinformatics.oxfordjournals.org/ cgi/reprint/22/15/1879.pdf Gillespie DT (2000) The chemical Langevin equation. J. Chem Phys 113:297–306. doi:10.1063/1.481811. http://link.aip.org/link/?JCPSA6/113/297/1 Gruber TR (1993) Toward principles for the design of ontologies used for knowledge sharing? Int J Hum Compu Stud 43(5–6):907–928, http://dx.doi.org/10.1006/ijhc.1995.1081 Hatzimanikatis V, Bailey J (1996) MCA has more to say. theor Biolo 182(3):233–342. doi:10.1006/jtbi.1996.0160. http://dx.doi.org/10.1006/jtbi.1996.0160 Hatzimanikatis V, Floudas CA, Bailey JE (1996) Analysis and design of metabolic reaction networks via mixed-integer linear optimization. AIChE 42(5):1277–1292. doi:10.1002/aic. 690420509 Heinrich R, Schuster S (1996) The regulation of cellular systems. Chapman and Hall, New York, NY Hinze T, Hayat S, Lenser T, Matsumaru N, Dittrich P (2007) Hill Kinetics meets P systems: a case study on gene regulatory networks as computing agents in silico and in vivo. In: Eleftherakis G, Kefalas P, Paun G (eds) Proceedings of the Eight Workshop on Membrane Computing, SEERC, pp 363–381 Hoffmann R, Valencia A (2004) A gene network for navigating the literature. Nature Genetetics 36(7):664. doi:10.1038/ng0704-664. http://www.nature.com/ng/journal/v36/n7/full/ng0704664.html Holland JH (1975) Adaptation in natural and artificial systems. The University of Michigan Press, Cambridge, MA Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendes P, Kummer U (2006) COPASI–a COmplex PAthway SImulator. Bioinformatics 22(24) :3067–3074. doi:10.1093/bioinformatics/btl485. http://bioinformatics.oxfordjournals.org/cgi/ content/abstract/22/24/3067, http://bioinformatics.oxfordjournals.org/cgi/reprint/22/24/3067.pdf Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H (2002) The erato systems biology workbench: enabling interaction and exchange between software tools for computational biology. Pac Symp Biocomput 450–461 Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Gilles ED, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JHS, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Le Novère N, Loew LM, Lucio D, Mendes P, Minch E, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner JM, Wang J, the rest of the SBML Forum (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19(4):524–531. doi:10.1093/bioinformatics/btg015. http://bioinformatics.oxfordjournals.org/cgi/content/abstract/19/4/524, http://bioinformatics. oxfordjournals.org/cgi/reprint/19/4/524.pdf Hucka M, Finney A, Bornstein BJ, Keating SM, Shapiro BE, Matthews J, Kovitz BL, Schilstra MJ, Funahashi A, Doyle JC, Kitano H (2004) Evolving a lingua franca and associated software infrastructure for computational systems biology: the systems biology markup language (SBML) project. Syst Biol IEE 1(1):41–53. http://ieeexplore.ieee.org/ xpls/abs_all.jsp?arnumber=1334988 Hucka M, Finney A, Hoops S, Keating SM, Le Novère N (2008) Systems biology markup language (SBML) Level 2: structures and facilities for model definitions. Tech. rep., Nat Proce doi:10.1038/npre.2008.2715.1. http://dx.doi.org/10.1038/npre.2008.2715.1 Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M (2006) From genomics to chemical genomics: new
202
A. Dräger et al.
developments in KEGG. Nucl Acids Res 34(1):D354–357. doi:10.1093/nar/gkj102. http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D354 King EL, Altman C (1956) A schematic method of deriving the rate laws for enzymecatalyzed reactions. J Phys Chem 60(10):1375–1378. doi:10.1021/j150544a010. http://pubs.acs.org/doi/abs/10.1021/j150544a010 Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680. http://www.sciencemag.org/cgi/content/abstract/220/4598/671 Kitano H (2002a) Computational systems biology. Nature 420(6912):206–210. http://dx.doi. org/10.1038/nature01254 Kitano H (2002b) Systems biology: a brief overview. Science 295(5560):1662–1664. http://www.sciencemag.org/cgi/content/abstract/295/5560/1662 Kitano H, Funahashi A, Matsuoka Y, Oda K (2005) Using process diagrams for the graphical representation of biological networks. Nat Biotechnol 23(8):961–966. http://dx.doi.org/10.1038/nbt1111 Klipp E, Liebermeister W, Helbig A, Kowald A, Schaber J (2007) Systems biology standards–the community speaks. Nat Biotechnol 25(4):390–391 Krebs O, Golebiewski M, Kania R, Mir S, Saric J, Weidemann A, Wittig U, Rojas I (2007) SABIORK: A data warehouse for biochemical reactions and their kinetics. J Integra Bioinform 4(1). doi:10.2390/biecoll-jib-2007-49. http://journal.imbio.de/index.php?paper_id=49 Kronfeld M (2008) EvA2 Short documentation. University of Tübingen, Deptartment of Computer Architecture, Tübingen, Germany, http://www.ra.cs.uni-tuebingen.de/software/EvA2 Laible C, Le Novère N (2007) MIRIAM Resources: tools to generate and resolve robust crossreferences in Systems Biology. BMC Syst Biol 13(58):58–67. doi:10.1186/1752-0509-1-58. http://www.biomedcentral.com/1752-0509/1/58 Le Novère N (2006) Model storage, exchange and integration. BMC Neurosci 7 (Suppl 1):S11. doi:10.1186/1471-2202-7-S1-S11. http://dx.doi.org/10.1186/1471-2202-7-S1-S11 Le Novère N, Finney A, Hucka M, Bhalla US, Campagne F, Collado-Vides J, Crampin EJ, Halstead M, Klipp E, Mendes P, Nielsen P, Sauro H, Shapiro BE, Snoep JL, Spence HD, Wanner BL (2005) Minimum information requested in the annotation of biochemical models (MIRIAM). Nat. Biotechnol 23(12):1509–1515. doi:10.1038/nbt1156. http://www.nature.com/nbt/journal/v23/n12/abs/nbt1156.html Le Novère N, Bornstein BJ, Broicher A, Courtot M, Donizelli M, Dharuri H, Li L, Sauro H, Schilstra M, Shapiro B, Snoep JL, Hucka M (2006a) BioModels Database: a free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems. Nucleic Acids Res 34:D689–D691. doi:10.1093/nar/gkj092. http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D689 Le Novère N, Courtot M, Laibe C (2006b) Adding semantics in kinetics models of biochemical pathways. In: Kettner C, Hicks MG (eds) 2nd International ESCEC workshop on experimental standard conditions on enzyme characterizations. Beilstein Institut, Rüdesheim, Germany, ESEC, Rüdessheim/Rhein, Germany, pp 137–153. http://www.beilsteininstitut.de/escec2006/proceedings/LeNovere/LeNovere.pdf Le Novère N, Moodie S, Sorokin A, Hucka M, Schreiber F, Demir E, Mi H, Matsuoka Y, Wegner K, Kitano H (2008) Systems biology graphical notation: process diagram level 1. Tech. Rep., Nat Proced. doi:hdl:10101/npre.2008.2320.1. http://hdl.handle.net/10101/npre.2008.2320.1 Liebermeister W (2008) Validity and combination of biochemical models. In: Kettner C, Hicks MG (eds) Proceedings of 3rd International ESCEC Workshop on experimental standard conditions on enzyme characterizations, ESEC, Rüdessheim/Rhein, pp 163–179. http://www.molgen.mpg.de/∼lieberme/data/Liebermeister_Merging_Validity_2008.pdf Liebermeister W, Klipp E (2005) Biochemical networks with uncertain parameters. IEE Proce Syst Biol 152(3):97–107, doi:10.1049/ip-syb:20045033. http://link.aip.org/link/?BDJ/152/97/1 Liebermeister W, Klipp E (2006) Bringing metabolic networks to life: convenience rate law and thermodynamic constraints. Theor Biol Med Model 3(42):41. doi:10.1186/1742-4682-3-41. http://dx.doi.org/10.1186/1742-4682-3-41
7
Automating Mathematical Modeling of Biochemical Reaction Networks
203
Liebermeister W, Krause F, Klipp E (2008) Merging of systems biology models with semanticSBML. Tech. Rep., Max Planck Institute for Molecular Genetics, Berlin. http://www.molgen.mpg.de∼lieberme/data/semanticSBML_heidelberg_2008.pdf Liebermeister W, Krause F, Uhlendorf J, Lubitz T, Klipp E (2009) SemanticSBML: a tool for annotating, checking, and merging of biochemical models in SBML format. In: 3rd International biocuration conference, Nature Publishing Group. doi:10.1038/npre.2009.3093.1. http://dx.doi.org/10.1038/npre.2009.3093.1 Lister AL, Pocock M, Taschuk M, Wipat A (2009) Saint: a lightweight integration environment for model annotation. Bioinformatics p btp523. doi:10.1093/bioinformatics/btp523. http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp523v2 http://bioinformatics. oxfordjournals.org/cgi/reprint/btp523v2.pdf Lloyd CM, Halstead MDB, Nielsen PF (2004) CellML: its future, present and past. Prog Biophys Mol Biol 85(2-3):433–450. doi:10.1016/j.pbiomolbio.2004.01.004 http://dx.doi.org/10.1016/j.pbiomolbio.2004.01.004 Machné R, Finney A, Müller S, Lu J, Widder S, Flamm C (2006) The SBML ODE Solver Library: a native API for symbolic and fast numerical analysis of reaction networks. Bioinformatics 22(11):1406–1407. doi:10.1093/bioinformatics/btl086. http://dx.doi.org/10.1093/bioinformatics/btl086 http://bioinformatics.oxfordjournals.org/cgi/ reprint/22/11/1406.pdf Magnus JB, Hollwedel D, Oldiges M, Takors R (2006) Monitoring and modeling of the reaction dynamics in the valine/leucine synthesis pathway in Corynebacterium glutamicum. Biotechnol Prog 22(4):1071–1083. http://dx.doi.org/10.1021/bp060072f Moles CG, Mendes P, Banga JR (2003) Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Res 13(11):2467–2474. doi:10.1101/gr.1262503. http://www.genome.org/cgi/doi/10.1101/gr.1262503 Nickerson DP, Buist ML (2009) A physiome standards-based model publication paradigm. Phil Trans R Soc A 367:1823–1844. doi:10.1098/rsta.2008.0296 Oda K, Matsuoka Y, Funahashi A, Kitano H (2005) A comprehensive pathway map of epidermal growth factor receptor signaling. Mol Syst Biol 1:2005.0010. doi:10.1038/msb4100014. http://dx.doi.org/10.1038/msb4100014 Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 27(1):29–34 Olivier BG, Snoep JL (2004) Web-based kinetic modelling using JWS Online. Bioinformatics 20 (13):2143–2144. doi:10.1093/bioinformatics/bth200. http://dx.doi.org/10.1093/bioinformatics/ bth200 Rechenberg I (1973) Evolutionsstrategie: optimierung technischer systeme nach prinzipien der biologischen evolution. Fromman-Holzboog, Stuttgart Rodriguez N, Donizelli M, Le Novère N (2007) SBMLeditor: effective creation of models in the systems biology markup language (SBML). BMC Bioinform 8:79. doi:10.1186/1471-2105-879. http://dx.doi.org/10.1186/1471-2105-8-79 Rodriguez-Fernandez M, Egea JA, Banga JR (2006a) Novel metaheuristic for parameter estimation in nonlinear dynamic biological systems. BMC Bioinform 7:483. doi:10.1186/1471-2105-7483. http://dx.doi.org/10.1186/1471-2105-7-483 Rodriguez-Fernandez MR, Mendes P, Banga JR (2006b) A hybrid approach for efficient and robust parameter estimation in biochemical pathways. Biosystems 83:248–265. doi:10.1016/j.biosystems.2005.06.016. http://www.sciencedirect.com/science/article/B6T2K4HC776X-4/2/2a48c31a0d9aa413bc616023689e55c8 Savageau MA (1969a) Biochemical systems analysis. I. Some mathematical properties of the rate law for the component enzymatic reactions. J Theor Biol 25(3):365–369 Savageau MA (1969b) Biochemical systems analysis. II. The steady-state solutions for an n-pool system using a power-law approximation. J Theor Biol 25(3):370–379 Savageau MA (1970) Biochemical systems analysis. 3. Dynamic solutions using a power-law approximation. J Theor Biol 26(2):215–226
204
A. Dräger et al.
Schauer M, Heinrich R (1983) Quasi-steady-state approximation in the mathematical modeling of biochemical reaction networks. Math Biosci 65:155–171 Schilstra MJ, Li L, Matthews J, Finney A, Hucka M, Le Novère N (2006) CellML2SBML: conversion of CellML into SBML. Bioinformatics 22(8):1018–1020. doi:10.1093/bioinformatics/ btl047. http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/8/1018, http://bioinfor matics.oxfordjournals.org/cgi/reprint/22/8/1018.pdf Schmidt H (2007) SBaddon: high performance simulation for the systems biology toolbox for MATLAB. Bioinformatics 23(5):646–647. doi:10.1093/bioinformatics/btl668. http://bioinfor matics.oxfordjournals.org/cgi/content/abstract/23/5/646, http://bioinformatics.oxfordjournals. org/cgi/reprint/23/5/646.pdf Schmidt H, Jirstrand M (2006) Systems biology toolbox for MATLAB: a computational platform for research in systems biology. Bioinformatics 22(4):514–515. doi:10.1093/bioinformatics/bti799. http://dx.doi.org/10.1093/bioinformatics/bti799 Schmidt H, Drews G, Vera J, Wolkenhauer O (2007) SBML export interface for the systems biology toolbox for MATLAB. Bioinformatics 23(10):1297–1298. doi:10.1093/bioinformatics/btm105. http://dx.doi.org/10.1093/bioinformatics/btm105. http:// bioinformatics.oxfordjournals.org/cgi/reprint/23/10/1297.pdf Schomburg I, Chang A, Schomburg D (2002) BRENDA, enzyme data and metabolic information. Nucl Acids Res 30(1):47–49. doi:10.1093/nar/30.1.47. http://nar.oxford journals.org/cgi/content/abstract/30/1/47 Segel IH (1993) Enzyme Kinetics–Behavior and Analysis of Rapid Equilibrium and Steady-State Enzyme Systems. Wiley-Intersciennce, New York, NY Shapiro BE, Finney A, Hucka M, Bornstein BJ, Funahashi A, Jouraku A, Keating SM, Le Novère N, Matthews J, Schilstra MJ (2007) Introduction to systems biology. Humana Press, Totowa, NJ chap SBML Models and MathSBML, pp 395–421. doi:10.1007/978-1-59745-5312. http://www.springerlink.com/content/q28j426582387022/ Snoep JL, Bruggeman F, Olivier BG, Westerhoff HV (2006) Towards building the silicon cell: a modular approach. Biosystems 83(2-3):207–216. doi:10.1016/j.biosystems.2005.07.006. http://dx.doi.org/10.1016/j.biosystems.2005.07.006 Spieth C, Streichert F, Speer N, Zell A (2004) Optimizing topology and parameters of gene regulatory network models from time-series experiments. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2004), LNCS, vol 3102 (Part I), pp 461–470 Spieth C, Streichert F, Speer N, Zell A (2005a) Inferring regulatory systems with noisy pathway information. In: German conference on bioinformatics (GCB 2005), vol P-71, pp 193–203 Spieth C, Streichert F, Supper J, Speer N, Zell A (2005b) Feedback memetic algorithms for modeling gene regulatory networks. In: Proceedings of the IEEE symposium on computational intelligence in bioinformatics and computational biology (CIBCB 2005), pp 61–67 Spieth C, Supper J, Streichert F, Speer N, Zell A (2006a) JCell–a Java-based framework for inferring regulatory networks from time series data. Bioinformatics 22(16):2051–2052. doi:10.1093/bioinformatics/btl322. http://bioinformatics.oxfordjournals.org/cgi/content/abst ract/22/16/2051, http://bioinformatics.oxfordjournals.org/cgi/reprint/22/16/2051.pdf Spieth C, Worzischek R, Streichert F, Supper J, Speer N, Zell A (2006b) Comparing evolutionary algorithms on the problem of network inference. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2006) Storn R (1996) On the usage of differential evolution for function optimization. In: 1996 Biennial Conference of the North American Fuzzy Information Processing Society, IEEE, New York, Berkeley, pp 519–523 Streichert F, Ulmer H (2005) JavaEvA–A Java framework for evolutionary algorithms. Technical Report WSI-2005-06, Center for Bioinformatics Tübingen, University of Tübingen, Tübingen, Germany. doi:urn:nbn:de:bsz:21-opus-17022. http://w210.ub. uni-tuebingen.de/dbt/volltexte/2005/1702/ Teusink B, Passarge J, Reijenga CA, Esgalhado E, van der Weijden CC, Schepper M, Walsh MC, Bakker BM, van Dam K, Westerhoff HV, Snoep JL (2000) Can yeast
7
Automating Mathematical Modeling of Biochemical Reaction Networks
205
glycolysis be understood in terms of in vitro kinetics of the constituent enzymes? Testing biochemistry. Eur J Biochem 267(17):5313–5329. doi:10.1046/j.1432-1327.2000.01527.x. http://www3.interscience.wiley.com/journal/119181440/abstract Tovey CA (1985) Hill climbing with multiple local optima. Alg Disc Meth 6(3):384–393. doi:10.1137/0606040. http://link.aip.org/link/?SML/6/384/1 Ulmer H (2005) Modellunterstützte evolutionäre optimierungsverfahren in javaeva. PhD thesis, Eberhard-Karls-Universität Tübingen Visser D, Heijnen JJ (2002) The mathematics of metabolic control analysis revisited. Metab Eng 4:114–123. doi:10.1006/mben.2001.0216. http://www.sciencedirect.com/science/ article/B6WN3-45V802C-3/2/d624a20d0e70ca2a1058359d7fd00cb0 Visser D, Heijnen JJ (2003) Dynamic simulation and metabolic re-design of a branched pathway using linlog kinetics. Metab Eng 5(3):164–176 Wilkinson DJ (2006) Stochastic modelling for systems biology. CRC Press, Boca Raton, FL Wittig U, Golebiewski M, Kania R, Krebs O, Mir S, Weidemann A, Anstein S, Saric J, Rojas I (2006) SABIO-RK: Integration and curation of reaction kinetics data. In: Leser U, Naumann F, Eckmann B (eds) Data integration in the life sciences, Springer, Berlin pp 94–103. doi:10.1007/11799511. http://www.springerlink.com/content/kw1kv13614272400 Zi Z, Klipp E (2006) SBML-PET: a systems biology markup language-based parameter estimation tool. Bioinformatics 22(21):2704–2705. doi:10.1093/bioinformatics/btl443. http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/21/2704, http://bioinformatics.oxfordjournals.org/cgi/reprint/22/21/2704.pdf Zi Z, Zheng Y, Rundell AE, Klipp E (2008) SBML-SAT: a systems biology markup language (SBML) based sensitivity analysis tool. BMC Bioinforma 9(1):342. doi:10.1186/ 1471-2105-9-342. http://www.biomedcentral.com/1471-2105/9/342 Ziller MJ (2009) Automatisierte mathematische Modellierung biochemischer Reaktionsnetzwerke. Master’s thesis, Eberhard-Karls-Universität Tübingen, Center for Bioinformatics Tübingen, Sand 1, 72076 Tübingen
Chapter 8
Strategies to Investigate Signal Transduction Pathways with Mathematical Modelling Julio Vera, Svetoslav Nikolov, and Olaf Wolkenhauer
Abstract Systems biology is an approach by which biological questions are addressed through integrating experiments in iterative cycles with computational modelling, simulation and theory. Systems biology is particularly suitable for the study of cell signalling systems because of the inherent complexity of the signalling networks, the amount and variety of the quantitative data combined for their analysis and some special features of cell signalling systems. Among these features we include the prevalence of transient activation processes and the emergence of non-linear behaviour such as signal amplification, ultrasensitivity, multistability and self-sustained oscillations. In this book chapter we discuss the elements of a systems biology methodology for the investigation of cell signalling systems and illustrate it with a number of examples. Keywords Cell signaling pathways · Mathematical modelling · JAK2-STAT5 · Erythropoiesis · Protein homodimerisation · Multilevel modelling
8.1 Introduction 8.1.1 A Definition for Systems Biology About a decade after the emergence of systems biology as a new interdisciplinary approach in the life sciences, a definition of its scope and goals is emerging: “Systems biology aims at an understanding of the dynamic interactions between components of a living system, between living systems and their interactions with the environment. Systems biology is an approach by which biological questions
J. Vera (B) Systems Biology and Bioinformatics Group, University of Rostock, 18051 Rostock, Germany e-mail:
[email protected]; Web: www.sbi.uni-rostock.de
S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_8,
207
208
J. Vera et al.
are addressed through integrating experiments in iterative cycles with computational modelling, simulation and theory. Modelling is not the final goal, but is a tool to increase understanding of the system, to develop more directed experiments and finally allow predictions” (www.erasysbio.net). Combining a range of technologies, systems biology is a new way of dealing with biological complexity. Focussing on subcellular processes and cell–cell interactions, the most important sources of complexity include (i) a large variety of interacting components, (ii) multiple temporal and spatial scales and (iii) non-linearities. Cell functions like growth, apoptosis, cell differentiation and the cell cycle are prototypical examples of complex non-linear dynamical systems. Their understanding necessitates mathematical modelling, in particular techniques from systems theory. Constructing models of dynamical systems in turn require quantitative experimental techniques.
8.1.2 Expected Results from Systems Biology From a biomedical perspective, systems biology is a strategy to make medicine predictive, personalized, preventive and participatory (Aebersold et al. 2009). The adequate integration of multi-level quantitative data of cellular and extracellular pathways through mathematical modelling will allow: (i) the identification and prioritization of potential biomarkers for a faster more reliable diagnosis in cancer and metabolic or degenerative diseases; (ii) the multi-level simulation of drug effects in key signalling pathways and tissues, effectively reducing the timeline in drug development; (iii) the faster detection of potential drug targets in cancer-related signalling pathways (Kitano 2003) and metabolic or degenerative diseases (Kitano 2004; Vera et al. 2007b) and (iv) the prediction of optimal patterns for cancer chronotherapy, making the treatment of cancer more effective and less pernicious (Lévi et al. 2008). From a biotechnological perspective, systems biology can improve the productivity, efficiency and controllability of biotechnological strains by the detection and fine tuning of the key regulatory networks (signalling and metabolic pathways) leading to the production machinery in the microorganisms (Vera et al. 2003). In combination with synthetic biology, systems biology should enable the design of microorganisms with the precise genomic, metabolic and signalling content to minimize secondary metabolic expenditures, not needed during cultivation processes. From a more practical perspective, a systems biology approach supports the experimental work of cell biologists (Kitano 2007; Wolkenhauer 2007). Considering mathematical modelling as a means to formulate and test hypotheses, systems biology can guide and optimize the design of experiments. Moreover, the appropriate combination of mathematical modelling and experimentation for a given pathway may facilitate the investigation of structural and dynamical properties not accessible with current experimental devices and techniques (Ferrel 2002; Jordan et al. 2000). Systems theory provides a formal conceptual framework to investigate essential properties of signalling pathways and other biochemical systems that emerge from multiple, linked feedback loops, generating properties like homeostasis, adaptivity,
8
Strategies to Investigate Signal Transduction Pathways
209
robustness (structural stability), multistability or oscillations (Barkai and Leibler 1997; Ferrel 2002; Kholodenko 2006; Wolkenhauer et al. 2005).
8.1.3 Features of Systems Biology as a New Paradigm for Cell Biology The way cell biology is perceived through system biology differs in some points to the classical approach (Table 8.1), including the following features: Table 8.1 Distinctive features of a systems biology approach Investigation of biochemical networks involving emergent and systemic properties, arising from dynamic interactions in living system Use of quantitative (ex vivo and in vivo) data, including the integration of data from a variety of technologies Biological hypothesis encoded by mathematical equations, supporting the design of experiments The outcome is an improved (biological) description of the system Integrating experiments in iterative cycles with mathematical modelling
1. First, in systems biology the objects of the study are biochemical networks that involve regulatory structures (feedback loops) and for which we can establish a limited number of biological inputs and outputs in the experimental conditions investigated. 2. The tools and methods of systems biology become fully operative when we investigate emergent properties, those special features of the system that only appear when the complete system is considered. We notice that most of the constitutive properties of signalling systems (homeostasis, robustness, fragility, signal amplification, ultrasensitivity, self-sustained oscillations, etc.) qualify as system’s emergent properties and therefore this approach is becoming inherent to the investigation of cell signal transduction. 3. The method requires the generation, integration and analysis of quantitative biological data. Depending on the nature of the biochemical system investigated, this will require the production of time courses for the concentration or expression levels of the biomolecules integrating the system (proteins, metabolites, mRNA, etc.) or their subcellular distribution. Since different molecules and timescales may be involved, this will require a variety of experimental techniques, which we further discuss in the context of cell signalling systems in Table 8.2. 4. The biological knowledge on the structure of the pathway and hypothesis that are to be tested is encoded as mathematical equations. The features, complexity and structure of those equations will depend on the nature of the data available and the precise biological properties that we investigate. This has remarkable consequences: (a) there is a variety of modelling frameworks which
210
J. Vera et al. Table 8.2 Quantitative experimental techniques for signalling systems biology
Molecule
Technique
References
mRNA and microRNA
Quantitative northern blotting Real-time PCR analysis Sandwich hybridization RNA microarrays Quantitative western blotting
Roth (2002) Schefe et al. (2006) Rautio et al. (2003) Wilhelm and Landry (2009) Schilling et al. (2005)
ELISA kits Mass spectrometry-based proteomics Live cell imaging Tandem liquid chromatography and mass spectrometry Reverse-phase protein arrays Gas chromatography High-performance liquid chromatography (HPLC) Capillary electrophoresis (CE) Nuclear magnetic resonance (NMR) Gas chromatography–mass spectrometry (GC–MS) In vivo labelling
Heyman (2006) Kito and Ito (2008)
Proteins and phosphoproteins
Metabolites
Mullassery et al. (2008) Gerber et al. (2003) Sahin et al. (2007) Xiayan and Legido-Quigley (2008)
Bollard et al. (2005) Schauer et al. (2005) Birkemeyer et al. (2005)
suitably will depend on the nature of the investigated problem and the data available (Table 8.3); (b) the mathematical models are specifically designed for the investigated hypothesis and are not a universal description of the investigated system, valid under any experimental circumstances; (c) the aim of a mathematical model is not to generate ultra-precise predictive simulations of the system but to generate an improved qualitative description of the system, able to dissect the structural and dynamic complexity of the system. 5. In most cases systems biology is a collective effort that requires a close collaboration and even integration of researchers with completely different backgrounds. Depending on the complexity of the investigation, the multidisciplinary team can be composed of researchers with a variety of professional skills (see Table 8.4).
8.1.4 Systems Biology for Signalling Pathways Signal transduction pathways form the basis for cell communication and cell regulation. A pathway is defined as a network of interacting proteins that deal with: (i) the reception and emission of chemical signals within and between cells; (ii) the modulation and integration of signals which control gene expression and
8
Strategies to Investigate Signal Transduction Pathways
211
Table 8.3 Modelling frameworks used for investigating the dynamics of cell signalling systems Modelling framework
Basic features
References
Mass action models
Precise description of protein–protein interactions. Extensive quantitative data sets required Simple description of non-linear processes. Feasible with incomplete description of protein–protein interactions. Quantitative data required; more phenomenological interpretation Only structural data about protein–protein interactions. Suitable for structural analysis. Reduced ability to simulate dynamics Only structural data required. Suitable for structural analysis and model reduction. Reduced ability to simulate dynamics Only structural data about protein–protein interactions. Suitable for structural analysis. Reduced ability to simulate dynamics Precise description of protein–protein interactions under special conditions. Computational requirements scale fast with complexity of model
Albridge et al. (2006)
Power-law models
Stoichiometric networks
Petri nets
Boolean networks
Stochastic models
Vera et al. (2007a)
Papin et al. (2005)
Chaouiya (2007)
Klamt et al. (2006)
Turner et al. (2004)
Table 8.4 Structure of a systems biology team Team member
Task to be performed
Experimental cell biologist
Formulation of hypotheses, definition of an experimental system, design of experiments, generation of data Mathematical modelling and analysis: definition of model structure, parameter estimation, numerical simulations, analytical studies Database searches, data management, data integration, high-throughput data analysis, clustering and classification Developing novel systems theoretic tools for system identification and analysis Advancing measurement technologies, image analysis, pattern recognition Data pre-processing, normalization, standardization, quality control and significance testing, correlation and regression analysis
Modeller Bioinformatician Mathematician Engineer Statistician
212 Fig. 8.1 Connections of signalling pathways to the other cellular mechanisms involved in cell communication and regulation
J. Vera et al.
Tissue coordination Environmental stimulus
Cell-2-cell stimulus
Intracellular signalling systems Cell signals integration
Gene networks
Metabolic networks
Regulation of transcription
Energetics Biosynthesis
ultimately cell function and (iii) the control and coordination of metabolic processes responsible for the intracellular bioenergetics and the biosynthesis (Fig. 8.1). There are some new elements that make systems biology particularly suitable for the study of cell signalling systems: The apparent increase in the complexity of the signalling networks, but also the amount and variety of the quantitative data available (see Table 8.2) for their analysis. Under these circumstances, system biology emerges as the right approach to deal with the systematic integration and analysis of this heterogeneous corpus of information. More important, cell signalling systems have some special structural and dynamical features that make mathematical modelling indispensable for a deeper understanding of their functioning. While metabolic systems very often work under a steady-state or nearly constant regime, a vast majority of signalling systems have to deal with both steady-state and transient activation. In addition, some of the previously cited emergent properties, barely present in other biochemical pathways, are constitutive properties of signalling systems, including signal amplification (Vera et al. 2008a), ultrasensitivity (Kim and Ferrell 2007), multistability (Bhalla et al. 2002; Reynolds et al. 2003), self-sustained oscillations (Hamstra et al. 2006) and cross talk (Kim et al. 2007). A final motivation to support the use of mathematical modelling in cell signalling is the fact that the complete structure of signalling pathway has in most cases not been fully elucidated. In this situation one requires tools and methodologies to identify functional modules or networks, their boundary and structure, before studying their behaviour.
8
Strategies to Investigate Signal Transduction Pathways
213
8.1.5 A Sketch of the General Methodology Used in Signalling Systems Biology The systems biology approach generally accepted includes the following basic steps (see Fig. 8.2):
Biological Knowledge
Mathematical Modelling
Experimental Data
Model Calibration Model Validation
Model Refinement
Predictive Simulations
Fig. 8.2 The “standard” iterative approach for signalling systems biology
(1) Set-up of the mathematical model. The list of relevant proteins and basic interactions and the structure of the mathematical model in mathematical terms are decided upon. For this purpose, biomedical literature, biological databases describing protein–protein interactions (Bader and Hogue 2000; HPRD, Peri et al. 2003; GO, Ashburner et al. 2000) and also text mining (Hoffmann and Valencia 2005) are used. (2) Model calibration. Quantitative data, specifically defined to set up and calibrate the mathematical model, are used in combination with computational data fitting techniques to make the model match the data (Balsa-Canto et al. 2010). (3) Model assessment and model refinement. The quality of the resulting model is tested by computational analysis, based on simulations and analytical tools (e.g. sensitivity analysis; Saltelli et al. 2004), and additional quantitative experiments. A failure of the model to reproduce the behaviour of the systems specified by the experiments or analytical properties leads to a refinement of the mathematical representation of the pathway by the iteration of the previous steps.
214
J. Vera et al.
(4) Predictive simulations. Once sufficiently validated, the model can be used to make predictions. Predictive computational simulations and other methodologies are used to detect the key biochemical processes and dynamical features of the signalling pathway. The new biological information generated (critical protein–protein interactions, new structural motifs, non-linear dynamical behaviour, etc.) is tested with additional experiments specifically designed and performed to validate these results.
8.1.6 Decision Making on the Modelling Strategy to Investigate a Cell Signalling Problem The general structure of the systems biology approach discussed on the previous section is not a closed rigid methodological framework, but changes depending on the specific features of the investigated problem. Thus, the selection of the research strategy and modelling framework will depend on a subtle trade-off between (a) the nature of the biochemical system investigated, (b) the biological knowledge and experimental data available for its characterization and (c) the questions about its dynamics and structure that we attempt to investigate through mathematical modelling. Depending on these factors, we can distinguish at least the following sub-types of system biology strategies: Investigation of complex (non-linear) dynamics. In this case, the structure of the investigated system, in terms of protein–protein interactions, is well characterized and the aim of the investigation is to elucidate the nature of its non-linear dynamics, associated with the existence of feedback loops or other regulatory structures. In Shin et al. (2009), the authors used systems biology to show that positive and negative feedback loops and inhibitory proteins cooperate to configure the response pattern of the Ras-Raf-MEK-ERK pathway. In a similar way, Reynolds et al. (2003) analysed via mathematical modelling the reaction network that couples receptor tyrosine kinase (RTK) activity to protein tyrosine phosphatase (PTP) inhibition by reactive oxygen species and succeeded in verifying experimentally the existence of bistability. Design of experiments for formulation and validation of hypothesis. Systems biology can as well help in the formulation of hypotheses about the structure and dynamics of signalling pathways that are not completely elucidated and in the design of appropriate experiments to verify them. In Blüthgen et al. (2009) the authors implemented a mathematical model to test their hypothesis about transcriptional negative feedback regulation of MAPK signalling via dual-specificity phosphatases; they designed experiments according to the model analysis and then confirmed the validity of some premises in the starting hypothesis.
8
Strategies to Investigate Signal Transduction Pathways
215
Investigation of design principles. Mathematical modelling of signalling systems can be used also to reveal general patterns, common to a family of pathways or systems, the so-called design principles. In this case, the mathematical model developed is an abstraction of the family of pathways, containing the basic common features of these set of pathways, which is used to analyse the dynamical or structural features shared by the whole class of signalling systems. Barkai and Leibler (1997) derived a simple mathematical model that contains many of the basic common features of bacterial chemotaxis, including proper responses to chemical gradients; they showed that signalling pathways must be robust for some of their key properties in order to ensure their proper functioning under unstable biological conditions. In Gutenkunst et al. (2007) the authors used a collection of models from the literature to test whether sloppy parameters (parameters poorly constrained during data fitting) are common in systems biology. Their analysis reveals that sloppiness is a common pattern, a design principle, in a variety of biochemical systems. Analysis of highly complex biochemical networks. Mathematical modelling proves to be very useful for the analysis of wide signalling networks composed of dozens to hundreds of proteins in different activation states and multiple cross talk between different cascades. In these cases, the scale of the biochemical system under consideration invalidates any direct intuitive analysis, and dynamical models become a useful tool to investigate complex interrelations that emerge from those so extremely interconnected networks. Saez-Rodriguez et al. (2007) developed a dynamical model to analyse the complex signalling network controlling the activation of T cells by several surface receptors; the model proved to predict unexpected signalling events after antibody-mediated perturbation, which were later experimentally confirmed. Integration of experimental data and biological scales. A promising new use for mathematical models in systems biology is the integration of multi-level quantitative data of cellular and extracellular processes. The rationale is that modelling is a suitable tool to deal with massive, diverse sets of data describing intracellular and tissue-level aspects of the same biological phenomenon. This strategy is currently considered critical to make progress our understanding in biomedicine from pathway-level knowledge to their physiological repercussions in tissue organization and functionality. Leeuwen et al. (2009) derived a multi-scale model for colorectal crypt dynamics linking phenomena at the subcellular, cellular and tissue levels; they used their model to investigate the connection between intracellular signalling pathways, crucial in colorectal tissue organization and cancer progression, and physiological-level features like stem cells clonal expansion and niche succession. Ribba and collaborators (2006) proposed a multi-scale model of colorectal cancer growth, including genes, signalling pathways, tissue dynamics and radio-sensitivity dependence; with the help of their model they analysed the role of gene-dependent cell cycle regulation in the response to therapeutic irradiation.
216
J. Vera et al.
In the following section we show our own experience in dealing with some of the different approaches discussed above.
8.2 Investigation of Non-Linear Dynamics: Signal Amplification in the JAK2–STAT5 Pathway The main purpose of mathematical modelling in cell signalling is to investigate the behaviour of non-linear dynamical systems. Our investigation of the JAK2– STAT5 pathway focuses on the analysis of responsiveness and signal amplification (Vera et al. 2008a). The Janus kinase–signal transducer and activator of transcription (JAK–STAT) pathways are a complete family of signalling pathways that have been intensively investigated in recent years due to its important biomedical implications (Aaronson and Horvath 2002). More precisely, the JAK2–STAT5 pathway, which can be activated through various receptors including the erythropoietin receptor, is crucial for the adequate differentiation of red blood cells. It appears often deregulated in several kinds of blood-related diseases, e.g. leukaemia and cancer (Kisseleva et al. 2002). In Vera et al. (2008a) we derived, calibrated and tested a mathematical model describing the JAK2–STAT5 pathway, which was used for analysing signal responsiveness and amplification. In our investigation we followed the usual workflow for system biology, described in Fig. 8.2: we set up a mathematical model, based on published information, followed by model calibration, using quantitative experimental data, and then the model is assessed and refined before it is used for model-based predictive simulations. The model developed is represented in Fig. 8.3. It includes equations accounting for the receptor complex EpoR/JAK2 and STAT5 dynamics, the main compounds of the signalling pathway. Since there were technical limitations to characterize every biochemical process in the pathway in detail, we derived a simplified model with the following premises: (a) the model includes a description of every biochemical processes essential to characterize the pathway; (b) it contains the description of protein states and protein–protein interactions relevant for the investigation of signal amplification and responsiveness and finally, (c) further simplifications and hypotheses can be justified on the basis of published knowledge. For a lack of quantitative data to generate a detailed kinetic model, we decided to use a simplified power-law model for our investigation. These models, based on ordinary differential equations (ODEs), allow for non-integer kinetic orders (Vera et al. 2007a) and have the following mathematical structure: gjk
d Xi = cij · γj · Xk dt p
j
i = 1, ..., nd .
k=1
Here, Xi represents any of the nd dependent variables of the signalling system (e.g., proteins or phosphoprotein concentrations, RNA, level of gene expression), while every biochemical process j is described as a product of a rate constant (γj ) and the p
8
Strategies to Investigate Signal Transduction Pathways
217
Epo Extracellular medium
ODE model: d EJ = γ 1 − γ 2 . EJ . Epo − γ 1 . EJ dt
Cytoplasm
Degradation Deactivation
EJ
pEpJ
d pEpJ = γ 2 . EJ . Epo − γ 3 . pEpJ dt d S = 2 . γ 4 . DpS nc − 2 . γ 5 . S . pEpJ dt
Recruitment
S
DpS
DpSnc
d DpS = γ 5 . S . pEpJ − γ 6 . DpS dt d DpS nc = γ 6 . DpS − γ 4 . DpS nc dt
Nucleus
Cell differentiation and proliferation
Fig. 8.3 Structure of the JAK2–STAT5 pathway model. The left-hand side represents the initial conceptual scheme used in our investigation, highlighting the biochemical processes considered. The right-hand side is the translation of the scheme into a mathematical model. Our model is a simplified representation of the JAK2–STAT5 pathway, with only those biochemical processes considered that are essential for our investigation. The receptor EpoR and the Janus Kinase JAK2 are assumed to form a stable complex, EpoR/JAK2, for all signalling processes. Two possible states were considered for the EpoR/JAK2 complex: non-activated, EJ, and activated Epo-bound EpoR/JAK2 complex, pEpJ; the processes included for the receptor complex dynamics are receptor activation by Epo, receptor deactivation, recruitment of new EpoR/JAK2 and the degradation of non-activated receptor. For STAT5, we consider three states: non-activated cytosolic STAT5, S; activated cytosolic STAT5, DpS; and activated nuclear STAT5, DpSnc ; the processes considered for the STAT5 dynamics are the activation of STAT5 by the activated receptor complex, the nuclear translocation of cytosolic activated STAT5 and the deactivation and cytoplasmic translocation of nuclear STAT5. The extracellular concentration of Epo, Epo, is considered the input signal. A complete description, derivation and calibration of the mathematical model can be found in Vera et al. (2008a)
variables of the system involved in the process to characteristic kinetic orders (gjk ). cij are the so-called stoichiometric coefficients of the system that describe mass conservation in the processes. The distinctive feature of simplified power-law models is the kinetic orders, whose values are estimated from experimental data and may be non-integer numbers: g between 0 and 1 represents saturation-like behaviour; g higher than 1 implies a cooperative process; g equal to 1 represents a conventional kinetic-like process; g equal to 0 implies an absence of interactions; and negative values represent inhibition.
218
J. Vera et al.
The parameters in the model were estimated using the data provided by the group of Ursula Klingmüller at the German Cancer Research Institute in Heidelberg. Briefly: (1) red blood progenitor cells were starved for 5 h in order to turn off any signalling through the pathway; (2) after starvation, in every replicate of the experiment the culture was stimulated with the same amount of Epo; (3) a constant amount of cells were extracted from the culture and lysed at every time point considered; (4) immunoblots for the measured proteins were incubated, exposed and quantified using an adequate software and (5) the quantification was enhanced using normalization, protein calibrators and other experiment design techniques. Following this procedure, we obtained data accounting for the time courses of cytoplasmic activated STAT5 and activated EpoR/JAK2 complex. In an independent but identical replica of the experiment, the amount of extracellular Epo was quantified during the time course (cf. Vera et al. 2008a). For the data fitting, a genetic algorithm, adapted and optimized for power-law models, was used for parameter estimation. Several models with an increasing level of structural complexity were tested against the quantitative data until the obtaining of an adequate mathematical model in terms of structural simplicity, ability to fit the quantitative experimental data and suitability for the analysis of responsiveness and amplification. Finally, additional tests were applied to the model in order to ensure its quality and validate its predictive abilities. We afterwards investigate the responsiveness and the ability of the system to amplify signals via mathematical simulation. Here we discuss our simulation results for experimental conditions of sustained and transient Epo stimulation. In case of sustained stimulus, we performed iterative simulations for conditions of constant Epo stimulation, Eposs , from very low concentrations (much smaller than physiological values) to concentrations up to tenfold the Epo concentration used in the experiments. The induced steady-state values of the system for the nuclear fraction of activated STAT5, DpSnc , were computed and are shown in the left side of Fig. 8.4. The figure shows a sigmoidal behaviour in the logarithmic scale of Eposs , with a maximal sensitivity to changes in the concentration of Epo in the interval (grey area) around the physiological value of Epo in mouse serum (solid black line). Weaker stimulation does not significantly activate the system, while for stronger stimuli the system gets saturated (dashed red line, accounting for the maximum value of nuclear-activated STAT5) and becomes virtually insensitive to any increase in the stimulus over the Epo value used in the experiment (grey double dashed line). In order to analyse signal amplification in the system, we define logarithmic amplification, LA, as the ratio between the total activated receptor production, pEpJ, and the total nuclear-activated STAT5 production, DpSnc , [see Vera et al. (2008a) for a complete description]. LA measures the signal amplification between pEpJ and DpSnc . LA smaller than 0 means signal attenuation, while positive values imply signal amplification. LA = 1 means that on average each molecule of pEpJ produces the activation (and subsequent translocation) of ten molecules DpSnc , before
Strategies to Investigate Signal Transduction Pathways
DpS
219
0.1
2.5
0.08
2.4 LA (log. units)
nc,ss
(norm. units)
8
0.06 0.04
2.3 2.2 2.1
0.02 0 10
Epo
–7
–5
10 Epo
–3
ss
10 10 (units/ml)
–1
10
1
2 10
–7
–5
10 Epo
–3
ss
10 10 (units/ml)
–1
10
1
Epo
Fig. 8.4 Responsiveness and signal amplification for sustained stimulation. Left: steady-state values of DpSnc (DpSnc,ss ) for different values of sustained stimulation on Epo (Eposs ). Right: Logarithmic amplification, LA for different values of sustained stimulation with Epo (Eposs ). The solid black line indicates the physiological value for serum concentration of Epo (approx. 7.9×10–3 units/ml), while the grey area accounts for the physiological feasible interval of values for Epo in mice serum
deactivation. As we see in Fig. 8.4 (right side), the logarithmic amplification factor has a value slightly higher than 2 (LA=2) for the different values of sustained stimulation simulated, suggesting that an activated receptor can on average activate and induce the nuclear translocation of up to 100 units of STAT5 before its deactivation. We furthermore investigated the response of the system to transient stimulation by Epo, which was characterized by the average duration of the stimulus, TEpo , and the average value of Epo during the transient stimulation, Epotr . Figure 8.5 shows the response of the system in terms of DpSnc,tr for transient stimulation. The system shows maximum sensitivity to input signal with a pulse duration (TEpo ) between 1 and 100 min and an average intensity (Epotr ) around the physiological values, [5×10−3 , 5×10−1 ], showing saturation for intense average stimulation. For longer stimulation, even at very high concentrations of Epo, there is a significant loss in DpSnc,tr , recovered for very long stimulation, which we hypothesized is a consequence of the slow receptor complex recruitment. Taken together, our results suggest that the JAK2/STAT5 pathway is a signal amplifier, with the maximum sensitivity for input signals whose intensity is in the interval of physiological values and saturation for very intense and long stimulation. The use of this model shows how a quantitative data-based model can be used to investigate dynamical non-linear properties of signalling pathways.
220
J. Vera et al.
DpSnc,tr
0.15 0.1 0.05 102 0
100 100
10–2 10–4
102 TEpo
Epotr
10–6 104
Fig. 8.5 Responsiveness for transient stimulation. Average fraction of dimerized phosphorylated STAT5 in the nucleus (DpSnc,tr ) during transient stimulation for modulation of stimulus duration (TEpo ∈[0.1, 104 ] min) and concentration (Epotr ∈[10–6 , 500] units/ml)
8.3 Investigation of Design Principles: Dynamical Implications of Homodimerization in Receptor–Transducer Interactions One of the most interesting uses for mathematical models is the investigation of general principles, common to large (structurally similar) families of biochemical pathways. In this case, the hope is that modelling will facilitate the detection of dynamical or structural features shared by all the pathway members of the class. The strategy commonly followed in this case is depicted in Fig. 8.6. Prior biological knowledge, based on the experience with one or more of pathways, suggests some hypothesis about the features associated with a dynamical motif shared by all those systems. Published information and databases can be used to verify whether the suggested dynamical motif is actually shared by a group of pathways. This information is used to set up a mathematical model which is an abstraction of the family of pathways, containing only the basic common features of these set of pathways. Computational simulations or mathematical analysis is used to elucidate whether any relevant dynamical property is actually linked to the structure of the motif and therefore shared by the whole class of signalling systems. In that case, a design principle, a dynamical characteristic common to all the systems with a given structure, emerges from the analysis. We used this strategy to investigate the dynamics inherent to the homodimerization in the interactions between receptors and protein transducers. Our preliminary work with the JAK2/STAT5 signalling pathway indicated to us that interactions
8
Strategies to Investigate Signal Transduction Pathways
Fig. 8.6 Workflow to investigate design principles in signalling pathways
221
Biological Knowledge Published Data Mining Hypothesis Formulation
Qualitative Mathematical Model
Simulations and analysis
Emerged design principle
between homodimer receptors and homodimer protein transducers occur often in the JAK/STAT signalling family. In order to verify whether this pattern is shared by other signalling pathways, we designed a workflow to merge data from the Gene Ontology (GO) and Biomolecular Interaction Network Database (BIND), the two databases with information about protein function and protein–protein interactions (Vera et al. 2009). Using this procedure, we generated a list of interactions between homodimer receptors and homodimer proteins by considering a homodimer–homodimer interaction as the one between a receptor previously identified as homodimer (R–R) and a transducer identified as homodimer (P–P), Fig. 8.7. We focused our analysis on signalling pathways in human cells and found Receptor
R
Transducer HD -HD interaction
R
P Homodimer
Homodimer
A
P
B
other interacting proteins
Q A
R
other interacting proteins
Fig. 8.7 Scheme of the search method used in our analysis to find homodimer–homodimer interactions
222
J. Vera et al.
31 homodimer receptors (among the 57 detected) that interact with 67 different homodimer proteins. Several authors suggest a role for dimerization as a regulatory mechanism in signal transduction (Klemm et al. 1998). We developed a simple model, using ordinary differential equations, accounting for the interaction between a prototypical homodimer receptor and a homodimer protein transducer (Vera et al. 2008b). The model focuses on the possible mechanisms of interaction by which this homodimer–homodimer interaction can occur and our analysis investigates the distinctive dynamics associated with those different reaction mechanisms. Our model is depicted in Fig. 8.8. We propose two possible mechanisms of interaction between the receptor and the protein transducer. In the single activation mechanism, two transducer monomers, P, bind to two independent activated receptors, R∗ and become activated, P∗ . After release into the cytoplasm, the monomers form an activated homodimer, (P∗ P∗ ), which is able to transduce the signal downstream in the pathway. In this specific case, the rate equation accounting for the monomer activation is linear on the concentration of the transducer. In addition, the rate describing the subsequent dimerization is a quadratic equation in P∗ . In the double activation mechanism, two transducer monomers bind to the same activated receptor. Right after this, (double) parallel protein activation occurs in the receptor, leading to the subsequent formation of an activated dimer. In this case, the rate equation accounting for monomer activation is quadratic in the protein transducer P, while dimerization has the same structure than in the single protein activation. The difference in the kinetic order for P will have important dynamical consequences that we study with computational simulations. In a generalization of this, we assume that both activation mechanisms may occur simultaneously and we call this dual activation mechanism, with the dynamics described in the following equation:
Single activation :
Extracellular medium
R* + P → R* + P* Receptor 1
Receptor 2
⇒ 2P* → P* P* R* + P → R* + P* d * P = 2k1 . R * . P + 2k 2 . ( P * ) 2 dt Double activation :
Double Activation
Single Activation
R* + 2P → R* + 2 P* ⇒ 2P* → P* P* d * P = 2 . k 3 . R * . P 2 − 4k 2 . ( P * ) 2 dt
Fig. 8.8 Activation mechanisms proposed for the activation of homodimer protein transducers in homodimer receptors: single activation (left-hand side, finely dashed black arrow) and double activation (left-hand side, solid black arrow). Equations for both mechanisms are included in the right-hand side. A third mechanism is also investigated, which is a combination of both and is called dual mechanism
8
Strategies to Investigate Signal Transduction Pathways
223
2 d ∗ P = 2k1 · R∗ · P + 2k3 · R∗ · P2 − 4k2 · P∗ dt In order to elucidate the consequences of this alternative designs for transducer activation, we performed a range of computational simulations. Our results establish that the dual mechanism of activation happens only within the interval of values for the constant accounting for the efficiency in double activation, k3 , in the interval: k1 /10 < k3 4 we found that the network can still learn the correct responses, but will often produce biases in its generalization (Fig. 13.4 g–h). However, we find that these biases do not occur for all replications where the value of w was kept constant and the random seed varied (Fig. 13.5). The probability of a bias occurring appeared to increase with the magnitude of w. In contrast, cases a–f shown in Fig. 13.1 do not show significant variation over 20 replications. Although we used five hidden nodes for the cases shown, we found that these results stand regardless of the number of hidden nodes.
Fig. 13.4 Generalization gradients showing the network output produced from a given input for an individual run with the same random seed. Each gradient is the result of a single run after 500,000 back-propagation iterations, trained to output 0.2 for an input of 0.2 and to output 0.6 for an input of 0.8. The network topology comprised a single input node, 5 hidden nodes, a single output node, and learning rate α = 1. For all cases the initial network weights were randomly selected from a uniform distribution with of p equal to (a) 0, (b) 0.1, (c) 1, (d) 2, (e) 3, (f) 4, (g) 5. Generalization gradients for many replications with values of w > 5 are similar to that of w = 5 (but see Fig. 13.2)
344
D.W. Franks and G.D. Ruxton
Fig. 13.5 Three-dimensional plot of the generalization gradients showing the network output produced from a given input for individual runs with different random seeds. Training is as in Fig. 13.1, but with the number of hidden nodes fixed at m = 5 and the initial weight range fixed at (a) w = 5 (case g in Fig. 13.4), (b) w = 10 (case h in Fig. 13.4). Note the wide quantitative and qualitative variation in results across the 20 replications with different random seeds, especially for w = 10
13.3.3 Changing the Activation Function to tanh The two most common activation functions used in feed-forward neural networks are the logistic function (sigmoid) given in the original model definition and hyperbolic tangent (tanh). The tanh function usually takes the form: φ(x) = a tanh(bx) where values of a and b can be set to any real value, but are typically set to a = 1.716 and b = 0.667. There seems to be no empirical justification for these values, but they tend to be used because previous studies have used them. However, the precise values used are not thought to make a difference to the network’s generalization ability (Guyon 1991). It is desirable that the transfer functions allow the network to act non-linearly, without producing a bias in the way the network processes certain values. We take the standard approach of replacing the sigmoid function with the tanh function for the hidden nodes, and keep the sigmoid function for the output node. We find that the tanh function causes a bias for input values close to zero (Fig. 13.6). This occurs because the tanh function gives a value of zero for inputs of zero, meaning that any weight adjustments have little affect on inputs close to zero. Thus, unless the network is heavily trained to specifically output a desired value for inputs close to zero the network will exhibit a generalization bias if the tanh function is used (but not if the logistic function is used). Note that this bias will occur regardless of the values of a and b.
13
Robustness of Neural Network Models
345
Fig. 13.6 A generalization gradient showing the network output produced from a given input. The solid line shows the generalization where the tanh function is used for hidden node activation, and the dashed line shows the generalization where the sigmoid function is used for the hidden node activation. The gradient is the result of a single run after 500,000 back-propagation iterations, trained to output 0.2 for an input of 0.2 and to output 0.6 for an input of 0.8. The network topology comprised a single input node, 5 hidden nodes, a single output node and learning rate α = 1. Network weights were initialized with values selected from a uniform random distribution with w = 0.1. Note the extreme difference between the outputs for input values close to zero
13.3.4 Multi-dimensional Stimuli To confirm that our findings work for more complicated problems, we perform the same experiments with multi-dimensional stimuli by replicating the intensity test in (Ghirlanda and Enquist 1998). For this the network is given 20 real-value inputs (i.e. i = 20), five hidden nodes and a single output node. The initial weights were selected with w = 0.1. The network is trained on two stimuli S– and S+ defined as follows: S− = [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4] S+ = [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7, 0.8, 0.7, 0.7, 0.6, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4] Each stimuli is weighted by a level of intensity (0.6 for S+ and 0.2 for S–) by simply multiplying each value in the stimuli’s vector by the intensity weighting. The network is trained to give a response of 0.6 to the weighted S+ and to give a response of 0.2 to the weighted S–. The generalization gradient is then examined by feeding the network with the stimuli with intensity systematically varied from 0 to 1. For our study with multi-dimensional stimuli, we find qualitatively identical results to those above for one-dimensional stimuli (thus no graphs shown). This suggests
346
D.W. Franks and G.D. Ruxton
that a single dimensional stimuli representation is equally valid in terms of network generalization for modelling stimulus selection.
13.3.5 Multiple Stimuli Returning to a single input; we examined the network’s generalization with three stimuli. Again, our previous results are confirmed for three stimuli where the desired generalization curves were similar to those of two stimuli. However, the inclusion of three stimuli allows for a case where we have an S+ between two S–, or an S– between two S+, causing the often-desired generalization gradient to be bell-shaped or U-shaped. Figure 13.7 shows the results of varying the number of hidden nodes
Fig. 13.7 Generalization gradients showing the network output produced from a given input. Each gradient is the result of a single run after 500,000 back-propagation iterations. The network topology comprised a single input node, three hidden nodes, a single output node, and learning rate α = 1. Network weights were initialized with values selected from a uniform random distribution with w = 0.1. The network was trained with back-propagation to output 0.6 for positive stimuli (S+) and 0.2 for negative stimuli (S−). The locations of the stimuli are (a) S1 − = 0.0, S2 − = 0.6, S+ = 0.3, (b) S1 − = 0.2, S2 − = 0.8, S+ = 0.5, (c) S1 − = 0.4, S2 − = 1.0, S+ = 0.7. Vertical lines show the location of S+. Note that each generalization gradient peaks at a value that is shifted away (always towards zero) from the location of S+. The extent of the shift reduces as the locations of the stimuli increase
13
Robustness of Neural Network Models
347
in this case. In each case the network accurately learns the correct response to all stimuli. However, the generalization gradients all show a peak shift towards zero. As the stimuli locations move closer to zero, the peak shift becomes more exaggerated and the generalization width narrows. Twenty replications with different random seeds show similar results. We changed the order of presenting stimuli to the network and repeated the replications. The order of presentation does not qualitatively change the results. We examined the three-stimuli case using tanh as the activation function for the hidden nodes. We find that, unlike the two-stimuli case, the network is unable to learn reasonable responses to all stimuli with the tanh function (Fig. 13.8, cf. Fig. 13.7). Additional runs showed that this failure to learn was robust to variation in the learning rate and for networks with multiple hidden layers.
Fig. 13.8 Generalization gradients showing the network output produced from a given input. The tanh function is used for the activation of hidden nodes. Each gradient is the result of a single run after 500,000 back-propagation iterations. The network topology comprised a single input node, three hidden nodes, a single output node, and learning rate α = 1. Network weights were initialized with values selected from a uniform random distribution with w 0.1. The network was trained with back-propagation to output 0.6 for positive stimuli (S+) and 0.2 for negative stimuli (S−). The locations of the stimuli are (a) S1 − = 0.0, S2 − = 0.6, S+ = 0.3, (b) S1 − = 0.2, S2 − = 0.8, S+ = 0.5, (c) S1 − = 0.4, S2 − = 1.0, S+ = 0.7. The network is unable to learn an appropriate response in each case, and it settles on exactly the same weights (not shown) for cases a and b, indicating that the network has arrived at the same local optima
348
D.W. Franks and G.D. Ruxton
13.3.6 Evolving Network Weights We now test the multiple-stimuli case above with the same network configuration, using a genetic algorithm (Holland 1975; Mitchell 1998) to optimize the weights. We evolve a population of 200 networks, and use tournament selection to differentially select and asexually reproduce the fittest network configurations each generation. Tournament selection takes place by selecting two networks at random from the population and comparing their fitness (see below for the fitness measure). The fittest network is reproduced in the new population with probability k (here we use k = 0.8). This process is repeated (with replacement) until the next generation has a full population of 200 networks. Mutation occurs with probability 0.05 for
Fig. 13.9 Generalization gradients showing a network output produced from a given input. Each gradient is the result of a single run after 500,000 generations of optimization using a genetic algorithm. The network topology comprised a single input node, three hidden nodes, a single output node. Network weights were initialized with values selected from a uniform random distribution with w = 0.1. The network was evolved to output 0.6 for positive stimuli (S+) and 0.2 for negative stimuli (S−). The locations of the stimuli are (a) S1 − = 0.0, S2 − = 0.6, S+ = 0.3, (b) S1 − = 0.2, S2 − = 0.8, S+ = 0.5, (c) S1 − = 0.4, S2 − = 1.0, S+ = 0.7. Vertical lines show the location of S+. Note that each generalization gradient peaks at a value that is shifted away (always towards zero) from the location of S+. The extent of the shift reduces as the locations of the stimuli increase
13
Robustness of Neural Network Models
349
each weight of a child network. The mutation operator proceeds by adding a value, selected from a Gaussian distribution with zero mean and variance σ = 0.1, to the weight. The genetic algorithm was run for 500,000 generations. We first evolve the networks to give desired responses to three stimuli (as in the previous section). Every generation each network’s fitness is found by assessing their responses to each of the three sample stimuli. The fitness f of each network i is updated after each training sample j using the network error: fi = fi + 1 − |dj − pi | where dj is the desired output for stimuli j (e.g. 0.6 for S+ and 0.2 for S−), and pi is the network’s output for the stimuli. Thus, the function gives higher rewards to networks with lower average errors. The final generalization gradients produced by the evolved networks are similar to those produced by the back-propagation algorithm, with the peak-shift bias reoccurring (Fig. 13.9; cf. Fig. 13.7). Twenty replications with different random seeds show similar results. We also examined the three-stimuli case using tanh as the activation function for the hidden nodes. As with back-propagation, we find that the network with the tanh function is unable to learn reasonable responses to all stimuli.
13.4 Discussion Generalization is an essential component of stimulus selection, and has consequently been the focus of much research. Feed-forward artificial neural networks are often used to model stimulus selection, due to their inherent ability to generalize (Enquist and Ghirlanda 2005). Despite an increase in their use, there has been no systematic study of the sensitivity of neural networks to key network and training parameters and assumptions. If we are to use neural networks to study animal behaviour then it is essential for us to be aware of any inherent generalization biases, and to understand what values to set for each parameter. Thus, in this chapter we systematically explored the effect of different parameter values on network generalization for various stimulus selection problems. We found that the number of hidden nodes and the initial network weights have an affect on the network’s ability to generalize effectively, and without bias. Importantly, we found that certain values for these parameters cause biases and reduce the network’s robustness to replication. We have a number of simple suggestions to follow when developing neural network models of stimulus control. The first suggestion is to use the sigmoid function for hidden node transfers as, unlike tanh, the logistic function does not suffer from extreme biases when dealing with low-value inputs. A further problem with the tanh function is that it caused the network to fail to learn appropriate responses to the task with three stimuli, whereas the
350
D.W. Franks and G.D. Ruxton
sigmoid function allowed the network to do this easily. However, the sigmoid function also showed some bias, with the generalization peak always shifted to the left (i.e. towards low values). It is important to keep this result in mind when considering peak-shift effects in neural network models. We also suggest that initial weights are selected from values within the saturation boundaries of the activation function and the input range (i.e. for the sigmoid function, 0 ≤ p ≤ 1) to avoid premature node saturation. We found that the number of hidden nodes required to avoid generalization bias is fairly low for all problems we tested. The results show that too many hidden nodes (i.e. more than is required) are better than too few. The number of hidden nodes required may increase with problem complexity. How can we know if we are using an appropriate number of nodes? We suggest the approach of doubling the number of nodes used to test for any differences in the results. If there are no differences then the original number of nodes should be adequate. Otherwise, the same process should be repeated until an adequate number of nodes is found. The test with three stimuli showed a bias: the network tends to produce a peak shift towards input values of zero. Although this peak-shift property is a bias of the network setup, it does not mean that network learning can reverse the peak shift as a result of the learning process and particular stimuli properties. It is, however, important to be aware of this network property when modelling stimulus selection. Given that back-propagation is a deterministic algorithm, one might expect it to be robust to replication. However, the random initialization of network weights adds stochasticity to the training process, causing the results to vary for different random seeds. We recommend replication with random seeding as one method for catching any generalization biases that might occur in a select number of runs as a result of one particular selection of initial values. This conclusion supports the appeal for replication in neural network models to capture variation between runs (Tosh and Ruxton 2007). The genetic algorithm and back-propagation resulted in similar network generalization for each task, suggesting that network generalization is not highly sensitive to the particular type of weight optimization algorithm. To conclude, it is important that we understand how model assumptions affect the generalization abilities of neural networks when modelling stimulus selection. It is also important that authors explicitly communicate the full details of their network parameters and training regime. Although neural networks are relatively robust to parameter perturbations, we have shown that certain network and training conditions produce undesirable artefacts that need to be avoided. Even if some biases cannot be resolved, it is important that researchers are aware of any biases that occur as a result of the particular network setup or training regime used, so that they (and reviewers and readers of their work) can be entirely clear that their results represent the effects of the underlying biology that they are interested in and are not simply artefacts of arbitrarily chosen parameters of the particular network arrangement. Acknowledgements We thank Stefano Ghirlanda and Samar Buchala for their help. Dan Franks is supported by an RCUK research fellowship.
13
Robustness of Neural Network Models
351
References Balogh ACV, Leimar O (2005) Müllerian mimicry: an examination of Fisher’s theory of gradual evolutionary change. Proc R Soc Lond B: Biol Sci 272:2269–2275 Barnard C (2004) Animal behaviour: mechanism development function and evolution. Person Scientific Baum EB, Haussler D (1990) What size net gives valid generalization? Neural Comput 1:151–160 Enquist M, Ghirlanda S (2005) Neural networks and animal behavior. Princeton University Press, Princeton Franks DW, TN Sherratt (2006) The evolution of multi-component mimicry. J Theor Biol 244: 631–639 Gamberale-Stille G, Tullberg BS (1999) Experienced chicks show biased avoidance of stronger signals: an experiment with natural colour variation with live aposematic prey. Evol Ecol 13:579–589 Ghedira H, M Bernier (2004) The effect of some internal neural network parameters on SAR texture classification performance. IEEE Int Geosci Remote Sens Symp 6:3845–3848 Ghirlanda S, Enquist M (1998) Artificial neural networks as models of stimulus control. Anim Behav 56:1383–1389 Ghirlanda S, Enquist M (2003) A century of generalization. Anim Behav 66:15–36 Ghirlanda S, Enquist M (2007) The effect of training and testing histories on generalization: a test of simple neural networks. Philos Trans R Soc B: Biol Sci Guyon IP (1991) Applications of neural networks to character recognition. Int J Pattern Recognition Artif Intel 5:353–382 Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Michigan Holmgren NMA, Enquist M (1999) Dynamics of mimicry evolution. Biol J Linnean Soc 66: 145–158 Holmgren NMA, Getz WM (2000) Evolution of host plant selection in insects under perceptual constraints: a simulation study. Evol Ecol Res 1:81–106 Krakauer DC (1995) Prey confuse predators by exploiting perceptual bottlenecks. Behav Ecol Sociobiol 36:421–429 Krogh A (1992) A Simple weight decay can improve generalization. Adv Neural Inf Proc Syst 4:950–957 Levin RI, Lieven NAJ, Lowenberg MH (2000) Measuring and improving neural network generalization for model updating. J Sound Vibration 238:401–424 Lindstrom L, Alatalo RV, Mappes J, Rippi M, Vertainen L (1999) Can aposematic signals evolve by gradual change? Nature 249–251 Mitchell M (1998) An introduction to genetic algorithms. MIT Press, Massachusetts Phelps S, Ryan M (1998) Neural networks predict response biases of female túngara frogs. Proc R Soc London B: Biol Sci 265:279–285 Purtle RB (1973) Peak shift: a review. Psychol Bull 80:408–421 Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representation by backpropagation of error. Nature 323:533–536 Schmidt WF, Raudys S, Kraaijveld MA, Skurichina M, Duin RPW (1993) Initializations, backpropagation and generalization of feed-forward classifiers. Proceedings of the 1993 IEEE Int Conf on Neural Networks 598–604 Shettleworth SJ (1998) Cognition, evolution and behaviour. Oxford University Press, Oxford Siraj F, Partridge D (2002) Improving generalization of neural networks using multilayer perceptron discriminants. Syst Anal Model Simul 42:1059–1068 Tosh C, Ruxton GD (2007) The need for stochastic replication of ecological neural networks. Philos Trans R Soc B: Biol Sci 362:455–460 Tosh CR, Jackson AL, Ruxton GD (2006) The confusion effect in predatory neural networks. Am Natur 167:E52–65
Chapter 14
Functional Modules in Protein–Protein Interaction Networks Tobias Müller and Marcus Dittrich
Abstract Modern high-throughput technologies in genomics, transcriptomics, and proteomics typically produce long lists of significantly deregulated genes and proteins. These large amounts of data call for new integrative analysis approaches which allow the investigation of these single genes within their functional network context. The integrative analysis of expression data with protein–protein interaction networks and further gene-wise information can identify novel deregulated modules within the entire complex cellular interaction network. Several heuristic approaches for the identification of functional modules have previously been proposed. Here we describe an exact solution for this problem, which is based on integer-linear programming and typically computes provably optimal solutions in a few minutes, even in large networks. In this chapter, we describe this exact approach in detail, using a well-established lymphoma microarray data set in combination with associated survival data and the large PPI network obtained from the Human Protein Reference Database (HPRD). The presented example further illustrates the benefit of integrated network approaches over classical plain gene-wise analyses. Keywords Functional modules · Integrated analysis · Network biology · Systems biology · Microarray analysis · Protein–protein interaction networks · Survival analysis The increasing amount of biological data on a molecular level calls for new methods to analyze it in an integrative manner and to study the functional interplay between the single molecules. High-throughput expression profiling technologies provide a plenitude of information on gene expression in various tissues and under diverse experimental conditions. Combining this information with the knowledge of interactions of the gene products in a protein–protein interaction network generates a meaningful biological context in terms of functional association of differentially expressed genes. T. Müller (B) Biocenter, Bioinformatics Department, University of Wuerzburg, 97074 Wuerzburg, Germany e-mail:
[email protected] S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_14,
353
354
T. Müller and M. Dittrich
To assess the influence of specific genes on a disease phenotype, microarray technologies are commonly used. Especially in tumor biology, the identification of differentially expressed genes in diverse tissue samples or cancer stages is a well-established method to classify tumors and tumor subtypes. Ordinary gene expression analysis yields a number of genes that are up- or down-regulated with a certain significance, but it reveals neither causal effects, nor functional associations between these genes. Of particular interest is also the disease relevance of the genes. Survival analysis allows to determine survival-relevant genes and to make predictions about the prognosis for the patients. The combined analysis of expression profiles, protein–protein interaction data, and further information of the influence of specific genes on a disease-specific pathophysiology thus allows the detection of previously unknown dysregulated modules not recognizable by the sole analysis of each of the data sets. Several approaches have been proposed to identify interaction modules in an integrative network analysis. These are based on p-values derived from gene-specific patient data, e.g., from the analysis of differentially expressed genes, in combination with a biological network. Ideker et al. (2002) first devised a function for the scoring of networks and an appropriate algorithm to find high-scoring subnetworks. The primary combinatorial problem for the identification of maximum-scoring subnetworks has been proven to be NP-hard. Therefore, the authors introduced a heuristic approach based on simulated annealing. This strategy allows the integration of multivariate p-values into a network score to measure the significance of a subnetwork. Heuristic approaches in general cannot guarantee to identify the global maximumscoring subgraph. Furthermore, the proposed techniques are often computationally demanding and tend to result in large high-scoring networks, which may be difficult to interpret afterward in a meaningful biological sense. In this chapter, we describe a novel exact approach that delivers provably optimal and suboptimal solutions to the maximal-scoring subgraph problem by integerlinear programming in acceptable running time. We introduce a novel scoring function for network nodes that is based on p-values and allows the integration of multivariate p-values using order statistics. This modular scoring function is based on a signal–noise decomposition of the p-value distribution using a beta-uniform mixture model (BUM). An adjustment parameter that can be statistically interpreted as false discovery rate (FDR) allows to control the resultant size of the subnetwork and therefore to extract smaller modules of interpretable size. Scoring of the network, including the decomposition of the p-value distribution and the aggregation of multiple p-values into one p-value of p-values, will be outlined in detail in the following sections. As an example we describe a stepwise analysis based on gene expression and survival data on diffuse large B-cell lymphomas. Section 14.1 introduces the microarray and protein–protein interaction (PPI) data used in the analysis, Section 14.2 will explain in detail how an adequate node score can be derived from the p-values of different analyses and will give an overview of the searching strategy utilized to find maximal-scoring subnetworks. In Section 14.3 we review the obtained modules in the lymphoma network and in Section 14.4 we compare the performance of the described exact analysis to the
14
Functional Modules in Protein–Protein Interaction Networks
355
heuristic approach implemented in the Cytoscape plugin jActiveModules (Ideker et al. 2002). All of the methods needed for the analysis are implemented in the R package BioNet (Beisser et al. 2010), which can be downloaded at http://bionet.bioapps.biozentrum.uni-wuerzburg.de and will be made available at BioConductor (http://www.bioconductor.org).
14.1 Data Integration In the following, we briefly introduce the data sets that will be analyzed in the remainder of the chapter. This will exemplify how an integrated network analysis can be performed including PPI, microarray, and clinical survival data. To illustrate the approach, we analyze a network obtained by combining the gene expression data from two different lymphoma subtypes (GCB and ABC) (Rosenwald et al. 2002) with survival data and a comprehensive interactome network derived from the Human Protein Reference Database (HPRD) (Peri et al. 2003).
14.1.1 Microarray and Survival Data The case study presented here is based on microarray data from diffuse large Bcell lymphomas (DLBCL) (Rosenwald et al. 2002). This comprises gene expression data from 112 tumors with the germinal center B-like phenotype (GCB DLBCL) and from 82 tumors with the activated B-like phenotype (ABC DLBCL). These tumor subtypes differ in their malignancy as well as in the treatment options for the patients. Expression profiling has been performed on the Lymphochip including 12,196 cDNA probe sets corresponding to 3,583 genes (Rosenwald et al. 2002). In addition, survival information from 190 patients is available (Rosenwald et al. 2002). As a first step, we are interested in two questions: first, which genes are differentially expressed between the two tumor subtypes and second, which genes are associated with the risk of relapse. After normalization, the significance of differential expression between the two subtypes ABC and GCB can be assessed by using robust statistics based on linear models and a moderated t-test (Smyth 2004). This yields an uncorrected p-value for differential expression for each gene. Alternatively, simple gene-wise t-tests between the groups could be used to obtain p-values. These p-values constitute a quantitative measurement describing the significance of differential expression for each gene. To assess the risk association of each gene we subsequently perform a survival analysis by fitting a univariate Cox model to the expression data of each gene using the routines implemented in the R-package survival (Andersen and Gill 1982). From the likelihood ratio test of the regression coefficient we can obtain p-values for each gene denoting the association with survival, independent of the
356
T. Müller and M. Dittrich
assigned tumor subtype. Thus we have p-values from both analyses, corresponding to differential expression on the one hand and to risk association on the other hand. In the next step, we combine these two p-values for each gene into one p-value from which we derive a score.
14.1.2 Network For the network data we use a data set of literature-curated human protein–protein interactions that has been obtained from the Human Protein Reference Database (HPRD) (Peri et al. 2003, Mishra et al. 2006). The entire interactome network assembled from these data consists of 36,504 interactions between 9,392 different proteins. From this we derive a Lymphochip-specific interactome network as the vertex-induced subgraph extracted by the subset of genes for which we have expression data on the Lymphochip (Fig. 14.1). The resulting network comprises 2,561 different gene products and 8,538 interactions with a large connected component of 2,034 proteins (79.4%) and 8,399 interactions (98.4%). The remaining
Fig. 14.1 Integration of PPI, microarray, and clinical data into a common framework. First the vertex-induced subgraph of all, genes on the array is extracted (dark nodes). Differential expression can be calculated using a standard t-test and relapse risk association for each gene is estimated by Cox regression. All nodes in the network are annotated with the vector of p-values derived from these analyses. There is practically no limitation to the number of p-values (here two) that can be integrated by the presented approach
14
Functional Modules in Protein–Protein Interaction Networks
357
proteins are either non-interacting single nodes in the network (472) or form tiny clusters of a handful of nodes (23). Since we want to identify modules as connected subgraphs we focus on the largest connected component. Visualization and further network analysis can be performed with Cytoscape (Shannon et al. 2003; Cline et al. 2007) give an introductory Cytoscape tutorial.
14.2 Scoring and Searching In principle, the problem of identifying functional modules can be decomposed into two separate subproblems. The first part of the problem is the definition of a scoring function, which captures the information of the experimental data and maps them onto the nodes of the network. After scoring each node of the network, the second problem is to find an adequate algorithm to search for the maximal scoring connected subgraph. In the presented example we use the p-values derived from differentially expressed genes and the p-values of the Cox-regression model indicating risk association for each gene (Fig. 14.2).
Fig. 14.2 Definition of a node scoring function. First all p-values are aggregated using an order statistic (Section 14.2.1). Subsequently, a signal–noise decomposition is performed based on a beta-uniform mixture model (Section 14.2.2). The score is derived by a log-likelihood ratio of signal and noise component (Section 14.2.3). The final subnetwork score can be simply calculated by the sum of all considered node scores. Finding highest -scoring networks is described in Section 14.2.4
358
T. Müller and M. Dittrich
14.2.1 Aggregation of p-Values Having annotated each node of the interaction network with experimentally derived p-values, we are faced with the problem to aggregate these p-values into an adequate score for each node. For this problem, we take a three-step solution. First, we aggregate the vector of p-values using an order statistic to one p-value. Second, the resulting p-value distribution is decomposed into a noise and signal component based on a beta-uniform mixture model. The final step defines the score as log-likelihood ratio of signal and noise components. Here we start by aggregating the p-values at each node in the network by asking for the ith order statistic of the associated p-values, resulting in one p-value of p-values. By definition, p-values are uniformly distributed under the null hypothesis (Wasserman 2005). This means, if we consider the p-value as a random variable p, by definition the following equation holds: P(p ≤ x) = x
(14.1)
for all x ∈ [0, 1]. Hence, the distribution function of the random variable p is the identity function and thus equal to the distribution function of the uniform distribution. In general, the probability density function of the ith smallest observation x(i) is given by f (x(i) ) =
n! f (x)F(x)i−1 (1 − F(x))n−i (n − i)!(i − 1)!
(14.2)
for distribution function F(x) and density function f(x) of a random variable X and for i ∈ 1, . . . , n (Lindgren 1993). Thus we can apply Eq. (14.2) with f (x) = 1 and F(x) = x and get f (x(i) ) =
n! · 1 · xi−1 (1 − x)n−i 0 ≤ x ≤ 1 (n − i)!(i − 1)!
(14.3)
or, in other words, the ith order statistic x(i) is distributed according to the betadistribution B(i, n − i + 1) with the associated cumulative distribution function: n! F(x(i) ) = (n − i)!(i − 1)!
!
x(i)
zi−1 (1 − z)n−i dz.
(14.4)
0
For example, let us consider a vector x of n = 4 ordered p-values (0.001, 0.05, 0.1, 0.5). The p-value of the second-order statistic is derived by applying Eq. (14.4): ! x(2) 4! z(1 − z)2 dz 2!1! 0 2 "x z 2z2 z4 "" (2) = 12 − + 2 3 4 "0 = 6x(2) 2 − 8x(2) 3 + 3x(2) 4 .
F(x(2) ) =
14
Functional Modules in Protein–Protein Interaction Networks
359
For x(2) = 0.05 we get a significant p-value of 0.014. All order statistics, from the first to the fourth, yield p-values of (0.004, 0.014, 0.0037, 0.0625).
14.2.2 Signal–noise Decomposition Based on these aggregated p-values, we derive a new scoring function. Following Pounds and Morris (2003), we consider the distribution of the p-values as a mixture of a noise and a signal component. Visual inspection of the empirical p-value distribution as displayed in Fig. 14.5 suggests that the signal component can be appropriately modeled by a B(a, 1)-distribution whereas the noise component is naturally modeled by a B(1, 1) = uniform distribution as detailed above. Thus the family of beta-distributions comprises both the signal distribution as well as the noise distribution (Fig. 14.3). The B(a, b) distribution is given by (a + b) a−1 x (1 − x)b−1 , (a)(b)
2.0
B(a, b)(x) =
1.0 0.5
density
1.5
Beta(0.2,1) Beta(0.4,1) Beta(0.6,1) Beta(0.8,1) Beta(0.9,1) Beta(1,1)
0.0
0.2
0.4
0.6
0.8
1.0
x
Fig. 14.3 Versatile shapes of beta-distribution with varying parameter a and fixed parameter b = 1. Note that for a = 1 the beta-distribution is equal to the uniform distribution. Thus the family of beta–distributions comprises both the signal distribution and the noise distribution
360
T. Müller and M. Dittrich
#∞ where (·) denotes the Gamma function with (x) = 0 tx−1 e−t dt. Thus the distribution fmix of the mixture model with mixture parameter λ and shape parameter a reduces to fmix (a, λ)(x) = λB(1, 1)(x) + (1 − λ)B(a, 1)(x) = λ + (1 − λ)axa−1 for x, a ∈ [0, 1], because
(a+1) (a)(1)
= a. For given data x = x1 . . . xn the log likelihood is log L(λ, a; x) =
n
log(λ + (1 − λ)axia−1 ) ,
i=1
and consequently the maximum-likelihood estimations of the unknown parameters are given by [λˆ , aˆ ] = argmaxλ,a L(λ, a; x) (Fig. 14.4). Both parameters can be obtained by standard numerical optimization methods (e.g., L-BFGS-B method (Byrd et al. 1995) as implemented in R). Applying this
0.9 1500 0.8 0.7 1000
a
0.6 0.5 500 0.4 0.3
● 0
0.2 0.1 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
λ
Fig. 14.4 Log-likelihood surface of the mixture parameter λ (x-axis) and the shape parameter a (y-axis) of the beta-uniform mixture model for the B-cell lymphoma data set. The numerically determined optimal parameter pair is indicated by the cross lines. The maximum-likelihood estimates are $ λ = 0.536 and $ a = 0.276
14
Functional Modules in Protein–Protein Interaction Networks
361
0.2
0.4
0.6
0.8
p−values (second order statistics)
1.0
1.0 0.8 0.6 0.4 0.2
observed p−values (second order statistics)
12 10 8
Density
6 4 2 0 0.0
0.0
14
optimization to the presented lymphoma data set delivers value of 0.536 for the mixture parameter λ and 0.276 for the shape parameter a of the beta-distribution as depicted in Fig. 14.5.
0.0
0.2
0.4
0.6
0.8
1.0
quantiles of expected p−values under the mixture model
Fig. 14.5 Left: The histogram depicts the fitted mixture model (curved line) to the empirical distribution of the p-values. The optimal parameters for the model are a = 0.276 and λ = 0.563. The horizontal line indicates the upper bound π for the fraction of noise. Right: quantile–quantile plot of the fitted distribution and the empirical distribution. The straight line indicates that the shape of the fitted model coincides with the shape of the empirical distribution (Dittrich et al. 2008)
As stated above p-values are by definition uniformly distributed under the null hypothesis while the true signal distribution is not known a priori. Therefore a uniform distribution will adequately model the noise component. Modeling the signal component by a beta-distribution is an assumption that has to be justified. Indeed, the fitted density function of the mixture model fits the data very well as demonstrated in the left plot of Fig. 14.5. Furthermore, the quantile–quantile plot of the fitted density function versus the empirical distribution function is close to a straight indicating that the signal component is well captured by the beta-distribution.
14.2.3 Network Score Inspired by the Neyman–Pearson lemma a scoring function can be defined by the log ratio of the signal to the noise component. In the BUM model the signal component is equal to the B(a, 1) (with fitted parameter a) while the noise component is given by B(1, 1), which is equivalent to the uniform distribution. This simplifies the denominator in the score function to the constant 1:
362
T. Müller and M. Dittrich
S(x) = log
B(a, 1)(x) B(1, 1)(x)
= log
axa−1 1
= log(a) + (a − 1) log(x) .
In summary, this function transforms a given set of p-values to a real-valued score where positive scores indicate significant p-values and negative scores denote non-significant p-values. As detailed in Pounds and Morris (2003) the BUM model allows the estimation of the false discovery rate (FDR). This can now be used to fine-tune the zero value threshold: Incorporating a threshold p-value τ (FDR) into the scoring function we can derive an adjusted log-likelihood ratio score given as
S
FDR
axa−1 (x) = log aτ a−1
= (a − 1) (log(x) − log(τ (FDR)))
(14.5)
The adjusted and unadjusted scores differ only by an additive offset dependent on the parameter τ. With the adjusted score p-values above the threshold τ (FDR) are considered to be noise and will be assigned a negative score, whereas p-values below the threshold are considered significant and will thus obtain a positive score. Having obtained a score for each node in the network, we now need to derive an aggregated score for the network modules. Due to the logarithmic scale of the scoring function this can consequently be defined by the sum over all node scores in the subnetwork T: SFDR (T) :=
SFDR (xi ) .
xi ∈T
In the context of the presented lymphoma study with p-values from the t-test and Cox-regression model this score combines the information on differential expression with that on risk association. This means that genes that are differentially expressed between the GCB and ABC DLBCL subgroups and are simultaneously associated with overall survival will obtain a positive score. Thus we are searching for differentially expressed and risk-associated modules in the integrated PPI network. These can now be identified by searching for the maximal-scoring subnetwork(s).
14.2.4 Searching We are now faced with the task of finding an optimal-scoring subnetwork T ∗ = argmax SFDR (T) , T∈T
(14.6)
where T is the set of all connected subgraphs of the protein–protein interaction network.
14
Functional Modules in Protein–Protein Interaction Networks
363
Since the network score derived in the previous sections is additive with respect to the network nodes, this combinatorial search problem is exactly the MaximumWeight Connected Subgraph Problem (MWCS). Due to the fact that edges have no weights, any solution for MWCS can always be trimmed to a tree of the same weight, still maintaining connectivity of the solution. Furthermore, in the case of only non-negative node weights, an optimal solution is easily computed in polynomial time by picking the maximum-weight spanning tree in a spanning forest. If, however, both positive and negative edge weights exist, the problem becomes much more difficult in theory. In fact, in the supplement of Ideker et al. (2002), Karp shows that MWCS is an NP-hard problem (Fig. 14.6.)
Fig. 14.6 Transformation of MWCS to PCST. (a) Example of an MWCS instance. The minimum weight is S = −2. (b) Vertex profits in PCST result from substracting S from every node weight. (c) Finally, all edge weights are set to −S = 2. Optimal solutions are marked in black. The Maximum-Weight Connected Subgraph T has weight S(T) = 7, the optimal prize-collecting Steiner tree has profit P(T) = 23 − 14 = 9. Observe that P(T) = S(T) − S
In contrast to Ideker et al. (2002) who approach this problem heuristically, we propose to solve it to provable optimality using techniques from mathematical programming. This field provides powerful tools to address NP-hard combinatorial optimization problems (Nemhauser and Wolsey 1988). Starting from an integer linear–programming (ILP) formulation modeling the problem under consideration, i.e., a linear program with integer variables, sophisticated techniques like cutting plane methods or Lagrangian relaxation can be combined with branch and bound to generate provably optimal solutions. Of course, these methods do not guarantee polynomial running time in the general case. For many practically relevant instances, however, they work astonishingly well. The advantages over ad hoc heuristic methods are threefold: • Methods from mathematical programming guarantee the quality of solutions, i.e., each new feasible solution comes with a maximal distance to an optimal solution. This allows the implementation of a trade-off between running time and solution guarantee. • Having provably optimal solutions at hand allows evaluating the quality of a model. • The sound mathematical formulation and investigation often leads to new insights into understanding the original problem.
364
T. Müller and M. Dittrich
For MWCS, we choose a solution approach via a strong relation to a wellstudied combinatorial problem. More precisely, we transform instances of MWCS into instances of the prize-collecting Steiner tree problem (PCST). This problem occurs in classical applications from operations research such as planning district heating or telecommunications networks, where profitgenerating customers and a connecting network have to be chosen in the most profitable way. We then use the mathematical programming-based algorithm for PCST by Ljubi´c et al. (2006) to find solutions of Eq.(14.6). Hereby, we exploit the strong connection between the problems and establish a one-to-one correspondence between feasible solutions. Our computational results in the next section show that this approach finds provably optimal and suboptimal subnetworks in short computation time for biologically relevant instance sizes. Astonishingly, we also outperform the heuristic methods in terms of speed. The details of our method and the transformation can be found in (Dittrich et al. 2008) In addition to computing the best solution to Eq. (1.6), our approach is also able to compute a list of promising solutions. Instead of applying straightforward deletion and re-iteration, we propose a different approach to generate suboptimal solutions: In our ILP approach, binary variables xv determine the presence of nodes in a subgraph T = (VT , ET ), that is, xv = 1 if v ∈ VT and xv = 0 otherwise. Now let T ∗ = (VT ∗ , ET ∗ ) be an optimal subnetwork as identified by the branch-and-cut algorithm. Adding the Hamming distancelike inequality
(1 − xv ) ≥ α|VT ∗ |
v∈VT ∗
with α ∈ [0, 1] and re-optimizing leads to a best solution differing in at least α|VT ∗ | nodes from T ∗ . This procedure can be iterated k times. Two advantages of this strategy are that the user can determine the number k of suboptimal solutions that should be reported and may adjust the variety of solutions via the parameter α.
14.3 Resulting Functional Modules Applying the above-described searching procedure to the lymphoma network we obtain the optimal-scoring subnetwork as shown in Fig. 14.7 for the combined score using a restrictive FDR of 0.001. The resultant module comprises 46 nodes of which 37 are positive and 9 possess a negative score. The overall subnetwork score sums up to 70.2, 102.9 from the positive and -32.8 from the negative nodes. Further, the optimal solution contains and extends interactome modules that have been identified previously and described to play major biological roles in the ABC and GCB DLBCL subtypes. The resulting optimal module connects and expands the proliferation module, which is more highly expressed in the ABC subtype (Rosenwald et al. 2002). It includes the genes MYC, CCNE1, CDC2, APEX1,DNTTIP2, and PCNA (highlighted in red) and parts of the oncogenic NFκB pathway (highlighted in red) containing the genes IRF4, TRAF2, and BCL2. Similarly, one can ask for the module up-regulated in the less malignant GCB
14
Functional Modules in Protein–Protein Interaction Networks
365
Fig. 14.7 Optimal subnetwork identified by using a score based on the p-values of a gene-wise two-sided t-test, an univariate Cox-regression hazard model and an FDR of 0.001. An overexpression of the proliferation module (MYC, CCNE1, CDC2, APEX1, DNTTIP2, and PCNA) can be observed. Proteins are denoted by their Entrez Gene names
Fig. 14.8 Optimal subnetwork identified by using a score based on the p-values of a one-sided t-test for over-expression in GCB, survival as in Fig.14.7 and an FDR of 0.05. Genes belonging to the by-stander module (FN1, SPARC, MMP9, CTSK, ITGA5, and ITGB5) are down-regulated in the ABC subtype. Proteins are denoted by their Entrez Gene names
366
T. Müller and M. Dittrich
subtype using a one-sided t-test and combine this information with the p-values resulting from the survival analysis. This scoring scheme identifies an interactome module (Fig. 14.8) associated with non-malignant by-stander cells in the lymphoma specimens. It clusters together proteins which are expressed in nonmalignant fibroblasts and histiocytes, specifically the genes Fibronectin, SPARC, MMP9, CTSK, ITGA5, and ITGB5 (Rosenwald et al. 2002).
14.4 Comparison and Validation To validate the performance of our approach including the scoring function and search algorithm we simulate artificial signal modules in microarray data and analyz these data with the presented algorithm. For this we use the induced subnetwork of the HPRD-network comprising the genes present on the hgu133a affymetrix chip. Within this network we set artificial signal modules of biological relevant size of 30 and 150 nodes, respectively; the remaining genes are considered as background noise. For all considered genes we simulate microarray data as follows: We divide the set of arrays (20) into two groups of 10 repetitions each. We draw expression values for genes in the noise component according to a normal distribution, with standard deviation of 1 and with the same mean μ0 for both groups. In contrast, genes for the signal component are set according to a normal distribution with the same standard deviation of 1, but with different means (μ1 , μ2 ) for both groups. The difference in the means μ2 − μ1 is termed signal strength in the following. In this validation study we apply a signal strength of 2. Subsequently, we analyze the simulated gene expression data analogously to the real expression analysis as detailed above. A large range of FDRs (0–0.8) is scanned in order to evaluate solutions of different sizes, reflecting the fine-tuning of the signal–noise decomposition. The modules are evaluated in terms of recall (true-positive rate) and precision (ratio of true positives to all positively classified), see Fig. 14.9. Of special interest is the upper right region of the plots that covers an area with a recall and precision higher than 90%. A large number of solutions, in the FDR range between 0.1 and 0.4, pass through it. We contrast the performance of our approach to that of Ideker et al. (2002) implemented in the jActiveModule Cytoscape plugin. Since it provides no adjustable scoring function we follow the proposal of Ideker et al. (2002) and recursively apply their algorithm to the obtained solution several times for five independent simulations. By this, we obtain six discrete solution spaces for different module sizes visualized as shaded polygons representing their convex hulls in Fig. 14.9. The obtained module sizes decrease from large subnetworks with a poor precision to smaller ones. Although after several recursive iterations the number of false positives reduces substantially and the resultant subnetworks are considerably smaller, they never fall within the region of high precision and recall in the upper right corner. However,
1
367 0.8
0.8
1.0
Functional Modules in Protein–Protein Interaction Networks 1.0
14
1
2
0.6
Precision
0.8
1.0
0.64
0.8
0.48 0.32
0.4
0.16
0.2
6 0
0.0
0.16
0.2 0.0
6 0.4
4
5
5
0.2
0.6
0.48 0.32
0.4
4
0.0
3
Recall
Recall
3
0
0.6
0.8
0.64
2
0.0
0.2
0.4
0.6
0.8
1.0
Precision
Fig. 14.9 Evaluation of simulated data sets. For a wide range of FDRs (gray scheme), three replication modules were calculated based on simulated microarray data. These were evaluated regarding recall vs. precision and contrasted to the algorithm by Ideker et al. (2002), for which we display six convex hulls (triangles) of solutions (solutions 5 and 6 partially overlap) for the recursive applications of the algorithm on five independent simulated data sets. We evaluate two different signal component sizes (30, left plot and 150, right plot) with the same procedure. The presented exact approach captures the signal with high precision and recall (over 90% for both) for a large range of FDRs. In contrast, none of the solutions delivered by the heuristic approach falls within the upper right region of high precision and high recall. Data points have been jittered for a better visualization
these solutions display a large variance especially for the smaller simulated signal modules. An overall similar behavior is observed for the larger module, but with a smaller variance since the signal in the network is stronger. In contrast to jActiveModules our approach is very robust with respect to recall and precision for different module sizes. A major strength of the presented methodology is its flexibility and its generalizability. Indeed any data and analysis method resulting in p-values as a measurement of significance is, by nature, a suitable input for the scoring function and can thus be used to score network nodes for subsequent module searching. Although p-values derived from order statistics of p-values from t-tests or Cox-regressions of gene expression data can generally be modeled appropriately with a BUM model, it is always advisable to check the applicability of the model for observed p-value distribution. Then a quantile–quantile plot of the observed vs. the theoretical distribution of the BUM model visualizes potential deviation from the theoretical model with high sensitivity. Depending on signal content of the data, the flexible score allows to fine tune sensitivity and specificity and thereby the size of the resulting modules by choice of an appropriate FDR. When selecting an FDR care should be taken to make sure that the expectation value of the score is negative to guarantee proper localization of the network solutions.
368
T. Müller and M. Dittrich
14.5 Summary and Conclusion Systematic analyses of cellular systems are gaining more and more importance in biological and medical research. Recent advances in experimental high-throughput technologies make large amounts of data on genomic, transcriptomic, proteomic, and phenomic level available. Microarray-based technologies allow highly parallelized measurements of mRNA or DNA levels. In the field of proteomics, mass spectrometry techniques are becoming capable of analyzing the entire cellular proteome on a quantitative level. In addition, large-scale data on protein–protein interactions is accumulating and permits the generation of large cellular interaction networks. In contrast to reductionist approaches, which usually focus on specific isolated parts, systems biology analyzes the entire system as a whole. Classical approaches for network analysis investigate the topological structure of the network. This implicitly assumes that all nodes and edges in the network are alike and thus can be treated as equivalent. Structural analysis of interaction networks may successfully recover complexes as highly interconnected regions in the network or can describe global network properties as, for example, scalefree network architecture. In biological networks, however, analysis of network structure alone can only deliver limited insight into the functioning of a cellular system. In general, different proteins in a cell—corresponding to different network nodes—take over different functions in different cellular processes, are expressed in different amounts or may be localized in different subcompartments of a cell. Disregarding this kind of information considerably reduces the biological insight to be gained from network analysis. Integrated network analysis, in contrast, allows the superposition of biological information onto the network structure; the subsequent analysis can then be performed in this biological context. Since more and more functional and molecular data will become available in the future, the integrated analysis of molecular networks—as exemplified in this chapter—will certainly become an important standard tool in the analysis of biological large-scale data.
References Andersen P, Gill R (1982) Cox’s regression model for counting processes: a large sample study. Ann Stat 10(4):1100–1120 Beisser D, Klau GW, Dandekar T, Müller T, Dittrich MT (2010) Bionet: an r-package for the functional analysis of biological networks. Bioinformatics 26(8):1129–1130. doi 10.1093/bioinformatics/btq089, http://dx.doi.org/10.1093/bioinformatics/btq089 Byrd RH, Lu P, Nocedal J, Zhu CY (1995) A limited memory algorithm for bound constrained optimization. SIAM Sci Comput 16(6):1190–1208. citeseer.ist.psu.edu/byrd94limited.html Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, Workman C, Christmas R, Avila-Campilo I, Creech M, Gross B, Hanspers K, Isserlin R, Kelley R, Killcoyne S, Lotia S, Maere S, Morris J, Ono K, Pavlovic V, Pico AR, Vailaya A, Wang PL, Adler A, Conklin BR, Hood L, Kuiper M, Sander C, Schmulevich I, Schwikowski B, Warner GJ, Ideker T, Bader GD (2007) Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2(10):2366–2382. doi 10.1038/nprot.2007.324
14
Functional Modules in Protein–Protein Interaction Networks
369
Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Müller T (2008) Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics 24(13):i223–i231 Ideker T, Ozier O, Schwikowski B, Siegel AF (2002) Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18 (Suppl 1):S233–S240 Lindgren W (1993) Statistical theory. Chapman & Hall, New York Ljubi´c I, Weiskircher R, Pferschy U, Klau GW, Mutzel P, Fischetti M (2006) An algorithmic framework for the exact solution of the prize-collecting steiner tree problem. Math Program, Ser B 105(2–3):427–449 Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S, Chandrika KN, Deshpande N, Palvankar K, Raghavnath R, Krishnakanth R, Karathia H, Rekha B, Nayak R, Vishnupriya G, Kumar HGM, Nagini M, Kumar GSS, Jose R, Deepthi P, Mohan SS, Gandhi TKB, Harsha HC, Deshpande KS, Sarker M, Prasad TSK, Pandey A (2006) Human protein reference database–2006 update. Nucleic Acids Res 34(Database issue):D411–D414 Nemhauser G, Wolsey L (1988) Integer and combinatorial optimization. Wiley, USA Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M, Anand SK, Madavan V, Joseph A, Wong GW, Schiemann WP, Constantinescu SN, Huang L, Khosravi-Far R, Steen H, Tewari M, Ghaffari S, Blobe GC, Dang CV, Garcia JG, Pevsner J, Jensen ON, Roepstorff P, Deshpande KS, Chinnaiyan AM, Hamosh A, Chakravarti A, Pandey A (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 13(10):2363–2371 Pounds S, Morris SW (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19(10):1236–1242 Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, MüllerHermelink HK, Smeland EB, Giltnane JM, Hurt EM, Zhao H, Averett L, Yang L, Wilson WH, Jaffe ES, Simon R, Klausner RD, Powell J, Duffey PL, Longo DL, Greiner TC, Weisenburger DD, Sanger WG, Dave BJ, Lynch JC, Vose J, Armitage JO, Montserrat E, López-Guillermo A, Grogan TM, Miller TP, LeBlanc M, Ott G, Kvaloy S, Delabie J, Holte H, Krajci P, Stokke T, Staudt LM, Project LMP (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med 346(25):1937–1947 Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504, doi 10.1101/gr.1239303 Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3(1):Article 3 Wasserman LA (2005) All of Statistics: a concise course in statistical inference, 2nd edn. Springer, New york
Chapter 15
Mixture Model on Graphs: A Probabilistic Model for Network-Based Analysis of Proteomic Data Josselin Noirel, Guido Sanguinetti, and Phillip C. Wright
Abstract High-throughput mass spectrometry techniques are increasingly playing an important role in systems biology, due to their ability to simultaneously assay a large number of proteins. Despite its unquestionable success, high-throughput proteomics still faces a number of challenges: noise levels can be high and coverage of the proteome is generally quite patchy. This often results in difficulties in rationalising the results of a proteomic experiment and consequently limits the insights one may draw. To obviate these problems, we recently introduced mixture model on graphs (MMG), a probabilistic model which integrates the structure of the metabolic network in the analysis of high-throughput proteomic data. This results in both a principled statistical handling of noise and a clearer interpretation of results in terms of the underlying biology. In this chapter, we review the mathematical and biological basis of MMG and illustrate its power on a number of examples, at the same time providing a tutorial on how to use the open-access software package which implements it (R module). Keywords Metabolic network · MMG (mixture model on graphs) · Probabilistic model · Proteomics
15.1 Introduction Technological innovation in mass spectrometry-based proteomics has recently enabled researchers to conveniently profile the proteome1 of practically any G. Sanguinetti (B) ChELSI Research Institute, Department of Chemical and Process Engineering, University of Sheffield, Mappin St, S1 3JD Sheffield, UK; Department of Computer Science, 211 Portobello St, University of Sheffield, S1 4DP, Sheffield, UK e-mail:
[email protected] 1 The
proteome is the set of proteins expressed in a given cell at a given time and under particular conditions (Wilkins et al. 1996).
S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_15,
371
372
J. Noirel et al.
sequenced organism (Aebersold and Mann 2003; Ong and Mann 2005). Proteome profiling is important, as it offers biologically relevant information about phenotype at the molecular level (Gstaiger and Aebersold 2009). It is indeed reasonable to assume that many phenotypes, such as those observed in mutant strains, infected cells, emerge from molecular changes at the proteomic level: even mutations in coding and regulatory genomic sequences are most likely silent unless they are reflected by proteomic changes. Another point to consider is that it has become increasingly clear that the relation between transcriptomic and proteomic data is not unequivocal, as it could have been hoped from the Central Dogma, even in the simplest organisms (de Godoy et al. 2008; Gstaiger and Aebersold 2009; Ideker et al. 2001) (see also Lu et al. 2007). But in spite of this, proteomics, and more particularly quantitative proteomics, has not yet fully matured, technologically and methodologically, and challenges remain to be overcome. On the one hand, the analysis of proteomic data, owing to the high-throughput nature of the experimental technique, should naturally be undertaken under the ‘systems biology’ paradigm. On the other hand, specific statistical approaches are needed, owing to the idiosyncrasies of proteomic data sets (Efron et al. 2001; Ghosh 2004; Hwang and Park 2009; Newton et al. 2001; Rapaport et al. 2007; Tusher et al. 2001; Wei and Li 2007; Wei and Pan 2008). Mixture model on graphs (MMG), presented in this chapter, is a model we developed to address these challenges (Noirel et al. 2008; Sanguinetti et al. 2008). An R package has been developed to facilitate MMG’s use (Noirel and Sanguinetti 2008; R Development Core Team 2008). The introduction is structured as follows. We shall first discuss the problems addressed by systems biology, e.g. the natural use of biological networks. We shall move on to further describing a particular class of biological networks: metabolic networks. Finally, we shall describe a particular technique used in quantitative proteomics and the specific problems that it raises. After that, the rest of the chapter will be devoted to describing the mixture model on graphs and showing the kinds of results that were obtained with it.
15.1.1 Systems Biology and Biological Networks Systems biology seeks to understand biological responses as properties of a global system. In order to do so, on the one hand, one requires comprehensive data to describe the system as a whole. The development of high-throughput -omic techniques, such as microarrays, whereby the expression levels of thousands of genes can be measured at once, has therefore contributed to the progresses made by systems biology. On the other hand, the functional organisation of the biological entities and system-wide models are required to integrate the considerable amount of experimental data (databases and software). A network is a set made of nodes, some of which may be connected together through edges. Sometimes the edges are ‘weighted’, this means that a number is attached to every edge and this number measures the importance or the capacity of
15
Mixture Model on Graphs
373
the edge. Biological networks allow the description of many biological features by putting biological entities (the nodes) within their functional context (the edges). For example, protein–protein interaction networks place a protein in the context of its functional binders, without which its function is not possible; transcriptional networks connect a pair of genes whenever a gene is regulated by the other genes’ expression products; metabolic networks describe as a graph the chain of biochemical reactions through which building blocks are assembled or energy is produced by breaking down biomolecules (Képès 2008). Biological networks are based on experimental evidence, such as two-hybrid experiments (Fields and Song 1989), or inferred, by homology or reverse engineering (d’Alché Buc 2008). Once built, they are convenient tools to help the organisation of the biological actors, even though they very often constitute an approximation of the molecular reality: networks are static representations; an interaction network contains far less information than a matrix recording all dissociation constants; a metabolic network cannot convey information as to the enzymatic mechanisms or kinetic constants; some connections may have been wrongly predicted rather than experimentally confirmed. However, biological networks are useful in many occasions for the following reasons: • Graph theory provides the researchers with algorithms and tools to analyse networks. This has led, for example, to the study of topological patterns in biological networks, which may help to understand biological evolution (Ciriello and Guerra 2008; Jeong et al. 2000). • As the simplest description of biological systems, biological networks allow practicable models to be devised, where more realistic descriptions would remain unworkable because too many parameters are introduced into the model (Palsson 2006). • Biologists have traditionally presented their results as networks or subnetworks. Biological networks can therefore be thought of as an extension of the traditional approach to understanding biological mechanisms. A key idea is that the edges could help the interpretation of -omic data sets since, in biological networks, they represent a functional dependence. For example, the expression measured for a gene is likely to be related to that measured for another gene, if both genes are neighbours in a transcriptional network. This idea was exploited by Wei and Li (2007) who developed a model that favoured the biological interpretations of a data set where regulatory states (regulated or nonregulated) were similar for genes that were neighbours in a transcriptional network. Similarly, one expects similar behaviours for neighbours in protein–protein interaction networks, since protein–protein interactions will require the various interacting partners to be at commensurable concentrations. This can be seen by plotting the expression level of a node against the average expression level of the node’s neighbours; applied to the yeast data set published by Ideker et al. (2001), we measure, for instance, a correlation factor of 0.21. Though weak, this correlation is significant, with a p-value less than 10−8 . The correlation is moreover obscured by experimental
374
J. Noirel et al.
noise: if expression were accurately measured, better correlation could, in principle, be achieved. Rapaport et al. (2007) suggested that, conversely, the assumption that such a correlation should exist along branches of a metabolic network may be useful in order for one to reduce experimental noise. Their approach consists in decomposing the measurements along the metabolic network’s branches into lowand high-frequency components. Only the low-frequency components, assumed to be devoid of noise, are kept for further analysis.
15.1.2 Functional Correlation in Metabolic Networks There are different metabolic network representations depending on the area of research. Although other ones exist (see for instance the bipartite network used by Croes et al. 2006), the most common representations are either ‘metabolite centred’ or ‘enzyme centred’; both are described below. Although representing the metabolism as a network seems natural, it must be noted that such representations are not always fully satisfactory. This is because most reactions catalysed by enzymes involve more than one substrate or product (Schuster et al. 2000; Fell 2008). 15.1.2.1 Metabolite-Centred Network Certain metabolites are called ‘currency metabolites’, for they are pervasive in metabolism and are of little relevance to the biological understanding of a biochemical reaction. For instance, phosphofructokinase is understood as the enzyme that catalyses the production of fructose-1,6-biphosphate from fructose-6-phosphate rather than the one that catalyses the production of ADP from ATP; in this sense, ADP and ATP are ‘currency metabolites’, whereas fructose-1,6-biphosphate and fructose-6-phosphate are ‘commodity metabolites’ (Huss and Holme 2007). To approach the problem of distinguishing between currency and commodity metabolites requires one to represent the metabolism as a network of metabolites, where a pair of metabolites is connected whenever one metabolite can be transformed into the other through enzyme catalysis (see Fig. 15.1a and 15.1b). 15.1.2.2 Enzyme-Centred Network An alternative view of the metabolism, which emphasises the role of enzymes and their regulation, is to regard it as a network of enzymes. In this chapter, it makes more sense to represent the metabolism as a network of enzymes connected when two enzymes belong to a same metabolic pathway. Effectively, this can be done by connecting all pairs of enzymes when one enzyme produces a metabolite that can be used as a reactant by the other (see Fig. 15.1a and 15.1c). For instance, in such a network, an edge would exist between phosphoglucose isomerase and phosphofructokinase, since the former produces fructose-6-phosphate that can then be further processed by the latter.
15
Mixture Model on Graphs
375
A
B
E1 G6P
F6P
F1,6bP
G6P
F6P
C F1,6bP
E2 ATP
E1
E2
ADP ATP
ADP
Fig. 15.1 Network representations of metabolism. (a) Two enzymatic reactions are considered: that catalysed by the phosphoglucose isomerase (E1 ), converting glucose-6-phosphate (G6P) into fructose-6-phosphate (F6P) and that catalysed by the phosphofructokinase (E2 ), converting fructose-6-phosphate (F6P) into fructose-1,6-biphosphate (F1,6bP). (b) A metabolite-centred view only shows the metabolites as nodes, while the connections connect metabolites engaged in the same enzymatic reaction. Because ATP and ADP are currency metabolites, they are likely to contract many connections. (c) An enzyme-centred view only shows the enzymes as nodes, while the connections are established between enzymes sharing a reactant/product (here F6P)
15.1.2.3 Structure of Metabolic Networks A number of studies undertook the elucidation of the topological features of metabolic networks. Metabolic networks belong to the category of ‘scale-free’ networks. They are characterised by two important features. First, in such a network, the number of neighbours possessed by a node follows a power law. This distribution appears linear in a log–log plot (Barabási and Albert 1999). Some nodes have far more neighbours than expected if the edges were distributed equiprobably; these nodes are called ‘hubs’. Second, the average path length between any two nodes is small compared to that found in Erd˝os and Rényi’s (1959) random graphs (Jeong et al. 2000; Wagner and Fell 2001). This is the so-called small world property (see Fig. 15.2). It was put forth that the scale-free structure had been favoured by evolution for its rapid dissipation properties. The biological and biochemical relevance of the scale-free structure was the object of some debate (Arita 2004; Rahman et al. 2005), the lines of which can be read in Fell (2008). Regardless of whether metabolic networks truly are scale free, there are many different routes (or paths) connecting two arbitrary metabolites or enzymes, owing to the many hubs, or currency metabolites, present in the network. Thus, the question of finding a biologically meaningful series of reactions that catalyse the transformation of one metabolite (source) into another (target) cannot be satisfactorily addressed by simply computing the shortest path between the two metabolites of interest. However, a heuristic solution was presented by Croes et al. (2006) to circumvent this limitation. What they did was essentially to assign a weight to each metabolite; the weight of a metabolite was set to the number of times this metabolite appeared in the metabolic network. Instead of looking for the shortest series of reactions converting the source metabolite into the target metabolite, they looked for the lightest series, i.e. the one having minimal weight (calculated as the sum of the weights of the metabolites that it traversed). Lightest paths were found to be biologically meaningful, since the passage through hubs, which are heavy, was prevented.
376
J. Noirel et al.
A
B
Fig. 15.2 Network topologies. (a) An important network topology found in many applications is the scale-free topology. It is called scale free because of the power law that describes the distribution of the number of neighbours. It is characterised by the existence of hubs: while most nodes have few neighbours a few accumulate a very large number of connections. (b) The more intuitive model of random network invented by Erd˝oH os and Rényi in 1959, where there are no hubs (Erd˝oH os and Rényi 1959)
15.1.2.4 Regulatory Correlation Experimental Evidence. Before using the assumption that enzymes that are neighbours in the network are more likely to be regulated likewise, and use it for predictive purposes, one can simply evaluate the correlation of the signal on existing data sets. In an attempt to understand the galactose metabolism in yeast, Ideker et al. (2001) compared the transcriptome of yeast in the presence and the absence of galactose. Their microarray data set is available as supplementary information. For each enzyme i whose mRNA log ratio, mi , was measured, we computed the ‘neighbourhood average’: μi = mj =
mj /ni ,
(15.1)
where the average is computed over the ni neighbours j of enzyme i. The values mi and μi were plotted against each other (see Fig. 15.3a), demonstrating a correlation factor of 0.23 (p-value Save As. If Cytoscape is installed locally, these session files can be accessed anytime, allowing one to easily return to a given analysis. By examining data overlaid onto individual pathways of interest using Cerebral, trends in the data that were not obvious from the table-format results alone can be readily observed. In this example, for instance, we see that while the expression of the interferon-gamma receptors increases slightly from day 3 to day 4, expression of the interferon-gamma ligand returns to baseline levels on day 4, as do levels of the STAT1 transcription factor, indicating a possible abatement of the interferon-gamma response after an early peak post-infection.
22.4.7 Generating and Exploring Molecular Interaction Networks Using InnateDB InnateDB pathway and Gene Ontology analyses can be very powerful in determining which annotated pathways and biological processes are significantly associated with a data set of genes. Such analyses, however, rely on using the association of genes to known biological pathways or Gene Ontology terms. Annotation of pathways and Gene Ontology terms is far from complete and pathways are often annotated as relatively simple linear cascades. Network analysis has the ability to move the investigation from this simple view of the signaling response to a more comprehensive analysis of the molecular interactions between genes of interest and their encoded proteins and RNAs, potentially allowing one to uncover as yet unknown signaling cascades or pathways, functionally relevant sub-networks and the central molecules, or hubs, of these networks. InnateDB is one of the most comprehensive databases of all human and mouse experimentally supported molecular interactions (∼130,000) but also specifically includes annotation on more than 12,900 manually curated human and mouse innate immunity-relevant interactions, many of which are not present in any other database. InnateDB allows one to upload a gene list of interest along with associated gene expression data and returns this data integrated in a molecular interaction network context for visualization and further interrogation and analysis. 1. Return to the Data Analysis page and this time, select Return a list of interactions. This will bring up the interaction filtering dialog box. Three options are available. Do not filter the results will display all of the interactions that all of the uploaded genes participate in. By investigating networks, such as this,
22
Systems-Level Analyses of the Mammalian Innate Immune Response
553
that include interactions between differentially expressed genes and their nondifferentially expressed interacting partners, one has the potential to identify key regulators of gene expression, even though these regulators themselves may not be differentially expressed but may be regulated at the post-transcriptional level. For a gene list with hundreds of entries, however, this network can consist of several thousand interactions (Fig. 22.10). For this reason, with large gene lists it is often preferable to first create a more focused network in which only interactions between the genes in the uploaded gene list will be shown (for example, differentially expressed genes only). This filtering – Only show interactions between uploaded molecules – is demonstrated below. The third filtering option, Filter for interactions in pathway, provides an even more focused view, allowing one to display only those interactions that comprise a given pathway. 1. Select Only show interactions between uploaded molecules and execute the search for interactions between uploaded molecules. As in our earlier analyses,
Fig. 22.10 Network of interactions between differentially expressed genes (and their encoded products) at day 3 and/or day 4 and all known interacting partners in InnateDB. The network was displayed in Cytoscape using the Cerebral plugin launched from InnateDB. Nodes encoded by up-regulated genes are shown in red, down-regulated in green. Analysis of this network enables the identification of central regulators (hubs/bottlenecks that are not necessarily regulated at the transcriptional level)
554
D.J. Lynn et al.
the search returns its results in a table format which can be edited, sorted and/or downloaded. 2. To visualize these retrieved interactions, click on the Cerebral button at the top of the page. A Cerebral view is launched in Cytoscape showing all of the interactions (Fig. 22.11).
Fig. 22.11 A network of molecular interactions only between genes (and their encoded products) which were differentially expressed at day 3 and/or day 4. Interactions involving molecules that were not differentially expressed are not shown. The network was displayed in Cytoscape using the Cerebral plugin. Up-regulated genes at day 3 are shown in red and down-regulated genes in green. Un-shaded nodes were not differentially regulated at day 3. This network is useful to investigate molecular interactions between molecules encoded by differentially expressed genes
3. In this interaction-based analysis, it is often worthwhile to lay the data out in different formats for an alternative perspective, particularly when several hundred interactions are displayed. In the Cytoscape Control Panel, select the Network tab. The name of the network – a string of numbers automatically generated by InnateDB – will appear highlighted in green. Right-click this name and select Destroy View, then right-click the name again and select Create View. This will redraw the network in Cytoscape’s default grid format.
22
Systems-Level Analyses of the Mammalian Innate Immune Response
555
Fig. 22.12 An alternative layout of the network in Fig. 22.11. In this view, we have used one of Cytoscape’s native layouts to visualize the relationships between the genes in our list, as the network structure is more apparent with this layout than with the Cerebral layout. The larger network shows all of the interactions between our genes of interest colored according to their expression level on day 4, while the inset view shows a network of transcriptional regulators extracted from the larger network. This type of analysis, which does not rely on pre-existing information such as GO or pathway annotation, can reveal novel processes, functions, or complexes active in a data set
4. From the Cytoscape Layout menu, select yFiles > Organic or any of the other available layout options. This will redraw the network in an alternative manner (Fig. 22.12). At this point, you may wish to analyze the network using other tools and approaches that are presented in this book. To do this one may export the network and its attributes in a number of formats. From Cytoscape select File > Export. By examining the relationships between the nodes of the network, new insight into particular processes or protein complexes can be gained. As an example, the inset of Fig. 22.12 shows a network of transcription regulators extracted from the larger network in Fig. 22.12 and colored according to their expression at day 4. A number of transcription factors not identified through GO or pathway analysis are observed to be active, and the user may wish to follow up on this analysis by examining whether genes regulated by these transcription factors are enriched at subsequent time points in the experiment.
556
D.J. Lynn et al.
22.5 Conclusions and Future Directions Systems biology approaches to investigating the innate immune response are beginning to provide novel insight and new understanding of the early host response to infectious disease. As we have shown, InnateDB greatly facilitates the interpretation of large-scale -omics data by allowing users to carry out a range of analyses on their data with just a few clicks of the mouse (Lynn et al. 2008). Uploading of data is simple, requiring only a spreadsheet containing the genes of interest, and from the Data Analysis page users can access tools ranging from network construction and visualization to powerful over-representation analyses. Display options customization and multiple download formats enable users to retrieve and store their data in the format of their choice, while visualizations through the Cytoscape/Cerebral tools (Shannon et al. 2003; Barsky et al. 2007; Barsky et al. 2008) allow for more intuitive approaches for analyzing data. Thus, in only a few steps a user can begin to interpret a gene list and generate specific testable biological hypotheses for follow-up. Many challenges, however, remain. Despite the large number of interactions currently annotated in InnateDB and other databases, it is estimated that only approximately 15% of the human interactome is currently known (Bader et al. 2008). In addition, almost all of these interactions are protein–protein interactions with only a small fraction of potential transcription factor-DNA interactions currently experimentally validated and even fewer RNA interactions currently known. Fortunately, ChIP-chip methods are now enabling large-sale identification of transcription factor-DNA interactions (Ramsey et al. 2008) and new array platforms are allowing genome-wide profiling of microRNA expression. Currently, however, the incomplete nature of the networks used for systems biology-oriented analyses undoubtedly means that important connections between signaling proteins and pathways are being missed. InnateDB curation efforts, whereby we have curated nearly 13,000 interactions of relevance to innate immunity, have assisted in providing a more complete picture of the innate immunity interactome based on data available in the biomedical literature. Large-scale interactome mapping efforts are essential however, to ensure that novel molecular interactions continue to be described to fill in the missing gaps in the interactome. Another important issue moving forward is that although we are becoming closer to determining the entire human interactome, the interactome is not a static entity. The interactions that occur at any given time depend on the genes being expressed, post-translational modifications, the cell type or tissue type, exogenous and endogenous stimuli and the particular conditions being investigated. The interactome is thus a dynamic entity changing over time. Fortunately, gene and protein array technologies can assist us in determining which particular networks of interactions are likely most relevant to a given response, by providing quantitative data that can be analyzed and interpreted in the framework of the interactome. More detailed investigation and annotation of the context of particular interactions, such as the cell type in which they occur, will greatly help in moving from a static view of the interactome. Better appreciation and understanding of host–pathogen interactions also needs to be accounted for in systems-based approaches. The host responses to disease
22
Systems-Level Analyses of the Mammalian Innate Immune Response
557
and the signaling networks involved can be actively manipulated by pathogens. For example, live (but not dead) Mycobacterium tuberculosis interfere with signaling in macrophages (Ehrt et al. 2001), while several viruses produce microRNAs to specifically modulate the host response by suppressing components of the innate immune response to inhibit apoptosis and promote virus latency (Pedersen and David 2008). Similarly, host factors influence the pathogens; interferon-gamma expressed by the host, for example, is sensed by Pseudomonas aeruginosa and causes expression of virulence factors (Wu et al. 2005). Several new databases that specialize in host–pathogen interactions provide valuable supplemental information to InnateDB including the VirusMINT (mint.bio.uniroma2.it/virusmint/) (Chatr-aryamontri et al. 2009) and Pathogen Interaction Gateway (PIG, molvis.vbi.vt.edu/pig) (Driscoll et al. 2009). Despite these and many other challenges, systems biology approaches are already providing significant new insight into innate immunity (see for review Gardy et al. 2009) and promise a far deeper understanding of our first line of defense against invading pathogens than previously possible. Acknowledgments The authors’ systems biology research has been funded by Genome Canada and Genome BC through the Pathogenomics of Innate Immunity (PI2) project and by the Foundation for the National Institutes of Health and the Canadian Institutes of Health Research under the Grand Challenges in Global Health Research Initiative (Grand Challenges ID: 419). DJL and JLG hold Postdoctoral Trainee Awards from the Michael Smith Foundation for Health Research (MSFHR) and JLG also holds a Sanofi Pasteur CIHR fellowship. FSLB is a Canadian Institutes of Health Research (CIHR) New Investigator and a MSFHR Senior Scholar. REWH holds a Canada Research Chair (CRC).
References Abbas AR, Baldwin D, Ma Y et al (2005) Immune response in silico (IRIS): immune-specific genes identified from a compendium of microarray expression data. Genes Immun 6:319–331 Akira S (2006) TLR signaling. Curr Top Microbiol Immunol 311:1–16 Alibes A, Yankilevich P, Canada A et al (2007) IDconverter and IDClight: conversion and annotation of gene and protein IDs. BMC Bioinformatics 8:9 Andersen J, VanScoy S, Cheng TF et al (2008) IRF-3-dependent and augmented target genes during viral infection. Genes Immun 9:168–175 Angus DC, Linde-Zwirble WT, Lidicker J et al (2001) Epidemiology of severe sepsis in the United States: analysis of incidence, outcome, and associated costs of care. Crit Care Med 29: 1303–1310 Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29 Bader S, Kuhner S and Gavin AC (2008) Interaction networks for systems biology. FEBS Lett 582:1220–1224 Barsky A, Gardy JL, Hancock REW et al (2007) Cerebral: a Cytoscape plugin for layout of and interaction with biological networks using subcellular localization annotation. Bioinformatics 23:1040–1042 Barsky A, Munzner T, Gardy J et al (2008) Cerebral: visualizing multiple experimental conditions on a graph with biological context. IEEE Trans Vis Comput Graph 14:1253–1260 Benjamini Y and Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. Series B 57:289–300
558
D.J. Lynn et al.
Bertrand MJ, Doiron K, Labbe K et al (2009) Cellular inhibitors of apoptosis cIAP1 and cIAP2 are required for innate immunity signaling by the pattern recognition receptors NOD1 and NOD2. Immunity 30:789–801 Bhoj VG and Chen ZJ (2009) Ubiquitylation in innate and adaptive immunity. Nature 458:430–437 Bi Y, Liu G and Yang R (2009) MicroRNAs: novel regulators during the immune response. J Cell Physiol 218:467–472 Brownstein BH, Logvinenko T, Lederer JA et al (2006) Commonality and differences in leukocyte gene expression patterns among three models of inflammation and injury. Physiol Genomics 24:298–309 Chatr-aryamontri A, Ceol A, Peluso D et al (2009) VirusMINT: a viral protein interaction database. Nucleic Acids Res 37:D669–673 Chen XM, Splinter PL, O’Hara SP et al (2007) A cellular micro-RNA, let-7i, regulates Toll-like receptor 4 expression and contributes to cholangiocyte immune responses against Cryptosporidium parvum infection. J Biol Chem 282:28929–28938 Chuang T and Ulevitch RJ (2001) Identification of hTLR10: a novel human Toll-like receptor preferentially expressed in immune cells. Biochim Biophys Acta 1518:157–161 Chuang TH and Ulevitch RJ (2000) Cloning and characterization of a sub-family of human toll-like receptors: hTLR7, hTLR8 and hTLR9. Eur Cytokine Netw 11:372–378 Cohen J and Enserink M (2002) Public health. Rough-and-tumble behind Bush’s smallpox policy. Science 298:2312–2316 Collas P and Dahl JA (2008) Chop it, ChIP it, check it: the current status of chromatin immunoprecipitation. Front Biosci 13:929–943 Dimitriou ID, Clemenza L, Scotter AJ et al (2008) Putting out the fire: coordinated suppression of the innate and adaptive immune systems by SOCS1 and SOCS3 proteins. Immunol Rev 224:265–283 Driscoll T, Dyer MD, Murali TM et al (2009) PIG—the pathogen interaction gateway. Nucleic Acids Res 37:D647–650 Ehrt S, Schnappinger D, Bekiranov S et al (2001) Reprogramming of the macrophage transcriptome in response to interferon-gamma and Mycobacterium tuberculosis: signaling roles of nitric oxide synthase-2 and phagocyte oxidase. J Exp Med 194:1123–1140 Gardy JL, Lynn DJ, Brinkman FS et al (2009) Enabling a systems biology approach to immunology: focus on innate immunity. Trends Immunol 30:249–262 Gilchrist M, Thorsson V, Li B et al (2006) Systems biology approaches identify ATF3 as a negative regulator of Toll-like receptor 4. Nature 441:173–178 Heng TS and Painter MW (2008) The Immunological Genome Project: networks of gene expression in immune cells. Nat Immunol 9:1091–1094 Hermjakob H, Montecchi-Palazzi L, Bader G et al (2004) The HUPO PSI’s molecular interaction format—a community standard for the representation of protein interaction data. Nat Biotechnol 22:177–183 Hijikata A, Kitamura H, Kimura Y et al (2007) Construction of an open-access database that integrates cross-reference information from the transcriptome and proteome of immune cells. Bioinformatics 23:2934–2941 Honda K and Taniguchi T (2006) IRFs: master regulators of signalling by Toll-like receptors and cytosolic pattern-recognition receptors. Nat Rev Immunol 6:644–658 Hsueh RC, Natarajan M, Fraser I et al (2009) Deciphering signaling outcomes from a system of complex networks. Sci Signal 2:ra22 Hubbard TJ, Aken BL, Ayling S et al (2009) Ensembl 2009. Nucleic Acids Res 37:D690–697 Inohara N and Nunez G (2001) The NOD: a signaling module that regulates apoptosis and host defense against pathogens. Oncogene 20:6473–6481 Joshi-Tope G, Gillespie M, Vastrik I et al (2005) Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 33:D428–432 Kanehisa M, Araki M, Goto S et al (2007) KEGG for linking genomes to life and the environment. Nucleic Acids Res 36:D480–484
22
Systems-Level Analyses of the Mammalian Innate Immune Response
559
Kanneganti TD, Lamkanfi M and Nunez G (2007) Intracellular NOD-like receptors in host defense and disease. Immunity 27:549–559 Kolchanov NA, Merkulova TI, Ignatieva EV et al (2007) Combined experimental and computational approaches to study the regulatory elements in eukaryotic genes. Brief Bioinform 8:266–274 Korb M, Rust AG, Thorsson V et al (2008) The Innate Immune Database (IIDB). BMC Immunol 9:7 Langlais D, Couture C, Balsalobre A et al (2008) Regulatory network analyses reveal genome-wide potentiation of LIF signaling by glucocorticoids and define an innate cell defense response. PLoS Genet 4:e1000224 Lee MS and Kim YJ (2007) Signaling pathways downstream of pattern-recognition receptors and their cross talk. Annu Rev Biochem 76:447–480 Litvak V, Ramsey SA, Rust AG et al (2009) Function of C/EBPdelta in a regulatory circuit that discriminates between transient and persistent TLR4-induced signals. Nat Immunol 10: 437–443 Lynn DJ, Winsor GL, Chan C et al (2008) InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol 4:218 MacLeod H and Wetzler LM (2007) T cell activation by TLRs: a role for TLRs in the adaptive immune response. Sci STKE 2007:pe48 Maglott D, Ostell J, Pruitt KD et al (2005) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 33:D54–58 Manicassamy S and Pulendran B (2009) Modulation of adaptive immunity with Toll-like receptors. Semin Immunol. doi: 10.1016/j.smim.2009.05.005 Medzhitov R, Janeway CA Jr (1997) Innate immunity: the virtues of a nonclonal system of recognition. Cell 91:295–298 Medzhitov R, Preston-Hurlburt P and Janeway CA Jr (1997) A human homologue of the Drosophila Toll protein signals activation of adaptive immunity. Nature 388:394–397 Mookherjee N, Hamill P, Gardy J et al (2009) Systems biology evaluation of immune responses induced by human host defence peptide LL-37 in mononuclear cells. Mol Biosyst 5:483–496 Nilsson R, Bajic VB, Suzuki H et al (2006) Transcriptional network dynamics in macrophage activation. Genomics 88:133–142 Oda K and Kitano H (2006) A comprehensive map of the toll-like receptor signaling network. Mol Syst Biol 2:2006 0015 Okabe Y, Sano T and Nagata S (2009) Regulation of the innate immune response by threoninephosphatase of Eyes absent. Nature 460:520–524 Orchard S, Salwinski L, Kerrien S et al (2007) The minimum information required for reporting a molecular interaction experiment (MIMIx). Nat Biotechnol 25:894–898 Pedersen I and David M (2008) MicroRNAs in the immune response. Cytokine 43:391–394 Pruitt KD, Tatusova T, Klimke W et al (2009) NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res 37:D32–36 Ramsey SA, Klemm SL, Zak DE et al (2008) Uncovering a macrophage transcriptional program by integrating evidence from motif scanning and expression dynamics. PLoS Comput Biol 4:e1000021 Robertson G, Hirst M, Bainbridge M et al (2007) Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 4:651–657 Rock FL, Hardiman G, Timans JC et al (1998) A family of human receptors structurally related to Drosophila Toll. Proc Natl Acad Sci USA 95:588–593 Rubins KH, Hensley LE, Jahrling PB et al (2004) The host response to smallpox: analysis of the gene expression program in peripheral blood cells in a nonhuman primate model. Proc Natl Acad Sci USA 101:15190–15195 Seet BT, Johnston JB, Brunetti CR et al (2003) Poxviruses and immune evasion. Annu Rev Immunol 21:377–423
560
D.J. Lynn et al.
Shannon P, Markiel A, Ozier O et al (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13:2498–2504 Smith B, Ashburner M, Rosse C et al (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25:1251–1255 Taganov KD, Boldin MP, Chang KJ et al (2006) NF-kappaB-dependent induction of microRNA miR-146, an inhibitor targeted to signaling proteins of innate immune responses. Proc Natl Acad Sci USA 103:12481–12486 Takeuchi O, Kawai T, Sanjo H et al (1999) TLR6: A novel member of an expanding toll-like receptor family. Gene 231:59–65 Tegner J, Nilsson R, Bajic VB et al (2006) Systems biology of innate immunity. Cell Immunol 244:105–109 The UniProt Consortium (2008) The universal protein resource (UniProt). Nucleic Acids Res 36:D190–195 Thompson AJ and Locarnini SA (2007) Toll-like receptors, RIG-I-like RNA helicases and the antiviral innate immune response. Immunol Cell Biol 85:435–445 von Bernuth H, Picard C, Jin Z et al (2008) Pyogenic bacterial infections in humans with MyD88 deficiency. Science 321:691–696 Wall EA, Zavzavadjian JR, Chang MS et al (2009) Suppression of LPS-induced TNF-alpha production in macrophages by cAMP is mediated by PKA-AKAP95-p105. Sci Signal 2:ra28 Wu L, Estrada O, Zaborina O et al (2005) Recognition of host immune activation by Pseudomonas aeruginosa. Science 309:774–777 Yoneyama M, Kikuchi M, Natsukawa T et al (2004) The RNA helicase RIG-I has an essential function in double-stranded RNA-induced innate antiviral responses. Nat Immunol 5:730–737
Chapter 23
Molecular Basis of Protective Anti-Inflammatory Signalling by Cyclic AMP in the Vascular Endothelium Claire Rutherford and Timothy M. Palmer
Abstract Prototypical second messenger cyclic AMP (cAMP) was originally thought to mediate its effects through activation of cAMP-dependent protein kinase (PKA). However, it is now clear that cells possess multiple alternative sensors of cAMP accumulation, of which the “exchange protein directly activated by cAMP” (Epac) proteins have been studied most intensively. This article will describe recent insights made into the molecular mechanisms by which Epac proteins mediate key protective effects of cAMP on two specific aspects of vascular endothelial cell function, namely barrier function and suppression of inflammatory signalling. It will also examine how integrative and unbiased global approaches are currently being deployed to answer several wider questions that have arisen from the identification of Epac as a trigger of gene transcription events and the E3 ubquitin ligase component “suppressor of cytokine signalling-3” (SOCS-3) as a key gene target regulated by this pathway. Keywords Cyclic AMP · Cytokine · Signalling · Inflammation · Barrier function
23.1 Introduction 23.1.1 Dysfunctional Vascular Endothelium and Disease The endothelium comprises a one cell thick lining over both blood (the so-called vascular endothelium) and lymphatic vessels (the lymphatic endothelium). The surface area of the formed by the vascular endothelium is approximately 350 m2 , thus providing an enormous interface for dynamic two-way communication at the
T.M. Palmer (B) Biochemistry and Cell Biology, Faculty of Biomedical and Life Sciences, University of Glasgow, Glasgow G12 8QQ, Scotland, UK e-mail:
[email protected] S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_23,
561
562
C. Rutherford and T.M. Palmer
blood/endothelium interface. While initially viewed as a relatively passive barrier between the circulation and underlying tissue, it is now clear that vascular endothelia have common central roles in regulating vascular tone, maintaining tissue homeostasis, limiting coagulation and thrombosis and also limiting local immune and inflammatory responses. However, vascular endothelia from distinct tissues demonstrate a degree of functional heterogeneity commensurate with distinct roles. For example, while most endothelia form a selectively permeable barrier to the free exchange of blood components with the underlying interstitium, permeability at the blood–brain barrier is minimal due to the presence of a large number of intercellular tight junctions (Davson et al. 1993). Conversely, some vascular beds have a lower tight junction density and may even have gaps between cells as is found in environments where sampling or filtration takes place, such as liver sinusoids (Davson et al. 1993). As well as structural differences, different vascular beds display distinct biochemical and immunological properties. For example, arterial ECs typically generate higher levels of autocoids, such as nitric oxide (NO) and prostacyclin (PGI2 ) (Fisslthaler et al. 1999), while venous ECs have more autocoid receptors, such as those for histamine, and more readily induce the expression of adhesion molecules, which are required for leukocyte adhesion (Cotran et al. 1986; Simionescu et al. 1982). Given its many pivotal roles, dysfunction of the vascular endothelium has been strongly implicated in the pathophysiology of diseases in which excessive platelet activation or enhanced leukocyte infiltration into underlying tissue are key steps. These include atherosclerosis (Vanhoutte 2009), rheumatoid arthritis (Middleton et al. 2004), systemic lupus erythromatosis (D’Cruz 1998), anti-phospholipid syndrome (Rand 2003), acute respiratory distress syndrome (Zimmerman et al. 1999) and graft versus host disease (Tichelli and Gratwohl 2008). Conversion of the endothelium from a predominantly anti-inflammatory/anticoagulant state to a pro-inflammatory/fibrotic phenotype is a complex process triggered by multiple chemical stimuli (e.g. cytokines, pathogen-derived molecules and bioactive lipids) as well as mechanical stress and hypoxia (Libby 2002). Key changes include the mobilisation of gene transcription programmes that control the induction of chemokines, such as monocyte chemoattractant protein-1 (MCP-1)/CCL2 and IL8/CXCL8, pro-inflammatory cytokines, such as tumour necrosis factor (TNF) α, interleukin (IL)-1β and IL-6, and adhesion molecules like E-selectin, VCAM-1 and ICAM-1. Each of these gene products has specific roles in initiating and propagating the inflammatory response. Thus, chemokines direct the migration and recruitment of specific immune cell subsets to the activated endothelium, where adhesion molecules can then immobilise them onto the endothelial surface. Once immobilised, leukocytes are then activated by the induced cytokines prior to their migration down chemokine gradients into underlying tissue. These cytokines also act in an autocrine and paracrine manner to further activate neighbouring endothelial and other cell types, thus stimulating further leukocyte recruitment and amplifying the response (Gimbrone 1995; Libby 2002).
23
Molecular Basis of Protective Anti-Inflammatory Signalling by Cyclic AMP
563
23.1.2 Cyclic AMP Signalling 23.1.2.1 Basic Architecture of cAMP Signalling Modules The prototypical intracellular second messenger cyclic AMP (cAMP) has long been known to play an important role in endothelial cell (EC) biology. Synthesis of cAMP is triggered upon agonist occupation and activation of specific plasma membrane-localised G-protein-coupled receptors capable of productive interaction with the guanine nucleotide-binding regulatory protein (G-protein) Gs. Under nonstimulated conditions, Gs exists in its inactive state as a heterotrimeric complex of a GDP-bound α-subunit bound to β- and γ-subunits. Interaction with an agonistoccupied receptor triggers conformational changes that promote the release of GDP from the α-subunit, thus allowing GTP to bind. This event precipitates the dissociation of the complex into a βγ dimer and GTP-bound Gsα which can directly bind and activate adenylyl cyclases to stimulate the synthesis of cAMP from ATP (Beavo and Brunton 2002). Changes in cAMP levels are translated into intracellular effects by multiple cAMP-binding effector proteins; these include cyclic nucleotide-gated ion channels, cAMP-dependent protein kinase (PKA) and exchange proteins directly activated by cAMP (Epacs) (Bos 2006; Schmidt et al. 2007). The signal is turned off by hydrolysis of cAMP to inactive 5 AMP in a reaction catalysed by a large superfamily of cyclic nucleotide phosphodiesterases (PDEs) (Lynch et al. 2006). An increasingly important aspect of cAMP’s effects is the generation of intracellular cAMP gradients arising from its synthesis by adenylyl cyclase isoforms at the plasma membrane and its degradation by PDEs localised to specific intracellular compartments (Baillie 2009; Lynch et al. 2006). The ability of distinct regions within the cell to sample these gradients is determined by a family of “A-kinase anchoring protein” (AKAP) scaffolds, which have been shown to target PDEs, RI and RII regulatory cAMP-binding subunits of PKA and Epacs to specific intracellular locations. In the case of PKA, binding of cAMP to R subunits releases catalytic C subunits from the PKA holoenzyme, thereby allowing phosphorylation of adjacent substrates (Dodge-Kafka et al. 2006).
23.1.2.2 Exchange Proteins Activated by cAMP (Epacs) In silico searches for genes encoding novel cAMP-binding signalling proteins to explain PKA-independent effects of cAMP elevation identified two related gene products, termed Epac1/cAMP-GEFI and Epac2/cAMP-GEFII, that function in a PKA-independent manner as cAMP-activated guanine nucleotide exchange factors (GEFs) for the Rap1 and Rap2 family of small G-proteins (de Rooij et al. 1998; Kawasaki et al. 1998). Both consist of multiple domains in which, in the absence of cAMP, a regulatory N-terminal region confers an auto-inhibitory effect on a C-terminal catalytic region (Fig. 23.1). The most striking difference between Epac1 and Epac2 is the presence of two cyclic nucleotide binding domains (CNBs)
564
C. Rutherford and T.M. Palmer 1
PKA RIIα
1
Epac1 Epac2
1
CNB*
AKAP
C
CNB
CNB
404
DEP
CNB
REM
??
CDC25 HD
DEP
CNB
REM
RA
CDC25 HD
Regulatory
881
993
Catalytic
Fig. 23.1 Schematic representation of the domain structures of PKA RII and Epacs. Domains of the type II regulatory subunit of PKA (RIIa), Epac1 and Epac2 are shown. C, kinase domain binding site; AKAP, A-kinase anchoring protein binding site; CNB, cAMP-binding domain; DEP, disheveled/Egl10/pleckstrin domain involved in localisation; REM, Ras exchange motif, thought to stabilise a catalytic helix within CDC25 HD (homology domain); RA, Ras association motif
in the latter (Fig. 23.1). The N-terminal CNB in Epac2 displays an approximately 20-fold lower affinity for cAMP in vitro than the conserved CNB. Moreover, deletion of the N-terminal CNB does not significantly diminish Epac2 responsiveness to cAMP in intact cells (de Rooij et al. 2000). Elegant crystallographic studies of active and inactive conformations of an N-terminally truncated Epac2, coupled with mutational analyses and comparison with structural studies of PKA, have led to the proposal of a model whereby, in the inactive state, access of Rap to the catalytic CDC25 homology domain (HD) is sterically blocked by the CNB. Upon binding of cAMP, the CNB orientates itself away from the Rap binding site in the CDC25 HD, a change which also renders the inactive conformation of the protein energetically less favourable. This allows binding of GDP-bound Rap to the CDC25 HD and subsequent guanine nucleotide exchange (Rehmann et al. 2006, 2008). In addition to one or more CNBs, the regulatory region of Epac also contains a DEP (Dishevelled/Egl-10/Pleckstrin) domain which, while not influencing Epac regulation by cAMP, may be involved in controlling its localisation to membranes (de Rooij et al. 2000; Qiao et al. 2002). The C-terminal catalytic region contains a REM (Ras exchange motif) domain, which is conserved in many GEFs with a CDC25 HD such as the Ras GEF Son of Sevenless (Sos), and may be involved in determining subcellular localisation. The REM domain and CDC25 HD are separated by a so-called Ras association (RA) domain, although this region is significantly different between Epac1 and Epac2 and association with Ras has only been reported for Epac2 (Li et al. 2006). One potentially important role that has been proposed for RA domain-mediated targeting of Epac2 to GTP-bound Ras at the plasma membrane may be to facilitate cAMP-stimulated signalling from membrane-
23
Molecular Basis of Protective Anti-Inflammatory Signalling by Cyclic AMP
565
localised Epac2 to ERK via Rap1 in response to simultaneous activation of Ras (Liu et al. 2008). The identification of Epacs as alternative intracellular cAMP sensors, coupled with the rational development of agonistic PKA- and Epac-selective cAMP analogues, has shed new light on how cAMP controls important biological processes; these include exocytosis and insulin secretory granule release in pancreatic β-cells (Doyle and Egan 2007), adipocyte differentiation (Petersen et al. 2008), neural differentiation (Christensen et al. 2003) and subcellular localisation of DNA-dependent protein kinase (Huston et al. 2008). However, the focus of this article will be the linkage of Epac to two critical protective aspects of cAMP function in the endothelium: inhibition of pro-inflammatory cytokine signalling and the potentiation of barrier function.
23.2 The Control of Endothelial Barrier Function by Cyclic AMP 23.2.1 Introduction Since maintenance of endothelial barrier function is critical for vascular homeostasis, selective control of vascular permeability is tightly regulated in order to limit the spread of infection and allow repair to damaged tissue. However, in diseases such as atherosclerosis and sepsis, which are both associated with vascular inflammation, the processes that maintain endothelial barrier function break down and result in unregulated paracellular transport (i.e. transport between adjacent ECs) of solutes and plasma proteins (Weis 2008). Consequently, therapeutic strategies aimed at restoring barrier function require an understanding at the molecular level of how the key proteins at EC junctions that determine paracellular permeability are organised and also how their function is controlled in response to specific environmental cues. The key processes that determine barrier function are Adherens junction (AJ) complexes, which comprise the transmembrane protein VE-cadherin and associated catenins. Integrin-containing junctions, which maintain barrier integrity by tethering cells to the basement membrane. Integrins also bind accessory proteins such as focal adhesion kinase (FAK) and p21-activated kinase (PAK) which allow for regulation of integrin function. The actin cytoskeleton, which interacts dynamically with both AJs and integrincontaining junctions and whose architecture is controlled directly by the small Gproteins Rho, Rac and Cdc42. Changes in any of these processes can produce the increases in endothelial permeability triggered by agonists such as thrombin, tumour necrosis factor α (TNFα) and vascular endothelial growth factor (VEGF), all of which can accumulate in response to inflammation or injury (Weis 2008). For example, thrombin has been
566
C. Rutherford and T.M. Palmer
shown to promote a rapid (3, we defined the hub as a date hub. Figures 24.5a and 24.5b depict dynamic PPIs in caspase formation of cancer cells within the hub caspases during 0–8 h and 4–30 h after induction of apoptosis, respectively. Figure 24.6a and 24.6b depict dynamic PPIs in caspase formation of normal cells within the hub caspases during 0–8 h and 4–36 h after induction of apoptosis, respectively. The bold lines represent distinct interactions at different times. Caspase signaling results in time-variant PPIs, and dynamic modeling allows specification of the time-dependent interactome. In cancerous cells, the date hubs include BIRC2, CASP2, and CASP3, and the party hubs include TP53, TNF, BIRC3, BAX, CASP1, and CASP9. In normal cells, date hubs include CASP3 and CASP9 and party hubs include TNFRSF6, TP53, BIRC2, BIRC3, BCL2, BAX, and CASP1. Effector caspase-3 is a date hub in both cell types because intrinsic and extrinsic pathways converge on caspase-3. Because date hubs appear to be more
608
L.-H. Chu and B.-S. Chen
important than party hubs, caspase-2 and -9 are important date hubs that differentiate network topologies of cancerous and normal cells. TP53, BIRC3, BAX, and CASP1 are party hubs in both cell types. Party hubs are found in static complexes where they interact with most of their partners simultaneously. In other words, we believe these four proteins play central roles in functional complexes in both cancerous and normal cells.
24.5 Conclusions Construction of cancer-perturbed PPIs for apoptosis has shed light on the disease mechanisms at a systems level, generating results which could be applied for drug target discovery. In this study, a nonlinear stochastic model was used to describe individual and cooperative protein interactions with a target protein. This model is more precise in PPI computation compared with the linear models presented in previous papers. Microarray and proteome data sets have been successfully integrated to delineate the cancer-perturbed PPI apoptosis networks, which illustrate the apoptosis mechanism at the systems level and which can predict apoptosis drug targets using data from the literature. The predictions of cancer apoptosis drug targets developed here are highly coordinated with the current apoptosis cancer drug discovery process, which should help researchers find more possible drug targets for other mechanisms in future work.
References Adams JM, Cory S (2007) The Bcl-2 apoptotic switch in cancer development and therapy. Oncogene 26(9):1324–1337 Alon U (2007) An introduction to systems biology: design principles of biological circuits. Chapman & Hall/CRC, Boca Raton, FL Andersen MH, Becker JC et al (2005) Regulators of apoptosis: suitable targets for immune therapy of cancer. Nat Rev Drug Discov 4(5):399–409 Araujo RP, Liotta LA et al (2007) Proteins, drug targets and the mechanisms they control: the simple truth about complex networks. Nat Rev Drug Discov 6(11):871–880 Bader GD, Betel D et al (2003) BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res 31(1):248–2450 Bader JS, Chaudhuri A et al (2004) Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol 22(1):78–85 Basu A (2003) Involvement of protein kinase C-delta in DNA damage-induced apoptosis. J Cell Mol Med 7(4):341–350 Carter GW (2005) Inferring network interactions within a cell. Brief Bioinform 6(4): 380–389 Chang YH, Wang YC et al (2006) Identification of transcription factor cooperativity via stochastic system model. Bioinformatics 22(18):2276–2282 Chen BS, Chang CH et al (2008a) Robust model matching control of immune systems under environmental disturbances: dynamic game approach. J Theor Biol 253(4):824–837 Chen BS, Chang YT (2008) A systematic molecular circuit design method for gene networks under biochemical time delays and molecular noises. BMC Syst Biol 2:103
24
Protein–Protein Interaction Network of Apoptosis for Drug Target Discovery
609
Chen BS, Li CH (2007) Analysing microarray data in drug discovery using systems biology. Exper Opin Drug Discov 2(5):755–768 Chen BS, Wang YC (2006) On the attenuation and amplification of molecular noise in genetic regulatory networks. BMC Bioinform 7:52 Chen BS, Yang SK et al (2008b) A systems biology approach to construct the gene regulatory network of systemic inflammation via microarray and databases mining. BMC Med Genomics 1:46 Chen HC, Lee HC et al (2004) Quantitative characterization of the transcriptional regulatory network in the yeast cell cycle. Bioinformatics 20(12):1914–1927 Chu LH, Chen BS (2008a) Comparisons of Robustness and Sensitivity between Cancer and Normal Cells by Microarray Data. Cancer Inform 6:165–181 Chu LH, Chen BS (2008b) Construction of a cancer-perturbed protein-protein interaction network for discovery of apoptosis drug targets. BMC Syst Biol 2:56 Cory S, Adams JM (2002) The Bcl2 family: regulators of the cellular life-or-death switch. Nat Rev Cancer 2(9):647–656 Cusick ME, Klitgord N et al (2005) Interactome: gateway into systems biology. Hum Mol Genet 14 Spec No. 2:R171–181 Danial NN, Korsmeyer SJ (2004) Cell death: critical control points. Cell 116(2):205–219 Fesik SW (2005) Promoting apoptosis as a strategy for cancer drug discovery. Nat Rev Cancer 5(11):876–885 Gandhi TK, Zhong J et al (2006) Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 38(3):285–293 Garber K (2005) New apoptosis drugs face critical test. Nat Biotechnol 23(4):409–411 Ghobrial IM, Witzig TE et al (2005) Targeting apoptosis pathways in cancer therapy. CA Cancer J Clin 55(3):178–194 Gomez SM, Choi K et al (2008) Prediction of protein-protein interaction networks. Curr Protoc Bioinform Chapter 8 :Unit 8 2 Hanash S (2004) Integrated global profiling of cancer. Nat Rev Cancer 4(8):638–644 He X, Zhang J (2006) Why do hubs tend to be essential in protein networks? PLoS Genet 2(6):e88 Hermjakob H, Montecchi-Palazzi L et al (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res 32(Database issue):D452–455 Herr I, Debatin KM (2001) Cellular stress response and apoptosis in cancer therapy. Blood 98(9):2603–2614 Hood L (2003) Systems biology: integrating technology, biology, and computation. Mech Ageing Dev 124(1):9–16 Hood L, Heath JR et al (2004) Systems biology and new technologies enable predictive and preventative medicine. Science 306(5696):640–643 Hood L, Perlmutter RM (2004). The impact of systems approaches on biological problems in drug discovery. Nat Biotechnol 22(10):1215–1217 Johansson R (1993) System modeling and identification. Englewood Cliffs, NJ, Prentice Hall Kaufmann T, Tai L et al (2007) The BH3-only protein bid is dispensable for DNA damage- and replicative stress-induced apoptosis or cell-cycle arrest. Cell 129(2):423–433 Klipp E, Herwig R, Kowald A, Wierling C, Lehrach H (2005) Systems biology in practice. Concepts, implementation and application. Wiley-VCH, Berlin Lewin, B (2004) Genes VIII. Upper Saddle River, NJ, Pearson Prentice Hall Lin LH, Lee HC et al (2005) Dynamic modeling of cis-regulatory circuits and gene expression prediction via cross-gene identification. BMC Bioinform 6:258 Markowetz F, Spang R (2007) Inferring cellular networks–a review. BMC Bioinform 8(Suppl 6):S5 Morris DS, Tomlins SA et al (2007) Integrating biomedical knowledge to model pathways of prostate cancer progression. Cell Cycle 6(10):1177–1787 Murray JI, Whitfield ML et al (2004) Diverse and specific gene expression responses to stresses in cultured human cells. Mol Biol Cell 15(5):2361–2374
610
L.-H. Chu and B.-S. Chen
Oltersdorf T, Elmore SW et al (2005) An inhibitor of Bcl-2 family proteins induces regression of solid tumours. Nature 435(7042):677–681 Pelengaris S, Khan M et al (2002) c-MYC: more than just a matter of life and death. Nat Rev Cancer 2(10):764–776 Peri S, Navarro JD et al (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 13(10):2363–2371 Rhodes DR, Chinnaiyan AM (2005) Integrative analysis of the cancer transcriptome. Nat Genet 37(Suppl):S31–37 Rhodes DR, Yu J et al (2004) Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci USA 101(25):9309–9314 Riedl SJ, Salvesen GS (2007) The apoptosome: signalling platform of cell death. Nat Rev Mol Cell Biol 8(5):405–413 Riedl SJ, Shi Y (2004) Molecular mechanisms of caspase regulation during apoptosis. Nat Rev Mol Cell Biol 5(11):897–907 Rual JF, Venkatesan K et al (2005) Towards a proteome-scale map of the human protein-protein interaction network. Nature 437(7062):1173–1178 Schrattenholz A, Soskic V (2008) What does systems biology mean for drug development? Curr Med Chem 15(15):1520–1528 Sebolt-Leopold JS, Herrera R (2004) Targeting the mitogen-activated protein kinase cascade to treat cancer. Nat Rev Cancer 4(12):937–947 Sherr CJ, McCormick F (2002) The RB and p53 pathways in cancer. Cancer Cell 2(2):103–12 Stelzl U, Worm U et al (2005) A human protein-protein interaction network: a resource for annotating the proteome. Cell 122(6):957–968 Tao Y, Pinzi V et al (2007) Mechanisms of disease: signaling of the insulin-like growth factor 1 receptor pathway–therapeutic perspectives in cancer. Nat Clin Pract Oncol 4(10):591–602 Troyanskaya OG (2005) Putting microarrays in a context: integrated analysis of diverse biological data. Brief Bioinform 6(1):34–43 Vousden KH, Lane DP (2007) p53 in health and disease. Nat Rev Mol Cell Biol 8(4): 275–283 Wada T, Penninger JM (2004) Mitogen-activated protein kinases in apoptosis regulation. Oncogene 23(16):2838–2849 Weston AD, Hood L (2004) Systems biology, proteomics, and the future of health care: toward predictive, preventative, and personalized medicine. J Proteome Res 3(2):179–96 Youle RJ, Strasser A (2008) The BCL-2 protein family: opposing activities that mediate cell death. Nat Rev Mol Cell Biol 9(1):47–59
Chapter 25
Transcriptional Changes in Alzheimer’s Disease Jeremy A. Miller and Daniel H. Geschwind
Abstract Alzheimer’s disease (AD) is the most common form of dementia, affecting millions of aging people worldwide with no known cure. Pathological analyses have identified extracellular β-amyloid plaques and intracellular neurofibrillary tangles of hyperphosphorylated tau as core features, but the key causal biochemical pathways remain elusive. In fact, functional genomic analyses suggest a number of other cellular changes occurring early in the disease, including synaptic and bio-energetic dysfunction, as well as a role for neuron–glia interactions. Given the complexity of AD and the limitations of our current knowledge, the field of AD research would benefit from taking hypothesis-independent approaches that allow one to view the data in a systems biology framework. This paradigm shift in AD transcriptional research would permit consolidation of the large body of often-conflicting transcriptional, proteomic, and other screening results into testable theories of AD progression. In this chapter, we will first review transcriptional studies of AD that use postmortem human brain, animal model systems, and peripheral human tissue, addressing the biological pathways that these studies suggest degrade with AD progression. Next, we will discuss recent studies with more multifaceted designs, and how results from such studies lend support to some of these theories. Finally, we will suggest how using a more systems-level approach to the study of AD could help scientists further clarify what goes wrong in the brain at various stages of AD progression and be beneficial for the development of more targeted therapies. Keywords Alzheimer’s disease · Functional genomic analysis · Systems-level approach · Transcriptional change
D.H. Geschwind (B) Department of Neurology and Center for Neurobehavioral Genetics, University of California, Los Angeles, CA, USA e-mail:
[email protected] S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_25,
611
612
J.A. Miller and D.H. Geschwind
25.1 Introduction Alzheimer’s disease (AD) is the most common form of dementia, affecting over 27 million people worldwide, and is one of the most costly diseases to society (see Cacabelos et al. 2005; Carr et al. 1997; Khachaturian 2000; Munoz and Feldman 2000; Drachman 2006 for more comprehensive reviews of AD background). AD has no cure and while 50 : r ) delete; A : |A| > 50 : r ) delete(50); : A -> 20, A let E : bproc = #(x,DE) [ rep x!().nil ]; let S : bproc = #(y,DS) [ y?().ch(y,DP).nil ];
} %% { (DE,DS, rate(ka), rate(kb), rate(kc) ), (DE,DP,0,inf,0) }
let kb : const = 1.0;
{ DE, DS, DP, DES }
let ka : const = 1.0;
let kc : const = 1.0;
run 1 E || 100 S
// Using events [steps = 1000]
let kb : const = 1.0; let kc : const = 1.0;
> let E : bproc = #(x,DE)[nil]; let S : bproc = #(y,DS)[nil]; let P : bproc = #(y,DP)[nil]; let ES : bproc = #(y,DES)[nil]; when(E,S::rate(ka)) join(ES); when(ES::rate(kb)) split(E,S); when(ES::rate(kc)) split(E,P); run 1 E || 100 S
// Michaelis-Menten kinetics // using functions [steps = 1000] >
{ DE, DS, DP } %% { (DE,DS,f1) }
let ka : const = 1.0; let kb : const = 1.0; let kc : const = 1.0;
let E : bproc = #(x,DE) [ rep x!().nil ];
let Km : const = (kb + kc)/ka;
let S : bproc = #(y,DS) [ y?().ch(y,DP).nil ];
let VMax : const = 100; let f1: function = VMax*(|S|/(Km+|S|));
run 1 E || 100 S
Kc E + S Ka E+P Kb ES →
where the rates associated to arrows are all equal to 1. The first model in Table 31.6 implements this reaction using complexes, while the second implements it using only events. The last model uses functions to implement the reaction through
31
Programming Biology in BlenX
789
Michaelis–Menten kinetics, one of the most important chemical reaction mechanisms in biochemistry used to describe the catalysis of chemical reactions (Alberts et al 2002). The ability to write models using different approaches (e.g. using events instead of complexes) is one of the key features of BlenX. Indeed, this allows for the creation, refinement and composition of models at different levels of abstractions; an example is given in Csikász-Nagy et al. (2009).
31.3 The Beta Workbench The Beta Workbench (BWB for short) is a set of tools to design, simulate and analyse models written in BlenX. The core of BWB is a command line application (CoreBWB) that hosts three tools: the BWB simulator, the BWB CTMC generator and the BWB reactions generator. These three tools share the BlenX compiler and the BlenX runtime environment. The core BWB takes as input the text files that represent a BlenX program (see Section 31.2), passes them to the compiler that translates these files into a runtime representation that is then stored into the runtime environment. The logical arrangement of the computational blocks above is depicted in Fig. 31.5.
Fig. 31.5 The logical structure of BWB
The BWB simulator is a stochastic simulation engine. The runtime environment provides the stochastic simulation engine with primitives for checking the current state (i.e. the current populations of all species and the current variable values) of the system and for modifying it. The stochastic simulation engine drives the simulation handling the time evolution of the environment in a stochastic way and preserves the previously described dynamics of BlenX. The stochastic simulation
790
L. Dematté et al.
engine implements an efficient variant of the Gillespie’s algorithms described in Gillespie (1977). When models are in finite state (i.e. the number of species and complexes that the model can generate is finite, the population of species and complexes is finite and no continuous variables are used), a BlenX program gives rise to a continuoustime Markov process (CTMC). The BWB CTMC generator iterates through species, complexes and actions exposed by the runtime environment to exhaustively traverse the whole state space of a BlenX program. The CTMC generator also labels all the transitions between states with their exponential rates. The BWB reactions generator identifies all the complexes and species that could be generated by the execution of a BlenX program as a result of interactions and state changes, without actually executing or simulating the model (also in this case we require the model to be finite). The output of the tool is a description of the system as a list of box species and a list of reactions in which these species are involved. These lists are abstracted as a digraph in which nodes represent species and edges represent reactions (see Fig. 31.6). This graph can be reduced to avoid, whenever possible, the presence of immediate reactions. The final result is an SBML description of the original BlenX program (Fig. 31.7).
Fig. 31.6 The graph of all the reactions generated by the BWB reactions generator
Around the core BWB, a set of tools has been developed to ease model writing and interpretation of results. Interpretation of results plays a very important role in language-based modelling approaches. Indeed, BlenX models are very compact with respect to other formalisms; this is mainly due to the choice of using a programming language to write models, thus expressing the behaviour of the single entities instead of listing all the possible interactions, as it happens in reaction-based input
31
Programming Biology in BlenX
791
Fig. 31.7 The SBML file generated by the BWB reactions generator.
formats like SBML. The result is more compact, less complex but more “dense” models. Complexity is transferred from the modeller to the software. During the simulation, the BWB runtime “unfolds” the reaction network on the user’s behalf, creating new boxes (species) as the simulation goes on. As a result, the simulation of a BlenX model produces not only a trace of species populations, but also a (possibly partial – e.g., the simulation did not explore some of the possible paths) reaction graph and a list of generated species and complexes, with information about their structure and conformation. These three pieces of information represent the emerging dynamics of the system, and as such they are important to understand how the system behaves. Understanding how a system behaves is one of the goals of modelling, and it is also important as a debugging 2 facility. The basic tool to perform these tasks is the BetaPlotter. The BetaPlotter is a small tool that can be used to perform some initial analysis about simulation results. It is able to display the reactions as both a graph (with some different layouts 3 ) and
2 Debugging
is a computer-related term that refers to the process of finding errors in a program (or a model, in this case) and correcting them. 3 A graph layout is an algorithm that decides, based on some criteria, how to place nodes and vertices in a 2D or 3D space.
792
L. Dematté et al.
a reaction list; it plots the variation of population of all or selected species and it can be used to see how species and complexes are structured. Recently, we added to our toolbox CoSBiLab Graph, an application designed specifically to visualize, analyse large graphs. This application can be used to load not only BlenX simulation results, but also CTMCs produced by the CTMC generator, graphs produced by the BWB reactions generator and generally graphs in any common format. Furthermore, CoSBiLab Graph supports many more layouts 4 and algorithms to perform different analysis on a graph (for example, statistics, classification of nodes, clustering, computation of various coefficient on nodes and edges). Finally, the need to classify and visualize complexes is addressed by the ComplexViewer. Complexes as native language constructs are a peculiar feature of BlenX (see Section 31.2 for more details) where boxes play the role of monomeric units; they can be used to represent any ensemble of two or more boxes, from dimers to more complex polymers. New complexes can be generated at runtime as a consequence of binding and unbinding actions – similarly to what happens when new boxes are created after monomolecular and bimolecular reactions – and so it is possible to easily generate complex biopolymers made of dozens or hundreds of boxes from relatively simple and compact programs (see Section 31.4.1 for a complete example). The drawback is that it is not easy to understand how a complex is shaped from the textual description given as an output by the core BWB. To ease the task of analyzing complexes, we created a tool to classify and draw them. Finally, BWB includes some tools to ease the development of models. Models, as explained in Section 31.2, are BlenX programs composed of two, optionally three, text files. As for any programming language, it is possible to use your text editor of choice to write BlenX programs. The text files can then be used as inputs for CoreBWB. Using a textual representation for programs (as opposed to graphical representation), it has both advantages and disadvantages. An advantage is the ability to scale with respect to complexity: writing complex software with textual languages has been repeatedly proven feasible. In fact, in some domains the density of data and its structure is such that a visual editor fails to represent it well. Program, and in the same way models written in a language-oriented approach, is one of such domains. It is possible to create a visual designer to visualize flow control, branching logic, complex expressions and functions, but code in text format is more compact, easy to read and understand in this kind of scenario. This is especially true with an IDE (integrated software environment) designed and developed to handle increasingly complex program code in a seamless way. Moreover, software developers often see code as the most human readable way to represent behaviour. In contrast, a graphical representation is easy to use lay programmers: domain experts who are not professional programmers but program in a domain-specific language as a part of their work. For lay programmers, a programming language
4 Layouts
and algorithms can be added using a plug-in system, so that more can be easily added.
31
Programming Biology in BlenX
793
can be too difficult to use on a daily basis, while most of them are comfortable with diagrams, tables and point-and-click interfaces. As BlenX aims to target both audiences, e.g. people with a software development experience and people with a biology or bioengineering background, we decided to offer both possibilities. BWB comes with a Visual Studio plug-in – with syntax highlight, code completion, code snippets and so on – for BlenX input files for the first category, and with a graphical tool, BlenX Designer, for the second one.
31.3.1 Usage CoreBWB is a command line application. At the command line (your shell of choice under Unix\Linux, Command Prompt or Command Shell under Windows, Terminal under Mac) you enter the program name followed by the input files and one or more options. The most important option is “--running_mode” (short form: -r), used to invoke the various programs. To invoke the BWB simulator, use “SIM.exe --running_mode=GILLESPIE”; for BWB CTMC generator use “SIM. exe --running_mode=TS” and for BWB reaction generator use “SIM.exe -running_mode=SBML”. Usually output files are saved in the same directory as input files; it is possible to specify a different path or prefix for output file using the --output option. In any case, the path to the output files is written by CoreBWB at the end of the execution (See Fig. 31.8). Other options can be used to specify application-specific parameters (total virtual time or number of steps to run a simulation; limits on the CTMC generation; SBML version and so on). To obtain a complete list, use the –help command line switch. BetaPlotter is a graphical tool that parses and displays simulation outputs as (1) a graph of the reactions executed by the simulator; (2) a Treemap of reaction causality;
Fig. 31.8 CoreBWB command line options
794
L. Dematté et al.
(3) a plot representing changes in populations; (4) a view of the internal behaviour of the generated entities (whose indentation is customizable by the user through a Python script) and (5) a list of the executed actions in a “chemical equation” style. Each of this display is available by clicking the corresponding tab in the upper part of the plotter main interface (See Fig. 31.9a and b). On the left panel, all currently available species and those selected are visible. It is possible to select species using this panel or, for example, the reaction graph (first
a
b
Fig. 31.9 (a) The BetaPlotter main interface. Notice the tabs in the upper part, where the different functionalities can be found. The selected tab shows the reaction graph of the cell cycle model explained in Section 31.4.2. (b) The oscillatory behaviour of different species of the cell cycle machinery, visualized with the BetaPlotter tool
31
Programming Biology in BlenX
795
tab). Selection is kept through tabs, so it is possible to select species on one tab (say, in the structure tab) and view the associated data on another tab (say, the plot tab). CoSBiLab Graph is a new tool for graph visualization and manipulation. It is an improvement over the graph part of BetaPlotter, as it can deal with different graph formats (including the Dot format, used by the BWB CTMC generator and by the BWB reaction generator) and with very large graphs (up to hundreds of thousands nodes, in the current release). CoSBiLab Graph is designed with a plug-in architecture, so it is possible to add general graph layouts and analysis algorithms to it. CoSBiLab Graph currently includes a set of layouts and algorithms to fulfil basic needs. The interface is simple to use, with a classic menu for load and save operation and dockable panels for navigation, node and edge inspection, analysis output (See Fig. 31.10).
Fig. 31.10 The main interface of CoSBiLab Graph. Notice the graph navigator and mini-map, designed to navigate graphs with thousands of nodes, the node and edge inspectors and the palette, used to inspect properties linked to these entities
ComplexViewer has been designed to let the user exploit his or her domainspecific knowledge to drive how a complex should appear. ComplexViewer loads a CoreBWB output file and extract species and complexes from it. Then, it classifies each box inside a complex into categories – defined by the user – using some pattern-matching rules (Fig. 31.11). In this way, it is possible to group and select boxes based on their characteristics and then tell to the software how to draw them
796
L. Dematté et al.
Source graph Graph adapter/filter Node classifier Layout rules
Visualizer
Stop, dist=1
–10’, +10’ Root Fig. 31.11 The logical structure of ComplexViewer and how it transforms a complex graph to display it
node branch { successors_direction_adj = monomer:translation(1,0,0); color = green; } node monomer { successors_direction_adj = monomer:(translation(1,0,0)), branch:(rotation(–60,0,0,1), translation(1,0,0)); color = red; }
Fig. 31.12 Drawing rule for branch
(Figs. 31.12 and 31.13), reproducing closely the representation of different polymers (DNA, Actin, etc.) usually found in textbooks. The end result is shown in Fig. 31.14. Applying different rules results in different visualizations as shown in Fig. 31.15. For more information on ComplexViewer and its usage, see Dematté and Larcher (2009).
31
Programming Biology in BlenX
Fig. 31.13 ComplexViewer execution chain
Fig. 31.14 Actin filament in ComplexViewer
797
798
a
L. Dematté et al.
b
Fig. 3.15 Different drawing rules used to produce different end results
BlenX Designer. The process of authoring a BlenX program can be divided into four different steps: define boxes and events; define box internal programs; define dynamics (interfaces types, compatibilities, constant, variables and functions) and define complexes. The BlenX designer starts with an empty canvas where it is possible to “draw” new boxes and events by using the toolbox on the right side (see Fig. 31.16). Double clicking on a box, it is possible to define the internal program of that box. To specify the internal program, a language is used (see Section 31.2 for more details); in the graphical designer, the same instructions can be easily specified and composed using some basic action blocks (input, output, change, nil, and in
Fig. 31.16 Definition of boxes and events in the BlenX Designer
31
Programming Biology in BlenX
799
Fig. 31.17 Definition of an internal behaviour as an action tree composition in the BlenX Designer. Notice the operator block (Choice, Rep and Parallel) and the basic action blocks (Input, Output, Change, Nil, etc.)
Fig. 31.18 Definition of compatibilities and functions in the BlenX Designer
800
L. Dematté et al.
general all the actions specified in Section 31.2) and some operator blocks (like “+”, “.”, “|” and in general all the operators specified in Section 31.2 – see Fig. 31.17). Finally, dynamics can be specified using appropriate tabs (Fig. 31.18). The graphical representation of both the boxes and the internal programs reflects closely the textual representation. The two representations are interchangeable: the tool can parse and generate the graphical representation from any valid BlenX program and generate the textual representation from the graphical form. The textual representation can be previewed and modified inside the designer; changes to the text code are reflected immediately on the graphical representation and vice versa. The textual representation can then be used as an input to the core BWB.
31.4 Case Studies In this section we present two case studies. In the first one we show how to model self-assembly of complexes in the context of actin polymerization. The second example, instead, concentrates on the cell cycle engine modelling. The goal of the section is to show the effectiveness of BlenX and all its features on real biological examples. Moreover, the case studies give an idea of how and when to use the different programming and modelling approaches (e.g. complexes, events, communication, functions) that BlenX offers.
31.4.1 Actin Polymerization Actin is a small globular protein present in almost all known eukaryotes. It is one of the most conserved proteins among species because it is involved in fundamental cell processes, e.g. muscle contraction, cell motility, cell division and cytokinesis, vesicle and organelle movement, cell signalling and the establishment and maintenance of cell junctions and cell shape. All these tasks are performed due to the ability of actin to polymerize in long filaments (Alberts et al. 2002). This filaments are polarized, thus it is possible to distinguish the two ends, called pointed and barbed. Actin can interact with a multitude of molecules; in particular it can bind with the ARP2/3 complex, a seven subunit protein. This complex has the capability to bind existing actin filaments and become a nucleation site for a new filament. This leads to the formation of branched types that are important for processes like cell locomotion, phagocytosis, and intracellular motility of lipid vesicles. The BlenX model we will study implements the polymerization of actin monomers and the ability of forming branches through the ARP2/3 molecule. In order to understand the modelling process better, we will introduce two models of increasing complexity: the first model will implement only polymerization and depolymerization of actin filaments. The second model is an evolution of the first one, introducing the ARP2/3 molecule and thus branching over the filaments. In
31
Programming Biology in BlenX
801
this context we are interested in modelling the mechanisms leading to actin filaments formation, hence we avoid considering realistic quantitative parameters of this process.
31.4.1.1 The BlenX model Actin polymerization and depolymerization. We only need to define a single BlenX box representing monomers and orchestrate the interactions so that two free monomers bind together starting the formation of a new filament. A filament can polymerize and depolymerize on both its ends but with different reaction rates. We assume that a filament cannot be broken in the middle, i.e., it can lose monomers only from its extremities (Fig. 31.19). The monomer box, which we call Actin, has only two interfaces: Pointed, pointed end of the monomer, and Barbed, its barbed end: let Actin : bproc = #(Pointed,P), #(Barbed,B) [ interaction_modifier ];
Fig. 31.19 Formation process of an actin filament. Actin monomers are represented with grey circles
The molecule can assume different conformations, referred to as states in this context. Figure 31.20 depicts which states the Actin box can assume. The arrows show which are the possible state changes and their labels describe which event causes the state change. The types of the interfaces, represented by their interface types in BlenX, change depending on the state of the monomer they belong to for specifying different rates for the polymerization/depolymerization on the pointed end and for the polymerization/depolymerization on the barbed end. The types of the Pointed and Barbed interfaces are P and B, respectively, if they belong to a free monomer (state 1). A compatibility between these types is defined in order to allow the complexation/decomplexation of two free monomers:
802
L. Dematté et al.
Fig. 31.20 Actin boxes can assume four states, free on both ends (state 1), free only on one end (states 2 and 3) and bound on both ends (state 4)
(P,B, rate(free_monomer_complexation), rate(free_monomer_decomplexation), 0)
The Pointed and the Barbed interfaces have types P_END and B, respectively, if a monomer is the pointed end of a filament (state 2). The compatibility between P_END and B that enables polymerization and depolymerization at the pointed end is (P_END,B, rate(pointed_polymerization), rate(pointed_depolymerization), 0)
If the monomer is the barbed end of a filament (state 3) we have a specular situation where the types involved are P and B_END. The compatibility for polymerization and depolymerization is (P,B_END, rate(barbed_polymerization), rate(barbed_depolymerization), 0)
Finally if the monomer is bound at both its ends (state 4), the Pointed and the Barbed interfaces have types P_END and B_END, respectively. In this case the rate of depolymerization depends on the state of the linked monomers. For example, if the monomer bound to the “Pointed” end is in state 3, the rate of depolymerization
31
Programming Biology in BlenX
803
on this side is specified by the compatibility we have considered before for P_END and P. Instead, if this monomer is in state 4, we do not allow depolymerization because a link in the middle of a filament cannot be broken. By not specifying a compatibility between P_END and B_END we make the two boxes inseparable. In order to change the type of the interfaces depending on the state of the box we define a process that we call interaction_modifier. Before starting with the implementation of this process we note that the states and the state changes represented in Fig. 31.20 can be obtained combining two smaller sets of states. One set takes care of the changes regarding the Barbed interface and the other considers those regarding the Pointed interface (Fig. 31.21).
Fig. 31.21 We can represent the possible states of the Actin box considering separately the Pointed and the Barbed interfaces. On the left side we have the states considering only the Pointed end state changes and on the right side those considering only the Barbed end state changes. Combining these states in all the possible ways we obtain again the diagram of Fig. 31.20
According to this observation we can define the interaction_modifier process as two independent processes running in parallel, each one taking care of updating the type of one of the interfaces. The first process, pointed_end_guard_p, modifies the Barbed interface: let pointed_end_guard_p : pproc = Pointed_end_guard!() | rep Pointed_end_guard?(). if (Pointed, bound) then ch (Barbed, B_END). if not (Pointed, bound) then ch (Barbed, B). Pointed_end_guard!() endif endif;
804
L. Dematté et al.
Given that the interface can continuously alternate between two types (B and B_END) the process implements an infinite behaviour. Infinite behaviours are obtained using the rep operator: let recursive_process : pproc = channel!() | rep channel?(). actions.channel!()
This process runs a first instance of the actions firing an output action on channel. After the execution of actions another output on channel activates a copy of the process under replication. In the case of pointed_end_guard_p the role of channel is played by pointed_end_guard and the repeated actions are the following 5 : the process is blocked by the if statement with condition (Pointed, bound) until a monomer get bound on the Pointed interface. The binding action changes the Barbed interface to B_END through the ch(Barbed, B_END) action, thus moving from state 1 to state 2 (Fig. 31.21 left side). Then the process is blocked by the if statement with condition not (Pointed, bound). When the Pointed interface unbinds the process executes the ch(Barbed, B) action that changes the type of the Barbed interface into B, thus moving from the state 2 to the initial state (Fig. 31.21 left side). The passage from one state to the other is immediate because all the involved communication rates are inf as well as those of the change actions. The process that implements the state changes on the right side of Fig. 31.21 is the same but operates over different interfaces. In this case the if statement checks for the Barbed interface state and the change action manipulates the Pointed interface type: let barbed_end_guard_p : pproc = Barbed_end_guard!() | rep Barbed_end_guard?(). if (Barbed, bound) then ch (Pointed, P_END). if not (Barbed, bound) then ch (Pointed, P). Barbed_end_guard!() endif endif;
The interaction_modifier process, which implements the state changes represented in Fig. 31.20, is obtained by composing in parallel the pointed_end_guard_p and the barbed_end_guard_p processes as follows: let interaction_modifier : pproc = pointed_end_guard_p | barbed_end_guard_p;
5 Consider
that at the beginning the monomer is free and thus it is in state 1.
31
Programming Biology in BlenX
805
ARP2/3 molecule. The ARP2/3 molecule forms a complex with an existing actin filament and acts as nucleation site for a new filament leading to the branching formation. In our model the ARP2/3 molecules cannot bind over the ends of a filament; moreover two ARP2/3 molecules cannot bind on two consecutive monomers. We also impose that if a monomer is involved in a branch its neighbours are not allowed to be involved in a depolymerization process. This constraint does not have biological bases in this particular system; however, it could be realistic in some other situations and thus we would like to present it in order to show more effectively the capabilities of BlenX. The reactions we are going to introduce are depicted in Fig. 31.22. In the first picture of the sequence it is possible to observe a partially formed filament, an ARP2/3 molecule (dark grey filled) and some free actin monomers. The first reaction shows the binding between the filament and the ARP2/3 molecule, the second one the interaction between the ARP2/3 molecule and a free monomer and the last reaction the growth of the new branch proceeding with the recruitment of another actin monomer (Fig. 31.22). The described process is implemented incrementally by adding new blocks of code to the previous BlenX model. The implementation we present is based on the notion of state machine. We start by presenting an abstract example that gives the intuition of what a state machine is and explains how it is implemented in BlenX. The programming style used to implement the state machine is highly reusable and actually used very often in BlenX models. Keeping separated the abstract example and its use in the modelling of actin polymerization gives an idea of how general design patterns can be applied to real scenarios. Other examples of general BlenX design patterns can be found in Larcher et al. (2010).
Fig. 31.22 Formation of a branch over an existing filament. The dark grey molecule represents the ARP2/3\ molecule. It interacts only with one actin monomer but because of its type makes impossible other branches formation around it
A state machine is composed of a finite set of state and transitions between those states. In our example we consider an entity representing the state machine that has three states: S0, S1 and S2. The transitions are possible only in a progressive way in relation to the number associated with the state. Thus the state machine can move from S0 to S1, from S1 to S2 but not directly from S0 to S2. All the possible state changes are depicted in Fig. 31.23.
806
L. Dematté et al.
Fig. 31.23 The state machine with the new transitions
In our case the state machine is commanded by another entity (controller) that can interact with it and that controls its states changes. In particular in our example we want that our controller commands the state machine to perform the following state changes: S0-S1-S2-S1-S2-S1. For translating this in BlenX we start defining a box, called state_machine, which represents the state machine. It is equipped with two interfaces. One called State that tells us the state of the box and another called x for receiving messages from the controller. The State interface can be associated with three types: St0, St1 and St2 that represent the states S0, S1 and S2, respectively. The internal process implements the three possible states following a general schema. Each state is represented as a process under the replication operator guarded by an input operation over a channel named as the state it represents. A list of alternative processes (choice operator) follows: rep state_name?().( state_change_1 + ... + state_change_n )
These processes represent the possible state changes and they share a common pattern: an input action (handler) that is followed by a change action that updates the State interface with the type that represents the destination state of the transition and an output action over the channel that has the same name of the destination state: handler?().ch(State,dest_state_identifier).dest_state!()
Handlers used for state transitions from a given state have to be different. Given that in our case we have no more than two state transitions for each state, we need only two handlers. We call them plus and minus. Processes representing states are put in parallel, obtaining the following process: let state_implementation : pproc = rep S0?().plus?().ch(State,St1).S1!() | rep S1?().( minus?().ch(State,St0).S0!() + plus?().ch(State,St2).S2!() ) | rep S2?().minus?().ch(State,St1).S1!();
Note that for states S0 and S2 there is only one possible state change and so the list of processes joined through the choice operator degenerates in a process
31
Programming Biology in BlenX
807
implementing the unique state change. Now, given that the initial state of the state machine is S0, we add an output operation over S0. This operation activates the state changes that are possible from the state S0: let state_implementation_started_in_S0 : pproc = S0!() | state_implementation;
The last step to complete the definition of the state_machine box is the definition of the process that allows the communication with the controller. This component is represented by a process that performs two operations: an input operation over the x interface of the box waiting for the name of an handler and then an output action over the received handler. The process is under replication because it can handle an undefined number of communications. At this point we have all the necessary processes for defining the state_machine box: let state_machine : bproc = #(x,R) [ rep x?(handler).handler!() | state_implementation_started_in_S0 ];
After that we define the controller box. It performs a sequence of output operations over its interface. The object of this interface is an handler and each of them guides a state change of the state_machine box: let controller : bproc = #(y,T) [ y!(plus).y!(plus).y!(minus).y!(plus).y!(minus) ];
Note that for making possible the communication of the two boxes we have to define the compatibility (T,R,1). Suppose we want to introduce a new event that causes the state_machine box to restart (and thus move to state S0 independently from its current state). This event is the binding of a newly defined box over the x interface of the state_machine box. We call this new participant restarter: let restarter : bproc = #(z,S) [ nil ];
We add the compatibility (S,R,0.0001,1,0) for allowing the complexation and the decomplexation between the state_machine and the restarter boxes. For adding the possibility to reset the state machine we modify it allowing new state changes: one for each state (excluding S0) that moves from that state to the initial state S0. In Fig. 31.24 we depicted all the transitions. In this picture we labelled the arrows with the handler to be used for triggering each state change. Note that there are two state changes that cause a transition from S1 to S0, however, they are different because they are triggered through two different handlers: minus and reset.
808
L. Dematté et al.
Fig. 31.24 State changes of the ARP_site interface
Now we modify the state_implementation process in order to implement the new transitions (differences with the previous definition of the process are highlighted in boldface): let state_implementation : pproc = rep S0?().plus?().ch(State,St1).S1!() | rep S1?().( minus?().ch(State,St1).S0!() + plus?().ch(State,St2).S2!() + reset?().ch(State,St0).S0!() ) | rep S2?().( minus?().ch(State,St1).S1!() + reset?().ch(State,St0).S0!() );
After that we add a process that fires the new state changes. Thus we define a new process to be added in parallel to the current internal process of the state_machine box: let state_machine_restarter : pproc = start_restarter!() | rep start_restarter?(). if (x,bound) then reset!(). start_restarter!() endif;
It implements a recursive behaviour exploiting the previously explained technique. The repeated action is an output over the reset channel guarded by an if statement that enables it only if the x interface is bound. Note that if the automata is in state S0 even if the condition of the if statement is satisfied the state_machine_restarter process is blocked because in that state there is no input action on the reset channel. Here the new definition of the state_machine box: let state_machine : bproc = #(x,R) [ rep x?(channel).channel!()
31
Programming Biology in BlenX
| | ];
809
finite_state_machine_started_in_S0 state_machine_restarter
After this brief introduction of state machines and their implementations we can continue the development of the actin model by introducing the branching mechanism. The first addition to the previous model is the molecule responsible for such a mechanism: the ARP2/3 molecule. We represent it as a box with two interfaces, one for interacting with the monomer of an existing filament (ARP_ pointed) and the other for recruiting a free monomer (ARP_barbed) and rising in the formation of a branched filament. let ARP : bproc = #(ARP_Pointed, ARP_P), #(ARP_Barbed, ARP_B) [ arp_process ]
Then we add two interfaces to the Actin box: ARP_site for interacting with the ARP box and Neighbours for storing information regarding the branching state of the neighbour monomers. Note that the ARP_site interface of a free monomer is hidden because actin in this state cannot interact with the ARP2/3 molecule. Moreover Neighbours is associated with the type N0 because no neighbour is involved in the branching structure (in the case of a free monomer it is trivially true given that the monomer does not have neighbour boxes). We define the branch_manager process and we compose it in parallel with the previously defined interaction_modifier process within the Actin box: let Actin : bproc = #(Pointed,P), #(Barbed,B), #h(ARP_site,A), #(Neighbours,N0) [ interaction_modifier | branch_manager];
The branch_manager is composed of three processes in parallel: arp_ site_manager_p, arp_site_guard_p and neighbours_state_modifier. The arp_ site_manager_p implements the constraints on the branch capability of a filament. This process makes the ARP_site interface visible when the Pointed interface and the Barbed interface are bound and the Neighbours interface has type N0 (meaning that neighbour boxes are not involved in branching). If one of these conditions is false the process hides the interface: let arp_site_manager_p : pproc = ARP_site_manager!() | rep ARP_site_manager?(). if (Neighbours,N0) and (Pointed,bound) and (Barbed,bound) then unhide(ARP_site). if not ((Neighbours,N0) and
810
L. Dematté et al.
(Pointed,bound) and (Barbed,bound)) then hide(ARP_site). ARP_site_manager!() endif endif;
Also in this case we have a recurring behaviour, i.e. the monomer continuously alternates the visibility of the ARP_site interface when the conditions are updated. The arp_site_guard_p process performs a set of actions as a consequence of binding and unbinding events on the ARP_site interface. When the ARP_site becomes bound the process alerts the neighbour monomers that it is involved in a branch (so they can adequately update their Neighbours interface) by sending an output with name plus on the Pointed and the Barbed interfaces. Moreover the type of these two interfaces is changed in such a way that the monomer cannot unbind from its neighbours (this is one of the constraints introduced at the beginning of this section). When the ARP_site interface unbinds, specular actions are performed. Messages with name minus are sent and the two interfaces are associated with their initial types: let arp_site_guard_p : pproc = Arp_site_guard!() | rep Arp_site_guard?(). if (ARP_site,bound) then Pointed!(plus).Barbed!(plus). ch(Pointed,BI).ch(Barbed,PI). if not(ARP_site,bound) then ch(Pointed,P_END).ch(Barbed,B_END). Pointed!(minus).Barbed!(minus). Arp_site_guard!() endif endif;
The actions performed by the process arp_site_guard_p are summarized in Fig. 31.25. The arrows are labelled by the event that causes the state change and by the output actions that are performed by the monomer while moving from one state to the other. Now we define the neighbours_state_modifier process that receives the messages plus or minus sent by the arp_site_guard_p and uses them for updating the type of the Neighbours interface. This interface can be associated to three types: N0, N1 and N2. The number after N indicates how many neighbour monomers are involved in a branch. For obtaining this kind of behaviour we need to implement a box that changes its state depending on events involving other boxes. We can think of it as a box that is remotely controlled by other boxes. This result is obtained through communications over the interfaces. In particular when an Actin box binds to or unbinds from an ARP box, it must communicate the event to its neighbours that adequately update their neighbours interface. These communications are performed inside the
31
Programming Biology in BlenX
811
Fig. 31.25 Representation of the state machine taken as example. The initial state S0 is double circled
arp_site_guard_p and they send over the Barbed and Pointed interfaces. Our goal is to update correctly the interface Neighbours, which can be associated with one of three types: N0, N1 and N2. We can separate the configurations of a monomer in three groups, according to the type associated with its Neighbours interface. These groups, depicted in Fig. 31.26, are the three states which the Actin box switches among. This behaviour can be implemented exploiting the state machine design pattern previously introduced. We associate the state S0 to N0, S1 to N1 and S2 to N2; the possible state transitions are the same used in the state machine example. In the case of the actin model, referring to Fig. 31.26, the state machine is represented by the monomer filled with grey and the controllers are the monomers bound to its Pointed and Barbed interfaces (actually all of them are both controller and state machine at the same time but if we look at the state changes represented in the figure we can distinguish the two roles). The code for the process that implements the state machine inside the Actin boxes in one-to-one correspondence with the state_implementation process: let finite_state_machine : pproc = rep stateN0?().plus?().ch(Neighbours, N1).stateN1!() | rep stateN1?().( plus?().ch(Neighbours,N2).stateN2!() + minus?().ch(Neighbours,N0).stateN0!() + restart_state_machine?().ch(Neighbours,N0). stateN0!() )
812
L. Dematté et al.
| +
rep stateN2?().( minus?().ch(Neighbours,N1).stateN1!() restart_state_machine?().ch(Neighbours,N0). stateN0!()
);
Fig. 31.26 Configurations of an actin monomer divided depending on the type of the Neighbours interface. These three groups represent the three state changes that has to be recognized by the state machine we implement inside the Actin box
For completing the definition of the neighbours_state_modifier process we have to add a process that starts the state machine in the starting state (in this case is stateN0), the processes that wait messages from the controller monomers linked through the Pointed and Barbed interfaces and the process that restarts the state machine. let neighbours_state_modifier : pproc = rep Pointed?(handler).handler!() | rep Barbed?(handler).handler!()
31
Programming Biology in BlenX
813
| stateN0!() | state_machine_restarter!() | rep state_machine_restarter?(). if (not(Pointed,bound)) and (not(Barbed, bound)) then restart_state_machine!(). state_machine_restarter!() endif | finite_state_machine;
These processes have a one-to-one correspondence with the previously described mechanism too. In this case synchronization with the handlers can happen on two interfaces (instead of one) and thus we need two processes waiting for messages. The condition of the if statement ensures that when a monomer unbind from a filament and becomes free the state machine is restarted. For what concerns the sending of the messages that govern the automata, we remand to the already defined arp_site_guard_p process. The ARP2/3 molecule, as we have seen at the beginning of this section, is implemented as a box with two interfaces. The ARP_Pointed interface interacts with the ARP_site interface of a monomer and the ARP_Barbed interface interacts with the Pointed interface of a free monomer promoting the polymerization of a new branch (see Fig. 31.27).
Fig. 31.27 Process of branch formation
let ARP : bproc = #(ARP_Pointed, ARP_P), #(ARP_Barbed, ARP_B) [ Pointed_end_guard!() | rep Pointed_end_guard?(). if (ARP_Pointed, bound) then ch (ARP_Barbed, ARP_B_END). if not (ARP_Pointed, bound) then ch (ARP_Barbed, ARP_B). Pointed_end_guard!()
814
L. Dematté et al.
endif endif | |
Barbed_end_guard!() rep Barbed_end_guard?(). if (ARP_Barbed, bound) then ARP_Barbed!(plus). ch (ARP_Pointed, ARP_P_END). if not (ARP_Barbed, bound) then ch (ARP_Pointed, ARP_P). Barbed_end_guard!() endif endif
];
The internal process is mostly the same of the Actin box of the first model we have presented. It differs for the highlighted output action over the ARP_Barbed interface. This communication changes the state of the monomer that binds to the ARP_Barbed interface. In this way we avoid that an ARP box binds to the ARP_site of a monomer that is already bound to an ARP box on its Barbed interface. In order to make the new complexations, decomplexations and communications among boxes possible, compatibilities have been added into the proper file.
31.4.2 Analysis This example shows how to model the dynamics of the formation of actin filaments. In this context we did not consider to associate realistic rates to each reaction. Thus, in this section we do not propose any plot regarding the populations of the species involved in the simulation because they do not have any biological relevance. Instead
Fig. 31.28 Two different representations of actin filaments obtained with the Complex Viewer tool
31
Programming Biology in BlenX
815
we show two pictures (Fig. 31.28) where we used the Complex Viewer tool for creating a representation of two complexes among those generated during a simulation. They have been drawn in two different ways (using the tool capabilities). The image on the left represents a filament with linear disposition of the monomers, the one on the right is more realistic from a biological point of view and represents a filament as two twisted strands of monomers. In both pictures the monomers have different colours. The colour depends on the molecule represented by the sphere. If the molecule is an ARP complex the sphere is filled with the darkest grey, if the molecule is an actin monomer lighter greys are used with different tones depending on the state of the Neighbours interface.
31.4.3 Cell Cycle The cell cycle is a coordinated set of steps by which a cell replicates all its components and divides them into two nearly identical daughter cells. The eukaryotic cell cycle is driven by an underlying molecular network whose main components are complexes of cyclin-dependent kinases (Cdk) and their regulatory cyclin partners. Population changes of active Cdk/cyclin dimers are responsible for the transition from one phase to the subsequent one in the cell cycle (Alberts et al. 2002). The transitions between the different phases are depicted in Fig. 31.29a. During G1 phase the activity of Cdk is low because the cyclin transcription is mostly inhibited and the produced cyclin is rapidly degraded by Cdc20 and Cdh1. The remaining Cdk/cyclin dimers are sequestered by CKI, a stoichiometric Cdk inhibitor, which
Fig. 31.29 Budding yeast cell cycle model. (a) The different phases of the cell cycle (Gap1, Synthesis, Gap2, Mitosis) and the proteins involved in the transition between them. (b) Protein reaction network of the cell cycle engine. Solid lines connect reactants to products, while dashed lines represent the mediation effect that some species have on reactions
816
L. Dematté et al.
forms an inactive heterotrimer with Cdk/cyclin dimers. If the environmental conditions are favourable, as the cell progresses in the G1 phase the mass of cell grows, and this leads to an increased production of Cln3, a cyclin that is resistant to Cdh1 and CKI. Cln3 can activate the transcription factors that induce the production of other starter kinases (Cln3, Cln2 and Clb5 collectively modelled by the SK species). They trigger the passage between G1 and S/G2 phase and they mediate the inactivation of both Cdh1 and CKI: this allows the other cyclins to start accumulating in the cell. The key regulator of entry into M phase is Clb2 (modelled by CycB species). Cyclin synthesis is induced and cyclin degradation is inhibited throughout the rest of the cell cycle, hence Clb2 population increases throughout S, G2 and M phases. High population of active Cdk/Clb2 also has the effect of causing the inactivation of the transcription factors for the starter kinases, which have already accomplished their role in the cell cycle. Moreover, Cdk/Clb2 also induces the synthesis of the Cdc20 protein. At the metaphase/anaphase transition, Cdc20 molecules bind to the APC and Cdk/Clb2 activates them. The active Cdc20 induces the separation of sister chromatides, the degradation of the Clb’s and activates Cdh1. As the Cdk activity reverts to low levels, the telophase completes and the cell divides. The synthesis of the APC-related protein Cdc20 stops as the activity of Cdk is lost. The newborn cells are back in G1 phase with low cyclin levels and the process starts again. In the literature, the dynamics of the interactions that drive the cell cycle engine has been modelled both with deterministic and stochastic approaches: we can find cell cycle models built with stochastic ODE Langevine type (Steuer 2004), with the Gillespie method (Sandip et al. 2009), with process algebra languages with stochasticity on transitions (Lecca and Priami 2007) and with stochastic Petri Nets (Mura and Csikász-Nagy 2008). One of the most popular deterministic model is the one by Novák and Tyson [see Novák and Tyson (2003, p. 270)], whose network of reactions is depicted in Fig. 31.29b. Following a procedure similar to the one explained in Palmisano et al. (2009), in the next section we will show how to write this model in BlenX. 31.4.3.1 The BlenX model In this section we will explain how to code the budding yeast wild-type cell cycle (Novák and Tyson 2003) in BlenX in order to use the BWB framework to perform stochastic simulations. The rationale behind the codification is to use the same level of abstraction adopted in the original model, where most of the reactions are non-elementary and so they cannot be modelled with the simple synchronization actions seen in the previous case study. As shown in Section 31.2, BlenX can also model complex mechanisms like enzymatic interactions and Michelis–Menten responses through the use of events whose rate is a user-defined mathematical function. Synthesis mechanisms. In the deterministic model, CYCBT is synthesized at a constant rate “k1” (Fig. 31.30). To code this in BlenX we define a function d_dtCYCBT_1 whose expression is k1/alpha: the alpha term is the conversion factor
31
Programming Biology in BlenX
817
that has to be used in order to convert the deterministic rate k1 into the stochastic framework (this term takes into account the Avogadro’s number and the volume in which the reactions take place. Further details about the conversion and naming conventions can be found in Palmisano et al. (2009). Then, in the model file, we introduce a new event for the CYCBT species with rate d_dtCYCBT_1: //in the declaration file let d_dtCYCBT_1: function = k1/alpha; //in the program file let CYCBT: bproc = when(CYCBT:: d_dtCYCBT_1) new(1); Fig. 31.30 Graphical representation of the synthesis mechanism that the BlenX code on the left is modelling
In general, the synthesis of a species can also be driven by more complex mechanisms: it can be influenced by other species in a simple (i.e. SK synthesis driven by TF) or in a more complex way (i.e. CDC20_IN protein synthesis depends on mass and on alphaDimer that is a complex mathematical function accounting for the amount of Cdk/CycB dimer active in the system). However, the BlenX code for their synthesis follows the same patterns of the simple case: just a definition of the rate function (either simple or more complex and involving other species) and a new event with this rate. Activation/inactivation mechanisms. In the model there are several species that switch from an active state to an inactive one and vice versa (see Fig. 31.31). As the synthesis case, this reaction can be driven by simple mass action kinetics (i.e. the switch from IEP to IE) or more complex Michaelis–Menten reactions (i.e. the switch from IE to IEP, from CDH1 to CDH1_IN, etc.). For the former case, we can use the atomic change action in the code for the internal program of the species in its
Fig. 31.31 Graphical representation of the IE activation/inactivation that the BlenX code on the left is modelling
818
L. Dematté et al.
initial state (see the code below); for example, we can define the product species IE as an empty box with an interface whose type is IE. Then we define the IEP species as a box whose internal process is composed of a single change action that changes the type of its interface from IEP to IE with rate k10. For the more complex case, as done for the synthesis mechanism, we need to use an event, as events allow us to use arbitrary functions. We define a function that computes the Michaelis–Menten rate and then use it as the rate of a split event whose subject is the initial state of the species (IE, in Fig. 31.31) and whose products are two boxes, one of which is Nil, and the other is the final state of the molecule (IEP, in Fig. 31.31). Degradation mechanisms. In the deterministic model the degradation of a protein can be spontaneous (i.e. dependent only from its own population) or it can be triggered by some other species (see Fig. 31.32). Moreover, as the activation/inactivation case, this trigger can be driven by simple mass action kinetics (i.e. CYCBT degradation induced by CDH1) or by more complex reactions (i.e. CKIT degradation induced by the CycB dimer and the mass). // in the declaration file let k2p : const = 0.04; let k12s_stoch : const = k12s* alpha; // in the program file let CYCBT: bproc = #(x,CYCBT) [ die (rate(k2p)) + x? ().die.nil]; let CDH1: bproc = #(y,CDH1) [rep y !().nil]; //in the interfaces file (CDH1, CYCBT, 0.0, 0.0, rate (k2s_stoch))
Fig. 31.32 Graphical representation of the degradation mechanisms that the BlenX code on the left is modelling
For the independent self-degradation case, we can simply use the atomic die action in the internal code of the species: For modelling the mass action degradation mechanism, it is enough to add to the code for the internal program of the two species involved paired input/output actions, followed by a die action for the species that we want to degrade and by an action that resets the species we want to keep unchanged to its initial state (because it does not change, it is just the trigger of the degradation). For example, if we want to allow CDH1 to degrade CYCBT box, we can add to the internal behaviour of CYCBT the possibility of receiving a signal through its interface x and, right after that, performing a die action that will immediately delete the box from the system (see Fig. 31.32). In order to trigger the execution of this input action, we need to add to the internal behaviour of CDH1 an output action on its interface y and we need to set, in the interfaces file, a non-zero communication compatibility between CDH1 and CYCBT types. Moreover, the output action is under the replication operator rep
31
Programming Biology in BlenX
819
because, after the first degradation signal, CDH1 remains active, and it is able to send this signal to the rest of CycB dimers in the system. Finally, if the degradation is driven by a non-elementary mechanism the input/output actions of BlenX cannot be used, and the degradation should be modelled through a delete event whose rate is a non-elementary function. Continuous variables. The growth of the mass is modelled by a continuous variable that depends on the time and whose value can be updated (i.e. halved) when a certain condition in the system is met. In order to model this scenario in BlenX, we define a continuous variable m in the function file as follows: let m (0.1): var = mu * m * (1 - m/mstar) init 0.70405; let mass_div : function = m / 2;
The first line of code expresses the fact that we want to recalculate m with a time step of 0.1 and that the initial value of the variable is 0.70405. The second line defines a function that divides the mass whenever it is called. This function is used in the following update event: when ( : mCycB -> 0.2, mCycB 0. The firing rule in P/T nets characterises the dynamic behaviour and is atomic and timeless. The definition below formalises the firing rule. Definition 32.3 (Firing Rule) Let N = (P, T, F, W, m0 ) be a Petri net and m a marking of N . • A transition t ∈ T may fire in a marking m, if t is enabled in m; • the firing of t yields a successive marking of m which is m := m + t. t
We write m − → m to represent the firing of t from m to m . To illustrate all these concepts, let us consider the example in Fig. 32.1. It represents a hypothetical small signal transduction network, from the binding of the signal to the receptor through a cascade to the response in terms of transcription. The P/T net of Fig. 32.1 consists of 15 places (p1 to p15 : Signal, Receptor, Signal_ Receptor, Complex_builder, Complex, X, XP, Y, YP, Z, ZP, A, B, C and Destroyed_ receptor) and 9 transitions (t1 to t9 : Receptor_binding, Complex_formation, Phosphorylation_X, Phosphorylation_Y, Phosphorylation_Z, Transcription_A, Transcription_B, Transcription_C and Receptor_degradation). The last transition
32
Discrete Modelling: Petri Net and Logical Approaches
827
Destroyed_receptor Receptor
Signal
t9 Receptor_degradation
t1 Receptor_binding Signal_receptor
t2 Complex_formation Complex_builder Complex Phosporylation_X X
t3
XP
t4
Y
Z
Phosphorylation_Y YP
t5
Phosphorylation_Z ZP
t6
t7
t8 Transcription_C
Transcription_B
Transcription_A A
B
C
Fig. 32.1 A small signal transduction Petri net, with 15 places and 9 transitions. The signal (e.g. a ligand) binds to the receptor and triggers the complex formation. Then, this complex is responsible for a phosphorylation cascade, which induces the transcription of several proteins. The model contains a feedback loop representing the fact that, when target A has been transcribed, it causes the degradation of the receptor, thus terminating the pathway
is an output transition through which the receptor will be removed from the system by its degradation, which is not explicitly modelled here. The underlying chemical processes of the model are the following. Note that we are working at a different abstraction level than in the case of metabolic networks.
Signal + Receptor → Signal_Receptor t1 Receptor_binding: t2 Complex_ formation: Signal_Receptor + Complex_builder → Complex t3 Phosphorylation_X: Complex + X → XP t4 Phosphorylation_Y: XP + Y → YP t5 Phosphorylation_Z: YP + Z → ZP ZP → A t6 Transcription_A: ZP → B t7 Transcription_B: ZP → C t8 Transcription_C: t9 Receptor_degradation: A + Receptor → Destroyed_receptor
828
I. Koch and C. Chaouiya
The initial marking m0 is [2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0]. Thus, transition t1 (Receptor_binding) is enabled and can fire. After firing of Receptor_binding, transition t2 (Complex_ formation) is enabled and can fire. Now, a cascade starts, which is typical for signal transduction, because t3 (Phosphorylation_X) can fire, thus inducing the firing of t4 and t5 (Phosphorylation_Y, Phosphorylation_Z). In the cascade, read arcs were used to model the phosphorylation processes. Read arcs are drawn as bidirectional arrows. They denote a condition necessary to enable a transition and are not modified by the firing of this transition, i.e. the tokens in a place connected by a read arc to a transition are required to enable the transition, but are not consumed (see also Section 32.3.1). The token in place ZP triggers the transcription of A, B and C. Afterwards, tokens produced in place A cause a feedback loop to destroy the receptor by its degradation through the firing of transition t9 (Receptor_degradation). Thus, the signal flow will be interrupted after a single response.
32.2.2 Petri Net Properties Two classes of properties are commonly studied: structural properties, which do not depend on the initial marking, and behavioural properties depending on the initial marking. Among these properties, we briefly present the ones which have a clear interest for Petri nets representing biological networks. 32.2.2.1 Behavioural Properties The reachability of a given marking m insures that there is a firing sequence leading from the initial marking m0 to m. Fig. 32.2a provides a simple example. The reachability of a marking is often hard to verify and model-checking techniques can be, in this case, very useful (Clarke et al. 1999). A transition is live if, whatever the previous evolution of the net is, it will always be enabled. A P/T net is live, if all its transitions are live. A transition is dead in a)
b) t0
p0
p1
t2
p2 t1
p3
p2 t0
t3
p1
t1
p0
Fig. 32.2 Two simple P/T nets to illustrate basic properties: (a) For m0 = [1, 0, 0, 0], the marking m = [1, 0, 1, 0] is reachable through the firing sequence (t0 , t2 ), whereas m = [0, 1, 0, 1] is not reachable. Moreover, m
= [0, 0, 0, 1] is a deadlock, hence this P/T net is not live. The places p0 , p1 and p3 are bounded, which is not the case of p2 because infinitely repeating the firing sequence (t0 , t2 ) leads to an infinite marking of p2 . (b) For m0 = [1, 0, 0] the net is live and reversible
32
Discrete Modelling: Petri Net and Logical Approaches
829
a marking if, whatever the future evolution of the net, it will never be enabled. A deadlock is a marking for which all transitions are dead. More subtle levels of liveness can be defined, see, for example, Murata (1989) or Peterson (1981) for details and further illustrations. In biochemical networks, such properties can be related to the possible future occurrence of a given reaction or yet to a steady state of the system. A Petri net is bounded if the number of tokens in each place is limited (see Fig. 32.2a). In biochemical networks, it means that no product can accumulate and reach an infinite number of tokens. A Petri net is reversible if m0 is reachable from any reachable marking. In other words, whatever the evolution of the net is, it will always be able to return to its initial marking. The coverability graph, see Fig. 32.3, is a graph representation of the behaviour of a P/T net. It is a labelled directed graph, where each node is a marking and each arc, going from a marking m to a marking m labelled by a transition t, indicates that t m− → m . In a coverability graph, a marking encompassing one or more wild-cards ω covers a set of markings m such that a marking m already exists and m(p) ≤ m (p) for all places p of the net. Some dynamic properties can then be checked on these graphs. a)
b) 1000 t1
0001
t0 0100
100
t2
t0
10ω0 t3
t1 00ω1
t1 011
t2
t0 01ω0
Fig. 32.3 Illustration of the behaviour representation of P/T nets. On the left, the coverability graph of the net in Fig. 32.2 (a) for the initial marking [1, 0, 0, 0]. One can note the dead marking (from which no transition is enabled) and the unboundedness of place p2 . The reversibility of the net in Fig. 32.2 (b) is obvious, because of its reachability graph for m0 = [1, 0, 0]
For example, a net is bounded, if and only if ω does not appear in any node of its coverability graph, which is called in this case, the reachability graph of the net, see Fig. 32.3. The nodes of the reachability graph cover all the reachable markings from the initial marking. If a marking m is reachable from m0 , then there exists a node m
in the coverability graph such that m ≤ m . Note that this is a necessary condition, not a sufficient one. A transition is dead in the initial marking if and only if it does not appear as an arc label in the coverability graph. Now, most dynamic properties can be checked on reachability graphs, for example, a marking m is reachable if and only if there exists a path from m0 to m.
830
I. Koch and C. Chaouiya
However, such analyses, based on exhaustive searches, can become untractable when the coverability or reachability graph is too large. 32.2.2.2 Structural Properties Structural properties are always valid for all initial conditions, i.e. initial markings. To proceed with the introduction of the two main relevant structures, we first need to introduce the incidence matrix C = (cij )pi ∈P, tj ∈T of a P/T net: if pi ∈ • tj ∪ tj• , otherwise.
cij = tj (pi ) = W(tj , pi ) − W(pi , tj ), cij = 0,
We define place(P)- and transition(T)-invariants hereafter. Further details on T-invariants can be found in Section 32.3.2. P-invariants relate to sets of places for which the weighted sum of tokens is constant, independently of the sequence of firings: x, a vector of positive integers defines a P-invariant if and only if CT · x = 0, see Fig. 32.4. In metabolic networks, the incidence matrix C coincides with the stoichiometric matrix of the network. Thus, P-invariants correspond to conservation relations of metabolites. If the net is covered by P-invariants, i.e. all places appear in at least one P-invariant, it is bounded. ⎛
T2
P2
2 P4
P1
2
T1
2 T3
P3
⎞ −2 1 1 ⎜ 1 −1 0 ⎟ ⎟ C=⎜ ⎝ 1 0 −1 ⎠ , 0 −2 2 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 1 2 1 ⎜1⎟ ⎜0⎟ ⎜ ⎟ ⎜ ⎟ x1 = ⎝ ⎠ , x2 = ⎝ ⎠ , y1 = ⎝ 1 ⎠ . 1 4 1 1 0
Fig. 32.4 Illustration of P/T net invariants. On the left a simple P/T net encompassing four places and three transitions. On the right, the corresponding incidence matrix is given. c1, 1 = −2 because transition T1 consumes two tokens from P1 . In other words, T1 (P2 ) = −2, whereas c3, 1 = 1 because transition T1 produces one token in P3 , i.e. T1 (P3 ) = 1. The vectors x1 and x2 are the minimal positive integer solutions of CT x = 0 and denote two minimal P-invariants giving the two following relations of conservation: m(P1 ) + m(P2 ) + m(P3 ) = cst and 2m(P1 ) + 4m(P3 ) + m(P4 ) = cst, where cst denotes a constant, whose value depends on the initial marking. This P/N net is bounded
It is easy to verify that any linear combination of P-invariants is also a P-invariant. Hence, we introduce minimality criteria. Given a P-invariant x, the set of places, denoted supp(x), such that xi = 0, pi ∈ P, is called the support of x, i.e. the set of places participating to the P-invariant. The support is said to be minimal if no proper subset is also the support of a P-invariant and the greatest common divisor is equal to 1. T-invariants denote firing sequences, which reproduce a marking: y, a vector of positive integers defines a T-invariant if C · y = 0, meaning that each transition
32
Discrete Modelling: Petri Net and Logical Approaches
831
has to fire as many times as indicated by the value of the corresponding component in solution vector y. A minimality criteria similarly applies for T-invariants to give minimal T-invariants. In biological terms, minimal T-invariants represent cyclical behaviours and the minimal set of transitions, which operate at steady state. Moreover, minimal T-invariants correspond to elementary modes (Schuster et al. 1993), which are well established in systems biology and have been mainly applied to metabolic systems. Based on the set of minimal T-invariants, the whole system behaviour can be reproduced by linear combination of them. Because T-invariants define the basic network behaviour of a system, they are important for model validation. The net should be connected and covered by minimal T-invariants, i.e. all transitions appear in at least one T-invariant. Moreover, each T-invariant should have a biological meaning.
32.2.3 Petri Net Extensions This section gives a brief overview of Petri net extensions that proved useful to model biological networks, in particular signal transduction networks. These extensions of the formalism relate to the consideration of transition timing (such as in time and stochastic Petri nets) or to an enhancement of the marking meanings (such as in continuous, hybrid or coloured petri nets). Whereas time or interval petri nets and stochastic Petri nets are also discrete like P/T nets, continuous Petri nets do not consider discrete tokens, but concentrations of chemical compounds, which are processed by a continuous firing rule using the same ODEs as in the classical systems biology (Heinrich and Rapoport 1974). Hybrid Petri nets involve both discrete as well as continuous places and transitions involving the corresponding firing rules. Time or Interval Petri nets: A number of extensions have been defined by the Petri net community to take into account time constraints, among which are Time Petri Nets (TPNs) introduced by Merlin and Farber in the 1970s (Merlin and Farber 1976). In TPNs, two time variables at and bt , with 0 ≤ at ≤ bt are associated with each transition t. This time interval relates to the earliest time at and latest time bt transition t can fire after being enabled. The firing itself is also timeless. This type of modelling can be useful when reaction rates can be estimated from experimental data. An application of TPN modelling of biochemical systems can be found in Popova-Zeugmann et al. (2005). Stochastic Petri nets: Stochastic Petri nets (SPNs) are also discrete. In SPNs, enabled transitions fire according to exponential distribution rates. Hence SPNs can be related to Markov jump processes with discrete state space. This framework is well suited to handle the stochasticity arising from small numbers of molecules, which are represented by the tokens in the places. Arcs are labelled according to the stoichiometric numbers as in P/T nets and TPNs. The firing rule involves a probability function, which includes stoichiometric relationships and depends on the number of molecules (Goss and Peccoud 1998; Srivastava et al. 2001).
832
I. Koch and C. Chaouiya
Such SPNs relate to the well-founded simulation algorithm developed by Gillespie (1977), which represents the basis for further developments, for example, see Gibson and Bruck (2000), and is proved useful to simulation biochemical processes where significant stochastic fluctuations arise. Continuous Petri nets (CTPNs): In contrast to P/T nets, TPNs and SPNs, continuous models are deterministic. Similarly to SPNs, the chemical properties are adapted to the Petri net formalism. In continuous Petri nets, places carry not discrete tokens, but concentrations of chemical compounds, which are processed by a continuous firing rule yielding the same systems of coupled ODEs as used in classical kinetic modelling (Heiner and Koch 2004). The firing rule encompasses the reaction rate according to the assumed kinetics. Mostly, and in particular for metabolic networks, the well-known Michaelis–Menten kinetics (Michaelis and Menten 1913) is used, see textbooks on biochemistry, e.g. Voet and Voet (2004), and physical chemistry, e.g. Atkins and de Pavla (2006). For a reaction network, this leads to a set of coupled differential equations, usually ODEs. Modelling signal transduction using differential equations is described, for example, in detail in Kofahl and Klipp (2004). The application to Petri nets can be found in Koch and Heiner (2008). The critical point in this approach is the availability of the kinetic parameters, what makes the use of hybrid Petri nets attractive. Hybrid Petri nets (HPNs): Hybrid Petri nets (HPNs) contain both, discrete and continuous places and transitions with their corresponding firing rules. The rates of input and output of a transition are the same, thus defining a firing rate of the transition. In CTPNs and HPNs, the rates assigned to arcs are constant. In hybrid functional Petri nets (HFPN) (Matsuno et al. 2003), different functions for the input and output rates, weights and delays were introduced. A further extension in HFPNs, is the inclusion of additional biological information, such as sequence information of DNA, mRNA and protein (Nagasaki et al. 2004). As mentioned above, especially in cases, where kinetic parameters are not available in sufficient amount, HPNs and HFPNs give good simulation results. The most widely used tool for hybrid Petri nets is the Cell Illustrator (Doi et al. 2003; Nagasaki et al. 2003, 2005). Many different types of networks have been modelled using HPNs, among which are signal transduction pathways (Tasaki et al. 2006) or gene regulatory networks (Matsuno et al. 2000, 2006). High-Level Petri Nets: In high-level petri nets or coloured Petri nets (CPN), tokens are distinguishable in that they are carry (complex) data types, referred to as colours. The enabling, consumption and production rules associated to the transitions depend on expressions labelling the arcs (Jensen 1997). Each CPN can be translated into a behaviourally equivalent P/T net and vice versa. The translation from CPNs to P/T nets is unique, whereas the translation in the other direction can be done by different ways. The existence of an equivalent P/T net is extremely useful, because it gives the ideas how to generalise the basic concepts and analysis methods to apply them to CPNs. This class of Petri nets has been employed in Voss et al. (2003), where the authors use colours to differentiate molecules of the same species, according to the paths along which they are produced and consumed, and perform a qualitative
32
Discrete Modelling: Petri Net and Logical Approaches
833
analysis of the combined glycolysis and pentose phosphate pathway in erythrocytes. Genrich et al. proposed a high-level Petri net modelling to quantitatively study metabolic pathways (Genrich et al. 2001). The colours associated to the places are pairs encompassing the names and concentrations of the related substrates. Kinetic functions are then assigned to the transitions. In Lee et al. (2006), CPN are also employed to specifically analyse signal transduction pathways. Finally, this class of Petri nets have been used to compactly represent logical models of genetic regulatory networks (Chaouiya et al. 2004; Comet et al. 2005).
32.3 Specific Modelling Techniques In this section we consider special Petri net modelling and analysis methods ranging from peculiarities to new techniques. To illustrate the methods we consider parts of a Petri net model of the mating pheromone response pathway in S. cerevisiae (baker’s yeast). First, let us give a very brief overview on the biology of this well-understood pathway. Reviews can be found in Bardwell (2004); Ciliberto et al. (2003). In S. cerevisiae two different mating types exist, the MATa and MATα cells, which can mate by fusion to a diploid cell. This process is induced by small peptides, called pheromones, the α-factor and the a-factor. These factors are secreted by the cell, which in turn carries receptors for pheromones of the opposite cell type. A haploid cell secretes its own factor and carries receptors for factors of the opposite type. When two different cell types are nearby located, they can be stimulated by the factor of the other cell. In response to this stimulation, a complex signalling pathway – the pheromone response pathway – starts in order to induce a series of physiological changes in preparation for mating. Here, we will only sketch the pathway for MATa-cells, which is nearly the same as for MATα cells. The α-factor-specific cell surface receptor Ste2 binds to a heterotrimeric G protein, which itself consists of the three subunits Gα, Gβ and Gγ , whereby the latter two can act as heterodimer Gβγ . The heterodimer, Gβγ , mediates a signal on the MAP kinase cascade by interacting with Cdc24, the protein, kinase Ste20 and scaffold protein, Ste5. In series, other kinases (Ste20, Ste11 and Ste7) are activated and phosphorylate the two MAP kinases, Fus3 and Kss1. Phosphorylated Fus3 in the nucleus induces the transcription of genes that prepare the mating of the two cells and the fusion of their nuclei. The gene products are responsible for cell cycle arrest in the phase G1 and the synthesis of some signalling regulators, for example, the protease, Bar1, which terminates the pathway by a negative feedback loop, which triggers receptor degradation. The corresponding Petri net consists of 42 places and 48 transitions. For details see Sackmann et al. (2006). We will consider some modelling techniques, which have proved to be useful in modelling signal transduction pathways using Petri nets.
834
I. Koch and C. Chaouiya
32.3.1 The Role of Place Invariants and Read Arcs In signalling networks, we often consider proteins or protein complexes in an activated or inactivated form. We can model this situation by two places, one for the active and one for the inactive form, whose sum of tokens remains constant in each system state and thus forming a P-invariant, see Fig. 32.5. To ensure that the tokens remain in this cycle the neighbouring transitions are connected via read arcs. t3
t1 p2
Activation
Inactivated_protein
Activated_protein Deactivation
t2
t4 p1
Fig. 32.5 A typical substructure representing activation and deactivation of the same protein. Read arcs have been used to guarantee that the token is either in the place Inactivated_ protein or in the place Activated_ protein
In the pheromone pathway, we apply this modelling technique to the G protein cycle. A change in the conformation of the receptor catalyses an exchange of GDP to GTP in the Gα subunit, leading to a dissociation of this monomer. The monomer reassociates with the dimer to the trimeric form of the G protein, when GTP is hydrolysed to GDP. In Fig. 32.6, the more complex structure, consisting of two place invariants which overlap in one place, is depicted. The two P-invariants are each composed by two places. They both involve the place Heterotrimer_Gαβγ and differ in the other place. The one invariant contains GαGTP and the other Gβγ _dimer.
32.3.2 Feasible T-Invariants T-invariants represent multi-sets of transitions. By firing of these transitions the initial marking will be reproduced. These transitions can only fire if they are enabled. Depending on the initial marking, the transitions of a T-invariant can fire to reproduce a given marking. If the initial marking is sufficient, i.e. all transitions of a T-invariant can fire, the T-invariant becomes feasible.
32
Discrete Modelling: Petri Net and Logical Approaches
835
Receptor_complex
Heterotrimer_G αβγ
Cleavage_α unit GαGTP Interaction_Far1
Hydrolysis_GTP−>GDP
G βγ_dimer
Binding_to_Ste5
Phosphorylation_of_Ste20
Fig. 32.6 The Petri net model of the G protein cycle. Two P-invariants are modelled, the one consisting of Heterotrimer_Gαβγ and GαGTP and the other is built of Heterotrimer_αβγ and Gβγ _ dimer
For the investigation of T-invariants, we have to consider two aspects. First, the feasibility of T-invariants depends on the initial marking, but the computation of T-invariants is independent from the initial marking. To get feasible T-invariants, P-invariant structures have to contain a sufficient amount of tokens, assuming that read arcs are the only source for non-realisability. Second, read arcs cannot be properly represented in the incidence matrix of the Petri net, because the token balance on the place connected by a read arc to a transition is zero. This means that, in the corresponding linear equation system this transition is not considered. Consequently, we yield T-invariants which stop or start, respectively, at a P-invariant, which is connected by read arcs to other net parts. Thus, if a P-invariant with read arcs in between occurs, T-invariants cannot reflect the entire way from the signal to the cell response. But, we are interested to reflect all possible ways of signal flow, i.e. information flow, from the signal to the cell response. Therefore, we have to connect these T-invariants in a proper way preserving the feasibility. For this reason, we process the minimal T-invariants by connecting them to get linear combinations, resulting in feasible T-invariants z. Let t1 , . . . , tk be all the pre-transitions of pi , which are not connected via a read arc with pi , see Fig. 32.7. These transitions belong to feasible T-invariants x. There is another transition tj which is connected to place pi via a read arc. A T-invariant, which includes tj and excludes tl , ∀l ∈ {1, . . . , k}, is not feasible, if place pi does not contain tokens. We have to apply the following procedure to each read arc between a transition tj and an empty place pi of a P-invariant. We combine the feasible T-invariants x and one non-feasible T-invariant y by z = x + y ∀ x ∃ l ∈ {1, ..., k} : xl = 0, with xj = 0, yj = 0 ∧ ∀ l ∈ {1, . . . , k} : yl = 0.
(32.1)
836
I. Koch and C. Chaouiya
Fig. 32.7 A subnet for illustrating the combination of minimal T-invariants in order to bridge a read arc which is not properly represented in the incidence matrix leading to an interruption of T-invariants. t1 , ... , tk and tj belong to minimal T-invariants. Place pi is p corresponds to a P-invariant
t1
t2
tk
X
p
i
tj
Y
Thus, we construct one linear combination z for each feasible minimal T-invariant x with a token on place pi . These combined T-invariants are not necessarily minimal anymore. But, because we construct only inevitable combinations of minimal T-invariants the resulting combined T-invariants are minimal with respect to their feasibility in the initial marking. If there are several read arcs in the net, appropriate combinations have to be performed iteratively. Let us return to the example in Fig. 32.5. Table 32.1 describes the T-invariants after processing them. The middle columns indicate, whether the T-invariants are minimal, i.e. non-processed, and whether they are feasible. Table 32.1 The minimal and feasible T-invariants of the example in Fig. 32.5 No.
Involved transitions
1. 2. 3. 4.
Activation, deactivation t1, t3 t2, t4 Activation, deactivation, t2, t4
Minimal?
Feasible?
√ √ √
√ √
–
–√
Composed of – – – 3+1
32.3.3 Modularisation Using MCT-Sets and T-Clusters The transitions of a T-invariant together with the places between these transitions define a connected subnet. Thus, T-invariants decompose the network into smaller subnetworks, each defining a biologically meaningful pathway. But, the number of T-invariants grows exponentially with the network size and complexity. For network validation and exploration, a further decomposition of the network on the basis of T-invariants would be useful. There are two concepts for further network decomposition, MCT-sets and T-clusters.
32
Discrete Modelling: Petri Net and Logical Approaches
837
In the first approach, we summarise equal parts of T-invariants in such a way that those transitions that exclusively belong to the same T-invariant form a maximal common transition set (MCT-set). Thus, MCT-sets are based on the support of T-invariants. This leads to maximal sets of transitions, ϑ, for which ∀ x ∈ X : ϑ ⊆ supp(x) ∨ ϑ ∩ supp(x) = ∅
(32.2)
holds, where X denotes the set of all minimal T-invariants x. By Eq. (32.2) MCT-sets are disjunctive sets. Because MCT-sets are based on T-invariants, which exhibit a biological meaning, MCT-sets should also be biologically significant. Thus, they have to be inspected for their biological meaning. Another important point is that the transitions of an MCT-set act always together. Thus, the corresponding genes have to be co-expressed at the same time. Another implication is that for performing systematic single knockout experiments it is sufficient to knockout only one of the transitions of an MCT-set. Hence, using MCTsets we reduce the network complexity in a way which is biologically significant. Fig. 32.8 illustrates a small example with its T-invariants and MCT-sets. Signal
p1
t1
p2
t2
t3
p4
t4
T−invariants: 1. (Signal, t3, t4, Response) 2. (Signal, t1, t2, t4, 2 x Response)
p3
Response
MCT−sets: 1. (t3) 2. (Signal, t4, Response) 3. (t1, t2)
Fig. 32.8 A small example of a signalling pathway with its T-invariants and MCT-sets
In the second approach of network decomposition, we cluster the support vectors of T-invariants, which leads to a more differentiated network decomposition. The resulting T-clusters and the corresponding places form overlapping subnetworks. As for MCT-sets, we consider the support of T-invariant vectors. Applying a simple distance measure, the Tanimoto index, we use hierarchical clustering, mainly UPGMA or Neighbour Joining, to get T-clusters. For a more detailed description, see Grafahrend-Belau et al. (2008). In contrast to T-invariants, MCT-sets and T-clusters with the corresponding places can also lead to disconnected subnets. For the results for the pheromone pathway, see Sackmann et al. (2006).
838
I. Koch and C. Chaouiya
32.3.4 Mauritius Maps and Knockout Analysis One important application of theoretical models, in particular in biochemistry, is the simulation of knockout experiments, which give new insights into systems behaviour. Through knockout analyses, alternative pathways can be found, which can be used, for example, to estimate robustness of a system. Also a ranking of transitions can be deduced, which can be applied in experiment design. Using Petri nets, knockouts can easily be performed by removing transitions or places in the network graph. Then, we again compute the T-invariants of this modified model. Now, we can determine the percentage of affected T-invariants, each representing a specific function which has been destroyed by the knockout. To answer the question in terms of transitions with highest or lowest impact to the net behavior we measure the number of affected, i.e. destroyed, T-invariants after knocking out the transition(s) of interest. To explore functional dependencies between T-invariants we construct a Mauritius map, which represents the T-invariants of a system in a special data structure in terms of a binary tree, for which the following holds: Definition 32.4 (Mauritius Map) Let N = (P, T, F, W, m0 ) be a Petri net and X the set of all T-invariants x. A Mauritius map is a finite binary tree, T = (V, E) with • V is a finite non-empty set of all transitions x, belonging to at least one T-invariant. The root vertex is located in the lower left corner. • E = (H, R) is a finite non-empty set of edges between vertices, indicating dependencies of T-invariants. • The set H contains horizontal edges, which connect, together with the vertical edges laying in between, vertices of the same T-invariant. • The set R contains vertical edges, which represent bifurcation points and connect vertices of the left subtree with vertices of the right upper subtree. Both subtrees belong to the same T-invariant. The root vertex has no left subtree, but a right subtree, which contains all transitions covering the Petri net. The leaves correspond to the transitions with a left subtree and a right subtree which consists of only that transition. Interior vertices exhibit left and right subtrees each describing subnetworks. These subnetworks are assigned by the vertices following the path from the root to the interior vertex considered. Bifurcation points in the tree, located on an edge, indicate that the remaining transitions belong to two different T-invariants. Both T-invariants contain the transitions on the edge before the bifurcation. Additionally, the T-invariants involve either the transitions on the horizontal edge right from the bifurcation or the transitions on the horizontal edge connected by the vertical edge of the bifurcation. In Fig. 32.8, transitions signal, response and T4 , belong to both T-invariants. Transitions t1 and t2 located on the horizontal edge, right from the bifurcation, are
32
Discrete Modelling: Petri Net and Logical Approaches
839
part of T-invariant 2, and transition t3 located on the horizontal edge, drawn parallel to the other horizontal edge, is part of T-invariant 1. Knocking out a transition, the net will be separated into two subnets, whereas one part of the net remains active and the other part loses its biological function. The subnet defined by the left child does not contain the transition knocked out and represents the function of the network that has not been affected by the knockout. Thus, only those pathways, which cover the left child and its successors, are not affected, maintaining their biological functionality. The subnet defined by the right child and its successors cover all affected pathways. The most important transitions are given by the horizontal line, which begins in the root, until the first bifurcation point. Knocking out one of these transitions, all T-invariants would be affected because these transitions are common in all T-invariants. Hence, their impact on the system will be the highest in comparison of knocking out other transitions. For illustration, see Figs. 32.8 and 32.9. The network can be decomposed into two fundamental pathways described by the two T-invariants. For the second Tinvariant, the transition Response has to fire two times because place, p3, got two tokens, one from firing of transitions t1 and t2 and one from firing of transition t4. There are two MCT-sets consisting of more than one element.
t3 t2
t1
t4 se
on
sp
Re
al
gn
Si
Fig. 32.9 The graphical representation of the Mauritius map for the example in Fig. 32.8. The root is located in the left bottom corner. Following the horizontal lines, we see the two T-invariants with the common part of the three transitions, Signal, Response, and t4, and the different parts separated by the vertical line. The most important transitions are Signal, t4 and Response. A knockout of one of them would be lethal for the system. In contrast, a knockout of transitions t1 and/or t2 or of transition t3 would not influence the respective other T-invariant, and the system would still be able to trigger the signal to the response
T-invariant 2 is represented by the horizontal line at the bottom. The second Tinvariant is drawn by two horizontal lines connected by a vertical line. The most important transitions are those of MCT-set 2 (Signal, t4 and Response). A knockout of one of these transitions would be lethal for the system. Knocking out transitions t1 or t2 the system can still work through firing of transition t4.
840
I. Koch and C. Chaouiya
32.4 Logical Approach The logical approach introduced by R. Thomas defines models where components influence each other (see Thomas et al. 1995 and references therein). This abstracted description contrasts with reaction models as it only retains the activatory or inhibitory effects between components (Fages and Soliman 2008). After a brief introduction to the logical modelling, we discuss the dynamic properties that can be studied within this formalism (for further details, see Chaouiya et al. 2003; Thomas et al. 1995). Definition 32.5 A logical regulatory graph (LRG) R = (G, E, K) is a graph (G, E) associated with a set of logical functions K defined as follows: • G = {G1 , G2 , . . . , GN } is the set of regulatory components (genes, proteins or even phenomenological components), each being associated with a variable xi , which denotes its discrete level (of concentration, of activity, etc.), taking its values in a finite set of positive integers {0, . . . , Maxi }. A state of the system is thus defined as a vector x = (x1 , . . . , xN ) ∈ S = i=1,...,N {0, . . . , Maxi } (the set S is the set of all states). • E ⊆ G × G is the set of regulatory interactions between the elements of G. • K = {Ki , i = 1, . . . , N} is the set of logical functions that define the behaviours of the regulatory components depending on the state of the system. Given a state x, Ki (x) gives the target level of Gi in state x: ∀Gi ∈ G, Ki : S (→ {0, . . . , Maxi }. A number of remarks follow this definition. First, the Boolean case (i.e. Maxi = 1, ∀Gi ∈ G) is often sufficient to convey the roles of the regulatory components. However, the consideration of multi-valued variables proved to be useful in some specific cases. For example, if a component regulates two targets, it is not likely that both regulatory effects occur for the same range of values of the regulator (see e.g. Thieffry and Thomas 1995). Moreover, in some cases, dual effects occur when, depending on the level of the regulator, the effect is either positive or negative. For simplicity, in what follows, we will restrict ourselves to the Boolean case. However, all the techniques and properties can be generalised to the multi-valued case, at the cost of cumbersome notations. For any regulatory component Gi of G, its logical function fully defines the set of its regulators Reg(Gi ) ⊆ G: Gj ∈ Reg(Gi ) if ∃x ∈ S, Ki (x) = Ki (xj ), j
j
where xk = xk , ∀k = j and xj = xj ± 1, i.e. vectors x and xj differ by only their jth components. The logical functions are conveniently represented by ordered binary decision diagrams (BDD) (Bryant 1986; Naldi et al. 2007). Given Gi , the BDD representation of Ki is a rooted directed acyclic graph (DAG) encompassing (internal) decision
32
Discrete Modelling: Petri Net and Logical Approaches
841
nodes and two terminal nodes labelled 0 and 1. Decision nodes correspond to the Boolean variables carrying the levels of the regulators of Gi . Each decision node has two child nodes (left and right nodes), and the edge from a decision node to its left (resp. right) node represents an assignment of this variable to 0 (resp. 1). The decision variables appear always in the same order (arbitrarily chosen). A state of the regulators of Gi determines a unique path in the BDD representing Ki and the reached terminal node gives the target level of Gi (see Fig. 32.10).
Fig. 32.10 A logical regulatory graph encompassing four regulatory components with the BDD representation of the logical functions (top). From the BDD of KG2 , we recover all situations for which the target value of G2 is 0 by listing all the paths leading to the terminal node labelled 0. Hence G2 ’s target value is 0, if (x1 = x3 = 0) or (x1 = 0 and x3 = x4 = 1) or (x1 = x4 = 1). Similarly, G2 ’s target value is 1, if (x1 = x4 = 0 and x3 = 1) or (x1 = 1 and x4 = 0). The asynchronous STG (bottom right) for the initial state in fuschia (0, 1, 0, 1) (i.e. x1 = x3 = 0, x2 = x4 = 1) shows that the three possible state states (oval nodes) are reachable. In the case of a synchronous updating (bottom left), from the same initial state, the four components change simultaneously, reaching the state (1, 0, 1, 0), which in turn has the initial state as unique successor. Hence, the updating scheme drastically changes the qualitative behaviour: in the asynchronous dynamics, the system reaches one stable state, whereas in the synchronous dynamics, the system is trapped in a cyclical attractor
For multi-valued models, multi-valued decision diagrams (MDD) represent the logical functions in a similar way. In this case the decision variables may have more than two values, see Naldi et al. (2007) and references therein for further details. In the definition above, we have employed the terms “regulatory components” and “regulatory interactions” because this formalism originates from the purpose of handling gene regulatory networks. It is worth clarifying here the meaning of “regulatory”: an activatory or inhibitory effect arises, leading to the change of the
842
I. Koch and C. Chaouiya
level of the interaction target. This semantics is thus well adapted to handle signal flows in signal transduction networks (Saez-Rodriguez et al. 2007). Behaviours of logical models are commonly represented as a state transition graph, where nodes are the states of the system and arcs are possible transitions connecting these states. Definition 32.6 Given an LRG R = (G, E, K), its fully asynchronous state transition graph (STG) is a (finite) directed graph (S, T ), where • S is the state space, • T ⊂ S 2 is the set of transitions defined as follows: (x, y) ∈ T (y is a successor of x) if and only if ∃Gi ∈ G such that for all j = i, xj = yj and xi = yi + δi (x), where ⎧ ⎨0 δi (x) = 1 ⎩ −1
if Ki (x) = xi (the value of Gi should not change), if Ki (x) − xi > 0 (the value of Gi should increase), if Ki (x) − xi < 0 (the value of Gi should decrease).
Given an initial state x0 ∈ S, one can also define (S|x0 , T|x0 ), sub-graph of (S, T ), as follows: x0 ∈ S|x0 and ∀x ∈ S|x0 , ∃y ∈ S s.t. (x, y) ∈ T =⇒ y ∈ S|x0 and (x, y) ∈ T|x0 . The STG as defined above encompasses all the 2N states of the state space S and all possible transitions between these states. It is worth noting here that we have considered a fully asynchronous dynamics, meaning that a successor state differs from its predecessor by a unique component. More precisely, a state has as many successors as the number of components called to change their values. This means that, having no information on related delays, conflicting transitions are not resolved (which component wins the race and changes first). This updating policy is similar to that of P/T nets, leading to non-deterministic behaviours. In contrast, a number of authors developing Boolean models rely on synchronous dynamics where each state has at most one successor, all changes being performed simultaneously (see e.g. Irons 2009). Figure 32.10 illustrates the differences between the two dynamics.
32.4.1 Analysis of Logical Regulatory Graphs The analysis of an LRG behaviour can be performed on the basis of the STG. In particular, it is possible to identify the stable states and more complex attractors (cycle compositions or terminal strongly connected components in terms of graph theory). Moreover, by searching paths in the STG, it is possible to evaluate the reachability of these attractors. However, as already pointed out for Petri nets, this graph representation of the behaviour is often too large to be efficiently analysed. Hence the necessity to resort to other methods.
32
Discrete Modelling: Petri Net and Logical Approaches
843
All potential stable states are efficiently identified from the specification of the LRG (Naldi et al. 2007). Although this does not solve the reachability problem, it allows a validation of the model and might facilitate the use of model-checking tools by specifying both initial and ending points of required path. Circuit analysis constitutes a powerful means for analysing LRGs. It is based on the following rules, initially enounced by R. Thomas and formally proved since then: a positive regulatory circuit is necessary for multistationarity (emergence of multiple attractors), whereas a negative circuit is necessary for cyclical behaviour (see Thieffry 2007 for a review). These are only necessary conditions as, because of surrounding regulators, a circuit can be prevented from being functional, that is to say from generating the expected property. Based on the notion of functionality contexts, the software tool GINsim pinpoints the circuits that may be at the origin of essential dynamical properties (see Table 32.3 and Naldi et al. 2009). It is not our purpose to develop further this topic in this chapter, although feedback circuits play essential roles in signalling pathways. Finally, it is worth noting that perturbation analyses are easily performed with logical models, as they generally amount to block the value of a discrete variable (see e.g. Fauré et al. 2009; Sánchez et al. 2008).
32.4.2 From Logical Regulatory Graphs to Petri Nets Due to the impressive amount of genomic data, regulatory networks to be studied tend to increase in size and complexity, complicating the modelling and analysis of their behaviours. P/T net representation of logical models complements the techniques mentioned above, as it opens the way to the use of computational tools developed by the Petri net community for almost half a century. An additional motivation is the development of a Petri net-based framework for the qualitative integrated modelling of regulated metabolic networks as delineated in Simão et al. (2005). In contrast to the consumption/production mechanism underlying a chemical reaction (naturally represented in terms of P/T nets), a regulatory interaction denotes an influence in the course of which the target state changes, while the regulator state remains unchanged. P/T nets are not meant to model such interactions. In the case of an activatory effect, the presence of the activator leads to an increase of the target level and the absence of the activator may provoke the decrease of the target level (and the other way around for a repression). Such situations might be represented in Petri nets using inhibitory arcs that allow tests to zero. However, the analysis methods based on the algebraic representation of P/T nets are no more valid when using these inhibitory arcs. Opportunely, when places are bounded (their markings are limited), one can avoid the use of inhibitory arcs by considering new complementary places. In what follows, a representation of LRGs in terms of P/T nets is presented. This representation was initially defined in Chaouiya et al. (2004) for Boolean models. Definition 7 formalises this construction of a P/T net, relying on the BDD
844
I. Koch and C. Chaouiya
representation of the logical functions. The generalisation to multi-valued models can be found in Chaouiya et al. (2006). Recall that, for the sake of simplicity, we restrict ourselves to Boolean models. Given a Boolean LRG, the construction of a P/T net representation encompasses two complementary places for each regulatory components, and as many transitions as paths in the BDDs representing the logical functions. Let us first introduce some additional notations. Given Gi , a regulatory component, and Ki its logical function, we denote {k }k=0,...,p the set of paths from the root to a terminal node of the BDD representing Ki . The number of paths, p, is less than or equal to 2|Reg(Gi )| (2 to the power of the number of regulators of Gi ). A path k ending at a terminal node labelled vk (with vk = 0 or 1) specifies a combination of the levels of the regulators for which the target value of Gi is vk . More precisely, • if path k encompasses a left edge going out the decision variable xj , the assignment of xj for k equals 0, denoted k (xj ) = 0; • if path k encompasses a right edge going out the decision variable xj , the assignment of xj for k equals 1, denoted k (xj ) = 1; • if, along the path k , a decision variable xj (Gj ∈ Reg(i)) does not appear (due to the simplification of the BDD), it means that the assignment for the remaining variables is sufficient to determine the target value of Gi . It is the case in the example of Fig. 32.10 for which the assignment x1 = x3 = 0 is sufficient to determine the target value of G2 . Definition 32.7 Given an LRG R = (G, E, K), with Maxi = 1, ∀i = 1, . . . , N (i.e. a Boolean LRG) and an initial state x0 ∈ S, we define the corresponding P/T net N = (P, T, F, W, m0 ) as follows: • P = {Gi , Gi }Gi ∈G is the set of places, with two complementary places for each regulatory component, with m0 (Gi ) = xi0 ,
m0 ( Gi ) = 1 − xi0 .
• For each Gi ∈ G, for each path k from the root to a leaf of the BDD representing + − if vk = 1 or ti,k if vk = 0. Ki , one transition is defined, denoted ti,k + − • A transition ti,k (or ti,k ) is connected to – place Gj , j ∈ Reg(i), with a test arc if xj appears along k and if k (xj ) = 1, – place Gj , j ∈ Reg(i), with a test arc if xj appears along k and if k (xj ) = 0 + • A transition ti,k is further connected to
• place Gi , with an incoming arc (increasing the level of Gi ), • place Gi , with an outgoing arc (ensuring that the current level of Gi is 0 and decreasing the marking of this complementary place).
32
Discrete Modelling: Petri Net and Logical Approaches G2 :
G2 :
x1
x1
x3
x3 x4 0
845
1
Φ0
Φ1
0
1
x4
Φ2
Φ3
0
1
3 G
tΦ−2
tΦ+1
1 G
Φ4 0
G2
G2
tΦ−0
x4
G3
tΦ+3
G3
tΦ+4
4 G
G4
Fig. 32.11 P/T net construction from the BDD representation of the logical functions illustrated for G2 , component of the model in Fig. 32.10. The BDD representing KG2 (top left) is unfolded for a better visualisation of the five paths from the root to the terminal nodes (top right). These paths are named. 0 to 4 The piece of P/T net related to G2 is displayed on the bottom part: transitions are coloured according to the corresponding path colour. Given a path i leading to a terminal − + node 0 (resp. a terminal node 1), read arcs connect the transition t (resp. t ) according to the i i − + assignment indicated by i , and the effect of ti (resp. ti ) is carried out by the arc going out G2 22 (resp. by the arc going out G 22 and the arc going into G2 ) and the arc going into G − • Symmetrically, a transition ti,k is further connected to
• place Gi , with an outgoing arc decreasing the level of Gi , • place Gi , with an incoming arc. This construction, illustrated in Fig. 32.11, has been implemented in the software GINsim, which provides an export function of logical to Petri net models, with the possibility to choose among several formats (Naldi et al. 2009). Gi ) = 1. It can be easily proved that all reachable marking m verifies: m(Gi ) + m( Moreover, the reachability graph of N is isomorphic to the STG (S|x0 , T|x0 ).
32.4.3 Illustration: Mating and Filamentous Pathways in Yeast We present a very much simplified model of the crosstalk between two signalling pathways in yeast: the mating pathway (see 32.3), triggered by pheromone, and the filamentous growth pathway, which responds to nutrient limitation. These pathways have been subject to various modelling studies, in particular focusing on
846
I. Koch and C. Chaouiya
the response specificity (Bardwell et al. 2007; Rensing and Ruoff 2009). Rather than a study of the functioning of these pathways, our aim here is to illustrate the logical modelling as a convenient framework to qualitatively describe signalling cascades in an abstract way.
Fig. 32.12 A simple logical model of yeast mating/invasive growth signalling pathways (inspired from Bardwell et al. 2007). On the top, the set G of the nine nodes is given, as well as few elements in E , the set of interactions. The regulatory graph is displayed together with the logical functions associated to Fus3 and FREs. The rule for Fus3 stipulates that both Ste5 and Ste7 must be activated to phosphorylate Fus3. This active form of Fus3 then leads to the activation of the genes required for the mating (PREs) and to the inhibition of the genes related to the filamentous growth (FRES)
We consider the main specific and shared components of the two pathways as delineated in Bardwell et al. (2007) (see Fig. 32.12): • • • • •
two external signals (pheromone and nutrient deprivation), the scaffold protein Ste5 recruited during mating, the kinases Ste 7 and Ste 11, the MAPK Fus3 and the MAPK Kss1, the outcomes of the signalling pathways; PREs (pheromone response elements) or FREs (filamentation response elements).
The behaviour of this model is illustrated in Fig. 32.13. Using GINsim, all potential stable states of the system can be easily identified: these encompass a trivial state (where all components are inactive), two mating states and one filamentous growth state. Note that the two mating states correspond to the situations where the cell is submitted to a pheromone signal alone or in combination with a nutrient deprivation signal. Hence, this model suggests that in presence of both signals, the cell undergoes mating. However, this model obviously needs to be refined. For example, it does not correctly account for the phenotype observed in cells lacking Fus3. Indeed, the simulation of a Fus3 knockdown (which amounts to constrain the corresponding
0 1 1
0 1 1
0 0 1
0 1 1
0 0 1
847
FREs
0 0 1
Kss1 PREs
0 1 *
Fus3
0 0 1
Ste11 Ste7
Ste5
trivial filamenteous mating
nutrie nt- de priva tion
Discrete Modelling: Petri Net and Logical Approaches
phero mone
32
0 1 0
Fig. 32.13 Dynamic properties of the logical model presented in Fig. 32.12. On the left, the table displays the four stable states of the model. The first row indicates the names of the nodes, while the following rows give the activity states (0 or 1). Beside the trivial state (no signal provided, no component activated), the second state relates to the sole presence of the nutrient deprivation signal and corresponds to the activation of the filamentous growth. The fourth row indicates that in presence of a pheromone signal (whatever the state of the nutrient deprivation signal), the cell is in a mating state. On the right, the STG obtained when the initial state (on the top of the graph) encompasses a pheromone signal (the order of the variables in the vector states is that of the table of the stable states). This STG displays the possible alternative trajectories towards the final state (ellipse node at the bottom). In particular, states characteristic from a filamentous growth response (dark nodes) might transiently occur. The chosen trajectory depends on the relative delays between conflicting transitions
variable to the constant value 0) with a pheromone signal, leads to a filamentous state, with the level of FREs being 1 and that of PREs being 0. This is not correct, although in the absence of Fus3, pheromone activates filamentation-specific genes through the activation of Kss1. Figure 32.14 illustrates the P/T representation of the simple logical model of the yeast mating/invasive growth depicted in Fig. 32.12.
32.5 Summary and Conclusions This chapter introduces two qualitative modelling techniques of molecular networks, in particular of signal transduction networks. In a first step, Petri net modelling is presented. After a short presentation of the main definitions and properties, illustrations are provided through models of signal transduction pathways. We mainly focus on transition invariant analysis, which plays a major role in model validation and in in silico knockout analysis. Minimal T-invariants, also known as elementary modes, cover the basic system behaviour of a network at steady state. Each possible behaviour can be obtained from linear combinations
848
I. Koch and C. Chaouiya
Fig. 32.14 The P/T representation of the logical model of the yeast mating/invasive growth shown in Fig. 32.12. The P/T encompasses 18 places and 18 transitions. For the initial marking depicted here (pheromone signal and all other components being inactive), the marking graph is isomorphic to the STG shown in Fig. 32.13
of minimal T-invariants. T-invariants together with the places and arcs in between define connected subnetworks, which can be interpreted as biologically significant modules. The assignment of a biological meaning to these pathways is a valuable criterion to validate a model. For signal transduction pathways, we introduce the notion of feasible Tinvariants, since P-invariants’ modelling with read arcs, for example, for activated and inactivated forms of a protein may split resulting T-invariants. To facilitate the exploration of a huge amount of T-invariants, we define new structures in Petri net models, the MCT-sets and T-clusters, providing a further decomposition. MCT-sets exclusively summarise common parts of T-invariants. Thus, a set of transitions are given, which occur always together and should therefore exhibit a similar expression behaviour. MCT-sets can also be regarded as building blocks of a system. By definition, they do not overlap whereas T-clusters can overlap. T-clusters are obtained by using hierarchical clustering methods based on the Tanimoto distance
32
Discrete Modelling: Petri Net and Logical Approaches
849
measure. Here, we yield again sets of T-invariants, which form subnetworks having a biological interpretation as functional modules. We further introduce a concept, called Mauritius map, to represent dependencies between T-invariants. A Mauritius map is a special data structure in terms of a binary tree. These maps provide an overview of affected subnetworks and are therefore useful for systematic knockout analyses. Overall, it can be concluded that P/T nets constitute a useful framework for modelling and analysing of signal transduction pathways. We can represent different abstraction levels in one model and thus combine signalling networks with metabolic and/or gene regulatory processes. This is a strong advantage of Petri net modelling. Moreover, in contrast to the elementary mode analysis, Petri nets provide an intuitive graphical interface. Further analysis methods, such as the possibility to check for deadlocks, liveness and reachability; and model-checking techniques based on temporal logics can easily be applied. Many software tools are freely available and interested users can easily edit and explore their own examples. The limitation of Petri nets, as for most other modelling methods, lies in their application to large systems encompassing several thousands of nodes. The invariant or elementary mode computation is NP-complete and fails for large systems. The logical formalism, originally dedicated to transcriptional networks, proved useful to represent in a convenient abstracted way a large variety of regulatory processes, including post-transcriptional regulation and complex formation. Here we present the framework and briefly discuss the properties that can be checked through on logical models. In contrast to structural properties of P/T nets as delineated in Section 32.3, which relates to a steady-state behaviour of the system, logical model analyses mainly focus on dynamic behaviours with a special attention to attractors (and their reachability). Note that such properties can also be checked on P/T net models by constructing their coverability graph or, often better, by using model-checking techniques. Opportunely, logical models can be represented in terms of P/T nets, applying a well-defined procedure as presented in this chapter. This representation permits the use of Petri net tools for the analysis of these models. More importantly, this P/T net representation facilitates the integrated modelling of molecular networks, considering different levels of abstraction. Finally, it is worthwhile to relate the presented approaches to more detailed and quantitative models if the required kinetic data are available. In addition to classical kinetic modelling methods, which provide many powerful software tools, Petri net extensions can be used increasing the expressive power of P/T nets. However, as the complexity of the system increases, the harder it is to efficiently analyse the resulting models. The discrete approaches, as presented in this chapter, are useful to delineate the structure and basic functioning of biochemical networks, in particular of signalling networks. Acknowledgments This work is partly supported by the Federal German Ministry of Education and Research (BMBF), BCB project 0312705D.
850
I. Koch and C. Chaouiya
Appendix For those interested readers who would like to develop their own models using Petri net and/or logical formalisms, Tables 32.2 and 32.3 present a brief overview of software tools. We have chosen those running under Linux and Windows. All but one are free for academic users. Further information and references can be found at the indicated webpages. The Petri net community is very well organised and offers Table 32.2 A selection of software packages supporting Petri nets. All tools have been used in the field of systems biology and run under main operating systems. They are all free of charge to academic users, except Cell Illustrator Name
Main features/Homepage
Cell Illustrator
Dedicated to the modelling and simulation of a wide range of biological processes, involving an editor with many features for representation of biochemical networks and is based on extended HFPNs http://www.cellillustrator.com/home A tool for editing, simulating and analysing CPNs. A fast simulator efficiently handles both untimed and timed nets. Full and partial state spaces can be generated and analysed, and a standard state space report contains information such as boundedness properties and liveness properties http://wiki.daimi.au.dk/cpntools/cpntools.wiki A graph drawing toolkit designed to efficiently manipulate several types of graphs and to automatically draw them according to many different aesthetic criteria and constraints. It is useful for high-level Petri nets and P/T nets http://www.dia.uniroma3.it/ gdt/gdt4/index.php A powerful graphical editor and analyser for high-level Petri nets, TPN and SPN. It provides state space computation, invariant analysis, structure analysis and advanced performance analysis. http://www.di.unito.it/ greatspn/index.html Although not maintained any longer, it is still widely used as a general purpose tool for the analysis of standard (timed) PNs and CPNs. Most Petri net editors provide export facilities towards INA, which has no graphical editor http://www2.informatik.hu-berlin.de/∼starke/ina.html Implements extensive reachability analysis and model checking for P/T and high-level nets http://www.tcs.hut.fi/Software/maria/ Supports stochastic Petri nets among others stochastic model types and provides multiple analysis tools http://www.mobius.illinois.edu/ A powerful editor with fast simulation for P/T nets, TPNs and SPNs. It provides state space analysis, invariant analysis, structural analysis, simple and advanced performance analysis http://pipe2.sourceforge.net/ A powerful tool for high-level Petri nets and P/T nets. It facilitates state space analysis and LTL and CTL model checking http://www.tcs.hut.fi/Software/prod/ A graphical editor and fast simulator for high-level Petri nets, P/T nets, SPNs with advanced performance analysis http://www.dvs.tu-darmstadt.de/staff/skounev/QPME/index.html
CPN Tools
GDToolkit
Great SPN
INA
Maria
Möbius
PIPE
PROD
QPME
32
Discrete Modelling: Petri Net and Logical Approaches
851
Table 32.2 (continued) Name
Main features/Homepage
Snoopy
Allows the edition, animation of P/T nets and various extensions, including continuous and stochastic Petri nets. Comprises export facilities to INA, TINA and Maria http://www-dssz.informatik.tu-cottbus.de/software/snoopy.html A graphical (editor) and interactive toolkit for modelling high-level Petri nets, P/T nets, SPNs and stochastic colored Petri nets (SCPNs). It provides invariant analysis, structural analysis and advanced performance analysis http://www.tu-ilmenau.de/fakia/8086.html A tool for (time) Petri net analysis, including graphic-editing facilities and efficient symbolic representation of behaviours http://www.laas.fr/tina/
TimeNET
Tina
Table 32.3 A selection of software packages supporting logical modelling. All tools have been used in the field of systems biology and run under main operating systems Name
Main features/Homepage
BooleanNet
Supports the simulation of Boolean models following various updating policies, including a “piecewise differential equation” mode that associates a set of continuous variables to each discrete variable http://code.google.com/p/booleannet/ CellNetAnalyzer This MATLAB package provides structural and functional analysis (based on network topologies) for both metabolic (stoichiometric), signaling and regulatory networks (Boolean interaction networks) http://www.mpi-magdeburg.mpg.de/projects/cna/cna.html ChemChains Recently released as a tool for simulation and analysis of Boolean networks, it implements means to set up series of simulation, varying external environments (input nodes) as well as perturbation conditions (mutations) http://www.bioinformatics.org/chemchains GINsim A framework for the specification of multi-valued logical models providing functionalities for (a)synchronous simulations, circuit analysis. Export facilities are provided towards Petri net formats, among others http://gin.univ-mrs.fr/GINsim/ Squad A tool for the study of regulatory networks using combined Boolean and specific ODEs, combined to easily identify steady states and to simulate the dynamic behaviour of the network in time. http://www.enfin.org/squad
an extensive web portal Petri Nets World (TGI-group 2009). There, a more complete list of tools dealing with Petri nets and their extensions is provided. Concerning tools supporting the logical modelling, we have selected those, more recent, which have been used to study real case applications.
852
I. Koch and C. Chaouiya
References Abou-Jaoudé W, Ouattara DA, Kaufman M (2009) From structure to dynamics: frequency tuning in the p53-Mdm2 network I. Logical approach. J Theor Biol 258(4):561–577 Atkins, PW, de Paula J (2006) Physical chemistry. Oxford University Press, Oxford Bardwell L (2004) A walk-through of the yeast mating pheromone response pathway. Peptides 26(2):339–350 Bardwell L, Zou X, Nie Q, Komarova NL (2007) Mathematical models of specificity in cell signaling. Biophys J 92:3425–3441 Bortfeldt R, Schuster S, Koch I (2010) Exhaustive analysis of the modular structure of the spliceosomal assembly network – a Petri net approach. In silico Biol 10(1–2):89–123 R.E. Bryant (1986) Graph-based algorithms for Boolean function manipulation. IEEE Trans Comput 35(8):677–691 Cao T-H, Sanderson AC (1996) Intelligent task planning using fuzzy Petri nets. In: Series in Intelligent Control and Intelligent Automation – Vol. 3, World Scientific Publishing Company Chaouiya C, Remy E, Mossé B, Thieffry D (2003) Qualitative analysis of regulatory graphs: a computational tool based on a discrete formal framework. LNCIS 294:119–126 Chaouiya C, Remy E, Ruet P, Thieffry D (2004) Qualitative modelling of genetic networks: from logical regulatory graphs to standard Petri nets. Proc ICATPN 2004, LNCS 3099:137–156 Chaouiya C, Remy E, Thieffry D (2006) Qualitative Petri net modelling of genetic networks. LNCS, TCSB VI, 4220:95–112 Chaouiya C (2007) Petri net modelling of biological networks. Brief Bioinf 8:210–219 Chaves M, Albert R, Sontag ED (2005) Robustness and fragility of Boolean models for genetic regulatory networks. J Theor Biol 235(3):431–449 Chen M, Hofestädt R (2003) Quantitative Petri net model of gene regulated metabolic networks in the cell. Silico Biol 3:30 Chen M, Hofestädt R (2006) A medical bioinformatics approach for metabolic disorders: Biomedical data prediction, modeling, and systematic analysis. J Biomed Inform 39(2): 147–159 Ciliberto A, Novak B, Tyson JJ (2003) Mathematical model of the morphogenesis checkpoint in budding yeast. J Cell Biol 163(6):1243–1254 Clarke EM, Grumberg O, Peled DA (1999) Model checking. MIT Press, Cambridge, MA Comet JP, Klaudel H, Liauzu S (2005) Modeling multi-valued genetic regulatory networks using high-level Petri nets. LNCS 3536:208–227 David R, Alla H (2004) Discrete, continuous, and hybrid Petri nets. Springer, Berlin Desrochers AA, Al’Jaar RY (1995) Applications of Petri nets in manufacturing systems: modelling, control and performance analysis, IEEE Press, New York Doi A, Nagasaki M, Fujita S, Matsuno H, Miyano S (2003) Genomic object: net II. modeling biopathways by hybrid functional Petri nets with extension. Appl Bioinf 2:185–188 Doi A, Nagasaki M, Matsuno H, Miyano S (2006) Simulation based validation of the p53 transcriptional activity with hybrid functional Petri net. Silico Biol 6:1 Ehrig H, Reisig W, Rozenberg G, Weber H (2003) Petri net technology for communication-based systems. In: Lecture Notes in Computer Science, vol. 2472, Springer, Berlin Fages F, Soliman S (2008) From reaction models to influence graphs and back: a theorem. LNCS 5054:90–102. Fauré A, Naldi A, Chaouiya C, Thieffry D (2006) Dynamical analysis of a generic Boolean model for the control of the mammalian cell cycle. Bioinf 22(14):124–131 Fauré A, Naldi A, Lopez F, Chaouiya C, Ciliberto A, Thieffry D (2009) Modular Logical Modelling of the Budding Yeast Cell Cycle. Molecular BioSystems 5:1787–1796 Genrich H, Küffner R, Voss K (2001) Executable Petri net models for the analysis of metabolic pathways. Int J STTT 3:394–404 Gibson MA, Bruck J (2000) Efficient exact stochastic simulation of chemical systems with many species and many channels. J Phys Chem A 104:1876–1889
32
Discrete Modelling: Petri Net and Logical Approaches
853
Gillespie, DT (1977) Exact stochastic simulation of coupled chemical reactions. J Phys Chem 81(25):2340–2361 González A, Chaouiya C, Thieffry D (2008) Logical modelling of the role of the Hh pathway in the patterning of the Drosophila wing disc. Bioinf 24:i234–i240 Goss PJE, Peccoud J (1998) Quantitative modeling of stochastic systems in molecular biology by using stochastic Petri nets. PNAS 95:6750–6755 Grafahrend-Belau E, Schreiber F, Heiner M, Sackmann A, Junker BH, Grunwald S, Speer A, Winder K, Koch I (2008) Modularization of biochemical networks based on classification of Petri net t-invariants. BMC Bioinf 92:189, doi:10.1186/1471-2105-9-90 Grunwald S, Speer A, Ackermann J, Koch, I (2008) Petri net modelling of gene regulation of the Duchenne muscular dystrophy. BioSystems 92(2):189–205 Haas PJ (2002) Stochastic Petri nets. Springer, Berlin Hardy S, Robillard PN (2004) Modelling and simulation of molecular biology systems using Petri nets: modelling goals of various approaches. J Bioinform Comput Biol 2:595–613 Heiner M, Koch I (2004) Petri net based model validation in systems biology, Proc ICATPN 2004, LCNS 3099:216–37 Heiner M, Koch I, Will J (2004) Model validation of biological pathways using Petri nets demonstrated for apoptosis. BioSystems 75(1–3):15–28 Heinrich R, Rapoport TA (1974) A linear steady-state treatment of enzymatic chains: general properties, control and effector strength. Eur J Biochem 42(1):89–95 Hofestädt, R (1994) A Petri net application of metabolic processes. Syst Anal Model Simulat 16:113–122 Hofestädt R, Thelen S (1998) Quantitative modelling of biochemical networks. Silico Biol 1:39–53 Irons DJ (2009) Logical analysis of the budding yeast cell cycle. J Theor Biol 257(4):543–559 Jensen K (1997) Coloured Petri nets. Basic concepts, analysis methods and practical use. Vol, 3, Practical use monographs in theoretical computer science. Springer, Berlin Kauffman S (1969) Metabolic stability and epigenesis in randomly constructed genetics nets. J Theor Biol 22:437–467 Kielbassa J, Bortfeldt R, Schuster S, Koch I (2009) Modeling of the U1 snRNP assembly pathway in alternative splicing in human cells using Petri nets. Comp Biol Chem 33:46–61 Klamt S, Saez-Rodriguez J, Lindquist JA, Simeoni L, Gilles ED (2006) A methodology for the structural and functional analysis of signaling and regulatory. BMC Bioinf 7:56 Koch I, Junker BH, Heiner M (2005) Application of Petri net theory for modelling and validation of the sucrose breakdown pathway in the potato tuber. Bioinf 21(7):1219–1226 Koch I, Heiner M (2008) In: Petri nets in biological network analysis In B. Junker, Schreiber, F. (ed) Analysis of biological networks. Wiley, New York, pp 139–179 Kofahl B, Klipp E (2004) Modelling the dynamics of the yeast pheromone pathway. Yeast 21(19):831–850 Lee DY, Zimmer R, Lee SY, Park S (2006) Colored Petri net modeling and simulation of signal transduction pathways. Metabl. Eng. 8(2):112–122 Marsan MA, Balbo G, Conte G, Donatelli S, Franceschinis G (1994) Modelling with generalized stochastic Petri nets. Wiley Series in Parallel Computing, Wiley, New York Matsuno H, Doi A, Nagasaki M, Miyano, S (2000) Hybrid petri net representation of gene regulatory network. Proc Pac Symp Biocomput 5:338–349 Matsuno H, Tanaka Y, Aoshima H, Doi A, Matsui M, Miyano S (2003) Biopathway representation and simulation on hybrid functional Petri net. Silico Biol 3(3):389–404 Matsuno H, Tanaka Y, Aoshima H, Doi A, Matsui M, Miyano S (2003) Biopathway representation and simulation on hybrid functional Petri net. Silico Biol 3:389–404 Matsuno H, Fujita S, Doi A, Nagasaki M, Miyano S (2003) Towards biopathway modeling and simulation. Proc ICATPN 2003, LNCS 2679:3–22
854
I. Koch and C. Chaouiya
Matsuno H, Inouye, ST, Okitsu Y, Fujii Y, Miyano S (2006) A new regulatory interaction suggested by simulations for circadian genetic control mechanism in mammals. J Bioinf Comput Biol 4:139–153 Mendoza L, Thieffry D, Alvarez-Buylla ER (1999) Genetic control of flower morphogenesis in Arabidopsis thaliana: a logical analysis. Bioinform 15(7–8):593–606 Mendoza L (2006) A network model for the control of the differentiation process in Th cells. Biosyst 84(2):101–114 Merlin P, Farber D (1976) Recoverability of communication protocols–implications of a theoretical study. IEEE Trans Commun 24(9):1036–1043 Michaelis L, Menten, ML (1913) Die Kinetik der Invertinwirkung. Biochem Z 49:333–369 Mura I, Csika´sz-Nagy A (2008) Stochastic Petri net extension of a yeast cell cycle model. J Theor Biol 254:850–860 Murata T (1989) Petri nets: Properties, analysis and applications. Proc IEEE 77:541–580 Nagasaki M, Doi A, Matsuno H, Miyano S (2003) Genomic object: net I. A platform for modeling and simulating biopathways. Appl Bioinform 2:181–184 Nagasaki M, Doi A, Matsuno H, Miyano S (2004) A versatile Petri net based architecture for modeling and simulation of complex biological processes. Gen Inform 15(1):180–197 Nagasaki M, Doi A, Matsuno H, Miyano S (2005) A versatile Petri net based architecture for modeling and simulation of complex biological processes. Gen Inform 15:180–197 Naldi A, Thieffry D, Chaouiya C (2007) Decision diagrams for the representation of logical models of regulatory networks. LNBI 4695:233–247 Naldi A, Berenguier D, Fauré A, Lopez F, Thieffry D, Chaouiya C (2009) Logical modelling of regulatory networks with GINsim 2.3. Biosystems 97(2):134–139 Oliveira JS, Jones-Oliveira JB, Dixon DA, Bailey CG, Gull DW (2004) Hypergraph-theoretic analysis of the EGFR signaling network: initial steps leading to GTP:ras complex formation. J Comp Biol 11:812–842 Peterson, JL (1981) Petri net theory and the modeling of systems. Prentice-Hall, Inc. Upper Sadle River, NJ Petri CA (1962) Communication with automata (in German) Institut für Instrumentelle Mathematik. Bonn: Schriften des IIM Nr. 3 Popova-Zeugmann L, Heiner M, Koch I (2005) Time Petri nets for modelling and analysis of biochemical networks. Fundamenta Informaticae 67:149–162 Reddy VN, Mavrovouniotis ML, Liebman MN (1993) Petri net representation in metabolic pathways. In: Proc Int Conf Intell Syst Mol Biol 1:328–336 Reddy VN, Liebman MN, Mavrovouniotis ML (1996) Qualitative analysis of biochemical reaction systems. Comput Biol Med 26:9–24 Reisig W, Rozenberg G (1998) Advances in Petri nets In: Lectures on Petri nets II: Applications, Lecture Notes in Computer Science 1492, Springer, Berlin Rensing L, Ruoff P (2009) How can yeast cells decide between three activated MAP kinase pathways? A model approach. J Theor Biol 257(4):578–587 Sackmann A, Heiner A, Koch I (2006) Application of Petri net based analysis techniques to signal transduction pathways. BMC Bioinf 7:482, doi:10.1186/1471-2105-7-482 Sackmann A, Formanowicz D, Formanowicz P, Koch I, Blazewicz J (2007) An analysis of the Petri net based model of the human body iron homeostasis process. Comput Biol and Chem 31(1): 1–10 Saez-Rodriguez J, Simeoni L, Lindquist JA, Hemenway R, Bommhardt U, Arndt B, Haus U-U, Weismantel R, Gilles ED, Klamt S, Schraven B (2007) A logical model provides insights into T cell receptor signaling. PLOS Comput Biol 8(3):e163 Sánchez L, Thieffry D (2003) Segmenting the fly embryo: a logical analysis of the pair-rule crossregulatory module. J Theor Biol 224(4):517–537 Sánchez L, Chaouiya C, Thieffry D (2008) Segmenting the fly embryo: a logical analysis of the segment polarity cross-regulatory module. Int J Dev Biol 52(8):1059–1075
32
Discrete Modelling: Petri Net and Logical Approaches
855
Schuster S, Hilgetag C, Schuster R (1993) Determining elementary modes of functioning in biochemical reaction networks at steady state. Proc Second Gauss Symp 1996:101–114 Simão E, Remy E, Thieffry D, Chaouiya C (2005) Qualitative modelling of regulated metabolic pathways: application to the tryptophan biosynthesis in E.coli. Bioinf 21 suppl. 2:ii190–ii196 Espinosa-Soto C, Padilla-Longoria P, Alvarez-Buylla ER (2004) A gene regulatory network model for cell-fate determination during Arabidopsis thaliana flower development that is robust and recovers experimental gene expression profiles. Plant Cell 16(11):2923–2939 Srivastava R, Peterson MS, Bentley WE (2001) Stochastic kinetic analysis of the Escherichia coli stress circuit using σ 32 -targeted antisense. Biotechnol Bioeng, 75(1):120–129 Tasaki S, Nagasaki M, Oyama M, Hata H, Ueno K, Yoshida R, Huguchi T, Sugano S, Miyano S (2006) Modeling and estimation of dynamic EGFR pathway by data assimilation approach using time series proteomic data. Gen Inform 57(2):226–238 TGI-group (2009) Petri Net World. http://www.informatik.uni-hamburg.de/TGI/PetriNets/. 31 May 2010 Thieffry D, Thomas R (1995) Dynamical behaviour of biological regulatory networks, II. Immunity control in bacteriophage lambda. Bull Math Biol 57(2):277–297 Thieffry, D (2007) Dynamical roles of biological regulatory circuits. Brief Bioinform 8(4):220–225 Thomas R, Thieffry D, Kaufman M (1995) Dynamical behaviour of biological regulatory networks–I. Biological role of feedback loops and practical use of the concept of the loop-characteristic state. Bull Math Biol 57(2), 247–276 van der Aalst, WMP, Desel J, Oberweis A (1999) Business process management: models, techniques, and empirical studies. In: Lecture Notes in Computer Science, vol. 1806, Springer, Berlin Voet D, Voet JG (2004) Biochemistry. Wiley, New York Voss K, Heiner M, Koch I (2003) Steady state analysis of metabolic pathways using Petri nets. Silico Biol 3:367–387 Windhager L, Zimmer R (2008) Intuitive modeling of dynamic systems with Petri nets and fuzzy logic. Proc German Conf Bioinf 136:106–115
Chapter 33
ProteoLens: A Database-Driven Visual Data Mining Tool for Network Biology Jake Yue Chen and Tianxiao Huan
Abstract Systems biology studies require researchers to understand how myriads of biomolecular entities orchestrate with one another in concert to achieve high-level cellular and physiological functions. Many software tools have been developed in the past decades to help researchers visually navigate large networks of biomolecular interactions with built-in template-based query capabilities. In this chapter, ProteoLens, a powerful visual analytic software tool for creating, annotating, and exploring multi-scale biological networks is introduced. ProteoLens is a stand-alone software tool written in Java programming language. The architectural design of ProteoLens makes it suitable for bioinformatics expert data analysts who are experienced in relational database management to perform large-scale integrated network visual explorations. It presents biological network in multiple different types of layout such as organic and hierarchical methods. And the users can use queries to specify and store “associations” between nodes as “interaction” or as nodes’ annotation/edges’ annotations. Then associations can be used to visually annotate large displayed biological networks using node/edge shape, size, weight, color, and text. In the below, we describe design ideas, whole architecture, and the major operations of this software and show some case studies to demonstrate how it is used to solve multi-scale biological network questions in details. Keywords Systems biology · Biological multi-scale networks · Visualization software · Visual mining
T. Huan (B) Shandong University, Microbiology Building 608#, Shanda Nanlu 27#, 250100, Jinan, People’s Republic of China e-mail:
[email protected] S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_33,
857
858
J.Y. Chen and T. Huan
33.1 Introduction 33.1.1 Biomolecular Network and Visualization Software The concept of networks is ubiquitous in systems biology. In the past few years, the high-throughput techniques produced abundance of biomolecular interaction data and there were lots of biomolecular interaction database built, especially protein–protein interaction (PPI) databases, such as HPRD (Peri et al. 2003), OPHID (Brown and Jurisica 2005), and HAPPI (Chen et al. 2009). The huge molecular data accumulating supported the researches of integrating types of genomic annotation information to uncover the complex interaction relationships of biomolecules under the perturbation by inside and outside environmental factors. Current development trend of biological network analysis and visualization software is to equip users with extended ability to query and interpret existing experimental data, particularly those from “Omics” platforms, in the emerging context of biomolecular interaction networks. Many network visualization software tools have been developed recently to make those text raw data easier to understand and facilitated more reasonable biological assumption to be presented (Baitaluk et al. 2006; Han et al. 2004; Hu et al. 2005; Shannon et al. 2003). For example, Cytoscape contained all basic functions for visualizing, annotating, and analyzing ready data of molecular network (Shannon et al. 2003). VisANT had its fabulous features by integrating some statistic functions to calculate several key topological parameters (Hu 2005). And WebInterViewe adopted a fast-layout algorithm for graph presentation and provided several abstraction and comparison operations for analyzing large-scale biological networks effectively (Han 2004).
33.1.1.1 Multi-Scale Biological Entities and ProteoLens However, as for the huge size of the biological interaction network, one of the big problems is hard to visualize the relationship of thousands of nodes in a single graph. And the better solution is to modularly organize the basic biomolecules in higher level biological entities and explore the biological network in multi-scale point of view. Analyzing the data in multi-scale biological networks is inherently more challenging than that of biomolecular interaction networks. Goh et al. explored all known associations of disease phenotypes by representing disease phenotypes instead of molecular entities as nodes in a network graph (Goh et al. 2007). The human disease network presented a clear pattern that many diseases had the common genetic origins. Yildirim MA et al. built drug–target network to analyze the global characters of the relationship between protein targets of all drugs and the human protein–protein interaction network (Yildirim et al. 2007). To support multi-scale biological network visual analytics studies, the advanced visualization software should provide friendly and efficiently bi-direction interaction linking the data sources and visualization part. Iterative, exploratory, and bi-directional data analysis capabilities to save temporary results and build visualization sessions on top of one another should be a pre-requisite. However, tens
33
ProteoLens
859
of visualization software tools were available for biological network analysis, most of them only designed for the molecular interaction network presenting and analyzing, cannot avoid of “menu-driven” or “mouse-click intensive.” As Suderman et al. recently surveyed, none of the 35 commonly used biological network visualization tools supported such query languages embedded directly (Suderman and Hallett 2007). ProteoLens is a java-based visual analytic tool for creating, annotating, and exploring multi-scale biological networks. Compared with existing biological network visualization tools, the design idea of ProteoLens is to make it more suitable for bioinformatics expert data analysts with SQL skills to work on large sets of data. First, it supports direct database connectivity to Oracle and PostgreSQL database and SQL statements including both data definition languages (DDL) and data manipulation languages (DML). Second, it supports graph/network represented data expressed in standard graph modeling language (GML) formats. Therefore, visual layouts performed in comparable software tools can interoperate with ProteoLens as long as they also support GML standards. Third, it supports the decoupling of user interface into two separate functional layers – network visualization and network annotation. These two layers are also unified by the concept of “association.” Users can significantly flexibly choose “Interacting association” as network visualization, or use “node/edge association” as data attributes (e.g., score, rank, description).
33.2 Concepts and Software Architecture ProteoLens is a stand-alone software tool written in Java programming language. Its software architecture consists of two separate functional layers – a data processing layer at the backend and a data visualization layer at the frontend – connected by a network data association engine (Fig. 33.1). This design enabled integrative large-scale visual network analysis and biological hypothesis exploration using heterogeneous data associated with biological networks.
33.2.1 Data Associations: The Concept The concept of data associations, which connected the data processing layer and the data visualization layer, is the fundamental concept used in the design of ProteoLens. It provides the interface between external data and visualization. Only data associations are accessible from visualization layer. A data association is a twoway, many-to-many relationship between data objects (“entities”). Entities, in turn, are context- and problem-specific and unbreakable. In our model, each pairing of two separate conceptual entities requires explicit definition of a data association. And the data visualization layer can identify two kinds of data associations submitted by data processing layer: (1) interaction association was used to build the network and (2) annotation association was
860
J.Y. Chen and T. Huan
Fig. 33.1 An overview of the ProteoLens core architecture. The design of ProteoLens decoupled the data processing and visualization presenting in two layers and the two layers communicated by the abstract data associations. The major components of ProteoLens are SQL data retrieving engine, network layout engine, and graph attributes editing engine (Huan et al. 2008). Reprinted with permission from BioMed Central
used to add some attached information to the network. For example, a typical protein interaction data table may contain two columns holding interacting protein identifiers and yet another column holding interaction confidence scores. Thus the following associations can be defined (1) interaction association: protein interaction (Protein ID A↔Protein ID B), (2) edge annotation association: interaction confidence score ([Protein ID A, Protein ID B]↔score).
33.2.2 Functional Layers ProteoLens includes two functional layers. (1) The data processing layer is the place where biological data from different sources, including flat files, XML data, and tabular data in relational databases, can be managed and converted from one format into another for subsequent analysis. It allows users to specify which particular subsets of data from which particular sources should be available to the application. A single external data source (flat file or database table), which is rich enough, can give rise to multiple data associations. And the best feature of the architecture design is embedding the database structural query language (SQL) engine; we will talk in details in the Section 33.3. (2) The data visualization layer is the place where specified network data attributes and data association rules are converted to network layouts and network visual properties. The software supports multiple independent network views, with each view being a fully functional graph editor.
33
ProteoLens
861
33.3 Top Features The Fig. 33.2 illustrated the core functionalities of ProteoLens. And we compared ProteoLens with several existing popular visualization software, Cytoscape and VisANT and BiologicalNetworks (Table 33.1). And the novelties and capabilities of ProteoLens were concluded below:
Fig. 33.2 The overview core functionalities of ProteoLens. Some of the core functionalities of ProteoLens labeled in this figure: (a) ProteoLens can access both the relational database and the local file system, (b) the SQL statement can be edited and run in the software environment for data association building, (c) SQL-like for building the sub-network by retrieving particular characters of nodes/edges, (d) convenient quick query of the nodes in the network view and sub-network retrieving, and (e) flexible and comprehensive annotation adding (Huan et al. 2008). Reprinted with permission from BioMed Central
33.3.1 Comprehensive Input and Output Supporting Currently, ProteoLens supports two types of physical data sources: text files on the local filesystem containing data in column-delimited format and relations obtained from SQL queries to an Oracle or Postgres Database. As huge data available today, users incline to use database as source. And the ProteoLens has build-in Robust Relational Database Engine support (Oracle 10 g DBMS and PostgreSQL 8.x). A user can directly query the data stored in local database using Oracle SQL or PostgreSQL and immediately visualize the network display results. ProteoLens also supports semi-structured data format in graph modeling language (GML) – the standard file format in the Graphlet graph editor system for non-relational graphs. Network visualization is created in a view can be saved in a
JPEG, BMP, GML network relations, Node lists, selections node lists (text) Embedding the SQL query make its software more flexible to suit powerful bioinformatics experts usage
yFiles Force-directed, radial layout hierarchical, circular, orthogonal Node shape/color/ border/label, edge color/ style/direction/ label Select nodes/links according to properties or using SQL statement for table attributes selecting directly Expand node neighbors Common Relational database Java stand-alone GML, XML session Text, GML, XML, Oracle or PostgreSQL
VisANT
PSI-MI, BioPAX, SVG, JPEG, network relations (text)
Expand node neighbors Predictome Java applet Network with layout PSI-MI, BioPAX, KGML, network relations (text)
Several “select” filters available
Node shape/color/size
The importance of Cytoscape Statistics ability for topological is its solid support for characteristic analysis and plug-in, growing number integrating several biological of which is available database
Graphical file, SVG, GML, network relations (text)
Plug-in Plug-in Java applet or stand-alone GML, SIF Text, GML, expression matrix, OBO
Node shape/color/ border/label, edge color/ style/direction/ label Select nodes/links according to properties (SQL-like)
yFiles and GINY In house More than 13 kinds of layout Force directed styles
Cytoscape
PathSys JSP (Java Server Pages) Save all work as projects Microarray data (Stanford, Affymetrix,TIGR, GenePix), SBML, SIF, PSI-MI, BioPAX GIF, JPEG, SWF, PDF, PNG, PostScript, RAW, SVG, BMP Integrated visualization and analysis of expression data
Select nodes/links according to properties (SQL-like)
Node shape/color/ border/label, edge color/ style/direction/label
Cytoscape Grid, circular, force directed
Biological networks
A summary of attributes of Cytoscape, VisANT and BiologicalNetworks as presented in detail by Matthew Suderman et al. in review (Suderman and Hallett 2007). (Huan et al. 2008)
Comments and other features
Exports
Expand/collapse nodes Database incline System requirements Save Imports
Filters
Drawing appearance
Graph manipulation Laying out network algorithm
ProteoLens
Table 33.1 Compare ProteoLens against cytoscape, VisANT, and biological networks
862 J.Y. Chen and T. Huan
33
ProteoLens
863
GML file, thus allowing for reopening and further editing in a new session, or data exchange without relational databases. ProteoLens stores every data association in a session configuration XML file. Users can save the session and recommence their analysis at any time. And the network view can be exported as a JPEG or PNG file. The user can import and manipulate any network data using standard GML file formats in addition to the structured data stored in the relational databases.
33.3.2 Declarative SQL-Based Visual Analysis ProteoLens is by far the first biomolecular network visualization software with full SQL support. It supports direct database connectivity to Oracle and PostgreSQL database tables and views and the entire set of SQL statements including both data definition languages (DDL) and data manipulation languages. This extends the range of data that expert users may bring into later network visualizations for annotation and visual exploration tasks. Users of ProteoLens can use the tool to iteratively prepare data stored in relational databases without leaving the visual analytic environment. Data from different tables in a complex relational database schema can also be queried on the fly to create networks at the appropriate level for exploration.
33.3.3 Layout Choices of Biological Network ProteoLens supports a variety of automated network layout methods, such as organic layout, hierarchical layout and circular layout.
33.3.3.1 Hierarchical Layout Hierarchical model is a good way to show the hierarchical information sometimes hidden inside molecular interaction network. Usually, nodes at the same hierarchy are shown on the same horizontal lines; a series of horizontal lines can be shown to indicate the existence of multiple hierarchies, so that edges are directed from nodes on lower horizontal lines to nodes on higher horizontal lines. Figure 33.3a showed the Hierarchical model and the details are shown in Huan et al. (2008).
33.3.3.2 Circular Layout Circular layout is a simple way to visualize a molecular interaction network, in which all the molecules are drawn at equal size and placed at pre-determined positions along circles (2D). Figure 33.3b.
864
J.Y. Chen and T. Huan
Fig. 33.3 Sketch map of major network layout supported by ProteoLens (a) Hierarchical layout; (b) circular layout; (c) organic layout
33.3.3.3 Organic Layout Organic layout, implies spring force algorithm, is an effective and popular model to produce relatively good network drawings that can highlight centrality of the network. It builds a spring force model between each pair of nodes to pull linked nodes together and push un-linked nodes apart iteratively until all the forces reach mechanical equilibrium. Examples for the organic layout of cancer subtype association network are shown in Fig. 33.3c (see Case study 2 for details).
33.3.4 Sub-network Retrieving Capability Users of ProteoLens can conveniently specify sub-networks based on existing networks to conduct studies in a specific biological context. It supports “network browsing” and “network querying” operations. Note also that a single network view can be subsequently reattached to different network sources, and thus a complex network with different types of relations coming from multiple datasets can be built. In case study part, Alzheimer disease-related protein–protein interaction network (seeing Case study 1) could be considered as retrieving the disease-specific sub-network from all human PPI network.
33
ProteoLens
865
33.4 Using ProteoLens In the below, we describe the major operations of using ProteoLens. For details, you can refer to our user guide manuals downloaded from the web site http://bio.informatics.iupui.edu/proteolens/.
33.4.1 Installing ProteoLens and Launching the Application ProteoLens can be downloaded in a ready-to-install executable from the web site http://bio.informatics.iupui.edu/proteolens/. To run ProteoLens, at first, we should have Java Runtime Environment version 1.42 or higher installed. The Java Runtime Environment from http://java.sun.com. A principal advantage of ProteoLens is how it directly connects with a database. To help to start, brief instructions are provided here for downloading and installing Oracle XE, a free, basic entry-level database. As an alternative to Oracle XE, you can also install and use PostgreSQL with ProteoLens. Oracle XE can be downloaded from http://www.oracle.com/technology/products/database/xe/index.html. We need to specify three parameters during the installation. The values used for the examples in this manual are indicated. (1) Destination Folder: C:\oraclexe (2) HTTP Listener Port: 8081 (3) System administrator password (for both SYS and SYSTEM): your choice Note, the database SID (db_name) for OracleXE is set to “XE” by default. ProteoLens is released as a standard Windows software installation package. After downloading the ProteoLens installation executable, double-click on the executable, and simply click “OK” to install.
33.4.2 Connecting to Database and File-Based Input In the ProteoLens user application, this form of input can be accessed from either a two- or three-column relational format flat file or a supported database type, Oracle or PostgreSQL. 33.4.2.1 Connecting to Database Input Supported database types of ProteoLens are Oracle and PostgreSQL. Databases can be accessed across the network or hosted on the same computer running the ProteoLens application. The advantage of connecting to a database for input is that you can quickly iterate through different relational associations based on sending SQL queries from the ProteoLens interface directly to the backend database.
866
J.Y. Chen and T. Huan
A connection begins with using the Filesystems window and viewing a file or database object (right click with mouse). In order to connect to a database object, first mount the database. To mount the database, you need to right click on the root Filesystems node in the Filesystems window and select the Mount database option that appears in the submenu as shown in Fig. 33.4a. As shown in Fig. 33.4b, use the thin connection type and enter the connection parameters.
Fig. 33.4 Connecting to database input
Use the Filesystem window to navigate to the database (named XE), open the schema named MYTESTUSER and right-click on the table object GENEDISEASETABLE and select View from the submenu (Fig. 33.4c). The contents (first few lines) of the data source are displayed as a table. If the data source is a database table, then the SQL query is displayed in the top pane. An arbitrary query can be entered here (it should be executed first, [Query→Run] to retrieve correct metadata), and the result of this query can be wrapped as Data Association (Fig. 33.4d).
33
ProteoLens
867
33.4.2.2 Connecting to File-Based Input ProteoLens navigate to a file with the Filesystems window. Right click on the file that contains data you wish to input into ProteoLens, and a submenu will emerge as shown in Fig. 33.5a. Click on the table data check box. Then select the View option from the submenu. A window then appears as shown in Fig. 33.5b and appropriate options are selected. Fig. 33.5 Connecting to file-based Input
33.4.3 Create Data Association Visualizations and annotations are created from data associations. After connecting database table or files, you can follow the below steps to create data associations: from this table pane menu select “Result→create data association” (Fig. 33.6a). The dialog will pop up asking to specify which columns of the underlying table will be wrapped by the association (Fig. 33.6b); the unique name for the new association must be also specified here. If only two columns of the underlying table were selected, the association is created – each column would represent a separate data entity. Otherwise, another dialog will pop up asking to specify which columns should be combined into the first or “leftmost” entity in the data association (“Key columns”); the remaining columns will be wrapped by the second or “rightmost” entity. Repeat these steps as necessary (i.e., if more data associations have to be created from the same data source). Data associations are linked directly to the
868
J.Y. Chen and T. Huan
Fig. 33.6 Create data association
underlying data source. The table views from which an association was created can be safely closed at any moment. Note that network nodes and edges can be used to represent proteins and protein interactions, whereas node/edge size, width, shape, and color can all be used to dynamically bind to customized data fields (such as gene symbol, functional category, and confidence score) to be visualized.
33.4.4 Attach Network Source to the View Visualization is the graphical layout of nodes and edges in the network. In the newly opened Network View window, the Load Network from data association option can be used to construct the network based on the uploaded data association (see Fig. 33.7a and b). After completing the steps of selecting the network source (Fig. 33.7c) and specifying loading conditions (Fig. 33.7d), a network view
Fig. 33.7 Attach network source to the view
33
ProteoLens
869
will appear to the network shown in Fig. 33.7e. Note however that, as you repeat the exact same procedure, the node-to-node associations will remain the same, but the physical layout of the network on the screen will be somewhat random. The network shown presents how genes link to each other through association with the same disease. Using SQL statement to construct a network that connects diseases directly together based on having a common gene, you can see the Case study 2 for details.
33.4.5 Add Annotation An annotation is the modification of nodes (e.g., labels, sizes, colors) or edges (e.g., labels, line widths, colors) based on input that links to the identities of the nodes or edges. Following the steps below to attach annotations to the Network View: (1) In the Network View menu, select Visualization→Nodes→Add annotation or Visualization→Edges→Add annotation (Fig. 33.8a). (2) A dialog appears (screenshot for Edge annotation is shown in Fig. 33.8b, screenshot for Node annotation is shown in Fig. 33.8c). In the left-top pane, select association to be used for annotation. Using comboboxes in the bottom left,
Fig. 33.8 Major steps to create and add annotations
870
J.Y. Chen and T. Huan
select type of visualization to be used (i.e., whether the values provided by the data association should be used for text label, color, shape). (3) If a visualization scheme cannot support multiple annotations (i.e., shape), the effect of “use all” is currently undefined; use “use first” instead (though an arbitrary annotation value out of a few available will be displayed) or “use Max”/”use Min” for numerical properties. (4) The annotations can be: (A) “as is” – such as labels, the annotation values are used as text strings. (B) Categorical – particular graphic attribute, such as color or shape, should be assigned to each annotation value of interest. Many values can be mapped onto the same attribute value, for example, many different GO groups can be all required to be drawn as the same color or the same shape. All annotation values do not have to be mapped onto visualization attributes; empty attributes (default) are allowed. (C) Continuous – allows mapping numerical values, such as expression values onto color, size, or width gradients. Note that if continuous annotation visualization is requested, then the “Data Values” pane shows only «MIN» and «MAX» values, for which the corresponding colors, sizes, or width should be specified; the intermediate values will be drawn as gradients between these two values. (See all detail functionalities in our user manual.) (5) Press OK, the current view will be re-rendered. (Fig. 33.8d)
33.5 Systems Biology Case Studies To illustrate the powerful architecture of ProteoLens, we show through several case studies how it enables bioinformatics users to address different biological problems clearly and effectively.
33.5.1 Case Study 1: Alzheimer’ Disease-Related Protein Interaction Network Alzheimer disease (AD) is a progressive neurodegenerative disease with 4.5 million patients in the United States today. In this case study, we showed how ProteoLens helped to generate AD-related disease-specific PPI sub-network. [This work was already published in Chen et al. (2006).] The computational techniques and procedures developed for AD protein interaction sub-network analysis are summarized as follows. First, searched the OMIM database to obtain an initial collection of AD-related genes; second, built an expanded AD protein interaction sub-network by nearest-neighbor expansion in OPHID database. Third, visualized and annotated the AD interaction sub-network. We could perform these three steps in the environment of ProteoLens. And before the visual analytic process beginning, the OMIM table and OPHID should be downloaded and stored in local database. To obtain a list of AD-related genes, writing
33
ProteoLens
871
SQL in the database panel of ProteoLens to retrieve each OMIM gene record in which the “description” field contains the term “Alzheimer.” And then expanded the sub-network by retrieving the seed genes interacting protein neighbors. Finally, created associations – “AD_INTERACTION.” Here, we denoted the initial 70 AD-related proteins as the seed AD set. To build AD sub-networks, we pulled out protein interacting pairs in OPHID such that at least one member of the pair belongs to the seed AD set. The set of interacting pairs pulled out will be called the AD interaction set. In this study, the AD interaction set contains 775 human protein interactions (Fig. 33.9).
Fig. 33.9 Alzheimer’s disease-related protein interaction network. The bigger nodes with light color are the key proteins (seed proteins)
33.5.2 Case Study 2: Gene Ontology Cross-Talk Network Platinum-based chemotherapy, usually with the cancer drug cisplatin, has been the primary treatment for ovarian cancer. In this case study, we demonstrate how ProteoLens help biologist to analyze the cellular process difference between cisplatin drug resistant and no resistant ovarian cancer cell lines by building up GO cross-talk network (Chen et al. 2007).
872
J.Y. Chen and T. Huan
After analyzing mass spectrometry-derived proteomics experimental data, Zhong et al. identified 574 differentially expressed proteins in cisplatin-sensitive vs. cisplatin-resistant ovarian cell line samples. And then they identified protein interaction partners for the differentially expressed protein set (“seed proteins”). And they developed a novel systems biology approach which can identify “significantly interacting protein categories” based gene ontology (GO) for proteins’ annotation. For a GO–GO functional interaction category, it refers to a pair of GO categories, which are derived by aggregating all the protein–protein interaction pairs with the same pairing of GO annotation categories for the interacting proteins. For example, if three protein interactions share annotation category A in one side of the interaction and annotation category B in the other side of the interaction, they say that A–B is a functional interaction category with an observed count of 3. Therefore, they transformed protein–protein interaction sub-network to GO cross-talk sub-network. Seventeen significant GO categories were filtered from 70 GO categories. Figure 33.10 showed a visualization of activated biological process
Fig. 33.10 Activated biological process network in cisplastin-resistant ovarian cancer cells using cisplatin-sensitive cell lines as controls. Red-colored lines stand for “significant,” while bluecolored lines stand for “not significant.” And the p-values of activated protein category significance in the sub-network are encoded as node color intensity, on a scale from light yellow (less significant) to dark red (more significant). (Chen et al. 2007). Reprinted with permission from World Scientific Publishing
33
ProteoLens
873
functional network, and encoded nodes as significantly over-/underrepresented protein functional categories and edges as significantly interacting protein functional categories. Several additional information types are also represented. The original abundance (by count) of each functional category is encoded into node size. The p-values of activated protein category significance in the sub-network is encode as node color intensity, on a scale from light yellow (less significant) to dark red (more significant). From this figure we can see that cisplatin-resistant ovarian cancer cells demonstrated significant cellular physiological changes, which are related to cancer cell’s native response to stimulus that is endogenous, abiotic, and stress related. Interestingly, we also observed that the regulation of viral life cycle also plays very significant roles in the entire drug resistant process. This unknown observation may be further examined at protein levels to formulate hypothesis about acquired cisplatin resistance in ovarian cancer.
33.5.3 Case Study 3: Human Cancer Association Network Decade long study of disease-related genes has generated a comprehensive set of “disease disorders–genes” relationship pairs (also referred to as the “diseasome”), which are represented in the OMIM morbidity map (Hamosh et al. 2005). Goh et al. recently showed a global view of the “human disease network” (HDN), which included 22 disease disorder classes, 1,284 disease disorders, and 1,777 disease genes (Goh et al. 2007). In this case study, we take advantage of ProteoLens to construct 13 kinds of cancer disease gene association network by using the SQL retrieve different association and annotation information in only one table which contains disorder–disease gene association pairs. (Fig. 33.11) As shown in Fig. 33.11, every node indicates a disease, and if two diseases have common disorder genes, we set an edge connect the two nodes. The sizes of nodes indicate the disease genes’ number. The color of nodes indicates the case occurrence number of American in 2007. And the width of the edge indicates the common genes’ number of the two diseases. The visual analytic process can be divided into five steps: (1) Create the table that has two columns, one is gene symbol and the other is disease name CREATE TABLE GENE_DISEASE GENE_SYMBOL AS VARCHAR2(50))
(DISEASE_NAME
AS
VARCHAR2(50),
And import gene_disease.txt files (this file could be downloaded on the web site of ProteoLens) to GENE_DISEASE. (2) Create an association about disease–disease association, naming “Disease_ inter”. “Disease_inter” is a link between two diseases, naming DISEASE_A and DISEASE_B, if there are at least one common disease genes sharing by these two diseases.
874
J.Y. Chen and T. Huan
Fig. 33.11 Disease–disease association network. This is a sub-network of the cancer disease association network, built by retrieving 13 kinds of popular cancer. In this representation, the node is a kind of cancer, and if two kinds of cancer have common genetic disorder genes, there is an edge connecting them. The size of nodes indicates the number of cancerogenic disorder genes and the color of nodes indicates the number of cases in 2007 in the United States; dark color indicates more cases, light color indicates less number, and white indicates less statistic data. The width of edge indicates the number of common genetic disorder genes of two kinds of cancer disease (Huan et al. 2008). Reprinted with permission from BioMed Central
SELECT distinct a.DISEASE_NAME DISEASE_A, b.DISEASE_NAME DISEASE_B FROM PROTEOLENS. GENE_DISEASE a, PROTEOLENS. GENE_DISEASE b WHERE a.GENE_SYMBOL= b.GENE_SYMBOL and a.DISEASE_NAME > b.DISEASE_NAME
(3) Create the annotation about nodes, “Node_Gene_No”, indicating the gene number in the disease and using the node size that indicates the gene number continuously. SELECT distinct DISEASE_NAME, count(GENE_SYMBOL) count_number FROM PROTEOLENS.GENE_DISEASE GROUP BY DISEASE_NAME
(4) Create the annotation about edges, “Edge_Common_Gene_No”, indicating the common gene number of the connecting diseases and using the line which indicates the common gene number continuously. SELECT distinct A. DISEASE_NAME DISEASE1, B.DISEASE_NAME DISEASE2, count (distinct A.GENE_SYMBOL) edge_count FROM PROTEOLENS. GENE_DISEASE A, PROTEOLENS. GENE_DISEASE B WHERE A.GENE_SYMBOL = B.GENE_SYMBOL and A. DISEASE_NAME > B. DISEASE_NAME GROUP BY A. DISEASE_NAME, B. DISEASE_NAME
(5) Create the annotation about nodes, “Node_Case_No”, indicating the case occurrence number of American in 2007 (http://www.cancer.org). And using the continuous color from white to red represent the occurrence from low to high.
33
ProteoLens
875
CREATE TABLE DISEASE_CASENO (DISEASE _NAME AS VARCHAR2(50), CASE_NO AS VARCHAR2(50)) Import Disease_caseNo.txt Create the annotation – “Node_case_no” SELECT distinct DISEASE_NAME, CASE_NO FROM PROTEOLENS. DISEASE_CASENO
33.6 ProteoLens Project The ProteoLens project is designed to accommodate iterative visual layout, annotation, and exploration of biomolecular networks. It effectively liberates advanced data analysts from the burden of data preparation and processing. ProteoLens project home page: http://bio.informatics.iupui.edu/proteolens/. The current version is 1.1 (August 11, 2008). ProteoLens is under continuous development now. With future releases of ProteoLens, we plan to add open application program interfaces (APIs) so that (1) ProteoLens can interoperate with other software tools in bioinformatics and (2) third-party plug-ins could be developed to accommodate expanding user community needs. You can take part in ProteoLens google group to give suggestions or co-work with us. http://groups.google.com/group/proteolens.
References Baitaluk M, Sedova M et al (2006) Biological networks: visualization and analysis tool for systems biology. Nucleic Acids Res 34(Web Server issue):W466–471 Brown KR, Jurisica I (2005) Online predicted human interaction database. Bioinformatics 21(9):2076–2082 Chen J Y, Mamidipalli S et al (2009) HAPPI: a database of human annotated and predicted protein interactions. BMC Genomics 10(Suppl 1):S16 Chen JY, Shen C et al (2006) Mining Alzheimer disease relevant proteins from integrated protein interactome data. Pac Symp Biocomp’06. Maui, HI 11:367–378 Chen JY, Yan Z et al. (2007) A systems biology approach to the study of cisplatin drug resistance in ovarian cancers. J Bioinform Comput Biol 5(2a):383–405 Goh KI, Cusick ME et al (2007) The human disease network. Proc Natl Acad Sci USA 104(21):8685–8690 Hamosh A, Scott AF et al (2005) Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33(Database issue):D514–517 Han K, Ju BH et al (2004) WebInter viewer: visualizing and analyzing molecular interaction networks. Nucleic Acids Res 32(Web Server issue):W89–95 Hu Z, Mellor J et al (2005) VisANT: data-integrating visual framework for biological networks and modules. Nucleic Acids Res 33(Web Server issue):W352–357 Huan T, Sivachenko AY et al (2008) ProteoLens: a visual analytic tool for multi-scale databasedriven biological network data mining. BMC Bioinform 9(Suppl 9):S5 Peri S, Navarro JD et al (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 13(10):2363–2371 Shannon P, Markiel A et al (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504 Suderman M, Hallett M (2007) Tools for visually exploring biological networks. Bioinformatics 23(20):2651–2659 Yildirim MA, Goh KI et al (2007) Drug-target network. Nat Biotechnol 25(10):1119–1126
Chapter 34
MADNet: A Web Server for Contextual Analysis and Visualization of High-Throughput Experiments Igor Šegota, Petar Glažar, and Kristian Vlahoviˇcek
Abstract Efficient data integration and visualization represents an important component of any systems biology approach. With large datasets resulting from experiments on a complex biological system it is often impossible to analyze and interpret results one by one – rather, we require tools to help us understand the outcome of our experiment in a broader context and with as much visual information as possible. MADNet, the MicroArray Database Network web server, is a user-friendly data mining and visualization tool with a simple and straightforward interface for rapid analysis of diverse high-throughput biological experiment results, such as microarray, phage display, or even metagenome analysis. It visually presents experimental results in the context of metabolic and signaling pathways, transcription factors, and drug targets through minimal user input, consisting only of the file containing a list of genes and associated expression values. This data is integrated with information extracted from various biological databases such as NCBI nucleotide and protein sequence databanks, metabolic and signaling pathway databases (KEGG), transcription regulation (TRANSFAC©), and drug target database (DrugBank). MADNet is freely available for academic use at http://www.bioinfo.hr/madnet. Keywords Microarray · Oligonucleotide array sequence analysis · Gene expression profiling · Metabolic networks and pathways · Signal transduction networks · Transcription networks · Enrichment analysis
K. Vlahoviˇcek (B) Bioinformatics Group, Division of Biology, Faculty of Science, Zagreb University, Horvatovac 102a, 10000 Zagreb, Croatia; Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway e-mail:
[email protected] S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9_34,
877
878
I. Šegota et al.
34.1 Introduction The present day technology in biomolecular sciences enables us to study life at high levels of system organization – in particular, we can use high-throughput methods to measure total molecular and enzymatic processes in the cell, determine genetic content of entire ecosystems, scan the chemical compound libraries for drug activity, etc. With such experiments, involving thousands or even millions of measurements performed in parallel, we can normally obtain large quantities of measured data and eventually need to process them with a computer in order to extract meaningful knowledge from the collected information. However, researchers lacking practical knowledge of bioinformatics and computer automation can easily find themselves overwhelmed with the abundance of experimental data, which can hinder unleashing the full potential of collected experimental data and result in subpar and inadequate conclusions drawn from the performed experiments. One such example of high-throughput technology that we will use in this chapter is microarray experiments. Microarray technology allows us to take a snapshot of expression of many genes or even entire genomes at the same time. Briefly, this is done by manufacturing an array which contains up to a few thousand microscopic spots with attached DNA oligonucleotides – short segments of a gene that are then hybridized with cDNA from a sample. Usually, the experiment is performed in two or more different conditions where one involves cells in a native condition and others are measured in cells that are either mutated (cancer), treated with a substance (potential drug, toxin, etc.), or subjected to some environmental stress (increased nutrients, salt, pressure, etc.). Such measurements can be taken in a time series or they can be performed only to answer yes/no questions. From the difference in measured expression between the native and altered state we can derive the relative change in the amount of mRNA for each gene that hybridizes on the microarray chip – the quantity ordinarily termed fold change. Usually, measurements are performed in several (commonly 3–10) so-called technical replicates (where experimental conditions are not altered), which enables us to statistically estimate the reliability of each measured fold change value (with the p-value). The list of genes produced and associated fold change values then become the subjects of evaluation by researchers, with the aim of reconstructing the behavior of the system (cell) from its components (transcribed genes). During the last 2 decades, microarrays have rapidly gained in popularity; however, reproducibility and, especially, the biological interpretation of the experimental data still present a challenge (Jaluria et al. 2007; Slonim 2002; Yang and Speed 2002). MADNet, the MicroArray Database Network web server, is a web-based software tool designed with the goal to overcome some of the aforementioned difficulties for a user who does not have extensive knowledge of bioinformatics and to help researchers in the final steps of every large-scale experiment: biological interpretation of experimental results.
34
MADNet
879
34.2 MADNet Web Server The flowchart in Fig. 34.1 outlines a simplified workflow of a typical microarray experiment with a blown-up box of tasks handled by MADNet. In a nutshell, MADNet inputs the list of genes and associated fold change values (with optional p-value estimates), distributes them across metabolic and signaling pathways, and outputs a ranked list of pathways statistically affected by the microarray experiment. The user can then visually browse and inspect each pathway, which is further annotated with the information integrated from various other database sources, including information about gene identifiers (NCBI, http://www.ncbi.nlm.nih.gov), transcription factors (TRANSFAC – Matys et al. 2006), and drug targets (DrugBank – Wishart et al. 2008). Each gene in a pathway is linked to information databases explaining their function and role in the cell and the display is colored according to their expression value (fold change), which provides a direct visual clue about the behavior of the entire pathway. Furthermore, where available, genes are marked as known drug targets, with a list of drugs known to affect the function of that gene. Genes for which the upstream transcription regulation factor is known are also visually cued on the map, providing another level of contextual information to interpret experiments. Furthermore, MADNet offers the possibility of looking into transcriptional cascades by generating graphs of transcription regulation networks based on information in the TRANSFAC database and enriched with color-coded expression information from the experiment.
Fig. 34.1 Workflow of a typical microarray project, using MADNet, with a blown-up box of tasks handled and information merged by MADNet
34.2.1 Web Server Implementation MADNet is implemented in the form of a World Wide Web server. In our opinion, this has several advantages over stand-alone computer applications. First, it is easily available and platform independent. The user is not required to install anything on their computer and the interface is exactly the same regardless of whether
880
I. Šegota et al.
the program is being used on a PC, Mac, or any other type of operating system. Second, the server approach is more flexible for the integration of various information sources onto metabolic and signaling pathways, alleviating the end-user from the need for a technical knowledge to, say, incorporate another database or update an existing one. It also enables MADNet to support more gene identifiers than would be feasible with the stand-alone computer version. There are disadvantages as well, mostly related to the periodic unavailability of web servers, as well as the need to transfer large experiment files over the Internet increasing the delay time before the analysis begins. However, it is our belief that the pros outweigh the cons in favor of the Internet-based application. MADNet utilizes PHP web server technology and its user interface is built as a series of dynamically generated graphic HTML pages. The user uploads a data file, which is then stored in a temporary directory on the server and the server remotely performs all the data analysis. The web server algorithm is very processive; the user is able to analyze thousands of genes within a few minutes, with the main limitation being only the time it takes to upload the data file to the server. Each user visit is organized and stored in the form of sessions which significantly increase responsiveness and facilitate any repeatable queries. Special attention has been paid to simplifying the user interface so that the user can focus on biology rather than having to pay attention to numerous processing options that distract them from interpreting the data. The workflow is separated into several sections, represented as tabs on the web page (Fig. 34.2). Every analysis step begins with data input and as the user proceeds through the analysis, tabs at the top of the page become available for selection. MADNet can be tested using an existing microarray data file already available on the server by clicking on the “MADNet Demonstration” button instead of uploading a file and proceeding as follows.
34.2.2 Data Input The central feature of contextual analysis with MADNet is the input experiment file, consisting of three columns: one column with a gene identifier, second with the corresponding differential expression (fold change), and (optionally) a third one with the statistical significance of gene expression, i.e., the p-value. This input file can be either tab-delimited or comma-separated (CSV) and expression values can optionally be log-transformed (this is a common transformation used to describe expression data, where differences in expression can sometimes span several orders of magnitude). It is worth noting that MADNet will not perform any so-called lowlevel data normalization, noise reduction, or any other data manipulation, so it is assumed that these steps have been done previously using dedicated software like Bioconductor (Gentleman et al. 2004). Gene identifiers (i.e., accession numbers) help to unambiguously define each gene in terms of sequence and host organism. Depending on the type of the experiment and even on the manufacturer of the microarray slide, these identifiers can
34
MADNet
881
Fig. 34.2 MADNet start page
point to different sequence databases (e.g., NCBI RefSeq, NCBI GenBank, KEGG GeneID, Uniprot ID, proprietary identifiers). Upon file upload, MADNet will try to determine the type of gene identifier used in the file and associate it to the host organism. Also, the file will be scanned to determine whether it contains a header row and to determine the format of fold change values (log vs. non-log). The autodetection algorithm will scan the first 100 lines of the user input data file and try to cross-reference the user’s gene identifiers with all identifiers available in MADNet databases to select the organism with the most matches. It also assumes that two leftmost columns that contain floating-point numbers are the ones with differential expression data and p-values, respectively. However, due to the possibility of ambiguous detection, it will allow the user to specify all these parameters manually. Upon the completion of automatic file format detection, the user is presented with the summary and the possibility to confirm or modify the suggested parameters. (Fig. 34.3) The MADNet database currently supports NCBI GenBank, NCBI RefSeq, NCBI GeneID, UniProt, and ENSEMBL Gene identifiers. Affymetrix gene identifiers are not supported in the present release and users are encouraged to convert them
882
I. Šegota et al.
Fig. 34.3 The information about the input file format and type of the experiment conducted (“processing criteria” tab). Default values are based on MADNet auto-detection algorithm (see text), and the user is allowed to manually adjust parameters
to NCBI RefSeq identifiers using the Affymetrix proprietary software. Currently supported genomes are human, mouse, and plant Arabidopsis thaliana.
34.2.3 Analysis and Visualization After data input and confirmation of format parameters, an overall distribution of expression values is produced (Fig. 34.4). The fold change values are usually partitioned by defining two numerical threshold values that separate the input data into three categories (underexpressed, overexpressed, and genes with no significant differential expression). Threshold values are often selected ad-hoc, by inspecting the quality of experimental data (how much noise the experiment contains) and also by specific requirements of the experimenter (whether we want to isolate extremes of our experiment or we might be looking for finer differences in expression). MADNet will offer the user two automatically determined values (from the 2σ interval of the binomial distribution of log-transformed expression values), but which can be set to any desired value manually. Usually, values of ± twofold change are considered significant. In addition, expression values are colored according to the commonly adopted scheme based on the fold change magnitude: underexpressed genes (expression is lower than the control) are colored green to yellow, overexpressed (expression higher than the control) yellow to red, and genes with no significant differential expression are uniformly colored in yellow.
34
MADNet
883
Fig. 34.4 Histogram of the entire input data file, of differential gene expression, which shows overall tendencies in the data set. Coloring that will used throughout the entire MADNet analysis is set at this point, according to the threshold values (see text)
The overall histogram can quickly provide insight into the general trends in our experiment – i.e., whether the result is global suppression (green bars more numerous) or activation (more red bars) of biological systems in response to experimental conditions. The same coloring scheme is then used in all subsequent MADNet visualizations. The user can simply navigate the rest of MADNet by clicking on the tabs at the top of the window and choose whether to investigate specific pathways, particular genes belonging to pathways, or even transcription cascades. 34.2.3.1 Metabolic and Signaling Pathways By clicking on the “Pathway list” tab, MADNet will sort input genes onto corresponding metabolic and signaling pathways and provide a ranked list of pathways according to the statistical significance of the amount of change that occurred in each pathway. Two types of statistics, the Z-score (Doniger et al. 2003) and the pvalue, are used to estimate the magnitude at which pathways were affected and they both take into account two variables: the ratio of altered genes vs. total gene count in a pathway and the average fold change of genes per pathway. The default statistic on which pathways are ranked is the Z-score. Furthermore, MADNet calculates additional statistical parameters for each pathway to help in biological evaluation: minimum and maximum expression values (i.e., expression extremes), median expression of all found genes, total gene count, and a count of how many genes from the experiment are assigned to that pathway. Each pathway in the list is identified visually with an arrow showing the general tendency of pathway to be under- or
884
I. Šegota et al.
Fig. 34.5 Part of a list of all the pathways that contain recognized genes from the input data, based on the information in the KEGG database. Default sorting is by Z-score
overexpressed, or in case there is no tendency, a dot is displayed (Fig. 34.5). With this list, it is possible to get a quick overview of significantly altered pathways. By clicking on each individual pathway, a new browser window will open with a graphical representation of the entire pathway and visual annotation of genes pertaining to the experiment. Any number of pathways can be opened and visualized simultaneously, making it convenient to analyze multiple pathways at the same time or the same pathway across different experiments. Upon clicking on a pathway name from the list, MADNet generates an interactive, clickable image of a given metabolic or signaling pathway (Fig. 34.6). The pathway image consists of a metabolic map imported from the Kyoto Encyclopedia
Fig. 34.6 Part of a MADNet-annotated interactive graphical representation of the Wnt signaling pathway, with the gene expression data represented by a coloring scheme (color available online), according to the threshold values
34
MADNet
885
of Genes and Genomes pathway database (KEGG – Kanehisa et al. 2006), additionally color-coded based on gene expression values and including gene annotations, with references to the NCBI Gene database. Users can quickly get a visual overview of the up- or down-regulation of a particular pathway or its subsection. By hovering the mouse over a specific gene location on the map, the pop-up window will summarize the information about gene(s) at that location: the expression values, p-values, known transcription factors (according to annotation in the TRANSFAC database), and whether a gene is a target to any known drug. Each gene also contains a hyperlink to additional information in the NCBI Gene database. Pathway images are accompanied with a legend describing all annotations as well as the expression histogram that enables easy identification of the overall trend of expression for each particular pathway (Fig. 34.7).
Fig. 34.7 MADNet-annotated metabolic or signaling pathway. Black and gray frames on pathways indicate transcription factors and/or drug targets respectively, and the exclamation mark tags mark the genes on the map with statistically large uncertainty (big p-value). In addition, the user has an overview of a histogram of fold change of all the genes on the particular pathway
34.2.3.2 Transcription Factors Mapping gene expression data onto transcription regulation networks is another possibility for systems-level analysis. MADNet cross-references genes from the input data set to the TRANSFAC database of transcription factors. The server generates a clickable list of all transcription factors found, with two numbers in parentheses for each transcription factor; the number of genes with significant expression change in the input set and the total number of genes known in the TRANSFAC database both regulated by that expression factor. By selecting an individual transcription factor from the list and clicking the “Go to. . .” button, the user is presented with a
886
I. Šegota et al.
detailed report on all the genes in a downstream regulation cascade: their differential expressions/p-values, expression averages/variance, and the option to display these genes on pathways (“view on pathways”) which generates another list of metabolic and signaling pathways, with the difference of mapping only genes regulated by the previously selected transcription factor. This analyzes complex pathways in a simplified and targeted way. Furthermore, individual genes or transcription factors are also clickable, linking to their corresponding entry on the NCBI Gene database. In addition to mapping transcription factors onto metabolic and signaling pathways, MADNet can also dynamically generate its own network graphs of transcription regulation cascades, based on information in the TRANSFAC database, annotated with gene expression information and color-coded according to the commonly adopted scheme based on fold change magnitude. Network graphs can be generated by selecting one or more transcription factors from the list and clicking the “Submit” button. The user is then presented with the transcription cascade graph containing the selected transcription factors and including downstream-regulated genes. The nodes in this graph represent transcription factors or genes and connecting lines represent the direction of regulation, e.g., if a line is pointing from and into the same node, it represents a transcription factor regulating a gene that codes for that transcription factor (i.e., self-regulation) (Fig. 34.8). Fig. 34.8 Transcription factor network graph, generated based on recognized transcription factors from the input data
34.2.4 Output The annotated pathway images and corresponding figure legends can easily be copied into a word processing program or elaborated further using image manipulation software for eventual inclusion in scientific publications. In addition to images, MADNet also generates all the reports of data analyses as tab-delimited text files and Microsoft Excel spreadsheets enabling further data processing or the efficient analysis of experiment batches and experiments containing large quantity of data.
34
MADNet
887
34.3 Conclusions and Future Work MADNet, the MicroArray Database Network web server, is a versatile platform for data mining and visualization of high-throughput experimental data. It integrates experimental results with the existing biological data in the context of metabolic and signaling pathways, transcription factors, and drug targets, and presents the results graphically in an intuitive way that is centered on a biological problem, requiring minimal technical knowledge and removing limits on the size of the experiment. MADNet can be used to analyze any information which can be organized to contain gene identifiers and associated numerical quantity (with the optional statistical significance). This allows for the possibility to analyze and visualize data from a number of high-throughput experiments like phage display, SAGE, tilling arrays, protein microarrays, or even metagenomes (in terms of abundance of functional gene categories or presence and absence of metabolic pathways), which is especially interesting in the context of recent progress in environmental sample sequencing. MADNet complements similar available software packages (Bouton and Pevsner 2002; Chung et al. 2005; Dennis et al. 2003a, b; Grosu et al. 2002; Salomonis et al. 2007) with a distinctive visual approach, DrugBank, and TRANSFAC integration, the ability to process chips of unlimited length, providing several different statistical measurements of pathway alterations, and an extensible and modular system for including future database links and annotations. MADNet is subject to continuous improvements and feature upgrades. Future work will include underlying database consolidation in terms of gene identifiers, as well as adding new species mappings into the database structure. A major improvement foreseen in the following releases will include the dynamic rendering of pathways with the possibility for analysis of user-submitted pathways. We also plan to include automatic recognition of all standard chip layouts and gene identifiers and provide better integration with microarray data repositories (GEO, the gene expression omnibus at NCBI – http://www.ncbi.nlm.nih.gov/geo; ArrayExpress at EBI – http://www.ebi.ac.uk/microarray-as/ae/; SMD at Stanford – http://smd.stanford.edu/) further removing the number of steps needed to reach the visualization stage. Furthermore, MADNet can easily be adopted to visualize data in the context of functional categories, like gene ontology (GO) or clusters of orthologous genes. Acknowledgments This work is funded by the EMBO Young Investigator Program (Installation Grant 1431/2006), ICGEB Collaborative Research Programme Grant and Croatian MSES Grant 119-0982913-1211 to KV. The authors gratefully acknowledge help of Maša Roller in proofreading the manuscript.
References Bouton CM, Pevsner J (2002). DRAGON View: information visualization for annotated microarray data. Bioinformatics 18(2):323–324 Chung HJ, Park CH et al. (2005). ArrayXPath II: mapping and visualizing microarray geneexpression data with biomedical ontologies and integrated biological pathway resources using Scalable Vector Graphics. Nucleic Acids Res 33(Web Server issue):W621–626
888
I. Šegota et al.
Dennis G Jr, Sherman BT et al. (2003a). DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 4(5):P3 Diehn M, Sherlock G et al. (2003b). SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res 31(1):219–223 Doniger SW, Salomonis N et al. (2003). MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol 4(1):R7 Gentleman RC, Carey VJ et al. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5(10):R80 Grosu P, Townsend JP et al. (2002). Pathway Processor: a tool for integrating whole-genome expression results into metabolic networks. Genome Res 12(7):1121–1126 Jaluria P, Konstantopoulos K et al. (2007). A perspective on microarrays: current applications, pitfalls, and potential uses. Microb Cell Fact 6:4 Kanehisa M, Goto S et al. (2006). From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34(Database issue):D354–357 Matys V, Kel-Margoulis OV et al. (2006). TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34(Database issue):D108–110 Salomonis N, Hanspers K et al. (2007). GenMAPP 2: new features and resources for pathway analysis. BMC Bioinformatics 8:217 Slonim DK (2002). From patterns to pathways: gene expression data analysis comes of age. Nat Genet 32(Suppl):502–508 Wishart DS, Knox C et al. (2008). DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 36(Database issue):D901–906 Yang YH, Speed T (2002). Design issues for cDNA microarray experiments. Nat Rev Genet 3(8):579–588
Subject Index
A Actin cytoskeleton, 565 Actin polymerization, 800–814 Actin box, 801, 809 ARP_site interface, 809 Neighbours interface 809 analysis, 814–815 ARP_site interface, 809–810 ARP2/3 molecule, 805 Barbed interface, 801, 803–814 BlenX model, 801–814, 816–819 branch_manager process, 809 cell cycle, 815–819 Complex Viewer tool, 814–815 interaction_modifier process, 803–804, 809 Neighbours interface, 809–810 neighbours_state_modifier process, 810 Pointed interface, 800, 803–814 state machine notion, 805–809 γ-Activated sequence (GAS), 571 Adherens junction (AJ) complexes, 565 Affymetrix data, 90, 94 Akaike information criterion (AIC), 591, 597, 600 Algorithm scan, 258–260 pseudo-code of SCAN, 258–259 Alignment, network, 28–30 Alliance for Cell Signaling (AfCS) project, 4–9, 436–439 Alzheimer’s disease, 611–639 See also Transcriptional changes in Alzheimer’s disease Anaphase-promoting complex proteins, 267–268 Anti-inflammatory signalling, 563–584 See also Cyclic AMP (cAMP) signalling Apical ectoderm ridge (AER), 517 Apolipoprotein E4 (APOE4) allele, 635
Apoptosis, 326 for drug target discovery, 589–608 regulators in cancer drug targets prediction, 603, 605 Approaches in systems biology, 3–11 data exploration, 5 reductionist approach, 4 systems approach, 4 See also individual entry Ariadne Genomics Pathway Studio, 103 Astrocyte signaling networks, wave propagation in, 49 Attractors, 306 Automating mathematical modeling, 159–198 of biochemical reaction networks, 159–198 straightforward modeling pipeline, 160–163, See also Modeling pipeline model parameters, obtaining, 189–193 heuristic optimization procedures, 193 optimal, 190 SBML2 LATEX, model reports generation with, 193–196 See also Standards in systems biology Average linkage, 100 B Back-propagation, 350 Backward differentiation formulae (BDF)-based approaches, 117, 130 Barbed interface, 801 Barrier function control of by cyclic AMP, 565–568 See also under Cyclic AMP (cAMP) signalling Basins of attraction, 306 Bayesian calibration of GPCR ODE model using noisy data, 37–38
S. Choi (ed.), Systems Biology for Signaling Networks, Systems Biology 1, C Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-4419-5797-9,
889
890 Bayesian statistics, 381–384 computational issues and solutions, 383–384 definitions, 381–382 ‘degree of belief’, 382 interpretations, 381–382 ‘marginal distribution’, 382 ‘posterior distribution’, 382 B-cell receptor (BCR) pathway, protein–protein interactions in, 8–9 Behavioural properties, Petri net, 828–830 Benjamini–Hochberg correction, 96 BetaPlotter, 791, 793 Beta-uniform mixture model (BUM), 354 Beta Workbench (BWB), 789–800 BlenX compiler, 789 BlenX runtime environment, 789 BWB CTMC generator, 789–790 BWB reactions generator, 789–791 BWB simulator, 789 logical structure of, 789 usage, 793–800 BetaPlotter, 793 BlenX Designer, 798 ComplexViewer, 795 CoreBWB, 793 CoSBiLab Graph, 795 Bicoid (Bcd) transcription factor, 47–48 Bidirectional BLAST hits, 28 Bimolecular reaction, 782 Binomial-Neighborhood (BN) model, 404–406, 408 assumption on PPI, 404–405 probabilistic inference from, 405–406 Biochemical reaction networks automating mathematical modeling of, 159–198 See also under Automating mathematical modeling Biological networks dynamics modeling from time course data, 275–292 grammar-based representation of domain knowledge, 279–283 metabolic network, 280 nonterminal symbols, 279 ODEs, 276 polynomial models and constraints, 277–279 process-based models, 283–285 reaction network, 279 representing knowledge for, 277–285 See also Learning dynamics
Subject Index mixture model on graphs for, 372–374 Biological pathways, 139–155 biological pathways eXchange (BioPAX), 78 See also Logic-based diagrams of biological pathways Biological process, 261–264 depiction, 146–147 Biological replicates, 92–93 Biology vs. Information, 87–89 Biomolecular Interaction Network Database (BIND), 221 BioNetS, 759 Bistability, 317 BlenX Designer, 798 BlenX, programming biology in, 777–820 BASERATE, 780, 782 bimolecular reaction, 782 BlenX language, 778–789 change action, 779 continuous variables, 784 declaration file, 778, 783 delay action, 779 delete verbs, 786 denotational descriptions, 777 events reaction, 781 hide action, 779 interfaces file, 778 join verb, 786 monomolecular reaction, 781 new verb, 786 operational descriptions, 777 output action, 779 program file, 778 split verb, 786 unhide action, 779 update verb, 786 See also Actin polymerization; Beta Workbench (BWB) Bone marrow macrophages (BMMs), 77 Boolean logic operators, 148 Boolean model, 32 Boolean networks, 211 Branch_manager process, 809 Branch-and-cut, 364 Brusselator model, 126–128 Burn.in, 383, 389 C Caenorhabitis elegans (roundworms), 623 cAMP-response element-binding (CREB) protein, 571 Cancer-perturbed PPI network construction for apoptosis, 589–608
Subject Index date hubs (dynamic hubs), 591 disease-perturbed PPI networks, 590 drug targets prediction, 600–606 apoptosis regulators, 600, 605 CASP2, 600 CASP3, 600 CASP9, 600 CCND1, 605–606 CDKN1A, 605–606 common pathway, 601–603 extrinsic pathway, 603 IGF1, 605–606 intrinsic pathway, 603–605 PCNA, 605–606 PRKCD, 605–606 stress-induced signaling, 600, 605 TNF, 600 TNFRSF6, 600 experimental data processing, 592–594 selecting, 592 gain of function, 597–598 human-perturbed PPI networks, 590 initial PPI networks construction, 591–592 interactions in, identification, 594–596 modification, 596–597 loss of function, 597–598 party hubs (static hubs), 591 Caspase activation through static and dynamic hubs, 606–608 CCAAT/enhancer binding proteins (C/EBPs), 576–577 CCAAT/enhancer-binding protein delta (Cebpd), 535 cDNA arrays, 90 Cell cycle behavior sensitivity to intrinsic vs. extrinsic noise, 58 Cell death, mathematical model of, 38–39 Cell signaling networks, 10 CellDesigner, 80, 161, 179–183 CellML, modular mass-action modelling with, 165, 721–749 anatomy of, 725–726 tags, 725 tags, 725 tags, 726 tag, 726 tags, 726 tags, 725 decoupling the components, 745–749 Cformation component, 745 Eformation component, 745
891 importing example, 742–744 mass-action modelling, 721–749 bidirectional reaction example, 727–735 OpenCell, 726 mathematical formalism, 722–725 bidirectional reaction, 723 unidirectional reaction, 722 modularisation, motivation for, 745 multi-environment reaction, 735–738 combined example model, 738–742 Cell-penetrating peptides (CPP), 716 Cellular automata (CA), 299–302 complex system, 299, 303 periodic dynamics, 299, 303 random dynamics, 299, 304 Cellular compartments, 149–151 Cellular component, 261–264 Cellular-level gene regulatory networks, 429–444 large-scale cellular-level models, 432–434 model derivation, 434–435 model interpretation, 435 models inferred from alliance for cell signaling data, 436–440 model properties, 436–439 regulatory influence statistical support, 439 temporal regulation, 439–440 multiscale cellular networks, 440–443 Cellular responses measurement, 6 Cellular signaling, alliance for, 5 Chaos in NK Boolean networks, 306–309 Chemical kinetic (mechanistic) models, 32 Chemical Langevin equation (CLE), 33, 54 Chemical master equation (CME), 53, 754 Chromatin immunoprecipitation (ChIP), 536 Chronic liver diseases (CLD), 645–681 See also Obesity-related CLD pathogenesis Chronic myelogenous leukemia (CML), 519 Circular layout choice of biological network, 863 Clustering, 20–21, 100, 236 agglomerative (hierarchical clustering), 100 average linkage, 100 complete linkage tends, 100 noncanonical (principal component analysis), 100 partitioning (self-organizing maps, k-means clustering), 100 single linkage clustering, 100 CNM clusters with highest p-values, 265–266
892 Coherence resonance (CR), 58–59 in stochastic model of circadian rhythm, oscillatory behavior due to, 58–60 Coloured Petri nets (CPN), 832 Common pathway in cancer drug targets prediction, 600 Comparative genomics hybridization studies (CGH), 89 Complement factor 5a (C5a), 37 Complete linkage tends, 100 Complexity analysis, 270–271 Complexity in NK Boolean networks, 306–309 ComplexViewer, 792, 795–796 Component annotation, 145–146 Composition/rejection method, 758 Computational capacity in cells, 315–323 See also Feedback loop connecting dynamics to, 302–309 Computational modeling MAPK1,2 network, 467–470 tyrosine-phosphoproteome dynamics data, 451–452 Computational procedures for model identification, 111–135 See also under Model identification, computational procedures for Computational resources for innate immunity, 537–538 Computational scientific discovery, 277 Computer-aided mathematical modeling of biological systems, 179–189 context-sensitive assignment of rate equations, 181–183 modulation, 182 graphical modeling tool CellDesigner, 179–181 MIRIAM annotations, model-merging using, 184–189 RDF (Resource Description Framework) syntax, 187 SBMLsqueezer, 183–184 semanticSBML, 188–189 Conditional gates, 147 Conditional probability, 381 Connector–enhancer of KSR (CNK), 458 Conserved modules, identification, 249 Conserved transcriptional modules, identification, 242–245 Constrained Induction of Polynomial Equations (CIPER), 278, 288
Subject Index Constrained probabilistic sparse matrix factorization (cPSMF) algorithm, 240 Context-free grammars (CFG), 279–280 Context-sensitive assignment of rate equations, 181–183 Continuous dynamical models, 32–33 advantages of, 33–34 Continuous Petri nets (CTPNs), 832 Continuous time Markov process (CTMC), 789 Control vector parameterization (CVP) approach, 128–130 COPASI, 759 Copy number variation (CNV) probes, 655 CoreBWB, 793 Corrupted data, detection, 21–22 Cortical basal degeneration (CBD), 612 CoSBiLab Graph, 792, 795 Cost function, 115–116 Crammèr–Rao inequality, 125 Cross-reference ID, 542 Current drug discovery strategy, limitations of, 491–492 Currency metabolites, 374 Cut constraints, 363 Cyclic AMP (cAMP) signalling, 561–582 architecture of, 563 C/EBPs as cAMP-activated EPAC1-regulated transcription factors, 576–577 endothelial barrier function control by, 565–568 actin cytoskeleton, 565 adherens junction (AJ) complexes, 565 co-ordination by EPAC, 566–568 co-ordination by PKA, 566–568 integrin-containing junctions, 565 microtubule (MT) network in, 567 genome-wide control of transcription by cAMP, PKA-independent route for, 580–581 IL-6 signalling control by, 574–576 pro-inflammatory signalling regulation, 568–579 IL-6 signalling through gp130 and IL-6Rα homodimers, 568–571 JAK-mediated activation of ERK1,2 and STATs via gp130, 570 SOCS-3 induction by, 577–579 systems-based future directions, 580–581 new SOCS-3 targets identification, 580
Subject Index in vascular endothelium, anti-inflammatory signalling by, 561–582 Epac, 563–565 Cyclic nucleotide binding domains (CNBs), 563 Cytokines, 532 cytokine receptor signalling inhibition, 571–574 Cytoscape, 30, 80, 768–770 D 2D-PAGE proteomics profiling, 654 Data definition languages (DDL), 859 Data from diffuse large B-cell lymphomas (DLBCL), 355 activated B-like phenotype (ABC DLBCL), 355 germinal center B-like phenotype (GCB DLBCL), 355 Database for annotation, visualization, and integrated discovery (DAVID), 693 Date hubs (dynamic hubs), 589 Decision making in cells, 295–332 emergent, 327–331 mathematical basis for, 311–314 nontrivial biochemical network activity in, 325–327 apoptosis, 326 differentiation, 326–327 growth, 326 movement, 325–326 proliferation, 326 signal transduction networks, 327 stem cell differentiation process, 327 Declarative SQL-based visual analysis, 863 Definition for systems biology, 207 Delete verbs, 786 Desensitization, 317 Desktop tools, for network visualization, 30–31 Cytoscape, 30 Medusa, 30 online network browsers, 31 Osprey, 30 Pajek, 30 programming libraries, 31 Differential equation models, 33 Differential evolution, 122 Direct finite-time Lyapunov exponent (DLE) analysis, 40 Direct method, 757 Direct-search methods, nonlinear programming solvers, 118
893 Direct structural reachability, 257 Directed acyclic graphs (DAGs), 404, 407 Discrete modelling, 32, 829–851 See also Petri net foundations Discrete Stochastic Models Test Suite (DSMTS), 765 Divergent modules, identification, 249 Divergent transcriptional modules, identification, 242–245 Domain knowledge, grammar-based representation of, 279–283 Domineering non-autonomy, 46 Dose-to-duration encoding, 41–42 Double activation mechanism, 222–223 Dual activation mechanism, 222 Dynamic hubs, caspase activation through, 606–608 Dynamical models of biological networks, 31–62 chemical Langevin equation (CLE), 33, 54 continuous models, 32–33 discrete models, 32 hybrid dynamical models, 61–62 limitations, 34 ordinary differential equation (ODE) systems, 32, 35–44 See also individual entry partial differential equation (PDE) models, 32, 44–52 See also individual entry stochastic differential equation (SDE), 33, 52–61 See also individual entry tasks associated with, 34–35 emergent behavior analysis, 35 model construction and calibration, 34 model validation and testing, 34 parameter sensitivity analysis, 34–35 predictive modeling and discovery, 35 Dysfunctional vascular endothelium and disease, 561–562 E Edge in mEPN, 143 Edinburgh pathway notation (EPN), 142 Elementary cellular automata, 299 Elongin A, 572 Emergent decision making in cells, 327–331 Emergent networks, 309–311 equivalence class, 312 functional properties of, 309–311 nonlinear connectivity, 310–311 nonlinear functions, 309–310
894 Emergent networks (cont.) pattern recognition, 312 nontrivial, 313 trivial, 313 structural properties of, 309–311 Emulator, 61 Enabling Rule, 826 Endoplasmic reticulum (ER) stress, 592 Endothelial barrier function control of by cyclic AMP, 565–568 See also under Cyclic AMP (cAMP) signalling Energy/molecular transfer nodes, 148 Enzymatic futile cycle, noise-induced bistability prediction in, 60–61 Enzyme-centred network, 374–375 Epidermal growth factor receptor (EGFR), 462, 471 Equation discovery, 277 ErbB signaling ODE system, 42–43 Erythropoietin (Epo)-mediated modulation of erythropoiesis, 227–231 JAK2–STAT5 pathway activation, 226 multi-level model for, 227–231 red blood cells, differentiation and proliferation of, 228 European Bioinformatics Institute (EMBL-EBI), 77 Events reaction, 781 Exchange protein directly activated by cAMP (Epac), 563–565 C-terminal CNB in, 564 N-terminal CNB in, 564 Exocyst complex cluster, 269 Expression signature, 97 EXtensible Markup Language (XML), 79 elements, 79 schema, 79 well-formed, 79 Extracellular signal-regulated kinase (ERK), 455–481 Extrinsic pathway in cancer drug targets prediction, 600 F False discovery rate (FDR), 96, 354, 362 Family-wise error rate (FWER), 96 Farnesyltransferase inhibitors (FTI), 477 Feasible Steiner arborescence, 363 Feasible T-invariants, 834–836 Features of systems biology, 209–210 Feedback loops, 315–323 in MAPK1,2 signaling, 464–467
Subject Index negative, 467 positive, 464–467 with multi-step signaling cascades, 322–323 negative, 316–317 positive, 317–320 and negative, combining, 320–322 Fibroblast growth factor (FGF) signaling, 458 FGF8, 517 FGFR-1, 479 FGFR2, 674 Fick’s second law of diffusion, 44 Firing rule, 825–828 Enabling Rule, 826 First reaction method, 756–757 Fisher Exact Test, 495 Fisher Information Matrix (FIM), 125 Fisher’s exact test, 102 Flux-balance analysis (FBA), 62 Flux equation, 733 Fold change, 878 Formal task specification, 285–286 Framework for evaluation of reaction networks (FERN), 751–773 accuracy of, 765–766 basic usage of, 768 cell growth simulation, 770–771 Command Line Tool, 766–768 Cytoscape plugin for stochastic simulation, 768–770 division using observers, 770–771 implementation details, 760–765 AmountManager controls, 761 AnnotationManager, 761 Decorators, 762 Evolution algorithms, 762 networks, 761–762 PropensityCalculator, 762 Readers, 85 networks, import and export of, 762–763 simulation algorithms, 763–764 tau-leaping algorithms, 763 observer system, 764–765 Gnuplot class, 764 Image objects, 764 Stochastics, 764 Petri nets, 752–754 reaction networks evaluation, 751–773 runtime performance of, 765–766 stochastic chemical kinetics, 754–756 stochastic simulation methods, 756–759 composition/rejection method, 758 direct method, 757
Subject Index first reaction method, 756–757 hybrid methods, 758–759 next reaction method, 757 Tau-leaping methods, 758 stochastic simulation, 751–773 From microarray to biology, 85–106 biology vs. information, 87–89 experimental design, 92–105 data normalization, 93–96 data preprocessing, 93–96 gene expression, 94 genes identification, 96–99 replicates, 92–93 gene networks, 103–104 clustering, 100 dynamic behavior of, 99–101 modular behavior of, 99–101 genes of interest, functional meaning behind, 101–102 known genes performing known functions, 101 known genes performing unknown functions, 101 unknown genes performing unknown functions, 101 information flow in 1969 as compared to 2009, 88 microarray data analysis, 91–92 microarray history, 89–90 microarray technology, 90–91 cDNA arrays, 90 oligonucleotide arrays, 90 transcription factors, 104–105 unannotated genes, 105 Frontotemporal dementia (FTD), 612, 637 Functional modules, finding, 253–272 algorithm scan, 258–260 application, 260–271 complexity analysis, 270–271 methods, 255–260 structure-connected clusters, 255–258 See also individual entry G G protein coupled receptors (GPCRs), 456 Gaussian distribution, 349 Gauss–Newton method, 118 Gene expression analyses, 7 Gene Expression Omnibus (GEO), 105, 618 Gene networks, 103–104 Ariadne Genomics Pathway Studio, 103 canonical pathways, 103 GeneGo, 103
895 Ingenuity Pathway Analysis, 103 interaction networks, 103 Gene Ontology (GO), 245–246, 401–403, 407, 632 gene ontology annotation (GOA) database, 78 hierarchy, 403 the true path rule, 402 validation metric based on, 261–267 anaphase-promoting complex proteins, 267–268 biological process, 261–266 cellular component, 261–266 clustering score, 263 CNM clusters with highest p-values, 265–266 exocyst complex cluster, 269 molecular function, 261–266 SCAN clusters with highest p-values, 264–265 translation initiation complex cluster, 268 Gene regulatory networks, inference of, 240–242 Gene set enrichment analysis (GSEA), 672 GeneGo, 103 General repository for interaction datasets (GRID), 401 Generalized mass action (GMA)-based models, 112, 175–176 Generalized probabilistic sparse matrix factorization (gPSMF) algorithm, 242 Generalized rate laws, 174–179 generalized mass action kinetics, 175–176 generalizing enzyme kinetics, 176–178 Hill equation, 178–179 modulation matrix, 174 stoichiometric matrix, 174 Generalizing enzyme kinetics, 176–178 Gene-regulatory reactions, 183 Genes identification, 96–99 Benjamini–Hochberg correction, 96 expression signature, 97 false discovery rate (FDR), 96 family-wise error rate (FWER), 96 p-value, 96 significance analysis of microarrays (SAM), 96 t-test, 96 variance-based approaches, 98
896 Genes of interest, functional meaning behind, 101–102 known genes performing known functions, 101 known genes performing unknown functions, 101 unknown genes performing unknown functions, 101 Genetic sequence database (GenBank), 77 Gibbs sampling, 383 Gillespie algorithm, 53 Global hierarchical conditional probability, 409–410 Global methods, 119–121 deterministic, 119 stochastic, 120 Glyphs in mEPN, 143–145 Goodwin model, 121–123 G-protein-coupled receptors (GPCRs), 37 Bayesian calibration using noisy data, 37–38 Graemlin network aligner, 28 Grammar, 283 grammar-based equation discovery, 289 grammar-based representation of domain knowledge, 279–283 Graph modeling language (GML), 859, 861 Graphical modeling tool CellDesigner, 179–181 Growth factor receptor bound-2 (Grb2), 508–509 ‘Guilt-by-association’, 401 H Hair follicle spacing, molecular mechanisms modeling for, 50 Half saturation constant, 178 Head and Neck Squamous Cell Carcinoma (HNSCC), 687–702 clinical management of, 688 complex formation in T cell signalling, 705–706 data extraction, 690–693 data formatting, 693–694 bounded fold changes, 694 common template, 693 data retrieval and processing, 689–694 systematic reviews and meta-analysis, 689 demographics of, 688 different cell lines, comparison, 714 evolving transcriptome of, 687–702 highly differentially expressed chromosomal regions, 700–701
Subject Index interaction motif for network organization, 715–716 peptide microarray-based approach, 706–709 peptides selection, 708 microarrays, generation and processing of, 710–712 progressive stages of, 689 Meta, 689 Pre, 689 TvN, 689, 692 protein interaction networks, large-scale analyses, 708–709 T cell signalling, 705 search strategy and flow diagram, 690 signaling pathways, topological analysis, 695–700 EMP1, 695 integrin signaling pathways, 698–699 inter-modular vs. intra-modular hubs, 695–698 MMP1, 695 signalling complexes architecture, analysis, 714–715 signalling-dependent changes in complex formation, 712–713 siRNA knockdown of integrin molecules, 699–700 systems-level analysis, 688–689, 694–701 gene expression signatures, consensus membership of, 695 progressive trends, 694–695 tissue-specificity as an example of validity assessment, 694 ‘Healthy’ tissue controls, 662–666 Hepatocyte growth factor receptor (HGFR), 519 Hesl autoregulatory network, 56 Heterogeneous genome-wide protein data, 415–422 network information integration with, 415–422 GO hierarchy contribution, 419–422 network comparison, 418–419 PHIPA, 416–418 protein feature contribution, 419–422 STRING, 415 Hidden nodes, varying the number of, 341–342 Hierarchical binomial-neighborhood (HBN) assumptions, 408–409 inference from, 409–410 global hierarchical conditional probability, 409–410
Subject Index local hierarchical conditional probability, 409 Hierarchical clustering (HC), 246 Hierarchical layout choice of biological network, 863 High-level Petri nets, 832 High-throughput approaches to CLD, 650–659 focused proteomics research, 659 genomics, 650–659 ‘healthy’ tissue controls, 662–666 heterogenous composition of tissues, 660–662 issues related to, 666–668 mRNA and protein levels correlation, 666 mRNA profiles evaluation, 650–653 NextGen sequencing technologies, 652 protein profiles evaluation, 653–659 2D-PAGE proteomics profiling, 654 copy number variation (CNV) probes, 655 FTICR MS, 656 ICAT technique, 658 mass spectrometry (MS), 654–655, 657 SELDI spectrometers, 657 SILAC, 658 single nucleotide polymorphisms (SNPs), 655 proteomics, 650–659 publicly available data sets, 669–671 sample size, 668–669 two-channel microarray profiling, 651 High-throughput yeast 2-hybrid (Y2H) screenings, 492 Hill equation, 43, 178–179 Homodimerization in receptor–transducer interactions, 220–224 cell lines required for, 226 double activation mechanism, 222–223 dual activation mechanism, 222 dynamical implications of, 220–224 homodimer–homodimer interaction, 221 single activation mechanism, 222–223 workflow, 221 Homodimers, 568–571 Hosphatidylinositol 3,4,5-trisphosphate (PIP3), 511 Hubs, 255, 261, 376 Human lung microvascular and human umbilical vein ECs (HUVECs), 568 Human Protein Reference Database (HPRD), 355–356 Human Proteome Organization (HUPO), 671
897 Hybrid dynamical models of biological systems, 61–62 Hybrid functional Petri nets (HFPN), 452, 832 Hybrid methods, 121, 758–759 Hybrid Petri nets (HPNs), 832 HyperText markup Language (HTML), 79 Hypervariable (HV) genes, 97 Hysteresis, 319 I Identifiability, 123–128 Brusselator model, 126–128 Fisher Information Matrix (FIM), 125 Monte Carlo-based approach, 125 structural versus practical, 124 IL-6 signalling through gp130 homodimers, 568–571 IL-6Rα, 568–570 ‘Improper prior distributions’, 385 Immune Response In Silico database (IRIS), 537 Immunoreceptor tyrosine-based activation motifs (ITAMs), 705 In vitro models of AD, 621–622 Independent component analysis (ICA), 238, 246 Indirect methods, in nonlinear programming solvers, 118 Inductive process modeling (IPM), 289–291 Inferring transcriptional regulatory network, 235–250 See also Transcriptional regulatory network, inferring Information flow in 1969 as compared to 2009, 88 Information quantification, 297–314 See also Decision making in cells; Network-based Information processing Ingenuity Pathway Analysis (IPA), 103, 672, 695 Initial value problem (IVP) solver, 116 Innate immunity, mammalian, systems-level analyses, 531–557 complexity of, 533–536 computational resources for, 537–538 See also Smallpox gene expression data set using InnateDB Insulin growth factor receptor (IGFR), 519 Integer linear program (ILP), 363 Integrated analysis, 236, 247, 368, 537 Integration, 20–21
898 Integrating Network Objects with Hierarchies (INOH) pathway, 547 Integrin signaling pathways, HNSCC, 698–699 ITGA3, 699 ITGA5, 699 ITGA6, 699 ITGB1, 699 Integrin-containing junctions, 565 Interaction_modifier process, 809 Interferon-beta (IFNβ), 6 Interferon-gamma (IFNγ), 6 Interferon regulatory factor 1 (IRF1), 536 Interferon regulatory factor 3 (IRF3), 534 Interleukin 4 (IL4), 6 Inter-modular vs. intra-modular hubs, 695–698 Internet Engineering Task Force (IETF), 165 Intracellular signal cascade (ISC) prediction on yeast genes, 410–415 cross-validation design, 411–412 evaluations, 412 GO hierarchy used for, 411 Intracellular signal cascade, 418–421 Intrinsic pathway in cancer drug targets prediction, 600–601 Invertebrate models of AD, 623–626 Irreversibility, 319 ISBJava library, 759 Isobaric tag for relative and absolute quantitation (iTRAQ), 449 Isotope-coded affinity tag (ICAT) technique, 448, 658–656 iTRAQ-based proteomics, idiosyncrasies of, 378–381 workflow for, 380 labelling using iTRAQ tags, 380 protein extraction, 380 trypsic digestion, 380 J JAK2–STAT5 pathway, signal amplification in, 216–220 responsiveness, 216 for sustained stimulation, 219 for transient stimulation, 220 structure of JAK2–STAT5 pathway model, 217 Jdesigner, 80 Join verb, 786 Joint probability, 382 K KEGG database use, 161 k-means clustering approach, 100 K-means clustering, 246
Subject Index Knockout analysis, 838–839 Kruskall–Wallis test, 673 Kyoto Encyclopedia of Genes and Genomes (KEGG), 632 L Labels, 20 LAGRAMGE searches, 285, 289 LAGRAMGE2, 289–291 Langevin method, 54, 755 Large-scale cellular-level models, 432–434 Large-scale microarray atlases, of transcriptional changes in AD, 613, 620–621 Laser capture micro dissection (LCM), 661 Latent data-based Bayesian approach, 56 Law of parsimony, 192 Layout choices of biological network, 863–864 circular, 863 hierarchical, 863 organic, 863–864 sub-network retrieving capability, 864 Learning dynamics, 285–291 CIPER, 288 formal task specification, 285–286 general algorithm for, 286–288 inductive process modeling, 289–291 LAGRAMGE, 289 Leukemia inhibitory factor (LIF), 536 Levenberg–Marquardt method, 118 Lipopolysaccharide (LPS), 6–7 Liquid chromatography-tandem mass spectrometry (LC-MS/MS) technology, 448 Liver, 647 Local density enrichment, 404 Local hierarchical conditional probability, 409 Local methods, 118–119 Logical approach, discrete modeling, 840–847 from Logical regulatory graphs to Petri nets, 843–845 logical regulatory graphs, analysis, 842–843 software packages supporting, 850 yeast, mating and filamentous pathways in, 845 Logic-based diagrams of biological pathways, 139–155 information collation, 151–154 modified Edinburgh pathway notation (mEPN), 139, 142–151
Subject Index See also individual entry pathway assembly, 151–154 pathway diagrams, 140 Lymphochip-specific interactome network, 356 M M (mitotic)-phase promoting factor (MPF), 57 Macrophage colony-stimulating factor (MCSF), 6 Manipulation languages (DML), 859 ‘Marginal distribution’, 382–383 Markov chain Monte Carlo (MCMC) methods, 37 Mass Action Law, 783 Mass action models, 211 Mathematical modelling, 114, 207–231 See also Signal transduction pathways investigation strategies Matrix decomposition, 236 Mauritius maps, 836–837 Maximal common transition set (MCT-set), modularisation using, 837 Maximum-Weight Connected Subgraph Problem (MWCS), 363 Medusa, 30 MEK inhibitors, 479–480 AZD6244, 480 MEK 1, 479–480 MEK 2, 479–480 PD0325901, 480 PD98059, 480 Metabolic control theory (MCT), 377 Metabolic networks, 253, 280 functional correlation in, 374–378 enzyme-centred network, 374–375 metabolite-centred network, 374 regulatory correlation, 376–378 structure of, 375–376 topologies, 376 Metabolites, 825 MetaCyc database use, 161 Metaheuristics, 121 Metropolis–Hastings algorithm, 37 Michaelis–Menten equation, 60, 176–177, 183, 723–724 Microarray and survival data, 355–356 Microarray data analysis, 91–92 constraint in, 91 Microarray Database Network Web Server (MADNet), 78, 878, 887 analysis and visualization, 877–887 data input, 880 fold change, 878
899 metabolic and signaling pathways, 883–885 output, 886–887 technical replicates, 878 transcription factors, 885–886 Web Server implementation, 879–880 Microarray Gene Data Expression Society (MGED), 669 Microarray history, 89–90 Microarray technology, 90–91 MicroRNAs (miRNA profiling), 89 Microtubule (MT) network, 567 Minimal spanning tree (MST) algorithm, 407, 416 MiniMental State Examination (MMSE) score, 617 Minimum description length (MDL) principle, 288 Minimum Information About a Microarray Experiment (MIAME), 669 Minimum Information About a Proteomics Experiment (MIAPE) standard, 670 MIRIAM annotations, model-merging using, 184–189 Mitochondrial antiviral signaling (MAVS), 534 Mitochondrial outer membrane permeabilization (MOMP), 38 Mitogen activated protein kinase (MAPK1,2) network, systems biology of, 455–481 alternative MAPK1,2 modeling methods, 475–476 computational modeling, 467–470 as a drug target, 477–480 farnesyltransferase inhibitors (FTI), 477 PLX4032, 479 Raf inhibitors, 478 RAF265, 479 Ras inhibitors, 477–478 XL281, 479 MEK inhibitors, 479–480 quantitative models development, 476 role in disease, 463–465 signaling cascades/network, 460, 467–478 cross talk with other signaling pathways, 461 feedback loops in, 464–467 FGF receptor substrate 2 (FRS2) in, 473 kinetic analysis, 473–474 MAPK phosphorylation and activation, duration and amplitude of, 460 negative feedback loops, 471 ODE-based model, 472–474 oncogenic mutations in, 462
900 Mitogen activated protein kinase (cont.) positive feedback loops, 464–466 recent trends, 474–475 scaffolding effects, 473 specific scaffolding and binding interactions, 460 sub-cellular compartmentalization, 460 Mitogen-activated protein kinase (MAPK), crosstalk between, 319, 505–522 during embryogenesis, 514–517 during artery specification, 514–516 during vein specification, 514–516 ERK1/2 signaling, 515 VEGF signaling, 515 HGF stimulation, 508 and human cancers, 518–520 potential therapeutic targets, 518–520 levels, 508 Raf/MAPK and PI3K/Akt signaling pathways, biochemical crosstalk between, 512–514 at the level of adaptor proteins, 513 at the level of ERK and TSC, 513–514 at the level of Raf and Akt, 513 near cell membrane, 512–513 Raf-1, 509 Ras protein in, 506–507 in vertebrate limb development, 516–517 by FGF8-mediating MAPK (ERK), 517 by PI3K/Akt pathways, 517 Mitogen-activated protein kinase kinase (MAPKK), 322 Mitogen-activated protein kinase kinase kinase (MAPKKK), 322 Mixture model on graphs (MMG), 371–394 Bayesian statistics, 383–386 See also individual entry iTRAQ-based proteomics, idiosyncrasies of, 378–381 metabolic networks, functional correlation in, 374–378 α parameter, effect, 386 posterior distribution, 384–387 prior distribution, 384–387 systems biology and biological networks, 372–374 Model-based approaches, for inferring transcriptional regulatory network, 235 Model identification, computational procedures for, 111–135 model building loop, 111–113
Subject Index generalized mass action(GMA)-based models, 112 identifiability analysis, 112 mathematical modeling, 112 parameter estimation, 112–116, See also Parametric identification power-law models, 112 signaling pathways, representing, 111–112 optimal experimental design, 128–135 NFκB regulatory module, 130–135 numerical method, 128–130 See also Identifiability Modeling pipeline, 160–163 KEGG database use, 161 kinetic equation for describing network topology, 161 equipping with values, 162 experimental validation of result, 162 model annotation, 164 MetaCyc database use, 161 Modified Edinburgh pathway notation (mEPN), 139, 142–151 biological processes, depiction, 146–147 Boolean logic operators, 148 cellular compartments, 149–151 component annotation, 145–146 conditional gates, 147 edge, 143 edges use, 149 energy/molecular transfer nodes, 148 glyphs, 143–145 interactions between components, depiction, 149 node, 143 pathway components, depiction, 143–145 pathway modules, 148 pathway outputs, 148–149 Modular modelling with CellML, 744–749 See also under CellML, modular mass-action modelling with Modularity-based algorithms, 254 Modulation, 182 modulation matrix, 174 Molecular function, 261–264 Molecular interaction networks using InnateDB, generating and exploring, 552–555 Monocyte chemoattractant protein-1 (MCP-1), 562 Monomolecular reaction, 781 Monte Carlo-based approach, 124–126 mRNA profiles evaluation, 650–653
Subject Index Multi-dimensional stimuli, 345–346 multiple stimuli, 346–347 Multilayer perceptron (MLP) network, 238–241 Multiple shooting, 116–117 Multi-scale biological entities and ProteoLens, 858–859 Multiscale cellular networks, 440–443 Multi-step signaling cascades, feedback loops with, 322–323 Multi-valued decision diagrams (MDD), 841 N National Center for Biotechnology Information (NCBI), 77 Nearest-Neighbor (NN) algorithm, 411 Negative feedback loops, 316–317 Neighbourhood average, 376 ε-Neighborhood, 256 Nelder–Mead simplex method, 118 Network-based analysis of proteomic data, probabilistic model for, 371–394 See also Mixture model on graphs (MMG) Network-based information processing, 297–314 attractors, 306 basins of attraction, 306 cellular automata (CA), 299–302 connecting dynamics to computational capacity, 302–309 mathematical basis for, 297–299 NK Boolean networks, 304–306 Network clustering, 254 Network configuration, 339–340 Network models, applications, 27–31 alignment, 28–29 experimental prioritization, 27–28 visualization, 30–31 Network reconstruction, 20–22 See also under Static modeling of biological networks Network representation, static modeling, 22–27 network ontology, 25–26 RDF representation, 25–26 advantages, 26 triple-based, 26 from reference assemblies to reference networks, 22–24 strongly typed, 24–27 Network training, 340 Neural network models, robustness of, 337–350
901 back-propagation, 340 evolving network weights, 348–349 hidden nodes, varying the number of, 341–342 methods, 339–340 multi-dimensional stimuli, 345–346 network configuration, 339–340 network training, 340 starting weight composition, varying, 343–344 stimulus control, 337–350 stimulus generalization, 337 stimulus selection, 337–350 tanh, activation function changing to, 344–345 New verb, 786 Next reaction method, 757–758 NFκB regulatory module, 130–135 Nitrogen-fixing processes, 389–391 NK Boolean networks, 304–306 chaos in, 306–309 complexity in, 306–309 ‘edge of chaos’ Boolean networks, 308 order in, 306–309 Node Attribute Browser, 550 Node in mEPN, 143 Node-scoring function, 357 Noise-induced bistability prediction in enzymatic futile cycle, 60–61 Non-alcoholic fatty liver disease (NAFLD), 647–650, 670, 675–677 pathological spectrum of, 648 pathophysiological hallmark of, 648 Non-alcoholic steatohepatitis (NASH), 647–649 miRNA expression in, 677–679 Non-core vertices, 258 Nonlinear connectivity, 310–311 Non-linear dynamics investigation, 216–220, 318 signal amplification in JAK2–STAT5 pathway, 216–220 Nonlinear functions, 309–310 Nonlinear independent component analysis (NICA), 236, 238 Nonlinear programming method (NLP)/Nonlinear programming solvers, 116–123 direct-search methods, 118 Gauss–Newton method, 118 global methods, 119–121 deterministic, 119 differential evolution, 122
902 Nonlinear programming method (cont.) SRES, 122 stochastic, 120 Goodwin model, 121–123 indirect methods, 118 Levenberg–Marquardt method, 118 local methods, 118–119 Nelder–Mead simplex method, 118 Non-member vertices, 259 Non-small cell lung cancers (NSCLC), 519 Nonterminal symbols, 280 Nontrivial biochemical network activity, 325–327 Normalization data, 93–96 Nostoc, 389–391 O Obesity-related CLD pathogenesis, 645–681 high-throughput approaches, 650–659 See also individual entry non-alcoholic fatty liver disease (NAFLD), 647–650 ‘omics’ approaches, 679 Occam’s razor, 192 Oligonucleotide arrays, 90 ‘Omics’ approaches, 679, 706 Online network browsers, 31 Open Biomedical Ontology (OBO), 537 ‘Optimisation principles’, 378 Order in NK Boolean networks, 304–306 Ordinary differential equation (ODE) systems, 32, 35–44, 276, 472 assumptions, 35–36 Bayesian calibration of GPCR ODE model using noisy data, 37–38 cell death, mathematical model of, 38–39 challenges in, 44 dose-to-duration encoding, 41–42 early examples, 36 ErbB signaling ODE system, 42–43 modern applications, 36–43 multivariate approach, 40–41 transient response sensitivity analysis, 40–41 Organic layout choice of biological network, 863–864 Osprey, 30 Outlier, 255, 261 Over-representation analysis (ORA), 544–546 pathway ORA, performing, 547–550 Over-represented interferon-gamma pathway, 551
Subject Index P Pajek, 30 Parametric identification, 113–116 numerical solution, 116–123 backward differentiation formulae(BDF)-based approaches, 117 initial value problem (IVP) solver, 116 multiple shooting, 116–117 nonlinear programming method (NLP), 116 nonlinear programming solvers, 117–123 Runge–Kutta approach, 117 single shooting, 116–117 problem formulation, 113–116 cost function, 115–116 experimental data, 114–115 experimental scheme, 114–115 mathematical model formulation, 114 Partial differential equation (PDE) models, 32, 44–52 astrocyte signaling networks, wave propagation in, 49 glial cells, 49 neurons, 49 Bicoid (Bcd) transcription factor, 47–48 challenges in, 52 early examples of, 45 hair follicle spacing, molecular mechanisms modeling for, 50 modern applications of, 46–51 planar cell polarity (PCP) model using qualitative phenotypes, calibration, 46 reaction–diffusion equations, 44 Sonic hedgehog (Shh) signaling pathway, 49 Partial least squares (PLS) regression, 451 Party hubs (static hubs), 591 Pathogen associated molecular patterns (PAMPs), 533 Pathway analysis tools for integration and knowledgebase (PATIKA), 80 Pathway construction, 75–80 approaches, 76–77 databases, 77–78 EMBL-EBI, 77 GenBank, 77 GOA database, 77 MADNet, 78 NCBI, 77 examples, 76–77
Subject Index pathway building tools, 79–80 CellDesigner, 80 Cytoscape, 80 Jdesigner, 80 standard notations for, 78–79 BioPAX, 78 HTML, 79 PSI-MI, 78 SBML, 78 XML, 79 Pathway crosstalk network (PCN), 491–501 annotation, 500 direction, 500 mode, 500 CDK4-Rb-E2F pathway, 493 construction of, 494–496 one-sided Fisher Exact Test, 495 protein interaction count, background distribution of, 494 protein interactions number, counting, 494 interpreting transcriptomic profiling using, 498–499 cell migration, 499 signaling, 499 network approach toward understanding biology, 492–494 properties of, 496–498 Ras-Raf-MAPK pathway, 493 ‘RNA metabolism’ cluster, 497 Small GTPase-Mediated Signal Transduction, 497 topology measurements of, 497 visualization of, 496 Pathway Interaction Database (PID), 549 Pathway modules, 148 Pathway outputs, 148–149 Pattern recognition, 312, 533 PD98059, 479 Peak-shift property, 350 Peptide microarray-based detection of cellular signalling changes, 706–709 Peptide microarrays for molecular interactions detection, 710–712 Peripheral blood mononuclear cells (PBMCs), 535, 669 Peripheral transcriptional changes, 628–631 Petri net foundations, 821–851 behavioural properties, 828–830 bounded, 830 coverability graph, 829 deadlock, 829 reachability, 829
903 reversible, 829 continuous variables, 819 degradation mechanisms, 818 feasible T-invariants, 834–836 marking, 824 Petri Net extensions, 831–833 coloured Petri nets (CPN), 832 Continuous Petri nets (CTPNs), 832 high-level Petri nets, 832 hybrid functional Petri nets (HFPN), 832 Hybrid Petri nets (HPNs), 832 Stochastic Petri nets (SPNs), 831 time or interval Petri nets, 831 place invariants role, 834 Place/Transition nets, 823–828 places, 824 read arcs role, 834 software packages supporting, 850–851 specific modelling techniques, 833–839 structural properties, 830–831 P/T net invariants, 830 P-invariants, 830 T-invariants, 830 tokens, 824 transitions, 824 Petri nets, 211, 475–476, 752–754 current state of, 753 Petri Net Modeling Application (PNMA), 772 simulation of, 753 Phosphatidylinositol 4,5-bisphosphate (PIP2), 511 Phosphoinositide-3 kinase (PI3K) pathways, crosstalk between, 505–522 and human cancers, 518–522 potential therapeutic targets, 518–522 levels, 508 PI3K/Akt signaling pathway, 511–512 Phosphoinositide-dependent protein kinase-1 (PDK1), 511 Phylogenetic profiling, 20 PI3K/Akt targeting for cancer therapy, 518–519 Place/Transition nets, 824–828 firing rule, 825–828 Place/Transition nets, 824–828 Planar cell polarity (PCP) signaling, 46 Platelet-derived growth factor receptor (PDGFR), 519 Pointed interface, 803 Polynomial models and constraints, 277–279 Positive feedback loops, 317–320
904 Posterior distribution, MMG, 384–387 Power-law models, 112, 211 Practical identifiability, 124 Predictors, 20 Primate models of AD, 628 Prior distribution, in MMG, 382–386 Prize-collecting Steiner tree problem (PCST), 363–364 Probabilistic Boolean networks (PBNs), 32 Probabilistic Hierarchical Inference of Protein Activity (PHIPA), 415 inference from, 416–418 assumptions, 416–417 feature component, calculation for, 417–418 notations, 416–417 Probabilistic inference from BN model, 405–406 Probabilistic sparse matrix fractionation (PSMF) methods, 237–239, 246 Process-based models, 283–285 formalisms for representing, 283–284 Process diagram notation (PDN) scheme, 142–144 Programming libraries, 31 Progressive supranuclear palsy (PSP), 612 Projection-based approaches, for inferring transcriptional regulatory network, 237 Promoter Analysis and Interaction Network Toolbox (PAINT), 104 Prostaglandin E2 (PGE2), 6 Protein function prediction, 399–424 GO, 401–403 GRID, 401 heterogeneous genome-wide protein data, 415–422 network information for, integration, 399–424 PPI network, 400–401, 403–407 See also Protein–protein interaction (PPI) networks by relational and hierarchical information integration, 407–415 GO hierarchy processing, 407–408 HBN assumptions, 408–409 inference from HBN model, 409–410 STRING, 401 Yeast genes, intracellular signal cascade prediction on, 410–415 Protein inhibitors of activated STATs (PIAS), 571–572
Subject Index Protein–protein interaction (PPI) maps/networks, 253–254, 260–261, 400–401, 492 limitations for PFP, 406–407 protein functions prediction by, 403–407 BN model, 404–406 notations, 403–404 Protein–protein interaction (PPI) networks, functional modules in, 353–368 comparison and validation, 366–367 data integration, 355–357 microarray and survival data, 355–356 network, 356–357 network score, 361–362 node-scoring function, 357 optimal subnetwork, 365 p-values aggregation, 358–359 scoring, 357–364 searching, 357–364 mathematical programming in, 364 Maximum-Weight Connected Subgraph Problem (MWCS), 363 signal–noise decomposition, 359–361 suboptimal solutions, 364 Protein Tyr phosphatases (PTPs), 571 ProteoLens, 857–875 add annotation, 869–870 Alzheimer’ Disease-related protein interaction network, 870–871 attach network source to view, 868 biomolecular network, 858–859 concepts, 859–860 connecting to database input, 865–866 connecting to file-based input, 867 create data association, 867–868 Cytoscape and, 862 data associations, 860 disease–disease association network, 874 functional layers, 860 Gene Ontology cross-talk network, 871–873 human cancer association network, 873–875 input and output supporting, 861–863 installing and launching the application, 865 layout choices of biological network, 863–864 multi-scale biological entities and, 858–859 Sdeclarative SQL-based visual analysis, 863
Subject Index software architecture, 859–860 top features, 861–864 VisANT and, 862 visualization software, 858–859 Proteomics Standards Initiative Molecular Interaction (PSI-MI), 78, 537 Pure cell population studies, of transcriptional changes in AD, 613, 616, 618–619 hippocampus, 619 Puzzle-solving activity in systems biology, 3–10 p-Values, 96, 262–263 aggregation in PPI networks, 358–359 Q Quality of reporting of meta-analyses (QUOROM) statement, 689 Quantitative experimental techniques for signalling systems biology, 210 Quantitative phosphoproteomics, 448–450 Quasi-steady-state assumption, 723 R Raf inhibitors, 478 Raf/MEK/ERK signaling, 455 Raf/MEK/ERK targeting for cancer therapy, 520–521 Random NK Boolean networks, 304 Ras association (RA) domain, 564 Ras inhibitors, 477–478 Reaction network, 279 Receptor tyrosine kinases (RTKs), 296, 456 in tumorigenesis, 518–520 therapeutic opportunities, 518–520 Reductionist approach, 4 Reference assemblies concept, 22–24 Reference Database of Immune Cells (RefDIC), 537 Reference networks concept, 22–24 Regression of data, 95 Regulatory correlation, metabolic networks, 376–378 experimental evidence, 376 theoretical arguments, 377 Regulatory networks, 247, 253 Relative squared error (RSE), 190 Replicates, 92–93 biological, 93 technical, 92–93 Representing knowledge for modeling dynamics, 277–285 Resource Description Framework (RDF) syntax, 187 Response bias, 338
905 Reverse-phase protein array (RPPA) technology, 463, 659 Rodent models of AD, 626–627 Runge–Kutta approach, 117 S Saccharomyces Genome Database (SGD), 260 Salirasib, 477 SBML2 LATEX, model reports generation with, 193–196 online version, 195 stand-alone version, 197 SBMLsqueezer, 183–186 Scatter Search metaheuristic, 121 ‘Scale-free’ networks, 375 Search Tool for Retrieval of Interacting Genes/Proteins (STRING), 401, 415 Self-organizing maps (SOM), 100, 236 SemanticSBML, 188–189 Sequential quadratic approach (SQP), 119 Signal transducer and activator of transcription (STAT) 1, 570 Signal transduction pathways investigation strategies, 207–232 decision making on modelling strategy, 214–216 complex (non-linear) dynamics investigation, 214 design principles investigation, 215 experimental data and biological scales, integration of, 215 highly complex biochemical networks analysis, 215 hypothesis, formulation and validation of, 214 Epo-mediated modulation of erythropoiesis, 227–231 general methodology used in, 213–214 mathematical model set-up, 213 model assessment and model refinement, 213 model calibration, 213 predictive simulations, 214 homodimer receptor–homodimer transducer mechanism of interaction, 225–227 modelling frameworks used for, 211 Boolean networks, 211 mass action models, 211 Petri nets, 211 power-law models, 211 stochastic models, 211 stoichiometric networks, 211
906 Signal transduction pathways (cont.) quantitative experimental techniques, 208 systems biology for, 210–212 See also Non-linear dynamics investigation Signal–noise decomposition, 359–361 Significance Analysis of Microarrays (SAM), 96 Simple steatosis (SS), 647 Single activation mechanism, 222–223 Single-channel superarray microarrays (GEArrays), 653 Single linkage clustering, 100 Single nucleotide polymorphisms (SNPs), 89, 655 Single shooting, 116–117 Singular matrix decomposition (SVD), 249 SiRNA knockdown of integrin molecules, 699–700 Smallpox gene expression data set using InnateDB, 538–555 data preparation for analysis, 539–541 GO over-representation analysis, performing, 544–547 interaction networks, 538–539 molecular interaction networks, generating and exploring, 552–555 between differentially expressed genes, 553 only between genes, 554 pathways, 538–555 processes, 538–555 uploading data to InnateDB, 541–543 visualizing pathway data with cerebral, 550–552 Small world property, 375 Sonic hedgehog (Shh) signaling pathway, 49 Sorafenib, 480 Sparse matrix analysis, 236 ‘Spatial Langevin’ system, 62 SpeciesReference, 167 Sphingosine-1-phosphate (S1P), 6 Split verb, 786 Src-family kinase (SFK), 705 Stable isotope labeling by amino acids in cell culture (SILAC), 448–450, 658 Standards in systems biology, 163–173 See also Systems Biology Markup Language (SBML); Systems Biology Ontology (SBO) Starting weight composition, varying, 343–344 State machine notion, 805–809 State transition, 182, 842
Subject Index Static hubs, caspase activation through, 606–608 Static modeling of biological networks, 14–31 advantages, 15 challenges in, 31 data for, 18–20 limiting model complexity, 19–20 sources, 18–19 types, 18–19 data integration as supervised learning, 18 inferred static relationships, 14 limitations, 15 network reconstruction, 20–22 clustering and integration methods, 20–21 corrupted data, detection, 21–22 data integration by supervised learning, 21 labels vs. predictors, 20 supervised integration, 22 supervised normalization, 21 tasks associated with, 15–18 data availability constraining network detail, 16 experimental confirmation, 16–17 input data sources, enumerating, 16–17 network applications, 17 network details, determining, 15–16 network reconstruction, 16 See also Network representation, static modeling Stem cell differentiation process, 327 Stimulus control, 338–339 Stimulus generalization, 337–338 Stimulus selection, 338, 346 Stochastic chemical kinetics, 754–756 Stochastic differential equation (SDE) systems, 33, 52–61 assumptions of, 55 cell cycle behavior sensitivity to intrinsic vs. extrinsic noise, 58 challenges in, 61 coherence resonance (CR), 58–60 modern application of, 55–61 noise-induced bistability prediction in enzymatic futile cycle, 60–61 stochastic cell cycle model comparison with experimental data, 57 Stochastic methods, 120, 211 Stochastic Petri nets (SPNs), 831 Stochastic simulation algorithm (SSA), 53, 752, 759, 763–765 StochKit software, 759
Subject Index STOCKS, 759 Stoichiometric matrix, 174 Stoichiometric networks, 211 Stress-induced signaling in cancer drug targets prediction, 600, 605 Strongly typed static network models, 24–27 Structural Clustering Algorithm for Networks (SCAN), 253–254 pseudo-code of, 258–259 SCAN clusters with highest p-values, 264–265 Structural identifiability, 124 Structural properties, Petri net, 830–831 Structural similarity, 256 Structure-connected clusters, 255–258 core vertex, 257 direct connections, 255 direct structural reachability, 257 hubs, 255 ε-neighborhood, 256 non-core vertices, 258 non-member vertices, 259 outlier, 255 structural similarity, 256 structure-connected cluster, 258 vertex structure, 256 Supernormal stimulation, 338 Supervised learning, data integration by, 21–22 Supervised normalization, 21 Suppressor of cytokine signalling (SOCS) proteins, 535, 571–574 down-regulating cytokine signalling, mechanism, 572–573 KIR interacting with binding site of JAK2, 572 pTyr residues binding via SH2 domain, 572 SOCS-3, 576–578 Suppressor of cytokine signalling-3, 571 Suppressor of Ras mutations-8 (SUR8), 458 Systems approach, 3–11 B-cell receptor (BCR) pathway, protein–protein interactions in, 8 cellular responses measurement, 6 gene expression analyses, 7 ligands for screening, 4–6 phosphorylation level, measuring, 6 target genes knocked down by RNAi, 9 Systems Biology Graphical Notation (SBGN), 142, 172–174 irreversible bi–bi enzyme reaction, 173 irreversible signal transduction reaction, 173
907 reversible bi–uni enzyme reaction, 173 reversible ion–catalyzed reaction, 173 reversible uni–uni enzyme reaction with feedback inhibition, 173 transcription and translation, 173 Systems Biology Markup Language (SBML), 78, 159, 165–169 compartments in, 166 rate equation in, 168 reaction in, 167 species in, 167 unit in, 169 Systems Biology Ontology (SBO), 163, 169–172 entity class, 170 interaction branch, 170 modelling framework, 170 participant role, 170 quantitative parameters, 170 T T cell signalling, complex formation in, 705–706 Tanh, activation function changing to, 344–345 Tau-leaping, 53, 758 Technical replicates, 92–93, 878 Temporal regulation, 439–440 The true path rule, 408–410 Thinning, 384 Threshold p-value τ (FDR), 362 Time Petri Nets (TPNs), 831 T-invariants, 834–836 modularisation using, 836–837 TNF-related apoptosis-inducing ligand (TRAIL), 38 Toll-like receptors (TLRs), 77, 533 Training set, 16 Transcription factors, 104–105, 183 Transcriptional changes in Alzheimer’s disease, 611–639 ‘βA-ptists’ group of researchers, 612 changes in blood, 629–631 CSF, changes in, 631 frontotemporal dementia (FTD), systems biology study of, 637 invertebrate models of AD, 623–626 mammalian models, 624–626 peripheral transcriptional changes, 628–631 arrays, 630 probands, 630 postmortem studies, 613–621 large-scale microarray atlases, 613
908 Transcriptional changes (cont.) pure cell population studies, 613, 616, 618–619 whole tissue studies, 614–618 primate models of AD, 628 rodent models of AD, 626–627 systems biology study approaches, 632–637 combining multiple transcriptional studies, 632–635 combining transcription and imaging, 636–637 combining transcription with genomics, 635–636 combining transcription with proteomics, 636 ‘Tau-ists’ group of researchers, 612 in vitro models of AD, 621–622 Transcriptional modules, discovery of, 237–240 Transcriptional regulatory network, inferring, 235–250 clustering, 236 conserved modules, identification, 249 conserved transcriptional modules, identification, 242–245 discovery of transcriptional modules, 246 divergent modules, identification, 249 divergent transcriptional modules, identification, 242–245 gene regulatory networks, inference of, 240–242 independent component analysis (ICA), 238 matrix decomposition, 236 methodology, 237–245 model-based approaches, 237 multilayer perceptron (MLP) network, 238 parameters of algorithms, 239 projection-based approaches, 237 regulatory network inference, 247 transcriptional modules, identification, 239 Transcriptomic profiling interpretation using PCN, 498–499 Transforming growth factor-beta (TGFβ), 6 Transient response sensitivity analysis, 40–41 Translation, 182 translation initiation complex cluster, 270 t-Test, 96 Tumor necrosis factor (TNF), 38, 603 Turing reaction–diffusion model, 51 Two-stage matrix decomposition approach, 240
Subject Index Tyrosine-phosphoproteome dynamics, 447–452 future prospects, 452 signaling molecules in cellular networks quantitative phosphoproteomics for, 448–450 temporal dynamics of, 448–450 time-resolved description, 449 U Ub-mediated degradation, 574 Ultrasensitivity, 323 Unannotated genes, 105 Uniform resource identifiers (URIs), 26 Uni–uni reaction scheme, 176 Update verb, 786 Uridine diphosphate (UDP), 37 V Variable-delay, snap-action switching, 38 Variance-based approaches, 98 Vascular endothelial growth factor receptor (VEGFR)-2, 479 Vertebrate limb development, crosstalk in, 516–517 Vertex structure, 256 Visualization, network, 30–31 desktop tools, 30–31 Cytoscape, 30 Medusa, 30 Osprey, 30 Pajek, 30 Von Hippel-Landau (VHL), 572 W Weighted gene co-expression network analysis (WGCNA), 633 Weighted neighbourhood average, 376 Whole tissue studies, of transcriptional changes in AD, 614–618 affected vs. unaffected brain regions, 617 Clinical Dementia Rating (CDR) scale, 614 MiniMental State Examination (MMSE) score, 617 NFT burden, 617 X X-linked inhibitor of apoptosis (XIAP), 40 Y Yeast genes, ISC prediction on, 410–415 See also Intracellular signal cascade (ISC) prediction on yeast genes Yeast two hybrid (Y2H) networks, 21–22