Genome Informatics 2008
GENOME INFORMATICS SERIES (GIS) ISSN: 0919·9454 The Genome Informatics Series publishes peer-reviewed papers presented at the International Conference on Genome Informatics (GIW) and some conferences on bioinformatics. The Genome Informatics Series is indexed in MEDLINE.
No.
Title
Year
ISBN CIJPa.
1
Genome Informatics Workshop I
1990
(in Japanese)
2
Genome Informatics Workshop II
1991
(in Japanese)
3
Genome Informatics Workshop III
1992
(in Japanese)
4
Genome Informatics Workshop IV
1993
4-946443-20-7
5
Genome Informatics Workshop 1994
1994
4-946443-24-X
6
Genome Informatics Workship 1995
1995
4-946443-33-9
7
Genome Informatics 1996
1996
4-946443-37-1
8
Genome Informatics 1997
1997
4-946443-47-9
9
Genome Informatics 1998
1998
4-946443-52-5
10
Genome Informatics 1999
1999
4-946443-59-2
11
Genome Informatics 2000
2000
4-946443-65-7
12
Genome Informatics 2001
2001
4-946443-72-X
13
Genome Informatics 2002
2002
4-946443-79-7
14
Genome Informatics 2003
2003
4-946443-82-7
15
Genome Informatics 2004 Vol. 15, No.1
2004
4-946443-88-6
16
Genome Informatics 2004 Vol. 15, No.2
2004
4-946443-91-6
17
Genome Informatics 2005 Vol. 16, No.1
2005
4-946443-93-2
18
Genome Informatics 2005 Vol. 16, No.2
2005
4-946443-96-7
19
Genome Informatics 2006 Vol. 17, No.1
2006
4-946443-97 -5
20
Genome Informatics 2006 Vol. 17, No.2
2006
4-946443-99-1
21
Genome Informatics 2007 Vol. 18
2007
978-1-86094-991-3
22
Genome Informatics 2007 Vol. 19
2007
978-1-86094-984-5
23
Genome Informatics 2008 Vol. 20
2008
978-1-84816-299-0
Genome Informatics Series Vol. 20
ISSN: 0919-9454
Genome Informatics 2008 Proceedings of the 8th Annual International Workshop on Bioinformatics and Systems Biology (lBSB 2008) Zeuten Lake, Berlin, Germany
9 -11 June 2008
Ernst-Walter Knapp Free University Berlin, Germany
Gary Benson Boston University, USA
Herman-Georg Holzhutter Charita-University Medicine Berlin, Germany
Minoru Kanehisa Kyoto University, Japan
Satoru Miyano University of Tokyo, Japan
~_________________________Im __p_e_ri_a_l_C_O_ll_e_g_e_p_re_s__ s
Published by Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
GENOME INFORMATICS 2008 Proceedings of the 8th Annual International Workshop on Bioinformatics and Systems Biology (mSB 2008) Copyright © 2008 by the Japanese Society for Bioinformatics (http://www.jsbi.org) All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the JSBi.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN-13978-1-84816-299-0 ISBN-I0 1-84816-299-5
Printed by Fulsland Offset Printing (S) Pte Ltd, Singapore
CONTENTS
Preface
ix
Program Committee
xi
Exploring the Effect of Variable Enzyme Concentrations in a Kinetic Model of Yeast Glycolysis J. Bruck, W. Liebermeister fj E. Klipp
1
The Role of IP 3 R Clustering in Ca 2+ Signaling A. Skupin fj M. Falcke
15
Rule-Based Reasoning for System Dynamics in Cell Systems E. Jeong, M. Nagasaki fj S. Miyano
25
Estimation of Nonlinear Gene Regulatory Networks via Ll Regularized NVAR from Time Series Gene Expression Data K. Kojima, A. Fujita, T. Shimamura, S. Imoto fj S. Miyano
37
ModelMage: A Tool for Automatic Model Generation, Selection and Management M. Flottmann, J. Schaber, S. Hoops, E. Klipp fj P. Mendes
52
A Framework for Determining Outlying Microarray Experiments R. Wan, A. M. Wheelock fj H. Mamitsuka Exploring the Impact of Osmoadaptation on Glycolysis Using Time-Varying Response-Coefficients C. Kuhn, E. Petelenz, B. Nordlander, J. Schaber, S. Hohmann fj E. Klipp Comparing Flux Balance Analysis to Network Expansion: Producibility, Sustainability and the Scope of Compounds K. Kruse fj o. EbenhOh Semi-Supervised Graph Partitioning with Decision Trees T. Hancock fj H. Mamitsuka
v
64
77
91
102
vi
Contents
Measuring Correlations in Metabolomic Networks with Mutual Information J. Numata, O. Ebenhoh fj E.- W. Knapp
112
Optimality Criteria for the Prediction of Metabolic Fluxes in Yeast Mutants E. S. Snitkin Cd D. Segre
123
Biosynthetic Potentials from Species-Specific Metabolic Networks G. Basler, Z. Nikoloski, O. EbenhOh fj T. Handorf Generalized Reaction Patterns for Prediction of Unknown Enzymatic Reactions Y. Shimizu, M. Hattori, S. Goto fj M. Kanehisa Optimal Metabolic Regulation Using a Constraint-Based Model W. 1. Riehl Cd D. Segre Comparative Determination of Biomass Composition in Differentially Active Metabolic States H.-C. Chiu fj D. Segre Suffix Techniques as a Rapid Method for RNA Substructure Search R. A. Bauer, K. Rother, J. M. Bujnicki fj R. Preissner The Relationship between Fine Scale DNA Structure, GC Content, and Functional Elements in 1% of the Human Genome S. C. J. Parker, E. H. Margulies fj T. D. Tullius A Novel Strategy to Search Conserved Transcription Factor Binding Sites Among Coexpressing Genes in Human Y. Hatanaka, M. Nagasaki, R. Yamaguchi, T. Obayashi, K. Numata, A. Fujita, T. Shimamura, Y. Tamada, S. Imoto, K. Kinoshita, K. Nakai fj S. Miyano Modeling IL-2 Gene Expression in Human Regulatory T Cells M. Benary, H. Bendfeldt, R. Baumgrass fj H. Herzel Toxicity versus Potency: Elucidation of Toxicity Properties Discriminating between Toxins, Drugs, and Natural Compounds S. Struck, U. Schmidt, B. Gruening, 1. S. Jaeger, J. Hossbach fj R. Preissner Comparative VEGF Receptor Tyrosine Kinase Modeling for the Development of Highly Specific Inhibitors of Tumor Angiogenesis U. Schmidt, J. Ahmed, E. Michalsky, M. Hoepfner fj R. Preissner
135
149
159
171
183
199
212
222
231
243
Contents
vii
Network Analysis of Adverse Drug Interactions M. Takarabe, S. Okuda, M. ftoh, T. Tokimatsu, S. Goto €1 M. Kanehisa
252
Sampling Geometries of Protein-Protein Complexes A. Guerler, S. Lorenzen, F. Krull €1 E. - W. Knapp
260
Computer Aided Optimization of Carbon Atom Labeling for Tracer Experiments B. S. Menkiic, C. Gille €1 H.-G. Holzhiitter
270
Web-Links as a Means to Document Annotated Sequence and 3D-Structure Alignments in Systems Biology C. Gille, A. Hoppe €1 H.-G. Holzhiitter
277
Author Index
285
This page intentionally left blank
PREFACE
Genome Informatics Vol. 20 contains a selection of peer-reviewed papers presented at the Eighth Annual International Workshop on Bioinformatics and Systems Biology on 9-11 June of 2008. This time the workshop was held in the Teikyo Hotel at the Zeuthen Lake near Berlin, jointly organized by the German members of the International Research Training Group (IRGT) 'Genomics and System Biology of Molecular Networks' and supported by the German Science Foundation (DFG). These workshops were created to give doctoral students and young researchers the opportunity to present and discuss their research work in Bioinformatics and Systems Biology in the frame of an international scientific meeting. The first workshop was held 2001 in Berlin. It was organized by Prof. Dr. Reinhart Heinrich, a co-founder of this series of workshops. Since 2001, the workshop has been held in Boston (2002), Berlin (2003), Kyoto (2004), Berlin (2005), Boston (2006) and Tokyo (2007). The present workshop was held in Zeuthen near Berlin as a part of a collaborative educational program involving the leading institutions committing the following programs and partner institutions of the US, Japan and Germany: • Boston - Graduate Program in Bioinformatics, Boston University • Berlin - The International Research Training Group (IRTG) "Genomics and Systems Biology of Molecular Networks" • Kyoto/Tokyo - Joint Bioinformatics Education Program of Kyoto University and University of Tokyo Partner Institutions • • • • • • • • • •
Boston University Charite Berlin Free University Berlin Humboldt University Berlin Kyoto University, Bioinformatics Center, Institute for Chemical Research Kyoto University, Department of Bioinformatics and Chemical Genomics, Graduate School of Pharmaceutical Sciences Max Delbriick Centre for Molecular Medicine, Berlin Max Planck Institute for Molecular Plant Physiology, Potsdam Max-Planck Institute of Molecular Genetics, Berlin University of Tokyo, Human Genome Center, Institute of Medical Science
This time we decided to first perform the workshop and to collect and re-
ix
x
Preface
view the manuscripts three weeks later, such that the discussions and criticisms at the workshop could be considered appropriately by the authors. However, there was also a pre-selection of the oral and poster contributions to be accepted at the workshop. The contributors were then allowed to submit manuscripts for the Genome Informatics volume. These contributions were reviewed by the members of the workshop event. We have selected 25 papers after revision. These papers will be indexed in Medline, and their electronic versions are freely available from the website of Japanese Society for Bioinformatics as Genome Informatics Online (http://www.jsbLorg/modulesjjournal/index.php/index.html). Former publications are also electronically available as Genome Informatics Vol. 15, No.1 (2004), Vol. 16, No.1 (2005), Vol. 17, No.1 (2006), and Vol. 18 (2007). We wish to thank all of those who submitted papers and helped with the reviewing process. We also wish to thank all those who helped in organizing this workshop for their efforts in local arrangement, especially the Local Committe Members: Martin Falcke, Alexander Skupin, Bianca Sprincenatu, Oliver Ebenhoh, Moritz Schutte, and Johannes Bausch.
Program Committee Chair Ersnt-Walter Knapp Organizers Gary Benson Hermann-Georg Holzhiitter Minoru Kanehisa Satoru Miyano
PROGRAM COMMITTEE Ernst-Walter Knapp Tatsuya Akutsu Gary Benson Oliver Ebenhoh Martin Falcke Hermann-Georg Holzhiitter Minoru Kanehisa Hiroshi Mamitsuka Satoru Miyano Robert Preissner Daniel Segre Brandon Xia
Free University Berlin, PC Chair Kyoto University Boston University Humboldt University Berlin Max-Delbriick-Center for Molecular Medicine Charite-University Medicine Berlin Kyoto University Kyoto University University of Tokyo Charite-University Medicine Berlin Boston University Boston University
xi
This page intentionally left blank
EXPLORING THE EFFECT OF VARIABLE ENZYME CONCENTRATIONS IN A KINETIC MODEL OF YEAST GLYCOLYSIS JOZSEF BRUCK',2
[email protected] WOLFRAM LIEBERMEISTER'
[email protected] EDDAKLIPP'
[email protected] , Max Planck Institutefor Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany University Berlin, Department of Biology, Chair of Theoretical Biophysics, Invalidenstr. 42,10115 Berlin, Germany
2 Humboldt
Metabolism is one of the best studied fields of biochemistry, but its regulation involves processes on many different levels, some of which are still not understood well enough to allow for quantitative modeling and prediction. Glycolysis in yeast is a good example: although high-quality quantitative data are available, well-established mathematical models typically only cover direct regulation of the involved enzymes by metabolite binding. The effect of various metabolites on the enzyme kinetics is summarized in carefully developed mathematical formulae. However, this approach implicitly assumes that the enzyme concentrations themselves are constant, thus neglecting other regulatory levels - e.g. transcriptional and translational regulation - involved in the regulation of enzyme activities. It is believed, however, that different experimental conditions result in different enzyme activities regulated by the above mechanisms. Detailed modeling of all regulatory levels is still out of reach since some of the necessary data - e.g. quantitative large scale enzyme concentration data sets - are lacking or rare. Nevertheless, a viable approach is to include the regulation of enzyme concentrations into an established model and to investigate whether this improves the predictive capabilities. Proteome data are usually hard to obtain, but levels of mRNA transcripts may be used instead as clues for changes in enzyme concentrations. Here we investigate whether including mRNA data into an established model of yeast glycolysis allows to predict the steady state metabolic concentrations for different experimental conditions. To this end, we modified an established ODE model for the glycolytic pathway of yeast to include changes of enzyme concentrations. Presumable changes were inferred from mRNA transcript level measurement data. We investigate how this approach can be used to predict metabolite concentrations for steady-state yeast cultures at five different oxygen levels ranging from anaerobic to fully aerobic conditions. We were partly able to reproduce the experimental data and present a number of changes that were necessary to improve the modeling result. Keywords: yeast; glycolysis; fermentation; respiration; kinetic modeling; metabolic regulation
1.
Introduction
Cellular metabolism is one of the key components of living systems. Its most basic functions are to generate the energy and the building blocks necessary to sustain the cells' life. Elucidation of central carbon metabolism, the source of energy for all heterotrophic life, is one of the success stories of biochemistry: function and mechanism of most of its components are known in considerable detail. A large class of the regulatory mechanisms of metabolism is well understood: the catalytic function of many enzymes is influenced by metabolites present in the cell. This kind of interactions have been successfully
1
2
J. Bruck, W. Liebermeister f3 E. Klipp
quantified in enzyme kinetic laws, which has led to ODE based models of metabolic pathways with considerable predicting power, as described in [4, 2, 7] and applied among others in [9, 5, 11]. However, metabolism is also regulated by other functional units of the cell, most importantly the transcriptional-regulatory system. It acts by changing the concentration of various enzymes via regulated production and degradation. This kind of regulation is necessary for the cell to steer its metabolism to meet its needs under various conditions. However, change in protein levels is usually not implemented in kinetic models: these typically adopt kinetic expressions for the included reactions with fixed maximal velocities, which amounts to the implicit assumption of constant enzyme concentrations. One of the possible reasons is that quantitative data on concentrations of single proteins in different experimental conditions are still lacking or rare. A fundamental determinant of the concentration of an enzyme's active form, and hence, its activity, is the amount of mRNA transcripts presents in the cell. However, many other layers of regulation exist, e.g. at the level of translation and allosteric regulation of the final protein among many others. It is controversial to what extent the final enzyme activity is determined by or correlated to the concentrations of its mRNA components. While genome-wide comparisons between mRNA and enzyme concentrations exist [1, 3], the abundance of a given set of proteins and their corresponding transcription rates should be systematically compared in different cell states to obtain a clearer picture. To the authors' knowledge such studies are not yet available. Based on an established ODE-based model of yeast glycolysis, we present an approach for modeling how metabolism is regulated by the transcriptional-regulatory system. In the model we include the change in enzyme concentrations in various experimental conditions. We used experimental data [12] from steady state yeast cultures with five different oxygen levels ranging from anaerobic to fully aerobic conditions. We implemented the change in enzyme concentrations by changing the maximal rates of the enzymatic reactions. For the above mentioned reasons, we determined these changes from mRNA concentration measurements, using them as inputs for the model. The model allows for computing metabolite concentrations and fluxes, which we compared to the corresponding experimental values. We performed parameter estimation to determine a set of parameters which best fit for the experimental data. The main question posed is the following: to what extent can experimental data for different cell states be explained by including expression data in the model under the assumption that biochemical reaction rates obey rate laws known from enzyme kinetics?
Exploring the Effect of Variable Enzyme Concentrations
2.
3
Methods
2.1. Experimental data
We used metabolite concentration and flux data from Wiebe et al. [12] obtained from cultures of Saccharomyces cerevisiae CEN.PKI13-1A grown in glucose-limited chemostat cultures (dilution rate D=O.lO/h). External conditions in these cultures could be controlled to a high extent. Steady-state cultures were obtained under one anaerobic (0% oxygen) and four aerobic conditions (0.5%, 1%, 2.8%, 20.9% oxygen in the inlet gas) with all other external conditions being kept constant. Measured quantities included biomass, concentration of external metabolitesa (Glucose, Ethanol, Glycerol), of intermediate metabolites (G6P, F6P, F16P, PEP, PYR, ATP, ADP, AMP, and the sum of 3PG and 2PG concentrations), net fluxes (consumption rates of oxygen and glucose and exhaust rate of ethanol, glycerol and C02) per unit of biomass, and relative fold changes of the mRNA concentrations compared to the anaerobic cultures for 69 genes with functions in carbon metabolism. 2.2. Mathematical model
We constructed a mathematical model of central carbon metabolism in S. cerevisiae based on the glycolytic pathway model by Teusink et al. [11]. The original model was based on measurements on steady state cell cultures under anaerobic conditions by comparison of experimental data of concentrations and fluxes of intermediate and external metabolites. The sum of the concentrations [NAD+] and [NADH] is a conserved moiety of the model. The adenosine species [ATP], [ADP] and [AMP] are not dynamical variables of the original model, instead, they were written as analytic expressions in term of the sum of high-energy phosphates. These were obtained under the assumptions that a) the sum of their concentrations is conserved, and b) the reaction catalyzed by adenosine kinase is fast in comparison to the other reactions, and hence in equilibrium. The metabolites GAP and DHAP are lumped to a single chemical species called "triose" reflecting the assumption that the transforming reaction between them (catalyzed by TPI) is also in equilibrium. The kinetic constants were largely obtained from experiments and fitted only to a minimal extent. The side branches of glycolysis contained in the model were
aAbbreviations: G6P: Glucose-6-phosphate; F6P: Fructose-6-phosphate; F l6P: Fructose-I,6-bisphosphate; Triose-P: sum of GAP: Glyceraldehyde-3-phosphate and DHAP: Dihydroxyacetone phosphate; BPG: 1,3bisphosphoglycerate; 3PG and 2PG: 3- and 2-phosphoglycerate respectively; PG: sum of 3PG and 2PG; PEP: Phosphoenolpyruvate; ACA: Acetaldehyde; AMP, ADP, ATP: Adenosine-mono-, di-, and triphosphate, respectively. NAD+, NADH: oxidation states of Nicotinamide adenine dinucleotide. Enzymes: ENO: Enolase; GAPDH: D-glyceraldehyde-3-phosphate dehydrogenase; ADHI, ADH2: Alcohol dehydrogenase I and 2, respectively; HK: Hexokinase; PGI: Phosphogluco isomerase; PFK: Phosphofructokinase; ALD: Aldolase; G3PDH: Glycerol-3-phosphate-dehydrogenase; PGK: Phosphoglycerate kinase; PGM: Phosphoglycerate mutase; PYK: Pyruvate kinase; PDC: Pyruvate decarboxylase; FBPI: Fructose-I,6-bisphosphatase.
4
J. Bruck, W. Liebermeister
f<j
E. Klipp
found to be crucial to reproduce the data. Glycerol producing branch was simplified to the reaction catalyzed by the enzyme G3PDH. The products ethanol and CO 2 were assumed to diffuse out of the cell quickly, thus their concentrations inside and outside the cell as equal in the steady state. We obtained the original model in SBML format from the JWS online database [14] (download on 26th May 2008). It is worth noting that the kinetic expression for PFK in the published SBML file differs from the one described in the article [11]; we adopted the latter version. Table I. List of the reactions which were added to the Teusink model. Numbers in brackets refer to reactions in Fig.1. Square brackets denote concentrations described by dynamic variables of the mathematical model. All other quantities are parameters ofthe model: their values are either adopted from [II], set to the measured values of external metabolites, or estimated. Name
Reaction
Adenosine kinase (19)
ATP+AMP ;::2 2 ADP
G6P consumption (3)
G6P+ATP -+ ADP
glycerol transport (9) TCA (16)
respiration (18)
Reaction rate expression VmAl(
([ATP] [AMP]-[ADP] [ADP]/KeqAI()
VmG6p [G6P] [ATP] VmOLY ([GLY] - GLyOU!)
4NAD +ADP+ACE ;::2 4 NADH + ATP + 2 CO,
V rnTCA ([ACE] [NAD] [ADP]-[NADH] [ATPl/KeqTCA)
0.5 0, + NADH + 2.5 ADP ;::2
NAD+2.5 ATP
ATP consumption (20)
ATP -+ ADP
PDC (15)
PYR ;::2 ACE + CO,
FBPI (6)
FI6P -+ F6P
VmRESP (02 [NADH] [ADP]- [NAD] [ATP]/K""RESP)
v
= VmATP.",[ATP]
We modified the original Teusink model in several details to fit our purposes. Reaction numbers refer to Fig. 1, for details of the stoichiometry and the kinetic expressions see Table 1. 1. We explicitly modeled the concentrations of AMP, ADP, and ATP as dynamic variables. The adenosine kinase reaction (reaction 19), modelled with reversible mass-action kinetics, was introduced to maintain the moiety conservation of the pool of these species.
Exploring the Effect of Variable Enzyme Concentrations 1
Glucoseo"
2
4
~ Glucose~G6P ~ • AOP • }IATP ATP 3 AOP
5
6 7 10 F6P'O F16P +---+(2) Triose-p7"""t BPG ( 5"\ ATP AOP
NAOH
~
~AOP
NAO + NAOH 11 NAO+ 8 Glycerol 3P G
t
t
g
12
t
2 G
Glycerol." ATP
~
ATP
AOP
.13 PEP
~
19 ATP + AMP+---+ 2 AOP
14
AOP
ATP
Pyr
0.5°2 2.5ATP .L....2.5AOP NAO' NAOH 18
}:C02
16
2C02~ACA
.J.
..
~NAOH
4 NAOH 4NAO' 17 ATP AOP NAO' Ethanol
Fig. I. Reaction scheme of the kinetic model of glycolysis. The numbers refer to the following reactions I :glucose transport; 2:HK; 3:G6P consumption; 4:PGI; 5:PFK; 6:FBPI; 7:ALD; 8:G3PDH; 9:glycerol diffusion; IO:GAPDH; II:PGK; 12:PGM; I3:ENO; 14:PYK; 15:PDC; 16:TCA; 17:ADH; 18:respiration; 19:adenosine kinase; 20:ATP consumption. Reaction 7 produces two Triose-P per FI6P, as indicated. Subscript "out" refers to species outside the cell. Reactions which were added to the original model by Teusink et a!. [II] are listed in Table I.
2.
3. 4.
5.
Instead of considering two side chains with constant fluxes at G6P (leading to glycogen and trehalose), we replaced them by a single G6P-consuming process (reaction 3) with irreversible mass action kinetics. We did not distinguish between them since we do not have measurements for metabolites or fluxes of this branches that would allow for distinguishing one from the other. At the end of the glycerol-producing branch, we included a diffusive transport reaction for glycerol out of the cell (reaction 9). The original model contains the TCA cycle in the form of a succinate production branch. In this reaction, two molecules of acetaldehyde are consumed to produce one molecule of succinate. Since our model is aimed to describe respiration, we replaced this reaction by a simplified description of a running TCA cycle (reaction 16) and the respiratory chain (reaction 18): we consider two reactions which consume acetaldehyde and oxygen to produce energy in form of ATP and NADH as well as the by-product CO 2 [8]. We assumed that CO 2 concentration in the cell remains low due to rapid diffusion, therefore we did not include it in the backward rate expression of reaction 16. The ATP-consuming reactions are summarized in one effective ATPase reaction (20). In the original model, this reaction had constant flux which we replaced by irreversible mass-action kinetics.
6
J. Bruck, W. Liebermeister €3 E. Klipp
6.
Reversibility of the main glycolytic chain is crucial to obtain qualitative agreement with the measured fluxes. Therefore, we changed the irreversible Hill kinetics of the PDC reaction (reaction 15) to a reversible kinetics by including an additional term with a parameter K:iiJc in the original rate expression as shown in the table. 7. Also the reaction catalyzed by PFK is irreversible and modeled without product inhibition. To allow for a slowing down of the glycolytic flux at higher product concentrations, we included the reaction catalyzed by FBPI into the model (reaction 6). In gluconeogenesis, this reverses the effect of PFK, but without involvement of ATP. All other parts of the model including the values of the parameters which are not explicitly mentioned in this article were adopted from [11]. In contrast to glucose and glycerol, it was assumed that ethanol diffusion through the cell membrane is fast enough to keep the outer and inner concentrations close, therefore no distinction was made between extra- and intracellular ethanol. The resulting model has 20 reactions and 17 dynamic variables representing metabolite concentrations. It is available in SBML and text formats as supplementary material.
2.3. Transcriptional regulation and external metabolites In order to include transcriptional regulation in our model, we write reaction rates for reaction i in the experimental conditionj as (1) where Eij denotes the concentration of the active form of the corresponding enzymes in the steady state cultures, R; denotes the rest of the kinetic expression, and 0 denotes the vector of all metabolite concentrations at condition j. We compared the four aerobic states to the anaerobic stateb • We indicate quantities belonging to this condition by the subscript j=O. Transcriptional regulation was accounted for in the following way: for each enzymatic reaction i and each aerobic condition j, we calculated Eij / EiO , the relative change of enzyme concentration of the four aerobic states from the transcription data by setting
Eij a -E =gij
(2)
;0
where the scaling exponent a is a constant and gij denotes the transcription fold change associated with reaction i in conditionj. By definition, g;o=1 for every reaction. Assuming that the activity of an enzyme is proportional to its concentration, we describe the effect of transcriptional regulation on the reaction rate Vij through replacing it by
Exploring the Effect of Variable Enzyme Concentrations
7
(3) for each reaction i and condition). For most enzymatic reactions, we calculated giJ as the arithmetic mean of the measured mRNA concentration fold change for the genes associated with reaction i. See the Appendix for the list of genes associated with each enzymatic reaction. Since the transcriptional activities corresponding to Enolase and GAPDH were not measured, for these reactions we computed the value of giJ by averaging the values for the next-neighbor reactions (PGM, PYK) and (ALD, PGK), respectively. Also the reaction ADH was treated differently. The expression data for ADHl, together with ADH2, the isoenzyme responsible for converting ethanol to acetaldehyde, indicate that net Ethanol production is shut down with growing oxygen supply, reaching virtually zero in fully aerobic condition. The resulting ethanol flux also reflects this behavior (Fig. 2). For simplicity, instead of including ADH2, which would involve yet more unknown parameters, we only included the reaction for ADHI and described its regulation, by setting gij to the values of the measured ethanol flux, normalized to the anaerobic condition. The resulting giJ values for all experimental conditions are shown in Fig. 2. A -+-HK(2) -----*-PGI(4) -€r-PFK(5)
~
&:5
Iii
e
ti~
'OJ
:g -g
~19 "'.g E
5
2.5
-+-- FBP1
2.5
(6)
-s-ALD (7) 2 1.5
1.5 1 0%
.
1%
2.8%
20.9%
C ENO" (13) - l i t - PYK (14) ----e- poe (15) -t--
~;g
Iii!
0.5%
2.5
~TCA(1B)
.
i "0
:g
ti,s "O.~ :g -g
2
'I! a
"'·S
1$
I
---a- resp. (18)
:l!;; E
5
D
" "
0.1
a
1 0%
0.5% 1% 2.8% 20.9% oxygen concentration in steady-state cell culture
0.01 0%
0.5% 1% 2.8% 20.9% oxygen concentration in steady-state cel/ culture
Fig. 2. A,B,C: fold change of mRNA concentration associated with reactions in the mathematical model, normalized to the anaerobic state (denoted by gij in the text). The values were calculated from the expression data of the genes associated with each reaction as given in the appendix. Numbers in brackets refer to reaction numbers in Fig.1. For reactions marked with (*), no transcript analysis was undertaken; the values were averaged from neighbors as described in the text. D: fold change of the genes ADHI and ADH2 and the resulting ethanol flux. At the highest oxygen concentration the flux drops to zero (not shown in the logarithmic scale).
The external metabolites glucose, glycerol and ethanol were represented by the model species Glucose oub Glycerolout and Ethanol (cf. Fig. 1). Their concentrations were set to constant values according to the experimental data: Glucose out was set to the
8
J. Bruck, W. Liebermeister €3 E. Klipp
corresponding concentration in the inlet feed solution, 55.55 mmol/l, in all conditions. Measured glycerol concentrations was 8.90 mmoVl for the anaerobic condition, and zero for all aerobic conditions. Measured ethanol concentration was 75.37 mmoVl, 59.01 mmoVl 47.56 mmoVl, 3.66 mmoVl, and 0 mmol/l for the conditions with 0%, 0.5%,1%, 2.8%, and 20.9% oxygen, respectively. 2.4. Parameter estimation We performed parameter estimation on a subset of the model parameters to achieve agreement with the data. Metabolite concentrations were compared with concentrations in the model. The measured fluxes for glucose, oxygen, ethanol, glycerol and CO 2 were each compared to the rates O.5r" r'8, rl7, r8, r'5 + 2r16, respectively, where ri denotes the rate of the reaction i in Fig. 1. We quantified goodness of fit for each possible set P of values for the estimated parameters by the following cost function:
(3) where we denote the steady-state value of a metabolite concentration or flux k for the condition j by Vkjim and Vk;xP for simulation results and experimental data values, respectively. ~jim values were obtained by runs of 10000 seconds of simulation time. 2 U kj is a weight factor in which U is often set to the value of the experimental error. However, this choice does not reflect an appropriate weight measure in our case, since we do not expect to be able to reproduce the experimental data within the errors. At the same time, small experimental error of a quantity does not necessarily correlate with higher importance of a good fit compared to other quantities with larger errors. To assign the same weight to all relative deviations, we set Ukj to be proportional to Vk5 im in the following way:
1/
Ukj
= 0.15· V~xP,
in case
V~xP"*
Ukj
=0.15.ll1in(~?),
m case
V~xP
0,
= 0,
J
7
where Vk denotes all nonzero values for the concentration or flux k among all conditions. To avoid non-steady state solutions, we introduced a penalty term ( exp ( K ) -1) in the cost function. The term K quantifies the deviation of the solution from steady state. It is defined as 17
K =
3
IIIXk(t~omp)-Xk(t1ast)1 ' k~1 '~1
Exploring the Effect of Variable Enzyme Concentrations
9
where Xk(tlaSI) denotes the simulated value of the concentration Xk at the last time instance t lasl =10000 sec, and Xk (tlomp ) denotes its value at some earlier time instance tlomp . The values tl"m p where chosen as t~omp = 0.5 . t last , t~omp = 0.75 . t last , and t~omp = 0.8 . t last • We estimated a total of 31 parameters which was an acceptable number given a total number of data points of 70. The values of all other parameters were taken from [11] . We estimated the following groups of parameters: 1. Since the experiment by Wiebe et al. and the experiments underlying the Teusink model differ in the experimental conditions and the yeast strain, we could not rely on the absolute enzyme concentrations to be comparable. Therefore, we fitted all Vm values and the diffusion coefficient for reaction 9 (20 parameters). 2. We also fitted the new kinetic parameters of the reactions that were added to the original model (4 parameters, cf. Table 1.) 3. The sum of [NAD+] and [NADH] is a conserved quantity of the model, determined by the initial concentrations of these species. Since experimental data were not available, we estimated this quantity for each condition separately (5 parameters). 4. We fitted the scaling exponent a from Eq. (2). 5. Concentration units: reaction rate expressions in our model are based on enzyme kinetics and hence the concentrations of the reactants need to be known. However, all metabolite concentrations and fluxes were measured in units per gram dry weight of biomass (gDW). The values were determined after collecting the cells from the culture by centrifugation, washing by distilled water, and drying to constant weight at 100Co. To determine the cytosol concentrations of the measured values, the net cytosol volume of the cells of IgDW is needed. Although estimates for this number exist (amounting Ig dry weight to 2 ml cytosol, [13]), we preferred to fit this quantity along with the other parameters of the model.
2.5. Genetic algorithm and semiglobal search We adopted the genetic algorithm Differential Evolution [16] to search for a parameter set with best fit to the experimental data. In a truly global search, parameters could assume any values between zero and infinity, with the aim to find a global optimum of fit. However, we found that this approach was not practicable since many parameter sets are, although in principle viable, not practical to work with. Some may not generate a steady state (for example due to accumulation ofFI6BP), others require long computation times. Therefore we developed the following semi-global approach: at a given time, only a limited region of parameter space was screened. This was achieved by limiting each parameter to a certain range. If a parameter repeatedly (4 out of the last 5 times) produced values in the upper or lower 20% of its search range, the range was relocated such that the parameter value corresponding to the hitherto best result became the center of the search range of this parameter. If this process would have resulted in a negative value for the lower limit, the latter was set to zero. The width of the search range was
10
J. Bruck, W. Liebermeister
fj
E. Klipp
kept constant during the process and was determined at the beginning of the parameter estimation to be [( 1- r) Po, (1 + r) Po ] where Po denotes the initial value of the parameter and r was set to 0.5. Since evaluating the cost function (Eq. 3) involves integrating a system of 20 differential equations numerically, we used various software tools to convert the SBML model to an executable C-code for faster integration [6, 10]. The integrator used in the process was CVaDE from Sundials [15].
3.
Results
3.1. Parameter estimation
We ran four parameter estimation processes to find model parameter values which produce the best possible fit to the experimental data. Fig. 3. shows the evolution of the goodness of fit (as quantified by the cost function) and the value of five parameters during a parameter estimation process (data for all parameters published as supplementary file). Most, but not all parameters converged to a certain value. However a unique
i'
name:fmma start value: 1 last value: 1.6
change of cost during estimation
ti
5
i " 25
1
~10~
name: fwstst
start value; 500 last value: 29
l
j~ 2~ i iE.,
1j)
a.
ii\"_ _ _ _ _ _ _
1
..8
:
:
_
0
I..
1.sf-1 ---1 11
i 5L----~ ~ os! 20000 40000 60000
name: nadsum 3 start value: 1.6
last value: 0.3
0:--::200:07:00-4-::00-::0::-0-:6-::0000 nama: GLYtrs VmGLY stan value: "1 0000
:i;:'·"~~·'
lZ'
name: vPOC_KmPDCACE start value: 5
""'~ ::1
o
20000 40000 60000
nr. of generations
o~
2~OO
40000 60000
nr. of generations.
00
20000 40000 60000
nr. of generations
Fig. 3. Evolution of goodness of fit (cost) of the best parameter set (top left) and corresponding values of five of the 31 model parameters during a parameter estimation process of ca. 49000 generations. Shown are values corresponding to the parameter set with the best fit to data (as defined by the cost function, see text) after a certain number of generations. The momentary search range for each parameter (see text for description) is specified by upper and lower bounds (shown by lines). The parameters frnrna (called a in the text), fwstst, nadsum_3 are explained in section 2.4 under points 4, 5 and 3, respectively. GLYtrs_VmGLY denotes the diffusion coefficient in reaction 9, and vPDC_KmPDCACE denotes the constant in reaction 15 (cf. Tablel).
parameter set with best fit could not be determined within the available computing time (24 hours of computing time amounting to roughly 7000 generations on an AMD 3800+ processor), since a number of parameters did not converge to similar values during these parameter estimations (data not shown).
Exploring the Effect of Variable Enzyme Concentrations
11
Notably, these parameter sets produced mostly similar simulation values. As shown in Fig. 4., the largest quantitative differences between the predictions generated by the four parameter sets can be observed in the simulation results for F 16P concentration (0% oxygen) and of the 02. Some of the parameters were seemingly not, or only weakly determined, i.e. their values did not matter for change in the cost function. This was to be expected, since only about two third of the dynamical variables of the model is measured. Since the number of data points (70) is more than twice the number of parameters (31), we do not expect
6IX~'0~ 1.51,1 +
)( 10~
g~ g'5, 4
n:~ o
0.5
1
2.8 20.9
o.o,~
'OOO5~ * o
t
o
0.5
1
2.8 20.9
6XW'o.
~~ 4 ~~
BE
0.02!
°f~-~ o.o'Li' o 0'
PG
§..
F16P
F6P
G6P
ATP.
"."
.
2
8.S 0·---·------·-·-o
0.5
1
2.8 20.9
o
0.5
)(10-3
1
. 2.8 20.9
,+
o
01
X
-r0.5
o
10.3
+
1
t
2.8 20.9
AD?
21~
aL-------~o 0.5
1
2.8 20.9
1
1+ ...._.............+
+ + . ...............,-
0.5
2.8 20.9
0""
o
1
l"IC ,. ---:,
[:[ ----- .J~-~--~~ 0
0.5
1
2.8 20.9
0
0.5
AMP
)(10"
4.
02[.~
oxygen concentration (%)
2.8 20.9
'id 01
::::J
1
4f
t·A
ol-.-.-----..... o
CO2 flux
C
0.5
PEP
0.5
1
2.8 20.9
glucose flux 02[
I
0.1(.,
:+~
0'·....- ".'-" ............... _.. o 0.5 , 2.8 20.9 oxygen concentration (%)
r.:e:"expefiment'! L". .:.:~._. ~i~~lati~.~
,
1 2.8 20.9
oxygen concentration :::::::~~~®
B
®~
/®
CD
~
0-~-®
0~ _~ /© ® -;:/ [iillo-~!\~ )-l"'] ~
o~/CD
0~ c
-~ _/
0~~~®
~~~fS)-~
l
/~~
®
@)
Fig. 2. Examples of removed species. (A) Removal of an intermediate species. Species T is removed and reactions r1 and r2 are combined. (B) When an enzyme-substrate-complex is removed from a reaction ModelMage creates a new reaction with the enzyme set as a modifier. (C) Removal of species AB leads to new a reaction that is a combination of all incoming and outgoing reactions of the removed species.
2.1.2. Removing Species When the user specifies a species for removal, ModelMage analyzes the neighborhood of this species and rewires the model in an intelligent manner. The rewiring heuristic follows the principle of reachability and it works the following way: First, ModelMage detects the incoming and outgoing reactions of the species that shall be removed and analyses the species involved in these reactions, i.e. the substrates of which the removed species was the product and the products of which the removed species was the substrate (see Fig. 2). Then, ModelMage tries to combine all pairs of incoming and outgoing reactions into one single new reaction. The new reaction inherits all substrates and products from the combined reactions and assigns a kinetic for the new reaction. The inserted kinetic can either
A Tool for Automatic Model Generation, Selection and Management
57
be the kinetics of the ancestors if they were equal and are still suitable for the combined reaction, or mass action otherwise. There are three possible cases for each combination of incoming and outgoing reactions that have to be considered and treated separately. (1) If the combination of a pair of incoming and outgoing reactions would result in reaction, which has the same species, both as substrates and as products, i.e. a self loop to a list of species, the respective reactions are not combined. (2) If there is only a subset of species equal in both substrates and products, then these species are regarded as enzymes and are set as a modifiers for the resulting reaction (Fig. 2 B). (3) If the sets are disjunct then the reactions are combined in the above described way. The main algorithm to remove one species looks like this: def remove (s) : for i in s. incomingReactions : for 0 in s. outgoingReactions : if i = 0: selfLoop(i ,0) else if i partiallyEqual 0: substrateEnzyme (i ,0) else: combine(i ,0) del (i ,0) del(s)
As mentioned before, removing combinations of species and reactions can lead to large numbers of candidate models. The user has to be careful how many components are combined for removal, because there is the danger of combinatorial explosion, especially if the components are combined by OR operators (Fig. 3). Exchanging kinetics also increases the number of models, because every structural model is generated with every possible set of kinetics. To simplify the process of finding the right logical formulations to create certain model families, the user can specify different sets of models in one single run of ModelMage. This can be achieved by concatenating different logical formulas by ',' i.e. MOdelMage. py -r (species_l ~ species_2), (reactionA & reaction_1) example. cps. This would generate three new alternative models. Two are generated from the first bracket, where in one species_l and in the other species_2 is removed. The third model is generated from the second bracket, where both reactions are removed from the model.
2.2. Model Discrimination When data for certain components is available, ModelMage can select the best model from the generated model family. ModelMage can automatically fit the models to the data by estimating parameters and is able to rank the models by different
58
M. Plottmann et al.
B
A
0
0~ © ~
+ ®
®
#~ +-------Uill-
\\
®=®
c 0
/"-..
®--@
®
0
\\
j @ 0/
®=®
D 0
0
0
/"-..
®=®
®=®
j @ 0/
®--@
\\ \\
Fig. 3. Logical combinations of alternative models. (A) Master Model from which the alternatives can be generated by ModelMage. (B) One model is generated by the logical string"B &; re_2". The logical AND means that species B and reaction re_2 must both be removed in this model. (C) The XOR in the command"B - re_2" produces two models, because in each of them only one node is removed. (D) The OR operator"B I re_2" creates a combination of Band C
statistical measures to determine the best model. For parameter estimation ModelMage utilizes Copasi's various parameter estimation routines, which makes it very fast and flexible. The parameter estimation task is most conveniently defined in Copasi's graphical user interface and can later be executed by ModelMage. The user has to set up the task only once for the master model. ModelMage automatically defines the parameter estimation task for all generated candidate models. Parameters of new or changed reactions are added to the estimation task if the parameters of the original reactions were part of the parameters estimation task of the master model. If there are parameters that do not exist in the specific alternative model they are deleted from the estimation task for this model. The parameter estimation in Copasi creates output files for all of the models that contain details about the results of the estimation. ModelMage parses these files and extracts the objective value, which is the sum of weighted squared residuals of the fitted model (RSS). From this, e.g., the Akaike Information Criterion [13] is calculated for each model:
Ale
= 2k + n(ln(RSS/n) + 1),
(1)
where n is the number of observations and k is the number of estimated parameters, which are also parsed from the Copasi output. From the AIC we can also compute the second order AIC (AICc):
A Tool fOT Automatic Model Generation, Selection and Management
AICc = AIC + 2k· (k + 1) n-k-l
59
(2)
The AlCc is used as the standard measure for ranking and model selection in ModelMage. The lower the AlCc the better is the fit to the data and the higher is the ranking the model gets. AlCc corrects the AlC for small number of observations, which is common in systems biology. But it can be also employed with bigger samples, because it converges to AIC when big sample sizes are available. [14, 15J
2.3. Implementation We use SBML and libSBML because it is accepted as standard for exchange of models between systems biology tools. [9, 16J We decided to develop ModelMage closely related to Copasi because it is widely used in the community to work with dynamical models and has a rich set of features to build upon. ModelMage is written in Python, which makes it very flexible and portable to many platforms. Currently, it is tested on Linux, Mac OS and Windows. Parameter estimation is done by the fast Copasi algorithms. The rest of ModelMage's features are not computationally intense which justifies the use of an interpreted language like Python for the tool. To install and run ModelMage, a system must meet the following requirements: • • • •
Python ~ 2.5 Networkx package for Python libSBML 3.0.1 ~ [17J Copasi ~ 4.2
3. Results and Discussion
3.1. Example To verify ModelMage's functionality we created a small master model that resembles a signaling cascade that includes three hypothetical feedback loops. (Fig. 4) From one alternative model we generated time-series data by simulating the model for 80 time units and sampled the values of species S, X5 and X6 at 7 time points. To get a more realistic test case we introduced small normally distributed errors into the data. After that, we used ModelMage to generate a family of ten models from the master model and fitted every model to the artificial data in a blind test. i.e. the person who used ModelMage did not know the model that produced the data beforehand. The model that produced the data was correctly recovered by the discrimination procedure of ModelMage. The fits were done only for parameters of the reactions re3, relO, rell and re12. The rest of the parameters were set to 0.1 to reduce the number of estimated parameters. For the parameter estimation we used the "Evolutionary Programming" algorithm in Copasi and ranked the models by AICc. Results are shown in table 2.
60
M. Flottmann et al.
t~~ , ..
5
, , .. . ,
,,
. ,,
- - - - --
.. .. ,,
- ...,-,
,
,
Fig. 4. Example model for presenting the features of ModelMage. The dashed lines were our hypotheses about feedbacks that possibly regulate the production of T. ModelMage was started with the parameters -r 'rell, rell:X5 & rell:X6 , X4 & rell:X5 & rell:X6 , rell:X6 & re3:X6 , rell:X6 & re3: X6 & X4 , rell: X5 & re3: X6 , rell: X5 & re3: X6 & X4 ' -k "rell (MM)" and produced all possible candidate models with different kinetics for rell where possible. All these alternatives were also created with a double or single phosphorylation. Additionally one model with rell completely removed was created.
# 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Model re11:X5 & re3:X6 & X4 re11:X5 & re11:X6 & X4 re11:X5 & re11:X6 & X4 re3:X6 & re11:X6 & X4 master model re11:X5 & re11:X6 re11:X5 & re3:X6 re3:X6 & rell:X6 re11:X5 & rell:X6 re11
Objective Value 0.0731 0.1246 0.1621 0.1833 1.3734 1.6720 1.6841 1.7246 1.5689 4.4530
AlC -46.9367 -33.7391 -30.2117 -27.6309 14.6623 18.7939 18.9455 19.4449 19.4577 37.3652
AICc -44.9367 -29.7391 -28.2117 -25.6309 16.6623 20.7939 20.9455 21.4449 23.4577 37.3652
n 21 21 21 21 21 21 21 21 21 21
k 4 5 4 4 4 4 4 4 5 3
The ranking of candidate models clearly divided the models into 3 separate groups. The first group consists of the models which did not include species X4. They had objective values that were about 10 times smaller than those of the second group while fitting the same number of parameters. This also leads to big differences in the AlCc. From this group the first ranking model, has a big difference to the following 3 models which all have very similar AlCcs. This model is the one the data was created with.
A Tool for Automatic Model Generation, Selection and Management
61
The second group includes all the models that still include species X4 and represent different ways of feedback. The master model, which was also fitted, ranked highest in this group, which is probably due the fact that it has all feedbacks still included and therefore can better regulate concentration of X6 then all the other models. Because of this obvious classification we did time course simulations for the best candidate model and the one that ranked seventh place. These models are very similar, the only difference is that the first ranking model contains X4 and the other one does not (Fig. 5). The third group consists of the model where reaction rell is completely removed, which has the worst fit of all the models. The bad fit is mainly due to the missing degradation of S. The other curves in this model also fit quite well, because there is still one feedback that can regulate X5 and X6 quite well. 1.2
1.5
1f!',
,\
0.8 (fJ
.'
,
U)
,, ,
x
I
0.5
I
0.2
.
\. 0
0 0
.. - ...
I
\
0.4
;
I
I
0.6
,
20
40
60
80
0
(0
....,
I
I
x
,,
, ...
I I • I I
.....
I
I 0+· • ,
20
40
60
80
0
20
40
60
80
Fig. 5. Plots of simulations with the fitted models. (A) The signal is fitted similarly by models from both groups. (B) The lower ranking models do not reach the high amplitude in the X5 concentration. (C) The concentrations of X6 reach a maximum and decay after 70s in the model from the second group, whereas they reach a limit in the model from the first group. This difference can be seen in all the models.
3.2. Conclusions The software we developed can substantially facilitate and accelerate the generation and discrimination of model alternatives. Model families can be created, analyzed and changed far easier than it was possible before. The generated models are portable to any other SBML compliant software, which gives the user the possibility to view and analyze them with an array of already existing tools. Our example model could be generated and clearly be recovered from a family of slightly different models in a very short time. To use Modelmage it must be installed locally and can be downloaded from http://sysbio.molgen.mpg.de/modelmage, but it would be possible to create a webbased version of the generator to make it easier to use. The ranking criteria worked well despite a limited set of data to fit. If this is also the case in real biological examples remains to be investigated. Nevertheless
62
M. Fliittmann et al.
the user has to be careful when selecting models and the ranking by AIC should only be used as a hint to which might be the best model [8]. The most complicated step in using ModelMage is the formulation of the logical combination of removals, which can become quite difficult in some cases. We hope to improve this by adding a more sophisticated user-interface to ModelMage in upcoming versions. We are also planning to integrate a broader set of exchangeable kinetics to give the user more possibilities for alternatives. Acknowledgements
JS is supported by the European Commision (CELLCOMPUT(04331O)) and MF is supported by the MPI for Molecular Genetics. References [1] Kuepfer, L., Peter, M., Sauer, U., Stelling, J., Ensemble modeling for analysis of cell signaling dynamics, Nat Biotechnol, 25: 1001-10066, 2007. [2] Geva-Zatorsky, N., Rosenfeld, N., Itzkovitz, S., Milo, R., Sigal, A., Dekel, E., Yarnitzky, T., Liron, Y., Polak, P., Lahav, G., Alon, U., Oscillations and variability in the p53 system, Mol Syst Bioi, 2:0033, 2006. [3] Hoops, S., Sahle, S., Gauges, R., Lee, C., Pahle, J., Simus, N., Singhal, M., Xu, L., Mendes, P., Kummer, U., COPASI-a complex pathway simulator, Bioinformatics, 22(24): 3067-30-74, 2006. [4] Funahashi, A., Morohashi, M., Kitano, H., and Tanimura, N., Celldesigner: a process diagram editor for gene-regulatory and biochemical networks, Biosilico, 1: 159-162, 2003. [5] Shapiro, B.E., Levchenko, A., Meyerowitz, E.M., Wold, B.J., Mjolsness, E.D., Cellerator: extending a computer algebra system to include biochemical arrows for signal transduction simulations Bioinformatics, 19(5): 677-678, 2003. [6] Blinov, M.L., Faeder, J.R., Goldstein, B., Hlavacek, W.S., BioNetGen: software for rule-based modeling of signal transduction based on the interactions of molecular domains, Bioinformatics, 20(17): 3289-3291, 2004. [7] Lok, L. and Brent, R., Automatic generation of cellular reaction networks with moleculizer l.0., Nat Biotechnol, 23(1): 131-136, 2005. [8] M. Haunschild, B. Freisleben, R. Takors and W. Wiechert, Investigating the dynamic behavior of biochemical networks using model families, Bioinformatics, 21(8): 16171625,2005. [9] Klipp, E., Liebermeister, W., Helbig, A., Kowald, A., and Schaber, J., Systems biology standards-the community speaks, Nat Biotechnol, 25: 390-391, 2007. [10] Finney, A. and Hucka, M., Systems biology markup language: Level 2 and beyond, Biochem Soc Trans, 31: 1472-1473, 2003. [ll] Hucka, M., Finney, A., Sauro, H. M., Bolouri, H., Doyle, J. C., Kitano, H., Arkin, A. P., Bornstein, B. J., Bray, D., Cornish-Bowden, A., et al., The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models, Bioinformatics, 19: 524-531, 2003. [12] Schulz, M., Uhlendorf, J., Klipp, E., and Liebermeister, W., SBMLmerge, a system for combining biochemical network models, Genome Inform, 17(1): 62-71, 2006. [13] Akaike, H., Information theory and an extension of the maximum likelihood principle, Selected Papers of Hirotugu Akaike, 1998.
A Tool for Automatic Model Genemtion, Selection and Management
63
[14] Wagenmakers, E.-J. and Farrell, S., AIC model selection using Akaike weights, Psychon Bull Rev, 11: 192-196, 2004. [15] Burnham, K. and Anderson, D., Model Selection and Multimodel Inference: A Practical Information- Theoretic Approach, Springer, 2002. [16] Kell, D. B. and Mendes, P., The markup is the model: Reasoning about systems biology models in the semantic web era, J Theor Biol, 252(3): 538-543, 2008. [17] Bornstein, B. J., Keating, S. M., Jouraku, A., and Hucka, M., LibSBML: An API library for SBML, Bioinformatics, 24(6): 880-881, 2008.
A FRAMEWORK FOR DETERMINING OUTLYING MICROARRAY EXPERIMENTS RAYMOND WANl
ASA M. WHEELOCK2
ryan~kuicr.kyoto-u.ac.jp
asa~para-docs.org
HIROSHI MAMITSUKA 1 mami~kuicr.kyoto-u.ac.jp
1 Bioinformatics
Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, 611-0011, Japan 2 Lung Research Lab L4:01, Respiratory Medicine Unit, Department of Medicine, Karolinska Institutet, 171 76 Stockholm, Sweden Microarrays are high-throughput technologies whose data are known to be noisy. In this work, we propose a graph-based method which first identifies the extent to which a single microarray experiment is noisy and then applies an error function to clean individual expression levels. These two steps are unified within a framework baSed on a graph representation of a separate data set from some repository. We demonstrate the utility of our method by comparing our results against statistical methods by applying both techniques to simulated microarray data. Our results are encouraging and indicate one potential use of microarray data from past experiments.
Keywords: microarrays; distance-based outliers; data cleaning; simulated microarrays
1. Introduction
Microarrays are high-throughput technologies that allow researchers to determine the expression levels of thousands of genes simultaneously. Microarray experiments themselves are known to have problems with noise. The validity of a completed microarray experiment needs to be evaluated by the experimentalist while taking into account the monetary costs of producing the slide. This translates into a potential bias in their decision. While statistical techniques exist for determining the validity of a microarray slide across replicates, we present an alternative method which assesses a microarray experiment and optionally cleans individual expression levels by making use of past microarray data as a guide. We propose a framework which extends some earlier work [19J and allows experimentalists to determine the extent to which a microarray experiment t is an "outlier" by using other data R from some repository as a guide. This is in contrast to methods embodied within microarray acquisition software which evaluates each microarray experiment in isolation. The external data could be from the same laboratory or from a public repository and is assume to be "correct" or, at least, of sufficient quality to compare against. As a starting point to our work, we as-
64
A Framework for Determining Outlying Microarray Experiments
65
sume the repository data are replicates. This allows us to use statistical methods for replicates as baselines. The framework builds an undirected graph representation of R where each probe is a vertex and edges indicate probes with similar values across R. We apply this framework in two different ways. The first way scores a microarray experiment using techniques related to distance-based outlier detection [8]. The score of a microarray experiment is calculated from the number of probes which are similar in R, but differ in t. The second method employs an energy function E which cleans individual expression levels which were previously marked as outliers. We demonstrate our method with simulated microarray data created using the SIMAGE web service in order to give us better control over the type of data being used [1]. This paper is structured as follows. We provide some background to this problem and our method in Section 2. Then, in the next 3 sections, we discuss our framework, method for outlier detection, and method for probe cleaning. In Section 6, we describe statistical methods for one dimensional replicate data which can be used as a baseline. Experimental results using data sets compiled by SIMAGE are reported in Section 7. Section 8 summarizes our work and provides some future directions. 2. Background
2.1. Notation The following notation for microarrays and graph theory are used in describing our methodology. A microarray platform specifies what the m probes are for microarray slides based on it. A microarray slide is subjected to a specific condition to form a microarray experiment. In this work, we apply our method to two-channel cDN A microarrays. These microarrays employ two colored dyes for an experiment to distinguish between two different conditions (for example, a treatment and a control): Cy3 (green) and Cy5 (red). A researcher then forms a data set of n microarray experiments for a particular study. The n experiments in the data set typically vary with experimental conditions of biological or experimental replicates, or a combination of both. The expression level at probe i, experiment j is denoted as Pij, for 1 < i < m and 1 < j < n. All of the expression levels for probe i are indicated as Pi, which is a vector of length n. The purpose of this work is to assess a single experiment t in the context of a set of experiments R which are obtained from a private or public repository. These experiments are assumed to represent a "consensus" , making it "more reliable" than the single experiment in question. In order to simplify the problem, we assume that all of the experiments are based on the same platform. We adopt the method chosen by SIMAGE by subtracting the background of each probe spot for both of the channels to arrive at an expression level for each probe. If we denote the two channels as Cy3 and Cy5 and their backgrounds as Cy3bgr and Cy5bgr, respectively, then the background-subtracted log-ratio of the two channels for a probe Pij is:
66
R. Wan,
A. M. Wheelock
fj
H. Mamitsuka
Pij
= log2 Cy5 ij ij
Cy3
-
Cy3bgr ij . Cy5bgr ij
(1 )
Other forms of background correction are possible, including not subtracting at all. Further information about microarrays can be found in other sources [21J. The basis of our method is to form an undirected graph G(V, E) using R to act as a guide in assessing t. The vertices V is formed from the set of all m probes. An edge between two vertices Vi and Vj indicates that the two probes have expression levels which are similar across R. Note that we refer to microarray expression values as either Vi or Pij, depending on whether the value is a value in a vertex or within experiment j in a data set of n experiments.
2.2. Related Work Related topics include combining microarray data, statistical techniques for detecting outliers across replicates, distance-based outlier detection, and data cleaning. As newer techniques for microarray analysis are made available, previously published data can be re-examined. The aggregation of multiple data sets from public repositories has been an active topic in recent years. In one study, researchers looked at pairs of co-expressed genes in 60 human data sets covering 3,924 microarray experiments across multiple platforms [11]. In their work, they looked at pairs of co-expressed genes in each individual data set and then compared these coexpressed links between data sets. Others have focussed on combining single-channel Affymetrix data [17], cancer classification using Support Vector Machines [20]' and even determining the amount of variation between studies [3]. In contrast, our method assumes that the repository data R is "correct" and that a single experiment t is being evaluated. Of course, the quality of data in a repository varies since they can reside on an experimenter's computer as part of a past experiment or be in a public repository. Thus, we are assessing how similar t is to R without any claims on the reliability of R. One method which more closely resembles our's involves the construction of a gene regulatory network [5]. Their gene regulatory network is depicted as a directed graph of probes, constructed using an algorithm dubbed "mode-of-action by network identification" (MNI). The direction of an edge indicates that one gene regulates another. This network is then applied on an experiment test set in order to determine which genes are associated with a particular drug treatment. Their method is an iterative procedure based on principle components regression. In comparison, our method asserts a weaker statement which simply says that two probes are "similar" to each other in terms of their expression profiles. Even so, our assertion is sufficient for our needs since the two purposes differ; instead of relying on gene regulation, we are concerned with outlier detection. Noise in microarray data is handled through various normalization techniques that range from operating at the probe-level up to the slide-level. Often, they adjust
A Framework for Determining Outlying Microarray Experiments
67
expression levels according to some distribution within the data set [12, 21]. In our work, we make a distinction between Rand t since R guides the analysis and noise cleaning of t. Distance-based outliers is a method of locating outliers in a database of records in order to find records of interest [8]. These records are not necessarily erroneous, but have characteristics that separate it from every other record. While the values in each record can be either continuous or discrete, a suitable distance function is required to handle whatever data types are used. For example, if all fields in the database are continuous, then the Euclidean distance between records is one option. Additional parameters are also required which dictate what is an outlier. At least three sets of parameters have been considered in the literature [2]: (1) Outliers are records for which there are fewer than p other examples within a distance d; (2) Outliers are the top n examples whose distance to the kth nearest neighbor is greatest; and (3) Outliers are the top n examples whose average distance to the k nearest neighbors is greatest. Regardless of which definition of outliers is employed, distance-based outliers require that every record be compared with every other record. In our case, the repository R provides information that restricts which comparisons are performed. These restrictions are represented as the undirected graph. The next step beyond detecting outliers or, more generally, "problematic" values, is to replace them. In the context of microarray data, data cleaning is synonymous with normalization. Outside of microarray data, more general data "polishing" has been investigated as an augmentation to the C4.5 decision tree algorithm [15, 18]. Others have constructed a probabilistic model of three components: a clean model, a noise model, and a data corruption model [9]. Our method of cleaning assumes that probes are related to each other according to the undirected graph C.
3. Framework
Our graph-based method is illustrated in Figure 1. The basis of our framework is the construction of an undirected graph C(V, E) from R. In this graph, each vertex is a probe from R. Undirected edges are added to E if two probes are similar to each other. Unlike work by others (for example, [11]), "similarity" refers to expression levels which are equivalent in value and not in expression patterns. This difference is due to our aim being outlier detection rather than knowledge discovery and our use of replicate experiments for R. If R is not composed of replicate experiments, then a different and potentially more relaxed measure would be required. As a result, distance measures are more appropriate than correlation ones and the one we have chosen is the Euclidean distance between two probes. The number of edges added to E are regulated by a distance threshold dt . All edges whose weight (similarity) is less than d t are added to the graph. If d t = 00, then every node is connected to every other node. If the distance between two nodes cannot be calculated due to excessive missing values, then the distance is set to 00.
68
R. Wan,
A. M. Wheelock
fj
H. Mamitsuka
R A B
C
D E
A B
C
D E
~
Fig. 1. Illustration of our graph-based framework. At the top, we have the repository data R of 5 probes and 4 experiments. Below, we have the single experiment t to be evaluated, with the same 5 probes.
For the sake of convenience, we normalize all edges so that they lie within the range [0,100]. After G is built, we apply it to experiment t by inserting the values from t into G. The structure of G indicates which expression levels from t should be compared since each vertex has a neighborhood of vertices. In this work, we make use of the neighborhood of distance 1 from each vertex. If an edge does not exist between probes, then their values across R differ enough that they also should not be compared for t. This framework is used for outlier detection and probe cleaning. 4. Outlier Detection Our application of distance-based outliers follows from the idea used for identifying outlying records in a database. The most significant difference is that we no longer compare every vertex (record) to every other vertex. Our approach is most similar to the first definition of distance-based outliers from Section 2.2 [8]. The combination of a distance d and a proportion p is replaced with two parameters: dt , as described above, and et, which provides an explicit threshold indicating when the expression levels of two adjacent vertices differ enough so that one of them is labeled an outlier. The difference between adjacent vertices is again based on the Euclidean distance since we have numerical expression levels. That is, the distance eij between two vertices i and j is: (2) As with dt , in order to bound et, we normalize all differences between vertices of t so that they are within the range [0,100]. An outlying probe is a probe which has the majority of its neighboring expression levels greater than et. This cut-off
A Pramework for Determining Outlying Microarray Experiments
69
may be too simplistic and in future work, an explicit proportion between expression levels less than et ("near" neighbors) to ones greater than et ("far" neighbors) may be required.
5. Probe Cleaning Probes which have been marked as outliers can also be cleaned using an error function. We define an error function E based on the Euclidean distance between connected, adjacent nodes in G, as shown in Equation (3). This error function defines the energy of the graph as the sum over the difference in expression level of every pair of connected vertices. The higher the energy, the greater the difference between connected vertices. As each pair is counted twice, the energy is halved: 1
E
m
m
i
j
= "2 L: L)Vi -
WijVj)2.
(3)
In order to mmlmlze the error in expression levels, we first take the partial derivative for some vertex Vk where 1 ~ k ~ m. Then, we set this equation to o and solve for Vk, leaving us with:
%!,
vk
22::7' WkiVi = -:=-7'::..c..,=;Y;---;o;INkl + 2::7' W~i '
where INkl is the size of the neighborhood of energy, all m equations are represented as:
(4)
Vk.
As we want the minimum (local)
v=A·v+c.
(5)
In Equation (5), v is the solution vector and A is an n x n matrix. For the moment, c is a zero column vector. Thus, the coefficient for row i, column j in A is: aij
=
2Wij .,----.,----==:"",.-...."...
IN;! + 2::;;'w;k
(6)
Vertices which our previous method has labeled as outliers are the only values which are cleaned. If a vertex Vi was not marked as an outlier, it is left unchanged and its corresponding row in matrix A can be removed. Furthermore, Vi adds a constant to all remaining rows in the matrix. These constants are moved into the constant vector c. Solving these m equations simultaneously gives the locally best expression levels as v. If Wij = Wji in Equation (3), then a local minimum of E is produced. Details are omitted here, but we use a Hessian matrix of second order partial derivatives of E to show that H(j)(x) = ~vtHv is positive for all vectors v [6]. This implies all eigenvalues must be positive and that E" > O. Our implementation makes use of LU-decomposition and back substitution routines [14] instead of Gaussian elimination since it is about three times faster and more numerically stable to round-off errors [4J.
70
R. Wan, A. M. Wheelock & H. Mamitsuka R
R
A B
A B
C
C
D
D
E
E
(a) Statistical methods
(b) Our graph-based method
Fig. 2. A comparison of statistically-based outlier methods against our graph-based one. Each of the two figures represent a microarray data set of replicates where each row is a probe and each column is an experiment. The black square represents the value being evaluated and the gray squares indicate the values used to make the evaluation.
6. Statistical Methods As a baseline for microarray experiment scoring, statistical methods for onedimensional data can be applied as usual for each probe. The difference is that there is no distinction made between the experiments of Rand t. These methods are applied to the combined data set RUt on an expression level-by-expression level basis. Figure 2 illustrates how these statistical methods differ from our framework. In Figure 2( a), the grid represents the unified microarray data set of RUt so that a row is a probe and a column is an experiment. The expression level being evaluated is shaded in black and the values which it is compared with are in gray. Statistical methods treat every experiment the same way and compare each expression level with the replicates within the same probe. In Figure 2(b), our method makes a distinction between Rand t, as described earlier. Statistical methods perform a direct comparison while our framework constructs a graph using the shaded values of R and the evaluation is performed using the shaded values of t. At least three types of statistical methods are at our disposal: (a) comparison against the inter-quartile range (IQR) , (b) standardized scores (or Z-scores) , and (c) Q-test. The inter-quartile range is the range from the first to the third quartile. Values outside of this range are considered outliers. The Z-test calculates a standardized score or Z-score for each value Pij against the overall average and standard deviation for all replicates of Pi. The Z-score reports the number of standard deviations the expression level is from the mean f..Li:
(7) For both IQR and standardized scores, a cut-off is required to indicate either how many times the IQR or how many standard deviations from f..Li are accepted before labeling a value as an outlier. A larger cut-off yields a more conservative test. In the natural sciences, the Q-test compares each value to its nearest neighbor and the overall range of values according to some confidence interval (critical values according to a 90% confidence interval are shown in Table 1):
A Framework for Determining Outlying Microarray Experiments
71
Table 1. Critical values for the Q-test for a 90% confidence interval [16, pg. 35J. N Qc
3 0.94 Table 2.
Name Vi V2 V3 V4
4 0.76
5 0.64
6 0.56
7 0.51
8 0.47
9 0.44
10 0.41
Simulated data sets created using SIMAGE.
Probes 11,664 11,664 11,664 11,664
Experiments 100 100 10 10
Dye-swap Yes No Yes No
Random noise N(0,0.219) N(O, 0.219) N(0,0.500) N(0,0.500)
Q(Pi') = Pij - (closest value to Pij) I J
range
(8)
7. Experiment Results Both the statistical methods in the previous section and our framework was applied to simulated microarray data sets.
7.1. Simulated Microarray Data We employed simulated microarray data to give us better control over our experiments. Several researchers have looked into creating simulated micro array data which are still "real" since they model real microarray data sets [1, 13J. The SIMAGE system is a publicly available web servera which models various aspects of microarray data in a controlled way, including effects from spot pins, channels, and replication [1 J. Four data sets were constructed using SIMAGE, as summarized in Table 2. SIMAGE has default parameters that were chosen through the modeling of a data set of 23 experiments [IJ. These default values, which were left unchanged throughout our work, are not shown in this table. Every data set consists of 11,664 probes and either 100 or 10 experiments. Two data sets were dye-swapped (Vi and V 3 ) and two were not (V2 and V4)' As SIMAGE simulates real microarray data, the default parameters already introduces noise into the microarray data as a Gaussian distribution of N(O, 0.219). The first two data sets contained this level of noise; the remaining two have a larger standard deviation of 0.500. Therefore, two sets of experiments are conducted. In the first set, we used simulated dye-swapped data and formed G using all of Vi and then applied the graph to the first 10 experiments of Vi and V 3 , where the ones in V3 are known to have more noise. In the second scenario, non-dye-swapped data is considered and V 2 is used to form G and it is applied to the first 10 experiments in V 2 and V 4 . aURL: http://bioinformatics . bioI. rug .nl/websoftware/simage/
72
R. Wan,
A.
M. Wheelock &J H. Mamitsuka
R
Percentage of outlying probes (initial)
Fig. 3.
Percentage of
outlying probes (final)
The framework for assessing our graph-based method.
7.2. Framework of Experiments The framework of our experiments encompass both the statistical tests and the use of our graph-based method. For the statistical tests, we combined 9 of the experiments from R with only one experiment known to have more noise to act as t since critical values for the Q-test are available for only up to 10 values (see Table 1). The aim is to determine how well statistical methods can isolate t. As for our graph-based method, we evaluate outlier detection and probe cleaning together using the framework shown in Figure 3. The repository data R is used to construct a graph G by selecting a value for dt . The graph is applied to t and the percentage of outlying probes is reported as the "initial" percentage using a fixed value for et. Afterwards, the probes are cleaned using the same graph structure. Next, the "final" percentage of outlying probes is reported using the same value for et. In addition, the first application of outlier detection is done for the first 10 experiments in R and averaged to act as a baseline. The aim of our framework is to demonstrate the usefulness of our graph-based method in comparison to more well-established statistical methods. In order to unify the comparison, the statistical methods also report a percentage indicating the number of probes which they deemed were outliers. The baseline for the statistical methods is the average percentage across the 9 experiments from R. This is compared to the single percentage obtained from evaluating the probes of the test set t.
7.3. Results The results from our experiments are summarized in the graphs of Figure 4 for both simulated dye-swapped and non-dye-swapped data sets. Beginning with the dye-swapped data sets, Figure 4(a) and Figure 4(b) present the results for statistical methods and methods based on our framework. In both figures, the vertical axes indicate the percentage of probes that are marked as outliers. Along the horizontal axes is the parameter relevant to the method. Beginning with the statistical methods in Figure 4(a), it would seem that the IQR test performs better than the Z-test as there is a clear separation between the two graphs for the baseline and the test set. As expected, for both methods, the number of probes identified as outliers decreases as the parameter increases for
A Framework for Determining Outlying Microarray Experiments
Dye-swapped ('0 1 and '03)
r-------------------------------, ~
-)( -
g r---~~----------------------
73
___
Baseline (3%) Baseline (10%) -)( - Initial test set (3%) -. - Initial test set (10%)
---M-
lOR (Baseline, averaged) lOR (Test set) Z-test (Baseline, averaged)
K
Final test Set (3%)
..•
Fina! test set (10%)
- •. Z-test (Test set) - - -
Q-test (Baseline, averaged) Q-test (Test set)
1;1
,. - - -)( - - - K- - -
i(- - - 'i(- - ..
-J+. - _
-)if. __ ->f ___ )(
~
.
.-
... .
'~'"
0
'"
1.0
1.5
2.5
2.0
.. x
10
3.0 Expression threshold
Parameter
(a) Statistical methods
(b) Graph-based methods
Non-dye-swapped ('0 2 and '04)
g
.---------------------------~ ---M-
lOR (Baseline. averaged)
-)( -
lOR (Test set) Z-le51 (Baseline, averaged)
§
.-----~--------------------~ ---M- Baseline (3%) Baseline (10%) -)( - Initial test set (3%) -. - Initial test set (10%) ,x Final test set (3%) ..• Final test set (10%)
Z-lesl (Test set)
a-test (Baseline. averaged) - - - a-test (Test set) I,,~:
"---)(----t 0 and [Nvl; 2: 0 for i tI- u.
(2)
For the components i E U, there is no restriction since these compounds may be imported from the environment. Condition (2) can be tested by phrasing it as a linear programming problem. Following the terminology introduced in [9], we call metabolites fulfilling this condition producible from the nutrients U. The entirety of all metabolites that are producible from U is denoted P(U). By defintion, a network may carry a stationary flux leading to an increase in concentration of the producible metabolites while only the nutrients are consumed. However, this interpretation holds only as long as it is assumed that the cell is in a stationary, non-growing state. If a growing and reproducing organism is considered, stricter conditions for the producibility of metabolites must be imposed. In particular, all metabolites not contained in the set P(U) are not producible and therefore their amount may not continuously increase. If we assume a persistent increase in cellular volume, the concentrations of such metabolites necessarily decrease and eventually reach zero and, as a consequence, are not available as substrates for other reactions. We take into account these considerations by repeating the calculation of all producible metabolites with the additional constraint that all those reactions are forbidden which use as substrate any metabolite that is not contained in P(U). More precisely, all those metabolites are identified for which flux vectors v = (Vj)
94
K. Kruse f3 O. Ebenh5h
exist with Vj ::::: 0 and VI = 0 if reaction l uses a substrate not contained in P(U). A reaction l fulfils this condition if the set {i rf. P(U) Inil < O} is non-empty. If this additional restriction results in a reduction of the set of producible compounds, the calculation is repeated with even stricter conditions. This process is iterated until the set of producible metabolites remains unchanged. The final set of metabolites is denoted by S(U) and a metabolite within this set is termed sustainable since it has the property that it can be produced from the nutrients U even if the cell is constantly growing. Sustainable metabolites are determined by repeatedly decreasing the set of producible metabolites until only those remain which can be produced from available nutrients without requiring the presence of any non-sustainable intermediates. In contrast, in the method of network expansion the scope of the seed U is determined by stepwise expanding a set of metabolites. Starting with the set U, all those reactions are identified that use exclusively substrates contained in the set and their products are included in the expanding set. Expansion stops if no further products are included and the final set is called the scope of the nutrients U, denoted ~(U). From the construction of the scope it is evident that every metabolite contained in the scope is also sustainable in the above defined sense. Therefore, ~(U)
c S(U) c P(U).
(3)
The concepts of producibility, sustainability and scope can be viewed as different definitions of which metabolites can be synthesized by a given network with increasingly stricter conditions. A)
.. --------- ..
-----------
g. u U1
20
50 40
II> Co
30
U1
0
u
10
20
•",.
10 100
200
t': 300
400
producible metabolites
(a) E.coli
500
0
0
100
200
300
400
producible metabolites
(b) M.barkeri
Fig. 2. Comparison between scope size and numbers of producible metabolites. Each dot represents one metabolite. Dots on the straight line represent metabolites for which the scope size equals the number of producible metabolites.
to condition (2). In approximately 44% of all cases, the scope is identical to the set of producible metabolites. However, identity is only observed for small sets with the majority being those cases in which the scope is identical to the seed. In most cases the set of producible metabolites is considerably larger than the size of the corresponding scope. This is not surprising, considering that the criteria for obtaining producible metabolites are weaker than for metabolites in the scope. Interestingly, the size distributions of both sets are clearly structured. For the scopes, this property has been extensively investigated in [2, 7] and the results have been used to derive a hierarchical ordering of metabolism [5, 12]. Apparently, there exists a similar ordering of sets of producible metabolites. Fig. 3 shows the direct comparison of scope sizes and numbers of sustainable metabolites. In the E. coli network, these sets are identical in 97% of all cases and in M. barkeri identity is observed in almost 99%. Those metabolites for which the corresponding sets differ are labelled by the abbreviations used in [4, 13]. Remarkably, many metabolites in the E. coli network exhibiting differences in the sets of sustainable metabolites and those in the scope are related to important cofactors. In particular, many adenine nucleotide phosphates and nicotineamide dinucleotide phosphates belong to this class. In both networks, many sugar phosphates also show a considerable difference in the corresponding sets. Because cofactors apparently take on a role as key metabolites in both networks, a detailed investigation of their influence on scope size and contents is performed. We specifically consider the following four cofactor functionalities: 1) transfer of a phosphate group from ATP to an acceptor, yielding ADP, 2) simulatenous hydrolysis of two phosphate groups from ATP yielding AMP, 3) reduction of NAD+ to yield NADH, thereby oxidizing another compound, 4) the analogous process but involving NADP+ /NADPH. Apparently, the introduction of a cofactor functionality can only increase the scope. We have systematically compared the scopes resulting for all 16 combinations of cofactor functionalities with the sets of sustainable metabolites.
Comparing PEA and Network Expansion 140
97
40
120 30
100
..
80
~
60
c.
·nadp
40
.
g-
20
l;(
,.&~;P
·nad man1p
campP
.~.f'pn6P
10
.~I Rh
(a) E.coli
f::
.e4p 0
0
.i~~p
10 20 30 sustainable metabolites
40
(b) M. barkeri
Fig. 3. Comparison between scope size and number of sustainable metabolites. Metabolites are represented as dots. Metabolites for which the scope size is not identical to the number of sustainable metabolites (located below the diagonal) are labeled. For clarity, two metabolites were omitted in figure (a): acg5p (sustainable metabolites 279, scope 3) and glu5p (264, 3).
In Fig. 4, the results for the E. coli network for the four cofactor combinations ATP I ADP, ATP I ADP and ATP I AMP, NADH and NADPH, and all cofactors are shown. Interestingly, introduction of the redox cofactors NADH and NADPH lead to a stronger increase in scope size as the introduction of the phosphate transfer cofactors ATP I ADP and ATP I AMP. The latter case, in which both ATP related cofactor functionalities are introduced, is of particular importance. Here, the scopes of many central metabolites including NAD+, NADP+ and deoxyadeninephosphates are identical to the corresponding sets of sustainable metabolites. There exist, however, other metabolites whose scope is always considerably lower than the set of sustainable metabolites, which holds true for both investigated networks. A thorough investigation of the participating reactions preventing the expansion of the scope leads to the identification of metabolites, whose addition directly to the seed resulted in identity of scope and sustainable metabolites. In Table 1, the Table 1. Selection of metabolites that have to be added to the seed in order to obtain the same result for scope and sustainable metabolites. network E. coli
M. barkeri
both
addition to seed Proton (H+) ATP D-Ribulose 5-phosphate D-Ribulose 5-phosphate Proton (H+)
affected metabolites ps, 3dglnp, orot5p dnad, nadh, nadph e4p, s7p manlp, man6p, glp, f6p, g6p, e4p pran, 2cpr5p
98
K. Kruse
~
~
fj
O. Ebenhoh
140
140
120
120
100
100
80
~
~
...;., ,.
60
80
40
20
20
60
80
100
120
0
140
.
••
• •••
.'
60
40
40
nap
r.",
".
~
0
20
(b)
ATP/ADP
120
120
100
100 1
.,
~.
".
ll:
..:
0
.
120
140
120
140
:.
..,.f·
H:·
80 60
: ".
.'r
/:.
....
.. . ,:
20
1··""I ...
0
100
ATP/ADP, ATP/AMP
r
40
40 20
80
0
;(
.-.' .,.
60
60
..
140
,.
40
sustainable metabolites
140
80
p
#
sustainable metabolites
(a)
r
8 des
.~p
20
40
60
80
100
sustainable metabolites
(c)
NADH, NADPH
120
140
0
,-.""
I ...
0
20
40
60
80
100
sustainable metabolites
(d)
a.ll cofactors
Fig. 4. Comparison of scope size with different cofactor funcionalities with the numbers of sustainable metabolites for the E. coli network. In (b) metabolites have been labeled whose scope rose to the size of the sustainable metabolites by adding both cofactor functionalities of ATP.
most predominant examples are presented. To study whether the finding that the inclusion of cofactor functionality improves the agreement of scopes with sets of sustainable metabolites is of a general nature, we performed a Monte Carlo simulation. For this, we randomly generated 1000 seeds with sizes varying between 10 and 100. For both networks, the scopes for all possible combinations of cofactors as well as the corresponding sets of sustainable metabolites have been determined. In Fig. 5 the degree of agreement of the sets is plotted versus the seed size. Interestingly, the behaviour differs for both networks. Whereas in both cases the agreement increases with increasing seed size
Comparing FBA and Network Expansion O.g
0.8 ~
0.7
~
0.6
~
0.8
:i
0.7
'"
c ~ 0.5
~ 0.6
E 0.4 ~
~
O.S
~
0.4
o
o O.:S
is
99
.2
0.3
:e
0.2
~
1il
0.1
.t: 0.1
O~~cc~~~~~ D U ro 00 100 ~
~
~
~
~
average seed size
(a) E.coli
~
0.2
O~~~~~~~~ o 10 20 30 40 50 60 70 80 gO 100 average seed size
(b) M. barkeri
Fig. 5. Degree of identity of scopes and sustainable metabolites for both investigated networks as a function of seed size. Black line represents network expansion without cofactors, the green line with the two ATP-related cofactors and the red line considering all four cofactor functionalities.
when cofactors are included, this is not true for scopes without cofactors. In the case of E. coli, the best agreement is obtained by considering all cofactor functionalities simultaneously (for seed sizes larger than 10). In contrast, in M. barkeri inclusion of both ATP related cofactor functionalities for large seed sizes (> 40) yields the highest degree of identity.
4. Discussion
We have introduced several mathematical descriptions defining the producibility of metabolites from available nutrients. Simple producibility is given when a steady state flux through the metabolic network may exist such that the concentration of a metabolite increases while exclusively consuming the nutrients. By considering a cell under persistent growth, we arrive at the concept of sustainability, which defines metabolites whose concentrations may be increased even if all intermediates are simultaneously diluted. The method of network expansion provides the concept of a scope of nutrients, describing what a network may produce if exclusively the nutrients are present and all intermediates possess zero concentration. We have systematically compared sets of producible and sustainable metabolites with the scopes obtained from single initial compounds and found that the scope is often identical to the set of sustainable compounds. We could further show that including cofactor functionalities, which are derived from heuristic arguments, can significantly increase the number of identical cases. More importantly, Monte Carlo simulations for larger sets of nutrients showed a tendency towards greater accordance of scope and sustainability with an increasing number of nutrients. For some metabolites, the introduction of cofactor functionality was not sufficient to produce a scope identical to the set of sustainable metabolites. It is to be expected that this also holds true for combinations of seed compounds. In some
100
K. Kruse
(3
O. Ebenhoh
cases, the addition of protons to the nutrients was sufficient to enlarge the scope to the sustainable metabolites. Since protons in most cases do not influence the size of a scope, it seems reasonable to generally include them in the seed. This is in particular plausible since we always considered water to be abundant and in aquaeous solutions protons are always present. An interesting observation was made for some metabolites occuring in the pentose phosphate pathway. Erythrose-4-phosphate (E4P), for example, exhibits a very small scope but in both networks the corresponding sets of sustainable metabolites are significantly larger. This observation can be explained by considering the structure of the pentose phosphate cycle which contains many bimolecular reactions. A subset can easily be assembled allowing for a stationary flux producing, for example, xylulose-5-phosphate and glyceraldehyde-3-phosphate from two molecules of E4P. However, since E4P never appears as a single substrate, it is evident that the scope of E4P only contains E4P itself. This fact has practical consequences for a whole class of other organism-specific networks. Most photosynthetic organisms, such as plants or green algae, can fix CO 2 by means of the Calvin cycle which bears high similarities with the pentose phosphate cycle. To realistically assess the biosynthetic capabilities from nutrient combinations including CO 2 , also other compounds of the Calvin cycle, such as ribulose-l,5-bisphosphate should be added. A thorough investigation of genome-scale networks of photoautotrophic organisms is still outstanding. Although the concept of sustainability is mathematically more rigorous, it has the drawback that it is computationally very intensive. For some calculations of sets of sustainable metabolites, several hundred linear programming problems have to be solved. In contrast, the network expansion algorithm is extremely simple and fast and can easily be applied millions of times on a normal personal computer, rendering it suitable for large-scale applications for example to investigate thousands of nutrient combinations for hundreds of networks. Considering that the agreement of scopes with sets of sustainable metabolites is in most cases extremely accurate, we conclude that the enormous gain in computational speed justifies the inaccuracies that the network expansion method unavoidably displays due to the introduction of heuristic cofactor functionalities. References [1] Ebenhi:ih, 0., Handorf, T., and Kahn, D., Evolutionary changes of metabolic networks and their biosynthetic capacities. Syst Bioi (Stevenage), 153(5):354-358, Sep 2006. [2] Ebenhi:ih, 0., Handorf, T., and Heinrich, R., Structural analysis of expanding metabolic networks. Genome Inform, 15(1):35-45, 2004. [3] Edwards, J. S. and Palsson, B. 0., Metabolic flux balance analysis and the in silico analysis of Escherichia coli K-12 gene deletions. BMC Bioinformatics, 1:1, 2000. [4] Feist, A.M., Scholten, J.e.M., Palsson, B.O., Brockman, F.J., and Ideker, T., Modeling methanogenesis with a genome-scale metabolic reconstruction of methanosarcina barkeri. Molecular Systems Biology, 2:2006.0004, 2006. [5] Handorf, T., Ebenhi:ih, 0., Kahn, D., and Heinrich, R., Hierarchy of metabolic compounds based on their synthesising capacity. Syst Bioi (Stevenage), 153(5):359-363,
Comparing FBA and Network Expansion
101
Sep 2006. [6) Handorf, T., Christian, N., Ebenhoh, 0., and Kahn, D., An environmental perspective on metabolism. J Theor Bioi, 252:530-537, Nov 2007. [7) Handorf, T., Ebenhoh, 0., and Heinrich, R., Expanding metabolic networks: scopes of compounds, robustness, and evolution. J Mol Evol, 61(4):498-512, Oct 2005. [8) Ibarra, R.U., Edwards, J.S., and Palsson, B.O., Escherichia coli K-12 undergoes adaptive evolution to achieve in silico predicted optimal growth. Nature, 420(6912):186189, Nov 2002. [9) Imielinski, M., Belta, C., Rubin, H., and Halasz, A., Systematic analysis of conservation relations in Escherichia coli genome-scale metabolic network reveals novel growth media. Biophysical Journal, 90:2659-2672, 2006. [10) Kauffman, K.J., Prakash, P., and Edwards, J.S., Advances in flux balance analysis. Curr Opin Biotechnol, 14(5):491-496, Oct 2003. [I1J Liolios, K., Mavromatis, K., Tavernarakis, N., and Kyrpides, N.C. The genomes on line database (gold) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res, 36(Database issue):D475-D479, Jan 2008. [12) Matthiius, F., Salazar, C., and Ebenhoh, 0., Biosynthetic potentials of metabolites and their hierarchical organization. PLoB Comput Bioi, 4(4):elO00049, Apr 2008. [13) Reed, J.L., Vo, T.D., Schilling, C.H., and Palsson, B.O., An expanded genome-scale model of Escherichia coli K-12 (ijr904 gsm/gpr). Genome Biology, 4(9):R54.1-R54.12, 2003. [14J Schilling, C. H., Edwards, J. S., Letscher, D., and Palsson, B. 0., Combining pathway analysis with flux balance analysis for the comprehensive study of metabolic systems. Biotechnol Bioeng, 71(4):286-306, 2000.
SEMI-SUPERVISED GRAPH PARTITIONING WITH DECISION TREES TIMOTHY HANCOCK
HIROSHI MAMITSUKA
timhancock~kuicr.kyoto-u.ac.jp
mami~kuicr.kyoto-u.ac.jp
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan In this paper we investigate a new framework for graph partitioning using decision trees to search for sub-graphs within a graph adjacency matrix. Graph partitioning by a decision tree seeks to optimize a specified graph partitioning index such as ratio cut by recursively applying decision rules found within nodes of the graph. Key advantages of tree models for graph partitioning are they provide a predictive framework for evaluating the quality of the solution, determining the number of sub-graphs and assessing overall variable importance. We evaluate the performance of tree based graph partitioning on a benchmark dataset for multiclass classification of tumor diagnosis based on gene expression. Three graph cut indices will be compared, ratio cut, normalized cut and network modularity and assessed in terms of their classification accuracy, power to estimate the optimal number of sub-graphs and ability to extract known important variables within the dataset.
Keywords: graph partitioning; decision trees; multiclass classification
1. Introduction
The recent interest of computational biologists in graph partitioning stems from the idea of a highly organised community structure within biological networks, such as metabolism [12J. These communities manifest themselves as sub-graphs within the larger network. Graph partitioning describes the set of algorithms that seek to identify these sub-graphs. Common solutions to the graph partitioning problem are recursive k-way partitioning methods such as METIS [7J and approximate methods such as spectral approaches [4J. These methods however only output the optimal partition and offer no clues as to which features determine each sub-graph. Therefore after the optimal partition has been found the sub-graphs must then be analyzed to se~ if they possess a specific biological function. This second step often proves to be more time consuming than initially finding the optimal partition. If a graph partitioning algorithm could also provide a list of important variables that are related to specific sub-graphs then this feature would be of considerable use to computational biologists. In this paper we propose such an interpretable solution to graph partitioning through the construction of a decision tree. Graphs can be represented in many forms however the most common form is
102
Semi-Supervised Graph Partitioning with Decision Trees 103
with an adjacency matrix, S, which is a N x N symmetric matrix that~sllmmarizes the distances between the N nodes of the graph (Figure 1). In Figure 1 a partition on a graph has the action of dividing the adjacency matrix into four sub-matrices, SL, SR, So and S'{;, where SL, SR are the sub-graphs created by the partition and Graph
Adjacency Matrix
S=
Fig. 1.
0 1 1 0 0 I 0 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 1 1
0 0 0 I 1 0
Partition
SL
Sc
Sf:
SR
Diagrammatic representation graph partitioning.
So contain the edges that connect them. Graph partitioning defines the quality of
a specified partition through a graph cut index, such as ratio cut [4], normalized cut [lOJ and more recently network modularity [12J. However the search for the optimal partition using these indices is an NP hard problem because optimality requires a search through all possible permutations of the nodes of a graph. Tree based models present a framework for predicting a response, y, by a hierarchy of decision rules found within a predictor dataset, X [2]. These decision rules are simple binary inequalities such as all observations that satisfy x :::; 0.71 go into the left down the tree otherwise they follow the right path. Tree models provide a predictive framework to estimate the optimal size of the tree and are able to measure the importance of each variable to the construction of the tree. These powerful features have led to many extensions that enable them to be used in more situations other than just prediction such as a for clustering and for feature selection [5, 13]. In this work we investigate the potential of decision trees to solve the graph partitioning problem. The construction of a tree for graph partitioning requires a greedy search over all binary decision rules within an external (predictor) set of variables X that can partition the adjacency matrix. The restricts the search space of possible partitions allowing the problem to be solved within a realistic computational time frame. In this paper we investigate the feasibility of using decision trees to solve the graph partitioning problem by using graph cut criteria as the homogeneity measure required to construct a tree. We compare the performance of ratio cut, normalized cut and normalized modularity cut on a benchmark multi-classification dataset [8]. The indices are compared with respect to their classification accuracy, their ability to estimate the optimal number of sub-graphs and on their power as a feature selection technique. The benchmark dataset used for the comparison of the three indices is the Ramaswamyet al. (2001) microarray dataset [8J for multiclass classification. This is an ideal benchmark because it is a combination of multiple mircoarray datasets each on a different tissue type and contains experiments from both tumor and normal
104
T. Hancock €3 H. Mamitsuka
tissues. These two classification problems allow for analysis of this dataset to be performed at two different scales. Firstly, the large scale is to classify tumor vs normal irrespective of tissue type, and secondly the small scale is to classify tissue type irrespective of the tumor/normal classes. These two resolutions provide an ideal test bed to assess the robustness of the three graph cut indices. Furthermore, Ramaswamy et al. (2001) performed a feature selection routine, based on SVM, which ranked each gene according to its ability to classify each tissue type. The gene rankings computed by [8] provide a convenient benchmark for comparing the feature selection power of each graph cut index. 2. Methods
2.1. Preliminaries and Notations
For the purposes of decision tree construction it is only necessary to consider a binary partition of the form in Figure 1 where the sub-graphs for a given partition are defined by SL and SR. Let k denote either sub-graph, Lor R, and define O"k to be the sum of the edges within k, O"k = ~(i,j)Ek Sij and define nk to be the number of nodes within sub-matrix k. Additionally we also define O"c to be the sum edges between the sub-graphs and O"T to be the sum of all edges within S. 2.2. Graph Cut Indices
The graph partition indices under evaluation in this paper are the ratio cut, normalized cut and the normalized modularity. These indices have the following forms: Table 1.
Graph cut indices
Ratio Cut
RR(S) = min {
Normalized Cut
RN(S) = min {
L
C !J } kE(L,R) nk
L ~}
kE(L,R) 17k
. C ut Norma I·Ize d Mo d u Ianty
RM(s)-_max{
~
+ !JC
k _n (17k _ (!J +!J C ~ kE(L,R) nk!JT !JT
)2)}
The indices in Table 1 vary in increasing order of complexity starting with the ratio cut, which simply searches for the minimum number of edges between the sub-graphs normalized by their size in number of nodes nk. Ratio cut however pays no attention to the density of edges within each sub-graph, only the edges between them. Normalized cut considers the density of the each sub-graph by normalizing by the total number of edges within and between each sub-graph. These indices are obvious in the context of graph partitioning because both seek to minimize the edges between the sub-graphs.
Semi-Supervised Gmph Partitioning with Decision Trees
105
For modularity cut however, the sub-graphs are assumed to be communities within the entire network. A community is defined as a sub-graph that has a nonrandom structure. A random structure is defined to be where the probability of an edge between two nodes is independent of any specific sub-graph structure. The modularity of a sub-graph is defined to be how much the probability of each edge within a sub-graph differs from the probability of that edge existing by random chance. This definition of sub-graph structure appeals to biologists and it has been shown many biological networks such metabolism are organized by a hierarchy of modularity [9J.
2.3. Tree Based Graph Partitioning Graph partitioning by decision trees starts with the adjacency matrix, 8, as a single sub-graph in the root node of the tree. The tree is then built by recursively finding binary partitions on 8 given a set of predictor variables X. To do this for all terminal nodes (sub-graphs) of the current tree the next best graph partition is found using a greedy search over all possible decision rules. The tree is then grown at the node that has the optimal graph partition index. This process is shown diagrammatically in Figure 2. The first tree in Figure 2 results in the four sub-matrices where 8 1 and 8 2 are the sub-graphs and 8 12 are the edges between them. The larger sub-graphs are expected to be found first as it is logical that identifying the larger sub-graphs first is more likely to optimize the graph cut index. The second tree in Figure 2 is created by partitioning 8 1 to create four new sub-matrices where 8 3 and 84 are the subgraphs and 8 34 are the edges between them. Note that by recursively partitioning the sub-graphs we are not changing the network structure but reordering the rows and columns of the adjacency matrix according to a specified graph cut criterion such that the identified sub-graphs lie on the block diagonal of 8. First Cut
Fig. 2.
Second Cut
Diagrammatic representation of tree graph partitioning.
3. Data and Methodology The dataset under examination is the Ramaswamy et al. (2001) microarray dataset [8J for multiclass classification. This dataset is an agglomeration of microarrays spanning 16063 genes measured on 14 different tissue types which are summarized in Table 2. We perform the same data preprocessing steps as described in [8], however,
106
T. Hancock €3 H. Mamitsuka
for our analysis we further reduce the number of genes by taking the top 1000 ranked with the largest standard deviation. The work on this dataset by Ramaswamy et al. (2001) focused on identifying genes to classify only the tissue types from within the tumor observations, however for our purposes we must also consider the normal tissues. However it can be seen from Table 2 that not all tissue types are observed in the normal observations. The absence of class assignments within the dataset occurs when no normal observation is possible, as would be expected for classes such as leukemia. In the case where it is not possible to obtain a normal observation, microarray experiments on comparable tissue types have been added into the data, such as microarrays of blood from nonleukemia patients. Taking into account normal microarrays, it can be seen that the full dataset is extended to 18 classes of tissue type. Ramaswamy et al. (2001) [8] defines separate test and training datasets of the micro arrays within the tumor classes for tissue type classification. However as our intention is to consider both the tumor/normal and tissue type classification problems, we keep test set of Ramaswamy et al. (2001) for the tumor tissue types but randomly assign 45 normal observations to our test set and assign the other 45 observations to our training set. Our defined training/test partition is described in Table 2. Table 2. Bladder
BL Train
T",t
Breast BR
Summary of the Ramaswamy et al. Microarray Colorectal
Leukemia
CO
LE 24
Thmor Normal Tumor Normal
Tissue Type Central Nervous System CNS 16
0 4 0
Lymphoma
LY 16 0
Melanoma ML 8
Mesothelioma
Ovary
ME
OV 8 2 3 2
Germinal GERMINAL
Lung
0 2 0
Tissue Type Pancreas
Train T",t
Tumor Normal Tumor Normal
PA 8
Prostate PR
Renal RE 8
Uterus UT
Cerebellum CEREBELLUM
0
Blood BLOOD
Brain
BRAIN 0
LU
To construct the graph adjacency matrix we consider two major aspects of our problem. Firstly, we are considering a supervised problem and therefore would like the sub-graphs within the adjacency matrix to agree as much as possible with the known response classes. Secondly, we are analyzing the performance of graph partitioning with decision trees and would therefore also like the sub-graphs to be generated by a tree structure. Fortunately, these two issues can be addressed by using the random forest proximity matrix [1] as the graph adjacency matrix. A random forest is an ensemble of classification tree models where each split within each tree is evaluated from a separate random sample of variables and observations [1]. The trees within a random forest are generated independently and the ensemble classification is performed by a majority vote on the predicted classes of each observation made by each tree. It is well established that by creating a random forest the predictive performance will stabilize and improve when compared to a single decision tree. It has also been found that random forests are also suitable for
Semi-Supervised Graph Partitioning with Decision Trees
107
feature selection [1] and observation of response class structure through the random forest proximity matrix [11]. The random forest proximity matrix is a graph adjacency matrix where the microarray experiments are the nodes and the edges between them are the number of times any two experiments are placed in the same terminal node over all trees within the random forest. As a random forest proximity matrix is built from an ensemble of trees it provides an ideal network structure for evaluating the relative performance of the graph cut measures. In this paper separate random forests are created on the training sets for the tumor/normal and for tissue type classification. The random forests are built using the randomForest R package [3, 6] and consist of 500 decision trees where each split is evaluated on a random sample of 31 genes. A heat map of the random forest proximity matrix for both tumor/normal and tissue type classification reordered by the known classes is presented in Figure 3. In Figure 3 yellow represents high similarities between the observations within each class and red represents low similarities. It is immediately obvious within Figure 3 that there are two different resolutions within the dataset. Through closer observation of Figure 3 it is clear that tumor/normal random forest is more accurately classifying the tumor class compared to the normal class. For the tissue type classification we see that the larger tumor classes, CNS, LE, LY are easily classified but the smaller groups are not as easily found. TumOTfNormat~yMatrlx
TIssue Type Adjacency Matrix
Fig. 3. Random forest proximity matrices for training datasets for tumor/normal and tissue type classification.
To compare the performance of each graph cut index for decision tree partitioning of the adjacency matrices in Figure 3 we perform 10-fold cross-validation for tree sizes ranging from 1 to 25 partitions or 2 to 26 sub-graphs. To show the sensitivity of each graph cut index at each tree size, the graph cut indices are evaluated on the training set. Furthermore, to assess predictive power of each index the correct classification rates (CCR) of each tree classifying the relevant response for both the
108
T. Hancock 1'3 H. Mamitsuka
training and test sets are also presented. Over the course of each cross-validation we keep a count of which genes are used to construct the tree. This count is then used as a measure of variable importance. To assess which graph cut index is selecting the most informative genes, we compare our importance measure to the top 1000 "One VB All" OVA features for each tissue type identified by [8J.
4. Results and Discussion The performance results for the 10-fold cross-validation are shown in Figure 4. In Figure 4 the left graphs plot the graph cut indices, the middle graphs plot training set correct classification rate (CCR), and the right most graphs plot the test set CCR for each tree size over the course of lO-fold cross-validation. The top row of plots in Figure 4 are the results for the tumor/normal classification and the bottom row of plots are the results for the tissue type classification. It should be noted that for comparison each graph cut index has been scaled such that the maximum value is 1 and the error bars are ±1 standard deviation from the mean. Additionally, the best partition for ratio and normalized cut is when the index is minimized, however the best partition for modularity is when the index is maximized. For performance comparison purposes, the SVM classifier employed by Ramaswamy et al. (2001) [8] classified the tumor/normal classes at 92% accuracy and the tissue types at 78% accuracy. However it should be noted that the datasets in this paper are not exactly the same as in our dataset we have divided the normal microarray observations into test/training samples. From Figure 4 it is immediately obvious that for each index, as the tree is grown the correct classification rate (CCR) of the known classes on both the training and testing subsets increase. The increasing CCR indicates that each measure is finding the predictive structure within the adjacency matrix. It can also be observed that for tumor/normal classification, as the tree size increases each index converges to the same classification performance, however for the tissue type classification problem, modularity appears to perform slightly worse. The reduced performance of modularity cut for tissue type classification may be a result of a lack of sensitivity to smaller sub-graphs. The trend of the modularity index however is more reliable than that for ratio or normalized cut because it is observed in Figure 4 that after a tree size of 19 the modularity decreases. Interestingly, this decrease in modularity after 19 splits seems not affect the classification performance, suggesting that there are no more predictive sub-graphs remaining to be found. Therefore the decrease in modularity after 19 splits is indicating that the optimal number of sub-graphs has been reached and any further partitioning is not improving the result. The power to estimate the optimal tree size is not observed in either ratio or normalized cut indices. The top 10 important genes for each index for both classification problems are presented in Table 3. The important decision tree variables are sorted in decreasing order of importance with a decision tree rank of 10 indicating that a gene was
Semi-Supervised Graph Partitioning with Decision Trees Tumor/Narmal Graph Cut Index:
rrainiog Set CCR
Tumor/Normal TestSIrt CCR
TlssueType Graph Cui Index
TIssue TypeTraining Set CCR
Tissue Type TelitSei CCR
'"-----"""--
Fig" 4"
TumorJNormal
109
"--=~------
IO-fold cross-validation results for tumor/normal and tissue type classification models"
selected in the building of each decision tree in all 10 cross-validation training sets" For the Ramaswamy et al. (2001) OVA rank the lower value the more important a gene is to a tissue type, with a OVA rank of 1 for the most important and 1000 for the least. The published list can be found in the supplementary materials section of [8J. In Table 3 an examination of the selected genes for both experiments show a high degree of similarity in the lists for ratio and normalized cut however the list for modularity cut appears different. This is seen most clearly for tissue type classification where 5 of the 10 genes appear in the both the ratio and normalized cut list, but only 1 gene is found to be similar within the modularity cut list" This result is expected as modularity and both normalized and ratio cut differ considerably in their definition of the structure of a sub-graph. Comparing the decision tree ran kings with the OVA ranking (Table 3) it is seen that for tumor/normal classification the genes selected by all three indices are not well ranked genes in the OVA scheme and seem to span multiple tissue types" For tumor/normal classification the poor OVA rankings and lack of tissue type specificity are expected as the OVA rank is specific for separating tissue type classes not for separating tumor/normal classes" For tissue type classification the selected genes for ratio and normalized cut are found to be well ranked genes in the OVA rankings" In particUlar 3 genes selected by the decision trees, AB00678 Ls_at, RC-AAI76975_s_at and L20688 are found to be the top ranked genes for colorectal, prostate and leukemia tissue types respectively" However it is observed that modu-
110
T. Hancock e3 H. Mamitsuka Table 3.
Decision tree VIP ranking compared with OVA tissue type ranking. Thmor ,Norma] CI888iftcation
Graph Cut
Variable Name
Ratio Cut
M55998.s.at RC.AAI95626.4t MT6318...at RC.AA426011.at RC..AA609113....at U03057..at RC.AA456588..at RC.AA434245J..at X13839..at U48959..at
Dechlion Tree Rank
BL
7
CNS
CO
Tissue Type OVA Rank LY ME ML LU 29
152
ov
PA
73
85
PR
RE
70' 715
33
'" 749
407
UT 235
102
630 597
596
891
149
511
112
16
SOl 18 SOl
913 85 913
" "
273
33
715 229
RC.AA195626..a.t RC.AA609113..a.t
AFOOl548..rnal.,at
283
J03592..at RC..AA433930..at RC.AA055560.J'..at
560
740
235
27J
21
481 780
177
98
127
M12529....11.t X80822J..at
737 891
M55998.s.at
D79205..at
Modularity Cut
LE
18
129
M55998.s..at U48959..at HG3431.HT3616.s..a.t
Normalized Cut
BR 891
18
56
83' 29
152
3BO 168
689
73
235
85
940
981
M26708..s.at HG2788.HT2896..at RC..AAI95626..at HG3214.HT3391..at M17886..at X03342-1\1 L19527.,a.t
31' 222 18
Tissue Type Cl88llificatlon Graph Cut
Ratio Cut
Normalized Cut
Modularity Cut
Variable Name M62895..1l..at RC..AA338646.i..at L20688..at RC...AAI16975.s..at ABOO678J...s..at HG3214.HT3391..At XOO855..s..At AFFX.HSAC07.XOO35LM..At D00654..A1. JOO268....11..A1. AB006781..s..At RC..AA176975..s..At RC-AA338646.i,.at HG3214.HT3391...J:1.t AFFX.HSAC07.XOO351_M..At L20688..At X6269L.at RC..AA479727..i..at AFFX.HUMGAPDH.M33197..5..at T30851-Lat X99076-I"J1I1oI..At M27602.i..At
LllOJ5..$..at X04476..J1..J1.t RC..AA479727.J..At HG3214.HT3391..at AFFX.HSAC07.XOO351..5....a1 XOO35IJ..o.t 0792Q5..at X13839...at
Decision Tree Rank 8
BL
BR
CNS
CO
LE
Tisllue Type OVA Rank LU LV ME ML
618
195
17
ov
PA
PR
RE
UT
638
12
733 476
797
695
24 12
10
I 17
759
151
10 19 25
981 596
145 940 149
571
99 511
112
16
larity cut does not find genes that are well ranked in the OVA scheme. In particular modularity cut identifies 5 genes in the 10 top ranked genes that do not appear in any OVA ranking for any tissue type. The lack for power for feature selection of modularity cut is surprising given the comparable classification rates in Figure 4. Overall from Table 3 it appears that both ratio and normalized cut are clearly identifying important genes found to be specific to several tissue types where as modularity cut is not as useful for this purpose. 5. Conclusions
Overall this work has shown that using decision trees in combination with graph partition indices is an accurate and informative measure of identifying sub-graphs within an adjacency matrix. Our experiments show that either ratio cut or normalized cut appear to be more accurate and informative than the modularity cut.
Semi-Supervised Gmph Partitioning with Decision Trees
111
However it was observed that the modularity index gave more information on the optimal number of sub-graphs. Future work to assess the performance of each index for decision tree construction would have to consider more datasets with differing network structures. Furthermore the further exploration into the feature selection properties of each index is required, focusing on the effect of surrogate splits and position within the decision tree. 6. Acknowledgements
This work was in part supported by a Japan Society for the Promotion of Science (JSPS) fellowship and the Japan Science and Technology Agency - Institute for Bioinformatics Research and Development (JST-BIRD) project. References [1] Breiman, L., Random forests. Mach. Learn., 45:5-32, 2001. [2] Breiman, L., Friedman, J., Olshen, R., and Stone, C., Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA, 1984. [3] Breiman, L., Cutler, A., Liaw, A., and Wiener, M., randomforest - brei man and cutler's random forests for classification and regression, 2008. [4] Hagen, L. and Kahng, A. B., New spectral methods for ratio cut partitioning and clustering. Computer-Aided DeSign of Integmted Circuits and Systems, IEEE Transactions on, 11(9):1074-1085, 1992. [5] Hancock, T., Multivariate Consensus Trees: Tree-based clustering and profiling for mixed data types. PhD thesis, Mathematics and Statistics Department, James Cook University, 2006. [6] Ihaka, R and Gentleman, R., R: A language for data analysis and graphics. Journal of Computational and Gmphical Statistics, 5(3):299-314, 1996. [7] Karypis, G. and Kumar, V., Multilevel k-way hypergraph partitioning. VLSI DESIGN, 11(3):285-300, 2000. [8] Ramaswamy, S., Tamayo, P., Rifkin, R, Mukherjee, S., Yeang, C. H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., Poggio, T., Gerald, W., Loda, M., Lander, E. S., and Golub, T. R, Multiclass cancer diagnosis using tumor gene expression signatures. Proc Nat! Acad Sci USA, 98(26):15149-15154, December 2001. [9] Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N., and Barabasi, A. L., Hierarchical organization of modularity in metabolic networks. Science, 297(5586):15511555, August 2002. [10] Shi, J. and Malik, J., Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888-905, 2000. [11] Shi, J. and Horvath, S., Unsupervised learning with random forest predictors. Journal of Computational and Gmphical Statistics, 15(38):118-138, 2006. [12] Shiga, M., Takigawa, T., and Mamitsuka, H., A spectral clustering approach to optimally combining numerical vectors with a modular network. Proceedings of the 13th ACM SIGKDD, pages 647-656,2007. [13] Smyth, C., Coomans, D., Everingham, Y., and Hancock, T., Auto-associative multivariate regression trees for cluster analysis. Chemometrics and Intelligent Labomtory Systems, 80:120-129, 2005.
MEASURING CORRELATIONS IN METABOLOMIC NETWORKS WITH MUTUAL INFORMATION OLIVER EBENHOH 2
[email protected] JORGE NUMATA!
[email protected] ERNST-WALTER KNAPP!
[email protected] Macromolecular Modeling Group, Freie Universitiit Berlin, Takustr. 6, Berlin, 14195 Germany 2 Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, Am Muhlenberg 1, Potsdam-Golm, 14476 Germany !
Non-linear correlations based on mutual information are evaluated to measure statistical dependencies among data points measured from metabolism in two dimensional space. While the Pearson correlation coefficient is only rigorously applicable to characterize strictly linear correlations with Gaussian noise, the mutual information coefficient is more generally valid. Here, we use recent distribution-free (non-parametric) mutual information estimators based on k-nearest neighbor distances. The mutual information algorithm of Kraskov et al. is found to yield estimates with low systematic and statistical error. The significance of the different methods is probed for artificial sets of tens to hundreds of data points, a size currently typical for metabolomic data. We analyze experimental data on metabolite concentrations from Arabidopsis thaliana by using these procedures. The mutual information was able to detect additional non-linear correlations undetectable for the Pearson coefficient. Keywords: statistical correlation; Pearson coefficient; non-linear correlation; mutual information; knearest neighbor entropy; metabolomics; Arabidopsis thaliana.
1.
Introduction: Linear and Non-linear Correlation Measures
1.1. Correlations of metabolite concentration data
Metabolomics is a crucial tool in systems biology, smce it allows insight into the phenotypic result of gene expression. Metabolites show coupled changes in concentration, both under the influence of genomic and stress perturbations, and as part of the intrinsic' variability of a biological network. The meaning of these correlations in terms of biochemical network topology and gene expression remains to be fully elucidated [1-3]. One important stumbling block is synthesized in the saying "correlation is not causation", which equally applies to non-linear correlations. Steuer et at. observed that large linear (Pearson) correlation coefficients often do not coincide with metabolite pairs that are neighbors in the biochemical network [4]. Here, we follow a more modest goal in providing suitable measures for statistical correlations among metabolite concentrations and to test their significance. The method is able to detect also non-linear correlations not accessible to the linear Pearson coefficient ,l'c.
112
Measuring Correlations in Metabolomic Networks
113
Statistical correlation measures based on mutual information are able to capture more features of the data than the linear Pearson correlation coefficient. At the same time, they demand larger data sets than the Pearson coefficient to be significant. Here, we test recent developments in non-parametric methods for entropy estimation [5-7] to provide a general, non-linear measure of statistical dependencies.
1.2. Advantages of mutual information as a measure of correlation Mutual information is a non-linear measure of statistical dependence based on information theory [8]. It has advantages over other methods, • since it requires no decomposition of the data into modes, so there is no need to assume additivity of the original variables, as is done in Principal Component (PCA) and Independent Component Analysis (lCA) [9]. • since it makes no assumptions about the functional form (Gaussian or non-Gaussian) of the statistical distribution that produced the data. Hence, it is a non-parametric method. A numerical implementation based on k-nearest neighbor distances [5-7, 10] is more attractive than other methods to estimate mutual information [11], • since it requires no binning to generate histograms. • since it consumes less computational resources and its parameters are easier to tune than for kernel density estimators. It is a common practice to normalize the data to zero mean and unit variance using a linear transformation, which has no effect on the Pearson correlation coefficient. Linear transformations are smooth and uniquely invertible maps, as are the more general homeomorphic (non-linear) transformations. Mutual information for pairs of variables is not altered by general homeomorphic transformations of the data [5, 12]. These properties are important because metabolomic data rarely yield absolute concentrations, but rather ratios of concentrations [2].
1.3. Entropy, mutual information and statistical (in)dependence We will employ the usual symbol for entropy from information theory (H) instead of the thermodynamic notation (S). All logarithms (In) refer to base e, so that entropy and mutual information are measured in nats. To convert to bits, divide by In(2). Weare interested in testing the correlation between two random variables Xi and Xi' which have marginal probability densities plx;) and pi~) and a joint probability density P(i,jJCXi. Xi)· In the current non-parametric approach, no particular functional form for probability densities is assumed. The corresponding differential entropies (for continuous variables) are:
114
J. Numata, O. Ebenhoh C3 E.- W. Knapp
The mutual information f(iJ) shared by Xi and xi is
- Ifp( .. )(x.,x.)ln [
1(I,) ..) -
I,}
I}
PU,j)(x;,xi ) ] ()
()
dx.dx.· I}
(2)
Pi Xi Pi Xi
Two variables Xi and xi are statistically independent if and only if the joint probability density equals the product of the marginal densities: P(iJ)(Xi, Xi) = plx;) x pix}), since in that case, the argument of the logarithm term in Eq. 2 is unity and the mutual information vanishes. The mutual information f(iJ) can also be written [8] as:
(3)
1( I,} ..) =H.+H -H( I,} . .). I } If the variables Xi and Xi are correlated, f(iJ) will take a positive value up to minCH;, H;). We employ a more intuitive non-linear correlation coefficient r[ [6, 13, 14] that assumes values in the interval (0,1) for correlated variables. This coefficient r[ is a measure of the generalized statistical dependence between two variables. For strict correlation of the variables Xi and Xi (e.g., Xi = Xi or Xi = -Xi)' r[ adopts the maximum value of +1; in absence of correlation rl vanishes. Albeit the exact value of f(iJ) cannot become negative, approximate evaluations can. Therefore, we propose here a modification of the coefficient r[ to allow also for negative values (-1, 0) that can quantify possible numerical errors in estimating mutual information
(4) Note that negative values of r[ should not be interpreted as anti-correlations, since adopts also positive values in that case. In contrast to mutual information the Pearson correlation coefficient quantifies exclusively linear correlations, and is actually given as a normalized covariance f(iJ)
(5) where
(6) Since the Pearson correlation coefficient is based on quadratic forms, it is relatively sensitive to outliers. Negative values of I'c for two variables denote anti-correlation (appearing as negative slope in a linear fit). In the numerical implementation, a value of I'c = is assigned to cases where one of the variances in the denominator of Eq. (5) vanishes. A non-vanishing I'c means that a linear fit can describe the correlation between Xi and Xj approximately. Similarly, a positive non-linear coefficient r[ means that the variables Xi and Xi are correlated, and a non-linear fit could describe this relationship.
°
Measuring Correlations in Metabolomic Networks
115
This is a very general statement and does not imply any particular functional form (such as a quadratic polynomial). 1.4. Numerical methods: k-nearest neighbor entropy and Kraskov mutual information
We employ two different methods to estimate mutual information I{iJ') and its correlation coefficient rI from Eq. 4. One method uses the k-nearest neighbor entropy [10] introduced by Hnizdo et al. [6, 7], which estimates H(I), H(j) and H(iJ) individually and then calculates the mutual information I{iJ) from Eq. 3, yielding the coefficient r/NN from Eq. 4. The second method estimates the mutual information I(i.j) in a more direct way using the Kraskov et al. algorithm [5], which is also based on a nearest-neighbor approach. The more direct estimate is advantageous, since it avoids accumulation of systematic biases inherent in the terms H(l), H(j) and H(iJ) when using Eq. 3 for I(iJ). For both algorithms, rI kNN and r[ Kras, the only adjustable parameter is the number of neighbors. We employ k = 6th nearest neighbor, which proved to be a good compromise between systematic and statistical errors (data not shown). 2.
Application to Constructed Data
2.1. Non-linear correlations are captured by mutual information
The non-linear correlation coefficient based on mutual information is able to detect additional correlations invisible to the linear Pearson coefficient. Cases AI-A 7 in Fig. I show comparable performance for both coefficients in linear cases. But, the non-linear nature of the correlation between the variables in cases BI-B6 causes !,C to vanish. Visually it is obvious that a relationship exists, and this is quantified by r/ras.
2
t'c r;Kras
1.0 1.0
/ 0.01 0.89
VV
, -
0.80 0.80
0.02 0.63
3
4
0.38 0.38
0.00 0.06
tI
.
.'" ."
0.00 0.67
~
•.,
0.00 0.79
.,,;
5
6
7
' , : -
.".
A
X
U
B
-0.38 0.38
0 .00 0.92
-0.80 0.80
0.00 0.81
o
-1.0 1.0
0.01 0.00
Figure I Comparison of the performance of the Pearson (linear) correlation coefficient /c and the non-linear measure rl Kra, (an implementation of rl) based on mutual information. Each of the 14 cases is an artificial example showing different functional relationships between the variables Xi and Xj' The artificial data sets are large: N,ize = 105 points, some of them with Gaussian noise. The first row (AI - A7) represents linear correlations, and for A4 a lack thereof. Except for the sign comparable performance is shown for rl Kras and /c when the correlation is linear. The second row (B I - B7) displays non-linear cases where /c fails to detect any correlation, while r/ras can quantify it. Case B7 is shown to be uncorrelated by both coefficients.
116
J. Numata, O. Ebenh(jh fj E.- W. Knapp
Note that anti-correlation (negative I'c) in B5-B7 is shown simply as statistical dependence in r/ras, which is strictly never negative in absence of numerical errors. In any case, the concept of anti-correlation is not applicable to relations with changing slope such as B4 or B6 of Fig. 1. Furthermore, anti-correlation loses its meaning even for linear relations in more than two variables [14]. 2.2. Significance of the coefficients given the limited sample size
Metabolomics and gene expression experiments currently yield tens to hundreds of data points. To probe the significance of the correlation coefficients among pairs of variables for such sample sizes, we numerically tested the artificial examples A2 (linear correlation), A4 (uncorrelated) and B4 (non-linear correlation undetectable by I'c). From Fig. 2, we observe that sample sizes N size that allows detecting correlations reliably need to be larger than Nsize > 40. One important lesson from Fig. 2 is that weak correlations corresponding to small correlation coefficients cannot be discriminated from background noise in agreement with findings of Selbig et al. [11]. There is a gray area in the region 0.550 < r[Kras < 0.665 for Nsize = 43, where a more thorough statistical analysis could be made [15]. In this work, we opt for the safe side by considering only large correlations.
Case
,-Pc
r
t:r~~l !
A2
l.--
I
r
kNN
:r'l
~....-
-1
10
..•.
A4
,,\:
10
h'.
'.'
)'~'. '.
I(
:~~HI
-1
10
"~
-
....
84
--40100 400
10
. ....,...~'$.. y~
~
~
....
-1
10
40 100 400
I( o
U]IIIWH 1
..{
-1 10
40100
N~I1'"
40100
~
~
~-
JIIII[I~I~
400
10
40100 400
NSil(!
•
.~~~~.'
:ri~~:
40 100 400
_:I~IHli
Ns,ze-
to
Kras
NSlZC
:' :." ~...
"''1;
40100 400
I
400
~
....-
'fJ 0 it-
i
-1
10
40100 400 N
~11C
NSilE'
Hff+r-"~ I
!!
1
'"
~-
-1 10
40100 400 N~11f>
Figure 2: For cases A2 (linear correlation), A4 (uncorrelated) and B4 (non-linear correlation), all three correlation coefficients ,fc, r/NN, r/'"" (vertical axis) were estimated using the sample size N,u, (horizontal axis). The error bars show empirical 95% confidence intervals (p = 0.05). They represent the observed, sometimes asymmetric variation around the mean for 2000 samples of N,izo data points each. Among the nonlinear methods, rI Km, shows less statistical and systematic errors than r/NN. Negative values of,fc denote anticorrelations. Negative values ofr/NN and r/'"' denote numerical errors in the estimation of mutual information.
Measuring Correlations in M etabolomic Networks
117
The method from Hnizdo et al. [6, 7] yielding r/,NN is based on a nearest neighbor entropy estimator. As suggested by Kraskov et al. [5], the systematic bias in individual entropy estimation of I-D and 2-D samples will not necessarily cancel out in Eq. 3. In our numerical experiment (Fig. 2, colunm r/NN), a negative systematic bias is evident for Nsize < 58 and a positive bias for N,ize > 58 with a larger spread of values and frequent occurrences of negative r/,NN values, which are traces of numerical errors. Nevertheless, the kNN method is still useful for very large sample sizes N,jze > 1000 and when the I-D entropy of each variable is of interest [10]. Computing r/ras to obtain the correlation is better suited than r/NN for our small sample sizes and in particular if we are interested in mutual information I(ij) and not in 1Kras D entropies. Thereby, rJ shows less systematic bias and lower variability (statistical bias) among the different computational methods. Negative values of I(ij) were only found for small to medium large correlation values rJ Kras < 0.665 or very small sample sizes Nsjze < 40, where presence of correlation is difficult to detect. A4: uncorrelated
false negatives:
true positives:
0.02%
99,8%
Figure 3: For the test cases A2, A4 and B4 (see Fig. I for morc dctails), the absolute value of thc Pearson coefficient l,.rci is plotted against mutual information rI Kms for 2 104 samples. Samples yielding points (+) outside the rectangle show significant correlations for both coefficients, while the points (x) from samples inside the rectangle do not. The rectangle marks the cutoff values l,.rci = 0.545 and r/ras = 0.665, which were chosen to minimize detection of false positives where no correlations are present. For linear correlation (A2) both coefficients provide similar information. But the Pearson coefficient ,.rc is not able to detect the non-linear correlations in B4 and reports values similar to the uncorrelated case A4. Negative values of r/"" denote numerical errors in the estimation of mutual information.
The cutoff values for I'c 0.545 and r/ras = 0.665 were chosen from Fig. 3 to minimize detection of false positives in the absence of correlation (case A4) for Nsize = 43. With these conditions, we obtain three false positives for 2 104 samples using r/ras (see Fig. 3, middle part) corresponding to 0.015% of all samples and a concomitant p value of p = 312 104 = 0.00015. In the following we will deal with data of sample size Nsize = 43 comparing 16290 pairs of metabolic variables. At the same time, we expect to detect 99.8% oflinearly highly correlated pairs, but only 28.5% of the non-linear ones (see Fig. 3). This is because the limited sample size of 43 data points limits reliable detection to large values of non-linear correlations. In an analog numerical simulation with 2 104 samples using the larger sample size Nsjze 700, non-linear correlations with
118
J. Numata, O. Ebenhoh
{3
E.- W. Knapp
0.75 could be completely separated from data obtained with absence of correlation Kras > = 0 (data not shown). Even for small Nsize = 43, applying both methods enriches detection of correlations in comparison to the usage of only I'c.
0.545 IinearJlarge rKroo) (Fh~. 6) I'c>0.545 r/'as < 0.665 linear (small rKra,) (Fig. 7) I'c 0.656, these plots exhibit
f'C= 0.63, ',""'=0.70
f'C= 0.69, ',""'=0.68
5,············..··..· ..• .. •·················· ..· .... ········,
x
[] )(
o
.~
glucose 6-phosphate
0
~5~---~0~--~'
leUCine
fucose
.5-5':---~--:O----_.....J
galactinol
Figure 6: Examples where both correlation coefficients I'c and r/m , indicate significant correlation. (I'c> 0.545, rjKm, > 0.656). f'C= 0.66, ',""'= 0.60
51-~~!I ++
.5 \ . . . . . . . . - - - - 0 - - - " 5 succinic add
f'C= 0.67, ,,",",= 0.58
f'C= 0.64, ',"'"'= 0.56
jJ
+
:~~ I .
~+
~
c .~ 0
]
threonk add
rPC=0.59, ',K",,= 0.26
)(
glyceric acid
," 2,4 hydroxybutiric add
Figure 7: Examples for the case where only the Pearson coefficient was significant (?C > 0.545) but the nonlinear coefficient was not (rI Km, < 0.656).
120
J. Numata, O. Ebenhoh
0X
~
9-
0
§ '"
oS ·5
x +0 +
0 glucose 6-phosphate
E.- W. Knapp
"0 .~
."8 u
.~
I 01
J ·5
• b
b
T
0
citric acid
r""= 0.39, r,Kro'=0.34
t"c= 0.10, r,Kn;'=0.10
t"c=-0.06, fj"""=-0.41
r""= 0.17, r,Krn'=0.25
]
{3
1!
.~
rn
o Q)) 00
i
w
x
+
.2 0
'"
?;fxt1l +
0
cellobiose
xylitol
Figure 8: The above examples show likely uncorrelated pairs of metabolites, where the limited number of data points does not allow a clearer classification.
Figs. 4 and 5 present correlations which were only detectable as significant by the mutual information coefficient r/ras, but invisible to the Pearson correlation coefficient !,C. In Fig. 4, the reason is the presence of outliers. The examples in Fig. 5, in addition to correlation, also present differences among plant lines, which cluster in different concentration regimes. Three of the plots in Fig. 5 involve cellobiose, which in another study [16] using a larger data set was found to be the largest contributor to phenotypic variations. The metabolic data analyzed in the present study are a subset of these data. In particular the metabolic data with cellobiose are in an experimentally trustworthy concentration regime, where correlations are likely not caused by experimental error. Fig. 6 shows examples where both correlation coefficients r[ Kras and !,C adopt values that indicate significant correlation. The first plot corresponds to large correlation found in a variety of studies [1]. The metabolites glucose 6-phosphate and fructose 6-phosphate are directly connected in the biochemical network by the enzyme EC 5.3.1.9. [17]. In the second plot, both metabolites are hydrophobic amino acids. But the chemical nature of the metabolites is seemingly unrelated in the third and fourth plot. In Fig. 7, we illustrate metabolite pairs where only the Pearson correlation coefficient, !,C, points to significant correlation. Most of such cases yield an intermediate value for r/ras, in the "gray area" that does not allow clear discrimination. The last plot in Fig. 6 is a rare example where r[ Kras is particularly small. Lastly, Fig. 8 shows either uncorrelated cases, or cases where the coefficients were not able to detect correlation reliably. The second plot shows two chemically related metabolites, which however show no correlation. The third plot shows a separate cluster for plant line Col-O, but no correlation. In the last plot some correlation seems to appear, but the correlation coefficients are too small to be significant. 4.
Conclusion
There are two major advantages in using the mutual information coefficient. The first one is the discovery of additional correlations invisible to the Pearson coefficient, frequently because of the presence of outliers (see Fig. 4). The second advantage is the detection of correlation even if plant lines cluster in different concentration ranges. Although a cluster analysis would be able to detect these differences in concentration regimes, the present
Measuring Correlations in Metabolomic Networks
121
method allows concurrent detection of correlation. For example, cellobiose displays a consistently lower concentration range when compared to galactinol for plant line Col-O, but not for the other three plant lines. Simultaneously, the two metabolites were found to be correlated by the mutual information coefficient, but not if the Pearson coefficient is used. (Fig. 5). In this work, the emphasis was on discovering few but highly significant correlations, with a small risk of false classifications even for small sample sizes of Nsize = 43. However, it should be noted that larger sample sizes of a few hundred data points would allow to detect also smaller correlations.
Acknowledgments This work was supported by the International Research Training Group "Genomics and Systems Biology of Molecular Networks" (GRK1360 of the DFG). We would like to thank Dr. Matthias Steinfath and Dr. Jan Lisec for useful discussions and for sharing their experimental data [16].
References [1] [2] [3] [4] [5] [6] [7]
[8] [9] [10] [11] [12]
Steuer, R., On the analysis and interpretation of correlations in metabolomic data. Briefings in Bioinformatics. 7(2): 151-158,2006. Camacho, D., A.dJ. Fuente, and P. Mendes, The origin of correlations in metabolomics data. Metabolomics. 1(1): 53-63,2005. MUller-Linow, M., W. Weckwerth, and M.-T. Hutt, Consistency analysis of metabolic correlation networks. BMC Systems Biology. 1(44),2007. Steuer, R., et al., Observing and interpreting correlations in metabolomic networks. Bioinformatics. 19(8): 1019-1026,2003. Kraskov, A., H. StOgbauer, and P. Grassberger, Estimating mutual information. Phys. Rev. E. 69: 066138, 2004. Hnizdo, V., et al., Nearest neighbor estimates of entropy. American J of Math and Manag Sciences. 23: 301-321,2003. Hnizdo, V., et aI., Nearest-Neighbor Nonparametric Method for Estimating the Configurational Entropy of Complex Molecules. J Comput Chem. 28(3): 655-668, 2007. Cover, T.M. and J.A. Thomas, Elements ofInformation Theory. 2nd E ed. Wiley Series in Telecommunications, ed. D.L. Schilling. 2006. Steinfath, M., et aI., Metabolite profile analysis: from raw data to regression and classification. Physiologia Plantarum. 132: 150-161, 2008. Numata, 1., M. Wan, and E.W. Knapp, Conformational Entropy of Biomolecules: Beyond the Quasi-Harmonic Approximation. Genome Informatics. 18: 192,2007. Steuer, R., et al., The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics. 18 Suppl. 2: S231-S240, 2002. Matsuda, H., Physical nature of higher-order mutual information: Intrinsic correlations and frustration. Phys Rev E. 3: 3096-3102, 2000.
122
J. Numata, O. Ebenhoh fj E.- W. Knapp
[13] Dionisioa, A., R. Menezes, and D.A. Mendes, Mutual information: a measure of dependency for nonlinear time series. Physica A: Statistical Mechanics and its Applications. 344(1-2): 326-329,2004. [14] Lange, O.F. and H. Grubmiiller, Generalized Correlation for Biomolecular Dynamics. Proteins: Structure, Function, and Bioinformatics. 62: lO53-lO61, 2006. [15] Storey, J.D., The Positive False Discovery Rate: A Bayesian Interpretation and the q-Value. The Annals of Statistics. 31(6): 2013-2035, 2003. [16] Lisee, J., et aI., Identification of metabolic and biomass QTL in Arabidopsis thaliana in a parallel analysis ofRIL and IL populations. The Plant Journal. 53: 960-972, 2008. [17] Mueller, L.A., P. Zhang, and S.Y. Rhee, AraCyc: A Biochemical Pathway Database for Arabidopsis. Plant Physiology. 132: 453-460, 2003.
OPTIMALITY CRITERIA FOR THE PREDICTION OF METABOLIC FLUXES IN YEAST MUTANTS EVAN S. SNITKIN 1
[email protected] DANIEL SEGRE 1•2
[email protected] IGraduate Program in Bioinjormatics, Boston University, 44 Cummington St., Boston, Massachusetts, 02215, USA 2Departments of Biology and Biomedical Engineering, Boston University, 24 Cummington St., Boston, Massachusetts, 02215, USA Constraint-based models of cellular metabolism, such as flux balance analysis (FBA), use convex analysis and optimization to study metabolic networks at a genome scale. The availability of reaction lists for numerous organisms, along with a variety of network analysis and optimization tools, is making these approaches increasingly popular for metabolic engineering and biomedical applications, as well as for addressing fundamental biological questions. It is therefore very important to assess the predictive capacity of these models and to understand how to interpret them in a biologically relevant manner. Typically, model assessment is limited to gauging the ability to predict phenotypes, such as viability under different environmental and genetic conditions. These types of assessments, for the most part, focus only on the growth phenotype of the cells, but ignore the underlying flux predictions. While this may be sufficient for certain types of study, the question of whether flux balance models can reliably predict intracellular and transport fluxes is crucial for more detailed analysis, and remains largely unanswered. Here we compare FBA model predictions of yeast metabolic fluxes to a previously published set of experimentally determined fluxes for \3 different single gene deletion mutants across a variety of possible objective functions. We find that the specific optimization criteria used to determine fluxes have a significant impact on the accuracy of the predicted fluxes. Interestingly, while different optimization methods provide very different levels of agreement relative to experimental fluxes, they tend to provide similar predictions with respect to the effect of the perturbation on growth. This demonstrates that assessment of models at the level of flux predictions is a critical step in assessing the biological validity of different models and optimization criteria.
Keywords: flux balance analysis; gene deletion; optimality criteria; flux measurements
1.
Introduction
A century of detailed biochemical studies, in conjunction with the genomic revolution, has culminated in the release of metabolic reconstructions for a number of model organisms. These metabolic reconstructions comprise the stoichiometries of all known enzymatic reactins in a given organism. In addition to enabling the study of metabolic networks in diverse organisms [19], these reconstructions have yielded the ability to create genome-scale predictive models by using the steady state framework of flux balance analysis [12]. Flux balance models have been released for a number of bacterial organisms such as E. coli [7]and H pylori [14], and more recently also for the eukaryotes yeast [9] and human [5]. With the ability to generate models largely from sequence data, it should be expected that the pace of model development will only increase in the coming months and years.
123
124
E. S. Snitkin
fj
D. Segre
Along with the increase in model availability has come a widening of the spectrum of reported applications of flux balance models. Recent work has demonstrated the use of flux balance models to address cutting edge research questions ranging from understanding the dynamics of microbial communities [17] to predicting perturbations required to fulfill complex metabolic engineering objectives [2]. These various applications of flux balance models often require different levels of predictive abilities from the models. For instance, for some applications, being able to accurately capture the range of possible metabolic behaviors of an organism is sufficient [3], while for others the ability to predict the precise metabolic state resulting from specific perturbations is required [2]. Given that different model applications may require different levels of predictive proficiency, it is important to be able to evaluate the appropriateness of models for addressing different research questions. A common method fqr evaluating models is by quantifying their abilities to predict the effects of environmental and genetic perturbations on growth rate. The attractiveness of this approach for model evaluation largely stems from the availability of high-throughput growth phenotype data for many organisms, in addition to the ease with which the effects of environmental and genetic perturbations on growth can be determined using these models. While such assessments evaluate model behavior in response to diverse perturbations, the assessments are typically limited to growth phenotype. An open question is how a model's ability to predict the growth phenotypes under a variety of conditions translates into its ability to predict the fluxes underlying the growth predictions. Here, we utilized a compendium of experimentally determined fluxes for yeast single gene deletion mutants [1] to gain insight into the ability of yeast flux balance models to predict central carbon metabolic fluxes in response to perturbations. In addition to assessing the relationship between predictions of growth phenotypes and predictions of the underlying fluxes, we also compared the ability of different objective functions to predict the metabolic response to genetic perturbations. Through this analysis we hoped not only to assess the predictive abilities of flux balance models at the level of flux predictions, but also to understand what drives the metabolic response to genetic perturbations. Our results support previous studies which suggested that the metabolic response to genetic perturbations is best described as a minimal rerouting of fluxes around the perturbation. Despite the clear superiority of an objective function implementing minimal flux rerouting to predict mutant fluxes, all tested objective functions correctly predicted the growth phenotype for all 13 mutants considered. This suggests that correct predictions of growth phenotype do not necessarily imply an accurate prediction of the underlying fluxes.
2.
Methods
2.1. Experimentaljlux data
All experimentally measured fluxes and uptake/secretion rates were taken from the supplementary material of the 2005 manuscript by Blank et at. [1]. Among the 38 single gene deletion mutants for which fluxes were measured, we focused on 13 for which the deleted gene did not have any duplicates. The reason for this is that gene duplicates are
Optimality Criteria for the Prediction of Metabolic Fluxes
125
implemented in a trivial manner in flux balance models, unless regulation is explicitly taken into account. In a typical flux balance calculation duplicate genes completely back one another up under all conditions. 2.2. Flux Balance Analysis Flux balance analysis is a linear constraint based modeling approach which has been described in detail elsewhere [6]. Briefly, flux balance analysis consists of two critical steps; (1) the imposition of linear constraints on fluxes, stemming from the assumption of steady state, and (2) an optimization step by which a particular set of fluxes fulfilling the given constraints is selected. These linear constraints limit the feasible flux solutions to those which result in no net production or consumption of any metabolite. These steady state constraints can be described by the nullspace of the m x n stoichiometric matrix S. The columns of S represent the n reactions, and its rows the m different metabolites. An entry Sij represents the stoichiometric coefficient of metabolite i in reaction). In addition to the steady state constraints, additional linear constraints are imposed to set upper and lower bounds on individual fluxes Caj ~ Vj ~ hj). These constraints can be applied to fix maintenance requirements, restrict reversibility of reactions and set limits on nutrient uptake rates. The previously released iLL672 yeast metabolic reconstruction was used for all analyses [13]. Constraints on uptake rates were imposed to mimic the minimal glucose conditions under which the utilized set of experimentally determined fluxes were determined. Gene deletions were implemented in the model by setting the flux to zero for all reactions requiring the protein product of the deleted gene. 2.3. Objective functions to predict mutant jluxes While the imposition of the linear constraints mentioned above restricts the space of possible metabolic behaviors, there are still potentially an infinite number of flux states which can fulfill the given constraints. To select a particular flux state, which can in turn be compared to the experimentally measured fluxes, one typically maximizes or minimizes a linear combination of fluxes, based on a biologically relevant criterion. Here we evaluated the flux predictions made using several different criteria. A summary of the different objective functions and the motivation for testing them can be found in Table 1.
3.
Results
3.1. Experimentally determinedjluxes To evaluate the relative abilities of different objective functions to accurately predict the metabolic flux response to genetic perturbations, we utilized the aforementioned compendium of experimentally determined fluxes for S. cerevisiae single gene deletion mutants [1]. The mutants analyzed by Blank et al. were selected on the basis that the deleted genes encoded enzymes which catalyzed reactions that were active under minimal glucose conditions, but were not essential to growth. In other words, these genes encoded enzymes in flexible reactions, such that by observing how the metabolic network responds to their deletion, insight could be gained into the metabolic basis for
126
E. S. Snitkin f3 D. Segre
the robustness to gene deletions that has been previously observed in yeast metabolism [I, 4]. Despite the fact that the set of mutants analyzed by Blank et al. targeted genes in various central carbon metabolic processes, the nature of the metabolic flux responses were largely similar. Specifically, it was observed that for most mutants, the metabolic response was a local rerouting of flux around the perturbed reaction, with the relative flux through other pathways remaining similar to the wildtype. The exceptions to this rule were for mutants in reactions critical to redox metabolism, where more distant rerouting was observed. An important caveat to the observed similarity in the flux distributions of the different mutants is that the absolute flux of carbon varied greatly. This aspect of the deletion mutant response is demonstrated in Fig. 1, where the glucose uptake and biomass production for the 13 mutants analyzed in the current study are shown. It can be seen that although the efficiency with which carbon is utilized is largely similar across different mutants, the growth rates vary greatly. 1.1 eLSCl eMAEl
1.0:
ewr eCTPl
0.9; II) II)
Q)
eSFCl
S
0.8i-
~
0.7i
eGLYl
u::: 'c,
eGCV2
PCKl eOACl
e
SDH1
0 (5
'iii
>.
.c
0.6~
a.
eFUMl
0.5: ePDAl
0.4; eRPEl
~-----
--6-----8'------to--- . . . . -t~
J .............
14
......... L _......
16
-18
Glucose Uptake Rate (mmol/g/h) Fig. I. Experimentally determined glucose uptake rates and fitness for strains analyzed in current study. Glucose uptake rates were plotted against the physiological fitness for the 13 mutants analyzed in the current study, along with the wildtype. Each point represents an individual strain, which is labeled with the gene which was deleted, or with WT if no gene was deleted. Physiological fitness was computed by normalizing a strains growth rate by that of the wildtype. The wide range of glucose uptake rates indicates variation in the absolute metabolic flux carried in the different mutants. On the other hand, the strong correlation between glucose uptake rate and physiological fitness suggests that the glucose is largely being used in a similar manner across the different mutants.
Optimality Criteria for the Prediction of Metabolic Fluxes
127
3.2. Objective functions used to predict mutant fluxes
Our assessment of the ability of yeast flux balance models to predict fluxes in single gene deletion mutants included the evaluation of a set of 9 different objective functions (See Table 1). These 9 objective functions can be dissected into four categories: growth maximization, minimization of metabolic adjustment, experimentally motivated and alternate maximization criteria. Table I. Objective functions used to detennine mutant fluxes.
Optimization Method
Primary Optimization Function
A secondary optimization was performed to minimize the sum of the absolute values of the fluxes A secondary optimization was performed to minimize the distance from an experimentally constrained WTsolution
KO
FBA MIN AV
max
Vgrowlh
FBA_WT_MIN_DIST
max
Vgrowlh
KO
m
MOMA_LP
Additional Notes
min ~)v{O -v~
LP refers to the use of linear programming to minimize the Manhattan distance
I
i=I QP refers to the use of quadratic programming to minimize Euclidean distance
m
MOMA_QP
mm ~::CViKO
_V;WT)2
;=1
I
m
MOMA_LP_ WT_ CONSTR
mm
IViKO _
WT - EXP
Vi
I
i=1
m
MOMA_QP_WT_CONSTR
MOMA_ LP_OLC_UP_NORM
mm
I
(V{O -V~ _EXP)2
i=I
min
m
v KO
VWT
i=I
VGLC
VGLC
L:I ~o --1rTi m
MOMA_LP_BM_SINK
The experimentally constrained WT solution was computed minimizing the sum of fluxes, given the experimental constraints [13].
min
II
V;KO _V;WT
I
During the optimization sink reactions were created for each biomass component
;=1
FBA_MAX_ETOH
max
KO VEIOH
For both primary and secondary optimizations biomass was fixed to the experimental value determined for theJli ven mutant
Abbreviations: WT = Wildtype, KO = Knock Out LP = Linear Programming, QP = Quadratic Programming, BM = Biomass, GLC = Glucose, EXP = Experimental
3.2.1. Growth maximization
This set consisted of two objective functions, which both select flux solutions which maximize biomass production. The two objective functions differ in their secondary objective functions, which are used to select among the set of alternative flux solutions which all result in optimal biomass production. The first, FBA_MIN_A V, performs a
128
E. S. Snitkin €3 D. Segre.
secondary optimization which finds the flux distribution which produces the optimal biomass and has the minimal sum of the absolute values of fluxes through all reactions. The hypothesis underlying this approach is that yeast will attempt to achieve maximal growth at a minimal expense in terms of enzyme usage [10, 15]. The second objective function, FBA_WT_MIN_DIST, performs a secondary optimization which finds the set of fluxes which produces the optimal biomass and has the minimal Manhattan distance from an experimentally constrained wildtype solution. The motivation for this secondary objective was the aforementioned observation that the distribution of flux in deletion mutants is overall very similar to the wildtype. 3.2.2. Minimization of metabolic adjustment
This set consisted of four objective functions all of which minimize the distance from a wildtype flux solution, given the additional constraint of the gene deletion [16]. These objectives differ in the distance metric used and the wildtype flux solution to which the distance was minimized. The distance metrics were Manhattan (MOMA_ LP and MOMA_ LP_ WT_ CONSTR) and Euclidean (MOMA_ QP and MOMA_ QP_ WT_ CONSTR) distances, both of which have been used in previous applications of the minimization of metabolic adjustment criteria [13, 16]. The wildtype flux distributions differed in that one uses experimental flux data to constrain the solution space (MOMA_LP _WT_ CONSTR and MOMA_QP _WT_CONSTR), and the other does not (MOMA_ LP and MOMA_QP). 3.2.3. Experimentally motivated
Both of the experimentally motivated objective functions are derivatives of minimization of metabolic adjustment, but with additions which were motivated by some of the observations made by Blank et al. [1], and others [8], in the analysis of fluxes in genetic mutants. MOMA_GLC_NORM used an experimentally constrained wildtype solution as above, but minimized the distance between fluxes normalized by the glucose uptake rate (See Table I). The motivation for MOMA_GLC_NORM was the observed variation in the absolute flux among the different deletion mutants. The second objective is MOMA_BM_SINK, which minimized the Manhattan distance from an experimentally constrained wildtype solution as above, but included sink reactions for all biomass components. The motivation for MOMA_BM_SINK was to alleviate constraints on maintaining wildtype growth, when minimizing distance to the wildtype flux solution. 3.2.4. Alternate maximization criteria
The only objective function in this category maximized ethanol production in the mutant, given that biomass production was fixed to the experimentally observed value. The FBA_MAX_ETOH objective was motivated by the well known phenomenon whereby yeast preferentially ferments glucose, although it can be more efficiently broken down through oxidative phosphorylation [II]. Some have theorized that this aspect of yeast metabolism is a result of a selective advantage in maximizing ethanol production, so as to create a poor environment for potential competitors [18].
Optimality Criteria for the Prediction of Metabolic Fluxes
129
3.3. Correlations between experimental and predicted fluxes
Initial evaluation of the different objective functions was done by computing the Spearman Rank correlation between predicted fluxes and 36 experimental flux measurements. These 36 fluxes, which consist of fluxes through central carbon metabolism along with uptake/secretion rates, were selected for correlation analysis because they represent a set of linearly independent variables in the genome scale yeast model used. The results of the correlation analysis are shown in Fig. 2 for four optimization methods, which were found to be representative of the nine evaluated. For all 13 mutants tested, the objective functions which computed minimal distance from an experimentally determined wildtype solution achieved the best correlations. The performance of this set of methods was largely unaffected by the choice of distance metric (Manhattan or Euclidean), the addition of sinks for biomass components or by computing distances based on fluxes normalized by glucose uptake rates. On the other hand, the nature of the wildtype reference from which the distance was minimized was found to be very important. Specifically, inferior performance was observed across all mutants when using the method which minimizes the distance from a wildtype solution predicted by assuming maximal biomass production. 1.00
• •
0.95
• •
•
•
• • • I
0.90
0::: .lI::: c: C\'l 0::: ~
III
• • I III
III
0.85
0.80
E C\'l ~
en
0.75
0.70
0.65
CTP1
FUM1 GCV2 GLY1
LSCl
MAE1 OACl PCK1
PDAl
Mutants Fig. 2. Spearman eorrelations of predicted fluxes with experimentally determined fluxes. Spearman rank correlation R values were computed between experimentally determined fluxes and the fluxes predicted by each of the 9 objective functions for the 13 different gene deletion mutants. Here, the R values for 4 objective functions are shown, as these 4 were found to be representative of all 9. Specifically, MOMA_LP performed the same as MOMA_QP, while MOMA_LP_WT_CONSTR performed the same as MOMA_QP_WT_CONSTR, MOMA_OLC_NORM, and MOMA_BM_SINK. For virtually all mutants the strongest correlation was achieved using an objective which minimized the distance from an experimentally
130
E. S. Snitkin & D. Segre
constrained wildtype flux solution (black circles). The reference flux solution was critical, as minimizing the distance from a wildtype solution computed with the assumption of optimal growth resulted in a decreased correlation in all mutants (gray triangles). The objective maximizing production of ethanol (gray diamonds), produced fluxes which were least correlated with the experimental measurements. Notably, despite the respirofermentative behavior of yeast in aerobic glucose conditions, maximization of ethanol did a worse job of describing the flux response than maximization of growth (black squares) for a1\ 13 mutants. ACETATE SECRETION ANAPLEROTIC REACTIONS BIOMASS CITRATE CYCLE ETC, COMPLEX II ETC. COMPLEX IV ETHANOL SECRETION GLUCOSE UPTAKE GLYCEROL SECRETION GLYCOLYSIS PENTOSE PHOSPHATE CYCLE SUCCINATE SECRETION
Fig. 3. Normalized difference of fluxes predicted by MOMA_ LP_ WT_ CONSTR from experimental values. Differences were computed between the experimenta1\y determined and model predicted fluxes. Before taking the difference between fluxes, all fluxes were normalized by the glucose uptake rate for the given mutant. In order to make differences comparable for fluxes of different magnitudes, flux differences were then normalized by the range of a given flux across all experimental measurements. Fina1\y, flux differences for reactions in the same metabolic pathway were averaged together to allow for easier interpretation of incorrect flux predictions. Displaying this data in a heatmap, where black represents maximal difference and white minimal difference, reveals that the largcst differences between experimental and model predicted fluxes are for the pdal, zwfl and rpel mutants. This fits with correlation analysis, as these mutants had three of the lowest Spearman R values for the MOMA_ LP_ WT_ CONSTR objective. Looking at the heatmap to identifY the processes with the largest differences for these three mutants provides insight into the cause of the low correlations. For pdal, the large difference in succinate secretion is a result of the model failing to predict that the TCA cycle is used to maintain NADHINAD balance in the absence of the pyruvate dehydrogenase reaction. For rpel, the model did not capture rerouting present in many pathways. Most of these reroutings stemmed from differential use of the pentose phosphate pathway resulting from the gene deletion. Fina1\y, for zwfJ, there is a large increase in the flux through malic enzyme to compensate for the inability to produce NADPH through the pentose phosphate pathway. The increased flux through malic enzyme is associated with an increase in flux through the TCA cycle and the respiratory chain, which is not predicted by the model. In general, these three gene deletions a1\ result in reroutings to maintain redox balance, and the full scope of these reroutings are missed by the model predictions.
Optimality Criteria for the Prediction of Metabolic Fluxes
131
While the objective function which minimizes distance from an experimentally constrained wildtype solution was best for all mutants, there is variability in its relative performance across mutants. To explore this variability in more detail, we examined predicted fluxes for MOMA_LP_WT_CONTR, and assessed how well the fluxes though different metabolic pathways were predicted for different mutants. We hoped that the results of this analysis, which are displayed in a heatmap in Fig. 3, would provide insight into the sources of the decreased performance in certain mutants. The most erroneous flux predictions for most pathways are largely restricted to three mutants: rpel, pdal, and zwfl. The pdal and zwfl mutants are in reactions which utilize redox cofactors, and as described by Blank et at. such mutants tend to enact more distant rerouting to maintain redox balance. Therefore, it fits with intuition that using an objective function which minimizes distance from the wildtype would struggle in capturing more distant flux changes. A detailed examination of the predicted fluxes for these two mutants shows that while adjustments are predicted which resolve the redox imbalances caused by the given mutation, they are not the same adjustments found experimentally. For instance, for the pdal mutant, the NADINADH imbalance caused by the mutation is predicted to be resolved using the NADH dependant acetaldehyde dehydrogenase, but it seems that instead yeast increases respiratory activity to achieve redox balance. For the zwfl mutant, the model fails to predict the huge increase through the TCA cycle and malic enzyme, which occurs in yeast to counteract the deficiency in NADPH resulting from the lack of an intact pentose phosphate pathway. These examples indicate that the flux rerouting in yeast metabolism which takes place in order to maintain redox balance does not represent a minimal adjustment, or at least not minimal with respect to the distance metrics evaluated here. 3.4. Prediction of absolute flUX changes While the correlations computed above quantify how well the different objective functions predict the nature of flux reroutings in the various deletion mutants, they do not capture how well the different objectives predict the absolute flux through the system. As discussed above, while the 13 different mutants analyzed here largely have the same relative flux through different pathways as observed in the wildtype, the absolute flux varies greatly. To evaluate how well the different objective functions capture different mutations' effects on absolute flux, we compared predicted biomass production in each mutant to the corresponding experimentally measured values. The results of this comparison are displayed in Fig. 4 for the MOMA_LP _WT_CONSTR objective function. Fig. 4 indicates that despite the strong correlation between predicted and observed fluxes for all deletion mutants, there is little success in predicting the relative effects of the same mutations on the growth rate. The same trend observed in Fig. 4 was seen for all objective functions. Specifically, across all objective functions no mutant was predicted to have less than 90% of the wildtype growth, whereas experimental measurements found that 9 of the 13 mutants in fact had less than 90% of the wildtype growth rate.
132
E. S. Snitkin
fj
D. Segre .PCKl .GLY1. SFC1 . . GCV2 • •MAEl .OACl CTPl WT
1.00 .FUMl
.SDHl
0.99
!:lell
0.98
.LSCl
.5
u:
~ '5l a:
0.97
Qj
"8 ::a:
0.96 .RPEl
0.95
.PDAl
Experimental Fitness Fig. 4. Comparison of model predicted and experimentally determined growth rates for different strains. Experimentally determined fitness was plotted against fitness predicted using the MOMA_ LP_ WT_ CONSTR objective function for the 13 gene deletion and wildtype strains. Fitness was defined as the ratio between the growth rate of a given strain and the growth rate of the wildtype. While the experimental fitness values have a wide range across the 13 mutants, the model predicts that no mutant has a growth rate less than 95% that of the wildtype.
4.
Discussion
We evaluated the proficiency with which yeast flux balance models can predict the flux response to a variety of gene deletion mutations. Specifically we assessed the flux predictions made by nine different objective functions, in response to 13 different single gene deletions. Comparison of flux predictions to complementary experimentally measured fluxes revealed that for all mutants the best performing objective functions were those which minimized the distance of mutant fluxes from an experimentally constrained wildtype solution. Importantly, while the 9 objective functions showed major differences in the accuracy of their predicted fluxes, all objectives correctly predicted that the 13 mutants would be able to produce biomass. This clearly demonstrates that the ability to correctly predict growth phenotypes does not necessarily translate into the ability to correctly characterize the underlying response at the level of reaction fluxes. The fact that for all mutants the flux response was best described by objectives which implemented minimal flux rerouting, supports previous analyses of the metabolic response to gene deletions. Although the minimal rerouting objectives were consistently the best, predictions for all mutants were not equally good. Specifically, it was found that for mutants in reactions involving redox cofactors, a minimal adjustment was not
Optimality Criteria for the Prediction of Metabolic Fluxes
133
sufficient to completely describe the flux response. We hypothesize that the reason for this is that there are a number of degrees of freedom in redox balancing, and the minimal rerouting criteria by itself is not sufficient to accurately predict the observed response. Likely, criteria which cannot easily be captured by flux balance models, such as enzyme affinity for redox substrates and kinetic rate constants, are crucial in determining how redox balance is achieved. In addition to issues with redox mutants, all objective functions failed to predict the absolute flux for different mutants. Specifically, despite accurate predictions of how fluxes were rerouted in the mutants, the model predictions did not capture the reduction in the overall flux observed in the experiments .. The inability of any objective function to capture this aspect of the mutant response leaves the mechanism responsible for this observation unidentified. Again, it is likely that features of the metabolic response which cannot be captured by flux balance models are important here. Specifically, the relative efficiency of alternative pathways may limit the overall flux in mutants. Alternatively, regulatory responses to imbalances resulting from the gene deletions may cause an overall reduction in metabolic activity. Despite some of the shortcomings in the abilities of flux balance models to predict mutant flux responses, overall they largely capture the salient features of the response to the different gene deletions. Importantly, the selection of objective function proved critical to the accuracy of the predicted fluxes, despite little effect on the prediction of mutant growth. Acknowledgements
The authors would like to thank Bill Riehl and Hsuan-Chao Chiu for critical reading of the manuscript. The authors would also like to acknowledge support from the NASA Astrobiology Institute, the US Department of Energy, and Boston University. References
[1]
[2]
[3]
[4] [5]
[6]
Blank, L.M., Kuepfer, L. and Sauer, U., Large-scale I3C-flux analysis reveals mechanistic principles of metabolic network robustness to null mutations in yeast, Genome Bioi, 6(6):R49, 2005. Burgard, A.P., Pharkya, P. and Maranas, C.D., Optknock: a bilevel programming framework for identifying gene knockout strategies for microbial strain optimization, Biotechnoi Bioeng, 84(6):647-57, 2003. Burgard, A.P., Nikolaev, E.V., Schilling, C.H., et ai., Flux coupling analysis of genome-scale metabolic network reconstructions, Genome Res, 14(2):301-12, 2004. Deutscher, D., Meilijson, I., Kupiec, M., et ai., Multiple knockout analysis of genetic robustness in the yeast metabolic network, Nat Genet, 38(9):993-8, 2006. Duarte, N.C., Becker, S.A., Jamshidi, N., et ai., Global reconstruction of the human metabolic network based on genomic and bibliomic data, Proc Natl Acad Sci USA, 104( 6): 1777-82, 2007. Edwards, 1.S., Ibarra, R.U. and Palsson, B.O., In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data, Nat Biotechnoi, 19(2):125-30,2001.
134
E. S. Snitkin & D. Segre
[7]
Feist, A.M., Henry, C.S., Reed, J.L., et al., A genome-scale metabolic reconstruction for Escherichia coli K-12 MG 1655 that accounts for 1260 ORFs and thermodynamic information, Mol Syst Bioi, 3:121, 2007. Fischer, E. and Sauer, U., Large-scale in vivo flux analysis shows rigidity and suboptimal performance of Bacillus subtilis metabolism, Nat Genet, 37(6):636-40, 2005. Forster, J., Famili, I., Fu, P., et al., Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network, Genome Res, 13(2):244-53,2003. Holzhutter, H.G., The principle of flux minimization and its application to estimate stationary fluxes in metabolic networks, Eur J Biochem, 271(14):2905-22, 2004. Johnston, M. and Kim, J.H., Glucose as a hormone: receptor-mediated glucose sensing in the yeast Saccharomyces cerevisiae, Biochem Soc Trans, 33(Pt 1):24752,2005. Kauffman, K.J., Prakash, P. and Edwards, J.S., Advances in flux balance analysis, Curr Opin Biotechnol, 14(5):491-6,2003. Kuepfer, L., Sauer, U. and Blank, L.M., Metabolic functions of duplicate genes in Saccharomyces cerevisiae, Genome Res, 15(10):1421-30,2005. Schilling, C.H., Covert, M.W., Famili, I., et al., Genome-scale metabolic model of Helicobacter pylori 26695, J Bacteriol, 184(16):4582-93,2002. Schuetz, R., Kuepfer, L. and Sauer, U., Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli, Mol Syst Bioi, 3:119,2007. Segre, D., Vitkup, D. and Church, G.M., Analysis of optimality in natural and perturbed metabolic networks, Proc Natl A cad Sci USA, 99(23):15112-7, 2002. Stolyar, S., Van Dien, S., Hillesland, K.L., et al., Metabolic modeling of a mutualistic microbial community, Mol Syst Bioi, 3:92, 2007. Thomson, J.M., Gaucher, E.A., Burgan, M.F., et ai., Resurrecting ancestral alcohol dehydrogenases from yeast, Nat Genet, 37(6):630-5, 2005. Vitkup, D., Kharchenko, P. and Wagner, A., Influence of metabolic network structure and function on enzyme evolution, Genome Bioi, 7(5):R39, 2006.
[8]
[9] [10] [11]
[12] [13] [14] [15]
[16] [17] [18] [19]
BIOSYNTHETIC POTENTIALS FROM SPECIES-SPECIFIC METABOLIC NETWORKS GEORG BASLERl,z
ZORAN NIKOLOSKI l ,2
basler~pimp-golm.mpg.de
nikoloski~pimp-golm.mpg.de
OLIVER EBENHOHl,2 ebenhoehmmpimp-golm.mpg.de
THOMAS HANDORF3 handorf~pimp-golm.mpg.de
1 Institute
for Biochemistry and Biology, University of Potsdam, 14476 Potsdam, Germany 2 Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam, Germany 3 Theoretical Biophysics, Humboldt- University Berlin, 10115 Berlin, Germany Studies of genome-scale metabolic networks allow for qualitative and quantitative descriptions of an organism's capability to convert nutrients into products. The set of synthesizable products strongly depends on the provided nutrients as well as on the structure of the metabolic network. Here, we apply the method of network expansion and the concept of scopes, describing the synthesizing capacities of an organism when certain nutrients are provided. We analyze the biosynthetic properties of four species: Arabidopsis thaliana, Saccharomyces cerevisiae, Buchnera aphidicola, and Escherichia coli. Matthaus et al. [12J have recently developed a method to identify clusters of scopes, reflecting specific biological functions and exhibiting a hierarchical arrangement, using the network comprising all reactions in KEGG. We extend this method by considering random sets of nutrients on well-curated networks of the investigated species from Bioeye. We identify structural properties of the networks that allow to differentiate their biosynthetic capabilities. Furthermore, we evaluate the quality of the clustering of scopes applied to the species-specific networks. Our study provides a novel assessment of the biosynthetic properties of different species.
Keywords: biosynthetic capabilities; clustering; scope; species-specific
1. Introduction
Recently, there has been tremendous interest in the comparison of metabolic network structures in order to quantitatively and qualitatively explain the organizational structure and identify possible intrinsic network design principles. While the research in this field historically concentrated on kinetic modelling of small parts of metabolism, e.g., the glycolytic pathway [15J, the emergence of biochemical databases, such as: KEGG [10], Brenda [11], and BioCyc [16], has prompted the interest for analyses of large-scale metabolic networks. As kinetic data corresponding to genome-wide, species-specific metabolic networks are often difficult to obtain or precisely determine, novel, topology-based methods have been introduced in the last decade to allow a functional anal-
135
136
C. Basler et al.
ysis of such networks. In particular, such networks have been investigated by graph-theoretic approaches [1, 18, 20], steady-state analysis, e.g., elementary flux modes [17] or the related concept of extreme pathways [14], flux balance methods [5, 19], or, recently, by characterizing their synthesizing capacities using the concept of scopes [7]. The concept of a scope provides an effective method for determining which products a network can synthesize when it is provided with a given set of nutrient metabolites. In [8], it was shown that the synthesizing capacities of the nutrient metabolites, i.e., their scopes, form a complex hierarchy in the species-independent network defined by the KEGG database. This hierarchy is mainly determined by the chemical composition of the metabolites-those with a larger number of chemical elements or chemical groups (and, therefore, with a larger scope) are placed on top of metabolites with a simpler composition. In a recent paper [12], this complex hierarchy was condensed into a terse hierarchy of descriptive consensus scopes resulting from a clustering of scopes originating from all nutrient metabolites, taken individually. These consensus scopes represent sets of highly similar scopes, and could be assigned to characteristic combinations of chemical elements and a few chemical groups. As it is computationally impossible to calculate the synthesizing capacities of all nutrient combinations, the consensus scopes are useful to efficiently describe the biosynthetic potential of a given metabolic network. Here, we investigate at which meaningful threshold values the formerly observed hierarchies and corresponding consensus scopes can also be found in species-specific networks. Our analysis comprises the metabolic networks offour model species: Arabidopsis thaliana, Saccharomyces cerevisiae, Buchnera aphidicola, and Escherichia coli, as defined in the BioCyc database. These species have been chosen as representatives of different domains of life and contrasting living environments. In particular, Arabidopsis thaliana (abbr. Arabidopsis, taxon 3702) is a eukaryotic multicellular CO 2 fixating plant, while Buchnera aphidicola (abbr. Buchnera, taxon 107806) is a highly specialized, intracellular parasite in aphids. Escherichia coli (abbr. E. coli, taxon 83333) is a well-studied bacteria that can grow in a variety of environments, and Saccharomyces cerevisiae (abbr. Yeast, taxon 4932) is a unicellular eukaryote and fungus that has been extensively used as a model organism. Furthermore, we perform extensive analyses focused on the effect of different parameters on the outcome of the clustering approach. Finally, as the concept of scope strongly depends on the network structure, we discuss the influence of properties, characteristic for the investigated species-specific networks, on the scopes. Organization and contributions: The methods employed in this study are presented in Section 2: The employed network representations and the scope algorithm are outlined in Subsections 2.1 and 2.2. In Subsections 2.3 - 2.5, the three main methods used in evaluating the influence of different parameters on the scope hierarchies, namely: the scope size distribution, (dis)similarity indices, and weighted modularity of a given clustering, are presented. The results from our analysis ap-
Biosynthetic Potentials from Species-Specific Metabolic Networks
137
pear in Section 3, while discussion about the effect of the network properties on the investigated approach for determining a representative scope hierarchy is given in Section 4. 2. Methods
In this section, we describe the methods for testing the sensitivity of the approach proposed by Matthaus et al. [12J in order to investigate the biosynthetic potential of specific species. In Subsection 2.1, we detail the retrieval and representation of networks used in this study. The main method-calculation ofthe scope-is formally presented in Subsection 2.2. The size distributions of scopes on the investigated networks are discussed in Subsection 2.3, and the approach for determining the relationship between the parameters and methods for clustering is discussed in Subsections 2.4 and 2.5. 2.1. Species-specific networks
A metabolic network is typically represented by a directed bipartite graph G (V, E). The node set V of G can be partitioned into two subsets: Vr , containing reaction nodes, and Vm , comprised of metabolite nodes, such that Vr U Vm = V. The edges in E are directed either from a node u E Vm to a node v E Vr , in which case the metabolite u is called a substrate of the reaction v, or from a node v E Vr to a node u E Vm , when u is called a product of the reaction v. In the following, we refer to substrates as predecessors (abbr. pred), and products as successors (abbr. succ). Such representation of a metabolic network can be retrieved from a publically available database of biochemical reactions. Here, the metabolic networks of the four investigated species were obtained from the BioCyc database [16]. Similarly to the network retrieval procedure specified in Matthaus et al. [12], the reactions were checked for consistency, and, consequently, those showing erroneous stoichiometry were removed. In addition, generic reactions and metabolites integrating sets of related metabolites were removed from the network, as proposed in [6]. The curation process was applied to the BioCyc database release from December 5, 2007, and resulted in networks of the following sizes: 1329 compounds and 1404 reactions (Arabidopsis) , 1158 compounds and 1256 reactions (E. coli), 620 compounds and 594 reactions (Yeast), 356 compounds and 336 reactions (Buchnera). The BioCyc database also provides information on the reversibility of biochemical reactions. Every enzymatic reaction (with a given direction), in principle, may also proceed in the reverse direction. However, the direction in which a reaction actually proceeds strongly depends on the metabolite concentrations, and may therefore vary for different physiological conditions. Thus, for analyzing the structure of a metabolic network from a given species, all reactions may be considered as being operable in both directions. Here, as a result, all reactions are assumed to be reversible. Hence, the network is represented by a bipartite graph G = (V, E), where the successors and predecessors of a reaction are exchange ably considered as
138
G. Basler et al.
reactants or products.
2.2. Biosynthetic potential of metabolites via scope Given a metabolic network G of an investigated species, the biosynthetic potential for a given set of metabolites, acting as substrates, can be described in terms of their scope, i.e., the metabolites that can be synthesized in the network by the substrates. The scope concept is related to reach ability in the metabolic network G: A reaction node v E Vr is reachable if all of its substrates are reachable. Given a subset S of metabolite nodes, called a seed, a node u E Vm is reachable either if u E S or if u is a product of a reachable reaction. With these clarifications, we can present a precise mathematical formulation for the scope of a given seed [3J: Definition 2.1. Given a metabolic network G = (V, E) and a set S ~ Vm , the scope of the seed S, denoted by R( S), is the set of all metabolite nodes reachable from S. For a given metabolic network G = (V, E) and a set S ~ Vm , the scope R(S) can be determined in polynomial time of the order O(IEI . IV!), as can be established by analyzing the following algorithm: Algorithm 1: Scope for a set of seed metabolites S in a metabolic network G Input: Metabolic network G = (Vm U Vr , E), set of seed metabolites S ~ Vm Output: Scope R(S) 1 mark all nodes in Vr .as unreachable and unvisited 2 R(S) = S 3 repeat 4 if there is a reachable unvisited node r E Vr then 5 mark r as visited 6 R(S) = R(S) U pred(r) U succ(r) 7 end 8 foreach node rEv,. do 9 if pred(r) ~ R(S) or succ(r) ~ R(S) then 10 mark r as reachable 11 end 12 end 13 until no reachable unvisited nodes in Vr
I
I
In our analysis, the seed, S, is chosen uniformly at random from the set of metabolite nodes in a given network G. Algorithm 1 is then applied to each of f = 3000 sets S of a specified cardinality c. In the following, we describe how one can determine the distribution and clustering of scopes for a given cardinality, c, of
Biosynthetic Potentials from Species-Specific Metabolic Networks
139
the seed.
2.3. Distribution of scope sizes
Ex
Given a species X with a metabolic network represented by G x, let be the set of all scopes for f randomly chosen sets S, such that c = lSI. The scope size distribution for gives the probability, Px(s), that a scope, randomly chosen from is of size s. The effect of the parameter c on the distribution P( s) can be investigated by plotting the curves Px(s) for different values of c. To investigate the (possible) difference in the scope size distribution for several species, the sizes of the scopes are normalized by the number of metabolites in the corresponding network for each species. The scope size distributions of the investigated species are analyzed in Subsection 3.1.
Ex
Ex,
2.4. Clustering of scopes Existing studies of biosynthetic potential [8, 12] have identified that a large number of metabolites do have scopes similar in size and metabolite composition. Here, we investigate this idea by hierarchical clustering for a set of scopes generated from a seed with cardinality c and a given metabolic network of a species X. Hierarchical clustering is based on a given distance (dissimilarity) matrix for the elements of Similar to [12], we employ the reversed Jaccard index as a distance measure for a pair of scopes, R(Si) and R(Sj), ISil = ISjl = c, 1 :S i,j :S J. The computation is in the order of O(/f1 2 ) for J scopes. For completeness, we give the definition of Jaccard distance, JR(Si)R(Sj):
Ex
Ex.
JR(Si)R(Sj)
IR(Si) n R(Sj)1 = 1 - IR(Si) U R(Sj)1
We investigate the effect of a nearest neighbor group-average clustering algorithm [9]. Nearest neighbor clustering is a bottom up clustering method where iteratively clusters with increasing distance are joined, starting with clusters composed of single elements (scopes). Group-averaging refers to the method of defining the distance between two clusters as the average over all distances between pairs of the corresponding cluster elements. The output of a hierarchical clustering algorithm is a tree, which can be cut at a given distance between the clusters, to retrieve the clusters of scopes. The clusters obtained from a cut at distance T contain all scopes whose mutual distance is not greater than T. The results of the clustering of scopes are presented in Subsection
3.2. 2.5. Evaluation of parameter values To evaluate the influence of the size of the seed, c, and the distance, T, at which the clustering tree is cut on the quality of the obtained clusters, we use weighted
140
C. Basler et al.
modularity [2]-a generalization of the graph cluster quality measure proposed by Newman and Girvan [13]. To apply graph cluster quality measures, one first has to build a graph from a given matrix of dissimilarity indices. Here, we construct a graph from the dissimilarity matrix by creating a node for each scope, with the distances between the scopes as weighted edges: let I be the dissimilarity matrix used in the hierarchical clustering. The weighted adjacency matrix A of the graph H is given by 1 - IR(Si)R(Sj) , over all pairs R(Si) and R(Sj) in 2: The edges of graph H are then weighted by the similarity of the scopes Si and Sj. Let C = {C1 , ... , Cp } be the set of scope clusters obtained by cutting the clustering tree at distance T. Given a graph H, with node set given by the f scopes and weighted edges as defined above, the modularity of C measures the quality of the clustering, or how separated nodes (scopes) from different clusters are from each other. It is defined as:
x.
Q
__ 1 . c,r -
~
2m.L.....t
(A .. _d(i)d(j)) b" 2m 'J
tJ'
t,)=1
where m = 2: ij Aij is the weighted number of edges in H, Aij is the element of the adjacency matrix in row i and column j, d( i) is the weighted degree of scope i in H, d(j) is the weighted degree of scope j in H, and bij = 1, if i and j are in the same cluster of C, and 0, otherwise. With regard to this definition, the modularity measure assesses the closeness of the scopes placed in the same cluster (according to the employed clustering algorithm) and their "distance" from the scopes placed in the other clusters with respect to the weighted adjacency matrix (i.e., the similarity matrix). We investigate the behavior of the cluster quality for different sizes of the seed and different values for the parameter T at which the clustering tree is cut to obtain the set of clusters C (see Subsection 3.3). 3. Results
Here, we analyze and compare the scope size distributions, cluster agglomeration, and weighted modularities of scope clusters, obtained from the networks of the four investigated species. The scope size distributions and cluster agglomeration reveal characteristic features of the networks, while the weighted modularities determined for different values of cut-off and seed size allow to systematically and quantitatively assess the relative influence of these parameters on the clustering. 3.1. Scope size distributions
Analyses of the scope concept have already identified that metabolites exhibit different biosynthetic potentials, i. e. the number of reachable metabolites strongly
Biosynthetic Potentials from Species-Specific Metabolic Networks
141
depends on the composition of the seed [3J . Therefore, we use the size of the scope to quantitatively characterize the biosynthetic potential of the seed metabolites in a given metabolic network. To this end, we empirically determine the size distributions of scopes resulting from the four investigated species (see Fig. 1). In order to enable comparability, the scope sizes were normalized by the size of the network , and the counts of scopes were turned into a probability distribution (see Subsection 2.3 for details).
Arabldopsis thaliana scope size distributions
~
_
E. coli scope size distributions
~
Seed size 4
_ Seed size 14 D-- Seed size 24
15
ci
Seed size 4
_
Seedsize14
ci
g ~
I;
I
~
J' ~
8
ci
.l~\
i'l ci
0
ci
ci
8
is
ci
ci
0.0
0.0
0. 1
0.2
Scope size (normalized)
0.3
OA
0.5
Scope size (normalized)
(a)
(b)
Saccharomyces cerevisiae scope size distributions
~
_ _
Buchnera aphidicola scope size distributions
~
S eedsize4 Seedsize14
0 - - Seed size 24
15
_ _
Seedsize4 Seedsize14
0---- Seed size 24
15
ci
i J'
_
0--- Seed size 24
15
ci
~
I;
~
!
~
8
ci
i'l ci
" ci
ci
is
is
ci
~.1n
c~ ~
ci
0.0
0.1
0.2
0.3
OA
Scope size (normalized)
(c)
0.5
0.6
0.0
0. 1
0.2
0.3
OA
0.5
0.6
0.7
Scope size (normalized)
(d)
Fig. 1. Scope size distributions of (a) Arabidopsis, (b) E. coli, (c) Yeast and (d) Buchnera, normalized by the number of metabolites in the corresponding network. The distributions are shown for seed sizes 4 (red), 14 (blue), and 24 (yellow). The highest frequencies for seed size 4 are excluded for clarity: P.4 rnbidopsis(4) = 0.38, Pi:. co li (4) = 0.35, P~east (4) = 0.39, and P~uchnera (4) = 0. 38.
We observe that with small seeds of four metabolites, the scope size distributions of all investigated networks share a high peak for very small scope sizes, indicating that a large number of seeds exhibit a very low biosynthetic potential. The remaining large isolated peaks in the networks of Arabidopsis (Fig. 1a) and E . coli (Fig. 1b) correspond to characteristic scopes reachable from a relatively large number of
142
G. Basler et al.
different seeds. These characteristic scopes correspond to large subnetworks with a high degree of mutually reachable metabolites, which we refer to as scope communities: If the seed contains metabolites from within such a scope community, then there is a high probability of reaching all the metabolites within the community. In addition, a scope community is self-contained in the sense that metabolites outside of the community can only be reached if the seed contains certain metabolites also outside of the community. Note that although one characteristic peak may correspond to several such scope communities with a similar scope size, this is not observed in the networks of Arabidopsis and E. coli. Instead, the subsequent clustering reveals that scopes pertaining to the same characteristic peak are agglomerated into one cluster at a merging distance not greater than 0.2. Furthermore, the relatively large sizes of the communities (apx. 35%, 46%, and 60% of the network size in Arabidopsis, see Fig. la, and apx. 38% and 45% in E. coli, see Fig. lb) suggest that the smaller scope communities form subsets of the larger ones and, thus, exhibit a hierarchical arrangement, as identified by Matthaus et al. [12J. By increasing the seed size, the probability of reaching any particular metabolite increases, and, therefore, one obtains larger scopes. In particular, we observe that for all networks the fraction of small scopes decreases, while the overall scope sizes increase. For the more complex networks of Arabidopsis and E. coli, we observe that the center of the large peaks shifts towards the larger scope size. This demonstrates that seeds containing metabolites from within a scope community now frequently contain additional metabolites from outside of the community, which account for a small increase of the scope size. Moreover, seeds containing no metabolites from within a scope community remain to have a small scope, regardless of the increased seed size. Consequently, scope communities in the more complex networks represent an outstanding feature that is robust with respect to the seed size. In contrast to these findings, an increase of the seed size in the smaller networks of Yeast (Fig. lc) and Buchnera (Fig. ld) results in more evenly distributed scope sizes. This observation suggests that scope communities do not exist or are less pronounced compared to the cases of Arabidopsis and E. coli. For these two species, there are many scopes containing a distinct fraction of metabolites in the network. Finally, while the scope size distributions of Arabidopsis and E. coli are easily distinguishable by the frequency, relative scope size and number of scope communities, this is not the case for Yeast and Buchnera.
3.2. Cluster agglomeration The dissimilarity matrix serves as the basis for the clustering described in Subsection 2.4. During the clustering process, scopes are agglomerated into clusters, starting with the most similar. At a merging distance of 0, every scope forms an individual cluster, so that the number of clusters equals the number of scopes f, i.e., 3000. The number of clusters monotonically decreases with an increasing merging distance,
Biosynthetic Potentials from Species-Specific Metabolic Networks
143
until, at a distance of 1, all scopes form a single cluster. The number of clusters obtained at a certain merging distance provides information on the overall mutual similarities between scopes. In the case of many highly similar scopes, a small number of clusters will be obtained for a small merging distance, while the opposite holds for the case of many dissimilar scopes. For instance, if at a distance of 0.5 the number of clusters is half the number of scopes, then more than half of the scopes have a mutual distance of at most 0.5; therefore, more than half of the scopes share at least two thirds of their metabolites with another scope (cf. Subsection 2.4).
Arabldopala thallan. cluster agglomeration 0
-
~
\
g
.;
\\
g
, ..............,
~
.. ,~ ....
.;
o.
~
.;
!
- . . . . . . . . . . >.:>. . . ~,
.
0.8
0.'
0.8
~
.;
'.\
:l
.........::-.;.~~.~ 0.2
\-~--~~,
~
~
~ 0.0
Seedslz64 ---' Seed size 14 Seed siZ624
\':
~
0
'-..'"
'.
~
!
Seed w&4 Seed aze 14 Seed IJiiz624
\._--_.............
~ G
E. coli cluster agglomeration
~
.... , ...
1.0
0.0
08 Merging distance
(b)
Saccharomyces carevi.'ae cluater agglomeration ~
-
--_.
\~"
j
"
\,
G
ti
'\. ~
...
.......,.
N
$eedslze4 Seedsize14
Seed size 24
~
1
........., ...........
~--.
f
',~....~,~~....~
..... '..........
.;
Buchne,. aphldlcol. cluater agglomeraHon ~
Seed Size 4 Seed size 14
Seed aze 24
'.. "'"
\\ -...,
....................>.~.~::~.~
~
(a)
G
-
.......-..•.......... .....
........
Merging distance
.;
'~~~.....-
......••
G
.; ~
0 N
0
'.
~
~ 0.0
0.2
0.8
0.'
Merging distance
(C)
OB
1.0
0.0
02
0.'
10
Merging distance
(d)
Fig. 2. Frequency of observed clusters over the merging distance for (a) Arabidopsis, (b) E. coli, (c) Yeast and (d) Buchnera. While steps appear in the frequencies for seed size of 4 (solid line) as a consequence of numerical effects of the Jaccard distance, the shapes appear continuous for seed sizes of 14 (dashed line) and 24 (dotted line). Furthermore, the overall mutual distances of scopes decrease when increasing the seed size, resulting in a smaller fraction of clusters at a particular merging distance.
As shown in Fig. 2, the mutual similarities of scopes exhibit significant differences when using varying seed sizes. As a trend, the number of clusters obtained at
144
G. Basler et al.
a certain merging distance is reduced with the increase of the seed size, demonstrating that more similar sc.opes result from a larger seed size. This conforms to the intuition, as larger seeds result in larger scopes with a higher probability of sharing common metabolites. While the agglomeration curves from seed sizes 14 and 24 appear continuous, steps appear in the curves from seed size 4. For the latter curves, a large number of scopes is agglomerated into clusters at certain distances. For Ambidopsis (Fig. 2a) and E. coli (Fig. 2b), there are large steps of more than 160 scopes at characteristic distances of 2/3 and 3/4, and steps of more than 530 scopes at a distance of 6/7. In Yeast (Fig. 2c) and Buchnem (Fig. 2d), there are steps of more than 300 scopes at distances of 2/3 and 6/7. These are numerical effects of the Jaccard distance which provides a discrete number of possible dissimilarity values, decreasing with smaller cardinalities of the compared entities. When using a small seed size, the fraction of small scopes is very large (cf. Subsection 3.1). Consequently, for a large number of scopes there is a small number of possible distances to consider. For instance, at a distance of 2/3, all scopes of size four with two metabolites in common are merged, and all scopes of size six with three metabolites in common, and so on. With many small scopes, these characteristic distances occur more frequently, leading to the observed steps. For the clustering of Ambidopsis and E. coli with seed sizes of 14 and 24, a significant fraction of scopes is agglomerated with a merging distance of less than 0.1. This indicates that there are many scopes with a high mutual similarity. In contrast, this does not hold for Yeast and Buchnem, where the range of similarities between scopes is more uniformly distributed and, thus, results in cluster agglomerations at higher distances. Again, there are significant differences between the calculated scopes of A mbidopsis and E. coli on one hand, and Yeast and Buchnem, on the other hand.
3.3. Influence of cut-off and seed size Due to the observed large impact of the employed seed size and cut-off on the calculated scopes and the resulting clustering, we aim at evaluating the influence of these parameters on the quality of clustering. Particularly, we are interested in those parameter values that allow to obtain clusters of highest weighted modularity. Moreover, thorough investigation of the parameter space may provide insights in the presented approach of scope clustering. We determine scopes from random seeds as described in Subsection 2.2 for seed sizes 2 ::::: c ::::: 25. For each set of scopes resulting from a given network and seed size, we perform the clustering of scopes as described in Subsection 2.4. Finally, we cut the obtained cluster trees at cut-off distances 0.05 ::::: T ::::: 1 with step-size of 0.05, and determine the weighted modularities of the resulting sets of clusters, as defined in Subsection 2.5. In Fig. 3, the resulting matrices of weighted modularities for different parameter
Biosynthetic Potentials from Species-Specific Metabolic Networks Color Key
Influence of cut-off and seed size on cluster quality for Arabidopsis thaliana
]g "
145
Influence of cut-off and seed size on cluster quality for E.coli
cg
8~ ~
0
0.'
0
0.20.304
Value
0.1
0.2
0.3
Value 0.05 0 .•
0.' 0.15 0.2 0.25 0.3 0.35
05 0.55
0,15
0.2 0.25 0.3 0.35 0.45 0.5 0.55 0.' 065 0.' 0.75 0.6
?S U
0.' 0,75 0.' 0.55 0.9 0.95
NMV~W~~~~~~~~~~~~~gN~~~~
.
0.9 0.95
""' •• ' ••
g"~~~~.~
9.'
"
Value
93
,
~"~~~~
(b) Color Key
Influence of cut-off and seed size on cluster quality for Saccharomyces cerevisiae
Influence of cut-off and seed size on cluster quality for Buchnera aphidicola
c§
8g
ug
0
••
Seed size
(a)
~§
0
0'
0'
{t20.3
OA
Value
0'
> -10
~
~
I
OJ
/
'S
\
~
\
OJ
~
>
0.5
-15 '--_-.J.._ _-'""
0.5 V perturbed
-15-10 -5
0
v perturbed
5
o
0.5
1.5
v perturbed
Figure 4: Single perturbation trajectory. I\. matrices were calculated to restore steady state to perturbations to the flux V3 in the network in Figure I. Plots A and 8 use a I\. calculated to return a perturbation of 2 to I while plot C uses a I\. that returns a perturbation of 113 to I. In each plot, the dotted line is the diagonal, the dashed line is the parabola described in Figure 3, and the solid line is the trajectory after several iterations of Equation (14). (A) Convergent regulation. A perturbed flux value of 0.1 will return to steady state after several regulatory steps. (8) Divergent regulation. A perturbed flux value of3.5 will approach - 00. (C) Chaotic regulation. For some values of I\. and initial perturbations (here, the initial perturbation is 0.4), any regulation performed may behave chaotically, never converging on a steady state or diverging toward infinity.
This dynamical regulation process behaves similar to a logistic map [13], displaying regimes of convergence, divergence or apparent chaotic trajectories, depending on the values of the parameters A and v. With regard to metabolic regulation this finding potentially implies that chaotic or divergent behavior might be easily encountered by regulatory networks, unless specific ranges of parameters are avoided. This may pose constraints on possible regulatory networks optimized through evolutionary adaptation. 4. Glycolysis An obvious question is whether our method can be used to predict the topology and dynamics of regulation in real-world networks. As a simple example, we chose a simplified (condensed) version of the glycolytic pathway, previously used for similar testing of computational approaches (Figure 5) [15J. Similarly to what done for the simple linear pathways (Figure 2), we approach this network by perturbing each flux individually and predicting the optimal network to restore homeostasis.
A
c
Figure 5: Perturbations in a simplified model of glycolysis. Solid lines represent metabolic reactions, and dashed lines represent predicted optimal metabolic regulation. Reactions represented as bold lines are the ones being perturbed. G = glucose, F = fructose-6-phosphate, 8 = fructose-I,6-bisphosphate, P = phophoenolpyruvate, Y = pyruvate, L = lactate, T = adenosine triphosphate, D = adenosine diphosphate.
168
W. J. Riehl €3 D. Segre
In all cases, only one regulatory metabolite was necessary for optimal regulation that restores a given steady state. For each of the reactions involving an energy-carrier, ADP was predicted to act as the main regulatory molecule (Figures 5B, 5C, and data not shown). Lactate also acted as a negative feedback regulator on its own production (Figure 5D), and glucose acted as a negative regulator on the influx of glucose (Figure 5A). 5. Discussion
In this work we developed new algorithms and methods for predicting optimal metabolic regulation based on the topology and stoichiometry of a metabolic network. Thus far, we have applied these algorithms to small pathways that are linear in nature in order to understand how accurate and robust the predictions are. Initially we found that while a single regulatory scheme can be robust for some perturbed values (Figures 3 and 4), it quickly becomes clear that a single regulatory approach predicted by this method is incapable of effectively regulating all perturbations. For example, a regulatory scheme focused on regulating perturbations to a single flux will have little or no effect on other fluxes. We also observed that multiple applications of a single regulatory system can produce unexpected, apparently chaotic results (Figure 4C). While some of these results may be unrealistic consequences of the mathematical approximations used, they may also capture some fundamental properties of biological regulation systems evolved to respond to multiple perturbations. Recent work has shown, for example, that some metabolic states are more stable than others, and that perturbations occurring on top of unstable states can lead to cell death [9]. It is worth emphasizing that each of these predicted optimized regulatory mechanisms represents just that: the optimal amount of regulation necessary to respond to a given perturbation. In all cases explored (perturbations to a single flux in the network), the optimal controlling metabolite turns out to be either a reactant or product in the perturbed reaction. However, it remains a point of interest that for many perturbations in glycolysis, the controlling metabolite predicted most often was ADP. This is interesting because both ADP and ATP are known to be strong regulators (either activators or inhibitors) of glycolysis. This may point to the utility of this method as both a quantitative (degree of regulation necessary) and a qualitative (type of metabolite functioning as a regulator) prediction generator. The current model involves simplifying hypotheses and approximations, some of which may be unjustified from the biochemical point of view. These include the assumption that the regulatory response is based on concentration changes, rather than absolute concentration values; the fact that we do not include flux relaxation induced by plain kinetic effects; the use of arbitrary values for flux perturbations; the implementation of a dynamical process based on discrete time points; and the limitation to noncompetitive inhibition as the only form of feedback. In ongoing work, we are addressing each of these assumptions to determine their impact on our results, and possible
Optimal Metabolic Regulation Using a Constraint-Based Model
169
strategies for more realistic implementations. We plan to expand on this work and use it to explore more complex systems. At first, we will use this method to understand how it predicts regulation of different and multiple perturbations to a system. We expect that when two or more fluxes are perturbed, the regulatory network will quickly become complex and intricate. Next, we plan to explore the regulation of networks with complex topologies that include branching and cyclical pathways. Eventually we intend to apply this predictive method to whole-genome models of flux balance, such as the Escherichia coli model produced by Feist et al. [6] or the Saccharomyces cerevisiae model produced by Blank, et al. [1]. Acknowledgements
The authors wish to thank Hsuan-Chao Chiu, Niels Klitgord, and Evan Snitkin for meaningful discussion and critical reading of the manuscript. Linear Programming calculations were performed using the software Xpress, kindly provided by Dash Optimization under free academic license. This work was partially supported by the NASA Astrobiology Institute, the US Department of Energy and the US National Institutes of Health (NIGMS). References
[1]
[2] [3]
[4]
[5] [6]
[7] [8]
Blank, L.M., Kuepfer, L. and Sauer, U., Large-scale 13C-flux analysis reveals mechanistic principles of metabolic network robustness to null mutations in yeast, Genome Bioi, 6(6):R49, 2005. Covert, M.W., Schilling, C.H. and Palsson, B., Regulation of gene expression in flux balance models of metabolism, J Theor Bioi, 213(1):73-88, 2001. Covert, M.W. and Pals son, B.O., Constraints-based models: regulation of gene expression reduces the steady-state solution space, J Theor Bioi, 221(3):309-25, 2003. EbenhOh, O. and Heinrich, R., Stoichiometric design of metabolic networks: multifunctionality, clusters, optimization, weak and strong robustness, Bull Math Bioi, 65(2):323-57,2003. Edwards, J.S. and Palsson, B.O., Metabolic flux balance analysis and the in silico analysis of Escherichia coli K-12 gene deletions, BMC Bioinjormatics, 1(1,2000. Feist, A.M., Henry, C.S., Reed, J.L., Krummenacker, M., Joyce, A.R., Karp, P.D., Broadbelt, L.J., Hatzimanikatis, V. and Palsson, B.O., A genome-scale metabolic reconstruction for Escherichia coli K-12 MG 1655 that accounts for 1260 ORFs and thermodynamic information, Mol Syst Bioi, 3(121, 2007. Fell, D., Understanding the Control of Metabolism, Portland Press Ltd., 1997. Goyal, S. and Wingreen, N.S., Growth-induced instability in metabolic networks, Phys Rev Lett, 98(13):138105,2007.
170
W. J. Riehl €3 D. Segre
[9J
Grimbs, S., Selbig, J., Bulik, S., Holzhutter, H.G. and Steuer, R., The stability and robustness of metabolic states: identifying stabilizing sites in metabolic networks, Mol Syst BioI, 3(146, 2007. Hatzimanikatis, V., Floudas, C.A. and Bailey, lE., Optimization of regulatory architectures in metabolic reaction networks, Biotechnology and Bioengineering, 52(4):485-500, 1996. Heinrich, R. and Rapoport, T.A., A linear steady-state treatment of enzymatic chains. General properties, control and effector strength, Eur J Biochem, 42(1):8995, 1974. Kauffman, KJ., Prakash, P. and Edwards, lS., Advances in flux balance analysis, Curr Opin Biotechnol, 14(5):491-6,2003. May, R.M., Simple mathematical models with very complicated dynamics, Nature, 261(5560):459-67, 1976. Shlomi, T., Eisenberg, Y., Sharan, R. and Ruppin, E., A genome-scale computational study of the interplay between transcriptional regulation and metabolism, Mol Syst BioI, 3: 10 I, 2007. Vance, W., Arkin, A. and Ross, l, Determination of causal connectivities of species in reaction networks, Proc Natl Acad Sci USA, 99(9):5816-21, 2002.
[10]
[l1J
[12J
[13J [14J
[15]
COMPARATIVE DETERMINATION OF BIOMASS COMPOSITION IN DIFFERENTIALLY ACTIVE METABOLIC STATES HSUAN-CHAO cmu!
[email protected] ! 2
DANIEL SEGREY
[email protected] Graduate Program in Bioinformatics, Boston University, Boston, MA, 02215, USA Departments of Biology and Biomedical Engineering, Boston University, Boston, MA, 02215, USA
Flux Balance Analysis (FBA) has been successfully applied to facilitate the understanding of cellular metabolism in model organisms. Standard formulations of FBA can be applied to large systems, but the accuracy of predictions may vary significantly depending on environmental conditions, genetic perturbations, or complex unknown regulatory constraints. Here we present an FBA-based approach to infer the biomass compositions that best describe multiple physiological states of a cell. Specifically, we seek to use experimental data (such as flux measurements, or mRNA expression levels) to infer best matching stoichiometrically balanced fluxes and metabolite sinks. Our algorithm is designed to provide predictions based on the comparative analysis of two metabolic states (e.g. wild-type and knockout, or two different time points), so as to be independent from possible arbitrary scaling factors. We test our algorithm using experimental data for metabolic fluxes in wild type and gene deletion strains of E. coli. In addition to demonstrating the capacity of our approach to correctly identifY known exchange fluxes and biomass compositions, we analyze E. coli central carbon metabolism to show the changes of metabolic objectives and potential compensation for reducing power due to single enzyme gene deletion in pentose phosphate pathway.
Keywords: flux balance analysis; systems biology; data integration; metabolic objectives
1.
Introduction
An important goal of systems biology is to reconstruct and simulate biological networks to facilitate the understanding of complex cellular metabolism. Constraint based approaches have been applied to characterize the cellular flux distribution and predict metabolic phenotypes for cells grown in different conditions. One of the most prominent constraint based approaches, Flux Balance Analysis (FBA), relies on a steady state approximation and optimization algorithms to predict metabolic fluxes at cellular level [15]. The steady state approximation translates into a set of constraints on the fluxes, namely that the net sum of all fluxes producing or consuming each metabolite has to be zero. FBA determines these steady state fluxes by searching the space of feasible solutions, a polyhedral space defined by multiple constraints, for a choice of fluxes that minimizes/maximizes an objective function associated with a biological task. For instance, for a unicellular organism, one may ask what is the solution that maximizes an appropriately defined growth (or biomass production) flux, reflecting selection for fastgrowth during evolution [15]. In addition to maximizing growth, van Gulik and Heijnen suggested maximization of ATP yield, based on the assumption that evolution drives
171
172
H.-C. Chiu €3 D. Segre
maximal energy efficiency [14]. Bonarius et al. suggested minimization of overall intracellular flux, reflecting the hypothesis that organisms are evolved to maximize enzymatic efficiency [1]. Several works have proposed methods to identify objective functions from experimental data. Knorr et al. proposed a Bayesian-based probability ranking method to evaluate multiple objective functions [7]. Schuetz et al. have measured fluxes and evaluated different objectives with a Euclidean metric approach [11]. Among all the objectives studied by Schuetz et at., nonlinear maximization of the A TP yield best described unlimited growth on glucose in oxygen or nitrate respiring batch cultures while linear maximization of the overall A TP or biomass yields achieved the highest accuracy under nutrient limited continuous cultures [11]. Although FBA optimal growth seems to work well in several cases, it has been shown to be sometimes insufficient for predicting perturbed metabolic states, such as the one found in gene deletion knockout strains. A better way to determine mutant fluxes is to use Minimization Of Metabolic Adjustment (MOMA) [12], which assumes that the mutants would stay as close to wild type flux distribution as possible. One lesson learned from MOMA is that metabolic networks perturbed from a simple average behavior may be better described by objective functions different than standard growth rate maximization. One can imagine, in general, that a living system may switch its objective when facing a physiological change. For example, the diauxic shift in yeast, which is the switching from anaerobic growth to aerobic respiration upon depletion of glucose, is known to be correlated with widespread changes in the expression of genes involved in carbon metabolism, protein synthesis, and carbohydrate storage [3, 6]. Understanding the physiology of such a natural progress is still an open challenge. Lacking knowledge of objectives for perturbed cells and changes of objectives under different metabolic states limits the capacity to correctly describe metabolic networks using FBA methods. An alternative way to study metabolism is to infer metabolic flux objectives from available data. Comparative analyses of biomass compositions in different physiological states, either between wild type and mutants or throughout naturally occurring physiological transitions, could provide insight helpful towards understanding the design of metabolic networks. Previously Burgard and colleagues proposed ObjFind and BOSS to identify putative objective functions from flux measurements. Specifically, these methods identify the coefficients of importance responsible for flux distributions in E. coli and yeast [2, 4]. Uygun et al. proposed a multilayer optimization framework to discover the major fluxes of metabolic objective that account for the flux distribution in a mammalian cell [13]. However, these methods rely on flux measurements and cannot take advantage of other high throughput data. Here we present an FBA-based approach to infer the biomass compositions that best describe multiple physiological states of a cell. Our method is designed to incorporate high throughput data for comparatively determining metabolic objectives in two physiological states. As a first step, we analyze here flux data from E. coli central carbon metabolism pathways [5] to demonstrate our method for predicting metabolic objectives.
Comparative Determination of Biomass Composition
2.
173
Method
2.1. Flux Balance Analysis FBA describes the cellular level reaction rates (fluxes) under a steady state approximation, thereby imposing linear mass balance constraints. All the nutrients taken from the extracellular environment would be consumed to produce biomass or other byproducts and taken out from the system without intracellular metabolite accumulation. The steady state equation responsible for mass balance can be written as follows: dxldl = Sv = 0 (1) where x is the vector of metabolites, v is the vector of reaction fluxes and S is the stoichiometric matrix of the network. S is an m by n matrix where m is the number of metabolites and n is the number of reactions. The value Sij in S is the stoichiometric coefficient for metabolite i in reaction j. Additional constraints such as lower and upper bound for specific enzymatic reactions or nutrient uptake rates may also be imposed as LBj'S.v;S.UBj, for reaction Vj' FBA determines a specific flux prediction by maximizing/minimizing a linear objective function associated with a biological task. A typical FBA objective used in microbial systems is the maximization of biomass production [15] based on the assumption that unicellular organisms have been selected to reach maximum growth performance during evolution. Biomass production is approximated by a growth flux Vgrowlh, which is defined as follows: (2) where c is the vector of biomass coefficients, whose component Ci indicates the proportion of metabolite Xi required for the formation of a unit of biomass. The linear programming statement for maximizing growth in FBA could be formulated as: max
Vgrowth
s.t. Sv = 0
(3)
LB j ~ Vj ~ UBj 2.2. Objective inference We extend the conventional FBA formulation to concurrently infer metabolic objectives in two different metabolic states of a system. Here we limit our search to maximization of biomass production as an objective function, but we allow the biomass composition to assume in principle any vector of coefficients. For instance, the two states could be the wild type and a given mutant. The goal is then to infer the corresponding c l and c2 vectors of biomass coefficients best representing the metabolic objectives for the two corresponding physiological states.
174
H.-C. Chiu f3 D. Segre
To reverse engineer the objectives, we implement a linear optimization procedure to identify the FBA objectives maximally compatible with given vectors El and E2 encoding reference experimental data: min
L IV~1 - V~21 E E j
EJ"EJ~O
S.t. S' ·v'
j
=0,
j
whereS'
=[~~], v' =[:~
]
(4)
LBJ ~v~ ~UBJ
Llvj I~vmin where v is the vector of fluxes to be determined, is a zero-containing matrix with the same dimensions of S. In this optimization problem the overall flux activity (the sum of the absolute values of all fluxes) is imposed to be above a threshold Vmin (e.g. 25% of the flux activity obtained with regular FBA). Biomass production reactions for the first and second state are disabled from the stoichiometric matrix and a sink reaction for each biomass component is added. Each single biomass component originally flowing into biomass is exported separately and the inferred fluxes will correspond to the biomass coefficients for the corresponding metabolic state. Our optimization method tries to optimize biomass coefficients simultaneously for two metabolic states, hence allowing us to take advantage of the fact that certain data could provide only relative changes between reaction activities in the two states. Here we limit the optimization to intracellular fluxes. To test our objective function inference approach and demonstrate its performance, we apply our method to experimental flux measurements in E. coli central carbon metabolism pathways, taken from the paper published by Ishii et al. [5]. In their flux measurements, wild type strain of E. coli K-12 and 24 single gene deletion mutants of glycolysis and pentose phosphate pathway were grown in glucose-limited chemostat cultures. The mutant cells were grown at fixed dilution rate of 0.2 hours-I, and wild-type cells cultured at the same specific growth rate were used as a reference sample. They also cultured wild type cells in different dilution rates (0.1, 0.4, 0.5, and 0.7 hours-I) for comparison. In this work, we apply an E. coli central carbon metabolism FBA model [9] to study these data. The biomass production reaction in this model is a sink for the linear combination of several metabolites that are precursors of amino acids, nuc1eotides or lipids: 0.205 g6p + 0.361 e4p + l.496 3pg + l.787 oaa + l.079 akg + 2.833 pyr + 0.898 r5p + 0.519 pep + 0.129 g3p + 0.071 f6p + 18.225 nadph + 3.748 accoa + 3.547 nad + 55.703 atp + 55.703 h20
7 18.225 nadp + 3.748 coa + 3.547 nadh + 55.703 adp + 55.703 pi + 4l.025 h
(5)
Comparative Determination of Biomass Composition
175
The fact that we are dealing with a small model and there are a lot of sink reactions for the metabolites results in many alternative optima for the optimization in Eq. (4). Therefore we further use Minimization Of Metabolic Adjustment (MOMA) [12] to find the most probable steady state solution for Vi by exploring the solution space we get from optimizing Eq. (4). Coefficients for biomass precursors listed above are inferred after the primary and secondary optimization process.
3.
Results
Performing gene deletions is a commonly used approach to study how an organism responds to perturbations. FBA and MaMA have been used for generating predictions of these metabolic responses. MaMA, in particular, has been shown to be more accurate for predicting mutant fluxes than FBA. However, there are cases in which neither FBA nor MaMA objectives seem to capture well enough the true metabolic state (Fig. 1). Hence this is a good test case for our algorithm, in search for biomass composition coefficients that would be compatible with experimental data. (b)
(a) 15
.
15
0
o
0
10
'2
'2
.Q
~
5
,
"0
~
0.
0
1.L
-5
«III
:;--
.. . •
0
5
Co
0
:E 0 :E
-5
«
~-
-10 -15
~ "0 ~
0
o
-5
v ( experiment) i
5
10
. ,.
0
00
o
0 0
-10 -15
o
-5 Vi
5
(experiment)
Fig. I. Intracellular flux determination for mutant strain .1.zwf. Units for both axes are millimoles per gram dry weight per hour (mmollgDWIh). (a) FHA flux predictions for the mutant do not correlate well with experimental measurements. (b) MOMA predictions are expected to better correlate with experimental fluxes. However, in this case, even MOMA predicted mutant fluxes are not satisfactory enough for inferring biomass coefficients.
We applied our method to flux measurements in E. coli central carbon metabolism pathways [5] to infer the metabolic objectives in wild type and mutant strains. The reference (Ref) strain we use is the average of the four replicates available experimentally. Fig. 2 shows the correlation of predicted and experimental exchange rates (which were not part of the input of the above inference algorithm). Our predictions for glucose uptake rates agree with all wild type and mutant measurements studied here. In general, predicted oxygen uptake rates match well with experiments except for those of the OR03 wild type strain (See Table 1). The less accurate predictions for oxygen or CO2 production rates may be caused by reactions consuming or producing these compounds that are not in central carbon metabolism pathways. For instance,
176
H.-C. Chiu & D. Segre
Ubiquinone-8 biosynthesis requires oxygen. Therefore, under-predicted oxygen uptake rates will result in a corresponding under-prediction of Ubiquinone-8 related reactions, such as NADH dehydrogenase or succinate dehydrogenase. In addition, predictions may also be affected by inaccurate flux measurements. For example, three CO 2-associated reactions have large standard deviations (larger than 0.5*mean, see Table S5B in [5]) in the wild type replicates, possibly due to experimental difficulties or resulted from the fitting procedure for flux corrections to achieve isotopomeric steady state. (b)
(a) x
40
~
'0 Q) U
20
'C ~
c. ~-
10 0
L, ./ GR03
:0-
~
x x
~x
b.
0
glc °2
x
-10 -10
10
20
vj(experiment)
GR03~++
0.15
GR04
30
30
t
GR04
0.1
'6 ~ c.
>-
&wf
0.05
CO2
40
+ 0
~
+ 0.05
ethanol
0.1
I
0.15
vj(experiment)
Fig. 2. Predicted uptake and secretion rates in wild type and three mutant strains ~zwf, ~pgl and ~gnd (mutants for pentose phosphate pathway single gene deletion). Unit for flux is millimoles per gram dry weight per hour (rnmol/gDW/h). Negative values refer to uptake rates. (a) Glucose uptake rates in all cultures are predicted quite well. Some oxygen uptake rates are under-predicted in wild type strains under high dilution rates. The wild type strain with largest dilution rate (GR04) has a large deviation for CO 2 predictions. (b) All significant ethanol production rates are correctly predicted.
3.1.
Conserved biomass coefficients across different glucose supply rates
For the wild type strains grown in different dilution rates, we implement our algorithm relative to the reference strain (Ref) mentioned above. In Fig. 3, the predicted production rates of the ten biomass precursors defined in the FBA model are plotted against the corresponding biomass coefficients. A linear correlation is observed across all dilution rates, ranging from an almost glucose-starved state to a nearly unlimited glucose supply. In addition, the slope of the line determined by the aligned data points roughly reflects the growth rate of each wild type strain. For instance, the slope observed in Fig. 3d is roughly 1.8 times the one in Fig. 3b, which matches the fold change of growth rates (0.7 vs. 0.4 h· l ) between these two experiments. These results suggest that E. coli grown in glucose supply cultures apply robust metabolic objectives for biomass precursors, in agreement with the FBA optimal growth assumption for central carbon metabolism, regardless of glucose supply rate.
Comparative Determination of Biomass Composition
3.2.
177
Influence of single gene deletion in pentose phosphate pathway
The pentose phosphate pathway is responsible for generating NADPH and nucleotides. A perturbation in the pentose phosphate pathway could change the levels of NADPH and nucleotides and may result in less efficient growth. To study the changes of metabolic objectives caused by the deletion of pentose phosphate pathway genes, we applied our algorithm to three single gene deletion mutants ilzwf (glucose 6-phosphate-ldehydrogenase), ilpgl (6-phosphogluconolactonase) and ilgnd (glucose 6-phosphate dehydrogenase), relative to the Ref state. Our goal was to see whether considerable changes of biomass coefficients or fluxes rerouting could be detected. (a)
(b) GROl
GR02
3
3
'E 2.5 Q)
'E 2.5 Q)
'(3
'(3
if: Q)
if:
8.,
2
.,.,8
2
'"cu
cu
E 1.5 0 :0
E 1.5 0 :0
'0
'0
Q)
Q)
t5
t5
'6
'6
i!?
i!?
0.5
[L
0
[L
0
2 Expected biomass coefficient
GR04 3
'E 2.5 Q)
'E 2.5 Q)
'(3
'(3
if: Q)
8
2
2
'"'"cu
'"'" cu
E 1.5 0 :0
E 1.5 0 :0 '0
'0
'6
'6
Q)
~
i!?
3
(d) 3
[L
PYR
2 Expected biomass coefficient
0
3
GR03
8
.. . . 0 '
(c)
if: Q)
0.5
t5
i!?
0.5 0
[L
.. 0
0.5
..
PYR '
0
2 Expected biomass coefficient
3
0
' 2 Expected biomass coefficient
3
Fig. 3. Predicted production rates for biomass precursors in wild type strain under different dilution rates. Units for both axes are millimoles per gram dry weight per hour (mmol/gDW/h). The y coordinate for each data point represents the predicted flux production rate for the corresponding biomass component, and x coordinate is the biomass coefficient taken from the FBA model (see Eq. (5)). E. coli is cultured at dilution rate ofO.lh-'(a), O.4h1 I (b), 0.5 h- ' (c), 0.7h- (d) respectively [5]. The slope of the data line roughly reflects the growth rate for each experiment. Pyruvate coefficients in GR02 and in GR03 are erroneously predicted to be zero. This might be related to the less accurate prediction of CO2 production rate, since several reactions consuming or producing pyruvate generate CO2 •
178
H.-C. Chiu & D. Segre
rg6p
D
--4
6pgl
zwf
r5p
NADPH
NADPH
---. pgl
6pgc
I \ x5p
Pentose Phosphate Pathway
Glycolysis
Fig. 4. Map of the initial reactions in the pentose phosphate pathway. zwf and gnd are responsible for NADPH production to generate reducing power for growth.
Fig. 4 illustrates the reactions being knocked out from the pentose phosphate pathway in our computational study. Detailed predictions for biomass components and several key fluxes are shown in Table 1. Note that all mutants were grown in chemostat cultures at the same dilution rate (O.2h- l ) as the wild type strain (column Ref in Table 1). Hence they can all be considered to grow at the same rate. The predicted production rates for the different biomass precursors can therefore be directly compared between different mutants, and relative to the corresponding biomass composition coefficients used in FBA calculations, appropriately normalized. Our results show that most biomass coefficients for biomass precursors change proportionally to the coefficients themselves across different strains. However, individual deviations from this trend can be seen. Fig. 5 shows the predicted production rates for biomass precursors in wild type and mutants. Amino acids and nucleotide precursors (e4p bar to pep bar) tend to be over-produced and under-produced in ~pgl and in ~gnd respectively, compared to the Ref strain. Meanwhile, the measured dry weight for ~pgl and ~gnd show the same trend as the production rates for these biomass precursors. One explanation for the deviations is that these mutants may not grow at exactly the same rate due to possible experimental error, since the dry weight matches the under/over production trend. Another interpretation would be that these mutants reprogram their fluxes differently in response to gene deletion. However, more investigation would be required to draw a clear conclusion. 0.8
•
0.7 0.6
0.3
o o •
v
O.S 0.4
C=::J FBA model Ref (0.2h-1) (0.2h-1 ) ~pgl (0.2h-1) ~gnd (0.2h-1) ~zwf
1.2
•
•
IQJ
0.8
v
•v
0.6
0
0.2
0.4 0.2
0.1
O'--~-'
0
g6p e4p 3pg oaa akg pyr rSp pep g3p f6p
dry weight
Fig. 5. Left panel is the predicted production rates for biomass precursors (rnmol/gDW/h) in wild type and mutants under the same dilution rate (O.2h·'). Right panel is the measured dry weight (gIL) for these strains.
Comparative Determination of Biomass Composition
179
NADPH serves as the electron donor in reductive biosynthesis. Gene deletions in the pentose phosphate pathway perturb NADPH levels and further cause oxidative damage to the mutants [8]. One possible response for these mutants is to reroute their fluxes and generate NADPH from NADP in other pathways. Our results suggest that these mutants may use another strategy for replenishing NADPH level. As shown in Table 1, all three mutants are predicted to have higher PntAB transhydrogenase activity, suggesting that mutants may replenish NADPH level by converting NADH into NADPH. This prediction supports the previous suggestion that PntAB transhydrogenase plays an important role for generating NADPH in E. coli [10]. The predicted PntAB flux ratio for ~zwflRef (1.55) agrees with previously reported ~zwfJwild-type mRNA ratio (about 1.7) [10]. In contrast to the predicted robust metabolic requirements for biomass precursors, the cofactor requirements vary. NAD and NADPH requirements show considerable increase per unit of biomass production in ~zwf and ~pgl. It is not clear how to biologically interpret the increase of redox requirements in these mutants. One possible explanation is the mutants result in redox imbalance and their regulatory networks react consequently, causing unusual ways to direct the network operation for central carbon metabolism. In all cases analyzed here, ATP coefficients are predicted to be zero. This is because ATP synthase is present in the FBA model, and we have no information about the proton and phosphate uptake rate, Hence, the reverse ATP synthase flux is indistinguishable from the sink flux of ATP in biomass. Therefore the amount of ATP synthase reaction actually contains the ATP biomass production (but in the opposite direction). When we block (set to zero) ATP synthase in the model, the ATP biomass coefficient results equal to the absolute value of the ATP synthase flux listed in Table 1. However, this is not enough to draw a conclusion at this point on the actual A TP biomass coefficient for each strain. This issue could be examined in detail in the future, with more experimental information. 4.
Discussion
We proposed an FBA-based approach to infer the biomass compositions that best describe multiple physiological states of a cell. Our results show that E. coli maintains robust biomass coefficients for biomass precursors in central carbon metabolism pathways under glucose supply medium, ranging from an almost glucose-starved state to a nearly unlimited glucose supply. This result suggests that E. coli operates its central carbon metabolism pathways with the same biomass objective, in agreement with optimal growth criteria under glucose supply medium. One should keep in mind that this result might be partially biased by the fact that experimental inference of fluxes requires fitting to a stoichiometric model that usually involves a biomass production flux as well. Our predictions for mutants indicate that there is an increase usage for the PntAB transhydrogenase flux, suggesting another potential strategy for the mutants to
180
H.-C. Chiu
fj
D. Segre
compensate the less efficient NADPH production caused by single gene deletion in the pentose phosphate pathway. Some of our flux predictions cannot be fully understood, partly due to the use of an incomplete model (as opposed to a genome-scale one) and partly due to potential experimental errors in the flux measurements. For instance, if we had information about Ubiquinone-8 associated fluxes, we could correct the missing information in the current model and improve the accuracy of oxygen uptake rate prediction. On the other hand, it would be difficult to apply a genome-scale E. coli FBA model in our study, since the experimental data is limited to central carbon metabolism pathways. At present, large scale flux measurements are still unavailable, due to experimental difficulties. One way to overcome this limitation would be to take advantage of other types of high throughput data. Our method is designed to incorporate not only flux measurements, but also other high throughput data as the reference vector E in Eq. (4), such as mRNA expression or protein levels for two distinct physiological states. In ongoing work, we are applying our method to time series data such as gene expression along the cell cycle, to provide insights into the physiology of cellular growth. This will allow us to learn more about how living organisms organize their biomass requirements and manage energy or redox balance during their life cycle. The method should provide insights into how a cell allocates its metabolic resources in a timedependent and condition-specific manner, and can be extended to integrate multiple data sources with FBA models, to shed new light on the system-level organization of metabolic networks.
Comparative Determination of Biomass Composition Biomass comeonent Biomass Precursors
'FBA model
Ref (0.2h·')
t:.zwf (0.2h-')
Biomass coefficients Llgnd Llpgl (0.2h-') (0.2h-')
'GR01 (0.1h·')
'GR02 (O.4h·')
'GR03 (0.5h-')
181
'GR04 (0.7h-')
g6p
0.043
0.043
0.044
0.050
0.038
0.040
0.043
0.042
0_044
e4p
0.076
0.089
0.089
0.101
0.076
0.080
0.086
0.085
0.088
3pg
0.314
0.299
0.296
0.343
0.257
0.272
0.288
0.292
0.297
oaa
0.375
$0.398
0.394
0.454
0.345
0.364
0.384
0.387
0.396
akg
0.226
0.243
0.242
0.279
0.210
0.222
0.235
0.236
0.241
pyr
0.594
0.615
0.610
0.702
0.000
0.562
0.000
0.000
0.613
r5p
0.188
0.198
0.197
0.225
0.169
0.182
0.190
0.191
0.196
pep
0.109
$0.109
0.108
0.124
0.094
0.098
0.104
0.106
0.107
g3p
0.027
0.033
0.032
0.037
0.029
0.030
0.033
0.032
0.033
f6e
0.015
0.021
0.022
0.024
0.018
0.018
0.020
0.021
0.021
Cofactors 'atp
11.684
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
nad
0.744
23.631
40.600
63.507
25.496
0.000
23.516
0.000
0.000
nadph
3.823
31.252
45.218
61.827
34.733
12.992
28.807
12.322
21.787
accoa
0.786
0.825
0.817
0.941
0.000
0.752
0.000
0.000
0.000
'GR04
'GROl
'GR02
'GR03
0.000
0.000
0.000
0.000
0.000
'57.991
'32.026
9.538
25.244
9.618
15.141
Seecificreac
Ref
NADPH->NADH
0.000
0.000
11e91 0.000
'NADH->NADPH
27.713
'42.818
hzwf
11gnd
eNADH->NAD #ADP->ATP (ATP synthase)
9.279
14.506
21.057
9.053
2.526
8.043
e1.146
~.406
-5.906
-7.114
-6.839
-8.288
-5.030
-5.695
-6.455
-9.590
EX_ac
0.000
0.000
0.000
1.245
0.000
1.389
1.597
1.371
EX_akg
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
EX_co2
8.251
9.747
9.482
9.456
7.512
6.128
6.311
12.380
EX_etoh
0.000
0.013
0.000
0.000
0.000
0.000
0.058
0.044
EX_for
0.000
0.000
0.000
0.532
0.000
0.594
0.600
0.000
EX_fum
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
EX-9lc
-2.934
-3.178
-3.361
-2.922
-2.676
-2.525
-2.653
-3.813
EX_h20
4.252
8.711
15.300
2.103
-2.172
2.939
-3.947
-6.669
EX_h
9.924
6.903
0.955
12.474
15.098
8.902
16.164
25.450
EXJac_D
0.000
0.000
0.000
0.000
0.000
0.000
EX_o2
-5.792
-8.765
-11.867
-6.006
-2.250
-4.785
0.000 e_1.408
"-2.786
EX_p'l
-0.792
-0.788
-0.904
-0.681
-0.722
-0.763
-0.769
-0.787
EX succ
0.000
0.000
0.000
5.2E-6
0.000
0.000
0.000
0.000
0.000
Table I. Predicted biomass production and important fluxes (mmol/gDW/h). Negative values refer to uptake fluxes. 'Normalized to the same scale with Ref column for comparison. £PntAB transhydrogenase activities increase in all three mutants. 'The ATP biomass coefficient would be the absolute value of ATP synthase fluxes if we block ATP synthase reaction from the model. sOne flux pair (Ref and ~zwf) fails to predict the correct value for oaa and pep (results not shown). This is due to erroneous prediction for a single reaction, ppe (phosphoenolpyruvate carboxylase), which converts pep and co2 into oaa. The deviation of ppe fluxes in two predictions of Ref biomass (Ref vs. ~zwf and Ref vs. ~gnd) matches the deviation of co2 production rates. In addition, the flux measurement for ppe has large standard deviation [5). ©The NADH dehydrogenase flux seems to be under-predicted in GR03 and GR04 due to the unprecise oxygen uptake prediction.
182
H.-C. Chiu
fj
D. Segre
Acknowledgements
The authors would like to thank Evan Snitkin, Niels Klitgord and William Riehl for discussion and critical reading of the manuscript. Linear Programming calculations were performed using the software Xpress, kindly provided by Dash Optimization under free academic license. This work is supported by research grants from the US National Institute of Health (5012846-00) and the US Department of Energy (DE-FG0207ER64388 and DE-FG02-07ER64483). References
[1] Bonarius, H.PJ., Hatzimanikatis, V., Meesters, K.P.H., et al., Metabolic flux analysis of hybridoma cells in different culture media using mass balances, Biotechnol Bioeng, 50(3):299-318, 1996. [2] Burgard, A.P. and Maranas, C.D., Optimization-based framework for inferring and testing hypothesized metabolic objective functions, Biotechnol Bioeng, 82(6):670-7, 2003. [3] DeRisi, J.L., Iyer, V.R. and Brown, P.O., Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278(5338):680-6, 1997. [4] Gianchandani, E.P., Oberhardt, M.A., Burgard, A.P., et al., Predicting biological system objectives de novo from internal state measurements, BMC Bioinjormatics, 9(43,2008. [5] Ishii, N., Nakahigashi, K., Baba, T., et al., Multiple high-throughput analyses monitor the response of E. coli to perturbations, Science, 316(5824):593-7, 2007. [6] Johnston, M. and Carlson, M., The Molecular Biology of the Yeast Saccharomyces: Gene Expression, 1992. [7] Knorr, A.L., Jain, R. and Srivastava, R., Bayesian-based selection of metabolic objective functions, Bioinjormatics, 23(3):351-7, 2007. [8] Minard, K.1. and McAlister-Henn, L., Antioxidant function of cytosolic sources of NADPH in yeast, Free Radic Bioi Med, 31(6):832-43,2001. [9] Palsson, B.D., Systems Biology: Properties oj Reconstructed Networks, Cambridge University Press, 2006. [10] Sauer, U., Canonaco, F., Heri, S., et al., The soluble and membrane-bound transhydrogenases UdhA and PntAB have divergent functions in NADPH metabolism of Escherichia coli, J Bioi Chern, 279(8):6613-9, 2004. [11] Schuetz, R., Kuepfer, L. and Sauer, U., Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli, Mol Syst Bioi, 3(119, 2007. [12] Segre, D., Vitkup, D. and Church, G.M., Analysis of optimality in natural and perturbed metabolic networks, Proc Natl Acad Sci USA, 99(23): 15112-7,2002. [13] Uygun, K., Matthew, H.W. and Huang, Y., Investigation of metabolic objectives in cultured hepatocytes, Biotechnol Bioeng, 97(3):622-37, 2007. [14] van Gulik, W.M. and Heijnen, J.J., A metabolic network stoichiometry analysis of microbial growth and product formation, Biotechnol Bioeng, 48(6):681-698, 1995. [15] Varma, A., Boesch, B.W. and Pals son, B.D., Stoichiometric interpretation of Escherichia coli glucose catabolism under various oxygenation rates, Appl Environ Microbial, 59(8):2465-73, 1993.
SUFFIX TECHNIQUES AS A RAPID METHOD FOR RNA
SUBSTRUCTURE SEARCH RAPHAEL A. BAUERl,2,> raphael.bauer~charite.de
KRISTIAN ROTHER3,4,> krother~genesilico.pl
JANUSZ M. BUJNICKI3,4
ROBERT PREISSNERI
iamb~genesilico.pl
robert.preissner~charite.de
1 Institute
of Molecular Biology and Bioinformatics, Structural Bioinformatics Group, Charite Universitiitsmedizin (Medical University), Arnimallee 22, 14195 Berlin, Germany 2 Graduate School: Genomics and Systems Biology of Molecular Networks, Monbijoustr. 2, 10117 Berlin, Germany 3 International Institute of Molecular and Cell Biol09Y in Warsaw, ul. Ks. Trojdena 4, 02-109 Warsaw, Poland 4 Laboratory of Bioinformatics, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, ul. Umultowska 89, 61-614 Poznan, Poland The RNA Ontology Consortium recently proposed a two-letter representation of the RNA backbone conformation. In this study, we compare the suite notation to a custom string representation that utilizes '7 - () pseudotorsion angles. Both representations were used to assess similarity and self-similarity in several RNA structure datasets. For the detection of similarities between two RNA structures we are utilizing suffix techniques that allow for the detection of substructure similarity within some degree of inexactness. The suite representation as well as the pseudotorsion representation was tested on four diverse RNA datasets. The possibility to detect structural similarities on these datasets allowed to recover many homologous structural elements that have implications for further understanding of the RNA apparatus in Systems Biology. The software as well as the utilized datasets are freely available from http://suiterna.sourceforge.net.
Keywords: RNA; structural search; suffix array; suite encoding
1. Introduction
String-based approaches to RNA structure analysis are widely used as long as secondary structures are concerned. But, there have been few attempts to express 3D features in a string notation. Recently, the RNA Ontology Consortium [11] proposed a string representation for the conformation of RNA backbones. This allows the use of classical string matching methodology to compare structural features in turn. In this manuscript, we explore how suffix techniques can be used to find similar regions in RNA backbone strings. >Both authors contributed equally to the paper.
183
184
R. A. Bauer et al.
RNA secondary structures are most commonly expressed in the dot-bracket grammar, which contains all nested Watson-Crick and wobble base pairs. This string notation is easy to handle, and therefore has been widely used to describe local motifs [10], for computational approaches comparing RNA sequences by tree grammars [16], and for aligning two or more sequences [4]. To distinguish subtle structural motifs, like the sarcin-ricin motif, RNAse P, pseudoknots, and tertiary interactions, this notation is not enough. These features depend on specific base pairing and stacking interactions, and a specific arrangement of the RNA backbone. The RNA Ontology Consortium has bundled efforts to describe RNA structures. It poses a platform where structural Bioinformaticians can exchange ideas and discuss formal nomenclature. Systematic approaches to describe RNA tertiary structure have been started from many sides: A typology of base pairs as the basic unit of which RNA is built was defined [19]. This allowed to identify interchangeable pairs of base-base interactions (known as the isostericity principle) [12]. Stacking is conceived as a major stabilizing force, and two complementary typologies have been introduced [13]. To describe larger local structural units, circular topologies, residues interconnected by backbone, base-pair or stacking interactions, have been introduced. Assembly of these building blocks has been successfully used in constructing tertiary structures, given that the topology is known or well-predicted [15]. Jane Richardson et al. created a string representation of the RNA backbone [17], where the backbone conformation of ribose-to-ribose 'suite' units can be represented by two letters. To analyze the RNA backbone, the most significant feature are torsion angles. For each base, there are six of them, one for each bond from one phosphodiester unit to the next. These torsion angles show a characteristic distribution. More distinct clusters of the torsions can be found if RNA 'suites' - units from one ribose to another - instead of the traditional phosphate-phosphate units are considered [14]. Each suite consists of seven torsion angles, including both C4'-C3' bonds. The torsion angles were clustered, each cluster being defined as a hyperellipsoid in the 7D space formed by the seven torsions of one suite. In total 46 distinct conformations of the backbone were identified. For each cluster, a two-character code was assigned. The first character corresponds to the first three torsion angles, and the second to the other four. Thus, it is possible to write an entire RNA 3D structure as a ID string representing the backbone. The main disadvantage of the suite representation is that its scope is limited to well-defined backbones. For a high quality dataset, it covers 90-95% of the residues in RNA structures. The other residues are disregarded either because any of the backbone torsions are outside well-defined boundaries, or because the suite is not close enough to any of the hyperellipsoids in 7D space. Most of the unassigned residues are in flexible regions having a high temperature factor, or they simply belong to clusters that are too sparsely populated to form a separate cluster. An alternative description of the RNA backbone is based on pseudotorsion angles. For this, the RNA structure is reduced to C4' and P atoms similar to the Co: trace of proteins. Between these atoms, two pseudotorsions f/ and () are defined.
Suffix Techniques as a Rapid Method for RNA Substructure Search
185
Even though it is more coarse-grained, the TJ - (J angles encode important features such as the sugar pucker to a satisfying degree. The Amigos program can be used to calculate pseudotorsions [6]. The P and C4' atoms are frequently used to construct initial backbone trace in x-ray crystallography. Recently, it was reported that using P-C1' pseudo torsions improves the assignment of the backbone and ribose to electron density maps (K. Keating, personal communication), but it was not explored how these pseudotorsions map to other structural features. It is very tempting to utilize these backbone representations to compare local structures of RNA to each other. There are only few instruments available to compare RNA structures. Most of them are based on secondary structures, and they use the dot-bracket grammar. Among them, RNAforester [16], Vienna [9] and ARTS [5] are the most common. Recently a webserver (SARSA) was released [3] that uses a custom vector quantification to cluster the RNA bases into 23 distinct conformers that are translated into a string representation. SARSA is subsequently applying traditional string alignments to find similar motifs. SARSA is especially useful when applied to multiple alignments of RNA structures, however a search against a database of RNA structures is not supported. The RNAFRABASE web site (http://rnafrabase.ibch.poznan.pl/) contains a big number of loop fragments from RNA structures, but it is very limited in both the kind of fragments contained, and possible search methodology. To our knowledge there exists no method that allows fast queries for similar RNA substructures against a database. Therefore, we decided to use string representations of the RNA backbone in order to take advantage of existing algorithmic solutions for the efficient string search. Alternatively we are calculating a pseudotorsion representation of TJ - (J angles. To cope with the problem of thousands of motifs and thousands of RNA structures available we are using a suffix technique [7] that holds all information in an index and can be crawled almost linearly. The main objectives of this work are in brief: (1) Verification of the applicability of the RNA Ontology Consortium suite code, by examining the suites of differently structured RNA. (2) Presentation of a suffix method to compare RNAs to each other and giving an overview which structures and substructures are similar. (3) Discussion of possible alternatives (regarding the structure - string coding, used search algorithms) and applications.
2. Methods We constructed suffix arrays from strings consisting of the RNA Ontology Consortium suite codes for four different datasets: motifs from the SCOR database, all tRNA structures, a high-resolution dataset, and the representative RNADB05 set. Each of them was then queried for matching subsequences in the suffix array to detect structural similarities. As an alternative approach, strings representing TJ -
186
R. A. Bauer et al.
() angles of the RNA backbone were constructed and processed in the same way. 2.1. Datasets used
2.1.1. SCOR dataset First, we wanted to know, whether known RNA motifs annotated in SCOR can be recovered by the suite representation. SCOR is a database containing 15,945 structural, functional and tertiary interaction motifs that have been annotated manually [18J. A hierarchical classification inspired by the SCOP database [IJ has been established, but the database lacks updates after 2004. Therefore, a reliable automatic recognition of motifs could be useful. Currently, no such procedure is available with the circular motif library of the MC-Sym program probably coming closest [15J. For this analysis, all 4,501 structural and 100 tertiary interaction motifs from SCOR (version 2.0.4) data were used. Functional motifs annotate entire RNAs, and were excluded. The according fragments of PDB structures had lengths between 2-11 suites for structural, and 4-60 suites for tertiary interaction motifs. This set was termed "SCOR". Functional motifs are annotating entire RNAs, and are considered in the later datasets.
2.1.2. tRNA dataset Second, we were interested in proofing that a set of structurally highly conserved RN As can be recognized by the suite representation as a positive control. For this, the tRNA as one of the most conserved molecules in life was chosen. Although tRNA sequences started diverging even before the genetic code itself was fixed and their structures are highly modified by post-transcriptional additions, all of them need to have a highly conserved tertiary structure in order to work in the translation machinery. Thus, it is not surprising, that all example tRNAs from the PDB look the same from afar - and we were convinced that they should have very similar backbone conformation when represented as suites. To examine whether this hypothesis holds, all tRNA structures from the NDB database [2J were retrieved. The resulting tRN A set consists of 102 tRN A structures from all kingdoms of life and is termed "TRN A" .
2.1.3. RNADB05 and HIRES sets Third, we wanted to check for similarities among RNAs of different origin. This was done for two sets of RNA structures. One was the dataset used by Richardson et al. (termed RNADB05) [17J. The RNADB05 set is a manually refined representative set of 173 RNA structures from both X-Ray and NMR experiments. The second set (HIRES) consists of 74 high-resolution X-Ray structures. They were filtered from the PDB by applying resolution::; 2.5 A and r-value ::; 0.25 constraints. Structures with identical sequences, and sequences with less than four bases were discarded.
Suffix Techniques as a Rapid Method for RNA Substructure Search
187
2.2. Calculation of RNA backbone string representation For each structure in each of these datasets, a string using the suite representation, and another one based on the pseudotorsions was calculated. The calculation is also applied to structures that are queried against one of these datasets. The method to calculate suites from a structure was re-implemented according to the description in [17]. The seven torsion angles were calculated according to Figure 1 in 5' to 3' direction. They were then assigned to one or none out of the 46 suite clusters. First they are grouped according to their 8, 8 - 1, and 'Y angles to limit the number of clusters to be considered. Second, the 7D distances to the 7D hyperellipsoids for each cluster were calculated. If the suite was inside a hyperellipsoid, its name was assigned to the suite. The extents of these hyperellipsoids varies depending on the cluster. Especially, some of the clusters were partially overlapping; in these cases the closest hyperellipsoid center was used.
suite code
dihedrals
1b23
Fig. 1. Definition of RNA suites. A suite stretches from one ribose unit to the next, involving seven dihedral angles along the RNA backbone. Note that the 8 angle is used by two adjacent suites. In the suite encoding, the first three dihedral angles are represented by a number, the next four by a letter. The example is taken from the tRNA structure with PDB-code Ib23.
Even though it is recommended by Richardson et al. not to calculate suites for residues with a high B-factor and with clashes, we decided to include them anyway. This was done for two reasons: First, to have a continuous string representation for all RNA structures. This is particularly important considering that 5-15% of the residues are unassignable to suites, and thus in average only short fragments of structure would remain for calculation at all. Second, we wanted to assess the number of errors that occur in a real-life dataset. There were four kinds of errors: Missing atoms in the residue (resulting in a '--' suite code), a single torsion angle outside boundaries defined in [17] (S0called triaged residue, resulting in a 'tt' suite code), an outlier suite which is not
188
R. A. Bauer et al.
close to any cluster (resulting in a '00' suite code), and a close outlier inside a 4D hyperellipsoid but outside in 7D space (resulting in a '!!' suite code). The second possibility to translate a 3D structure of an RNA into a sequence of characters is implemented by calculating the 'f/ - () pseudotorsion angles from the backbone atoms of the same residues as the suites. For 'f/, these were the C4'i-Pi+ 1C4'i+1-Pi+2 dihedral, and for () the Pi-C4'i-Pi+1-C4'i+1 dihedral angles. Each of these angles was divided into 36 ten-degree bins, and for each bin, an alphanumeric character was assigned. Thus, a single 'f/ - () tuple - conceptually corresponding to the RNA suite - was represented by two characters as well. Only in the case when either of the atoms defining the dihedral was missing, an '--' code was assigned in place of the 'f/ - () tuple.
2.3. Suffix tree and array implementation Our studies where performed using a suffix array. While even simple implementations of suffix trees fulfill the property to search for a given substring in O(m) with m being the length of the input string we used the slightly slower suffix array implementation because of a better memory footprint. An algorithmic introduction to suffix trees and suffix arrays is given in [8]. The implementation we used as suffix array can search in O(mlogn) with m being the length of the search string, and n the number of strings in the index. This performance is fast enough considering the absolute amount of structures to index - even for all RNA structures in the PDB (currently 1500). A suffix array works in principle in the following manner: To index a string s with length m in the suffix array each substring from 0 - m is put into an array. This array is then sorted alphabetically. After the sorted array is established a substring of s can be retrieved by using binary search over the index that fulfills the O(mlogn) property. A conceptual disadvantage of suffix techniques is that a substring search can only be performed in an exact manner. To overcome this disadvantage we are using the notion of n-grams to perform an inexact search and to get a scoring of one input structure against a whole database. This similarity score (SCORE) is generated by searching all consecutive substrings of length n (n-grams) of the input string against the database.
SCORE =
number_of_matches-found number -of _matches_expected
(1 )
This allows us to generate a ranking of the best matching entries in the database as well as a nice way to generate an all-against-all ranking of entities in one database. One drawback of this scoring scheme is that ubiquitous repeating substrings (like 'la1a1a1a') are found in nearly every entity in the database and therefore add a huge bias to the calculation. To avoid that, a search of substrings with repeating entities is excluded.
Suffix Techniques as a Rapid Method for RNA Substructure Search
189
Apart from the theoretical runtimes given by O(x) the practical runtimes for the n-gram search with the current Suffix Array implementation is below 5 seconds for an all against all search of the RNADB05 set (257 entries) on a commodity pc (dual core 2.2 GHz, 3 GB RAM). 3. Results
In this analysis, we systematically looked for similar backbone conformations, and then checked whether they occur in RN As that are somehow annotated in a similar way. We calculated the suite strings and 'f/ - () binning strings for for 4,950 structures in all datasets. In Table 1, the distribution of suite codes is shown. Table l. Ratio of suite codes, as they occur in the four datasets examined here. The table is filled with number of suites of a particular kind, divided by the total number of suites (including outliers) for the corresponding dataset.
!! Ob 1[ 1b Ie 19 1t 2[ 2g 20 3a 3d 4b 4n 5d 5n 5q 6g 6p 7d 7r 9a tt
TRNA
SCOR
RNADB05
HIRES
0.0221 0.0005 0.0007 0.0077 0.0110 0.0045 0.0226 0.0046 0.0011 0.0056 0.0003 0.0045 0.0026 0.0023 0.0003 0.0009 0.0012 0.0004 0.0001 0.0003 0.0042 0.0012 0.0019 0.1120
0.0167 0.0015 0.0012 0.0065 0.0165 0.0063 0.0217 0.0019 0.0057 0.0007 0.0007 0.0084 0.0056 0.0028 0.0013 0.0029 0.0023 0.0006 0.0032 0.0052 0.0046 0.0023 0.0052 0.0352
0.0100 0.0094 0.0020 0.0078 0.0202 0.0049 0.0127 0.0025 0.0048 0.0015 0.0009 0.0038 0.0027 0.0045 0.0019 0.0019 0.0010 0.0005 0.0033 0.0044 0.0027 0.0017 0.0042 0.0543
0.0094 0.0343 0.0011 0.0057 0.0244 0.0068 Om05
0.0031 0.0026 0.0017 0.0020 0.0020 0.0011 0.0048 0.0017 0.0017 0.0009 0.0003 0.0028 0.0043 0.0011 0.0000 0.0051 0.0709
&a Oa 1L 1a 1c 1£ 1m 1z 2a 2h 2u 3b 4a 4d 4p 5j 5p 6d 6j 7a 7p 8d 00
TRNA
SCOR
RNADB05
HIRES
0.0188
0.0252 0.0047 0.0252 0.5760 0.0426 0.0058 0.0177 0.0029 0.0110 0.0010 0.0005 0.0022 0.0026 0.0012 0.0021 0.0020 0.0008 0.0020 0.0008 0.0076 0.0029 0.0026 0.0766
0.0170 0.0041 0.0269 0.5943 0.0477 0.0044 0.0111 0.0023 0.0109 0.0019 0.0009 0.0022 0.0020 0.0017 0.0019 0.0016 0.0011 0.0030 0.0008 0.0043 0.0028 0.0020 0.0723
0.0119 0.0034 0.0201 0.6015 0.0471 0.0023 0.0071 0.0011 0.0122 0.0011 0.0020 0.0014 0.0020 0.0017 0.0014 0.0014 0.0009 0.0045 0.0006 0.0014 0.0034 0.0000 0.0590
0.0007 0.0422 0.4504 0.0769 0.0098 0.0314 0.0001 0.0117 0.0017 0.0016 0.0004 0.0003 0.0042 0.0017 0.0007 0.0007 0.0057 0.0003 0.0078 0.0020 0.0005 0.1179
As expected, the helical stem suite variants (la, 1m, 1L, &a) are predominant. In the two representative datasets, the la suites account for up to 60% of all suites, its three satellite clusters contain together another 5%. In SCOR these numbers are very close to that, indicating that the 1a backbone conformation is apt to form many of the motifs annotated there (verified by visual inspection of the primary suite strings). In TRNA the number of la is lower (45%). This is a common feature of the tRNA fold, as this observation is the same for all tRNA suite strings. In turn,
190
R. A. Bauer et al.
some of the other suites are more highly represented. In particular, 1L, 1c, 1m, 2g, 4d, 6d, and It seem to play an important structural role in tRNA. The total number of all four kinds of invalid suites ('tt', 'oo','!!', and '--') are 25.25% in the tRNA set, 12.00% in SCOR, and 14.60%/17.36% in the RNADB05 and HIRES datasets, respectively. At first, the latter seems surprising, because one would expect less errors in high resolution structures. The percentage is mainly caused by 3.4% residues with missing atoms. The remaining 13.9% are caused by 'triaged' dihedral angles, and by outlier suites for which no suitable cluster could be found. An interpretation of this is that these are unusual backbone conformations which are only visible at a better resolution - in low-resolution structures they probably get smoothed out by the refinement process. In SCOR, the number of invalid suites is much lower. It is clearly biased by the manual selection of motifs, which by definition must occur in well-defined regions. In the tRNA set, the high error rate was examined in more detail. It appears that the three loop regions contain many conformations that do not fit in any cluster (resulting in '00' or 'tt' suites in a row for some structures). This can be a result of strong constraints in the structure during the refinement or by interaction with other molecules. In the high resolution tRNA entry with PDB id 1ehz, the rate of triaged and outlier suites is lower than in the RNADB05 and HIRES sets and the clusters of outliers do not occur here. It is unclear whether modified bases contribute to the problem, but in the examined high-resolution structures this was no problem either. This observation indicates that the lower resolution RNA structures are to be treated with caution.
3.1. Analysis of SCOR motifs The 4,601 motifs from SCOR were divided into a 20% training set and a 80% test set. The training motifs were stored in the suffix tree, and the test motifs searched in it by all their subsequences of 12 characters. One should assume that e.g. loops of a given type should have similar backbone conformations. Therefore we wanted to know which motifs can be identified this way, and whether they are distinct from other motifs. It was counted how many motifs from the test set could be correctly identified based on matchings of their suite strings. In Figure 2, the sensitivity and specificity of this analysis is given for each motif class separately. It turns out, that the predictability of the SCOR motifs is low. While the specificity is above 0.6 for almost all classes examined, and at 1.0 for many of them, the sensitivity covers almost the entire range from zero to one. The reason is a high number of false negatives in each class. To find out where these come from, the suite strings of several classes were inspected in more detail: The '180 degree turn' class consists of 24 motifs. 17 of them are just two suites (three residues) long, all having the suite string '4b6p'. The remaining 7 contain five suites, which are small variations of 'la3a1g9a1a'. These two groups fully correspond
Suffix Techniques as a Rapid Method for RNA Substructure Search
191
Recognition of SCaR motifs by substring matching
r----,
.. •
1.0 r-.~.~';'.~.T • ....,...... ~r:.,."':.:-~:"'::-:-.-::'-~"
0.8
.•
~
it
1,1,.
~
0.6
0.2
o~
U
M S~nsitillity
M
0.8
1.0
[1 - TP/(TP+fN)]
Fig. 2. SCOR motifs recognized by substring matching. The entire set of SCOR motifs was divided into a 20% training set and an 80% test set. The number of correctly matched sUbstrings of length 12 (or the entire motif, if it was shorter), the number of matches from different SCOR motifs, and the total number of motif pairs compared were used to calculate the sensitivity and specificity of the search.
to two homologous positions in different structures of the 23S rRNA (1874-1876 for the first, and 1789-1794 for the second). A similar effect can be observed for many other motifs like '3 non-We base pair', 'About 90 Degree '!Urn With All Bases Simply Stacked', and 'Multiple Twist'. In other cases, like the 'Ustk stack swap' motif, even more variations can be found. On the positive side, it has to be noted that the homologous motifs can be recognized well from as few as 2-4 suites, and their structures are conserved. As stated above, the manual selection of motifs probably facilitates this. There were no examples found, where two non-homologous motifs belonging to the same class can be identified on the bases of their suites alone. One of the reasons for this observation is that the rules upon which SeOR motifs have been annotated, are based on singular decisions made by experts. It appears, that the base pairing/secondary structure scheme that is specific for a particular motif class, does not impose a constraint on the backbone strong enough to allow a prediction. On the other hand, this implies that in the RNA backbone, an independent set of frequently occurring conformations could exist that has not been described.
3.2. Similarities among tRNA Next, a set of 102 tRNA structures with a well-defined backbone structures was examined. Because all tRNA structures have a highly conserved tertiary structure, one would expect this to be represented in the suite strings as well. In the TRNA dataset, several suites are over-represented compared to the RNADB05 and HIRES sets (partiCUlarly '6d', '2g', '7d', '1£', 'lc' and '11'). These
192
R. A. Bauer et al.
can be found in corresponding positions of most tRN As. We have locally aligned a couple of D-Ioops from tRNA structures with the corresponding suite strings in Figure 3. While each backbone follows the loop along the same path, there are several small differences in the suite codes. These include local variants, often replacing one suite by one close in the 7D dihedral space (e.g. the 'la'-'lL' and 'lm'-'l[' exchanges). The structures are also occasionally interrupted by outlier suites. These outliers are visible, but hardly distinguishable in the visualization. They do not alter the direction of the backbone and by no means disrupt the loop structure. Rather, it seems that many of them are results of improper refinement or low structure quality, as high-resolution structures such as PDB-code lehz and PDB-code 1b23 are less affected by this. One important conjecture of this is, that the suite codes are a very detailed description of tRNA backbone structure. It is apparently not suitable to describe a well-defined structure such as the D-Ioop in a general and unambiguous way. For the same loop trace, many combinations of suites are possible.
48
49 51
50
Fig. 3. The backbone of the dihydrouridine loops from the tRNA structures with PDB-codes: Ib23, lefwC, 19ts, lqf6, and lqrs superimposed by their backbone atoms. The labels indicate the residue numbers. The suite codes of the dihydrouridine loops are described in the table on the right. Outlier suites are underlined valid, but singleton suite codes at a given position are highlighted in bold case.
Another observation is that up to half of the D-Ioop suites are of the 'la' type, which was described by [17] as the conformer forming' A-form helices'. The D-Ioop contains a noncanonical base pair between residues 54 and 58, and two adjacent GC base pairs (53-61 and 52-62). But apart from that, many of the bases are involved in tertiary stacking (57, 58) and base pairing (59, 60) interactions. In total, the D-Ioop stem is more than a simple helix, showing that the abundant 1a suite can accommodate different structural roles. Although it was not attempted to align all structures explicitly, this seems feasible from these observations, and can be expected to result in a consensus alignment
Suffix Techniques as a Rapid Method for RNA Substructure Search
193
of suites. A more detailed analysis could be used to identify individual conformations of tRNA at a high level of detail. An all-against-all search of subsequences of all tRNA suite strings was performed using the suffix array, and the n-gram algorithm, as described in section 2.3. In Table 2, the numbers of hits found for different word lengths are given. Table 2. Results of the all-against-all search in the TRNA, RNADB05, and HIRES datasets using the n-gram approach. The column "total hits" indicates how many exactly matching n-grams were found for the given word length. "score" gives the average score for these hits. The score is calculated by the sum of the inverse frequencies from Table 1 for the matching n-gram.
n-gram length
TRNA number hits
score
RNADB05 number hits
score
HIRES number hits
score
4 6 8 10 12 14 16 18 20
6824 6732 6386 5381 3812 2817 1990 1542 1306
5.4 13.0 19.1 24.5 38.1 60.9 96.0 140.3 175.4
27978 22543 17674 13657 10436 6504 3554 2376 1443
6.1 10.6 14.9 16.6 20.4 30.7 45.2 62.0 86.7
10543 10111 8917 6497 4823 3321 2683 2001 1283
3.7 7.1 10.9 16.5 20.8 34.3 46.8 59.2 86.2
The tRN A dataset is different enough among itself, that in average only 69 other structures contain a sufficient number of matching n-grams. But, for structures found, the number of words within one hit is high. With increasing word length, the number of hit structures decreases continuously. This is expected as it gets increasingly difficult to find a longer word in the set of suite strings, because each of the occasional variations will disrupt the search for a local match. The number of words found within a structure drops correspondingly at first, but starts to rise again at a word length of 16 (data not shown). This observation can be explained by the fact that these hits are only occurring in a few but highly similar tRNA structures, where little or no variation occurs. We therefore conclude that a word size of 12 or 14 is optimal to find similarities within the set with as little background noise as possible, and at the same time not restricting the search to almost-identical structures. The outcome of the all-against-all search has been visualized in Figure 4 (TRN A depicted left). There, the normalized number of word hits for a given pair of structures is plotted. This indicates that an overall level of similarity exists between most pairs of tRNAs. The bright spots result from a group of few highly similar tRNA structures (the ones still remaining with word size 20). The dark regions (the lines at 31, and several ones between 56-68) are structures with very low similarity. The structures in this region (among others, PDB-codes: 1y14, 2ow8, 2v46, 3tra) were examined more closely. It turned out that these contain a much higher proportion
194
R. A. Bauer et al.
(up to 40%) of outlier and erroneous suite codes. Three of the examples here are structures of tRNAs bound to ribosomes, having resolutions of 3.7 A and higher. The fourth (PDB-code: 3tra) is alone, but it also has been determined at an inferior resolution. This dearly shows that the suite nomenclature is of very limited use for non-high-resolution structures.
Fig. 4. Scores of the all-against-all search in the a) TRNA (left), b) RNADB05 (middle), and c) HIRES (right) datasets. On each axis, the structures used are sorted according to their PDBcode. The color indicates the score found for a particular structure-structure-pair. The scaling was chosen such as that dark areas correspond to repeating 'la' matches. The higher the score, the more uncommon suites a particular hit contains. The results shown here are for n-grams of length 12.
3.3. Similarities in the representative RNA sets To assess whether these observations are meaningful, we compared both the 107 high-resolution structures and the 254 structures from the RNADB05 set. The number of hits found is described in Table 2. The according similarity maps are depicted in Figure 4. At first, it is observed that some of the suite strings in the datasets were too short to match anything (empty rows/columns and an interrupted diagonal in the heat map). Also, both the HIRES and RNADB05 datasets contained a number of sequences with trivial structures, consisting of 'la'-repeats and not much more. The scoring also depends on the length of the query string and therefore the matrices must not necessarily be symmetric. In Figure 4, it is clearly visible that the overall number of structures in RNADB05 and HIRES with detected similarities drops more sharply compared to the TRNA set. In the same way, the total number of hits changes. Even though the RNADB05 set is larger, only few hit structures remain there at word size 20 (also see Table 2). One reason for that is that the average size of both reference datasets is smaller, as they contain many hairpin loops and other short RNA. In both reference sets, the number of A-form helical stems (repeating regions consisting of 'la' suites) is higher, and they are practically excluded from the eval-
Suffix Techniques as a Rapid Method for RNA Substructure Search
195
uation by the scoring function. This leaves only a fraction of hits in the reference compared to the tRNA set. In tRNA not only a higher number of hits exists, but they are also less random because they consist of less frequently occurring suites. This shows that the similarity among tRNAs is non-random, which can be taken as a proof of concept for the method. One structure in the RNADB05 set - rr0082H09, the 23S subunit of the ribosome - was matched by almost any other from this database. The structural variety in this single structure easily matches that of the remaining dataset taken together, and any motif found somewhere else is probably found there as well (see the white vertical line in Figure 4 at dataset RNADB05). Interestingly, when searching for a set of local RNA structures other than helical stems with either of the methods, we find non-homologous hits. This works for: a) an internal loop of the SRP and the ribosomal SSU, b) a biotin-binding pseudoknot and the tRNA, and c) a tRNA and the E-Ioop from 5S-RNA. 4. Discussion
Geometrically, the suite representation does not cover variations that could occur in the bond lengths and flat angles of the RNA backbone. While bond lengths have a very narrow distribution throughout all structure files, bond angles show significant variation. This means that there is a degree of freedom that makes it impossible to rebuild RNA structures from a string, even if the suite nomenclature would determine the dihedrals with perfect precision. There are two obvious possibilities to resolve this: (1) Encode the flat angles in a similar way as the suites. (2) Encode base-base interactions in the string in order to constrain the structure, and use a 3D modeling procedure subsequently. We believe that the second method is more promising, because it would include those interactions that shape the function of RNA instead of restricting the structure of RN A to the backbone alone. Such a reconstruction of structures from a descriptive grammar (not string-based) was demonstrated already in [15]. Another implication of this approach would be, that if an RNA has in some region no further constraints, it may be structurally flexible. Therefore, the second approach would indirectly encode the flexibility. Having a rapid method for string-based motif recognition has a number of potential applications. First, it could be used to systematically find frequently occurring backbone motifs in RNA structures - as it has been demonstrated here. Further, it can be used to sample big numbers of backbone conformations in order to generate native-like RNA backbones which could be modeled subsequently. Finally, it allows on-the-fly evaluation of RNA models which are generated during manual structure modeling or automatic refinement. The combination of this technique with more
196
R. A. Bauer et at.
elaborate string representations would impose further improvement. We therefore think it is possible to accurately re-model the structure of RNA from a string representation by including additional structural features like base pairs, base stacking, or even tertiary interactions with energy minimization instead of extensive probing of the local conformational space. The T/ - () binning approach was shown to produce too many different local conformations for an effective substring matching. One could argue that by decreasing the number of bins, the matching could be improved. But, it has been shown earlier, that the pseudotorsion angles contain specific regions that are characteristic for some structural motifs [6]. Decreasing the bin size would ignore these and therefore be hopelessly inaccurate. Therefore, either explicit clusters in the pseudotorsion space would have to be defined or string matching techniques allowing for more inexact matches than the current suffix array would be necessary. We emphasize, that a more fuzzy search method could improve the usefulness of the suite codes as well. In particular, this could eliminate the adversary effects of the occasionally occurring erroneous or undefined suites. Practically, this could be implemented as a classical similarity matrix between the suite codes, and for the beginning, its values could simply be based on a normalized 7D distance between the 46 suite clusters. Given the performance of the suffix array the analysis presented here could easily be extended to the entire NDB [2]. Identifying structures that should be expected to be similar (e.g. based on their function) is more challenging, if one does not want to rely on sequence similarity alone. 5. Conclusions
This work presented the first approach that uses an indexing technique to scan the structural space of RNA. The indexing was implemented using suite codes and an T/ - () binning approach and tested on four distinct datasets. It could be shown that this approach can be used to rapidly identify similar substructures. This has applications not only for querying the RNA space but also for the modeling of RNAs by rapidly predicting possible conformations and in turn on-the-fiy evaluation of proposed RNA models regarding structural and functional similarities. All datasets as well as the sourcecode is freely available from http: / / sui terna. sourceforge. net. We hope this will be useful for the community and are looking forward to receiving feedback. Acknowledgements
This effort is supported by DFG SFB-449, Deutsche Krebshilfe, DFG (Deutsche Forschungsgemeinschaft) International Research Training Group (IRTG) on "Genomics and Systems Biology of Molecular Networks" (GRK1360) and the 6th MarieCurie EU Research Training Network "DNA Enzymes", grant no. MRTNCT-2005019566. Without the use of free and/or open source software this effort would not
Suffix Techniques as a Rapid Method for RNA Substructure Search
197
have been possible. References [1] Andreeva, A., Howorth, D., Chandonia, J. M., Brenner, S.E., Hubbard, T.J., Chothia, C., Murzin, A.G., Data growth and its impact on the SCOP database: new developments. Nucleic Acids Research, 36(Database issue):419-425, January 2008. [2] Berman, H. M., Westbrook, J., Feng, Z., lype, L., Schneider, B., and Zardecki, C., The Nucleic Acid Database. Acta Crystallographica Section D, 58(6 Part 1):889-898, Jun 2002. [3] Chang, Y.F., Huang, Y.L., and Lu, C.L., SARSA: a web tool for structural alignment of RNA using a structural alphabet. Nucleic Acids Research, 36(Web Server Issue): 1924, May 2008. [4] Dowell, R. D. and Eddy, S. R., Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 7:400+, September 2006. [5] Dror, 0., Nussinov, R., and Wolfson, H. J., The ARTS web server for aligning RNA tertiary structures. Nucleic Acids Research, 34(Web Server issue), July 2006. [6] Duarte, C. M. and Pyle, A. M., Stepping through an RNA structure: a novel approach to conformational analysis. Journal of Molecular Biology, 284(5):1465-1478, December 1998. [7] Giegerich, R. and Kurtz, S., From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction. Algorithmica, 19(3):331-353, November 1997. [8] Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, January 1997. [9J Hofacker, I. L., Vienna RNA secondary structure server. Nucleic Acids Research, 31(13):3429-3431, July 2003. [10J Hofacker, I. L., Bernhart, S. H., and Stadler, P. F., Alignment of RNA base pairing probability matrices. Bioinformatics, 20(14):2222-2227, September 2004. [l1J Leontis, N.B., Altman, RB., Berman, H.M., Brenner, S.E., Brown, J.W., Engelke, D.R, Harvey, S.C., Holbrook, S.R, Jossinet, F., Lewis, S.E., Major, F., Mathews, D.H., Richardson, J.S., Williamson, J.R, and Westhof, E., The RNA Ontology Consortium: an open invitation to the RNA community. RNA, 12(4):533-541, April 2006. [12] Lescoute, A., Leontis, N. B., Massire, C., and Westhof, E., Recurrent structural RNA motifs, Isostericity Matrices and sequence alignments. Nucleic Acids Research, 33(8):2395-2409, 2005. [13] Lescoute, A. and Westhof, E., The interaction networks of structured RNAs. Nucleic Acids Research, 34(22):6587-6604, December 2006. [14] Murray, L. J. W., Richardson, J. S., Iii, A. W. B., and Richardson, D. C., RNA backbone rotamers finding your way in seven dimensions. Biochemical Society Transactions, pages 485-487, 2005. [15] Parisien, M. and Major, F., The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature, 452(7183):51-55, 2008. [16] Reeder, J., Hochsmann, M., Rehmsmeier, M., Voss, B., and Giegerich, R., Beyond Mfold: Recent advances in RNA bioinformatics. J Biotechnol, March 2006. [17] Richardson, J.S., Schneider, B., Murray, L.W., Kapral, G.J., Immormino, RM., Headd, J.J., Richardson, D.C., Ham, D., Hershkovits, E., Williams, L.D., Keating, K.S., Pyle, A.M., Micallef, D., Westbrook, J., Berman, H.M., RNA backbone: consensus all-angle conformers and modular string nomenclature (an RNA Ontology Consortium contribution). RNA, 14(3):465-481, March 2008.
198
R. A. Bauer et al.
[18J Tamura, M., Hendrix, D. K., Klosterman, P. S., Schimmelman, N. R., Brenner, S. E., and Holbrook, S. R., SCOR: Structural Classification of RNA, version 2.0. Nucleic Acids Res, 32(Database issue), January 2004.
[19J Yang, H., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H., and Westhof, E., Tools for the automatic identification and classification of RNA base pairs. Nucl. Acids Res., 31(13):3450-3460, July 2003.
THE RELATIONSHIP BETWEEN FINE SCALE DNA STRUCTURE, GC CONTENT, AND FUNCTIONAL ELEMENTS IN 1% OF THE HUMAN GENOME ELLIOTT H. MARGULIES 2
[email protected] STEPHEN C. J. PARKER]
[email protected] THOMAS D. TULLIUS]' 3
[email protected] ] Graduate Program in Bioinjormatics, Boston University, Boston MA 02215, US.A. National Human Genome Research Institute, National Institutes 0/ Health, Bethesda MD 20892, US.A. 3 Department o/Chemistry, Boston University, Boston MA 02215, US.A. 2
GC content has been shown to be an important aspect of human genomic function. Extending beyond the scope of GC content alone, there is a class of regions in the genome that have especially high GC content and are enriched for the CG dinucleotide-called CpG islands. CpG islands have been linked to biologica\1y functional genomic elements. DNA structure also contributes to biological function. Recent studies found that some DNA structural properties are correlated with CpG island functionality [5, 14]. Here, we use hydroxyl radical cleavage patterns as a measure of DNA structure, to explore the relationship between GC content and fine-scale DNA structure. We show that there is a positive correlation between GC content and the solvent-accessible structural properties of a DNA sequence, and that the strength of this correlation decreases as genomic resolution increases. We demonstrate that regions of the genome that have highly solvent-accessible DNA structure tend to overlap functional genomic elements. Our results suggest that fine-scale DNA structural properties that are encoded in the genome are important for biological function, and that the highly solvent-accessible nature of high GC content regions and some CpG islands may account for some of their functional properties.
Keywords: DNA structure; GC content; CpG islands; hydroxyl radical cleavage; functional element;
human genome
1.
Introduction
GC content-the fraction of G or C nucleotides within a given window-is variable across the human genome [17, 36]. This observed heterogeneity in sequence composition has been implicated as a marker for some functional genomic regions. One example of this is CpG islands, which are regions of the genome characterized by high GC content and enrichment of the CG dinucleotide [11]. CpG islands have been linked to many regulatory processes [7, 18,24,33,37-39]. Beyond the primary order of nucleotides in a genome that is used to define GC content and CpG islands, the local structural profile of DNA has been implicated in a number of biological processes. Recent studies suggest that DNA structure is important for some of the same processes as CpG islands: namely DNA-protein interactions [20], promoter function [1, 29], epigenetically controlled gene regulation [4, 23, 32, 34, 40],
199
200
S. C. J. Parker, E. H. Margulies &J T. D. Tullius
and DNase I hypersensitivity [14]. However, the precise relationship between GC content, fine-scale DNA structure, and genome function remains unclear. A critical first step in assessing this relationship is the ability to predict the local DNA structural profile for genomic sequences. Hydroxyl radical cleavage patterns of DNA have been used to study structural properties for a wide variety of sequences [13, 19, 30]. The cleavage pattern of naked DNA is a reflection of an important structural parameter, the solvent-accessible surface area of the DNA backbone [2]. The cleavage pattern thus provides a high-resolution quantitative measure of the shape of the DNA backbone and how it varies with respect to its sequence. We have recently shown that using a database of experimentally-determined hydroxyl radical cleavage patterns, the cleavage pattern of any DNA sequence can be predicted with a high degree of accuracy [13]. Although GC content has recently been implicated in defining hydroxyl radical cleavage patterns of DNA [35], this analysis was conducted at a relatively low genomic resolution of 333 base pairs. Single-nucleotide, genome-scale DNA structure predictions are feasible [13], which makes exploring the relationship between GC content and finescale DNA structure possible. Since different DNA sequences can have similar local structural properties [10, l3], directly correlating GC content with DNA structure is an important experiment. Results from the ENCODE Pilot Project provide a rich resource for functional annotations in 1% of the human genome [3]. These developments facilitate the investigation of the relationship between GC content, DNA structure, and functional elements in this 1% of the human genome. Here, we compare GC content to DNA structure (measured as hydroxyl radical cleavage patterns) at various genomic resolutions, with an emphasis on fine-scale DNA structure. We then measure the occurrence of significantly over-represented DNA structural motifs with known functional annotations. Our results show that GC content only weakly influences fine-scale DNA structure, and that local structural properties may be important in conferring biological functionality to genomic regions like CpG islands. 2.
Materials and Methods
2.1. DNA sequence andfunctional annotation data sources The DNA sequence for NCBI build 36 (March 2006), hg18 version of the ENCODE regions within the human genome was downloaded from the UCSC genome browser (http://genome.ucsc.edulENCODEJ) [21,22]. We used the following functional annotations for comparisons with DNA sequence and structural features. All the annotations are available through the UCSC genome browser (see above), unless otherwise noted. For all analyses, the hg18 version of each annotation track was used.
Fine Scale DNA 8t.r?lrt?i.rp
r;c
Content. and Functional Elements
201
•
DNase I hypersensitive sites (DHSs) represent regions of open chromatin architecture where protein-DNA interactions occur. We used a Union set ofDHSs derived from the human GM06990 cell line, as described in [3, 14].
•
Formaldehyde Assisted Isolation of Regulatory Elements (FA IRE) is an alternative method used to locate regions of open chromatin. FAlRE sites are enriched for regulatory elements [12].
•
Promoters were defined as the region 2.5 kilobases upstream from gene start sites. We used the GENCODE [16] gene track to define genes.
•
Ancestral Repeats (ARs) are mobile elements that inserted before the common ancestor of most mammals. They are thought to be neutrally evolving and are therefore typically used to represent nonfunctional regions of the human genome [9, 15,28,31,41]. We used the AR regions defined in [3].
•
CpG islands are regions of the human genome with high GC content and higherthan-expected CG dinucleotide density. We used the CpG islands track from the UCSC genome browser, which was constructed using the CpG island definition described in [11].
•
Evolutionarily constrained regions are areas of the human genome that are under purifying selection against nucleotide changes. We used the 'moderate track' which is a summary of regions identified by multiple sequence alignment and constraint detection algorithms-described in [3, 25] for this analysis.
•
Transcription start sites used here are described in [3, 8].
•
As a control, we constructed a 'random annotation' by randomly selecting 500 base pair intervals within the ENCODE regions. We repeated this process 1000 times to create the random annotation track used here. Since this annotation set was derived randomly, there should be no association with any given set of functional elements.
2.2.
Local DNA structure prediction and GC content analysis
We used predicted hydroxyl radical cleavage patterns as a measure of local DNA structure. Hydroxyl radical cleavage patterns were predicted using the Sliding Tetramer Window algorithm described in [13] for all the ENCODE regions. After the cleavage intensity at each base was predicted, we averaged the cleavage values within a window for all possible windows within the ENCODE regions. For GC content analysis we calculated the fraction of G or C bases within all possible windows of various sizes within the ENCODE regions. To calculate CpG density we counted the observed number of CG dinucleotides within the same windows.
202
S. C. J. Parker. E. H. MarGulies €3 T. D. Tullius
2.3. Annotation proximity and overlap statistics To calculate the proximity of various windows to functional annotations we computed the distance, in base pairs, from the closest base in a given window to the closest base from the nearest element in the specified annotation. To calculate the observed overlap statistics between different annotations, for example-comparing the regions in annotation X to the regions in annotation Y, we first computed the fraction of regions in annotation X that overlap any region from annotation Y. We then constructed a null distribution of the fraction of expected overlaps by using the block bootstrap method described in [3]. We calculated the mean and standard deviation from the null distribution to assess the statistical significance of the observed overlap. This allowed us to determine if the regions in annotation X overlap the regions in annotation Y significantly more or less than random expectation. 3.
Results
3.1. Correlation between GC content and local DNA structure Given the data reported in [35] that shows a high correlation between GC content and mean hydroxyl radical cleavage patterns at a window size of 333 base pairs, we first sought to reproduce and supplement these results. We computed the Pearson correlation between GC content and mean hydroxyl radical cleavage for windows of size N, where N = {2, 3, 4, 5, 10, 20, 50, 100, 333, 500, 1000, 10000}, in the ENCODE regions. We observe a positive correlation between the size of a window and the strength of the correlation between GC content and hydroxyl radical cleavage (Figure IA). That is, while large windows have a high correlation between GC content and mean hydroxyl radical cleavage, small windows-which are a reflection of the fine-scale structure of DNA-donot. To determine if the above result is unique to the DNA in the ENCODE regions we randomized all of the ENCODE sequences. We used a first order Markov model trained on the real ENCODE sequences to preserve all dinucleotide frequencies. The random sequences follow the same correlation trend as the real ENCODE sequences (data not shown), which suggests that the observed correlations are in inherent property of DNA and not an artifact of the ENCODE sequences chosen for this analysis. We next focused on the relationship between CpG density and mean hydroxyl radical cleavage over windows of size N (Figure IB). For equivalent values of N, the strength of the correlation between DNA structure and CpG density is less than for GC content (compare Figure IB to Figure IA).
Pine Scale DNA Structure, GC Content, and Punctional Elements
203
A 1
,---- .
i
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 N
M
V
LI'>
0 N
8
0 LI'>
0
8
M M M
0 0 LI'>
0 0
0 0 0
8
8
Window size (bases)
B 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 N
M
v
LI'>
8
0 N
0 LI'>
0
8
M M M
0 0
III
0 0 0 .-
ii
40
E
a U 30 20 10 0 9
-log (Le50) Fig. I. Distribution of the compounds according to their toxicity. The ratio of compounds is plotted against the -log (LC50) values as a measurement of toxicity.
Molecular weight Figure 2 depicts the distribution of the molecular weight of natural compounds, drugs and toxic compounds. It is noteworthy that the drugs have the lowest weight followed by the group of slightly, medium, and highly toxic compounds whereas natural compounds represent intermediate weights.
Toxicity versus Potency 235 40
Molecular Weight
,.
35 30
--.k-- TC 3-6
:;? 25 £...
'"c:
"0
-.--TC6-9 _ _ TC>9
20
:::l
0
n. 15
---(j)---
0
--zlr--NC
E ()
Drugs
10 5
.,
0 1000
Weight [g/mol]
Fig. 2. Distribution of the molecular weight of toxic compounds (TC), natural compounds (NC) and drugs. Thc toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 highly toxic).
Molecular Weight 40 35 30
~
25
'" §
20
TC6-7
"0
o a.
--111-- TC7-S
-t-TC8-9
E 15
8
10
a 1000
Weight [g/molJ
Fig. 3. Detailed distribution of the molecular weight regarding the group of medium toxicity (-log (LCSO): 6-9). (TC toxic compounds)
Figure 3 shows a detailed distribution of the medium toxic compounds regarding their molecular weight. This diagram reflects the same trend as shown in Figure 2. The slightly toxic compounds are characterized by a lower molecular weight compared to the more toxic compounds. These findings support the tendency that toxic compounds have a higher molecular weight than non-toxic compounds. In summary, the investigated groups of compounds differ according to their molecular weight forming a clear sequence: drugs, slightly toxic compounds, natural
236
S. Struck et al.
compounds, medium toxic compounds, and highly toxic compounds. Thus, a clear correlation between the toxicity and the molecular weight can be found. As drugs are designed as small molecules which can enter cells easily, these compounds are comparatively small. Within the highly toxic compounds, toxins like valinomycin (Streptomyces fulvissimus) or halichondrin (Axinella sp.) can be found. These large compounds function by binding to receptors or forming pores in membranes and are, therefore, very effective resulting in a high toxicity. Hydrogen bond donors and acceptors Hydrogen atoms attached to a relatively electronegative atom gain a positive partial charge which makes them very reactive. Thus, they act as hydrogen bond donors in the formation of a hydrogen bond to electronegative atoms such as fluorine, oxygen, or nitrogen which serve as hydrogen bond acceptors. Hydrogen bond donors and acceptors are ideal components of toxic compounds due to their high reactivity. Therefore, toxic compounds are even active at very low concentrations by interacting with biological macromolecules such as enzymes or cellular receptors.
H-Bond Acceptors 40,-----------------------------------------------. 35 30
~ "' '0
_TC3-6 25
--+--TC6-9 _ _ TC>9
§ 20 o c.
.. -e··· Drugs
g 15
--,.·-NC
()
10
o
1
2
3
4
5
6
7
8
9
10 11
12 13 14 15 16 17 18 19 20 >20
Acceptors [nJ
Fig. 4. Distribution of the amounts of hydrogen bond acceptors of toxic compounds (TC), natural compounds (NC) and drugs. The toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 = highly toxic).
Toxicity versus Potency
237
H-Bond Donors 40r------------------------1-----~
40
35 30 ~
e.-
TC3-6
25
_.+_.- TC 6-9
'"
"0
§ 20
--B-TC>9
a c. E 15
o
a
o
- ·e-·· Drugs
1 2 3 4
5
6 7 8 9 1011
.....--NC
10
o
1
2
3
4
5
6
7
8
9
10 11
12 13 14 15 16 17 18 19 20 >20
Donors [n]
Fig. 5. Distribution of the amounts of hydrogen bond donors of toxic compounds (TC), natural compounds (NC) and drugs. The toxie compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 = highly toxic). The small diagram shows a detailed distribution of the amounts of hydrogen bond donors regarding the group of medium toxicity (-log (LC50): 6-9).
To analyze this supposition, the amount of hydrogen bond donors and acceptors was compared between toxic compounds, natural compounds, and drugs (Figures 4 and 5). It was found that the group of natural compounds, slightly and medium toxic compounds, and drugs have very similar amounts of hydrogen bond acceptors as well as donors, ranging between three and six hydrogen bond acceptors and between zero and two hydrogen bond donors. The lowest number of hydrogen bond acceptors was found within drugs, as they are chemically designed to fulfill the Lipinski's rule of five [15]. According to this rule, they are supposed to comprise not more than 10 hydrogen bond acceptors in order to have adequate ADME properties [16]. In contrary to this, the group of highly toxic compounds shows both, more hydrogen bond donors and acceptors. It is obvious that within the groups of slightly, medium, and highly toxic compounds the amount of hydrogen bond acceptors and donors rises. This was confirmed by a more detailed investigation of the medium toxic compounds which show the same trend regarding the hydrogen bond acceptors (data not shown) and donors (Figure 5 small graph). Comparing the molecular weight and the hydrogen bond acceptors the same sequence of compound groups can be found: the drugs feature the least amount of hydrogen bond acceptors followed by the slightly toxic compounds, natural compounds, and the medium toxic compounds concluding with the highly toxic compounds as the group with the highest amount of hydrogen bond acceptors. The same order occurs regarding the hydrogen bond donors, except that the natural compounds show the least amount of hydrogen bond donors and the drugs follow the slightly toxic compounds. Thus, the assumption was confirmed, that the more toxic a compound the more hydrogen bond donors and acceptors can be found in the structure.
238
3.2.
S. Struck et al.
Functional properties
The distribution of functional groups in toxic compounds, drugs and natural compounds was analyzed and is depicted exemplarily in Figure 6. It can clearly be seen, that the occurences of functional groups rises with increasing toxicity whereas the natural compounds and the drugs exhibit frequencies among those of the toxic compounds. The highly toxic compounds differ significantly in the amounts of alcohol and sugar groups compared to the other compounds. The more hydroxyl groups can be found in a molecule, the more hydrogen bond donors are available and the higher is the reactivity. Sugar molecules have many chiral centers and therefore, are characterized by a high stereo selectivity. Regarding the huge amount of different sugar molecules there is a vast number of possible combinations resulting in a high specificity according to the binding affinity to their targets. Alcohol or phenol as an aromatic alcohol are characterized by their reactivity and corrosiveness resulting in a high toxicity. These properties are explained by the denaturing effect of phenol on membrane proteins forming pores which may lead to cell death. Acetal includes a hydroxyl group which, as mentioned above, makes molecules more reactive. Acetals are stable with respect to hydrolysis by bases. This is an important property for toxic compounds since the more protected they are from hydrolysis the better they can perform their effects. In summary, an order can be defined, starting with the sligthly toxic compounds with the least amounts of the depicted functional groups followed by the natural compounds, the drugs, and the medium toxic compounds concluding with the highly toxic compounds which possess the highest frequencies of the mentioned functional groups. Functional properties Alcohol
TC3-6 Acetal/Acetal-like
TC6-9 .. TC >9
!ljiNC
Alenol
Drugs Sugar
o
10
20
30
40
50
60
70
80
90
100
compounds [%] Fig. 6. Distribution of the occurrences of functional groups of toxic compounds (TC), natural compounds (NC) and drugs. The toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = slightly toxic, 6-9 = medium toxic, >9 = highly toxic).
Toxicity versus Potency
3.3.
239
Structural properties
Structural properties were also investigated as toxicity indicators. The most distinct ones are represented in Figure 7. The analyses of the structural characteristics in the three groups of toxicity show results analogous to the analyses of the functional properties: the more toxic a compound the more distinctive the property. Since chiral centers can be found in high amounts in sugar molecules their distributions correlate with those of the sugar group having the same origin: the high specificity and selectivity they provide ensure a very efficient and specific mode of action of toxic compounds. Conjugated double bonds contribute to the stability of a molecule so that a high amount hamper degradation and enable the toxin to perform its effects. Earlier studies revealed that the center of aromatic rings act as hydrogen bond acceptors [17] which is expected to playa significant role in molecular associations. This ensures a very specific and selective mode of action which explains the increasing amount of ring systems with increasing toxicity. Structural properties Ring system
TC3-6 TC6-9 Conjugated double bond
III
TC>9
III
NC
mDrugs Chiral center
o
10
20
30
40
50
60
70
80
90
100
compounds [%]
Fig. 7. Distribution of the occurrences structural properties of toxic compounds (TC), natural compounds (NC) and drugs. The toxic compounds are split into three classes according to their toxicity values (-log (LC50): 3-6 = low, 6-9 = medium, >9 = high).
3.4.
Case study
Amatoxins are cyclic non-ribosomal oligopeptides found in several members of the Amanita genus of mushrooms, one being the Death cap (Amanita phalloides). The most deadly of all the amatoxins is the a-amanitin with an oral LD50 of approximately 0.1 mglkg. It is an inhibitor of the RNA polymerase II blocking the transcription of DNA and RNA [18]. This leads to a total failure of the protein synthesis causing severe effects on liver and kidney [19]. Death usually occurs around a week from ingestion [20]. A map of the purine and pyrimidin pathway which can be found in the Kyoto Encyclopedia of
240
S. Struck et al.
Genes and Genomes (KEGG) [21] is shown in Figure 8. It displays in detail the function of the RNA polymerase II and the effects its inhibition by a-amanitin would cause. 5 '·Acetylphosphoadenosine 0 (mitochondria) 5L Bell2oylpho.phoadeno.ine 0 (mi1OchorulIia)
Fig. 8. Excerpt of the purine pathway extracted from KEGG. The enzyme colored in red with the number "2.7.7.6" depicts the RNA polymerase II.
With a molecular weight of 918.97 g/mol, 13 hydrogen bond donors, and 15 hydrogen bond acceptors the chemical "toxicity properties" of a-amanitin are consistent with our findings of the highly toxic compounds. A lot of ring systems, conjugated double bonds, and chiral centers also fit in our results of the structural "toxicity properties" of the highly toxic compounds.
4.
Conclusion and Future Perspectives
In this work we were able to elucidate a continuous trend in structural, chemical, and functional properties within the different groups of toxic. The analysis of hydrogen bond donors and acceptors as well as certain functional groups and structural features revealed a positive correlation between occurrence and toxicity whereas the amounts of drugs and natural compounds have similar values compared to the slightly toxic compounds. Toxic compounds function in a variety of ways and subgroups, like the highly toxic ones, react with their target in a completely different manner than drugs. While drugs are usually small compounds, able to enter the cell and to affect targets within the cells, a lot of toxic compounds function by forming pores in membranes (e.g. alpha toxin from Staphylococcus aureus), by permanent activation of for example sodium channels (aconitin) or by interaction with neurotransmitter receptors (strychnin). With the help of
Toxicity versus Potency
241
such mechanisms these toxic compounds are able to affect critical pathways which often cannot be circumvented. Therefore, these molecules are very effective. The data presented here provide valuable insight into the phenomenon of toxicity by elucidating "toxicity properties", characteristics of toxic compounds. Thus, the properties analyzed here will function as additional criteria to predict toxicities with the help of QSAR. Additional toxicity relevant properties, as presented here, will be helpful to improve such analysis. Further efforts will be made in the prediction of potential targets of unknown compounds. Acknowledgements
This work was supported by the International Research Training Group Boston-KyotoBerlin, funded by the German Research Foundation (DFG). References
[1] Watson, P., Spooner RA., Toxin entry and trafficking in mammalian cells, Adv Drug Deliv Rev, 58: 1581-1596,2006. [2] Hong, H., Xie, Q., Ge, W., Qian, F., Fang, H., Shi, L., Su, Z., Perkins and R, Tong, W., Mold(2), Molecular Descriptors from 2D Structures for Chemoinformatics and Toxicoinformatics, J Chern Inf Model, 2008. [3] Hughes, L.D., Palmer, D.S., Nigsch, F. and Mitchell, J.B., Why are some properties more difficult to predict than others? A study of QSPR models of solubility, melting point, and Log P, J Chern Inf Model, 48: 220-232, 2008. [4] Dunkel, M., Fullbeck, M., Neumann, S. and Preissner, R., SuperNatural: a searchable database of available natural compounds, Nucleic Acids Res, 34: D678683,2006. [5] Goede, A., Dunkel, M., Mester, N., Frommel, C. and Preissner R.,. SuperDrug: a conformational drug database, Bioinforrnatics, 21: 1751-1753,2005. [6] http://dtp.nci.nih.gov/ [7] http://chem.sis.nlm.nih.gov/chemidplus [8] http://pubchem.ncbi.nlm.nih.gov/ [9] Teuscher, E. and Lindequist, U., Biogene Gifte, Gustav Fischer Verlag, Germany, 1994 [10] Gunther, S., Kuhn, M., Dunkel, M., Campillos, M., Senger, c., Petsalaki, E., Ahmed, J., Urdiales, E.G., Gewiess, A., Jensen, L.1. et al., SuperTarget and Matador: resources for exploring drug-target relationships, Nucleic Acids Res, 36: D919-922, 2008. [11] Guha, R., Howard, M.T., Hutchison, G.R, Murray-Rust, P., Rzepa, H., Steinbeck, c., Wegner, J. and Willighagen, E.L., The Blue Obelisk-interoperability in chemical informatics, J Chern Inf Model. 46: 991-998,2006. [12] http://openbabel.sourceforge.netl [13] http://mychem.sourceforge.netl. [14] http://www.daylight.comldayhtmVdoc!theory/theory.smarts.html
242
S. Struck et al.
[15] Lipinski, CA., Lombardo, F., Dominy, B.W. and Feeney, PJ., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings, Adv Drug Deliv Rev, 46: 3-26, 2001. [16] van de Waterbeemd, H. and Gifford, E., ADMET in silico modelling: towards prediction paradise?, Nat Rev Drug Discov, 2: 192-204,2003. [17] Levitt, M. and Perutz, M.F., Aromatic rings act as hydrogen bond acceptors, J Mol Bioi, 201: 751-754,1988. [18] Lindell, T.1. et aI., Specific Inhibition of Nuclear RNA Polymerase II by agrAmanitin, Science, 170: 447-449, 1970. [19] Wieland, T., Poisonous Principles of Mushrooms of the Genus Amanita: Fourcarbon amines acting on the central nervous system and cell-destroying cyclic peptides are produced, Science, 159: 946-952, 1968. [20] Mas, A., Mushrooms, amatoxins and the liver, Journal of Hepatology, 42: 166-169, 2005. [21] http://www.genome.jp/kegg/.
COMPARATIVE VEGF RECEPTOR TYROSINE KINASE MODELING FOR THE DEVELOPMENT OF HIGHLY SPECIFIC INHIBITORS OF TUMOR ANGIOGENESIS ULRIKE SCHMIDT!
[email protected] JESSICA AHMED!
[email protected] MICHAEL HOEPFNER2
[email protected] ELKE MICHALSKY!
[email protected] ROBERT PREISSNER!
[email protected] Structural Bioinformatics Group, Institute for Molecular Biology and Bioinformatics, Charite (CBF), Arnimallee 22, 14195 Berlin, Germany, http://bioinformatics.charite.de 2 Molecular Tumor Therapy and Tumor Angiogenesis Group, Institute of Physiology, Charite (CBF), Arnimallee 22, 14195 Berlin, Germany !
The Vascular Endothelial Growth Factor receptors (VEGF-Rs) playa significant role in tumor development and tumor angiogenesis and are therefore interesting targets in cancer therapy. Targeting the VEGF-R is of special importance as the feed of the tumor has to be reduced. In general, this can be carried out by inhibiting the tyrosine kinase function of the VEGF-R. Nevertheless, there arise some problems with the specificity of known kinase inhibitors: they bind to the ATP-binding site and inhibit a number of kinases, moreover the so far most specific inhibitors act at least on these three major types of VEGF-Rs: Fit-I, Flk-I/KDR, Flt-4. The goal is a selective VEGF-R-2 (FlkIIKDR) inhibitor, because this receptor triggers rather unspecific signals from VEGF-A, -C, -D and E. Here, we describe a protocol starting from an established inhibitor (Vatalanib) with 2D-/3Dsearching and property filtering of the in silico screening hits and the "negative docking approach". With this approach we were able to identifY a compound, which shows a fourfold higher reduction of the proliferation rate of endothelial cells compared to the reduction effect of the lead structure.
Keywords: VEGF; cancer; tumor angiogenesis; homology modeling; in silica screening; docking
1.
Introduction
Angiogenesis, the fonnation of new blood vessels, nonnally occurs moderately in adults, e.g. during wound healing and during the menstrual cycle. The process of angiogenesis is regulated by activators and inhibitors [1]. Tumor angiogenesis is the fonnation of networks of blood vessels supplying the tumor with oxygen and nutrients. Tumor cells induce this process by releasing signaling proteins to the surrounding nonnal tissue. The most important signaling proteins, which are also released by most of the cancer cells, are the vascular endothelial growth factors (VEGFs). The VEGF family consists of the following secreted glyco-proteins: VEGF-A, VEGF-B, VEGF-C, VEGF-D, VEGF-E and the placental growth factors (PIGF-l and -2)
243
244
U. Schmidt et al.
[2-4]. The VEGFs bind to VEGF receptor (VEGF-R) proteins on the endothelial cell surface with different binding affinities for each of the VEGF-Rs. Expression of VEGF-Rs varies in specific endothelial cell layers. The VEGF-R-2 is located on almost all endothelial cells; however, the VEGF-R-I and -3 are alternatively located on endothelial cells in distinct vascular layers [5]. Since angiogenesis was found to be necessary for tumor growth [6], the inhibition of pathological angiogenesis is a main goal in cancer therapy. Particularly, the VEGFNEGF-R pathway plays a significant role in the development of angiogenesis and therefore represents a point of interference for therapy in oncology [5]. Different strategies to inhibit tumor angiogenesis exist: It is possible to interfere with angiogenesis from the extracellular as well as from the intracellular site. In the extracellular region, for example, antibodies and soluble receptors can avoid binding of the VEGF to the binding site of the receptor [6]. Moreover, VEGF antagonists block the ligand binding site of the VEGF-R on the extracellular site. Another way is the inhibition of the VEGF-R in the intracellular region by blocking the ATP-binding site of the tyrosine kinase [7]. However, there arise some problems concerning the specificity of known tyrosine kinase inhibitors: they bind into the ATP-binding site and inhibit a number ofkinases. So far the most specific inhibitors act on the VEGF-Rs. The goal would be to find a selective inhibitor for the VEGF-R-2 (KDR) , because it is expressed on almost all endothelial cells and the majority of the effects in angiogenesis, including cell proliferation, micro-vascular permeability [8], invasion, migration, and survival [9, IOJ, are mediated by VEGF-R-2. To find new compounds by using structure-based drug design, structural information about the target is needed. But today, no complete crystal structures of the VEGF-Rs are available. Here, we describe a protocol to find novel potential VEGF-R inhibitors starting from an established inhibitor (Vatalanib, see Figure 1) [11]. A known inhibitor was used as lead structure for an in silica two- and three-dimensional searching in an "Inhouse" database to identify novel potential VEGF-R tyrosine kinase inhibitors. Moreover, the structures of the ATP-binding site of three VEGF-Rs were modeled, starting from an incomplete crystal structure of the VEGF-R-2. These homology models were then used for comparative docking as qualitative evaluation of the in silica screening results.
2.
Methods
The in silica searching protocol consists of several steps, which are described in this section. In Figure I the procedure is schematically depicted.
2.1. Compound database To search for new potential VEGF-Rs inhibitors we used our Inhouse database which contains about four million compounds and more than 140 million conformers, which
Comparative VEGF Receptor Tyrosine Kinase Modeling
245
were pre-calculated by using the MedChemExplorer of Accelrys [12, 13]. Around 95% of the compounds stored in the Inhouse database are commercially available for experimental validation. In silleD screening
Lead structure
l
(Vatalanlb)
L""""""'~""'''''''''''~I
, ==::. sequence
Preliminary alignment
-.e
~.m~atepmlm")
.
r.v;,c,v"--;,vm.;m,,mx.tW.l m;m.,.M.1
1 Fig. 1. Scheme of the in silico and in vitro screening protocol.
2.2. Two-dimensional searching To search for similar structures in our Inhouse database we pursued 2D-searching. The screening is based on the chemical similarity between two molecules according to the similar property principle of Johnson and Maggiora [14].
246
U. Schmidt et al.
A structural fingerprint [15], a binary string encoding for the chemical characteristics of a compound, was calculated for the lead structure as well as for the database compounds. To screen the database, the fingerprint of the lead structure was compared to the fingerprints of the database entries by using the Tanimoto coefficient [16]. The Tanimoto coefficient is defined as:
Na describes the number of bits, which were set 1 in the fingerprint of compound a, Nb stands for the number of bits, which were set to I for compound b and Nab is the number of bits, which have compound a and compound b set to 1 in common. A molecule with a similarity greater than 85% (2: 0.85) to an active compound is assumed to be biologically active itself [17]. Therefore, only compounds with a similarity greater than 85% to the lead structure were considered. 2.3. Three-dimensional searching
A 3D-similarity search was applied to identify potential scaffold hoppers. For this purpose, the lead structure was compared to the conformers of drug-like compounds stored in our database. A plane representing the moment of inertia was put into all structures. For a comparison of two structures, the long and short sides of the planes were superimposed, which resulted in four different superimposition possibilities. The superimpositions were evaluated by using a scoring function, which includes the number of superimposed atoms and the Root Mean Square Deviation (RMSD). This scoring function is defined as: score = (percentage of superimposed atoms) . e,RMSD 2.4. Homology modeling
For homology modeling of the three VEGF-Rs several steps were necessary and were performed with the aid of the Swiss-PDBViewer [18]. A crystal structure of the VEGF-R-2 (PDB-code: 1YWN) was obtained from the Protein Data Bank (PDB). This structure is not complete; two gaps are located in and near the ATP-binding site. The ATP-binding pocket was completed by using the SuperLooper web server [19]. Loops were extracted from the LIP database [20] and inserted into the structure via the web service. Furthermore, the completed model of the VEGF-R-2 was used as template structure for the VEGF-R-l and VEGF-R-3. Finally, the models were subjected to an energy minimization using the respective function of the Swiss-PDB Viewer.
Comparative VEGP Receptor Tyrosine Kinase Modeling
247
2.5. Property filtering To estimate the drug-likeness of the 2D/3D-searching results the compounds were filtered according to their molecular properties by using the "Lipinski rule of five". There are four empirical rules, which say, that an orally available drug has: • not more than 5 hydrogen bond donors • not more than 10 hydrogen bond acceptors • a molecular weight below 500 glmol and • a 10gP (water/n-octanol partition) < 5. If a compound breaks more than one rule, it does not promise to become a drug [21]. Therefore, only compounds with no or at most one violation of the Lipinski rules were considered. The properties were calculated with the Accord for Excel Add-On [22].
2.6. Docking To evaluate the remaining drug-like candidates, they where docked into the ATP-binding site of the modeled VEGF-Rs by using the docking program Glide from Schrodinger [23). The Glide scoring function (Glide SP score) was used to rank the docking results. The docking scores and the visual inspection of the docked ligand-protein complexes were used as qualitative evaluation of the candidates and resulted in a ranking of those compounds. The best molecules were used for further in vitro screening.
2.7. In vitro screening A kinase assay was used to test the drug candidates for their inhibitory effect on VEGFRs. The potential of inhibition is expressed by the IC50 value (the concentration where kinase activity is reduced to 50%). Cytotoxicity was measured using a LDH-assay. The ability of cell proliferation inhibition was tested on different cell lines (endothelial cell line EA-HY 926) for each of the potential angiogenesis inhibitors.
3.
Results and Discussion
3.1. Sequence alignment and homology modeling The sequence alignment of the VEGF-Rs, as shown in Figure 2, is the basis of homology modeling. In a second step the non-identical amino acids of the template structure were exchanged according to the VEGF-R sequences. Only gaps in the ATP-binding pocket were filled in. VEGFR-3 VEGFR-2 VEGFR-l
827 IIp:ILIIYDlo,SINE 816 809
VEGFR-3 VEGFR-2 VEGFR-1
877 AVrCML],EGATIilIS 866 859
VEGFR-3 VEGFR-2 VEGFR-1
248
U. Schmidt et al. VEGFR-3 VEGFR-2 VEGFR-1
1024 1015 1009
VEGFR-3 VEGFR-2 VEGFR-l
1074 1065 1059
VEGFR-3 VEGFR-2
1124 1115
VEGFR-3 VEGFR-2 VEGFR-1
1174 QGRGI,QE 1165 QANAQQD 1159 QANVQQD
Fig. 2. Sequenee alignment of the three VEGF-Rs after the homology modeling. Amino acid differenees in thc ATP-binding site arc highlighted in black; other differences in grey.
Figure 3 shows a superimposition of the ATP-binding sites of all three homology modeled VEGF-Rs. Different amino acid residues in the ATP-binding site are shown in stick representation.
Fig. 3. Superimposition of the homology models of the VEGF-R-l (light grey), VEGF-R-2 (dark grey) and VEGF-R-3 (black). Different amino acid residues are shown in stick representation.
3.2. In silico screening The 2D-/3D-similarity screening of the Inhouse database for chemically and structurally similar compounds resulted in about 60 compounds which resemble the lead structure (with a Tanimoto ~ 0.85). The number of potential candidates could be reduced to 21 drug-like compounds by applying the Lipinski rule of five as molecular property filter.
Comparative VEGF Receptor Tyrosine Kinase Modeling
249
3.3. Docking The remaining 21 structures were docked into the ATP-binding site of the VEGF-Rs. The docking scores and the visual inspection of the docked ligand-receptor complexes were combined as qualitative evaluation of the in silico screening results. The docked structures of the lead compound Vatalanib and compound 10 to VEGF-R-l, -2 and -3 are exemplarily shown in Figure 4a-c) and Figure 4d-f), respectively.
Fig. 4. Ligand docked into the ATP-binding site (surface representation of the VEGF-Rs). Lead structure (Vatalanib) : a) in VEGF-R-l b) in VEGF-R-2 and c) in VEGF-R-3. Compound 10: d) in VEGF-R-l e) in VEGF-R-2 and t) in VEGF-R-3.
In Table 1 the docking scores for Vatalanib and compound 10 are listed. The evaluation of the docking results reveals better scores for compound 10 as for the lead structure. This suggests that compound 10 should have similar or even better biological activity. Therefore, compound 10 was one of the 21 substances selected for experimental validation. Table 1: Docking scores (Glide Score SP)
VEGF-R-l VEGF-R-2 VEGF-R-3
Lead (Vatalanib) -4.51 -4.27 -4.92
Compound 10 -5.01 -4.86 -5.15
3.4. Experimental validation The twelve compounds were tested in vitro for VEGF-R kinase activity inhibition, cell proliferation, migration inhibition and cytotoxicity. In Figure 5 the result of a cell proliferation assay on the endothelial cell line EA-HY 926 for compound 10 compared to the lead structure Vatalanib is exemplarily shown.
U. Schmidt et al.
250
It can be concluded that compound 10, at a concentration of 10 11M, reduces cell proliferation by ~40% (light grey) whereas the cell proliferation decreases about 8% when treated with the lead compound. The results shown here confirm the in silica screening results.
~ ..... c: 0
Cell proliferation (EA-HY 926) 100
:i2 ..Q
:.c
--..... -----.. Vatalanib
.5
(dark grey)
Comp10
.§
(light gray)
(0
....
~
...
"0
Q.
4.1
()
10
Concentration [J-lM]
Fig. 5. Cell proliferation assay (endothelial cell line EA-HY 926).
4.
Conclusion and Future Work
Using this approach, we were able to identify a new potential VEGF-R tyrosine kinase inhibitors. One of the hits was found to have a better effect on the inhibition of cell proliferation than the lead structure. Therefore, we reason that this compound is a specific inhibitor of tumor angiogenesis. This compound will undergo further in vitro and in vivo experiments and will be starting point for further refinement cycles.
Acknowledgements
This work was supported by the International Research Training Group Boston-KyotoBerlin, funded by the DFG. References
[1] Nishida, N., et al., Angiogenesis in cancer. Vase Health Risk Manag, 2(3): 213-219, 2006. [2] Ferrara, N., H.P. Gerber, and J. LeCouter, The biology ofVEGF and its receptors. Nat Med, 9(6): 669-676,2003. [3] Tischer, E., et al., The human gene for vascular endothelial growth factor. Multiple protein forms are encoded through alternative exon splicing. J Bioi Chem, 266(18): 11947-1154,1991.
Comparative VEGF Receptor Tyrosine Kinase Modeling
251
[4] Houck, K.A., et al., The vascular endothelial growth factor family: identification of a fourth molecular species and characterization ofaltemative splicing of RNA. Mol Endocrinol, 5(12): 1806-1814, 1991. [5] Hicklin, D.I. and L.M. Ellis, Role of the vascular endothelial growth factor pathway in tumor growth and angiogenesis. J Clin Oncol, 23(5): 1011-1027.,2005. [6] Los, M., I.M. Roodhart, and E.E. Voest, Target practice: lessons from phase III trials with bevacizumab and vatalanib in the treatment of advanced colorectal cancer. Oncologist, 12(4): 443-450, 2007. [7] Underiner, T.L., B. Ruggeri, and D.E. Gingrich, Development of vascular endothelial growth factor receptor (VEGFR) kinase inhibitors as anti-angiogenic agents in cancer therapy. Curr Med Chem, 11(6): 731-745.,2004. [8] Dvorak, H.F., Vascular permeability factor/vascular endothelial growth factor: a critical cytokine in tumor angiogenesis and a potential target for diagnosis and therapy. J Clin Oncol, 20(21): 4368-4380, 2002. [9] Zeng, H., H.F. Dvorak, and D. Mukhopadhyay, Vascular permeability factor (VPF)/vascular endothelial growth factor (VEGF) peceptor-l down-modulates VPFIVEGF receptor-2-mediated endothelial cell proliferation, but not migration, through phosphatidylinositol 3-kinase-dependent pathways. J BioI Chem, 276(29): 26969-26979. [10] Millauer, B., et af., High affinity VEGF binding and developmental expression suggest Flk-l as a major regulator of vas cuiogene sis and angiogenesis. Cell, 1993. 72(6): 835-846,2001. [11] Drevs, J., PTKlZK (Novartis). !Drugs, 6(8): 787-794,2003. [12] Smellie, A., et al., Conformational analysis by intersection: CONAN. J Comput Chem, 24(1): 10-20,2003. [13] MedChemExplorer, Accelrys Inc., http://www.accelrys.comldstudio/ds_medchem. [14] Johnson, M. and G. Maggiora, Concepts and Applications of Molecular Similarity. Wiley, NY, 1998. [15] 960 bit MDL (Molecular Design LTD.) MACCS keys [16] Delaney, J.S., Assessing the ability of chemical similarity measures to discriminate between active and inactive compounds. Mol Divers, 1(4): 217-222, 1996. [17] Martin, Y.c., I.L. Kofron, and L.M. Traphagen, Do structurally similar molecules have similar biological activity? J Med Chem, 45( 19): 4350-4358, 2002. [18] Guex Nand P. MC, SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis, 18(15): 2714-2723, 1997. [19] SuperLooper, http://bioinformatics.charite.de/superlooper. 2007. [20JMichalsky E, Goede A, and P. R, Loops in Proteins (LIP) - a comprehensive loop database for homology modelling. Protein Eng, 16: 979,2003. [21] Lipinski CA, et aI., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev, 46(1-3): 3-26, 2001. [22J Accelrys Inc., http://accelrys.coml [23J Schrodinger, Glide, version 4.5, Schr6dinger, LLC, New York, NY. 2007.
NETWORK ANALYSIS' OF ADVERSE DRUG INTERACTIONS MASATAKA TAKARABE'
[email protected] TOSIHAKI TOKIMATSU'
[email protected] SHUJIRO OKUDA'
[email protected] SUSUMU GOTO'
[email protected] MASUMIITOH'
[email protected] MINORU KANEHISA'·2
[email protected] 'Bioin/ormatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan 2Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Harmful cffects associated with use of drugs are caused as a result of their side effects and combined use of different drugs. These drug interactions result in increased or decreased drug effects, or produce other new unwanted effects and are serious problems for medical institutions and pharmaceutical companies. In this study, we created a drug-drug interaction network from drug package inserts and characterized drug interactions. The known information about the potential risk of drug interactions is described in drug package inserts. Japanese drug package inserts are stored in the JAPIC (Japan Pharmaceutical Information Center) database and GenomeNet provides the GenomeNet pharmaceutical products database, which integrate the JAPIC and KEGG databases. We cxtracted drug interaction data from GenomeNet, where interactions are classified according to risks, contraindications or cautions for coadministration, and some entries include information about cnzymes metabolizing the drugs. We defined drug target and drug-metabolizing enzymes as interaction factors using information on them in KEGG DRUG, and classified drugs into pharmacological/chemical subgroups. In the resulting drug-drug interaction network, the drugs that are associated with the same interaction factors are closely interconnected. Mechanisms of these interactions were then identified by each interaction factor. To characterize other interactions without interaction factors, we used the ATC classification system and found an association between interaction mechanisms and pharmacological/chemical subgroups.
Keywords: drug interaction; network; KEGG
1.
Introduction
Adverse drug events caused by drug interactions are significant problems in medications and the development of new drugs. These drug interactions lead to increase or decrease of drug effects or other serious reactions. For example, cyclosporin, which is widely used as an immunosuppressant drug, is known to interact with many other drugs such as ketoconazole and erythromycin [1, 2]. Cyclosporin is metabolized by CYP3A4, which is a member of a cytochrome P450 family and catalyzes the oxidation of a number of substrates, whereas, ketoconazole and erythromycin inhibit CYP3A4 enzyme activity. Thus, the combined use of these drugs results in delayed clearance and elevated blood level of cyclosporin and increase or prolong both its therapeutic and adverse effects. Assessing and managing such drug interactions are significant problems for clinical practice and drug development. In this study, we focused on adverse drug interactions
252
Network Analysis of Adverse Drug Intemctions
253
and created drug-drug interaction networks to characterize and investigate the drug interactions. To create the drug-drug interaction networks, we extracted drug interaction data from Japanese drug package inserts, which contain known information about potential risk of drug interactions. The Japanese drug package inserts are stored in the JAPIC (Japan Pharmaceutical Information Center) database [12]. We have integrated the JAPIC and KEGG databases [3] and provide it as the GenomeNet pharmaceutical products database [13]. Additionally we defined interaction factors and merged drugs into pharmacological/chemical subgroups to characterize the drug interactions. In the resulting drug-drug interaction networks, drugs that are associated with the same interaction factors are closely interconnected, and mechanisms of the drug interactions were identified by the interaction factors (CYP enzyme family or monoamine receptors, for example). Some other drug interactions without interaction factors were characterized by using information from pharmacological/chemical subgroups.
2.
Method
2.1. Datasets The GenomeNet pharmaceutical products database provides Japanese drug package insert data linked to the KEGG DRUG database. Each entry contains information on the brand/generic name, physicochemicallpharmacokinetic properties, drug interactions, etc. The drug interaction section lists the drugs or the classes of drugs that cause adverse interactions with the product, and these interactions are classified according to risks, contraindications or cautions for coadministration. Additionally, some drugs contain additional sections which include information on enzymes metabolizing the products like cytochrome P450 family. Most entries are assigned KEGG DRUG IDs (D numbers), which correspond to the active ingredient of the products. The KEGG DRUG database is a chemical structure-based database in which each entry includes information on chemical structure, efficacy, drug target, pathway, ATC code, etc.
2.2. Drug interaction network We used the data from the GenomeNet pharmaceutical products database as of March 26, 2008. 13973 pharmaceutical product entries were stored in the database, of which 7562 entries contained drug interaction information. We extracted drug names from the drug interaction section of each entry and listed JAPIC IDs that correspond to the drug names to create drug interaction data between JAPIC IDs. Next, JAPIC IDs were merged with respect to the D numbers that the JAPIC IDs are assigned because we considered that products assigned the same medicinal properties have the same potential risk of drug interactions. Consequently, we obtained drug interaction data between D numbers and used the data to create drug interaction networks.
254
M. Takarabe et al.
To characterize the drug interactions, we defined drug targets and drug-metabolizing enzymes as interaction factors for each D number and searched drug interactions associated with the same interaction factors. Information on the interaction factors was collected from the package insert data and the KEGG DRUG database. Drug target genes data stored in the KEGG DRUG database were merged with respect to each functional type of protein according to KEGG BRITE, which is a collection of hierarchical classifications [3].
2.3. PharmacologicaVchemical subgroups We used the Anatomical Therapeutic Chemical classification system (ATC classification system), developed by the WHO Collaborating Centre for Drug Statistics Methodology [14], to group D numbers. The ATC classification system divides drugs at 5 different levels according to the sites of action and their therapeutic and chemical characteristics. Each level is assigned a code which consists of 1 letter or 2 digits corresponding to pharmacological/chemical subgroups of the level. The drugs assigned the same ATC codes indicate that they are assigned the same pharmacological/chemical subgroups. Thus, D numbers were grouped into chemical substance subgroups in terms of the pharmacological/chemical categories based on the ATC classification system. 3.
Results
The numbers of extracted interactions between JAPIC IDs are 29,663 and 1,196,494 in contraindications and cautions for coadministration respectively, and we merged JAPIC IDs into D numbers. As a result, 1,513 and 36,040 interactions between D numbers were obtained respectively (Table 1). Table I. Number of drug interactions and entries involved in the interactions.
JAPIC ID D number
Contraindications Interaction Entry 29,663 3,043 1,513 517
Cautions Interaction Entry 1,196,494 9,432 36,040 1,431
3.1. Interaction/actors We created network graphs from the resulting data on the drug interaction and interaction factors. Figure 1 shows the obtained network of contraindications for coadministration. In the network, nodes represent the D numbers that correspond to the drugs, and edges represent interactions. Node sizes are proportional to the numbers of edges they have. Bold edges indicate the interactions between the drugs associated with the same interaction factors and are colored according to the interaction factors.
Network Analysis of Adverse Drug Interactions
255
. ..
•
• '
't
•
/
- - - - CYP family
- - - - Monoamine receptor
Other interaction factors
Fig. I. Drug interaction network of contraindications for coadministration. Interaction factors were merged into the CYP enzyme family, monoamine receptor, and others. Bold edges were colored according to these interaction factor groups.
Obtained interaction factors were 12 and 38 in contraindications and cautions for coadministration, respectively. Table 2 shows the top 5 interaction factors that both drugs in the interaction are associated with. CYP families and monoamine (adrenaline, serotonin, dopamine, histamine, etc.) receptors are the most frequently observed interaction factors which are associated with both drugs in the interactions. The interactions between the drugs associated with the same interaction factors are closely interconnected.
256
M. Takarabe et al. Table 2. Number of interactions and drugs with interaction factor.
Contraindications Interaction factor # of interaction CYP3A Adrenaline receptor Serotonin receotor CYP2D CYPIA
181 33 28 17 16
# of drugs 77 17 8 14 16
Interaction factor CYP3A Adrenaline receptor CYP2C Dooamine receptor CYPIA
Cautions # of interaction 1,916 200 200 182 113
# of drugs 147 52 50 42 31
Information on action mechanisms of these interactions are provided in the package inserts. For instance, drug interactions from CYP families are caused by inhibition/induction of the enzymes and result in a decrease/increase in the effects of drugs. In the case of drug interactions with monoamine receptors, both drugs affect the same receptors, which results in the additive effect of the receptors. Next, we investigated other interactions without interaction factors by using information from pharmacological/chemical subgroups. In the network of contraindications for coadministration, 398 D numbers were assigned ATC codes and merged into 331 pharmacological/chemical subgroups. 1042 D numbers were merged into 941 subgroups in the network of cautions for coadministration. To explore an association between interaction mechanisms and pharmacological/chemical subgroups, we searched hub nodes and common pharmacological/chemical categories of their neighboring nodes. Figure 2 shows an example of D00951 (Medroxyprogesterone acetate) and its neighboring nodes with pharmacological/chemical subgroup information in the network of contraindications for coadministration. D00951 interacts with 97 different drugs, of which 43 are included in the most common category "Corticosteroids, plain" which corresponds to third level A TC code "D07 A". These interactions between D00951 and "Corticosteroids, plain" subgroup increase the risk of side effect of both drugs such as cardiovascular disease [4, 5, 6]. 4.
Discussion
We created drug interaction networks from Japanese drug package insert information to explore adverse drug interactions. In the resulting networks, many drugs are associated with the same interaction factors and closely connected with each other. Therefore there are many drugs that mostly interact only with drugs associated with the same interaction factors. For example, D02211 (Dihydroergotamine mesilate) interacts with 37 different drugs, of which 30 are associated with CYP3A, and D00560 (Pimozide) interacts with 23 different drugs, of which 21 drugs are associated with CYP3A. Dihydroergotamine mesilate and pimozide are reported to be metabolized by CYP3A [7,8], and coadministrations of the two drugs with CYP3A inhibitors or drugs metabolized by CYP3A cause serious side effects such as QT prolongation or ventricular arrhythmia. These interaction factors enabled us to characterize drug interactions and identify mechanisms of these interactions because their interaction mechanisms or clinical symptoms depend on the interaction factors. Obtained drug interaction networks include many nodes and edges. Particularly, in the network of cautions for coadministration, it is difficult to explore drug interactions from the network graph. For efficient analysis,
Network Analysis of Adverse Drug Intemctions
257
elimination of drugs and interactions associated with the same interaction factors may be effective to reduce nodes and edges in the drug networks. Next, we used ATC classification system to investigate interactions between drugs assigned no information of interaction factors or assigned different interaction factors respectively. We applied the information of pharmacological/chemical subgroups to neighboring nodes of each node and searched their common pharmacological/chemical categories that correspond to third level or forth level of ATC code. In some interactions between drugs and their neighboring nodes, common pharmacological/chemical categories were found in the neighboring nodes, and there are characteristic interaction mechanisms or clinical symptoms related to the pharmacological/chemical categories. We illustrated Figure 2 as an example of the association between interaction mechanisms and pharmacological/chemical subgroups, and Figure 3A shows another example of the associations. D00386 (Triamterene) interacts 8 different drugs, of which 6 drugs are classified "Acetic acid derivatives" subgroup, and these interactions cause acute renal failure [9, 10]. Figure 3B illustrates the case of D00089 (Oxytocin), and these interactions result in the enhancement effect of both drugs and lead to serious events [11]. The results indicate this method using pharmacological/chemical subgroups is effective to investigate drug interactions without information of interaction factors. However, some drug interactions remain uncharacterized. For further research, there is a need for more exhaustive data including drug interactions, targets and other new pharmacological/chemical properties to determine the uncharacterized drug interactions. Acknowledgments
We thank lB. Brown for critical reading of our manuscript. This work was supported by grants from the Ministry of Education, Culture, Sports, Science and Technology of Japan and the Japan Science and Technology Agency. The computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University.
258
M. Takarabe et al.
Fig. 2. 000951 and its neighboring nodes in the network of contraindications for coadministration. Red nodes represent nodes that included in the "Corticosteroids, plain" subgroup ("007 A").
A
B
Fig. 3. Associations between interaction mechanisms and pharmacological/chemical subgroups in the network of contraindications for coadministration. Red nodes represent nodes that included in the same pharmacological/chemical subgroups. (A) 000386 (Triamterene) interacts with 6 drugs classified in "Acetic acid derivatives" subgroup. (8) 000089 (Oxytocin) interacts with 5 drugs classified in "Prostaglandins" subgroup.
Network Analysis of Adverse Drug Interactions
259
References [1] Wadhwa, N.K., Schroeder, T.J., Pesce, AJ., Myre, S.A, Clardy, C.W., First, M.R., Cyclosporine drug interactions: a review, Ther. Drug Monit., 9(4):399-406, 1987. [2] Pichard, L., Fabre,l., Fabre, G., Domergue, J., Saint Aubert, B., Mourad, G., Maurel, P., Cyclosporin A drug interactions. Screening for inducers and inhibitors of cytochrome P-450 (cyclosporin A oxidase) in primary cultures of human hepatocytes and in liver microsomes, Drug Metab. Dispos., 18(5): 595-606, 1990. [3] Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T., Yamanishi, Y., KEGG for linking genomes to life and the environment, Nucleic Acids Res., 36, D480-D484, 2008. [4] Falkeborn, M., Persson, I., Adami, H.O., Bergstrom, R., Eaker, E., Lithell, H., Mohsen, R., Naessen, T., The risk of acute myocardial infarction after oestrogen and oestrogen-progestogen replacement, Br. J. Obstet. Gynaeeol., 99(10), 821-828, 1992. [5] Lacroix, K.A, Bean, C., Reilly, R., Curran-Celentano, J., The effects of hormone replacement therapy on antithrombin III and protein C levels in menopausal women, Clin. Lab. Sci., 10(3): 145-148, 1997. [6] AI-Farra HM, AI-Fahoum SK, Tabbaa MA., First MR., Effect of hormone replacement therapy on hemostatic variables in post-menopausal women, Saudi Med. J., 26(12):1930-1935, 2005. [7] Moubarak AS, Rosenkrans CF Jr, Johnson ZB., Modulation of cytochrome P450 metabolism by ergonovine and dihydroergotamine, Vet. Hum. Toxieol., 45(1):6-9, 2003. [8] Desta Z, Kerbusch T, Soukhova N, Richard E, Ko JW, Flockhart DA, Identification and characterization of human cytochrome P450 isoforms interacting with pimozide, J. Pharmaeo.l Exp. Ther., 285(2):428-437,1998. [9] Favre L, Glasson P, Vallotton MB., Reversible acute renal failure from combined triamterene and indomethacin: a study in healthy subjects, Ann. Intern. Med., 96(3):317-320, 1982. [10] Favre L, Vallotton MB., Relationship of renal prostaglandins to three diuretics, Prostaglandins Leukot. Med., 14(3):313-319, 1984. [11] Tomialowicz M, Florjanski J, Zimmer M., The use of oxytocin and prostaglandin in pregnancies after cesarean delivery or uterine surgery, Ginekol Pol., 71(4):242-246, 2000. [12] http://database.japic.or.jp/nw/index
[13] http://www.genome.jp/kusuri/ [14] http://www.whocc.no/atcdddl
SAMPLING GEOMETRIES OF PROTEIN-PROTEIN COMPLEXES A YSAM GUERLER
STEPHAN LORENZEN
[email protected] [email protected] FLORIAN KRULL
ERNST-WALTER KNAPP
[email protected] [email protected] Frie Universitat Berlin, Department a/Chemistry and Biochemistry, Fabeckstr. 36a, 14195, Berlin-Dahlem, Germany Protein-protein docking is a major task in structural biology. In general, the geometries of protein pairs are sampled by generating docked conformations, analyzing them with scoring functions and selecting appropriate geometries for further refinement. Here, we present an algorithm in real space to sample geometries of protein pairs. Therefore, we initially determine uniformly distributed points on the surfaces of the two protein structures to be docked and additionally define a set of uniformly distributed rotations. Then, the sampling method generates structures of protein pairs as follows: (i) We rotate one protein of the protein pair according to a selected rotation and (ii) translate it along a line connecting two surface points belonging to different proteins such that these surface points coincide. The resulting protein pair geometries are then analyzed and selected using a scoring function that considers residues and atom pairs. We applied this approach to a set of 22 enzymeinhibitor complexes and demonstrate that a discretisation of the rigid-body search in real space provides an efficient and robust sampling scheme. Our method generates decoy sets with a considerable fraction of near-native geometries for all considered enzyme-inhibitor complexes.
Keywords: protein-protein docking; rigid-body geometry search; interface analysis
1.
Introduction
Proteins are important regulators of biochemical processes in biological cells. They are for instance used to catalyze chemical reactions, to transport substrates through membranes and to stabilize cellular structures. Interactions with other molecules can affect a protein's macromolecular structure and functionality. For proteins, whose function is to form specific complexes with other proteins, the shape of the contact surface and the residue pair interactions at the contact surface are especially relevant [1]. This protein-protein interaction obeys the key-lock principle and is driven by free energy contributions, resulting in high binding affinities. Binding can influence the function of proteins in diverse ways from total inhibition to enhancement or induction. Although genome-wide proteomics studies indicate that many proteins interact with each other, the number of complexes in the Protein Data Bank (PDB) increases very slowly. Possibly, this is related to the instability of transient protein-protein interactions, which make a crystallographic analysis difficult. Therefore, theoretical approaches for the identification and prediction of protein-protein interactions can be of great importance. Many efforts have been made to find a computational solution to this problem. Unlike the prediction of the binding modes for small molecules (i.e. FlexX [2],
260
Sampling Geometries of Protein-Protein Complexes
261
ICM [3] and Fado [4]), most protein-protein docking approaches consider the structures of the individual proteins in the complex to be rigid. Initially, a wide variety of docked conformations are generated and simultaneously evaluated by scoring functions. In general, these methods perform well when applied on individual protein conformations that are directly taken from the corresponding co-crystallized structures. However, predicting protein complex geometries using protein structures obtained from separate crystallizations essays remains difficult, often leading to many false positives. The binding process often involves conformational changes. Although these are generally subtle, they make it more difficult to find the proper complex geometry. Therefore, a further refinement of the proposed complex geometries by other methods, e.g. Monte Carlo approaches, is often necessary. Currently, most established methods for rigid-body analysis of protein-protein interactions are based on the convolution technique in Fourier space as initially utilized by Katchalski-Katzir et al. in 1992 [5]. These approaches include ZDOCK [6], MolFit [7], 3D-Dock [8], DOT [9], GRAMM [10] and others. These methods use a scoring function defined on a discrete grid for each of the two proteins. Instead of evaluating the scoring function in real space, which is computationally expensive, the values of the scoring function are obtained by multiplication the corresponding Fourier transformed grids. This is done by assigning the atomic interaction parameters for each protein on separate grids, which are subsequently transformed by the fast Fourier transform (FFT) algorithm. In the Fourier space the Fourier coefficients are multiplied and the results are transformed back to real space. This is done for a large set of protein orientations [5]. Besides the FFT-based approaches, a variety of other procedures have also been applied on the protein-protein docking problem. Nussinov et al. proposed an algorithm based on geometric matching of knobs on the interacting surfaces [11]. Others, such as Baker [12] and Abagyan [13] have developed highly accurate methods using Monte Carlo simulations. The protein complex geometries are clustered [14] and their stability is analyzed by perturbation studies using different scoring functions [15]. The development of proper scoring functions is a non-trivial problem in proteinprotein docking. A large variety of scoring functions attempt to capture the biophysically relevant properties for protein complex formation, such as e.g. interactions based on physical principles, on residue pair distributions or on geometric fit [16-20]. In this work, we describe a real space rigid-body protein-protein docking approach. Instead of assigning atom specific interaction parameters to each grid point, as necessary for FFT methods, we can take into consideration all interactions of atom pairs within a certain cutoff distance from the protein surfaces. In order to reduce the computational costs in real space, an efficient sampling strategy of the search space is used, which in tum allows to consider additional parameters in the scoring function. Two proteins are translated and rotated by a discrete set of transformations. To obtain the corresponding parameters for the transformations, the protein surfaces are uniformly covered by surface points. In addition, a set Q of uniformly distributed quatemions is generated from which the rotations are obtained. The translational vector is defined by the line connecting the
262
A. Guerler et al.
pair of surface points selected from each of the two proteins. The residues interacting in the resulting geometry are evaluated by a statistical scoring function, which comprises geometrical and physicochemical components by considering residue pairs and atom pairs. The parameters of the scoring function were determined by Heuser et al. for enzyme-inhibitor complexes [20, 21]. 2.
Methods
2.1. Preparing surface and grid representation
From now on, we call the smaller of both proteins ligand (L) and the larger receptor (R). We embed both proteins by a grid with grid constant of 1.0 A. Points of the receptor grid GR, which are in the van der Waals (vdW) sphere of a receptor atom (radius of 1.8 A for all atoms) are inside the receptor and marked as receptor points. If the receptor grid points are outside of the vdW volume of the corresponding protein they contain a neighbor list of protein atoms, which are within a distance cutoff of rcutCneighbor) 7 A. This neighbor list provides an efficient way to find atomic interaction partners between the two proteins in the complex structure.
a)
b)
c)
Fig. I. Generation of neighbor list and surface points. Small spheres denote the protein atoms. a) Atom neighbor list of a reference grid point (center of large sphere) contains the numbers of atoms within the cut-off distance (largest sphere). b) Initial surface points (thicker red points of the grid) are all grid points, which are within a specified minimal and maximal distance (medium size blue spheres denoted by dashed lines) to the nearest protein atoms. c) The initial surface points are translated towards the center of the nearest protein atom until the vdW surface of the atom is reached (blue points on the surface of the gray spheres).
For both proteins (ligand and receptor) the grids are also used to determine surface points and surface normal vectors (see Fig. 1 for more details). In a first approximation the protein surface points are those grid points whose distances to the nearest protein atoms are between 4.0 and 6.0 A. These points are then projected on the vdW surface of the nearest atom sphere. For each such surface point, we calculate a surface normal vector connecting the assigned atom center with the surface point. Then, we compute for
Sampling Geometries of Protein-Protein Complexes
263
all atoms of a residue the average of the surface normal vectors. Now we reduce the number of surface points. To obtain an even distribution of surface points we randomly select a single surface point and delete all other surface points within a distance of rcut(surface) = 7 A. Next, we select the nearest remaining surface point and repeat the procedure until all surface points have been selected or deleted. We denote the resulting sets of surface points SR and SL and of corresponding normal vectors V R and V L for the receptor and ligand, respectively. For the rotations a set Q of 8000 uniformly distributed quatemions is calculated with the approach described by Kuffner [22].
2.2. Sampling strategy During the generation of the protein-protein geometries (called decoys), the receptor stays fixed, while the ligand is moved, i.e. translated and rotated. A decoy is defined by the triplet [q(k), sR(i), SL(j)], of quatemion q(k) E Q and surface points sR(i) and SL(j) of receptor and ligand, respectively. For each pair [SR(i), SL(j)] of surface points we compute the angle