Series on Advances in Bioinformarics and Compurarional Biology - Volume 4
w
Life Sciences Society
rf
["A*^
COMPUTATIONA SYSTEMS BIOINFORMATICS CSB2006 CONFERENCE PROCEEDINGS Stanford CA, 1 4 - 1 8 August 2006
Editors
Peter Markstein Ying Xu WARIM MAtf^ Imperial College Press
Life Sciences Society
COMPUTATIONAL SYSTEMS BIOINFORMATICS
SERIES ON ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY
Series Editors: Ying XU (University of Georgia, USA) Limsoon WONG (National University of Singapore, Singapore) Associate Editors: Ruth Nussinov (NCI, USA) Rolf Apweiler (EBI, UK) Ed Wingender (BioBase, Germany)
See-Kiong Ng (Inst for Infocomm Res, Singapore) Kenta Nakai (Univ of Tokyo, Japan) Mark Ragan (Univ of Queensland, Australia)
Published Vol. 1: Proceedings of the 3rd Asia-Pacific Bioinformatics Conference Eds: Yi-Ping Phoebe Chen and Limsoon Wong Vol. 2: Information Processing and Living Systems Eds: Vladimir B. Bajic and Tan Tin Wee Vol. 3: Proceedings of the 4th Asia-Pacific Bioinformatics Conference Eds: Too Jiang, Ueng-Cheng Yang, Yi-Ping Phoebe Chen and Limsoon Wong Vol. 4: Computational Systems Bioinformatics Eds: Peter Markstein and Ying Xu
ISSN: 1751-6404
Jjeriei o i
A'JVCIR'O".
if: ni-"Sni:o'-T'fjf'i •: on-.-' CnmpijMt r;:i<j. [' ^l-j-.-y
'/>i i-' .rl
r ' '•
Life Sciences Society '
COMPUTATIONAL SYSTEMS BIOINFORMATICS CSB2006 CONFERENCE PROCEEDSJ^GS Stanford CA, i ^ £ ^ » i s ^ O c b C \
Edilch's
"yf'
\ \
~^
Peter M i r k s ^x l i i /
YingXu
/ r x \\
/ V [XH
#*.
'V
Imperial College Press
Published by Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
Series on Advances in Bioinformatics and Computational Biology — Vol. 4 COMPUTATIONAL SYSTEMS BIOINFORMATICS Proceedings of the Conference CSB 2006 Copyright © 2006 by Imperial College Press All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN
1-86094-700-X
Printed in Singapore by B & JO Enterprise
Life Sciences Society THANK YOU LSS Corporate Members and CSB2006 Platinum Sponsors! The Life Sciences Society, LSS Directors, together with the CSB2006 program committee and conference organizers are extremely grateful to the Hewlett-Packard Company and Microsoft Research for their LSS Corporate Membership and for their Platinum Sponsorship of the Fifth Annual Computational Systems Bioinformatics Conference, CSB2006 at Stanford University, California, August 14-18, 2006
i n v e n t
Microsoft"
Research
This page is intentionally left blank
COMMITTEES
Steering Committee Phil Bourne - University of California, San Diego Eric Davidson - California Institute of Technology Steven Salzberg - The Institute for Genomic Research John Wooley - University of California, San Diego, San Diego Supercomputer Center
Organizing Committee Russ Altman - Stanford University, Faculty Sponsor (CSB2005) Serafim Batzoglou - Stanford University, Faculty Sponsor (CSB2002-CSB2004) Pat Blauvelt - Communications Ed Buckingham - Local Arrangements Chair Kass Goldfein - Finance Consultant Karen Hauge - Local Arrangements - Food VK Holtzendorf - Sponsorship Robert Lashley - Sun Microsystems Inc, Co-Chair Steve Madden - Agilent Technologies Alexia Marcous - CEI Systems Inc, Sponsorship Vicky Markstein - Life Sciences Society, Co-Chair, LSS President Yogi Patel - Stanford University, Communications Gene Ren - Finance Chair Jean Tsukamoto - Graphics Design Bill Wang - Sun Microsystems Inc, Registration Chair Peggy Yao - Stanford University, Sponsorship Dan Zuras - Group 70, Recorder
Program Committee Tatsuya Akutsu - Kyoto University Vineet Bafna - University of California, San Diego Serafim Batzoglou - Stanford University Chris Bystroff - Rensselaer Polytechnic Institute Jake Chen - Indiana University Amar Das - Stanford University David Dixon - University of Alabama Terry Gaasterland - University of California, San Diego Robert Giegerich - Universitat Bielefeld Eran Halperin - University of California Berkeley Wolfgang R. Hess - University of Freiburg
Ivo Hofacker - University of Vienna Wen-Lian Hsu - Academia Sinica Daniel Huson - Tubingen University Tao Jiang - University of California Riverside Sun-Yuan Kung - Princeton University Dong Yup Lee - Singapore Cheng Li - Harvard School of Public Health Jie Liang - University of Illinois at Chicago Ann Loraine - University of Alabama Bin Ma - University of Western Ontario Peter Markstein - Hewlett-Packard Co., Co-chair Satoru Miyano - University of Tokyo Sean Mooney - Indiana University Ruth Nussinov - National Cancer Institute Mihai Pop - University of Maryland Isidore Rigoutsos - IBM TJ Watson Research Center Marie-France Sagot - Universite Claude Bernard Mona Singh - Princeton University Victor Solovyev - Royal Holloway, University of London Chao Tang - University of California at San Francisco Olga Troyanskaya - Princeton University Limsoon Wong - Institute for Infocomm Research Ying Xu - University of Georgia, Co-chair
Assistants to the Program Co-Chairs Misty Hice - Hewlett-Packard Labs Ann Terka - University of Georgia Joan Yantko - University of Georgia
Poster Committee Dick Carter - Hewlett Packard Labs Robert Marinelli - Stanford University Nigam Shah - Stanford University, Chair Kathleen Sullivan - Five Prime Therapeutics, Inc
Tutorial Committee Carol Cain - Agency for Healthcare Research and Quality, US Department of Health and Human Services Betty Cheng - Stanford University Biomedical Informatics Training Program, Chair Al Shpuntoff
Workshop Committee Will Bridewell - Stanford University, Chair
Demonstrations Committee AJ Chen - Stanford University, Chair Rong Chen - Stanford University
Referees Larisa Adamian Tatsuya Akutsu Doi Atsushi Vineet Bafna Purushotham Bangalore Serafim Batzoglou Sebastian Boecker Chris Bystroff Jake Chen Shihyen Chen Zhong Chen Amar Das Eugene Davydov Tobias Dezulian David A. Dixon Chuong B Do Kelsey Forsythe Ana Teresa Freitas Terry Gaasterland Irene Gabashvili Robert Geigerich Samuel S Gross Juntao Guo Eran Halperin Wolfgang Hess Ivo Hofacker Daniel Huson Wen-Lian Hsu Seiya Imoto
Tao Jiang Uri Keich Gad Kimmel Bonnie Kirkpatrick S Y Kung Vincent Lacroix Dong Yup Lee Xin Lei Cheng Li Guojun Li Xiang Li Jie Liang Huiqing Liu Jingyuan Liu Nianjun Liu Ann Loraine Bin Ma Man-Wai Mak Fenglou Mao Peter Markstein Alice C McHardy Satoru Miyano Sean Mooney Jose Carlos Nacher Rei-ichiro Nakamichi Brian Naughton Kay Nieselt Ruth Nussinov
Vibin Ramakrishnan Isidore Rigoutsos Marie-France Sagot Nigam Shah Baozhen Shan Daniel Shriner Mona Singh Sagi Snir Victor Solovyev Andreas Sundquist Ting-Yi Sung Chao Tang Eric Tannier Olga Troyanskaya Aristotelis Tsirigos Adelinde Uhrmacher Raj Vadigepalli Gabriel Valiente Limsoon Wong Hongwei Wu Lei Xin Ying Xu
Victor Olman
Rui Yamaguchi Will York Hiroshi Yoshida Ryo Yoshida
Daniel Piatt Mihai Pop
Noah Zaitlen Stanislav O. Zakharkin
This page is intentionally left blank
PREFACE
The Life Sciences Society, LSS, was launched at the CSB2005 conference. Its goal is to pull together the power available from computer science, the engineering capability to design complex automated instruments, together with the weight of centuries of accumulated knowledge from the biosciences. LSS directors, organizing committee and members have dedicated time and talent to make CSB2006 one of the premier life sciences conferences in the world. Beside the huge volunteer effort for CSB it is important that this conference be properly financed. LSS and CSB are thankful for the continuous and generous support from Hewlett Packard and from Microsoft Research. We also want to thank the CSB2006 authors who have trusted us with the results of their research. In return LSS has arranged to have the CSB2006 Proceedings distributed to libraries as a volume in the "Advances in Bioinformatics and Computational Biology" book series - Oxford Press. CSB proceedings are indexed in Medline. A very big thank you to John Wooley, CSB steering committee member, par excellence, who was there to help whenever needed. The general conference Co-Chair for CSB2006, Robert Lashley, has done a phenomenal job in his first year with LSS. Ed Buckingham as Local Arrangements Chair continues to
provide for the 4th continuous year outstanding professional leadership for CSB. Once again the Program Committee co-chaired by Peter Markstein and Ying Xu has orchestrated a stellar selection of thirty eight bioinformatics papers for the plenary sessions and for publication in the Proceedings. The selection of the best posters was done under the supervision of Nigam Shah, Poster Chair. Selection of the ten tutorial classes was conducted by Betty Cheng, Tutorial Chair, and of the seven workshops by Will Bridewell, Workshop Chair. Ann Loraine's work with PubMed has been instrumental in getting CSB proceedings indexed in Medline. Kirindi Choi is again Chair of Volunteers. Pat Blauvelt is LSS membership Chair, Bill Wang is Registration Chair, and Gene Ren is Finance Chair. Together with the above committee members all CSB committee members deserve a special thank you. This has been an incredibly dedicated CSB organizing committee! If you believe that Sharing Matters, you are invited to join our drive for successful knowledge transfer and persuade a colleague to join LSS. Thank you for participating in CSB2006. Vicky Markstein President, Life Sciences Society
This page is intentionally left blank
CONTENTS
Committees
vii
Referees
ix
Preface
xi
Keynote Addresses Exploring the Ocean's Microbes: Sequencing the Seven Seas Marvin E. Frazier et al.
1
Don't Know Much About Philosophy: The Confusion Over Bio-Ontologies Mark A. Musen
3
Invited Talks Biomedical Informatics Research Network (BIRN): Building a National Collaboratory for BioMedical and Brain Research Mark H. Ellisman Protein Network Comparative Genomics Trey Ideker Systems Biology in Two Dimensions: Understanding and Engineering Membranes as Dynamical Systems Erik Jakobsson
5
7
9
Bioinformatics at Microsoft Research Simon Mercer
11
Movie Crunching in Biological Dynamic Imaging Jean-Christophe Olivo-Marin
13
Engineering Nucleic Acid-Based Molecular Sensors for Probing and Programming Cellular Systems Christina D. Smolke
15
Reactome: A Knowledgebase of Biological Pathways Lincoln Stein et al.
17
XIV
Structural Bioinformatics Effective Optimization Algorithms for Fragment-Assembly based Protein Structure Prediction Kevin W. DeRonne and George Katypis Transmembrane Helix and Topology Prediction Using Hierarchical SVM Classifiers and an Alternating Geometric Scoring Function Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung and Wen-Lian Hsu
19
31
Protein Fold Recognition Using the Gradient Boost Algorithm Feng Mao, JinboXu, Libo Yu and Dale Schuurmans
43
A Graph-Based Automated NMR Backbone Resonance Sequential Assignment Xiang Wan and Guohui Lin
55
A Data-Driven, Systematic Search Algorithm for Structure Determination of Denatured or Disordered Proteins Lincong Wang and Bruce Randall Donald
67
Multiple Structure Alignment by Optimal RMSD Implies that the Average Structure is a Consensus Xueyi Wang and Jack Snoeyink
79
Identification of a-Helices from Low Resolution Protein Density Maps Alessandro Dal Palit, Enrico Pontelli, Jing He and Yonggang Lu
89
Efficient Annotation of Non-Coding RNA Structures Including Pseudoknots via Automated Filters Chunmei Liu, Yinglei Song, PingHu, Russell L. Malmberg and Liming Cai
99
Thermodynamic Matchers: Strengthening the Significance of RNA Folding Energies Thomas Hochsmann, Matthias Hochsmann and Robert Giegerich
111
Microarray Data Analysis and Applications PEM: A General Statistical Approach for Identifying Differentially Expressed Genes in Time-Course cDNA Microarray Experiment without Replicate XuHan, Wing-Kin Sung and Lin Feng
123
Efficient Generalized Matrix Approximations for Biomarker Discovery and Visualization in Gene Expression Data Wenyuan Li, Yanxiong Peng, Hung-Chung Huang and Ying Liu
133
Computational Genomics and Genetics Efficient Computation of Minimum Recombination with Genotypes (not Haplotypes) Yufeng Wu and Dan Gusfield
145
Sorting Genomes by Translocations and Deletions Xingqin Qi, Guojun Li, Shuguang Li and YingXu
157
XV
Turning Repeats to Advantage: Scaffolding Genomic Contigs Using LTR Retrotransposons Ananth Kalyanaraman, Srinivas Aluru and Patricks. Schnable
167
Whole Genome Composition Distance for HIV-1 Genotyping Xiaomeng Wu, Randy Goebel, Xiu-Feng Wan and Guohui Lin
179
Efficient Recursive Linking Algorithm for Computing the Likelihood of an Order of a Large Number of Genetic Markers S. Tewari, S. M. Bhandarkar and J. Arnold
191
Optimal Imperfect Phylogeny Reconstruction and Haplotyping (IPPH) Srinath Sridhar, Guy E. Blelloch, R. Ravi and Russell Schwartz
199
Toward an Algebraic Understanding of Haplotype Inference by Pure Parsimony Daniel G. Brown and Ian M. Harrower
211
Global Correlation Analysis Between Redundant Probe Sets Using a Large Collection of Arabidopsis ATH1 Expression Profiling Data Xiangqin Cui and Ann Loraine
223
Motif Sequence Identification Distance-Based Identification of Structure Motifs in Proteins Using Constrained Frequent Subgraph Mining Jun Huan, Deepak Bandyopadhyay, Jan Prins, Jack Snoeyink, Alexander Tropsha and Wei Wang
227
An Improved Gibbs Sampling Method for Motif Discovery via Sequence Weighting Xin Chen and Tao Jiang
239
Detection of Cleavage Sites for HIV-1 Protease in Native Proteins Liwen You
249
A Methodology for Motif Discovery Employing Iterated Cluster Re-Assignment Osman Abul, Finn Drablos and Geir Kjetil Sandve
257
Biological Pathways and Systems Identifying Biological Pathways via Phase Decomposition and Profile Extraction Yi Zhang and Zhidong Deng
269
Expectation-Maximization Algorithms for Fuzzy Assignment of Genes to Cellular Pathways Liviu Popescu and Golan Yona
281
Classification of Drosophila Embryonic Developmental Stage Range Based on Gene Expression Pattern Images Jieping Ye, Jianhui Chen, Qi Li and Sudhir Kumar
293
Evolution versus "Intelligent Design": Comparing the Topology of Protein-Protein Interaction Networks to the Internet Qiaofeng Yang,Georgos Siganos, Michalis Faloutsos andStefano Lonardi
299
XVI
Protein Functions and Computational Proteomics Cavity-Aware Motifs Reduce False Positives in Protein Function Prediction Brian Y. Chen, DrewH. Bryant, Viacheslav Y. Fofanov, David M. Kristensen, Amanda E. Cruess, MarekKimmel, Olivier Lichtarge andLydiaE. Kavraki Protein Subcellular Localization Prediction Based on Compartment-Specific Biological Features Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung and Wen-Lian Hsu Predicting the Binding Affinity of MHC Class II Peptides Fatih Altiparmak, AltunaAkalin andHakan Ferhatosmanoglu Codon-Based Detection of Positive Selection Can be Biased by Heterogeneous Distribution of Polar Amino Acids Along Protein Sequences Xuhua Xia and Sudhir Kumar Bayesian Data Integration: A Functional Perspective Curtis Huttenhower and Olga G. Troyanskaya An Iterative Algorithm to Quantify the Factors Influencing Peptide Fragmentation for MS/MS Spectru Chungong Yu, Yu Lin, Shiwei Sun, Jinjin Cai, Jingfen Zhang, Zhuo Zhang, Runsheng Chen andDongbo Bu Complexity and Scoring Function of MS/MS Peptide De Novo Sequencing ChangjiangXu and Bin Ma Biomedical Applications Expectation-Maximization Method for Reconstructing Tumor Phylogenies from Single-Cell Data Gregory Pennington, Charles A. Smith, Stanley Shackney and Russell Schwartz Simulating In Vitro Epithelial Morphogenesis in Multiple Environments Mark R. Grant, Sean H. J. Kim and C. Anthony Hunt A Combined Data Mining Approach for Infrequent Events: Analyzing HIV Mutation Changes Based on Treatment History Ray S. Lin, Soo-Yon Rhee, Robert W. Shafer andAmar K. Das A Systems Biology Case Study of Ovarian Cancer Drug Resistance Jake Y. Chen, Changyu Shen, Zhong Yan, Dawn P. G. Brown andMu Wang Author Index
1
EXPLORING THE OCEANS MICROBES: SEQUENCING THE SEVEN SEAS Marvin E. Frazier'.Douglas B. Rusch1, Aaron L. Halpern1, Karla B. Heidelberg1, Granger Sutton1, Shannon Williamson1, Shibu Yooseph1, Dongying Wu2, Jonathan A. Eisen2, Jeff Hoffman1, Charles H. Howard1, Cyrus Foote1, Brooke A. Dill1, Karin Remington1, Karen Beeson1, Bao Tran1, Hamilton Smith1, Holly Baden-Tillson1, Clare Stewart1, Joyce Thorpe1, Jason Freemen1, Cindy Pfannkoch1, Joseph E. Venter1, John Heidelberg2, Terry Utterback1, Yu-Hui Rogers1, Shaojie Zhang3, Vineet Bafna3, Luisa Falcon4, Valeria Souza4,German Bonilla4, Luis E. Eguiarte4 , David M. Karl5, Ken Nealson6, Shubha Sathyendranath7, Trevor Piatt7, Eldredge Bermingham8, Victor Gallardo9, Giselle Tamayo10, Robert Friedman1, Robert Strausberg1, J. Craig Venter1 J. Craig Venter Institute, Rockville, Maryland, United States Of America The Institute For Genomic Research, Rockville, Maryland, United States Of America Department of Computer Science, University of California San Diego Instituto de Ecologia Dept. Ecologia Evolutiva, National Autonomous University of Mexico Mexico City, 04510 Distrito Federal, Mexico University of Hawaii, Honolulu, United States of America Dept. of Earth Sciences, University of Southern California, Los Angeles, California, United States of America 7 Dalhousie University, Halifax, Nova Scotia, Canada 8 Smithsonian Tropical Research Institute, Balboa, Ancon, Republic of Panama University of Concepcidn, Concepcion, Chile 10 University of Costa Rica, San Pedro, San Jose, Republic of Costa Rica 2
6
The J. Craig Venter Institute's (JCVI) environmental genomics group has collected ocean and soil samples from around the world. We have begun shotgun sequencing of microbial samples from more that 100 open-ocean and coastal sites across the Pacific, Indian and Atlantic Oceans. These data are being augmented with deep sequencing of 16S and 18S rRNA and the draft sequencing of ~150 cultured marine microbial species. The JCVI is also developing and refining bioinformatics tools to assemble, annotate, and analyze large-scale metagenomic data, along with the appropriate database infrastructure to enable directed analyses. The goals of this Global Ocean Survey are to better understand microbial biodiversity; to discover new genes of ecological importance, including those involved in carbon cycling; to discover new genes that may be useful for biological energy production; and to establish a freely shared, global environmental genomics database that can be used by scientists around the world. Using newly developed metagenomic methods, we are able to examine not only the community of microorganisms, but the community of genes that enable them to capture energy from the sun, remove carbon dioxide from the air, take up organic carbon, and cycle
nitrogen in its various forms through the ecosystem. To date, we have discovered many thousands of new microbial species and millions of new genes, with no apparent slowing of the rate of discovery. This data will be of great value for the study of protein function and protein evolution. The goal of this new science, however, is not to merely catalog sequences, genes and gene families, and species for their own sake. We are attempting to use these new data to better understand the functioning of natural ecosystems. Environmental metagenomics examines the interplay of perhaps thousands of species present and functioning at a point in space and time. Each individual sequence is no longer just a piece of a genome. It is a piece of an entire biological community. This is a resource that can be mined by microbial ecologists worldwide to better understand biogeochemical cycling. Moreover, within this data set is a huge diversity of previously unknown, energy-related genes that may be useful for developing new methods of biological energy production. We acknowledge the DOE, Office of Science (DEFG02-02ER63453), the Gordon and Betty Moore Foundation, the Discovery Channel and the J. Craig
2
Venter Science Foundation for funding to undertake this study. We are also indebted to a large group of individuals and groups for facilitating our sampling and analysis. We thank the Governments of Canada, Mexico, Honduras, Costa Rica, Panama, and Ecuador and French Polynesia/France for facilitating sampling activities. All sequencing data collected from waters of the above named countries remain part of the genetic patrimony of the country from which they were obtained. Canada's Bedford Institute of Oceanography provided a vessel and logistical support for sampling in Bedford basin. The Universidad Nacional Autonoma de Mexico (UNAM) facilitated permitting and logistical arrangements and identified a team of scientists for collaboration. The scientists and staff of the Smithsonian Tropical Research Institute (STRI) hosted our visit in Panama. Representatives from Costa Rica's Organization for Tropical Studies (Jorge Arturo Jimenez and Francisco Campos Rivera), the University of Costa Rica (Jorge Cortes) and the National Biodiversity Institute (INBio) provided assistance with planning, logistical arrangements and scientific analysis. Our visit to the Galapagos Islands was facilitated by assistance
from the Galapagos National Park Service Director, Washington Tapia, the Charles Darwin Research Institute, especially Howard Snell and Eva Danulat. We especially thank Greg Estes (guide), Hector Chauz Campo (Institute of Oceanography of the Ecuador Navy) and a National Park Representative, Simon Ricardo Villemar Tigrero for field assistance while in the Galapagos Islands. Martin Wilkalski (Princeton) and Rod Mackie (University of Illinois) provided advice for target regions in the Galapagos to sample. We thank Matthew Charette (Woods Hole Oceanographic Institute) and Dave Karl (University of Hawaii) for nutrient analysis work and advice. We also acknowledge the help of Michael Ferrari and Jennifer Clark for assistance in acquiring the satellite images. The U.S. Department of State facilitated Governmental communications on multiple occasions. John Glass (JCVI) provided valuable assistance in methods development. Tyler Osgood (JCVI) facilitated many of the vessel related technical needs. We gratefully acknowledge Dr. Michael Sauri, who oversaw medical related issues for the crew of the Sorcerer II. Finally, special thanks also to the captain and crew of the S/V Sorcerer II.
3
DON'T KNOW MUCH ABOUT PHILOSOPHY: THE CONFUSION OVER BIO-ONTOLOGIES Mark A. Musen, M.D., Ph.D. The National Center for Biomedical Ontology Stanford University 251 Campus Drive, X-215 Stanford, CA 94305 USA
Abstract: For the past decade, thqre has been increasing interest in ontologies in the biomedical community. As interest has peaked, so has the confusion. The confusion stems from the multiple knowledge-representation languages used to encode ontologies (e.g., frame-based systems, Semantic Web standards such as RDF(S) and OWL, and languages created specifically by the bioinformatics community, such as OBO), where each language has explicit strengths and weaknesses. Biomedical scientists use ontologies for multiple purposes, from annotation of experimental data, to natural-language processing, to data integration, to construction of decision-support systems. Each of these purposes imposes different requirements concerning which entities ontologies should encode and how those entities should be encoded. Although the biomedical informatics community remains excited about ontologies, exactly what an ontology is and how it should be represented within a computer are points about which, with considerable questioning, we can see little uniformity of opinion. The confusion will persist until we can understand that different developers have very different requirements for ontologies, and therefore those developers will make very different assumptions about how ontologies should
be created and structured. We will review those assumptions and the corresponding implications for ontology construction. Our National Center for Biomedical Ontology (http://bioontologv.org') is one of the seven national centers for biomedical computing formed under the NIH Roadmap. The Center takes a broad perspective on what ontologies are and how they should be developed and put to use. Our goal, simply put, is to help to eliminate much of the current confusion. The Center recognizes the importance of ontologies for use in a wide range of biomedical applications, and is developing new technology to make all relevant ontologies widely accessible, searchable, alignable, and useable within software systems. Ultimately, the Center will support the publication of biomedical ontologies online, much as we publish scientific knowledge in print media. The advent of biomedical knowledge that is widely available in machine-processable form will alter the way that we think about science and perform scientific experiments. The biomedical community soon will enter an era in which scientific knowledge will become more accessible, more useable, and more precise, and in which new methods will be needed to support a radically different kind of scientific publishing.
This page is intentionally left blank
5
BIOMEDICAL INFORMATICS RESEARCH NETWORK (BIRN): BUILDING A NATIONAL COLLABORATORY FOR BIOMEDICAL AND BRAIN RESEARCH Mark H. Ellisman, Ph.D., Professor UCSD Department of Neurosciences and Director of the BIRN Coordinating Center (www.nbirn.net) The Center for Research on Biological Systems (CRBS) at UCSD
The Biomedical Informatics Research Network (BIRN) is an initiative within the National Institutes of Health (US) that fosters large-scale collaborations in biomedical science by utilizing the capabilities of the emerging national cyberinfrastructure (high-speed networks, distributed high-performance computing and the necessary software and data integration capabilities). Currently, the BIRN involves a consortium of 20 universities and 30 research groups participating in three test bed projects centered around brain imaging of human neuropsychiatric disease and associated animal models. These groups are working on large scale, crossinstitutional imaging studies on Alzheimer's disease, depression, and schizophrenia using structural and functional magnetic resonance imaging (MRI). Others are studying animal models relevant to multiple sclerosis, attention deficit disorder, and Parkinson's disease through MRI, whole brain histology, and highresolution light and electron microscopy. These test bed projects present practical and immediate requirements for performing large-scale bioinformatics studies and provide a multitude of usage cases for distributed computation and the handling of heterogeneous data. The promise of the BERN is the ability to test new hypotheses through the analysis of larger patient populations and unique multi-resolution views of animal models through data sharing and the integration of site independent resources for collaborative data refinement. The BIRN Coordinating Center (BERN-CC) is orchestrating the development and deployment of key infrastructure components for immediate and long-range support of the scientific goals pursued by these test bed
scientists. These components include high bandwidth inter-institutional connectivity via Internet2, a uniformly consistent security model, grid-based file management and computational services, software and techniques to federate data and databases, data caching and replication techniques to improve performance and resiliency, and shared processing, visualization and analysis environments. As a core component of the BERN infrastructure, Internet2 provides a solid foundation for the future expansion of the BERN as well as the stable high performance network required by researchers in a national collaboratory. Researchers within BERN are also benefiting directly from the connectivity to high performance computing resources, such as TeraGrid. Currently researchers are performing advanced shape analyses of anatomical structures to gain a better understanding of diseases and disorders. These analyses run on TeraGrid have produced over 10TB of resultant data which were then transferred back to the BERN Data Grid. BERN intertwines concurrent revolutions occurring in biomedicine and information technology. As the requirements of the biomedical community become better specified through projects like the BERN, the national cyberinfrastructure being assembled to enable large-scale science projects will also evolve. As these technologies mature, the BERN is uniquely situated to serve as a major conduit between the biomedical research community of NIH-sponsored programs and the information technology development programs, mostly supported by other government agencies (e.g., NSF, NASA, DOE, DARPA) and industry.
This page is intentionally left blank
7
PROTEIN NETWORK COMPARATIVE GENOMICS Trey Ideker University of California San Diego
With the appearance of large networks of proteinprotein and protein-DNA interactions as a new type of biological measurement, methods are needed for constructing cellular pathway models using interaction data as the central framework. The key idea is that, by comparing the molecular interaction network with other biological data sets, it will be possible to organize the network into modules representing the repertoire of distinct functional processes in the cell. Three distinct types of network comparisons will be discussed, including those to identify: (1) Protein interaction networks that are conserved across species (2) Networks in control of gene expression changes (3) Networks correlating with systematic phenotypes and synthetic lethals Using these computational modeling and query tools, we are constructing network models to explain the physiological response of yeast to DNA damaging agents.
Relevant articles and links 1. Yeang, C.H., Mak, H.C., McCuine, S., Workman, C , Jaakkola, T., and Ideker, T. Validation and refinement of gene regulatory pathways on a network of physical interactions. Genome Biology 6(7): R62 (2005). 2. Kelley, R. and Ideker, T. Systematic interpretation of genetic interactions using protein networks. Nature Biotechnology 23(5):561-566 (2005). 3. Sharan, R., Suthram, S., Kelley, R. M., Kuhn, T., McCuine, S., Uetz, P., Sittler, T., Karp, R. M., and Ideker, T. Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci USA. 8:102(6): 1974-79 (2005). 4. Suthram, S., Sittler, T., and Ideker, T. The Plasmodium network diverges from those of other species. Nature 437: (November 3,2005). 5. http://www.pathblast.org 6. http://www.cytoscape.org
Acknowledgements We gratefully acknowledge funding through NIH/NIGMS grant GM070743-01; NSF grant CCF0425926; Unilever, PLC, and the Packard Foundation.
This page is intentionally left blank
9
SYSTEMS BIOLOGY IN TWO DIMENSIONS: UNDERSTANDING AND ENGINEERING MEMBRANES AS DYNAMICAL SYSTEMS Erik Jakobsson
University of Illinois at Urbana-Champaign Director, National Center for the Design of Biomimetic Nanoconductors
Theme: The theme of our NTH Nanomedicine Development Center is design of biomimetic nanoconductors and devices utilizing nanoconductors. The model theoretical systems are native and mutant biological channels and other ion transport proteins and synthetic channels, and heterogenous membranes containing channels and transporters. The model experimental systems are engineered protein channels and synthetic channels in isolation, and in self-assembled membranes supported on nanoporous silicon scaffolds. The ultimate goal is to understand how biomimetic nanoscale design can be utilized in devices to achieve the functions that membrane systems accomplish in biological systems: a) Electrical and electrochemical signaling, b) generation of osmotic pressures and flows, c) generation of electrical power, and d).energy transduction.
Broad Goals: Our Center's broad goals are: 1. To advance theoretical, computational, and experimental methods for understanding and quantitatively characterizing biomembrane and other nanoscale transport processes, through interactive teams doing collaborative macromolecular design and synthesis, computation/theory, and experimental functional characterization. 2. To use our knowledge and technical capabilities to design useful biomimetic de-
vices and technologies that utilize membrane and nanopore transport. 3. To interact synergistically with other workers in the areas of membrane processes, membrane structure, the study of membranes as systems biomolecular design, biomolecular theory and computation, transport processes, and nanoscale device design. 4. To disseminate enhanced methods and tools for: theory and computation related to transport, experimental characterization of membrane function, theoretical and experimental characterization of nanoscale fluid flow, and nanotransport aspects of device design.
Initial Design Target: A biocompatible biomimetic battery (the "biobattery") to power an implantable artificial retina, extendable to other neural prostheses. Broad design principles are suggested by the electrocyte of the electric eel, which generates large voltages and current densities by stacking large areas of electrically excitable membranes in series. The potential advantages of the biomimetic battery are lack of toxic materials, and ability to be regenerated by the body's metabolism.
Major Emergent Reality Constraints: The development and maintenance of the electrocyte in the eel are guided by elaborate and adaptive pathways under genetic control, which we can not realistically hope to include in a device.
10 Our approach will include replacing the developmental machinery with a nanoporous silicon scaffold, on which membranes will selfassemble. The lack of maintenance machinery will be compensated for by making the functional components of the biobattery from more durable, less degradable molecules.
Initial Specific Activities: 1.
2.
3.
Making a detailed dynamical model, including electrical and osmotic phenomena and incorporating specific geometry, of the eel electrocyte. Do initial design of biomimetic battery that is potentially capable of fabrication/self assembly. Search for more durable functional analogues of the membranes and transporters of the electrocyte. Approaches being pursued include designing beta-barrel
4.
functional analogues for helix-bundle proteins, mining extremophile genomes for appropriate transporters, chemically functionalized silicon pores, and design of durable synthetic polymer membranes that can incorporate transport molecules by self-assembly. These approaches combine information technology, computer modeling, and simulation, with experiment. Fabrication of nanoporous silicon supports for heterogenous membranes in complex geometries.
Organizational Principles of Center: Our core team is supported by the NIH Roadmap grant, but we welcome collaborations with all workers with relevant technologies and skills, and aligned interests.
11
BIOINFORMATICS AT MICROSOFT RESEARCH Simon Mercer Microsoft Research One Microsoft Way Redmond, WA 98052, USA
The advancement of the life sciences in the last twenty years has been in part the story of increasing integration of computing with scientific research, a trend that is set to transform the practice of science in our lifetimes. Conversely, biological systems are a rich source of ideas that will transform the future of computing. In addition to supporting academic research in the life sciences, Microsoft Research is a source of tools and technologies well suited to the needs of basic scientific
research - current projects include new languages to simplify data extraction and processing, tools for scientific workflows, and biological visualization. Computer science researchers also bring new perspectives to problems in biology, such as the use of schema-matching techniques in merging ontologies, machine learning in vaccine design, and process algebra in understanding metabolic pathways.
This page is intentionally left blank
13
MOVIE CRUNCHING IN BIOLOGICAL DYNAMIC IMAGING Jean-Christophe Olivo-Marin Quantitative Image Analysis Unit Institut Pasteur CNRS URA 2582 25 rue du Dr Roux 75724 Paris, France Recent advances in biological imaging technologies have enabled the observation of living cells with high resolution during extended periods of time and are impacting biological research in such different areas as high-throughput image-base drug screening, cellular therapies, cell and developmental biology and gene expression studies. Deciphering the complex machinery of cell functions and dysfunction necessitates indeed large-scale multidimensional image-based assays to cover the wide range of highly variable and intricate properties of biological systems. However, understanding the wealth of data generated by multidimensional microscopy depends critically on decoding the visual information contained therein and on the availability of the tools to do so. Innovative automatic techniques to extract quantitative data from image sequences are therefore of major interest. I will present methods we have recently developed to perform the computational analysis of image sequences coming from multidimensional microscopy, with particular emphasis on tracking and motion analysis for 3D+t images sequences using active contours and multiple particle tracking.
1. INTRODUCTION The advent of multidimensional microscopy (real-time optical sectioning and confocal, TIRF, FRET, FRAP, FLIM) has enabled biologists to visualize cells, tissues and organs in their intrinsic 3D and 3D+t geometry, in contrast to the limited 2D representations that were available until recently. These new technologies are already impacting biological research in such different areas as high-throughput image-base drug screening, cellular therapies, cell and developmental biology and gene expression studies, as they are put-ting at hand the imaging of the inner working of living cells in their natural context. Expectations are high for breakthroughs in areas such as cell response and motility modification by drugs, control of targeted sequence incorporation into the chromatin for cell therapy, spatial-temporal organization of the cell and its changes with time or under infection, assessment of pathogens routing into the cell, interaction between proteins, sanitary control of pathogen evolution, to name but a few. Deciphering the complex machinery of cell functions and dysfunction necessitates large-scale multidimensional image-based assays to cover the wide range of highly variable and intricate properties of biological material. However, understanding the wealth of data generated by multidimensional
microscopy depends critically on decoding the visual information contained therein. Within the wide interdisciplinary field of biological imaging, I will concentrate on work developed in our laboratory on two aspects central to cell biology, particle tracking and cell shape and motility analysis, which have many applications in the important field of infectious diseases.
2. PARTICLE TRACKING Molecular dynamics in living cells is a central topic in cell biology, as it opens the possibility to study with submicron resolution molecular diffusion, spatio-temporal regulation of gene expression and pathogen motility and interaction with host cells. For example, it is possible, after labelling with specific fluorochromes, to record the movement of organelles like phagosomes or endosomes in the cell,6 the movement of different mutants of bacteria or parasites2 or the positioning of telomeres in nuclei (Galy et al., 2000).3 I will describe the methods we have developed to perform the detection and the tracking of microscopic spots directly on four dimensional (3D+t) image data. 4,5 They are able to detect with high accuracy multiple
14 biological objects moving in three-dimensional space and incorporate the possibility to follow moving spots switching between different types of dynamics. Our methods decouple the detection and the tracking processes and are based on a two step procedure: first, the objects are detected in the image stacks thanks to a procedure based on a three-dimensional wavelet transform; then the tracking is performed within a Bayesian framework where each object is represented by a state vector evolving according to biologically realistic dynamic models.
3. CELL TRACKING Another important project of our laboratory is motivated by the problem of cell motility. The ability of cells to move and change their shape is important in many important areas of biology, including cancer, development, infection and immunity.7 We have developed algorithms to automatically segment and track moving cells in dynamic 2D or 3D microscopy.1' 8 For this purpose, we have adopted the framework of active contours and deformable models that is widely employed in the computer vision community. The segmentation proceeds by evolving the front according to evolution equations that minimize an energy functional (usually by gradient descent). This energy contains both data attachment terms and terms encoding prior information about the boundaries to be extracted, e.g. smoothness constraints. Tracking, i.e. linking segmented objects between time points, is simply achieved by initializing front evolutions using the segmentation result of the previous frame, under the assumption that inter-frame motions are modest. I will describe some of our work on adapting these methods to the needs of cellular imaging in biological research. References 1. A. Dufour, V. Shinin, S. Tajbakhsh, N. Guillen, J.C. Olivo-Marin, and C. Zimmer, Segmenting and
tracking fluorescent cells in dynamic 3-d microscopy with coupled active surfaces, IEEE Trans. Image Processing, vol. 14, no. 9, pp. 1396-1410, 2005. 2. F. Frischknecht, P. Baldacci, B. Martin, C. Zimmer, 5. Thiberge, J.-C. Olivo-Marin, S. L. Shorte, and R. Menard, Imaging movement of malaria parasites during transmission by Anopheles mosquitoes, Cell Microbiol, vol. 6, no. 7, pp. 687-94, 2004. 3. V. Galy, J.-C. Olivo-Marin, H. Scherthan, V. Doyle, N. Rascalou, and U. Nerhbass, Nuclear pore complexes in the organization of silent telomeric chromatin, Nature, vol. 403, pp. 108-112, 2000. 4. A. Genovesio, B. Zhang, and J.-C. Olivo-Marin, Tracking of multiple fluorescent biological objects in three dimensional video microscopy, IEEE International Conference on Image Processing ICIP 2003, vol. I, pp. 1105-1108, September 2003, Barcelona, Spain, 2003 5. A. Genovesio, T. Liedl, V. Emiliani, W. Parak, M. Coppey-Moisan, and J.-C. Olivo-Marin, Multiple particle tracking in 3D+t microscopy : method and application to the tracking of endocytozed Quantum Dots, IEEE Trans. Image Processing, 15, 5, pp. 1062-1070, 2006 6. C. Murphy, R. Saffrich, J.-C. Olivo-Marin, A. Giner, W. Ansorge, T. Fotsis, and M. Zerial, Dual function of rhod in vesicular movement and cell motility, Eur. Journal of Cell Biology, vol. 80, no. 6, pp. 391-398,2001. 7. C. Zimmer, E. Labruyere, V. Meas-Yedid, N. Guillen, and J.-C. Olivo-Marin, Segmentation and tracking of migrating cells in videomicroscopy with parametric active contours: a tool for cell-based drug testing, IEEE Trans. Medical Imaging, vol. 21, pp. 1212-1221,2002. 8. C. Zimmer and J.-C. Olivo-Marin, Coupled parametric active contours, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1838-1842,2005.
15
ENGINEERING NUCLEIC ACID-BASED MOLECULAR SENSORS FOR PROBING AND PROGRAMMING CELLULAR SYSTEMS Professor Christina D. Smolke California Institute of Technology, Department of Chemical Engineering
Information flow through cellular networks is responsible for regulating cellular function at both the single cell and multi-cellular systems levels. One of the key limitations to understanding dynamic fluctuations in intracellular biomolecule concentrations is the lack of enabling technologies that allow for user-specified probing and programming of these cellular events. I will discuss our work in developing the molecular design and cellular engineering strategies for the construction of tailor-made sensor platforms that can temporally and spatially monitor and regulate information flow through diverse cellular networks. The construction of sensor platforms based on allosteric regulation of non-coding RNA (ncRNA) activity will be presented, where molecular recognition of a ligand-binding event is coupled to a conformational change in the RNA
molecule. This regulated conformational change may be linked to an appropriate readout signal by controlling a diverse set of ncRNA gene regulatory activities. Our research has demonstrated the modularity, design predictability, and specificity inherent in these molecules for cellular control. In addition, the flexibility of these sensor platforms enables these molecules to be incorporated into larger circuits based on molecular computation strategies to construct sensor sets that will perform higher-level signal processing toward complex systems analysis and cellular programming strategies. In particular, the application of these molecular sensors to the following downstream research areas will be discussed: metabolic engineering of microbial alkaloid synthesis and 'intelligent' therapeutic strategies.
This page is intentionally left blank
17
REACTOME: A KNOWLEDGEBASE OF BIOLOGICAL PATHWAYS Lincoln Stein, Peter D'Eustachio, Gopal Gopinathrao, Marc Gillespie, Lisa Matthews, Guanming Wu Cold Spring Harbor Laboratory Cold Spring Harbor, NY, USA
Imre Vastrik, Esther Schmidt, Bernard de Bono, Bijay Jassal, David Croft, Ewan Birney European Bioinformatics Institute Hinxton, UK
Suzanna Lewis Lawrence Berkeley National Laboratory Berkeley, CA, USA
Reactome, located at http://www.reactome.org is a curated, peer-reviewed resource of human biological processes. Given the genetic makeup of an organism, the complete set of possible reactions constitutes its reactome. The basic unit of the Reactome database is a reaction; reactions are then grouped into causal chains to form pathways. The Reactome data model allows us to represent many diverse processes in the human system, including the pathways of intermediary metabolism, regulatory pathways, and signal transduction, and highlevel processes, such as the cell cycle. Reactome provides a qualitative framework, on which quantitative
data can be superimposed. Tools have been developed to facilitate custom data entry and annotation by expert biologists, and to allow visualization and exploration of the finished dataset as an interactive process map. Although our primary curational domain is pathways from Homo sapiens, we regularly create electronic projections of human pathways onto other organisms via putative orthologs, thus making Reactome relevant to model organism research communities. The database is publicly available under open source terms, which allows both its content and its software infrastructure to be freely used and redistributed.
This page is intentionally left blank
19
EFFECTIVE OPTIMIZATION ALGORITHMS FOR FRAGMENT-ASSEMBLY BASED PROTEIN STRUCTURE PREDICTION
Kevin W . D e R o n n e * a n d George K a r y p i s Department of Computer Science & Engineering, Digital Technology Center, Army HPC Research Center University of Minnesota, Minneapolis, MN 55455 * Email: {deronne, karypis} @cs.umn.edu
Despite recent developments in protein structure prediction, an accurate new fold prediction algorithm remains elusive. One of t h e challenges facing current techniques is the size and complexity of the space containing possible structures for a query sequence. Traditionally, to explore this space fragment assembly approaches t o new fold prediction have used stochastic optimization techniques. Here we examine deterministic algorithms for optimizing scoring functions in protein structure prediction. Two previously unused techniques are applied to the problem, called the Greedy algorithm and the Hill-climbing algorithm. The main difference between the two is t h a t t h e latter implements a technique to overcome local minima. Experiments on a diverse set of 276 proteins show that t h e Hill-climbing algorithms consistently outperform existing approaches based on Simulated Annealing optimization (a traditional stochastic technique) in optimizing the root mean squared deviation (RMSD) between native and working structures.
1. INTRODUCTION Reliably predicting protein structure from amino acid sequence remains a challenge in bioinformatics. Although the number of known structures continues to grow, many new sequences still lack a known homolog in the PDB 2 , which makes it harder to predict structures for these sequences. The conditional existence of a known structural homolog to a query sequence commonly delineates a set of subproblems within the greater arena of protein structure prediction. For example, the biennial CASP competition 3 breaks down structure prediction as follows. In homologous fold recognition the structure of the query sequence is similar to a known structure for some other sequence. However, these two sequences have only a low (though detectable) similarity. In analogous fold recognition there exists a known structure similar to the correct structure of the query, but the sequence of that structure has no detectable similarity to the query sequence. Still more challenging is the problem of predicting the structure of a query sequence lacking a known structural relative, which is called new fold (NF) prediction. Within the context of the NF problem knowledge-based methods have attracted increasing attention over the last decade. In CASP, prediction 'Corresponding author. http://predictioncenter.org/
a
approaches that assemble fragments of known structures into a candidate structure 18 ' 7 ' 10 have consistently outperformed alternative methods, such as those based largely on explicit modeling of physical forces. Fragment assembly for a query protein begins with the selection of structural fragments based on sequence information. These fragments are then successively inserted into the query protein's structure, replacing the coordinates of the query with those of the fragment. The quality of this new structure is assessed by a scoring function. If the scoring function is a reliable measure of how close the working structure is to the native fold of the protein, then optimizing the function through fragment insertions will produce a good structure prediction. Thus, building a structure in this manner can break down into three main components: a fragment selection technique, an optimizer for the scoring function, and the scoring function itself. To optimize the scoring function, all the leading assembly-based approaches use an algorithm involving a stochastic search (e.g. Simulated Annealing 18 , genetic algorithms 7 , or conformational space annealing 1 0 ). One potential drawback of such techniques is that they can require extensive parameter tuning before producing good solutions.
20
In this paper we wish to examine the relative performance of deterministic and stochastic techniques to optimize a scoring function. The new algorithms presented below are inspired by techniques originally developed in the context of graph partitioning 4 , and do not depend on a random element. The Greedy approach examines all possible fragment insertions at a given point and chooses the best one available. The Hill-climbing algorithm follows a similar strategy but allows for moves that reduce the score locally, provided that they lead to a better global score. Several variables can affect the performance of optimization algorithms in the context of fragmentbased ab initio structure prediction. For example, how many fragments per position are available to the optimizer, how long the fragments are, if they should be multiple sizes at different stages 18 or all different sizes used together 7 , and other parameters specific to the optimizer can all influence the quality of the resulting structures. Taking the above into account, we varied fragment length and number of fragments per position when comparing the performance of our optimization algorithms to that of a tuned Simulated Annealing approach. Our experiments test these algorithms on a diverse set of 276 protein domains derived from SCOP 1.69 14 . The results of these experiments show that the Hill-climbing-based approaches are very effective in producing high-quality structures in a moderate amount of time, and that they generally outperform Simulated Annealing. On the average, Hillclimbing is able to produce structures that are 6% to 20% better (as measured by the root mean square deviation (RMSD) between the computed and its actual structure), and the relative advantage of Hillclimbing-based approaches improves with the length of the proteins.
2. MATERIALS AND METHODS 2.1. Data The performance of the optimization algorithms studied in this paper were evaluated using a set of proteins with known structure that was derived from b
SCOP 1.69 14 as follows. Starting from the set of domains in SCOP, we first removed all membrane and cell surface proteins, and then used Astral's tools 3 to construct a set of proteins with less than 25% sequence identity. This set was further reduced by keeping only the structures that were determined by X-ray crystallography, filtering out any proteins with a resolution greater than 2.5A, and removing any proteins with &Ca — Ca distance greater than 3.8A times their sequential separation13. The above steps resulted in a set of 2817 proteins. From this set, we selected a subset of 276 proteins (roughly 10%) to be used in evaluating the performance of the various optimization algorithms (i.e., a test set), whereas the remaining 2541 sequences were used as the database from whence to derive the structural fragments (i.e., a training set). c The test sequences, whose characteristics are summarized in Table 1, were selected to be diverse in length and secondary structure composition. Table 1. Number of sequences at various length intervals and SCOP class. Seqiaence Length SCOP Class alpha beta alpha/beta alpha+beta
< 100 23 23 4 15
100-200 40 27 26 36
> 200 6 18 39 17
total 69 69 69 69
2.2. Neighbor Lists As the search space for fragment assembly is much too vast, fragment-based ab initio structure prediction approaches must reduce the number of possible structures that they consider. They accomplish this primarily by restricting the number of structural fragments that can be used to replace each fc-mer of the query sequence. In evaluating the various optimization algorithms developed in this work, we followed a methodology for identifying these structural fragments that is similar in spirit to that used by the Rosetta 18 system. Consider a query sequence X of length /. For
No bond lengths were modified to fit this constraint; proteins not satisfying it were simply removed from consideration. This dataset is available at http://www.cs.umn.edu/ " deronne/supplement/optimize
c
21 each position i, we identify a list (Li) of n structural fragments by comparing the query sequence against the sequences of the proteins in the training set. For fragments of length fc, these comparisons involve the fc-mer of X starting at position i(0 TS,, therefore, the final topology for the N-terminal loop is outside (o).
36 3.2. Evaluation metrics There are two sets of evaluation measures for the TMH prediction: per-segment and per-residue accuracies13. Per-segment scores indicate how accurately the location of a TMH region is predicted and per-residue scores report how well each residue is predicted. Table 1 lists the per-segment and per-residue metrics used in this paper. In the calculation of per-segment scores, two issues must be addressed when counting a helix as correctly predicted. First, a minimal overlap of observed helix segments must be defined. For this, we use a less relaxed criterion which requires at least 9 overlapping residues. An evaluation study by Chen et al.13 used a more relaxed minimal overlap of only 3 residues. Second, we do not allow an overlapping observed helix to be counted twice. We use the following examples to illustrate these two issues (H = Helix): Observation Prediction 1 Prediction 2 Prediction 3
- -HHHHHHHHHHHHH- - -HHHHHHHHHHHHH HHHHHHHHH-HHHHHHHHHHHHHHHHHHHHHHH---HHH-HHHHHHHHH---HHHHHHHHH--
Prediction 1 achieves 100% accuracy if the minimal overlap is 3 residues. If the minimal overlap is 9 resi-
dues, Prediction 1 achieves 50% accuracy. Prediction 2 achieves 50% accuracy because it already overlaps with the first observed helix. Prediction 3 achieves 100% accuracy if the minimal overlap is 3 residues, but the second predicted helix is an over-prediction since we only count an overlapping observed helix only once. Prediction 3 achieves 50% accuracy if the minimal overlap is 9 residues because the first predicted helix does not satisfy the minimal overlap requirement. In addition, the second predicted helix is also an overprediction, thus it is not counted. 3.3. Performance of input feature combinations for helix prediction We test the performance of different input feature combinations for the first classifier. The following combinations are considered: 1) AA only; 2) AA and any one of DP, HS, and AM; 3) AA and any two of DP, HS, and AM; and 4) all four features. We also construct a consensus prediction from the two top-performing combinations through probability estimation using LIBSVM25. The value of the estimated probability for each residue corresponds to the confidence given for its predicted class. In the case of disagreement between the predicted classes, the consensus prediction takes the result of a prediction that has the highest probability.
Table 1. Evaluation metrics used in this work. Per-segment metrics include Q^, Q%£" , £ £ f ' and TOPO. Per-residue metrics include Q2 Q^f" , and Q%.prd . Npn>, is the number of proteins in a data set. We follow the same performance measures proposed by Chen et al." Symbol
Formula
Qoi
[1,1/ O** A O l " = 100% for protein i -J x 100%, with S, = \ N ' [o, otherwise number of correctly predicted TM in data set xl00% number of TM observed in data set number of correctly predicted TM in data set xl00% number of TM predicted in data set number of proteins with correctly predicted topology xl00%
e
?/oobs him
him
TOPO
Is,
' number of residues predicted correctly in protein i 1 number of residues in prtoeini -xl00%
e
,%obs 2T
%prd $T IT
number of residues correctly predicted in TM helices xl00% number of residues observed in TM helices number of residues correctly predicted in TM helices xl00% number of residues predicted in TM helices
Description percentage of proteins in which all its TMH segments are predicted correctly TMH segment recall TMH segment precision percentage of correctly predicted topology
percentage of correctly predicted TMH residues
TMH residue recall TMH residue precision
37
Table 2 shows the performance of combinations of input features and the consensus prediction. Combination 5 achieves the highest score for Qok at 71.9% and performs consistently well in other per-segment and per-residue measures. Combination 6 has a strikingly high Q%jbs score of 85.9%. The purpose of consensus prediction is to maximize the benefits of both combinations. In fact, the consensus approach increases the Qok score of Combination 6 by 1.5%, while the Q^hs score only decreases by 0.3%. Compared to Combination 5, the consensus has a decrease in Qok of 1.4%, but an increase in Q^1" of 3.8%. In addition, the consensus approach also scores the highest for Q2 at 89.1%. The consensus approach is selected as our best model for comparison with other approaches.
compared methods for per-segment measures in TOPO, Qok, and Q°^d at 84%, 71%, and 95%, respectively. Specifically, SVMtmh improves TOPO by 5% over the second best method for the low-resolution data set. For the high-resolution set, most notably, SVMtmk has the highest score at 91% for TOPO, a 14% improvement over the second best method. Another marked improvement is also observed for the high-resolution set in Table 2. Performance of input feature combinations and the consensus method. Input features: AA (amino acid composition), DP (di-peptide composition), HS (hydrophobicity scale)16 and AM (amphiphilicity)'7. Per-segment (%) No. Input Feature (s)
3.4. Performance on high- and lowresolution data sets SVMtmh is compared to other methods for high and low-resolution data sets in Table 3. For the lowresolution set, SVMtmh ranks the highest among all the
a*
Qkm
Per-residue (%)
y-i%prd MhOn
a
1 2 3 4 5 6 7 8
AA AA+DP AA+HS AA+AM AA+DP+HS AA+DP+AM AA+HS+AM AA+DP+HS+AM
71.2 69.8 71.2 70.5 71.9 69.0 68.3 69.1
93.8 94.0 92.8 93.6 93.6 93.4 93.3 92.3
93.9 93.8 94.2 93.6 94.2 94.0 94.2 95.4
89.1
9
Consensus (5+6)
70.5
93.2
94.9
Table 3. Performance of prediction methods for low- and high-resolution data sets. Per-segment and per-residue scores of all methods compared are taken from an evaluation by Chen et a!.u. TOPO scores for the high-resolution data set are re-evaluated due to the update of topology information. The shaded area outlines the four top-performing methods. Note that we do not have cross-validation results for all other methods. Therefore, their accuracies might be over-estimated. In addition, we use a minimal overlap of 9 residues whereas Chen et al" used only 3 residues. Methods are sorted by their Qok values for the low-resolution data set. High-resolution
Low-resolution Per-segment (%)
Methods
;
•T s\
MMIA
j.Ml^!'-i'
"I
(.:.
** '! •mlt
••US',: • 1 ' r 1 \ ! . \ * I ( >'rj
PRED-TMR PHDhtm08 PHDhtm07
sosui
TopPred2 DAS Ben-Tal Wolfenden WW GES Eisenberg KD Heijne Hopp-Wodds Sweet Av-Cid Roseman Levitt Nakashima A-Cid Lawson Radzicka Bull-Breese EM Fauchere
tj?
• »
58 57 56 49 48 39 35 29 27 23 20 13 11 11 11 10 9 9 9 8 8 6 6 5 5
92 86 85 88 84 93 79 56 90 93 90 88 89 87 87 87 89 88 88 87 86 87 86 89 87
Per-residue (%)
1 " J ropo fs'l
14
L' j
i-i X^
M s;-
:! ! %•) • >'. ::. h
Ml H
•• i
93 86 86 86 79 81 90 82 75 68 63 59 55 58 59 58 56 56 56 57 57 56 56 56 56
68 72 59
]Per-segment
90 87 87 88 88 86 87 80 81 78 72 63 51 54 58 53 48 49 50 52 43 41 40 41 43
£
\>
78 83 83 79 74 65 67 47 83 87 89 91 91 90 88 89 91 91 90 89 89 91 91 91 91
86 75 75 72 71 85 83 76 59 53 47 42 35 36 38 36 34 35 35 35 32 32 32 32 33
ec> V
»S • 'r
'• 1
>•*!
•>o
»'
'4
k*
-*
61 64 69 71 75 79 65 64 60 58 56 54 52 52 48 47 45 45 43 40 39 36 33 31 28
Per-residu e ( % )
(%)
•T.,^" OX"1
.
"V
84 77 83 88 90 99 94 97 79 95 93 95 93 94 91 95 93 92 90 93 88 92 86 92 43
90 76 81 86 90 96 89 90 89 89 86 91 83 83 84 83 82 82 83 79 83 80 79 77 62
1C,PO Ml fl. t r
.'" 60 69 57
•?."!"" M SO H. X2
Mi
;.• .ft
y.v
*'-J
58 76 76 66 64 48 79 74 53 77 80 71 83 83 80 80 85 85 83 85 84 84 84 85 28
>."•
'.'! 85 82 82 74 83 94 66 72 80 68 61 72 58 58 58 56 58 55 60 55 58 56 54 55 56
&"
88.9 89.1 89.1 89.0 89.0 88.8 89.0
82.9 81.9 81.9 83.0 81.8 85.9 79.8 80.9
83.0 83.2 84.0 82.9 83.7 80.6 84.4 84.3
89.1
85.6
81.4
38 which SVMtmh obtains the highest score for Q2 at 86%, compared to the second best methods at 80%. Generally, SVMtmh performs 3% to 12% better for the highresolution set than for the low-resolution in terms of per-segment scores. Meanwhile, for per-residue scores, the accuracy for the high- and low-resolution data sets is similar in the range of 81% to 90%. The shaded area in Table 3 denotes the four top-performing approaches, which are selected to further predict newly solved membrane protein structures (Section 3.7).
3.5. Discrimination between soluble and membrane proteins To assess our method's ability to discriminate between soluble and membrane proteins, we apply SVMtmh to the soluble protein data set. A cut-off length is chosen as the minimum TMH length. Any protein that does not have at least one predicted TMH exceeding the minimum length is classified as a soluble protein. We calculate the false positives (FP) rates for the soluble protein set, where a false positive represents a soluble protein being falsely classified as a membrane protein. Similarly, we also calculate the false negatives (FN) rates for both high- (FNhigh) and low-resolution (FNi0W) membrane protein sets using the chosen cut-off length. Clearly, the cut-off length is a trade-off between the FP and FN rates. Therefore, the cut-off length selected must minimize FP + FNhigh+ FNiow- Fig. 4 shows the FP and FN rates as a function of cut-off length. The cut-off length at 18, which minimizes the sum of all errors is used to discriminate between soluble and membrane proteins. Table 4 shows the results of our method compared to the other methods. SVMtmh is capable of distinguishing soluble and membrane proteins at FP and FNi0W rates at less than 1% and FNhigh rate at 5.6%. In general, most advanced methods such as TMHMM23 and PHDpsiHtm0812 achieve better accuracies than simple hydrophobicity scale methods including KyteDoolittle (KD)8 and White -Wimley (WW)10.
3.6. Effect of alternating geometric scoring function on topology accuracy We characterize the dependency of topology accuracy (TOPO) on the values of the base (b) and the exponent increment (EI) used in the alternating geometric scoring function for the low-resolution data set. Fig. 5 shows
100 False positives (FP)
I
False negatives ( F N ^ ) (Lowlresoultion) False negatives (FNhigh) (Higji resoultion)
60 40 20 v
oL£ 0
_&jy
5
10
15
is 20
Fig. 4. The false positive and false negative rates as a function of cut-off length. The x-axis: cut-off length; the y-axis: false positive and false negative rates (%). Discrimination between soluble proteins and membrane proteins is based on the cut-off length chosen. The cutoff length at 18 (dashed line) is chosen to minimize the sum of all three error rates (FP + FN|0W + FNhigh)Table 4. Confusion between soluble and membrane proteins. The results of all compared methods are taken from Chen et alP. False positive rates for soluble proteins are calculated in the second column In the third and fourth columns, false negative rates for membrane proteins are reported. Methods are sorted by false positive rates. Methods SVMtmh TMHMM2 SOSUI PHDpsiHtm08 PHDhtm08 Wolfenden Ben-Tal PHDhtm07 PRED-TMR HMMTOP2 TopPred2 DAS WW GES Eisenberg KD Sweet Hopp-Woods Nakashima Heijne Levitt Roseman A-Cid Av-Cid Lawson FM Fauchere Bull-Breese Radzicka
False negatives (%) False positives (%) Low-resolution High-resolution 0.5 1 1 2 2 2 3 3 4 6 10 16 32 53 66 81 84 89 90 92 93 95 95 95 98 99 99 100 100
0 4 4 8 23 13 4 16 1 1 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5.6 8 8 3 19 39 11 14 8 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
the relationships between topology accuracy coded by colours and the variables in the scoring function. The white circles indicate the highest topology accuracy at about 84% and their corresponding values for b and EI. The region in which half of the white circles (8/16) occur falls in the ranges for b and EI between [1.5, 2.5]
39 and [0.5, 1.5], respectively. The set of values for (b, EI) we choose for the scoring function is (1.6, 1.0). An interesting observation is that low topology accuracy (80%: blue and 79%: navy) occurs in the vertical-left, lower-horizontal, and upper-right regions. In the vertical-left (b = 1) and the lower-horizontal (EI = 0) regions, the scoring function is simplified to assigning an equal weight of 1 to all loop signals regardless of their distance from the N-terminus. Conversely, in the upperright region, when both b and EI are large, the scoring function assigns very small weights to the loop signals downstream of the N-terminus. The poor accuracy in the vertical-left and the lower-horizontal region is a result of considering the contribution of every signal in the loop segments equally. On the other hand, in the upper-right region, the poor performance is due the contribution from downstream signals made negligible by the scoring function. Therefore, our analysis supports the assumptions we have made about our scoring function: 1) topology formation is a result of contributing signals distributed along the protein sequence, particularly in the loop regions; and 2) the contribution of each downstream loop segment on the first loop segment is not equal and diminishes as a function of distance away from the N-terminus. Our results suggest that the inclusion of both assumptions in modeling membrane protein topology is a key factor in achieving the best topology accuracy.
3.7. Performance on newly solved structures and analysis of bacteriorhodopsin To illustrate the performance of the top four methods on the high and low-resolution data sets as shown in Table 3, we test four recently solved membrane protein structures not included in the training set. The results are shown in Table 5. The best predicted protein is a photosynthetic reaction center protein (PDB ID: lumxL), for which all methods predict all helices correctly (Qok = 100%). On the other hand, only two methods are capable of predicting all the helices from a bacteriorhodopsin (bR) structure (PDB ID: ltn0_A) correctly {Qok = 100%). In terms of topology prediction, most methods predict correctly for all four proteins. We devote our analysis to bR to illustrate that TMH prediction is by no means a trivial task and continuous development in this area is indispensable in advancing our understanding of membrane protein structures. Fig. 6(a) displays the high-resolution structure of bR from PDB. Bacteriorhodopsin (bR) is a member of the rhodopsin family, which is characterized with seven distinct transmembrane helices that can be indexed from Helix A to G. Studies of synthetic peptides of each of the seven TM helices of bR have shown that Helix A to Helix E can form independently stable helices when inserted into a lipid bilayer26. However, Helix G does
Table 5. Performance of top four approaches shaded in Table 3 for newly solved membrane proteins. Proteins are indicated by their PDB codes and their observed topologies. Topology terms N„,: N-terminal loop on the inside of membrane; Nout: N-terminal loop on the outside of membrane. PREDTOPO: predicted topology. Protein (observed topology)
Methods
Nout
ltn0_A(N out )
SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2
lvfp_A(N in )
lumx_L(N in )
lxfh_A(Ni„)
PRED_TOPO
Per-•segment (%)
Per-residue (%) O /~*%obs /~\%prd 122 \ilT V.1T
y-s%o6* ^Zhtm
/-\%prd idhtm
Nout
100 0 0 100
100 86 71 100
100 100 100 100
85 71 76 73
84 68 77 69
94 87 87 90
SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2
N ta N„, Nin N ta
0 0 0 0
70 70 50 80
100 100 50 89
87 86 86 85
57 54 52 58
74 72 72 63
SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2
N ta
100 100 100 100 100 100 100 100
100 100 100 100
90 85 82 83
91 78 92 78
89 89 75 83
78 88 56 90
60 63 53 71
62 57 69 73
60 65 53 71
SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2
Nout Nout
Nu, Nout
Nu,
N,. Nin Ni„
Nin
0 0 0 0
70 70 50 90
40
84
83 82.5 82
81 80.5 80
79
sr 3.s -
1.5
2.0
2.5
3.0
Base (6)
Fig. 5. The relationship between base (ft) and exponent increment (El) in the alternating geometric scoring function and topology accuracy. The x-axis: base (b); the y-axis: exponent increment (£/). The accuracy of topology prediction (TOPO) for low-resolution data set is divided into 8 levels, each indicated by a colour. The best accuracy (84%) and its associated (b, EI) values occur within the white circles.
N - t e r m i n u s (extracellular side)
C - t e r m i n u s (cytoplasmic sklel
(a) SEQ&TMH SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2
Q A Q I TGRP
I - H B L » ««»«*««*«»** »«•««««*»**. »•***» »««•««***«». 100
I
llGYGLTMVPFGGEQNPIYWARYADWLFTT .»•••**»••»»«»«« a* s • • • • • • • • • « • « • • * a * • • • • • • • • • • • «
• ••••••nil
110
1«0
ISO
I
SEQ&TMH SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2
PLLLLDLAL LVOAD^LBLM-1 * * * * • • * • ••««. • * * • • • • * • » * «««. * • * * • * • • • • • • • * ••»•« * * * • • * * • • • • •»»«
S6Q&TMH SVMtmh TMHMM2 PHDpSiHtmOS HMMTOP2
L W S A Y P V V W L I G S E G A G I V P I - N I E T L L F H V L D V S A K V G F G L I L L ^ S R A I FGEA £ A P E P S A G D G A A A T S D • * » • • * » * « » * * . . . i . . . . . . t . . H . . M M . • * » « • « • * * * * * • • » • • • • • • » * * * * » » • * * « « • • • • • • • • • • • • • » • • • • • * * * * » » * * * • • » • * • • • • • • • > • • • • • * • • • * • * * • • • • • • • • • • *
190
200
iMRPgVASTFKVLRNVTVV • • • • • • • i * > • > • • • > • •
:::
2~9
(b) Fig. 6(a). The structure of a bacteriorhodopsin (bR) (PDB ID: ltnO_A). Each helix is coloured and indexed from A to G. Figure is prepared with ViewerLite29. Fig. 6(b). Prediction results of bR by the top four methods (* = predicted helix). The observed helices are indicated by colour boxes. The region of Helix G (purple) and its predictions are highlighted in grey.
41 not form a stable helix in detergent micelles27 and exhibits structural irregularity at Lys216 by forming a nbulge28. However, despite its atypical structure, Helix G is important in the function of bR, as it binds to retinal and undergoes conformation change during the photosynthetic cycle28. The results of the predictions by all four approaches are shown in Fig. 6(b). Interestingly, all approaches are successful in identifying the first six helices (Helix A - E) with good accuracy. However, most methods do not predict with the same level of success for Helix G. In particular, TMHMM2 misses Helix G entirely and PHDpsihtm08 merges predictions for Helix F and Helix G into one long helix. SVMtmh and HMMTOP211 are the only two out of all four methods that can correctly identify the presence of Helix G. Furthermore, upon a closer examination of Helix G, HMMTOP2 over-predicts by 3 residues at the Nterminus and severely under-predicts by 9 residues at the C-terminus. SVMtmh only under-predicts by 2 residues at the N-terminus of Helix G. The poor prediction results may be due to the intrinsic structural irregularity as described earlier, which adds another level of complexity into the TMH prediction problem. Despite the difficulties involved in predicting the correct location of Helix G, SVMtmh is successful in producing a prediction for the bR structure, which is in close agreement with the experimental approach. One possible reason for our success in this case could be the integration of multiple biological input features that encompass both global and local information for TMH prediction. TMHMM2 and HMMTOP2 rely solely on amino acid composition as sequence information, while PHDpsiHtm08 only uses sequence information from multiple sequence alignments. In contrast, SVMtmh incorporates a combination of both physico-chemical and sequence-based input features for helix prediction.
input features and using a novel topology scoring function, SVMtmh achieves comparable or better persegment and topology accuracy for both high- and lowresolution data sets. When tested for confusion between membrane and soluble proteins, SVMtmh discriminates between them with the lowest false positive rate compared to the other methods. We further analyze a set of newly solved structures and show that SVMtmh is capable of predicting the correct helix and topology of bacteriorhodopsin as derived from a high resolution experiment. With regard to future work, we will continue to enhance the performance of our approach by incorporating more relevant features in both stages of helix and topology prediction. We will also consider some complexities of TM helices, including helix lengths, tilts, and structural motifs, as in the case of bacteriorhodopsin. Supported by the results we achieved, our approach could prove valuable for genome-wide predictions to identify potential integral membrane proteins and their topologies. While obtaining high-resolution structures for membrane proteins presents itself as a major challenge in the field of structural biology, the need for accurate prediction methods is highly demanded. We believe that the continuous development of computational methods with the integration of biological knowledge in this area will be immensely fruitful. Acknowledg merits We gratefully thank Jia-Ming Chang, Hsin-Nan Lin, Wei-Neng Hung, and Wen-Chi Chou for providing helpful discussions and computational assistance. This work was supported in part by the thematic program of Academia Sinica under grant AS94B003 and AS95ASIA02. References
4. CONCLUSION We have proposed an approach based on SVM in a hierarchical framework to predict transmembrane helix and topology in two successive steps. We demonstrate that by separating the prediction problem using two classifiers, specific biological input features associated with individual classifiers can be applied more effectively. By integrating both the sequence and structural
1. Wallin E and von Heijne G. Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 1998; 7: 1029-1038. 2. Stevens TJ and Arkin IT. The effect of nucleotide bias upon the composition and prediction of transmembrane helices. Protein Sci 2000; 9: 505-511. 3. Krogh A, Larsson B, von Heijne G, and Sonnhammer EL. Predicting transmembrane protein topol-
42
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
ogy with a hidden Markov model: application to complete genomes. JMol Biol 2001; 305: 567-580. Ubarretxena-Belandia I and Engelman DE. Helical membrane proteins: diversity of functions in the context of simple architecture. Curr Op in Struc Bio 2001; 11: 370-376. White SH. The progress of membrane protein structure determination. Protein Sci 2004; 13: 1948-1949. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, and Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000; 28: 235-242. van Geest M and Lolkema JS. Membrane topology and insertion of membrane proteins: search for topogenic signals. Microbiol Mol Biol Rev 2000; 64: 13-33. Kyte J and Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982; 157: 105-132. Eisenberg D, Weiss RM, and Terwilliger TC. The hydrophobic moment detects periodicity in protein hydrophobicity. Proc Natl Acad Sci USA 1984; 81: 140-144. White SH and Wimley WC. Membrane protein folding and stability: physical principles. Annu Rev Biophys Biomol Struct 1999; 28: 319-365. Tusnady GE and Simon I. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol 1998; 283: 489-506. Rost B, Fariselli P, and Casadio R. Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 1996; 5: 1704-1718. Chen CP, Kernytsky A, and Rost B. Transmembrane helix predictions revisited. Protein Sci 2002; 11:2774-2791. Chang CC and Lin CJ. LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cilin/libsvm/. von Heijne G. Membrane protein structure prediction. Hydrophobicity analysis and the positiveinside rule. JMol Biol 1992; 225: 487-494. Hessa T, Kim H, Bihlmaier K, Lundin C, Boekel J, Andersson H, Nilsson I, White SH, and von Heijne G. Recognition of transmembrane helices by the
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
endoplasmic reticulum translocon. Nature 2005; 433:377-381. Mitaku S, Hirokawa T, and Tsuji T. Amphiphilicity index of polar amino acids as an aid in the characterization of amino acid preference at membranewater interfaces. Bioinformatics 2002; 18: 608-616. Zhou H and Zhou Y. Predicting the topology of transmembrane helical proteins using mean burial propensity and a hidden-Markov-model-based method. Protein Sci2003; 12: 1547-1555. Jayasinghe S, Hristova K, and White SH. Energetics, stability, and prediction of transmembrane helices. JMol Biol 2001; 312: 927-934. Goder V and Spiess M. Topogenesis of membrane proteins: determinants and dynamics. FEBS Letters 2001;504:87-93. Popot JL and Engelman DM. Membrane protein folding and oligomerization: the two-stage model. Biochemistry 1990; 29: 4031-4037. Moller S, Kriventseva EV, and Apweiler R. A collection of well characterised integral membrane proteins. Bioinformatics 2000; 16: 1159-1160. Bairoch B, Apweiler R. The SWISS-PROT protein sequence database: its relevance to human molecular medical research. J. Mol. Med. 1997; 5: 312316. Cao B, Porollo A, Adamczak R, Jarrell M, and Meller J. Enhanced recognition of protein transmembrane domains with prediction-based structural profiles. Bioinformatics 2006; 22: 303-309. Wu TF, Lin CJ, and Weng RC. Probability estimates for multi-class classification by pairwise coupling. JMLR 2004; 5: 975-1005. Booth PJ. Unravelling the folding of bacteriorhodopsin. Biochim Biophys Acta 2000; 1460: 414. Hunt JF, Earnest TN, Bousche O, Kalghatgi K, Reilly K, Horvath C, Rothschild KJ, and Engelman DM. A biophysical study of integral membrane protein folding. Biochemistry 1997; 36: 1515615176. Luecke H, Schobert B, Richter HT, Cartailler JP, and Lanyi JK. Structure of bacteriorhodopsin at 1.55 A resolution. JMol Biol 1999; 291: 899-911. ViewerLite for molecular visualization. Software available at http://www.iaici.or.ip/sci/viewer.htm.
43
PROTEIN FOLD RECOGNITION USING THE GRADIENT BOOST ALGORITHM
Feng Jiao* School of Computer
Science, University
[email protected] of Waterloo,
Canada
Jinbo Xut Toyota
Technological Institute at Chicago, j3xu@tti-c. org
USA
Libo Yu Bioinformatics Solutions Inc., Waterloo,
[email protected] Canada
Dale S c h u u r m a n s Department
of Computing Science, University dale @cs. ualberta. ca
of Alberta,
Canada
Protein structure prediction is one of the most important and difficult problems in computational molecular biology. Protein threading represents one of the most promising techniques for this problem. One of the critical steps in protein threading, called fold recognition, is to choose the best-fit template for the query protein with t h e structure to be predicted. The standard method for template selection is to rank candidates according to the z-score of t h e sequence-template alignment. However, the z-score calculation is time-consuming, which greatly hinders structure prediction at a genome scale. In this paper, we present a machine learning approach that treats the fold recognition problem as a regression task and uses a least-squares boosting algorithm (LS-Boost) to solve it efficiently. We test our method on Lindahl's benchmark and compare it with other methods. According to our experimental results we can draw the conclusions that: (1) Machine learning techniques offer an effective way to solve the fold recognition problem. (2) Formulating protein fold recognition as a regression rather than a classification problem leads to a more effective outcome. (3) Importantly, the LS_Boost algorithm does not require the calculation of the z-score as an input, and therefore can obtain significant computational savings over standard approaches. (4) T h e LS_Boost algorithm obtains superior accuracy, with less computation for both training and testing, than alternative machine learning approaches such as SVMs and neural networks, which also need not calculate the z-score. Finally, by using the L S 3 o o s t algorithm, one can identify important features in the fold recognition protocol, something that cannot be done using a straightforward SVM approach.
1. INTRODUCTION In the post-genomic era, understanding protein function has become a key step toward modelling complete biological systems. It has been established that the functions of a protein are directly linked to its three-dimensional structure. Unfortunately, current "wet-lab" methods used to determine the threedimensional structure of a protein are costly, timeconsuming and sometimes unfeasible. The ability to predict a protein's structure directly from its sequence is urgently needed in the post-genomic era, where protein sequences are becoming available at
a far greater rate than the corresponding structure information. Protein structure prediction is one of the most important and difficult problems in computational molecular biology. In recent years, protein threading has turned out to be one of the most successful approaches to this problem 7 ' 14 ' 15 . Protein threading predicts protein structures by using statistical knowledge of the relationship between protein sequences and structures. The prediction is made by aligning each amino acid in the target sequence to a position in a template structure and evaluating how well
* Work performed at the Alberta Ingenuity Centre for Machine Learning, University of Alberta, t Contact author.
44 the target fits the template. After aligning the sequence to each template in the structural template database, the next step then is to separate the correct templates from incorrect templates for the target sequence—a step we refer to as template selection or fold recognition. After the best-fit template is chosen, the structural model of the sequence is built based on the alignment between the sequence and the chosen template. The traditional fold recognition technique is based on calculating the z-score, which statistically tests the possibility of the target sequence folding into a structure very similar to the template 3 . In this technique, the z-score is calculated for each sequencetemplate alignment by first determining the distribution of alignment scores among random re-shufflings of the sequence, and then comparing the alignment score of the correct sequence (in standard deviation units) to the average alignment score over random sequences. Note that the z-score calculation requires the alignment score distribution to be determined by randomly shuffling the sequence many times (approx. 100 times), meaning that the shuffled sequence has to be threaded to the template repeatedly. Thus, the entire process of calculating the z-score is very time-consuming. In this paper, instead of using the traditional z-score technique, we propose to solve the fold recognition problem by treating it as a machine learning problem. Several research groups have already proposed machine learning methods, such as neural networks 9 23 ' and support vector machines (SVMs) 20 ' 2 2 for fold recognition. In this general framework, for each sequence-template alignment, one generates a set of features to describe the instance, treats the extracted features as input data, and the alignment accuracy or similarity level as a response variable. Thus, the fold recognition problem can be expressed as a standard prediction problem that can be solved by supervised machine learning techniques for regression or classification. In this paper we investigate a new approach that proves to be simpler to implement, more accurate and more computationally efficient. In particular, we combine the gradient boosting algorithm of Friedman 5 with a least-squares loss criterion to obtain a least-squares boosting algorithm, LS_Boost. We use LS_Boost to estimate the alignment accuracy
of each sequence-template alignment and employ this as part of our fold recognition technique. To evaluate our approach, we experimentally test it on Lindahl's benchmark 12 and compare the resulting performance with other fold recognition methods, such as the z-score method, SVM regression, SVM classification, neural networks and Bayes classification. Our experimental results demonstrate that the LS_Boost method outperforms the other techniques in terms of both prediction accuracy and computational efficiency. It is also a much easier algorithm to implement. The remainder of the paper is organized as follows. We first briefly introduce the idea of using protein threading for protein structure prediction. We show how to generate features from each sequence-template alignment and convert protein threading into a standard prediction problem (making it amenable to supervised machine learning techniques). We discuss how to design the least-squares boosting algorithm by combining gradient boosting with a least-squares loss criterion, and then describe how to use our algorithm to solve the fold recognition problem. Finally, we will describe our experimental set-up and compare LS_Boost with other methods, leading to the conclusions we present in the end. 2. PROTEIN THREADING A N D FOLD RECOGNITION 2.1. The threading method for protein structure prediction The idea of protein threading originated from the observation that the number of different structural folds in nature may be quite small, perhaps two orders of magnitude fewer than the number of known protein sequences n . Thus, the structure prediction problem can be potentially reduced to a problem of recognition: choosing a known structure into which the target sequence will fold. Or, put another way, protein threading is in fact a database search technique, where given a query sequence of unknown structure, one searches a structure (template) database and finds the best-fit structure for the given sequence. Thus, protein threading typically consists of the following four steps: (1) Build a template database of representative
45 three-dimensional protein structures, which usually involves removing highly redundant structures. (2) Design a scoring function to measure the fitness between the target sequence and the template based on the knowledge of the known relationship between the structures and the sequences. Usually, the minimum value of the scoring function corresponds to the optimal sequence-template alignment. (3) Find the best alignment between the target sequence and the template by minimizing the scoring function. (4) Choose the best-fit template for the sequence according to a criterion, based on all the sequencetemplate alignments. In this paper, we will only focus on the final step. That is, we only discuss how to choose the best template for the sequence, which is called fold recognition. We use our existing protein threading server RAPTOR 21 ' 22 to generate all the sequencestructure alignments. For the fold recognition problem, there are two different approaches: the z-score method 3 and the machine learning method 9 ' 23 .
2.2. The z-score method for fold recognition The z-score is defined to be the "distance" (in standard deviation units) between the optimal alignment score and the mean alignment score obtained by randomly shuffling the target sequence. An accurate zscore can cancel out the sequence composition bias and offset the mismatch between the sequence size and the template length. Bryant et al. 3 proposed the following procedures to calculate z-score: (1) Shuffle the aligned sequence residues randomly. (2) Find the optimal alignment between the shuffled sequence and the template. (3) Repeat the above two steps N times, where N is on the order of one hundred. Then calculate the distribution of these N alignment scores. After the N alignment scores are obtained, we calculate the deviation of the optimal alignment score from the distribution of these N alignment scores.
We can see from above that in order to calculate the z-score for each sequence-template alignment, we need to shuffle and rethread the target sequence many times, which takes a significant amount of time and essentially prevents this technique from being applied to genome-scale structure prediction. 2.3. Machine learning methods for fold recognition Another approach to the fold recognition problem is to use machine learning methods, such as neural networks, as in the GenTHREADER 9 and PROSPECT-I systems 2 3 , or SVMs, as in the RAPTOR system 22 . Current machine learning methods generally treat the fold recognition problem as a classification problem. However, there is a limitation to the classification approach that arises when one realizes that there are three levels of similarity that one can draw between two proteins: fold level similarity, superfamily level similarity and family level similarity. Currently, classification-based methods treat the three different similarity levels as a single level, and thus are unable to effectively differentiate one similarity level from another while maintaining a hierarchical relationship between the three levels. Even a multi-class classifier cannot deal with this limitation very well since the three levels are in a hierarchical relationship. Instead, we use a regression approach, which simply uses the alignment accuracy as the response value. That is, we reformulate the fold recognition problem as predicting the alignment accuracy of a threading pair, which then is used to differentiate the similarity level between proteins. In our approach, we use SARF 2 to generate the alignment accuracy between the target protein and the template protein. The alignment accuracy of threading pair is defined to be the number of correctly aligned positions, based on the correct alignment generated by SARF. A position is correctly aligned only if its alignment position is no more than four position shifts away from its correct alignment. On average, the higher the similarity level between two proteins, the higher the value of the alignment accuracy will be. Thus alignment accuracy can help to effectively differentiate the three similarity levels. Below we will show in our experiments that the regression approach obtains
46 much better results than the standard classification approach.
3. FEATURE EXTRACTION One of the key steps in the machine learning approach is to choose a set of proper features to be used as inputs for predicting the similarity between two proteins. After optimally threading a given sequence to each template in the database, we generate the following features from each threading pair. (1) Sequence size, which is the number of residues in the sequence. (2) Template size, which is the number of residues in the template. (3) Alignment length, which is the number of aligned residues. Usually, two proteins from the same fold class should share a large portion of similar sub-structure. If the alignment length is considerably smaller than the sequence size or the template size, then it indicates that this threading pair is unlikely to be in the same SCOP class. (4) Sequence identity. Although a low sequence identity does not imply that two proteins are not similar, a high sequence identity can indicate that two proteins should be considered as similar. (5) Number of contacts with both ends being aligned to the sequence. There is a contact between two residues if their spatial distance is within a given cutoff. Usually, a longer protein should have more contacts. (6) Number of contacts with only one end being aligned to the sequence. If this number is big, then it might indicate that the sequence is aligned to an incomplete domain of the template, which is not good since the sequence should fold into a complete structure. (7) Total alignment score. (8) Mutation score, which measures the sequence similarity between the target protein and the template protein. (9) Environment fitness score. This feature measures how well to put a residue into a specific environment. (10) Alignment gap penalty. When aligning a sequence and a template, some gaps are allowed.
However, if there are too many gaps, it might indicate that the quality of the alignment is bad, and therefore the two sequences may not be in the same similarity level. (11) Secondary structure compatibility score, which measures the secondary structure difference between the template and the sequence in all positions. (12) Pairwise potential score, which characterizes the capability of a residue to make a contact with another residue. (13) The z-score of the total alignment score and the z-score of a single score item such as mutation score, environment fitness score, secondary structure score and pairwise potential score. Notice that here we still take into consideration the traditional z-score for the sake of performance comparison. But later we will show that we can obtain nearly the same performance without using the 2-score, which means it is unnecessary to calculate the z-score as one of the features. We calculate the alignment accuracy between the target protein and the template protein using a structure comparison program SARF. We use the alignment accuracy as the response variable. Given the training set with input feature vectors and the response variable, we need to find a prediction function that maps the features to the response variable. By using this function, we can estimate the alignment accuracy for each sequence-template alignment. Then, all the sequence-template alignments can be ranked based on the predicted alignment accuracy and the first-ranked one is chosen as the best alignment for the sequence. Thus we have converted the protein structure problem to a function estimation problem. In the next section, we will show how to design our LS-Boost algorithm by combining the gradient boosting algorithm of Friedman 5 with a least-squares loss criterion.
4. LEAST-SQUARES BOOSTING ALGORITHM FOR FOLD RECOGNITION The problem can be formulated as follows. Let x denote the feature vector and y the alignment accuracy. Given an input variable x, a response variable y and
47 some samples {yi^Xi}^, we want to find a function F*(x) that can predict y from x such that over the joint distribution of {y,x} values, the expected value of a specific loss function L(y,F(x)) is minimized 5 . The loss function is used to measure the deviation between the real y value and the predicted y value.
Algorithm 1: Gradient Boost
F(x)
F(x))
• Step 1. Compute the negative gradient -
\dL{yi,F{xi))^ dFx
• Step 2. Fit a model
"'
= argmin Ex[EyL{y,F(x))\x]
\Vu p)
• For m = 1 to M do:
y% =
F*(x) = aigmin Ey,xL(y,
Li
• Initialize F0(x) = argmin p J2?-i
(1)
F{x)
N
am = argmin V ^ y ~
f3h(xi;am)}2
i=i
Normally F(x) is a member of a parameterized class of functions F(x;P), where P is a set of parameters. We use the form of the "additive" expansions to design the function as follows:
• Step 3. Choose a gradient descent step size as N
Pm = argmin V L { y u F m p f-f
M
+ph(xi\a))
F{x;P) = ^2
Pmh(x;am)
(2)
m=0
• Step 4. Update the estimation of F(x) Fm(x) = F m _ i ( x ) + pmh{x;
where P = {/?m,«m}m=0' The functions h(x;a) are usually simple functions of x with parameters Q = {c*i,a2,. • • ,C«M}- When we wish to estimate F(x) non-parametrically the task becomes more difficult. In general, we can choose a parameterized model F(x; P) and change the function optimization problem to parameter optimization. That is, we fix the form of the function and optimize the parameters instead. A typical parameter optimization method is a "greedy-stagewise" approach. That is, we optimize {Pm,(Xm} after all of the {/?i,aj}(i = 0 , 1 , . . . ,m — 1) are optimized. This process can be represented by the following two recursive equations.
• end for • Output the final regression function Fig. 1.
am) Fm(x)
Gradient boosting algorithm
Algorithm 2: LS_Boost • Initialize F0 = y = j , £ i y» • For m = 1 to M do: • y%=Vi -Fm-i{xi,i
=
1,...,N)
N
N
{/3m,am) = argmin Y^L(2/i,F m _i(a;j) + (3h(xi;a)) 8=1
(3)
Fm = Fm-i(x)
l(xi)
t—l
+
pmh{x;am)
(4)
Friedman proposed a steepest-descent method to solve the optimization problem described in Equation 2 5 . This algorithm is called the Gradient Boosting algorithm and its entire procedure is given in Figure 1.
• (pm, «m) = argmin V ^ - ph(xi\ p,a *•—* t=l
am)f
• Fm(x) = Fm_i(x) + pmh(x;am) • end for • Output the final regression function Fm(x) Fig. 2.
LS-Boost algorithm
By employing the least square loss function (L(y, F)) = (y—F)2/2 we have a least-squares boosting algorithm shown in Figure 2. For this procedure,
48 p is calculated as follows: N
(P,am) = axgminV^j/i -
ph{xi\am)f
»=1
N
and therefore
p = N x j/»/ ^ / i ( £ t ; a m )
(5)
»=i
The simple function /i(:r, a) can have any form that can be conveniently optimized over a. In terms of boosting, optimizing over a to fit the training data is called weak learning. In this paper, for considerations of speed, we choose some function for which it is easy to obtain a. The simplest function to use here is the linear regression function: y — ax + b
(6)
where x is the input feature and y is the alignment accuracy. The parameters of the linear regression function can be solved easily by the following equation:
a
_ JOL^i, — y _
ax
"XX
n
where
n
' n = n x ^ i - - ( ^ i j ) i=i n
lxy =nx
^Xiyi i=l
*=i nn
!
n
(£,xi)(%2vi)
«=1
*=1
There are many other simple functions one can use, such as an exponential function y = a + ebx, logarithmic function y = a + blnx, quadratic function y = ax2 + bx + c, or hyperbolic function y = a + b/x, etc. In our application, for each round, we choose one feature and obtain the simple function h(x, a) with the minimum least-squares error. The underlying reasons for choosing a single feature at each round are: i) we would like to see the role of each feature in fold recognition; and ii) we notice that alignment accuracy is proportional to some features. For example, the higher the alignment accuracy, the lower the mutation score, fitness score and pairwise score. Figure 3 shows the relation between alignment accuracy and mutation score.
100
200
300 400 500 alignment accuracy
600
700
800
F i g . 3 . T h e relation between alignment accuracy and m u t a t i o n score.
In the end, we combine these simple functions to form the final regression function. As such, Algorithm 2 translates to the following procedures. (1) Calculate the difference between the real alignment accuracy and the predicted alignment accuracy. We call this difference the alignment accuracy residual. Assume the initial predicted alignment accuracy is the average alignment accuracy of the training data. (2) Choose a single feature which correlates best with the alignment accuracy residual. The parameter p is calculated by using Equation 5. Then the alignment accuracy residual is predicted by using this chosen feature and the parameter. (3) Update the predicted alignment accuracy by adding the predicted alignment accuracy residual. Repeat the above two steps until the predicted alignment accuracy does not change significantly.
5. EXPERIMENTAL RESULTS When one protein structure is to be predicted, we thread its sequence to each template in the database and obtain the predicted alignment accuracy using the LS_Boost algorithm. We choose the template with the highest alignment accuracy as the basis to build the structure of the target sequence. We can describe the relationship between two proteins at three different levels: the family level, super family level and the fold level. If two proteins
49 are similar at the family level, then these two proteins have evolved from a common ancestor and usually share more than 30% sequence identity. If two proteins are similar only at the fold level, then their structures are similar even though their sequences are not similar. The superfamily-level similarity is something in between family level and fold level. If the target sequence has a template that is in the same family as the sequence, then it is easier to predict the structure of the sequence. If two proteins are similar only at fold level, it means they share less sequence similarity and it is harder to predict their relationship. We use the SCOP database 16 to judge the similarity between two proteins and evaluate our predicted results at different levels. If the predicted template is similar to the target sequence at the family level according to the SCOP database, we treat it as correct prediction at the family level. If the predicted template is similar at the superfamily level but not at the family level, then we assess this prediction as being correct at the superfamily level. Similarly, if the predicted template is similar at the fold level but not at the other two levels, we assess the prediction as correct at the fold level. When we say a prediction is correct according to the top K criterion, we mean that there are no more than K — 1 incorrect predictions ranked before this prediction. The foldlevel relationship is the hardest to predict because two proteins share very little sequence similarity in this case. To train the parameters in our algorithm, we randomly choose 300 templates from the FSSP list 1 and 200 sequences from Holm's test set 6 . By threading each sequence to all the templates, we obtain a set of 60,000 training examples. To test the algorithm, we use Lindahl 's benchmark, which contains 976 proteins, each pair of which shares at most 40% sequence identity. By threading each one against all the others, we obtain a set of 976 x 975 threading pairs. Since the training set is chosen randomly from a set of non-redundant proteins, the overlap between the training set and Lindahl's benchmark is fairly small, which is no more than 0.4 percent of the whole test set. To ensure the complete separation of training and testing sets, these overlap pairs are removed from the test data.
We calculate the recognition rate of each method at the three similarity levels. 5.1. Sensitivity Figure 4 shows the sensitivity of our algorithm at each round. We can see that the LS_Boost algorithm nearly converges within 100 rounds, although we train the algorithm further to obtain higher performance. Sensitivity according to Top 1 and Top 5 criteria Family Laval (Top 5]
SuperfamBy Level (Top 5)
f
Fold Level (Top 5)
/
;p ^~~~^~~ •
SO
F i g . 4.
100
150
200 250 300 350 Number of training rounds
400
450
500
Sensitivity curves during the training process.
Table 1 lists the results of our algorithm against several other algorithms. PROSPECT II uses the 2-score method, and its results are taken from Kim et al.'s paper 10 . We can see that the LS_Boost algorithm is better than PROSPECT II at all three levels. The results for the other methods are taken from Shi et al's paper 18 . Here we can see that our method apparently outperforms the other methods. However, since we use different sequence-structure alignment methods, this disparity may be partially due to different threading techniques. Nevertheless, we can see that the machine learning approaches normally perform much better than the other methods. Table 2 shows the results of our algorithm against several other popular machine learning methods. Here we will not describe the detail of each method. In this experiment, we use RAPTOR to generate all the sequence-template alignments. For each different method, we tune the parameters on the training set and test the model on the test set. In total we test the following six other machine learning methods.
50 Table 1. Sensitivity of the LS_Boost method compared with other sturcutre prediction servers.
R A P T O R (LS-Boost) PROSPECT II FUGUE PSIJBLAST HMMER.PSIBLAST SAMT98-PSIBLAST BLASTLINK SSEARCH THREADER
Family Top 1 86.5% 84.1 % 82.3% 71.2% 67.7% 70.1% 74.6% 68.6% 49.2%
Level Top 5 89.2% 88.2% 85.8% 72.3% 73.5% 75.4% 78.9% 75.7% 58.9%
Superfamily Level Top 1 Top 5 60.2% 74.4% 52.6% 64.8% 41.9% 53.2% 27.4% 27.9% 20.7% 31.3% 28.3% 38.9% 29.3% 40.6% 20.7% 32.5% 10.8% 24.7%
Fold Level Top 1 Top 5 38.8% 61.7% 27.7% 50.3% 12.5% 26.8% 4.0% 4.7% 4.4% 14.6% 3.4% 18.7% 6.9% 16.5% 5.6% 15.6% 14.6% 37.7%
Table 2. Performance comparison of seven machine learning methods. The sequence template alignments are generated by R A P T O R .
LS-Boost SVM (regression) SVM (classification) AdaJBoost Neural Networks Bayes classifier Naive Bayes Classifier
Family Top 1 86.5% 85.0% 82.6% 82.8% 81.1% 69.9% 68.0%
Level Top 5 89.2% 89.1% 83.6% 84.1% 83.2% 72.5% 70.8%
(1) SVM regression. Support vector machines are based on the concept of structural risk minimization from statistical learning theory 19 . The fold recognition problem is treated as a regression problem, therefore we consider SVMs used for regression. Here we use the svmJight software package 8 and an RBF kernel to obtain the best performance. As shown in Table 2, LS-Boost performs slightly better than SVM regression. (2) SVM classification. The fold recognition problem is treated as a classification problem, and we consider an SVM for classification. The software and kernel we consider is the same as for SVM regression. In this case, one can see that SVM classification performs worse than SVM regression, especially at the superfamily level and the fold level. (3) AdaBoost. Boosting is a procedure that combine the outputs of many "weak" classifiers to produce a powerful "committee". We use the standard AdaBoost algorithm 4 for classification, which is similar to LS-Boost except that it performs classification rather than regression and uses the exponential instead of least-squares loss function. The AdaBoost algorithm achieves a comparable result to SVM classification but is
Superfamily Level Top 1 Top 5 60.2% 74.4% 55.4% 71.8% 45.7% 58.8% 50.7% 61.1% 47.4% 58.3% 29.2% 42.6% 31.0% 41.7%
Fold Level Top 1 Top 5 38.8% 61.7% 38.6% 60.6% 30.4% 52.6% 32.2% 53.3% 30.1% 54.8% 13.6% 40.0% 15.1% 37.4%
worse than both of the regression approaches, LS_Boost and SVM regression. (4) Neural networks. Neural networks are one of the most popular methods used in machine learning 17 . Here we use a multi-layer perceptron for classification, based on the Matlab neural network toolbox. The performance of the neural network is similar to SVM classification and Adaboost. (5) Bayesian classifier. A Bayesian classifier is a probability based classifier which assigns a sample to a class based on the probability that it belongs to the class 13 . (6) Naive Bayesian classifier. The Naive Bayesian classifier is similar to the Bayesian classifier except that it assumes that the features of each class are independent, which greatly decreases computation 13 . We can see both Bayesian classifier and Naive Bayesian classifier obtain poor performance. Our experimental results show clearly that: (1) The regression based approaches demonstrate better performance than the classification based approaches. (2) LSJBoost performs slightly better than SVM regression and significantly better than the other methods. (3) The computational efficiency of
51 LS_Boost is much better than SVM regression, SVM classification and the neural network. One of the advantages of our boosting approach over SVM regression is its ability to identify important features, since at each round LS-Boost only chooses a single feature to approximate the alignment accuracy residual. The following are the top five features chosen by our algorithm. The corresponding simple functions associated with each feature are all linear regression functions y = ax + b, showing that there is a strong linear relation between the features and the alignment accuracy. For example, from the figure 3, we can see that the linear regression function is the best fit. (1) (2) (3) (4) (5)
Sequence identity; Total alignment score; Fitness score; Mutation score; Pairwise potential score.
It seems surprising that the widely used z-score is not chosen as one of the most important features. This indicates to us that the z-score may not be the most important feature and redundant. To confirm our hypothesis, we re-trained our model using all the features except all the z-scores. That is, we conducted the same training and test procedures as before, but with the reduced feature set. The results given in Table 3 show that for LS_Boost there is almost no difference between using the zscore as an additional feature or without using it. Thus, we conclude that by using the LS-Boost approach it is unnecessary to calculate z-score to obtain the best performance. This means that we can greatly improve the computational efficiency of protein threading without sacrificing accuracy, by completely avoiding the calculation of the expensive zscore. To quantify the margin of superiority of LS-Boost over the other machine-learning methods, we use bootstrap method to get the error analysis. After training the model, we randomly sample 600 sequences from Lindahl's benchmark and calculate the sensitivity using the same method as before. We repeat the sampling for 1000 times and get the mean and standard deviation of the sensitivity of each method as listed in table 4. We can see
that LS-Boost method is slightly better than SVM regression and much better than other methods.
5.2. Specificity We further examine the specificity of the LS_Boost method with Lindahl's benchmark. All threading pairs are ranked by their confidence score (i.e., the predicted alignment accuracy or the classification score if an SVM classifier is used) and the sensitivityspecificity curves are drawn in Figure 5, 6 and 7. Figure 6 demonstrates that at the superfamily level, the LS-boost method is consistently better than SVM regression and classification within the whole spectrum of sensitivity. At both the family level and fold level, LS-Boost is a little better when the specificity is high while worse when the specificity is low. At the family level, LS-Boost achieves a sensitivity of 55.0% and 64.0% at 99% and 50% specificities, respectively, whereas SVM regression achieves a sensitivity of 44.2% and 71.3%, and SVM classification achieves a sensitivity of 27.0% and 70.9% respectively. At the superfamily level, LS-Boost has a sensitivity of 8.2% and 20.8% at 99% and 50% specificities, respectively. In contrast, SVM regression has a sensitivity of 3.6% and 17.8%, and SVM classification has a sensitivity of 2.0% and 16.1% respectively. Figure 7 shows that at the fold level, there is no big difference between LS_Boost method, SVM regression and SVM classification method.
Family Level Only It
1
1
ii
|
LS_Booat 0.9 X 0.8 •
07
— ,
SVM_Classification
,
vT'^ ~ ~-~.^
"
—"
I " ~ " --
>- ° 6 '
~^^T\
1 °'5 '
\
e (0
I 0,4 •
,
\
0.3 •
>
0.2 • 0.1 0' 0
' 0.2
' 0.4
' 0.6
' 0.8
1
Specificity
Fig. 5. Family-level specificity-sensitivity curves on Lindahl's benchmark set. T h e three methods LS-Boost, SVM regression and SVM classification are compared.
52 Table 3.
Comparison of fold recognition performance with zscore and without zscore.
LS-Boost with z-score LS_Boost without z-score
Family Top 1 86.5% 85.8%
Level Top 5 89.2% 89.2%
Superfamily Level Top 1 Top 5 60.2% 74.4% 60.2% 73.9%
Fold Level Top 1 Top 5 38.8% 61.7% 38.3% 62.9%
Table 4 . Error Analysis of seven machine learning methods. The sequence-template alignments are generated by RAPTOR.
LS-Boost SVM (R) SVM (C) Ada-Boost NN BC NBC
Family Topi mean std 86.6% 0.029 0.031 85.2% 82.5% 0.028 82.9% 0.030 81.8% 0.029 70.0% 0.027 68.8% 0.026
Level Top mean 89.2% 89.2% 83.8% 84.2% 83.5% 72.6% 71.0%
5 std 0.031 0.031 0.030 0.029 0.030 0.027 0.028
Superfamily Level Top 1 Top mean std mean 60.2% 0.029 74.3% 55.6% 0.029 72.0% 45.8% 0.026 58.9% 61.2% 50.7% 0.028 58.4% 47.5% 0.027 29.1% 0.021 42.6% 31.1% 0.022 41.9%
Fold Level 5 std 0.034 0.033 0.030 0.031 0.031 0.026 0.025
Top mean 38.9% 38.7% 30.4% 32.1% 30.2% 13.7% 15.1%
1 std 0.027 0.027 0.024 0.025 0.024 0.016 0.017
Top mean 61.8% 60.7% 52.8% 53.4% 55.0% 40.1% 37.3%
5 std 0.036 0.035 0.032 0.034 0.033 0.028 0.027
5.3. Computational Efficiency
SuperFamily Level Only
Fig. 6 . Superfamily-level specificity-sensitivity curves on Lindahl's benchmark set. The three methods LS-Boost, SVM regression and SVM classification are compared.
Fold Level Onty LS_Booat SVM_Classification
Overall, the LS_Boost procedure achieves superior computational efficiency during both training and testing. By running our program on a 2.53 GHz Pentium IV processor, after extracting the features, the training time is less than thirty seconds and the total test time is approximately two seconds. Thus we can see that our technique is very fast compared to other approaches, in particular the machine learning approaches such as neural networks and SVMs which require much more time to train. Table 5 lists the running time of several different fold recognition methods. Prom this table, we can see that the boosting approach is more efficient than the SVM regression method, which is desirable for genomescale structure prediction. The running time shown in this table does not contain the computational time of sequence-template alignment.
6. CONCLUSION
Specificity
Fig. 7. Fold-level specificity-sensitivity curves on Lindahl's benchmark set. The three methods LS-Boost, SVM regression and SVM classification are compared.
In this paper, we propose a new machine learning approach—LS_Boost—to solve the protein fold recognition problem. We use a regression approach which is proved to be both more accurate and efficient than classification based approaches. One of the most significant conclusions of our experimental evaluation is that we do not need to calculate the standard z-score, and can thereby achieve a substantial computational savings without sacrificing prediction accuracy. Our algorithm achieves strong sen-
53 Table 5. Running time of different machine learning approaches. LS-Boost SVM classification SVM regression Neural Network Naive Bayes Classifier Bayes Classifier
sitivity results compared t o other fold recognition methods, including both machine learning methods a n d z-score based methods. Moreover, our approach is significantly more efficient for b o t h the training and testing phases, which m a y allow genome-scale scale structure prediction.
References 1. T. Akutsu and S. Miyano. On the approximation of protein threading. Theoretical Computer Science, 210:261-275, 1999. 2. N.N. Alexandrov. SARFing the PDB. Protein Engineering, 9:727-732, 1996. 3. S.H. Bryant and S.F. Altschul. Statistics of sequencestructure threading. Current Opinions in Structural Biology, 5:236-244, 1995. 4. Y. Preund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory, pages 23-37, 1995. 5. J.H. Friedman. Greedy function approximation: A gradient boosting machine. The Annuals of Statistics, 29(5), October 2001. 6. L. Holm and C. Sander. Decision support system for the evolutionary classification of protein structures. 5:140-146, 1997. 7. J. Moultand T. Hubbard, F. Fidelis, and J. Pedersen. Critical assessment of methods on protein structure prediction (CASP)-round III. Proteins: Structure, Function and Genetics, 37(S3):2-6, December 1999. 8. T. Joachims. Making Large-scale SVM Learning Practical. MIT Press, 1999. 9. D.T. Jones. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology, 287:797-815, 1999. 10. D. Kim, D. Xu, J. Guo, K. Ellrott, and Y. Xu. PROSPECT II: Protein structure prediction method for genome-scale applications. Protein Engineering, 16(9):641-650, 2003. 11. H. Li, R. Helling, C. Tang, and N. Wingreen. Emergence of preferred structures in a simple model of protein folding. Science, 273:666-669, 1996.
Training time 30 seconds 19 mins 1 hour 2.3 hours 1.8 hours 1.9 hours
Testing time 2 seconds 26 mins 4.3 hours 2 mins 2 mins 2 mins
12. E. Lindahl and A. Elofsson. Identification of related proteins on family, superfamily and fold level. Journal of Molecular Biology, 295:613-625, 2000. 13. D. Michie, D.J. Spiegelhalter, and C.C. Taylor. Machine learning, neural and statistical classification, (edit collection). Elllis Horwood, 1994. 14. J. Moult, F. Fidelis, A. Zemla, and T. Hubbard. Critical assessment of methods on protein structure prediction (CASP)-round IV. Proteins: Structure, Function and Genetics, 45(S5):2-7, December 2001. 15. J. Moult, F. Fidelis, A. Zemla, and T. Hubbard. Critical assessment of methods on protein structure prediction (CASP)-round V. Proteins: Structure, Function and Genetics, 53(S6):334-339, October 2003. 16. A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia. SCOP:a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536540, 1995. 17. Judea Pearl, probabilistic reasoning in intelligent system:Networks of plausible inference. Springer, 1995. 18. J. Shi, T. Blundell, and K. Mizuguchi. FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. Journal of Molecular Biology, 310:243-257, 2001. 19. V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. 20. J. Xu. Protein fold recognition by predicted alignment accuracy. IEEE Transactions on Computational Biology and Bioinformatics, 2:157 - 165, 2005. 21. J. Xu, M. Li, D. Kim, and Y. Xu. RAPTOR: optimal protein threading by linear programming. Journal of Bioinformatics and Computational Biology, 1(1):95117, 2003. 22. J. Xu, M. Li, G. Lin, D. Kim, and Y. Xu. Protein threading by linear programming, pages 264-275, Hawaii, USA, 2003. Biocomputing: Proceedings of the 2003 Pacific Symposium. 23. Y. Xu, D. Xu, and V. Olman. A practical method for interpretation of threading scores: an application of neural networks. Statistica Sinica Special Issue on Bioinformatics, 12:159-177, 2002.
This page is intentionally left blank
55
A GRAPH-BASED AUTOMATED NMR BACKBONE RESONANCE SEQUENTIAL ASSIGNMENT
Xiang W a n a n d G u o h u i Lin* Department of Computing Science, University of Edmonton, Alberta T6G 2E8, Canada * Email:
[email protected]. ca
Alberta
T h e success in backbone resonance sequential assignment is fundamental to protein three dimensional structure determination via NMR spectroscopy. Such a sequential assignment can roughly be partitioned into three separate steps, which are grouping resonance peaks in multiple spectra into spin systems, chaining t h e resultant spin systems into strings, and assigning strings of spin systems to non-overlapping consecutive amino acid residues in the target protein. Separately dealing with these three steps has been adopted in many existing assignment programs, and it works well on protein NMR data that is close to ideal quality, while only moderately or even poorly on most real protein datasets, where noises as well as data degeneracy occur frequently. We propose in this work t o partition t h e sequential assignment not into physical steps, but only virtual steps, and use their outputs t o cross validate each other. T h e novelty lies in the places where the ambiguities in the grouping step will be resolved in finding the highly confident strings in the chaining step, and the ambiguities in the chaining step will be resolved by examining t h e mappings of strings in the assignment step. In such a way, all ambiguities in t h e sequential assignment will be resolved globally and optimally. The resultant assignment program is called GASA, which was compared to several recent similar developments RIBRA, MARS, PACES and a random graph approach. The performance comparisons with these works demonstrated that GASA might be more promising for practical use. Keywords: Protein NMR backbone resonance sequential assignment, chemical shift, spin system, connectivity graph.
1. INTRODUCTION Nuclear Magnetic Resonance (NMR) spectroscopy has been increasingly used for protein threedimensional structure determination. Although it hasn't been able to achieve the same accuracy as X-ray crystallography, enormous technological advances have brought NMR to the forefront of structural biology 1 since the publication of the first complete solution structure of a protein (bull seminal trypsin inhibitor) determined by NMR in 1985 2 . The underlined mathematical principle for protein NMR structure determination is to employ NMR spectroscopy to obtain local structural restraints such as the distances between hydrogen atoms and the ranges of dihedral angles, and then to calculate the three dimensional structure. Local structural restraint extraction is mostly guided by the backbone resonance sequential assignment, which therefore is crucial to the accurate three dimensional structure calculation. The resonance sequential assignment is to map the identified resonance peaks from multiple NMR spectra to their corresponding nuclei in the target protein, where every peak captures a nuclear *To whom correspondence should be addressed.
magnetic interaction among a set of nuclei and its coordinates are the chemical shift values of the interacting nuclei. Normally, such an assignment procedure is roughly partitioned into three main steps, which are grouping resonance peaks from multiple spectra into spin systems, chaining the resultant spins systems into strings, and assigning the strings of spin systems to non-overlapping consecutive amino acid residues in the target protein, as illustrated in Figure 1, where the scoring scheme quantifies the residual signature information of the peaks and spin systems. Separately dealing with these three steps has been adopted in many existing assignment programs 3 _ 1 0 . Furthermore, depending the NMR spectra data availability, different programs may have different starting points. To name a few automated assignment programs, PACES 6 , a random graph approach 8 (we abbreviate it as RANDOM in the rest of the paper) and MARS 10 assume the availability of spin systems and focus on chaining the spin systems and their subsequent assignment; AutoAssign 3 and RIBRA 9 can start with the multiple spectral peak lists and automate the whole sequential
56 Scoring
peak lists
*• Grouping F i g . 1.
Chaining
Assignment
-*- candidates
The flow chart of t h e NMR resonance sequential assignment.
assignment process. In terms of computational techniques, PACES uses exhaustive search algorithms to enumerate all possible strings and then performs the string assignment; RANDOM 8 avoids exhaustive enumeration through multiple calls to Hamiltonian path/cycle generation in a randomized way; MARS 10 first searches all possible strings of length 5 and then uses their mapping positions to filter out the correct strings; AutoAssign 3 uses a best-first search algorithm with constraint propagation to look for assignments; RIBRA 9 applies a weighted maximum independent set algorithm for assignments. The above mentioned sequential assignment programs all work well on the high quality NMR data, but most of them remain unsatisfactory in practice and even fail when the spectral data is of low resolution. Through a thorough investigation, we identified that the bottleneck of automated sequential assignment is resonance peak grouping. Essentially, a good grouping output gives well organized high quality spin systems, for which the correct strings can be fairly easily determined and the subsequent string assignment also becomes easy. In AutoAssign and RIBRA, the grouping is done through a binary decision model that considers the HSQC peaks as anchor peaks and subsequently maps the peaks from other spectra to these anchor peaks. For such a mapping, the HN and N chemical shift values in the other peaks are required to fall within the pre-specified HN and N chemical shift tolerance thresholds of the anchor peaks. However, this binary-decision model in the peak grouping inevitably suffers from its sensitivity to the tolerance thresholds. In practice, from one protein dataset to another, chemical shift thresholds vary due to the experimental condition and the structure complexity. Large tolerance thresholds could create too many ambiguities in resultant spin systems and consequently in the later chaining and assignment, leading to a dramatic decrease of assign-
ment accuracy; On the other hand, small tolerance thresholds would produce too few spin systems when the spectral data resolution is low, hardly leading to a useful assignment. Secondly, we found that in the traditional threestep procedure, which is the basis of many automated sequential assignment programs, each step is separately executed, without consideration of inter-step effects. Basically, the input to each step is assumed to contain enough information to produce meaningful output. However, for the low resolution spectral data, the ambiguities appearing in the input of one step seem very hard to be resolved internally. Though it is possible to generate multiple sets of outputs, the contained uncertainties in one input might cause more ambiguities in the outputs, which are taken as inputs to the succeeding steps. Consequently, the whole process would fail to produce a meaningful resonance sequential assignment, which might be possible if the outputs of succeeding steps are used to validate the input to the current step. In this paper, we propose a two-phase Graphbased Approach for Sequential Assignment (GASA) that uses the spin system chaining results to validate the peak grouping and uses the string assignment results to validate the spin system chaining. Therefore, GASA not only addresses the chemical shift tolerance threshold issue in the grouping step but also presents a new model to automate the sequential assignment. In more details, we propose a two-way nearest neighbor search approach in the first phase to eliminate the requirement of user-specified HN and N chemical shift tolerance thresholds. The output of first phase consists of two lists of spin systems. One list contains the perfect spin systems, which are regarded as of high quality, and the other the imperfect spin systems, in which some ambiguities have to be resolved to produce legal spin systems. In the second phase, the spin system chaining is performed to re-
57 solve the ambiguities contained in the imperfect spin systems and the string assignment step is included as a subroutine to identify the confident strings. In other words, the ambiguities in the imperfect spin systems are resolved through finding the highly confident strings in the chaining step, and the ambiguities in the chaining step are resolved through examining the mappings of strings in the assignment step. Therefore, GASA does not separate the sequential assignment into physical steps but only virtual steps, and all ambiguities in the whole assignment process are resolved globally and optimally. The rest of the paper is organized as follows. In Section 2, we introduce the detailed steps of operations in GASA. Section 3 presents our experimental results and discussion. We conclude the paper in Section 4.
2. THE GASA ALGORITHM The input data to GASA could be a set of peak lists or, assuming the grouping is done, a list of spin systems. In the case of a given list of spin systems, GASA skips the first phase and directly invokes the second phase to conduct the spin system chaining and the assignment. In the other case, GASA firstly conducts a bidirectional nearest neighbor search to generate the perfect spin systems and the imperfect spin systems with ambiguities. It then invokes the second phase which applies a heuristic search, guided by the quality of the string mapping to the target protein, to perform the chaining and assignment for resolving the ambiguities in the imperfect spin systems and meanwhile complete the assignment.
alpha/beta from the same or the preceding amino acid residue; An CBCA(CO)NH spectrum contains 3D peaks each of which is a triple of chemical shifts for a nitrogen, the directly adjacent amide proton, and a carbon alpha/beta from the preceding amino acid residue. For ease of presentation, a 3D peak containing a chemical shift of the intra-residue carbon alpha is referred to as an intra-peak; otherwise an inter-peak. The goal of filtering is to identify all perfect spin systems without asking for the chemical shift tolerance thresholds. Note that to the best of our knowledge, all existing peak grouping models require the manually set chemical shift tolerance thresholds in order to decide whether two resonance peaks should be grouped into the same spin system or not. Consequently, different tolerance thresholds clearly produce different sets of possible spin systems, and for the low resolution spectral data, a minor change of tolerance thresholds would lead to huge difference in the formed spin systems and subsequently the final sequential assignment. In fact, the proper tolerance thresholds are normally dataset dependent and how to choose them is a very challenging issue in the automated resonance assignment. We propose to use the nearest neighbor approach, detailed as follows using the triple spectra as an example. Due to the high quality of HSQC spectrum, the peaks in HSQC are considered as centers, and every peak in CBCA(CO)NH and HNCACB is distributed to the closest center, using the normalized Euclidean distance. Given a center C = ( H N c , N c ) and a peak P = (HNp,Np),Cp ), the normalized Euclidean distance between them is defined as D=
2.1. Phase 1: Filtering For ease of exposition and fair comparison with RANDOM, PACES, MARS and RIBRA, we assume the availability of spectral peaks containing chemical shifts for C, I/J) angles are sufficient to determine a backbone conformation. The structure determination problem for a denatured protein is to compute an ensemble of presumably heterogeneous structures that are consistent with the experimental data within a relative large range for the data. More precisely, the structure determination problem for denatured proteins can be formulated as the computation of a set of conformation vectors, c n , given the distributions for all the RDCs r and for all the PREs d.
4. PREVIOUS WORK Solution NMR spectroscopy is the only experimental technique currently capable of measuring geometric restraints for individual residues of a denatured protein at the atomic level. Traditional NMR structure determination methods 5 ' n , developed for computing structures in the native state, require more than 10 restraints per residue, derived mainly from NOE experiments, to compute a well-defined native structure. Recently developed RDC-based approaches for computing native structures rely on either heuristic approaches such as restrained molecular dynamics (MD) and simulated annealing (SA) 1 0 ' 1 3 or a structural database 8 ' 2 1 . It is not clear how to extend these native structure determination approaches to compute the desired denatured structures. Traditional NOE-based approaches cannot be used since long-range NOEs, which are critical for applying the traditional approaches to determine NMR structures, are usually too weak to be detected in the denatured
The main difference between PRE and NOE is that PRE results from the dipole-dipole interaction between an electron and a nucleus while the physical basis of NOE is the dipole-dipole interaction between two nuclei. Under the isolated two spin assumption, both PRE and NOE (that is, the observed intensity of cross-peaks in either a PRE or NOE experiment) are proportional to r - 6 where r is the distance between two spins.
70 state.d Previous RDC-based MD/SA approaches typically require either more than 5 RDCs per residue or at least 3 RDCs and 1 NOE per residue (most of them should be long-range) to compute a well-defined native structure. In the database-based approaches, RDCs are employed to select structural fragments mined from the protein databank (PDB) 3 , a database of experimentallydetermined native structures. A backbone structure for a native protein is, then, constructed by linking together the RDC-selected fragments using a heuristic method. Compared with the MD/SA approaches, the databasebased approaches require fewer RDCs. However, these database-based approaches have not been extended to compute structures for denatured proteins. In summary, neither the traditional NOE-based methods nor the above RDC-based approaches can be applied to compute allatom backbone structures in the denatured state at this time. Recently, approaches 14 ' 4 have been developed to build structural models for the denatured state using one RDC per residue. These approaches are generate-andtest. They begin with the construction of a library of backbone (, ip) angles using only the angles occurring in the loops of the native proteins deposited in the PDB. Then, they randomly select (4>, ip) angles from the library to build an ensemble of backbone models. Finally, the models were tested by comparing the experimental RDCs with the average RDCs back-computed from the ensemble of backbone structures. There are three problems with these methods. First, the (, ip) angle library is biased since only the (, ip) angles from the loops of the native proteins are used. Consequently, the models constructed from the library may bias towards the native conformations in the PDB. Second, random selection may miss valid conformations. Third, the agreement of the experimental RDCs with the average RDCs back-computed from the ensemble of structures may result from overfitting. Over-fitting is likely since one RDC per residue is not enough to restrain the orientation of an internuclear vector (such as the NH bond vector) to afiniteset. In fact, given an alignment tensor S, an infinite number of backbone conformations can agree with one RDC per residue, while only a finite number of conformations agree with two RDCs per residue 29, 28 ' 31 . d
All-atom models for the denatured state have been computed previously in a generate-and-test manner in 16 by using PREs to select the structures from all-atom MD simulation at high temperature. Due to the data sparsity and large experimental errors, PREs alone are, in general, insufficient to define precisely even the backbone C a -trace. The generated models have large uncertainty. A generate-and-test approach 6 using mainly NOE distance restraints has been developed to determine the ensemble of all-atom structures of an SH3 domain in the unfolded state in equilibrium with a folded state.e However, the relatively large experimental errors as well as the sparsity and locality of NOEs similarly introduce large uncertainty in the resulting ensemble of structures, which was selected mainly by the NOEs.
5. THE MATHEMATICAL BASIS OF OUR ALGORITHM Our algorithm uses a set of low-degree (i,ipi angles, respectively, from the CH RDC of residue i and the NH RDC of residue i +1 using the following two Propositions: Proposition 5.1 28 Given the orientation of peptide plane i in the POF (see section 2) of RDCs, the x-component of the CH unit vector u of residue i, in the POF, can be computed from the CH RDC by solving a quartic monomial in x. Given the x-component, the y-component can be computed from Eq. (1), and the z-component from x2 + y2 + z2 = 1. Given u, the sine and cosine of the <j>i angle can be computed by solving linear equations.
The denatured state in this paper (see section 1) has been called the "unfolded state" 2 0 . An unfolded state in equilibrium with a folded state 6 differs from the denatured state in this paper. In 6 , the observed NOEs result from the equilibrium between the folded and unfolded states, not from the unfolded state alone. e
71 Proposition 5.2 28 Given the orientation of peptide plane i in the POF of RDCs, the x-component of the NH unit vector v of residue i + 1, in the POF, can be computed from the NH RDC by solving a quartic monomial in x. Given the x-component, the y-component can be computed from Eq. (1), and the z-component from x2 +y2 + z2 = 1. Given v, the sine and cosine of the ipi angle can be computed by solving linear equations. According to Propositions 5.1-5.2, given the orientation of peptide plane i (plane i stands for the peptide plane for residue i in the protein sequence), the sines and cosines of the backbone (pi, ipi angles can be computed, exactly in closed form, from the CH RDC of residue i and the NH RDC of residue i+1. Furthermore, the orientation of the peptide plane for residue i + 1 can be computed, exactly in closed form, from the orientation of the peptide plane for residue i and the sines and cosines of the intervening fa, ipi angles. Thus, given a tensor S, the orientation of the peptide plane for residue 1 (the first peptide plane) of the protein sequence, and CH and NH RDCs, all the sines and cosines of backbone (m, i>m), computed from Rt where the 4>k angle for residue k is computed according to Proposition 5.1 from the sampled CH RDC for residue k, and the tpk angle is computed according to Proposition 5.2 from the sampled NH RDC for residue k +1. An optimal conformation vector is a vector which has the minimum score under a scoring function TF denned as TF=E?
+ wvE2v
(2)
where Er = y/Zi-lET=£$k~ri,k)' is the RDC RMSD, u is the number of RDCs for each residue, rjk and r' k are, respectively, the experimental RDC for RDC j of residue k, and the corresponding RDC backcomputed from the structure. The variables wv and Ev are, respectively, the relative weight and score for van der Waals (vdW) repulsion. For each conformation vector cm of a fragment, Ev is computed with respect to a quasipolyalanine model built with c m . The quasi-polyalanine
model consists of alanine, glycine and proline residues with proton coordinates. If a residue is neither a glycine nor a proline in the protein sequence, it is replaced with an alanine residue. If the vdW distance between two atoms computed from the model is larger than the minimum vdW distance between the two atoms, the contribution of this pair of atoms to Ev is set to zero. Since the (, ip) angles are computed from the sampled CH and NH RDCs by exact solution, the back-computed NH and CH RDCs are in fact the same as their sampled values. For additional RDCs (CC or N C RDCs), Er is minimized as cross-validation using Eq. (2). For each sampled set of RDCs, Rt,t = 1 , . . . , b, the output of this systematic search step is the optimal conformation vector c l t in Fig. 2. The search step is followed by an SVD step to update tensors, S l t , using die experimental RDCs and the just-computed fragment structure. Next, the algorithm repeats the cycle of systematic search followed by SVD (systematic-search/tensor-update) to compute a new ensemble of structures using each of the newly-computed tensors, Sit, t = 1 , . . . , b. The output of the fragment computation for a fragment i is a set of conformation vectors Chw, w = 1,.. .,bh, where h is the number of the cycles of systematic search/tensor-update.
6.3. Linker computation and assembly Given a common tensor S in set Q and the orientations of two fragments Fi and F2 in the POF for S, an m-residue linker L\ between diem is computed as shown in Fig. 3. The computation of a linker can start from either its Nterminus as detailed in Fig. 3 or from its C-terminus, depending on die availability of experimental data. For the latter, the interested reader can see the Propositions 10.1 and 10.2 (section 10 of APPENDIX) for the detail. Every two consecutive fragments are assembled (combined), recursively, into a single fragment and the process stops when all the fragments have been assembled. The scoring function for the linker computation, Th, is computed similarly to TF. TL=E2
+ wvE2 + wpE2
(3)
The main difference is mat Ev for a linker is computed with respect to an individual structure composing of all the previously-computed and linked fragments, and the current linker built with the backbone (cf>, tp) angles computed from RDCs. In addition, the PRE violation, Ep, which is essentially the PRE RMSD for an individual
73
I
Fragments / Linkers
J
Structure Ensembles
Fragment Computation
Tensor , , Update
Sets of Tensors
V
J
Set of Tuples
Set Q of Common Alignment Tensors
Linkers
Divide
F^ _ Li - F2—L2
Fp-1-Lp-1 —Fp F47L1 - F2-L2 — — Fp-i-Lp-1 -Fp
Merge: Syy ± 8yy and Szz ± 5zz
(
Tensor Update
I
Linker Computation
y
II Assemble Ensemble
C|, C2,
C
V c 2-
1 C,
'"1
• cq
Fig. 1. Divide-and-conquer strategy. The input to the algorithm is: the protein sequence, at least two RDCs per residue in a single medium and PREs (if available). The terms c* denote conformation vectors for the complete backbone structure. Please see the text for the definitions of other terms and an explanation of the algorithm.
Systematic Search Ensemble of conformation vectors:
I
Set of alignment tensors:
Tensor Update Systematic Search
Ensemble of conformation vectors:
Set W\
C2t 1 c 2 2 S2, 1 S;2,2
>2,j
o'
o
,
0 2 , b(b-1) S 2 , b 2
I \\
Tensor Update
"
Fig. 2. Fragment computation: the computation of a structure ensemble of a fragment Thefigureshows only two cycles of systematic search followed by S VD. Please see the text for the definition of terms and an explanation of the algorithm. structure composing of all the previously-computed and linked fragments and the current linker, is computed as Ep = Y
' oJx—
, where d, and d^ are, respectively,
the experimental PRE distance and the distance between two CQ atoms back-computed from the model, and o is the number of PRE restraints. An experimental PRE dis-
74 tance restraint is between two Ca atoms computed from the PRE peak intensity 16 . If d\ < di, the contribution of PRE violation i to Ep is set to zero. This search step is similar to our previous systematic searches as detailed in 2 9 , 2 8 , 31 . The key difference is that the linker scoring function, Eq. (3), has two new terms: Ev and Ep, and lacks the term in 29 ' 28 ' 3 1 for restraining (, ip) angles to the favorable Ramachandran region for a typical a-helix or /3-strand.
7. APPLICATION TO REAL BIOLOGICAL SYSTEMS We have applied our algorithm to compute the structure ensembles of two proteins, an acid-denatured denatured ACBP and a urea-denatured eglin C, from real experimental NMR data. Application to acid-denatured ACBP. An ensemble of 231 structures has been computed for ACBP denatured at pH 2.3. The experimental NMR data 9 has both PREs and four backbone RDCs per residue: NH, CH, N C and C C . All the 231 structures have no vdW repulsion larger than 0.1 A except for a few vdW violations as large as 0.35 A between the two nearest neighbors of a proline and the proline itself. These 231 structures satisfy all the experimental RDCs (CH, NH, C C and NC) much better than the native structure, and have PRE violations, Ep, in the range of 4.4 — 7.0 A. The native structure also has very different Saupe elements, Syy and Szz. Further analysis of the computed ensemble shows that the aciddenatured ACBP is neither random coil nor native-like. Application to urea-denatured eglin C. An ensemble of 160 structures were computed for eglin C denatured at 8 M urea. No structures in the ensemble have a vdW violation larger than 0.1A except for a few vdW violations as large as 0.30 A. The computed structures satisfy the experimental CH and NH RDCs much better than the native structure. The native structure also has very different Saupe elements, Syy and Szz. Further analysis of the computed ensemble also shows that the acid-denatured ACBP is neither random coil nor native-like.
8. ALGORITHMIC COMPLEXITY AND PRACTICAL PERFORMANCE The complexity of the algorithm (Fig. 1) can be analyzed as follows. Let the protein sequence be divided into p
m-residue fragments and p — 1 m-residue linkers and let the size of samplings be b. The systematic-search step in Fragment computation takes 0(bpfm) time to compute all the p ensembles for p fragments (Fig. 2) where / is the number of (, tp) pairs for each residue computed from two quartic equations (Propositions 5.1-5.2) and pruned using a real solution filter as described in 2 8 and also a vdW filter (repulsion). A single SVD step in Fig. 2 takes m5 2 + 5 3 = 0(m) time. Thus, h cycles of systematic-search/SVD take tF time in the worst-case, where tF = £ j = 1 pV (fm + m) = p^=^ (fm + m) = 0(pbh+1(fm + m) = 0(pbh+1fm) since fm is much larger than m. In implementation, b = 8 x 1024 and h = 2 (see section 11 of APPENDIX). In practice, only a small number (about 100) of structures out of all the possible bh computed structures for fragment i (section 6.2 and Fig. 2), are selected and saved in W j (Fig. 1), that is, the selected structures have TF < Tmax or TL < Tmax where TF and TL are computed, respectively, by Eq. (2) and Eq. (3), and Tmax is a threshold. The Merge step takes o(pwplogiu) time, where w = \Wi\ is the number of structures in W*. The Merge step generates q p-tuples of alignment tensors, where q = 'ywp and 7 is the percentage of p-tuples selected from the Cartesian product of the sets Tt,i = 1 , . . . ,p, according to the ranges for Syy and Szz (section 6.1). The SVD step for computing q common tensors from p m-residue fragments takes q(mp52 + 53) = 0{mpq) time. The linkers are computed and assembled top-down using a binary tree. The Linker computation and assembly step then takes th time, where
** = 6«£i2f 2*/(afc+1>m = t g W ' m g r _ 1 r a / " / m since at depth k, vdW repulsion and PRE violation are computed for the assembled fragment consisting of 2k m-residue fragments and an m-residue linker (Fig. 3). The total time is 0(pbh+1fm + pwp log w + mpq + 2m+1 2mlogp+m h+1 m 6gp / ) = 0(pb f + pwplogw + mpq + fcgp2(c+l)m+lym)
w h e r e
28
c
=
lQgf
=
Q^y
The largest possible value for / is 16 but on average / is about 2. The largest possible value for 7 is 1 but in practice, it is very small, about 1 0 - 9 , and q = 10 3 with w = 100. Although the worst-time complexity is exponential in O(h), 0(m) and 0(p), the parameters for m, h,paie rather small constants in practice with typical values o f m = 10, h = 2,p = 6 for a 100-residue protein. In practice, on a Linux cluster with 32 2.8GHz Xeon processors, 20 days are required for computing an ensemble of 231 structures for ACBP, and 7 days for computing
75 For i 4>m-2, i'm-2) by systematic search. Compute m and Vm by Proposition 10.3 (section 10 of APPENDIX). Build a polyalanine model for linker Li using the vector c ' m i ; m) Link Li to Fi and F2. // see figure caption for an explanation Compute Ep and a new score T'L by Eq. (3) for the assembled fragment FiU LiU F2.
(h) HT'L *k"" -PA ^IZ^Ik -p>f = SD i = l *=1
;=1 *=1
From theorem 2, in step 4 we have
sz^=£f>|/^-vi 2 /'=1
n m ^ X""1 X™1
jfc=l
..2
.,
^LL^qPik
new
—
-PA
1 = 1 ll=\
So SD"™ < SD and SD decreases in each iteration. We stop when this decrease is less than the threshold e, this will be a local minimum of SD. Horn's method calculates the optimal rotation matrix for two /w-atom structures in 0(m) operations, so initialization and each iteration take 0(« m) operations. Our experiments show that for any start positions of all n structures, the algorithm converges in a maximum of 4-6 iterations when s = l.OxlO-5. The number of iterations is one fewer when the proteins start with a preliminary alignment from the optional initialization in step 1. Because the lower bound for aligning n structures with m points per structure is 0(« m), this algorithm is close to the optimum. We must make two remarks about the paper of Sutcliffe et a/.8, which proposed the algorithm above. First, they actually give different weights to individual atoms, which they change during the minimization. We can establish analogues of Theorems 1-3 for individual atom weights if the weight of a corresponding pair of atoms is the half-normalized product of the individual weights. To minimize wRMSD for such weights, however, we have observed that it is no longer sufficient to translate the structure centroids to the origin. We believe that this may explain why Sutcliffe's algorithm can take many iterations for convergence — the weights are not well-grounded in mathematics. We plan to explore atom weights more thoroughly in a subsequent paper. Second, their termination condition was when the deviation between two average structures was small, which is actually testing only the second inequality on the decrease of SD above. It is a stronger condition to terminate based on the deviation of SD.
While preparing the final version of this paper, we found two papers with similar iterative algorithms13'14. Both algorithms use singular value decomposition (SVD) as the subroutine for finding an optimal rotation matrix; quaternions should be used instead because they preserve chirality. Pennec14 presented an iterative algorithm for unweighted multiple structure alignment and our work can be regarded as the extension of his work. Verboon and Gabriel13 presented their iterative algorithm as minimizing wRMSD with atom weights (different atoms having different weights), but in fact it works only for position weights because the optimization of translation and of rotation cannot be separated with atom weights.
3. RESULTS AND DISCUSSION 3.1. Performance We test the performance of our algorithm by minimizing the RMSD for 23 protein families from HOMSTRAD19, which are all the families that contain more than 10 structures with total aligned length longer than 100. We set e= l.OxlO-5 and run the experiment on a 1.8GHz Pentium M laptop with 768M memory. The code is written in MATLAB and is downloadable at http://www.cs.unc.edu/~xwang/. We run our algorithm 5,000 times for each protein family. Each time we begin by randomly rotating each structure in 3D space and then minimize the RMSD. We expect that the changes in RMSD will be small, since these proteins were carefully aligned with a combination of tools, but want to make sure that our algorithm does not become stuck in local minima that are not the global minimum. The results are shown in Table 1. For each protein family's 5,000 tests, the difference between maximum RMSD and minimum RMSD is less than l.OxlO"8, so they converge to the same local minimum. Moreover, the optimal RMSD values found by our algorithm are less than the original RMSD from the alignments in HOMSTRAD in all cases. In three cases the relative difference is greater than 3%; in each of these cases there is an aligned core for all proteins in the family, but some disordered regions allow our algorithm to finds alignments with better RMSD. These cases clearly call for weighted alignment.
83 Table 1. Performance of the algorithm on different protein families from HOMSTRAD. We report n, the number of proteins, m, the number of atoms aligned, RMSD from the HOMSTRAD Alignment (HA), the RMSD for the optimal alignment from our algorithm, statistics on iterations and time (milliseconds) for 5,000 runs of each alignment.
Protein family immunoglobulin domain - V set heavy chain
21
RMSD HA(A)
optim. RMSD
%rel. diff
Iterations avg,med,max
1.224
1.213
0.91
3.8, 4, 4
11.7, 10, 30
107
Time (ms) avg,median,max
globin
41
109
1.781
1.747
1.95
4.0, 4, 5
24.4, 20, 40
phospholipase A2
18
111
1.492
1.478
0.95
10.5, 10, 41
ubiquitin conjugating enzyme
13
114
1.729
1.714
0.88
3.9, 4. 4 4.0, 4. 5
Lipocalin family
15
118
2.881
2.873
0.28
4.0, 4. 5
9.3, 10, 30
12
119
1.357
1.342
1.12
3.9, 4, 4
7-3, 10, 11
17
122
1.825
1.824
0.05
4.0, 4, 5
10.5, 10, 40
Proteasome A-type and B-type
17
148
3.302
3.032
8.91
4.8, 5, 6
9.3, 10, 21
phycocyanin
12
148
2.188
2.077
5.34
4.0, 4, 5
11,0, 10, 40
13
177
1.971
1.954
0.87
4.0, 4, 5
8.8, 10, 11
serine proteinase - eukaryotic
27
181
1.454
1.435
1.32
3.8, 4, 4
17.4, 20, 40
Papain fam cysteine proteinase
13
190
1.396
1.383
0.94
3.9, 4, 5
8.9, 10, 30
glutathione S-transferase
14
200
2.336
2.315
0.91
4.0, 4, 5
9.8, 10, 20
Alpha amylase, catalytic dom.
23
201
2.327
2.293
1.48
4.0, 4, 5
16.1, 20, 40
legume lectin
12
202
1.302
1.287
1.17
3.8, 4, 4
8.0, 10, 30
15
205
2.561
2.503
2.32
4.0, 4, 5
10.6, 10, 21
11
222
2.279
2.268
0.49
4.0, 4, 5
8.1, 10, 30
23
224
2.668
2.602
2.54
4.0, 4, 5
16.6, 20, 40
10
242
1.398
1.386
0.87
3.7, 4. 4
7.0, 10, 11
11
262
3.870
3.420
4.7, 5, 6
10.1, 10, 21
lactate/malate dehydrogenase
14
266
2.036
2.024
4.0, 4, 5
10.9, 10, 21
cytochrome p450
12
295
2.872
2.861
0.38
4.0, 4, 5
9.8, 10, 30
aspartic proteinase
13
297
1.932
1.877
2.93
4.0, 4, 4
10.5, 10, 30
glycosyl hydrolase family 22 (lysozyme) Fatty acid binding protein-like
short-chain dehydrogenases/reductases
Serine/Threonine protein kinases, catalytic domain subtilase Alpha amylase, catalytic and Cterminal domains triose phosphate isomerase pyridine nucleotide-disulphide oxidoreductases class-I
30
13.16 0.59
7.9, 10, 11
T
25 20
•
S 15
~^*.
• •
$-^*-
04
0
1000
2000 3000 4000 Number of atoms (nxm)
5000
6000
(a) Average running time vs. number of atoms (b) Average running time vs. number of structures Fig. 1. Average running time vs. the number of atoms or the number of structures
84 The maximum number of iterations is 6 and the average and median number of iterations is around 4, so / is a small constant and the algorithm achieves the lower bound of multiple structure alignment, which is @(n m). All of the average running time is less than 25 milliseconds and all of the maximum running time is less than 40 milliseconds, which means our algorithm is highly efficient. Figure la and lb show the relationship between the average running time and the number of atoms (nxm) and the number of structures (ri) in each protein family. The average running time shows linear relation with the number of structures but not the number of atoms, because the most time-consuming operation is computing eigenvectors and eigenvalues of a 4x4 matrix in Horn's method, which takes 0(n) in each iteration.
3.2. Consensus structure For a given protein family, one problem is to find a consensus structure to summarize the structure information. Altaian and Gerstein20 and Chew and Kedem21 propose to use the average structure of the
(a) all 11 aligned proteins
(c) Structure with minimum RMSD Fig. 2. Multiple structure alignment for
conserved core as the consensus structure. In fact, by Theorems 1 and 2, the wRMSD is minimized by aligning to the average structure, and no other structure has better wRMSD with all structures. Thus, we claim that the average structure is the natural candidate for the consensus structure. One objection to this claim is that the average structure is not a true protein structure - it may have physically unrealizable distances or angles due to the averaging. This depends on the intended use for the consensus structure — in fact, some other proposed consensus structures are even more schematic: Taylor et al.22, Chew and Kedem21, and Ye and Janardan23 use vectors between neighboring C a atoms to represent protein structures and define a consensus structure as a collection of average vectors from aligned columns. But a more significant answer comes from Theorem 3: if you do have a set of structures from which you wish to choose a consensus, including the proposal of Gerstein and Levitt10 to use the true protein structure that has the minimum RMSD to all other structures, or POSA of Ye and Godzik24, which builds a consensus structure by rearranging input structures based on alignments of partial order graphs based on
(b) the consensus structure
(d) Structure with maximum RMSD nucleotide-disulphide oxidoreductases class-I
85 40 30 20 10
0.5
1 1.5 2 2.5 3D Gaussian Distribution
3
0 5
n 0.6,r- 0.7,
0.8
0.9
(a) Distribution of the best aligned position (b) histogram of R2 for all aligned positions 3. 3D Gaussian Distribution analysis of the distances from each atom to corresponding points on the average structure Fig
these structures, then you should choose from this set the structure with minimum wRMSD to the average. Figure 2 shows the alignment of conserved core of protein family pyridine nucleotide-disulphide oxidoreductases class-I, the consensus structure, the consensus protein structure with the minimum RMSD to all other structures, and the structure with maximum RMSD to other structures. 3.3. Statistical analysis of deviation from consensus in aligned structures Deriving the statistical description of the aligned protein structures is an intriguing question that has significant theoretical and practical implications. As a first step, we investigate the following question concerning the spatial distribution of aligned positions in a protein family. More specifically, we want to test the null hypothesis that, at a fixed position k, the distances the n atoms can be found from the average ~pk, especially those that are in the "core" area of protein structures, are consistent with distances from a 3D Gaussian distribution. We chose the Gaussian not only because it is the most widely used distribution function, due to the central limit theorem of statistics, but also because previous studies hint that Gaussian is the best model to describe the aligned structures25. If, by checking our data, we can establish the fact that aligned positions are distributed according to the Gaussian distribution in 3D, the set of aligned protein structures can be conveniently described by a concise model that is composed by the average structure and the covariance matrix specifying the distribution of the positions. To test the fitness of our data to the hypothesized 3D Gaussian model, we adopted the Quantile-Quantile Plot (q-q plot) procedure26, which is commonly used to
determine whether two data sets come form a common distribution. In our procedure, the y-axis is the distances from each structure to the average structure for each aligned position, and the x-axis is the quantile data from 3D Gaussian. Figure 4a shows the q-q plot for the best aligned position. The correlation coefficient R2 is 0.9632, which suggests that the data fits the 3D Gaussian model pretty well. We carried the same experiments for all the aligned positions and the collected the histogram of the correlation coefficient R2 is shown in figure 4b. We identify that more than 79% of the positions we check have R2> 0.8. The types of curves in q-q plots reveal information that can be used to classify whether a position should be deemed part of the core. The illustrated q-q plot has the last two curves above the line, which indicates that the two corresponding structures have larger errors in this position than would be predicted by a Gaussian distribution. Most position produce curves like this, or with all or almost all points on a line through the origin. Low slope indicates that they align well, and that the residuals may fit a 3D Gaussian distribution with a small scale. A few plots begin above the line and come down, or stay on a line of higher slope, indicating that such positions are disordered and should not be considered part of the core. 3.4. Determining and weighting the core for aligned structures There are many ways in which we can potentially use this model of the alignment in a family to determine the structurally conserved core of the family, and help biologist to compare protein structures. Due to space constraints, we briefly demonstrate one heuristic for determining position weights to identify and align the conserved core of two of our structure families.
86
(a) pyridine nucleotide-disulphide oxidoreductases class-I (b) proteasome A-type and B-type Figure 4. Aligned protein families using position weights. The black colored positions satisfy ak F i g . 1.
1 1
/" 4
/
T 3
^ 1 *«2 I
VT^r) Helix models
The intuition of the segmentation process is that each local maximal density voxel can be related to the presence of a packed set of atoms. This situation arises when amino acids are arranged into specific patterns that provide a high local density contribution. For example, helices are arranged so that the side chains of amino acids involved show an average increase of local density w.r.t. normal coil, due to the helical packing of the backbone. At low resolution, this is characterized by a clear increase of local density that reflects the helical three dimensional shape of the helix. Hence, the problem boils down to recognize such clusters made of locally higher density. Every maximal density voxel v is a representative of a volume that is defined as the set of voxels that can be reached from v without increasing the density along the path followed. Each volume is a maximal set of voxels and it contains, in general, small parts of individual helices. The key idea is that this segmentation offers a robust identification of subsets of helices' volumes. Thus, the problem boils down to the one of correctly merging some of these volumes in order to reconstruct the identified helices. The method involves gradient analysis, and it is substantially different from simple density values thresholding (as used in previous proposals 7 ' 8 ) . The gradient is a vectorial information, expressed in terms of a 3D direction and intensity. Intuitively, the gradient shows the direction that points to the locally maximal increase of density. The gradient information is computed for each voxel, considering the density map as the discretization of a continuous function from R3 to density values. In this perspective, the gradient corresponds to the first derivative
of such density function. For processing purposes, the gradient associated to each voxel is approximated using a discrete and local operator over the original density map. Using the gradient direction as a pointer, we can follow these directions from voxel to voxel, until a maximal density value voxel is found. The paths generated touch every voxel, and can be partitioned according to the ending points reached. Paths that share the same ending point form a tree structure, that is associated to the same volume. This process generates in output the segmentation we require for helix detection. The motivation for requiring such segmentation is that low resolution density maps witness the presence of a helix as a dense cylinder-like shape, where the maximal density increases gradually towards its axis. When close to the axis (e.g., < 5A), the gradient points towards such axis. This means that the high density voxels of the trees identified on the gradient paths can be employed to characterize the location of the helix axis. Observe that we use gradient trees to segment volumes—by collecting in a single volume all the nodes whose gradient paths lead to the same maximal density voxel. Thus, each of these volumes will contribute to only a part of a helix, and further analysis is required to study the properties of the volumes (and the relationships between their maximal density voxels) and determine whether different volumes actually belong to the same helix. The complete process is articulated in the following phases, described in the next subsections: (i) gradients calculation, (ii) graph construction and processing; (Hi) detection of helices. 2.2. Gradients determination The density map is processed, in order to build the map of gradients. The gradient is approximated using Sobel-like convolution masks ( 3 x 3 x 3 ) over the original density map 5 . The gradient is represented by a vector whose direction and intensity can be calculated using the Sobel-like mask in Figure 2. For each voxel, a 3D convolution process is performed using the three masks: each mask is overlapped on the density map, and the summation of a point-by-point product is performed in order to collect the intensity of the gradient component for each dimension. The
92 addition of the three resulting vectors generates the gradient associated to the voxel. For example, the component of the gradient along the X axis can be calculated by using the three matrices in the first row in Figure 2. In Fig. 3(a), we show a slice of a density map; Fig. 3(b) indicates the corresponding z-projection of the gradient for each point. Fig. 3(c) is the overlay of Fig. 3(a) and Fig. 3(b). Observe how the gradient lines are "pointing" towards the denser regions of the density map (shown in darker color).
2.3. Construction of the graph The next step of the algorithm involves the construction of a graph describing the structure of the density map. In particular, the directed graph G = (N, E) is used to summarize the gradient properties, where N is the set of nodes of the graph and EC. N x N is the set of edges. Nodes will represent voxels of interest (as described later) while edges connect voxels that are "adjacent" in the density map. Let us consider two voxels Vi = (xi,yi,zi) and V MFEGF(S). Note that it is not always possible to fold a sequence into a particular motif. In this TDM returns an empty result. 2.1. Z-scores A Z-score is the distance from the mean of a distribution normalized by the standard deviation. Mathematically: Z{x) = (x — fx)/5, with fi being the mean and 8 the standard deviation. Z-scores are useful for quantifying how different from normal a recorded value is. This concept has been applied to eliminate an effect that is well known for minimum free energy folding: The energy distribution is biased by the G/C content of a sequence as well as its length and dinucleotide composition. To calculate the Z-score for a particular sequence, the distribution of MFE values for random sequences with the same dinucleotide composition must be known. The lower the Z-score, the lower is the energy compared to energies from random sequences. Clote et. al. 11 observed that Z-score distributions for RNA genes are lower than Z-score dis-
tribution for random RNA. However, this difference is fairly small and only significant if the whole distribution is considered. It is not sufficient to distinguish an individual RNA gene from random RNA 10 . The reason for the insufficient significance of Z-scores are the combinatorics of RNA folding. There is often some structure in the complete search space that obtains a low energy.
0.25 I
1—
1
1
1
1
1
GGF "BNAI G HH
g.
0.15 -
f
1
/
I
0.05
o I
\
\
/
•— -
4
-
\
—•
'
2
0 Z-score
- ^
'
2
4
F i g . 3 . Z-score histogram for 10000 random sequences with a length of 100 nucleotides, for two TDMs and the general folding.
Here, our aim is not the general prediction of non-coding RNA, but the detection of new members of a known, or at least defined, RNA family. By restricting the folding space, we can, as we demonstrate in Section 3, shift Z-scores for family members into a significant zone. Structures with MFEGF = MFEg for a grammar Q get a lower Zscore, since the distribution MFEg for random RNA is shifted to higher energies. Even if this seems to be right for the grammars used in this paper, the effect of a folding space restriction on the energy distribution is not obvious. Clearly, the mean is shifted to more positive values, but the effect on the variance is not yet understood mathematically. Therefore, our applications must provide evidence that the Z-scores are affected in the desired way. Let Dg{s) be the frequency distribution of MFE values for random sequences with the same dinucleotide frequency as s, i.e. the minimum free energy versus the fraction of structures s' obtaining that energy with TDMg(s'). Zg{s) is the Z-score for a sequence s with respect to the distribution Dg{s).
115 The value-mean and the standard deviation can be determined by a sampling procedure. For our experiments, we generate 1000 random sequences preserving the dinucleotide frequencies of s. The distribution of Z-scores for random RNA sequences is shown in Figure 3. Interestingly, a restriction of the folding space does not affect the Z-score distribution. At least this holds for the TDMs shown in this paper. For a reliable detection of RNA genes, a Z-score of lower than -4 is needed 10 . Our experiments showed that over 99.98% of random RNAs have Z-scores greater then -4. To distinguish RNA genes from other RNA on a genomic scale, a threshold should be set to a Z-value such that the number of false predictions is trackable.
the Rfam database, the consensus shown there is a good starting point; at least the structural part of it. Alternatively, the consensus of known sequences can be obtained with programs that predict a common structure, like PMmulti22 and RNAcast23. motif
AD hloop
region hloop
2.2. Design and implementation
AD hloop
SS
Designing a thermodynamic matcher means defining its structure space. On the one hand it must be large enough to support good sensitivity, and on the other hand it must be small enough to provide good specificity. A systematic analysis of the relation between structure space restriction and its effect on specificity and sensitivity of MFE based Z-scores is subject of our current research.
region hloop
IL base hloop base BL region hloop
A-e
f.
I
region hloop region I
BR hloop region
HL
E
• '-6
*&*•
region F i g . 5. Simplified version of t h e grammar QRNAIReconsider the grammar in Figure 1. Instead of an axiom t h a t derives arbitrary RNA structures, t h e axiom motif derives three hairpin loops {hloop) connected by single stranded regions.
F i g . 4 . Consensus structure for RNAI genes taken from the Rfam database.
The design of a TDM for an RNA gene requires a consensus structure. If an RNA family is listed in
We now exemplify the design of a TDM. For instance, we are interested in stable secondary structures that consist of three hairpin loops separated by single stranded regions, like the structures of RNAI genes as shown in Figure 4. A specialized grammar for RNAI must only allow structures compatible with this motif. A simplified version of the grammar QRNAU which abstracts from length constraints for stems and loops, is given in Figure 5. Since we want to demonstrate that with a search
116 space reduction new members of an RNA family can be detected by their energy based Z-score, we do not incorporate explicit sequence constraints in a thermodynamic matcher other than those necessary to form the required base-pairs. However, this could be easily incorporated in our framework. We use the algebraic dynamic programming (ADP) framework19 to turn RNA secondary structure space grammars into thermodynamic matchers. In the context of ADP, writing a grammar in a text based notation is equivalent to writing a dynamic programming structure prediction program. This approach is similar to using an engine for searching with regular expressions. There is no need to implement the search routines, it is only a matter of specifying the search results. A grammar, which constitutes the control structure of an unrestricted folding algorithm, is augmented by an evaluation algebra incorporating the established energy rules 5 . All TDMs share these rules, only the grammar changes. The time complexity of a TDM depends on the motif complexity. If multiloops are included the runtime is 0(n3) where n is the length of the sequence that is folded. Without multiloops the time complexity is 0(n2), if the size of bulges and loops is bounded by a constant. In both cases the memory consumption scales with 0(n2).
3. RESULTS We constructed TDMs for the non-coding RNA families RNAI and hammerhead type III ribozyme (hammerheadlll) taken from the Rfam database Version 7.0 16 ' 17 . All TDMs used in this section utilize the complete energy model for RNA folding6 and therefore have more complex grammars than the grammars presented to explain our method. To assess if TDMs can be used to find candidates for an RNA family, we searched for known members in genomic data. The known members are those from Rfam seeds, which are experimental validated. We apply our TDMs to genomes containing the seed sequences and measure the relation between Z-score threshold, sensitivity, and specificity. We define sensitivity as T P / ( T P + F N ) and specificity as TN/(TN+FP), where T P is the number of true positives, TN is the number true negatives, FP is the number of false positives, and FN is the number of
false negatives.
3.1. RNA I Replication of ColEl and related bacterial plasmids is initiated by a primer, the plasmid encoded RNAII transcript, which forms a hybrid with its template DNA. RNAI is a shorter plasmid-encoded RNA that acts as a kinetically controlled suppressor of replication and thus controls the plasmid copy number 24 . Sequences coding for RNAI fold into stable secondary structures with Z-scores reaching from —3.6 to - 6 . 7 (Table 1). Table 1. Z-score for the RNAI seed sequences computed with TDMgGF and TDMgRNA]. EMBL Accession number AF156893.2 X80302.1 Y17716.1 Y17846.1 U80803.1 D21263.1 S42973.1 U65460.1 X63534.1 AJ132618.1
Z
GCF
-6.61 -4.88 -5.74 -5.06 -6.33 -3.96 -4.53 -6.73 -3.63 -5.93
Z
QRNAI
-7.31 -6.20 -6.29 -6.16 -6.84 -5.33 -5.82 -7.41 -5.41 -6.71
The Rfam consensus structure consists of three adjacent hairpin loops connected by single stranded regions (Figure 4). Structures for this consensus are described by the grammar QRNAI (Figure 5). If we allow for arbitrary stem lengths in our motif, all structures that consist of three adjoined hairpins would be favored by TDMgRNAI. This has an undesired effect: It would be possible to fold a sequence, that folds (with general folding) into a single hairpin with low energy, into a structure with one long and two very short hairpins. Although the energy of the restricted folding is higher than the energy of the unrestricted folding, it would still obtain a good energy resulting in a low Z-score. Clearly, these structures do not really resemble the structures of RNAI genes. In refinement, each stem loop is restricted to a minimal length of 25 nucleotides and the length of the complete structure is restricted to up to 100 nucleotides. These restrictions are compatible with the consensus of RNAI and increase the sensitivity
117 and specificity of TDMgRNAI. Sequences from the seed obtain ZgRNAI values between —5.33 and —7.41 (Table 1). For random RNA the frequency distribution of ZgRNAI is similar to ZgCF (see Figure 3). The ZQRNAI score difference is large enough to distinguish RNAI genes from random RNA.
Z-score in the range of 5 nucleotides to the left or right of the starting position of an RNAI gene has a Z-score equal or lower than the current threshold. In this region, no negative hits are counted. Figure 6 shows the result for a plasmid of Klebsiella pneumoniae. It is also possible to use a complete sequence as input for a TDM. However, this will return the best substructure (or substructures) in terms of energy, which not always corresponds to the substructure with the lowest Z-score. 100
Sequence position [nt]
(a) General folding
TTZ
(TDMgG 80
2a> o> 8
60
40 0
1000
2000 3000 Sequence position [nt]
(b) Restricted Folding
4000
5000
(TDMg R N A I )
F i g . 6. TDM scan for RNAI in a plasmid of Klebsiella pneumoniae (EMBL Accession number AF156893). The known RNAI gene is located at position 4498 indicated by the dotted vertical line, (a) In steps of 5 nucleotides, the score ZgCF is shown for the following 100 nucleotides and for their reverse complement. T h e Z-scores for both directions are drawn versus the same sequence position. The position where the known RNAI gene starts achieves a low Z-score, but there is another position with a lower Z-score (position ~ 1450) and positions, with nearly as low scores (around position 750). (b) shows corresponding values for ZgRNAJ. The RNAI gene now clearly separates from all other positions. Sequences that fold into some unrelated stable structure are penalized because they cannot fold into a stable RNAI structure.
To verify whether RNAI genes can also be distinguished from genomic RNA, we applied our matcher to 10 plasmids that contain the seed sequences (one in each of them). The Plasmid length ranges from 108 to 8193 nucleotides in this experiment. All plasmids together have a length of ~ 27500 nucleotides. For each plasmid, a 100 nucleotides long window was slid from 5' to 3' with a successive offset of 5. ZgRNAI was computed for every window. RNA I can be located on both strands of the plasmid. Therefore, TDMgRNAI was also applied to the reverse complement. Overall, this results in ~ 11000 ZgRNA, scores. An RNAI sequence was counted as positive hit if a
20
Sensitivity (G GF ) — Specificity (G e F ) — Sensitivity ( G R N A | ) —• Specificity ( G R N A , ) --•-
-8
-7
-6
-3
-2
-1
Fig. 7. Sensitivity and specificity versus the Z-value threshold. TDMgRNA1 improves sensitivity and specificity compared to TDMgGF.
If we set the Z-score threshold to —5, we obtain for TDMgRNAI a sensitivity of 100% and a specificity of 99.89%, which means 10 true positives and 12 false positives (for all plasmids). For TDMgGF, we obtain only a sensitivity of 80% and a specificity of 99.10%, which means 8 true positives and 99 false positives. A threshold of —3.5 is required to find all RNAI genes of the seed. The specificity in this case is 96.71% resulting in 362 false positives. (Figure 7). Although the specificity is fairly low, it makes a big difference to the number of false positives for genome wide applications. 3.2. Hammerhead ribozyme (type III) The hammerhead ribozyme was originally discovered as a self-cleaving motif in viroids and satellite RNAs. These RNAs replicate using the rolling circle mech-
118 anism, which generates long multimeric replication intermediates. They use the cleavage reaction to resolve the multimeric intermediates into monomeric forms. The region able to self-cleave has three base paired helices connected by two conserved single stranded regions and a bulged nucleotide. Hammerhead type III ribozymes (Hammerheadlll) form stable secondary structures with Z-scores varying from -6 to -2 for general folding. The seed sequences from the Rfam database vary in their length. 6 sequences have a length of around 80 nucleotides. All other seed sequences are around 55 nucleotides long. To be able to use length constraints, which are not too vague, we removed the 6 long sequences for our experiment. Thus, TDMgHH is not designed to search for Hammerheadlll candidates with a sequence length larger than 60 nucleotides.
quences are only about 45 nucleotides long. They fold into two adjacent hairpin loops and do not form a multiloop with TDMgGF They are forced into our Hammerheadlll motif with considerable higher free energy. If a family has many members, it might be necessary to separately consider subfamilies.
I
Fig. 9. Z-scores distribution for 68 hammerhead ribozyme type III sequences.
U-c Qfi«3 l i
A ..Cn C cCCA %AU0^O «
Fig. 8. Consensus structure for hammerhead ribozyme type III genes taken from the Rfam database.
Grammar QHH describes the folding space for the consensus structure shown in Figure 8. The maximal length of our motif is 60 nucleotides. The single stranded region between the two stem loops in the multiloop has to be between 5 and 6 nucleotides long. The stem lengths are not explicitly restricted. TDMgHH improves the distribution of Z-scores for the seed sequences (Figure 9). Most sequences now obtain a Z-score smaller than —4, but some obtain a higher score. These se-
We applied TDMgHH to 59 viroid sequences with length of 290 to 475 nucleotides. Hammerheadlll can be located on both strands of the DNA. Each sequence contains one or two Hammerheadlll genes. A 60 nucleotides long window was slid from 5' to 3' with a successive offset of 2. For the sequence (and for its reverse complement), of each window ZgHH was computed. Overall, this resulted in ~ 19500 scores. An Hammerheadlll sequence was counted as positive hit if a Z-score in the range of 3 nucleotides to the left or right of the starting position of an Hammerheadlll gene has a Z-score equal or lower than the current threshold. In this region, no negative hits are counted. The sensitivity and specificity depending on the Z-score threshold is shown in Figure 10. The sensitivity is improved significantly compared to TDMgGF. However, the specificity is lower for Zscores thresholds smaller than —3, which is the relevant region. It turned out that many false positives with Z-values of smaller —4 maybe true positives, which are not part of the Rfam seed, but are predicted as new RNAI candidate genes in Rfam. Figure 11 shows sensitivity and specificity if false nega-
119 0.4 0.35
1"
1
1
'
'
G^ ' G
-
HH
random sequences — -
1
—r
t
1
0.15
i
1
0.2
i
0.25
i
0.3 -
1
tives, that are candidate genes in Rfam, are counted as true positives. All RNA candidate genes that are provided in Rfam achieve low Z-scores as shown in Figure 12. Unlike Infernal16, which is used for the prediction of candidate family members in Rfam, we use pure thermodynamics rather than a covariance based optimization. This gives further and independent evidence for the correctness of both predictions.
\
0.1 0.05 0 -10
r-t—/
-8
V.'. ' \ v ' " K - - , i
Y-.
-4 -2 Z-score
F i g . 12. Distribution of Z-scores for all 274 Hammerheadlll gene and gene candidate sequences taken from the Rfam database.
4. DISCUSSION
F i g . 1 0 . Selectivity and specificity versus the Z-value threshold. TDMgHH improves sensitivity and specificity compared to TDMgGF.
100
80
60
40 -
20
Sensitivity Specificity Sensitivity Specificity
(GGF) (G GF ) (G HH ) (GHH)
-4 Z-score
P i g . 1 1 . Selectivity and specificity versus the Z-value threshold. TDMgHH improves sensitivity and specificity compared to TDMgGF. Candidates predicted by Rfam are treated as positive hits.
The current debate about the quality of thermodynamic prediction of RNA secondary structures is extended by our observations regarding specialized folding spaces. It is well known that the MFE structure from predictions in most cases only shares a small number of base-pairs that can be detected by more reliable sources than MFE such as compensational base mutations. This is a consequence of the combinatorics of the RNA folding space, which provides many "good" foldings. Thus, MFE on its own can not be used to discriminate non-coding RNAs. We demonstrated that, given a consensus structure for a family of non-coding RNA, a restriction of the folding space to this family prunes low energy foldings for non-coding RNA that do not belong to this family. The overlap of Z-score distributions for MFE values for family members and non-family members can be reduced by our technique resulting in a search technique with high sensitivity and specificity, called thermodynamic matching. In our experiments for RNA I and the hammerhead type III riboyzme, we did not include other restrictions than size restrictions for parts of the structure. These matchers can be fine tuned and can also include sequence restrictions, which could further increase their sensitivity and specificity. It is also possible to include H-type pseudoknots in the motif using techniques presented in Ref. 18.
120 We demonstrated that a TDM can detect members of RNA families by scanning single sequences. It seems promising to extend the TDM approach to scan aligned sequences using a combined energy and covariance scoring in spirit of RNAalifold12. This should further increase selectivity, or, if this is not necessary, allow "looser" motif definitions. A question that arises from our observations is: Can our TDM approach be incorporated in a gene prediction strategy? If we would guess a certain motif and find stable structures with significant Zscores, they might be somehow biologically relevant. In a current research project, we focus on a systematic generation of TDMs for known RNA families from the Rfam database. We are also working on a graphical user interface to facilitate biologists to create their own TDMs, without requiring the knowledge of the underlying algebraic dynamic programming technique. Beside the two RNA families shown here we have implemented TDMs for 7 other non-coding RNA families, including transfer RNA, micro RNA percursor and the Nanos 3' UTR translation control element. The results were consistent with our observations for RNAI and the hammerhead ribozyme given here, and will be used to analyze further the predictive power of thermodynamic matchers.
5.
6. 7. 8.
9.
10.
11.
12.
13.
ACKNOWLEDGEMENTS We thank Marc Rehmsmeier for helpful discussions and Michael Beckstette for comments on the manuscript.
14.
References
16.
1. A. F. Bompfiinewerer, C. Flamm, C. Fried, G. Fritzsch, I. L. Hofacker, J. Lehmann, K. Missal, A. Mosig, B. Miiller, S. J. Prohaska, B. M. R. Stadler, P. F. Stadler, A. Tanzer, S. Washietl, and C. Witwer, "Evolutionary patterns of non-coding RNAs," Theor. Biosci, vol. 123, pp. 301-369, 2005. 2. S. R. Eddy, "Non-coding RNA Genes and the Modern RNA World," Nature Reviews Genetics, vol. 2, pp. 919-929, 2001. 3. S. Washietl, I. L. Hofacker, and P. F. Stadler, "From The Cover: Fast and reliable prediction of noncoding RNAs," PNAS, vol. 102, no. 7, pp. 2454-2459, 2005. 4. E. Rivas and S. Eddy, "Noncoding RNA gene de-
15.
17.
18.
19.
tection using comparative sequence analysis," BMC Bioinformatics, vol. 2, no. 1, p. 8, 2001. D. H. Turner, N. Sugimoto, and S. M. Freier, "RNA Structure Prediction," Annual Review of Biophysics and Biophysical Chemistry, vol. 17, no. 1, pp. 167192, 1988. M. Zuker, "Mfold web server for nucleic acid folding and hybridization prediction," Nucl. Acids Res., vol. 31, no. 13, pp. 3406-3415, 2003. I. L. Hofacker, "Vienna RNA secondary structure server," Nucl. Acids Res., vol. 31, no. 13, pp. 34293431, 2003. W. Seffens and D. Digby, "mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences," Nucl. Acids Res., vol. 27, no. 7, pp. 1578-1584, 1999. C. Workman and A. Krogh, "No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution," Nucl. Acids Res., vol. 27, no. 24, pp. 4816-4822, 1999. E. Rivas and S. R. Eddy, "Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs," Bioinformatics, vol. 16, no. 7, pp. 583-605, 2000. P. Clote, F. Ferre, E. Kranakis, and D. Krizac, "Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency," RNA, vol. 11, no. 5, pp. 578-591, 2005. S. Washietl and I. L. Hofacker, "Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics," J Mol Biol, vol. 342, pp. 19-30, 2004. S.-Y. Le, J.-H. Chen, D. Konings, and J. Maizel, Jacob V., "Discovering well-ordered folding patterns in nucleotide sequences," Bioinformatics, vol. 19, no. 3, pp. 354-361, 2003. R. Giegerich, B. Voss, and M. Rehmsmeier, "Abstract Shapes of RNA," Nucl. Acids Res., vol. 32, no. 16, pp. 4843-4851, 2004. B. Voss, R. Giegerich, and M. Rehmsmeier, "Complete probabilistic analysis of RNA shapes," BMC Biology, vol. 4, no. 5, 2006. S. Griffiths-Jones, A. Bateman, M. Marshall, A. Khanna, and S. R. Eddy, "Rfam: an RNA family database," Nucl. Acids Res., vol. 31, no. 1, pp. 439441, 2003. S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman, "Rfam: annotating non-coding RNAs in complete genomes," Nucl. Acids Res., vol. 33, no. suppl 1, pp. D121-124, 2005. J. Reeder and R. Giegerich, "Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics," BMC Bioinformatics, vol. 5, no. 104, 2004. R. Giegerich, C. Meyer, and P. Steffen, "A discipline of dynamic programming over sequence data,"
121 Science of Computer Programming, vol. 51, no. 3, pp. 215-263, 2004. 20. P. Steffen and R. Giegerich, "Versatile and declarative dynamic programming using pair algebras," BMC Bioinformatics, vol. 6, no. 224, 2005. 21. T. J. Macke, D. J. Ecker, R. R. Gutell, D. Gautheret, D. A. Case, and R. Sampath, "RNAMotif, an RNA secondary structure definition and search algorithm," Nucl. Acids Res., vol. 29, no. 22, pp. 47244735, 2001.
22. I. L. Hofacker, S. H. F. Bernhart, and P. F. Stadler, "Alignment of RNA Base Pairing Probability Matrices," Bioinformatics, vol. 20, pp. 2222-2227, 2004. 23. J. Reeder and R. Giegerich, "Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction," Bioinformatics, vol. 21, no. 17, pp. 3516-3523, 2005. 24. Y. Eguchi and J. Itoh, T Tomizawa, "Antisense RNA," Annu. Rev. Biochem., vol. 60, pp. 631-652, 1991.
This page is intentionally left blank
123
PEM: A GENERAL STATISTICAL APPROACH FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN TIME-COURSE CDNA MICROARRAY EXPERIMENT WITHOUT REPLICATE XuHan* Genome Institute of Singapore, 60, Biopolis Street, Singapore 138672 'Email:
[email protected] Wing-Kin Sung Genome Institute of Singapore, 60, Biopolis Street, Singapore 138672 School of Computing, National University of Singapore, Singapore 117543 Email:
[email protected],
[email protected] Lin Feng School of Computer Engineering, Nanyang Technological University, Singapore 637553 Email:
[email protected] Replication of time series in microarray experiments is costly. To analyze time series data with no replicate, many model-specific approaches have been proposed. However, they fail to identify the genes whose expression patterns do not fit the pre-defined models. Besides, modeling the temporal expression patterns is difficult when the dynamics of gene expression in the experiment is poorly understood. We propose a method called PEM (Partial Energy ratio for Microarray) for the analysis of time course cDNA microarray data. In the PEM method, we assume the gene expressions vary smoothly in the temporal domain. This assumption is comparatively weak and hence the method is general enough to identify genes expressed in unexpected patterns. To identify the differentially expressed genes, a new statistic is developed by comparing the energies of two convoluted profiles. We further improve the statistic for microarray analysis by introducing the concept of partial energy. The PEM statistic is incorporated into the permutation based SAM framework for significance analysis. We evaluated the PEM method with an artificial dataset and two published time course cDNA microarray datasets on yeast. The experimental results show the robustness and the generality of the PEM method. It outperforms the previous versions of SAM and the spline based EDGE approaches in identifying genes of interest, which are differentially expressed in various manner. Keywords: Time course, cDNA microarray, differentially expressed gene, PEM.
1. INTRODUCTION Time-course cDNA microarray experiments are widely used to study the cell dynamics from a genomic perspective and to discover the associated gene regulatory relationship. Identifying differentially expressed genes is an important step in time course microarray data analysis to select the biologically significant portion from the genes available in the dataset. A number of solutions have been proposed in the literature for this purpose.
* Corresponding author.
When replicated time course microarray data is available, various statistical approaches, like ANOVA and its modifications, are employed (Lonnstedt & Speed, 2002; Park et al, 2003; Smyth, 2004). This category of approaches has been extended to recent work on longitudinally sampled data, where the microarray measurements span in multi-dimensional space with the coordinates to be gene index, individual donor, and time point, etc. (Guo et al, 2003; Storey et al, 2005). However, replication of time series or longitudinal sampling is costly if the number of time points is comparatively large. For the sake of this, many published time course datasets have no replicate.
124 When replicated time course is not available, clustering based approaches and model-specific approaches are widely used. Clustering based approaches select genes whose patterns are similar to each other. A famous example of clustering software is the Eisen's Cluster (Eisen et al., 1998). Clustering based approaches are advantageous in finding co-expressed genes. The drawback is that clustering does not provide a ranking for the individual genes, and it is difficult to determine a cut-off threshold based on confidence analysis. Additionally, cluster analysis may fail to detect changing genes that belong to clusters for which most genes do not change (BarJoseph et al, 2003). Model-specific approaches identify differentially expressed genes based on prior knowledge of their temporal patterns. For instance, Spellman et al. (1998) used Fourier transform to identify cell-cycle regulated genes; Peddada et al. (2003) proposed an orderrestricted model to select responsive genes; Xu et al. (2002) developed a regression-based approach to identify the genes induced in Huntington's disease transgenic model; in the recent versions of SAM (Tusher et al, 2001), two alternative methods, slope based and signed area based, are provided for analyzing single time course data. However, the assumption underlying the model-specific approaches is too strong and some biologically informative genes that do not fit the predefined model may be ignored. Bar-Joseph et al. (2002) proposed a spline based approach, which is established on comparatively weaker assumptions. The software of EDGE (Storey et al, 2005) implemented natural cubic spline and polynomial spline for testing the statistical significance of genes. In spline based approaches, the dimension of spline needs to be chosen carefully to balance the robustness and the diversity of gene patterns, and an empirical setting of dimension may not be applicable for some applications. The goal of this paper is to propose a new statistical method called PEM (Partial Energy ratio for Microarray) for the analysis of time course cDNA microarray data. In time-course experiments, the measurements are sampled from continuously varying gene expressions. Thus it is often observed that the logratio expression profiles of the differentially expressed
genes are featured with "smooth" patterns, of which the energies mainly concentrate in low frequency. To utilize this feature, we employ two simple convolution kernels that function as a low-pass filter and a high-pass filter, namely smoothing kernel and differential kernel, respectively. The basic statistic for testing the smoothness of a temporal pattern is represented by the energy ratio of the convoluted profiles. We further improve the performance of the statistic for microarray anlaysis by introducing a concept called partial energy to solve the problem caused by "steep edge", which refers to rapid increasing or decreasing of gene expression level. The proposed ratio statistic is incorporated into the permutation based SAM (Tusher et al., 2001) framework for determining confidence interval and false discovery rate (Benjamini and Hochberg, 1995). In the SAM framework, a small positive constant called "relative difference" is added to the denominator of the ratio, which efficiently stabilizes the variance of the proposed statistic. An artificial dataset and two published cDNA microarray datasets are employed to evaluate our approach. The published datasets include the yeast environment response dataset (Gasch et al, 2000) and the yeast cell cycle dataset (Spellman et al, 1998). The experiment results showed the robustness and generality of the proposed PEM method. It outperforms previous versions of SAM and spline based EDGE in identifying genes differentially expressed in various manner. In the experiment with yeast cell cycle dataset, the PEM method not only identified the periodically expressed genes, but also identified a set of non-periodically expressed genes, which are verified to be biologically informative.
2. METHOD 2.1 Signal/noise microarray data
model
for
cDNA
Consider a two-channel cDNA time-course microarray experiment over m genes: gj, g2, .... gm, and n time points: th t2, ..., tn. the log-ratio expression profile of the gene g, (;' = 1 to m) can be represented by X, = [Xfai), Xi(t2), ... X,(f„)]T, where X&tj) (/' = 1 to n) represents the log-ratio expression value of g; at they'th time point.
125 We model the log-ratio expression profile X; as the sum of its signal component 5, = [S^fy), Sfa), ... 5,(f„)]T and its noise component et = [Si(t,), e,(f2), ... e,{tn)]T, i.e. X, = Si + £;. We have the following assumption on the noise component: Assumption of noise: £;(f;), eit2), ..., £;(/„) are independent random variables following a symmetric distribution with the mean equal to zero. Note that the noise distribution in our assumption is not necessarily normal so that this gives a better model of the heavily tailed symmetrical noise distribution that is often observed in microarray log-ratio data. For a non-differentially expressed gene g„ we assume its expression signals in two channels are identical at all the time points. In this case, the signal component 5,- is constantly zero, and the log-ratio expression profile Xt only consists of the noise component. Thus the null hypothesis is defined as follow:
H0:
X, =e,
Due to the variation of populations in cDNA microarray experiments, there is bias between the expression signals in two channels. Thus the assumption underlying the null hypothesis may not be established if the log-ratios are calculated directly from the raw data. We suggest using pre-processing approaches such as Lowess regression to compensate the global bias (Yang et al., 2002). To further overcome the influence of the genespecific bias, we adopted the SAM framework, in which a small positive constant called "relative difference" was introduced to stabilize the variance of the statistic (Tusher et al, 2001). Nevertheless, the null hypothesis provides a mathematical foundation for demonstration of our method.
simple convolution kernels for time series data analysis, namely the smoothing kernel and the differential kernel. The smoothing kernel is represented by a slidingwindow Ws = [1, 1], and the differential kernel is represented by Wd = [-1, 1]. In signal processing, the smoothing kernel and the differential kernel function as a low-pass filter and a high-pass filter, respectively, to detect the edges. Given a vector V = [V(ti), V(t2),..., V(tn)]T representing a time-series, the smoothed profile and the differential profile of V are represented by V*WS = [V(f;) + V(t2), V(t2) + V(t3), .... V(tn.,) + V(tn)f, and V*Wd = [V(t,) V(t2), V(t2) - V(t3) V{tn.i) - V(tn)]T, respectively, where * is the convolution operator. Since the energy of the signal component St is likely to concentrate in low frequency, we have: Assumption of signal: If 5, is a non-zero signal vector, then
E(\Si*Ws\2)>E(\Si*Wd\2) where E(\ St *WS |2) and E(\ St *Wd |2) represent the expected energies of the corresponding smoothed profile and differential profile. Next, we derive two propositions from the Assumption of noise and the Assumption of signal, as follows: Proposition 1: If the noise component e, satisfies the Assumption of noise, then E(\ei*Ws\2)
In time-course experiments, the measurements are sampled from continuously varying gene expressions. If there is adequate number of sampled time points, the temporal pattern of the signal St will be comparatively smooth so that the energy of S, will concentrate in low frequency. To utilize this feature, we introduce two
(1)
Proposition 2: If the signal component St satisfies Assumption of signal, and the noise component e, satisfies the Assumption of noise, then E(\(Si+£i)*Ws
2.2 Smoothing convolution and differential convolution
= E(\ei*Wd\2)
\2)>E(\(Si
+£i)*Wd | 2 )
(2)
Propositions 1 and 2 can be proven based on the symmetry of noise distribution and the linear decomposability of convolution operation. Note that the log-ratio expression profile Xt = S, + £,. According to Eq. (1) and Eq. (2), we define a statistic called energy ratio (ER) for testing the null hypothesis, as follow:
126 4-
2 -
3-
1.5 -
ft
l\
1-
2-
f1
1\ \
-• „ -\ :
Logarithm of ER -n=15
-n=10
i
n
\8
•0.5 -
time points
•n=5
Fig. 1. The numerically estimated distribution of logarithm of ER(£,),
Fig. 2. An example of responsive gene expression profile where a
where n is the number of time points.
"steep edge" occurs between the 3 rd and the 4* time points.
2
ER(Xi)
=
\x,*w,\ \x,*wd\2
(3)
The distributions of logarithm of ER(ed are shown in Fig. 1, where the number of time points varies from 5 to 15 and the distribution of et is multivariate normal. We take logarithm simply for the convenience of visualization. Obviously, the logarithm of £/?(£,) follows a symmetric distribution highly peaked around zero mean. The distribution is two-tailed, but we are only interested in the positive tail when testing the null hypothesis. This is because the negative tail implies the energy concentrates in the high frequency. According to Nyquist sampling criterion, the high frequency component is not adequately sampled thus the expression profile may not be reliable. When n—»°°, ER(sd is asymptotically independent on the distribution of Si, which can be easily proven based on central limit theorem.
2.3 Partial energy In most time-course microarray experiments, the number of time points is limited. Due to insufficient sampling, the smoothness of the signal component St is not guaranteed at all the time points. We call this a "steep edge" problem. A steep edge refers to rapid increasing or decreasing of gene expression level at certain time points. Fig. 2 shows an example of responsive gene
expression profile in which a steep up-slope edge occurs between the 3"1 and the 4th time points. When the number of time points is limited, the steep edge adds a large value to the denominator in Eq. (3), hence reduces the statistical significance of the ER score. To solve the "steep edge" problem, we propose a new concept called partial energy. The basic idea of partial energy is to exclude the steep edges in calculating the energy of a differential profile. Denote Y = [Yi, Y2, ... y„]T be a vector representing a profile, the border partial energy of Y is defined as:
i=l
where k
(j"i:'
ii.is1
IISMI
I)! !
0.74.'
ii *.'.'
i) : .V
Mldi-ildi
!i i'm
i) (rfi.-
0 fhV
l ! \ ; n - i n-..iiii|i-
UOV
'I " J l
(Mi'-l
f i.-*K.=J
ilW'i
!;<j;S
i ):;.•,! i n :\
(J4')1
DM1.
(I
l i 74*1
(I /"!0.7) in most experiments, except for the Menadione exposure experiment in which all the methods do not perform well. To further show the
130 superiority of PEM, we averaged the ROC scores over all experiments for each method, and used paired t-test for comparison of the performance of PEM and the other methods. The p-values of the paired t-test demonstrate the significance of the improvement made by PEM.
Cluster 1 Cluster 2 Cluster 3
periodic clusters
Cluster 4
3.3 Evaluation with Yeast Ceil Cycle Dataset
Cluster 5 Cluster 6
The yeast cell cycle dataset (Spellman et at, 1998) consists of the measurements in three experiments (Alpha factor, CDC15, CDC28) on cell cycle synchronized yeast S. Cerevisiae cells. We employed a reference list containing 104 cell cycle regulated genes determined by traditional biological experiments, as mentioned in the original paper. In addition to SAM and EDGE, we also include the method of Fourier transform (Spellman et at, 1998) in our evaluation. The Fourier transform (FT) method was introduced specifically for identifying periodically expressed genes.
Cluster 7
non-periodic clusters
Cluster 8
=r*
Fig. 5. Clustering result shows periodic and non-periodic patterns of differentially expressed genes identified by PEM in alpha factor experiment.
Table 2. ROC scores for evaluation of the methods in identifying periodically expressed cell cycle regulated genes.
The ROC scores are shown in Table 2. The PEM method outperforms the SAM approaches and the spline based EDGE approaches in all experiments. The FT method performs slightly better than PEM in identifying periodically expressed genes. However, the PEM method also identified a number of non-periodically expressed genes, which account for considerable false positives in calculating ROC scores. To show this, we clustered the top 706 differentially expressed genes identified by the PEM in the alpha factor experiment. These genes are selected based on a false discovery rate equal to 0.1. We applied K-mean clustering using Eisen's Cluster software (Eisen et al, 1998) and came up with eight clusters, as shown in Fig. 5. Five of the clusters are periodic and the remaining three are nonperiodic. Note that the non-periodic portion of the differentially expressed genes is not significant with the Fourier transform approach. The non-periodic clusters are mapped to the gene ontology clusters using GO Term Finder in SGD database (http://db.yeastgenome. org/cgi-bin/GO/goTermFinder/). We selected four significant gene ontology terms corresponding to the non-periodic clusters, as listed in Table 3. The Bonferroni correlated hypergeometric P-values show
"linpi
.V.jili.l
< i;:m
S-.,h
ll.l.eil
a:«-3
S,Milr
hllhlK-
^M
ba.M-d
Iwn-J
SAM
r.iM.i-
1.1 Kil-
n
I'l-M
alp/u
(] S'-'J
ci r» .-•*
n 'ASA
!) : •/
0.917
0 «S.?
. .K i *
(141) J
U jSu
0 iji-,4
0 vm
0.811
II Si»s
• i T>:S
7
!
0.859
0 "ft')
..;. »s
l.-lh*
:i
l
!! IM
Table 3. Selected significant gene ontology terms mapped to nonperiodic clusters. The GO terms and cluster IDs are retrieved from SGD database.
( I:MCM'I
• ;•«!!•.••;;. • R, where the weights of the j - t h frequency is given in fj = (wij,W2j, • • •, w\o\j)- F ° r e& ch frequency /_,-, Wj = c(r,fj) integrates the weights from fj into 5 by evaluating the resonance strength recorded in r. Again, c is abstract, and can be materialized using the inner product c(r, fj) = r • fj = J2i wn • r(6, Oi). Finally, we compute 6 = norm(6) and record it as 6(fe+1) = 6. Test Convergence Compare o( fc+1 ) against b^k\ If the result converges, go to the next step; else apply r on O again (i.e., forcing resonance), and then adjust o. M a t r i x R e a r r a n g e m e n t Sort the objects Oj € O by the coordinates of r in descending order; and sort the frequencies fa € T by the coordinates of 6 in descending order. For clearly stating the whole process above, we further express it in the following formulas, r = norm(r(W6 (fc) )) ;(fe+i)
norm (c(W
[ T-(k+l)
(1) ))
(2)
To illustrate how the matrix is sorted, let's take a look at a real-life example from a yeast gene expression data 19 . The symmetric gene correlation matrix is computed by Pearson correlation measure. After the resonance model, we obtained the converged r*
norm(x) = x/||x||2, where ||x||2 = ($27=1 z ? ) 1 / 2 & 2-norm of vector x = ( x i , . . .
,xn)
137
Responso. Weighted Function
(a) basic resonance model
(b) GMA-1: extended resonance model 1
»h Adjustment Function
(c) GMA-2: extended resonance model 2
Pig. 2. The resonance models of approximating the matrix for different purposes: (a) collecting the high values into the left-top corner; (b) simultaneously collecting high/low values into the left-top corners of k classes or submatrices W~ or W*; (c) collecting the extremely high similarity/correlation values into the left-top corner to form a dense cluster.
and 6* with the decreasing order, and also sorted Oj e O and fj € T accordingly. Certainly, the rows and columns of the matrix S are also rearranged with the same orders of Oi and fj. The sorted S in this example is shown in Fig. 1(c). We also draw its corresponding 1-rank approximation matrix r*6* T in Fig. 1(d). This example in Fig. 1(c) and (d) illustrates two observations: (1) the function of the resonance model is to collect the large values in the left-top corner of the rearranged matrix and leave the small values to the right-bottom corner; (2) the underlying rationale is to employ the 1-rank matrix r*6* T to approximate 5. Actually, it is essential that the value distribution of r*6* T determines how the values of the sorted S are distributed.
3. TWO GENERALIZED MATRIX APPROXIMATIONS BY EXTENDING RESONANCE MODEL FOR GENE SELECTION In this section, we extend and generalize the basic mechanism of the resonance model in Section 2 for the purpose of the gene selection in two aspects. The first is to rank genes and samples for selecting those differentially expressed genes Q={gi,. • -,gk}- The second is to discover those very dense clusters in the correlation matrix computed from Q, and remove the redundant genes in Q by only selecting one or two representative genes from each dense cluster. In the two steps, we particularly designed two extended resonance models. From the perspective of the matrix computation, they are two generalized matrix approximation methods based on the basic resonance
model.
3.1. GMA-1 for Ranking Differentially Expressed Genes Consider the general case of the gene expression data, suppose the data set consists of m genes and n samples with k classes, whose number of samples are n i , . . . , nit respectively and n\ + ... + nk—n. Without losing the generality, we suppose the first fc_ classes are negative, the following k+ classes are positive, and k- + k+ = k. Therefore, a general genesample matrix WmXn = [ W~ , Wf ) is shown with submatrix blocks in Fig.3(a). Because the target of analyzing differentially expressed genes is to find up-regulated or down-regulated genes between negative and positive sample classes, the basic resonance model should be changed, from collecting high values to the left-top corner of W, to: (1) A series of low values collections in each W^~ into the left-top corner, and simultaneously a series of high values collections in each W* into the left-top corner. (2) Controlling the differences of left-top corners between the negative classes W[~ and W*. An example figure of such matrix approximation is illustrated in Fig.4. Therefore, to meet these two goals, we extended the basic resonance model, called GMA-1, according to this task as follows. (1) Transformation of W: before doing the GMA-1, we need to transform the original gene-sample matrix W to W. The structure of W is made of
Negative Classes
Positive Classes
up regulation I
w=
Wj
Wu
I +
w1
l-wr
WT
1-W+
w= VJ2-
">
[i-Wf. w+ 1
j
Y n = n, + ... +n t
(a) original matrix W = [ Wi
, W4+ ]
(b) transformed matrix W ' = [ W'~
F i g . 3 . Transformation of the matrix W: the transformed matrix W in (a), but with different submatrix W s ' - and W^+ as listed in (b).
the submatrix blocks W~ and Wf of negative classes and positive classes as shown in Fig.3(a). In the case of finding up-regulated and differentially expressed genes, since we need to collect the low values of W~ into the left-top corner, we need to reverse the values of W~ so that low values become high and vice versa. In other words, we do the transformation by W'~ = 1 — W~. In this way, the result of collecting high values of W'~ and W[+ into their own left-top corners naturally lead to the result of collecting the low values of W~ into the left-top corners and the high values of Wf into the left-top corners. This is an essential step to meet the first goal aforementioned. We can also use other reverse functions in stead of the simple 1 — x function used in Fig.3(b). Similarly, we can transform W by W[+ = 1 — Wf in the case of finding downregulated and differentially expressed genes. (2) The k partitions of the forcing object 6: an implicit requirement in the first goal is that the relative order of each class (submatrix W-~ or W/ + ) should be kept the same after doing GMA-1 and sorting W. For example, after running our algorithm, it is required that all columns of the submatrix W!f must appear after all columns of W[~, although we can change the order of columns or samples within W{~ or W^~. To satisfy this requirement, we partition the original forcing object's frequency vector 6 into k parts corresponding to k classes or submatrices. f
down regulation
, W,' +
has the same structure of submatrix blocks as shown
Specifically, 6 = ( 6 i ; . . . ; 6 k ) f , where each 5i corresponds to a sample class. In the process of GMA-1, we separately normalize each 6i and then sum their resonance strength vectors together with the factor a to control the differentiation between the negative and positive classes. (3) The factor a for controlling the differentiation between the negative and positive classes: the frequency vector of 6 is divided into k = fc_ + k+ parts, each of which is normalized independently. Therefore, we can control the differentiation between the negative and positive classes, by magnifying the resonance strengths rf = norm(W i ' + 6;) of k+ positive classes, or minifying the frequency subvectors r~ = norm(W i ' _ 6i) of fc_ negative classes. In formal,
'(
+
• + rk_
+ ar+ + ... + av+
)
fe_ negative classes fc+ positive classes (3)
where a ~£ 1 and a as a scaling factor is multiplied with the normalized positive classes' resonance strength vectors. With the increasing of a, the proportions of positive classes in the resonance strength vector r will increase and thus result in the increasingly large differences in the top-left corners between positive and negative classes. In this way, the user can tune a. to get a suitable differential contrast of two types of classes.
T h e concatenation of k = k- + k+ vectors is expressed in MATLAB format.
139 To summarize the above changes of the resonance model, we draw the architecture of the GMA-1 in Fig.2(b) and express its process in the following formulas: r-(fc+1)=norm(W;-67«), i= l fc~ r+ ( f c + 1 ) = n o r m ( W ; + 6 + W ) , i = 1 , . . . , k+ =nor»(E^1ri-(fc+1) +aE£iri+(fc+1)) r ( fc+1 ) (fc+1) 6=norm((W i '-) T r( f e + 1 )), i = 1 , . . . , k~ 5+(fc+i) = n o r m ( ( ^+ ) T r ( f c + i)), j
=
3.2. GMA-2 for Reducing Redundancy by Finding Dense Clusters
i;...; fc+
(4)
A l g o r i t h m 3.1 (GMA-1): Biomarker Discovery. Input:
(1) Wmxn, expression matrix from m genes set G and n samples set S; (2) ( m , . . . ,nfc) T , sizes of t h e fc sample classes with the submatrix structure as in Fig.3(a). (3) (fc_, k+)T, numbers of negative and positive classes. (4) regulation option, down or up; (5) a, differentiation factor. O u t p u t : (1) (gi,...,gm), ranking sequence of m genes; (2) ( s i , . . . , s„), ranking sequence of n samples. 1: preprocess W so that the values of W in [0,1] as following the steps in Subsection 2.1. 2: transform W t o W according t o formulas in Fig. 3(b) with the knowledge of the matrix structure given by ( n i , . . . ,rik)T, and (fc_, fc+)T and regulation option. 3: iteratively run equations in Eqn.(4) t o obtain the converged r* and 5* ( i = l , 2 , . . . , fc). 4: sort r* in decreasing order t o get the ranking gene sequence (gi,...,gm), and sort each of o £ , . . . , o £ in decreasing order t o get t h e sorted sample sequence {comment: Because the positions of all sample classes in W keep not changing as shown in Fig.3(a), each sorting ofo* can only change the order of samples within the i-th sample class W^.}.
where ri,vf ,r~ £ R m x l and 6~ e Mn< x l , of £ Rntxl. Comparing Eqn.(l) and (2) with Eqn.(4), besides using the linear functions r = c = I , we partitioned the matrix W to k submatrix blocks and divided the frequency vector 6 into k subvectors. Therefore, two equations in the basic resonance model are expanded to the (2k + 1) equations in GMA-1. We also formally summarize it as Algorithm 3.1 GMA-1 for the biomarker discovery. A real-life example of the overall process in Algorithm GMA-1 is visually shown in Fig.4. g
In practice, GMA-1 can quickly converge. Considering that GMA-1 is a generalized resonance model by partitioning the matrix into k submatrices, its computational complexity is the same as the resonance model on the whole matrix, i.e., 0(mn).
It has been recognized that the top-ranked genes may not be the minimum subset of genes for biomarker and classification 9 | 4 ' 23 , because there are correlations among the top-ranked genes, which induces the problem of reducing "redundancy" from the topranked gene subsets. One of the effective strategies is to take into account the gene-to-gene correlation and remove redundant genes through pairwise correlation analysis among genes 9 ' 4 ' 21 . In this section, we proposed to use the GMA-2, an special instance of the basic resonance model to reduce the redundancy of the top-ranked genes selected by GMA1. The GMA-2 is a clustering method to find the high-density clusters. Then we can simply select one or more representative genes from each cluster and therefore reduce the redundancy. The underlying rationale is "members of a very homogeneous and dense cluster are highly correlated and with more redundancy; while a heterogeneous and loose cluster means bigger variety in genes". Although similar work has been done by Jaeger et al. 9 , the authors used the fuzzy clustering algorithm which is not a suitable algorithm to control the density of the clusters. Comparing with the fuzzy clustering algorithm, the GMA-2 can not only find clusters with different densities, but also provide the membership degree for a cluster for each gene. Given a pairwise correlation or similarity matrix of a set of genes g , the GMA-2 outputs the largest cluster with the fixed density. To find more clusters with the fixed density, the GMA-2 can be iteratively run on the remaining matrix by removing rows and columns of the genes in clusters already found. Unlike the GMA-1 which is a generalization of the basic resonance model, the GMA-2 is actually a special instance of the basic resonance model. Observing Fig. 1(c) and (d), the linear basic resonance model is
I n our context, this set of genes are the top-ranked m' genes selected by the GMA-1.
140 able to collect the high values of a symmetric matrix to the left-top corner of the sorted matrix. This means that it can approximate a high-density cluster. Therefore, we customized the basic resonance model to find the dense cluster by setting the response and adjustment functions to be I or E. When r = c = I, we called this linear resonance model as RML; and when r = c = E, this non-linear resonance model is called RME. The overall architecture of RML and RME is illustrated in Fig.2(c). With these settings and S = ST, two equations in the basic resonance model (i.e., Eqn.(l) and (2)) can be combined together by removing 6, and therefore RML and RME can be represented by Eqn.(5) and Eqn.(6) respectively as follows,
r(fc+1> = n o r m ( S r « ) fe+1
r(
> =norm(E(Sr))
(5) (6)
A theoretical analysis is given in the following to show how RML works. Given a nonnegative gene correlation matrix S = (sij)nxn £ R n x " , a nonnegative membership vector x = (xi,... ,xn)T € {0, l } n x l is supposed to indicate the membership degree of each gene belonging to the dense and largest cluster, when the values of x are 0 or 1, D(x) in Eqn.(7) means the density of a cluster formed by those genes whose corresponding Xi is 1.
n
n
D(x) = Y J y j SijXiXj = x T 5 x
(7)
i=lj=\
However, there are extensive studies on the problem of finding the densest subgraph h which is known as the NP-hard problem 6 . A typical strategy in approximation algorithms is to relax the integer constraints (i.e., x take the binary values 0 or 1) in x to the continuous real numbers, e.g., x e [0, l ] " x l and normalize it as ||x||2 = y/J27=i x1 = 1- In this way, the membership degree x changes from the binary number to the continuous number. According to the matrix computation theory 8 , we have the following theorem,
T h e o r e m 3.1 (Rayleigh-Ritz). Let S e R n x " be a real symmetric matrix and Xmax(S) be the largest eigenvalue of S, then we have, xT5x ^max(S) = max ,, ,, = max x r 5 x x£E" | | x | | 2
(8)
||x|| 2 = l
and the eigenvector x* corresponding to A m a x (5) is the solution on which the maximum is attained. Theorem 3.1 indicates that the first eigenvector x* of S is the solution of -D(x) and therefore reveals a dense cluster. According to the linear algebra, the iterative running of Eqn.(5) in RML will lead to the convergence of r to the first eigenvector of 5 , i.e., r* = x*. Therefore, the RML can reveal the dense cluster. In practice, we found that the non-linear resonance model RME works better than the linear RML by using the exponential function to magnify the roles of high values in the dense cluster. Hence, based on RME, the GMA-2 is formally stated in Algorithm 3.2, A l g o r i t h m 3.2 (GMA-2): Find a HK(H) because IG{H) is subgraph of MIG(G). Therefore, \EG\ > MaxHK(G). To show the converse, it is sufficient to show that \EG\ < HK{H) for some HI solution H for G. This is not immediate because it is not necessarily true that MIG(G) = IG{H) for some HI solution H for G. But, if we can find an HI solution H for G where all the edges of EG are in IG(H) (where they will be non-overlapping), then \EG\ < HK(H). The edges in EG induce a graph, and consider one of the connected components, C, of that graph. Because the edges in EG are non-overlapping and C is a connected component, the edges in C form a simple connected path along the nodes in C ordered from left to right in the embedded MIG(G). Let s\, s%,..., s* denote the ordered nodes in C. To construct the desired H, we first phase sites sy, S2 to make pair si, S2 incompatible (that is possible since edge (si,S2) is
149 in MIG(G)). Now we move to site S3. We want to make pair S2,S3 incompatible but we have already chosen how s-2 will be phased with respect to si. The critical observation is that this prior decision does not constrain the ability to make pair s^, S3 incompatible, although one has to pay attention to how S2 was phased. In choosing how to phase S3 relative to S2, the only rows in G where a phasing choice has any effect on whether pair 3^,83 will be incompatible, are the rows where both those sites have value 2 in the genotype matrix G. For one such row k of G, suppose we need to phase the 2's in S2, S3 to produce the pair 0,1 or the pair 1,0 or both, in order to make pair si, S2m incompatible. (The case where we need 0,0 and/or 1,1 is similar and omitted.) If column S2 (for row k) has been phased as [ ] we phase S3 (for
that MinCC{G) can be computed in polynomial time by Algorithm MinCC, using an idea similar to one used for MaxHK(G). The problem of efficiently computing MaxCC{G) is currently open. Algorithm
MinCC
1. Given genotype matrix G, construct graph MIG(G) and remove all trivial components. 2. For each remaining component C, let G{C) be the matrix G restricted to the sites in C. For each such C, determine if there is a PPH solution for G{C), and remove component C if there is a PPH solution for G(C). 3. Let Kc be the number of remaining connected components. We claim that Kc = MinCC (G).
row k) as [ ]. Otherwise, we phase S3 as [ ]. In either case, we will produce the needed binary pairs in sites S2, S3 for row k. Similarly, we can follow the same approach to phase sites s 4 , . . . , sk, making each consecutive pair of sites incompatible. In this way, we can construct a haplotyping solution H for G where all the edges of EG (and possibly more) appear in IG(H), and hence \Ea\ < HK(H) < MaxHK{G). But since \EG\ > MaxHK(G), \EG\ = MaxHK(G), completing the proof of the correctness of Algorithm MaxHK.
3.2. The case of connected-component lower bound A "non-trivial" connected component, C, of a graph is a connected component that contains at least one edge. A trivial connected component has only one node, and no edges. For a graph / , we use cc(I) to denote the number of non-trivial connected components in graph I. It has previously been established 13, 1 that for a haplotype matrix H, cc{IG{H)) < Rmin(H), and that this lower bound can be, but is not always, superior to the HK bound when applied to specific haplotype matrices. Therefore, for the same reasons we want to compute MinHK{G) and MaxHK(G), we define MinCC(G) and MaxCC(G) respectively as the minimum and maximum values of cc(IG(H)) over every HI solution H for G. In this section we show
Time analysis: Constructing MIG{G) takes 0(nm2) time. Finding all components takes O(m) time. Checking all components for PPH solutions takes 0(nm) time. Thus, the entire algorithm takes 0{nm2) time. Correctness. We first argue that cc(IG(H)) > Kc for every HI solution H for G. Let H be an arbitrary HI solution for G, and consider one of the Kc remaining connected components, C, found by the algorithm. Since G(C) does not have a PPH solution, there must be at least one incompatible pair of sites in H, and so at least one edge in C must also be in IG(H). Further, since IG{H) is a subgraph of MIG{G), every connected component of IG{H) must be completely contained in a connected component of MIG(G). Therefore, there must be at least one non-trivial connected component of IG(H) contained in C, and so cc(IG(H)) > Kc. To finish the proof of correctness, it suffices to find an HI solution H' for G where cc{IG(H')) = Kc. Note that we can phase the sites in each connected component of MIG(G) separately, assured that no pair of sites in different components will be made incompatible. This is due to the maximality of connected components, and the definition of MIG(G). To begin the construction of H', for a non-trivial component C of MIG(G) where G{C) has a PPH solution, we phase the sites in C to create a PPH solution. As a result, none of those sites will be in-
150 compatible with any other sites in G. Next we phase the sites of one of the Kc remaining components, C, so that in H', the nodes of C form a connected component of IG{H'). To do this, first find an arbitrary rooted, directed spanning tree T of C Then phase the site at the root and one of its children in T so that those two sites are made incompatible. Any other site can be phased as soon as its unique parent site has been phased. As in the proof of correctness for Algorithm MaxHK, and because each node has a unique parent, each site can be phased to be made incompatible with its parent site, no matter how that parent site was phased. The result is that all the sites of C will be in a single connected component of IG(H'), so Kc > cc(IG(H'). But cc(IG(H)) > Kc for every HI solution H for G, so MinCC{G) = Kc, and the correctness of Algorithm MinCC is proved. Final comments on the polynomial-time methods Above, we developed polynomial-time methods to compute MaxHK(G) and MinCC(G), given genotypes G. These are two specific cases of our interest in efficiently computing MinL(G) and MaxL(G) for different lower bounding methods L that work on haplotypes. Clearly, for the best application of such numerical values, we would like to compute MinL(G) and MaxL{G) for the lower bound methods L that obtain the highest lower bounds on Rmin(H) when given haplotypes H. The HK and the CC lower bounds are not the best, but are of interest because they allow provably polynomial-time methods to compute MinHK(G), MaxHK(G) and MinCC(G). Those results contribute to the theoretical study of lower bound methods, and may help to obtain polynomial-time, or practical methods, for better lower bound methods. In the next section we discuss a practical method (on moderate size data) to compute better lower bounds given genotypes. 3.3. Parsimony-based lower bound One of the most effective methods to compute lower bounds on Rmin(H), for a haplotype matrix H, was developed in Myers, et al. 30 , further studied in Bafna, et al. 2 , and optimized in Song et al 34 . All of the methods in those papers produce lower bounds
on Rmin(H) that are much superior to HK(H) and CC(H), particularly when n > m. Therefore, given G, we would like to compute the minimum and/or maximum of these better bounds over all HI solutions for G. Unfortunately, we do not have a polynomialtime method for that problem, and we presently solve it only for very small data. However, we have developed a lower bounding method that works on genotype matrices of moderate size, using an idea related to the cited methods, and we have observed that when n> m, the lower bound obtained is often much superior to MinHK(G) and MinCC(G). All the lower bound methods in the papers cited above work by first finding (local) lower bounds for (selected) intervals or subsets of sites in H, and then combining those local bounds to form a composite lower bound on Rmin(H). The composition method was developed in Myers, et al. 3 0 and is the same for all of the methods. What differs between the methods is the way local bounds are computed. We do not have space to fully detail the methods, but all the local bounds are computed with some variation of the following idea 3 0 : Let Hap(H) be the number of distinct rows of H, minus the number of distinct columns, minus 1. Then Hap(H) < Rmin(H). Hap(H) is called the Haplotype lower bound. When applied to the entire matrix H, Hap(H) is often a very poor lower bound, but when used to compute many local lower bounds in small intervals, and these local bounds are combined with the composition method, the overall lower bound on Rmin(H) is generally quite good. Similar to the methods that work on haplotype data, given a genotype matrix G, we compute relaxed Haplotype lower bounds for many small intervals, and then use the composition method to create an overall number Ghap(G) which is a lower bound on the minimum Rmin(H) over every HI solution H for G. Of course, to be of value, it must be that Ghap(G) is larger than MinHK(G) and MinCC{G) for a large range of data. We now explain how we compute the local bounds in G that combine to create Ghap(G). When restricted to sites in an interval, we have a submatrix G' of G. An HI solution H' for a genotype matrix G' is called a "pure parsimony" solution if it minimizes the number of distinct haplotypes used, over
151 all HI solutions for G'. If the number of distinct haplotypes in a pure parsimony HI solution for G' is p{G'), and G' has m' sites, it is easy to show that p(G') - m' - 1 < Rmin(H') for any HI solution H' for G'. We call this bound Par{G'). To compute Ghap(G), we compute the local bound Par(G') for each submatrix of G denned by an interval of sites of G, and then combine those local bounds using the composition method from Myers, et al 3 0 . It is easy to show that Ghap(G) < Rmin(H) for every HI solution H for G. The problem of computing a pure parsimony haplotyping solution is known to be NP-hard 17 ' 22 , so computing Par{G') is also NP-hard. But, a pure parsimony HI solution can be found relatively efficiently in practice on datasets of moderate size by using integer linear programming 10 . Other papers have shown how to solve the problem on larger datasets 4 ' 5 . Therefore, each local Par(G') bound can be computed in practice when the size of G' is moderate, and so Ghap(G) can be computed in practice for a wide range of data. Our experiments show that Ghap(G) is often smaller than MinHK(G) or MinCC(G) when n < m and when the recombination rate is low. However, when n increases, Ghap(G) becomes higher than MinHK(G) or MinCC(G). Our simulation shows that for dataset with 20 genotypes and 20 sites, Ghap(G) is larger than MinHK(G) or MinCC{G) for over 80% of the data. As an example, a real biological data (from Orzack, et al. 31 ) has 80 rows and 9 sites. MinHK{G) = MinCC(G) = 2, while Ghap{G) is 5 (which is equal to Rmin(G) as shown in Section 5.3).
4. CONSTRUCTING A MINIMUM ARG FOR GENOTYPE DATA USING BRANCH AND BOUND In this section, we consider the problem of constructing an ancestral recombination graph (ARG) that derives an HI solution H and uses the fewest number of recombinations for the genotype matrix G. We call such an ARG a minimum ARG for G and denote the minimum number of recombination in this ARG Rmin(G). Formally, Haplotyping on a minimum ARG: Given a genotype data G, find an HI solution H for G, such
that we can derive H on an ARG with the fewest number of recombinations. Here, as usual, we assume the infinite sites model of mutations. It is easy to see this problem is difficult. After all, there is no known efficient algorithm for constructing the minimum ARG for haplotype data 37, 3 and haplotype data can be considered to be a subset of genotype data. Here, we show that under certain conditions, we can solve this problem by a branch and bound method. The intuition of our method comes from the concept of hypercube of length m binary sequences. Note that there are up to 2 m possible sequences in the hypercube that can be on the an ARG that derives an HI solution for G. Conceptually we can build the ARG as follows. We start from every sequence node in the hypercube as the root of the ARG. Each time, we try all possible ways of deriving a new sequence by (1) an (unused) mutation from a derived sequence, or (2) a recombination of two derived sequences. The ARG grows when we derive new sequences. Once the ARG derives an HI solution for G, we have found an ARG that is potentially the solution. We can find the minimum ARG by searching through all possible ways of deriving new sequences and finding the ARG with smallest number of recombinations. Directly applying the above idea is not practical when the data size increases. We develop a practical method using branch and bound. We start building the ARG by staring from a sequence as the root. At each step, we maintain a set of sequences that have been derived. We also maintain the best ARG found so far, i.e. the ARG that derives an HI solution for G and use the smallest number of recombinations (denoted Rmin). We derive a new sequence by a recombination of two already derived sequences or an unused mutation from a derived sequence. We check whether the current ARG derives an HI solution. If so, we store this solution if this ARG uses less recombinations than Rmin. If not, we compute a lower bound on the minimum number of recombinations we need to derive an HI solution, given the choices we make in the search path. If the lower bound is not smaller than Rmin, we know the current partially built ARG can not lead to a better solution and thus stop this search path. Otherwise,
152 we continue to derive more sequences from the current derived sequences. We illustrate the basic procedure of the branch and bound method in Algorithm GenoMinARG. Algorithm
GenoMinARG
1. Root We maintain a set of sequences called derived set (containing sequences that are part of the ARG already built so far). Initialize the derived set with a binary sequence sr as the root of the ARG. Maintain a variable Rmin as the currently known minimum number of recombinations. Initialize Rmin to be oo (or some pre-computed upper bound). 2. Deriving sequences Repeat until all search paths are explored or terminated. Then Return to Step 1 if there are more root sequences to try. Stop the algorithm otherwise. 2.1 Through either a recombination or (unused) mutation from sequences in the derived set, grow the derived set by deriving a new sequence. 2.2 Check whether the derived set contains an HI solution. If so, stop this search path. Denote the number of recombinations in this ARG Rminc. If Rminc < Rmin, set Rmin
2" 2'
3'...-S'.::-: f'"" 8'
7k
9'
10' 10' l l l
'©
7'
»*
O
6
4'
4'
5"
5'
6'
6fc
*'
O
0
&
6
l l ' 12" 12' 13' 13k 141 i 4 »
^
O——O
i 5 » 15' 16'
O 16 »
Cycle Graph
(1-3)
(4...S)
(9...11)
6 (11.-13)
(14-10
Forest F i g . 1.
T h e cycle graph 5(11, T) and the forest F n .
3.2. The New Definition for a Translocation In 0(n, T), an indirect black edge determines not an adjacency of genome II but an interval containing only genes to be deleted. We thus have to redefine what we mean by "the bad translocation acting on two black edges" or "the proper translocation determined by an interchromosomal gray edge". Let e = (a,b) be one indirect edge in Q(U,T). The segment [x, 5(a)] designates the interval bounded on the left by x and on the right by the element of An adjacent to b. The segment [(5(a), x] designates the interval bounded on the left by the element of An adjacent to a and on the right by x. To give Definition 3.1 simply, we define 5(a) = 0 for a direct black edge e = (a, b). Then the segment [x, 5(a)] designates the interval bounded on the left by x and on the right by a. The segment [5(a), x] designates the interval bounded on the left by b and on the right by x. Definition 3.1. Assume the two black edges e = (a, b) and / = (c, d) are on two different chromosomes X = Xi,...,a,6(a),b,...,xp and Y = yi,...,c,5(c),d,...,yq, where x;(l < i < p) and yj (1 < j < q) are vertices of £7(11, T). (1) The translocation determined by g = (a,c) exchanges the segment [x\, a] of X with the segment [5(c),yq}oiY. (2) The translocation determined by g = (b, d) exchanges the segment [xi,