Computational Systems Bioinformatics: Csb2007 Conference Proceedings, University of California, San Diego, USA, 13-17 August 2007

iife Sciences Society COMPUTATIONAL SYSTEMS BI0INFORMATICS This page intentionally left blank iife Sciences Societ...

Author: Peter Markstein | Ying Xu

32 downloads 1507 Views 39MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

iife Sciences Society

COMPUTATIONAL SYSTEMS BI0INFORMATICS

This page intentionally left blank


COMPUTATIONAL SYSTEMS

BIOINFORMATICS CSB2007 CONFERENCE PROCEEDINGS Volume 6

University of California San Diego, USA

13-17 August 2007

eDITORS

pETER mARKSTEIN IN SILICO lABS, llc, usa

yING xU uNIVERSITY OF gEORGIA, usa

Imperial College Press

Published by

Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by

World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 U K office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

COMPUTATIONAL SYSTEMS BIOINFORMATICS Proceedings of the Conference CSB 2007 - Vol. 6 Copyright 0 2007 by Imperial College Press All rights reserved. This book, orparts thereoj may not be reproduced in anyform or by any means, electronic ormechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN- 13 978- 1-86094-872-5 ISBN-I0 1-86094-872-3

Printed by FuIsland Offset Printing (S) Pte Ltd, Singapore


Thank You CSB2007 Gold Sponsor The Life Sciences Society, LSS Directors, together with the CSB2007 Program Committee and Conference Organizing Committee are extremely grateful to the

Hewlett-Packard Company for their Gold Sponsorship of the Sixth Annual Computational Systems Bioinformatics Conference, CSB2007 at University of California San Diego, La Jolla, California, August 13-17,2007

i n v e n t


vii

PREFACE The 2 1St Century has seen the emergence of tremendous vigor and excitement at the interface between computing and biology. Following years of investment by industry and then private foundations, the US Federal Government has greatly increased its support. Increasingly, the experimental findings from all of the biological sciences are becoming data rich and their practitioners are turning to the use of computational methods to manage and analyze the data. In light of the growing opportunities and excitement at the frontier interface between computing and biology, a few scientists turned conference planners organized the first Computational Systems Bioinformatics (CSB) conference in 2001 at Stanford, CA; CSB continued each August at Stanford over the next five years. During this time, many computer scientists and other engineers, as well as biologists, have attended CSB meetings, which have particularly served to introduce cutting edge biological inquiry and challenge problems to investigators from quantitative science backgrounds. CSB, more recently, became the public venue for the not-for-profit Life Sciences Society, or LSS, which was founded, in part, to enhance the opportunities at the interface between the quantitative I engineering sciences and the biological sciences. In 2006, LSS was honored to be invited to hold CSB2007 on the campus of the University of California at San Diego, UCSD. Some future meetings at UCSD, and ultimately, at other universities, as well as satellite sessions at bioinformatics meetings around the world, are anticipated over the next several years. The Stanford / Bay Area I Silicon Valley venue, along with presenting highlights in bioinformatics, has been especially valuable for connecting individuals from the computer and electronics industry with investigators in the pure and applied life sciences. The current venue should provide some connections to telecommunications, while sustaining some of the earlier opportunities, but we anticipate an enhanced interaction with basic and applied biotechnology. The University of California at San Diego grew from the Scripps Institute of Oceanography and began as a graduate school with a strong focus on

the natural sciences. The rich research culture around UCSD and many neighboring institutions would generally be termed the venue of the Torrey Pines Mesa, an area very rich in biotechnology research activities. Today, San Diego has the largest cluster of Life Sciences centers with 26 research institutes (including UCSD and a suite of Institutes: Salk, Neurosciences, Scripps Research, Burnham Medical, as well as smaller not for profits) located in an area less then 10 square miles. For more on San Diego’s R&D Life Sciences Centers and the vibrant biotechnology and pharmacology efforts, do visit the BioCom website: http://www. biocom.org/Portals/O/SanDiegoLifeScien ceNumbers-Fall06.pdf. CSB2007 will continue to be a 5 day single track conference, with three core plenary presentations days sandwiched between a day of practical tutorials, long a very popular feature, and a day workshops exploring the future. Thus, CSB2007 includes several half day tutorials, 30 refereed papers plus keynote and invited speakers, and posters, during its five full days. Special events for the evenings are planned. CSB2007, as in each of its previous years, owes a lot to its many hard working volunteers, who are listed under the Committees. The indefatigable energy of Vicky and Peter Markstein continues to sustain the magnitude and amplitude of the extraordinary science and technology vector that is CSB, and their partnership with Ying Xu also remains essential. The efforts to manage and enhance local arrangements, by Kayo Arima, Patrick Shih, Lydia Grech and Ed Buckingham, should also be acknowledged. A few words, naturally, about SoCal: bring family and guests to enjoy San Diego’s world-famous attractions as Seaworld, the San Diego Zoo, the Wild Animal Park and LEGOLAND California, as well as historic cultural gems Balboa Park and Old Town, and of course, the “endless” beach.

John Wooley, General Conference Chair


ix

COMMITTEES

Steering Committee Phil Bourne - University of California, San Diego Eric Davidson - California Institute of Technology Steven Salzberg - The Institute for Genomic Research John Wooley - University of California San Diego, San Diego Supercomputer Center

Organizing Committee Kayo Arima - Universitye of California San Diego, Local Arangements Pat Blauvelt - Communications Ed Buckingham - LSS VP Conferences Kass Goldfein - Finance Consultant Lydia Grech - University of California San Diego, Local Arrangements Fenglou Mao - University of Georgia, On-Line Registration and Refereeing Website Vicky Markstein - Life Sciences Society, Co-Chair, LSS President Patrick Shih - University of California San Diego, Local Arrangements Jean Tsukamoto - Graphics Design Bill Wang - Sun Microsystems Inc, LSS Information Technology John Wooley - University of California San Diego, San Diego Supercomputer Center, Co-Chair

Program Committee Tatsuya Akutsu - Kyoto University Phil Bourne - University of California San Diego Jake Chen - Indiana University Amar Das - Stanford University Chris Ding - Lawrence Berkeley Laboratory Roderic Guigo, IMIM, Barcelona Tao Jiang - University of California Riverside Lydia Kavraki - Rice University Hoong-Chien Lee - National Central University, Taiwan Ann Loraine - University of Alabama Michele Markstein - Harvard University Peter Markstein - Hewlett-Packard Co., Co-chair Satoru Miyano - University of Tokyo Sean Mooney - Indiana University Jan Mrazek - University of Georgia Isidore Rigoutsos - IBM TJ Watson Research Center Andrey Rzhetsky - Columbia University

X

Hershel M. Safer, Weizmann Institute of Science David States - University of Michigan Anna Tramontano - University of Rome Olga Troyanskaya - Princeton University Alfonso Valencia - Centro Nacional de Biotecnologia, Spain Eberhard Voit - Georgia Tech Limsoon Wong - Institute for Infocomm Research Ying Xu - University of Georgia, Co-chair Aidong Zhang - SUNY Buffalo Michael Zhang - Cold Spring Harbor Laboratory Xianghong Jasmine Zhou - University of Southern California Yaoqi Zhou - Indiana University

Assistants to the Program Co-Chairs Ann Terka - University of Georgia Joan Yantko - University of Georgia

Poster Committee Nigam Shah - Stanford University, Chair Patrick Shih - University of California San Diego

Tutorial Committee Weizhong Li - University of California San Diego, A1 Shpuntoff - Syngenta Biotechnology Institute, Chair John Wooley - University of California San Diego

Workshop Committee Iddo Friedberg - University of California San Diego, Co-Chair Weizhong Li - University of California San Diego, Co-Chair Patrick Shih - University of California San Diego

xi

REFEREES

Tatsuya Akutsu Mar Alba

Yuki Kato Lydia Kavraki Melissa Kemp

Takis Benos Phil Bourne Liming Cai Ildefonso Cases Robert Castelo Dongsheng Che Jake Chen Liang Chen David Chew Young-Rae Cho I-Chun Chou Xiangqin Cui PhuongAn Dam Amar Das David de Juan Chris Ding

HC Lee Hoong-Chien Lee Haiquan Li Jing Li Xiaoman Shawn Li Guohui Lin Chun-Chi Liu Guimei Liu Huiqing Liu Yunlong Liu Ann Loraine Michia Ma Fenglou Mao Peter Markstein David Martin Satoru Miyano Mark Moll Jan Mrazek

Iakes Ezkurdia Matteo Floris Sylvain Foissac David Gilley Gautam Goel Roderic Guigo Scott Harrison Nurit Haspel Jianjun Hu Woochang Hwang

Masao Nagasaki Luay Nakleh Christoforous Nikolau Juan Nunez-Iglesias Victor Olman Miguel Padilla Grier Page Florencio Pazos Daniel Platt Zhen Qi

Seiya Imoto Tao Jiang

Predrag Radivojac Isidore Rigoutsos

Andrey Rzhetsky Hershel M. Safer Sudipto Saha David States Wing-Kin Sung Takeyuki Tamura Anna Tramontano Olga Troyanskaya Aristotelis Tsirigos Alfonso Valencia Siren Veflingstad Eberhard Voit John Wagner Mingyi Wang Limsoon Wong Hongwei Wu Jialiang Wu Min Xu Ying Xu Weiwei Yin Kangyu Zhang Michael Zhang Shiju Zhang Fengfeng Zhou Ruhong Zhou Wen Zhou Xianghong Jasmine Zhou Yaoqi Zhou


...

XI11

CONTENTS Preface

vii

Committees

ix

Referees

xi

Keynote Address Quantitative Aspects of Gene Regulation in Bacteria: Amplification. Threshold, and Combinatorial Control Terry Hwa Whole-Genome Analysis of Dorsal Gradient Thresholds in the Drosophila Embryo Julia ZeitlingeK Rob Zinzen, Dmitri Papatsenko et al.

Invited Talks Learning Predictive Models of Gene Regulation Christina Leslie

9

The Phylofacts Phylogenomic Encyclopedias: Structural Phylogenomic Analysis Across the Tree of Life Kimmen Golander

11

Mapping and Analysis of the Human Interactome Network Kavitha Venkatesan

13

Gene-Centered Protein-DNA lnteractome Mapping A.J. Marian Walhout

15

Proteomics Algorithm for Peptide Sequencing by Tandem Mass Spectrometry Based on Better Preprocessing and Anti-S ymmetric Computational Model Kang Ning and Hon Wai Leong

19

Algorithms for Selecting Breakpoint Locations to Optimize Diversity in Protein Engineering by Site-Directed Protein Recombination Wei Zheng, Xiaoduan Ye, Alan A4 Friedman and Chris Bailey-Kellogg

31

An Algorithmic Approach to Automated High-Throughput Identification of Disulfide Connectivity in Proteins Using Tandem Mass Spectrometry Timothy Lee, Rahul Singh, Ten-Yang Yen and Bruce Macher

41

xiv

Biomedical Application Cancer Molecular Pattern Discovery by Subspace Consensus Kernel Classification Xiaoxu Hun

55

Efficient Algorithms for Genome-Wide tagSNP Selection Across Populations via the Linkage Disequilibrium Criterion Lan Liu, Yonghui Wu, Stefano Lonardi and Tao Jiang

67

Transcriptional Profiling of Definitive Endoderm Derived from Human Embryonic Stem Cells Huiqing Liu, Stephen Dalton and Xng Xu

79

Pathways, Networks and Systems Biology Bayesian Integration of Biological Prior Knowledge into the Reconstruction of Gene Regulatory Networks with Bayesian Networks Dirk Husmeier and Adriano I.: Werhli

85

Using Indirect Protein-Protein Interactions for Protein Complex Predication Hon Nian Chua, Kang Ning, Wing-Kin Sung et al.

97

Finding Linear Motif Pairs from Protein Interaction Networks: A Probabilistic Approach Henry C.M. Leung, M H . Siu, S.M. Yiu et al.

111

A Markov Model Based Analysis of Stochastic Biochemical Systems Preetam Ghosh, Samik Ghosh, Kalyan Basu and Sajial K. Das

121

An Information Theoretic Method for Reconstructing Local Regulatory Network Modules from Polymorphic Samples Manjunatha Jagalur and David Kulp

133

Using Directed Information to Build Biologically Relevant Influence Arvind Rao, Alfred 0. Hero III, David J. States and James Douglas Engel

145

Discovering Protein Complexes in Dense Reliable Neighborhoods of Protein Interaction Networks Xiao-Li Li, Chuan-Sheng Foo and See-Kiong Ng

157

Mining Molecular Contexts of Cancer via In-Silico Conditioning Seungchan Kim,Ina Sen and Micheal Bittner

169

Genomics Prediction of Transcription Start Sites Based on Feature Selection Using AMOSA Xi Wang, Sanghamitra Bandyopadhyay, Zhenyu Xuan et al.

183

Clustering of Main Orthologs for Multiple Genomes Zheng Fu and Tao Jiang

195

Deconvoluting the BAC-Gene Relationships Using a Physical Map Yonghui Wu, Lan Liu, Timothy J. Close and Stefano Lonardi

203

X \’

A Grammar Based Methodology for Structural Motif Finding in ncRNA Database Search Daniel Quest, William Tapprich and Hesham Ali

215

IEM: An Algorithm for Iterative Enhancement of Motifs Using Comparative Genomics Data Evliang Zeng, Kalai Mathee and Giri Navasimhan

227

MANGO: A New Approach to Multiple Sequence Alignment Zefeng Zhang, Ha0 Lin and Ming Li

237

Learning Position Weight Matrices from Sequence and Expression Data Xin Chen, Lingqiong Guo, Zhaocheng Fan and Tao Jiang

249

Structural Bioinformatics Effective Labeling of Molecular Surface Points for Cavity Detection and Location of Putative Binding Sites Mavy Ellen Bock, Claudio Garutti and Conettina Guerra

263

Extraction, Quantification and Visualization of Protein Pockets Xiaoyu Zhang and Chandrajit Bajaj

275

Uncovering the Structural Basis of Protein Interactions with Efficient Clustering of 3-D Interaction Interfaces Zeyar Aung, Soon-Heng Tan, See-Kiong Ng and Kian-Lee Tan

287

Enhanced Partial Order Curve Comparison Over Multiple Protein Folding Trajectories Hong Sun, Hakan Ferhatosmanoglu, Motonori Ota and Yusu Wang

299

fRMSDPred: Predicting Local EWSD Between Structural Fragments Using Sequence Information Huzefa Rangwala and George Kavypis

311

Consensus Contact Prediction by Linear Programming Xin Gao, Dongbo Bu, Shuai Cheng Li et al.

323

Improvement in Protein Sequence-Structure Alignment Using InsertiodDeletion Frequency Arrays Kyle Ellrott, Jun-Tao Guo, Victor Olman and Ying Xu

335

Composite Motifs Integrating Multiple Protein Structures Increase Sensitivity for Function Prediction Brian Y Chen, Drew H. Bvyant, Amanda E. Cruess et al.

343

Ontology, Database and Text Mining An Active Visual Search Interface for Medline Weijian Xuan, Manhong Dai, Barbara Mire1 et al.

359

Rule-Based Huamn Gene Normalization in Biomedical Text with Confidence Estimation William K! Lau, Calvin A. Johnson and Kevin G. Becker

371

xvi

CBioC: Beyond a Prototype for Collaborative Annotation of Molecular Interactions from the Literature Chitta Baral, Graciela Gonzalez, Anthony Gitter et al.

381

Biocomputing Supercomputing with Toys: Harnessing the Power of NVIDIA 8800GTX and Playstation 3 for Bioinformatics Problem Justin milson, Manhong Dai, Elvis Jakupovic et al.

387

Exact and Heuristic Algorithms for Weighted Cluster Editing Sven Rahmann, Tobias Wittkop, Jan Baumbach et al.

391

Method for Effective Virtual Screening and Scaffold-Hopping in Chemical Compounds Nikil Wale, George Karypis and Ian A. Watson

403

Transcriptomics and Phylogeny Improving the Design of Genechip Arrays by Combining Placement and Embedding Sirgio Anibal de Cawalho JK and Sven Rahmann

417

Modeling Species-Genes Data for Efficient Phylogenetic Inference Wenyuan Li and Ying Liu

429

Reconcilation with Non-Binary Species Trees Benjamin Vernot, Maureen Stolzer; Aiton Goldman and Dannie Durand

441

Author Index

453

-----".".--

Computational Systems Bioinformatics 2007

Keynote Address


3

QUANTITATIVE ASPECTS OF GENE REGULATION IN BACTERIA: AMPLIFICATION, THRESHOLD, AND COMBINATORIAL CONTROL Terry Hwa Center,for Theoretical Biological Physics and Department of Physics University of California San Diego 9500 Gilman Drive La Jolla, CA 92093-03 74

Biological organisms possess an enormous repertoire of genetic responses to ever-changing combinations of cellular and environmental signals. Unlike digital electronic circuits however, signal processing in cells is carried out by a limited number of asynchronous devices in fluctuating aqueous environments. In this talk, I will discuss the control of genetic responses in bacteria. Theoretical analysis of the known mechanisms of transcriptional control suggests "programmable" mechanisms for implementing a broad class of combinatorial control. Further analysis of post-

transcriptional control suggests mechanisms for signal amplification, threshold response, and noise attenuation. I will present experimental characterization of some of these bio-computational "devices", as well as experiments illustrating how promoter sequences may be "trained" by directed evolution. Quantitative characterization and controlled manipulation of these devices may bring about predictive understanding of biological control systems, and reveal interesting, novel strategies of distributed computation.


WHOLE-GENOME ANALYSIS OF DORSAL GRADIENT THRESHOLDS IN THE DROSOPHILA EMBRYO Julia Zeitlinger’,Rob Zinzen2,Dmitri Papatsenko2,Rick Young’, and Mike Levine2

Whitehead Institute, M I . T. Cambridge, MA

Dept. MCB, Centerfor Integrative Genomics UC Berkeley, Berkeley, CA

Dorsal is a sequence-specific transcription factor related to NF-kE3. The protein is distributed in a broad nuclear gradient in the precellular Drosophila embryo. This gradient controls dorsal-ventral patterning by regulating at least 50 target genes in a concentration-dependent manner. Dorsal works with two additional regulatory proteins that are encoded by genes directly regulated by the gradient, Twist and Snail. To determine how the Dorsal gradient generates diverse thresholds of gene activity, we have used ChIP-chip assays with Dorsal, Twist, and Snail antibodies. This method efficiently identified 20 known enhancers and predicted another 30-50 novel enhancers associated with known or suspected dorsal-ventral patterning genes. At least one-third of the Dorsal target genes appear to contain “shadow” enhancers. These are additional cis-regulatory sequences with activities that overlap the principal enhancer guiding the expression of the associated gene. Shadow enhancers might arise from duplications of regulatory DNAs and could provide an important source for novel patterns of gene expression during evolution. The analysis of -30 different Dorsal target enhancers suggest that those mediating gene expression in response to high levels of the Dorsal gradient contain a series of disordered low-affinity Dorsal andlor Twist activator binding sites. In contrast, enhancers mediating expression in response to low levels of the gradient (5% or less of the peak levels of the Dorsal protein) contain an ordered arrangement of optimal Dorsal and Twist binding sites. This organization is likely to foster cooperative occupancy of linked operator sites. We

discuss the importance of enhancer structure in mediating a sensitive threshold response to a morphogen gradient. Although there are many examples of gene regulation via elongation of stalled polymerase (Pol) 11, it is not known to what extent this mechanism is used to establish differential patterns of gene expression during Drosophila embryogenesis. To investigate this issue, we performed ChIP-chip assays using antibodies directed against Pol 11. A specific mutant embryo was used-Tolllob--thatcontains high, uniform levels of the Dorsal, Twist, and Snail proteins. As a result, all of the cells form mesoderm derivatives. Ectodermal derivatives, such as the CNS and extraembryonic membranes, are completely absent. Previous whole-genome tiling arrays identified every gene that is active and inactive in Toll’Ob mutant embryos. Neurogenic genes that are activated by intermediate levels of the Dorsal gradient are repressed due to overexpression of the Snail repressor. Although silent, most of these genes contain a peak of Pol I1 binding at the 5’ end of the transcription unit. In contrast, genes that are uniformly expressed in these embryos display distinct Pol I1 binding profiles (across the length of the transcription unit). It was possible to classify 75% of all protein coding genes in the Drosophila genome into 3 categories based on Pol I1 binding profiles: uniform binding, no binding, or restricted binding near the start site. The -3600 genes exhibiting a uniform Pol I1 binding profile encode housekeeping functions that are constitutively expressed throughout embryogenesis. The -5,000 genes lacking

6

Pol I1 binding tend to be silent in the embryo, but expressed during larval and adult development. Finally, the -1600 genes that exhibit 5’ binding (i.e. stalling) tend to exhibit localized patterns of gene expression during embryogenesis and function as developmental control genes, such as Hox genes and components o f the FGF, Wnt, Hedgehog, TGFP, and Notch signaling pathways.

These observations suggest that the regulation of Pol I1 elongation is a major mechanism o f differential gene activity in the Drosophila embryo. We discuss the use of Pol I1 stalling as a mechanism o f transcriptional repression, and as a means for preparing developmental control genes for rapid and dynamic induction during embryogenesis.

~-

Invited Talks


9

LEARNING PREDICTIVE MODELS OF GENE REGULATION Christina Leslie Memorial Sloan-Ketteving Cancer Center New York City,NY

Studying the behavior of gene regulatory networks by learning from high-throughput genomic data has become one of the central problems in computational systems biology. Most work in this area focuses on learning structure from data -- e.g. finding clusters or modules of potentially co-regulated genes, or building a graph of putative regulatory "edges" between genes -- and generating qualitative hypotheses about regulatory networks. Instead of adopting the structure learning viewpoint, our focus is to build predictive models of gene regulation that allow us both to make accurate quantitative predictions on new or held-out experiments (test data) and to capture mechanistic information about transcriptional regulation. Our algorithm, called MEDUSA, integrates promoter sequence, mRNA expression, and transcription factor occupancy data to learn gene regulatory programs that predict the

differential expression of target genes. MEDUSA does not rely on clustering or correlation of expression profiles to infer regulatory relationships. Instead, the algorithm learns to predict up/down expression of target genes by identifying condition-specific regulators and discovering regulatory motifs that may mediate their regulation of targets. We use boosting, a technique from machine learning, to help avoid overfitting as the algorithm searches through the high dimensional space of potential regulators and sequence motifs. We will describe results of a recent gene expression study of hypoxia in yeast, in collaboration with the lab of Li Zhang. We used MEDUSA to propose the first global model of the oxygen and heme regulatory network, including new putative context-specific regulators. We then performed biochemical experiments to confirm that regulators identified by MEDUSA indeed play a causal role in oxygen regulation.


THE PHYLOFACTS PHYLOGENOMIC ENCYCLOPEDIAS: STRUCTURAL PHYLOGENOMIC ANALYSIS ACROSS THE TREE OF LIFE

Kimmen Sjolander Berkeley Phylogenomics Group University of California, Berkeley h ttp://phylogenomics. berkeley. edu

Protein families evolve a multiplicity of functions and structures through gene duplication, domain shuffling, speciation and other processes. Phylogenomic analysis, combining phylogenetic tree construction, integration of experimental data, and differentiation of orthologs and paralogs, has been shown to address the systematic errors associated with standard protocols of protein hnction prediction. The explicit integration of structure prediction and analysis in this framework, which we call structural phylogenomics, provides additional insights into protein superfamily evolution, and improves function prediction accuracy. The Berkeley Phylogenomics Group has developed the PhyloFacts Phylogenomic Encyclopedia for protein

families across the Tree of Life. At present (April 17, 2007), PhyloFacts contains over 27,000 “books” for protein families and domains and over 988,000 hidden Markov models (HMMs) enabling classification of proteins to functional families and subfamilies. Other functionality provided by PhyloFacts includes prediction of protein structure, active site residues, and cellular localization. In this talk, I will present new methods developed by my group for key tasks in a phylogenomic pipeline, including multiple sequence alignment, phylogenetic tree construction, subfamily identification and critical residue prediction.


13

MAPPING AND ANALYSIS OF THE HUMAN INTERACTOME NETWORK Kavitha Venkatesan

Harvard Univevsity Email: kavitha-venkatesan@dfci. haward. edu

1. INTRODUCTION We have mapped a first version of the human interactome network using a high-throughput Y2H (HTY2H) technology. Our data set, CCSB-HI1 is high in specificity and adds -2700 new protein-protein interactions to existing interactome maps. CCSB-HI 1 interactions are enriched for correlations with mRNA coexpression, presence of shared conserved cis regulatory DNA motifs, shared phenotypes and shared function.. A systematic quantitative examination of various existing human interactome maps shows that, contrary to existing notion, high-throughput Y2H maps are in fact higher in specificity than the composite information obtained from curating literature containing a large number of papers describing one or a few interactions at-a-time.

Furthermore, combined experimental and computational modeling of repeat trials of a HT-Y2H screen predicts the size of the Y2H-detectable human interactome and demonstrates the feasibility of mapping a nearly complete set of human interactions through multiple screens in a reasonable time frame. Novel candidate disease genes and associated hypotheses emerge for more than 300 interactions involving disease proteins from this data set. This existing interactome map can be used to begin to investigate how cellular networks are perturbed in disease. For example, from analysis of a draft interactome map of Epstein-Barr virus proteins with human proteins that we generated, we find that EBV proteins tend to target highly connected or hub proteins in the human interactome, and moreover, proteins that are central in the network, having relatively short paths to other proteins in the network.


GENE-CENTERED PROTEIN-DNA INTERACTOME MAPPING AJ Marian Walhout Program in Gene Function and Expression and Program in Molecular Medicine UMass Medical School Worcester, MA

Transcription regulatory networks play a pivotal role in the development, function and pathology of metazoan organisms. Such networks are comprised of proteinDNA interactions between transcription factors (TFs) and their target genes'. We are interested in the architecture and hnctionality of such networks. We developed high-throughput gene-centered methods' for the identification of protein-DNA interactions between large sets of regulatory gene segments and various TF resources, including novel Steiner Triple System-based TF smart pools and a TF array3. So far, we mapped two gene-centered networks using C. elegans gene promoter^^'^. These networks already provided insights into differential gene expression at a systems level. For instance, we found that most C. elegans genes are controlled by a layered hierarchy of TFs that sometimes function in a modular manner. Our data can be accessed in our database, EDGEdb6.

1. Walhout, A. J. M. Unraveling Transcription Regulatory Networks by Protein-DNA and ProteinProtein Interaction Mapping. Genome Res 16, 14451454 (2006). 2 . Deplancke, B., Dupuy, D., Vidal, M. & Walhout, A. J. M. A Gateway-compatible yeast one-hybrid system. Genome Res 14,2093-2101 (2004). 3. Vermeirssen, V. et al. A C. elegans transcription factor array and Steiner Triple System-based smart pools: high-performance tools for transcription regulatory network mapping. Nat Methods In press (2007). 4. Deplancke, B. et al. A gene-centered C. elegans protein-DNA interaction network. Cell 125, 11931205 (2006). 5. Vermeirssen, V. et al. Transcription factor modularity in a gene-centered C. elegans core neuronal protein-DNA interaction network. Genome Res May 18; [Epub ahead of print] (2007). 6. Barrasa, M. I., Vaglio, P., Cavasino, F., Jacotot, L. & Walhout, A. J. M. EDGEdb: a transcription factor-DNA interaction database for the analysis of C. elegans differential gene expression. BMC Genomics 8,21 (2007).


Proteomics

. I


19

ALGORITHM FOR PEPTIDE SEQUENCING BY TANDEM MASS SPECTROMETRY BASED ON BETTER PREPROCESSING AND ANTI-SYMMETRIC COMPUTATIONAL MODEL Kang Ning and Hon Wai Leong Department of Computer Science, National University of Singapore Block SI 5, 3 Science Drive 2, Singapore 11 7543 (ningkang, leonghw]@comp.nus. edu.sg Peptide sequencing by tandem mass spectrometry is a very important, interesting, yet challenging problem in proteomics. This problem is extensively investigated by researchers recently, and the peptide sequencing results are becoming more and more accurate. However, many of these algorithms are using computational models based on some unverified assumptions. We believe that the investigation of the validity of these assumptions and related problems will lead to improvements in current algorithms. In this paper, we have first investigated peptide sequencing without preprocessing the spectrum, and we have shown that by introducing preprocessing on spectrum, peptide sequencing can be faster, easier and more accurate. We have then investigated one very important problem, the anti-symmetric problem in the peptide sequencing problem, and we have proved by experiments that model that simply ignore anti-symmetric or model that remove all anti-symmetric instances are too simple for peptide sequencing problem. We have proposed a new model for antisymmetric problem in more realistic way. We have also proposed a novel algorithm which incorporate preprocessing and new model for anti-symmetric issue, and experiments show that this algorithm has better perfonnance on datasets examined.

1.

INTRODUCTION

Peptide sequencing by mass spectrometry (referred to as “peptide sequencing” in the following part) is the process of interpreting peptide sequence from the mass spectrum. Peptide sequencing is an important problem in proteomics. Currently, though high throughput mass spectrometers has generated huge amount of spectra, the peptide sequencing these spectrum data is still slow and not accurate. Algorithms for peptide sequencing can be categorized into database search algorithms [ 1-31 and de novo algorithms [4-61. The database search algorithms are suitable for known sequences already existing in the database. However, they do not have good performance for novel sequences not available in database. For these peptide sequences, the de novo algorithms are the methods of choice. De novo algorithms interpret peptide sequences from spectrum data purely by analyzing the intensity and correlation of the peaks in the spectrum. Though current extensive research in de novo peptide sequencing helps to improve the accuracies, there are still many obstacles for both de novo and database search approaches, which make further improvement of the accuracies of peptide sequencing difficult. Among these obstacles, preprocessing to remove the noises froin spectrum before peptide sequencing, as well as the anti-symmetric problem, are two of the very important issues.

Preprocessing to remove noisy peaks A peak in spectrum is noisy if it does not correspond to

a peptide fragment, but a contaminant in mass spectrometers, experiment environments, etc. Since most spectra contain a significant amount of noises, and noisy peaks may mislead interpretation; therefore, preprocessing to remove noisy peaks from the spectrum is necessary.

The anti-symmetric problem A peak p L is anti-symmetric if there can be different fragment ion interpretations for p I , otherwise, p , is symmetric. There is an anti-symmetric problem in spectrum S if S has one peak p I which is anti-symmetric. For the spectrum graph G [4] used to represent spectrum, a path in G is called anti-symmetric if there are no two vertices (fragment ion interpretations) on this path which represent the same peak; otherwise, we say that this path has the anti-symmetric problem. The antisymmetric problem is common in peptide sequencing. Currently there are generally two approaches to the antisymmetric problem. One approach is to ignore the antisymmetric problem [6]; and another is to apply the “strict” anti-symmetric rule that require each peak to be represented by at most one vertex (fragment ion interpretation) on a path in the spectrum graph G [4; 7; 81. The “strict” anti-symmetric rule is used in many peptide sequencing algorithms, but whether applying this rule is realistic is doubtful. In this paper, we will address computational model to remove noise peaks from spectrum. This model also includes the method for introduction of “pseudo peaks”

20 into the spectrum to iinprove peptide sequencing accuracies. We have also proposed the restricted antisymmetric model for the anti-symmetric problem. We have then proposed a novel peptide sequencing algorithm which incorporate these two computational models.

2.

ANALYSIS OF PROBLEMS AND CURRENT ALGORITHMS

In this section, we will analyze the presence of noises in the spectrum, as well as the difference between the algorithms that use preprocesses and those which do not use them. We will also investigate how significant is the anti-symmetric problem in peptide sequencing by mass spectrum, and how current algorithms cope with this pi ob1em.

2.1.

General Terminologies

We first define some general terms. Through mass spectrometer, or tandem inass spectrometer, a peptide P=(alaz...a,& where each of a],.. .,all is one of the amino acids, is fragmented into a spectrum S with maximum charge of a. The parent mass of the peptide P is given by A4 = m ( p ) = C:=, m(a,) . Consider a peptide prefix fragment ph = ( a l a2...aJ, for k 5 n, the prefix mass is defined as m(p,) = m ( a , ) . Suffix masses are defined similarly. We always express a fragment mass in experimental spectrum using the PRM (prefix residue inass) representation, which is the mass of the prefix fragment. Mathematically, for a fragment q with mass m(q), we define PRM(q) = m(q) if q is a prefix fragment (such as (b-ion}); and we define PRM(q) = M - m(q) if q is a suffix fragment (such as b-ion}). A spectrum S is composed of many peaks {pl, p2 ... p,}. Each of the peaks p I is represented by its intensity intensityb,) and mass-to-charge ratio mz(pJ. If peak p i is not noisy peak, then it will represent a fragment ion of P.Each peak pi can be characterized by the ion-type, that is specified by (t, h, Z)E(A,XA,,~AJ,where Az is the set of charges of the ions, A, is the set of basic ion-type, and Ah is the set of neutral losses incurred on the ion. In this paper, we restrict our attention to the set of ion-types AR=(AtxAhxAz), where Az ={1,2,..:,a), A, = {a-ion, bion, y-ion} and A,, = -H20, -NH3). Suppose the (t, h, z)-ion of the fragment q (prefix or suffix fragment) produces an observed peak p i in the experimental spectrum S that has a mass-to-charge ratio of mz@,),

C:=,

{a,

then m(q) can be computed using a shifting function, Shift, defined as follows: m(q)= ShZf 2.5. Table 1 listed the number of spectra and the number of peaks per spectrum for different charges of GPM and ISB spectra. Table 1. The number of spectra, and the number of peaks per spectrum. The results are based 011 the GPM and ISB datasets of different charges.

WLGJFGGJ Charge

No. Spectrum

No. peaks per spectrum

pm-1 Total

2328

995

42.6

230.7

46.5

226.0

Each GPM spectrum has between 20-50 peaks (usually high quality peaks) and an average of about 40 peaks. In contrast, each ISB spectrum has between 50-300 peaks and an average of 150 peaks. Moreover, for the corresponding peptide sequences, GPM

sequences have average lengths of 14.5 amino acids, and ISB sequences have average length of 15.0.

2.3.

Problems Analysis

Since binning is the general prerequisites for spectra data preprocess, in this section, we have first analyzed the methods for binning of the peaks in the spectrum, and then discuss preprocessing to remove noisy peaks from while introduce "pseudo peaks" into spectrum. Then we have analyzed of the anti-symmetric problem. Binning of peaks in spectrum Binning discretizes the mass to charge ratios of the peaks to a series of bins of equal sizes. Each bin contains a single peak. The binning idea is already embedded in [12; 131 for mass spectrum alignment. In 112; 131, the peaks of the spectrum are packed into many bins of same sizes, and the spectrum is transformed to a sequence of 0s and 1s. Recently, a database search algorithm COMET [ 141 is proposed which uses the bins (usually of size 1 Da) for their correlations and statistical analysis (Z-score) for accurate peptide sequencing by database search (spectrum comparison). The important parameters considered in binning include the size of the bins, the number of supporting peaks, as well as the intensities of the peaks. The leinina below shows that connected peaks remain connected after binning if we adjust the mass tolerance properly.

22 Table 2. The average contents of different types of peaks in GPM and ISB spectra. The symmetiic peaks are just counted once for total content measures.

I

b-ion, 0, 1 b-ion. 0 . 2

I

a: I

11.2 3.55

I

0.07 0.01

I

0.05 0.02

I

b-ion, -HzO, 1 I

I

I

I

y-ion, -NH3, 1

I Therefore, it is clear that given the proper value of tolerance, the binning can preserve the accuracies. The binning method makes the removal of noises easier, and also makes sequencing faster and potentially more accurate, especially for noisy spectrum. Preprocessing to remove noisy peaks and introduce pseudo peaks Noisy peaks exist in every spectrum, but how to distinguish them from “true” peaks is not an easy problem. The first step is to analyze the spectrum data and find the patterns of noisy peaks. To this end, we have analyzed most abundant ion type: {b-ion, 0, l}, {b-ion, 0, 2}, {b-ion, -H20, l}, {b-ion, -NH3, 1}, {yion, 0, l}, {y-ion, 0, 2}, {y-ion, -H20, l}, {y-ion, NH3, l}, and assume those peaks not of these ion types noises. The analysis is done on binned GPM dataset and ISB dataset. The experimental spectrum and theoretical spectrum for the corresponding sequence is compared, and peaks in experimental spectrum that can be matched with certain ion types are counted. The “content of peaks” for specific ion type is defined as the ratio of “number of peaks” (in experimental spectrum) of that ion type, over total number of peaks in experimental spectrum. The number of peaks and the contents of peaks of different ion types are analyzed, with average results in Table 2. From Table 2, we can see that noisy peaks comprise a significant portion of the peaks in the experimental spectrum. For GPM datasets, 80% of the peaks are noisy peaks, and the most abundant ion types - the b- and yion types, only compose 6% and 5% of the peaks. For

Noises Total

1

26.0 32.2

I

189.1

1

1.00

I

1.00

I

ISB datasets, 83% of the peaks are noisy peaks, and the most abundant ion types - the b- and y-ion types, only compose 4% and 5% of the peaks. ISB spectra have more noisy peaks, and peptide sequencing for these spectra are more difficult. Further analysis of the noisy peaks indicates that there are more noisy peaks in the middle part (according to mass to charge ratios) of the spectrum, than those at the two ends of the spectrum. Also, most of the noisy peaks have some features in common, such as low intensity and few other ions (b-, y-, loss of water or ammonia, for example) support. For some famous algorithms such as Lutefisk [6], there are no such preprocessing to remove noises. PEAKS [15] and PepNovo [5] are two famous algorithms that have implemented preprocesses. In PEAKS, the noise level of the spectrum is estimated, and the intensities of all the peaks in the spectrum are reduced by this noise level. Then all the peaks with zero or negative intensities are removed. In PepNovo, preprocessing is applied to remove or downgrade peaks that have low intensity, and do not appear to be b- or yions. Recently, the AUDENS algorithm has been proposed [16]. The algorithm has a flexible preprocessing module which screens through the peaks in the spectrum, and distinguishes between signal and noise peaks. Previous preprocessing for peptide sequencing by mass spectrometry only considered how to remove noisy peaks. However, some fragment ions are not represented by any of the peaks. Appropriate introduction of “pseudo peaks” into spectrum may help

23 in interpretation of these fragment ions, and increase the sequencing accuracies. The idea of “pseudo peaks” is first described in PEAKS [15]. PEAKS assumes that peaks are at every place in the spectrum, and those which are not present in the actual spectrum are peaks with 0 intensities. It is proven that appropriate introduction of “pseudo peaks” can partially solve the problem of missing edges in the spectrum graph approach [ 151. In our preprocessing computational model, we have remove noisy peaks from, removal as well as the introduction of pseudo peaks into spectrum. Notice that though the process is similar to previous work, the computation model is different.

The anti-symmetric problem We have mentioned that there are two approaches to the anti-symmetric problem: 1) ignore the antisymmetric problem and 2) apply “strict” anti-symmetric rule. In the following part, we show that since both of the approaches are based on unverified assumptions, they do not reflect the nature of real spectrum. First we give part of a real spectrum from GPM datasets (Fig 1). Note that peak no. 1 has multiple annotations. If we just ignore this peak, then there are two peptide fragments that we cannot interpret (AGFAGDDA and AGFAGDDAPRA V F P q , while the peptide itself has 21 amino acids. Therefore, we see that the simple model which apply strict anti-symmetric rule may miss some interpretations of peptide fragments. Peak No,

M/Z

Intensity

1 2 3

ii7.105 191 12

20

5 6

a 8 10

11

205 103

231 125

248.166 2?i> 16 I 302.203 319 212 ‘il? 211 3761@6

To analyze the significance of the anti-symmetric problem in peptide sequencing, we have generated the theoretical spectrum of known peptide sequences. We have analyzed most abundant ion types {b-ion, 0, I}, {b-ion, 0, 2}, {b-ion, -H20, l } , {b-ion, -NH3, l}, {yion, 0,l}, {y-ion, 0, 2}, {y-ion, -H20, I}, {y-ion, NH3, l}, and assume there is no noise. Two peaks are said to be overlap if their mass difference is within threshold (default of 0.25 Da). Note that each of such overlapping peaks is equivalent to a symmetric peak. Results on selected GPM and ISB spectrum datasets are shown in Table 3. The “average numbers” are the average number of symmetric peaks for theoretical spectrum of one peptide sequence, and the “average ratios” are computed as “average numbers”, over average number of peaks in theoretical spectrum. It is obvious that the instances of overlaps (within threshold, 0.25 Da) are quite common. For the overlaps of b- and y-ions in GPM datasets, there is one overlap instance in about 5 peptide sequences, or in about 67 amino acids. The overall overlap instances are even more common, one instance in about 0.36 sequences, or about 5 amino acids. The ISB datasets has a little bit less overlaps, but overall, there is still more than one instance in 0.35 sequences, or in 4 amino acids. Note that we have not considered peaks with higher charges (223). But previous research [9] has found that there is significant amount of higher charge (223) peaks in high-charge spectra. It is nature that the number of overlapping instances will increase when we have

Ion Types

PO 20

A

60 30 4% i

l

1 0 18 0 h i (’i 10

,’ 1 , h. @)

, I h k5c-

11)

7.0 LC‘

541673 5.13 -63

20 10

22

5 8 5 324

80

24

60959

10

25

628.0

19

AGFAGDDAPRAVFPSIVCRP AGFACDEIAPRAVFPSWGRPR

Fig 1. Example of a real spectrum (left) with its corresponding peptide (right). The ion types are represented by (t, b, Z)E (AtxAhxAz), as defined above. In the bracket after the peptide fragment is the corresponding peak number.

24

3.

considered high-charge peaks, and more ion types. Therefore, “strict” anti-symmetric rule is not realistic.

We propose a new algorithm that is based on two new computational models: 1) preprocessing that can remove noisy peaks from, while introduce pseudo peaks into, the spectrum; and 2 ) new anti-symmetric model that is more flexible and realistic to the anti-symmetric problem.

Table 3: The average numbers and ratios of overlapping instances for different kinds of overlaps. Results on all of the GPM and ISB data. Overlapping Types

b-ion, 0 , 1

++ y-ion, 0, 1

b-ion. 0. I f 3 v-ion. 0 . 2 b-ion, 0 , I f 3 y-ion, -H20, 1 b-ion, 0 , I f 3 y-ion, -NH3, 1

1 K2li

0.1541 0.012 0.173 0.0111

p G 7 p i i i ~ 3.1. l

Preprocessing to remove noisy peaks and introduce pseudo peaks

v-ion. 0. l f + b-ion. 0 . 2 y-ion, 0 , I f 3 b-ion, -H20, 1

I

y-ion, 0 , 1 f+ b-ion, -NH3, I b-ion. 0 . 2 t 3 v-ion, 0 . 2

Ib-ion, 0,2 t 3 y-ion, -H20, 1 b-ion, 0,2 t 3 y-ion, -NH3, 1

1

I

0.1521

0.00011

0.1281 0.0001

y-ion, 0,2 t 3 b-ion, -H20, 1 y-ion, b-ion, b-ion, y-ion, b-ion,

NEW COMPUTATIONAL MODELS AND ALGORITHM

0 , 2 + 3 b-ion, -NH3, 1 -H20, I t 3 y-ion, -H20, 1 -H20, I f 3 y-ion, -NH,, 1 -H20, I f 3 b-ion, -NH3, 1 -NH3, I t 3 y-ion, -NH3, 1

All

Experiments were also performed with random introduction of noises into theoretical spectrum. Results (details not shown) indicate that there is a significant increase in the number of overlap instances. Therefore, ignoring the anti-symmetric problem is also not realistic, especially for noisy spectra. In Lutefisk [6], the anti-symmetric problem is assumed not exist, and a peak can be annotated as different ion types. In the Sherenga algorithm [4], only one ion type is possible for each peak, but the exact algorithm that solve the anti-symmetric algorithm is not described. The dynamic programming algorithm for solving anti-symmetric problem is described in [7; 81, and suboptimal algorithm that gives the suboptimal results for the anti-symmetric problem is shown in [ 171. Since our experiments have shown that neither of the two approaches (assumptions) to the anti-symmetric problem is realistic, the simple models based on these assumptions may be the obstacles for improvements of current algorithms. Therefore, we have proposed a more realistic computational model for anti-symmetric problem.

First, the binning process is applied on the peaks in the spectrum. The masses of amino acids are at least of 1.0 Da difference (except for (I, L) and (Q, K), which cannot be distinguished by any de novo peptide sequencing algorithm without isotope information). We thus set the value of mass tolerance m, to be 0.5 Da, and the bin size mbin to be 0.25 Da (according to Lemma 1). With the process of binning, later processes can be even more accurate (lemma 1 shows that there is no loss of accuracy) as well as more efficient because less peaks are considered. After binning, the pseudo peaks are introduced into every empty bin, and each of them are of 1/10 intensity (empirically determined) of the lowest intensity in original spectrum. After binning the peaks and introduction of pseudo peaks, the support scores are computed for every bin (peak). Here, we transform each of the bins (peaks) into vertices (ion interpretations) in the extended spectrum graph G~(yb), and then score each of the vertices. Define Nsupport(vi)as the number of vj (v;#v,), where PRM(vj)=PRM(vi). Define the intensity function as fintensity(vi)=max(O.O 1, loglo(intensity(vi)), where loglo(intensity(vi)) is normalized, so that fintenslty (v,) cannot be less than 0. Let L be the total number of incoming and outgoing edges for vi, and aj be the amino acid for the edge (v,,v, ) (or (v, ,v,)). Then CII(PRM(v,)PRM(vi)I-mass(aj)l/Lis the average mass error for v,. To avoid “divide-by-zero” error in calculating the weight function, we define error function as ferr,,(vi)=max(0.05, CII(PRM(v,)-PRM(vi)l-mass(ai)(iL). The definition ensure that ferror(vi)is larger than 0.05, a reasonably small error value. Then the score of vertex v, in G,(F,J is defined as (9)

25 For each bin, the support score is computed and ranked. Some of the actual peaks that are highly likely to be noises are deleted, and some of the pseudo peaks highly likely to represent ion types are kept. Using this method, we can 1) prune out noises in the spectrum and 2) introduce meaningful peaks into the spectrum. So we may create better spectrum graph to process. Based on the analysis of the scores of peaks in the spectrum (details not shown here), the lowest 20% bins in scores ranking, or those bins with scores less than 1% of the highest ones are filtered out.

3.2.

The Anti-symmetric Problem

Since there are a significant ratio of peaks in spectrum that can be (correctly) annotated as different ion types, the anti-symmetric rule should not be strictly followed. Otherwise, there is loss of information. However, since there are still quite some noisy peaks after preprocess, peptide sequencing that ignores anti-symmetric problem may also be misled by noisy peaks, and thus not preferable. Therefore, it would be better if a more flexible and less strict anti-symmetric rule is applied on the spectrum for the anti-symmetric problem. We have proposed the restricted anti-symmetric model. In this model, restricted number ( r ) of peaks can have different ion types. It is easy to observe that the current two approaches for anti-symmetric problem can be described by this model. The approach that ignores the anti-symmetric problem is the one with Y-number of peaks, and the approach that apply the “strict” antisymmetric rule is the one with F O . The restricted anti-symmetric model is based on the extended spectrum graph G,(Fp) model using multicharge strong tags [18]. Multi-charge strong tags are highly reliable tags in the spectrum graph G,(Fp). A multi-charge strong tag of ion-type (z*, t, h) E A” is a maximal path ( vo, vl, v2, ..., v,) in GI( S: ,{ AR }), where every vertex v, is of a (z*, t, h)-ion, in which t and h should be the same for all vertices, and z* can be different number from { 1,. . .a}. The principle of the restricted anti-symmetric model is that if a multi-charge strong tags (tag) T, in G,(S’J is of high score, and on this tag, the number (r) of overlapping instances (an instance is represented as two vertices of different ion type for the same peak) is within certain tolerance (half of the length of tag), then

T, is a good tag in G,(Fp), and it is selected for subsequent process. It is easy to see that preprocessing and restricted anti-symmetric models can be applied on any de novo peptide sequencing algorithms to improve the accuracies (details in experiments). Below we describe our novel algorithm based on these two models.

3.3. Novel Peptide Sequencing Algorithm Our novel algorithm (GST-SPC*) is based on the previously proposed GST-SPC algorithm [ 181 which has good performance. GST-SPC algorithm has two phases. In the first phase, the GST-SPC algorithm computes a set of tags - the set of all multi-charge strong tags (corresponding to tags of maximal length in extended spectrum graph) - and this leads to an improvement in the sensitivity that can be achieved. In the second phase, the GST-SPC algorithm try to link these tags, and computes a peptide sequence that is optimal with respect to shared peaks count (SPC) from all sequences that are derived from tags. The GST-SPC performs comparable to or better than other de novo sequencing algorithms (Lutefisk and PepNovo), especially on multi-charge spectra. In the GST-SPC* algorithm, before peptide sequencing, all of the peaks of the spectrum are binned, with each bin of the mass range rnbi,, (0.25 Da). The pseudo peaks are introduced into every empty bins. Bins (transformed to vertices in extended spectrum graph) that have very low scores or low support rank are filtered out. Based on the analysis of the peaks in the spectrum, lowest 20% bins, as well as those bins with support scores less than 5% of the highest ones are filtered out. In GST-SPC algorithm, we note that all of the tags can have their SPC computed before deriving the paths in the spectrum. So in GST-SPC* algorithm, after tags are generated in the extended spectrum graph G,(Fb), we have filtered out the tags that violate the “restricted anti-symmetric rule”. For the restricted anti-symmetric model on tags, we restricted r to be at maximum half the length of that tag. We have then computed the SPC for those “good” tags. Then a variant of width first search algorithm is applied on GI(S*p)to find paths from vg to vM, so that these paths have high SPC, and they are consistent with restricted anti-symmetric model. Since

26 computed as the longest common subsequence (LCS) of the correct peptide sequence p and the sequencing result P. Sensitivity indicates the quality of the sequence with respect to the correct peptide sequence and a high sensitivity means that the algorithm recovers a large portion of the correct peptide. The tag-sensitivity accuracy take into consideration of the continuity of the correctly sequences amino acids. For a fair comparison with algorithms as PepNovo that only outputs highest scoring tags, we also use PPV and tag-PPV measures, which indicate how much of the results are correct. Upper Bound on Sensitivity: Given a spectrum S and the correct peptide sequence p, let U( S; , { d } ) denote the theoretical uppev bound on sensitivity that can be attained by any algorithm using the extended spectrum graph G, (SF ) , namely using the extended spectrum Sp” and a connectivity d. The bound U( Sp” , I d } ) is computed as the maximum number of amino acids that can be identified from G,(SF) with all of ion types in A , over the length of p. PepNovo and Lutefisk which considers charge of up to 2 are bounded by U( S; , (2)) and there is a sizeable gap between U( Si,{2)) and U( Sl,(2}). This bound was introduced in [18] for the analysis of the multi-charge spectra. In this paper, we have also computed this bound to evaluate the performance of different algorithms.

the number of tags is small, the algorithm is efficient. A flowchart of the whole algorithm is illustrated in Fig 2.

EXPERIMENTS

4.

4.1.

Experiment Settings

All of the experiments in this paper are performed on a PC with 3.0 GHz CPU and 1.0 GB memory, running Linux system. Our algorithm is implemented in Perl. We have also selected Lutefisk [6], PepNovo [ 5 ] and PEAKS [15], three modern and commonly used algorithms with freely available implementations (online portal for PEAKS), for analysis and comparison. The best results given by different algorithms are used for comparison. For measurement of the sequencing performance, we have adopted the following measurements: Sensitivity and Positive Predictive Value (PPV). Sensitivity = # correct / I p I PPV = # correct / I P I Tag-Sensitivity = # tag-correct / I p 1 Tag-PPV = # tag-correct / I P 1

(10) (1 1) (12) (13)

where #correct is the “number of correctly sequenced amino acids” and #tag-correct is “the sum of lengths of correctly sequenced tags (of length > 1)”. #correct is

t

............... noises - -. pseudo peaks I

I

EII Binning

Multi-chxgc tags (b) Rcstiicted anti-syinrrtetric inodcl (‘1)

-

i

Introduce “pseudo peaks”

4

I

.................................................

...............................

Tags:

i

j 7/’STSQKR j CCTGDHTK

I

Compute scores and remove noisy peaks

I ......

Fig 2. Flowchart of the whole algorithm. “bad’ tags are tags that violate the restricted anti-symmetric model.

j

27

4.2.

Results

We have first analyzed the performance of preprocessing method, and compared the results of Lutefisk, PepNovo, PEAKS and GST-SPC. We have also compared these results with theoretical upper bounds on sensitivity, to measure how good the results of these algorithms are compared to optimal ones. The GPM and ISB spectra are categorized by charges (given by spectrum data). The results are shown in Table 4. From results, we have observed that preprocessing to remove the noises can effectively increase the sequencing accuracies. Compared with the results from original GST-SPC without preprocess, both of the PPV and sensitivity accuracies increase by about 8% for GPM datasets, and about 5% for ISB datasets after preprocess. This difference is probably due to the fact that ISB spectrum has more noises in it than GPM spectrum, so after preprocessing to filter out noises, ISB spectra still have more noises. Such accuracies are much superior to results from Lutefisk algorithm, especially on spectrum with high charges (223). The novel algorithm outperforms the PepNovo algorithm on GPM dataset; and for ISB dataset, the accuracies are closer. Interestingly, when compared with PEAKS, we have discovered that though PEAKS’S results on spectra with charge 1 and 2 are comparable with our results, they are better than our results on multi-charge spectrum. This is because PEAKS also has a preprocessing step to remove noisy peaks and introduce pseudo peaks, again prove that such preprocessing in necessary. As can be found later, when we have used new anti-symmetric model, the accuracies of our algorithm are improved, and there

is almost no difference between them. Compared with theoretical upper bounds, we can see that there is still much room for improvements. We have then performed analysis on restricted antisymmetric model. All of the results based on GST-SPC algorithm are preprocessed. The results based on restricted anti-symmetric model (GST-SPC*) are compared with the results based on strict anti-symmetric rule (strict anti-symmetric) and results from GST-SPC which ignores anti-symmetric issue (no anti-symmetric). The results are shown in Table 5. Table 5. results based on the restricted anti-symmetrie model, compared with other models. The accuracies in cells are represented in

a (PPV/sensitivity [tag PPV/tag sensitivity]) format. Dataset

No. of spectrum

GST-SPC (no antisymmetric)

0.020/0.027]

Total

GST-SPC

GST-SPC*

(strict antisymmetric)

[0.026/0.025]

[0.02810.029]

0.34510.360

0.344/0.364

0.347/0.375

0.39010.473 [O. 120/0.132] 0.41 110.398 [0.096/0.072] 0.40810.496 [O. 10110. 1451

0.386/0.486 [0.121/0.132] 0.414/0.397 [0.090/0.076] 0.42610.528 [O. 1 15/0.156]

0.393/0.491 [O. l61/0 1601 0.434/0.421 [O. I 19/0.121] 0.419/0.53 I [O. 1 17/0.164]

0.409/0.447 [0.109/0.120]

0.419/0.464 [0.118/0.112]

0.427/0.475 [0.119/0.141]

Table 4. The performance of preprocess. The accuracies in cells are represented in a (PPVisensitivity) fonnat. “-”means that the value is not available by the algorithm, and “*” shows the average values based on charge 1 and charge 2 spectra.

28 Table 5 shows that the restricted anti-symmetric model has superior accuracies. Compared with the results froin algorithms which ignores anti-symmetric problem (no anti-symmetric), the application of restricted anti-symmetric model can improve the accuracies by about 5%, and this is probably due to the fact that restricted anti-symmetric model can remove some “bad” tags. About 2% to 5% improvements is observed when compared with the results from strict anti-symmetric model, this is consistent with the results of significance of the anti-symmetric problem in Table 3. The results also show a great improvement in tag PPV and tag sensitivity by using the restricted antisymmetric rule, especially on ISB datasets. This may also be caused by the restricted anti-symmetric model that removes the “bad’ tags. Compare the results in Table 5 with those from Table 4, we have also observed that by the use of restricted anti-symmetric mle in GST-SPC”, the peptide sequencing results are more accurate. The results of GST-SPC” are closer to accuracies of PepNovo (charge and 2) and PEAKS, and significantly better than results of Lutefisk. We also note that these results are still about 20% (charge 1 and charge 2 spectra) to 50% (charge 5 spectra) less than the theoretical upper bounds of the accuracies given in [9]. We have then computed the number of results that are 100% match with the correct peptide sequences (sensitivity=l and PPV=l). Results show that all of these algorithms output more than 5% of 100% match results. For our novel algorithm which introduces pseudo peaks, the problem that many of the missing fragmentations do not have enough peaks support still exists. We think that better scoring function can help to improve the ratio of 100% match results.

We have also applied preprocessing and restricted anti-symmetric model on other algorithms. We have selected PepNovo algorithm in this experiment. PepNovo takes input as the preprocessed spectra by our preprocessing model, and output the tags. We have then rescored and ranked these tags according to the restricted anti-symmetric model. We refer to this method based on preprocessing and restricted antisymmetric model as PepNovo”. Table 7. The performance of preprocessing and anti-symmetric model on PepNovo. The accuracies in cells are represented in a (PPVIsensitivity) format.

I

Dataset

I No.

of

1

PepNovo

I

spectrum

PepNovo with preprocess

I

0.322 ! 0.186

I 1 1 Charge 1

1-

;;;:;0

I

I

I

0.320 I 0.190

~

PepNovo*

~.;ll

0.330 I 0.201

~

1

~ ~ ; l /

Charge 2

0.481 I 0.445

0.480 I 0.445

0.488 ! 0.445

Total

0.486 I 0.455

0.485 I 0.417

0.489 I 0.425

Table 6. Sequencing results of Lutefisk, PepNovo, GST-SPC and our novel algorithm. The accurate subsequences are labeled in italics “M1Z”means mass to charge ratio, “Z’means charge, and “-” ineans there is no result.

29 In Table 6, we have listed a few “good” interpretations of the GST-SPC* algorithm, on which Lutefisk does not provide good results. It is interesting to note that more and longer peptide fragments are correctly sequenced by the novel algorithm - the power of preprocessing and the restricted anti-symmetric rule. In these interpretations, we observe that the novel algorithm that incorporates preprocessing and restricted anti-symmetric model can predict more and longer fragments of the correct peptides than Lutefisk, PepNovo and original GST-SPC. For example, for the peptide sequence “PAAPAAPAPAEKTPVKK”, the two tags “APAAPAPA” and “KE’ are both interpreted correctly only by this novel algorithm. Efficiency: The GST-SPC* algorithm can process a GPM spectrum (fewer peaks) in about 8 seconds, and 20 seconds for an ISB spectrum (many peaks). This is a little bit faster than the original GST-SPC algorithm, but slower than Lutefisk algorithm (within 10 seconds for these spectra) and PepNovo (about 10 to 15 seconds for these spectra) algorithm. This is because preprocessing can reduce the number of peaks, but the restricted antisymmetric rule cause the increase of time. For PEAKS algorithm, the average processing time is 0.3 second per spectrum on the powerful computation facility of peaks online (http:llwww.bioinfor.com:8080lpeaksonline). Because of preprocess, the space needed by GST-SPC* is less than the original GST-SPC algorithm. The novel algorithm used approximately 20 MB memory to process a GPM spectrum, and about 50 MB memory to process an ISB spectrum, in which most of the space is used for store the extended spectrum graph.

5.

CONCLUSIONS

In this paper, we have addressed two important issues in peptide sequencing. The first one is preprocessing to remove noisy peaks from spectrum, and introduce pseudo peaks into spectrum at the same time. We have shown by experiments that there is a significant portion of noisy peaks in the spectrum, and our preprocessing method, which removes noisy peaks and introduce pseudo peaks, can make peptide sequencing more efficient and more accurate. The second issue is about the anti-symmetric problem. We have shown that both strict anti-symmetric rule and no consideration of antisymmetric problem are not realistic, and we have

proposed a restricted anti-symmetric model. Both models can help improve accuracies of de novo algorithms, and the novel GST-SPC* algorithm that incorporates these models is shown to have high performance on datasets examined However, there are still gaps between accuracies of this algorithm and the theoretical upper bounds. The algorithm can be improved by using better scoring function (rather than SPC), better preprocessing method, and more adaptable anti-symmetric model. We are currently working on these aspects, and preliminary results are encouraging. The peptide sequencing problem is a very interesting problem in bioinformatics, and there are many other problems in peptide sequencing, such as peptide sequence assembly. We will apply our computational models on some of these interesting problems in the future.

References 1. Eng, J.K., McCormack, A.L. and John R. Yates, I. (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database, JASMS, 5, 976-989. 2. Perkins, D.N., Pappin, D.J.C., Creasy, D.M. and Cottrell, J.S. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data, Electrophoresis, 20, 355 13567. 3. Tanner, S., Shu, H., Frank, A., Mumby, M., Pevzner, P. and Bafna., V. (2005) Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra, Anal Chem, 77, 4626--4639. 4. Dancik, V., Addona, T., Clauser, K., Vath, J. and Pevzner, P. (1999) De novo protein sequencing via tandem mass-spectrometry, J . Comp. Biol., 6, 327-341. 5 . Frank, A. and Pevzner, P. (2005) PepNovo: De Novo Peptide Sequencing via Probabilistic Network Modeling, Anal. Chem., 77,964 -973. 6. Taylor, J.A. and Johnson, R.S. (1997) Sequence database searches via de novo peptide sequencing by tandem mass spectrometry, Rapid Commun Mass Spectrom., 11, 1067-1075. 7. Chen, T., Kao, M.-Y., Tepel, M., Rush, J. and Church, G.M. (2001) A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry, Journal of Computational Biology, 8, 325-337.

30 8. Lu, B. and Chen, T. (2004) Algorithms for de novo peptide sequencing via tandem mass spectrometry, Drug Discovery Today: BioSilico, 2, 85-90. 9. Chong, K.F., Ning, K., Leong, H.W. and Pevzner, P. (2006) Modeling and Characterization of MultiCharge Mass Spectra for Peptide Sequencing, Journal of Bioinfovmatics and Computational Biology, 4, 13291352. 10. Craig, R. and Beavis, R.C. (2004) TANDEM: matching proteins with mass spectra, Bioinformatics, 20, 1466-1467. 11. Keller, A,, Purvine, S., Nesvizhskii, A.I., Stolyar, S., Goodlett, D.R. and Kolker, E. (2002) Experimental protein mixture for validating tandem inass spectral analysis, O M C S , 6, 207-212. 12. Pevzner, P.A., Dancik, V. and Tang, C.L. (2000) Mutation-tolerant protein identification by massspectrometry, International Conference on Computational Molecular Biology (RECOMB 2000), 23 1-236. 13. Tsur, D., Tanner, S., Zandi, E., Bafna, V. and Pevzner, P.A. (2005) Identification of Post-translational Modifications via Blind Search of Mass-Spectra. IEEE

Computer Society Bioinformatics Conference (CSB) 2005. 14. Keller, A,, Eng, J., Zhang, N., Li, X.-j. and Aebersold, R. (2005) A uniform proteomics MSIMS analysis platform utilizing open XML file formats, Molecular Systems Biology, doi: 10.1038Imsb4100024. 15. Ma, B., Zhang, K. and Liang, C. (2005) An Effective Algorithm for the Peptide De Novo Sequencing from MSIMS Spectrum, Journal of Computer and System Sciences, 70, 418-430. 16. Grossmann, J., Roos, F.F., Cieliebak, M., Liptak, Z., Mathis, L.K., Miiller, M., Gruissem, W. and Baginsky, S. (2005) AUDENS: A Tool for Automated Peptide de Novo Sequencing, J. Proteome Res., 4, 1768 -1774. 17. Lu, B. and Chen, T. (2003) A suboptimal algorithm for de novo peptide sequencing via tandem mass spectrometry, J Comput Biol., 10, 1-12. 18. Ning, K., Chong, K.F. and Leong, H.W. (2007) De novo Peptide Sequencing for Multi-charge Mass Spectra based on Strong Tags, Fifth Asia Pacific Bioinformatics Conference (APBC 2007).

31 ALGORITHMS FOR SELECTING BREAKPOINT LOCATIONS T O O P T I M I Z E DIVERSITY I N PROTEIN ENGINEERING B Y SITE-DIRECTED PROTEIN RECOMBINATION

Wei Zheng’ , Xiaoduan Ye’, Alan M. Friedrnan2*, and Chris Bailey-Kelloggl* Department of Computer Science, Dartmouth College Department of Biological Sciences, Markey Center for Structural Biology, Purdue Cancer Center, and Bindley Bioscience Center, Purdue University Protein engineering by site-directed recombination seeks t o develop proteins with new or improved function, by accumulating multiple mutations from a set of homologous parent proteins. A library of hybrid proteins is created by recombining the parent proteins a t specified breakpoint locations; subsequent scrccning/selection identifies hybrids with desirable functional characteristics. In order t o improve the frequency of generating novel hybrids, this paper develops the first approach t o explicitly plan for diversity in site-directed recombination, including metrics for characterizing the diversity of a planned hybrid library and efficient algorithms for optimizing experiments accordingly. The goal is to choose breakpoint locations t o sample sequence space as uniformly as possible (which we argue maximizes diversity), under the constraints imposed by the recombination process and the given set of parents. A dynamic programming approach selects optimal breakpoint locations in polynomial time. Application of our method t o optimizing breakpoints for a n example biosynthetic enzyme, purE, demonstrates the significance of diversity optimization and the effectiveness of our algorithms.

1. INTRODUCTION Protein engineering aims to create amino acid sequences encoding proteins with desired characteristics, such as improved or novel function. Two contrasting strategies are coinmonly employed to attempt t o improve an existing protein. One approach focuses on redesigning a single sequence towards a new purpose, selecting a small number of mutations t o the wild-typelP5. Another approach creates libraries of variant proteins to be selected or screened for desired characteristics. The library approach samples a larger portion of the sequence space, accw mulating multiple mutations in each library member increasing both the ability t o reveal novel solutions t o attaining function. as well as the risk of obtaining non-functional sequences. Protein engineering by site-directed recombination (Fig. 1) provides one approach for generating libraries of variant proteins. A set of homologous parent genes are recombined at defined breakpoint locations, yielding a combinatorial set of hybrid^"^. In contrast t o stochastic library construction site-directed approaches choose breakpoint locations t o optimize expected library quality, e.g., predicted disruption7. 13, 14. In both cases, the use of recombination enables the creation of protein variants that simultaneously accu-

mulate a relatively large number of “natural” mutations relative t o the parent. The mutations have been previously proven compatible with each other and within a similar structural and functional context, and are thus less disruptive than random mutations. Recombination-based approaches, when conibined with high-throughput screening and selection. can avoid the need for precise modeling of the biophysical implications of mutations. They employ an essentially “generate-and-test” paradigm. As always, the goal is t o bias the .'generate" phase to improve the hit rate of the “test” phase. A library is completely determined by selecting a set of parents and a set of breakpoint locations. To optimize an experiment so as to improve the expected quality of the resulting library, there are essentially two competing goals-we warit the resulting proteins t o be both viable and novel. Most previous work on planning site-directed recombination cxperinients has focused on enhancing viability, by seeking to minimize the amount of structural disruption due t o recombination6, 14-17. However. breakpoints can also be selected so as t o enhance novelty by maximizing the diversity of the hybrids. For example, consider choosing one internal breakpoint (in addition t o the one at the end) for the three parents in Fig. 1, left. If we put the breakpoint between the

*Contact authors. CBK: 6211 Sudikoff Laboratory, Hanover, NH 03755, USA; [email protected]. AMF: Lilly Hall, Purdue University. West Lafayette, IN 47907, USA; [email protected].

32 last two residues, all hybrids will be the same as the parents ( i x . . a zero-mutation library). To improve the chance of getting novel hybrids, we must choose breakpoints that make hybrids different from each other and/or from the parents (Fig. 1, right).

... Fig. 1. Diversity optimization in site-directed protein recombination. (Left) Recombination of three parent sequences at a set of three breakpoints (we always include a n extra breakpoint a t the end of the sequence). A total of s3 = 27 hybrids results, including three sequences equivalent to the parents. (Right) Repulsive spring analogy for library diversity. Hybrids (circles) are defined by parents (stars) and breakpoint locations. In order to sample the sequence space well, we want to choose breakpoint locations to push hybrids away from each other. (For clarity, only some relationships are illustrated.) Since the parents will also appear in the hybrid library, the hybrids are pushed away from them as well. Alternatively, a n explicit goal may be t o push the hybrids away from the parents as much as possible, so as to maximize the possibility for novel characteristics that are not found in the parents. We capture these two goals as the W H H (hybrid-hybrid) and W H (hybrid~ parent) metrics below, and demonstrate that they are highly correlated as a function of breakpoint location. Note that at all times, the hybrids are restricted t o being a combination of the parents.

quency of clones with improved activity on the normally poor substrate cefotaxime18. In a study of single chain F v antibodies, the greatest affinity improvement was exhibited by libraries of moderate t o high mutation levels (3.8-22.5 mutations/gene)lg. Mutants with significantly higher affinity than the wildtype were well represented within the active fraction of the library population with high mutation levels. This paper represents the first approach to explicitly plan for diversity in site-directed recombination. We develop metrics for evaluating diversity, iii terms of both the differences among hybrids and the differences between hybrids and parents. We develop polynomial-time dynamic programming algorithms t o select optimal breakpoint locations for these diversity rnetrics. We show that the algorithnis are effective and significant in optimizing libraries from the piirE family of biosynthctic enzymes.

2. M E T H O D S We

are given a set of n, parent sequences {PI.P2:. . . , P,}, forming a niultiple sequence alignment with each sequence of length 1 including residues and gaps. Our goal is t o select a set of X breakpoint locations X = { x 1 , ~ 2 .,.., x 1 ~1 5 x1 < x2 < ... < xx = I } . For simplicity in notation, we always place the final breakpoint after the final residue position (i.e., z x = I ) . The breakpoints partition each parent Pa into X fragments with sequences Pa[l, 2 1 1 , Pa[xl 1 , 2 2 1 , . . . P,[zx-l 1 . ~ ~where 1 ; in general we use S[T,r’] to denote the ainiiio acid string from position r to r’ in sequence S , and S [ r ]to denote tlie single amino acid at position r . A hybrid protein Hi is a concatenation of chosen parental fragments, asscmbled in the original order. Thus it is also of length I . Then a hybrid library ‘H(P, X ) = { H I ,H 2 ; .. . HT1x} includes all combinations. Our goal is to choose X (such that = X and z x = 1) to optimize the diX ) , for a set P of parents. versity of library ‘H(P,

P

=

Diversity has been experimentally demonstrated t o be important t o obtaining new characteristics. The number of mutations has been correlated with functional change from wild-type in several proteins modified by different methodologies. Hybrid cytoclrromes P450 with the most altered profiles and greatest activity on a new substrate (allyloxybenzene) were found to have higher effective mutation levels (30-50 mutations among tlie 460 residues) than the enzymes with similar activities to the parentsi6. A random mutant library of TEM-1 plactamase witjh a minimal mutation load (8.2 mutations/gene) was found t o have the highest frequency of clones carrying wild-type or minimally different activity, while a mutant library with maximal mutation load (27.2 mutationslgene) had the highest fre-

+

+

~

1x1

2.1. Library Diversity For two amino acid sequences S and S’of length I , we define the mutation le,iiel m ( S .S’) as the number of corresponding residues that differ:

m ( S , S ’ )=

c lsrsl

I { S [ r ]# S ’ [ r ] } .

(1)

33 where indicator function I is 1 when the predicate is true and 0 when it is false. To mitigate the effect of neutral mutations, rather than using literal equality we measure functional relatedness using one of the standard sets of amino acid classes

(Claim 2.2): nx

i=l

1

n

r=l b = l

n

{{C>,{F,YIW},{H,R,,K},{N,D,Q,E},{S,T,P,A,G}, {M,I,L,V}}. In either case, a “gap” in the alignment is taken as a distinct amino acid type. Our approach can be used with any similarly-structured metric for mutation level. While our goal is to optimize library diversity, we show that the choice of parents and number of breakpoints, independent of breakpoint location, determines the rniitatiori level between all pairs of hybrids (Claim 2.1), between one parent and all hybrids (Claim 2.2), and between all hybrids arid all parents (Claim 2.3). nx-1

Claim 2.1.

(4) b= 1

Claim 2.3 follows immediately from Claim 2.2.

0

The right-hand sides of the claims involve the parents but not the hybrids. Thus, surprisingly, the total number of mutations differentiating hybrids from each other and from the parents are independent of breakpoint locations and determined solely by the choice of parents. However, the distribution of the diversity within the library does depend on the breakpoints.

C y = i + l m ( ~ i , ~=jn2(’-’) ) x

E;I; C&,+,4 P , ,fi>).

2.2. Metrics for Breakpoint Selection

Intuitively (Fig. 1, right), hybrids sample a sequence space defined by the parents and the breakpoint locations. A priori, we don’t know what parts of the space are most promising, and thus we seek to generClaim 2.3. Z:=l Z:Ll m ( H i ,Pa) = X ate novel proteins by sampling the space as uniformly m(Pa,Ph). as possible, rather than clustering hybrids near each other or near the parents. Proof. Consider residue position T . where 1 5 r 5 I . More formally, consider one particular hybrid Over the set of n’ hybrids, there must be nXp1inHi. We want to make other hybrids roughly all as stances of PI [r].n’-’ of Pz[r],. . . , and n’-’ of PrL[r]. different from Hi; i.e., for the other H j , the various Thus we have m ( H i , H j ) should be roughly equal. If we do this for all H i , then we will also make the HJ- different from each other (and not just from one particular H i ) . That is, we want to make m ( H i ,Hi) relatively ,=1 r=l a=l uniform, or minimize its deviation:

Claim 2.2. ypa E P : C?=’m(Pa,8).

~ f ~ m ( ~ = i nx-1 , ~ ,X )

c:=,c;=,

I

a=l

By extending this t o all pairs we have (Claim 2.1):

I

n-1

r=l a=’

n

b=a+l

n-1

n

(3) a = l b=a+I

and by similarly comparing to a fixed parent we have

where m is the mean value of m ( H , , H,). Expanding the square in Eq. ( 5 ) yields an m ( H , , H,)2 term, a constant m2 term, and an m x m ( H , , H,) term whose sum is constant by Claim 2.1. Thus we need only minimize the m ( H , , H J ) 2term, which we call the “variance.” This gives us the first of two diversity optimization targets.

Problem 2.1. (Hybrid-Hybrid Diversity Optimization) G i v e n n parent sequences P of 1 residues

34 a71d a positive inteaer A, ch,oose a set X of X breakp o i n t s (with zx = 1) t o minimize th,e hybrid-hybrid “variance” YHH (X) of t h e resulting library, where

nX-I

2=1

nx

{ X I , . . . . ~ k - 1 = r’} by concatenating each of the hybrids with each parent fragment Pa[r’ 1.7-1. Optimal substructure holds, sirice the best choice for ~k depends only on the best choice for z k - 1 .

+

r’

1

j=a+l

r

r’+l

[

1

2

...

d(T,X’)

9

In addition to making hybrids different from each other, we also may want to focus on making them different from the parents. Following a similar intuition and argument as above. we obtain a second diversity optimization target:

Problem 2.2. (Hybrid-Parent Diversity Optimization) Giiien n paren,t sequences P of 1 residues and a positisue integer A, choose a s e t X of X brealcp o i n t s (where = 1 ) t o minimize t h e hybrid-parent “variance” v ~ p ( Xof) t h e resulting library, where nX

2=1

n

a=l

f o r H , E R ( P .X ) ,Pa E P. Intuitively (Fig. 1, right), both H-H and H-P diversity optimization will spread hybrids out in sequence space. In fact, we can show that for any set X of X breakpoints,

10

Fig. 2. Library substructure: library X ( P ,X ) ending at position T extends library ‘ H ’ ( P,X’ ) ending at position r‘ by 1.r] t o each hybrid H: adding each parent fragment Pa [T’ in ‘ H ’ ( P , X ’ ) .

+

H-H Diversity Optimization. We use this insight to devise a dynamic programming recurrence t o coinpute t,he optimal value of Z J H H for the kth breakpoint location, based on the optimal values of ? ) H H for the possible ( k - 1)st locations. Define ~ H H ( Tk:) t o be the minimum value of W H H ( X )for * any X = {XI.. .., ~ = k r}. Then d ~ ~ (A)1 is, the optimal value for H-H diversity optimization.

we

Due t o lack of space, we omit thc proof, which is an algebraic manipulation of the terms. This relationships means that the two criteria should be highly correlated. as our results below confirm.

2.3. Dynamic Programming for Breakpoint Selection

In order t o select an optimal set of breakpoints, we select breakpoints from left t o right (N- t o Cterminal) in the sequences. We slightly abuse our previous notation, truncating the parents at the last breakpoint selected (consistent with our previous use of the end of the sequence as the final breakpoint). As Fig. 2 illustrates, a hybrid library with breakpoints X = { X I , . . . . x ~ - I= r ’ , = ~T } extends a hybrid library with breakpoints X’ =

Claim 2.4. can compute in t i m e O(Xn2P) as

dHH(r,k )

recursively

m,(Pa[1,r]; Pb[1,r])2 { czz,’Ey=a+~ x + d(,r’, k

min,/,,{n2

where

~ H H is

-

1)

if k = 1, k > 1.

e H H ( k . r, r’)} if

defined in Ep.(lO).

Proof. As discussed above, the hybrid library R ( P , X ) is extended from R ( P , X ’ ) , where X’ is missing the final breakpoint in X . Let us use H , for the members of R ( P , X ) and H,’ for those of R ( P .X ’ ) , and “+” t o denote sequence concatenation. Following the structure in Fig. 2, we can sep1 , r ] from arate w H H into terms %(P .X ’) P,[r’ hybrids in a single “sub-library” sharing the same added fragments, and terms R ( P .X ’ ) Pa[r’ I,.] and R ( P .X ’ ) Pb[r’ + 1,r ] between separate “sublibraries” with distinct added fragments. This gives Eq. (11).

+

+

+

+

+

35 Expanding the second term 011 the right-hand side in Eq. (11) gives Eq. (12). By Claim 2.1 for parents with k - 1 breakpoints (and thus truncated at T ’ ) , we h a w Eq. (13). We can substitute twice the right-hand side of Eq. (13) into the third term in Eq. (12) (with ”twice” to account for summing over all pairs vs. all distinct pairs), noting that the sums over the parents a, and b in Eqs. (12) and (13) are independent. We can then substitute the resulting formula back into Eq. (11). Simplification yields Eq. (14), where most terms are collected into e H H ; except for the sums

n-1

n

a = l b=a+l

%=I j=i+l

i=l

J=i+l

of T L ( H , I , H : )including ~, n from the first term in Eq. (11) and twice )(; from the first term in Eq. (12) (with ”twice” again due to all vs. all distinct). Because Eq. (14) only depends on T’ and not the previous breakpoints, d ( r ,~ c )= min{n2 r’alyzesteps in the d e novo syntlicsis of purines. While clear orthologs, purE proteins carry out substantially different eiizymat,ic act,ivit,ies in different organisms: in eubacteria, fungi and plants (as wcll as probably most arcliaebacteria), tlie purE product fiinctions as a mutase in the sec-

orid step of a two-step react,ion. while in 1net)azoaiis and methanogenic archaebacteria. the purE product functions as a carboxylase in a single-step reaction that yields thc same product,”0. 21. A genetic system allows selection in, uiuo for both the catalytic mechanism and different levels of erizyriiatic activity. In order t o uncover explanations for the striking divergence of firiction (mutase vs. carboxylase activity) within lioniologoiis sequences. we sought to evenly partition the sequence space: bridging tlie two W a n d s . ” To establish a set, of piirE parents. we perforiried standard sequence search and aligrirrieiit techniques, and elirriiriatcd columns riot, mapped to

37 the struct,ure of E. coli p r E (PDB id: lqcz) and eliminated sequences with niorc than 20% gaps. This yielded a diverse set of 367 sequences of 162 rcsidues each: inclutliiig 28 of the rarer class of metazoans arid methanogens with inferred carhoxylasc activity. The average pairwise sequeiice idcnt,ity (under the cla of Sec. 2.1) is G5.8%. We first chose three diverse parent sequences from the purE family: PI from the eubacterium Eschrr.ichia coli P2 from the vertebrate chicken (GalZus g a l k ~ s )and P3 from the iriethanogenic archaeth,ermautotroph*ibact,eriuin Mrthur~,oth,er~nobu~tt'~ C I L S . The mutation levels among these three parent sequences are 'rrL(P13Pz)= 94. m(P1.l'3) = 65 and !rrl,(Pz,P3) = 85.We applied our algorithms to choose a. set of 4: 5 , 6 and 7 internal breakpoints (Fig. 3 ) .

bleakpolnl local~onsand fragment mufa110nlevels lor H H opt~mlratlon

# Infernal breakpolnls

24

7

47

j

30

62

,

32

33

~

, 24 6

30

5

35

j

40

0

j

30

62

I

,

I

32

24

30

~

33

760~10~

47

8 54x1 0.

140 52

~

9 63x10-

47

0

114

j

28

30

40

j

36

, 64 40

~

,

~

j

52

6o

I

36

residueomton

~

27

,

1614-

28

822x10

144

xio-

/ 4 0

46

-8

-12.

209x10-

140

106 46

,

,

38

113

64

40

36

Y

149

,

121

32

52

j

136

~

93

,

j

j37 ,

32

2o

j 64

,

35

47

89

I

47

32

26

4

40

80 100 resdue positon

47

,

5

67 7 ~ 1 0 ' ~

breakpoint 10catmns and fragment mutat80n levels lor H-P opf~mlzaflon 24

6

j

46

,

46

60

40

j

36

,

140

106

j

52

20

t mteinal breakpoints

7

j

40

26

~

j

36

113

64

I

47

j

35

,

144

121

~

v,,,

150

I, 27

38

~

84

32 4

32

,

52

~

30 93

,

37

,

26

28

~

,

,

32

137

114

I

64

47

~

89

,

points. As the nilitation levels show. in seeking t,o make hybrids distribut,ed uniformly in the sequerice space, breakpoint selection optimization equalizes the contributions to diversity from the fragments. To show that it is not likely to generate equivalent diversity by chance. we chose 10000 random sets of four internal brcakpoints. The distributions of '011~ and i 1 H p for these raridorii sets are plotted in Fig. 4

&

8~

U 6-

47

705~10~

a, u

4-

140

52

j

47

239x10'

0

Fig. 3. Breakpoint locations for three purE proteins, under (top) H-H and (hottorn) H-P divcrsity optimization. The sequence is labelcd with residue indices, with a-helices shown with light boxes and 8-sheets with dark ones, according to the crystal structure of E. cola purE (PDB id: lqcz). Nurnbers above the dashed lines indicate the positions of breakpoints. Numbers within the fragments give the sum of the intra-fragment mutation levels between all pairs of parents

For 4, 5, arid G interrial breakpoints, both HH and H-P optimization yield the same breakpoint locations. For 7 internal breakpoints, the locations only differ by a few residues for the last two break-

Fig. 4. Distribnt,ion of diversity values for random breakpoint selection compared with dynamic programming optimization. The z-axis indicatcs different diversity values. The y-axis indicates the frequencies of the diversity value among 10000 random sets of four internal breakpoints. Dark diamonds indicate diversity values for breakpoints selected by our algorithm: 9.63 x lo7 for H-H, 2.39 x lo6 for H-P, and 8565 for sum-min (using the H-P breakpoints).

The breakpoints selected by our a1gorit)hins are better than any random selection. For cornparison, we also calculatcd the ..sum-niin" diversity i , used by Arnold and metric C:", rnin, n ~ ( H Pa) colleague^'^. Currently no efficient algorithm has

38 been found to directly maximize sum-min diversity, but our H-H and H-P optimization algorithms also apparently do a good job of optimizing it; no random breakpoint selection was found to do better. As we proved in Claims 2.1-2.3, the choice of parent sequences determines the total number of mutations. We also expect it to affect library diversity, since the choice of parents defines the available sequence space (we can only recombine the parents). To test the effect of parent diversity on optimization of library diversity, we randomly chose 1000 threemember purE parent sets. For each set, we selected optimized breakpoints with our algorithms, and calculated the three diversity values as above (using the H-P breakpoints for calculation of sum-min diversity). For each parent set, we also calculated the means of the three diversity metrics over 1000 random sets of four internal breakpoints. Fig. 5 plots the additive difference between values under our optimized breakpoint sets vs. mean values for random breakpoint sets. As the total mutation level of the parents increases, so does the improvement of our breakpoints over random. Presumably, more parent diversity provides more opportunity to explicitly optimize library diversity. As shown by the ratio analysis of YHH and V H P in Eq. (8) and confirmed empirically in Fig. 3, hybrid-parent diversity optimization is highly correlated with hybrid-hybrid diversity optimization. It also appears to be highly correlated with the summin diversity of Arnold and co-workers. Fig. 6(a,c) shows the relationship among these values. using the same random breakpoint selections as in Fig. 4. Optimization for hybrid-parent diversity also achieves good diversity according to the other two metrics. Fig. 6(b,d) shows that the correlation remains extremely high ( R near 1 and -1) over the random parent sets and random breakpoint sets used in Fig. 5. These correlations allow us to do just one polynomial-time diversity optimization, achieving three goals simultaneously.

L -

d0

I00

150

200

250

300

Total Mutations among Parents

I1

L -

40

I00

1,50

260

250

300


G 2

1500

5

v) c

0 1000 a,

3 E0

50%0

100

150

200

250

300


Fig. 5. Effect of parent selection on diversity optimization. The z-axis indicates the total number of mutations between pairs of purE parents in 1000 randomly chosen three-parent plans. The y-axis indicatcs, for each parent choice, the improvement in diversity from 1000 random plans t o the optimized plan (larger y values indicate more improvement). For H-H and H-P, improvement is measured as the mean random plan value minus the value of our plan; for sum-min, improvement is the value of our plan minus the mean random plan value.

39

4. CONCLUSION 1 2 5 ~

While diversity in hybrid libraries is the key to finding novel function, library design has instead previonsly focused on reducing the fraction of non-viable hybrids. Diversity has been a side-effect, rather than an explicit optimization target. In this initial approach t o optimizing diversity, we showed here that the total number of mutations in a library is fixed by the choice of' parents, but that their distribution among hybrids can be optimized so that the hybrids broadly sample sequence space. Our metrics and algorithms enable efficient selection of breakpoint locations t o optimize diversity. In practical applications, a suitable combination of diversity and viability will be desired. Since the dynamic programniing approach here has a similar structure to algorithms for minimizing disruptiori13. 14, it might be possible t o optimize for a desired trade-off between these two competing goals. We likewise anticipate integrating knowledge of important residues (e.g., targeting an active site), via appropriate weights. Finally, since the parents define the searchable sequence space and tlie total possible diversity, the importance of parent selection is reenipliasized.

0 8985

'"""~

8'2

2'4

2'6

2'8

H-P Variance

3

3'2

lo6

(c)

ACKNOWLEDGMENTS

This work was supported in part by an NSF CAREER award t o CBK (11s-0444544) and a grant from NSF SEIII (11s-0502801) t o CBK. AMF, and Bruce Craig. References

Fig. 6. Relationship among three diversity metrics. (a,.): Correlation over random four-breakpoint sets with the fixed three-parent set of Fig. 4. The z-axis indicates H-P variance ( w ~ p )the , y-axis indicates H-H variance ( V H H ) or sum-min diversity, respectively. (b,d): Histogram of correlation coefficients of diversity metrics for random sets of four internal breakpoints with the same random parent sets as Fig. 5. Note that the histograms are focused on a small region very near 1 and -1, respectively.

1. B. Kuhlman, G. Dantas, G.C. Ireton, G. Varani, B.L. Stoddard, and D. Baker. Design of a novel globular protein fold with atomic-level accuracy. Sczence, 302( 5649) 11364-8, 2003. 2. L.L. Looger, M.A. Dwyer, J . J . Smith, and H.W. Hellinga. Computational design of receptor and sensor proteins with novel functions. Nature, 423(6936):185-90, 2003. 3 . R.H. Lilien, B.W. Stevens, A.C. Anderson, and B.R. Donald. A novel ensemble-based scoring and search algorithm for protein redesign and its application t o modify t h e substrate specificity of the gramicidin synthetase A phenylalariine adenylation enzyme. J . Cornput. Biol., 12(6):740-61, 2005. 4. J. Li, Z. Yi, M.C. Laskowski, M. Laskowski J r . , and C. Bailey-Kellogg. Analysis of sequencereactivity space for protein-protein interactions. Proteins, 58(3) :661-71, 2005.

40 5. I. Georgiev, R.H. Lilien, and B.R. Donald. A novel minimized dead-end elimination criterion and its application to protein redesign in a hybrid scoring and search algorithm for computing partition functions over molecular ensembles. In Proc. RECOMB, pages 530-45, 2006. 6. C.A. Voigt, C. Martinez, Z.G. Wang, S.L. Mayo, and F.H. Arnold. Protein building blocks preserved by recombination. Nut. Struct. Bzol., 9(7):553-8, 2002. 7. M.M. Meyer, J.J. Silberg, C.A. Voigt, J.B. Endelman, S.L. Mayo, Z.G. Wang, and F.H. Arnold. Library analysis of SCHEMA-guided protein recombination. Protean Sci.,12:1686-93, 2003. 8. C.R. Otey, M. Landwehr, J.B. Endelman, K. Hiraga, J.D. Bloom, and F.H. Arnold. Structureguided recombination creates an artificial family of cytochromes P450. PLoS Biol., 4(5):e112, 2006. 9. L. Saftalov, P.A. Smith, A.M. Friedman, and C. Bailey-Kellogg. Site-directed combinatorial construction of chimaeric genes: general method for optimizing assembly of gene fragments. Proteins, 64(3):629-42, 2006. 10. W.P. Stemmer. Rapid evolution of a protein in witro by DNA shuffling. Nature, 370(6488):389-91, 1994. 11. A.M. Aguinaldo and F.H. Arnold. Staggered extension process (StEP) zn witro recombination. Methods MoZ. Bzol., 231:105-10, 2003. 12. W.M. Coco. RACHITT: Gene family shuffling by random chimeragenesis on transient templates. Methods Mol. Biol.,231:111-127, 2003. 13. J.B. Endelman, J . J . Silberg, Z.G. Wang, and F.H. Arnold. Site-directed protein recombination as a shortest-path problem. Protein Eng. Des. Sel., 17:589-594. 2004.

14. X. Ye, A.M. Friedman, and C. Bailey-Kellogg. Hypergraph model of multi-residue interactions in proteins: sequentially-constrained partitioning algorithms for optimization of site-directed protein recombination. J . Comput. B i d , in press, 2007. Conference version: Proc. RECOMB, 2006, pp. 15-29. 15. G.L. Moore and C.D. Maranas. Identifying residueresidue clashes in protein hybrids by using a secondorder mean-field approach. PNAS, 100(9):5091-6, 2003. 16. C.R. Otey, J.J. Silberg, C.A. Voigt, J.B. Endelman, G. Bandara, and F.H. Arnold. Functional evolution and structural conservation in chimeric cytochromes p450: calibrating a structure-guided approach. Chem. Biol., 11(3):309-18, 2004. 17. M. C. Saraf, A. Gupta, and C.D. Maranas. Design of combinatorial protein libraries of optimal size. Proteins, 60(4):769-77, 2005. 18. bl. Zaccolo and E. Gherardi. The effect of highfrequency random mutagenesis on i n vitro protein evolution: a study on TEM-1 beta-lactamase. J . Mol. Biol., 2851775-83, 1999. 19. P.S. Daugherty, G. Chen, B.L. Iverson, and G. Georgiou. Quantitative analysis of the effect of the mutation frequency on the affinity maturation of single chain Fv antibodies. PNAS, 97:2029-34, 2000. 20. S.M. Firestine, S.W. Poon, E.J. Mueller, J. Stubbe, and V.J. Davisson. Reactions catalyzed by 5aminoimidazole ribonucleotide carboxylases from Escherichia coli and Gallus gallus: a case for divergent catalytic mechanisms. Biochemistry, 33:1192734, 1994. 21. J. Thomas et al. in preparation.

41

AN ALGORITHMIC APPROACH TO AUTOMATED HIGH-THROUGHPUT IDENTIFICATION OF DlSULFlDE CONNECTIVITY IN PROTEINS USING TANDEM MASS SPECTROMETRY Timothy Lee and Rahul Singh* Department ojcomputer Science, Sun Francisco State University, 1600 Holloway Avenue, Sun Francisco, CA 94132-4025, U.S.A.

Ten-Yang Yen and Bruce Macher Department ojChemistuy and Biochemistry, Sun Francisco State University, 1600 Holloway Avenue, Sun Francisco, CA 941324025, U.S.A. Knowledge of the pattern of disulfide linkages in a protein leads to a better understanding of its tertiary structure and biological function. At the state-of-the-art, liquid chromatographyielectrospray ionization-tandem mass spectrometry (LCIEST-MSIMS) can produce spectra of the peptides in a protein that are putatively joined by a disulfide bond. In this setting, efficient algorithms are required for matching the theoretical mass spaces of all possible bonded peptide fragments to the experimentally derived spectra to determine the number and location of the disulfide bonds. The algorithmic solution must also account for issues associated with interpreting experimental data from mass spectrometry, such as noise, isotopic variation, neutral loss, and charge state uncertainty. In this paper, we propose a algorithmic approach to high-throughput disulfide bond identification using data from mass spectrometry, that addresses all the aforementioned issues in a unified framework. The complexity of the proposed solution is of the order of the input spectra. The efficacy and efficiency of the method was validated using experimental data derived from proteins with with diverse disulfide linkage patterns

1. INTRODUCTION Cysteine residues have a property unique among the 20 naturally occurring amino acids, in that they can pair to form disulfide bonds. These covalent bonds occur when the sulfhydryl groups of cysteine residues become oxidized (S-H + S-H + S-S + 2H).’ Because disulfide bonds impose length and angle constraints on the backbone of a protein, knowledge of the location of these bonds significantly constrains the searchspace of possible stable tertiary structures which the protein folds into. The disulfide linkage pattern of a protein also can have an important effect on its function. For example, the disulfide bond structures of STSSia IV are necessary for its polysialyation activity.2 Methods for determining disulfide bonds in a protein can be classified as either: (1) purely predictive, based completely on the protein’s primary structure, or ( 2 ) based on analyzing data from experimental methods, such as Crystallography, NMR, and Mass Spe~trometry.~,~ Predictive approaches typically aim to infer the disulfide bonding state of cysteine residues in a protein, primarily by characterizing a heuristically defined local sequence environment. Towards this goal, predictive approaches include graph-theoretic method^,^ combinatorial optimization formulations,6 techniques

* Corresponding author. Email: [email protected]

based on efficient indexing of the search space,’ and a variety of supervised learning formulations involving neural-networks, hidden Markov models, and support vector machines.’.’’ However, Vullo and Frasconi concluded that any prediction algorithm must have a computational time complexity bounded by O(n( & /2)”), where n is the number of cysteines in the protein.* This limits the application of such an algorithm to proteins with only a few disulfide bonds. In addition, the prediction accuracies or these methods, defined as the fraction of the total number of proteins whose connectivity patterns are correctly predicted, are currently limited to about 60%. By contrast, determination of disulfide bonds can also be achieved with high accuracy for any number of bonds by analyzing data from structure elucidation techniques such as X-ray crystallography and NMR. These techniques require relatively large amounts (10 to 100 mg) of pure protein in a particular solution or crystalline state and are fundamentally low-throughput in nature. In this context, the use of information from mass spectrometric (MS) analysis constitutes an important direction for elucidation of structural features, such as disulfide bonds. For identification of disulfide linkages, the general strategy involves mass spectrometry-based

analysis to make an initial identification of the putative peptides involved in a disulfide bond. These peptides are then fragmented, and a tandem mass spectrum (MUMS) of the fragments generated. The MS/MS spectrum is subsequently analyzed to confirm the initial identification of a disulfide bond. Such an approach can offer accurate identification and, in principle, can scale to any number of bonds with much less stringent sample purity requirements when compared to NMR or X-ray crystallography. Although the aforementioned approach is conceptually straightforward, the actual task of identifying the MS/MS spectra corresponding to disulfide linkages is non-trivial. In this paper we investigate this precise problem. The key contributions of this work lie in addressing the problem of disulfide bond identification in the context of the technical challenges arising from the use of the real-world data from tandem mass spectrometric analysis. The combination of experimental procedure and algorithmic analysis proposed is scalable to structures having a large number of disulfide bonds. Furthermore, the processing is inherently highthroughput. Other features of the proposed approach include: Invariance to the topology of the disulJde bonds: Disulfide bonds may be classified as intramolecular bonded (within a single peptide chains) or inter-molecular bonded (between different peptide chains). The proposed methodology can identify such bonds within a single framework. Analysis of experimental errorshoise at the level of the produced spectrum: Our proposed methodology requires the mass spectra and tandem mass spectra to be converted into a finite set of discrete “mass peaks.” We present algorithms to resolve such peaks from spectra having peaks of non-zero width. We also address how to obtain the optimal set of peaks from each tandem mass spectrum. Accounting for neutral loss and isotopic variation: During the collision induced disassociation step of an LCIESI-MS/MS analysis, a peptide fragment may have undergone neutral loss, resulting in the loss of a small molecule such as water or ammonia. In addition, the constituent atoms that comprise an amino acid exist in a number of isotopic forms. As a result, peptides consisting of the same sequence of amino acids will be measured as a series of masses by the mass spectrometer. This must be considered

when computing the expected mass of a disulfide bonded peptide fragment. Interpretation of the charge state: Precursor ions with a high charge state (triply charged ion or greater) can be misinterpreted by MS data processing programs commonly supplied as part of the MS instrumentation. For example, ion trap mass spectrometers have a relatively low resolution. In such cases, a quadruply charged ion may not be well resolved and can be misinterpreted as a triply charged ion. This error often cannot be identified unless a higher resolution scan (zoom scan) is employed during the experiment. Consequently, the mass of a disulfide bonded pair of peptides is incorrectly computed, resulting in either not identifying (false negative) or incorrectly identifying (false positive) the bond.

1.1. Comparison of the Proposed Approach with Related Works Examples of techniques employing purely predictive methodology include DiANNA,” DISULFIND,I2 and P r e C y ~ . ’ ~ Each of these implementations employ weighted graph matching to predict the final disulfide connectivity pattern. In fact, these implementations all use a program (by Rothberg) that implements Gabow’s algorithmic solution of the maximal weighted graph matching p r ~ b l e m . ’Additionally, ~ a learning strategy is involved where fundamental assumptions are made about the relationship between the cysteine residues in order to obtain the edge weights. Examples of such assumptions include length of the local sequence environment to be considered, formulation of the residue contact potential function used, and assumptions involved in defining the training set. However, their reported prediction accuracies indicate that these underlying assumptions remain open to further investigations. Existing web-based programs such as MS-Bridge in the ProteinProspector tools,15 X! Protein Disulphide Linkage Modeler,I6 and PeptidemapI7 are useful when analyzing MS data from MALDI-TOF (Matrix Assisted Laser Desorption Ionization-Time of Flight) experiments. However, these programs do not analyze MSiMS data, thus missing the useful structural information inherent in this data. The program MS2Assign can be used to analyze disulfide linkages from MS/MS data.I8 However, because it was designed

43 primarily for the analysis of results from cross-linking studies, MS2Assign requires the user to input detailed information on the specific modifications expected. As a result, there is a need for a software tool that utilizes both MS and MSIMS data to identify disulfide linkage patterns in a high throughput manner.

2. THE PROPOSED METHOD

2.1. Problem Formulation Let a, denote the set of amino acid residues, each with mass m(a,). A peptide p = (ai) is then a string of amino acids with mass m(p) = xi m(aJ + 18 Da (Daltons). Peptides have a specific directionality: the string starts at the unbonded amide group, called the N terminus, and ends at the carboxylic acid group, called the C terminus. The term 18 Da is included in this formula to account for the masses of H and OH of the N- and C-termini of the peptide, respectively. In a LCIESI-MSIMS experiment, a protease is used to divide a protein into peptides. A protein A = /pi) denotes the set of all peptides. A cysteine-containing peptide c is a peptide of protein A that has at least one of its amino acids ai identified as a cysteine residue. Thus if C = /cJ is the set of all cysteine-containing peptides, A . In practice, it is very rare that C = A. then A disulfide bondedpeptide DI,, is a pair of cysteinecontaining peptides cI and c2, with mass m(C,,>)= m(c,) + m(c2) 2m(H), where 2m(H) accounts for the mass of the two protons that are lost when the disulfide bond is formed. A disulfide connectivity pattern can be modeled in terms of an undirected graph G = ( V , E). The vertex set V represents the set of bonded cysteines and an edge e E E corresponds to a disulfide bridge between its adjacent cysteines. Admissible vertex and edge sets are constrained because an even number of intra-chain bonded cysteines is required and a cysteine can only be bridged to one and only one different cysteine. Thus, we have Iq = ZB, (El = B and degree(v) = 1 for any v E (perfect matching), where B denotes the number of disulfide bonds in a chain. The problem of identifying the correct connectivity pattern for a given disulfide bonded chain is simply formulated as finding the best possible candidate as given by a suitable scoring hnction. If we consider only those cysteines that are known to be involved in a disulfide bond, it is evident that this problem is equivalent to the problem of computing the maximum-

c

~

v

weight perfect matching. In a perfect matching, every vertex of the graph is incident to exactly one of the edges of the matching. In this formulation, we attribute a weight w,greater or equal to zero for the edge e of G that was initially identified by the MS spectrum match to each pair of cysteines. The disulJide bond mass space BMS = (bmsi} of a protein is the set of every possible pair of cysteinecontaining peptides. A mass list h f L = {mb) is the set of numbers that represent the masses of the precursor ions obtained from a LCIESI-MSIMS experiment. A bond match bmk between D and MI occurs between bms, and mb when (bmsi- ml,i < bm,, where msm, is defined as the bond mass tolerance, the amount of experimental uncertainty that ml, is allowed to have to determine the match. A bond spectrum match is the set of matches BSM = { bSmk} between ML and BMS. In a LCIESI-MSIMS experiment, each precursor ion undergoes collision-induced disassociation, resulting in fragment ions that constitute a MSIMS spectrum. If the precursor ion is a disulfide bonded pair of peptides, the fragmentation process typically keeps this bond intact. Let FML = icfmlJ denote the set of MSIMS values corresponding to the masses of the peptide fragments. A peptide fragment F is a substring of a peptide with mass m(F) = C6,5sm(a,), where r and s denote the locations of starting and ending amino acids of the peptide fragment. A disulfide bondedfragment FI,, is a pair of peptide fragments FI and F,, with mass m(FI,2) = m(F1) + m(F2) 2m(H). For there to be a disulfide bond between F1 and F2, each fragment must contain at least one cysteine. The disulfide bonded fragment mass space FMS = {fms,) for two cysteinecontaining peptides P, and Pj is the set of every disulfide bonded fragment mass that can be obtained from these two peptides. A,fiagment match fmk between FML and FMS occurs between fml, and fmsV when lfmsV - fmlJ < fmt, where fmt is defined as the M S / S mass tolerance, the f amount of experimental uncertainty that tmsj is allowed to have to determine the match. A MUMS spectrum match TSM is the set of matches TSM = Vsmk} between FMS and FML. The match ratio r is then defined as the number of matches divided by the size of the tandem mass spectrum, i.e. r = ITShilIIFMSI. Since each match ratio is a measure of how well the LCIESI-MSIMS experimental data supports the hypothesis of a disulfide bond between two of the cysteines in the protein being analyzed, we denote r i j as

*

~

44 the match ratio for a bond between cysteines CI and C2. As a result, each r is assigned as the weight w e of the graph G which models the overall connectivity pattern. Thus, the disulfide linkage pattern identification problem is to find a perfect matching in G of maximum weight.

2.2. Algorithmic Framework Determining the disulfide linkage patterns involves solving the following four sub-problems: Find the bond spectrum match BSM between the mass list ML and the disulfide bond mass space BMS. Determine the MS/MS spectiwm match TSM between the disulfide bonded fragment mass space FMS and the MSIMS mass list FML. Find a perfect matching of maximum weight for a fully connected graph with /C/vertices, with edges of weight TI,,. Utilize experimental data the contains noise, isotopic variation, neutral loss, and charge state uncertainty to achieve the matchings in subproblems 1 and 2. In the following subsections, we present our approach to each of these sub-problems.

2.2.1. Finding the MS spectrum match Let k denote the number of sites where an arbitrary protein A can be cleaved with a certain protease. The construction of the mass space then requires O(k2)time. This is because the k proteolytic amino acids divide the protein A into k+l subsequences, leading to k(k+1)12 unique pairs of subsequences that can be formed. For the case of disulfide bonds, we are concerned with forming unique pairs of subsequences from C as opposed to A. Because A for almost all proteins and proteases, the disulfide bond mass space BMS is likely to be smaller than the mass space obtained from every peptide in A . The quadratic time complexity can be hrther reduced if the data structure used to construct and search D did not require computing the mass of every member of D. The intuition lies in computing the inasses of the possible disulfide bonded peptides that are expected to be close in value in the mass spectrum S. This can be

c

done by use of the expected uinino ucid mass, as defined below:

1

The weighted mean of A, i.e, me = w,m(a,), where {wl}denotes the relative abundance of each amino acid. Using published values for masses and relative abundances,19 we obtain me = 1 1 1.17 Da. Using this definition, we can predict that the mass of a peptide m(p) = IIpII m, + 18, where IIpII represents the number of amino acids contained in the peptide. The additional 18 Da was explained in Section 2. Statistically, this is justified to a first approximation because the weighted standard deviation, again using published data”, is 28.86 Da. Thus, the number of amino acids in the bonded pair of peptides, denoted lid, 1 , can be used to construct BMS in such a way that it is approximately mass sorted. This is the motivation for exploring the use of a hash table to construct and search BMS. The hash table is a well known data structure for efficient searching of a data space.20 If the hash function employed satisfies the assumption of simple uniform hashing, then the expected time to search for an element is O(1). Simple uniform hashing means that, given a hash table T, with 171 buckets, any data element d, is equally likely to hash into any bucket, independently of where any other element has hashed to. Using the Expected Amino Acid Mass to predict the mass of a peptide, we implement the simple hashing function h(d,) = lld!lII as a first approximation. This results in our algorithm (which we call MSHushlD) for this subproblem, to have an overall complexity of O(IC/*+ IBMSI), where IBMSI is the size of the mass spectrum. Table 1 presents a toy example illustrating the construction of the hash table. In this example, the three pairs of peptides will be hashed to buckets 10, 12, and 14 respectively. Let the MS spectrum for peptides of the protein being considered in this example contain a mass peak having the value of m(p) = 1332 Da. Following our approach, this results in an estimated number of amino acids to be 12 (IIpII = 12). Subsequently, the corresponding bucket in the hash table is accessed.

45 Table 1 . Example showing how hash table is constructed

from the lowest to the highest value. Thus, given an MUMS fragment ion inass, it is possible to make an I .Given proteiii 2. Identify 3 . Foim all 4. determine initial estimate of the location of the diagonal band of sequence and riurnber of cysteinepossible pairs theoretical table masses that are most likely to match amino acids protease containing of peptides this fragment ion mass. peptides Let s be an MSiMS fragment ion mass peak value. EC~GR EC~GRNVNC~ EC~GR 10 If either s < mrllrrl or s > m,llnr,the algorithm returns no NVNC*TK value. Otherwise, in the second step, we compute the TKAIQC~~LDE average amino acid residue mass E = (m(p1) + NVNC~TK EC’GR 12 H, trypsin m(p2))/(n + in). This is the approximate mass difference AIQCI4LDEH I (cleaves after K between an element and the (up to) four elements that AIQPLDEH NVNC*TK 14 and R) are a “step” away from it. A step is defined to be the AIQC~~LDEH movement of an index that points from an element to a In our example, this bucket contains the peptide neighboring element, either vertically or horizontally, in NVNCTK. The mass of this peptide is then computed, a mass table. Thus, the estimate of the number of steps and compared with m(p) to determine if there is a match. used to index into the table to locate the band for a Because there is a possibility that another bucket may particular inass peak is nstcps= s I E . While any contain a peptide pair with a matching mass, continuous path of steps froin m,,,,,, to m,,,,can be used neighboring buckets (i.e, buckets 11 and 13) are also to locate the band, it is simplest to step along the accessed. perimeter of the mass table. In this algorithm, we start by stepping “down” along the first column, and then 2.2.2. Finding the MS/MS spectrum match “across” along the last row. We note that the initial estimate may not index into Based on experimental observation, when peptides the actual location of the band. Therefore, we need a undergo collision-induced dissociation (CID), the strategy to reach the actual location starting from the fragments produced are mostly either a b-ions (contains the N terminus) or y-ions (contains the C ter~ninus).~ initial estimate. For relatively short peptides of under one hundred amino acid residues (much longer than We have also observed that the disulfide bond remains usually encountered in tryptic digests), one can simply intact during CID. Let p l denote a peptide with its generate neighboring mass table elements along the path possible y-ions yl and b-ions b l , and similarly y2 and used to index into the table until the band is reached. b2 for peptide p2. If pl and p2 are in a disulfide bond, The location of the band is identified as the index of the four types of fragments may occur: yl+y2, yl+b2, element that has the mass closest to s. bl+yl, and bl+b2. The most convenient way to Once the location of the band is identified, the compute and display the disulfide bonded pair mass remaining elements of the band are generated and space is to generate four tables in which each row compared to s. The second element will be found either represents the mass of an ion of the first peptide and directly above, or above and to the right (rowrow-1, each column represents the mass of an ion of the second column=column+k, where k depends on the relative peptide. Then, each entry in this MS/MS mass table sizes of the peptides) of the first element. (subsequently referred to as mass tuble) is the sum of its As an example, let the two amino acid sequences be row and column, minus 2m(H) Da. Next, let peptides p l p l = NVNCTK, and p2 = AIQCLDEH. Table 2 shows and p2 consist of m and n amino acid residues, all of the possible y- and b-ions that contain a cysteine, respectively. The first step is to compute the lowest and as well as the mass of each ion. Note that for y ions, an highest inasses m,,,,, and m,,,~,,,in the mass table. The additional 18 Da are added to the sum of the residue former is the first row and first column of the mass table, masses. The resulting mass table for the b l + y2 and the latter is its last row and last column. Also, combination is shown in Table 3. because the dynamic range of amino acid residue masses The algorithm described by this approach, IndexlD, is relatively small (about 3.3:l in the extreme case of has a worst case time complexity of O(n + in) to locate tryptophan:glycine), the increase in mass is the band. However, because this approach usually approximately linear as the values are read “diagonally”

I I

46 indexes into the mass table just a few elements away from the band, the time complexity can be estimated by a constant. Because the band is in general a diagonal along the mass table, enerating the band elements has a complexity of O( nm ). Since IndexZD is invoked for each instance of a FSM spectrum match, the time complexity of the solution to subproblem 2 for a protein is O(IFML,l )), where IFMLI is the size of the tandem mass spectrum.

J"

(G

Table 2. Example fragment mass space

I

l'eptide

1

Ion

I

I

Sequence

Mass (Da)

I

type

1

I

Y

b

I 1 1 2

y

CTK NCTK VNCTK NVNCTK NVNC NVNCT NVNCTK CLDEH QCLDEH IQCLDEH AIQCLDEH AIQCL AIQCLD AIQCLDE AIQCLDEH

I I

351 446 564 678 43 1 532 660

1 :5; 1 752

I

I

673 810

Table 3. Example mass table. b 1+y2-2

50 1

639

752

813

CLDEH

QCLDEH

IQCLDEH

AIQCLDEH

930

1068

1181

1242

1031

1169

1282

1343

1159

1297

1410

1471

43 1 NVNC 532 NVNCT 660 NVNCTK

2.2.3. Finding a perfect matching of maximum weight for a fully connected graph Sub-problem 3 , the maximum-weight perfect matching problem, is a well understood problem in graph theory. At present, the best performing algorithm that solves this problem for a fully connected graph was designed by Gabow.*' This algorithm has a worst-case bound of

0(ici3).

2.2.4. Consideration of missed proteolytic cleavages and intra-molecular bonded cysteines In the laboratory, a protease used to digest a protein may sometimes miss a cleavage point. For example, a protein with sequence NRDKTA should be digested by trypsin into three peptides: NR, DK, and TA. However, if one cleavage point is missed, two peptides are created: either NRDK and TA, or NR and DKTA. We model this behavior by including the parameter m,,,, the maximum number of missed cleavages allowed. It can be inferred by induction that a protein with k cleavage sites and a mlnax= m will digest into (in + l)k unique peptides, assuming k >> m. Note that mlnax includes all smaller values of missed cleavage levels, e.g., mmax= 2 includes m = 1 and m = 0 as well. If mmax is small (e.g., three or smaller), missed cleavages can be considered to be a constant multiplicative factor in our time complexity analysis as described earlier. Since the proteolytic digestion process produces peptides that contain two or more cysteine residues, there is the possibility that intra-molecular bonds may occur, i.e. disulfide bonds exist within a single peptide. These peptides must be included into the mass list M L , with mass m(p) = Clm(a,) - 2, if at most one disulfide bond per peptide is considered. The impact on time complexity is simply the larger disulfide bond mass space D, which can be modeled as an additive factor, f(/P/, ICI, mmax). The disulfide bonded fragment mass space DF for an intra-molecular bonded peptide consists of the union of the mass spaces of the possible b-ions and y-ions that can result from its fragmentation. For example, for the peptide ASICQQNCQY, the possible b-ions are b l , b2, b3, b8, b9, and bl0, and the possible y-ions are yl, y2, y7, y8, y9, and y10. Thus the complexity of the solution to subproblem 2 is increased by an additive factor, O(jm1 max[n, m]).

2.2.5. Peak finding in the presence of noise Using Bioworks software from Thermo-Fisher, the raw data obtained from a LCIESI-MSIMS analysis of a single protein is converted to a series of of DTA files. The DTA format is very simple; the first line contains the mass of the precursor ion and the peptide charge state as a pair of space separated values. Subsequent lines contain space separated pairs of fragment ion

47 mass-to-charge ratios (denoted d z ) and intensity values. These lines are sorted in order of increasing mlz. Typically hundreds of DTA files are produced per analysis. A typical DTA file contains on the order of lo2 to lo3 lines of fragment ion information. The intensity vales can range from 1 to the order of 10’. It is expected that only a fraction (loo), a limit I is placed on the number of peaks to be considered for matching. Next, we consider the correlation between MS/MS spectrum peaks and the masdintensity lines in the associated DTA file. In the graphical representation of a MUMS spectrum the peaks are very sharp. In the DTA file the more intense mass peaks typically occupy several neighboring lines, reflecting the slightly differing masses of the isotopes of a fragmented ion. If each line in a DTA file is considered to be a separate mass peak, the data analysis would be biased towards masses associated with more intense peaks. To correct for this bias, we represent a set of neighboring lines with similar intensity as a single peak. We formalize the concept of “neighborhood” by defining the maximum peak width p,,>as the maximum difference in mass-to-charge ratio that two consecutive lines in the DTA file can have and yet be considered to a single peak. “Similar” is defined as the absolute difference in intensity of two neighboring peaks less than 50% of the larger intensity. We denote the set of peaks that result asp,, where 0 5 i 5 1. Figure 1 illustrates an example of how threshold t, limit I, and maximum peak width pW work together to find the best mass peaks. Let the masses of the six peaks shown here be labeled a through f, and let t = lo%, 1 = 2 and p M ,be the mass range as shown. Peak c has the maximum intensity, so peak f is eliminated, since c a, its intensity is less than 10% of c’s. Because peaks a, b, and c would have been replaced by a single ~

peak with mass of average mass of these three peaks. However, because the intensity of peak b is less than 50% of peak a, this is not done. Instead, the peak window moves to peaks c, d, and e. In this case, these peaks are replaced by a single peak of mass = (c + d + e)/3. Since the limit is two, this peak and peak a are identified as the peaks to use for subsequent analysis. Figure 1. Illustrative example of peak finding.

qd

f

g

P W

2.2.6. Addressing isotopic neutral loss

variation

and

To account for the possibility of neutral loss, for each element fml, of the MS/MS mass space FML computed in the preceding section, we add three more elements: mVm1, ) + m(H20), mCfhl, ) + m(NH3), and rnVml, ) + m(H20) + m(NH3), where m(H20) is the mass of a water molecule and m(NH3) is the mass of an ammonia molecule. This accounting increases the size of the disulfide bonded fragment mass space FMS by a factor of four. In addition, we use the average masses for amino acid residues to compute the mass of peptides and their fragments with molecular weights greater than 1500 Da. Our experiments using an ion trap indicate that this results in more accurate correlations with observed fragment ion peaks than by simply using monoisotopic masses. As a consequence of this step, we empirically observed a much closer correlation between the MUMS

4% mass space FA4L and the disulfide bonded fragment mass space FMS values.

performance of this algorithm is dominated by IMLI, lFMLl and the IIO cost to process the spectrum data.

2.2.7. Interpretation of peaks given charge state uncertainty

3. EXPERIMENTAL RESULTS

For some low resolution mass spectrometers, it has been observed that the charge state of the precursor ion used to generate the MSIMS spectra may be reported incorrectly. An incorrect number for the charge state will significantly impact the MSIMS mass space that is searched for matches. To address such cases our system is implemented such that the user can intervene and correct the mass assignment. Next, we examine how to process the values of fragment ion m / z in the DTA file to obtain the MSIMS mass space FML used to search for matches with the disulfide bonded fragment mass space FMS. Let the charge state value (reported or corrected) for a DTA file be denoted as c. No fragment of the precursor ion can have a charge larger than c. Then each element of FMLis obtained by computing FML,(z) = zp,-(z-l)m(H), where 1 5 z 5 c for each z, 1 5 i 5 I , and m(H) is the mass of a single proton. The second term is needed because FMS is computed for singly protonated ions. 2.2.8. Overall complexity

The overall time complexity of our algorithmic approach is computed as follows: Finding the bond spectrum match BSM between the mass list ML and the disulfide bond mass space BMS is performed once per analysis, with a time complexity of IMLIO(MSHash1D) = O(IMLI(IC/’ + IBMm). Determining the MS/MS spectrum match TSM between the disulfide bonded fragment mass space FMS and the MSIMS mass list FML is performed each time there is a bond spectrum match, or /BSMI times, with time com lexity of IBSMIO(1ndexID) = O(IBSM/IFML/( nm )). Finding a perfect matching of maximum weight for a hlly connected graph with /CI vertices has a time complexity of o(l~1~). The techniques developed to utilize experimental data constitute a constant factor multiplying ML and FML. Thus, the overall complexity of o u m r o a c h is O(IMLI(ICI~ + IBMS~) + (IBSMIIFMI, ( J n m + 1~1’). Since n, m and C are typically small (< loo), the

e

3.1. Description of the Data Experimental Procedures

and

The proposed method was validated utilizing MS and MSIMS data obtained by LCIESI-MSIMS analysis for three eukaryotic glycosyltransferases with varying numbers of cysteines and disulfide bonds: 1. Mouse Core 2 1,6-N-Acetylglucosaminyltranferase I (C2GnT-I) 22 2. ST8Sia IV Polysialytranserase (ST8Sia IV) 3. Human Fucosyltransferase VII (FT VII) 23 The disulfide linkage pattern for each of these proteins is known and reported in each cited reference. The experimental data was obtained using a capillary liquid chromatography system coupled with a ThermoFisher LCQ ion trap mass spectrometer LCIESI-MSIMS system was used to obtain the MS and MSIMS data. Further details of the experimental protocols used are a~ailable.’~ We obtained the primary sequences from the SwissPro1 databa~e,’~and DTA files were obtained from LCIESI-MSIMS analyses of each protein. For each experiment, we set the bond mass tolerance bm, = 3.0 Da, the maximum peak width p,, = 2 Da, the threshold t = 2% of the maximum intensity, and the limit 1 = 50 peaks. We used MS/MS mass tolerance fni, = 1.0 Da, except when intramolecular bonded cysteines were identified, when a value of 1.5 Da was used. The protease is set to what was used in the actual experiment. We set maximum number of missed cleavages allowed inmah= 1, except for one case where a combination of trypsin and chymotrypsin was used, where we set mmax=3.

’

3.2. Summary of Results The proposed method was applied to determine the disulfide-bonding patterns of three proteins, with varying numbers of cysteines and disulfide bonds. Our results are presented in the form of a connectivity matrix, as proposed in2‘ Each matrix element below the diagonal corresponds to a possible disulfide bond. In this matrix we indicate the “known” linkage patterns by a gray shaded matrix element. If our method computes a match ratio of over 50% for a particular combination,

49 we record it in the table. In addition, we assign one of the values TP, FP, FN, or TN to each matrix element per the following conventions: For match ratios of at least 50%, true positive (TP) is assigned if the same matrix element is shaded gray. 0 A false positive (FP) is assigned if the matrix element is not shaded. A false negative (FN) is assigned to a matrix element if the matrix element is shaded but its match ratio is less than 50%. Table 4 summarizes our results for an analysis of 233 DTA files of C2GnT-I. For this dataset, the charge state reported in two DTA files needed to be reinterpreted in order to avoid false negative results. In Table 5 we present the results from the analysis of 79 DTA files of ST8Sia IV, and table 6 contains the results obtained from the analysis of 158 DTA files of FucT VII. We evaluate the performance using the following metrics: 0 Precision P = TP/(TP+FP) Recall R = TP/(TP+FN) 0 Sensitivity S = TN/(TN+FP) Table 7 summarizes our results for these metrics. Although our precision result for C2GnT-I is low compared to the precision results for ST8Sia IV and FucT VII, it still compares favorably with the results reported by the purely predictive methods."-'3 In addition, we note that we can improve the precision from P = 0.40 to P = 0.70 if we chose to ignore all match ratios less than 85%.

Table 5. ST8Sia IV validation testing results. Cysteine

142

292

156

356

location

142 156

TN

Table 6. FucT VII validation testing results.

location

I

I

Table 7. Overall performance results. Protein

Precision

Recall

Specificity

1 C2GnT-I

FT VII

1.o

1.o

.o

1

Following the implementations of the purely predictive methodology, we adapted WMATCH, Rothberg's implementation of Gabow's a l g ~ r i t h m ' ~to, ~ ' find the maximum weight matching. This analysis component was only conducted for the C2GnT-I intermediate results, as the linkage patterns for ST8Sia IV and FucT VII are already evident. Our result was in agreement with the published bonding pattem2'

3.2.1. Analysis of the effect of threshold t on results

varying

The values we used for many of the parameters introduced in this paper, such as threshold t, limit 1, and maximum peak width pw, were based on heuristics

50 developed by experimenters. In this section, we examine the effect of varying the threshold t on our results. We used the C156-C356 bond in ST8Sia IV for data. Figure 2 consists of two graphs: (1) a plot of match ratio vs. t, and (2) a plot of the fraction of total peaks used vs. t. The intersection of these two graphs is close to t = 2, confirming that the heuristic value used in our experiments optimizes performance and data utilization.

-

Figure 2. Match ratio and peak utilization vs. threshold t

I

Peak utilization

Match ratio

Program

Number of

Number of

neaks utilized

matches

Match ratio

MS2Assign

1774

1646

0.93

MS2DB

50

48

0.96

Program

Number of

Number of

Match ratio

peaks utilized

matches

MSZAssign

2169

1791

0.78

MS2DB

50

44

0.72

These studies suggest that MS2DB may be slightly better than MS2Assign at discriminating between a true positive and a false positive result. More studies are needed to support this conclusion. 4. CONCLUSIONS AND DISCUSSION

0

5

10

15

20

25

threshold t

3.2.2. Comparison with MS2Assign program As discussed in Section 1, the program MS2Assign can be configured to process MSIMS data to identify disulfide bonds in protein. However, we note that while MS2Assign automates the identification of disulfide bonds, it does not do so in a high throughput manner. For example: The two peptides MS2Assign takes as input must be obtained from another program, such as Peptidemap. MS2Assign accepts the input of only one MSIMS mass list (from one DTA file). Also, because MS2Assign does not account for experimental noice, isotopic variation, or the intensity of the fragmented ion, the accuracy of its results may not be as high as the accuracy of a program that takes these factors into consideration. To investigate this, we identified the DTA files that MS2DB used to obtain match ratios for C13 to C59 (true positive identification) and C199 to C413 (false positive identification) of C2GnT-I. We then copied the fragment ion d z portion of the file to use for the Peak List in MS2Assign. Our results are summarized in Tables 7 and 8.

In this paper we have presented a comprehensive algorithmic framework for the determination of disulfide bonds by utilizing data from tandem mass spectrometry. The proposed approach involves addressing four key sub-problems. First, the match between a given mass spectrum and the set of every possible pair of cysteinecontaining peptides of the given protein is obtained. Next, the correspondence between the tandem mass spectrum and the set of every disulfide bonded fragment mass is determined. The actual disulfide connectivity pattern is determined by solving the maximal weight matching problem. The salient contribution of our approach is the use of real-world data from mass spectrometry in the above steps. Doing so, requires addressing a series of algorithmic challenges that include peak finding in noise spectra, addressing issues of isotopic variation and neutral loss, peak interpretation in the presence of charge state uncertainty, consideration of both inter-peptide and intra-peptide bonds, and consideration of missed proteolytic cleavages. Until now, techniques for disulfide bond identification have tended to remain on either sides of the model-or-measure dichotomy. The proposed work seeks to span this divide and identifies the core algorithmic challenges at the intersection of purely computational and purely experimental strategies. Experimental results highlight the high precision and recall that can be obtained with such a hybrid strategy. Another advantage of this approach is its data-driven

51 and high-throughput nature. An implementation of our approach is available for public use at: http:l/tintin.sfsu.edu:3319lims2dbl.

Acknowledgments The research presented in this paper was partially supported by the grants I S 0 6 4 4 4 1 8 and CHE-06 19163 of the National Science Foundation, a grant from the Center for Computing in Life Science of San Francisco State University, and the grant P20MD000262 from the NIH. The authors also than the anonymous reviewers for their comments.

References 1. Creighton TE, Zapun A and Darby NJ. Mechanisms and catalysts of disulfide bond formation in proteins. Trends in biotechnology 1995; 13: 18-23. 2. Angata K, Yen TY, El-Battari A, Macher BA, Fukuda M. Unique disulfide bond structures found in ST8Sia IV polysialyltransferase are required for its activity. JBiol Chem. 2001; 18:15369-15377. 3. Gorman JJ, Wallis TP, Pitt JJ. Protein disulfide bond determination by mass spectrometry. Mass spectrometvy reviews 2002; 21: 183-216. 4. Brunger, AT. X-ray crystallography and NMR reveal complementary views of structure and dynamics. Nature structural biology 1997; 4 Suppl: 862-865. 5 . Klepeis JL, Floudas CA. Prediction of P-sheet topology and disulfide bridges in polypeptides. J. Comput. Chem. 2003; 24:191-208. 6. Taskar B, Chatalbashev V, Koller D, Guestrin C. Learning structured prediction models: A large margin approach. Proc. of the International Conference on Machine Learning; 2005. 7. Ferre F, Clote P. Disulfide connectivity prediction using secondary structure information and diresidue frequencies. Bioinformatics 2005; 21: 2336-2346. 8. Vullo A, Frasconi P. Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics 2004; 20, 653-659. 9. Baldi P, Cheng J, Vullo A. Large-Scale Prediction of Disulphide Bond Connectivity. Advances in Neural information Processing Systems 2004; 11: 97-104. 10. Tsai CH, Chen BJ, Chan CH, Liu HL, Kao CY, Improving disulfide connectivity prediction with sequential distance between oxidized cysteines. Bioinformatics 2005, 21:44 16-44 19.

11. DiANNA:

http:/lclavius.bc.edu/-clotelab/DiANNA/ 12. DISULFIND: http://disulfind.dsi.unifi.it/ 13. PreCys: http:llbioinfo.csie.ntu.edu.tw:5433lDis~1lfidel 14. WMATCH: Solver for the Maximum Weight Matching Problem:

http://elib.zib.de/pub/Packaaes/mathpron/matching/ weighted1 15. MS-Bridge: http://pros~ector.ucsf.edulpros~ector 16. X! Protein Disulphide Linkage Modeler: http:/lwww.systemsbiology.ca/x-

bang/DisulphideModeler/Disul~hideModeler.litml 17. Peptidemap: http:llprowl.rockefeller.edu/prowl 18. MS2Assign: http:llroswell.ca. sandia. govl-mmyounglms2assign. & m J

19. http:/lprowl.rockefeller.edulaainfolsti~ct.htm 20. Cormen TH, Leiserson CE, Rivest RL, Stein C. introduction to Algorithms, MIT Press, 2001 : 224229. 21. Gabow H. Implementation of Algorithms for Maximum Matching on Nonbipartite Graphs. Ph.D. thesis, Stanford University, 1973. 22. Yen TY, Macher BA, Bryson S, Chang X, Tvaroska I, Tse R, Takeshita S, Lew AM, and Datti A. Highly Conserved Cysteines of Mouse Core 2 1,6-N-Acetyl glucosaminyltransferase I Form a Network of Disulfide Bonds and Include a Thiol That Affects Enzyme Activity. J Biol Chem. 2003; 278145864-81. 23. De Vries T, Yen T, Joshi RK, Storm J, van den Eijnden DH, Knegtel RMA, Bunschoten H, Joziasse DH, Macher BA. Neighboring cysteine residues in human fucosyltransferase VII are engaged in disulfide bridges, forming small loop structures: a proposed 3D model based on location of cysteines, and threading and homology modeling. Glycobiology 2001; 11:423-432. 24. Yen TY, Macher BA. Determination of glycosylation sites and disulfide bond structures using LC/ESI-MSIMS analysis. Methods in enzymology 2006; 415:103-113. 25. Swiss-Prot database: http://cs.expasv.orgl 26. Fariselli P, Casadio R. Prediction of disulfide connectivity in proteins. Bioinformatics 200 1; 17:957-964.


Biomedical Application

m


55

CANCER MOLECULAR PATTERN DISCOVERY BY SUBSPACE CONSENSUS KERNEL CLASS1FlCAT1ON Xiaoxu Han Department of Mathematics and Bioinformatics Program, Eastern Michigan Universio Ypsilanti, MI 48197, USA xiaoxu. han @ emich. edu Cancer molecular pattern efficient discovery is essential in the molecular diagnostics. The characteristics of the gendprotein expression data are challenging traditional unsupervised classification algorithms. In this work, we describe a subspace consensus kernel clustering algorithm based on the projected gradient nonnegative matrix factorization (PG-NMF). The algorithm is a consensus kernel hierarchical clustering (CKHC) method in the subspace generated by the PG-NMF. It integrates convergence-soundness parts-based learning, subspace and kernel space clustering in the microarray and proteomics data classification. We first integrated subspace methods and kernel methods by following our framework of the input space, subspace and kernel space clustering. We demonstrate more effective classification results from our algorithm by comparison with those of the classic NMF, sparse-NMF classifications and supervised classifications (K" and SVM) for the four benchmark cancer datasets. Our algorithm can generate a family of classification algorithms in machine learning by selecting different transforms to generate subspaces and different kernel clustering algorithms to cluster data.

1. INTRODUCTION

With the development of genomics and proteomics, Molecular diagnostics has appeared as a new tool to diagnose cancers. It picks a patient's tissues or blood samples and uses DNA microarray or mass spectrometry (MS) based proteomics techniques to generate their gene expressions or protein expressions. The genelprotein expressions reflect gene/protein activity patterns in different types of cancerous or precancerous cells. They are molecular patterns or molecular signatures of cancers. Different cancers will have different molecular patterns and the molecular patterns of a normal cell will be different from those of a cancer cell. Clinicians identify the potential biomarkers by analyzing the gene/protein patterns. However, robustly classifying cancer molecular patterns is still a challenge for clinicians and bioiformaticans. Many classification methods from statistical and machine learning are proposed for cancer molecular pattern classification. These methods can be generally classified as supervised classification methods, such as k-nearest neighborhood (k"), linear discriminant anayalsis (LDA), neural networks (NN),support vector machines (SVM); unsupervised classification (clustering) methods, such as hierarchical clustering

(HC), self-organizing maps (SOM), principal component analysis (PCA); and their variants, such as particle swarm optimization support vector machines (PSO-SVM), kernel principal component analysis (KPCA) etc. 4-7 We are particularly interested in the unsupervised molecular pattern discovery algorithms, because they do not need or have prior knowledge about data. They also have potentials to explore the latent structure of data. However, the traditional clustering algorithms: HC and SOM were already proved unstable for gene and protein expression data although they are widely used in the cancer molecular pattern discovery community. 4*8*15 Actually, the characteristics of gene and protein expression data are challenging the traditional unsupervised classification algorithms. These high dimensional data can be represented by an n x m matrix after preprocessing. The row data in the matrix are the expression levels of a gene across different experiments or intensity values of a measured data point in different samples (observations) corresponding to an d z ratio. The column data are the gene expression levels of a genome under an experiment or intensity values of all measured data points in a sample corresponding to m/z ratios. Usually, n >> rn ; that is, the number of variables

56 in a dataset is much greater than the number of observations/experiments. For the gene expression data, the column number in the matrix is 5000 usually; for the proteomics data, the matrix column number is < 200 and the matrix row number is in the order of lo5 - lo6 generally. These data are not noise free data because their raw data have noise and preprocessing algorithms can’t remove them completely. Although there are a large number of variables in these data, only a small set of variables account for most of data variations.

H as a feature matrix. The columns of W (a set of bases) set up a new coordinate system and all elements of H are the coordinates of X in this new coordinate system. The feature matrix H is the prototype dataset of X after the feature selection, where each column is the prototype of an observation. After NMF, each column (observation) of X can be represented as a linear combination of r bases Wt, i = 1,2,...r approximately,

1.I. Nonnegative matrix factorization

That is, each observation is expressed as the product of the basis matrix and its corresponding prototype after feature selection. The objective function E ( W ,H ) = IIX -WHII can be expressed as Euclidean distance or Kullback-Leibler (KL) divergence between X and WH . For example, the Euclidean distance objective function is defined as follows.

It is obvious that dimension reduction / feature selection should be conducted to reduce data to a much lower dimension before classification. Several well-known global feature selection methods, such as principal component analysis (PCA), singular value decomposition (SVD), and independent component analysis (ICA) have been applied in the cancer molecular pattern classifications. 9,10,’ However, the holistic feature selection mechanism from these methods prevents from the alternative local feature selection. For example, PCA can only capture the global characteristics of data and each principal component (PC) contains information from all input variables. This leads to the hard time to interpret PCs intuitively. Data representation in PCA is not “purely additive”. Each PC has both positive and negative entries, which are likely to cancel each other partly in the feature selection. On the other hand, there is a local feature selection algorithm: nonnegative matrix factorization (NMF) with parts-based learning mechanism.I3 In contrast to the global feature selection algorithms, NMF can capture variables contributing to local characteristics of data with obvious interpretations. It makes the global characteristics as the simple “additiodcombinations” of the local characteristics. In fact, data representation in NMF is purely additive is because of the nonnegative constraints in the NMF. Given an nonnegative matrix X E R””” and a rank r < min(n,rn) , NMF is an nonlinear programming problem to find two optimal nonnegative matrices W E R””‘ and H E R‘”’” that minimize the reconstruction error, which can be measured by a distance metric, between the matrices X and WH : E ( W ,H ) = IIX -WHII ; that is, X - WH . We name W as a basis matrix and

’,’*

Lee and Seung gave a multiplicative update algorithm for NMF by conducting a dynamic step based gradient descent learning with respect to W and H . I 3 The iteration schemes for the Euclidean distance objective function are as follows (The iteration schemes for the KL divergence are similar). In the iteration, W and H are initialized randomly.

(3)

The multiplicative update algorithm works well experimentally. However, there is no guarantee that it can converge to local minimum points of the objective function, because the limit of the non-increasing sequence [ W ( k ) , H I k ) ] generated from the multiplicative update algorithm may not be a stationary point; l4 that is, it lacks “convergence-soundness”. Brunet et al. used NMF to classify cancer molecular patterns by conducting NMF based clustering for gene expression data.I5 Their NMF clustering consists of three steps. First, decompose gene expression data X

57 under a rank r by the multiplicative update algorithm, i.e. each observation is represented as the linear combination of bases by Eq. (l), where h, is the i-th element of the H , , which is the prototype of the j-th observation X after feature selection. Second, clustering is conducted by the following query asked by each sample: 'which basis has the largest expression level in my prototype? I will belong to the cluster associative with that basis'. For example, suppose ho is the largest value in H , , then sample X , will be assigned to the cluster i because the irh basis has the largest expression level in its prototype H , . The number of clusters is just the decomposition rank r . Finally, the rank leading to the most meaningful clustering is decided by a Monte Carlo based model selection mechanism by finding a rank with the maximum cophenetic correlation coefficient in the hierarchical clustering. The cophenetic correlation coefficient is a measure to evaluate the stability of hierarchical clustering. It is the correlation between the pairwise distance and linkage distance in the hierarchical clustering. A large cophenetic correlation coefficient value will indicate the high stability of a hierarchical clustering. Brunet et a1 proved this method was superior to HC and SOM methods for three benchmark cancer data set^.'^ Inspired by this work, Gao and Church developed a sparse nonnegative matrix factorization to cluster the cancer samples by adding sparseness control in the basic NMF formulation (sparse-NMF).'6xl 7 They demonstrated the sparse-NMF based clustering was superior to the basic NMF clustering method for the same datasets. However, Brunet et a1 's NMF based clustering method has following weak points. 1. The multiplicative update algorithm in the NMF lacks the convergence soundness. The model selection mechanism in the NMF clustering is expensive, because it requires to compute cophenetic correlation coefficients for the hierarchical clustering conducted at all possible ranks to decide the final optimal decomposition rank.

,

1.2. Contributions In this study, we describe a subspace consensus kernel clustering technique based on the projected gradient nonnegative matrix factorization (PG-NMF), which was developed by Lin,I4 to conduct cancer molecular pattern classification for microarray and proteomics data. The projected gradient nonnegative matrix factorization (PG-

NMF) has sound convergence and converges faster than the basic NMF.I4 In addition, we present the ideas of input space, subspace and kernel space clustering before elaborating on our PG-NMF based classification method under the framework of subspace and kernel space clustering. The idea of our method is to transform a genelprotein expression data set X E %" into a subspace S c Finby using the PG-NMF algorithm. Then, a consensus kernel hierarchical clustering algorithm (CKHC) is developed to cluster the projections of a dataset X in the subspace S to infer the latent structure of the data. We have showed that the PG-NMF based subspace kernel clustering (PG-NMF-CKHC) is superior to the basic NMF, sparse-NMF clustering and supervised clustering (K" and SVM) in the cancer molecular pattern discovery for four benchmark cancer datasets. This paper is organized as follows. Section 2 presents the concepts of input space, subspace and kernel space clustering before introducing our PG-NMF based consensus kernel hierarchical clustering in the section 3 . Section 4 shows the experimental results of our algorithm. Finally, we discuss the possible algorithm generalizations and draw conclusions.

2. INPUT

SPACE, SUBSPACE AND KERNEL CLUSTERING

For a given data set X = ( X ' , X ~ , . . . X ~E) ~9lnxrn, clustering is to find an implicit classification function f : X + r that maps each data sample x i ,to its target function value y j (label) in a set r according to some dissimilarity metric ( j = 1,2.. . I r 1). Data samples with a same target function value (label) after classification will claim to share a same cluster. We classify clustering as the input space, subspace and kernel space clustering according to where the implicit classification function f is computed. In the input space clustering, the implicit classification function f is computed in the input space %jnxrnofthe dataset. Hierarchical clustering (HC), K-means clustering and expectation maximization (EM) clustering all belong to the input space clustering. In the kernel space clustering, the classification function f is computed in a kernel space R of the input space, which is a high dimensional Hilbert space generated by a feature map function @ : X + R , dim(R) >> dim(X) . That is, the clustering is conducted for the high

58 dimensional data@(X). On the other hand, in the subspace clustering, the classification function f is computed in a subspace S of the input space, generated by a linear or nonlinear transform 4 , dim(S) 5 dim(X) . Generally, almost all input-space clustering methods can be used in the subspace clustering to cluster the feature data in the subspace. However, not all input space clustering algorithms can have corresponding kernel space clustering algorithms. In the following work, we use the HC as an example to demonstrate the input space, subspace and kernel space clustering.

2.1. Subspace clustering A subspace S is generated from a linear or nonlinear transform @ : X E %nx"' + X *E %'""' and clustering is conducted through the transformed data X * . For example, SOM and PCA based clustering are typical subspace clustering approaches. Most likely, the subspace has the lower dimensionality than the original dataset, i.e. dim(S) < dim(X) . Each transform4 applied to X can be represented as TX = X' , where T is the matrix representation of transform 4. Writing it as a matrix decomposition form of X , we have X = WX * , where the matrix W is the inverse or pseudo-inverse of the matrix T. We still call W as a basis matrix and X ' as a feature matrix. The columns of the basis matrix span the subspace: S = span(W, ,W,,. . .W,) . Dependent on the properties of the transform@,the basis matrix may not be unique and the corresponding matrix decomposition may not be unique also. Geometrically, each column of X is the coordinates of each observatiodcolumn of X in the subspace S , which can be viewed as a new coordinate system. Self-organizing map clustering can be viewed as a simple subspace clustering, where the target function value of each sample is determinated by the location of its corresponding reference vector of the best matching unit (BMU) on the SOM plane. In the nonlinear transform conducted by a self-organizing map (SOM), the feature matrix X * is called the prototype data including all reference vectors on the SOM plane. The subspace bases (W,,W*,...W,)can be obtained by solving r least square problems, where r is the number of neurons on the SOM plane. Actually, the transform 4can be implemented by any linear or nonlinear feature selection methods, such A

as principal component analysis (PCA), independent component analysis (ICA), self-organizing map (SOM) and nonnegative matrix factorization (NMF). The spectral analysis methods like fast Fourier transform, wavelet transform can also implement 4.That is, any input space clustering algorithms can be employed to cluster the feature data X ' . For example, clustering the data principal components (PCA clustering) by HC or other input space clustering methods is a typical subspace clustering, where the subspace generated by the PCA transform is an orthogonal space." Similarly are the hierarchical clustering of the independent components of data (ICA clustering) and the FFT coefficients of data (FFT clustering). l 9

2.2. Kernel space clustering: conduct clustering in a high dimension space with kernel tricks Kernel space clustering conducts clustering in the kernel/feature space Q of a data set X E %'""" . The motivation to conduct kernel space clustering is because classificatiodlearning in a high dimensional space can have desirable results. We use the kernel tricks to avoid the huge computing complexity from clustering in the feature space Q . To apply the kernel tricks in clustering, we need to formulate an input space clustering algorithm into inner product forms at first. Then a kernel function k ( x , y ) = (@(x) @ ( y ) ) is employed to evaluate all the inner products. The kernel function has to satisfy the mercer theorem.*' Through the kernel tricks, classificatiodclustering can be conducted in a high dimensional space by only paying input space level computing complexity, and the feature map CD is unnecessary to be explicit. Although several inputspace clustering methods have their corresponding kernel extensions, we give the kernelization of the hierarchical clustering (HC) in this work. Qin et a1 mentioned the applications of the kernel hierarchical clustering in the gene expression data. However, they only gave an approximation based kernel extension rather than a rigorous kernel extension of the classic hierarchical clustering. Kernelization of the general hierarchical clustering algorithm consists of two steps: kernelize pairwise distance and linkage computing. In the kernelization of the painvise distances, we focus on the Euclidean and correlation distances because they are mostly used

*'

59 dissimilarity metrics in HC. The Euclidean distance between samplesx! and x I in the kernel space can be which can be kernalized as: d(@(x,1, @@,

1) = ( K ,- 2K,, + K , )I/*

(5)

where K,l = K ( x , , x , ) = ( ( @ ( x , ) @ ( x , ) ) . In the kernelization of the correlation distance between samplesx, and x I , we assume the mapped vectors @(xt),@(x,)are zero mean data in the kernel space L2 , then the correlation distance between @(x,) and @(x,) can be formulated as the following inner product form in Eq. (6), where clI = c(@(x, ), @(x,)) . c" =1-

(@(XI

(@(XI)

@(x, 1)

@(x,))I'2(@(x,)

Where x,"' is the i'" sample in the cluster C, ; The I C, C, I are the number of samples in the clusters C, and C, ; k i ' ) = k(x,"', x j r ) ) ,ki" = k ( x : ' ) , x:')) and ki' ') = k(x,'",x:,)) .

(6) @(x,))1/2

However, we shall drop this assumption in the kernel space for more general practice. We use the expectation of all feature data to center each feature data, (7)

Then the corresponding correlation distance can be formulated-as the similar form as in the Eq. (6). Let K,; = (@(xi) G ( x j ) ) , then we have the following result:

2.3. What's the ideal unsupervised classification algorithm for the high dimensional gene/protein expression data? We believe that an ideal unsupervised classification or clustering algorithm for the high dimensional gene and protein data should satisfy following criteria. 1. Some feature selection methods ought to be applied to reduce data dimensions such that data are "clean and compact". 2. The feature selection method employed should have the part-base learning property to maintain the data locality well; that is, the feature selection method can conduct local feature selection. 3. Kernel tricks are desirable to be applied in the clustering of the data after feature selection to achieve better classification results in a kernel space. According to the criteria, we give our subspace consensus kernel classification algorithm based on the projected gradient NMF (PG-NMF). The basic idea is to apply a convergent soundness local feature algorithm: PG-NMF to the gene/protein expression dataset X , which is equivalent to project the dataset X into the subspace S generated by the PG-NMF: X WH , where W is the basis matrix generating the subspace. Then kernel hierarchical clustering is applied to column data the feature matrix H , which are the prototype data of the original data. Since the basis matrix and feature matrix are not unique in the NMF. We develop the consensus kernel hierarchical clustering algorithm (CKHC) to get the final classification.

-

Since the kernel matrix K is a semi-positive definite matrix, summarizing previous results, we have the correlation distance in the kernel space between @(I, ) and @(x, ) can be computed as (9)

3. PG-NMF

The extension of the single, complete and average linkage in the kernel space is trivial but not for the centroid linkage. The centroid linkage between two clusters is defined as the Euclidean distance between the centroid of two clusters. We give the centroid linkage d,, between the clusters C, and C, in the Eq. (10).

SUBSPACE KERNEL HIERARCHICAL CLASS1FlCATlON

PG-NMF based subspace kernel classification is to conduct consensus kernel hierarchical clustering (CKHC) to each feature matrix H in a subspaces generated by the PG-NMF. The CKHC is an algorithm to run the kernel hierarchical clustering in a Monte Carlo simulation approach and compute the final classification by building a consensus tree. It consists of two general steps. 1. Build a consensus tree for the expression dataset X at each rank by conducting CKHC

60 to feature matrices H from the PG-NMF. 2. Then the best consensus tree, which is the final classification, is selected by our novel model selection method. The following algorithm describes the consensus kernel hierarchical clustering (CKHC) at rank r. Algorithm 1 Consensus kernel hierarchical clustering at rank r Input: nonnegative matrix X (nxm), rank r, PG-NMF running times N>=100, Kernel function k ( x , y ) , linkage metric 1 Output: the consensus tree T a t rank r // Run PG-NMF X-WH to do feature selection at rank r N times 1. 2.

For run=l:N Initialize W and H randomly

3.

Compute X-WH, W E R n X r , H E R"'"

4.

Compute the kernel pairwise distances

by PG-NMF

between columns of feature matrix H in the kernel space by Eq. (5)/(9) 5.

Record the kernel pairwise distances in

6.

Concatenate all such kernel distance vectors for N

an m(m-1)/2 x l vector: d feature matrices in a matrix D: D=[D, d];

7. End 8. Compute a consensus kernel distance vector dCo,ISe,zSUQ by weighting the ratios of the sum of each column in D over the sum of the elements of matrix D

9.

Build the consensus tree T from the consensus

10

Return T

kernel distance vector under the linkage metric I

We still need to answer the following question: 'What is the model selection method to find the most robust consensus tree (classification)?' To avoid the exhaustive search on all possible ranks, we give a singular-value based rank selection method to find an optimal rank search interval [ 2 , r * ] .The idea can be described as follows. Given a threshold & ( E E [0.90,1) ), we compute the importance ratio of first r* singular values such that the important ration >= the threshold. The importance ratio of first r*singular values is defined as the ratio of the

sum of the first r*singular values over the sum of all singular values (Eq.11).

That is, PG-NMF is only conducted in the optimal rank search interval [ 2 , r * ] and we only search the best consensus tree from the r* consensus trees. The most robust consensus tree will be from which rank in the interval [ 2, Y * ] ? It is reasonable that the most robust consensus tree should be from a rank, where the bases of its subspace generated by the PG-NMF each time represent all levels of patterns inherent in the dataset. From the point of view of data variability, it is a rank where the ratio between the largest data variability and the smallest data variability of the bases data reaches its maximum value. We propose a measure robust index 6 to find the most robust consensus tree according to the previous considerations. The robust index 6 is the condition number of the covariance matrix of the average basis matrix E ( W )from the N times running of the PG-NMF. The average basis matrix is defined as:

E ( W ) = -l- CN W"' N r=l The condition number of the covariance matrix of the average basis matrix E ( W ) is the ratio between the maximum eigenvalue and the minimum eigenvalue of E ( W ) : 6 = A,,, I A,,,,. The A,,, is the variance of the 1st principal component of the average basis matrix: the largest data variability of the basis data. The ;Iminis the variance of the last principal component of the average basis matrix: the smallest variability of the basis data. The robust index can be huge but it is impossible to reach infinite because A,,,, is the smallest positive eigenvalue of the covariance matrix of E ( W ) . The final classification is just the consensus tree with the largest robust index number. The PG-NMF based consensus kernel hierarchical clustering algorithm (PG-NMFCKHC) can be described as follows. Algorithm 2 PG-NMF based Consensus kernel hierarchical clustering Input: a n X m nonnegative data matrix X, Importance ratio threshold & 2 0.90 Output: the final consensus tree T

61

1.

Decide the rank search interval [2, r*] by the

2.

For r=2: r*

important ratio threshold

3.

E

Conduct consensus kernel hierarchical clustering at rank r to get a consensus tree T,. at rank r

4.

Compute the robust index

better than those of Euclidean distance (Figure 3). The NMF clustering has two misclassified samples: ALL-21 302-B-cell and ALL-14749-B-cell. Sparse-NMF clustering has one misclassified sample: AML-12. However, the running time of NMF and sparse-NMF clustering are twice more than that of our algorithm.

6 of the consensus

tree T,

5.

End

6.

T t T, with the maximum robust index

4. EXPERIMENTS We apply the PG-NMF-CKHC algorithm to discover the cancer molecular patterns for several bench-mark cancer datasets. We use a measure called classification rate ,n C, = x 6 ( i ) / m to evaluate the accuracy of the unsup&vised classification for a dataset with m samples, where 6(i)=1 if the sampleiis assigned in a correct cluster; otherwise 6(i)= 0 . We use three kernel functions in our algorithms: linear, polynomial and Gaussian kernel. The dissimilarity measures in the kernel hierarchical clustering are chosen as Euclidean and correlation distances. We choose the average linkage metric in the kernel hierarchical clustering. The PGNMF algorithm is run N=100 times in each optimal rank search interval with tolerance 1Oe-9. The first dataset is Leukemia dataset, a benchmark dataset consisting of 38 samples in the cancer research. It can be classified as 27 acute lymphoblastic leukemia (ALL) and 11 acute myelogenous leukemia (AML) marrow samples. The ALL samples can be further divided into 19 'B' and 8 'T' subtypes. HC and SOM were proved to be unstable for this dataset.15 The optimal search interval for this dataset is [2,6] under the importance ratio threshold 0.90. The robust index in PG-NMF-CKHC reaches its largest number at rank 5 for a Gaussian kernel under the correlation distance (Figure 2). Figure 1 is the visualization of the final consensus tree. It is clear that there are three clusters, AML, ALLB, and ALL-T in the final consensus tree. There is just only one misclassification i.e. ALL-14749-B-cell was assigned to AML. We have found the combinations of the Gaussian kernel function and correlationEuclidean distance under the average linkage metric both can reach the best performance in the classification. Under the linear kernel, we can see that classification results under the correlation distance are

Fig. 1. The visualization of the consensus tree at rank 5 for a Gaussian kernel under the correlation distance and average linkage metric.

nm.

Fig. 2. The largest robust index reached at rank 5 for the Gaussian kernel with correlation distance.

I

b

F

Fig. 3. The classification rates under linear, polynomial and Gaussian kernel for Euclidean and correlation distances.

The second dataset is Medulloblastoma dataset, the gene expression data from childhood brain tumors

62

known as medulloblastomas. The pathogenesis about these tumors is still not well understood yet by investigators. However, there are two generally accepted histological sub-classes: classic and desmoplatic. These sampled are divided as 25 classic and 9 desmoplastic medulloblastomas. General HC and SOM failed to reveal the classifications of these samples.” The robust index reaches its maximum in the optimal rank search interval [2,10] at rank 7 for a polynomial kernel under the correlation distance. Figure 4 is the visualization of the final classification. There are 8 desmoplastic samples clustered and total 2 samples are misclassified: sample 25 and sample 33.

Fig. 4.

Visualization

of

the

final

consensus

tree

of

better clustering structure since there are 8 desmophlastic samples clustered. On the other hand, sparse-NMF has 7 misclassified at its best rank 5 . l 6 It seems sparseness constraints do not contribute to the improving classification rates for this dataset. Since the pathogenesis of medulloblastoma is still not wellunderstood, we did not compute the classification rates for this dataset. The third dataset is an ovarian cancer dataset, a MS proteomics dataset consisting of 20 cancer and 20 normal samples, which presents as a 15142x40 positive matrix. This data set is a subset of Ovarian Dataset 8-702 that was generated using the WCX2 protein array, which includes 91 controls and 162 ovarian cancers. For this dataset, we try supervised classification first. We randomly pick other 40 samples (20 cancer and 20 normal) from the original dataset as a training set; then we use kNN under Euclidean and correlation distance to classify the MS data. We have found the best classification rate from k” is 92%. But it can’t classify sample 3, 12, 36 correctly. Our algorithm reaches the best classification at rank 7 in the optimal rank search interval [2,10]. There is only one misclassified sample :sample 36 (Figure 6).

the

medulloblastomas dataset at the rank 7 under the polynomial kernel under the average linkage metric and correlation distance

16000 14000

t

12000~

~

5 Rank

6

7

0

02

04

06

08

1

12

8

Fig. 5. The largest robust index reached at rank 7 for the polynomial

Fig. 6. The final consensus tree at rank 7 under Gaussian kernel with

kernel with correlation distance.

correlation distance.

The NMF decomposition desmophlastic algorithm also

Figure 7 shows the performance of linear, Gaussian and polynomial kernel in the classification. The combination of the polynomial kernel and correlation distance has the best performance under the average

has 2 samples misclassified at its best rank 5.’’ However, it only gets 7 samples clustered. Although our have 2 misclassified samples, we have

63 linkage metric. Classification rates generally decrease after the rank 7 and the correlation distance generally performs better than the Euclidean distance in the classification. I

I

Fig. 7. The classification rates of the PG-NMF-CKHC for this dataset: polynomial kernel

+

correlation distance reaches the best

classification rate.

We also apply NMF and sparse-NMF classification for the proteomics data, although they were developed under the context of gene expression data. There are 8 samples misclassified from NMF clustering and 12 samples misclassified from the Sparse NMF clustering for our ovarian cancer dataset. Both algorithms indicate there are 2 clusters from their cophenetic coefficients. Since a proteomics dataset generally has much higher dimensionalities than a gene expression dataset, NMF and sparse NMF clustering have large time complexity for a proteomics dataset. For this dataset, NMF clustering takes >78 hours and sparse-NMF clustering takes >153 hours running under two PCs with 3.0 GHZ CPU and 504 RAM running under WIN-XP 0s. It seems that NMF based clustering/classification mechanism can't work well in the context of the proteomics data.

4.1. Comparing classification results from kNN, sparse-NMF and support vector machines (SVM) We compare PG-NMF-CKHC for the four datasets (the leukemia, medulloblatoma, ovarian cancer dataset and a colon cancer dataset, which consists of 22 controls and 40 cancer data samples) with the classic NMF clustering, sparse-NMF clustering, and SVM and k"

classifications. In k"and SVM, We run classification 10 times under holdout cross-validation with 50% holdout percentage for each case. We take the average classification rates as the final classification rates. In the SVM classification, we also use linear, polynomial and Guassian kernel. We select the best final classification rate from three kernels as the final classification rate of SVM. In the leukemia data, we use SVM/lc" to classify ALL and AML types instead of all three types. Although the pathogenesis of medulloblatoma is not well established, we still compute the classification rates of this dataset based on the general assumption that samples are divided as 25 classic and 9 desmoplastic medulloblastomas, for the convenience of comparisons. Table 1 shows the classification rates for the four benchmark datasets from,"k PG-NMF-CKHC, NMF, sparse-NMF and SVM classifications. We have found that our algorithm is superior to the NMF, sparse-NMF and supervised SVM classification algorithms for these datasets; The NMF classification has better performance than SVM and k" for three gene expression datasets. Sparse-NMF has averagely better performance than k" for three gene expression datasets. However, the NMF and sparse-NMF can't compete with k" and SVM for the proteomics data. According to our classification results, it seems that sparseness constraint on the NMF may not always contribute to the improvement in the classifications for some datasets. Besides the ovarian dataset, for the medulloblatoma dataset, the classic NMF clustering seems to perform better in classifying desmoplastic medulloblastomas than the sparse-NMF clustering at rank 5 , where both algorithms reaches the most robust reproducibility partitions. We also noticed the NMF and sparse-NMF clustering can not compete with SVM classification for the ovarian dataset. It is interesting to see that sparseness constraint may not lead to the better classification results for the colon cancer dataset. The classic NMF clustering reaches its largest cophenetic correlation coefficient at rank 2 (2 clusters) and its corresponding classification rate is 0.9355. However, the sparse NMF clustering reaches its largest cophenetic correlation coefficient at rank 4 (4 clusters) and its corresponding classification rate is 0.758 1. It is possible due to the fact that the expression patterns of those dominant co-expressed genes such as, oncogenes, tumor suppressor genes are not extracted out in the sparse representation. This may also indicate that sparseness

64 control may not always lead to a better classification results for some dataset. Figure 8 and 9 give the visualization of the NMF and sparse-NMF clustering from the rank 2-5 for the colon cancer dataset. Probability of two samples clustered together is indicated by color. Generally, blue indicates a numeric value near 0 and a red color indicates a numeric values near 1. The deep blue standing for 0 indicates samples are never assigned in one cluster and dark red standing for 1 indicates samples are assigned in one cluster.

Fig. 8. The visualization of the NMF clustering from rank 2-5 for the colon dataset

5. CONCLUSIONS As a part-based learning machine learning algorithm, NMF has found its application successfully in image

analysis, document clustering and cancer molecular pattern discovery. In this study, we present an NMF based subspace kernel clustering algorithm: PG-NMFCKHC based on the input space, subspace and kernel space clustering framework. We have shown that PGNMF-CKHC improves the cancer molecular pattern discovery for the well-studied four datasets. It can work well for both gene expression data and protein expression data according to out current results. Our algorithm can be generalized to a family of subspace kernel classificatiodclustering algorithms in machine learning by selecting different transforms to generate subspaces and different kernel clustering algorithms to cluster data. For example, conduct kernel k-means clustering in a subspace generated by the independent component analysis (ICA) applied to a high dimensional dataset, or conduct the kernel Fisher discriminant analysis (KFDA) 22 in a subspace generated by principal component analysis (PCA). Despite its promising features, it is also worthy to point out that PG-NMF based consensus kernel hierarchical clustering has the limitation of greater algorithmic complexity, especially compared with the traditional hierarchical clustering (HC). However, it is clear that our algorithm is easy to fit in a parallel computing structure due to its Monte Carlo simulation mechanism. Thus, we plan to implement the parallel version of the subspace based kernel classification algorithm for the cancer molecular pattern classification in the following work.

Fig. 9. The visualization of the sparse-NMF clustering from rank 25 for the colon dataset

Table 1. Compare PG-NMF-CKHC classification results with those of the NMF, sparse-NMF, SVM and KNN classifications

Cancer Data Information Algorithm Classification Rates DataSize #tvne k” PGNMF-CKHC NMF Snarse-NMF Leukamia 5000x38 3 0.8860 0.9737 0.9470 0.9737 Medulloblastoma 5893x34 2 0.7611 0.9412 0.9412 0.8235 Ovarian 15142x40 2 0.8990 0.9750 0.8000 0.7000

0.9132 0.8300 0.9474

Colon

0.8542

Cancer Name

2000x62

2

0.7667

0.9355

0.9032

0.7581

SVM

65

Acknowledgments Author wants to thank the support from the New Faculty Research Award at Eastern Michigan University for this research.

References 1. Lilien, R. and Farid, H. Probabilistic Disease Classification of Expression-dependent Proteomic Data from Mass Spectrometry of Human Serum, Journal of Computational Biology 2003; 10 ( 6 ) , 925-946. 2. Golub, T. et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science 1999; 286: 531537. 3. Furey T., Cristianini N., Duffy N, Bednarski D., Schummer M. and Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 2000; 16 (10): 906-914. 4. Hautaniemi, S. , Yli-Harja, O., Jaakko Astola, J., Kauraniemi, P. et al. Analysis and Visualization of Gene Expression Microarray Data in Human Cancer Using Self-organizing Maps, Machine Learning 2003; 52: 45-66. 5. Ressom, H., Varghese, R., Saha, D., Orvisky, R. et al. Analysis of mass spectral serumprofiles for biomarker selection. Bioinformatics 2005; 21: 4039-4045. 6. Liu Z., Chen D. and Bensmail H. Gene expression data classification with Kernel principal component analysis. J Biomed Biotechnol. 2005 (2) 155-159. 7. Eisen,M. et al. Cluster analysis and display of genome-wide expression patterns. Proc. Nut1 Acad. Sci. USA 1998; 95: 14863-14868. 8. Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Nut1 Acad. Sci. USA 1999; 96: 2907-2912. 9. Bicciato,S. et al. PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinfomatics 2003; 19: 571-578 10. Wall, M., Andreas, R., Rocha, L. Singular value decomposition and principal component analysis. A Practical Approach to Microarray Data Analysis. Berrar, D., W. Dubitzky, W., Granzow, M. eds. Kluwer: Nonvell, 2003; 91-109. 11. Tan, Y., Shi, L., Tong, W., and Wang, C. Multiclass cancer classification by total principal component regression using microarray gene

expression data. Nucleic Acids Res. 2005; 33( 1) 5665. 12. Zhang, X., Yap, Y., Wei, D., Chen, F. and Danchin, A. Molecular diagnosis of humancancer type by gene expression profiles and independent component analysis, European Journal of Human Genetics 2005; 1-9: 1018-4813. 13. Daniel D. Lee and H. Sebastian Seung.: Learning the parts of objects by non-negative matrix factorization. Nature 1999; 401: 788-791. 14. Lin, C. Projected gradient methods for non-negative matrix factorization, Neural Computation 2007; In Press. 15. Brunet, J., Tamayo, P., Golub, T. and Mesirov., J. Molecular pattern discovery using matrix factorization. Proc. Nut1 Acad. Sci. USA, 2004, 101,12: 4 164-4 169. 16. Gao, Y. and Church, G. Improving molecular cancer class discovery through sparse nonnegative matrix factorization, Bioinformatics 2005; 21 (21):, 3970-3975. 17. Patrik 0. Hoyer: Non-negativematrix factorization with sparseness constraints. Journal of Machine Learning Research 2004,s: 1457-1469. 18. Yeung, K. and Ruzzo, W.: Principal Component Analysis for clustering gene expression data, Bioinformatics, 2001; 17 (9): 763-774. 19. Lee, S. and Batzoglou, S. ICA-Based Clustering of Genes from Microarray Expression Data, Neural Information Processing Systems(NIPS) 2003. 20. Vapik, V. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. 21. Qin. J. et al.: Kernel hierarchical gene clustering from microarray expression data, Bioinformatics, 2003, 19 (16), 2097-2104. 22. Mika, S., Ratsch, G., Weston, J., Scholkopf, B. and Muller, KR. Fisher discriminant analysis with kernels, Neural Networks f o r Signal Processing IX,. 1999; 41-48.


67

EFFICIENT ALGORITHMS FOR GENOME-WIDE TAGSNP SELECTION ACROSS POPULATIONS VIA THE LINKAGE DISEQUILIBRIUM CRITERION

Lan Liu, Yonghui Wu, Stefan0 Lonardi and Tao Jiang* Department of Computer Science and Engineering, University of California, Riverside, CA 92507, USA *Email:[email protected]:edu In this paper, we study the tagSNP selection problem on multiple populations using the pairwise T~ linkage disequilibrium criterion. We propose a novel combinatorial optimization model for the tagSNP selection problem, called the rninirnurn coi?inzontugSNP selection (MCTS) problem, and present efficient solutions for MCTS. Our approach consists of three main steps including (i) partitioning the SNP markers into small disjoint components, (ii) applying some data reduction rules to simplify the problem, and (iii) applying either a fast greedy algorithm or a Lagrangian relaxation algorithm to solve the remaining (general) MCTS. These algorithms also provide lower bounds on tagging (i.e. the minimum number of tagSNPs needed). The lower bounds allow us to evaluate how far our solution is from the optimum. To the best of our knowledge, it is the first time tagging lower bounds are discussed in the literature. We assess the performance of our algorithms on real HapMap data for genome-wide tagging. The experiments demonstrate that our algorithms run 3 to 4 orders of magnitude faster than the existing single-population tagging programs like FESTA,LD-Select and the multiplepopulation tagging method MultiPop-Tagselect. Our method also greatly reduces the required tagSNPs compared to LD-Select on a single population and MultiPop-Tagselect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal since they are very close to the corresponding lower bounds obtained by our method.

1. INTRODUCTION The rapid development of high-throughput genotyping technologies has recently enabled genome-wide association studies to detect connections between genetic variants and human diseases. Single-nucleotide polymorphism (SNP) is the most frequent form of polymorphism in the human genome. Common SNPs with minor-allele frequency (MAF) of 5% have been estimated to occur once every -600 bps '*, and there are more than 10 million verified SNPs in dbSNP Given these numbers, it is currently infeasible to consider all the available SNPs to carry out association studies. This motivates the selection of a subset of informative SNPs, called tagSNPs. The selection of tagSNPs in silico is a well-studied research topic. Existing computational methods for tagSNP selection can be classified into two categories: haplotype-based methods 1, 12, 17, 19,24, 28, 31, 3 2 , 34 and haplotype25, 27, 26. The independent methods 5, 15. 16, haplotype-based methods require phased multi-locus haplotypes, whereas the haplotype-independent methods do not require haplotype information. The main shortcoming of haplotype-based methods is that the preprocessing step (i.e. the inference of haplotypes from genotypes) is computationally demanding. In addition, since there is not an authoritative inference method, the haplotypes generated by the existing haplotype inference methods are often quite different 7, 3 5 . Consequently, the tagSNPs selected by the haplotype-based methods would be quite different. Recently, Carlson et al. proposed a haplotype-independent method that employs the r2 linkage disequilibrium (LD) statistical criterion to

''.

20-22i

321

measure the association between SNPs. The tagSNPs selected by this method are shown to be effective in disease association mapping studies, because the measure T~ is directly related to the statistical power of association mapping. Because this method has comparable performance at a lower computational cost than many other methods 3 3 , 27, tagging approaches based on r2 LD statistics have gained popularity among researchers in the SNP community 2. 5, 8 , 22, 26,33 Most approaches using the r2 criterion require that tagSNPs be defined within a single population, because LD patterns (see the caption of Figure 1(A) for a definition) are quite susceptible to population stratification In two populations with different evolutionary histories, a pair of SNPs having remarkably different allele frequencies and very weak LD may show strong LD in the admixed population (see such an example in Table 1). Recent study shows that the LD patterns and allele frequencies across populations are very different 29 in fact. For example, among the populations collected in the HapMap project (i.e. YRI, CEU, CHB and JPT), 81 % of the SNPs in YRI population have a near perfect proxy (i.e. SNPs that have r2 2 0.8 with other SNPs), while in the other three populations, 9 1% of the SNPs have a near perfect proxy '. Therefore, tagSNPs picked from the combined populations or one of the populations might not be sufficient to capture the variations in all populations. In order to maintain the power of association mapping, we need generate a common (or universal) tagSNP set to type all the populations with sufficient accuracy. A simple approach to select a universal tagSNP set is to tag one population first and then select a supplementary set for each of the other populations one by one 22.

'.

2i

*Corresponding author.

231

68 Table 1. r 2 statistics for a pair of S N P markers in a single and admixed populations. O n e SNP has alleles denoted as A and a while the other S N P has alleles denoted as B and b. Population 3 is an even mixture of populations 1 and 2. Population 1 b 0.9025 0.0475 0.0475 0.0025 0.95 0.05

B

A a

0.95 0.05 T'

=0

A a

Population 2 b 0.0475 0.9025 0.95

B 0.0025 0.0475 0.05

For instance, we can select a tagSNP set for non-African populations and a supplement for populations with significant African ancestry 2 3 . However, this sequential approach might not give a satisfactory solution, as the tagSNP set selected for one population might be far from being adequate to type the SNPs of the remaining populations. As a result, the supplementary tagSNP sets are large and the total number of tagSNPs chosen is far from the optimum. Moreover, the performance of the approach is sensitive to the specific order of the input populations. In order to generate the smallest set of tagSNPs on K populations, one would have to execute the tagging procedure K ! times considering all possible orderings, which would be extremely inefficient for genomewide tagging. We can improve the performance of the tagging approach by evaluating multiple populations at the same time. When choosing tagSNPs, we prefer those with "good properties" with respect to the collection of populations as a whole. An example of our tagging strategy is given in Figure 1.

Population I

Populat,on 2

(A)

Populvt,on 2

PopularLon I

iB)

Populatlo" 2

Population I

iC)

Fig. 1. (A). LD patterns in two populations. The vertices denote the SNP markers and the edges denote pairs of markers with strong LD (i.e. the r2 measure between the markers is greater than a given threshold). (B). Tagging results of the above simple sequential approach. We first choose markers 3 and 6 to tag population I and then choose an additional marker 5 to tag population 2. Three markers are selected in total to tag both populations. (C). Tagging results of an improved approach. We select markers 4 and 6 considering both populations simultaneously. Only two markers are selected in total to tag both populations.

Previous work on tagSNP selection based on the linkage disequilibrium criterion. There is a large body of scientific literature on the problem of selecting tagSNPs based on the r2 LD criterion. Carlson et al. suggested a greedy procedure called LD-Select, which works as follows: (i) select the SNP with the maximum number of proxies, (ii) remove the SNP and its proxies from consideration, and (iii) repeat the above two steps until all SNPs have been tagged '. This algorithm is very simple, however it may miss solutions with the smallest number of tagSNPs in general, as shown in 2 G . More recently, Qin et al. implemented a comprehensive search algo-

Population 3 b 0.4525 0.0475 0.5 0.0475 0.4525 0.5 0.5 0.5 T' = 0.6561

B

0.05 0.95 T'

=0

A a

rithm called FESTA, which first breaks down a large set of markers into disjoint pieces (calledprecincts), and then performs an exhaustive search on each piece if the estimated computational cost is below a certain threshold 2 G . FESTA usually gives a better solution than LD-Select, but due to the fact that it employs exhaustive search, it is too slow to be practical for genome-wide tagSNP selection. The above methods are only applicable to single population tagSNP selection. Recently, Howie et al. presented an algorithm for multiple populations, called MultiPop-Tagselect. MultiPop-Tagselect combines the tagSNPs selected for each population by LD-Select to produce a universal tagSNP set for a collection of populations 1 3 . The algorithm works reliably, and it could in principle be used with any tagSNP selection method for single populations. However, its accuracy highly depends on the performance of the single-population tagSNP selection method. Magi et al. 22 also designed a software tool called REAPER which is rather similar to LD-Select if applied to a single population. To select a universal tagSNP set for several populations, it first selects a tagSNP set for one population, and then it selects a supplement for the remaining populations one by one. As mentioned above, the performance of the method crucially depends on the choice of the initial tagSNP set and the ordering of the populations. It is not clear, moreover, how one should select tagSNPs for the first population so as to minimize the size of the final solution.

Our contribution on tagSNP section based on the linkage disequilibrium criterion. In this paper, we take a different approach to the multi-population tagSNP selection problem. Contrary to the previous methods, we do not generate a tagSNP set for each individual population separately, but rather we evaluate all the populations at the same time. The method that we propose could be used to generate a universal or cosmopolitan tagSNP set for multi-ethnic, ethic-unknown or admixed populations 1 3 . The main idea of our approach is to transform a multi-population tagSNP selection problem, called the minimum common tagSNP selection (MCTS) problem (to be defined more precisely later in the paper), into a minimum common dominating vertex set problem on multiple graphs. Each graph corresponds to one of the populations under consideration. The vertices in a graph correspond to the SNP markers of the population, and there is an edge between two markers when they are in strong LD

69

(according to some given threshold). To find an optimal solution MCTS, we first decompose it into disjoint subproblems, each of which is essentially a connected component of the union graph" and represents a precinct as defined in 26. Then, for each precinct, we apply three data reduction rules repeatedly to further reduce the size of the subproblem, until none of the rules can be applied anymore. Finally, the reduced subproblems are solved by either a simple greedy approach (similar to cosmopolitan tagging ') or a more sophisticated Lagrangian relaxation heuristic. Both algorithms will be explained in detail later in the paper. Along with the solution produced by our algorithm, we also obtain lower bounds on the minimum number of tagSNPs required, which allows us to quantitatively assess how close our solution is from the optimum. We evaluate the performance of our method on real HapMap data for genome-wide tagging. The experimental results demonstrate that our algorithms run 3 to 4 orders of magnitude faster than the existing singlepopulation tagging programs like FESTA, LD-Select and the multiple-population tagging method MultiPopTagselect. Our method also greatly reduces the required tagSNPs compared to LD-Select on a single population and MultiPop-Tagselect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal since they are very close to the corresponding lower bounds provided by our method. For example, the gap between our solution and the lower bound is 1061 SNPs with r2 threshold being 0.5 and 142 SNPs with the r2 threshold being 0.8, given the entire human genome with 2,862,454 SNPs (MAF being 5%). The rest of the paper is organized as follows. In Section 2, we first propose a combinatorial optimization model for the MCTS problem and then present a computational complexity result. In Section 3, we introduce three rules to reduce the size of the problem, and devise a greedy tagging algorithm, called GreedyTag, and a Lagrangian relaxation heuristic, called LRTag. After showing the experimental results in Section 4, we conclude the paper with some remarks about the performance of our tagging method in Section 5. Due to page limit, some of the illustrative figures and tables are given in the appendix. 2. FORMULATION OF THE MCTS PROBLEM

Consider K distinct populations and a set V of biallelic SNP markers v1, v2, . . . , vn. Since the r2 coefficient is unreliable for rare SNPs when the sample size is small 5, we will consider only SNPs with MAF 2 5%. The set of SNPs might be different from population to population. We use V , C V to denote the SNP set in population i. Clearly, we have V = V1 U Vz U . . . U VJ. "Given graphs G, = ( K ,E,)(1 5 z

For a pair of SNP markers wjl and w j , in a population i (for any 1 5 i I K ) ,the r2 coefficient between them is denoted by r : ( w j , , uj,). Markers w j l and v J 2 are said to be in high LD in population i, if r: (vJ1,v j , ) 2 yo, where yo is a pre-defined threshold (yo will be set to 0.5 or higher in our study). Moreover, w j , (or vj,) is considered being the tagSNP or proxy for vj, (or vjl , respectively) in population i . For convenience, we define E,to be the set containing all the high-LD marker pairs in population i, i.e. Ei = {(~~,,~j,)/~~(~jl,2lj,) 2 yo, wj,,uj2 E Ei}. Now we can formally define the MCTS problem. MINIMUMCOMMONTAGSNP SELECTION(MCTS) Instance: A collection of K populations and a set V of biallelic SNP markers. Each population i (1 5 i 5 K ) has its marker set V , C V and LD patterns E, = { ( w j l , u g z ) l r ~ ( . u j , , v j 2 )2 yo, u j , , v j z E V,}, where yo is a pre-defined threshold. Feasible solution: A subset T C V such that for any marker w E V,,'u !$ T from some population i, there exists a marker w' in T f' V , with ( w , w ' ) E Ei(that is,

P.,(

4L 70).

Objective: Minimize ITI. It is easy to observe that any feasible solution to the MCTS problem is a common dominating vertex set in the graphs { Gi 11 5 i L K } , where Gi = (V,, E,). In particular, the smallest set of tagSNPs for a single population is a minimum dominating vertex set of the corresponding graph. Obviously, the MCTS problem is NP-hard, since it is a generalization of the minimum dominating vertex set problem, which is known to be NP-hard '.

Theorem 2.1. The MCTS problem is NP-hard. We introduce some additional notations to be used later. To differentiate the occurrences of a marker in different populations, we use w j to represent the j t h marker appearing in the ith population. Given a marker u j E V, we define the following two sets: N z ( t ~ j= ) { t ~ j , l ( t ~ j , aE j,E ) , , t ~ ~ , uEj /&} U { ~ $ v jE &} N * ( t J j )= U I < i < K NZ(tJj)

(1) The set NZ (vj)represents the subset of markers in strongLD with vj in population i, and the set N* (wj)represents the union of such subsets for all the populations. Note that, NZ ( v j ) is empty if vj $ V,. Given a marker w3 E V , in population i, we define the following set:

C(vj) = {vj(l(vj,wf) E

E~,Uj,UjI E

V , } u { V j } (2)

The set C(v3)is the subset of markers each of which can tag the occurrence 714, whereas N * ( v j )is the subset of occurrences that the marker uj can tag.

5 k ) , the union graph is defined as G = ( V , E ) ,where V =

u,V , and E = u,E,

70

Based on the above definitions, the MCTS problem can also be viewed as the following set cover problem. Given the universe U = ul -

Their partial correlation a y x ~ zis then given by,

6yXlz = a2

=

a1 a3

cYX

-

with,

a1

= Cyy

CYZC,;CZX,

-

a3

CyzC,;Czy, =

cXX

-

CXZC,$ZX. Recalling results from conditional Gaussian distributions, these can be denoted by: a1 = C y l z , a 2 = C x y j z and a3 = C x l z . Thus, 6 ~ x = 1~ C ~ ~ ~ C X Y ~ ZExtending C X ; ~ the . above result from the mutual information to the directed information case, we have, pDT1 = J1 - e-Zc,",i I ( x z ; ~ I y z - ' ) . We recall the primary difference between MI and DTI, (note the superscript on X): MI: I ( X N ;Y N )= I ( X N ;J'lY"'). DTI: I ( X N + Y N )= ELl I ( X z ;J'lYz-l). Having found the normalized DTI, we ask if the obtained DTI estimate is significant with respect to a 'null DTI distribution' obtained by random chance. This is addressed in the next two sections. 6. KERNEL DENSITY ESTIMATION (KDE) The goal in density estimation is to find a probafunction f ( z ) that approximates the Nbility ~ N I density , underlying density f ( z ) of the random variable 2. Under certain regularity conditions, the kernel density estimator i h ( Z ) a t the point z is given by fh(Z) = E:=l K ( y ) , with n being the number of samples z1,22, . . . , z, from which the density

150 is to be estimated, h is the bandwidth of a kernel I((.) that is used during density estimation. A kernel density estimator at z works by weighting the samples (in ( z 1 , 2 2 , .. . .z,)) around z by a kernel function (window) and counts the relative frequency of the weighted samples within the window width. As is clear from such a framework, the choice of kernel function K ( 0 ) and the bandwidth h determines the fit of the density estimate. Some figures of merit to evaluate various kernels are the asymptotic mean integrated squared error (AMISE), bias-variance characteristics and region of support [8].It is preferred that a kernel have a finite range of support, low AMISE and a favorable bias-variance tradeoff. The bias is reduced if the kernel bandwidth (region of support) is small, but has higher variance because of a small sample size. For a larger bandwidth, this is reversed (ie large bias and smaller variance). Under these requirements, the Epanechnikov kernel has the most of these desirable characteristics - i.e. a compact region of support, the lowest AMISE compared t o other kernels, and a favorable bias variance tradeoff [ 8 ] . The Epanechnikov kernel is given by: 3 K ( u ) = -(1 - u ~ ) I ( I u I5 1). 4 with I ( * ) being the indicator function conveying a window of width spanning [-1,1] centered at 0. An optimal choice of the bandwidth is h = 1.06 x iz x n-'I5;, following [14]. Here 8, is the standard error from the bootstrap DTI samples ( z 1 ,za, . . . , z,). Hence the kernel density estimate for the bootstrapped DTI (with n = 1000 samples), Z & ~ B ( X Y~N )becomes, fh(Z) = $[1- ( ~ ) a ] I ( l 1) ~ with l h = 2.676-, and n = 1000. We note that I B ( X N + Y N )is obtained by finding the DTI for each random permutation of X , Y time series, performing this permutation B times. and obtaining a estimate of the density over these B permutations. --j

&xCpl

5

7. BOOTSTRAPPED CONFIDENCE I NT ERVA LS Since we do not know the true distribution of the DTI estimate, we find an approximate confidence interval for the DTI estimate ( f ( X N + Y " ) ) , using bootstrap [19]. We denote the cumulative distribution function (over the Bootstrap samples) of

f(X" Y " ) by F I ; ( X N - Y N ) ( IB (X" + Y " ) ) , Figure 3. Let the mean of the bootstrapped null distribution be I g ( X N + Y " ) . We denote by t l p a ,the (1 quantile of this distribution i.e. +

I;, ( x

-

Y ) -I;; ( x -Y

)

P([ 6 1 5 t1-a) = 1 - a } . Since we need the real f ( X N + Y " ) t o be significant and close to I , we need f ( X N + Y N )2 [ I g ( X N+ Y N )+ tl-, x 81,with i being the standard error of the bootstrapped distribution, [C=; ib( X N -Y N ) - I ; ( X N Y " ) I 2 , B is the numfT= B-1 ber of Bootstrap samples. For the Pax2-Gata3 interaction, we show the kernel density estimate of the bootstrapped histogram using the Epanechnikov kernel (Fig. 3 ) as well as the position of the true DTI estimate in relation to the overall histogram. With the obtained kernel density estimate of the Pax2- Gata3 interaction, shown below, we can find significance values of the true DTI estimate in relation t o the bootstrapped null distribution. :

(t1-o

-

,

I

01

nu

Fig. 3.

Cumulative Distribution Function for bootstrapped I ( P a z 2 + Gata3). The true I ( P a z 2 + Gata3) = 0.9911.

8. S U M M A R Y OF ALGORITHM We now present two versions of the DTI algorithm, one which involves an inference of general influence network between all genes of interest (unsupervisedD T I ) and another, a focused search for effector genes which influence one particular gene of interest (supervised-DTI). Our proposed approach for (supervised-DTI) is as follows: 0

Identify the G key genes based on required phenotypical characteristic using fold change studies. Preprocess the gene expression profiles by normalization and cubic spline interpolation. We now assume that there are N

151

0

points for each gene. Bin each of the expression profiles into K quantiles (here, we use K = 4), thus building a joint histogram. The granularity of sampling can be an issue during entropy estimation, hence the DarbellayVajda method can also be used here. We note that the presence of probe-level or sample replicates greatly enhance the accuracy of the entropy estimation step. For each pair of genes A,and B among these G genes :

-

-

~

-

~

0

Look for a phylogenetically conserved binding site of protein encoded by gene A, in the upstream region of gene B. Find D T I ( A , , B ) = I(A: 4 BN), and the normalized DTI from A, to B, DTI(A,, B ) = 2/1- e-21(AEJ+BN). Bootstrapping over several permutations of the data points of A, and B yields a null distribution (using KDE) for D T I ( A , , B ) . If the true D T I ( A , , B) is greater than the 95% upper limit of the confidence interval (CI) from this null histogram, infer a potential influence from A, to B. The value of the normalized DTI from A, to B gives the putative strength of interaction/influence. Every gene A, which is potentially influencing B is an 'affector'. This search is done for every gene A, among these G genes ((AI, Az, . . . , A G ) ) .

ple to include some apriori biological knowledge (if a subset of upstream TFs at the promoter are already known, either experimentally or from other sources) - a search among the binding partners of these known TFs can reduce the set of potential effectors and reduce the complexity of the unsupervised procedure. Another element that has been added is the control of false discovery rate (FDR) [27] to screen each of the G(G - I) hypotheses (both directions) during network discovery amongst G genes. Table 1. Comparison of various network inference methods. Method

Resolve Cycles

Nonlinear framework

Search for interaction

Nonparametric framework

SSM [l] COD [3] GGM [6] DTI [5]

Y N N

N N Y Y

N Y N Y

N N N Y

In Table 1 we compare the various contemporary methods of directed network inference. Recent literature has introduced several interesting approaches such as graphical gaussian models (GGMs), coefficient of determination (COD), state space models (SSMs) for directed network inference. This comparison is based primarily on expectations from such inference procedures - that we would like any such metric/procedure to: 0 0

1-

0

We observe that both phylogenetic information is inherently built into the influence network inference step above.

0

For unsupervised DTI, we adapt the above approach for every pair of genes (A, B ) in the list, noting that D T I ( A ,B ) # D T I ( B ,A). In this case we are not looking at any interaction in particular, but are interested in the entire influence network that can be potentially inferred from the given time series expression data. The network adjacency matrix has entries depending on the direction of influence and is related to the strength of influence as well as the false discovery rate. We note that it is fairly sim-

Y

Resolve cycles in recovered interactions. Be capable of resolving directional and potentially non-linear interactions. This is because interactions amongst genes involve non-linear kinetics. Be a non-parametric procedure to avoid distributional assumptions (noise etc). Be capable of recovering interactions that a biologist might be interested in. Rather than use a method that discovers interactions underlying the data purely, the biologist should be able to use prior knowledge (from literature perhaps). For example, a biologist can examine the strength and significance of a known interaction and use this as a basis for finding other such interactions.

From the above comparisons, we see that DTI is the only metric which can recover interactions under all these considerations.

152 9. RESULTS In this section, we give some scenarios where DTI can complement existing bioinformatics strategies to answer several questions pertaining to transcriptional regulatory mechanisms. We address three different questions. To infer gene influence networks between genes that have a role in early kidney development and T-cell activation, we use unsupervised D T I with relevant microarray expression data, noting that these influence networks are not necessarily transcriptional regulatory networks. To find transcription factors that might be involved in the regulation of a target gene (like Gata3) at the promoter, a common approach is to first look for phylogenetically binding motif sequences conserved across related species. These species are selected based on whether the particular biological process is conserved in them. To add additional credence to the role of these conserved TFBSes, microarray expression can be integrated via supervised DTI to check for evidence of an influence between the TF encoding gene and the target gene. Before proceeding, we examine the performance of this approach on synthetic data. 9.1. Synthetic Network A synthetic network is constructed in the following fashion: We assume that there are two genes g1 and g3 which drive the remaining genes of a seven gene network. The evolution equations are as below: g2,t =

1 -gl,t-1

2

1 + -g3,t-2 3

g4,t

f g7,t-1;

2

= 92,t-1

g5.t = g2,t-2

g7,t

+ g 31/2, t - 1 ; + g4,t-1; 1 2

1/3

= -g4,t-1;

For the purpose of comparison, we study the performance of the Coefficient of Determination (COD) approach for directed influence network determination. The COD allows the determination of associ-

ation between two genes via a R2 goodness of fit statistic. The methods of [3]are implemented on the time series data. Such a study would be useful to determine the relative merits of each approach. We believe that no one procedure can work for every application and the choice of an appropriate method would be governed by the biological question under investigation. Each of these methods use some underlying assumptions and if these are consistent with the question that we ask, then that method has utility.

97

(With DTI)

(with COD)

Fig. 4. The synthetic network as recovered by (a) DTI and (b) COD.

As can be seen (Fig. 4), though COD can detect linear lag influences, the non-linear ones are missed. DTI detects these influences and almost exactly reproduces the synthetic network. Given the non-linear nature of transcriptional kinetics, this is essential for reliable network inference. DTI is also able to resolve loops and cycles ( g 3 , [g2,g4],g5 and g 2 , g4, g7,g2). Based on these observations, we examine the networks inferred using DTI in both the supervised and unsupervised settings. 9.2. Directed Network inference: Gata3 Regulation in Early Kidney Development Biologists have an interest in influence networks that might be active during organ development. Advances in laser capture microdissection coupled with those in microarray methodology have enabled the investigation of temporal profiles of genes putatively involved in these embryonic processes. Forty seven genes are expressed differentially between the ureteric bud and metanephric mesenchyme [25] and putatively involved in bud branching during kidney development. The expression data [lo] temporally profiles kidney development from day 10.5 dpc to the neonate stage. The influence amongst these genes is

153 shown below (Fig. 5). Several of the presented interactions are biologically validated but there is an interest to confirm the novel ones pointed out in the network. The annotations of some of these genes are given below (Table 2).

genes over 10 time points with 44 (34+10) replicate measurements for each time point.

Agtrap

Scarb2

Pax2

Gata3

GataP

Mapkl

Fig. 5. Overall Influence network using DTI during early kidney development.

Some of the interactions that have been experimentally validated include the Ram-Mapkl [18], Pax2-Gata3 [16] and Agtr-Pax2 [17] interactions. We note that this result clarifies the application of DTI for network inference in an unsupervised manner - i.e. discovering interactions revealed by data rather than examining the strengths of interactions known apriori. Such a scenario will be explored later (Sec: 9.4). We note that though several interaction networks are recovered, we only show the largest network including Gata3, because this is the gene of interest in this study. An important shortcoming of most gene network inference approaches is that these relationships are detected based on mRNA expression levels alone. To understand these interactions with greater fidelity, there is a need to integrate other data sources corresponding to phosphorylation, dephosphorylation as well as other post-transcriptional/translationalactivities, including miRNA activity. 9.3. Directed Network Inference: T-cell Activation

To clarify the validity of the presented approach, we present a similar analysis on another data set - the T-cell expression data [l],in Fig. 6. This data looks at the expression of various genes after T-cell activation using stimulation with phorbolester PMA and ionomycin. This data has the profiles of about 58

Fig. 6.

DTI based T-cell network.

Several of these interactions are confirmed in earlier studies [l, 29, 30, 311 and again point to the strength of DTI in recovering known interactions. The annotation of some of these genes are given in Table 3. We note that the network of Fig. 6 shows the largest influence network (containing Gata3) that can be recovered. Gata3 is involved in T-cell development as well as kidney development and hence it is interesting to see networks relevant to each context in Figs. 5 and 6. Also, these 58 genes relevant to T-cell activation are very different from those for kidney development, with fairly low overlap. For example this list does not include Pax2 (which is relevant in the kidney development data). 9.4. Phylogenetic conservation of TFBS effectors

A common approach to the determination of "functional" transcription factor binding sites in genomic regions is to look for motifs in conserved regions across various species. Here we focused on the interspecies conservation of TFBS (Fig. 2) in the Gata3 promoter to determine which of them might be related to transcriptional regulation of Gata3. Such a conservation across multiple-species suggests selective evolutionary pressure on the region with a potential relevance for function. As can be seen in Fig. 2, we examine the Gata3 gene promoter and find atleast forty different transcription factors that could putatively bind at the promoter as part of the transcriptional complex. Some of these TFs, however, belong to the same family.

154 Table 2.

Functional annotations (Entrez Gene) of some of the genes with Gata2 and Gata3 during nephrogenesis. ~~

~~

Gene Symbol

Gene Name

Possible Role in Nephrogenesis (Function)

Ram Gata2 Gata3 Pax2 Lamc2 Pgf Coll8ul Agtrap

Retinoic Acid Receptor G A T A binding protein 2 G A T A binding protein 3 Paired Homeobox-2 Laminin Placental Growth Factor collagen, type X V I I I , alpha 1 Angiotensin I1 receptor-associated protein

crucial in early kidney development several aspects of urogenital development several aspects of urogenital development conversion of MM precursor cells to tubular epithelium Cell adhesion molecule Arteriogenesis, Growth factor activity during development extracellular matrix structural constituent, cell adhesion Ureteric bud cell branching

Table 3. ~

~~~

Functional annotations of some of the genes following T-cell activation

~~

Gene Symbol

Gene Name

Possible Role in T-cell activation (Function)

Gasp7 JunD CKRl

Caspase 7 Jun D proto-oncogene Chemokine Receptor 1 Interleukin 4 receptor Mitogen activated kinase 4 acute myeloid leukemia 1; am11 oncogene Retinoblastoma 1

Involved in apoptosis regulatory role of in T lymphocyte proliferation and T h cell differentiation negative regulator of the antiviral CD8+ T cell response inhibits IL4-mediated cell proliferation Signal transduction CD4 silencing during T-cell differentiation Cell cycle control

114~ Mapk4 AMLI Rbl

Using supervised DTI, we examined the strength of influence from each of the TF-encoding genes (Ai) to Gata3, based on expression level [lo, http://spring.imb.uq. edu.a u / ] . These "strength of influence" DTI values are first checked for significance at a p-value of 0.05 and then ranked from highest to lowest (noting that the objective is to maximize I ( A z4 Gata3)). Based on this ranking, we indicate some of the TFs that have highest influence on Gata3 expression (Fig. 7). Obviously, this information is far from complete, because of examination only at the mRNA level for both effector as well as Gata3.

Fig. 7. Putative upstream TFs using DTI for the Gata3 gene. The numbers in each TF oval represent the DTI rank of the respective T F .

Table 4 shows the embryonic kidney-specific expression of the TFs from 7. This is an independent annotation obtained from UNIPROT

Table 4. Functional annotations of some of the transcription factor genes putatively influencing Gata3 regulation in kidney. Gene Symbol

Description

Expressed in Kidney

PPAR

peroxisome proliferatoractivated receptor Paired Homeobox-2 Hypoxia-inducible factor 1 SP1 transcription factor GLI-Kruppel family member early growth response 3

Y

Pax2 HIFl SPl GLI EGR3

Y Y Y Y Y

(http://expasy.org/sprot/). To understand the notion of kidney-specific regulation of Gaia3 expression by various transcription factors, we have integrated three different criteria. We expect that the TFs regulating expression would have an influence on Gata3 expression, be expressed iii the kidney and have a conserved binding site at the GataY promoter. This is clarified in part by Fig. 7 and Table 4. As an example, we see that the TFs Pax2, PPAR, SP1 have high influence via DTI and are expressed in embryonic kidney (Table 4), apart from having conserved TFBS. This lends good computational evidence for the role of these TFs in Gatu3 regulation, and presents a reasonable hypothesis worthy of experimental validation. As an additional step, we also examined the influence for another two TFs - S T E l 2 and HP1, both of which have a high co-expression correlation with Gata3 as well as conserved TFBS in the promoter

155 region. The DTI criterion gave us no evidence of influence between these to TFs and GataS’s activity. We believe that this information coupled with the present evidence concerning the non-kidney specificity of S T E l 2 and H P l , present some argument for the non-involvement of these TFs in kidney specific regulation of Gata3. Hopefully, these findings would guide a more focused experiment to identify the key TFs involved in GataY activity.

CONCLUSIONS In this work, we have presented the notion of directed information (DTI) as a reliable criterion for the inference of influence in gene networks. After motivating the utility of DTI in discovering directed non-linear interactions, we present two variants of DTI that can be used depending on context. One version, unsupervised-D TI, like traditional network inference, enables the discovery of influences (regulatory or non-regulatory) among any given set of genes. The other version (supervised-DTI) aids the modeling of the strength of influence between two specific genes of interest - questions arising during transcriptional influence. It is interesting that DTI enables the use of the same framework for both these purposes as well as is general enough t o accommodate arbitrary lag, non-linearity, loops and direction. We see that the above presented combination of supervised and unsupervised variants enable their applicability to several important problems in bioinformatics (upstream TF discovery), some of which are presented in the Results section. The network inference approach can also alow incorporation of additional biophysical knowledge - both pertaining t o physical mechanisms as well as protein interactions that exist during transcription. We point out that given the diverse nature of biological data of varying throughput, one has to adopt an approach to integrate such data to make biologically relevant findings and hence the DTI metric fits in very naturally into such a n integrative framework.

ACKNOWLEDGEMENTS The authors gratefully acknowledge the support of the NIH under award 5R01-GM028896-21 (J.D.E). We would like t o thank Prof. Sandeep Pradhan and Mr. Ramji Venkataramanan for useful discussions

on Directed information. We are also grateful to the reviewers for having helped us to improve the quality of the manuscript.

References 1. Range1 C, Angus J , Ghahramani Z, Lioumi M, Sotheran E, Gaiba A, Wild DL, Falciani F, 11Modeling T-cell activation using gene expression profiling and state-space models”, Bioinformatics, 20(9),1361-72, June 2004. 2. Stuart RO, Bush KT, Nigam SK, “Changes in gene expression patterns in the ureteric bud and metanephric mesenchyme in models of kidney development”, Kidney International,64(6),19972008,December 2003. 3. Hashimoto RF, Kim S, Shmulevich I, Zhang W, Bittner ML, Dougherty ER., “Growing genetic regulatory networks from seed genes”.,Bioinformatics. 2004 May 22;20(8):1241-7. 4. Woolf PJ, Prudhomme W, Daheron L, Daley GQ, Lauffenburger DA., “Bayesian analysis of signaling networks governing embryonic stem cell fate decisions”., Bioinformatics. 2005 Mar;21(6):741-53. 5. Rao A,Hero A0,States DJ,Engel JD, “Inference of biologically relevant Gene Influence Networks using the Directed Information Criterion” ,Proc. of the IEEE Conference on Acoustics, Speech and Signal Processing, 2006. 6 Opgen-Rhein, R., and Strimmer K., “Using regularized dynamic correlation to infer gene dependency networks from time-series microarray data”, Proc. of Fourth International Workshop on Computational Systems Biology, W C S B , 2006. 7 G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partitioning of the observation space,” IEEE Trans. on Information Theory, vol. 45, pp. 1315-1321, May 1999. 8. Hastie T , Tibshirani R, The Elements of Statistical Learning, Springer 2002. 9. Geweke J., “The Measurement of Linear Dependence and Feedback Between Multiple Time Series,” Journal of the American Statistical Assoczation, 1982, 77, 304-324. (With comments by E. Parzen, D. A. Pierce, W. Wei, and A. Zellner, and rejoinder) 10. Challen G, Gardiner B, Caruana G, Kostoulias X, Martinez G , Crowe M, Taylor DF, Bertram J , Little M, Grimmond SM., “Temporal and spatial transcriptional programs in murine kidney development” .,Physiol Genomics. 2005 Oct 17;23(2):159-71. 11. Kreiman G., “Identification of sparsely distributed clusters of cis-regulatory elements in sets of coexpressed genes”.,Nucleic Acids Res. 2004 May 20;32(9):2889-900. 12. MacIsaac KD, Fraenkel E., “Practical strategies for discovering regulatory DNA sequence motifs”.,PLoS Comput Biol. 2006 Apr;2(4):e36

156 13. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A,, “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context” .,BMC Bioinformatics. 2006 Mar 20;7 Suppl 1:S7. 14. J. Ramsay, B. W. Silverman, Functional Data Analysis (Springer Series in Statistics), Springer 1997. 15. H. Joe., “Relative entropy measures of multivariate dependence” ., J. A m . Statist. Assoc., 84:157164, 1989. 16. Grote D, Souabni A, Busslinger M, Bouchard M., “Pax 2/8-regulated Gata3 expression is necessary for morphogenesis and guidance of the nephric duct in the developing kidney” ., Development. 2006 Jan;133(1):53-61. 17. Zhang SL, Moini B, Ingelfinger JR., “Angiotensin I1 increases Pax-2 expression in fetal kidney cells via the AT2 receptor”.,J Am Soc Nephrol. 2004 Jun;15(6):1452-65. 18. Balmer JE, Blomhoff R., “Gene expression regulation by retinoic acid”.,J. Lipid Res. 2002 Nov;43:ll: 1773-808. 19. Effron B, Tibshirani R.J, An Introduction to the Bootstrap (Monographs on Statistics and Applied Probability), Chapman & Hall/CRC, 1994. 20. I. Ovcharenko, M.A. Nobrega, G.G. Loots, and L. Stubbs, “ECR Browser: a tool for visualizing and accessing data from comparisons of multiple vertebrate genomes” , Nucleic Acids Research, 32, W280-W286 (2004). 21. Khandekar M, Suzuki N, Lewton J, Yamamoto M, Engel JD., “Multiple, distant Gata2 enhancers specify temporally and tissue-specific patterning in the developing urogenital system” ., Mol Cell Biol. 2004 Dec;24(23):10263-76.

22. J. Massey, “Causality, feedback and directed information,” in Proc. 1940 Symp. Information Theory and Its Applications (ISITA-go), Waikiki, HI, Nov. 1990, pp. 303305. 23. Hudson, J.E., “Signal Processing Using Mutual Information”, Signal Processing Magazine,23(6):50-54, Nov. 2006. 24. Gubner J. A,, Probability and Random Processes for Electrical and Computer Engineers, Cambridge, 2006. 25. Schwab K, Patterson LT, Aronow BJ, Luckas R, Liang HC, Potter SS., “A catalogue of gene expression in the developing kidney”., Kidney Int. 2003 Nov;64(5):1588-604. 26. H. Marko, “The Bidirectional Communication Theory - A Generalization of Information Theory”, IEEE Transactions on Communications, Vol. COM21, pp. 1345-1351, 1973. 27. Benjamini, Y. and Hochberg, Y., “Controlling the false discovery rate: A practical and powerful approach to multiple testing”.,J. Roy. Statist. Soc. Ser. B.1995; 571289-300. 28. Cover T.M, Thomas J.A, “Elements of Information Theory”, Wiley-Interscience, 1991. 29. Ezzat S, Mader R, Yu S, Ning T, Poussier P, Asa SL., “Ikaros integrates endocrine and immune system development.” ,J Clin Inwest. 2005 Apr;115(4):844-8. 30. Zhang, DH, Yang L, and Ray A. “Differential responsiveness of the IL-5 and IL-4 genes to transcription factor GATA-3”.,J Immunol 161: 3817-3821, 1998. 31. Rogoff HA, Pickering MT, Frame FM, Debatis ME, Sanchez Y, Jones S, Kowalik TF., “Apoptosis associated with deregulated E2F activity is dependent on E2F1 and Atm/Nbsl/Chk2”.,Mol Cell Biol. 2004 Apr;24(7):2968-77.

157 DISCOVERING PROTEIN COMPLEXES IN DENSE RELIABLE NEIGHBORHOODS OF PROTEIN INTERACTION NETWORKS

Xiao-Li Li* Knowledge Discovery Department, Institute for Infocomm Research, Heng Mui Keng Terrace, 119613, Singapore * Email: [email protected] Chuan-Sheng Foo Computer Science Department, Stanford University, Stanford CA 94305-9025 USA Email: csfoo astanford. edu See-Kiong Ng Knowledge Discovery Department, Institute for Infocomm Research, Heng Mui Keng Terrace, 119613, Singapore Email: [email protected] Multiprotein complexes play central roles in many cellular pathways. Although many high-throughput experimental techniques have already enabled systematic screening of pairwise protein-protein interactions e n m a s s e , the amount of experimentally determined protein complex data has remained relatively lacking. As such, researchers have begun t o exploit the vast amount of pairwise interaction data t o help discover new protein complexes. However, mining for protein complexes in interaction networks is not an easy task because there are many data artefacts in the underlying protein-protein interaction data due t o the limitations in the current high-throughput screening methods. We propose a novel DECAFF (Dense-neighborhood Extraction using Connectivity and conFidence Features) algorithm t o mine for dense and reliable subgraphs in protein interaction networks. Our method is devised t o address two major limitations in current high throughout protein interaction data, namely, incompleteness and high data noise. Experimental results with yeast protein interaction data show that the interaction subgraphs discovered by DECAFF matched significantly better with actual protein complexes than other existing approaches. Our results demonstrate that pairwise protein interaction networks can be effectively mined t o discover new protein complexes, provided that the data artefacts in the underlying interaction data are taken into account adequately.

1. INTRODUCTION Multiprotein complexes play central roles in many cellular pathways. Common examples include the ribosomes for protein biosynthesis, the proteasomes for breaking down proteins, and the nuclear pore complexes for regulating proteins passing through the nuclear membrane. Searching for protein complexes is therefore an important research focus in molecular and cell biology. However, while tens of thousands of pairwise protein-protein interactions have been detected by high throughput experimental techniques (e.g. yeast-two-hybrid), only a small subset of the many possible protein complexes has been experimentally determined Given that the protein complexes are molecular aggregations of proteins assembled from multi-

'.

*Corresponding author.

ple stable protein-protein interactions, researchers have recently begun to explore the possibility of exploiting the current abundant datasets of pairwise protein-protein interactions to help discover new protein complexes (see Section 2). In fact, it has been observed that densely connected regions in the protein interaction graphs often correspond to actual protein complexes 2 , suggesting the identities of protein complexes can be revealed as tight-knitted subcommunities in protein-protein interaction maps. This has led to previous works that looked into the mining of cliques or other dense graphical subcomponents in the interaction graphs for putative complexes. However, the protein interaction networks derived from current high throughput screening meth4 e 7

158

ods are riot an easy source for mining as there are still many data artefacts in the underlying interaction data due to inherent experimental limitations. In fact, it has been repeatedly shown that the current protein interaction data is still incomplete and noisy 8 , and it is important to take this into account when devising algorithms to mine the protein interaction networks. For example, the use of cliques for detecting complexes would be too constraining and cannot provide satisfactory coverage. In this work, we propose a novel DECAFF (Dense-neighborhood Extraction using Connectivity and conFidence Features) algorithm that is devised to address two major limitations in current high throughout protein interaction data, namely, incompleteness and high data noise. Unlike conventional methods, our DECAFF method specifically mines for maximal dense local neighborhoods (instead of cliques) and filters the unreliable protein complexes by estimating the reliability of each protein interaction in the network. Experimental results with yeast protein interaction data show that the interaction subgraphs discovered by DECAFF matched significantly better with actual protein complexes than other existing approaches. Our results confirm that there are indeed dense graphical subcomponents in the pairwise protein interaction networks that correspond to actual multiprotein complexes, and we could exploit the interactome to help map the protein complexome more effectively by taking in account of the data artefacts in the underlying protein interaction data.

2. RELATED WORKS By modeling protein interaction data as a large undirected graph where the vertices represent unique proteins and edges denote interactions between two proteins, Ref. 2 was one of the first to reveal that protein complexes generally corresponded to dense regions (highly interconnected subgraphs) in the protein interaction graphs. Ref. 3 then exploited this finding and used cliques (fully connected subgraphs) as a basis to detect protein complexes and functional modules in protein interaction networks. However, the use of cliques was too constraining given that the incompleteness in the currently available interaction data; as a result, the method could only detect fewer protein complexes. Bader then proposed a novel MCODE algorithm

that discovered protein complexes based on the proteins’ connectivity values in a protein interaction graph ‘. The algorithm first computes the vertex weighting from its neighbor density and then traverses outward from a seed protein with a high weighting value to recursively include neighboring vertices whose weights are above a given threshold. As the highly weighted vertices may not be highly connected to one another, this approach does not guarantee that the discovered regions are dense. As a result, not all the detected regions correspond to protein complexes. In fact, in the post preprocessing step of the MCODE algorithm, there was a need to filter for the so-called “2-core”s as an attempt to eliminate some obvious non-dense region detected by the algorithm. Clustering algorithms have also been proposed to identify dense regions in a given graph by partitioning it into disjoint clusters . However, these general graph clustering algorithms cluster each vertex (protein) into one specific group which made them inappropriate for this biological application as a protein is often involved in multiple complexes (i.e. clusters) 1 3 . Another clustering approach was proposed by Ref. 5, which used a restricted neighborhoods search clustering algorithm (RNSC) to predict protein complexes by partitioning the proteinprotein interaction network using a cost function. However, like many clustering algorithms, their results depended on the quality of the initial random seeds. In addition, there were relatively fewer complexes predicted by this algorithm, reflecting another limitation of clustering approaches. In our recent work 6 , we proposed the LCMA algorithm (Local Clique Merging Algorithm) to mine the dense subgraphs for protein complexes. Instead of adopting the over-constraining cliques as the basis for protein complexes, LCMA adopted a local clique merging method as an attempt to address the current incompleteness limitation of protein interaction data. Evaluation results showed that LCMA was better in detecting complexes than full clique 3 , MCODE and RNSC algorithm ’. However, LCMA also shared the same drawback as MCODE in that the graphical components detected by the algorithm are not guaranteed to be dense subgraphs. Most recently, Ref. 7 proposed an algorithm based on the assumption that two nodes that belong to the same cluster have more common neigh‘ 1

159 bors than two nodes that are not in the same cluster. Besides ensuring the high density (20.7) of a graph, their algorithm also keeps track of its cluster property, a numerical measure for measuring whether a dense graph contains more than one dense component. If a graph has a low value for the clust e r property, then it will be separated into multiple subgraphs. However, given the higher proportion of noisy protein interactions (up to 50%) in current protein interaction networks ’, the formations of clusters will be greatly affected when the algorithm computes the cluster property. In this paper, we propose the DECAFF algorithm which first mines local dense neighborhoods (in addition to local cliques) for each vertex (protein) and then merges these local neighborhoods according to their affinity to form maximal dense regions that correspond to possible protein complexes. In addition, given the potentially high false positive rate in the protein interaction data, DECAFF also filters away possible false protein complexes that have low reliability scores, ensuring that the proteins in the predicted protein complexes are connected by high confidence protein interactions in the underlying network. The overall DECAFF algorithm is described in Section 3.3. 3. T H E PROPOSED TECHNIQUES Mathematically, a protein-protein interaction (PPI) network can be represented as a graph G,,, = (VPPI.EPP,),where V,,, represents the set of the interacting proteins and E,,, denotes all the detected pairwise interactions between proteins from V,,,,,. Our objective is to detect a set of subgraphs C = { g = ( V , E ) I IVI 2 3 , V C vpp,,E C E p p i } , where each g is a dense subgraph (possibly overlapping) in GI,,, that may correspond to an actual multiprotein complex. Additionally, since many false positive protein interactions in G,,, may be assembled into false protein complexes, we also require that each detected dense graph g has a high reliability score.

3.1. Mining for dense subgraphs Let us first introduce the notion of the local neighborhood graph for each vertex:

Definition 3.1. The local neighborhood graph of a vertex vi E V in G = (V,E) is defined as Gut =

Vuz= {wt} U {u I v E V, { u , v t } E E}, and Eut

= { { ~ j r ~ l c } / { ~ jE ~E ~ , Vkj }, u k

(1)

In other words, vertex u,’s local neighborhood graph is the subgraph formed by v, and all its immediate neighbors with the corresponding interactions in G. In this work, we have devised our algorithm to focus first on each vertex’s local neighborhood graph in a bottom-up fashion, as it is impractical to directly detect dense subgraphs in a top-down fashion from G,,,, which is usually a very large graph with thousands of vertices and tens of thousands of edges. Let us now define the notion of the density of a graph :

Definition 3.2. The density of a graph g = (V,E) is defined as its clustering coefficient (cc) 1 2 :

Note that 0 5 cc(g) 5 1 since the maximum number of edges in an undirected graph g = (V,E) is ~ V ~ * ( ~ V ~ - lIfgisaclique, )/2. thencc(g) = 1 a s i t has the maximum number of edges. In this work, we detect putative protein complexes from dense subgraphs of G,,, instead of the conventional requirement for cliques. We define a dense graph as one in which its density is at least max(6,0.5), where 6 is a user-defined threshold to provide for more stringent conditions. The results reported in this paper are based on setting 6 as 0.7, which is also the same setting used in the recent work by Ref. 7. The following theorem indicates that we can adopt a bottom-up approach to discover dense subgraphs from protein interaction network:

Theorem 1. E v e r y dense neighborhood g in GI,,, c a n be assembled using only t h e dense neighborhoods of i t s i n n e r vertices. The formal proof for Theorem 1 can be found in Appendix A of the S u p p l e m e n t a r y Materials (which is available at h t t p : //wwwi . i2r.a-star . edu. sg/ “xlli/csb-supp . p d f ) . Theorem 1 suggests a strategy of first finding the local dense neighborhoods for each vertex, and then obtaining larger dense neighborhoods by merging these dense sub-regions. As such, DECAFF algorithm mines for dense subgraphs in two steps:

160

(1) First, we compute the local dense neighborhoods for all the vertices in the given interaction graph G,,,. We use a local clique mining method to locate the local cliques, and then deploy a novel hub-removal technique to heuristically detect local dense subgraphs in each vertex’s local neighborhood graph. Such systematic scanning of the local dense neighborhoods in the entire interaction graph will allow DECAFF to discover most of the local dense regions, resulting in significantly higher recall than other algorithms (see Section 4). (2) Then, we merge the extracted local dense neighborhoods to obtain maximal dense neighborhoods that correspond to larger complexes.

3.1.1. Mining for local dense subgraphs Given that we already have an efficient method for discovering local cliques ‘, we first mine for each vertex’s local cliques, and then expand the collection of other local dense subgraphs using a hub-removal procedure which we will describe shortly. In this way, we can ensure that both cliques and non-clique dense subgraphs are detected effectively.

Fig. 1. A local clique obtained from YBR112C’s local neighborhood graph

To detect local cliques, we adapt the method from the LCMA algorithm which is basically an elimination process in which the neighborhood vertices of a given vertex are iteratively removed, starting from the least connected vertex (vertex with lowest degree), to increase the overall density of the local neighborhood graph. The details of this

step can be found in Appendix B of Supplementary Materials. Here, we show an example (Figure 1) of mining a local clique from a local neighborhood graph for the vertex (protein) YBRl12C to illustrate how it works. In this case, the neighbors YILOGlC, YDR043C, YGL035C, YMR240C, YCLO67C, YLR176C, YCR084C were sequentially removed. This results in the final local dense neighborhood shown in the circled area of Figure 1 which is a clique d = (V, E ) , V = {YBR112C, YDL005C, YOR174W, YGL025C) and density cc ( d ) = 1 (IVl= 4, and IEl = 6). Although the LCMA algorithm can obtain the local cliques, an actual protein complex may not be presented as a fully connected subgraph in a protein interaction network for various reasons as previously discussed (e.g. incompleteness of current protein interaction data). There are thus possibly many other dense but non-clique subgraphs for each vertex that could form parts of a target complex. In DECAFF, we devise a Hub Removal algorithm to efficiently detect multiple dense subgraphs with densities larger than the given threshold 6. In the hierarchical network model proposed by Ref. 14, a biological network is constructed from a small cluster of highly connected nodes by generating replicas of the network at each step and linking the external nodes of the replicated clusters to the central node of the old cluster. This construction procedure suggests a heuristic for recovering the smaller dense clusters in the network by reversing the process, which forms the basis for the Hub Removal algorithm. Basically, we start by removing the most highly connected node (the hub) and its corresponding edges from the network, and then recursively repeating this procedure on its connected components, until a dense cluster is recovered and the removed hub is re-inserted back into the cluster. A more detailed description of this algorithm can be found in Appendix B of Supplementary Materials. Figure 2 shows the results of applying the Hub Removal algorithm to further discover dense subgraphs in the local neighborhoods of the protein YBR112C. While the previous LCMA algorithm could only discover a single fully connected graph {YBR112C, YDL005C; YOR174W, YGL025C) in this neighborhood graph, our recursive Hub Removal Algorithm is able to detect an additional 4 dense subgraphs: {YBR112C, YGL035C,

161 YMR240C), {YBR112C, YCR084C, YLRl76C}, {YBR112C, YCR084C, YCLO67C) and {YBR112C, YDL005C, YOR174W, YGL025C, YCR084C). Note that as this approach allows the discovery of multiple, possibly overlapping, dense neighborhoods for each vertex, it also allows the possibility of a vertex (protein) participating in multiple complexes.

pothesis that if two neighborhoods have larger intersection sets and similar sizes, then they are more similar and have a larger affinity. The merging step takes the set of local dense neighborhoods LDN (comprising local cliques output by the LCMA algorithm and the dense neighborhoods obtained from the Hub Removal Algorithm) and tries to merge neighborhoods that have affinity values greater than a threshold w . The merging process is performed iteratively until the average density of the subgraphs in LDN starts to fall. The details of the algorithm are provided in Appendix B of Supplementary Materials, which also contains a further illustrative example in Appendix C. 3.2. Filtering for reliable subgraphs

Fig. 2. Multiple dense subgraphs obtained from YBR112C’s local neighborhood graph

3.1.2. Merging for maximal dense neighborhoods In an interaction graph with potentially incomplete interaction data, it is likely that a large protein complex is presented in the PPI graph as a composite of multiple overlapping dense neighborhoods. In addition, there is also biological evidence that many complexes are formed by multiple substructures such as subcomplexes 8 , l5 . We therefore adopt an additional step to merge the individual local dense neighborhoods (that have been detected in section 3.1.1) using a heuristic that assigns overlapping neighborhoods with comparable sizes a high affinity to be merged.

Definition 3.3. Neighborhood Affinity. Given two neighborhoods (subgraphs) A and B, we define the Neighborhood Affinity N A between them as N A ( A , B )=

lA n BIZ IAl

* IBI

(3)

Equation 3 quantifies the degree of similarity between neighborhoods. Note that if one neighborhood’s size, e.g. IBl, is much bigger than lAl, then N A ( A , B ) will be small since lAnBI/IAI < 1 and IA n BI m i n o l v l l ( l - e-Y)/y

=

Therefore, to sum up, E ( W ) 2 (1 - e-')OPT,. 0 The algorithm can be de-randomized via the method of conditional expectation to achieve a deterministic performance guarantee. The derandomization step follows the procedure in 13 page 132. We observe that the approximation algorithm we propose here is similar to the one for MAX-SAT 14, '. To summarize, the sketch of the MCSPCA algorithm is presented in Figure 2 . 2.4. Deconvolution Using a Perfect Physical

Theorem 2.2. The randomized MCSPCA algorithm achieves approximation ratio (1 - e-').

Proof. Let us still use {X,",, 1 Mo E 0 , c E C} and {Y: I Mq 6 O'} to denote the optimal solutions obtained by solving the linear program. Let O P T f be the optimal value of the objective function of the linear program, that is, O P T f = CqEn, Y i . Let Iq be an indicator random variable, which is set to 1 if the constraint q is satisfied under our randomized rounding step and to 0 otherwise. Let W denote the total number of satisfied constraints after the rounding step. Clearly, W = CqEn, I q , and E ( W ) = CqEn, Prob(1, = 1). Consider the following two cases: (1) Y; = 1. Let q be of the form ( c ,S ) where c E and S C 0.In this case, X:,, L 1.

xoES

Prob(1,

=

C

1) = 1 - Prob(1, = 0 ) =

1-

n(l

- X,*,,)

So, Prob(Iq = 1)/Y,* > 1 - e-l < 1. Let q be of t,he form ( c , S )where c E C

(2) Yg

Map

As said, although it is not realistic to assume to have a perfect or near perfect physical map, this variation of the problem allows us to establish the limits of how many assignment we can correctly deconvolute from the hybridization data. This is particularly useful for simulations, to ensure that our algorithm can achieve good results if the input physical map is of good quality. If we are given a perfect (or near-perfect) physical map, the problem can be tackled from the "opposite" direction. Instead of trying to take advantage of the grouping of BACs into disjoint contigs, we partition each BAC into several pieces. We preprocess the physical maps as follows. We align the BACs along the chromosome according to their relative positions on the physical map. Then, starting from the 5' end, we cut the chromosome at each location where either a BAC starts or ends. This process breaks the genome into at most 2 n fragments, where n is the total number of BACs. Each fragment is covered by one or more BACs, and some fragments may be covered by exactly the same set of BACs. In that latter case, only one fragment is kept while the others are removed. At the end of this preprocessing phase, a set of fragments is produced where each fragment is covered by a distinct set of overlapping BACs. Let

209 Algorithm MCSPCA(0, @, B, R)

0. Convert R to R' 1. Generate the integer program in (2) from (0, @, 0 ' ) 2. Solve the LP relaxation of the ILP in step 1, and obtain the optimal fractional solution {X,",,} and {Y;} 3. Apply K steps of randomized rounding and save the best solution for each o E 0 do = { c 1 c E @,X,.,, > 0) Assign 0 to c E C, with probability X:,, or to none of the contigs with probability 1 - CcEc, X:,, 4. Further assign probes to BACs if o is assigned to c in step 3 then assign o to the set of BACs { b E c13(b,Ob,p) E R s.t. o E Ob,p)}

c,

Fig. 2.

Sketch of the two-phase deconvolution algorithm that exploits an imperfect physical map

us denote the final set of fragments as IF. Given a fragment f E IF and a BAC b E B, we use B ( f ) to denote the set of BACs that f is covered by, and use F(b) to denote the set of fragments that b covers. For the same reasons mentioned above, we expect that each probe will match its intended place in the genome and nowhere else. Our goal is to assign each probe to one fragment while a t the same time maximize the number of satisfied constraints in R. A constraint (b ,Q p ) E R is satisfied if one or more probes from Ob,pis assigned to any of the fragment in the set F(b). Given an assignment between probes and fragments, the probe-BAC assignment can be easily obtained. Below is a formal statement of our new optimization problem.

MAXIMUMCONSTRAINTSATISFYING PROBEFRAGMENT ASSIGNMENT (MCSPFA) Instance: A set of fragments IF, a set of probes 0,a set of BACs B and a list of constraints R = { ( b ,Qb,,)lb E B,0 b , p c 0). Objective: Assign each probe in 0 to at most one fragment in IF, such that the number of satisfied constraints in R is maximized. The MCSPFA problem is also NP-hard, since it is a special case of MCSPCA when all BACs in B are disjoint. 2.4.1. Solving the MCSPFA problem via integer programming

A variant of the integer program that we presented for MCSPCA can also solve this problem optimally. Let X,,f be a variable associated with the possible assignment of probe o to fragment f , which is set to 1

if o is assigned to f , 0 otherwise. Let Yqbe defined in the same way as the previous integer program. The integer linear program for MCSPFA follows.

M a x i m i z e CqEC Yq Subject t o C GF X,,f 5 1

vo E 0 vq E

R

o t S f EF(b)

X0,f E {0,1) y q E {0,1>

v0 E 0, f E IF vq E R

The major difference between the ILP (3) and the ILP (2) is in the third constraint. The third constraint in the ILP above translates the fact that a constraint ( b , S) E R is satisfied if any probe in S is assigned to any fragment in F(b). 2.4.2. Relaxation, rounding and analysis

Following the same strategy used in the MCSPCA problem, the ILP is relaxed to its corresponding LP, and then the LP can be solved optimally. Let {X,", I Vo E 0, f E IF} and {Y,* I Y q E R'} denote the optimal solutions to the LP. The fractional solution will be rounded to an integer solution by interpreting X,",f as the probability of assigning probe o to fragment f . Let OPTf be the optimal value of objective function in the LP, that is OPTf = CqEn Yg. Let I , be the indicator random variable, which is 1 if the constraint q is satisfied under the above randomized rounding step, 0 otherwise. Let W denote the total number of satisfied constraints after the rounding step. Clearly, W = CqEn I,, and E ( W ) = CqER Prob(Iq = 1). A similar analysis

210 to the one carried out for MCSPCA applies to MCSPFA as well, and as a consequence we can prove that E ( W ) 2 (1 - e - l ) O P T f . We can therefore claim the following theorem. Theorem 2.3. T h e randomized MCSPFA algorithm achieves approximation ratio (1 - e-').

The pseudo-code of the MCSPFA algorithm is presented in Figure 3.

3. EXPERIMENTAL RESULTS In order to evaluate the performance of our algorithms, we applied them to two datasets. The first one is partially simulated data on the rice genome while the second is real-world da.ta from hybridizations carried out in Prof. Close lab at UC Riverside on the barley genome. Before delving in the experimental setup, we give a short description of the pooling design which is relevant to the discussion.

3.1. Pooling Design Pooling design (or group testing) is a well-studied problem in the scientific literature (see 2 and references therein). Traditionally, biologists use the rather rudimentary 2D or 3D grid design. There are however more sophisticated pooling strategies (see, e.g., 7, 3, 12). To the best of our knowledge, the shifted transversal design (STD) l2 is among the best choices due to its capability to handle multiple positives, its flexibility and efficiency. STD pools are constructed in layers, where each layer consists of P pools, where P is a prime number. Each layer constitutes a partition of the probes, and the larger is the number of layers, the higher is the decodability of the pooling. More specifically, let I' be the smallest integer such that Pr+' is greater than or equal to the number of probes to be pooled, and let L be the number of layers. Then, the decodability of the pool set is equal to L(L - l)/r].In order to increase the decodability by one, an additional P r pools are needed. By a simple calculation, one can realize that the number of pools required to provide sufficient information for deconvolution (e.g., to be at least 10decodable) for a real-world problem assuming 50,000 BACs and 50,000 unigenes is prohibitively high.

3.2. Experimental Results on the Rice Genome The rice "virtual" BAC library used here is a subset of the Nipponbare BAC library, whose BAC end sequences are hosted at Arizona Genomic Institute (AGI). The fingerprinting data for this library was also obtained from AGI. Our rice virtual library contains the subset of BACs in the original library whose location on the rice genome (Oryza sativa) can be uniquely identified and for which restriction fingerprinting data was available. The location of the BACs was determined by BLASTing the BAC end sequences against the rice genome. Since our rice virtual BAC library is based on real fingerprinting data (agarose gel), we expect the overlapping structure of the rice BACs in the physical map to be an accurate representation of what would happen in other organisms. Also, since we know the actual coordinates of the BACs on the rice genome, we can also produce a perfect physical map and test the maximum amount of the decorivolution that can be extracted from the hybridization data. For the purposes of this simulation, we restricted our attention to chromosome 1of rice, which includes 2,629 BACs. This set of BACs provides a 8 . 5 9 cov~ erage of chromosome 1. We created a physical map of that chromosome by running F P C lo on the fingerprinting data with cutoff parameter set to l e - 15 (all other parameters are left to default). F P C assembled the BACs into 347 contigs and 416 singletons. Not including the singletons, each contig contains on average about 6.4 BACs. Given that the fingerprinting data is noisy, the physical map assembled by F P C cannot be expected to be perfect. It is also well known that the order of the BACs within a contig is generally not reliable '. We then obtained rice unigenes from NCBI (build #65) with the objective of designing the probes. First, we had to establish the subset of these unigenes that belong to chromosome 1. We BLASTed the unigenes against the rice genome, and we selected a subset of 2,301 unigenes for which we had high confidence to be located on chromosome 1. Then, we computed unique probes using OLIGOSPAWN '', 17. This tool produces 36 nucleotides long probes, each of which matches exactly the unigene it represents and at the same time it does not match (even approximately) to any other

21 1 Algorithm MCSPFA(0, IF,B,0)

1. Generate the integer program (3) from (0, IF, 0) 2. Solve the LP relaxation of the ILP in step 1, and obtain the optimal fractional solution {X,”,,} and {Y;} 3. Apply K steps of randomized rounding and save the best solution

for each o E 0 do El = { f I f E L q f > 0) Assign o to f E F, with probability X:,

,

or to none of the fragments with probability 1 - C 4. Further assign probes to BACs. if o is assigned to f in step 3 then assign o to all the BACs in F(b) Fig. 3.

, , EFo

X:,

Sketch of the two-phase deconvolution algorithm that exploits a perfect physical map

unigenes in the dataset. OLIGOSPAWNsuccessfully produced unique probes for 2,002 unigenes (out of 2,301). The remaining unigenes were not represented by a probes because no unique 36-mer could be found. This set of 2,002 probes is named probe set 1. Some of the probes in probe set 1, however, did not match anywhere in the genome. This will happen if the probe chosen happens to cross a splicing cite or when the unigene from which it was selected was misassembled. In probe set 1, 330 probes did not match anywhere on rice chromosome 1. The remaining 1,672 probes matched exactly once in rice chromosome 1. This latter set constitutes our probe set 2, which is a “cleaner” version of probe set 1. We observe that in order to clean the probe set one has to have access to the whole genome, which somewhat unrealistic in practice. Each BAC contains on average 5.8 probes and at most 20 probes. The probes were hybridized in silico to the BACs using the following criteria. A 36 nucleotides probe hybridizes a BAC if they share a common (exact) substring of 30 nucleotides or more. The criteria was debated a t length with the biologists in Prof. Close lab and among all suggestions, this one was chosen for its simplicity. Observe that the hybridization table is the only synthetic data in this dataset. The next step was to pool the probes for group testing. We followed the shifted transversal design pooling strategy l 2 and designed four sets of pools, for different choices of the pooling parameters P and L. Recall that in STD the number of pools is P*L. Pool set 1 is 1-decodable, obtained by choosing P = 13 and L = 3. Pool set 2 uses two extra layers (P = 13, L = 5), which increased the decodability by

1. For pool set 3 , we chose P = 47 and L = 2. Pool set 3 is also 1-decodable, but since each pool contains a smaller number of probes, it will deconvolute the BAC-probe relationships better than pool set 1. Pool set 4 is constructed from pool set 3 by adding an additional layer (P = 47, L = 3 ) . As a consequence, pool set 4 is 2-decodable. The four pooling designs were applied to probe set 1 and probe set 2. In total, we constructed 8 sets of pools. The hybridization tables h between pools and BACs were formed for the 8 sets of pools. Then, the basic deconvolution step (described in Section 2.2) was carried out. This step produced a list of constraints, some of which were exact pairs and could deconvolute immediately. The set of BACs, the set of probes, the list of constraints from the previous step, and the physical map produced by FPC, were then fed into MCSPCA. Our ILP-based algorithm produced a set of BACprobe assignments, which was then merged with the exact pairs obtained by the basic deconvolution to produce the final assignment. Similarly, the set of BACs, the set of probes, the list of constraints from the basic deconvolution and the perfect physical map were fed into MCSPFA. The assignment obtained by MCSPFA was also merged with the exact pairs to produce the final assignment. For both algorithms we used the GNU Linear Programming Kit (GLPK) to solve the linear programs. The size of the linear programs are quite large. The number of variables ranged from 47,820 to 165,972 and the number of constraints ranged from 29,412 to 60,475. To evaluate the accuracy of our algorithms, we employ two performance metrics, namely recall and

212 Table 1. Assignment accuracy of MCSPCA (with imperfect physical map) and MCSPFA (with perfect physical map) on probe set 1

pooling

#pools

# true assigns

basic recall

P=13,L=3 P=13,L=5 P = 47,L = 2 P =47,L =3

39 65 94 141

14742 14742 14742 14742

0.0103 0.2726 0.0173 0.763

MCSPCA recall precision

MCSPFA recall precision

0.199 0.618 0.4005 0.9069

0.4857 0.9708 0.8856 0.9991

0.2647 0.7668 0.5236 0.9446

0.4227 0.8511 0.7626 0.9798

Table 2. Assignment accuracy of MCSPCA (with imperfect physical map) and MCSPFA (with perfect physical map) on probe set 2

pooling

#pools

# true assigns

basic recall

P=13,L=3 P = 13,L = 5 P=47,L =2 P=47,L= 3

39 65 94 141

14742 14742 14742 14742

0.0121 0.3111 0.0298 0.8182

Table 3.

pooling

P = 13, L P = 13,L P = 47, L P = 47, L

=3 =5 =2 =3

I

recall MCSPCA precision

0.2163 0.6488 0.4348 0.9285

0.3182 0.8314 0.6009 0.9767

I

~

MCSPFA recall precision

0.625 0.9984 0.9971 0.9995

0.6214 0.9964 0.9962 0.9997

Performance of the randomized rounding scheme on probe set I

#constraints

OPTf

MCSPCA W OPTfIW

OPTf

W

OPTfIW

35071 58472 27591 41509

28683 45220 21472 29458

22615 41462 18633 29378

35033 58425 27567 41467

30562 58277 26524 41467

0.8724 0.9975 0.9622 1

precision. Recall is defined as the number of correct assignments made by our algorithm divided by the total number of true assignments. Precision is defined as the number of correct assignments made by our algorithm divided by the total number of assignments our algorithm made. Tables 1 and 2 summarize the assignment accuracy of our algorithms. “Basic recall” is the recall of the basic deconvolution step (precision is not reported because it is always 100%). A few observations are in order. First, note that 2-decodable pooling designs achieve a much better performance than l-decodable pooling. Second, probe set 2 provides better quality data and as a consequence it improves the deconvolution. However, if we stick with the more realistic probe set 1 and noisy physical map, our algorithm still achieves 91% recall and 94% precision for the pooling P = 47,L = 3. Even more impressive is the amount of additional deconvolution achieved for the other 2-decodable pooling when compared to the basic deconvolution. For example, for P = 13,L = 5 the pooling is composed by only 65 pools, and thereby the basic deconvolu-

0.7884 0.9169 0.8678 0.9973

MCSPFA

tion achieves just 27% recall. Our algorithm however achieves 62% recall with 77% precision. The results for the perfect map shows that our algorithm could potentially deconvolute all BAC-probe pairs with almost 100% precision (if the pooling is “powerful” enough). Finally, in order to show the effectiveness of our randomized rounding scheme, Tables 3 and 4 report the total number of constraints, the optimal value OPTf of the LP, and the number W of satisfied constraints. Note that the rounding scheme does not significantly affect the value of the objective function (see ratio O P T f I W ) .

3.3. Experimental Results on the Barley Genome

The second dataset is related to the barley (Hordeurn vulgare) project currently in progress at UC Riverside. The Barley BAC library used is a Morex library covering 6.3 genome equivalents 15. Selected BACs from the BAC library were fingerprinted using a techniques that employs four different restric-

213

pooling

#constraints

OPTf

MCSPCA W OPTfIW

OPTf

P = 13,L = 3 P = 13,L = 5 P = 47,L = 2 P = 47,L = 3

35102 58210 27739 41378

27161 42795 20127 28567

21567 39934 17990 28532

35089 58179 27711 41378

tion enzymes, called high information content fingerprinting '. The physical map was constructed using FPC. The total number of BACs that were successfully fingerprinted and that were present in the physical map is 43,094. Among the set of BACs present in the physical map, about 20 have been fully sequenced. They will be used for validation of our algorithm. The Barley unigenes were obtained by assembling the ESTs downloaded from the NCBI EST database. The unigene assembly contains in total 26,743 contigs and 27,121 singletons. About a dozen research groups around the world contributed hybridization data. Each group designed probes for certain genes of interest and performed the hybridization experiments. Since those efforts were not centrally coordinated, the probe design and the pool design were completely ad hoc. The length of probe ranges from 36 nucleotides to a few hundreds bases. The number of unigenes that each pool represents also ranges from one to a few hundreds. We collected the data, and we transformed it into a list of constraints that we processed first with the basic deconvolution. Recall that if we obtain an exact pair, the assignment is immediate. But if a constraint is non-exact, we cannot conclude much even if the size of CDb,p is very small. However, intuitively those constra,ints for is small are the most informative. which In an attempt to filter out the noise and isolate the informative constraints we selected only those for which ICDb,pl 5 50. In total 14,796 constraints were chosen. Then, we focused only on the 5,327 BACs and the 2,263 unigenes that were involved in this selected set of constraints. We then used our MCSPCA method on this reduced set (along with the barley physical map produced by FPC) and we obtained 9,587 assignments. We cross-referenced these assignments to the small set of sequenced BACs and we determined that six of them were in common to the 5,327 BACs we selected. Our algorithm assigned

0.7940 0.9331 0.8938 0.9988

MCSPFA W OPTfIW 31183 58179 27711 41338

0.8887 1 1 0.9990

eight unigenes to those 6 BACs, and six of them turned out to be correct by matching them to the sequences of 20 known BACs.

4. CONCLUSIONS In this paper, we proposed a new method to solve the BAC-gene deconvolution problem. Our method compensates for a weaker pooling design by exploiting a physical map. The deconvolution problem is formulated as pair of combinatorial optimization problems, both of which are proved to be NP-complete. The combinatorial optimization problems are solved approximately via Integer Programming followed by Linear Programming relaxation and then randomized rounding. Our experimental results on both real and simulated data show that our method is very accurate in determining the correct mapping between unigenes and BACs. The right combination of combinatorial pooling and our method not only can dramatically reduce the number of pools required, but also can deconvolute the BAC-gene relationships almost perfectly. ACKNOWLEDGMENTS This project was supported in part by NSF CAREER 11s-0447773, NIH LM008991-01, and NSF DBI-0321756. The authors thank Serdar Bozdag for providing the data for the simulation on the rice genome. References 1. Barbazuk WB, Bedell JA, Rabinowicz PD. Reduced representation sequencing: a success in maize and a promise for other plant genomes. Bioessays 2005; 27: 839-848 2. Du DZ, Hwang FK. Combinatorial Group Testing and its applications 2nd edition. World Scientific

2000 3. Dyachkov A, Hwang F, Macula A, et al. A Construction of Pooling Designs with Some Happy Surprises.

214

4.

5.

6.

7.

8. 9.

10.

Journal of Computational Biology 2005; 12: 11291136 Flibotte S, Chiu R, Fjell C, et al. Automated ordering of fingerprinted clones. Bioinformatics, 20: 1264-1271 Goemans MX, Williamson DP. New :-Approximation Algorithms for the Maximum Satisfiability Problem. S I A M Journal o n Discrete Mathematics 1994; 7; 656-666 Luo MC, Thomasa C, Youa FM et al. Highthroughput fingerprinting of bacterial artificial chromosomes using the snapshot labeling kit and sizing of restriction fragments by capillary electrophoresis. Genomics 2003; 82: 378-389 Macula AJ. A simple construction of &disjunct matrices with certain constant weights. Discrete Mathematics 1996; 162: 311-312 Michael S. Introduction to the Theory of Computation. International Thomson Publishing 1996 Sandhu D, Gill KS. Gene-Containing Regions of Wheat and the Other Grass Genomes Plant Physiology 2002; 128: 803-811 Soderlund C, Longden I, Mott R. FPC: a system for building contigs from restriction fingerprinted clones. Computer Applications in the Biosciences 1997; 135, 523-535,

11. Sumner AT, de la Torre J, Stuppia L. The distribution of genes on chromosomes: a cytological approach Journal of molecular evolution 1993; 37: 117-122 12. Thierry-Mieg N. A new pooling strategy for highthroughput screening: the Shiftcd Transversal Design. B M C Bioinformatics 2006; 7:28 - 37 13. Vazirani VV. Approximation Algorithms Springer 2001 14. Yannakakis M. On the approximation of maximum satisfiability. S O D A '92: Proceedings of the third annual A C M - S I A M symposium o n Discrete algorithms 1992; 1-9 15. Yu Y, Tomkins JP, Waugh R, et al. A bacterial artificial chromosome library for barley (Hordeum vulgare L.) and the identification of clones containing putative resistance genes. Theoretical and Applied Geneti c 2000; ~ 101: 1093-1099. 16. Zheng J, Close T J , Jiang T, Lonardi S. Efficient Selection of Unique and Popular Oligos for Large EST Databases. Bioinformatics 2004; 20 : 2101-2112 17. Zheng J , Svensson JT, Madishetty K et al. Oligospawn: a web-based tool for the design of overgo probes from large unigene databases B M G Bioinformatics 2006; 7: 7-15

A GRAMMAR BASED METHODOLOGY FOR STRUCTURAL M O T I F FINDING IN ncRNA DATABASE SEARCH

Daniel Quest*, William Tapprich' , Hesham Ali"

* College of Information Science and Technology, University of Nebraska at Omaha

' Department of

Biology, University of Nebraska at Omaha Omaha, NE 68182-0694, USA E-mail: djquestaunmc. edu

In recent years, sequence database searching has been conducted through local alignment heuristics, patternmatching, and comparison of short statistically significant patterns. While these approaches have unlocked many clues as to sequence relationships, they are limited in that they do not provide context-sensitive searching capabilities (e.g. considering pseudoknots, protein binding positions, and complementary base pairs). Stochastic grammars (hidden Markov models HMMs and stochastic context-free grammars SCFG) do allow for flexibility in terms of local context, but the context comes at the cost of increased computational complexity. In this paper we introduce a new grammar based method for searching for RNA motifs that exist within a conserved RNA structure. Our method constrains computational complexity by using a chain of topology elements. Through the use of a case study we present the algorithmic approach and benchmark our approach against traditional methods.

1. INTRODUCTION Functional non-coding R.NA (ncRNA) has received great attention in recent years because of their A diverse functional activities within the cell. ncRNA forms a secondary structure that enables other molecules to interact with it and carry out functional activities. In many cases, molecules interact with conserved primary structure patterns or motifs given that the ncRNA is in the correct secondary structure conformation. Because of this, the bioinformatics community has focused considerable energy towards methods that predict ncRNA secondary structure and search for homologous structures within a sequence database (e.g. 1, 2 , 3) . Currently there are many approaches to find a RNA homolog. The first approach is to construct a structure for the sequence, and then use that structure to query a sequence database 1. One option to construct the structure is to use sequence profiles of the RNA as it is conserved through evolution 4. Another approach is to use a package such as Mfold 5 and chemical probing validation experiments to determine the RNA structure. As soon as the structure is determined, one can use pattern-matching software to find structural homologs within the RNA database. Pat tern-matching software packages were first used to find homologous tRNAs 6, 7. Over time, they have evolved to consider multiple different abstractions of the structural patterns. Pattern-matching

programs have evolved from regular expression tools to scripting languages capable of considering errors, non-Watson-crick base pairs, complementary base pairing, and common structural profiles. Some example programs include RnaBob 8, RNAMOT 9, Palingol 10, and RNAMotif 11. Although these methods are extremely powerful and fast, they require significant user expertise to obtain reliable profiIes. In addition, they do not easily allow probabilistic scoring schemes to be integrated into them. This implies that these tools return all hits that are possible given our current understanding of the secondary structure. These tools do not rank profiles based on what is most likely to occur based on the phylogenetic relationships of the ncRNA. A second approach to finding a ncRNA homolog is to use a stochastic context free grammar (SCFG) 12 to simultaneously align the primary sequence and the secondary structure. SCFGs have an advantage over pattern-matching programs in that they require less manual expertise and tuning to find accurate structural alignments once the global parameters are set. In practice however they are impractical because of running time 0 ( n 4 ) 1. If pseudoknots are considered 13 time complexity ( 0 ( n 6 )makes ) database searches impossible. To circumvent these obstacles Weinberg and Ruzzo proposed a HMM filter that allows for faster ncRNA searches without the loss of accuracy 2 . Recently Zhang et al proposed an additional filter 3 and a sequence filtering methodology

216

14 for constructing fast SCFG searchers without the loss of accuracy. This capability allows us to construct queries over large datasets based on primary and secondary structure instead of primary sequence alone. However, implicit in the assumptions of these filtering techniques is the concept that scoring matrices are homogeneous across all putative alignment regions. In some cases however we have more evidence that require our scoring system to be heterogeneous. For example, when some of the bases have been biologically verified by chemical probing and other bases have not been verified we wish a model with two classes verified and not verified. We then wish to search for RNAs that have a conserved secondary structure subject to the constraint that all verified bases remain functional. In other words, we wish the bases in the motif to remain functional, and so they can not be considered in other base pair interactions in the folding of the molecule. This problem can be solved with patternmatching programs, but search results suffer because errors are not scored in a probabilistic way. Consequently, for a short ncRNA, such a program can return a pattern that satisfies all constraints but is not closely related to known ncRNA found in nature. This implies that the number of matches is dictated by the length of the database instead of functional relationships inside the database. SCFGs can be modified to impose additional constraints through additional grammar rules, however this process is time consuming. More importantly, changing the grammar and the parameters has the effect of also changing the relationships used to construct the filters that allow SCFGs to run in reasonable time. In this paper, we propose a new approach to search a sequence database for RNA structures that have known functional sites (motifs). Our approach uses the strategy of nested grammars to simultaneously integrate secondary structure, primary structure, and biologically verified constraints. We will show that our method is capable of finding significant substrings or motifs when pattern-matching approaches can not, and that our method can serve as a reasonable second step for imposing constraints on putative hits from a filtered SCFG filter (therefore avoiding the need to construct constraint-aware fil-

ters). To illustrate our nested-grammar paradigm, we will first show that a grammar with favorable runtime characteristics can be used as an approximation for a grammar with more complex runtime characteristics. In this way, a heuristic for a complex grammar G can be generated via a simple grammar G‘. We also show that G’ can provide a solution within r for G where T is an arbitrary error threshold. Finally, we illustrate our algorithm for evaluating G and G’ via an example and a case study.

2. PROBLEM DESCRIPTION A N D METHODOLOGY Our primary interest in this work is to search large databases for significant signals within a conserved two-dimensional structure. Given that we know some pattern or signal from biologically verified data, we wish to find two-dimensional structural honiologs in a database subject to the constraint that the structural homolog must contain this signal. In this paper, we present a robust grammar-based approach for finding non-deterministic RNA structural motifs in a conserved secondary structure. Like the patternmatching approaches, our approach allows the user to decide the level of flexibility of constraints of the profile to search. Given reasonable constraints, the proposed approach also has a favorable computational complexity. The core idea is to define a primary grammar for running the nucleotide comparisons, and a secondary grammar to model the secondary structure relationships. This idea is similar to the idea independently developed in MilPat using constraint networks 15. In MilPat, a constraint network is used to model secondary structure dependencies. Our approach, on the other hand, uses a secondary grammar to model constraints. The key advantage of using nested grammars is that all constraints can be integrated homogeneously using Bayes rule. The direct impact is that we allow for mismatches and thus our models can be entirely probabilistic in nature. Our nested grammar based method functions by considering two grammars: G and G‘. The structural alignment grammar G represnents the known constraints that exists in the molecule. The sequence alignment grammar G’ represents an ordered set of phylogenetic subsequences found in the structure. Elements of G’ may be scored with traditional

217 scoring matricies, or additional information from biological experements can be added to score subsequences heterogeneously. We use a pairwise hidden markov model (discussed below) to evaluate all possible alignment positions for subsequences in G’ and then combine evidence using the more robust SCFG gramniar to select the subsequence alignments with most supporting evidence and construct the grammar to sequence alignment. The next few paragraphs provide a background for our method. In sections 2.1 to 2.5 we detail the key components of the algorithm. The algorithm in its entirety is presented in section 3. Consider a pattern P = { p l , p z , . . . , p m } that we wish to search for, and a sequence S such that IPI < IS/(1x1denotes the length of x). Our objective is to create a sequence S from P using production rules from the set: insert ( I = ;), delete ( D = ?), match ( M = E), and mismatch ( X = :) where a and b are any characters in the terminal alphabet C such that a # b and - is a space. The production transcript T is an ordered list of production rules that produces S from P 16. T is a generative regular probabilistic grammar ( G ) that, for each production rule, generates a pair of characters corresponding to P and one corresponding to S such that both P and S are produced (a pairwise hidden Markov model or PHMM). For example, if P = AAAC and S = TAGCC we could construct both P and S with the production transcript T = { X ,D , M , I , M , I } as shown in figure 1.

and non-terminal production rules. A non-terminal production rule is a grammar production rule that may produce any other production rule (terminal or non-terminal including itself) from a finite set of options. Non-terminal production rules exist to allow shortcuts in the alignment path to more closely approximate our biological problem. The classic example is one where gaps are clustered together to represent introns in an alignment between genomic DNA and messenger RNA. Such a gramniar could be represented as follows:

GI:

T+:T

,T L+,L

R+?R

I

,T

I I I

,L T

S+LS L+:F,,

1 b‘

F+kFt,

PHMMs have proven to be useful in global sequence alignments between two sequences. In order to cluster gaps or account for palindromic base pairing, the grammar needs to be extended to include both primitive production rules ( I ,D , M , X )

I 1

T

E

E E

In GI, production rules L and R represent the gaps in P and S respectively. E represents the final character in P and S . More recently, non-terminal production rules have been used to model structural parameters in RNA folding. Such folding parameters are modeled by production rules that produce two characters simultaneously. The resulting production rules represent a palindromic language. For example, we could extend our non-terminals to include production rules such as tT$’ where the notation implies a basepairs with a’ in P and b basepairs with b‘ in S. A simple, but effective grammar for RNA structure prediction was proposed by Knudsen and Hein in the PFold package 17:

Gz :

Fig. 1. A PHMM representation of a production transcript

1 aT I I ‘?R 1

b

1 I

L LS

Additional production rules allow models to more closely represent biological function. They also increase computational complexity, sometimes so much so that realistic models on large data sets can not be computed in reasonable time even on a large cluster. Production rules that allow for non-regularity (i.e. both-sides emission) make database search intractable in practice without filters. To circumvent this problem, statistical techniques are used to infer where non-terminal operations can be applied. Traditionally, this approach has been to find some statistical properties of a dataset given a grammar G and then restrict the search based on those properties. In this work, we wish to show a method for

218 compiling evidence that can restrict the number of non-regular production rules to areas of greatest interest and therefore manage tractability through a multi-level grammar strategy. In other words, given our grammar G that is difficult to compute, we wish to run a grammar GI that approximates G to some threshold 'T. Given those approximations, we then wish to bind the search of non-terminal operations that exist in G to regions generated in GI that have the most evidence to support a non-terminal shortcut. 2.1. Transcript Evidence Each of the production rules has an associated cost. To calculate the cost of a production transcript, we sum the costs of all production rules in that production transcript. Traditionally, alignment imposes few limitations on the costs chosen for each of the production rules. Drastically different alignment summaries can be obtained from different scoring schemes 18. To combine multiple production transcripts in a logically consistent manner, we use a Bayesian method for scoring production transcripts. Imagine we draw production rules from an urn at random to construct our production transcript. At each draw, we are constrained by the pattern and the sequence of the production rule we choose because P must produce S . If Hi is the hypothesis that production operation exists in the transcript at position i, and X i is our prior information about all other possible production operations a t position i (a position specific scoring matrix), then we can relate our hypotheses by the inversion formula:

Axiomatically, if H,' represents the hypothesis that any production operation other than H, exists at position i in the transcript, then we can construct an identical equation for P(H,IIP,S,X , ) . If we take the log of the ratio of P ( H , ( P , S , X , ) and P(H,'IP,S,X , ) we can obtain the evidence:

(2) In equation 2 , e(HiIXi) represents our prior evidence in production rule H a t position i based on our grammar production model. If this to zero, it indicates

that we have no evidence supporting or refuting H . The evidence for an entire production transcript T can be calculated as:

The evidence in a production transcript depends only on our prior knowledge stated explicitly in X. Given N represents all possible production operations at position i in the transcript, S M is the substitution matrix based on sampling of production rules from ncRNA, and W represents the current production operation, we have a general formula for evaluating evidence of a production rule versus all other production rules a t the same position in the transcript:

e(TIP,S , X ) =

e(HzIP,S, X z )

(4)

2

At this point, we can integrate our chemical probing data or other biological evidence using the term e ( H z = W I X , = S M ) .

2.2. Scoring a Grammar The objective now is to find a maximum production list amongst all possible production lists.

Definition 2.1. A maximum weighted grammar production list (MWGPL) is an ordered list of all allowable grammar production steps and their associated evidence such that the production list: (1) produces S from P when the list of production rules are taken in order and (2) has maximum total evidence over all paths that produce S from P . If the MWGPL is known, then we can make a statement about how well the pattern and the sequence correspond to the model. A great deal of evidence is likely to imply that the model, the sequence and the pattern agree and that the sequence has the same characteristics as the pat tern. This leads us to the three key considerations that are the subject of this work: (1) Find a partition of S such that our algorithms can be run efficiently in practice with minimum loss to the quality of a query, (2) Minimize the use of non-terminals but retain the benefits of non-terminal operations, and (3) Integrate relationships in our data into the grammar model.

219

2.3. Optimization Through Nesting Grammars

To manage tractability of the evaluation polynomial, we would like to be able to partition S recursively as the query collects evidence towards the most likely propositions. To obtain some guarantee about the running time of our partition, we also want to select production operations that are consistent with runtime expectations. To do this, we define the notion of a topology element.

Definition 2.2. A topology element T E = {PS,, R } is a grammar production rule set that contains a collection of patterns PS, = { p s l , p s z , . . . } and a set of allowed production rules R. A topology must use the set of production rules R to produce S’, a subsequence of S. Each topology element must have one prior associated with all of the production rules in R. In a grammar G a topology element may be used only once. A topology element is also a grammar. To evaluate the evidence that topology element T E produced S’ we use Equation 3 selecting p s l from PS, and selecting S’ in S such that evidence is maximized. As each topology element may have more than one string, the production of S from the topology element series is constructed by picking the minimum weighted grammar production list over all topology alternatives in T E . Topology elements are produced through a grammar, The global grammar G contains productions for the topology grammar GI. A heuristic grammar HG for G approximates G by using production rules in G and production rules from a simpler grammar GI. The topology element paradigm allows us to manage complexity by recursively defining partitions on S . Consequently, it allows us to restrict complexity by bounding non-terminal production rules to regions specified by the partition. A topology element serves as a heuristic to cut vertexes in the production transcript graph so that the graph may be evaluated using divide and conquer (for a description of the relationship between edit transcripts and edit graphs and how grammar production rules can be represented as both a graph and a sequence of rules see 16, 12). The topology element serves as evidence towards G with GI. In practice, we want to evaluate all topology elements that satisfy an ev-

idence threshold 7 . If 7 i s large, the approximation for G will be inaccurate because G‘ will miss many candidates although the runtime will be favorable. If 7 is small, the likelihood that we miss a production transcript lessens, but the computational cost of evaluating HG increases. A topology element T E is evaluated with grammar G’. Evaluating T E with G’ inevitably reduces the correctness of the MWGPL for G because some production rules in G do not exist in GI. There are two approaches to solving this problem: (1) Allow grammar refinement, or (2) assume that higher order relationships can be approximated, given enough alternatives in the data. Grammar refinement is a strategy where we may reexamine the production list for topology element T E produced by G’ and substitute production rules that exist in G (but not in G’) into the production list for T E . Using a grammar refinement strategy, we can guarantee that the MWGPL for HG has the same evidence as the MWGPL for G. The disadvantage of this approach is that in the worst case we will actually evaluate G. While there are many potential approaches to bound the number of refinements, we choose to save this for future work. Instead, we choose to focus on a data-driven approach. We assume that a relationship found in a higher order grammar production can be discovered if we have enough supporting sequences in our model. As the number of sequences increases, the known alternatives for a topology element approaches the real number of alternatives in the database. As an example, consider sequence A =tgtCCCaTATAaGGGata that we know can be partitioned into 3 consensus regions. TE1 =tata, TE2 =ccc, and TE3 =ggg. TE2 and TE3 are related because they complementary base pair. Here is an example topology element based grammar for A:

GTl:

A-+:A ,A

1

I B+EB I B I c+;c 1

L A 1 ‘?A TE1B :B

I

FB

I

1

TE2 C T E 3

LC

I

FC

1 ,c

(t

In GTl, insertion of TE1, TE2, and TE3 is done via calls to a simpler grammar GT,’. Insertion may allow grammar substitution operations that increase

220 evidence based on known topology relationships. For example, in the above grammar TE2 and TE3 are known to complementary basepair so regular grammar productions { M + EM,M + EM,M + : M } on the MWGPL of TE2 and the regular grammar productions { M -+ ; M , M + : M , M + : M } on the MWGPL of TE3 can be substituted with palindromic grammar productions representing complementary base pairs in GTl: { M + EM:,M + EM:, M 4 EM:}. Using this framework, we can pursue likely complementary base pairs without being forced to evaluate all possible complementary base pairs. 2.4. Non-terminal Grammar Operations

The goal of this section is t o show how one grammar can be used to approximate another grammar. Consider the following grammar, G3 , that produces a local alignment:

1 A A+EA I : A I R + F R I ,R 1

G3:

L+,L

1

F A

,A

I

E

E

In this grammar, L and R are production rules that result in no evidence; we wish a local alignment. Production rule A is where G3 collects evidence. If we use dynamic programming to build all maximal solutions, A requires O(IPl x ISI) to evaluate the MWGPL. The bottleneck comes from the fact that at each position in the production list, we have three choices: we may advance our position in S but not P , or in P but not S , or in both P and S. Consider a prototype grammar, G4 that we wish to use t o approximate G3:

Given an evidence cutoff T , we can store all possible production transcripts with evidence over r in O(lPl ISl). Note that G4 only need a starting position and an ending position. Given these two positions, the evidence transcript is unique. This grammar can be evaluated in O(1PI IS/). We can further increase speed by hashing all possible production transcripts resulting in a score of at least T

+

+

and index all instances that actually appear in producing S from P . To produce G3 with the transcript T from G4 we could use the following grammar Gg:

Gg:

I

I : T + ~ _ L 1 , R I :T I L+b_L 1 T R+,R I T

:T+:T

,R

1

:L

E E

G5 functions to stitch elements with large amounts of supporting evidence from Gq and construct our approximation for G3. For each grammar alignment in G4 with evidence over r, we construct the table for the grammar production list. As an example, consider a list 1st of non-overlapping components that produce S 1st = ( A ,B,C . . . ) via G4. To stitch A, B , C together using G5, we first make all the grammar productions in A, thus constructing the final row of the dynamic programming table from A. The final row in the table is then used as we make grammar rule productions in G5 until we get to the position in Si where i is the first position in B. We then make all production rules in B and again use the last row from B to continue making grammar productions using Gg. This process continues until S is produced. If 7- is large, then most of the computational time will be spent evaluating G5 instead of G4 and the computational time will approach Gs. If T is small, then computing time is dominated by G4 and our approximation algorithm will be nearly linear. 2.5. Integrating Relationships in Data into Search

A topology element allows us to produce sequences instead of characters. Assume that we have a regular grammar that produces the hairpin in figure 2a. The grammar is divided into topology elements TI -7'11. Figure 2b shows several example sequences that all contain the same hairpin. At the base of figure 2b are the totals of the number of bases at each position. T4 and T8 are highly conserved and easily detected. However, the signal for T4 and T8 has very little information content. We would like to use the surrounding elements to increase (decrease) the evidence that we have supporting T4 and T8 as a real site. Simple graphical models such as an HMM will not be able to detect the site T4 T8 because

+

22 1 a.

b.

o .U

c o U-Go G-C

T6

A - G O OU-A. .G A. A A.

T4 AAUA AAUA AAUA AAUA AAUA AAUA AAUA AAUA

T5 GACUGU CAGACA GGUCCG AUUUAG ACAGCG UAGUGA CUGAUU GGCGUC

T6 UC-AC AC*GG CU*GC GC*AG GC*AA AA*cu UG*AG UG*UA

T7 GCGGUU UGUGUG UGGACA CUAAAU CGCUGC UCACUA GAUCAG GACGCU

T8 GA-A GA-A GA-A GA-A GA-A GA-A GA-A GA-A

T9 GGAGA GGCGC

Sum: A 1210 C 1324 G 3124 U 2230

8808 0000 0000 0080

231212 212131 323223 122322

21*42 14*22 22*13 31*11

022222 222221 332312 312133

08 8 00 0

10322 14203 54241 10122

52

53 S4 55 S6 57 58

G C-G

T3

~3 UAUC GCGC GGGG CCAC GACG UCUG AUCG GUUC

S1

80 0

00 0

cccuu UGUGG GCGUC GCAGA GCGAU AGAAC

T9

c u

G O T10 C. C-GO A-Uo o G UO T1 o G - C T11 C-G

T2

53

58 54

'57

'57

Fig. 2. a) A hairpin structure containing the loop E motif from coxsackievirus B3 b) An illustrative example of sequences from the characteristic portions of the loop E motif c) A phylogenetic tree constructed from sequences S1-S8 d) A phylogenetic tree constructed from T5 T7. Note that we chose the partitions because of our chemical probing data

+

the evidence found in the other sites is lost when you consider only the previous base. On the other hand, SCFGs will catch the base pairing relationships between elements T 3 , T9 and T5, T7 but they must check every possible base pair in the sequences to find the relationship. We would prefer a method that can detect the base pairing, but is not forced to evaluate all possible pair alignments. Our approach is to use the phylogenetic relationships found in the data to add evidence toward base pairing. To construct our prior belief in the sequence relationships, we cluster all of the known sequences by constructing a phylogenetic tree as shown in figure 2c. Then, to represent the complementary base pairs, we concatenate T5 and T7 and construct a tree. The evidence that a relationship exists between Si and Sj is the distance between Si and Sj in the tree shown in figure 2c minus the distance between Si and Sj in tree shown in figure 2d. This is evaluated in the grammar when the term e ( H i I X i )of Equation 2 sums evidence for topology elements.

3. ALGORITHM To score a sequence, first we introduce a global grammar G for evaluating sequences in the database D. For example, to evaluate the structure of the hairpin in Figure 2 we can use the following grammar G:

G:

D j E D E+EE PI+ D

I I

,D ,E T 3 P2

P24T4

P3

P3+T5

T6 T7

I 1

? D ? E

I

E

T9 E

T8

Our algorithm contains two phases, top down and bottom up. Top down refers to the phase where we traverse productions in G generating instances of GI that will produce S . Bottom up is the procedure where instances of GI are stitched together with operations in G into a production list that produces S . The transcript with the maximum evidence is selected as the approximation for the MWGPL of G. In this example, we assume the grammar for evaluating T3,T 5 ,T 6 ,T8 and T9 are instances of the regular grammar G3. While we could choose to evaluate G3 with an approximation, as was done in the previ-

222 ous section, the overhead in RNA structure matching comes from the palindromic non-terminals. Because the partitions of S1 - S8 in our example are such short sequences, evaluating G3 directly using the GOtoh 19 memory reduction method has better alignments for small sequences. Because the sequences are short, the dynamic programming tables are also short and the overhead is low. T 4 and T 8 can be evaluated with G' = G4 because they are absolutely conserved.

Fig. 3.

A grammar for finding the loop E RNA motif

The top down algorithm for constructing G proceeds forward by first producing P1. P1 in turn produces T 3 , P2, and T9. Candidates for the MWGPL for G over T3, and T 9 are computed via GY. All non-overlapping candidates that satisfy the condition that T 3 k < T91 and that the evidence for e ( T 3 k ) e(T91) > T are stored in a table (tablel). G then proceeds by producing P2 constrained such that T31, < T4, < T8, < T91 and evidence for the transcript greater than T . Because we do not know which entries in table1 exist in the MWGPL for G, we must store all potential candidates for T4 and T8 that exist between Sk+lTYl+i, min(k) and Sl,maz(1) in tablez. Variables k and 1 represent an index in 0 - /SI. Note that a potential candidate for a topology element such as T4 may not overlap with another candidate for the production list of T 4 , but it may overlap with a candidate from any other topology element (e.g. T5). In a similar way, G t,hen produces T5, T6, and T7 and the non-self-overlapping production lists over T are stored in tables. Once tables are created for all elements of G , we construct the MWGPL approximation for G by stitching candidate topology production lists together if they contain more evidence after being merged. Intuitively,

+

the tables mark candidate positions for G that may exist on the MWGPL. Figure 3 illustrates this basic idea. On S, we have putative positions marked by the forward production operations on the grammar

G. score ( S ,t) : 1 = i-1 productionTranscript = 0 while(productionTranscript.hasMoreWaysToMakeS0): productionTranscript += makeProduction(G,S,t,l) if (productionTranscript .produces(S) == True) : 1 += productionTranscript productionTranscript = 0 forAll i in 1: GX. computeEvidence (i) if(i.isMaxEvidence0) return i makeProduction(G,S,t,l): SelectNextRule = FSA.DP(1) if(productionSet.contains(TE): evaluateAndMark(TE) forAll TE > t BayesNet 11.index] . add(productionRu1e) return productiofiule

4. RESULTS Nondetermanistic structural motif finding is one of the most outstanding problems in bioinformatics. The proposed method, advanced grammar alignment search tool (AGAST) can be applied to find motifs in any biological sequences including DNA/RNA/Protein. In this section, we assess the performance of the proposed method in finding loop E motifs in conserved secondary RNA structures. The loop E motif is a fold that organizes structure in hairpin loops and multi-helix junctions in many RNA molecules. The motif is prevalent in 16s and 2 3 s ribosomal RNA and derives its name from its discovery in loop E of 5s rRNA 20. The loop E motif is particularly significant in RNA structure because it uses a series of non-canonical base pairs to form a characteristic fold. This fold widens the major groove of the RNA helix and presents a crossstrand adenosine stack that serves as a recognition feature for RNA-protein and RNA-RNA interactions 21. The presence of this motif in molecules as diverse as ribosomal RNA, potato tuber spindle viroid, RNase P RNA, and the hairpin ribozyme, proves that the loop E motif is an important feature in RNA structure and function. Sequence comparison and chemical probing analysis has revealed a consensus pattern for the loop E motif. This pattern consists of a parallel purine-

223 purine pair (usually AA), a bulged nucleotide, a nonWatson-Crick UA that is absolutely conserved, and a purine-puring pair (AA or AG, but not GA). As a result of the non-canonical pairing, the motif generates a signature pattern of susceptibilities in chemical probing experiments 22. We have identified the sequence pattern and the chemical probing pattern of the loop E motif in the coxsackievirus B3 (CVB3) genomic RNA. The general character of the absolutely conserved properties of this motif were a.ua.*gaa. The CVB3 loop E motif in the context of the surrounding RNA is shown in Figure 2. In this section, we compare the performance of our proposed method, AGAST, with RSEARCH and RNAMotif in finding the loop E motif. We selected RNAMotif because of the pattern-matching tools, it is one of the most flexible and can be customized to our specific problem. We selected RSEARCH because it is guarenteed to give optimal results over other methods because it computes all possible sequence-structure alignments. We did not use MilPat because its current release does not allow errors and thus is more constrained than both of these programs. The structure of the loop E motif must correspond exactly to the character of the sequence shown figure 2. The character of the motif is a hairpin loop with the loop E sequence immediately flanked by paired regions. The loop E section must be absolutely conserved (with motif a.ua.* [gal aa). Complementary base-pairing flanking the loop E is responsible for maintaining the structure of the loop. The turn at the top of the loop may contain a large secondary structure (instead of a 4 base turn). To test the sensitivity and specificity of our approach, we collected a set of sequences from coxsackievirus B3 (CVB3) genomic RNA. We partitioned these sequences into two groups, testing and modeling. With the modeling sequences, we constructed a multiple sequence alignment as shown above by overlaying our chemical probing data, phylogenetic conservation and possible folding conformations from mfold. Then, for each sequence in the testing set, we constructed a false positive sequence using a third order Markov chain (to preserve the motif, but destroy complementary base pairing required for secondary structure). For each of the sequences in the dataset we ran RSEARCH, RNAmotif, and AGAST. In the case of RNAmotif, we designed two

pattern-matching queries using the same information that we had in constructing the AGAST query. The first query which we call RNAmotif-intuitive, constrains results such that they form a hairpin around the conserved loop E motif and that complementary base pairs exist in the hairpin both 5’ and 3’ of the motif. This query is based on our understanding that the motif can only be formed if there is significant stability provided by complementary base pairs both 5’ and 3’ of the motif. In our second RNAmotif query, which we call RNAmotif-permissive, we give RNAmotif the 5’ and 3’ regions surrounding the loop E motif. Because this query did not match any sequences in our test database, we gradually increased the error threshold in the regions 5’ and 3’ of the motif until we obtained matches. RSEARCH was provided only with the sequence from the hairpin, that it uses to make a grammar. Each of these grammars was queried against our database. The results from this experiment are in Table 1. These results indicate that the traditional methods of SCFGs (RSEARCH) and expertly tuned queries (RNAMotif-permissive) remain the most sensitive methodologies when searching for a double stranded RNA motif in a two-dimensional structure. However, this sensitivity comes at the cost of an increased number of false predictions. In sequences that have no conserved two-dimensional structure (HMM-3 Jumbled sequences), we found the false positive rates to be 1.22 and 1.29 for RSEARCH and RNAMotif-permissive respectively. This is because both programs predicted more sites than there exist sequences in the generated database. The RNAMotif-intuitive query was able to substantially reduce the number of false predictions, but it was far too restrictive, eliminating 85% of true positives. We believe that our approach has significant promise because it was capable of maintaining relatively high sensitivity while increasing specificity to the same level as our intuitive description of the motif. Moreover, upon closer investigation, we realized that the false positives found in our real database where all phylogentically diverse from those instances we had in our training set (all forming in different clades from our training sequences). This indicates that our approach may perform better with a representative sequence from each clade in the phylogenetic tree, but finding such representatives in a new domain remains a challenging problem. Among the

224 Tool RSEARCH RNAMotif-intuitive RNAMot if-permissive AGAST

True Positives Time 963m53.2~ 55 89.65s 8 15.2s 55 68.36s 50

other methods, none produce acceptable values for both sensitivity and specificity in our problem domain. On the other hand AGAST had over 90% for both parameters. Another experiment was conducted t o search a larger dataset to find additional unknown instances of the loop E motif. Because ribosomal RNA sequences are known to contain the loop E, we generated a data set of all ribosomal RNA by parsing species specific data files (e.g. gbpril) from Genbank release 143 for all files with ribosomal RNA. We found 176,371 rRNA records using this method. We shuffled each of the sequences in the database using a Markov chain of order three and ran our algorithm on both ribosomal RNA sequences. Figure 4 shows the distribution of scores for records from the two sets.

Fig. 4. Grammar scores (maz{ewidence}- evidence f o u n d ) versus number of records for finding the loop E RNA motif.

In this experiment, there is a significant difference between sequences generated by the Markov chain that contain the sequence a.ua . * [gal aa, and rRNA sequences. Also, of the top 10 records returned in our search we were able t o verify that 9 of the records did contain the loop E motif and the other records are unknown. We are currently working to verify if the final sequence also contain the loop E motif. This demonstrates that our tool can find

0.98 0.91

0.98

motif in structure homology even in a large database.

5 . CONCLUSIONS In this paper, we have proposed a grammar based method based on constructing graphical models that relate subsequences instead of forcing the evaluation of individual characters. We have used this method to find the loop E structural motif inside of ncRNA with conserved secondary structure. Our results show that our method produced the best sensitivity/specificity combination among the tested methods for the problem domain. It may also serve as a strong complement to current methods in accelerating ncRNA homology detection because it can be more specific than SCFGs in the case where we have additional information about interior structural motifs. We believe that well structured data relationships can play a key roll in difficult problems such as motif searching. We also believe topology models are very general and could be used in modeling and searching for complex patterns in DNA or proteins. We believe that this work points to the need of more general approaches to automatically generate RNA database queries; especially queries where some possible structures can be eliminated from the SCFG on the basis of biological evidence. Our method would serve well for building filters that can be combined with existing methods such as FastR for increased specificity in selecting structures from the SCFG with conserved structural motifs.

ACKNOWLEDGMENTS We would like to thank Mohammad Shafiullah for help on scaling the code to large architectures, and Laura A. Quest/Brad Friedman/Mark Pauley for help with the manuscript. This research project was made possible by the NSF grant number EPS0091900 and the NIH grant number P20 RR16469 from the INBRE Program of the National Center for Research Resources.

225 References 1. Klein RJ, Eddy SR. RSEARCH: nding homologs of single structured RNA sequences. B M C Bioinformatics, September 2003; vol. 4. 2. Weinberg Z, Ruzzo WL. Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics, August 2004; vol. 20 Suppl 1. 3. Zhang S, Haas B, Eskin, E, and Bafna V. Searching genomes for noncoding RNA using FastR. I E E E / A C M Trans Comput Biol Bioinform, 2005; vol. 2, no. 4 366-379. 4. Rivas E, Klein R J , Jones TA, and Eddy SR. Computational identication of noncoding RNAs in E. coli by comparative genomics. Curr B i d , September 2001; vol. 11, no. 17: 1369-1373. 5. Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res, July 2003; vol 31, no. 13: 3406-3415. 6. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res, March 1997; vol. 25, no. 5: 955-964. 7. Fichant GA, Burks C. Identifying potential tRNA genes in genomic DNA sequences. J Mol Biol, August 1991; vol. 220, no. 3: 659-671. 8. Eddy SR. RNABob: A program to search for RNA secondary structure motifs in sequence databases. http://selab.wustl.edu/cgibin/selab.pl?mode=software 9. Laferrire A, Gautheret D, and Cedergren R. An RNA pattern matching program with enhanced performance and portability. Comput Appl Biosci, April 1994; vol. 10, no. 2: 211-212. 10. Billoud B, Kontic M, and Viari A. Palingol: a declarative programming language to describe nucleic acids secondary structures and to scan sequence database. Nucleic Acids Res, April 1996; vol. 24, no. 8: 1395-1403. 11. Macke TJ, Ecker DJ, Gutell RR, Gautheret D, Case DA, Sampath R. Rnamotif, an RNA secondary structure denition and search algorithm. Nucleic Acids Res, November 2001; vol. 29, no. 22: 4724-

4735. 12. Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Cambridge University Press, London. 1998. 13. Rivas E, and Eddy SR. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol, February 1999; vol. 285, no. 5: 2053-2068. 14. Zhang S, Borovok I, Aharonowitz Y , Sharan R, Bafna V. A sequence-based ltering method for ncRNA identication and its application to searching for riboswitch elements. Bioinformatics, July 2006; vol. 22, no. 14. 15. Thbault P, de Givry S, Schiex T, Gaspin C. Searching RNA motifs and their intermolecular contacts with constraint networks. Bioinformatics, July 2006. 16. Guseld D. Algorithms o n strings trees and sequences. Cambridge University Press, London 1999. 17. Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res, July 2003; vol. 31, no. 13 3423-3428. 18. Dewey CN, Huggins PM, Woods K, Sturmfels B, Pachter L. Parametric alignment of drosophila genomes. PLoS Computational Biology, June 2006; vol. 2, no. 6: e73+. 19. Gotoh 0 . An improved algorithm for matching biological sequences. J Mol Biol, December 1982; vol. 162, no. 3: 705-708. 20. Branch AD, Benenfeld BJ, and Robertson HD. U1traviolet light-induced crosslinking reveals a unique region of local tertiary structure in potato spindle tuber viroid and HeLa 5s RNA.PNAS, October 1985; vol. 82, no. 19: 6590-6594. 21. Correll CC, Wool IG, and Munishkin A. The two faces of the Escherichia coli 23 S rRNA sarcinlricin domain: the structure at 1.11 a resolution. J Mol Biol, September 1999; vol. 292, no. 2: 275-287. 22. Leontis NB, Westhof E. A common motif organizes the structure of multi-helix loops in 16 s and 23 s ribosomal RNAs. J Mol Biol, October 1998; vol. 283, no. 3: 571-583.


227

IEM: AN ALGORITHM FOR ITERATIVE ENHANCEMENT OF MOTIFS USING COMPARATIVE GENOMICS DATA Erliang Zeng’, Kalai Mathee‘, and Giri Narasimhan’’,

‘ Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, Miami, Florida, 33199, USA, and ’Departnzent of Biological Sciences, Florida International University, Miami, Florida, 331 99, USA. Understanding gene regulation I S a key step to investigating gene jimrtions and their relationships, Man) algorithms have Been developed to discover transcription factor binding sites (TFBS); they are predominantly located in upstream regions of genes and contribute to transcription regulation if they are bound by a specific transcription factor. However, traditional metho& focnsing on finding motifs have shortcomings, which can be overcome by using conparative genonzics data that is now increasingly available. Traditional methods to .score inotvs cllso have their limitations. In this paper, we propose a new algorithm called IEM to refine motijjs using comparative genomics data. We .show tlze effectiwness of our techniques with several data sets, Two sets oj experiments were peiformed d l i comparative genomics data onfive strains of P. aeruginosa. One set of experiments w r e perforined with similar data on fonr species of yeast. The weighted conservation score proposed in this paper is an improvement over existing motifscores.

Keyword: Comparative Genomics, Motif, EM algorithm

1. INTRODUCTION Gene expression is a fundamental biological process. The first step in this process called transcription transmits genetic information from DNA to messenger RNA (mRNA). A transcription factor (TF) is a protein that regulates transcription of a gene by interacting with specific short DNA sequences, located often in the upstream region of the regulated genes. Such short DNA sequences are called transcription factor binding sites (TFBS) or regulatory elements. The regulatory elements can be described as sequence signatures and will be referred to in this paper as motifs. One TF can regulate a large set of genes, and a single gene may be regulated by the combination of several TFs. The upstream region of each gene regulated by the same T F must have at least one binding site specific for that particular TF. These binding sites must be specific enough so that the T F can “recognize” them and bind to them. However, it is well known that different sites bound by the same T F are not necessarily identical. The computational challenge is to find these sites and to succinctly and accurately describe all suc’h binding sites.

^To whom correspondence should be addressed

The simplest way to describe a binding site is to write down its consensus sequence. However, this is very imprecise and does not do justice to the complexity of the sequence signature. A sequence alignment of all known binding sites captures its complexity, but is not succinct enough. A logo format (Schneider and Stephens 1990; Crooks, Hon et al. 2004) is succinct enough, but is merely visual. The appropriate description is a profile, which is also referred to as a position-specific scoring matrix (PSSM) or a position weight matrix (PWM) (Werner 1999; Stormo 2000). A profile is a 4 x K matrix (K is the length of the binding site) whose entries give a measure of the preference of a base appearing at any given position. Examples of sophisticated algorithms to identify TF binding sites include MEME (Bailey and Elkan 1994), AlignACE (Hertz and Stormo 1999), Bioprospector (Liu, Brutlag et al. 2001), MDscan (Liu, Brutlag et al. 2002), YMF (Sinha and Tompa 2003), Weeder (Pavesi, Mereghetti et al. 2004) and many more. All these methods attempt to find sequence signatures that are significantly overrepresented in the upstream regions of a given gene set (typically a cluster of co-regulated genes from analyzing microarray data, or a gene set inferred

228 from a ChIP-Chip experiment) when compared to an appropriately chosen background. Despite the successful application of the algorithms listed above, each of them has certain limitations (Hu, Li et al. 2005; Tompa, Li et al. 2005; GuhaThakurta 2006; Maclsaac and Fraenkel 2006; Sandve and Drablos 2006). First, all these methods are prone to predict a large number of motifs, many of which are falsepositives, partly because TFs show remarkable flexibility in the binding sites they can potentially bind to. Second, all these methods report statistically overrepresented motifs. However, statistical significance of motifs need not be synonymous with biological relevance of motifs. Binding of TFs to their binding sites is a complex process and may be assisted or hindered by many other unexplained factors. Comparative genornics data is a promising new source of information that can help to improve motif prediction. With the availability of an increasing number of whole genome sequences of evolutionarily-related genomes, it is practical to incorporate the comparative genomics data into the motif discovery process. The basic assumption is that transcription factors and transcriptional mechanisms involved in fundamental cellular processes are likely to be conserved among evolutionary-related genomes. Consequently, the binding sites for such TFs are also likely to be conserved. Therefore, availability of comparative genomics data i; likely to provide additional support to the predictions of binding sites. The simplest way to deal with data on additional genomes is to pool together the upstream regions of all available genomes and to apply traditional motif detection methods. However, this is not an optimal utilization of the comparative genomics data. The “phylogenetic footprinting” strategy is a sophisticated method used to find motifs that are conserved for a particular gene across related organisms (Blanchette and Tompa 2002). Several subtle approaches such as PhyloCon (Wang and Stormo 2003), orthoMEME (Prakash, Blanchette et al. 2004), CompareProspector (Liu, Liu et al. 2004), EMnEM (Moses, Chiang et al. 2004), PhyME (Sinha, Blanchette et al. 2004), and PhyloGibbs (Siddharthan, Siggia et al. 2005) were developed recently to solve this problem. In these approaches, either an EM-based algorithm, a greedy algorithm or a Gibbs Sampling strategy was applied to optimize an objective function, while taking the phylogenetic relationships into account. The main problem with these methods is that phylogenetic relationships are often not easy to infer and not very reliable. Also, any motif that is unique to particular genomes or in upstream regions of genes with no orthologs

in some related genomes will not be detected. Most of above methods also need an alignment of the input sequence. Like phylogenetic relationships, alignments are also often unreliable. Inaccurate alignments (or phylogeneties) lead to errors in profile matrices, and ultimately in motif prediction. Another challenge in motif prediction is to develop scoring functions that ,reflect biological significance. Several popular scoring functions include IC (information content), MAP, Group Specificity score, LLBG (least likely under the background model) and Bayesian scoring function. However, as explained earlier, algorithms that use these scoring schemes end up with a large number of false positives in their predictions. When dealing with multiple genomes, the degree of conservation of the ‘hits’ of a profile across the many genomes can be used as a crude surrogate for the significance of the motif. However, this metric has its shortcomings. In this paper, we propose a metric to measure such biological significance. In this paper, we propose a new algorithm called ZEM (Iteratively Enhancing Motif Discovery). ZEM is an iterative version of an earlier algorithm called EMR (Enhancing Motif Refinement) (Zeng and Narasimhan 2007). It differs from other earlier approaches in that no attempt is made to perform de nova detection of motifs (although that would be easy to incorporate). Instead, comparative genomics data is used to “enhance” any given motif. These motifs may have been discovered by other computational methods, or may have been identified by laboratory techniques. Thus our method leverages the best-known motif discovery methods, or utilizes the (potentially incomplete) knowledge of previous studies while incorporating newly available comparative genomics data. The research described here is significant for the following reasons. First, there is a clear need to reduce the number of false positives predicted by traditional tools. Second, our method can make use of partial information (on one or more binding sites), which may be available as a result of biological experiments. Third, with the availability of high throughput gene expression techniques like Microarrays and ChIP-Chip experiments, it is possible to get sets of co-expressed genes involved in the same metabolic pathway (and, therefore, potentially coregulated). Finally, our results show that the IEM algorithm has superior ability to overcome the shortcoming of previous methods and to effectively utilize any available comparative genomics data.

229

2. METHODS 2.1. Algorithm The IEM algorithm takes as input an “unrefined” motif for a given genome & (called the reference genome); this motif may have been generated using any reasonable existing motif detection method. Alternatively, the input could be a known binding site or a crude approximation based loosely on some experiments. Using one or more additional genomes f i (referred to as the related genomes), and the corresponding orthology information the algorithm returns an enhanced between TI and r2, motif. The refinement procedure is EM-based, as described below in Section 2.1.3. 2.1 . I . Basic Expectation Maximization (EM) Algorithm Since our algorithm is EM-based, we first present an adaptation of the classical EM algorithm (Dempster, Laird et al. 1977) for ab initio motif discovery (Lawrence and Reilly 1990). Motif prediction can be thought of as a parameter estimation process for a mixture model: ( 1 ) a model for the motif and (2) a model for the background. Roughly speaking, the algorithm can be described as follows: In the (Expectation) E-Step, for every site, the likelihood that it belongs to either model of the mixture is computed. And, in the (Maximization) M-Step, a set of parameters (i.e., the entries of the profile) for the individual models (motif model and background model) are recomputed using the likelihood values computed in the E-step as weights in the calculation. Upon convergence, we end up with two models: one for motif and one for background. We randomly initialize parameters for the motif model (by randomly choosing the locations of the binding sites), and then the E-step and M-step are iterated until convergence. 2.1.2.

Improvements in MEME

The original version of EM as proposed by Lawrence and Relly (Lawrence and Reilly 1990) suffers from several limitations. For example, it does not state how to choose a starting point: It assumes that each sequence in the dataset contains exactly one occurrence of the motif; it also assumes that there is only one instance of the motif in each upstream region and does not attempt to find multiple instances. Bailey and Elkan proposed a modified EM method called MEME to eliminate these limitations (Bailey and Elkan 1994). Their method used sequences from the input as random start points. The

method allows multiple instances of a motif in one upstream region. Furthermore, once the algorithm converges upon a motif, it is eliminated from consideration and then the algorithm restarts to look for other motifs. MEME works reasonably well on many data sets, and is widely used. However, it has shortcomings. First, even though it choses a start point form among the subsequences of the input sequence, it may not converge upon a desired motif. Thus, it is not suitable for finding motifs for which we may know partial information. Second, the only way it can deal with comparative genomics data is by merely pooling the input sequences from multiple genomes. However, as mentioned before, this leaves the comparative genomics data underutilized. Our proposed IEM method considers comparative genomic data in a “dual” manner. 2.1.3. /EM Alaorithm The IEM algorithm is described below in Figure 1. Assume the input consists of profile MI = [mJ, which is a 4 x K matrix. K is the length of the motif and m , is the entry in the ithrow and jthcolumn of M I . Let the indicator variable matrix be defined as Z = (zpq):where zpq= 1, if an instance of the motif starts from pth position in the upstream region of the qth gene, and is equal to 0 otherwise. These indicator variables approximate the probability that a specific site (i.e., the sequence starting from the pth position in the upstream region of the qlh gene) is a binding site according to the profile matrix. The IEM algorithm estimates the indicator variable matrix Zl and profile matrix MI in the reference genome and the indicator variable matrix Z2 and profile M2 in the related genomes iteratively. The estimation process is similar to that in MEME (Bailey and Elkan 1994). However, in IEM a dual-step estimation is applied by incorporating comparative genomics data. Given indicator variable zpqin one data source (either the reference genome or the related genomes) and a motif model [i.e., profile matrix) M for the entire data set (merged from M I and M2), we can calculate the probability of observing a given upstream region U, as follows:

where mu,] is background frequency for base a, mrr, is frequency for base a at positionj in the motif model, k is the motif length, n is number of 1s in ZPq,and I is the length of upstream sequence. Then by Bayes’ rule, we can calculate the probability that the site at position p in upstream region q is a binding site as follows:

230

Intuitively, the ZEM algorithm tries to refine a motif in each iteration in two successive EM steps. In each step, it computes the likelihood for each site in one data set over a model M (not merely M I or M 2 ) , which is arrived at by the previous maximization step applied over all the data sets. Comin et al. reported a subtle motif discovery method using a similar two-step strategy (Comin and Parida 2007). The differences are twofold. First, we incorporate comparative genomics data, and second, we use profiles instead of consensus sequences to represent the motifs. Input: a) Profile MI, motif length I , and associated gene set GI from

genome fi bj upstream sequences of the ORFs in GI c j Additional genome(s) ri,.and the orthology map for all the genomes d) upstream sequences of the ORFs in G I ,the orthologs of GI in

r, Output: Refined motif weight matrix M r Algorithm: Estimate Z, in G, from MI. while (not converged) do Re-estimate M2 in G2 from Z,. M = merge(M1 , MI) Re-estimate ZI in GI from M. Re-ertimate MI in G, from ZI. M = rnerge(M1 , M I ) Re-estimate ZI in G2from M. endwhile Return M,.

Figure 1. [EM Algorithm

In summary, IEM algorithm does the following 4 steps iteratively: 1. In the first E step, the probabilities that each site in the reference genome belongs to the profile M I are computed by using formula (2). 2. In the first M step, the new profile M I is estimated by using every (indicated) binding site in the reference genome (i.e., weighted with Z,,). Profile M is updated using the new sites. 3. In the second E step, the probabilities that each site in the related genomes belong to the profile M2 are computed by using formula (2). 4. In the second M step, the new profile M 2 is estimated by using every (indicated) binding site in the related genomes (i.e., weighted with Z,,). Profile M is updated by using the new sites.

The “merge” operation mentioned in the algorithm is achieved by creating the profile matrix from the instances of the sites with indicator value 1 from all the genomes. Note that a generalization of the merging step is possible where the sites are weighted by the probability of that site belonging to a mode1 (i.e., its score against the profile).

2.2. Evaluation Approaches Evaluation of the IEM algorithm is a nontrivial task because very little experimentally verified data is available. Even the available experimentally verified data is often only partial information. In one of the experiments described below, we consider the critical regulation activities in the arginine metabolic pathways in the bacterium P. aeruginosa (PAO1). We show that our algorithm, with the help of the complete genomes of six strains of P. aeruginosa, produces refined motifs with improved accuracy (see the Results section for details). The performance in such cases can be measured in terms of true positives and false positives from the available partial information. Here the true positives measure indicates the number of known binding sites that are predicted, while the false positives are the number of known non-binding sites that are predicted. In another experiment, where no experimentally verified data was available, we have proposed two approaches to evaluate our results. One approach is to investigate the functional enrichment of the genes whose upstream regions have a predicted binding site. Using gene ontology analysis, we observed that the terms that were enriched were closely related to what is known about the regulator. Another approach is to compute meaningful measures of motif scores. Traditional ones such as MAP and IC scores are not well-suited for comparative genomics data. A better approach is to use scores based on how well the predicted binding site is conserved across all the genomes under consideration. The simplest measure along these lines is what we will refer to as the conservation score. It is the average number of genomes in which any given predicted binding site occurs simultaneously in the upstream sequences of orthologous genes. This value ranges between 0 and m, where m is the number of genomes (besides the reference genome) being analyzed. Such a measure was proposed earlier (Gertz, Riles et al. 2005). Let m denote the number of genomes (besides the reference genome) being considered. Let n be the total number of genes in the reference genome whose upstream sequence has at least one pre-

23 1 dicted site of the motif, and let s, be the number of genomes in which the ortholog of gene i contains a site in its upstream region. Then the conservation score S is defined as:

3. RESULTS

S=C” 2 S

Metabolic pathways have been widely studied. They can be extremely complex, and may involve large numbers of genes. Often every path in the network involves one or more TFs and the genes regulated by them. However, only a few of genes and TFs in the pathways may have been identified, and even fewer of the T F binding sites may be known. A useful problem is to identify the genes and TFs and their binding sites specifically involved in a specific pathway. Starting from one or two experimentally verified binding sites, can we predict the rest of the relevant binding sites of the genes in the pathway? Furthermore, can we identify such a gene set? We will show that our IEM algorithm can help to address these questions. In order to evaluate our results, we used a well studied pathway - the arginine metabolic pathway in P. aeruginosa, as an example. It is already known that P. aeruginosa possesses four different pathways for utilization of arginine (Lu, Yang et al. 2004): the arginine deiminase (ADI) pathway, the arginine succinyltransferase (AST) pathway, the arginine decarboxylase (ADC) pathway, and the arginine dehydrogenase (ADH) pathway. Under anaerobic conditions, arginine can be used as a direct source of ATP via the AD1 pathway. ArgR is a TF in the ADH pathway. Lu et al. used microarray experiments to identify candidate genes for the ArgR regulon (Lu, Yang et al. 2004). It was reported that ArgR regulated 37 (28 induced and 9 repressed) genes from 17 operons. Eighteen of the 28 arginine-inducible genes are in 4 transcriptional units that have been reported previously as members of the ArgR regulon (Itoh 1997; Park, Lu et al. 1997; Nishijyo, Park et al. 1998; Lu, Winteler et al. 1999; Lu and Abdelal 2001; Hashim, Kwon et al. 2004). Lu et al. also identified several new ArgR regulon members among these 37 genes, and verified them by wet lab experiments. Since the ArgR system is well studied, we used it to test the IEM algorithm.

(3)

,=I

The weakness of this conservation score is that it does not account for some key facts. In the following discussion, let A and B be two predicted motifs with the same conservation score, i.e., same average hits per genome. 1. If A has more instances than B in which s, equals to in, it should be considered more significant. 2. If A has more hits than B in the reference genome, then it should be considered more significant. To overcome the above disadvantages, we propose a new score, which we refer to as the weighted conservation score. It is given as:

Erniwnt ,

s,= log[mn]

I=’

WI>W,-I

Vi,

(4)

n z ”r=l’ wI

where rn is the number of genomes being considered, n is the number of genes in the reference genome whose upstream regions contain at least one instance of the predicted motif, n, is the number of genes that has i number of genomes in which the corresponding ortholog contains at least one instance of the motif in its upstream region, and w,is a suitable weight constant that satisfies w,> w,I for all i, implying that if a motif instance occurs in more orthologs then it should be weighted higher. w, is chosen to be i in following example. We highlight the differences between the conservation score and the weighted conservation score using simple examples. In Figure 2, motifs A and B have the same conservation score. Unlike motif B, motif A has instances across all related genomes in the upstream regions of three orthologous gene sets. We argue that motif A is more conserved than motif B. The weighted conservation score reflects this intuition. Motif C, with the same conservation score as motif D, has more instances in the reference genome, which may indicate a more important biological role. The weighted conservation score rewards motifs A and C.

3.1. Results on the Arginine Metabolic Pathway Study

3.1. l . Arginine pathway data s e t Upstream regions of the 17 transcriptional units (operons) were obtained for five strains of P. aeruginosa (PAO1, PA14, PACS2, PA2192, and PA3719). We also included 6 genes involved in the ADC pathway and the ADH pathways that were known not to bind to ArgR.

232 of the base from the consensus sequence was set at 0.7, and the frequencies of other bases were set at 0.1. Each of the three programs was run 10 times for the data set introduced earlier. We counted the number of true predictions (TP, True Positives), the number of false predictions (FP, False Positives) and the motif scores IC (Information Content), MAP (maximum a posteriori probability) and the weighted conservation scores S,. 3.1.3. Arginine path way prediction comparisons results Tables 1 and 2 present the results from two experiments (two genome case vs five genome case) for the 10 runs. The three columns present the results with the three programs. In cases where a motif was reported, the number of TPs and Fps along with three measures of quality of the motif are reported. The IEM algorithm finds the ArgR binding motif in every instance. In the experiments involving two genomes, the motif scores (using the MAP, IC, and Sc measures) are comparable to the reported ones using MEME or AlignACE. However, when four genomes were used, the scores using the E M algorithm was markedly superior to those with the other two methods (when they were reported).

3.2. Results on AmpR Figure 2. Shown are examples that highlight the differences

between the conservation score, S, and the weighted conservation scores, S,. 3.1.2. Prediction comparison procedure

To show the power of our technique, we assumed for our experiments that we know only one (randomly chosen) instance of a binding site for ArgR. We used a subset of the operons mentioned above (12 out of 17 from AD1 pathways and all 6 from ADC/ADH pathways). We then set out to see if the algorithm successful in locating previously known binding sites in the remaining 5 operons. On an average the refined motif missed 1.2 of the 5 known binding sites. We applied MEME, AlignACE, and IEM to the same data set. The results were compared for an experiment with data from two genomes (PA01 and PA14) and another experiment with data from five genomes (PAO1, PA14, PACS2, PA2192, and PAC3719). The idea was to get a sense of how much the comparative genomics data helped in the task. MEME and AlignACE were applied to the pooled data. For IEM, the initial profile was created using the motif instance. The frequency

In this section, we discuss our experiments with the IEM algorithm applied to data from experiments on the transcription factor, AmpR, in P. aeruginosa. AmpR was recently reported as a global transcription factor that regulates the expression of many virulence factors (Kong, Jayawardena et al. 2005). To better understand the regulon of AmpR, the consensus sequence (5'TCTGCTGCAAATTT-3') of AmpR binding sites in C. freundii and E. cloacae was used by Kong et al. to find an exactly conserved sequence site within the upstream region of ampC in P A 0 1 (Kong, Jayawardena et al. 2005). They also analyzed the upstream regions of all the genes putatively regulated by AmpR with the hope of finding a potential AmpR binding site. Tools such as MEME and AlignACE failed to find anything resembling the binding site from the upstream region of ampC. The IEM algorithm was then applied using the consensus sequence mentioned above, a potential handcrafted list of 10 genes possibly regulated by AmpR, and newly available comparative genomics data sets from four closely related strains of Pseudomonas (PA14, PA2192, PACS2, and PAC3719). As mentioned in the previous section, a crude motif profile was constructed

233 based on the consensus sequence. The results before and after applying the IEM algorithm are shown in Table 3.

A -

The refined motif showed improved scores according to three different motif scores. After refinement, we found that putative AmpR binding site appears only in 3 of the 10 genes mentioned above (lasA, lasR, and ampC) across all five strains of P. aeruginosa. Support for these 3 predictions was obtained using lacZ fusions in the Mathee lab. Further experimental verification is needed and work is underway in the Mathee lab. We conjecture that the remaining 7 genes are only indirectly regulated by AmpR. We then used the refined motif to scan the entire PA01 genome for instances of the motif in the upstream regions. Based on the likelihood value calculated in formula (2), we ranked the “hits” and chose the top 150 genes and followed it up with gene function enrichment analysis. See Table 4 for the results. The term with the top hit, i.e., the lowest P-value was “periplasmic space”. This is considered significant because, ampR is known to be involved in cell-wall recycling. A similar search with the motif before refinement did not find this GO-term.

I Table 1. Motif predicted by IEM, MEME, and AlignACE using data on 2 strains of P. aeruginosa (PAOland PA14).

(457,120,284)

Table 3. Characteristics of motif before and after refinement

3.3. Results on Whole Genomic Data

Table 2. Motif predicted by IEM, MEME, and AlignACE using data on 5 strains of P. aeruginosa (PAO1, PA14, PA2192, PACS2, and PAC3719).

Next we discuss our experiments with yeast data sets. Recently, Kellis et al. compared five yeast species to identify regulatory elements in the entire genome by searching for conserved segments across different yeast species (Kellis, Patterson et al. 2003). They developed a motif score called MCS (Motif Conservation Score) to measure the conservation ratio of a motif compared to the random patterns of the same length and degeneracy (Kellis, Patterson et al. 2003). A list of 72 full motifs having MCS at least 4 was reported. These 72 predicted motifs showed strong overlap with 28 of the 33 known motifs in yeast. However, the motifs used in the paper were represented using generalized consensus sequences (i.e., using IUPAC codes to represent nucleotide degeneracy) instead of the more powerful profile matrix. We set out to consider whether the IEM algorithm could improve the predictions from that work. Starting from the results of Kellis et al., we used IEM to refine each of the 72 motifs mentioned above.

234 Data from four yeast genomes (S. cerevisiae, S. paradoxus, s. mikatae and s. bayanus) were used. Complete results on the refined motifs are available at our supplementary results website: [http://biorg.cs.fiu.edu/IEM/]. Below we show some of the highlights in Table 5. In each case the number of hits went down after the refinement.

Motif

Motif Score (C, MAP)

# o f ORFs

Motif

Number YCGTnnnnmRYGAY

1.89, 0.40

9.83,5.61

I

hRCCCYTWM

29

' -&&x.L,

I

442 284

I

1.93, 0.53 6.78, 4.80

lola1 Enriched P-Value

3

2

0.m

3

2

0.m

3

2

011085

4

2

O.0lM

5

2

0.om

290

23

0.0273

152

14

0.0293

97

10

0.0328

14

3

0.0367

671

43

0.04

Table 4. Go enrichment analysis for the AmpR experiments

4. DISCUSSION AND CONCLUSION In this paper we propose a new algorithm to refine motifs with the help of comparative genomics data. The algorithm incorporates an improved scoring scheme that is sensitive to hits in the related genomes. The algorithm is inspired by the technique of "co-training" from the field of data mining, where lessons learnt from one data source is iteratively used to model the situation for another data source. The results show clear improvements in the quality of the motifs output. The IEM algorithm does have its own shortcomings, which we continue to improve. First, it does not attempt to change the length of the motif from the initial motif it started with. Second, it works best if the genomes considered are very closely related and is useful in cases where the phylogenetic relationships between the genomes are not known. If phylogentic information is available, then the algorithm can be modified to factor this in, along the lines of several previous algorithms.

ACKNOWLEDGMENTS The work of GN was supported in part by a grant from NIH under NIH/NIGMS SO6 GM008205. We thank Camilo Valdes for helping us compile the upstream sequence data for the five strains of P. aeruginosa and for his help with Figure 2 in the paper.

CGGCnnMGnnnnnnnCGC

57

*c c

C~

84 52

2.03,0.34 5.68, 1.81

Table 5. Results of motif refinement for the yeast data set. For each of the five motifs, the upper row is the consensus sequence from Kellis et al., while the lower row is the result after refinement by the IEM algorithm.

REFERENCES 1. Bailey, T. L. and C. Elkan (1994). "Fitting a mixture model by expectation maximization to discover motifs in biopolymers." Proc Int Conf Intel1 Syst Mol Biol2: 28-36. 2. Blanchette, M. and M. Tompa (2002). "Discovery of regulatory elements by a computational method for phylogenetic footprinting." Genome Res 12(5): 739-48. 3. Comin, M. and L. Parida (2007). Subtle Motif Discovery for Detection of DNA regulatory sites. Pac Bioinfo Conf (APBC2007), Hong Kong. 4. Crooks, G. E., G. Hon, et al. (2004). "WebLogo: A sequence logo generator." Genome Res 14(6): 11881190. 5. Dempster, A. P., N. M. Laird, et al. (1977). "Maximum likelihood estimation from incomplete data via the EM algorithm." J. R.Statist. SOC.B 39: 1-38. 6. Gertz, J., L. Riles, et al. (2005). "Discovery, validation, and genetic dissection of transcription factor binding sites by comparative and functional genomics." Genome Res 15(8): 1145-52. 7. GuhaThakurta, D. (2006). "Computational identification of transcriptional regulatory elements in DNA sequence." Nucl Acids Res 34( 12): 3585-98. 8. Hashim, S., D. H. Kwon, et al. (2004). "The arginine regulatory protein mediates repression by arginine of the operons encoding glutamate synthase and anabolic glutamate dehydrogenase in Pseudomonas aeruginosa." J Bacteriol 186(12): 3848-54. 9. Hertz, G. Z. and G. D. Stormo (1999). "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 15(7-8): 563-77.

10. Hu, J., B. Li, et al. (2005). "Limitations and potentials of current motif discovery algorithms." Nucl Acids Res 33(15): 4899-9 13. 11. Itoh, Y. (1997). "Cloning and characterization of the aru genes encoding enzymes of the catabolic arginine succinyltransferase pathway in Pseudomonas aeruginosa." J Bacteriol 179(23): 7280-90. 12. Kellis, M., N. Patterson, et al. (2003). "Sequencing and comparison of yeast species to identify genes and regulatory elements." Nature 423(6937): 24154. 13. Kong, K. F., S. R. Jayawardena, et al. (2005). "Pseudomonas aeruginosa AmpR is a global transcriptional factor that regulates expression of AmpC and PoxB beta-lactamases, proteases, quorum sensing, and other virulence factors." Antimicrob Agents Chemother 49( 11): 4567-75. 14. Lawrence, C. E. and A. A. Reilly (1990). "An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences." Proteins 7( 1): 41-5 1. 15. Liu, X., D. L. Brutlag, et al. (2001). "BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes." & Symp Biocomput: 127-38. 16. Liu, X. S., D. L. Brutlag, et al. (2002). "An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments." Nat Biotechnol20(8): 835-9. 17. Liu, Y., X. S. Liu, et al. (2004). "Eukaryotic regulatory element conservation analysis and identification using comparative genomics." Genome Res 14(3): 451-8. 18. Lu, C. D. and A. T. Abdelal (2001). "The gdhB gene of Pseudomonas aeruginosa encodes an arginine-inducible NAD(+)-dependent glutamate dehydrogenase which is subject to allosteric regulation." J Bacteriol 183(2): 490-9. 19. Lu, C. D., H. Winteler, et al. (1999). "The ArgR regulatory protein, a helper to the anaerobic regulator ANR during transcriptional activation of the arcD promoter in Pseudomonas aeruginosa." J Bac& 181(8): 2459-64. 20. Lu, C. D., Z. Yang, et al. (2004). "Transcriptome analysis of the ArgR regulon in Pseudomonas aeruginosa." J Bacteriol 186( 12): 3855-61. 21. MacIsaac, K. D. and E. Fraenkel (2006). "Practical strategies for discovering regulatory DNA sequence motifs." PLoS Comput Biol 2(4): e36. 22. Moses, A. M., D. Y. Chiang, et al. (2004). "Phylogenetic motif detection by expectationmaximization on evolutionary mixtures." Pac Svmp Biocomput: 324-35.

23. Nishijyo, T., S. M. Park, et al. (1998). "Molecular characterization and regulation of an operon encoding a system for transport of arginine and ornithine and the ArgR regulatory protein in Pseudomonas aeruginosa." J Bacteriol 180(21): 5559-66. 24. Park, S. M., C. D. Lu, et al. (1997). "Cloning and characterization of argR, a gene that participates in regulation of arginine biosynthesis and catabolism in Pseudomonas aeruginosa PA01 J Bacteriol 179(17): 5300-8. 25. Pavesi, G., P. Mereghetti, et al. (2004). "Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes." Nucl Acids Res 32(Web Server issue): W199-203. 26. Prakash, A., M. Blanchette, et al. (2004). "Motif discovery in heterogeneous sequence data." Svmp Biocomput: 348-59. 27. Sandve, G. K. and F. Drablos (2006). "A survey of motif discovery methods in an integrated framework." Biol Direct 1: 11. 28. Schneider, T. D. and R. M. Stephens (1990). "Sequence logos: a new way to display consensus sequences." Nucl Acids Res 18(20): 6097-100. 29. Siddharthan, R., E. D. Siggia, et al. (2005). "PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny." PLoS Comput Biol 1(7): e67. 30. Sinha, S., M. Blanchette, et al. (2004). "PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences." BMC Bioinformatics 5: 170. 31. Sinha, S. and M. Tompa (2003). "YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation." Nucl Acids & 31( 13): 3586-8. 32. Stormo, G. D. (2000). "DNA binding sites: representation and discovery." Bioinformatics 16(1): 1623. 33. Tompa, M., N. Li, et al. (2005). "Assessing computational tools for the discovery of transcription factor binding sites." Nat Biotechnol 23(1): 137-44. 34. Wang, T. and G. D. Stormo (2003). "Combining phylogenetic data with co-regulated genes to identify regulatory motifs." Bioinformatics 19(18): 2369-80. 35. Werner, T. (1999). "Models for prediction and recognition of eukaryotic promoters." Mamm Genome lO(2): 168-75. 36. Zeng, E. and Narasomhan, G. (2007). "Enhancing motif refinement by incorporating comparative genomic data." Proc of the Int Symp on Bioinfo Res and Appl (ISBRA), Lect Notes in Comp Sci, Vol. 4463, Springer Verlag, p329-337, 2007. .I'


237 MANGO: A N E W APPROACH T O MULTIPLE SEQUENCE ALIGNMENT

Zefeng Zhang a n d Hao Lin

Computational Biology Research Group, Division of Intelligent Software Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China Email: { zhangzf, linhao} @ict.ac. cn Ming Li*

David R. Cheriton School of Computer Science, University of Waterloo, Ont. N2L 3G1, Canada * Email: [email protected] Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the ‘once a gap, always a gap’ phenomenon. Is there a radically new way t o do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16s RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, ProbConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0. MANGO is available at http://www.bioinfo.org.cn/mango/

1. Introduction Multiple sequence alignment is a basic and essential step of many sequence analysis methods6. For example, the multiple sequence alignment is used in phylogenetic inference, RNA structure analysis, homology search, non-coding RNA (ncRNA) detection and motif finding. For recent reviews in this area, see Refs. 35 and 13. Finding the optimal alignment (under the SP score with affine gap penalty) for multiple sequences has been shown to be NP-hard44. A trivial solution by dynamic programming takes O ( n k )time with k sequences, each of length n. Under moderate assumptions”, the problem has a polynomial time approximation scheme (PTAS)28. However, this PTAS remains to be a theoretical solution since it has a high polynomial power related to the error rate. With the rapid growth of molecular sequences, the problem becomes more prominent. Thus many modern alignment programs resort to heuristics to reduce the computational cost

while sacrificing accuracy. The prevailing strategy is the progressive alignment methodl4>41, implemented in the popular C l ~ s t a l W *software, ~ as well as in the more recent multiple sequence alignment programs MUSCLE”, T - C ~ f f e e MAFFT19, ~~, and Progre~sive-POAl~, to name a few. The idea behind progressive alignment is to build the multiple sequence alignment on the basis of pairwise alignments under the guidance of an evolutionary tree. A distance matrix is computed from similarities of sequence pairs, according to which a phylogenetic tree is builts8. The multiple alignment is then constructed by aligning two sequences or alignment profiles along the phylogenetic tree. In this way, the progressive alignment method avoids the exponential search. For a large number of sequences, distance matrix calculation can be slow, and the optimal phylogenetic tree construction itself, under the usual assumptions of parsimony, niax likelihood, or max number of quartets, is NP-hard anyways. After all, sometimes the purpose of doing multiple sequence alignment is

*Corresponding author. aFor example: when the average number of gaps per sequence is a constant, the problem has a PTAS

238 to construct a phylogenetic tree itself. Then if we didn’t believe the initial phylogeny (constructed t o do multiple sequence alignment), why should we believe in the phylogeny that is constructed based on a multiple alignment which in turn is based on the untrusted phylogeny? Again, heuristics were used t o accelerate the phylogenetic tree construction. Pairwise similarity is estimated using fast k-mer counting in MUSCLE, and similar strategy can be seen in the fast version of ClustalW. However, in spite of its most attractive virtue the speed, progressive approach (adding sequences greedily t o form multiple sequence alignment) is born with the wellknown pitfall that, error introduced in early stages cannot be corrected later (so-called ‘once a gap, always a gap’). Many efforts have been made t o remedy this drawback t o enhance the accuracy of final alignment. MUSCLE adds tree refinement after progressive alignment stage t o tune the result. T-Coffee uses consistency-based score reflecting global information to assist pairwise alignment during the progressive alignment process. PROBCONS adopts a probabilistic consistency-based way. All of them do achieve better accuracy. There are two alternatives t o progressive approaches. One is simultaneous alignment of all sequences by standard dynamic programming (DP). Two packages MSA26 and DCA3’ follow this idea. However, algorithms in this category do not scale up because of their heavy computational costs. Another alternative is the iterative strategies16> l8, 2 5 . Starting from an initial alignment, these methods iteratively tune the alignment until there are no improvements t o the objective functions. These iterative strategies require a good initial alignment as a start point, otherwise the iterative process will be time-consuming or easily fall into local optima. We now introduce the core idea of MANGO. Given a set of sequences, how do we really judge an alignment? Do we really care about aligning a non-homologous region well? No. What we really care is the aligner’s ability of putting similar regions together, including the distant homologous regions. Some similar regions are shared by most of (if not all) sequences, while others may be shared by only a few sequences. Gaps are inserted t o get an alignment which lines up the similar regions properly. With this simple observation, we describe a new paradigm to do multiple sequence alignnient. Our new algo~

331

rithm uses the novel idea of optimized spaced seeds. introduced by Ma, Tromp, and Li31 initially for pairwise alignment, t o find similar regions and bind them together via sophisticated algorithms (which are of theoretical interests on their own) and then refine the alignment. Note that similar approaches (Ref. 32) using consecutive or non-optimized k-mers (with some gaps) might have been used in some programs t o some degree, however, without optimized spaced seeds, such approaches cannot achieve high sensitivity and specificity as in MANGO. Also note that these optimized spaced seeds are not dependent on any data. they are independently optimized as in Refs. 31, 29 and 9. Our new algorithm requires neither the slow global multiple alignment nor the inaccurate progressive local pairwise alignment. The optimized spaced seeds have shown their advantages over traditional consecutive seeds for pairwise alignment in PatternHunter software3’. It has since been adopted by most modern homology search software including BLAST (MegaBLAST). It has been shown31, 22, 7 , 30 that the optimized spaced seeds can achieve much better sensitivity and specificity than consecutive seeds. After that, the idea of a single optimized spaced seed is extended t o multiple optimized seeds in PattcrriHunter I1 program2’. 40. vector seeds3, and neighbor Multiple seeds’ were studied for even better sensitivity and specificity. One multiple genomic DNA alignment program4 uses optimized spaced seeds t o find hits between two sequences. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. MANGO uses multiple optimized spaced seeds to catch similar regions (hits) quickly with high sensitivity and specificity. A scoring scheme is designed t o encode global similarity information in hits. Hits with scores beyond a threshold are arranged carefully to form parts of the alignment. Under the constraint of these hits. banded dynamic programming is carried out t o achieve a global solution. Experiments were carried out on large 16s RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed. against state-ofart multiple sequence alignment methods, including ClustalW, MUSCLE, MAFFT, ProbConsRNA, Dialign, DIALIGN-T, T-Coffee, POA and Kalign. The experiments were performed only on nu-

239 cleotide sequences for the purpose of justifying our new approach. For multiple protein sequence alignment, other factors, such as similarity scoring schema and secondary structure information affect the alignment quality a lot, hence potentially would blur the comparative results on the effectiveness of optimized spaced seed approach.

2. Method The work flow of MANGO is given in Fig 1. Our strategy contains three stages. After any stage, MANGO can be stopped and it will output the alignment constructed so far. In stage one of template construction, MANGO locates super motifs in input sequences, and builds a skeletal alignment by pasting each sequence to the template exposed by motifs. In stage two of hit binding, MANGO first sorts the hits according to their agreements among themselves and then tries to bind hits one by one into the skeletal alignment. Iterative refinement is then carried out in stage three to produce the final alignment, where MANGO picks out one sequence at a time, and aligns it to the current alignment of the rest of sequences.

not requiring a match. The length of the seed is the length of the binary string, and the weight of the seed is the number of 1’s in the seed. For reasons why the optimized spaced seeds are much better than the consecutive BLAST type of seeds, please see Refs. 31 or 30 which points to many more recent theoretical studies. We have used eight highly independent spaced seedsb generated from the parent seed 1110110010110101111,with seed weight 13 and seed length ranged from 19 to 23 from Ref. 9. 82 single optimal spaced seed with weight ranged from 9 to 16 and length ranged from 9 to 22, are optimized against a 64-length IID region of similarity level 0.7, using the dynamic programming algorithm described in Ref. 22. These 82 seeds are sorted with decreasing weight, to prefer specificity to sensitivity. We used the first seed to locate the super motifs and construct profile template, then we applied the other seeds one by one. For each seed-match (namely a hit), we call the matched fragment a k-mer. The k-mer has the same length as the seed.

2.2. Stage one: constructing profile template

locate

high freq

profile template

Skeletal alignment

sorted hits

Fig. 1. Three stages of MANGO: template construction, hit binding and iterative refinement, producing skeletal alignment, initial alignment and final alignment, respectively.

2.1. Seeds selection Following the original notation of Ref. 31, we denote a spaced seed by a binary string. A “1” in the spaced seed means it requires a match at the position, and a “0” indicates a “don’t care” position b11100110110010101111,1101110110000110100111, 1011110010110111011,11001110000010110101111, 10110111010110001111,10101010110010100101111, 1110110001111101101,11001110110010010001111.

If a piece of sequence segment is shared by a considerably large portion of input sequences, we call it a super motif. The super motifs reflect the conserved parts among sequences and are very likely to appear in the final alignment, hence pining them down will give guidance to the whole alignment process. MANGO uses an optimized spaced seed to detect super motifs, thus they are not necessarily the same (due to ‘no care’ positions of spaced seed), as long as they have high similarity to be caught by the spaced seed. The detection is performed as the following. Highly frequent k-mers (currently defined as 25% of the input sequences having the k-mer) extracted by the spaced seed are lined up according to their relative appearance in each sequence. Then MANGO determines overlapping portion (see middle of Fig 2 for demonstration of overlap) of adjacent k-mers, by searching for their existing overlapping parts in each sequence. In this way, a super motif is represented

240 by a series of overlapping k-mers, and all those kmers are concatenated together to form the profile template. After that MANGO aligns each sequence to the template. As Fig 2 indicates, those highly frequent k-mers resided inside each sequence is directed (aligned) to corresponding position in profile template, thus produces a skeletal alignment. sequence I profile template sequence 2

---

----

--------

Fig. 2. MANGO locates super motifs by highly frequent kmers (shaded boxes) and constructs profile template by concatenating them. Those k-mers inside each sequence are aligned t o the profile template, producing a skeletal alignment.

The impact of this stage is twofold: the profile template provides anchors and constraints for later alignment process, and wiping out highly frequent k-mers greatly reduces the number of hits to be considered in stage two of hit binding.

ously. Hence hitj will vote against hit,, and s(i,j ) = -Wdisagree; if hiti and hitj are order compatible as Fig s(2.a) indicates, the appearance of hitj in a certain alignment will encourage hiti to appear too, and

S ( i , j )= Wagree; if hiti and hitj are overlapped and their nucleotide mapping order is consistent as Fig 3(2.b) indicates, MANGO further considers their overlapped region size in this case. Define overlapping ratio between them as Q = overlap-size/(21 - overlap-size), where 1 is the spaced seed (hit) length. Then S ( i , j ) = Q * Woverlap-high (1 - Q) * Woverlap-low;

+

We also have indirect votes from k-mers on other sequences too. If hitj votes for or against hiti (as in Fig 3(3.a) and (3.b)), those k-mers same as k-mer of hitj on other sequences (C in Fig 3(3)) will increase the power of voting, since occurrence of C through hitj will enhance the probability that hiti appears or doesn’t appear in final correct alignment. hiti

2.3. Stage two: binding the hits

2.3.1. Vote among hits After getting rid of the super motifs, less frequent k-mers extracted by single or multiple spaced seeds will generate a set of hits among sequences. Hits may conflict to each other and MANGO tries to select a good compatible subset of them. Since the consistent relationship among those hits reflects the global similarity of input sequences, MANGO encodes global information into each hit by assigning it a priority score, which is voted by other hits to agree or disagree that this hit should appear in final alignment. The consistency and inconsistency relationships of the hits are illustrated in Fig 3, corresponding to “yes” vote (positive score) and “no” vote (negative score), respectively. Assume that hiti and hit? occur between the same two sequences. Let S ( i , j ) be the vote score for hiti by hitj, which is calculated as: (1) if hiti and hitj are incompatible (either they are order incompatible as in Fig 3(l.a) or they are overlapped but their nucleotide mapping order is inconsistent as in Fig 3(1.b)), they can not appear in the same alignment simultane-

hitj

1.a order incompatible

1.b overlap incompatible

2.a order compatible

2.b overlap compatible c

3.a indirect agree

3.b indirect disagree

Fig. 3 . To uncover global similarities, MANGO assigns hit, a priority score, which is voted by other hits: (1) “no” vote: hit, and hiti are order incompatible in 1.a or they have inconsistent nucleotide mapping order in 1.b; (2) “yes” vote: hit, and hit, are order compatible in 2.a or they have consistent nucleotide mapping order in 2.b; (3) indirect vote from k-mer C who increases voting power of hit, t o hit,.

Let N ( h i t )be the number of sequences that have the same k-mer as that inside hit. The priority score assigned to hit, is calculated as C, S ( i , j ) * (1 ( N ( j ) - 2 ) * Windirect). The voting results are collected and hits are sorted according to their score. Low scored hits (probably random hits) are removed

+

24 1 and the remaining hits are considered as candidates into next hit binding stage. 2.3.2. Bind hits greedily

In the second part of stage two of hit binding, MANGO will try to bind each hit (high score first) into skeletal alignment generated from stage one, by greedily checking hit candidates one by one. To bind a hit is to align all corresponding sequence letter pairs along the hit and fix their positions (once aligned, they remain aligned). What we want is to arrange the relative positions of the hits carrying similarity information. Thus, a natural way is to formulate an alignment solution as a directed acyclic graph(DAG), by viewing aligned nucleotides as one vertex. We found Ref. 25 employs a similar DAG format ion.

above criteria are satisfied, MANGO binds it to the DAG (updates the DAG by merging the corresponding vertices); otherwise, the hit is discarded. Let reach(s,t)be a predicate which is true iff s # t and there is a directed path from s to t in the DAG. If we merge two vertices z and y when reach(z,y) V reach(y,z)is true, then the resulting graph is no longer a DAG, due to the circle introduced. So a hit ( ( s l , t l ) ,( s ~ , t 2 .) ., . , ( s i , t i ) ) can be bound into alignment DAG, if and only if lreach(si,ti)A i r e a c h ( t i , s i ) , for 1 5 i 5 1. sz

s1

,

\

I

SI

(

I

)

1?1 1?1

:?:

!

\

l

i

1,

j

12

I

11

(1) 0

1

2

3

4

5

(1) The success of binding a hit ( ( ~ 1t l, ) ,( s z , t 2 ) , . . . . that there is no directed path between any vertex pair ( s z , t a ) ,1 5 z 5 1 ; (2) We check that by expanding left predecessors (grey area) of (sz,t t ) , bypassing those already expanded by ( s t - l , t t - l ) (dark grey area).

Fig. 5.

( s l , t l ) ) requires

(4

( B)

Fig. 4 By viewing aligned nucleotides as one vertex, an alignment can be formulated as a DAG

Given N sequences each with L, nucleotides, we denote j t hnucleotide in sequence S, as a vertex Sz,J. Directed edges are linked from S,,J to Sz,J+l,for 0 5 j < L , - 1. Thus, the initial graph G = ( V , E ) has CO1, 0 structures, then even the problem of aligning them optimally without considering transformations becomes intractable it takes f l ( n k )time using the standard dynamic programming algorithm, where n is the size of each protein involved. ~

In practice, progressive methods are widely used to attack the MSTA problem 21. For example, given a set of structures, many approaches start with a seed structure and then progressively align the remaining 20, 2 5 , '6 35. structures onto it one by one 3 , A consensus or core structure is typically built throughout, to maintain the common substructures among the proteins that are already aligned. At each round, only pairwise structure comparison is usually performed to align the current consensus with a new structure. " 1

2. RELATED A N D N E W WORK 2.1. Related Work Previously, folding simulations analysis is performed mainly for testing various protein folding models 18, 3 3 , such as the folding pathway model and the funnel model; and/or for studying energetic aspects of folding kinetics ', 5 , 19. The geometric shapes of the conformations involved in folding trajectories have not been widely explored 6, 14) 2 8 , despite their important role in folding. A particularly interesting work in this direction is by Ota et al. 2 s , where they provide a quite detailed study of the folding trajectories of a mini-protein Trp-cage using phylogenic tree combined with expert knowledge. However in general, an automatic tool to facilitate the folding simulations analysis a t large scales is still missing. This paper provides an important step towards this goal by modeling folding trajectories as curves and using a new multiple curve comparison (MCC) algorithm to detect critical folding events. ' 1

41

The closest relative of our MCC problem in computational biology is the multiple structure alignment (MSTA) problem, which aims at aligning a family of protein structures, each modeled as a three dimensional polygonal curve to represent its backbone. MSTA is a very hard problem. In fact, even the pairwise comparison problem of aligning two structures A and B is believed to be NP-hard since one has to optimize simultaneously both the correspondence between A and B and the relative transformation of one structure with respect to the other. Numerous heuristic-based algorithms have been developed in practice for this fundamental

Obviously, the above progressive MSTA framework is a greedy approach. Its performance depends on the underlying pairwise comparison methods used, the order of structures that are progressively aligned, as well as the consensus structure maintained. Various heuristics have been exploited to find a good order for the progressive alignments. Note that this order can also be guided by a tree instead of a linear sequence, which removes the need of choosing a seed structure. The progressive procedure may also be iterated several times to locally refine the niultiple structure alignments.

2.2. Our Results There are two main differences between the MCC problem we are interested in and the traditional MSTA problem. In the case of protein structures, it is usually explicitly or implicitly assumed that the (majority of the) input proteins belong to one family", or a t least share some relations. As such, one can expect that some consensus of the family should exist. However in our case, the set of curves are from a set of simulations including both successful and unsuccessful runs, and we wish to classify this diverse set of curves, and capture common features within as well as across its sub-families. Secondly and more importantly, the level of similarity existing in these folding trajectories is usually much lower

"How to classify a set of input structures into different families is a related problem, and many such classifications exist

12, 2 2 , 2 7 .

301

t 2 3

3 1 2 3

1 2

1

2 3

4 5

1

3 3

4

5 4

(a) Linear Alignment

(b) Partial Order Alignment

Fig. 1. Aligning five trajectories (IDS 1 to 5 ) using (a) a linear graph, and (b) a partial order graph. Symbols in the circles are the node IDS and numbers on edges are trajectory IDS. Note that the linear alignment in (a) will not be able t o record the partial similarity between curves 3 and 4, which is maintained in (b) (i.e, node d ) .

than that in a family of related proteins. Hence we aim at an algorithm with high sensitivity, which is able to detect small-scaled partial similarity. In this paper, we propose and develop a sensitive MCC algorithm, called the EPO (enhanced partial order) algorithm, to compare a set of diverse high dimensional curves. Our algorithm follows a similar framework as the POA algorithm 17, 35 to encode the similarities of aligned curves in a partial order graph, instead of in a linear structure used by many traditional MSTA algorithms. This has the advantage that other than similarities among all curves, similarities among a subset of input curves can also be encoded in this graph. See Figure 1 for an example, where nodes in both graphs represent a group of aligned points from input curves. For the more important problem of sensitivity, we observe that being a greedy approach, the progressive MSTA framework tends to be inherently insensitive to low level of similarities if one early local decision is wrong, it may completely miss a smallscaled partial similarity. To improve this aspect of the performance of the progressive framework, we first propose a novel two-level scoring function to measure similarity, which, together with a clustering idea, greatly enhances the quality of the local pairwise alignment produced a t each round. We then develop an effective merging step to post-process the obtained alignments. This step helps to reassemble vertices from input curves that should be matched together, but were scattered in several sub-clusters in the alignments due to some earlier non-optimal ~

decisions. Both techniques are general and can be used to improve the performance of many existing MSTA algorithms. Experimental results show that our MCC algorithm is highly sensitive and able to classify input curves. We also demonstrate the power of our tool in mining critical events from protein folding trajectories using a detailed case study of a miniprotein Trp-cage. Although our EPO algorithm is developed for comparing folding trajectories, the algorithm is general and can be applied to other domains as well, such as protein structures or pedestrian trajectories extracted from surveillance videos 34. EPO fits especially well for those applications when the level of similarity is low.

3. METHOD In this section, we describe our EPO algorithm for comparing a set of possibly high dimensional general c7~rves.If we are given a set of protein folding data, we first convert each folding trajectory to a high dimensional curve. In particular, a folding trajectory is a sequence of conformations (structures) of a protein chain, representing different states of this protein during the simulation of its folding process. We represent each conformation using the distance map between its alpha-carbon atomsb so that it is invariant under rigid transformations. For example, if a protein contains n amino acids, then its distance map is a n x n matrix M where M [ i ][ j ]equals the distance between the ith and j t h alpha-carbon atoms along the protein backbone. This matrix can then be considered as a point in the n2 dimensions. This

"One can also encode the side-chain information into the high dimensional curves, or map a substructure into a high dimensional point.

302 way, we map each trajectory of rn conformations to a curve in Rn2 with m vertices. In the remaining part of this paper, we use the terms trajectories and curves interchangeably. 3.1. Notations and Algorithm Overview Before we formally define the MCC problem, we introduce some necessary notations. Given a set of elements V = {vl,. . . , vl}, a relation + over V is transitive if vi 4 vj and vj 4 v k imply that ui 4 v k . In this paper, we also refer to vi + vj as a partial order constraint. A partial order graph (POG) G = (V, E) is a directed acyclic graph with V = ( ~ 1 , ... ,vl}, where vi 4 wj if there is an edge (vi,vj).Note that by the transitivity of this relation, two nodes may have a partial order constraint even when there is no edge between them in G. Let R be the set of partial order constraints induced by G. We say that V is a partial order last w.r.t. G if for any vi 4 vj E R , we have that i < j . In other words, the linear order in V is a total order satisfying all partial order constraints induced from G. See Figure 2 for an example.

@--@ C

Fig. 2. A POG G of 5 nodes. Note that there is a partial order constraint a 4 d even though there is no edge between them. Both { a , b, c, d , e} and { a , c, b, d, e } are valid partial order lists w.r.t. G.

Let 7 = { T I ,. . . , T N }be a set of N trajectories in Rd, where each trajectory Ti is an ordered sequence of n points p i , . . . , p i ‘. The goal of the MCC algorithm is to find aligned sub-sequences from 7. More formally, an aligned node o is a collection of vertices from Tis, with at most one point from each Ti. Given a 3-tuple ( 7 , 7 ,E ) , where T and E are input thresholds, an alignment of 7 is a POG G with the corresponding set of partial order constraints R and a partial order list of aligned nodes 0 = (01, . . . , o ~ } such that the following three criteria are satisfied:

C1.

lokl

2

7,

for any k E [I,L];

~ 2 for . anypj,pj’, E o k , lip; - p $ : ~ /F E ; C3. if p j E ok, and p:., E o k 2 with o k l then j < j’.

+ ok2,

( C l ) indicates that the number of vertices of input curves aligned to each aligned node is greater than a size threshold 7 , and (C2) requires that these aligned points are tightly clustered together (i.e, the diameter is bounded by a distance threshold E ) . (C3) enforces that points in different aligned nodes still maintain their partial order along their respective trajectory. Our goal is to maximize L , the size of such an alignment 0. See Figure 3 (b) for an example of an alignment graph. Algorithm overview At a high level, the EPO algo-

rithm has two stages (see Figure 3): (Sl) initial POG construction stage and (S2) merging stage. The first stage generates an initial alignment for 7,encoded in a POG G. The procedure has the same framework as the POA algorithm, but its performance, especially when the similarity is low, is significantly improved, via the use of a clustering preprocessing step and a new two-level scoring function. In the second stage, we develop a novel and effective procedure to merge nodes from G to produce aligned nodes with large size, and output a better final alignment G*. Below, we describe each stage in detail. 3.2. Initial POG Construction

Standard dynamic programming (DP) 23, 31 is an ideal method for pairwise comparison between sequences. It produces the optimal alignment between two sequences with respect to a given scoring function. One can perform multiple sequences alignment progressively based on this DP pairwise comparison method. Roughly speaking, in the ith round of the algorithm, the alignment of the first i - 1 sequences is represented in a consensus sequence. The algorithm then update this consensus by aligning it with the i t h sequence Si using the standard DP algorithm. Information from Si that is not aligned to the consensus sequence is essentially lost. See Figure 1(a). The partial order alignment (POA) algorithm l7 greatly alleviates this problem by encoding the consensus in a POG instead of a linear sequence

‘For simplicity, we assume without loss of generality that all Tis have the same length n.

~

.....

@i; (a) Initial POG

(b) POG before merging

~

.

303

.....

(c) POG after merging

Fig. 3. Symbols inside the circles are the node IDS. The table associated with each node encodes the set of points aligned to it. In particular, each row represents a point with its trajectory ID ( T column) and its index along the trajectory (S column). In (a), a POG is initialized by the trajectory T I . An example of a POG after aligning a few trajectories is shown in (b). Note that a new node/branch is created when a point cannot be aligned to any existing nodes. For example, node e is created when pg (i.e, the 3rd point of Tz) is inserted. (c) shows the POG after merging point p i from the node b to the node e constrained by the distance threshold E .

(see Figure l ( b ) ) . In particular, the alignment of . . . , S,-l is encoded in a partial order graph Gi, which is then updated by aligning it with Si. The alignment between Gi and Si can still be achieved by a DP algorithm. The main difference is that in this DP procedure, to find the optimal score of aligning a node u E Gi and an element s E Si,one has to inspect the alignment between all parents of u in Gi and the parent of s in Si. The POA algorithm reduces the influence of the order of the sequences aligned, and is able to capture alignments between a subset of sequences. More details of the POA algorithm can be found in 17) 5’1,

’.

In our case, each trajectory is mapped to an ordered sequence of points (i.e, a polygonal curve), and a similar algorithm can be applied to our trajectory data, where instead of the usual 1D sequences, we now have dD sequencesd. Below we explain two main differences between our EPO algorithm and the POA algorithm.

3.2.1. Size of POG

The first problem with current POA algorithm is that the size of the POG graph maintained expands quickly when the level of similarity is low. For example, suppose we are updating the current POG Gi to G,+1 by aligning it with a new curve Ti. If a point p E T, cannot be aligned to any node in Gi, then it will create a new node in Gi+l, as this node may potentially be aligned later with the remaining curves. Consequently, if the similarity is sparse, many new nodes are created without producing significantly aligned nodes later and the size of the POG graph

increases rapidly. This induces high computational complexity. To address this problem, our algorithm first preprocesses all points from the input curves 7 by clustering them into groups 13, the diameter of which is smaller than a user defined threshold, which is fixed as the distance threshold E in our experiments. We keep only those clusters whose size is greater than a certain threshold (7/2 in our experiments), and collect their centers in C = { c l , . . . , c T } , which we refer to as the set of canonical cluster centers. Intuitively, C provides a synopsis of the input curves and represents potentially aligned nodes. If, in the process of aligning Ti with G,, a point p E Ti is not aligned to any node in Gi, then we insert a new node in Gi+l only if p is within E away from some canonical center from C. If p is far from all the canonical cluster centers, then there is little chance that p can form significant alignment with points from later curves, as that would have implied that p should belong to a dense cluster. The set of canonical cluster centers will also contribute to the scoring function described below.

3.2.2. Scoring Function

The choice of the scoring function when aligning Gi = ( K ,Ei) with Ti, is in general a crucial aspect of an alignment algorithm. Given a point p E Ti and a node o E Gi, let S ( o , p ) be the similarity between p and 0, the definition of which will be described shortly. The score of aligning p with o is usually

dSince each point corresponds to the distance map of a conformation, no transformation is needed when comparing such curves.

.

304 scoring function for measuring similarities.

defined as: Score(o,p) = max { max (Score(o’, q) (o’,o)€Et

max

+ 6(0, p)),

Score(o’, p), Score(o, q)},

(o’,o)€Et

where q is the predecessor (i.e, parent) of the point p along Ti, and 0’ ranges over all predecessors of o in the POG Gi. It is easy to verify that such scores can be computed by a dynamic programming procedure due to the inherent order existing in both the trajectory and the POG. A common way to define S ( o , p ) , the similarity between o and p, is as follows. Assume that each node o is associated with a node center W ( O ) to represent all the points aligned to this node. Then

More specifically, for a node 0 , let q be the first point aligned to this node. This means that at the time we were examining q , q cannot be aligned to any existing node in the POG. Let c k E be the nearest canonical cluster center of q recall that the node o was created because 114 - ckll 5 E . We add c k as a point aligned to this node, and at any time, the center of the minimum enclosing ball of currently aligned points, including c k , will be used as the node center ~ ( o )Now . let

c

~

be the diameter of points currently aligned to o. We define that : S(o,p) =

An alternative way to view this is that each node o has an influence region of radius E around its center. A point p can be aligned to a node o only if it lies within the influence region of 0. In order to be able to align as many points as possible, intuitively, it is more desirable that the influence regions of nodes in current POG cover as much space as possible. Natural choices for the node center W ( O ) of o include using a canonical cluster center computed earlier, or the center of the minimum enclosing ball of points already aligned to this node (or some weighted variants of it). The advantage of the former is that canonical cluster centers tend to spread apart, which helps to increase coverage. Furthermore, the canonical cluster centers serve as good candidates for node centers as we already know that there are many points around them. The disadvantage is that it does not consider the distribution of points aligned to this node. See Figure 4, where without considering the distribution of points aligned to 0, and ob, the new point p will be aligned to ob even thought oa is a better choice. Using the center of the minimum enclosing ball alleviates this problem. However, such centers depend heavily on the order of curves aligned, and the influences regions of nodes produced this way tend to overlap much more than using the canonical cluster centers. We combine the advantages of both approaches into the following two-level

{

2~ if lip - ~ ( o ) l < l D(o) E else if llp-w(o)ll < E 0 else

(2)

In other words, the new scoring function encourages centering points around previously computed cluster centers, thus tending to reduce overlaps between the influence regions of different nodes. Furthermore, it gives higher similarity score for points that are more tightly grouped together with those already aligned at current node, addressing the problem shown in Figure 4. Our experimental tests have shown that this two level scoring function significantly outperforms the ones using either only the canonical centers or only the centers of minimal enclosing balls. We remark that it is possible to use variants of the above two-level scoring function, such as making it continuous (instead of being a step function). We choose the current form for its simplicity. Furthermore, experiments show that there is only marginal difference if we use the continuous version.

Fig. 4. Empty and and Obi respectively. is closer to w(ob), it oa. Hence ideally, it

solid points are aligned to the nodes oa For a new point p (the star), although it is better grouped with points aligned to should be aligned to om instead of to o b .

30s

3.3. Merging Stage In the first stage, we have applied a progressive method to align each trajectory onto an alignment graph one by one. In the ith iteration, a point from Ti is either aligned to the best matched node in the current POG Gi, or a new node is created containing this point and the corresponding canonical cluster center. After processing all of the N trajectories in order, we return the final POG G = G N . In the second stage of our EPO algorithm, we further improve the quality of the alignment in G using a novel merging process. Given the greedy nature of the POA algorithm, the alignment obtained in G is not optimal and depends on the alignment order. Furthermore, given that the influence regions of different nodes may overlap, no matter how we improve the scoring function, sometimes it is simply ambiguous to decide locally where to align a new point, and a wrong decision may have grave consequence later.

Fig. 5. Empty and solid points are aligned to the nodes on and oh, respectively, while points in the dotted region should be grouped together.

For example, see Figure 5, where the set of points P (enclosed in the dotted circle) should have been aligned to one node. However, suppose the nodes o, and o b already exist before any point in P is inserted. Then as points from P come in, it is rather likely that they are distributed evenly into both oa and ob. This problem becomes much more severe in higher dimensions, where P can be distributed to several nodes whose centers are well-separated around P , but whose influence regions still covers some points from P (the number of such regions grows exponentially w.r.t. the dimension d ) . Hence instead of being captured in one heavily aligned node, P is broken into several nodes of small size. Our experimental tests confirm that this is happening rather commonly in the POA algorithm.

To address this problem, we propose a novel postprocessing on G. The goal is to merge qualified points from neighboring less-aligned nodes t o augment more heavily loaded nodes. In particular, the following two invariants are maintained during the merging process:

(11) At any time, the diameter of the target node is still bounded by the distance threshold E ; (12) The partial order constraints induced by the POG graph are always consistent with the order of points along each trajectory. The second criterion means that at any time in the POG graph G’, if p E 01, y E 02, p , q E Ti and p precedes y along the trajectory T,, then either 01 4 02, or there is no partial order relation between them. In other words, the resulting POG still corresponds to a valid alignment of 7 with respect to the same thresholds. As an example, see Figure 3, where the point p i (i.e, the second point of T I )in the node b in (b) is moved to the node e in (c). Note that the graph is also updated to reflect the change (the dashed edge in (c)), in order to maintain the invariants (11) and (12). When all points aligned a node o are merged to other nodes ( i x , o becomes empty), we delete 0, and its successors in the POG will then become the successors of its parent.

Algorithm 3.1: MERGING PROCESSING( G = (01, . . . , om , ...}, I om 1 2 1 om+l I) while significant progress for each om E G in increasing order of m for each neighbor on, 1 on 1xstingfador wiw

0.77 +F - m r e

0.765

+ALC

0.76

O.=I 0.75 0.7451

0.735-

. -1

Fig 4. Effect of the boosting factor. F-measure and AUC versus the boosting factor w7 while the other weights were set to w train. Results obtained against the tesing set.

Using the BioCreAtIvE datasets for evaluation of the algorithm, the best F-score we achieved on the test data was 0.7622 when the feature weights were optimized with the training data. Without the thresholding process, the gene tagging component alone could attain an F-score of 0.647 with a recall of 0.869. Recall at this step essentially limits the recall obtainable in the thresholding process. A majority of the undetected mentions have complex syntax not being

handled by the rules we defined. Table 4 provides some examples of challenging cases that contributed to the false negative counts in the tagging process. Nevertheless, many genes are referred to in the text both by their name and symbol. The undetected mentions thus result in a smaller impact on the recall performance. Figs. 2 and 3 show the individual contribution of each internal feature we measure in the confidence score. We call these internal features because the scores are computed out-of-context, based solely on the evidence presented by the mentions themselves. The only exception is the scaling factor s on gene symbols, which is influenced by whether the symbol is extracted from text enclosed by a set of brackets. We can observe from the figures that all six features are useful for the gene normalization task because their optimal weights are all greater than zero. As the weight of a feature increases, the feature becomes more dominant in determining the final confidence score. Inverse distance and uniqueness are the only features that produced better results (on AUC) or only slightly degraded (on Fscore) results from zero weight to a weight of 1. All the other features posted worse performance when they became dominant. Although the best performance is achieved using a combination of these features, our observation suggests that inverse distance and uniqueness have good enough discriminatory power to estimate the level of confidence by themselves when other information is not available. In addition to the internal features, several contextual factors are used to determine whether the confidence score is boosted or not. Since the boosting factor is added as an exponent, the effect is non-linear. Boosting exerts most of its influence on mentions for which the internal features may be ineffective. When a gene is mentioned for the first time in the text, the authors often specify that the entity of interest is a gene, especially when the gene is ambiguous or not very well known. Boosting is useful as illustrated in Fig. 4. However, sometimes a wrong mention can be boosted. Moreover, when counterindicators are detected, the boosting factor is inverted and the score is thus reduced. It can be argued that the punishing factor should be made more severe in order to successfully remove those mentions that have high scores but actually refer to something else. Features for confidence measure. In contrast with the other features, the effect of coverage, inverse

378 distance, and uniqueness are clearly pivotal as there is significant performance improvement from zero weight to their optimal settings. It can be argued that uniqueness is the most important feature in our evaluation. Lack of this feature would result in severe degradation of performance, most noticeable in the AUC. Uniqueness is a statistical measure with the assumption that gene mentions should have a low frequency of occurrence. This is a good assumption in most cases. However, it is not good with legitimate genes that actually appear frequently in the literature (e.g. Interleukin 1) and relatively rare terms with multiple meanings, one of them being a gene reference. For example, “ADA” can stand for the American Diabetes Association or the gene adenosine deaminase. Our solution to the second issue is to look at whether a symbol is mentioned within a set of brackets. If it is the case, presence of the gene name becomes a determining factor. We found this contextual feature to be very helpful for improving precision. Another important feature is the inverse distance, which is a dictionarybased measure that calculates the similarity between the candidate mention and the corresponding gene term in the database. Currently, character is the basic unit in the calculation of edit distance. For names, the effect of changing the word order is subject to the length of the words. It may be more appropriate to use word as the unit of measurement. Coverage is mostly a heuristic measure in which we assume longer mentions are more likely to be true. Albeit that it is a very good measure, the performance degraded when it become a dominant factor, suggesting that length alone is not reliable. Comparison to other gene normalization tools. A number of gene tagging tools are freely available to the community, but to our knowledge, no standalone gene normalization systems have been made publicly accessible. No comparison is made between our tool and ABNER or GAPSCORE because the task of these tools (i.e. NER) is different from ours (i.e. normalization) and such comparison would not be particularly meaningful. In the second BioCreAtIvE challenge, 20 teams entered the gene normalization taskI2. Many teams followed the same general approaches we employed. Several participants built upon “off-the-selves” gene tagging tools. The best Fscore from each team ranges from 0.810 to 0.394, with a

median of 0.731. The highest recall and precision achieved are 0.833 and 0.841, respectively. The difference in performance is primarily due to the way filtering of candidates, including disambiguation, was performed. Some relied on pruning of the lexicon and some implemented rules of various degrees of sophistication to reduce false positives. Nevertheless, the results of the top scoring teams, including ours, are comparable. It is important to note that the recall of 0.869 at a precision of 0.515 which we achieved after the first step of the process is advantagous when high recall is required. Another benefit that our system provides is that each mention is associated with a confidence score. This feature affords users the ability to choose a suitable balance between recall and precision. Table 4. Examples of false negative cases in which the algorithm was not able to detect them at all.

Description Range Ambiguity Choice of words Boundary

Examples ORP- 1 to ORP-6 p32 IFN-induced protein of 10 kDa Protein kinase C epsilon, and zeta

isoforms

alpha,

Effectof qmizatimmtranifgck@ast&edonteSt&ta 0.9,

Id

I

‘h

0.51

OC 0.41

I

0

0.5

I

0.6

0.7

0.8

0.9

1

Precision

Fig. 5. Recall versus precision as tested on the test data with the and the optimized weights w”“”I, original weights (wo)

379 5. CONCLUSION We have developed a gene normalization algorithm that relies heavily on rules that combine statistics and heuristics. The confidence measure provides a means to quantify the degree of conformance to these rules and allow users to choose the proper compromise between recall and precision based on the situation. In our evaluation, only basic knowledge about the genes was used to disambiguate mentions with multiple mappings. A majority of candidates that mapped to more than one gene identifier actually referred to gene families. For future work, information about gene families and association of various terms can be applied for more sophisticated filtering. Part-of-speech tagging may also help to discern mention boundaries and improve system efficiency by only considering noun phrases.

8.

9.

10.

11.

12.

13.

Acknowledgments This research was supported by the Intramural Research Program of the National Institutes of Health, Center for Information Technology. We appreciate the contributions of Alex Wang and Jigar Shah.

14.

15.

References 1. Tanabe L, Wilbur WJ. Tagging gene and protein names in biomedical text. Bioinformatics. 2002; 18: 1124-1132. 2. Jensen LJ, Saric J, Bork P. Literature mining for the biologist: From information retrieval to biological discovery. Nut Rev Genet. 2006; 7: 119-129. 3. Liu H, Hu ZZ, Torii M, Wu C, Friedman C. Quantitative assessment of dictionary-based protein named entity tagging. J Am Med Inform Assoc. 2006; 13: 497-507. 4. Zhou G, Zhang J, Su J, Shen D, Tan C. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics. 2004; 20: 1 178-1190. 5. Hakenberg J, Bickel S, Plake C, et al. Systematic feature evaluation for gene name recognition. BMC Bioinformatics. 2005; 6 Suppl 1: S9. 6. Leser U, Hakenberg J. What makes a gene name? named entity recognition in the biomedical literature. Brief Bioinform. 2005; 6: 357-369. 7. Dickman S. Tough mining: The challenges of searching the scientific literature. PLOS Biol. 2003; 1: E48.

16. 17.

Settles B. ABNER: An open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005; 21: 3 191-3192. Chang JT, Schutze H, Altman RB. GAPSCORE: Finding gene and protein names one word at a time. Bioinformatics. 2004; 20: 216-225. Hirschman L, Colosimo M, Morgan A, Yeh A. Overview of BioCreAtIvE task 1B: Normalized gene lists. BMC Bioinformatics. 2005; 6 Suppl 1: s11. Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for highthroughput analysis of gene expression. Nut Genet. 2001; 28: 21-28. Morgan A, Hirschman, L. Overview of BioCreative I1 Gene Normalization. Proc of the Second BioCreative Challenge Evaluation Workshop 2007. Becker KG, Hosack DA, Dennis G,Jr, et al. PubMatrix: A tool for multiplex literature mining. BMC Bioinformatics. 2003; 4: 61. Lau W, Johnson C. Rule-based gene normalization with a statistical and heuristic confidence measure. Proc of the Second BioCreative Challenge Evaluation Workshop 2007. Hanisch D, Fundel K, Mevissen HT, Zimmer R, Fluck J. ProMiner: Rule-based protein and gene entity recognition. BMC Bioinformatics. 2005; 6 Suppl 1: S14. Tamames J, Valencia A. The success (or not) of HUGO nomenclature. Genome Biol. 2006; 7: 402. Nelder JA, Mead R. A simplex method for function minimization. Comput J . 1965; 7: 308-313.


38 1

CBioC: BEYOND A PROTOTYPE FOR COLLABORATIVE ANNOTATION OF MOLECULAR INTERACTIONS FROM THE LITERATURE C. Baral, G. Gonzalez, A. Gitter, C. Teegarden, and A. Zeigler School of Computing and Informatics, Arizona State University Tempe, AZ 85281, USA Email: chitta C3asu. edu, ggonzalez @ asu.edu

G. Joshi-Top6 Northeast Biosciences, lnc New York, NY, USA In molecular biology research, looking for information on a particular entity such as a gene or a protein may lead to thousands of articles, making it impossible for a researcher to individually read these articles and even just their abstracts. Thus, there is a need to curate the literature to get various nuggets of knowledge, such as an interaction between two proteins, and store them in a database. However the body of existing biomedical articles is growing at a very fast rate, making it impossible to curate them manually. An alternative approach of using computers for automatic extraction has problems with accuracy. We propose to leverage the advantages of both techniques, extracting binary relationships between biological entities automatically from the biomedical literature and providing a platform that allows community collaboration in the annotation of the extracted relationships. Thus, the community of researchers that writes and reads the biomedical texts can use the server for searching our database of extracted facts, and as an easy-to-use web platform to annotate facta relevant to them. We presented a preliminary prototype as a proof of concept earlier’. This paper presents the working implementation available for download at http://www.cbioc.orp as a browser-plug in for both Internet Explorer and FireFox. This current version has been available since June of 2006, and has over 160 registered users from around the world. Aside from its use as an annotation tool, data from CBioC has also been used in computational methods with encouraging results2.

1. INTRODUCTION There are about 15 million abstracts currently indexed in PubMed, with anywhere between 300,000 and 500,0003 being added each year. To illustrate this problem, consider the following example. A search for the gene TNF alpha in PubMed yields 74430 articles (as of March of 2007) and 6193 review articles. Refining the search to TNF alpha and inflammation reduces this number to 15 126 regular articles and 1757 review articles, still too many for a researcher to review. It would be significantly easier if he or she had access to a database that stores relevant nuggets of knowledge such as the relationship between genes and biological processes. The problem of constructing such a database has been recognized as one that needs to be solved to move forward into the great challenges of science for this century4. Currently, two approaches are used to extract such facts from biomedical publications: (i) human curation and (ii) development and use of automated information extraction systems. However, the constantly increasing

number of articles and the complexity inherent to its annotation results in data sources that are continuously outdated. For example, CeneRIF (Gene Reference Into Function), was started in 2002, yet it covers only about 1.7% of the genes in Entrez’ and 25% of human genes. Automatic extraction and annotation seems a natural way to overcome the limitations of manual curation, and a lot of work has been done in this area, including the automatic extraction of genes and gene products6, proteinprotein interactions’, relationships between genes and biological functions*, and genes and diseases’, among others. However, the reliability of the extracted information varies greatly and thus discourages the biologists from using it for their research. CBioC represents a new approach to the problem through mass collaboration, where the community of researchers that writes and reads the biomedical texts will be able to contribute to the curation process, dictating the pace at which it is done. Automated text extraction is used

Y

Figure 1. C B i d automatically launches the interaction web band at the bottom of the main window when a user visits F’ubMed, and displays the facts processed, extraction occurs “on the fly”. The left image corresponds to the interactions display, allowing available for it. If the abstract has no the user to tab through the different of relationships @roteidprotein, gene/disease, and gene/bioprocess). The right image shows the simplest annotation mechanim (a yes/no vote orrectly Extracted”) and the agreement level (% Approval). Users may also modify and add interactions.

as a starting point to bootstrap the database but then it i s up to researchers to improve upon the extracted data by modifications, additions of missed facts, and voting on the accuracy of extraction. It runs as a web browser extension and allows unobtrusive use of the system during the regular course of research, allowing the natural “checks and balances“ of community consensus to take hold to resolve inconsistencies when possible, or to point out disagreements and controversial findings. Although most of the data in CBioC is currently from automatic extraction, users have contributed over 500 interactions which are currently being evaluated. This shows how with CBioC, small or large groups of researchers can easily annotate articles and find facts of interest to them.

2. METHODS CBioC is available for both Internet Explorer and Firefox, for PCs, Macs, and Linux machines. Once installed, CBioC runs unobtrusively, and when one visits the Entrez (PubMed) web site, CBioC automatically opens within a “web band“ at the bottom of the main browser. Users that do not wish to install the plug in can get similar functionality by logging in from our home page. modified version of the extraction ed IntEx3) that uses Natural Language Processing methods to extract protein-protein interactions, gene-disease relations, and gene-bioprocess relations. 2.1. Usage

.

.I_

.

, ..

I

Figure 2. A CBioC search provides a simple way to browse through interactions involving a particular gene or the list of genes involved in a disease or biological process.

Consider a variation of the research scenario introduced before. A PubMed search for ”TNF alpha atherosclerosis” returns over 900 abstracts. One of the abstracts (PMID 16814297), reports TNF-alpha modulates MCP-1, a common alias to CCL2. Expression of CCL2 has been found to be increased in cardiovascular diseases and is of high interest as a biomarker of atherosclerosis”. However, as of March 2007, none of the public curated databases had captured this important interaction, and any researcher that missed the article will probably not learn about it. CcL2 is involved in immunoregulatory and inflammatory processesll. Thus, that Tm-alpha modulates C C L ~as

reported in the article (supported by others, such as PMID 9920834) is significant, important to assess the relevance of TNF alpha with respect to atherosclerosis and for any systems biology simulations. Thus, relying solely on curated data could leave this piece of information out. Consider the same scenario, but with CBioC installed. The user could start with a TNF alpha search in CBioC (see Figure 2 ) . Quickly scrolling down through the listed interactions gives the researcher a general idea of the known relevant associated genes, even though some of them might not be accurate. If MCP 1 calls the researcher’s attention, the rest of the interactions in that abstract can be quickly displayed by clicking on the PMID of the interaction of interest among the search results. A list of other articles that report the same interaction can be viewed by clicking on the “Related Articles” link.

2.2. Functionality 2.2.1. Displaying data

When one searches the PubMed database and displays a particular abstract, CBioC automatically displays the interactions found related to the abstract. If the abstract has not been processed by CBioC before, an extraction system runs “on the fly”. CBioC also displays interactions found for the article in publicly accessible databases. 2.2.2. Searching

As a registered CBioC user, one can search the CBioC database for all facts related to a particular protein, gene, disease, or interaction word by simply typing the relevant term in the Search box within the CBioC web band. CBioC automatically expands a search term with known synonyms of the term. One can also display the facts available for a set of abstracts by typing a commaseparated list of their PMIDs in the search box. The search box also lets one see all the facts we have from a particular database by typing its name, such as “BIND”or “MINT”. 2.2.3. Modifying, and adding

Registered CBioC users can vote on the accuracy of an extraction, modify the interactions, or add interactions that the extraction system missed. If the interaction seems

correctly extracted, one can click the “Yes” button to approve. Otherwise, one can vote “No” or modify the data by clicking “Modify”. If “Modify” is clicked, the data fields open up for editing. The user’s screen id will be displayed in the “Source” column from then on, with the previous data stored and accessible via the “History” link. The modified information is then subject to community vote. Similarly, an interaction present in the abstract or in the full article, it can be entered in the last row.

3. RESULTS AND DISCUSSION Although the CBioC system has moved well beyond its prototype stage, it is still considered a “beta” system and new features are being added. It is, however, functional. To date, over 4.5 million abstracts have been preprocessed, and CBioC does dynamic (“on the fly”) extraction when a user views an abstract that has not been pre-processed. This is an important feature that gives users total control over which abstracts are to be processed. Additionally, we have incorporated interactions from BIND, GRID, MINT, DIP and IntAct. A total of 261 distinct users have downloaded the CBioC plug-in, with 161 of them becoming registered users since June 2006, when CBioC was mentioned in Science Magazine’s NetWatch 1 2 . Partial statistics for those that have chosen a personal title (such as “Doctor”, “Professor”, or “Researcher”) during the registration process show our users include 53 doctors, 30 researchers, 17 professors, 8 post-docs and 40 students. Actions of registered users are tracked, and have so far yielded a total of over 500 curated interactions (either added, modified, or approved through a “yes” vote). Of course, this added to the more than 1.5 million relationships automatically extracted from text. As a point of comparison, at the time of its publication, IntAct13 had 2200 interactions, most of them from high throughput experiments (not curated). Two years after its conception, MINTI4 had 2500 curated mammalian interactions, and was the largest publicly available dataset of curated entries at the time. It will be interesting to see how many curated interactions will CBioC have when it hits the 2 year mark in June of 2008. Table 1 shows statistics about content and user actions. About 55% of the votes confirm the automatic extraction

384 is correct (yes votes), an indicator of the extraction system precision. This use of community validation is another area to explore as value added by the CBioC platform. Aside from the web interface, data from CBioC has also been used in computational methods with encouraging results15. We presented in a computational method to

uncover possible gene-disease relationships that are not directly stated in an abstract or were missed by the initial mining of the literature. Ranked lists of genes obtained from the method reach precision of 98% for the top 50, and up to 92% for the top 200 genes, in contrast to about 70% accuracy by simple co-occurrence searches.

Table 1. CBioC statistics. The left table details the type of information stored in the CBioC database, accessible via term searches or by P M D . The right table details the number of actions by registered users (as of March 2007). Actions by non-registered users are not tracked. IntAct interactions are being updated, with over 130,000 becoming available soon. Users (excluding the development team) include 53 doctors, 30 researchers, 17 professors, 8 postdocs and 40 students. CBioc statistics Integrated Data

Add ii7teraciion

163

Total Processed

1 618 878

BIND Interactions 114,685

Modify interaction

133

M h Interactions

47%

GRID Interactions 58,467

Abstracts

51 721

Seaich

972,769

DIP Interactions

52,070

View (article)

Total GenelDisease

301,547

IntAct Interactions

Total GeneiBio-Process

251,233

Total ProteiniProtein

References 1. Baral, C. et al. Collaborative Curation of Data from

2.

3.

4.

5.

6.

7.

Rate interaction

MINT Interactions

lnteracbons

Bio-medical Texts and Abstracts and Its integration. in Data Integration in the Life Sciences 309-312 (Lecture Notes in Computer Science, San Diego, CA, 2005). Gonzalez, G., Uribe, J.C., Tari, L., Brophy, C. & Baral, C. Mining Gene-Disease relationships from Biomedical Literature. in Pacific Symposium in Biocomputing (Maui, Hawaii, 2007). Soteriades, E.S. & Falagas, M.E. Comparison of amount of biomedical research originating from the European Union and the United States. BMJ: British Medical Journal. 331 192-194 (2005). Emmott, S. Towards 2020 Science: a Report. in Towards 2020 Science Workshop (ed. Cambridge, M.R.) (2006). Lu, Z., Cohen, K.B. & Hunter, L. Finding GeneRIFs via Gene Ontology Annotations. in Pacific Symposium on Biocomputing Vol. 11 52-63 (World Scientific Publishing Co. Pte. Ltd., Maui, Hawaii, USA, 2006). Tanabe, L. & Wilbur, W.J. Tagging gene and protein names in biomedical text. Bioinformatics 18, 11241132 (2002). Ahmed, S.T., Chidambaram, D., Davulcu, H. & Baral, C. IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text. in BioLINK: Linking Literature, Information and Knowledge,for Biology (Detroit, Michigan, 2005).

6 734

Vote (total) Votc (ycs:

71

793 31 69 370 207

8. Koike, A,, Niwa, Y. & Takagi, T. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 21, 1227-1236 (2005). 9. Chun, H.-W. et al. Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning. in Pacific Symposium on Biocomputing Vol. 11 4-15 (2006). lO.Herder, C. et al. Chemokines and Incident Coronary Heart Disease. Results From the MONICA/KORA Augsburg Case-Cohort Study, 1984-2002. Arterioscler Thromh Vasc B i d , 01 .ATV.0000235691.84430.86 (2006). 11. Entrez Gene entry for CCL2 (GeneID: 6347). 12. Leslie, M. NetWatch - Software: Annotate While You Read. in Science Magazine Vol. 312 1721 (2006). 13.Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucl. Acids Res. 32, D452-455 (2004). 14. Arnaud Ceol et al. The (new) MINT Database. in BITS 2004 (Padova, Italy, 2004). 15. Gonzalez, G., Uribe, J.C., Tari, L., Brophy, C. & Baral, C. Mining Gene-Disease relationships from Biomedical Literature: Incorporating Interactions, Connectivity, Confidence, and Context Measures. in Pacific Symposium in Biocomputing (Maui, Hawaii, 2007).

I

Biocomputing


387

SUPERCOMPUTING WITH TOYS: HARNESSING T H E POWER OF NVlDlA 8800GTX A N D PLAYSTATION 3 FOR BIOINFORMATICS PROBLEMS

Justin Wilson, Manhong Dai, Elvis Jakupovic, Stanley Watson and Fan Meng* Molecular and Behavioral Neuroscience Institute and Department of PsychiatmJ, University of Michigan, A n n Arbor, M I 48109, United States of America * Email: mengfaumich. edu Modern video cards and game consoles typically have much better performance to price ratios than that of general purpose CPUs. The parallel processing capabilities of game hardware are well-suited for high throughput biomedical data analysis. Our initial results suggest that game hardware is a cost-effective platform for some computationally demanding bioinformatics problems.

1. INTRODUCTION Biomedical data analysis, visualization and mining demand more and more computing power in the post-genome era. Computer clusters are the prevailing solution for many bioinformatics laboratories and centers for accelerated large-scale data analysis. However, expanding the computing capacity of an existing cluster by more than an order of magnitude using traditional methods in a time of leveling-off processor speeds is difficult and expensive. State-of-the-art game consoles and graphics processing units possess enormous computing power that can be directed at a variety of data analysis taskslP4. However, the use of game hardware in bioinformatics is still rare and limited to special applications. The GPGPU website listed only one bioinformatics-related application, which reported a 2.7-fold speedup of the most time-consuming loop in the RAxML phylogenetic tree inference program when using a GeForce FX 5700 LE graphics card instead of a Pentium 4 3.2 GHz processor5. Most recently, the famous FoldingQHome project developed clients for both AT1 graphics processing units (GPU) and the Sony Playstation 3 (PS3). In fact, PS3 already exceeds all participating computers in the number of TFLOPs contributed to the Folding@Home project6. A major obstacle to the wide-spread deployment of such promising game hardware was the lack of development tools. Traditionally, a developer had to learn a graphics API and cast their problem like a

*Corresponding author

graphics problem in order to use a GPU for general computation. However, the recent release of the Compute Unified Device Architecture (CUDA) by NVIDIA has circumvented this problem and greatly facilitated developing software for NVIDIA GPUs7. In addition, the highly acclaimed Cell Broadband Engine (CBE) in the PS3, can be programmed using C instead of assembly with the free IBM Cell SDK'. Furthermore, third party vendors such as PeakStreamg and RapidMind" allow the same program to be compiled and automatically optimized without modification for different multi-core platforms, thus greatly shortening the development cycle for different parallel computing platforms. The computationally-intense nature of highthroughput data analysis led us to examine the possibility of utilizing game hardware to speed-up several common algorithms. Our results are very encouraging and we believe game hardware is an effective platform for many bioinformatics problems.

2. MATERIALS A N D M E T H O D S Single or multiple CPU tests were performed on an 8x Opteron 865 (dual core) sever with 64G PC2700 memory running Fedora Core 2. GPU tests were performed on a 2x Opteron 275 (dual core) server with 4G of memory and a BFG GeForce 8800GTX with a core frequency of 600 MHz. The PS3 used in this project was a 60 GB version. The complier used for single and multiple Opteron core implementations was GCC 3.3.3. CUDA 0.8 and IBM Cell SDK

388

2.0 were used for the 8800GTX and PS3 programs, respectively. We used RapidMind version 2.0 beta 3 and followed their write-once and run-anywhere paradigm for each platform. See our webpage” for the details of our tests.

3. RESULTS 3.1. 8800GTX and CBE vs. lx and 16x CPU

Table 1 summarizes the performance of a single precision vector calculation when using the native development environments for the 8800GTX and CBE as well as the RapidMind platform. The calculation is described by N

2nI;oI; +

where o is an element-wise division operator, a‘ and b are vectors of 9437184 elements and N is the number of repeated Go calculations. The column headings for table 1 are as follows: “I” is the number of times the calculation was performed, “N”is the number of repeated b’ o b’ calculations, “1x” represents a single CPU, ”16x” represents 16 CPUs, “GPU” represents the 8800GTX, “PS3” represents the CBE, and “RM” represents the designated hardware under the control of RapidMind. Table 1. Vector multiplication/division performance on different platforms (seconds)

I 10 1 10 1 100 10 1000 100 1000 100

N 500 500 100 100 10 10 1 1 0 0

lx

426.6 42.7 79.1 7.9 56.7 5.7 59.7 6.0 19.7 2.0

16x

32.3 3.3 6.2 0.6 7.1 0.7 30.6 4.1 15.0 1.6

GPU

2.2 0.3 1.3 0.2 11.7 1.2 116.2 11.7 106.0 10.2

GPU RM

PS3

2.5 1.5 1.9 0.7 13.2 1.9 127.9 13.4 127.8 13.2

96.4 9.7 19.6 2.0 21.4 2.2 41.3 4.4 20.0 2.3

PS3

RM

574.6 559.5 11.5 6.1 21.0 2.6 17.0 2.2

Due to the physical design, game hardware does not provide an advantage for operations involving a large number of memory reads and writes (lower half ahttp://wiki.mbni.med.umich.edu/wiki/index.php/Toycomputing

of table 1). When a small number of memory operations (low iteration) are combined with CPU intensive operations (high calculation), the PS3 is more than 4 times the speed of a single Opteron 865 core. Most strikingly, a single NVIDIA 8800GTX is about 200 times faster than a single Opteron 865 core and more than 10 times faster than our 16-core (8x2) server. These results should be interpreted with the understanding that these numbers represent the upper limit of game hardware performance since the entire problem resided in the main memory of each device and there were no conditional statements. Executables generated by RapidMind showed similar performance improvements on the 8800GTX when compared to executables generated using CUDA. The version we tried lacked optimization support for the CBE but RapidMind has promised such optimization in future versions”. Regardless, the ability to use the same source code for different multicore platforms should significantly help the adoption of game hardware.

3.2. Clustering Algorithms Clustering is one of the most widely used approaches in bioinformatics. However, clustering algorithms are CPU intensive and a speedup would benefit problems ranging from gene expression analysis to document mining. A full clustering algorithm usually has two main components: determining the similarity of various samples (vectors) through a distance measure and the classification of samples into different groups through a clustering method”. We decided to implement two distance calculation methods, Euclidean and B-spline-based mutual information13, and two clustering methods, single-link hierarchical clustering and the centroid k-means clustering for an 8800GTX, and investigate their performance under various conditions. The mentioned implementations have been used to generate similarity matrices and cluster documents from the MEDLINE database represented by MeSH term vectors and gene expression values from U133A GeneChips. GPUs are best suited for parallel data processing with a high ratio of arithmetic to 1 / 0 and minimal amount of conditional instructions. Memory reads and writes between the host computer and GPU should be minimized. Data should be aligned

389 in memory and memory access patterns should be sequential and regular. A good strategy for design algorithms for the GPU is to examine the data dependency between the stages of an algorithm and have a kernel for each stage. Furthermore, having each thread or each block compute one independent element of the output of a stage automatically eliminates the need for synchronization between blocks. Using these rules yields a distance matrix calculation kernel where each element in the distance matrix is computed by one block. First the vectors are copied to the device and aligned in memory. Each thread then computes the difference between two elements of two vectors and accumulates the results until both vectors are exhausted. Then, the shared memory between the threads can be utilized to sum up the contribution of each thread. Finally, the computed value is written to the distance matrix. Finding the minimum in a distance matrix and updating values according to the Lance-Williams formula are both activities in hierarchical clustering that can be parallelized. Finding the minimum is similar to computing a distance matrix only the location of the minimum must be remembered. Updating the distance matrix can also be performed in parallel because only the rows and columns containing the two merged elements need to be updated. Consequently, one thread can process each column in the matrix. The above techniques are also used in the kmeans algorithm. The only seemingly difficult issue is adding up the vectors to calculate the new cluster centers. Since the GPU lacks atomic operations, having different blocks update the centers at the same time will not work correctly. However, by having each thread compute one element of one new cluster center, we circumvent the need for atomic operations. We also minimize number of memory reads by using the assignment matrix. The computational speedup for calculating Euclidean distance matrices and mutual information matrices is presented in figure 1. The legend shows the number of elements in each vector and the type of calculation (“D” for distance, “B” for €3-spline). The B-spline mutual information algorithm was configured to use 10 bins and spline order of 313. As expected, the B-spline mutual information matrix shows better GPU acceleration due to its higher arithmetic to 1/0 ratio. The figure also shows that

it may not be worthwhile to perform small Euclidean distance calculations with a GPU since most of the processing time will be spent on memory operations.

0 ‘ 256

512

1024

2048

I 8192

4096

Number of Vectors 512D ----t 2048D - - x - -

8192D 512B

- - * - -2048B --=--

----=---81928

-

8

~

Fig. 1. Similarity Matrix Calculation Speedup

16 I

I

Q

3

U a, a,

Q

(JY

0 ’ 256

512

1024

2048

4096

I 8192

Number of Vectors 512H + 8192H 2048H - - x - 512K Fig. 2.

- - * - -2048K ----D

8192K

- 1.--8.-

Clustering Speedup

The computational speedup for hierarchical clustering ( “H”) (including the initial distance calculation) and k-means clustering (“K ”) is presented in figure 2. For k-means, the number of iterations was

390 fixed and the number of clusters was 4. As expected, both figures show that the speedup is strongly related to the dimensionality of the vectors to be classified because the elements of a data point can usually be operated on in parallel.

3.3. Monte Carlo Permutation Permutation is widely used in statistical analysis but is often the most time consuming step in genomewide data analysis. Table 2 compares the performance of an efficient Monte Carlo permutation procedure14 for correlation calculation on different platforms using expression values from 4096 genes from 7226 U133A Genechips deposited in the NCBI GEO database. It is obvious from the table that the 8800GTX can drastically speed-up Monte Carlo permutations without expanding an existing cluster given an open PCIe slot and adequate power supply. Table 2. Monte Carlo permutation on game hardware (seconds) Number

CPU ( l x )

GPU

PS3

1 2 4

258.77 517.50 1035.01

11.12 21.99 43.75

57.33 114.36 228.60

4. DISCUSSION

Although we just started developing with game hardware, our results suggest that the NVIDIA GeForce 8800GTX is a very attractive co-processor capable of increasing the single precision floating point calculation speed by more than one order of magnitude in clustering and Monte Carlo permutation procedures. It is likely many other parallel data bioinformatics algorithms, particularly those related to high throughput genome-wide data analyses, will benefit from a port to game hardware.

Acknowledgments

The authors are members of the Pritzker Neuropsychiatric Disorders Research Consortium, which is supported by the Pritzker Neuropsychiatric Disorders Research Fund L.L.C. This work is also supported in part by the National Center for Integrated Biomedical Informatics through NIH grant lU54DA021519-01Al to the University of Michigan. References 1. Angel E, Baxter B, Bolz J , Buck I, Carr N, Coombe et al. http://www.gpgpu.org/ 2007. 2. Buck I, Foley T, Horn D, Sugerman J, Fatahalian K , Houston M, Hanrahan P. A C M Transactions o n Graphics 2004; 23(3): 777-786. 3. Owens JD, Luebke D, Govindaraju N, Harris M, Krger J, Lefohn A E, Purcell TJ. Computer Graphics Forum 2007; 26(1): 80-113. 4. Mueller, F. http://moss. csc. ncsu. edu/ mueller/cluster/ps3/ 2007. 5. Charalambous M, Trancoso P, Stamatakis A. Proceedings of the 10th Panhellenic Conference o n Informatics (PCI 2005), Springer L N C S 2005; 415-425. 6. FoldingQHome. http://fah-web. stanford. edu/cgibin/main.py?qtype=osstats 2007. 7. NVIDIA. http://developer.nvidia. com/object/cuda. html 2007. 8. IBM. http://www.alphaworks.ibm.com/tech/cellsw 2006. 9. Peakstream. http://www. peakstreaminc. com/product/overview/ 2006. 10. RapidMind. http://www.rapidmind.net/2006. 11. RapidMind. http://www.rapidmind.net/ pdfs/RapidMindCellPortzng.pdf 2006. 12. Wikipedia. http://en.wikipedia.org/wiki/Data-clustering 2007. 13. Daub CO, Steuer R, Selbig J, Kloska S. B M C Bioinformatics 2004; 5 : 118. 14. Lin, DY. Bioinformatics 2005; 21(6): 781-787.

39 1 EXACT A N D HEURISTIC ALGORITHMS FOR WEIGHTED CLUSTER EDITING

Sven Rahmann, Tobias Wittkop, Jan Banmbach, and Marcel Martin Computational Methods for Emerging Technologies group, Genome Informatics, Technische Fakultat, Bielefeld University, 0-33594 Bielefeld, Germany Address correspondence to: [email protected]. de

Anke Trd3 and Sebastian Bocker Lehrstuhl f u r Bioinformatik, Friedrich-Schiller- Universitat Jena, Ernst-Abbe-Platz 2, 0-077443 Jena, Germany Clustering objects according to given similarity or distance values is a ubiquitous problem in computational biology with diverse applications, e.g., in defining families of orthologous genes, or in the analysis of microarray experiments. While there exists a plenitude of methods, many of them produce clusterings that can be further improved. “Cleaning up” initial clusterings can be formalized as projectzng a graph on the space of transitive graphs; it is also known as the cluster editing or cluster partitioning problem in the literature. In contrast to previous work on cluster editing, we allow arbitrary weights on the similarity graph. To solve the so-defined weighted transitive graph projection problem, we present (1) the first exact fixed-parameter algorithm, (2) a polynomial-time greedy algorithm that returns the optimal result on a well-defined subset of “close-to-transitive’’ graphs and works heuristically on other graphs, and ( 3 ) a fast heuristic that uses ideas similar to those from the Fruchterman-Reingold graph layout algorithm. We compare quality and running times of these algorithms on both artificial graphs and protein similarity graphs derived from the 66 organisms of the COG dataset.

1. INTRODUCTION

The following problem arises frequently in clustering applications: Given a set of objects V and a similarity or distance measure for each unordered pair {u, u} of objects, we want to partition V into disjoint clusters. A common strategy is to choose a similarity threshold and construct the corresponding threshold graph: The objects constitute the nodes of the graph, and an edge is drawn between u and v if their similarity exceeds (distance falls below) the given threshold. In this case, u and u are called “similar”, which we write as u v. However, the resulting graph need not be transitive, meaning that u N v and v w do not necessarily imply u w. We wish to “clean up” such a preliminary clustering with as few edge changes as possible. Formal definitions are given below. N

N

N

The similarity graph. We write V for the set of objects to be clustered; these are the vertices or nodes of the graph. We write for the set of k-element subsets of V. We use uv shorthand for an unordered pair {u, u} E We assume the availability of a symmetric similarity function s : -+ IR such that u and u are simiZar, u v, if and only if s ( u , u ) := s(uu) > 0.

(I)

(y).

-

(I)

The edge set of the similarity graph is E := {uu : u u}. Note that the similarity of an object to itself is not and need not be defined here. For any set F C we define s ( F ) := CuUcF s(u,u). A perfect clustering is characterized by the condition that the graph G = (V,E ) is transitive, defined by any of the following equivalent conditions: N

(I),

(1) For each triple uuw E (:), the implication uv E E and uw E E + uw E E holds. (2) G contains no induced paths of length 2, i.e., for each triple uvw E we have IE n {uw,ww,uw}l # 2. (3) G is a disjoint union of cliques (i.e., of complete graphs).

(y),

Our goal is to edit a given graph G = ( V , E ) by removing and adding edges in such a way that it becomes transitive. Each operation incurs a nonnegative cost: If uu E E , the edge removal cost of uv is s(u,’u). If uu E , the edge addition cost of uv is -s(u,u).Note the following subtlety: If s(u,u) = 0, then initially uv $! E , but it costs nothing to add this edge. The cost to transform the initial graph G = (V, E ) into a graph G’ = (V, E’) with different edge set E’ is consequently defined as cost(G + G’) :=

4

392

s ( E \ E’) - s(E’ \ E ) Problem statement. The weighted transitive graph projection problem (WTGPP) is defined as follows. Given a similarity function s : + R and the weighted undirected graph G = (V,E , s ) with E := {uw : s(uw) > 0}, compute S(G) := min{cost(G + G’) : G’ transitive} and find one or all transitive G* with cost(G + G*) = S(G). Such G* are called best transitive approximations to G or transitive projections or least-cost cluster edits of G. We also call this problem the weighted cluster editing problem.

(y)

Previous work and results. The unweighted version of this problem, where s ( u , v ) E {+l,-1} and cost(G + G’) = IE \ E’I IE’\ El = IE A E’I, has been extensively studied and is also known as cluster editing in the literature. The first study that we are aware of goes back to Zahn” in 1964 and solves the problem on specially structured graphs (2level hierarchies). On the negative side, the problem has been proven NP-hard in general at least twice independently4>16. On the positive side, fixed-parameter tractability (FPT) of the unweighted cluster editing problem using the minimum number of edge changes as parameter k is well-studied. Gramm et al.g give a simple algorithm with running time O(3k+lV/13) and, by applying a refined branching strategy, improve the time complexity to 0(2.27k+lV13).recent experiments by Dehne et al.3 suggest that the 0(2.27k lV13) algorithm is indeed faster than the 0(3k++IV13) algorithm in practice. In theory, the best known algorithm8 for the problem has running time 0(1.92k lVI3),but this algorithm uses very complicated branching rules (137 initial cases) and has never been implemented. Damaschke2 shows how to enumerate all optimal solutions. Unfortunately, it is also known that almost all graphs are almost maximally far away from transitivity in the following sense, as shown by Moon12. Let Q, be the set of all graphs on n vertices. Note that each (V, E ) = G E Q, satisfies S(G) 5 ( ; ) / 2 because if I E 1 )(; /2, we can remove all edges and obtain the transitive empty graph, and if /El 2 (;)/a, we can add all missing edges and obtain the transitive complete graph. Now define the class G,,, of graphs that are “far away” from transitivity in the sense that S(G) 2 (1- E ) . ( : ) / 2 . Then for any E > 0,

+

+

+

O}. We set m := lEl. Without loss of generality, we may assume that the input graph consists of a single connected component. If not, we can treat each connected component separately, because an optimal solution will never join separate components; this is easily proved by contradiction.

(y)

393 The output of each algorithm is an edge set E* and a cost c* := cost(G + (V,E * ) ) .We say that an algorithm correctly solves instance G = (V, E , s) if C* = S ( G ) . Let N ( u ) := {u: s(uv) > O} c V denote the set of neighbors of v. We call N n ( u ,v) := N ( u ) n N ( v ) the common neighbors of u and v and N n ( u , v ) := ( N ( u )A N ( v ) )\ { u ,w} their non-common neighbors; here A A B is the symmetric set difference of sets A and B . Let C(G) be the set of all conflict triples, i.e., uvw E that induce a path of length two: C(G) := {uvw E : IE n {uu, vw, U W } ~ = 2). As noted, G is transitive if and only if C(G) = {}.

(y)

(y)

2. FIXED-PARAMETER ALGORITHM Fixed-parameter algorithmics were introduced by Downey and Fellows in the late nineties5. They enable us to find exact solutions for several NP-hard problems. The basic idea is to choose a parameter for a given problem such that the problem is solvable in polynomial time when the parameter is fixed. A problem is fixed-parameter tractable with respect to the given parameter if there exists an algorithm which solves the problem in a running time of O ( f ( k ) . l I l c )where , f is a function only dependent on the parameter k , 111 is the size of the input, and c is a constant. See Ref. 13 for a recent overview on fixed-parameter algorithms. In the following, we propose a fixed-parameter algorithm for the WTGPP parameterized with the (real-valued) cost k of an optimal solution. Given an instance of the problem and fixed k , the algorithm is guaranteed to find an optimal solution with cost at most k or to return that no such solution exists. The algorithm roughly adopts the branching strategy and data reduction rules of the 0(3k lVI3) algorithm from Ref. 9 and runs in time 0(3k lVI3) if every edge deletion or insertion has a cost of at least 1 (if not, costs may be scaled up to fulfill this requirement). While our algorithm accepts any positive real numbers as input, minimum edit costs are required to achieve a provable running time because there can be no fixed-parameter algorithm solving the problem with arbitrarily small weights unless P=NP. Our algorithm requires a cost parameter k . So in order to find an optimal solution, i.e., the smallest k for which a G* with cost(G 4 G*) 5 k exists, we call the algorithm repeatedly, starting with k = 1.

+

+

If we do not find a solution with this value, we increase k by 1, call the algorithm again and so forth. Note that for every k , we have to traverse the complete search tree and find the best solution with cost 5 k , if any. The overall structure of the algorithm is recursive. In the beginning, we start with the full input graph and the given parameter k . Given G and k 2 0, we first call the data reduction procedure described below. Then we pick a conflict triple uvw E C(G) and repair it in each possible way by recursively branching into three subproblems. In order to ensure that the sub-problems do not overlap, we will in the process set some nonexistent edges to “forbidden” (so we can never add them) and some existent edges to “permanent” (so they cannot be removed). Initially all edges have no such label. Data reduction. The following operations reduce the problem size. They are performed initially and for every sub-problem.

Remove claques: Identify connected components and remove all components that are cliques from the input graph. The algorithm can be called separately for each component. Check for unaffordable edge modifications: For each uw E , we calculate a lower bound icf (uv) and icp(uv) for setting uv to forbidden or permanent, respectively. When setting u v to forbidden, we state that u and w should be in different components and therefore should have no common neighbors. Conversely, setting uu to permanent means getting rid of all non-common neighbors. Lower bounds on the induced costs are obtained as

(L)

icf(uv)=

C C

min{s(uw), s(vw));

w E N n (U, u )

icp(uv) =

niin{ls(uw)I, Is(ww)I}

.Lu€NA(%V)

We maintain lists in which these costs are sorted by size and update these lists every time an edit operation is carried out. Data reduction now works as follows: (a) For all uu E V where i c f ( u v ) > k (i.e., which cannot be forbidden): Insert uv if necessary, and set u v to “permanent”. (b) For all uv E V where icp(uv) > k (i.e., which cannot be made permanent): Delete uv if necessary, and set uu to “forbidden”.

394

0

If there is a pair uv such that both icp(uv) > k and i c f ( u v ) > k , the (sub-)problem instance is not solvable with parameter k . Merge vertices incident to pemnanent edges: As soon as we set an edge uv to permanent, it is obvious that u and u must be in the same clique in each solution found in this branch of the algorithm. In this case we merge u and v , creating a new vertex x. Note that if w is a neighbor both of u and of v, we create a new edge xw whose deletion costs as much as the deletion of both uw and uw. If w is neither a neighbor of u nor of v, we calculate the insertion cost of the nonexistent edge xw analogously. In case w is a neighbor either of u or of v but not both, uvw is a conflict triple, and we have to decide whether we delete the edge connecting w with u or v or we insert the nonexistent edge. By summing the similarities (one of which is negative) to calculate the respective value for xw we carry out the cheaper operation and maintain the possibility to edit xw later. Thus, we merge u and v into a new vertex x as follows: For each vertex w E V \ {u,v}, set s(zw)+ s(uw)+s(uw). Let k +- k - icp(uu), and delete u and %r from the graph.

Branching Strategy. After data reduction, let uvw E C(G) be a conflict triple, and let u be the vertex of degree two and v , w be the leaves. We recursively branch into three cases.

(1) Insert the missing edge vw,and set all edges uv, uw,vw to “permanent”. ( 2 ) Delete edge uv, and set the remaining edge uw to “permanent” and the absent edges uv and vw to “forbidden”. (3) Delete edge uw,set it to “forbidden” (do not set the other edge labels).

permanent. We can show that the running time for merging two vertices is 0(lVI2),and the total running time for data reduction of an arbitrary input graph is 0(lVI3).A detailed proof is deferred to a full journal version of this paper. If every edge deletion or insertion has a cost of at least 1, then we can show that our data reduction results in a problem kernel with a t most 2k2 k vertices. For the weighted cluster editing algorithm, this would result in a total running time of 0(3k . k4 lVI3). We use interleaving14 by performing data reduction repeatedly during the course of the search tree algorithm whenever possible. This reduces the total running time to 0 ( 3 k lVI3). We stress that the faster 0 ( 2 . 2 7 k algorithm of Gramm et al.g for the unweighted case cannot be used to solve the WTGPP, because the branching strategy is based on an observation that does not hold for weighted graphs (Lemma 5 in Ref. 9). We are currently working on adapting this branching strategy to the weighted case.

+

+

+ +

3. GREEDY HEURISTIC As in the fixed-parameter algorithm, all conflict triples uvw E C(G) must be repaired to make G transitive. A repair consists of either removing one of the two existing edges or adding the missing edge. Observe that the hard part is to correctly “guess” the set of edges to remove. Thereafter, the edge insertions can easily be found by transitive closure, that is, adding those edges required to make each connected component a clique. Our idea is to define a function that scores edge removals and then let the algorithm greedily delete the highest-scoring edge in each step until further deletions do not improve the solution. Scoring edges. We define G’s deviation from transi-

tivity D ( G ) as In each branch, we lower k by the insertion or deletion cost required for the executed operation. If this would lead to k < 0, we skip this branch. This branching strategy gives us a search tree of size 0(3’), but usually much smaller in practice. Time complexity analysis. If we set an edge to forbidden or permanent, this can reduce the parameter k because we have to delete or insert an edge. This, in turn, may trigger other edges to be forbidden or

D ( G ) :=

c

min{l4uv)I, Is(vw)l, ls(.w)ll.

(1)

u v w EC ( G )

We can now score edge removals: Let uv be an edge in G = (V,E , s). Removing it yields GL, := (V, E \ {uv}, s’), where s’(zy)= s(zy), except s’(uv) = -cc (“forbidden”). We call

Auv(G) := D ( G ) - D(GL,)

(2) the transitivity improvement of edge uv. The term s(uv) penalizes the edge removal. - S(UU)

395 Algorithm. In addition to the main algorithm, the greedy heuristic consists of two auxiliary functions, which we describe first,.

Algorithm REMOVE-CULPRIT(G)returns the highest-scoring edge argmax,,,,{A,,(G)} and removes it from G. There are m edges; computing each A,,(G) can be done in O(n) since only triples containing uw need to be considered. Thus, the runtime of the first invocation is O(mn). Subsequent invocations need only O(m+n)time, O ( n )to update scores for edges around the deleted edge, and O ( m ) to find the maximum score. asAlgorithm TRANSITIVE-CLOSURE-COST(G) sumes G is connected; it returns the total cost of all edge additions required for a transitive closure of v max{-s(uw),O}, in time O ( n 2 ) .

G,

CuvE(,)

Algorithm GREEDY-HEURISTIC(G) is the main algorithm. It returns a pair (deletions, cost), where deletions is the list of edges to be removed from G and cost is the total cost of all edit operations (both removals and additions). Remember that G is connected.

(1) cost c TRANSITIVE-CLOSURE-COST(G). ( 2 ) If cost = 0, return an empty list and cost 0. (3) Set deletions +- empty last; delcost + 0. (4) Repeat the following steps until G consists of two connected components GI and Ga. (a) uw t REMOVE-CULPRIT(G) (b) append uu to deletions (c) increase delcost by s(uw)

(5) Adjust deletions such that it only includes edges that contribute to the cut between GI and G2. Adjust delcost accordingly, and re-add incorrect edges t o GI and Ga. (6) Solve the problem recursively for GI and G2, as long as there is a chance for a better solution: If delcost 2 cost, return ( e m p t y last, cost). ( l i s t l , costl) t GREEDY-HEURISTIC(G~). If delcost cost1 2 cost, return ( e m p t y list, cost). (listz, costz) t GREEDY-HEURISTIC(G~) If delcost + cost1 + costs 2 cost, return (empty list, cost). ( 7 ) Append list1 and last2 to deletions. Return (deletions, delcost-kcostl +costg).

+

If the “safety net” in step 5 is never invoked,

GREEDY-HEURISTIC deletes each of the m edges at most once across all recursions. After each deletion, both determining connected Components and REMOVE-CULPRITrequire O ( m n) time. Also, TRANSITIVE-CLOSURE-COST takes O ( n 2 )time for each cut, of which there are at most n - 1. Thus, the runtime is O(m(rn n) n3).

+

+ +

Correctness of the greedy heuristic for special graphs. We show that the greedy heuristic correctly

computes the transitive projection of certain classes of graphs in the unweighed case, where s( i , j ) E {+l}. Here Eq. (1) becomes D(G) = IC(G)l and Eq. ( 2 ) becomes A,,(G) = IC(G&,)l - IC(G)l - 1 = JNn(u,v)\ - \ N n ( u ,v)l - 1, since triples not containing edge uw cancel out. Let T be an unweighted transitive graph consisting of T cliques C1,. . . , C, with ni := ICJ. Graph G is obtained from T by edge modifications. Let 6, be the number of u-incident edges deleted from T, and L, the number u-incident edges added to T to obtain G.

Lemma. (I) Let u’u E E ( G ) n E ( T ) be an intracluster edge of C,. Then A,,(G) 5 26, 26, L, L, - n2 1. ( 2 ) Let xljl E E ( G ) \ E ( T ) be an inter-cluster edge between C, 3 x and C, 3 y . Then A,,(G) 2 n, n3 - (6, 6, 2 ~ , 2 ~ ~1. )

+

+

+ +

+ +

+

+

+

Proof. We count the common and non-common neighbors of u. (1) There are no non-common neighbors of uw in T , and each edge deletion or insertion incident to u or w creates at most one. Therefore INn(u,w)I 5 6, 6, L, L,. There are ni - 2 common neighbors of uu in T , and each edge deletion incident to u or ’u removes a t most one. Thus ~ N n ( u , v )>_ J nz - 2 6,). ( 2 ) After inserting x y into T , this edge has (n,- 1) (nj- 1) non-common neighbors. Each deletion incident to z or y decreases this number, and each of the L, - 1 plus L~ - 1 additional insertions incident to 2 or y might also decrease this number. Thus I N n ( x , y ) /2 ni+nj-(6,+6y+~,+-~y).O n t h e other hand, each insertion can also create a common neighbor; thus INn(z,y)I 5 L~ L~ - 2.

+ + + +

+

+

396 Theorem. GREEDY-HEURISTIC( G) recovers

the original transitive graph T if the following assumption holds: For each vertex from any Ci in T , at most 2ni/9 edges to vertices in other clusters are added and at most 2ni/9 of the edges to vertices in the same cluster are removed to obtain G. Proof. We show that &(G) > A,(G) for any intercluster edge e and intra-cluster edge f . Assume that e = xy lies between C, and C j , and that f = uu lies in Ci. Using the Lemma and the 2/9-assumption, & ( G ) - A ( f )2 2ni - (6, 2 ~ , 26, 26, L , L,) nj - (6, 2 ~ >_ ~n j /)3 > 0, as all ni-terms cancel out. Therefore, GREEDY-HEURISTIC will always remove inter-cluster edges first. This also shows that the “safety net” (step 5) of the algorithm is unnecessary here.

+

+

+ +

+

+ +

4. LAYOUT-BASED HEURISTIC Our final heuristic is based on physical intuition and motivated by graph layouting, initially introduced by Fruchterman and Reingold‘. It has later been extended and used for the visualization of structural and functional relationships in biological networks, e.g., in BioLayout7. The main idea of these layout algorithms is to arrange all nodes on a 2-dimensional plane to fit esthetic criteria (such as even node distribution in a frame and inherent symmetry reflection). The graph’s nodes are interpreted as magnets (or electrical charges of the same kind), and edges are replaced by rubber bands to form a physical system. The nodes are initially placed randomly or in a circle, for example, and then left to the forces of the system, so that the magnetic repulsion and the band’s attraction forces on the nodes move the system to a minimal energy state. While a physical system provides the motivation for these algorithms, in the actual implementation the nodes need not move according to exact physical laws. We have adapted and extended these ideas: The layout of the graph is used to partition it into disjoint connected components. Our algorithm proceeds in three phases: (1) layout, (2) partitioning, and ( 3 ) postprocessing. Layout phase. The goal is to find a position pos[i]= ( p o s [ i ] l , p o s [ i ] 2E ) R2 for each node 1 5 i 5 n, starting with a circular layout of radius po (a

user-defined parameter) around the origin. We define the distance d ( i , j ) of nodes i and j as their Euclidean distance in the layout: d ( i , j ) :=

(EL,( P 4 i I d

1/2 - POS[Jld)2)

.

For a user-defined number R of iterations, we compute the displacement of each node, and update the position pos[i] of each node i accordingly. We have allowed ourselves some freedom in deriving a good displacement vector. In particular, we do not compute forces, accelerations, and velocities of points, but for simplicity’s sake, directly apply a displacement vector to a node once it has been computed according to t,he rules below. In this sense, the physical system described above serves only as a motivation, but not as a model for the algorithm. In round 7’ E (1,.. . , R } , we compute the displacement of node i as follows. For each node j # i with s ( i j ) > 0, we move i into the direction of j (the unit vector of this direction is (pos[j]- p o s [ i ] ) / d ( i ,j ) ) by an amount of fatt.Fatt(d(i. j ) ) . s ( i j ) . Here Fat+,(d) is a strictly increasing function of the distance we use Fatt(d) := log(d 1) -, and fatt > 0 is a userdefined scaling factor for attraction. Conversely, for each node j # i with s ( i j ) < 0, we move i away from j by an amount of frep . Frep(d(i,j)) . ls(ij)l, where Frep(d):= l/Fatt(d) is strictly decreasing, and frep > 0 is another scaling factor. Finally, the magnitude of the displacement vector is cut off a t a maximal value M ( T )that depends on the iteration T : We use M ( T )= n . M O . ( l / ( ~ + l ) ) ~ to obtain increasingly small displacements in later iterations. Again, MO > 0 is a user-defined parameter. After the displacement of a node i has been computed, the node is immediately moved, before the displacement of node i 1 is computed. While this does not agree with physical model, we have found that it speeds up convergence of the layout and saves memory for the displacement vectors for each node. After all nodes have been moved, the next iteration starts. The layout phase obviously runs in O ( R .n 2 ) time. The actions of the algorithm are visualized in Figure 1. For the cluster editing problem based on protein sequence similarities, we use the following parameters: number of iterations R = 186, initial circular layout radius po = 200, repulsion scaling factor frep = 1.687, attraction scaling factor fatt = 1.245, Mo = 633. The best parameter constellation is ~

+

+

397

B

Fig. 1. Layout of a graph with 41 nodes after (A) 3 , (B) 10 and (C) 90 iterations.

(more or less) specific to the concrete problem and has been obtained by an evolutionary training procedure by using the cost function as quality function. It is included in our implementation to enable the user to perform parameter calibration for arbitrary applications. Partitioning phase. The nodes’ positions after R rounds are used to partition the graph geographically. Given a distance parameter 6, we singlelinkage cluster all nodes, meaning that nodes i and j belong to the same cluster if there exist nodes . . i = 2 0 ~ 2 1 , . . . i~ = j such that d ( i k - 1 , i k ) 5 6 for all k = I,.. . , K . We determine cost(G 4 Gi) for the so defined transitive graph G;. To find a good GZ, we start with a small distance parameter Sinit := 0 and increase it (6 c 6 0 ) by a growing step size 0 : Initially 0 + g i n i t := 0.01; subsequently 0 + 0 . f D with factor fD := 1.1. This continues until 6 2 S,, := 300. The best value for 6, along with its cost, is remembered. Obviously, the time complexity of the , D is the numpartitioning phase is O ( D . n 2 ) where ber of different values for S.

+

Postprocessing. The geometric single-linkage clustering is further improved by the postprocessing, which takes O ( n 4 )time in the worst case, but this almost never happens in practice. Effectively, the running overall time is O ( n 2 ) .The two postprocessing steps are: (1) For each pair of clusters, we check if joining them

into a single cluster decreases overall cost and perform this operation if appropriate. During this step, we especially reduce the number of erroneous singleton nodes. This happens in arbitrary but deterministic order as long as merging a pair of clusters results in an improvement. For each node i and cluster C with i q! C, we check if moving i to C decreases overall cost and perform this operation if appropriate. We also repeat this step as long as further improvements result.

5 . RESULTS We implemented the FP algorithm in C++, the greedy heuristic in Python, and the layout-based heuristic in Java. While with modern Java virtual machines, running times of Java programs are comparable to those of C++ programs, there is a higher start-up cost, which especially hurts performance for small problem instances. Python running times are about 10 times slower than those of the comparable C++ implementation. This should be kept in mind when comparing the running times of our implementations. All measurements were taken on a SunFire 880 with 900-MHz UltraSPARC III+ processors and 32 GB of RAM. Artificial graphs. We generate random artificial graphs as follows. Given the number of nodes n, we randomly select an integer k E [1,n] and define the corresponding nodes to be a cluster. We proceed in the same way with the remaining n - k

398 Table 1. Results on artificial graphs with different numbers of nodes n, resulting in different ranges of edge numbers m. For each n 5 50, ten random instances were generated. For each n 2 60, where the F P algorithm did not finish in reasonable time, only five instances were generated. Costs and running times are averages over these 10 resp. 5 instances. Smallest costs and running times are marked in boldface. The (Diff.) columns show the relative cost difference against the optimal solution returned by FP, where possible. Abbreviations: FP: fixed parameter algorithm; Greedy: greedy heuristic; Layout: layout-based heuristic. Parameters n m E 10 [11,30] 20 [65,165] 30 [138,296] 40 [251,533] 50 [402,821] 60 [515,1252] 70 [694,1911] 80 [1141,2094] 90 [1248,2969] 100 11711.31571

FP 95.75 301.89 671.25 1238.3 1859.99 __ -

-

Greedy 96.17 305.22 671.51 1238.31 1859.99 2742.3 3608.54 4729.52 6106.56 7494.36

costs (Diff.) (+O%) (+l%) (SO%) (SO%) (+O%)

(-)

(-1 (-1 (-1

(-)

nodes until no nodes are left. This gives us a random number of clusters of random sizes. Then the similarities of objects within a cluster are drawn from a Gaussian distribution N ( p i n ,o?*);they are positive on average, but negative with some probability. Similarities of objects in different clusters are conversely drawn from a Gaussian distribution N ( p e x ,o:~), which leads to negative values on average. If the parameters are chosen carefully, this construction leads to “almost transitive” graphs. For our experiments, we choose pin = 21, pex = -21, gin = f l e x = 20, so that the probability of seeing an undesired or missing edge is about 0.147 per node pair. Table 1 shows the results. We see that the FP algorithm is the fastest one for small graphs, but reaches its limits above 50 nodes. On the other hand, the greedy and layout-based heuristics perform almost as well, while requiring significantly less time. The layout-based heuristic is much faster on large components, but first requires a good choice of parameters, as discussed in Section 4.

Protein similarity graph from the COG dataset. We test the algorithms on the 66 organisms of the COG dataset17 from http ://www .ncbi .nlm.nih. gov/COG/, i.e., on the protein sequences from ftp: //ftp.ncbi.nih.gov/pub/COG/COG/myva/. We define the similarity score of two proteins as follows: First let s ( ~+ w) := CHE7LH(u+V) [-log,oE(H)I - 2 . (IZ(u .)I - 1). Here Z ( u + w) denotes the set of high-scoring pairs (HSPs) with E-value better than returned when +

Layout 95.75 301.89 671.25 1238.31 1859.99 2742.3 3609.48 4722.08 6106.56 7494.36

(Diff.) (SO%) (+O%)

(+O%) (SO%)

(+O%)

(-1 (-1 (-1 (-) f-)

Running Times Is] .. FP Greedy Layout 0.035 0.242 0.845 1.407 0.152 0.538 2.756 1.157 1.876 72.109 3.167 2.816 3.353 8.315 2204.862 19.198 3.972 4.358 58.124 69.056 4.698 128.986 5.384 207.958 5.464 -

BLASTing u against w. We subtract a penalty of 2 score points for each HSP beyond the highestscoring one. We similarly define the score s(u + u) by BLASTing w against u. Finally we define the symmetric similarity score s(u,w) := min{s(u + w),s ( v + u ) } - T , where we use a threshold of T = 10, corresponding to an E-value of The resulting similarity matrix defines a graph of 42563 (trivially transitive) connected components of size 1 and 2, and 8037 larger components, 3964 of which are not transitive; these are the input to our algorithms. Figure 2 shows a histogram of initial component sizes IVI. There are 70 intransitive components with IVI > 200 that are not shown in the histogram, the largest of size 8836. As all three algorithms perform well on very small components (which could be solved by exhaustive enumeration), we now restrict our attention to the 1243 components with IVI 2 20. For each instance and each algorithm, we limit computation time to 48 hours; thus we could find the exact FP solution for 825 of the 1243 components in the alloted time. Figure 3 (left) shows the relative cost of the solutions found by Greedy and Layout in comparison to the optimal one found by FP for the 825 components. Both heuristics work quite well: In 635 out of 825 cases, Greedy returns the optimal solution; and in 811 out of 825 cases, Layout returns the optimal solution. This behavior is relatively independent of the size or complexity of the graph (shown on the x-axis). The solution returned by Layout deviates in only two cases by more than 5% from the optimal so-

399

.i

12 10 component 51ze IVI

14

16

18

1

180

,t

/it

i 200

Fig. 2. Initial distribution of component sizes IVI for the complete COG dataset in the range 3 5 /VI < 20 (left) and 20 5 J V / 5 200 (right). Cyan (lower bars): number of non-transitive components. Magenta (upper bars): number of transitive components.

lution; this happens in 95 cases with Greedy, whose maximal deviation is about 50% in rare cases. Figure 3 (right) visualizes the running times of the different algorithms against component complexity for all 1243 components. It is evident that the FP algorithm is fastest for small components, but quickly hits a wall for larger ones. Greedy is quickest for medium-sized components, but its running time grows faster with graph complexity than that of Layout, which is the only feasible algorithm for the largest components.

6. DISCUSSION A N D CONCLUSION We have put forward three algorithms for weighted transitive graph projection or weighted cluster editing that cover the whole spectrum from an exact fixed-parameter algorithm to pure heuristics. If graphs that arise from “real” data are not far away from transitivity (in contrast to random graphs, which are highly intransitive with high probability according to Moon’s result12), we can find the optimal solution to the WTGPP with an FP algorithm in reasonable time for medium-sized components, and close-to-optimal solutions with well-engineered heuristics in guaranteed polynomial time. The FP and the Greedy algorithm complement each other well: The former guarantees the exact solution (and runs quickly for almost transitive graphs); the latter always runs in polynomial time and guarantees an optimal solution for close-to-transitive graphs. The

Layout heuristic works very well in practice, but has no provable guarantees. Our study shows that real protein similarity graphs are indeed close-to-transitive, and the three algorithms perform quite well on these WTGPP instances. Although not in the scope of this paper, the WTGPP has numerous potential applications to be investigated. Here we merely used the COG dataset as a comparative illustration of the respective capabilities of our three algorithms. Applications naturally arise in delineating gene and protein families”$ l7 (which in turn can be used as a preprocessing method for gene cluster discovery15) and in the discovery of structure in protein complexes or of communities in social or biological networks. To further understand and improve the FP algorithm, it is of interest to systematically compare the branching strategy of our FP algorithm with that of a general ILP solver, using the cutting plane algorithm of Ref. 10, which so far has not been attempted on large components. Acknowledgments and availability.

Tobias Wittkop is supported by the DFG GK Bioinformatik. Jan Baumbach is supported by the International NRW Graduate School in Bioinforniatics and Genome Research. The fixed-parameter algorithm was implemented and engineered by Sebastian Briesemeister. We thank M. Madan Babu for many constructive comments, and Andreas Dress for point-

400 Solution cost Differences

Comparison of Running Times x o

Greedy vs FP Layout vs FP

105

10

Graph complexity IVI* IE I

10 Graph complexity IVI*lEI

10

Fig. 3. Left: Relative cost differences of the solutions in percent (y-axis) found by Greedy and Layout in comparison to the exact Fixed-Parameter (FP) algorithm. Only those components whose exact solution could be computed in less than 48 hours are shown. For both Greedy and Layout, in the majority of cases, the optimal solution is found. Note that the x-axis, which shows the component complexity (we use |V| • \ E \ ) , is logarithmic. Right: Running times of FP, Greedy, and Layout against component complexity. Both axes are logarithmic. ing out the work of Grotschel and Wakabayashi. Supplementary material and source code is available at http://gi.cebitec.uni-bielefeld. de/transitivegraphprojection/.

8.

References

9.

1. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389-3402, 1997. 2. P. Damaschke. On the fixed-parameter enumerability of cluster editing. In D. Kratsch, editor, Proc. of International Workshop on Graph Theoretic Concepts in Computer Science (WG 2005), volume 3787 of LNCS, pages 283-294. Springer, 2005. 3. F. Dehne, M. A. Langston, X. Luo, S. Pitre, P. Shaw, and Y. Zhang. The cluster editing problem: Implementations and experiments. In Proc. of International Workshop on Parameterized and Exact Computation (IWPEC 2006), volume 4169 of LNCS, pages 13-24. Springer, 2006. 4. S. Delvaux and L. Horsten. On best transitive approximations to simple graphs. Acta Informatica, 40(9):637-655, 2004. 5. R. G. Downey and M. R. Fellows. Parameterized Complexity. Springer, 1999. 6. T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement. Software Practice and Experience, 21(11):1129-1164, 1991. 7. L. Goldovsky, I. Cases, A. J. Enright, and C. A. Ouzounis. BioLayout(Java): versatile network visu-

10.

11.

12.

13. 14.

15.

16.

alisation of structural and functional relationships. Applied Bioinformatics, 4(l):71-74, 2005. J. Gramm, J. Guo, F. Hiiffner, and R. Niedermeier. Automated generation of search tree algorithms for hard graph modification problems. Algorithmica, 39(4):321-347, 2004. J. Gramm, J. Guo, F. Hiiffner, and R. Niedermeier. Graph-modeled data clustering: Exact algorithm for clique generation. Theor. Comput. Syst., 38(4):373392, 2005. M. Grotschel and Y. Wakabayashi. A cutting plane algorithm for a clustering problem. Mathematical Programming, Series B, 45:59-96, 1989. A. Krause, J. Stoye, and M. Vingron. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics, 6:15, 2005. J. W. Moon. A note on approximating symmetric relations by equivalence classes. SIAM Journal of Applied Mathematics, 14(2):226-227, 1966. R. Niedermeier. Invitation to Fixed-Parameter Algorithms. Oxford University Press, 2006. R. Niedermeier and P. Rossmanith. A general method to speed up fixed-parameter-tractable algorithms. Inform. Process. Lett, 73:125-129, 2000. S. Rahmann and G. W. Klau. Integer linear programs for discovering approximate gene clusters. In P. Bucher and B. Moret, editors, Proceedings of the 6th Workshop on Algorithms in Bioinformatics (WABI), volume 4175 of LNBI, pages 298-309. Springer, 2006. R. Shamir, R. Sharan, and D. Tsur. Cluster graph modification problems. Discrete Applied Mathematics, 144:173-182, 2004.

40 1 17. R. L. Tatusov, N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y . I. Wolf, J. J. Yin, and D. A. Natale. The COG database: an updated version includes eukary-

otes. BMC Bioinformatics, 4:41, 2003. 18. C. T. Zahn Jr. Approximating symmetric relations by equivalence relations. Journal of the Society of Industrial and Applied Mathematics, 12(4):840-847, 1964.


403

M E T H O D S FOR EFFECTIVE VIRTUAL SCREENING A N D SCAFFOLD-HOPPING IN CHEMICAL COMPOUNDS

Nikil Wale* a n d George Karypis

Department of Computer Science, University of Minnesota, T w i n Cities ’Email: [email protected], [email protected] Ian A. Watson

Eli Lilly and Company Lilly Research Labs, Indianapolis Email: [email protected] Methods that can screen large databases to retrieve a structurally diverse set of compounds with desirable bioactivity properties are critical in the drug discovery and development process. This paper presents a set of such methods, which are designed to find compounds that are structurally different to a certain query compound while retaining its bioactivity properties (scaffold hops). These methods utilize various indirect ways of measuring the similarity between the query and a compound that take into account additional information beyond their structure-based similarities. Two sets of techniques are presented that capture these indirect similarities using approaches based on automatic relevance feedback and on analyzing the similarity network formed by the query and the database compounds. Experimental evaluation shows that many of these methods substantially outperform previously developed approaches both in terms of their ability to identify structurally diverse active compounds as well as active compounds in general.

1. INTRODUCTION Discovery, design, and development of new drugs is an expensive and challenging process. Any new drug should not only produce the desired response to the disease but should do so with minimal side effects. One of the key steps in the drug design process is the identification of the chemical compounds (hit compounds or just h i t s ) that display the desired and reproducible activity against the specific biomolecular target23. This represents a significant hurdle in the early stages of drug discovery. A popular approach for finding these hits is to use a compound, known to possess some of the desired activity properties, as a reference and identify other compounds from a large compound database that have a similar structure. This is nothing more than a ranked-retrieval using the reference compound as a query. This approach relies on the wellknown fact that compounds sharing key structural features will most likely have similar activity against a biomolecular target. This is referred to as the structure activit,y relationship (SAR) ’. The similarity between the compounds is usually computed

by first representing their molecular graph as a vector in a particular descriptor-space and then using a variety of vector-based methods to compute their similarity 8 , ’. However, the task of identifying hit compounds is complicated by the fact that the query might have undesirable properties such as toxicity, bad ADME (absorption, distribution, metabolism and excretion) properties, or may be promiscuous 17, 26. These properties will also be shared by most of the highest ranked compounds as they will correspond to very similar structures. In order to overcome this problem, it is important to rank high as many chemical compounds as possible that not only show the desired activity for the biomolecular target but also have different structures (come from diverse chemical classes or chemotypes). Finding novel chemotype using the information of already known bioactive small molecules is termed as scaffold-hopping 17, 3 2 , ”. In this paper we address the problem of scaffoldhopping by developing a set of techniques that measure the similarity between the query and a compound that take into account additional information

beyond their structure-based similarities. These indirect ways of measuring similarity enables the retrieval of compounds that are structurally different from the query but at the same time possess the desired bioactivity properties. We present two sets of techniques t o capture such indirect similarities. The first set, contains techniques that are based on automatic relevance feedback, whereas the second set, derives the indirect similarities by analyzing the similarity network formed by the query and the database compounds. Both of these sets of techniques operate on the descriptor-space representation of the compounds and are independent of the of selected descriptor-space. We experimentally evaluate the performance of these methods using three different descriptor-spaces and six different datasets. Our results show that most of these methods are quite effective in improving the scaffold-hopping performance over standard ranked-retrieval. Among them, the methods based on the similarity-network perform the best and substantially outperform previously developed scaffold-hopping schemes. Moreover, even though these methods were created to improve the scaffoldhopping performance, our results show that many of them are quite effective in improving the rankedret>rievalperformance as well. The rest of the paper is organized as follows. Section 2 describes the problems addressed in this paper. Section 3 introduces the definitions and notations used in this paper. Section 4 introduces the various descriptor-spaces for this problem. Section 5 describes the methods developed in this paper. Section 6 gives an overview of the related work in this field. Section 7 describes the materials used in our experimental methodology. Section 8 compares and discusses the results obtained. Finally, Section 8.2 summarizes the results of this paper.

2. PROBLEM STATEMENT A N D M O TIVAT I0N The ranked-retrieval and the scaffold-hopping problems that we consider in this paper are defined as follows: Definition 2.1 (Ranked-Retrieval Problem) Given

a query compound, rank the compounds in the database based on how similar they are to the query in terms of their bioactivity.

Definition 2.2 (Scaffold-Hopping Problem) Given

a query compound and a parameter k , retrieve the k compounds that are similar to the query in terms of their bioactivity but their structure is as dissimilar as possible to that of the query. The solution to the ranked-retrieval problem relies on the well known fact that chemical structure of a compound relates to its activity (SAR) ’. As such, effective solutions can be devised that rank the compounds on the database based on how structurally similar they are to the query. However, for scaffold-hopping, the compounds retrieved must be structurally suficiently similar to possess similar bioactivity but at the same time must be structurally dissimilar enough to be a novel chemotype. This is a much harder problem than simple ranked-retrieval as it has the additional constraint of maximizing dissimilarity that runs counter t o SAR. Methods that have the ability to rank higher the compounds that are structurally different (different chemotypes) have advantages over methods that do not. They improve the odds of being able to find a compound that is not only active for a biomolecular target but also has all the other desired properties (non-toxicity, good ADME properties, target specificity, etc. 8 , 17) that the reference structure and compounds with similar structures might not possess. One of such compounds is then more likely to become a true drug candidate. Furthermore, scaffold-hopping is also important from the point of view of un-patented chemical space. Many important lead compounds and drug candidates have been already patented. In order to find new therapies and offer alternative treatments it is important for a pharmaceutical company to discovery novel leads away from the existing patented chemical space. Methods that perform scaffold-hopping can achieve those objectives.

3. DEFINITIONS A N D NOTATIONS Throughout the paper we will use D to denote a database of chemical compounds, q to denote a query compound, and c to denote a chemical compound present in the database. Given two compounds ci and c j , we will use sim(c,, c j ) to denote their (direct) similarity which

405 is computed with respect to their descriptor-space representation by a suitable similarity measure. Given a compound c, and a set of compounds A, we will use sim(c,, A ) to denote the average pairwise similarity between ci and all the compounds in A. Given a query compound q , a database D , and a parameter k , we define top-k to be the k compounds in D that are most similar to q. Given a compound c, a set of compounds A, and a similarity measure, its k-nearest-neighbor list contains the k compounds in A that are most similar to C.

Finally, throughout the paper we will refer to the task of retrieving active compounds as rankedretrieval and the task of retrieving scaffold-hops as scaffold-hopping. 4. DESCRIPTOR SPACES FOR RANKED-RETRIEVAL

The similarity between chemical compounds is usually computed by first transforming them into a suitable descriptor-space representation 8 , ’. A number of different approaches have been developed to represent each compound by a set of descriptors. These descriptors can be based on physiochemical properties as well as topological and geometric substructures (fragments) 31, 1, 3 , 12, 2 5 , 18,29 In this study we use three descriptor-spaces that have been shown to be very effective in the context of ranked-retrieval and/or scaffold-hopping. These descriptor-spaces are the graph fragments (GF) 29, extended connectivity fingerprints (ECFP) 25, ’*, and the extended reduced graph (ErG) descriptors 2 7 . GF is a 2D topology-based descriptor-space 29 that is based on all the graph fragments of a molecular graph up to a predefined size. ECFP is also a 2D topological descriptor-space and many flavors of these descriptors have been described by several authors 18. The idea behind this descriptor-space is to capture the topology around each atom in the form of shells whose radius (number of bonds) ranges from 1 to I , where 1 is a user defined parameter. We use the ECZ3 variation of ECFP in which each atom is assigned a label corresponding to its atomic number and the maximum shell radius is set to three. Both extended connectivity fingerprints (ECFP) and GF have been shown to be highly effective for the ranked-retrieval of chemical compounds l8, ”. 251

Extended reduced graph descriptors (ErG) is a pharmacophoric descriptor-space. A pharmacophore is defined as a critical 3D or 2D arrangement of molecular fragments forming a necessary but not sufficient condition for biological activity. The descriptors that rely only on 2D information are called 2D pharmacophoric descriptors whereas descriptors that utilize 3D information are called 3D pharmacophoric descriptors. ErG is a 2D pharmacophoric descriptor-space that combines the reduced graphs 151 l4 and binding property pairs 22 to generate pharmacophoric descriptor-space. A detailed description on the generation of these pharmacophoric descriptors can be found in 2 7 .

5. M E T H O D S

In order to improve the scaffold-hopping performance we developed a set of techniques that measure the similarity between the query and a compound by taking into account additional information beyond their descriptor-space-based representation. These methods are motivated by the observation that if a query compound q is structurally similar to a database compound c, and ci is structurally similar to another database compound c 3 , then q and cJ could be considered as being similar or related even though they may have zero or very low direct similarity. This indirect way of measuring similarity can enable the retrieval of compounds that are structurally different from the query but at the same time, due to associativity, possess the same bioactivity properties with the query. We developed two sets of techniques to capture such indirect similarities that were inspired by research in the fields of information retrieval and social network analysis. The first set, contains techniques that use various forms of automatic relevance feedback to identify a set of compounds to be used for creating an indirect similarity measure, whereas the second set, derives the indirect similarities by analyzing the network formed by a k-nearest-neighbor graph representation of the query and the database compounds. Both of these sets of techniques operate on the descriptor-space representation of the compounds and are independent of the of selected descriptor-space.

406

5.1. Relevance-Feedback- based Methods 5.1.1. Top-k Weighting This approach, which is based on the Rochio 24 scheme for automatic relevance feedback, first retrieves the top-lc compounds for a given query q and then uses these compounds to derive an indirect similarity between q and each of the compounds in the database. Specifically, if A is the initial set of top-k compounds, the new similarity, simA(q, c), between q and a compound c is given by simA(q,c) = asim(q,c)

+ (1

-

a)sim(c,A),

(1)

where 0 5 Q 5 1 is a user-specified parameter that controls the degree to which the new similarity is affected by the compounds in A. We will refer to this method as TOPKAVG. The motivation behind this approach is that for reasonably small values of k , the set A will contain a relatively large number of active compounds. Thus, by modifying the similarity between q and a compound c to also include how similar e is to the compounds in A, we obtain a similarity measure that is re-enforced by A’s active compounds. This enables the retrieval of active compounds that are similar to the compounds present in A even if their similarity to the query is not very high; thus, enabling scaffoldhopping

5.1.2.Cluster Weighting This method is similar in spirit to TOPKAVG,but employs a clustering-based approach to identify the set of compounds to use for automatic relevance feedback. We will refer to this scheme as CLUSTWTand consists of the following four steps. First, it finds the top-k most similar compounds to a query q. Second, it clusters these compounds into 1 = k / m sets (5’1, . . . , S,} each of size m (assuming that k is a multiple of m ) . Third, it selects among these sets, the set S* that has the highest similarity to the query. Fourth, it uses Equation 1 to re-rank all the compounds in the database using S* as the relevance feedback set (i.e., A = S*). The clustering is computed using a fixedcapacity heuristic min-cut partitioning algorithm on the complete weighted graph whose nodes are the k compounds and the edge-weights are the similarities between them 21) Consequently, the intercluster compound-to-compound similarities are ex-

’.

plicitly minimized leading to clusters in which the intra-cluster similarities are implicitly maximized (i.e., each cluster will end-up containing similar compounds). By using for relevance feedback the set S*, which contains compounds that are most similar to the query, CLUSTWTselects the cluster that will most likely have a large number of active compounds. This is similar in spirit to the method that TOPKAVGuses to select its own relevance feedback set A. However, since S* contains compounds that are also very similar to each-other, the nuniber of active compounds that it contains will tend to be higher than that contained in A (assuming that both A and S* have the same size). In fact, S* has already incorporated some form of automatic relevance feedback, since all pairwise similarities between its compounds were taken into account during the clustering process. The fact that objects that are relevant to a query tend to cluster together is well-known within the document retrieval community and is usually referred to as the clustering hypothesis 16.

5.1.3. Sum-based Search The performance of TOPKAVGand CLUSTWTdepends on selecting a reasonable value for the size of the set used to provide automatic relevance feedback. If that set is too small, it may not incorporate a sufficiently large number of active compounds and thus lead to limited (if any) performance improvements, whereas if the set is too large, it may degrade the performance by incorporating a relatively large number of inactive compounds. Unfortunately, our initial experiments showed that the right size of the relevance feedback set is dataset dependent. Motivated by this observation we developed a scheme for automatic relevance feedback, which instead of using a fixed number of compounds, it does so in a progressive fashion. Specifically, if A is the set of compounds that have been retrieved thus far, then the compound selected next, c,,,t, is the one that has the highest average similarity to the set A U (4). That is,

This compound is added in A and the overall process is repeated until the desired number of compounds is retrieved or all the compounds in D have been

407 ranked. Thus, in this scheme, as soon as a compound is retrieved it is used to expand the set of compounds used to provide relevance feedback. We will refer to this method as BESTSUMDESCSIM. 5.1.4. Max-based Search

A common characteristic to the three schemes described so far is that the final ranking of each compound is computed by taking into account all the similarities between the compound and the compounds in the relevance feedback set. Since the compounds in the relevance feedback set will tend t o be structurally similar to the query compound (with the CLUSTWTpotentially being an exception), this approach is rather conservative in its attempt to identify active compounds that are structurally different from the query (i.e., scaffold-hops). To overcome this problem, we developed a bestsearch scheme that is based on the BESTSUMDESCSIMapproach but instead of selecting the next compound based on its average similarity to A U {y}, it selects the compound that is the most similar to one of the compounds in A U (4). That is, the next compound is given by

cnezt = argmax{ max sim(ci, c j ) } . c,ED-A

CJEAUIq)

(3)

In this approach, if a compound cj other than y has the highest similarity to some compound ci in the database, ci is chosen as c,,,t and added to A irrespective of its similarity to q . Thus, the queryto-compound similarity is not necessarily included in every iteration as in the other schemes, allowing BESTMAXDESCSIM to identify compounds that are structurally different from the query. We will refer to this schemes as BESTMAXDESCSIM. 5.2. Nearest-Neighbor Graph-based Methods

These methods, motivated by the field of social (relational) network analysis, determine the similarity between a pair of compounds by taking into account any other compounds that are very similar to either or both of them. Thus, the similarity depends on the structure of the network formed by all highly similar pairs of compounds. The network linking the database compounds with each other and with the query is determined

by using a k-nearest-neighbor (NG) and a k-mutualnearest-neighbor (MG) graph. Both of these graphs contain a node for each of the compounds as well as a node for the query. However, they differ on the set of edges that they contain. In the k-nearestneighbor graph there is an edge between a pair of nodes corresponding to compounds ci and c j , if ci is in the k-nearest-neighbor list of cj or vice-versa. In the k-mutual-nearest-neighbor graph, an edge exists only when ci is in the k-nearest-neighbor list of c j and c j is in the k-nearest-neighbor list of ci. As a result of these definitions, each node in NG will be connected to a t least k other nodes (assuming that each compound has a non-zero similarity to at least k other compounds), whereas in MG, each node will be connected to at most k other nodes. Since the neighbors of each compound in these graphs correspond to some of its most structurally similar compounds and due to the relation between structure and activity, each pair of adjacent compounds will tend to have similar activity. Thus, these graphs can be considered as the network structures for capturing bioactivity relations. A number of different approaches have been developed for determining the similarity between nodes in social networks that take into account various topological characteristics of the underlying graphs 1 3 . In our work, we determine the similarity between a pair of nodes as a function of the intersection of their adjacency lists, which takes into account all two-edge paths connecting these nodes. Specifically, the similarity between ci and c j with respect to graph G is given by 28i

where adjG(ci) and adjc(c3) are the adjacency lists of ci and cj in G, respectively. This measure assigns a high similarity value to a pair of compounds if both are very similar to a large set of common compounds. Since a pair of active compounds will be more similar to other active compounds than an active-inactive pair, their similarity according to Equation 4 will be high. Also, since Equation 4 can potentially assign a high similarity value to a pair of compounds even if their direct similarity is very low (as long as they have a large number of common neighbors), it facilitates scaffold-hopping. For each of the NG and MG graphs we devel-

408 oped two retrieval schemes that use Equation 4 as the similarity measure in the sum- and max-based search strategies represented in Equations 2 and 3. For example, in the case of the NG graph and the sum-based search strategy, the next compound cnezt to be retrieved is given by Cnezt =

argmax{simNG(ct, A u {Y})},

(5)

c,ED-A

where simNG(cz,Au{y}) is the average pairwise similarity between c, and the compounds in A computed using Equation 4 for the NG graph. The equations for the other schemes are derived in a similar fashion. We will refer to these four schemes as BESTSUMNG, BESTMAXNG,BESTSUMMG,and BESTMAXMG, respectively. 6. RELATED WORK Many methods have been proposed for rankedretrieval and scaffold-hopping. These can be divided into two groups. The first contains methods that rely on better designed descriptor-space representations, whereas the second contains methods that are not specific to any descriptor-space representation but utilize different search strategies to improve the overall performance. Among the first set of methods, 2D descriptors such as path-based fingerprints ’, dictionary based keys and more recently Extended Connectivity fingerprints (ECFP)18, Graph Fragments (GF) 29 have all been successfully applied for the retrieval problem. Pharmacophore based descriptors such as ErG 27 have been shown to outperform simple 2D topology based descriptors for scaffoldhopping 3 3 . Lastly, descriptors based on 3D structure or conformations of the molecule have also been applied successfully for scaffold-hopping 3 3 , 2 6 . The second set of methods include the turbo search schemes (TURBOSUMFUSION and TURBOMAXFUSION) l7 and the structural unit analysis based techniques 32 all of which utilize relevance feedback ideas. These have been shown to be effective for both scaffold-hopping and ranked-retrieval. The turbo search techniques operate as follows. Given a query y, they start by retrieving the top-k compounds from the database. Let A be the ( k 1)-size set that contains y and the top-k compounds. For each compound c E A, all the compounds in the database are ranked in decreasing order based on 41

31

271

+

+

their similarity to c, leading to IC 1 ranked lists. These lists are used to obtain the final similarity of each compound with respect to the initial query. In particular, in TURBOMAXFUSION, the similarity between y and a compound c is equal to the similarity corresponding to the maximum ranking of c in the k 1 lists, whereas in TURBOSUMFUSION, the similarity is equal to the sum of all the similarities in these rankings. Similar methods based on consensus scoring, rank averaging, and voting have been investigated in 3 3 . The TURBOSUMFUSION approach is similar to that of the TOPKAVGdescribed in Section 5.1.1 as it utilizes relevance feedback mechanism to re-rank a database with respect to a query. However, the TURBOSUMFUSION approach treats every compound in the top-k set as equally important along with the query, whereas in TOPKAVG,each compound in A is given a weight of (1 - a)(l/IA[a) relative to y.

+

7. MATERIALS

7.1. Datasets We used datasets that contain compounds that bind to six different biomolecular targets: COX2 (cyclooxygenase 2), CDK2 (cyclin-dependent kinase 2), FXa (coagulation factor Xa), PDE5 (phosphodiesterase 5), AlA (alpha-1A adrenoceptor), and M A 0 (Monoamineoxidase). Each of these sets represent a different activity class. The datasets for the first five targets are obtained from 5 , 19. The entire set consists of 2142 compounds and there are 50 active compounds for each one of the targets (250 in total). The rest of the compounds are “decoys” (inactive) obtained from the National Cancer Institute diversity set. For each target, we constructed a dataset that contains its 50 active compounds and all the decoys. These datasets are termed as COX2, CDK2, PDE5, FXa and A1A. The dataset of the sixth target was derived from 11, 29 and after removing compounds with impossible Kekule forms and valence errors it contains 1458 compounds. The compounds in this dataset have been categorized into four different classes, 0, 1, 2, and 3 based on their levels of activity, with 0 indicating no activity. For our experiments we treat all the compounds that have non-zero activity level (268 compounds) as active.

409

7.2. Definition of Scaffold-Hopping Compounds

Molecular scaffold is a widely cited concept and is used to evaluate the performance of a method with respect to its scaffold-hopping ability. However the definition of a scaffold-hop is highly subjective with numerous papers using different criteria to define what constitutes a scaffold-hop 17) 32) lo. In this paper we use an objective way of defining which compounds can be considered as scaffoldhops by using an approach that directly relies on the scaffold-hopping problem definition (Section 3). In particular, for a given query q, the active compounds are ranked based on their structural similarity to q , and the lowest 50% of them are defined to be the scaffold-hops for q. Thus, this approach identifies a set of scaffold-hopping compounds that are specific to each query and represent the 50% most dissimilar active compounds to the query. We use the 2048bit path-based fingerprint generated by Chemaxon’s screen program for measuring the structural similarity between a query and an active compound. These fingerprints are well-designed to capture structural similarity between two compounds 2 7 . 331

*

7.3. Experimental Methodology

All the experiments were performed on dual core AMD Opterons with 4 GB of memory. We used the descriptor-spaces GF, ECZ3, and ErG (described in Section 4) for the evaluating the methods introduced in this paper. Each method is tested against six datasets (Section 7.1) using three different descriptor-spaces (Section 4) leading to a total of 18 different combinations of datasets and descriptorspaces. We will refer to them as 18 different problems. We use the Tanimoto similarity 8 , 31 for all direct similarity calculations. The Tanimoto similarity function is given by 301

shown to be an effective way of measuring the similarity between chemical compounds 30, 31 for rankedretrieval and is the most widely-used similarity function in cheminformatics. For each dataset we used each of its active compounds as a query and evaluated the extent to which the various methods lead to effective retrieval of the other active compounds and scaffold-hops. For CLUSTWTwe used hMETIS 21, 2o to perform the clustering into fixed sized clusters. We varied the parameter values for the methods described in Section 5 and obtained results by averaging over four different sets of values. For TOPKAvG, which depends on the number of compounds k used in relevance feedback, we used k = 5, 10, 15, and 20. For CLUSTWT,which depends on the cluster size m and the number of compounds k on which the clustering was performed, we used m = 25 and 40 and k = 200 and 400. For CLUSTWTand TOPKAVGthat have cr as a parameter, we use a value of 0.5. These parameter values were selected because they gave the best results in our experiments. For the nearest-neighbor methods which depend on the number of neighbors, we used k = 4, 6, 8, and 10 for the BESTSUMNGand BESTMAXNG,and k = 12, 16, 20, and 24 for the BESTSUMMGand BESTMAxMG schemes. These values were chosen because they gave good results. Moreover, for NG the value of k less than 4 leads to graphs with many connected components whereas for MG this value is 12. Hence, we decided not to use values below these thresholds. Note that the threshold for NG is less than that of MG because the criterion for an edge to exist between two nodes of the neighborhood graph is stricter for MG as opposed to NG (Section 5.2). We also compared our schemes against TURBOMAXFUSION and TURBOSUMFUSION 17. For both these methods, we used k = 5, 10, 15, and 20. These values gave the best results and the results degraded as k was further increased.

n CikCjk

sim(s,cj)=

k=l n

n

c( c i ! ~ ) ~ c

k=l

f

k=l

-

c

1

CikCjk

7.4. Standard Retrieval

k=l

(6) where C i k and cjk are the values for the kth dimension in the n-dimensional descriptor-space representation for the compounds ci and c j , respectively. This similarity function was selected because it has been

For each problem, we obtain a baseline performance by ranking all the compounds with respect to each active compound using the Tanimoto similarity 8 , 30, 31. We call this Standard Retrieval and denote it by STDRET.

A10

7.5. Performance Assessment Measures

8. RESULTS

We measure ranked-retrieval and scaffold-hopping performance using uninterpolated precision 16. This is calculated as follows. For each active that appears in the top 50 retrieved compounds we compute the precision value. For ranked-retrieval this is defined as the ratio of the number of actives retrieved over the number of compounds retrieved thus far. For scaffold-hopping it is defined as the number of scaffold-hops retrieved over the number of compounds retrieved thus far. For both ranked-retrieval and scaffold-hopping we sum all their precision values and normalized them by dividing them with 50. This is called the total uninterpolated precision for a query. Similar values are obtained for all the queries for a dataset and the total uninterpolated precision is the average of all these values. Note that the total uninterpolated precision captures the number of active compounds (scaffold-hops) for each query as well as the position (rank) information of the actives (scaffold-hops). To compare the ranked-retrieval or scaffoldhopping performance of two methods, we evaluate their relative performance over all the 18 problems. This is achieved as follows. Let T , and q, represent the ranked-retrieval or scaffold-hopping performance achieved by methods r and q on the ith problem respectively. We calculate the log-ratio, log, ( ~ % / q ,for ), every problem and take the average of these values. We call this quantity the Average Relative Performance or A R P of r with respect to q. On the average, if the A R P is less than zero, r performs worse than q whereas if the A R P is greater than zero, r performs better than q. Note that the reason that we use log-ratios as opposed to simply the ratios is that the distribution of the ratios of two random variables is not symmetric whereas the distribution of their log-ratios is normally distributed. This allows us to compute their average and compare them in an unbiased way. We also assess whether the A R P for a given pair of methods is statistically significant using the student’s t-test 7 , which is well-suited to assess statistical significance of a sample of values drawn out of a normal distribution. The null hypothesis being tested here is that the log-ratios are centered around a mean of zero.

8.1. Overall Performance Assessment Tables 1 and 2 compare the performance of all the methods in a pairwise fashion for scaffold-hopping and ranked-retrieval, respectively. In each of these tables we present two statistics. The first is the A R P of the row method ( r ) with respect to the column method (4) as described in Section 7.5. The second statistic, shown immediately below the A R P value in parenthesis, is its pvalue obtained from the student’s t-test. Note that for the remainder of this section we will define the ARP of the two methods to be statistically significant if p 5 0.01. The rest of this section highlights some of the key observations that can be made by analyzing the results in these tables. 8.1.1. Performance of Relevance Feedback Methods

Comparing the performance of the four relevancefeedback-based methods described in Section 5.1 against STDRET,we see that all of them lead to better scaffold-hopping results. Among them, the results achieved by CLUSTWT and BESTSUMDESCSIM are 63% and 94% better than STDRET,respectively and also these improvements are statistically significant. However, all four of these methods achieve somewhat worse ranked-retrieval performance (3% to 15%). Moreover, these differences are statistically significant for BESTSUMDESCSIM and BESTMAXDESCSIM. Comparing the four methods against TURBOSUMFUSION and TURBOMAXFUSION, we observe that the relative performance of most of these methods varies, with some methods doing better for scaffold-hopping and others doing better for rankedretrieval. However, with the exception of TOPKAVG, which is statistically better than the two fusion-based scheme for ranked-retrieval, all other differences are not statistically significant. Comparing the four relevance-feedback-based methods against each other we see that most of them perform the same for both scaffold-hopping and ranked-retrieval and whatever differences that exist are not statistically significant. Despite of this, the average performance of BESTSUMDESCSIM is better than BESTMAXDESCSIM, indicating that the sumbased search strategy leads to better results. The

41 1 results also show that the CLUSTWT is better than TOPKAVGfor scaffold-hopping and that this difference is statistically significant.

8.1.2. Performance o f Nearest-Neighbor Graph- Based Methods

Comparing the performance of the nearest-neighbor methods, we observe that all of these schemes show good performance for scaffold-hopping as well as ranked-retrieval. Among them, the best performing method is BESTSUMNG.It achieves the best balance between the ranked-retrieval and scaffoldhopping performance. Furthermore, similar to the relevance feedback-based methods, the sum-based search methods outperform the corresponding maxbased methods. However, these differences are not statistically significant. The results also show that the nearest-neighbor methods performs significantly better than all the other methods for scaffold-hopping and most of these differences are statistically significant ( BESTSUMDESCSIMand BESTNIAXDESCSIM are the two exceptions). In particular, the performance of the nearestneighbor methods are 62% to 300% better than the STDRETand the fusion-based methods and 46% to 244% better than the relevance-feedback-based methods. The nearest-neighbor methods also achieve better performance than all of the methods for rankedretrieval, although most of these differences are not statistically significant. BESTSUMNGis a clear exception as its ranked-retrieval performance is also significantly and statistically better than all the other non graph-based techniques. For example, compared to the fusion-based techniques its rankedretrieval performance is 62% to 209% better. 8.2. Performance of Descriptor-Spaces and

feedback- and graph-based methods, respectively. The results of these evaluations are shown in Figures 1 and 2, which compare the performance of STDRETagainst CLUSTWTand BESTSUMNG,respectively. In these figures, the left Y-axis represents uninterpolated precision values for ranked-retrieval, whereas the right Y-axis represents uninterpolated precision values for scaffold-hopping. For CLUSTWT and BESTSUMNGwe also show error bars that correspond to the standard deviation of the results obtained for the four sets of parameter values used for these schemes. These results show that for scaffold-hopping, CLUSTWToutperforms STDRET in most dataset and descriptor-space combinations. However, the actual performance gains are dataset and descriptorspace dependent. For example, CLUSTWT achieves significant gains on the A1A and FXa datasets for the ErG and ECZ3 descriptor-spaces, whereas the gains for the other datasets and/or descriptor-spaces are not as dramatic. In terms of ranked-retrieval performance, these results show that in the case of the G F descriptor-space, CLUSTWTperforms consistently better than STDRETacross all datasets. However, CLUSTWT’Sranked-retrieval performance for the other two descriptor-spaces is somewhat mixed. Finally, the results in Figure 2 show that for scaffold-hopping, BESTSUMNGperforms consistently better than STDRETfor all the descriptorspace and dataset combinations. However, similarly to CLUSTWT,the actual gains are dataset and descriptor-space dependent. For example, the gains are particularly high for the FXa, AlA, and COX2 datasets and for the ErG descriptor space. Similar trends can be observed with the ranked-retrieval results, with BESTSUMNGoutperforming STDRET. Moreover, the performance gains achieved on some problems by BESTSUMNGare usually much higher than the performance degradations in others.

Datasets

Our discussion so far focused on evaluating the average performance of the different methods across the various descriptor-space representations and datasets. In this section we analyze the performance of the methods on the individual descriptorspaces and datasets. We limit our evaluation to only the CLUSTWTand the BESTSUMNGmethods as these methods achieve the best scaffold-hopping and ranked-retrieval performance among the relevance-

CONCLUSION

In this paper we introduced a number of methods based on relevance feedback and social (relational) network analysis to improve scaffold-hopping and ranked-retrieval. Our results showed that among these methods, the ones based on social network analysis consistently and substantially outperform the standard retrieval as well as previously introduced methods for these problems.

412 Table 1.: Performance for Scaffold-Hopping.

(0.031)

(0.006)

(0.127) (0.007)

(0.002)

(0.024)

(0.000) (0.000)

(0.000) (0.000)

-0.38 (0.073)

0.13 -0.26 (0.024) (0.029)

-0.52 (0.068)

-0.44 (0.298)

-1.07 -1.07 (0.000) (0.000)

-1.16 -1.15 (0.000) (0.000)

0.11 0.51 (0.0 13) (0.467)

-0.14 (0.547)

-0.07 (0.835)

-0.7 -0.69 (0.002) (0.005)

-0.78 (0.001)

(0.000)

-0.4 (0.001)

-0.65 (0.032)

-0.57 (0.177)

-1.2 -1.2 -1.29 (0.000) (0.000) (0.000)

-1.28 (0.000)

-0.25 (0.316)

-0.18 (0.645)

-0.81 -0.9 -0.8 (0.000) (0.000) (0.000)

-0.88 (0.000)

0.07 (0.754)

-0.56 -0.65 -0.55 (0.038) (0.064) (0.039)

-0.63 (0.038)

-0.63 -0.62 -0.72 (0.109) (0.140) (0.053)

-0.7 (0.071)

-0.01 (0.947)

-0.1 (0.577)

-0.08 (0.579)

-0.09 (0.620)

-0.08 (0.614)

TLlttBOSUMFUSION

0.44 (0.031)

TURBO!dAXFLlSION

0.82 (0.006)

0.38 (0.073)

TOPKAVG

0.31 (0.127)

-0.13 (0.024)

-0.51 (0.013)

CLUSTWT

0.71 (0.007)

0.26 (0.029)

-0.11 (0.467)

0.4 (0.001)

BESTSUMDESCSIM

0.96 (0.002)

0.52 (0.068)

0.14 (0.547)

0.25 0.65 (0.032) (0.316)

BESTLIAXDESCSIM0.89 (0.024)

0.44 (0.298)

0.07 (0.835)

0.57 0.18 (0.177) (0.645)

-0.07 (0.754)

BESTSUMNG

1.51 (0.000)

1.07 (0.000)

0.69 (0.002)

1.2 0.8 (0.000) (0.000)

0.55 (0.038)

0.62 (0.109)

BESTMAXNG

1.52 (0.000)

1.07 (0.000)

0.7 (0.005)

1.2 0.81 (0.000) (0.000)

0.56 (0.064)

0.63 (0.140)

0.01 (0.947)

BESTSUIIMG

1.6 (0.000)

1.16

(0.000)

0.78 (0.001)

1.29 0.9 (0.000) (0.000)

0.65 (0.039)

0.72 (0.053)

0.1 0.09 (0.577) (0.620)

1.59 (0.000)

1.15 (0.000)

0.77 (0.000)

1.28 0.88 (0.000) (0.000)

0.63 (0.038)

0.7 (0.071)

0.08 0.08 -0.01 (0.579) (0.614) (0.886)

BESTMAXMG

-0.77

0.01 (0.886)

The top entry in each cell corresponds to the average of the log, ratios of the uninterpolated precision of the row method to the column method for the 18 problems. The number below this entry, in parenthesis, corresponds t o the p-value obtained from the student's t-test for t h a t entry.

"

mStdRet (Hits) EClustWt (Hits)

0 StdRet (Scanolds) 0.50 -

PCIuStWt (Scaffolds)

,E 0.40 -

--

0 10

- 0.08

,5 H z"E

.o 0

t

P ~

m 0 06

030

'g 0

1

v

' 0

0.20

0.04

0 to

0.02

0 00

0 00

ErG

ECZ3

2

GF

Fig. 1.: STDRETversus CLUSTWT. ACKNOWLEDGEMENTS This work was supported by NSF EIA-9986042, ACI0133464,IIS-0431135, NIH R L M O O ~ ~the ~~A,

High Performance Computing Research Center contract number DAAD19-01-2-0014, and by the Digital Technology Center at the University of Minnesota.

413 Table 2.: Performance for Ranked-Retrieval.

STDRET

0.14 (0.019)

TURBOSUMFCSION

-0.14 (0.019)

0.21 (0.001)

0.04 0.06 (0.332) (0.415)

0.17 (0.009)

0.27 (0.002)

-0.25 -0.12 (0.015) (0.179)

-0.18 (0.151)

-0.08 (0.434)

0.07 (0.156)

-0.1 -0.08 (0,001) (0.113)

0.03 (0.502)

0.13 (0.137)

-0.26 -0.39 (0.001) (0.003)

-0.32 (0.016)

-0.22 (0.037)

-0.17 -0.15 (0,001) (0,101)

-0.04 (0.426)

0.06 (0.419)

-0.33 -0.46 -0.39 (0,001) (0.002) (0.013)

-0.29 (0.028)

0.02 (0.725)

0.13 (0.017)

0.23 (0.016)

-0.16 -0.22 -0.29 (0.004) (0.054) (0.080)

-0.12 (0.226)

0.11 (0.168)

0.21 (0.071)

-0.24 -0.18 -0.31 (0.009) (0.027) (0.047)

-0.14 (0,158)

0.1 (0.121)

-0.29 -0.42 -0.25 -0.35 (0.001) (0.004) (0.02 1) (0.051)

TURBOhhXFUSION

-0.21 (0.002)

-0.07 (0.156)

TOPKAVG

-0.04 (0.332)

0.1 (0.001)

0.17 (0,001)

CLLJSTWT

-0.06 (0.415)

0.08 (0.113)

0.15 (0,101)

-0.02 (0.725)

BESTSUMDESCSI~~ -0.17 (0.009)

-0.03 (0.502)

0.04 (0.426)

-0.13 -0.11 (0.017) (0.168)

BESThfAXDESCSIM

-0.27 (0.002)

-0.13 (0.137)

-0.06 (0.419)

-0.21 -0.23 (0.016) (0.071)

-0.1 (0.121)

BEsTSUhlNG

0.25 (0.015)

0.39 (0.001)

0.46 (0.001)

0.29 0.31 (0.004) (0.009)

0.42 (0.001)

0.52 (0.001)

BESTRIAXNG

0.12 (0.179)

0.26 (0.003)

0.33 (0.002)

0.16 0.18 (0.054) (0.027)

0.29 (0.004)

0.39 (0.002)

-0.13 (0.148)

BESTSUMRIG

0.18 (0.151)

0.32 (0.016)

0.39 (0.013)

0.22 0.24 (0.080) (0.047)

0.35 (0.021)

0.45 (0.008)

0.06 -0.07 (0.517) (0.484)

BESTMAXMG

0.08 (0.434)

0.22 (0.037)

0.29 (0.028)

0.12 0.14 (0.226) (0.158)

0.25 (0.051)

0.35 (0.019)

-0.17 (0.079)

-0.39 -0.52 -0.45 (0.001) (0.002) (0.008)

-0.35 (0.0 19)

0.13 0.07 (0,148) (0.519)

0.17 (0.079)

-0.06 (0.484)

0.04 (0.591)

-0.04 (0.591)

0.1 (0.036) -0.1 (0.036)

The top entry in each cell corresponds to t h e average of the log, ratios of t h e uninterpolated precision of t h e row method t o the column method for the 18 problems. The number below this entry, in parenthesis, corresponds to the p-value obtained from the student's t-test for t h a t entry.

0 60

0.12 t3BestSumNG (Hits)

.$

0.50

0.10

040

0.08 0

t

b -

,;

rn

030

0.06

IgE 0.20

0.04

0.10

0.02

'g I" v

g

3

0.00

0.00

$

Go@

c+* +P

,.""

ECZB

p.' ErG

4".

$0

Qoe

OF

Fig. 2.: STDRETversus BESTSUMNG. References 1. http://www.daylight.com. Daylight Inc. 2. http://www.digitalchemistry.co.uk/.Digital Chemistry Inc.

3. http: f fwww.mdl.com. MDL Information Systems Inc. 4. www.chemaxon.com. ChemAxon Inc. 5 . www.cheminformatics.org. Cheminformatics.

414 6. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern information retrieval. Addison Wesley 1999. 7. J. M. Bland. An introduction to medical statistics. (1 995) 2nd edn. Oxford University Press. 8. H.J. Bohm and G. Schneider. Virtual screening for bioactive molecules. Wiley- VCH, 2000. 9. Gianpaolo Bravi, Emanuela Gancia, Darren Green, V.S. Hann, and M. Mike. Modelling structureactivity relationship. Virtual Screening f o r Bioactiue Molecules, 2000. 10. N. Brown and E. Jacoby. On scaffolds and hopping in medicinal chemistry. Mini Rev Medicinal Chemistry, 6(11):1217-1229, 2006. 11. R. Brown and Y. Martin. Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J . Chem. Info. Model., 36(1):576-584, 1996. 12. Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis. Frequent substructurebased approaches for classifying chemical compounds. I E E E T K D E . , 17(8):1036-1050, 2005. 13. F. Fouss, A. Pirotte, J. Renders, and M. Saerens. Random walk computation of similarities between nodes of a graph with application to collaborative filtering. IEEE T K D E , 19(3):355-369, 2007. 14. V. J. Gillet, P. Willet, and J. Bradshaw. Similarity searching using reduced graphs. J . Chem. Inf. Comput. Sci., 43:338-345, 2003. 15. G. Harper, G.S. Bravi, S.D. Pickett, J. Hussain, and D.V. Green. The reduced graph descriptor in virtual screening and data-driven clustering of highthroughput screening data. J . Chem. Info. Model., 44(6) ~45-56, 2004. 16. Marti Hearst and Jan Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. ACM/SIGIR, 1996. 17. J. Hert, P. Willet, and D. Wilton. New methods for ligand based virtual screening: Use of data fusion and machine learning to enchance the effectiveness of similarity searching. J . Chem. Info. Model., (46):462-470, 2006. 18. J. Hert, P. Willet, D. Wilton, P. Acklin, K. Azzaoui, E. Jacoby, and A. Schuffenhauer. Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. Organic and Biomolecular Chemistry, 2:32563266, 2004. 19. Robert N. Jorissen and Michael K. Gibson. Virtual screening of molecular databases using support vector machines. J . Chem. Info. Model., 45(3):549-561, 2005. 20. George Karypis, Rajat Aggarwal, Vipin Kumar, and

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31. 32.

33.

Shashi Shekhar. Multilevel hypergraph partitioning: Applications in vlsi domain. Design and Automation Conference, pages 526-529, 1997. George Karypis and Vipin Kumar. Multilevel kway hypergraph partitioning. Design and Automation Conference, pages 343-348, 1999. S. K. Kearsley, S. Sallamack, E. M. Fluder, J. D. Andose, R. T. Mosley, and R. P. Sheridan. Chemical similarity using physiochemical property descriptors. J. Chem. Inf. Comput. Sci., 36:118-127, 1996. Andrew R. Leach. Molecular modeling: Principles and applications. Prentice Hall, Englewood Cliffs, NJ, 2001. J. J. Rochio. Relevance feedback in information retrieval. The S M A R T Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Chapter 14, 1971. D. Rogers, R. Brown, and M. Hahn. Using extendedconnectivity fingerprints with laplacian-modified bayesian analysis in high-throughput screening. J . Biomolecular Screening, 10(7):682-686, 2005. Jamal C. Saeh, Paul D. Lyne, Bryan K. Takasaki, and David A. Cosgrove. Lead hopping using svm and 3d pharmacophore fingerprints. J . Chem. Info. Model., 45:1122-113, 2005. Nikolaus Stiefl, Ian A. Watson, Kunt Baumann, and Andrea Zaliani. Erg: 2d pharmacophore descriptor for scaffold hopping. J . Chem. Info. Model., 46:208220, 2006. B. Teufel and S. Schmidt. Full text retrieval based on syntactic similarities. Information Systems, 31(1), 1988. Nikil Wale and George Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. International Conference in Datamining. (ICDM), 2006. Martin Whittle, Valerie J. Gillet, and Peter Willett. Enhancing the effectiveness of virtual screening by fusing nearest neighbor list: A comparison of similarity coefficients. J . Chem. Info. Model., 44:18401848, 2004. Peter Willett. Chemical similarity searching. J . Chem. Info. Model., 38(6):983-996, 1998. P. N. Wolohan, L. B. Akella, R. J. Dorfman, P. G. Nell, S.M. Mundt, and R. D. Clark. Structural units analysis identifies lead series and facilitates scaffold hopping in combinatorial chemistry. J . Chem. Inf. Comput. Sci., 46:1188-1193, 2005. Qiang Zhang and Ingo Muegge. Scaffold hopping through virtual screening using 2d and 3d similarity descriptors: Ranking, voting and consensus scoring. J. Chem. Info. Model., 49:1536-1548, 2006.

Transcriptomics and Phylogeny


417 IMPROVING THE DESIGN OF GENECHIP ARRAYS BY COMBINING PLACEMENT AND EMBEDDING

S6rgio A. d e Carvalho Jr.* and Sven Rahmann

Computational Methods for Emerging Technologies ( C O M E T ) , Genome Informatics, Technische Fakultat, Bielefeld University, 0-33594 Bielefeld, Germany; D F G G K Bioinformatik and Institute for Bioinformatics, CeBiTec, Bielefeld University Email: { Sergio. Carvalho,Sven.Rahmann} @cebitec.uni-bielefeld.de The microarray layout problem is a generalization of the border length minimization problem and asks to distribute oligonucleotide probes on a microarray and to determine their embeddings in the deposition sequence in such a way that the overall quality of the resulting synthesized probes is maximized. Because of its inherent computational complexity, it is traditionally attacked in several phases: partitioning, placement, and re-embedding. We present the first algorithm, Greedy+, that combines placement and embedding and results in improved layouts in terms of border length and conflict index (a more realistic measure of probe quality), both on arrays of random probes and on existing Affymetrix Genechip@ arrays. We also present a large-scale study on how the layouts of GeneChip arrays have improved over time, and show how Greedy+ can further improve layout quality by as much as 8% in terms of border length and 34% in terms of conflict index.

1. INTRODUCTION Microarrays are a ubiquitous tool in molecular biology with a wide range of applications on a wholegenome scale including high-throughput gene expression analysis, genotyping, and resequencing. This article is about improving the design of high-density oligonucleotide microarrays, sometimes called DNA chips. This type of microarray consists of relatively short DNA probes (20-30-mers) synthesized at specific locations, called features or spots, of a solid surface, that are usually built by light-directed combinatorial chemistry, nucleotide-by-nucleotide. For example, Affymetrix Genechip@ arrays have up to 1.3 million spots on a fused silica substrate measuring a little over 1 cm'. The spots are as narrow as 5 pm (0.005 mm), and are arranged in a regularly-spaced rectangular grid. GeneChip arrays are produced with techniques derived from micro-electronics and integrated circuits fabrication. Probes are usually 25 bases long and are synthesized on the chip, in parallel, in a series of repetitive steps. Each step appends the same kind of nucleotide to probes of selected regions of the chip. The sequence of nucleotides added in each step is called deposition sequence. The selection of which probes receive the nucleotide is achieved with the help of photolithographic masks3. The quartz wafer of a GeneChip

*Corresponding author

array is initially coated with a chemical compound topped with a light-sensitive protecting group that is removed when exposed to ultraviolet light, activating the compound for chemical coupling. A mask is used to direct light and remove the protecting groups of only those positions that should receive the nucleotide of a particular synthesis step. A solution containing adenine (A), thymine (T), cytosine (C) or guanine (G) is then flushed over the chip surface, but the chemical coupling occurs only in those positions that have been previously deprotected. Each coupled nucleotide also bears another protecting group so that the process can be repeated until all probes have been fully synthesized. An alternative method of in situ synthesis uses an array of miniature mirrors to direct or deflect the incidence of light on the chip". Regardless of which method is used to direct light, it is possible that some probes are accidentally activated for chemical coupling because of light diffraction, scattering or internal reflection on the chip surface. The unwanted illumination introduces unexpected nucleotides that change the probe sequences, significantly reducing their chances of successful hybridization with their targets, and increasing the risk of cross-hybridization with unintended targets . This problem can be (and has been) alleviated by

418 improving the production process, which however is expensive. Here, we are interested in computational methods that re-arrange the probes on the chip in such a way that the problem is minimized. Note that the problem of unintended illumination primarily occurs near the borders between masked and unmasked spots (in the case of maskless synthesis, between a spot that is receiving light and a spot that is not); we thus speak of a border conflict. By carefully designing the arrangement of the probes on the chip and their embeddings (the sequences of masked and unmasked steps used to synthesize each probe), it is possible to reduce the risk of unintended illumination. The problem has received some attention in the past, mostly by Hannenhalli et al.4,Kahng et a1.6p8,and ourselves', '. In this paper, we put forward a new idea: We efficiently combine probe placement with probe embedding in a single algorithm; previously, these task have been done in separate phases. We also present a large-scale layoutquality study on several old and recent GeneChip arrays and propose alternative layouts with reduced conflicts. In the next section, we state the microarray layout problem formally and define two different objective functions to be minimized. Section 3 contains our study of GeneChip arrays and shows how their layouts can be improved. Section 4 explains our new Greedy+ algorithm that achieves these improvements. Since Greedy+ builds on previous work, we briefly review the relevant details in Section 4.1 before presenting Greedy+ in Section 4.2 and results on chips with random probes in Section 4.3. Section 5 contains a concluding discussion. Supplementary material is available at http: //gi . cebitec . uni-bielefeld.de/comet/chiplayout/affy/.

2. T H E MICROARRAY LAYOUT PROBLEM Data. The data for the microarray layout problem

{sl, s 2 , . . . , s m } , where each spot s accommodates many copies of a unique probe pk E P. Each probe is synthesized at a unique spot, hence there is a one-to-one assignment between probes and spots (if we assume that there are as many spots as probes, i.e., m = n). Some microarrays may have complex physical structures but we assume that the spots are arranged in a rectangular grid. the nucleotide deposition sequence N = N l N 2 . . . NT corresponding to the sequence of nucleotides added at each synthesis step. It is a supersequence of all p E P and often a repeated permutation of the alphabet C = {A, C, G, T}, mainly because of its regular structure and because such sequences maximize the number of distinct subsequences. Each synthesis step t uses a mask Mt to induce the addition of a particular nucleotide Nt E C to a subset of P (Figure 1).

A probe may be embedded within N in several ways. An embedding of pk is a T-tuple & k = ( E ~ , I ~, k , 2 , .. . , E ~ , T )in which E k , t = 1 if probe p k receives nucleotide Nt (at step t ) , and 0 otherwise. In particular, a left-most embedding is an embedding in which the bases are added as early as possible (as in ~1 in Figure 1). Finding good embeddings is part of the problem. Problem statement. Given P, S , and N as specified above, the MLP asks to specify a chip layout (A,&) that consists of a bijective assignment X : S ---f (1,. . . , n} that specifies a probe index X(s) for each spot s (meaning that will be synthesized at s), an assignment E : { 1, . . . , n } + (0, I } specify~ ing an embedding ~k = ( & k , l , . . . , E ~ , T )for each probe index k , such that N [ E := ~ ](Nt)t:Ek,t=l = Pk

1

(MLP) consists of a set of probes P = ( ~ 1 ~ ~ .2. ,,p n. } , where each p k E {A, C, G, T}* with 1 5 k 5 n is produced by a series of T synthesis steps. Frequently, but not necessarily, all probes have the same length .t. a geometry of spots, or sites, S =

such that a given penalty function is minimized. We now describe two such penalty functions: total border length and total conflict index. Objective functions. The total border length B(X,E ) of a chip layout (A,&) was first introduced by Hannenhalli et a ~ who ~ defined , the border length

419

PI

P2

P3

N = ACGTiACGTiAC

ACT CTG GAT Ps TCC GAC

P4

P7

P8

El

TGA CGT AAT

Fig. 1. Synthesis of a hypothetical 3 x 3 chip with photolithographic masks. Left: chip layout with 3-mer probe sequences. Center: deposition sequence with 2.5 cycles (delimited with dashed lines) and probe embeddings. Right: first six masks (masks 7 to 10 not shown).

Bt(X,&)of a mask Mt as the number of borders separating masked and unmasked spots a t synthesis step t. Then B ( X , E )= &(A,&). As an example, the six masks shown in Figure 1 have B1 = 4, Bz = 3, B3 = 5, B4 = 4,B5 = 8 and B6 = 9. The total border length of that layout is 52 (masks M7 to MI0 are not shown). Note that B ( X , E )can be expressed with the Hamming distance between embeddings of probes at adjacent spots: Let H,(k, k’) be the number of synthesis steps in which the embeddings &k and E ~ difI fer. Then B(X,E ) = adjacent H & ( X ( S ) , X(s’)). Ideally, all probes should have roughly the same risk of being damaged by unintended illumination, so that all hybridization signals are affected in approximately the same way. Total border length treats every conflict in the same way, which is reasonable without further information. However, it has been suggested previously7 that stray light might activate not only adjacent neighbors but also spots that lie as far as three cells away from the targeted spot, and that imperfections produced in the middle of a probe are more harmful than in its extremities. Therefore, as in Ref. 1, we define the total conflict index of a layout as C(X,E):= C , C ( s ) ,where C ( s ) f C ( s ;A, E ) is the conflict index of a spot s defined as:

zTxl

;c,,,/

T

s’: neighbor of s

The indicator functions ensure that there is a conflict

at s during step t if and only if s is masked EX(^),^ = 0) and a neighbor s’ is unmasked EX(^,),^ = 1). Function y(s, s’) is a “closeness” measure between s and s’, defined as r(s,s’) := ( ~ ( s , s ’ ) ) - ~ , where d ( s , s’) is the Euclidean distance between the spots s and s’. Note that, in ( l ) ,s’ ranges over all neighboring spots that are at most three cells away from s. The position-dependent weighting function w(&,t) accounts for the significance of the location inside the probe sequence where the undesired nucleotide is introduced in case of accidental illumination. It increases exponentially with the distance S ( E , ~ of ) the synthesized nucleotide from the probe’s closer end, as motivated by thermodynamic considerations’: W ( E , t ) := c . exp (0 . S ( E , t ) ) ,where c > 0 and 0 > 0 are constants. The parameter 0 controls how steeply the exponential weighting function rises towards the middle of the probe. In our experiments, we use probes of length i? = 25, and parameters 0 = 5/.t and c = 1/ exp (0). Problem variants and per-chip measures. We con-

sider two variants of the MLP:

BLM Border Length Minimization (BLM) means that the objective is to minimize B(X,E ) . CIM Conflict Index Minimization (CIM) means that the objective is to minimize (?(A,&), which depends on the weighting functions y and w and their parameters, which we choose as described above. In either case, we can measure both B(X,E ) and C(X,&). Naturally, after BLM, B(X,E)will be low,

420 whereas C(X, E ) may be relatively large; the converse holds after CIM. In order to better compare chips of different size, we introduce normalized versions of these quantities.

NBL If the the chip is a rectangular grid with nT rows and n, columns, the number of internal borders is n b = n,(n, - 1) + n,(n, - 1) FZ 2n,n, = 2121, and we call B(X,&)In,, the normalized bord e r length (NBL). We may also refer to the NBL of a particular mask Mt as Bt/nb. ABC Real arrays have a significant number of empty spots (as much as 11.94% on the Affymetrix Chicken Genome array). To better compare chips with different amounts of empty spots we use the average n u m b e r of border conflicts p e r probe (ABC), defined as B(X,E ) / I P \We . roughly have ABC = 2 . NBL if I S 1 M IPI.The ABC of a particular mask Mt is Bt/lPl. ACI We define the average conflict i n d e x (ACI) of a layout as C(X,&)/IPl.

perfect match (PM), which perfectly matches its target sequence, and the mismatch (MM) probe, which is used to quantify cross-hybridizations and unpredictable background signal variations. The MM probe is a copy of the PM probe except for the middle base (position 13 of the 25-mer), which is exchanged with its Watson-Crick complement. The layout of a GeneChip alternates rows of PM probes with rows of MM probes in such a way that the probes of a pair are always adjacent on the chip. Moreover, PM and MM probes are pair-wise left-most embedded. Informally, a pair-wise left-most embedding is obtained from left-most embeddings by shifting the second half of one embedding t o the right until the two embeddings are “aligned” in the synthesis steps that follow the mismatched middle bases. This approach reduces border conflicts between the probes of a pair, but it leaves a conflict in the steps that add the middle bases. The fact that probes must appear in pairs restricts even more which sequences can be used as probes on GeneChip arrays because both PM and MM probes must “fit” in the deposition sequence.

3. ANALYSIS OF GENECHIP ARRAYS We obtained the specification of several GeneChip arrays containing the list of probe sequences and their positions on the chip from Affymetrix’s web sitea. We make a few assumptions because some details such as the deposition sequence used to synthesize the probes, the probe embeddings, and the contents of “special“ spots are not publicly available (some of the special spots contain quality control probes used to detect failures during the production of the chip). Not knowing the contents of these special spots barely interferes with our analysis because, in all arrays we examined, they amount to at most 1.22% of the total number of spots. It has been reported that a fixed 74-step deposition sequence is used by Affymetrix7. All GeneChip arrays we analyzed, regardless of their size, can be synthesized in N = (TGCA)’’TG, i.e., 18.5 cycles of TGCA, and a shorter deposition sequence is indeed unlikely. This suggests that only sub-sequences of this particular deposition sequence can be used as probes on Affymetrix chips. In principle, this should not be a problem as this sequence covers about 98.45% of all 25-mers9. Probes of GeneChip arrays appear in pairs: the

Results. Figure 2 shows the ABC for each masking step of three GeneChip arrays (Yeast, Human and E. coli). We assume that the probes are pair-wise left-most embedded in N = (TGCA)~’TG,and we consider all spots whose contents are not available as empty spots. In all chips we analyzed, the ABC is higher in the steps that add the middle bases, a result of placing PM and MM probes in adjacent spots. The Yeast Genome S98 array has the worst layout in terms of border conflicts, and most of the earlier GeneChip arrays such as the E. coli Antisense Genome have similar levels of conflicts. The layout of the Human Genome U95A2 array has significantly fewer border conflicts than the Yeast array, suggesting that it was designed with a better placement strategy. The curve of the E. coli Genome 2.0 array, with very low levels of conflicts in the first 10 masks, is typical of the latest generation of GeneChip arrays, including the Chicken Genome and the Wheat Genome (one of the largest GeneChip arrays currently available with 1164 x 1164 spots), which suggest yet another placement strategy. Table 1 shows summary statistics on several

“http://www.affymetrix.com/support/technical/byproduct.affx?cat=arrays

42 1

m]

0.9

45000

0.8

xx

xx

40000

XX

x

xx

0.7

,x

0.6

b OO 00

x

X

xx x

X

xx

xx

++++~ooooo

0.5

xx xx

00

00

00 00

x X

xx 00

00

xx

om

.

++f+++++++++f++++++Q+

o

0.4

x

I 20000

ox

4

+ + B X XY

0.3

+

15000

10000

0.2 ., .. ..

0.1

2

25000

.

5000

. . . . .. ... ... .. . ., .. .

0

0

0

5

10

15

20

25

30

35 40 Masking step

45

50

55

60

65

70

Fig. 2. Average number of border conflicts per probe (scale on the left y-axis) of selected GeneChip arrays: Yeast Genome S98, Human Genome U95A2, and E. coli Genome 2.0. The histogram shows the number of middle bases added per synthesis step on the E. coli 2.0 chip (scale on the right y-axis).

commercially available arrays. The layout of the Human Genome U95A2 array is one of the best in terms of NBL and the best in terms of ACI. This, however, has more to do with empty spots than with the placement strategy as this chip has about 1.83% of empty spots that are evenly distributed on the chip surface. In contrast, the Chicken Genome array has an exceptionally high percentage of empty spots (11.94%) that contribute to its low NBL but not equally to a low ABC in comparison with the Human Genome array because the empty spots are concentrated in the lower part of the chip (figures illustrating the distribution of empty spots on these chips are available in the supplementary web page). GeneChip arrays exhibit relatively low levels of NBL and ABC when compared to layouts produced by the best algorithms for arrays of random probes of similar dimensions (see next section). This can be explained by the fact that each probe has a nearly identical copy next to it. However, they have relatively high ACIs because the conflicts are concentrated on the synthesis steps of the middle bases, which are expensive in the conflict index model. Design improvements. We used our new algorithm Greedy+ with different parameters Q, and Sequential8 re-embedding algorithm (see Section 4 for explanations; in general, larger Q gives better layouts, but also increases the running time), to create alternative layouts for two of the latest generation of GeneChip arrays: E. cola Genome 2.0 and Wheat

Genome. Greedy+ was modified to avoid placing probes on special spots or empty spots that we believe might have a function on the chip. For each chip we separately run both BLM and CIM versions of the algorithms. The main difference between our layouts and the original ones is that we do not require the arrays to alternate rows of P M and MM probes; hence, probes of a pair are not necessarily placed on adjacent spots. This is especially helpful for CIM since it avoids conflicts in the middle bases. With BLM, we observe that Greedy+ places between 90.7% and 95.2% of the P M probes adjacent to their corresponding MM probes. With CIM, this rate drops to between 12.9% and 21.3%. Figure 3 shows the NBL for each masking step of the layout produced by Greedy+ and Sequential for the E. cola Genome 2.0 array in comparison with the original Affymetrix layout. It can be clearly seen that the CIM variant of our algorithm greatly reduces the number of border conflicts in the middle synthesis steps, where conflicts are expensive. In the BLM variant, the conflicts are distributed more evenly across all synthesis steps. To compare the new layout algorithm with re-embedding only, we also show the result of running a pair-wise version of Sequential on the original layout (this version em sures that the embeddings of PM-MM pairs remain pair-wise “aligned”). The total NBL and ACI values of these layouts are also shown in Table 2, together with several lay-

422 Table 1. Average number of border conflicts per probe (ABC), normalized border length (NBL) and average conflict index (ACI) of selected GeneChip arrays. The dimension of the chip, the percentage of spots with unknown content and the percentage of empty spots are also shown. GeneChip Array Yeast Genome S98 E . coli Antisense Genome Human Genome U95A2 E . coli Genome 2.0 Chicken Genome Wheat Genome

Dimension

534 x 544 x 640 x 478 x 984 x 1164 x

Unknown

534 544 640 478 984 1164

outs for the Wheat Genome array. Greedy+ with Q = 10K produces a layout with 8.10% less border conflicts than the original layout for E. coli array (13.2406 versus 14.4079) in 218.3 minutes. With Q = 2K, the improvement is almost as good (7.15%), but requires only 46.9 minutes. For the larger Wheat array, Greedy+ with Q = 2K generates a layout with 7.36% less border conflicts than the original layout (12.7622 versus 13.3771). In terms of CIM, our results show that Greedy+ can improve the quality of GeneChip arrays in as much as 34.31% (from 550.2014 to 361.4418 for the E. coli array).

4. ALGORITHMS Traditionally, The MLP has been attacked heuristically in two phases, as exact solutions are computationally infeasible. First, an initial embedding of the probes is fixed and an arrangement of these embeddings on the chip with minimum conflicts is sought. This is usually referred to as the placement phase. Placement algorithms typically assume that an initial embedding of the probes is given (which can be a left-most or otherwise pre-computed embedding), and do not change the given embeddings. Second, a post-placement optimization phase reembeds the probes considering their location on the chip, in such a way that the conflicts with neighboring spots are further reduced. For superlinear placement algorithms, the chip is often yurtitzoned into smaller sub-regions before the placement phase in order to reduce running times, especially on larger chips. We briefly review the best known placement and re-embedding principles and then present a new algorithm, Greedy+, the first one to combine placement and embedding into a single phase. In addition to the results presented in the previ-

1.22% 1.17% 0.96% 1.08% 0.46% 0.38%

Empty

ABC

1.70% 3.12% 1.83% 0.46% 11.94% 0.08%

44.8168 43.3345 28.2489 29.2038 28.2087 27.6569

NBL

21.7945 20.7772 13.7517 14.4079 12.3680 13.7771

ACI

669.0663 663.7353 510.3418 550.2014 540.5022 539.9632

ous section, we show in Section 4.3 that Greedy+ compares favorably to the best known placement strategy (Row-Epitaxial). Partitioning algorithms such as Centroid-based Quadrisection' and Pivot Partitioning' are not discussed. 4.1. Review of Existing Placement and Re- Embedding Strategies Placement. The following elements of placement strategies have proven successful in practice for largescale chips.

Initial ordering The probe sequences (or their binary embeddings) are initially ordered, either lexicographically7, which is easy, or to minimize the sum of distances of consecutive probes, which leads to an instance of the NP-hard traveling salesman problem (TSP) that is then solved heuristically4. k-threading The sequence of ordered probes is threaded onto the chip. This can happen row-byrow, where the first row is filled left-to-right, the second one right-to-left, and so on. This leads to an arrangement where consecutive probes in the same row have few border conflicts, but probes in the same column may have a significant number of conflicts. An alternative is provided by &threading4, in which the right-to-left and left-to-right steps are interspaced with alt,ernating upward and downward movements over k sites. Row-by-row threading can be seen as k threading with k = 0. Iterative refinement The Row-Epitaxial7 algorithm refines an existing layout as follows: Spots are re-considered in a pre-defined order, from top to bottom, left to right. For each spot s, a user-defined number Q of probe candidates below and to the right of s is considered for an

423 Table 2. Normalized border length (NBL) and average conflict index (ACI) of layouts for the E. coli 2.0 and Wheat GeneChip arrays. Greedy+ used k-threading with k = 5 for BLM and k = 0 for CIM. Running times in minutes include placement and two passes of re-embedding with Sequential. Array

Layout

NBL

ACI

E. coli 2.0

Affymetrix with pair-wise left-most Affymetrix after “pair-aware’’ Sequential (BLM) Greedy+ with Q = 2K and Sequential (BLM) Greedy+ with Q = 10K and Sequential (BLM) Greedy+ with Q = 2K and Sequential (CIM) Greedy+ with Q = 10K and Sequential (CIM)

14.4079 13.5005 13.3774 13.2406 17.6935 17.5575

550.2014 541.0954 529.8129 515.5917 394.9905 361.4418

Affymetrix with pair-wise left-most Affymetrix after “pair-aware’’ Sequential (BLM) Greedy+ with Q = 2K and Sequential (BLM) Greedy+ with Q = 5K and Sequential (BLM) Greedy+ with Q = 2K and Sequential (CIM) Greedy+ with Q = 5K and Sequential (CIM)

13.7771 12.9151 12.7622 12.6670 17.1047 17.1144

539.9632 531.2692 519.0869 511.7193 387.8430 366.6045

Wheat

exchange with the probe p at s. Probe p is then swapped with the probe that generates the minimum number of border conflicts between s and its left and top neighbors. In the experiments conducted by Kahng et al.7,RowEpitaxial was the best large-scale placement algorithm for the BLM problem. We have adapted Row-Epitaxial to CIM by choosing the probe candidate that minimizes the sum of conflict indices in a region around s restricted to those neighboring spots that have been already refilled. Re-embedding. Most current re-ernbedding strategies are based on the Optimum Single Probe Embedding algorithm (OSPE; see below) first introduced by Kahng et aL6 and differ mainly in the order in which the spots are considered. Some of the proposed strategies are Chessboard, Greedy and Batched Greedy‘, and Sequential’. The Sequential strategy proceeds spot by spot, from top to bottom, left to right, re-embedding each probe optimally with regard to its neighbors using OSPE. Once the end of the array is reached, it is restarted at the top left corner of the array for the next iteration, until a local optimal solution is found, or until improvements drop below a given threshold, or until a given number of passes have been executed. Sequential is not only the simplest but also the fastest and most effective known strategy’. Therefore, we skip the discussion of other strategies.

Time -

46.9 218.3 54.9 225.7 ~

-

279.2 676.0 322.7 704.7

OSPE is a dynamic programming algorithm ( a variant of global sequence alignment) that computes an optimum embedding of a single probe p (of length l ) at a given spot s into the deposition sequence N (of length T ) with respect to p’s neighbors, whose embeddings are considered as fixed. The algorithm was originally developed for BLM but a more general form designed for conflict index minimization (CIM) was given by de Carvalho Jr. and Rahmann’. OSPE fills an (l 1) x (T 1) dynamic programming matrix D , where D [ i , t ]is defined as the minimum cost of an embedding of p l , , i into Nl,,t for 0 5 i 5 l , 0 5 t 5 T . The cost is the sum of conflicts induced by the embedding of on its neighbors (when s is unmasked and a neighbor is masked), plus the conflicts suffered by p l , , i because of the embeddings of its neighbors (when s is masked and a neighbor is unmasked). The basic recurrence is

+

D [ i ,t

D [ i ,t] =

+

+

11 m , t , D[ i - 1 , t - 11 -

+ ut

If

P, = Nt,

In accordance with the conflict index model, the additional costs Ut (incurred at masked neighbors when s is unmasked, only possible if p i = N t ) and Mi,t (incurred at masked s because of unmasked neighbors)

424 0.4 x

0.35

x

X

0.3 0.25 0.2 0.15 0.1

0.05 ++

I , . . . I . . . . I . . . . I . . . . I . . . , I . , . . I . . . . I . . . . I . . . . I . . . . I . . . . I . . . . I . . . . I Q Q ~

0 0

5

10

15

20

25

30

35

45

40

50

55

60

65

70

Masking step 0 +

Affymetrix layout (pair-wise left-most embeddings) Affymetrix layout after pair-wise Sequential

0

x

Greedy+ and Sequential (BLM) Greedy+ and Sequential (CIM)

Fig. 3. NBL for each masking step of the original Affymetrix layout for the E. coli 2.0 GeneChip compared with alternative layouts produced by Greedy+ (with Q = 10K) and Sequential. ’The layout resulting from running Sequential on the original layout is also shown

are

ut

c

:=

n{Ex(,5’),t=O}

s ’ : neighbor of s

Mi,t

:=

c . exp(0. (1

. W(EX(s’), t ) . Y(S’, s),

+ min{i,l

-

i}))

s ‘ : neighbor of s

The initialization is given by D[O,01 = 0,D [ i ,01 = 03 for 0 < i 5 l, and D[O,t]= D[O,t - 11 M o , for ~

+

O

Computational Systems Bioinformatics: Csb2007 Conference Proceedings, University of California, San Diego, USA, 13-17 August 2007

Software Engineering Education: SEI Conference 1992, San Diego, California, USA, October 5-7, 1992. Proceedings: 1992 SEI Conference, San Diego,

Computational Systems Bioinformatics Conference

Los Angeles San Diego & Southern California

Computational Systems Bioinformatics: Conference Proceedings CSB2008 Life Sciences Society

Computational Systems Bioinformatics: CSB2006 Conference Proceedings Stanford CA, 14-18 August 2006

Evolutionary Programming VII: 7th International Conference, EP'98, San Diego, California, USA, March 25-27, 1998: Proceedings

Comparative Genomics: RECOMB 2007, International Workshop, RECOMB-CG 2007, San Diego, CA, USA, September 16-18, 2007, Proceedings

Frommer's San Diego 2007 (Frommer's Complete)

Difference Equations And Discrete Dynamical Systems: Proceedings of the 9th International Conference University of Southern California, Los Angeles, California, USA, 2-7 August 2004

Frommer's San Diego 2006

Frommer's San Diego 2004

Frommer's San Diego 2010

Computational Science - ICCS 2001: International Conference San Francisco, CA, USA, May 28-30, 2001 Proceedings

Computational Science - ICCS 2001: International Conference, San Francisco, CA, USA, May 28-30, 2001. Proceedings

Amerykanski krazownik przeciwlotniczy San Diego

Intelligent Tutoring Systems, ITS '98: 4th International Conference, Its'98, San Antonio, Texas, USA, August 16-19, 1998: Proceedings

Intelligent Tutoring Systems, ITS '98: 4th International Conference, Its'98, San Antonio, Texas, USA, August 16-19, 1998: Proceedings

Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA, June 13-15, 2007, Proceedings (Lecture Notes in Computer Science)

Profiles of California 2007

Algorithms in Bioinformatics: 7th International Workshop, WABI 2007, Philadelphia, PA, USA, September 8-9, 2007, Proceedings

Fuzzy systems in bioinformatics and computational biology

Fuzzy Systems in Bioinformatics and Computational Biology

Fuzzy Systems in Bioinformatics and Computational Biology

IEEE International Conference on Fuzzy Systems: March 8-12, 1992, Town & Country Hotel, San Diego, California

Bioinformatics Research and Development: First International Conference, BIRD 2007, Berlin, Germany, March 12-14, 2007, Proceedings

Scalable Uncertainty Management: First International Conference, SUM 2007, Washington, DC, USA, October 10-12, 2007, Proceedings

Advances in Biometrics: International Conference, ICB 2007, Seoul, Korea, August 27-29, 2007, Proceedings

Frommer's San Diego 2008 (Frommer's Complete)

Distributed Computing in Sensor Systems: Third IEEE International Conference, DCOSS 2007, Santa Fe, NM, USA, June 18-20, 2007, Proceedings

Distributed computing in sensor systems: third IEEE international conference, DCOSS 2007, Santa Fe, NM, USA, June 18-20, 2007: proceedings

Computational Systems Bioinformatics: Csb2007 Conference Proceedings, University of California, San Diego, USA, 13-17 August 2007

Software Engineering Education: SEI Conference 1992, San Diego, California, USA, October 5-7, 1992. Proceedings: 1992 SEI Conference, San Diego,

Computational Systems Bioinformatics Conference

Los Angeles San Diego & Southern California

Computational Systems Bioinformatics: Conference Proceedings CSB2008 Life Sciences Society

Computational Systems Bioinformatics: CSB2006 Conference Proceedings Stanford CA, 14-18 August 2006

Evolutionary Programming VII: 7th International Conference, EP'98, San Diego, California, USA, March 25-27, 1998: Proceedings

Comparative Genomics: RECOMB 2007, International Workshop, RECOMB-CG 2007, San Diego, CA, USA, September 16-18, 2007, Proceedings

Frommer's San Diego 2007 (Frommer's Complete)

Difference Equations And Discrete Dynamical Systems: Proceedings of the 9th International Conference University of Southern California, Los Angeles, California, USA, 2-7 August 2004

Frommer's San Diego 2006

Frommer's San Diego 2004

Frommer's San Diego 2010

Computational Science - ICCS 2001: International Conference San Francisco, CA, USA, May 28-30, 2001 Proceedings

Computational Science - ICCS 2001: International Conference, San Francisco, CA, USA, May 28-30, 2001. Proceedings

Amerykanski krazownik przeciwlotniczy San Diego

Intelligent Tutoring Systems, ITS '98: 4th International Conference, Its'98, San Antonio, Texas, USA, August 16-19, 1998: Proceedings

Intelligent Tutoring Systems, ITS '98: 4th International Conference, Its'98, San Antonio, Texas, USA, August 16-19, 1998: Proceedings

Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA, June 13-15, 2007, Proceedings (Lecture Notes in Computer Science)

Profiles of California 2007

Algorithms in Bioinformatics: 7th International Workshop, WABI 2007, Philadelphia, PA, USA, September 8-9, 2007, Proceedings

Fuzzy systems in bioinformatics and computational biology

Fuzzy Systems in Bioinformatics and Computational Biology

Fuzzy Systems in Bioinformatics and Computational Biology

IEEE International Conference on Fuzzy Systems: March 8-12, 1992, Town & Country Hotel, San Diego, California

Bioinformatics Research and Development: First International Conference, BIRD 2007, Berlin, Germany, March 12-14, 2007, Proceedings

Scalable Uncertainty Management: First International Conference, SUM 2007, Washington, DC, USA, October 10-12, 2007, Proceedings

Advances in Biometrics: International Conference, ICB 2007, Seoul, Korea, August 27-29, 2007, Proceedings

Frommer's San Diego 2008 (Frommer's Complete)

Distributed Computing in Sensor Systems: Third IEEE International Conference, DCOSS 2007, Santa Fe, NM, USA, June 18-20, 2007, Proceedings

Distributed computing in sensor systems: third IEEE international conference, DCOSS 2007, Santa Fe, NM, USA, June 18-20, 2007: proceedings

Recommend Documents