Pacific Symposium on Biocomputing 2007: Maui, Hawaii, 3-7 January 2007

PACIFIC S Y M P O S I U Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany Murray & Teri E. Klein PACIFIC SYMP...

Author: Russ B. Altman | A. Keith Dunker | Lawrence Hunter | Tiffany Murray | Teri E. Klein

7 downloads 1067 Views 27MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

PACIFIC S Y M P O S I U

Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany Murray & Teri E. Klein

PACIFIC SYMPOSIUM ON

BIOCOMPUTING 3007

PACIFIC SYMPOSIUM ON

BIOCOMPUTING 2007 Maui, Hawaii 3-7 January 2007

Edited by Russ B. Altman Stanford University, USA

A. Keith Dunker Indiana University, USA

Lawrence Hunter University of Colorado Health Sciences Center, USA

Tiffany Murray Stanford University, USA

Teri E. Klein Stanford University, USA

\[p World Scientific NEW JERSEY • LONDON • SINGAPORE •

BEIJING • SHANGHAI • HONGKONG

• TAIPEI • CHENNAI

Published by World Scientific Publishing Co. Pte. Ltd. 5 TohTuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

BIOCOMPUTING 2007 Proceedings of the Pacific Symposium Copyright © 2007 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-270-417-5

Printed in Singapore by Mainland Press

PACIFIC SYMPOSIUM ON BIOCOMPUTING 2007 Biomedical computing has become a key component in the biomedical research infrastructure. In 2004 and 2005, the U.S. National Institutes of Health established seven National Centers for Biomedical Computation, focusing on a wide range of application areas and enabling technologies, including simulation, systems biology, clinical genomics, imaging, ontologies and others (see http://www.bisti.nih.gov/ncbc/). The goal of these centers is to help seed an information infrastructure to support biomedical research. The Pacific Symposium on Biocomputing (PSB) presented critical early sessions in most of the areas covered by these National Centers, and we are proud to continue the tradition of helping to define new areas of focus within biomedical computation. Once again, we are fortunate to host two outstanding keynote speakers. Dr. Elizabeth Blackburn, Professor of Biology and Physiology in the Department of Biochemistry and Biophysics at the University of California, San Francisco will speak on "Interactions among telomeres, telomerase, and signaling pathways." Her work has led our understanding of overall organization and control of chromosomal dynamics. Our keynote speaker in the area of Ethical, Legal and Social implications of technology will be Marc Rotenberg, Executive Director of the Electronic Privacy Information Center (EPIC) in Washington, D.C. He will speak on "Data mining and privacy: the role of public policy." Many biomedical computation professionals have had and continue to grapple with privacy issues as interest in mining human genotype-phenotype data collections has increased. PSB has a history of providing early sessions focusing on hot new areas in biomedical computation. These sessions are often conceived during the previous PSB meeting, as trends and new results are pondered and discussed. Very often, new sessions are lead by new faculty members trying to define a scientific niche and bring together leaders in the emerging areas. We are proud that many areas in biocomputing received their first significant focused attention at PSB. If you have an idea for a new session, we the organizers, are available to talk with you, either at the meeting or later by e-mail. Again, the diligence and efforts of a dedicated group of researchers has led to an outstanding set of sessions, with associated introductory tutorials. These organizers provide the scientific core of PSB, and their sessions are as follows: v

vi

Indra Neil Sarkar Biodiversity Informatics: Managing Knowledge Beyond Humans and Model Organisms Bobbie-Jo Webb-Robertson & Bill Cannon Computational Proteomics: High-throughput Analysis for Systems Biology Martha Bulyk, Ernest Fraenkel, Alexander Hartemink, & Gary Stormo DNA-Protein Interactions and Gene Regulation: Integrating Structure, Sequence and Function Russ Greiner & David Wishart Computational Approaches to Metabolomics Pierre Zweigenbaum, Dina Demner-Fushman, Kevin Bretonnel Cohen, & Hong Yu New Frontiers in Biomedical Text Mining Maricel Kann, Yanay Ofran, Marco Punta, & Predrag Radivojac Protein Interactions in Disease In addition to the sessions and survey tutorials, this year's program includes two in depth tutorials. The presenters and titles of these tutorials are: Giselle M. Knudsen, Reza A. Ghiladi, & D. Rey Banatao Integration Between Experimental and Computational Biology for Studying Protein Function Michael A Province & Ingrid B Borecki Searching for the Mountains of the Moon: Genome Wide Association Studies of Complex Traits

We thank the Department of Energy and the National Institutes of Health for their continuing support of this meeting. Their support provides travel grants to many of the participants. Applied Biosystems and the International Society for Computational Biology continue to sponsor PSB, and as a result, we are able to provide additional travel grants to many meeting participants.

VII

We would like to acknowledge the many busy researchers who reviewed the submitted manuscripts on a very tight schedule. The partial list following this preface does not include many who wished to remain anonymous, and of course we apologize to any who may have been left out by mistake. Aloha! Russ B. Altman Departments of Genetics & Bioengineering, Stanford University A. Keith Dunker Department of Biochemistry and Molecular Biology, Indiana University School of Medicine Lawrence Hunter Department of Pharmacology, University of Colorado Health Sciences Center Teri E. Klein Department of Genetics, Stanford University

Pacific Symposium on Biocomputing Co-Chairs September 28, 2006

V1U

Thanks to the reviewers.. Finally, we wish to thank the scores of reviewers. PSB requires that every paper in this volume be reviewed by at least three independent referees. Since there is a large volume of submitted papers, paper reviews require a great deal of work from many people. We are grateful to all of you listed below and to anyone whose name we may have accidentally omitted or who wished to remain anonymous. Joshua Adkins Eugene Agichtein Gelio Alves Sophia Ananiadou Alan Aronson Ken Baclawski Joel Bader Breck Baldwin Ziv Bar-Joseph Serafim Batzoglou Asa Ben-Hur Sabine Bergler Olivier Bodenreider Alvis Brazma Kevin Bretonnel Yana Bromberg Harmen Bussemaker Andrea Califano Bob Carpenter Michele Cascella Saikat Chakrabarti Shih-Fu Chang Pierre Chaurand Ting Chen Hsinchun Chen Nawei Chen Praveen Cherukuri Wei Chu James Cimino Aaron Cohen Nigel Collier Matteo Dal Peraro

Vlado Dancik Rina Das Tjil De Bie Dina DemnerFushman Rob DeSalle Luis DeSilva Diego Di Bernardo Chuong Do Michel Dumontier Mary G. Egan Roman Eisner Emilio Espisitio Mark Fasnacht Oliver Fiehn Alessandro Flammini Fabian Fontaine Lynne Fox Ari Frank Kristofer Franzen Tema Fridman Carol Friedman Robert Futrelle Feng Gao Adam Godzik Roy Goodacre Michael Grusak Melissa A. Haendel Henk Harkema Marti Hearst P. Bryan Heidorn Bill Hersh

Lynette Hirschman Terence Hwa Sven Hyberts Lilia Iakoucheva Navdeep Jaitly Helen Jenkins Kent Johnson Andrew Joyce James Kadin Martin R. Kalfatovic Manpreet S. Katari Sun Kim Oliver King Tanja Kortemme Harri Lahdesmaki Ney Lemke Gondy Leroy Christina Leslie Li Liao John C. Lindon Chunmei Liu Yves Lussier Hongwu Ma Kenzie Maclsaac Tom Madej Ana Maguitman Askenazi Manor Costas Maranas Leonardo Marino John Markley Pedro Mendes Ivana Mihalek

Leonid Mirny Joyce Mitchell Matthew Monroe Sean Mooney Rafael Najmanovich Preslav Nakov Leelavati Narlikar Adeline Nazarenko Jack Newton William Noble Christopher Oehmen Christopher Oldfield Zoltan Oltvai Matej Oresic Bernhard Palsson Chrysanthi Paranavitana Matteo Pellegrini Aloysius Phillips Paul J. Planet Christian Posse Natasa Przulj Teresa Przytycka Bin Qian Weijun Qian Arun Ramani Kathryn Rankin Andreas Rechtsteiner Haluk Resat Tom Rindflesch Martin Ringwald Elizabeth Rogers Pedro Romero Graciela Rosemblat Andrea Rossi Erik Rytting Jasmin Saric Indra Neil Sarkar Yutaka Sasaki Tetsuya Sato

Santiago Schnell Rob Schumaker Robert D. Sedgewick Eran Segal Kia Sepassi Anuj Shah Paul Shapshak Hagit Shatkay Mark Siddall Mona Singh Mudita Singhal Saurabh Sinha Thereza Amelia Soares Bruno Sobral Ray Sommorjai Orkun Soyer Irina Spasic Padmini Srinivasan Paul Stothard Eric Strittmatter Shamil Sunyaev Silpa Suthram Lorrie Tanabe Haixu Tang Igor Tetko Jun'ichi Tsujii Peter Uetz Vladimir Uversky Vladimir Vacic Alfonso Valencia Karin Verspoor Mark Viant K. Vijay-Shanker Hans Vogel Slobodan Vucetic Alessandro Vullo Wyeth Wasserman Bonnie Webber Aalim Weljie

John Wilbur Kazimierz O. Wrzeszczynski Dong Xu Yoshihiro Yamanishi Yuzhen Ye Hong Yu Peng Yue Pierre Zweigenbaum

CONTENTS Preface

v

PROTEIN INTERACTIONS AND DISEASE Session Introduction Maricel Kann, Yanay Ofran, Marco Punta, and Predrag Radivojac

1

Graph Kernels for Disease Outcome Prediction from Protein-Protein Interaction Networks Karsten M. Borgwardt, Hans-Peter Kriegel, S.V.N. Vishwanathan, and Nicol N. Schraudolph

4

Chalkboard: Ontology-Based Pathway Modeling and Qualitative Inference of Disease Mechanisms Daniel L. Cook, Jesse C. Wiley, and John H. Gennari

16

Mining Gene-Disease Relationships from Biomedical Literature Weighting Protein-Protein Interactions and Connectivity Measures Graciela Gonzalez, Juan C. Uribe, Luis Tari, Colleen Brophy, and Chitta Baral

28

Predicting Structure and Dynamics of Loosely-Ordered Protein Complexes: Influenza Hemagglutinin Fusion Peptide Peter M. Kasson and Vijay S. Pande

40

Protein Interactions and Disease Phenotypes in the ABC Transporter Superfamily Libusha Kelly, Rachel Karchin, and Andrej Sali

51

LTHREADER: Prediction of Ligand-Receptor Interactions Using Localized Threading Vinay Pulim, Jadwiga Bienkowska, and Bonnie Berger

64

Discovery of Protein Interaction Networks Shared by Diseases Lee Sam, Yang Liu, Jianrong Li, Carol Friedman, and Yves A. Lussier

76

XI

xii An Iterative Algorithm for Metabolic Network-Based Drug Target Identification Padmavati Sridhar, Tamer Kahveci, and Sanjay Ranka Transcriptional Interactions During Smallpox Infection and Identification of Early Infection Biomarkers Willy A. Valdivia-Granda, Maricel G. Kann, and Jose Malaga

88

100

COMPUTATIONAL APPROACHES TO METABOLOMICS Session Introduction David S. Wishart and Russell Greiner

112

Leveraging Latent Information in NMR Spectra for Robust Predictive Models David Chang, Aalim Weljie, and Jack Newton

115

Bioinformatics Data Profiling Tools: A Prelude to Metabolic Profiling Natarajan Ganesan, Bala Kalyanasundaram, and Mahe Velauthapllai

127

Comparative QSAR Analysis of Bacterial, Fungal, Plant and Human Metabolites Emre Karakoc, S. Cenk Sahinalp, and Artem Cherkasov

133

BioSpider: A Web Server for Automating Metabolome Annotations Craig Knox, Savita Shrivastava, Paul Stothard, Roman Eisner, and David S. Wishart

145

New Bioinformatics Resources for Metabolomics John L. Markley, Mark E. Anderson, Qiu Cui, Hamid R. Eghbalnia, Ian A. Lewis, Adrian D. Hegeman, Jing Li, Christopher F. Schulte, Michael R. Sussman, William M. Westler, Eldon L. Ulrich, and Zsolt Zolnai

157

Setup X — A Public Study Design Database for Metabolomic Projects Martin Scholz and Oliver Fiehn

169

X1U

Comparative Metabolomics of Breast Cancer Chen Yang, Adam D. Richardson, Jeffrey W. Smith, and Andrei Osterman

181

Metabolic Flux Profiling of Reaction Modules in Liver Drug Transformation Jeongah Yoon and Kyongbum Lee

193

NEW FRONTIERS IN BIOMEDICAL TEXT MINING Session Introduction Pierre Zweigenbaum, Dina Demner-Fushman, Hong Yu, and K. Bretonnel Cohen

205

Extracting Semantic Predications from Medline Citations for Pharmacogenomics Caroline B. Ahlers, Marcelo Fiszman, Dina Demner-Fushman, Frangois-Michel Lang, and Thomas C. Rindflesch

209

Annotating Genes Using Textual Patterns Ali Cakmak and Gultekin Ozsoyoglu

221

A Fault Model for Ontology Mapping, Alignment, and Linking Systems Helen L. Johnson, K. Bretonnel Cohen, and Lawrence Hunter

233

Integrating Natural Language Processing with Flybase Curation Nikiforos Karamanis Y, Ian Lewin, Ruth Seal, Rachel Drysdale, and Edward Briscoe

245

A Stacked Graphical Model for Associating Sub-Images with Sub-Captions Zhenzhen Kou, William W. Cohen, and Robert F. Murphy

257

GeneRIF Quality Assurance as Summary Revision Zhiyong Lu, K. Bretonnel Cohen, and Lawrence Hunter

269

xiv Evaluating the Automatic Mapping of Human Gene and Protein Mentions to Unique Identifiers Alexander A. Morgan, Benjamin Wellner, Jeffrey B. Colombe, Robert Arens, Marc E. Colosimo, and Lynette Hirschman

281

Multiple Approaches to Fine-Grained Indexing of the Biomedical Literature Aurelie Neveol, Sonya E. Shooshan, Susanne M. Humphrey, Thomas C. Rindflesh, and Alan R. Aronson

292

Mining Patents Using Molecular Similarity Search James Rhodes, Stephen Boyer, Jeffrey Kreulen, Ying Chen, and Patricia Ordonez

304

Discovering Implicit Associations Between Genes and Hereditary Diseases Kazuhiro Seki and Javed Mostafa

316

A Cognitive Evaluation of Four Online Search Engines for Answering Definitional Questions Posed by Physicians Hong Yu and David Kaufman

328

BIODIVERSITY INFORMATICS: MANAGING KNOWLEDGE BEYOND HUMANS AND MODEL ORGANISMS Session Introduction Indra Neil Sarkar

340

Biomediator Data Integration and Inference for Functional Annotation of Anonymous Sequences Eithon Cadag, Brent Louie, Peter J. Myler, and Peter Tarczy-Hornoch

343

Absent Sequences: Nullomers and Primes Greg Hampikian and Tim Andersen

355

XV

An Anatomical Ontology for Amphibians Anne M. Maglia, Jennifer L. Leopold, L. Analia Pugener, and Susan Gauch

367

Recommending Pathway Genes Using a Compendium of Clustering Solutions David M. Ng, Marcos H. Woehrmann, and Joshua M. Stuart

379

Semi-Automated XML Markup of Biosystematic Legacy Literature with the Goldengate Editor Guido Sautter, Klemens Bbhm, and Donat Agosti

391

COMPUTATIONAL PROTEOMICS: HIGH-THROUGHPUT ANALYSIS FOR SYSTEMS BIOLOGY Session Introduction William Cannon and Bobbie-Jo Webb-Robertson

403

Advancement in Protein Inference from Shotgun Proteomics Using Peptide Detectability Pedro Alves, Randy J. Arnold, Milos V. Novotny, Predrag Radivojac, James P. Reilly, and Haixu Tang

409

Mining Tandem Mass Spectral Data to Develop a More Accurate Mass Error Model for Peptide Identification Yan Fu, Wen Gao, Simin He, Ruixiang Sun, Hu Zhou, and Rong Zeng

421

Assessing and Combining Reliability of Protein Interaction Sources Sonia Leach, Aaron Gabow, Lawrence Hunter, and Debra S. Goldberg

433

Probabilistic Modeling of Systematic Errors in Two-Hybrid Experiments David Sontag, Rohit Singh, and Bonnie Berger

445

xvi

Prospective Exploration of Biochemical Tissue Composition via Imaging Mass Spectrometry Guided by Principal Component Analysis Raf Van de Plas, Fabian Ojeda, Maarten Demi, Ludo Van Den Bosch, Bart De Moor, and Etienne Waelkens

458

DNA-PROTEIN INTERACTIONS: INTEGRATING STRUCTURE, SEQUENCE, AND FUNCTION Session Introduction Martha L. Bulyk, Alexander J. Hartemink, Ernest Fraenkel, and Gary Stormo

470

Discovering Motifs With Transcription Factor Domain Knowledge 472 Henry CM. Leung, Francis Y.L. Chin, and Bethany M.Y. Chan Ab initio Prediction of Transcription Factor Binding Sites L. Angela Liu and Joel S. Bader

484

Comparative Pathway Annotation with Protein-DNA Interaction and Operon Information via Graph Tree Decomposition Jizhen Zhao, Dongsheng Che, and Liming Cai

496

PROTEIN INTERACTIONS AND DISEASE MARICEL KANN National Center for Biotechnology Information, NIH Bethesda, MD 20894, U.S.A. YANAY OFRAN Department of Biochemistry & Molecular Biophysics, Columbia University New York, NY 10032, U.S.A. MARCO PUNTA Department of Biochemistry & Molecular Biophysics, Columbia University New York, NY 10032, U.S.A. PREDRAG RADIVOJAC School of Informatics, Indiana University Bloomington, IN 47408, U.S.A.

In 2003, the US National Human Genome Research Institute (NHGRI) articulated grand challenges for the genomics community in which the translation of genome-based knowledge into disease understanding, diagnostics, prognostics, drug response and clinical therapy is one of the three fundamental directions ("genomics to biology," "genomics to health" and "genomics to society").1 At the same time the National Institutes of Health (NIH) laid out a similar roadmap for biomedical sciences.2 Both the NHGRI grand challenges and the NIH roadmap recognized bioinformatics as an integral part in the future of life sciences. While this recognition is gratifying for the bioinformatics community, its task now is to answer the challenge of making a direct impact to the medical science and benefiting human health. Innovative use of informatics in the "translation from bench to bedside" becomes a key for bioinformaticians. In 2005, the Pacific Symposium on Biocomputing (PSB) first solicited papers related to one aspect of this challenge, protein interactions and disease, which directly addresses computational approaches in search for the molecular basis of disease. The goal of the session was to bring together scientists interested in both bioinformatics and medical sciences to present their research progress. The session generated great interest resulting in a number of high quality papers and testable hypothesis regarding the involvement of proteins in various disease pathways. This year, the papers accepted for the session on Protein Interactions and Disease at PSB 2007 follow the same trend. 1

2

The first group of papers explored structural aspects of protein-protein interactions. Kelly et al. study ABC transporter proteins which are involved in substrate transport through the membrane. By investigating intra-transporter domain interfaces they conclude that nucleotide-binding interfaces are more conserved than those of transmembrane domains. Disease-related mutations were mapped into these interfaces. Pulim et al. developed a novel threading algorithm that predicts interactions between receptors (membrane proteins) and ligands. The method was tested on cytokines, proteins implicated in intra-cellular communication and immune system response. Novel candidate interactions, which may be implicated in disease, were predicted. Kasson and Pande use molecular dynamics to address high-order molecular organization in cell membranes. A large number of molecular dynamics trajectories provided clues into structural aspects of the insertion of about 20-residue long fusion peptide into a cell membrane by a trimer hemagglutinin of the influenza virus. The authors explain effects of mutations that preserve peptide's monomeric structure but incur loss of viral infectivity. The second group of studies focused on analysis of protein interaction networks. Sam et al. investigate molecular factors responsible for the diseases with different causes but similar phenotypes and postulate that some are related to breakdowns in the shared protein-protein interaction networks. A statistical method is proposed to identify protein networks shared by diseases. Sridhar et al. developed an efficient algorithm for perturbing metabolic networks in order to stop the production of target compounds, while minimizing unwanted effects. The algorithm is aimed at drug development where toxicity of the drug should be reduced. Borgwardt et al. were interested in predicting clinical outcome by combining microarray and protein-protein interaction data. They use graph kernels as a measure of similarity between graphs and develop methods to improve their scalability to large graphs. Support vector machines were used to predict disease outcome. Gonzalez et al. extracted a large number of genedisease relationships by parsing literature and mapping them to the known protein-protein interaction networks. They propose a method for ranking proteins for their involvement in disease. The method was tested on atherosclerosis. Valdivia-Granda et al. devised a method to integrate protein-protein interaction data along with other genomic annotation features with microarray data. They applied it to microarray data from a study of non-human primates infected with variola and identified early infection biomarkers. The study was complemented with a comparative protein domain analysis between host and pathogen. This work contributes to the understanding of the mechanisms of infectivity, disease and suggests potential therapeutic targets. Finally, Cook et al. worked on the novel ontology of biochemical pathways. They present Chalkboard, a tool for

3

building and visualizing biochemical pathways. Chalkboard can be used interactively and is capable of making inferences. Acknowledgements The session co-chairs would like to thank numerous reviewers for their help in selecting the best papers among many excellent submissions. References 1. Collins FS, Green ED, Guttmacher AE, Guyer MS. A vision for the future of genomics research. Nature 2003; 422(6934):835. 2. Zerhouni E. The NIH roadmap. Science 2003; 302(5642):63.

G R A P H KERNELS FOR DISEASE OUTCOME PREDICTION FROM P R O T E I N - P R O T E I N INTERACTION N E T W O R K S

K A R S T E N M. B O R G W A R D T A N D H A N S - P E T E R K R I E G E L Institute for Computer Science, Ludwig-MaximiliansUniversity Munich, Oettingenstr. 67, 80538 Munich, Germany E-mail: [email protected], [email protected]. Imu. de S.V. N. V I S H W A N A T H A N A N D N I C O L N. S C H R A U D O L P H Statistical Machine Learning Program, National ICT Australia, Canberra, 0200 ACT, Australia E-mail: SVN. [email protected], Nic. Schraudolph@nicta. com. au

It is widely believed that comparing discrepancies in the protein-protein interaction (PPI) networks of individuals will become an important tool in understanding and preventing diseases. Currently P P I networks for individuals are not available, but gene expression data is becoming easier to obtain and allows us to represent individuals by a co-integrated gene expression/protein interaction network. Two major problems hamper the application of graph kernels - state-of-the-art methods for whole-graph comparison - to compare P P I networks. First, these methods do not scale to graphs of the size of a P P I network. Second, missing edges in these interaction networks are biologically relevant for detecting discrepancies, yet, these methods do not take this into account. In this article we present graph kernels for biological network comparison that are fast to compute and take into account missing interactions. We evaluate their practical performance on two datasets of co-integrated gene expression/PPI networks.

1. Introduction An important goal of research on protein interactions is to identify relevant interactions that are involved in disease outbreak and progression. Measuring discrepancies between protein-protein interaction (PPI) networks of healthy and ill patients is a promising approach to this problem. Unfor4

5 tunately, establishing individual networks is beyond the current scope of technology. Co-integrated gene expression/PPI networks, however, offer an attractive alternative to study the impact of protein interactions on disease. But, researchers in this area are often faced with a computationally challenging problem: how to measure similarity between large interaction networks? Moreover, biologically relevant information can be gleaned both from the presence and absence of interactions. How does one make use of this domain knowledge? The aim of this paper is to answer both these questions systematically.

1.1. Interaction

Networks

are

Graphs

We begin our study by observing that interaction networks are graphs, where each node represents a protein and each edge represents the presence of an interaction. Conventionally there are two ways of measuring similarity between graphs. One approach is to perform a pairwise comparison of the nodes and/or edges in two networks, and calculate an overall similarity score for the two networks from the similarity of their components. This approach takes time quadratic in the number of nodes and edges, and is thus computationally feasible even for large graphs. However, this strategy is flawed in that it completely neglects the structure of the networks, treating them as sets of nodes and edges instead of graphs. A more principled alternative would be to deem two networks similar if they share many common substructures, or more technically, if they share many common subgraphs. To compute this, however, we would have to solve the so-called subgraph isomorphism problem which is known to be NP-complete, i.e., the computational cost of this problem increases exponentially with problem size, seriously limiting this approach to very small networks [1]. Many heuristics have been developed to speed up subgraph isomorphism by using special canonical labelings of the graphs; none of them, however, can avoid an exponential worst-case computation time. Graph kernels as a measure of similarity on graphs offer an attractive middle ground: they can be computed in polynomial time, yet, they compare non-trivial substructures of graphs. In spite of these attractive properties, as they exist, graph kernels neither scale to large interaction networks nor do they address the issue of missing interactions. In this paper, we present fast algorithms for computing graph kernels which scale to large networks. Simultaneously, by using a complement graph - a graph

6

made up of all the nodes and the missing edges in the original graph - we address the issue of missing interactions in a principled manner. Outline The remainder of this article is structured as follows. In Section 2, we will review existing graph kernels, and illustrate the problems encountered when applying graph kernels to large networks. In Section 3, we will present algorithms for speeding up graph kernel computation, and in Section 4, we will define graph kernels that take into account missing interactions as well. In our experiments (see Section 5), we employ our fast and enhanced graph kernels for disease outcome prediction, before concluding with an outlook and discussion. 2. Review of Existing Graph Kernels Existing graph kernels can be viewed as a special case of R-Convolution kernels proposed by Haussler [2]. The basic idea here is to decompose the graph into smaller substructures, and build the kernel based on similarities between the decomposed substructures. Different kernels mainly differ in the way they decompose the graph for comparison and the similarity measure they use to compare the decomposed substructures. Random walk kernels are based on a simple idea: Given a pair of graphs decompose them into paths obtained by performing a random walk, and count the number of matching walks [3-5]. Various incarnations of these kernels use different methods to compute similarities between walks. For instance, Gartner et al. [4] count the number of nodes in the random walk which have the same label. They also include a decay factor to ensure convergence. Borgwardt et al. [3] on the other hand, use a kernel denned on nodes and edges in order to compute similarity between random walks. Although derived using a completely different motivation, it was recently shown by Vishwanathan et al. [6] that the marginalized graph kernels of Kashima et al. [5] are also essentially a random walk kernel. Mahe et al. [7] extend the marginalized graph kernels in two ways. They enrich the labels by using the so-called Morgan index, and modify the kernel definition to prevent tottering, i.e., similar smaller substructures from generating high similarity scores. Both these extensions are particularly relevant for chemoinformatics applications. Other decompositions of graphs, which are well suited for particular application domains, include subtrees [8], molecular fingerprints based on various types of depth first searches [9], and structural elements like rings, functional groups and so on [10]. While many domain specific variants of graph kernels yield state-of-the-

7

art performance, they are plagued by computational issues when used to compare large graphs like those frequently found in PPI networks. This is mainly due to the fact that the kernel computation algorithms typically scale as 0(ne) or worse. Practical applications therefore either compute the kernel approximately or make unrealistic sparsity assumptions on the input graphs. In contrast, in the next section, we discuss three efficient methods for computing random walk graph kernels which are both theoretically sound and practically efficient. 3. Fast Random Walk Kernels In this section we briefly describe an unifying framework for random walk kernels, and present fast algorithms for their computation. We warn the biologically motivated reader that this section is rather technical. But, the algorithms presented below allow us to efficiently compute kernels on large graphs, and hence are crucial building blocks of our classifier for disease outcome prediction. 3.1.

Notation

A graph G(V, E) consists of an ordered and finite set of n vertices V denoted by {v\,V2,. • •, vn}, and a finite set of edges E c V x V. G is said to be undirected if (vi, Vj) € E : X —> H denote the corresponding feature map

8

which maps e to the zero element of H. We use $(L) to denote the feature matrix of G. For ease of exposition we do not consider labels on vertices here, though our results hold for that case as well. 3.2. Product

Graphs

Given two graphs G(V,E) and G'(V',E'), the product graph GX(VX,EX) is a graph with nn' vertices, each representing a pair of vertices from G and G', respectively. An edge exists in Ex iff the corresponding vertices are adjacent in both G and G'. Thus

Vx={(vuv'i,):vieVAv'i,€V'}, Ex = {((« represents the Kronecker product of matrices. If G and G' are edge-labeled, we can associate a weight matrix Wx e n«'x„n' R w i t h Q^ d e f i n e d & Wx = $(L) $(!/)• Recall that $(L) and $(L') are matrices defined in an RKHS. Hence we use a slightly extended version of the Kronecker product and define the (in+j, i'n'+j')-th entry of Wx as K(LIJ, Lily). As a consequence of the definition of the entries of Wx are non-zero only if the corresponding edges exist in the product graph. We assume that H = R d endowed with the usual dot product, and that there are d distinct edge labels { 1 , 2 , . . . , d}. Moreover we let K be a delta kernel, i.e., its value between any two edges is one iff the labels on the edges match, and zero otherwise. Let lA denote the adjacency matrix of the graph filtered by the label I, i.e., lAij = Aij if Ly = I and zero otherwise. Some simple algebra (omitted for the sake of brevity) shows that the weight matrix of the product graph can be written as d

Wx=^2lA®lA'.

(3)

Let p and p' denote initial probability distributions over vertices of G and G'. Then the initial probability distribution px of the product graph is p x := p®p'. Likewise, if q and q' denote stopping probabilities (i.e., the probability that a random walk ends at a given vertex), the stopping probability q^ of the product graph is qx := q q'.

9

3.3. Kernel

Definition

An edge exists in the product graph if, and only if, an edge exits in both G and G'. Therefore, performing a simultaneous random walk on G and G' is equivalent to performing a random walk on the product graph [11]. Given the weight matrix Wx, initial and stopping probability distributions Px and (fc)«IW*px.

(4)

fc=o A popular choice to ensure convergence of (4) is to assume fi(k) = Xk for some A > 0. If A is sufficiently small3, then (4) is well defined, and we can write k(G,G') =^\kqTW*Px =ql{\-\WxTlPx, (5) fc where I denotes the identity matrix. It can be shown (see Vishwanathan et al. [6]) that the marginal graph kernels of Kashima et al. [5] as well as the geometric graph kernels of Gartner et al. [4] are special cases of (5). 3.4. Fast

Computation

Direct computation of (5) is prohibitively expensive since it involves the inversion of a nn' x nn' matrix, which scales as 0(n6). We now outline three efficient schemes whose worst case computational complexity is lower, and whose practical performance as measured by our experiments is up to three orders of magnitude faster. Vishwanathan et al. [6] contains more technical and algorithmic details. 3.4.1. Sylvester Equation Methods Consider the following equation, commonly known as the Sylvester or Lyapunov equation: X = SXT + X0.

(6)

Here, S,T,X0 e Rnxn are given and we need for solve for X e RnXn. 3 These equations can be readily solved in 0(n ) time with freely available code [12], e.g. Matlab's dlyap method. a

T h e values of A which ensure convergence depend on the spectrum of Wx.

10 It can be shown that if the weight matrix Wx can be written as (3) then the problem of computing the graph kernel (5) can be reduced to the problem of solving the following generalized Sylvester equation: X = ^k'AXk

T

+ X0,

(7)

i

where vec(-Xo) = px, with vec(-) being the function that flattens a matrix by vertically concatenating its columns. 3.4.2. Conjugate Gradient Methods Given a matrix M and a vector b, conjugate gradient (CG) methods solve the system of equations Mx = b efficiently [13]. They are particularly efficient if the matrix is rank deficient, or has a small effective rank, i.e., number of distinct eigenvalues. Furthermore, if computing matrix-vector products is cheap the CG solver can be sped up significantly [13]. The graph kernel (5) can be computed by a two-step procedure: First we solve the linear system (I-\Wx)x=Px,

(8)

for x, then we compute qxx. By using extensions of tensor calculus rules to RKHS, one can compute Wxr for an arbitrary vector r rather efficiently, which in turn can be used to speed up the CG solver. 3.4.3. Fixed-Point Iterations Fixed-point methods begin by rewriting (8) as x=px+\Wxx.

(9)

Now, solving for x is equivalent to finding a fixed point of the above iteration [13]. Letting xt denote the value of x at iteration t, we set XQ := px, then compute xt+i =Px +XWxxt

(10)

repeatedly until |\x t + i —xt\\ < e, where 11• 11 denotes the Euclidean norm and £ some pre-defined tolerance. Observe that each iteration of (10) involves computation of the matrix-vector product Wxxt, and hence the extensions of tensor calculus to RKHS mentioned previously can again be used to speed up the computation.

11 4. Composite Graph Kernel The presence of an edge in a graph signifies interactions between the end nodes. In many applications these interactions are significant. For instance, in chemoinformatics the presence of an edge indicates the presence of a chemical bond between two atoms. In the case of the PPI networks, the presence of an edge indicates that the corresponding proteins interact. But, when studying protein interactions in disease, not just the presence but also the absence of interactions is significant. Existing graph kernels (e.g. (5)) cannot take this into account. We propose to modify the existing kernels to take this information into account. Key to our exposition is the notion of a complement graph which we define below. Suppose G(V, E) is a graph with vertex set V and edge set E. Then, its complement G(V, E) is a graph with the same vertex set V, but with a different edge set E := V x V \E. In other words, the complement graph is made up of all the edges missing from the original graph. Using the concept of a complement graph we can now define a composite graph kernel as follows: kcomp(G, G1) = k(G, G') + k(G, G Model network Inference network

Entity Attributes 1$ *• ?, n u> w w w %

I I

Action attributes

Figure 3. (Top) The ligand and kinase model from Figure 2. (Bottom) The Inference network (not visualized in the Chalkboard user-interface), Functional attributes (FA) for each Entity and Action are represented and linked by a network of Operators (white circles with mathematical symbols) and arcs (arrows) that represent the directed dependencies of attribute values on each other. PathTracing displays one "main" FA for each£ntt'ry or Action (bold frames). FA's in this model include: amt = Amount of a Molecule or Site; molarity or concentration, act = Activity of a Site; percent or fraction, avl = Available amount = act x amt\ molarity or concentration, occ = Bind-site occupancy of a Bind site; percent or fraction, bnd = Bound amount of a Bind site; molarity or concentration, Met = Chemical flow rate of reaction; moles/s, concentration/s, Del = Change of Site attribute; percent or fraction, mod = Modulator (of action); percent or fraction. token. Feedback loops are characterized as positive or negative according to the net polarity of perturbations in the token's list. Tokens are also terminated when they reach nodes with no outgoing arcs (as at each occ in Figure 3). Chalkboard reuses the Inference network to automatically generate JSim [5] mathematical biosimulation code (not shown) that includes: (a) system state variables (one for each FA value) with default units, (b) algebraic or differential equations for each Operator (e.g., a rate equation), and (c) Operator equation parameters (e.g., reaction rate constants). The JSim system interprets Chalk-

22

nuclear transcription Car'.Tia-sccrpiase

Figure 4. A view of APP proteolysis within Chalkboard where the action between LRP and the proteolysis by BACE is clamped. Under this condition, if more LRP is bound to Fe65, or if more LRP is available, then Amyloid production decreases. board-generated code while parameter values are set by users at runtime.

A Chalkboard model of APP processing Alzheimer's Disease is a pervasive neurodegenerative disorder associated with aging characterized by diffuse cortical plaques (neurofibrillary tangles) [6] whose primary constituent is a small peptide derived from the [J-amyloid precursor protein (APP) [11]. The primary theory of Alzheimer's Disease etiology is the "amyloid hypothesis" by which elevated levels of (3-amyloid production results in neuronal degeneration, cortical plaques, cognitive dementia, and ultimately death[6]. Effective therapy requires that scientists understand the complex events of APP proteolysis, in both normal and pathologic situations. APP is a single-pass transmembrane protein that is sequentially proteolytically cleaved by enzymes to peptides (yellow and blue, respectively, in Figure 4). Primary cleavage occurs in the luminal/extracellular domain at the oc-secretase cleavage site by metalloproteinases such as TACE [12] or at the P-secretase cleavage site by the atypical aspartyl protease BACE[13]. Subsequently, the remaining carboxy-terminal fragments of APP (C99 and C83 in Figure 4) are

23

cleaved by the heterotetrameric y-secretase complex [14]. Cleavage of APP at the a- and y-secretase sites (left hand side of Figure 4) liberates the APP extracellular domain (APPsa), p3 peptide, and the APP intracellular domain CTFy (also called AICD)[15]. Alternatively, cleavage of APP at (3- and y-secretase sites, (right side, Figure 4) generates a soluble extracellular domain (APPs(3), an intracellular domain CTFy, and amyloid fj peptide[15]. CTFy plays an important role in transcription. In particular, the heterotrimeric APP-CTFy/Fe65/Tip60 complex functions as a nuclear targeted transcriptional regulator[16, 17]. It is currently unclear, however, how CTFy/Fe65/Tip60 complex affects neuronal survival[18], [19]. Furthermore, APP proteolysis by y-secretase complex may be regulated by the APP-associated factor LRP [20] via Fe65[21] and also may involve the stimulation of either a-secretase or (3-secretase cleavage[20, 22]. To test these possibilities, we have included the LRP/Fe65 binding in our Chalkboard model (Figure 4), and included LRP activation of both B ACE and TACE proteolysis. Then, we clamped the effect of LRP on P-secretase cleavage (the red slash sign), to show that the downstream effect is to decrease amyloid production. The inherent complexity of the interactions among APP, the proteolytic processing enzymes, and the associated binding proteins is an arena in which a detailed modeling system such as Chalkboard would be extremely helpful. Potentially, Chalkboard could help provide valuable insights into predictions about both mechanisms of action and potential experimental manipulations that could guide the development of effective therapeutic approaches to treating AD.

Discussion and related work Chalkboard is an ontology-based computational tool for representing biomolecular pathways using a graphical language and model editing environment to represent pathway models that can be analyzed qualitatively with a built-in PathTracing tool (Section 2.2) and analyzed quantitatively by exporting model simulation code (Section 2.3) to the JSim simulation system. As such, Chalkboard relates to several threads of computational research that deserve in-depth discussion beyond the scope of this paper. However, here we emphasize Chalkboards relation to three areas of pathway informatics research: Ontology research, qualitative inference, and quantitative analysis. We also address the tradeoffs between scalability and the rich biochemical representation we employ with Chalkboard. 1.4. Ontology-based representations of biomolecular pathways The Chalkboard ontology continues to evolve from the BioD biological description language [7] concurrently with biomolecular pathway ontologies including BioPax [2], PATIKA [23], CellDesigner [24], and others. As expected there is

24

considerable representational overlap that should, with community effort, be resolvable into a high-level ontology or, at least, alignment between related ontologies. We are committed to such efforts as advocated by others [10, 25]. We note, however, important representational differences, particularly in modeling molecular "states". Many ontologies consider different states of a physical entity (e.g., a molecule) to be separate entities (e.g., a molecule, its phosphorylated form, and its active form). Chalkboard takes an "object-oriented" view that a single entity Molecule can have Functional sites as parts and each part can have an independent operational state so that the state of a Molecule is specified by the values its own Functional attributes plus the FAs of its parts (e.g., Occupied, Active, etc.). We adopt the Functional attribute approach because it maps well to both qualitative and quantitative analyses (Section 2.3). Furthermore, we suggest, the Functional attribute approach generalizes readily to other biophysical domains such as membrane biophysics (e.g., membrane potentials, conductances and currents), structural mechanics (e.g., elastance), and fluid flow (e.g., diffusive or bulk flows). We see this generalizability as a prerequisite for the integration of pathway knowledge and analysis into multiscale (molecules, cells, organs, organ systems, etc.), multidomain (biochemistry, biophysics, mechanics) models. 7.5. Qualitative inference and quantitative analysis Qualitative reasoning tools in biological research have been driven by the scarcity and high cost of the quantitative datasets required for quantitative modeling. However, many representational schemes do not as yet, support qualitative inference (e.g., BioPax [2], CellDesigner [24]) and those that do use graph theoretic query methods (e.g., PATIKA [23]) and rule-based reasoning (e.g., BioCyc [3]) of state-based modeling. Chalkboard qualitative inference, is based more directly on the principles of quantitative modeling by tracking the propagation of (small) perturbations through a network of essentially quantitative relationships. The benefits of coupling graphical representations to the computational analysis of biological systems has long been recognized resulting in a variety of implementations including our own KineCyte [26] that integrates graphical modeling with biosimulation. Chalkboard, however, relies on existing simulation engines to interpret automatically-generated simulation code (currently, we use JSim but intend to support CellML[27] and SBML[28]). Although other molecular pathway representations (e.g., PATIKA, CellDesigner) may have sufficient rigor and expressiveness to export simulation code, to our knowledge this is not yet available for existing simulation languages[29].

25

1.6. Scalability and representational richness We recognize trade-offs between Chalkboard's semantically-rich graphical view of biological pathways and less rich but more scaleable representations used by applications such as Cytoscape[30]. We believe that scientists need both sorts of tools—although Cytoscape is appropriate for coarse-grained visualization of large networks, only tools like Chalkboard, that use richer representations can capture notions of competitive binding, cooperative and anti-cooperative effects. We recognize that Chalkboard will not be the only tool used by a researcher, and thus, we have the designed the system to export its models in a sharable format. Chalkboard models are saved in an XML text file that represents all model entities, model actions and their linkages in a form that can be read and parsed by other applications. Our plans more specifically include inter-operating with the BioPAX standard [2], (as much as possible, given the differences in modeling), as well as to CellML and SBML for simulation code.

Summary We have argued that modern pathway researchers need tools for building and reasoning about causal models based on an inference method. Chalkboard is one prototype system that fills this need. The key characteristics of Chalkboard are: (1) The use of an expressive ontology of Entities, Actions and Functional attributes to model pathways at a based on the physics and biochemistry of inter- and intra-molecular interactions. And (2) Chalkboard's ability to carry out high-level symbolic qualitative inference (PathTracing) and to generate quantitative (JSim) simulation code allows users to avoids two pitfalls: (1) being tied to quantitative models whose utility and relevance are limited by the (typical) lack of quantitative data, and (2) over-simplified biochemical representations whose fidelity to actual biochemical processes is limited. We have introduced Chalkboard modeling environment and demonstrated its use analyzing a cell-signaling pathway with important scientific and clinical implications. The design of effective therapeutics requires a rigorous understanding of how modulation of a particular molecular entity would affect a distributed signaling system. As Chalkboard is designed to assess this issue, we suggest that use of Chalkboard modeling could facilitate the identification of appropriate pharmacogenetic therapeutic targets within Alzheimer's Disease and other human pathologies.

26

References 1. Joshi-Tope G. GM, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L. Reactome: a knowledgebase of biological pathways. Nucleic Acids Research 2005;33:D428-D432. 2. BioPAX - Biological Pathways Exchange Language Level 2. http://www.biopax.org. 3. Karp PD, Ouzounis CA, Moore-Kochlacs C, et al. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res 2005;33(19):6083-9. 4. Davis R, Shrobe H, Szolovitz P. What is a knowledge representation? Al Magazine 1993;Spring:17-33. 5. National Simulation Resource. http://nsr.bioeng.washington.edu/PLN 6. Hardy J, Selkoe DJ. The amyloid hypothesis of Alzheimer's disease: progress and problems on the road to therapeutics. Science 2002;297(5580):353-6. 7. Cook DL, Farley JF, Tapscott SJ. A basis for a visual language for describing, archiving and analyzing functional models of complex biological systems. Genome Biol 2001;2(4):RESEARCH0012. 8. Cook DL, Mejino JLV, Rosse C. The Foundational Model of Anatomy: a template for the symbolic representation of multi-scale physiological functions. Medinfo 2005; 12. 9. Rosse C, Mejino JLV. A Reference Ontology for Bioinformatics: The Foundational Model of Anatomy. Journal of Biomedical Informatics 2003;36:478-500. 10. Open Biomedical Ontologies, http://obo.sourceforge.net 11. Glenner GG, Wong CW, Quaranta V, Eanes ED. The amyloid deposits in Alzheimer's disease: their nature and pathogenesis. Appl Pathol 1984;2(6):357-69. 12. Buxbaum JD, Liu KN, Luo Y, et al. Evidence that tumor necrosis factor alpha converting enzyme is involved in regulated alpha-secretase cleavage of the Alzheimer amyloid protein precursor. J Biol Chem 1998;273(43):27765-7. 13. Vassar R, Bennett BD, Babu-Khan S, et al. Beta-secretase cleavage of Alzheimer's amyloid precursor protein by the transmembrane aspartic protease BACE. Science 1999;286(5440):735-41. 14. De Strooper B. Aph-1, Pen-2, and Nicastrin with Presenilin generate an active gamma-Secretase complex. Neuron 2003;38(1):9-12. 15. Selkoe DJ. Alzheimer's disease: genes, proteins, and therapy. Physiol Rev 2001;81(2):741-66.

27

16. Cao X, Sudhof TC. A transcriptionally [correction of transcriptively] active complex of APP with Fe65 and histone acetyltransferase Tip60. Science 2001 ;293(5527): 115-20. 17. Baek SH, Ohgi KA, Rose DW, et al. Exchange of N-CoR corepressor and Tip60 coactivator complexes links gene expression by NF-kappaB and betaamyloid precursor protein. Cell 2002;110(l):55-67. 18. Kinoshita A, Whelan CM, Berezovska O, Hyman BT. The gamma secretase-generated carboxyl-terminal domain of the amyloid precursor protein induces apoptosis via Tip60 in H4 cells. J Biol Chem 2002;277(32):285306. 19. Sastre M, Steiner H, Fuchs K, et al. Presenilin-dependent gamma-secretase processing of beta-amyloid precursor protein at a site corresponding to the S3 cleavage of Notch. EMBO Rep 2001;2(9):835-41. 20. Pietrzik CU, Busse T, Merriam DE, et al. The cytoplasmic domain of the LDL receptor-related protein regulates multiple steps in APP processing. Embo 7 2002;21(21):5691-700. 21. Pietrzik CU, Yoon IS, Jaeger S, et al. FE65 constitutes the functional link between the low-density lipoprotein receptor-related protein and the amyloid precursor protein. J Neurosci 2004;24(17):4259-65. 22. Yoon IS, Pietrzik CU, Kang DE, Koo EH. Sequences from the low density lipoprotein receptor-related protein (LRP) cytoplasmic domain enhance amyloid beta protein production via the beta-secretase pathway without altering amyloid precursor protein/LRP nuclear signaling. J Biol Chem 2005;280(20):20140-7. 23. Demir E, Babur O, Dogrusoz U, et al. An ontology for collaborative construction and analysis of cellular pathways. Bioinformatics 2004;20(3):34956. 24. Kitano H, Funahashi A, Matsuoka Y, Oda K. Using process diagrams for the graphical representation of biological networks. Nat Biotechnol 2005;23(8):961-6. 25. Stromback L, Lambrix P. Representations of molecular pathways: an evaluation of SBML, PSI MI and BioPAX. Bioinformatics 2005;21(24):44014407. 26. Cook DL, Gerber AN, Tapscott SJ. Modeling stochastic gene expression: implications for haploinsufficiency. Proc Natl Acad Sci USA 1998;95(26): 15641-6. 27. CellML. http://www.cellml.org 28. Hucka M, Finney A, Sauro HM, et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 2003;19(4):524-31. 29. National Simulation Resource. http://nsr.bioeng.washington.edu/PLN. 30. Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003;13(ll):2498-504.

MINING GENE-DISEASE RELATIONSHIPS FROM BIOMEDICAL LITERATURE: WEIGHTING PROTEINPROTEIN INTERACTIONS AND CONNECTIVITY MEASURES GRACIELA GONZALEZ'", JUAN C. URIBE*, LUIS TARI*, COLLEEN BROPHY+, CHITTA BARAL*, "Department of Biomedical Informatics, Ira A. Fulton School of Engineering, Computer Science and Engineering Department, Ira A. Fulton School of Engineering, *Center for Metabolic Biology, Department of Kinesiology Arizona Sate University Tempe, Arizona 85281, USA Motivation: The promises of the post-genome era disease-related discoveries and advances have yet to be fully realized, with many opportunities for discovery hiding in the millions of biomedical papers published since. Public databases give access to data extracted from the literature by teams of experts, but their coverage is often limited and lags behind recent discoveries. We present a computational method that combines data extracted from the literature with data from curated sources in order to uncover possible gene-disease relationships that are not directly stated or were missed by the initial mining. Method: An initial set of genes and proteins is obtained from gene-disease relationships extracted from PubMed abstracts using natural language processing. Interactions involving the corresponding proteins are similarly extracted and integrated with interactions from curated databases (such as BIND and DIP), assigning a confidence measure to each interaction depending on its source. The augmented list of genes and gene products is then ranked combining two scores: one that reflects the strength of the relationship with the initial set of genes and incorporates user-defined weights and another that reflects the importance of the gene in maintaining the connectivity of the network. We applied the method to atherosclerosis to assess its effectiveness. Results: Top-ranked proteins from the method are related to atherosclerosis with accuracy between 0.85 to 1.00 for the top 20 and 0.64 to 0.80 for the top 90 if duplicates are ignored, with 45% of the top 20 and 75% of the top 90 derived by the method, not extracted from text. Thus, though the initial gene set and interactions were automatically extracted from text (and subject to the impreciseness of automatic extraction), their use for further hypothesis generation is valuable given adequate computational analysis.

1.

Introduction

Post-genome project data and techniques available to the research community have exponentially increased the capacity of researchers to conduct experiments and publish results. The resulting deluge of biomedical literature, however, has reached a point that exceeds the capacity of any researcher to process and assume, making it difficult to realize the full benefit of these findings. From 1994 to 2004, close to 3 million biomedical articles were published by US and European researchers [1]. This publication rate has resulted in approximately 16 million publications currently indexed in PubMed. 28

29

Figure 1. Overview and data flow of the computational method presented here to mine the biomedical literature for genes potentially related to a specific disease.

Efforts have been made to extract data from articles and abstracts. For example, Entrez's OMIM [2] has summaries of published work that relate genes to diseases. However, it covers only about 20% of the human genes in the Entrez Gene database. A similar initiative for gene function annotation, GeneRIF (Gene Reference Into Function), was started in 2002, but it covers only about 1.7% of all the genes in Entrez and 25% of human genes[3]. New findings usually take a long time to be reflected by curated sources such as these, and any computational method that relies solely on them will necessarily have its hands tied. To fill this void, the Collaborative Bio Curation (CBioC) project [4, 5] was started to bring together nuggets of information automatically extracted from the published biomedical literature and the intellectual power of a social network of researchers, who can rate the accuracy of the extraction. Extracted facts include protein-protein interactions, gene-disease and gene-bioprocess relationships. This paper describes a computational method that uses extracted facts from the CBioC database and integrates them with curated sources to find a set of proteins potentially related to a target disease, ranking them so that existing knowledge (known gene-disease relationships and curated protein-protein interactions) is balanced with the potential impact of new information (proteinprotein interactions extracted from the literature) and the researcher's intuition. An assessment of the method through a study of atherosclerosis is also described and reported in the Results section. This balance of different factors, notably a network connectivity impact measure for each gene, among others, marks the difference between our approach and others such as MedGene [6] and the method in [7]. The scope of the initial gene-disease data also differs, as well as the level of user interaction,

30

which is very limited in other approaches. A comparative view to these efforts is presented in the Related Work section and in Section 2.4. The resulting ranked list of genes and gene products can provide the basis for further focused experiments to investigate the genetic determinants of any disease. On top of helping to find gene-disease relationships that were not discovered in the information extraction step (false negatives), this focused analysis could uncover yet unexplored genetic linkages and provide an insight into specific genetic and proteomic pathways related to any disease, as our study in atherosclerosis will show. The method is implemented in Java using SQL to access the CBioC database (which is stored as a MySQL database). On-demand runs can be requested by contacting the authors. A web-based interface to the software is in development. Other sections cover the computational method, the results of applying the method to the study of atherosclerosis, and a comparison with related work. 2.

The computational method

The computational method presented here takes a 4-step approach to the task of finding and ranking genes and gene products related to a given disease, relying not only on automatic computation, but allowing (not requiring) user input at different levels. The method can be summarized as follows: 1. Obtain a list of genes or gene products known to be involved with the target disease from the CBioC[5] database. 2. Apply heuristics to unify variants of extracted names, and use HUGO [8] to normalize both the set obtained in the previous step and the names stored in CBioC. This will be referred to as the initial set. 3. Apply nearest-neighbor expansion to the initial set to build a protein interaction network using data from the CBioC database and curated databases. Analyze the connectivity of the network. The genes and proteins in this network (derived from the interactions) form the extended set. 4. Apply a heuristic scoring formula to the extended set to predict the proteins most likely related to the disease. One part of the formula measures the number of interactions of each gene in the extended set with proteins in the initial set, incorporating contextual information if indicated by the user. The second part measures the role of the protein in the connectivity of the protein network, since high degrees of local network interconnectivity can identify sets of functionally related proteins [9, 10]. Researchers can focus the analysis through different interventions. Figure 1 shows the data and process flow of the method. Each step is detailed next.

31 2.1. Initial set of disease-related genes and gene products The initial set of genes and gene products of interest is obtained by querying the CBioC database using the disease of interest and any variants or synonyms of its name. CBioC uses a natural language processing extraction system, IntEx [11], which is based on identification of syntactic roles, such as subject, objects, verbs, and modifiers. English grammar dependencies reported by Link Grammar [12] are used to identify the roles and transform complex sentences of interest into triplets of the form (Entity 1, interaction, Entity2). We extended IntEx to extract not only protein-protein interactions, but also gene-disease relationships, using MeSH [13] terms under the disease category to recognize them in the abstracts. Even though the natural language processing approach allows for more precise extractions than co-occurrence[ll], the gene-disease relationships and protein-protein interactions extracted directly from the literature are not perfect. In fact, IntEx reports a 65.7% precision in extracted interactions[l 1]. Thus, there will be genes and gene products in the initial set that are not related to the disease (false positives), just as there will be others that are not retrieved even though they are related (false negatives). The protein interaction network analysis and the incorporation of protein-protein interactions from curated sources helps assuage the impact of these problems. Also, users might filter the initial set to narrow the focus to a particular set of genes and gene products. 2.2. Unifying extracted gene and protein names One of the challenges of using data extracted directly from biomedical texts is the great variety of names used for the same entity: one gene or gene product might appear under different synonyms and variants. For example, HNF4A might appear as hepatocyte nuclear factor 4 alpha or any of a number of aliases (such as HNF4, MODY, TCF, or TCF14), or variants of any of these, such as HNF4-alpha or HNF 4A. An additional problem is that the triplets in the CBioC database sometimes include modifiers that were in the same noun phrase or modifying phrase, such as "HNF4A protein" or "HNF4A mutation". It was necessary to unify the names (normalize them) so when the protein network is built, all the interactions of the same protein are clustered into a single node. A naive normalization algorithm was applied to entries in the CBioC database to eliminate non-essential words (such as "protein" or "mutation" at the end of a name), in order to then find its official abbreviation in the HUGO[8] database. 2.3. Build the protein network The CBioC database is queried for any and all interactions involving the genes and gene products in the initial set. On top of the extracted interactions, CBioC

32

integrates interaction data from BIND[14], MINT[15], DIP[16], IntAct [17], and BioGRID[18]. A nearest neighbor algorithm is run to build a protein interaction network, noting the confidence level for each interaction as follows: 1. If the interaction comes from any of the curated sources, its confidence level is noted as 1. 2. If the interaction comes from CBioC, and it has received "Yes" votes from the community of users, its confidence level is noted as .65 plus .07 for each "Yes" vote up to 1. CBioC counts only one vote per user per fact. 3. If the interaction comes form CBioC, and has not been rated by any user, its confidence level is given as .65 (the measured precision of IntEx [11]). 2.4. Rank the genes and gene products in the expanded network To rank the genes in the resulting set, we score each gene or gene product based on the number and confidence levels of its interactions with proteins in the initial set, and combine this measure with another that reflects how relevant it is for maintaining the protein network connectivity. Both measures are important. The first helps discover the most active proteins with respect to the disease (high precision), preferring interactions with the highest confidence level (high fidelity), while the second finds those that could potentially play a crucial role in a pathway related to the disease or that are very likely related to the known (extracted) genes, as high degrees of local network interconnectivity can identify sets of functionally related proteins [9, 10]. The first score also incorporates user-defined weights. For example, given interactions as triplets (Entity 1, interaction-term, Entity2), users might indicate that interactions that include "phosphorylates" as an interaction term should be given greater weight. Let us assume for now that no user weights are defined. We use a variation of the formula given in [7] for this level, removing a bias towards the initial set that the formula in [7] suffers from. Let • A be the extended set of proteins (initial set plus interactions). • N(i) is the set of proteins in the initial set interacting with protein ('. • p(i,j) be the confidence level of the interaction between proteins i and/ • N(i,j) = 1 if protein i e A andy 6 N(i), and 0 otherwise. Then a score t is assigned to each protein i by applying Eq.(l). (1) ti = u?*\N(i)\ Zp(ij) y., ( „ (2) |N(i) | Equation (2), u,, is the average confidence level of the interactions involving i. Equation (1) results from expanding the formula used in [7], noting that in [7], N(i) is the set of proteins interacting with protein i and N(ij) = 1 if; e N(i) n A. =

33

( r tt =exp k*\n

\

^>(u) h ln

r

w

EMU) \\k

exp A;*ln

exp In

ZP(U) /y

V

2>(U) {jeN(i)nA

JJ (3)

exp In

ZMu)

I>(u) VjsW(i)n/i

2>(U)

jeN(l)r,A

2>(u) _ VJEW(i)n^

|^(OnA|

From the last expression, using k=2 as in [7], and noting that \N(i) n AI = AYO (since only interactions in A are included), we get Eq. (1). Since ut 20

H =

-Zaa-iPaalog2Paa

was calculated for each column where Paa is defined as:

(1)

55 Saa(i) >

£jaa»]

„Saa(/) * '

where Saa(/) = Y . " w(i) and w(i) is the sequence weight and nm is the number of amino acid residues of a particular type seen in the column. Because of the minus sign in Equation (1), lower numbers indicate greater evolutionary conservation. A MATLAB implementation of Henikoff weighting and the sequence weighting-based Shannon entropy calculation are available on request. 2.3. Interface definition Domain interfaces were defined according to PiBase, a database of domain interactions from x-ray crystal structures in PDB that uses a 5.5A cutoff for heavy atom interatomic distances to define residues at an interface [21]. The functional unit of ABC transporters is two transmembrane domains (TMD) complexed with two nucleotide-binding domains (NBD) [3, 13, 5]. For the complete ABC transporter structure BtuCD, we define three interfaces: NBD/TMD, NBD/NBD and TMD/TMD. For the dimeric structures we define only the NBD/NBD interfaces. The structure of each domain was aligned to the BtuCD structure (for TMD/NBD interactions) and the MJ0796 structure (for NBD/NBD interactions) with the salign routine in MODELLER [17]. All residues that aligned to interface residues in BtuCD or MJ0796 were predicted to also be interface residues. 2.4. Homology transfer annotations We used the multiple sequence alignments to predict the locations of interface amino acid residues in each of the human ABC transporter NBDs. We assume that if a residue aligns to a known interface residue in the MJ0796 structure (for NBD/NBD alignments) or the BtuCD structure (for NBD/TMD alignments) that it is also at an interface in homologous family members. Residues that aligned to defined interface residues (Interface definition) were examined for disease associations as annotated in the VARIANT records of the Uniprot database [16]. 2.5. Surface conservation We used the molecular graphics visualization program Chimera [22] to identify sites of putative binding interactions that have not yet been functionally characterized, by locating surface regions of medium to high conservation (excluding defined interface sites). Medium conservation is defined as no greater than half of the highest column entropy found in a given alignment (Figure 2B).

56

ATP binding

NBD/NBD interface

- W ^

NBD/TMD interface

ft'

Figure 1. ABC transporter interdomain interfaces. Interfaces are defined according to PiBase [21]. Interfaces are mapped on to the representative structure of the MJ0796 ABC transporter nucleotide binding domain (NBD) dimer structure from M. jannaschii (PDB ID: IL2T) [6].

Figure 2. Disease-associated residues at putative ABC transporter interfaces.

^ I

G480-CFTRV

«' ' rr

D6M-CFTR T668-ABCDI DI099-ABCAI

(A). A close-up of the first nucleotide binding domain of the human CFTR (PDB ID: 1XMI)[15]. Interface residues were defined using homology transfer annotation based on the structure of an NBD dimer from M. jannaschii (PDB ID: IL2T) [6] and are shown in gold. Residues with known cystic fibrosisassociations at the NBD/NBD interface are shown in black (Table 2). An N-terminal helix in the CFTR structure is hidden to show the complete interface as defined by the 1L2T structure. (B). The exposed, non-NBD surface of the 1L2T structure. Residue positions in yellow have entropies of no more than 2.1 bits. Residue positions in black are associated with cystic fibrosis (CFTR), adrenoleukodystrophy (ALD) and high-density lipoprotein deficiency type2(ABCAl).

57

3. Results 3.1. Differential conservation of interfaces in ABC transporters We examined evolutionary conservation at the amino acid residue level for three different interfaces in ABC transporter structures {Interface definition).

• ATP &md;ti3 s«« ' - NSO/NBD = NBDflioATP

4•

4 TMD/TMO a Ransom

3.0G2.83a,eo~ 1.50-

*

.

*

A

&

*

»

*

t,oa~

-*"'?—•"•"

117V 1ZJR tXEF 1L2T 1JJ7 tXMl StructwTe

Figure 3. Evolutionary conservation at binding and interface sites in six ABC transporter structures. The ATP binding site, which forms part of the interface between the nucleotide binding domains (NBDs) has the lowest entropy due to highly conserved residues in the Walker A, B and 'signature' motifs. The NBD/NBD interface is well conserved even when the ATP-binding residues are removed from consideration (NBD/noATP). The TMD interfaces, both with the cognate NBD and the cognate TMD (only definable for 1Z2R and 1L7V) are not highly conserved.

We were only able to define the TMD/TMD interface for the two complete structures, 1Z2R and 1L7V. We found that the NBD/NBD interface was consistently more conserved than either the NBD/TMD and TMD/TMD interfaces (Figure 3, Discussion). 3.2. Disease associated mutations at ABC transporter interfaces We found a total of 68 disease-associated positions at PiBase-defined interfaces in 10 transporters (Table 2, Figure 2A). Of these positions, 65 were single residue mutations and three were deletions. Thirty-eight mutations were at the NBD/NBD interface and 30 at the NBD/TMD interface. We also found conserved surface residues that included two positions associated with disease in several ABC transporters. These residues correspond to the 1L2T residues 1, 2, 31, 60, 164, and 213. There are 587 total known disease mutations in the 10 transporters, of which 504 are found in ABCC7, ABCD1 or ABCA4 [16].

58 Table 2. Disease associated mutations at putative ABC transporter interfaces. Human protein residues that aligned with the NBD/NBD or NBD/TMD interface were examined for disease association using Uniprot [16]. The two interfaces overlap by two residues. [Transported Disease(s) [ABCA1] High density lipoprotein deficiency type 2 [ABCA31 Respiratory distress syndrome [ABCA4] Stargardt disease (STGD), Fundus flavimaculatus (FFM), Age-related macular degeneration 2 (ARMD2)

[ABCA12] Lamellar icthyosis

[ABCC2] Dubin-Johnson syndrome [ABCC6] Autosomal recessive pseudoxanthoma elasticum

[ABCC7/CFTR] Cystic fibrosis

[ABCC8] Persistent hyperinsulinemic hypoglycemia of infancy

[ABCD1 j Adrcnoleukodystrophy

[ABCG51 Sitosterolemia

NBD/NBD N935S N568D R943W (STGD/FFM)

N965S (STGD) S1063P(STGD) E1087D/K(STGD) G1091E(FFM) G1975R(STGD) E2096K (STGD) H2128R(STGD) N1380S G1381E E1539K R768W T1301I G1302R Q1347H G458V S549I/N/R G551S R553Q D579G G1244E

G715V V1359M G1377R G1380S R1435Q E1505K G507V S552P S606P/L G608D E609G/K E630G S633I V635M S636I

NBD/TMD

L1014R(STGD)

T1019A(STGD) Kl 031E (STGD) E1036K(STGD) V1072A(STGD) L2027F (STGD/FFM) R2030Q (STGD/FFM) L2035P(STGD)

Q1382R AM 1393 R1314Q QI347H D1361N S492F E504Q AF507 AF508 W1282R R1283M F1286S N1303H R1392I1 R1419C R1435Q

P543L S552P Q556R P560R M566K

EI46Q

59

4. Discussion We have comprehensively mapped known disease-associated mutations to putative interfaces and found that 68 disease-associated positions in 10 transporters fall at putative interfaces. This indicates that a majority of diseaseassociated ABC transporters (10/17) have mutations at interface regions. Single residue point mutations were the most common and accounted for 65 of the disease-associated positions; the other three were single residue deletions. Thirty-eight mutations were at the NBD/NBD interface and 30 at the NBD/TMD interface. We hypothesize that many disease-association mutations involving ABC transporters may be due to disruption of domain-domain binding interactions. Proper function of ABC transporters involves cycles of substrate binding and release which are currently thought to be governed by an 'ATP switch'-type mechanism with ATP binding and hydrolysis causing formation and dissociation of an NBD/NBD dimer. The switch, between open and closed dimer states, causes conformational changes in the TMDs that enable substrate transport [5]. While large conformational changes have been seen in mammalian ABC transporters using electron microscopy [23, 24], the specific residue interactions at both the NBD/NBD interface and the TMD/NBD interface that are involved in these interactions in human transporters is lacking. As noted earlier, interface mutations can disrupt ABC transporter domain interactions in several ways: by interfering with ATP binding or hydrolysis, by destabilizing or preventing proper folding and association of the domains, or by interfering with allosteric communication between domains that is suggested by the large conformational changes seen during the transport cycle. Defining residues at these interfaces is useful to experimentalists interested in examining specific residue interactions that stabilize or abrogate interface interactions. For example, in CFTR, a hypothesized hydrogen bond between R555 in NBD1 and T1246 in NBD2 stabilizes the open, chloride-transporting state of the protein [27]. The high conservation of the NBD/NBD interface at the superfamily level suggests that there are likely additional residue interactions that stabilize dimer formation and facilitate transport. The relative lack of conservation at the TMD/NBD interface and the large number of disease-associated mutations at this interface (Table 2) might indicate that NBD/TMD mutations might lead to defects in folding and maturation rather than directly affecting the function of a properly processed, intact transporter. The AF508 mutant falls at the TMD/NBD interface and leads to an immature protein that is tagged for degradation and does not localize properly to the cell

60

membrane [2]. A recent study showed that mutating the analogous residue in Pglycoprotein (MDR1), Y409, also led to an immature form of the protein with an altered NBD/NBD interface. This observation indicates a misfolded protein with improper or incomplete domain associations [25]. Another predicted TMD/NBD interface mutant, R1435Q mutant in ABCC8, could not form functional KATP channels and showed 10-fold reduced expression compared to wild-type ABCC8. Either protein instability or defective transport to the cell membrane could cause this phenotype [26]. Alternatively, the lack of conservation at this interface might suggest TMD/NBD interactions are subfamily specific, in contrast to the overall high conservation of residues at the NBD/NBD interface. Given the 30 disease mutants at this interface, the lack of conservation at the NBD/TMD interface does not indicate that this region is unimportant for ABC transporter function. However, it suggests that instead of a larger conserved interaction footprint as seen in the NBD/NBD interface, perhaps a small number of conserved residues form the necessary contacts for communication between the domains. In the TMD of BtuCD, the A221 in the L2-loop is one of only three moderate to highly conserved residues in the TMD. In MsbA, the residues G122 and E208 are well conserved, and contribute to the TMD/NBD interface. We also suggest a possible interaction site distinct from those observed in the crystallographic structures based on conserved surface residues and diseaseassociation in human ABC transporters (Figure 2B). Surface residues not at defined interfaces are generally not well conserved (Appendix Figure Al) in our analysis. However, a moderate to highly conserved region on the surface of the 1L2T structure includes the aligned human mutations: D1099Y, in ABCA1 associated with high density lipoprotein deficiency type 2, D614G in CFTR associated with cystic fibrosis and T668I in ABCD1 associated with adrenoleukodystrophy. Observing three different transporters with disease-associated mutations at the same solvent-exposed position suggests that this position is conserved for a functional reason. If the residues indeed form part of an interaction site with an unknown partner, that partner might also be conserved in multiple transporters. Alternatively, these residues could indicate a region that stabilizes oligomerization of complete ABC transporters. This example also demonstrates the utility of homology transfer annotation for locating functionally important residues. There is little experimental data available defining the specific effect of disease-associated mutations on ABC transporters. A recent review noted that the majority of CFTR mutants have not been experimentally characterized [2].

61 The difficulty of working with these large membrane proteins underscores the need for computational analysis that provides hypotheses for the mechanism of domain interactions in ABC transporters that can be verified experimentally. We used this analysis to prioritize residues selected to experimentally probe domain interactions in the human multidrug ABC transporter P-gp. We will apply our method to new ABC transporter structures as they become available, and we intend to explore using other measures of residue conservation, including determining site-specific mutation rates and locating coevolving residues, in the future [28,29].

Acknowledgments We thank John Chodera (UCSF) and our reviewers for comments on the manuscript as well as Dr. Deanna Kroetz, Dr. Kathy Giacomini, Jason Gow and Marco Sorani (UCSF) for helpful discussions about membrane transporters. This work is supported by NIH (F32 GM-072403-02, U01 GM61390, U54 GM074929-G1, R01 GM54762), the Burroughs Wellcome foundation, the Sandler Family Supporting Foundation, SUN, IBM, and Intel.

Appendix

Figure A.l. Representative sequence conservation at putative ABC transporter interfaces. Residue conservation was mapped on to the structure of an NBD dimer from U. jannaschii (PDB ID; 1L2T) [6]. Conservation is colored from black to white, with black indicating high conservation and white indicating low conservation. The TMD/NBD interface region (left panel) defined by alignment to the BtuCD structure is circled, and shows low conservation. The NBD interface is visible as a curve of high conservation extending from one ATP molecule (shown in stick) to the other. The right panel is rotated 180 degrees horizontally and shows some solvent-exposed regions of higher conservation.

References 1. S. V. Ambudkar, era/., Annu. Rev. Pharmacol. Toxicol., 39, 361 (1999).

62

2. D. Gadsby, P. Vergani, and L. Csanady, The ABC protein turned chloride channel whose failure causes cystic fibrosis. Nature, 440, 477 (2006). 3. K. Locher, A. Lee, and D. Rees, The E. coli BtuCD Structure: A Framework for ABC Transporter Architecture and Mechanism. Science, 296, 1091 (2002). 4. M. Dean and T. Annilo, Evolution of the ATP-binding cassette (ABC) transporter superfamily in vertebrates. Annu. Rev. Genomics Hum. Genet., 6, 123 (2005). 5. C. F. Higgins, and K. J. Linton, The ATP switch model for ABC transporters. Nat. Struct. Mol Biol., 11,918 (2004). 6. P. C. Smith, et al, ATP binding to the motor domain from an ABC transporter drives formation of a nucleotide sandwich dimer. Mol. Cell, 10, 139(2002). 7. J. Zaitseva, et al, H662 is the linchpin of ATP hydrolysis in the nucleotidebinding domain of the ABC transporter HlyB. The EMBO Journal, aop (2005). 8. J. Chen, G. Lu, J. Lin, A. L. Davidson and F. A. Quiocho, F.A. A tweezerslike motion of the ATP-binding cassette dimer in an ABC transport cycle. Mol. Cell., 12, 65\ (2003). 9. A. Bhatia, H. J. Schafer and C. A. Hrycyna, Oligomerization of the human ABC transporter ABCG2: evaluation of the native protein and chimeric dimers. Biochemistry, 44, 10893 (2005). 10. J. Xu, Y. Liu, Y. Yang, S. Bates and J. T. Zhang, Characterization of oligomeric human half-ABC transporter ATP-binding cassette G2. J. Biol. Chem.,219, 19781 (2004). 11. C. Nichols, KATP channels as molecular sensors of cellular metabolism. Nature, 440, 470 (2006). 12. C. Li, and A. Naren, Macromolecular complexes of cystic fibrosis transmembrane conductance regulator and its interacting partners. Pharmacology & Therapeutics, 108, 208 (2005). 13. L. Reyes and G. Chang, Structure of the ABC transporter MsbA in complex with ADP.vanadate and lipopolysaccharide. Science, 308, 1028 (2005). 14. R. Gaudet and D. C. Wiley, Structure of the ABC ATPase domain of human TAP1, the transporter associated with antigen processing. EMBO J., 20, 4964(2001). 15. H. A. Lewis, et al, Structure of nucleotide-binding domain 1 of the cystic fibrosis transmembrane conductance regulator. EMBO J., 23, 282 (2004). 16. A. Bairoch, et al, The Universal Protein Resource (UniProt). Nucleic Acids Res., 33 (2005). 17. U. Pieper, et al, MODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res., 34 (2006). 18. S. F. Altschul, et al, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25, 3389 (1997). 19. C.E. Shannon, A mathematical theory of communication. Bell Sys. Tech. J. 27 379(1948).

63

20. S. Henikoff and J. Henikoff, Position-based sequence weights. Journal of Molecular Biology, 243, 574 (1994). 21. F. P. Davis and A. Sali, PIBASE: a comprehensive database of structurally defined protein interfaces. Bioinformatics, 21, 1901 (2005). 22. E. F. Pettersen, et al, UCSF Chimera—a visualization system for exploratory research and analysis.J. Comput. Chem., 25, 1605 (2004). 23. M. F. Rosenberg, et al., Purification and crystallization of the cystic fibrosis transmembrane conductance regulator (CFTR). J. Biol. Chem., 279, 39051 (2004). 24. M. F. Rosenberg, et al., Repacking of the transmembrane domains of Pglycoprotein during the transport ATPase cycle. EMBO J., 20, 5615-5625 (2001). 25. T. W. Loo, M. C. Bartlett, and D. M. Clarke, Processing mutations located throughout the human multidrug resistance P-glycoprotein disrupt interactions between the nucleotide binding domains. J. Biol. Chem., 279, 38395 (2004). 26. Y. Tanizawa, et al., Genetic analysis of Japanese patients with persistent hyperinsulinemic hypoglycemia of infancy: nucleotide-binding fold-2 mutation impairs cooperative binding of adenine nucleotides to sulfonylurea receptor 1. Diabetes, 49, 114 (2000). 27. P. Vergani, et al., CFTR channel opening by ATP-driven tight dimerization of its nucleotide-binding domains Nature, 433, 7028 (2005). 28. Z. Yang and S. Kumar, Approximate methods for estimating the pattern of nucleotide substitution and the variation of substitution rates among sites. MolBiolEvol, 13, 650-9 (1996). 29. S.W. Lockless and R. Ranganathan, Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 286,295-9 (1999).

LTHREADER: PREDICTION OF LIGAND-RECEPTOR INTERACTIONS USING LOCALIZED THREADING JADWIGA BIENKOWSKA 1 ' 2 * BONNIE BERGER 1 * Computer Science and Artificial Intelligence Laboratory, MIT Biomedical Engineering Dept, Boston University. Corresponding author Identification of ligand-receptor interactions is important for drug design and treatment of diseases. Difficulties in detecting these interactions using high-throughput experimental techniques motivate the development of computational prediction methods. We propose a novel threading algorithm, LTHREADER, which generates accurate local sequencestructure alignments and integrates statistical and energy scores to predict interactions within ligand-receptor families. LTHREADER uses a profile of secondary structure and solvent accessibility predictions with residue contact maps to guide and constrain alignments. Using a decision tree classifier and low-throughput experimental data for training, it combines information inferred from statistical interaction potentials, energy functions, correlated mutations and conserved residue pairs to predict likely interactions. The significance of predicted interactions is evaluated using the scores for randomized binding surfaces within each family. We apply our method to cytokines, which play a central role in the development of many diseases including cancer and inflammatory and autoimmune disorders. We tested our approach on two representatives from different structural classes (all-alpha and all-beta proteins) of cytokines. In comparison with stateof-the-art threader, RAPTOR, LTHREADER generates on average 20% more accurate alignments of interacting residues. Furthermore, in cross-validation tests, LTHREADER correctly predicts experimentally confirmed interactions for a common binding mode within the 4-helical long chain cytokine family with 75% sensitivity and 86% specificity. For the TNF-like family our method achieves 70% sensitivity with 55% specificity. This is a dramatic improvement over existing methods. Moreover, LTHREADER predicts several novel potential ligand-receptor cytokine interactions. Supplementary website: http://theory.csail.mit.edu/~vinyAthreader

1.

Introduction

Proteins are essential for the proper operation of living cells and viruses, performing a wide variety of functions. Most often, they do so by interacting with other proteins. The study of these interactions is extremely important, as many diseases can be traced to undesirable or malfunctioning protein-protein interactions (PPIs). Currently, methods exist for predicting PPIs that have achieved some degree of success, relying mostly on data obtained from highthroughput (HTP) experiments such as yeast-two-hybrid screens. Receptors are proteins embedded within the cell membrane. Interactions with their extra-cellular ligands occupy a central role in inter-cellular signaling and biological processes that lead to the development and progression of many diseases. Of particular importance to human diseases are cytokines. Cytokine 64

65 interactions with their receptors are responsible for innate and adaptive immunity, hematopoiesis and cell proliferation. Etiology of cancer and autoimmune disorders can be attributed in part to the cytokine signaling through their receptors. For example, long-chain 4-helical bundle cytokines, erythropoietin and human growth hormone, are already used for the treatment of cancer and growth disorders. Many other therapies altering cytokine-receptor interactions are in clinical development [1]. However, ligand-receptor (L-R) interactions are much more difficult to predict than general PPIs, and methods that work well for PPIs often fail when applied to L-R binding pairs. In particular, the lack of high-throughput experimental data for these interactions makes it difficult to apply existing prediction methods that depend on this information (see Related Work). We consider the problem of predicting whether a ligand and receptor interact, given only their sequence information and several confirmed L-R PPIs among members of the same structural SCOP family [2]. Even when one or more complex structures are available within an L-R family it is often a challenge to effectively use this information to predict interactions among other members of the family. One reason is the difficulty in identifying the interacting residues that are common among distant family members. The conformational differences that often occur at the interface of bound proteins make such identification non-obvious. Our approach is to thread the sequences onto the binding interface of a solved L-R complex and to evaluate the complementarity of the resulting surface. In so doing, we face four challenges: (1) identifying the residues at the binding interface that are common to an L-R family; (2) threading the query sequences onto the binding interface; (3) scoring the resulting threaded sequences in order to differentiate between binding and non-binding partners; and (4) evaluating the significance of the predicted interaction scores. Related Work. Many computational approaches have been applied to prediction of PPIs such as: threading of structural complexes [3] and scoring them with statistical potentials [4]; correlated mutations [5-8]; and docking methods using physical force fields [9, 10]. However, the performance of all of these methods is highly dependent on the accuracy of the alignment to the structural template, and thus for distantly related proteins is more prone to errors. For example, the PPI predictor InterPrets [11] cannot find a confident match for any of the sequences from the cytokine families that we consider. Integrative machine learning methods also have been applied to prediction of PPIs and networks [12, 13]. Many of these approaches rely on HTP experimental PPI data itself as a predictor and this information is scarce for L-R pairs.

66

Contributions. This paper proposes a novel threading algorithm, LTHREADER, which first incorporates secondary structure (SS) and relative solvent accessibility (RSA) predictions with residue contact maps to guide and constrain alignments. While existing threading algorithms (e.g. RAPTOR [3]) are not so successful at aligning interacting residues in sequences with low homology [15], LTHREADER achieves much higher accuracy (see Section 3.1). Given interaction data from gold-standard low-throughput experiments, LTHREADER predicts L-R interactions using statistical and energy scores. We apply our algorithm to the cytokines, performing significantly better than existing in silico methods (see section 3). We investigate two structurally distinct cytokine families: 4-helical bundle cytokines and the TNF-like family belonging to the all-beta structural class. Cytokine interactions with receptors are particularly difficult to predict because they display a high level of structural similarity but almost no sequence similarity, preventing the effective use of simple homology-based methods or general threading techniques. Furthermore, little experimental interaction data exists for cytokine interactions, and the structures for only a few cytokine-receptor complexes have been determined. Therefore, accurate prediction of cytokine interactions is a good indicator of the success we can achieve with our algorithm. Finally, our method predicts previously undocumented cytokine interactions which may have implications for diseases. We evaluate the significance of our predictions by comparing them to those of randomized interaction surfaces. 2.

Algorithm

Overview. LTHREADER threads two given protein sequences onto a representative template complex in order to determine and score the putative interaction surface. Our interaction prediction algorithm is divided into three stages (Figure 1). In the first stage (Figure 1, Stage 1), from the set (at least two) of template complexes, we determine the residues that are most likely to be involved with L-R binding. We do this by generating a multiple alignment of clusters of interacting residues from each complex and determining the positions that are most conserved. We build a generalized profile for each position in the alignment of interacting residues [16]. In the second stage (Stage 2), the profile is used to identify the most likely location of interacting residues in the query sequences. The locations of the interacting residues in the query sequences define the putative interaction surface. In the third stage, this surface is scored using several methods and an interaction prediction is made using a decision tree classifier (Stage 3). The significance of the classification is then evaluated by

67

estimating the probability of predicting an interaction between the L-R pair using a randomized interaction surface. Sequences: MU.GTG.. •RNtQPVI,

Input

*r

CM Si ore lce'ihV-.U tin Common Interacting Residues

Localized Threading

L>P S':ore

fcgjftple0i:6f,

FF & v r e

- • gljMecacSiri^

Stage 1

Stage 2

h

^ CR Score ^

i Dn".iMi"i Tree 1 Tel ninrj

Stage 3

i.l.-r-,..: '

Y: S M M V M- [lIlK.illn.' Vi»li-

Output

Figure 1: Schematic of LTHREADER. In Stage 3, CM is the compensatory mutation score, SP the statistical potential score, FF the force field score, and CR the conserved residue score.

2.1. Stage 1: Generation of Localized Profiles for Interaction Cores In this stage, we assume that if a set of ligands and receptors have similar structures and binding orientation, then their corresponding interface surfaces will have good alignment. We first examine the L-R pairs that have solved structures for their bound complex and align the ligand and receptor structures separately using POSA [17]. Then, clusters of interacting residues are identified within these complexes and mapped to their corresponding ligand and receptor sequences, thus delimiting core regions of interaction within each sequence. Given a set (minimum two) of complexes, the positions of the cores are then optimized to ensure that the locations of the interactions contained in the clusters overlap as much as possible between complexes. Finally, generalized profiles are computed for each residue in the core regions of all pairs of L-R sequences. Clustering of Residue Interactions. For two interacting domains in a complex structure we define the interface residues as those in contact with residues from the other domain. We define two residues to be in contact if the distance between any two of their heavy atoms is less then 4.5A. This cutoff is the same as that used by Lu et. al. [4] to determine statistical potentials for contacting residues. We define a contact map as a matrix C such that cy = 1 if the z'th residue of the ligand and the 7th residue of the receptor interact, and cy = 0 if they do not. Given a contact map C, we group together clusters of interacting pairs (non-zero entries of C) by using a simple index-based distance function to determine inclusion. The distance between two interacting pairs {iiji} and {12, J2} in C, where ij, ji are the ligand and receptor indices respectively for the first interacting pair, and i2 and j 2 , for the second pair, is defined as follows: d>st({>\J\}AhJ2})

y('i '2) +0'i Ji) . which indicates infinite distance when

any two residues do not interact.

68

Interacting residue pairs that are separated by a distance, dist, less than three are considered members of the same cluster. A cluster in contact map C implies a corresponding sub-matrix whose non-zero entries are members of that cluster. Note that cluster edges delimit a contiguous sequence stretch on both the ligand and receptor sequences, referred to as a core (see Figure 2). Thus we can define a notation for indexing a cluster by the index of its corresponding cores in the ligand and receptor. Given contact map C, we denote Ck'' as the sub-matrix containing the cluster indexed by the £th core in the ligand and the /th core in the receptor. The size and position of C ' within C can vary as long as the requirement that only one cluster can be contained within C ' is not violated.

Figure 2: An illustration of how ligand (red) and receptor ( blue) cores are derived from a clustering of interactions within the interaction map (at right). The yellow dots correspond to interacting residues and the green dots in the interaction map indicate an interaction. A black line in the cartoon on the left denotes that an interaction occurs between the residues at its endpoints.

Alignment of Clusters for a Pair of Ligand-Receptor Complexes. The next step of our algorithm optimizes the length and location of cores within a pair of L-R complexes so that the similarity score of corresponding clusters is maximized. Let C be the contact map for the first complex, and D be the contact map for the second complex. Let m be the number of cores in the ligands for both complexes, and n be the number of cores in the receptor for both complexes. Let Ck'' refer to the k,l-th cluster in C, and D*'' refer to the corresponding k,I-th cluster in D. We set the height and width of both submatrices to the maximum of the height and width of each sub-matrix. (Note that this accounts for the rare case when two clusters in one complex map to a single larger cluster in another.) The precise alignment of the interaction cores is the goal of the following optimization procedure. For the k,l-th cluster we fix the starting position of Ck,!, but allow the starting position of DA,/ to vary. Let D*' ? be equal to D*'; offset by p along the first dimension of D and offset by q along the second dimension. Our goal then is to maximize the objective function,

69 / O v - M > - * < 7 „ ) = 2 sim(C W , D ^ ) , for 1 < k < m and 1 < 1 < n subject to the following constraints: -4 4 Associated Genes - homo sapiens subset . on|y SQL: Reactome Filter Direct Complex' and Reaction' types only

MySQL JOIN: - Generate gene pairs for each disease pair (M) - Identify common gene pairs (m) though repeated join operations

Perl: Hypergeometric Calculator

Correlated ^ Disease Pairs:

Figure 1. Method to correlate human diseases based on their underlying protein interactions. M and m refer to parameters of the hypergeometric calculations as described in equation 1.

2. Methods In order to identify associations between diseases by mapping their respective protein interaction networks with statistical significance values, we took the following steps. An overview of the process is pictured in Figure 1. Extraction of human protein-disease relationships was achieved though Structured Query Language querying of the PhenoGO database. We extracted all UMLS-coded diseases classified under the "Disease" semantic type hierarchy along with their associated proteins. In this study, we chose to stay on a more conservative side, and only extracted diseases associated with more than 4 proteins to avoid errors stemming from mis-assignment in PhenoGO and to reduce spurious predictions in the next step from the

79 hypergeometric distribution because a single error contributes proportionally to a larger statistical impact on a smaller sample of protein in the statistical method that follows (equation 1). These UMLS-coded terms fall under the UMLS semantic types 'Congenital Abnormality', 'Disease or Syndrome', 'Experimental Model of Disease', 'Anatomical Abnormality', and 'Neoplastic Process'. The resultant set consists of 154 diseases and their 1,931 associated proteins (http://phenos.bsd.uchicago.edu/PSB2007/). Integration and Discovery. The second step is to correlate diseases with their underlying protein-protein interaction networks using a statistical approach. In this study, we used the Reactome protein interaction dataset [8] to define the underlying topological networks associated with these diseases. The common proteins between diseaseassociated proteins in PhenoGO and proteins in the Reactome were identified by using the identifiers in the UniProt [30], The Reactome data set defines four distinct types of reactions: 1) neighboring reactions, which define interactions that occur consecutively; 2) indirect complexes, which define interactions which involve subcomplex interaction, but not direct binding/interaction; 3) direct complex, defining protein-protein complexes; and 4) reaction, representing situations where the two proteins participate in the same reaction [8]. The Reactome dataset was normalized to a set of paired Swiss-Prot accession numbers, and filtered to remove neighboring reactions and indirect complexes, leaving only entries for binary interactions and direct complexes. This data set contains 20,317 distinct interactions corresponding to 1,140 distinct proteins. From the 154 diseases, we generated combinations of pairs of diseases, and for each pair of diseases, proteins in both diseases were also paired for all potential combinations. These protein pairs were then cross-referenced with our filtered Reactome data set to determine if they participated in reactions or formed direct complexes with one another. There are two basic types of relationships used in calculations in our methods. These relationships correspond to the two scenarios we considered to determine whether two diseases share interaction networks: 1) an identity relationship where common proteins are shared by two diseases, and 2) direct interactions between protein A in one disease and protein B in the other disease. As related diseases can share both types of relations, and due to the requirements of the hypergeometric distribution, we consider both in the underlying protein-protein interaction network in diseases. Based on this, we calculated the correlations between all possible pairs of diseases by applying the hypergeometric distribution function to identify significantly correlated diseases (equation 1) and adjustments for multiple a posteriori comparisons (equation 2), as shown below: (M\(N-M\ P(i>=m\N,M,n,m) =

Y,

' j^V' (Equation 1) {n ) In equation 1, '/V represents the total number of all pair combinations between proteins of any two diseases in the experiment that includes the possibility of sharing the

80 same proteins (identical protein pair between two diseases), 'Af, as the sum of number of observed distinct pairs of interacting proteins that exist in the Reactome database for all the diseases in the experiment (direct interaction only), V as the putative total number of pairs of proteins that could exist in a pair of disease, and 'm" as the sum of the observed number of common proteins shared between two specific diseases and the number of distinct pairs of interacting proteins observed in the Reactome database for these two specific diseases (M n n). This measure gives a p-value which is then adjusted for multiple comparisons with the Dunn-Sidak method (a derivative of the Bonferroni method): p'=l-([-p)r

(Equation 2)

In equation 2, p' and p represent the corrected and uncorrected p-values, respectively, and r represents the number of independent comparisons, which is the number of disease pairs (r=\ 1,703) used in the study. These corrected p-values are then thresholded at p and a compound vector Vc — \c\, C2, • • •, c\c\\- Each value, ei, in VE denotes the damage of inhibition of enzyme, Ei G E. Each value, n, in VR denotes the damage incurred by stopping the reaction Ri e R. Each value, Q, in Vb denotes the damage incurred by stopping the production of the compound C* £ C. Initialization: Here, we describe the initialization of vectors VE, VR, and Vc- We initialize VE first, VR second, and Vc last. Enzyme vector: The damage e,, Vi, 1 < i < |.E|, is computed as the number of nontarget compounds whose productions stop after inhibiting Ei. Wefindthe number of such compounds by doing a breadth-first traversal of the metabolic network starting from Ei. We calculate the damage e» associated with every enzyme Ei e E, Vi, 1 < i < |-E|. and store it at position i in the enzyme vector VEReaction vector: The damage rj is computed as the minimum of the damages of the enzymes that catalyze Rj, Vj, 1 < j < \R\. In other words, let EVl, E7t2, • • •, E„k be the enzymes that catalyze Rj. We compute the damage of rj as rj = min*L1{e7ri}. This computation is intuitive since a reaction can be disrupted by inhibiting any of its catalyzers. We calculate rj associated with every reaction Rj e R< Vj, 1 < j < \R\ and store it at position j in the reaction vector VR. Let E(Rj) denote the set of enzymes that produced the damage rj. Along with rj, we also store E(Rj). Note that in our model, we do not consider back-up enzyme activities for simplicity. Compound vector: The damage Cfc, Vfc, 1 < k < \C\, is computed by considering the reactions that produce Cfc. Let Rltl, R„2, • • •, Rnj be the reactions that produce Cfc. We first compute a set of enzymes E(Ck) for Cfc as E(Ck) = E{RKl) U E(Rn2)\J- • •L)E(R7rj). We then compute the damage value Cfc as the number of nontarget compounds that is deleted after the inhibition of all the enzymes in E{Ck). This computation is based on the observation that a compound disappears from the system only if all the reactions that produce it stop. We calculate Cfc associated with every compound Cfc e C, 1 < k < \C\ and store it at position k in the compound vector Vc. Along with ck, we also store E(Ck). Column IQ in Table 1 shows the initialization of the vectors for the network in Figure 1. The damage e\ of E\ is three, as inhibiting E\ stops the production of three non-target compounds C2, C 3 and C4. Since the disruption of E2 or E3 alone does not stop the production of any non-target compound, their damage values are zero. Hence, Vg = [3, 0, 0]. The damage values for reactions are computed as the minimum of their catalyzers ( n = r 2 = e\ and r3 = r4 = e2). Hence, VR =[3, 3, 0, 0]. The damage values for compounds are computed from the reactions that produce them. For instance, Ri and R2 produce C 2 . E{RX) = E(R2) = {Ei}. Therefore, c2 = ei. Similarly c5 is equal to the damage of inhibiting the set E(Rs) U E(R4) — {E2,E3}. Thus,c 5 = l. Iterative steps: We iteratively refine the damage values in vectors VR and Vc in a

93 Table 1. Iterative Steps: To is the initialization step; I\ and h are the iterations. VR and Vc represent the damage values of reactions and compounds respectively computed at each iteration. VB = [3, 0, 0] in all iterations.

vR,vc

h

h

h

[3, 3, 0,0], [3, 3, 3, 3,1]

[1,3,0,0], [1,3, 3, 3, 1]

[1,3, 0,0], [1,3, 3, 3, 1]

number of steps. At each iteration, the values are updated by considering the damage of the precursor of the precursors. Thus, at nth iteration, the precursors from which a reaction or a compound is reachable on a path of length up to n are considered. We define the length of a path on the graph constructed for a metabolic network as the number of reactions on that path (see Definition 4.2). There is no need to update Vg since the enzymes are not affected by the reactions or the compounds. Next, we describe the actions taken to update VR and Vc at each iteration. We later discuss the stopping criteria for the iterations. Reaction vector: Let CWl, CK2, • • •, CKt be the compounds that are input to Rj. We update the damage of r^ as rj = minjrj, min'^-fc^}}. The first term of the min function denotes the damage value calculated for Rj during the previous iteration. The second term provides the damage of the input compound with the minimum damage found in the previous iteration. This computation is intuitive since a reaction can be disrupted by stopping the production of any of its input compounds. The damage of all the input compounds are already computed in the previous iteration (say (n — l)th iteration). Therefore, at iteration n, the second term of the min function considers the impact of the reactions and compounds that are away from Rj by n edges in the graph for the metabolic network. Let E(Rj) denote the set that contains the enzymes that produced the new damage rj. Along with rj, we also store E{Rj). We update all rj e VR using the same strategy. Note that the values rj can be updated in any order, i.e., the result does not depend on the order in which they are updated. Compound vector: The damage Cfc, Vfc, 1 < k < \C\, is updated by considering the damage computed for Cfc in the previous iteration and the damages of the reactions that produce Cfc. Let Rni, RW2, • • •, RKj be the reactions that produce Cfc. We first compute a set of enzymes as S ( i ^ J U E(R7T2) U • • • U E(Rnj). Here, E{RKt), 1 < t < j , is the set of enzymes computed for Rt after the reaction vector is updated in the current iteration. We then update the damage value ck as cfc = min{c fc ,damage(ULi E{RTTi))}. The first term here denotes the damage value computed for Cfc in the previous iteration. The second term shows the damage computed for all the precursor reactions in the current step. Along with cfc, we also store E(Ck), the set of enzymes which provides the current minimum damage Cfc. Condition for convergence: At each iteration, each value in VR and Vc either remains the same or decreases by an integer amount. This is because a min function

94 is applied to update each value as the minimum of the current value and a function of its precursors. Therefore, the values of VR and Vc do not increase. Furthermore, a damage value is always an integer since it denotes the number of deleted nontarget compounds. We stop our iterative refinement steps when the vectors VR and Vc do not change in two consecutive iterations. This is justified, because, if these two vectors remain the same after an iteration, it implies that the damage values in VR and Vc cannot be minimized any more using our refinement strategy. Columns l\ and I2 in Table 1 show the iterative steps to update the values of the vectors VR and Vc- In Io, we compute the damage r\ for R\ as the minimum of its current damage (three) and the damage of its precursor compound, C5 = 1. Hence, r\ is updated to 1 and its associated enzyme set is changed to {E2, E3}. The other values in VR remain the same. When we compute the values for Vc, c\ is updated to 1, as its new associated enzyme set is {£2, £3} and the damage of inhibiting both E2 and E3 together is 1. Hence, VR = [1,3,0,0] and Vc = [1,3,3,3,1]. In I2, we find that the values in VR and Vc do not change anymore. Hence, we stop our iterative refinement and report the enzyme combination E2, £3 as the iterative solution for stopping the production of the target compound, C\. Complexity analysis: Space Complexity: The number of elements in the reaction and compound vectors is (\R\ + \C\). For each element, we store an associated set of enzymes. Hence, the space complexity is 0((|i?| + |C|) * \E\). Time Complexity: The number of iterations of the algorithm is 0(|i?|) (see Section 4). The computational time per iteration is 0(G * (|J?| + |C|)), where G is the size of the graph. Hence, the time complexity is 0(\R\G * (\R\ + \C\)). 4. Maximum number of iterations In this section, we present a theoretical analysis of our proposed algorithm. We show that the number of iterations for the method to converge is finite. This is because the number of iterations is dependent on the length of the longest non-selfintersecting path (see Definitions below) from any enzyme to a reaction or compound. Definition 4.1. In a given metabolic network, a non-self-intersecting path is a path which traces any vertex on the path exactly once. • For simplicity, we will use the term path instead of non-self-intersecting path in the rest of this section. Definition 4.2. In a given metabolic network, the length of a path from an enzyme Ei to a reaction Rj or compound Ck is defined as the number of unique reactions on that path. • Note that the reaction Rj is counted as one of the unique reactions on the path from enzyme Ei to Rj.

95 Definition 4.3. In a given metabolic network, the preceding path of a reaction Rj (or a compound Ck) is defined as the length of the longest path from any enzyme in that network to Rj (or Ck)• Theorem 4.1. Let VE = [ei, e2, • • •, e| B |], VR — \v\, r2, ••-, r\R\], and Vc [c\, c2, • • •, C|c|] be the enzyme, reaction and compound vectors respectively (see Section 3). Let n be the length of the longest path (see Definitions 4.2 and 4.1) from any enzyme E{ to a reaction Rj (or a compound Ck)- The value Tj (or Ck) remains constant after at most n iterations. • Proof: We prove this theorem by an induction on the number of reactions on the longest path (see Definitions 4.2 and 4.1) from any enzyme in Ei corresponding to et e VE to Cfc. Basis: The basis is the case when the longest path from an enzyme Et is of length 1 (i.e., the path consists of exactly one reaction). Let Rj be such a reaction. This implies that there is no other reaction on a path from any Ei to Rj. As a result, the value rj remains constant after initialization. Let Ck be a compound such that there is at most one reaction from any enzyme to Ck- Let RKl, R7r2, • • •, R„ be the reactions that produce Ck- Because of our assumption there is no precursor reaction to any of these reactions. Otherwise, the length of the longest path would be greater than one. Therefore, the values r7Vl,r7T2, ••• ,rnj and the sets E(RKl), E{R-K2), • • •, E(Rnj) do not change after initialization. The value Ck is computed as the damage of E(Ck) = B f f l j J U ^ J U ' • • u B ( i J r j ) . Thus, ck remains unchanged after initialization and the algorithm terminates after the first iteration. Inductive step: Assume that the theorem is true for reactions and compounds that have a preceding path with at most n — 1 reactions. Now, we will prove the theorem for reactions and compounds that have a preceding path with n reactions. Assume that Rj and Ck denote such a reaction and a compound. We will prove the theorem for each one separately. Prooffor RJ: Let C7ri, Cn2, • • •, C7rt be the compounds that are input to Rj. The preceding path length of each of these input compounds, say CRs is at most n. Otherwise, the preceding path length of Rj would be greater than n. Case 1: If the preceding path length of C7Ts is less than n, by our induction hypothesis, cKs would remain constant after (n — l)th iteration. Thus, the input compound CVs will not change the value of Tj after nth iteration. Case 2: If the preceding path length of CVs is n, then Rj is one of the reactions on this path. In other words, C7ra and Rj are on a cycle of length n. Otherwise, the preceding path length of Rj would be greater than n. Recall that at each iteration, the algorithm considers a new reaction or a compound on the preceding path starting from the closest one. Thus, at nth iteration of computation of rj, the algorithm completes the cycle and considers Rj. This however will not modify rj. This is because the value of rj monotonically decreases (or remains the same) at each iteration. Thus, the initial damage value computed from Rj is guaranteed to be no

96

better than rj after n — 1 iterations. We conclude that rj will remain unchanged after nth iteration. Prooffor Ck: Let RVl, R^2, • • •, RKj be the reactions that produce Ck- The preceding path length of each of these reactions, say R„3 is at most n. Otherwise, the preceding path length of Cfc would be greater than n. Case 1: If the preceding path length of R7ts is less than n, by our induction hypothesis 7v3 would remain constant after (n - l)th iteration. Thus, the reaction fl^ will not change the value of Ck after nth iteration. Case 2: If the preceding path length of R^s is n, then from our earlier discussion for proof of Rj, rVs remains unchanged after nth iteration. Therefore Rns will not change the value of Ck after nth iteration. Hence, by induction, we show that the Theorem 4.1 holds. • 5. Experimental results We evaluate our proposed iterative algorithm using the following three criteria: Execution time: The total time (in milliseconds) taken by the method to finish execution and report if a feasible solution is identified or not. Number of iterations: The number of iterations performed by the method to arrive at a steady-state solution. Average damage: The average number of non-target compounds that are eliminated when the enzymes in the result set are inhibited. We extracted the metabolic network information of Escherichia Coli (E.Coli) from KEGG 19 ( f t p : / / f t p . g e n o m e . j p / p u b / k e g g / p a t h w a y s / e c o / ) . The metabolic network in KEGG has been hierarchically classified into smaller networks according to their functionality. We performed experiments at different levels of hierarchy of the metabolic network and on the entire metabolic network, that is an aggregation of all the functional subnetworks. We devised a uniform labeling scheme for the networks based on the number of enzymes. According to this scheme, a network label begins with 'N' and is followed by the number of enzymes in the network. For instance, 'N20' indicates a network with 20 enzymes. Table 2 shows the metabolic networks chosen, along with their identifiers and the number of compounds (C), reactions (R) and edges (Ed). The edges represent the interactions in the network. For each network, we constructed query sets of sizes one, two and four target compounds, by randomly choosing compounds from that network. Each query set contains 10 queries each. We implemented the proposed iterative algorithm and an exhaustive search algorithm which determines the optimal enzyme combination to eliminate the given set of target compounds with minimum damage. We implemented the algorithms in Java. We ran our experiments on an Intel Pentium 4 processor with 2.8 GHz clock speed and 1 -GB main memory, running Linux operating system. Evaluation of Accuracy: Table 3 shows the comparison of the average damage values of the solutions computed by the iterative algorithm versus the exhaustive

97 Table 2. Metabolic networks from KEGG with identifier (Id). C, R and Ed denote the number of compounds, reactions and edges (interactions) respectively. Id

Metabolic Network

C

R

Metabolic Network

C

R

Ed

N08

Polyketide biosynthesis Xenobiotics biodegradation Citrate or TCA cycle Galactose Pentose phosphate Glycan Biosynthesis

11

11

33

N42

Other amino acid

69

63

208

47

58

187

N48

Lipid

134

196

654

21 38 26 54

35 50 37 51

125 172 129 171

N52 N59 N71 N96

67 72 102 145

128 82 217 175

404 268 684 550

32 36

49 46

160 151

N170 N180

Purine Energy Nucleotide Vitamins and Cofactors Amino acid Carbohydrate

54 247

378 501

1210 1659

21

51

163

N537

Entire Network

988

1790

5833

N13 N14 N17 N20 N22 N24 N28 N32

Glycerolipid Glycine, serine and threonine Pyruvate

Ed

Id

Table 3. Comparison of average damage values of solutions determined by the iterative algorithm versus the exhaustive search algorithm. Pathway Id

NU

JV17

N2Q

JV24

N28

N32

Iterative Damage

2.51

8.73

1.63

3.39

1.47

0.59

Exhaustive Damage

2.51

8.73

1.63

3.17

1.47

0.59

Pathway Identifier

(a)

Pathway Identifier

(b)

Figure 2. Evaluation of iterative algorithm. (a)Average execution time in milliseconds. (b)Average number of iterations

search algorithm. We have shown the results only upto JV32, as the exhaustive search algorithm took longer than one day to finish even for 7V32. We can see that the damage values of our method exactly match the damage values of the exhaustive search for all the networks except iV24. For N24, the average damage differs from the exhaustive solution by only 0.02%. This shows that the iterative algorithm is a good approximation of the exhaustive search algorithm which computes an optimal solution. The slight deviation in damage is the tradeoff for achieving the scalability of the iterative algorithm (described next). Evaluation of Scalability: Figure 2(a) plots the average execution time of our it-

98 erative method for increasing sizes of metabolic networks. The running time increases slowly with the network size. As the number of enzymes increases from 8 to 537, the running time increases from roughly 1 to 10 seconds. The largest network, N537, consists of 537 enzymes, and hence, an exhaustive evaluation inspects 2 537 - 1 combinations (which is computationally infeasible). Thus, our results show that the iterative method scales well for networks of increasing sizes. This property makes our method an important tool for identifying the right enzyme combination for eliminating target compounds, especially for those networks for which an exhaustive search is not feasible. Figure 2(b) shows a plot of the average number of iterations for increasing sizes of metabolic networks. The iterative method reaches a steady state within 10 iterations in all cases. The various parameters (see Table 2) that influence the number of iterations are the number of enzymes, compounds, reactions and especially the number of interactions in the network (represented by edges in the network graph). Larger number of interactions increase the number of iterations considerably, as can be seen for networks N22, JV48, N96, N537, where the number of iterations is greater than 5. This shows that, in addition to the number of enzymes, the number of compounds and reactions in the network and their interactions also play a significant role in determining the number of iterations. Our results show that the iterative algorithm can reliably reach a steady state and terminate, for networks as large as the entire metabolic network of E.Coli. 6. Conclusion Efficient computational strategies are needed to identify the enzymes (i.e., drug targets), whose inhibition will achieve the required effect of eliminating a given target set of compounds while incurring minimal side-effects. An exhaustive evaluation of all possible enzyme combinations to find the optimal subset is computationally infeasible for large metabolic networks. We proposed a scalable iterative algorithm which computes a sub-optimal solution to this problem within reasonable timebounds. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one by tracing backward from the target compounds. We evaluated the immediate precursors of a target compound and iteratively moved backwards, to identify the enzymes, whose inhibition stopped the production of the target compound while incurring minimum damage. We showed that our method converges within a finite number of such iterations. In our experiments on E.Coli metabolic network, the accuracy of a solution computed by the iterative algorithm deviated from that found by an an exhaustive search only by 0.02 %. Our iterative algorithm is highly scalable. It solved the problem for even the entire metabolic network of E.Coli in less than 10 seconds. References 1. 'Proteome Mining' can zero in on Drug Targets. Duke University medical news, Aug 2004. 2. M Arita. The metabolic world of Escherichia coli is not small. PNAS, 101 (6): 1543-7,

99 2004. 3. S. Broder and J. C. Venter. Sequencing the Entire Genomes of Free-Living Organisms: The Foundation of Pharmacology in the New Millennium. Annual Review of Pharmacology and Toxicology, 40:97-132, Apr 2000. 4. S. K. Chanda and J. S. Caldwell. Fulfilling the promise: Drug discovery in the postgenomic era. Drug Discovery Today, 8(4):168—174, Feb 2003. 5. A. Comish-Bowden. Why is uncompetitive inhibition so rare? FEBS Letters, 203(1):36, Jul 1986. 6. A. Cornish-Bowden and J. S. Hofmeyr. The Role of Stoichiometric Analysis in Studies of Metabolism: An Example. Journal of Theoretical Biology, 216:179-191, May 2002. 7. J Drews. Drug Discovery: A Historical Perspective. Science, 287(5460):1960-1964, Mar 2000. 8. Davidov et. al. Advancing drug discovery through systems biology. Drug Discovery Today, 8(4):175-183, Feb 2003. 9. Deane et. al. Catechol-o-mefhyltransferase inhibitors versus active comparators for levodopa-induced complications in parkinson's disease. Cochrane Database of Systematic Reviews, 4, 2004. 10. Hatzimanikatis et. al. Metabolic networks: enzyme function and metabolite structure. Current Opinion in Structural Biology, (14):300-306, 2004. 11. Imielinski et. al. Investigating metabolite essentiality through genome scale analysis of E. coli production capabilities. Bioinformatics, Jan 2005. 12. Imoto et. al. Computational Strategy for Discovering Draggable Gene Networks from Genome-Wide RNA Expression Profiles. In PSB 2006 Online Proceedings, 2006. 13. Jeong et. al. Prediction of Protein Essentiality Based on Genomic Data. ComPlexUs, 1:19-28,2003. 14. Lemke et. al. Essentiality and damage in metabolic networks. Bioinformatics, 20(1):115-119, Jan 2004. 15. Ma et. al. Decomposition of metabolic network into functional modules based on the global connectivity structure of reaction graph. Bioinformatics, 20(12): 1870-6, 2004. 16. Mombach et. al. Bioinformatics analysis of mycoplasma metabolism: Important enzymes, metabolic similarities, and redundancy. Computers in Biology and Medicine, 2005. 17. Teichmann et. al. The Evolution and Structural Anatomy of the Small Molecule Metabolic Pathways in Escherichia coli. JMB, 311:693-708, 2001. 18. Yeh et. al. Computational Analysis of Plasmodium falciparum Metabolism: Organizing Genomic Information to Facilitate Drug Discovery. Genome Research, 14:917-924, 2004. 19. M Kanehisa and S Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28(l):27-30, Jan 2000. 20. C Smith. Hitting the target. Nature, 422:341-347, Mar 2003. 21. R. Somogyi and C.A. Sniegoski. Modeling the complexity of genetic networks: Understanding multi-gene and pleiotropic regulation. Complexity, 1:45-63, 1996. 22. P. Sridhar, T. Kahveci, and S. Ranka. Opmet: A metabolic network-based algorithm for optimal drug target identification. Technical report, CISE Department, University of Florida, Sep 2006. 23. R. Surtees and N. Blau. The neurochemistry of phenylketonuria. European Journal of Pediatrics, 159:109-13,2000. 24. Takenaka T. Classical vs reverse pharmacology in drag discovery. BJU International, 88(2):7-10, Sep 2001.

TRANSCRIPTIONAL INTERACTIONS DURING SMALLPOX INFECTION AND IDENTIFICATION OF EARLY INFECTION BIOMARKERS* WILLY A. VALDIVIA-GRANDA Orion Integrated Biosciences Inc., 265 Centre Ave. Suite 1R New Rochelle, NY 10805, USA Email: willy, valdivia @ orionbiosciences. com

MARICEL G. KANN National Center for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike Bethesda, MD 20894, USA Email: kann @ mail.nih. gov

JOSE MALAGA Orion Integrated Biosciences Inc., Email: jose. malaga @ orionbiosciences. com Smallpox is a deadly disease that can be intentionally reintroduced into the human population as a bioweapon. While host gene expression microarray profiling can be used to detect infection, the analysis of this information using unsupervised and supervised classification techniques can produce contradictory results. Here, we present a novel computational approach to incorporate molecular genome annotation features that are key for identifying early infection biomarkers (EIB). Our analysis identified 58 EIBs expressed in peripheral blood mononuclear cells (PBMCs) collected from 21 cynomolgus macaques (Macaca fascicularis) infected with two variola strains via aerosol and intravenous exposure. The level of expression of these EIBs was correlated with disease progression and severity. No overlap between the EIBs co-expression and protein interaction data reported in public databases was found. This suggests that a pathogen-specific re-organization of the gene expression and protein interaction networks occurs during infection. To identify potential genome-wide protein interactions between variola and humans, we performed a protein domain analysis of all smallpox and human proteins. We found that only 55 of the 161 protein domains in smallpox are also present in the human genome. These co-occurring domains are mostly represented in proteins involved in blood coagulation, complement activation, angiogenesis, inflammation, and hormone transport. Several of these proteins are within the EIBs category and suggest potential new targets for the development of therapeutic countermeasures.

* correspondence should be addressed to: [email protected] 100

101 1. INTRODUCTION The virus that causes smallpox, known as variola major, belongs to the genus Orthopoxvirus within the family Poxviridae. During 1967, the year the smallpox global eradication program began, an estimated 10 to 15 million smallpox cases occurred in 43 countries and caused the death of 2 million people annually (1). Intensive vaccination programs lead in 1979 to the eradication of the disease. Since then, vaccination ceased, and levels of immunity have dropped dramatically (2). In recent years there has been increasing concern that this virus can be used as a bioweapon (3, 4). In very early stages of viral infection and during the progression of the disease, a series of physiological and molecular changes including differential gene expression occur in the host. This information can be used to identify biomarkers correlated with the presence or absence of a specific pathogen, the prognosis of the disease, or the efficacy of vaccines and drug therapies. Since microarrays can measure the whole genome gene expression profiles, the use of peripheral blood mononuclear cells (PBMCs) can allow the identification of pathogen-specific biomarkers before clinical symptoms appear. While the collection of PBMCs is a minimal invasive method which facilitates the assessment of host responses to infection, doubts about their usefulness persist. These revolve around two very strong arguments. First, expression signals might come from a minority of cells within the bloodstream. Thus, expression might be a secondary consequence rather a primary effect of viral infection. Second, PBMC population is not in a homogenous biological state; therefore, there is an inherent biological noise which could make the data impossible to reproduce. Rubins et al. (5) used cDNA microarrays to measure the expression changes occurring in PBMCs collected from blood of cynomolgus macaques infected with two strains of variola by aerosol and intravenous exposure. Clustering analyses revealed that variola infection induced the expression of genes involved in cell cycle and proliferation, DNA replication, and chromosome segregation. These transcriptional changes were attributed to the fact that poxviruses encode homologues of the mammalian epidermal growth factor (EGF) that bind ErbB protein family members which are potent stimulators of cell proliferation. However, the conclusions of Rubins et al. (5) were limited by the ability of unsupervised microarray data analysis algorithms, such as clustering, to detect true gene product interactions (6, 7). This is relevant, because an increasing number of data suggests that proteins involved in the regulation of cellular events resulting from viral infections are organized in a modular fashion rather in a particular class or cluster (8-10). While some microarray data analysis tools use gene ontologies to increase the performance of the classification of gene expression data (11, 12), these methods

102 incorporate molecular annotation after the classification of the gene expression values. However, many human genes have little or no functional annotation, or they have multiple molecular functions that can change with database update versions. Therefore, the identification of biomarkers is challenging because it is not possible to quantify the contribution of the molecular annotation in the overall classification process. To address these limitations, and to gain a better understanding of the molecular complexity resulting during host-pathogen interactions, we developed a new method for microarray data classification and for the discovery of early infection biomarkers (EIBs). Our approach incorporates different molecular biological datasets and narrows the set of attributes required for the classification process. This information is represented as transcriptional networks where genes associated with early viral infection events and disease severity are selected. These interactions were overlapped with physical protein-protein interaction data reported in the scientific literature. To complement these analyses and to identify possible human receptors used by smallpox during cellular entry, replication, assembly, and budding (13, 14), we identified all protein domains (from PFAM protein domain database (15)) within 197 smallpox proteins that are also present within human proteins. The results of our analysis provide new insights into receptor co-evolution and suggest potential therapeutic targets that might diminish the lethal manifestations of smallpox.

2. METHODS 2.1. Transcriptional Network Reconstruction We used the microarray gene expression data from the experiments by Rubins et al. (5). This information consists of the molecular profiles collected from PBMCs of 21 male cynomolgus macaques (Macaca fascicularis) exposed to two variola strains (India-7124 and Harper-99) via subcutaneous injections (5 X 108 plaqueforming-units p.f.u.) and aerosol exposure (109 p.f.u.). For the analysis of this data, we developed an algorithm to identify genes responding similarly to the viral challenge across different exposed animals. Then we proceeded to identify infection specific genes corresponding to a particular time-point after the inoculation (16). As shown in Figure 1, our implementation consists of two main steps. First, a nearest neighbor voting (NNV) classification including gene expression values and gene annotation features where the best attributes associated to a particular transcriptional network are selected (17). Second, a genetic algorithm (GA) optimization using the trade off between the false negative and false positive rates for every possible function cut off area, represented by the area under the receiver operating characteristic (ROC) curve, as fitness function (17).

103 Pred(G) = lm(G) + Sim(G) (1.1) lm(G) = WL(G) + W,(G)+WA(G) trnSet

SirriG) = £

(1.2)

features

Y,WtfMatchf

(G>S) + Im(G) (1.3)

Equation 1.1 defines the function used for predictor voting (Prerf) of specific transcriptional interactions, estimated as the sum of the similarity importance for a given gene G (Im(G)) and the similarity of attributes (Sim(G)) of its gene neighbors. The gene importance of gene G is given by Equation 1.2 and is based on the weights for scoring the gene cellular compartment localization (WL), number of interactions with other genes (Wi), and number of attributes (WA). Considering that there are multiple attributes to select, we optimized the weight space (Wtf) (used in Equation 1.3) by scoring the best combination of weights using a standard genetic algorithm (GA) matching each of the features (f) voted as important. This approach selects the best and/or fittest solution and allows only the higher scores to proceed in form of transcriptional interactions. The ROC value of the prediction is used as the fitness evaluator. Depending on the fitness value, random mutation is used occasionally to change or optimize an existing solution. For the visualization of the final transcriptional interactions we calculated the probability (pigma_mol_i, Most_Pos_Charge, Most_Pos_Rs_i_mol, Most_Pos_Sigma_i_mol, Most_Pos_Sigma_mol_i,

143 Softness_of_Most_Pos, Sum_Hardness, Sum_Neg_Hardness, Total_Neg_Softness, b_double, bjrotN, bjrotR, bjriple, chiral, rings, a_nN, a_nO, a_nS, FCharge, lip_don, KierFlex, ajbase, vsa_acc, vsa_acid, vsa_base, vsa_don, density, logP(oAv), aJCM, chilv_C, chiraljx, balabanJ, logS, ASA, ASA+, ASA-, ASAJi, ASA_P, CASA+, CASA-, DASA, DCASA For more details on 'inductive' parameters see references [1-5], while the used conventional QSAR parameters can be accessed through the MOE program [16]. References 1. 2. 3. 4. 5. 6. 7. 8. 9.

10.

11. 12. 13. 14.

15. 16. 17.

A. Cherkasov, Curr. Comp.-Aided Drug Design. 1, 21 (2005). A. Cherkasov, and B. Jankovic, Molecules. 9, 1034 (2004). A. Cherkasov, Z. Shi, M. Fallahi, and G.L. Hammond, J. Med. Chem. 48, 3203 (2005). A. Cherkasov, J. Chem. Inf. Model. 46, 1214 (2006). E. Karakoc, S. C. Sahinalp, and A. Cherkasov. J. Chem. Inf. Model. 46, in press (2006). ChemlDPlus database: http://chem.sis.nlm.nih.gov/chemidplus/, May 2006 Journal of Antibiotics database: http://www.nih.go.ip/~jun/NADB/bvname.html , May 2006 F. Tomas-Vert, F. Perez-Gimenez, M.T. Salabert-Salvador, F.J. GarciaMarch, J. Jaen-Oltra, J. Molec. Struct. (Theochem). 504, 249 (2000). M.T.D. Cronin, A.O. Aprula, J.C. Dearden, J.C. Duffy, T.I. Netzeva, H. Patel, P.H. Rowe, T.W. Schultz A.P. Worth, K. Voutzoulidis, and G. Schuurmann, J. Chem. Inf. Comp. Sci. 42, 869 (2002). M. Murcia-Soler, F. Perez-Gimenez, F.J. Garcia-March, M.T. SalabertSalvador, W. Diaz-Villanueva, M.J. Castro-Bleda and A. Villanueva-Pareja. J Chem InfComput Sci. 44, 1031 (2004). The Merck Index 13.4 CD-ROM Edition, CambridgeSoft, Cambridge, MA, 2004. Analyticon Discovery Company: www.ac-discoverv.com May 2006 Assinex Gold Collection, Assinex Ltd., Moscow, 2004. Human Metabolome Database: http://redpoll.pharmacv.ualberta.ca/~aguo/www hmdb ca/HMDB/, May 2006 T.A. Halgren, J. Comp. Chem. 17, 490 (1996). Molecular Operational Environment, 2005, by Chemical Computing Group Inc., Montreal, Canada. E. Karakoc, A. Cherkasov, and S. C. Sahinalp. Bioinformatics, in press

144

18. 19.

(2006). CPLEX: High-performance software for mathematical programming http://www.ilog.com/products/cplex/. May 2006. M. Tasan, J. Macker, M. Ozsoyoglu, S. Cenk Sahinalp. Distance Based Indexing for Sequence Proximity Search, IEEE Data Engineering Conference ICDE'03, Banglore, India (2003)

B I O S P I D E R : A W E B SERVER F O R A U T O M A T I N G METABOLOME ANNOTATIONS

CRAIG KNOX, SAVITA SHRIVASTAVA, PAUL STOTHARD, ROMAN EISNER, DAVID S. WISHART Department of Computing Science, University of Alberta, Edmonton, AB T6G-2E8

Canada

One of the growing challenges in life science research lies in finding useful, descriptive or quantitative data about newly reported biomolecules (genes, proteins, metabolites and drugs). An even greater challenge is finding information that connects these genes, proteins, drugs or metabolites to each other. Much of this information is scattered through hundreds of different databases, abstracts or books and almost none of it is particularly well integrated. While some efforts are being undertaken at the NCBI and EBl to integrate many different databases together, this still falls short of the goal of having some kind of humanreadable synopsis that summarizes the state of knowledge about a given biomolecule - especially small molecules. To address this shortfall, we have developed BioSpider. BioSpider is essentially an automated report generator designed specifically to tabulate and summarize data on biomolecules - both large and small. Specifically, BioSpider allows users to type in almost any kind of biological or chemical identifier (protein/gene name, sequence, accession number, chemical name, brand name, SMILES string, InCHI string, CAS number, etc.) and it returns an in-depth synoptic report (-3-30 pages in length) about that biomolecule and any other biomolecule it may target. This summary includes physico-chemical parameters, images, models, data files, descriptions and predictions concerning the query molecule. BioSpider uses a web-crawler to scan through dozens of public databases and employs a variety of specially developed text mining tools and locally developed prediction tools to find, extract and assemble data for its reports. Because of its breadth, depth and comprehensiveness, we believe BioSpider will prove to be a particularly valuable tool for researchers in metabolomics. BioSpider is available at: www.biospider.ca

1.

Introduction

Over the past decade we have experienced an explosion in the breadth and depth of information available, through the internet, on biomolecules. From protein databases such as the PDB [1] and Swiss-Prot [18] to small molecule databases such as PubChem (http://pubchem.ncbi.nlm.nih.gov/), KEGG [2], and ChEBI (http://www.ebi.ac.uk/chebi/), the internet is awash in valuable chemical and biological data. Unfortunately, despite the abundance of this data, there is still a need for new tools and databases to connect chemical data (small, biologically active molecules such as drugs and metabolites) to biological data (biologically active targets such as proteins, RNA and DNA), and vice versa. Without this linkage clinically important or pharmaceutically relevant information is often lost. To address 145

146 this issue we have developed an integrated cheminformatics/bioinformatics reporting system called BioSpider. Specifically, BioSpider is a web-based search tool that was created to scan the web and to automatically find, extract and assemble quantitative data about small molecules (drugs and metabolites) and their large molecule targets. BioSpider can be used as both a research tool or it can be used as a database annotation tool to assemble fully integrated drug, metabolites or protein databases. So far as we are aware, BioSpider appears to be a unique application. It is essentially a hybrid of a web-based genome annotation tool, such as BASYS [3] and a text mining system such as MedMiner [4], Text mining tools such as MedMiner, iHOP [5], MedGene [6] and LitMiner [7] exploit the information contained within the PubMed database. These web servers also support more sophisticated text and phrase searching, phrase selection and relevance filtering using specially built synonym lists and thesauruses. However, these text mining tools were designed specifically to extract information only from PubMed abstracts as opposed to other database resources. In other words MedMiner, MedGene and iHOP do not search, display, integrate or link to external molecular database information (i.e. GenBank, OMIM [8], PDB, SwissProt, PharmGKB [9], DrugBank [10], PubChem, etc.) or to other data on the web. This database or web-based information-extraction feature is what is unique about BioSpider.

2. 2.1.

Application Description Functionality

Fundamentally, BioSpider is highly sophisticated web spider or web crawler. Spiders are software tools that browse the web in an automated manner and keep copies of the relevant information of the visited pages in their databases. However, BioSpider is more than just a web spider. It is also an interactive text mining tool that contains several predictive bioinformatic and cheminformatic programs, all of which are available through a simple and intuitive web interface. Typically a BioSpider session involves a user submitting a query about one or more biological molecules of interest through its web interface, waiting a few minutes and then viewing the results in a synoptic table. This hyperlinked table typically contains more than 80 data fields covering all aspects of the physico-chemical, biochemical, genetic and physiological information about the query compound. Users may query BioSpider with either small molecules (drugs or metabolites) or large molecules (human proteins). The queries can be in almost any form, including chemical names, CAS numbers, SMILES strings [11], INCHI identifiers, MOL files or Pubchem IDs (for small molecules), or protein names and/or Swiss-Prot IDs (for macromolecules). In extracting the data and assembling its tabular reports BioSpider employs several robust data-gathering techniques based on screen-scraping, text-

147 mining, and various modeling or predictive algorithms. If a BioSpider query is made for a small molecule, the program will perform a three-stage search involving: 1) Compound Annotation; 2) Target Protein/Enzyme Prediction and 3) Target Protein/Enzyme Annotation (see below for more details). If a BioSpider query is made for a large molecule (a protein), the program will perform a complete protein annotation. BioSpider always follows a defined search path (outlined in Figure 1, and explained in detail below), extracting a large variety of different data fields for both chemicals and proteins (shown in Table 1). In addition, BioSpider includes a builtin referencing application that maintains the source for each piece of data obtained. Thus, if BioSpider obtains the Pubchem ID for a compound using KEGG, a reference "Source: KEGG" is added to the reference table for the Pubchem ID. Figure 1 - Simplified overview of a BioSpider search (1) Obtain Chemical Information: CAS IUPAC Name, Synonyms, Melting Point, etc.

(2) Predict Drug Targets or Metabolizing Enzymes

(3) For each predicted Drug Target or Metabolizing Enzyme, obtain protein information including sequence information, description, SNPs, etc.

Table 1 - Summary of some of the fields obtained by BioSpider Drug or Compound Information Generic Name Brand Names/Synonyms IUPAC Name Chemical Structure/Sequence Chemical Formula PubChem/ChEBVKEGG Links SwissProt/GenBank Links FDA/MSDS/RxList Links Molecular Weight Melting Point Water Solubility pKa or pi LogP or Hydrophobicity NMR/Mass Spectra MOL/SDF Text Files Drug Indication Drug Pharmacology Drug Mechanism of Action Drug Biotransformation/Absorption Drug Patient/Physician Information Drug Toxicity

Drug Target or Receptor Information Name Synonyms Protein Sequence Number of Residues Molecular Weight Pi Gene Ontology General Function Specific Function Pathways Reactions Pfam Domains Signal Sequences Transmembrane Regions Essentiality Genbank Protein ID SwissProt ID PDBID Cellular Location DNA Sequence Chromosome Location

148 Step 1: Compound Annotation Compound annotation involves extracting or calculating data about small molecule compounds (metabolites and drugs). This includes data such as common names, synonyms, chemical descriptions/applications, IUPAC names, chemical formulas, chemical taxonomies, molecular weights, solubilities, melting or boiling points, pKa, LogP's, state(s), MSD sheets, chemical structures (MOL, SDF and PDB files), chemical structure images (thumbnail and full-size PNG), SMILES strings, InCHI identifiers, MS and NMR spectra, and a variety of database links (PubChem, KEGG, ChEBI). The extraction of this data involves accessing, screen scraping and text mining -30 well known databases (KEGG, PubChem), calling a number of predictive programs (for calculating MW, solubility) and running a number of file conversion scripts and figure generation routines via CORINA [12], Checkmol (http://merian.pch.univie.ac.at/~nhaider/cheminf/cmmm.html) and other in-house methods. The methods used to extract and generate these data are designed to be called independently but they are also "aware" of certain data dependencies. For instance, if a user only wanted an SDF file for a compound, they would simply call a single method: get_value('sdf_file')- There is no need to explicitly call methods that might contain the prerequisite information for getting an SDF file. Likewise, if BioSpider needs a Pubchem ID to grab an SDF file, it will obtain it automatically, and, consequently, if a Pubchem ID requires a KEGG ID, BioSpider will then jump ahead to try and get the KEGG ID automatically. Step 2: Target/Enzyme Prediction Target/enzyme prediction involves taking the small-molecule query and identifying the enzymes likely to be targeted or involved in the metabolism of that compound. This process involves looking for metabolite-protein or drug-protein associations from several well-known databases including SwissProt, PubMed, DrugBank and KEGG. The script begins by constructing a collection of query objects from the supplied compound information. Each query object contains the name and synonyms for a single compound, as well any similar but unwanted terms. For example, a query object for the small molecule compound "pyridoxal" would contain the term "pyridoxal phosphatase" as an unwanted term, since the latter name is for an enzyme. The list of unwanted or excluded terms for small molecule compounds is assembled from a list of the names and synonyms of all human proteins. These unwanted terms are identified automatically by testing for cases where one term represents a subset of another. Users can also include their own "exclusion" terms in BioSpider's advanced search interface. The name and synonyms from a query object are then submitted using WWW agents or public APIs to a variety of abstract and protein sequence databases, including Swiss-Prot, PubMed, and KEGG. The name and synonyms are each sub-

149 mitted separately, rather than as a single query, since queries consisting of multiple synonyms typically produce many irrelevant results. The relevance of each of the returned records is measured by counting the number of occurrences of the compound name and synonyms, as well as the number of occurrences of the unwanted terms. Records containing only the desired terms are given a "good" rating, while those containing some unwanted terms are given a "questionable" rating. Records containing only unwanted terms are discarded. The records are then sorted based on their qualitative score. BioSpider supports both automated identification and semiautomated identification. For automated identification, only the highest scoring hits (no unwanted terms, hits to more than one database) are selected. In the semiautomated mode, the results are presented to a curator who must approve of the selection. To assist with the decision, each of the entries in the document is hyperlinked to the complete database record so that the curator can quickly assess the quality of the results. Note that metabolites and drugs often interact with more than one enzyme or protein target. Step 3: Target/Enzyme Annotation Target/Enzyme annotation involves extracting or calculating data about the proteins that were identified in Step 2. This includes data such as protein name, gene name, synonyms, protein sequence, gene sequence, GO classifications, general function, specific function, PFAM [13] sequences, secondary structure, molecular weight, subcellular location, gene locus, SNPs and a variety of database links (SwissProt, KEGG, GenBank). Approximately 30 annotation sub-fields are determined for each drug target and/or metabolizing enzyme. The BioSpider protein annotation program is based on previously published annotation tools developed in our lab including BacMap [14], BASYS and CCDB [15]. The Swiss-Prot and KEGG databases are searched initially to retrieve protein and gene names, protein synonyms, protein sequences, specific and general functions, signal peptides, transmembrane regions and subcellular locations. If any annotation field is not retrieved from the abovementioned databases then either alternate databases are searched or internally developed/installed programs are used. For example, if transmembrane regions are not annotated in the Swiss-Prot entry, then a locally installed transmembrane prediction program called TMHMM (http://www.cbs.dtu.dk/services/TMHMM/) is used to predict the transmembrane regions. This protein annotation tool also coordinates the updating of fields that are calculated from the contents of other fields, such as molecular weight and isoelectric point. The program also retrieves chromosome location, locus location and SNP information from GeneCards [16] on the basis of the gene name. BLAST searches are also performed against the PDB database to identify structural homologues. Depending upon the sequence similarity between the query protein sequence to a sequence represented in the PDB database, a program

150 called HOMODELLER (X. Dong, unpublished data) may generate a homology model for the protein sequence. 2.2. Implementation The BioSpider backend is a fully objected-oriented Perl application, making it robust and portable. The frontend (website, shown in Figure 2) utilizes Perl CGI scripts which generate valid XHMTL and CSS. BioSpider uses a relational database (MySQL 5) to store data as it runs. As BioSpider identifies and extracts different pieces of information, it stores the data in the database. To facilitate this storage process, a module called a "DataBean" is used to store and retrieve the desired information from/to the database. This approach was chosen for 3 reasons: 1) it provides an "audit-trail" in terms of the results obtained, 2) it provides a complete search result history, enabling the easy addition of "saved-searches" to the website, and 3) it reduces memory load as the application is running. A screenshot of the BioSpider website is shown in Figure 2. Figure 2 - A screen shot montage of BioSpider

151 3.

Validation, Comparison and Limitations

Text mining and data extraction tools can be prone to a variety of problems, many of which may lead to nonsensical results. To avoid these problems BioSpider performs a number of self-validation or "sanity checks" on specific data extracted from the web. For example, when searching for compound synonym names, BioSpider will check that the PubChem substance page related to that synonym contains the original search name or original CAS number within the HTML for that page. This simple validation procedure can often remove bogus synonyms obtained from different websites. Other forms of such small-scale validation or sanity-checks includes a CAS number validation method, whereby the CAS number check-digit is used to validate the entire CAS number (CAS numbers use a checksum, whereby the checksum is calculated by taking the last digit times 1, the next digit times 2, the next digit times 3 etc., adding all these up and computing the sum modulo 10). Since the majority of the information obtained by BioSpider is screenscraped from several websites, it is also important to validate the accessibility of these websites as well as the HTML formatting. Since screen-scraping requires one to parse the HTML, BioSpider must assume the HTML from a given website follows a specific format. Unfortunately, this HTML formatting is not static, and changes over time as websites add new features, or alter the design layout. For this reason, BioSpider contains an HTML validator application, designed to detect changes in the HTML formatting for all the web-resources that BioSpider searches. To achieve this, an initial search was performed and saved using BioSpider for 10 pre-selected compounds, whereby the results from each of the fields were manually validated. This validation-application performs a search on these 10 pre-selected compounds weekly (as a cron job). The results of this weekly search are compared to the original results, and if there is any difference, a full report is generated and emailed to the BioSpider administrator. The assessment of any text mining or report generating program is difficult. Typically one must assess these kinds of tools using three criteria: 1) accuracy; 2) completeness and 2) time savings. In terms of accuracy, the results produced are heavily dependent on the quality of the resources being accessed. Obviously if the reference data are flawed or contradictory, the results from a BioSpider search will be flawed or contradictory. To avoid these problems every effort has been made to use only high-accuracy, well curated databases as BioSpider's primary reference sources (KEGG, SwissProt, PubChem, DrugBank, Wikipedia, etc). As a result, perhaps the most common "detectable" errors made by BioSpider pertain to text parsing issues (with compound descriptions), but these appear to be relatively minor. The second most common error pertains to errors of omission (missing data that could be found by a human expert looking through the web or other references). In addition to these potential programmatic errors, the performance of BioSpider can be com-

152 promised by incorrect human input, such as a mis-spelled compound name, SMILES string or CAS number or the submission of an erroneous MOL or SDF file. It can also be compromised by errors or omissions in the databases and websites that it searches. Some consistency or quality control checks are employed by the program to look for nomenclature or physical property disagreements, but these may not always work. BioSpider will fail to produce results for newly discovered compounds as well as compounds that lack any substantive electronic or web-accessible annotation. During real world tests with up to 15 BioSpider users working simultaneously for 5-7 hours at a time, we typically find fewer than two or three errors being reported. This would translate to 1 error for every 15,000 annotation fields, depending on the type of query used. The number of errors returned is highest when searching using a name or synonym, as it is difficult to ascertain correctness. Errors are much less likely when using a search that permits a direct mapping between a compound and the source websites used by BioSpider. It is thus recommended that users search by structure (InChI, SDF/MOL, SMILES) or unique database ID (pubchem ID, KEGG ID) first, resorting to CAS number or name only when necessary. Despite this high level of accuracy, we strongly suggest that every BioSpider annotation should be looked over quickly to see if any non-sensical or inconsistent information has been collected in its annotation process. Usually these errors are quite obvious. In terms of errors of omission, typically a human expert can almost always find data for 1 or 2 fields that were not annotated by BioSpider - however this search may take 30 to 45 minutes of intensive manual searching or reading. During the annotation of the HMDB and DrugBank, BioSpider was used to annotate thousands of metabolites, food additives and drugs. During this process, it was noted that BioSpider was able to obtain at least some information about query compounds 91% of the time. The cases where no information was returned from BioSpider often involved compounds whereby a simple web search for that compound would return no results. This again spotlights one of the limitations of the BioSpider approach - its performance is directly proportional to the "web-presence" of the query compound. Perhaps the most important contribution for Biospider for annotation lies in the time savings it offers. Comparisons between BioSpider and skilled human annotators indicate that BioSpider can accelerate annotations by a factor of 40 to 50 X over what is done by skilled human annotators. In order to test this time-saving factor, 3 skilled volunteers were used. Each volunteer was given 3 compounds to annotate (2-Ketobutyric acid, Chenodeoxycholic acid disulfate and alpha-D-glucose) and the fields to fill-in for that compound. Each volunteer was asked to search for all associated enzymes, but only asked to annotate a single enzyme by hand. The data obtained by the volunteers were then compared to the results produced by BioSpider. These tests indicated that the time taken to annotate the chemical fields averages 40 minutes and 45 minutes for the biological fields, with a range between 22

153 and 64 minutes. The time taken by Biospider was typically 5 minutes. In other words, to fill out a complete set of BioSpider data on a given small molecule (say biotin) using manual typing and manual searches typically takes a skilled individual approximately 3 hours. Using BioSpider this can take as little as 2 minutes. Additionally, the quality of data gathered by BioSpider matched the human annotation for almost all of the fields. Indeed, it was often the case that the volunteer would give up on certain fields (pubchem substance IDs, OMIM IDs, etc.) long before completion. In terms of real-world experience, BioSpider has been used in several projects, including DrugBank and HMDB (www.hmdb.ca). It has undergone full stress testing during several "annotation workshops" with up to 50 instances of BioSpider running concurrently. BioSpider has also been recently integrated into a LIMS system (MetaboLIMS - http://www.hmdb.ca/labm/). This allows users to produce a side-by-side comparison on the data obtained using BioSpider and the data collected manually by a team of expert curators. Overall, BioSpider has undergone hundreds of hours of real-life testing, making it stable and relatively bug-free. 4.

Conclusion

BioSpider is a unique application, designed to fill in the gap between chemical (small-molecule) and biological (target/enzyme) information. It contains many advanced predictive algorithms and screen-scraping tools made interactively accessible via an easy-to-use web front-end. As mentioned previously, we have already reaped significant benefits from earlier versions of BioSpider in our efforts to prepare and validate a number of large chemical or metabolite databases such as DrugBank and HMDB. It is our hope that by offering the latest version of BioSpider to the public (and the metabolomics community in particular) its utility may be enjoyed by others as well.

5.

Acknowledgments

The Human Metabolome Project is supported by Genome Alberta, in part through Genome Canada.

154 References 1.

2.

3.

4.

5. 6.

7.

8.

9.

10.

11. 12.

13.

Sussman, JL, Lin, D, Jiang, J, Manning, NO, Prilusky, J, Ritter, O & Abola, EE. Protein data bank (PDB): a database of 3D structural information of biological macromolecules. Acta Cryst. 1998. D54:1078-1084. Kanehisa, M„ Goto, S., Kawashima, S., Okuno, Y. and Hattori, M. 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32(Database issue):D277-280. Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A, Dong X, Lu P, Szafron D, Greiner R, Wishart DS. 2005. BASys: a web server for automated bacterial genome annotation. Nucleic Acids. Res. l;33(Web Server issue):W455-9. Tanabe, L., Scherf, U., Smith, L.H., Lee, J.K., Hunter, L. and Weinstein, J.N. MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques 1999. 27:1210-1217. Hoffmann, R. and Valencia, A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 2005. 21 Suppl 2:ii252-ii258. Hu Y., Hines L.M., Weng H., Zuo D., Rivera M., Richardson A. and LaBaer J: Analysis of genomic and proteomic data using advanced literature mining. J Proteome Res. 2003. Jul-Aug;2(4):405-12. Maier H., Dohr S., Grote K., O'Keeffe S., Werner T., Hrabe de Angelis M. and Schneider R: LitMiner and WikiGene: identifying problem-related key players of gene regulation using publication abstracts. Nucleic Acids Res. 2005. Jul 1 ;33(Web Server issue):W779-82. Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A. and McKusick, V.A. 2005. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33(Database issue):D514517. Hewett, M., Oliver, D.E., Rubin, D.L., Easton, K.L., Stuart, J.M., Altman, R.B. and Klein, T.E. 2002. PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Res. 30:163-165. Wishart, D.S., Knox, C , Guo, A., Shrivastava, S., Hassanali, M., Stothard, P. and Woolsey, J. 2006. DrugBank: A comprehensive resource for in silico drug discovery and exploration. Nucleic Acids. Res. 34(Database issue):D668-672. Weininger, D. 1988. SMILES 1. Introduction and Encoding Rules. J. Chem. Inf. Comput. Sci. 28:31-38. Gasteiger J, Sadowski J, Schuur J, Selzer P, Steinhauer L, Steinhauer V: Chemical information in 3D-space. J Chem Inf Comput Sci 36: 1030-1037, 1996. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., Studholme, D.J., Yeats, C. and Eddy, S.R. 2004. The Pfam protein families database. Nucleic Acids Res. 32:D138-141.

155 14. Stothard P, Van Domselaar G, Shrivastava S, Guo A, O'Neill B, Cruz J, Ellison M, Wishart DS. BacMap: an interactive picture atlas of annotated bacterial genomes. Nucleic Acids Res. 2005 Jan 1 ;33(Database issue):D317-20. 15. Sundararaj S, Guo A, Habibi-Nazhad B, Rouani M, Stothard P, Ellison M, Wishart DS. BacMap: an interactive picture atlas of annotated bacterial genomes. Nucleic Acids Res. 2005 Jan 1 ;33(Database issue):D317-20. 16. Rebhan, M„ Chalifa-Caspi, V., Prilusky, J. and Lancet, D. 1998. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14:656-664. 17. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J„ Zhang, Z., Miller, W. and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. 18. Bairoch, A., Apweiler. R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al. 2005. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33(Database issue):D154-159. 19. Brooksbank, C , Cameron, G. and Thornton, J. 2005. The European Bioinformatics Institute's data resources: towards systems biology. Nucleic Acids Res. 33 (Database issue):D46-53. 20. Chen, X„ Ji, Z.L. and Chen, Y.Z. 2002. TTD: Therapeutic Target Database. Nucleic Acids Res. 30:412-415. 21. Halgren, T.A., Murphy, R.B., Friesner, R.A., Beard, H.S., Frye, L.L., Pollard, W.T. and Banks, J.L. 2004. Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J. Med. Chem. 47:1750-1709. 22. Hatfield, C.L., May, S.K. and Markoff, J.S. 1999. Quality of consumer drug information provided by four Web sites. Am. J. Health Syst. Pharm. 56:23082311. 23. Hulo, N., Sigrist, C.J., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P. and Bairoch, A. 2004. Recent improvements to the PROSITE database. Nucleic Acids Res. 32:D134-137. 24. Kramer, B., Rarey, M. and Lengauer, T. 1997. CASP2 experiences with docking flexible ligands using FlexX. Proteins Suppl 1:221-225 25. Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E.L. 2001. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305:567-580. 26. McGuffin, L.J., Bryson, K. and Jones, D.T. 2000. The PSIPRED protein structure prediction server. Bioinformatics 16:404-405. 27. Montgomerie, S., Sundararaj, S., Gallin, W.J. and Wishart, D.S. 2006. Improving the accuracy of protein secondary structure prediction using structural alignment. BMC Bioinformatics 7:301-312.

156 28. Orth, A.P., Batalov, S., Perrone, M. and Chanda, S.K. 2004. The promise of genomics to identify novel therapeutic targets. Expert Opin. Ther. Targets. 8:587-596. 29. Sadowski, J. and Gasteiger J. 1993. From atoms to bonds to three-dimensional atomic coordinates: automatic model builders. Chem. Rev. 93: 2567-2581. 30. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Church, D.M., DiCuccio, M , Edgar, R., Federhen, S., Helmberg, W., Kenton, D.L., Khovayko, O., Lipman, D.J., Madden, T.L., Maglott, D.R., Ostell, J., Pontius, J.U., Pruitt, K.D., Schuler, G.D., Schriml, L.M., Sequeira, E., Sherry, ST., Sirotkin, K., Starchenko, G., Suzek, T.O., Tatusov, R., Tatusova, T.A., Wagner, L. and Yaschenko, E. 2005. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 33(Database issue):D39-45. 31. Willard, L., Ranjan, A., Zhang, H., Monzavi, H., Boyko, R.F., Sykes, B.D. and Wishart, D.S. 2003. VADAR: a web server for quantitative evaluation of protein structure quality. Nucleic Acids Res. 31:3316-3319.

NEW BIOINFORMATICS RESOURCES FOR METABOLOMICS JOHN L. MARKLEY, MARK E. ANDERSON, QIU CUI, HAMID R. EGHBALNIA,* IAN A. LEWIS, ADRIAN D. HEGEMAN, JING LI, CHRISTOPHER F. SCHULTE, MICHAEL R. SUSSMAN, WILLIAM M. WESTLER, ELDON L. ULRICH, ZSOLT ZOLNAI Department of Biochemistry, University of Wisconsin-Madison, 433 Babcock Drive, Madison, Wisconsin 53706, USA

We recently developed two databases and a laboratory information system as resources for the metabolomics community. These tools are freely available and are intended to ease data analysis in both MS and NMR based metabolomics studies. The first database is a metabolomics extension to the BioMagResBank (BMRB, http://www.bmrb.wisc.edu), which currently contains experimental spectral data on over 270 pure compounds. Each small molecule entry consists of five or six one- and two-dimensional NMR data sets, along with information about the source of the compound, solution conditions, data collection protocol and the NMR pulse sequences. Users have free access to peak lists, spectra, and original time-domain data. The BMRB database can be queried by name, monoisotopic mass and chemical shift. We are currently developing a deposition tool that will enable people in the community to add their own data to this resource. Our second database, the Madison Metabolomics Consortium Database (MMCD, available from http://mmcd.nmrfam.wisc.edu/), is a hub for information on over 10,000 metabolites. These data were collected from a variety of sites with an emphasis on metabolites found in Arabidopsis. The MMC database supports extensive search functions and allows users to make bulk queries using experimental MS and/or NMR data. In addition to these databases, we have developed a new module for the Sesame laboratory information management system (http://www.sesame.wisc.edu) that captures all of the experimental protocols, background information, and experimental data associated with metabolomics samples. Sesame was designed to help coordinate research efforts in laboratories with high sample throughput and multiple investigators and to track all of the actions that have taken place in a particular study.

1.

Introduction

The metabolome can be defined as the complete inventory of small molecules present in an organism. Its composition depends on the biological fluid or tissue studied and the state of the organism (health, disease, environmental challenge, etc). Metabolomics is the study of the metabolome, usually as a high-throughput activity with the goal of discovering correlations between metabolite levels and the state of the organism. Metabolomics holds a place in systems biology

Also Department of Mathematics, University of Wisconsin-Madison. 157

158 alongside genomics, transcriptomics, and proteomics as an approach to modeling and understanding reaction networks in cells [1-4]. Mass spectrometry (MS) and nuclear magnetic resonance (NMR) are the analytical techniques used in the majority of metabolomics studies [5, 6], Although MS and NMR suffer from some well documented technical limitations [7], both of these tools are of clear utility to modern metabolomics [8]. MS is now capable of detecting molecules at concentrations as low as 1(T18 molar, and high-field nuclear magnetic resonance (NMR) can efficiently differentiate between molecules that are as similar in structure as glucose and galactose. Despite the availability of these impressive analytical tools, determining the molecular composition of complex mixtures is one of the most difficult tasks in metabolomics. One reason for this difficulty is a lack of publicly available tools for comparing experimental data with the existing literature on the masses and chemical shifts of common metabolites. We recently developed two databases of biologically relevant small molecules as practical tools for MS- and NMRbased research. The first of these databases is a metabolomics extension to the existing Biological Magnetic Resonance Data Bank (BioMagResBank, BMRB). The BMRB database contains experimental NMR data from over 270 pure compounds collected under standardized conditions. The peak lists, processed spectra, and raw time-domain data are freely available at http://www.bmrb.wisc.edu. Although the initial data were collected by the Madison Metabolomics Consortium (MMC), several groups in the metabolomics community have expressed interest in submitting data. We are currently developing a deposition tool that will facilitate these submissions and are encouraging others to submit their data. Our second free resource, the Madison Metabolomics Consortium Database (MMCD, available at www.nmrfam.wisc.edu), acts as a hub for information on biologically relevant small molecules. The MMCD contains the molecular structure, monoisotopic masses, predicted chemical shifts and links for more than 10,000 small molecules. The interface supports single and batch-mode searches by name, molecular structure, NMR chemical shifts, monoisotopic mass, plus various miscellaneous parameters. The MMCD is intended to be a practical tool to aid in identifying metabolites present in complex mixtures. Another impediment in metabolomics research is the complex logistics associated with coordinating multiple investigators in studies with large numbers of samples. To address this problem, we have created a metabolomics module for our Sesame laboratory information management system (LIMS) [9].

159 We designed Sesame to capture the complete range of experimental protocols, background information, and experimental data associated with samples. The system allows users to define the actions and protocols to be tracked and supports bar coded samples. Sesame is freely available at http://www.sesame.wisc.edu. In this paper we discuss the construction and mechanics of these resources as well as the details of our experimental designs and the sources we have drawn upon in developing these tools. 2.

Data Model for Metabolomics

The Metabolomics Standards Initiative recently recommended that metabolomics studies should report the details of study design, metadata, experimental, analytical, data processing, and statistical techniques used [10]. Capturing these details is imperative, because they can play a major role in data interpretation [11-13]. As a result, informatics resources need to be built on a data model that can capture all of the relevant information while maintaining sufficient flexibility for future development and integration into other resources [14]. To meet this challenge, the Madison Metabolomics Consortium has adopted a Self-defining Text Archival and Retrieval (STAR) [15-17] for storing and disseminating data. A STAR file is a flat text file with a simple format and extensible Data Definition Language (DDL). Data are stored as tag-value pairs and loop constructs resemble data tables. The STAR DDL is inherently a database schema that can be mapped one-to-one to a relational database model. Translating between STAR and other exchange file formats, such as XML, is a straightforward process. The STAR DLL used in our metabolomics resources was adapted from the existing data dictionary developed by the BMRB (NMRSTAR) for their work on NMR spectroscopic data of biological macromolecules and ligands. To describe the data for metabolic standard compounds, we used a subset of the NMR-STAR dictionary suitable for data from small molecules and extended the dictionary to include MS information. The information defined includes a complete compound chemical description (atoms, bonds, charge, etc.), nomenclature (including INChI and SMILES codes and synonyms), monoisotopic masses, links to databases through accession codes (PubChem, KEGG, CAS, and others), and additional information. Descriptions are provided for the NMR and mass spectrometers and chromatographic systems used in data

160 collection. Information on the sample contents and sample conditions is captured. Details of the NMR and mass spectrometry experiments can be included. For NMR, pointers to the raw NMR spectral data and the acquisition and processing parameters, experimental spectral peak parameters (peak chemical shifts, coupling constants, line widths, assigned chemical shifts, etc.), chemical shift referencing methods, theoretical chemical shift assignments and details of the calculation methods are described. For MS, the chromatographic retention times for the compound(s) of interest and standards are defined as well as the m/z values and intensities and pointers to the raw data files. The metabolite data dictionary is now being used to construct files containing all of the above information for the growing list of standard metabolic compounds analyzed by our consortium. The populated metabolite STAR files and the raw NMR and MS data files (instrumental binary formats) are being made freely available on the World Wide Web. The BMRB provides tools for converting NMR-STAR files into a relational database and XML files. 3.

Metabolite Database at BMRB

3.1. Approach The metabolomics community would clearly benefit from an extensive, freelyaccessible spectral library of metabolite standards collected under standardized conditions. Although the METLIN database serves this role for the MS community (http://metlin.scripps.edu/about.php), most current NMR resources have limitations in that they do not provide original spectral data (Sadtler Index [18], NMRShiftDB [19]; NMR metabolomics database of Linkoping (MDL http://www.liu.se/hu/mdl/main/), contain data that were collected under nonstandardized conditions ([19], MDL), or do not make their data freely available (AMIX/SBASE http://bruker-biospin.de). To our knowledge, the Human Metabolome Project (http://www.hmdb.ca/) is the only NMR resource, apart from BMRB, without these limitations. The current sparse coverage of NMR metabolomics resources stems in part from the high investment required to compile a comprehensive library of biologically relevant small molecules under standardized conditions. Our solution is to provide at BMRB a well-defined, curated platform that will allow the deposition of data from multiple research groups and free access to all.

161 3.2. Rationale for Metabolomics at BMRB The BMRB is a logical host for a metabolomics spectral library because of its history as a world wide repository for biological macromolecule NMR data [20-22]. BMRB is a public domain service and is a member of the Worldwide Protein Data Bank. Along with its home office in Madison, Wisconsin, BMRB has mirror sites in Osaka, Japan and Florence, Italy. BMRB is funded by the National Library of Medicine, U.S. National Institutes of Health, and its activities are monitored by an international advisory board. BMRB data are well archived with daily onsite tape backups and offsite third party data backup. 3.3. Data Collection and Organization Currently, the BMRB metabolomics archive contains experimental NMR data for more than 270 compounds collected by the Madison Metabolomics Consortium. Entries contain NMR time-domain data, peak lists, processed spectra, and data acquisition and processing files for one-dimensional ('H, B C , 13 C DEPT 90°, and 13C DEPT 135°) and two-dimensional ('H-'H TOCSY and 'H-13C HSQC) NMR experiments. A BMRB entry represents a set of either experimental or theoretical data reported for a metabolic compound, mixture of compounds, or experimental sample by a depositor. Entries are further distinguished by the experimental method used (NMR or MS). Separate prefixes on entries serve to discriminate between experimental data (bmse-) and theoretical calculations (bmst-). As described above, the metadata describing the chemical compounds and experimental details and quantitative data extracted from experiments or theoretical calculations for a unique entry are archived in NMR-STAR formatted text files. On the BMRB ftp site (ftp://ftp.bmrb.wisc.edu/pub/metabolomics), directories are defined for each compound or non-interconverting form of a compound (i.e., L-amino acids). Subdirectories for NMR, MS, and literature data are listed under each compound directory. All data associated with a BMRB experimental or theoretical entry are grouped together in a subdirectory, with the BMRB identifier located under the directory named for the compound studied and the appropriate subdirectory (NMR or MS). Data for compounds that form racemic mixtures in solution (e.g., many sugars) are grouped under a generic compound name. BMRB has developed internal tools to coordinately view spectra, peak lists, and the molecular structure; these tools are used to review deposited data for

162 quality assurance purposes. However, the depositor is ultimately responsible for the data submitted, and user feedback is the best defense against erroneous data on a public database. Users who encounter questionable data are encouraged to contact [email protected]. Questionable data will be reviewed and corrected if possible; otherwise they may be removed from the site. 3.4. Presentation

and Website

Design

The B M R B metabolomics website has been developed to meet needs expressed by many of its users. The layout and usage of the metabolomics web pages have had several public incarnations and will probably undergo more as the site matures and grows. The first page a visitor sees contains a two-paragraph introduction to the field and a collection of Internet links to a few important small molecule sites; a more complete listing of metabolomics websites is accessed from a link in the sidebar. The information contained in these websites and databases is complementary to that collected by B M R B . The Standard Compounds page (Figure 1) provides the means for searching for metabolites of interest. For each compound archived, an individual summary page (Figure 2) is created dynamically from the collection of files located in the standard substance sub-directory associated with that compound. A basic chemical description is provided from information BMRB collects from PubChem at the National Institutes of Health, National Center for Biotechnology Information, National Library of Medicine, (http://www.ncbi.nlm.nih.gov/) A twodimensional stick drawing is created. Three-dimensional '.mol' files are generated from the two dimensional ' . s d f files obtained from PubChem, and 8toto£*&tfM,fgft&rtoR0«d™»?ee0v>&Ffiatfjf these are displayed . using Jmol. Links are <M&*SSBfiS ft* \ * ^ ^ ^ | * ^ | * M 4 ^ W ^**fc%^ created to one or more Data Available for Tftese Standard PubChem entries and Substances to the K E G G entry if available. Synonym information and ti£@{f f H i n ^ H f f s i u u n various nomenclature descriptions such as INChI codes, IUPAC names, and SMILES strings are given. Figure 1. Metabolomics standard substances page in the BMRB website.

163 The use of dynamic information presentation techniques allows BMRB to create tools that search through the data or calculate answers according to specific user input. NMR data can be

v »*,««,-»«,»» ;_• ^JiVii^ri^,

mm

displayed in a variety of ways: as a collection of spectra, as a spectrum along with its peak list, or simply a single spectrum _ . of interest. Links allow the user to access the

m i»' '*} HXSI

' ' " ' '" • • > • « < > • •»•> > • (Ala-C2), is the main contributor to the isotopomer population of [2,3-13C2]alanine as assessed by the relative intensity of the doublet ('/cc = 35 Hz) of alanine C2. The total fraction of pyruvate derived from PPP can be estimated as 5/2 */2>(Ala-C2). TCA cycle and anaplerotic flux. [U-'3C]pyruvate enters the TCA cycle either by pyruvate dehydrogenase oxidation or by the anaplerotic reaction of

184 pyruvate carboxylase. The first process generates [4,5-'3C2]a-ketoglutarate via [l,2-'3C2]acetyl-CoA. Since intracellular a-ketoglutarate concentration is too low to be detected by NMR, its labeling state was assessed via glutamate, an abundant metabolite in rapid exchange with a-ketoglutarate. The isotopomer population of [4,5-l3C2]glutamate reflects the flux through pyruvate dehydrogenase, which equals the TCA cycle (citrate synthase) flux, provided the acetyl-CoA synthetase flux is zero. The second process is expected to yield a distinct labeling pattern represented by [1,2,3-'3C3] and [2,3-13C2] glutamate. This pattern reflects the formation of [1,2,3-'3C3] and [2,3,4-'3C3] oxaloacetate due to the pyruvate carboxylase reaction followed by the reversible interconversion between the asymmetric oxaloacetate and symmetric succinate (or fumarate). The relative activity of pyruvate carboxylase versus pyruvate dehydrogenase (vPC / vPDH) was calculated from the l3C multiplet components of glutamate at C3 and C4 using Eq. 1: vPC _ d(Glu-C3) vPDH rf*(Glu-C4) + AI»-C2 a r e the specific enrichments of glycine C2 and alanine C2. /JGiyc2 can be calculated from Xsyn, P„and P\\d-a using the relation Pa\y.ci = Xs*" • PAU-CI + (1 - - ' O • P„- Therefore A*yn can be derived from the analysis of l3C multiplets of alanine C2 and glycine C2 using Eq. 3:

185

1-^,

P.< y syn

Gly-C2

_

s+d d* +q

(3) ) + P„-( Gly-C2

Gly-C2

P.

3. Results 3.1. NMR Spectral Assignment Fig. 1 shows a typical two-dimensional [13C,'H] HSQC spectrum of metabolites extracted from the human breast cancer cells. The assignment of l3 C-'H cross peaks for various metabolites was made by comparing the carbon and proton chemical shifts with literature values (12-17), with spectra of pure compounds and by spiking the samples. Overall, 24 metabolites could be unambiguously assigned. The details of peak assignments and the reference summary Table SI of characteristic chemical shifts are provided in SOM.

~+ < «u)OSH)-C4

ItotjCH, _'

Civ?Hi |

SlyCJ

efctijMHH- • V.-!

i

i

.... — . L

V

V

N. J'

ly**C5

*w^ :2fj*3 ^ * < 3 J U * C 4 *

* i

~&e%y - - —I

Cys!6*H» FC-CH.0H i £

\

—\ :•»

J*

XT .

UtJ"IUDf>:CS *

-

L

*

'

-

«tf-C2j.

JM4KHt

i

*««£

Fig. 1. Atypical two-dimensional [13C, 'H] HSQC spectrum of the metabolites extracted from breast cancer cells. Abbreviations for the assigned peaks are as in Table SI.

3.2. Metabolic Fluxes A comparison of relative intensities of l3C-13C scalar coupling multiplet components of various metabolites extracted from [U-13C]glucose labeled MCF-

186 10A and MDA-MB-435 cells are shown in Table 1. These data were used in the C isotopomer model to determine the metabolic fluxes or flux ratios through individual pathways including glycolysis, PPP, TCA cycle and anaplerotic reaction, fatty acid and amino acid biosynthetic pathways (Fig. 2). 1

Table 1. Relative intensities of "C multiplet components of metabolites extracted from MCF-lOA and MDA-MB-435 cells grown on [U-l3C]glucose " Carbon position Alanine-C2

Alanine-C3 Lactate-C3 Acetyl-CoA (GlcNAc/GalNAc)-C2 Glutamine-C4

Glutamate-C3 Glutamate-C4

Glu (GSH)-C3 Glu (GSH)-C4

Gly (GSH)-C2 Glycine-C2 Proline-C4

Proline-C5

Isotopomer populations 2-"C 2,3-l3C2 1,2-'3C2 1,2,3-I3C3 3-,3C 2,3-l3C2 3-l3C 2,3-'3C2 2-l3C

Multiplet components s d d* q s d s d s

1,2-I3C2 4-13C 3,4-,3C2 4,5-l3C2 3,4,5-l3C3 3-13C 2,3-' 3 C 2 /3,4- l3 C 2 2,3,4-l3C3 4-,3C 3,4-l3C2 4,5-l3C2 3,4,5-13C3 3-13C 13 2,3- C2/3,4-l3C2 2,3,4-'3C3 4-,3C 3,4-'3C2 4,5-13C2 3,4,5-l3C3 2-l3C 1,2-13C2 2-,3C 1,2-I3C2 4-,3C 4,5-l3C2 3,4,5-13C3 5-13C 4,5-l3C2 ( Jcc, -35 Hz); d*, doublet split by a

MCF-lOA 0.27 0.01 0.01 0.71 0.28 0.72 0.16 0.84 0.29

MDA-MB435 0.16 0.11 0.01 0.72 0.17 0.83 0.20 0.80 0.14

d 0.71 0.86 _b s 0.50 _b d 0.01 _b 0.48 d* _b 0.01 q s 0.73 0.72 d 0.27 0.27 t 0 0.01 s 0.50 0.30 d 0.01 0.01 0.48 0.66 d* 0.01 0.03 q s 0.67 0.71 d 0.32 0.28 t 0.01 0.01 s 0.24 0.13 d 0.02 0.02 d* 0.70 0.73 0.04 0.12 q s 0.88 0.27 d 0.12 0.73 s 0.86 0.27 d 0.14 0.73 1.00 0.25 s 0.71 d 0.00 0.04 t 0.00 s 1.00 0.25 0.00 0.75 d large coupling constant ( Jcc, -60 Hz); t,

" s, singlet; d, doublet triplet; q, quartet. b Resonance of glutamine C4 is below the detection level in the MDA-MB-435 cells.

The relative activity of PPP versus glycolysis was determined based on the analysis of 13C multiplets of alanine C2 as described above. The contribution of

187

the signature doublet ('/ C c = 35 Hz) to the multiplets of alanine C2 is very small in MCF-10A but significant in MDA-MB-435 cells (Table 1), suggesting that a relative contribution of PPP to production of pyruvate is substantially higher in malignant cells (28%) compared to nonmalignant cells (-2%), where the bulk of pyruvate stems from glycolysis (Fig. 2). The increased use of PPP enables the MDA-MB-435 cells not only to supply more ribose for nucleic acid synthesis, but to recruit more of the NADPH reducing power for fatty acid synthesis. Indeed, the GC/MS analysis performed in this study revealed that 47% of palmitate is newly synthesized from glucose in MDA-MB-435 cells (Fig. 2) in correlation with the observed increase in PPP flux. The de novo synthesized palmitoleate, stearate, and oleate is 37%, 35%, and 18%, respectively. This is in marked contrast with almost no dc novo fatty acid synthesis in MCF-10A cells as evidenced by the lack of l3C tracer accumulation in palmitate, palmitoleate, stearate or oleate. 100 -i BMCF-10A 80 -

1

• MDA-MB-435

60

•M

4-

1 40-

!

•

rh

:LLUI

n

PyrfromPP Fatty acid Contribution Gly Pro pathway synthesized of synthesized synthesized from glucose anaplerosis from glucose from glucose to TCA cycle Fig. 2. Metabolic fluxes in MCF-10A and MDA-MB-435 cells (mean + s.d.; «=4).

The relative fluxes through pyruvate carboxylase and pyruvate dehydrogenase were estimated from the analysis of glutamate labeling. The major isotopomer populations of 4,5-13C2 of glutamate and y-glutamyl of glutathione indicated that these carbon atoms are derived from [l,2-13C2]acetylCoA (Table 1). The isotopomer ratio of acetyl-CoA C2, 1,2-13C2/ 2-l3C1, which can be assessed via the acetyl moiety of GlcNAc or GalNAc, is 2.5 for MCF10A and 6.1 for MDA-MB-435. Whereas these ratios are similar to the isotopomer ratios of 4,5- l3 C 2 + 3,4,5-13C3/ 4-l3C, + 3,4-l3C2 of glutathione C4

188 (2.8 for MCF-10A and 5.7 for MDA-MB-435), they are markedly different from the glutamate C4 ratios (0.96 for MCF-10A and 2.2 for MDA-MB-435). This indicates that the C4 and C5 in the y-glutamyl moiety of glutathione are solely derived from acetyl-CoA, whereas glutamate is a likely subject of the isotopic dilution originating from a non-enriched carbon source (e.g. glutamine). Therefore, the isotopomer distribution of y-glutamyl of glutathione was used to determine the relative activity of the anaplerotic reaction versus TCA cycle. The observed flux ratio of pyruvate carboxylase reaction over TCA cycle is slightly decreased in MDA-MB-435 compared to MCF-10A cells (Fig.2). Analysis of the 13C labeling pattern of the nonessential amino acids allowed us to determine the activity of the respective biosynthetic pathways. Using the 13 C isotopomer model, we found that cysteine is obtained directly from media components, and the activity of glutamate and glutamine biosynthesis is not changed significantly in MCF-10A and MDA-MB-435 cells (data not shown). Interestingly, MCF-10A cells do not utilize glucose for synthesis of glycine and proline, whereas these amino acids are actively synthesized from glucose in MDA-MB-435 cells (Fig. 2). 3.3. Metabolite Pools We used the 2D NMR data from the same labeling experiments to determine and compare the concentrations of unambiguously assigned metabolites (Table 2). Quantitation of metabolites with natural isotope abundance yields directly the total metabolite concentrations. At the same time, the differences observed for biosynthetically labeled metabolites may originate from changes in pool sizes as well as due to the l3C enrichment. In many cases these effects can be decoupled as illustrated below. Comparison of MCF-10A and MDA-MB-435 cell lines revealed significant changes in the pool sizes of many metabolites. For example, malignant cells exhibited significantly increased glutathione, m-inositol, and creatine concentrations and decreased isoleucine, leucine, valine, and taurine concentrations. Phosphocholine level is higher, whereas free choline and glycerophosphocholine were below the detection level in MDA-MB-435. The observed 12-fold increase in C2 and C3 peaks of succinate may not be explained solely by the l3C enrichment, which could account only for -12% of the overall increase. The latter estimate is based on the labeling pattern of ccketoglutarate deduced from the observed ~1.3-fold 13C enrichment at the C3 and C4 of y-glutamyl moiety of glutathione. Therefore, the total pool size of succinate was significantly increased in MDA-MB-435 cells. A similar approach allowed us to establish a substantial increase in the total pool size of GlcNAc or GalNAc and a decrease in those of alanine, glutamine, and glycine (Fig.3).

189 Table 2. Comparison of metabolite concentrations in MCF-10A and MDA-MB-435 cells " Metabolitesb

Arginine GSH Isoleucine Leucine Lysine Valine m-lnositol Free choline Phosphocholine Glycerophosphocholine Total choline Phosphocholine / glycerophosphocholine Creatine Taurine

Ratio MDA-MB-435 / MCF-10A 0.98 ±0.15 1.59 ±0.08 0.27 ±0.04 0.48 ± 0.05 0.74 ±0.16 0.26 ±0.03 1.75 ±0.10 14.7 GlcNAc / GalNAc C2 UDP-GlcNAc / UDP2.56 ±0.64 GalNAc C2 UTP/UDPCl 3.38 ±0.53 a Relative amount of the various compounds were obtained by normalizing peaks to the internal reference standard, and further normalized per 1 mg of total protein (mean + s.d.; n=4) b Quantitation of metabolites with natural isotope abundance (a direct measure of metabolite concentrations). c Differences observed for biosynthetically labeled metabolites may reflect both, a 13C enrichment and a change in a total pool size.

4. Discussion The key aspects of the metabolomics methodology used in this study were: 1. A comparative approach was applied to assess metabolic changes in a model system of the highly metastatic cell line MDA-MB-435 versus the immortalized nontumorigenic cell line MCF-10A. 2. [U-nC]glucose labeling followed by the high-resolution 2D NMR spectroscopy allowed us to monitor twenty-four intracellular metabolites (Tables 1 and 2) in addition to fatty acids analyzed by GC-MS. 3. An extensive 13C isotopomer model was developed to determine and compare fluxes through the key central metabolic pathways including glycolysis, PPP, TCA cycle and anaplerotic reactions, and biosynthetic pathways of fatty acids and non-essential amino acids (Fig.2). 4. A combination of fluxes with individual metabolite pools within the single metabolic reconstruction framework expanded our ability to interpret underlying metabolic transitions (Fig.3). Although most of the individual components of this approach have been previously described, to our knowledge this is the first study when a combination of these techniques was systematically applied for metabolomics of

190 cancer. Although comprehensive isotopomer models are widely used in microbial systems (18,19), only a few models have been described for human cells (20-29). Most of these models were restricted by relatively narrow metabolic subnetworks (20-25) or based on the labeling data for one (i.e. glutamate (25,26)) or a few individual metabolites (27-29). Due to the higher sensitivity of HSQC method compared to regular l3C-NMR we were able to decrease the amount of cells required for the analysis. The increased signal dispersion in 2D spectra allowed us to analyze a wide range of metabolites without prior separation.

Fig. 3. Metabolic profile changes in breast tumors compared with normal human mammary epithelial cells. The arrows represent the fluxes. Fluxes are normalized to glucose uptake rate. The boldface arrows indicate the fluxes that are significantly upregulated. The pool sizes of boxed metabolites are directly assessed by [l3C,'H] HSQC. Metabolites are colored if their concentrations are increased (black), decreased (white), or not changed (gray). G6P, glucose-6-phosphate; R5P, ribose-5-phosphate; GAP, glyceraldehydes-3-phosphate; 3-PG, 3-phosphoglycerate. See other abbreviations in Table S1 given in SOM.

An integration of fluxes and pool sizes acquired within a single experiment

191 gives a more detailed fingerprint of the phenotype compared to conventional approaches based on one parameter. Although fluxes provide a direct measure of metabolic activities pointing to potential targets, they can be usually obtained only for a subset of central metabolic pathways. Metabolite pools can be readily assessed for both central and peripheral metabolites. While providing only an indirect evidence of metabolic activities, they can be used as biomarkers. We observed a sharp increase in metabolic activity of several pathways in cancer cells (Fig.2 and 3). Some of these observations such as upregulation of PPP and fatty acid synthesis are consistent with previous reports (30,31) providing us with a validation of the approach. An increase in other fluxes, eg the synthesis of glycine and proline, are reported here for the first time. Possible implications of these changes in establishing and maintaining a breast cancer phenotype are yet to be explored. Some of the observed changes in metabolite pools can be readily interpreted in the context of respective fluxes. For example the pools of all monitored amino acids decreased or remained largely unchanged in cancer cells, despite the established upregulation of some of the respective biosynthetic pathways (Fig.3). This is consistent with accelerated consumption of amino acids for protein synthesis. At the same time, the pool of glutathione (GSH in Fig.3), which is not consumed at the same level increased in keeping with the increased synthetic flux. Overproduction of GSH in tumors may reflect the increased resistance towards oxidative stress (32). We observed significant alterations in pools of several peripheral metabolites (eg creatine and taurine), whose metabolism may not be easily assessed via flux measurements. Therefore, the results obtained in this study, in addition to the validation of the approach, provide new information about metabolic aspects of tumorigenesis, and can aid the identification of new diagnostic and therapeutic targets. The presented approach constitutes a promising analytical tool to screen different metabolic phenotypes in a variety of cell types and pathological conditions. 1. 2. 3. 4. 5.

REFERENCES Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O., and Herskowitz, I. (1998) Science 282, 699-705 Klose, J., Nock, C, Herrmann, M., Stuhler, K., Marcus, K., Bluggel, M, Krause, E., Schalkwyk, L. C, Rastan, S., Brown, S. D., Bussow, K., Himmelbauer, H., and Lehrach, H. (2002) Nat Genet 30, 385-393 Voss, T., Ahorn, H., Haberl, P., Dohner, H„ and Wilgenbus, K. (2001) Int J Cancer 91, 180-186 Moch, H., Schraml, P., Bubendorf, L., Mirlacher, M, Kononen, J., Gasser, T., Mihatsch, M. J., Kallioniemi, O. P., and Sauter, G. (1999) Am J Pathol 154, 981-986 Celis, J. E., Celis, P., Ostergaard, M., Basse, B., Lauridsen, J. B., Ratz, G, Rasmussen, H. H., Orntoft, T. F., Hein, B., Wolf, H., and Celis, A. (1999) Cancer Res 59, 3003-3009

6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

26. 27.

28. 29. 30. 31. 32.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C , Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999) Science 286, 531-537 Dang, C. V., Lewis, B. C , Dolde, C, Dang, G, and Shim, H. (1997) J Bioenerg Biomembr 29, 345-354 Lee, W. N., Bassilian, S., Guo, Z., Schoeller, D., Edmond, J., Bergner, E. A., and Byerley, L. O. (1994) Am J Physiol 266, E372-383 Wittmann, C, and Heinzle, E. (1999) Biotechnol Bioeng 62, 739-750 Delaglio, R, Grzesiek, S., Vuister, G W., Zhu, G, Pfeifer, J., and Bax, A. (1995) J Biomol NMR 6, 277-293 Szyperski, T. (1995) Eur J Biochem 232, 433-448 Gribbestad, I. S., Petersen, S. B., Fjosne, H. E., Kvinnsland, S., and Krane, J. (1994) NMR Biomed 7, 181 -194 Gribbestad, I. S., Sitter, B., Lundgren, S., Krane, J., and Axelson, D. (1999) Anticancer Res 19, 1737-1746 Pal, K., Sharma, U., Gupta, D. K., Pratap, A., and Jagannathan, N. R. (2005) Spine 30, E68-72 Patel, A. B., Srivastava, S., Phadke, R. S., and Govil, G (1999) Anal Biochem 266,205-215 Sharma, U., Atri, S., Sharma, M. C, Sarkar, C, and Jagannathan, N. R. (2003) NMR Biomed 16, 213-223 Sharma, U., Mehta, A., Seenu, V., and Jagannathan, N. R. (2004) Magn Reson Imaging 22, 697-706 Dauner, M., Bailey, J. E., and Sauer, U. (2001) Biotechnol Bioeng 76, 144-156 Schmidt, K., Nielsen, J., and Villadsen, J. (1999) J Biotechnolll, 175-189 Fernandez, C. A., and Des Rosiers, C. (1995) J Biol Chem 270, 10037-10042 Lapidot, A., and Gopher, A. (1994) J Biol Chem 269, 27198-27208 Jeffrey, F. M., Storey, C. J., Sherry, A. D„ and Malloy, C. R. (1996) Am J Physiol 111, E788-799 Malloy, C. R., Sherry, A. D., and Jeffrey, F. M. (1988) J Biol Chem 263, 69646971 Vercoutere, B., Durozard, D., Baverel, G, and Martin, G (2004) Biochem J 378, 485-495 Lu, D., Mulder, H., Zhao, P., Burgess, S. C, Jensen, M. V, Kamzolova, S., Newgard, C. B., and Sherry, A. D. (2002) Proc Natl Acad Sci VSA99, 27082713 Cline, G W., Lepine, R. L., Papas, K. K., Kibbey, R. G, and Shulman, G I. (2004) J Biol Chem 279, 44370-44375 Boren, J., Cascante, M., Marin, S., Comin-Anduix, B., Centelles, J. J., Lim, S., Bassilian, S., Ahmed, S., Lee, W. N., and Boros, L. G (2001) J Biol Chem 276, 37747-37753 Boren, J., Lee, W. N., Bassilian, S., Centelles, J. J., Lim, S., Ahmed, S., Boros, L. G, and Cascante, M. (2003) J Biol Chem 278, 28395-28402 Portais, J. C, Schuster, R„ Merle, M., and Canioni, P. (1993) Eur J Biochem 217, 457-468 Boros, L. G, Cascante, M., and Lee, W. N. (2002) Drug Discov Today 7, 364372 Baron, A., Migita, T., Tang, D., and Loda, M. (2004) J Cell Biochem 91, 47-53 Meister.A. (1991) Pharmacol Ther 51, 155-194

METABOLIC FLUX PROFILING OF REACTION MODULES IN LIVER DRUG TRANSFORMATION JEONGAH YOON, KYONGBUM LEE Department of Chemical & Biological Engineering, Tufts University, 4 Colby Street, Medford, MA, 02155, USA With appropriate models, the metabolic profile of a biological system may be interrogated to obtain both significant discriminatory markers as well as mechanistic insight into the observed phenotype. One promising application is the analysis of drug toxicity, where a single chemical triggers multiple responses across cellular metabolism. Here, we describe a modeling framework whereby metabolite measurements are used to investigate the interactions between specialized cell functions through a metabolic reaction network. As a model system, we studied the hepatic transformation of troglitazone (TGZ), an antidiabetic drug withdrawn due to idiosyncratic hepatotoxicity. Results point to a welldefined TGZ transformation module that connects to other major pathways in the hepatocyte via amino acids and their derivatives. The quantitative significance of these connections depended on the nutritional state and the availability of the sulfur containing amino acids.

1.

Introduction

Metabolites are intermediates of essential biochemical pathways that convert nutrient fuel to energy, maintain cellular homeostasis, eliminate harmful chemicals, and provide building blocks for biosynthesis. Many metabolites are in free exchange with the extracellular medium, and may be used to obtain quantitative estimates of biochemical pathway activities in intact cells. In recent years, metabolite measurement arrays, or metabolic profiles, in conjunction with appropriate models, have been used for a variety of applications, e.g. comparisons of plant phenotypes [1], elucidation of new gene functions [2], and discovery of disease biomarkers [3]. Another promising application is the study of drug-mediated toxicity in specialized metabolic organs such as the liver. One approach to identifying drug toxicity markers has been to extract characteristic fingerprints by applying pattern recognition techniques to 'metabonomic' data obtained through nuclear magnetic resonance (NMR) spectroscopy [4]. An alternative and complementary approach is to build structured network models applicable to metabolomic data. These models could be used, for example, to globally characterize the effects of drug chemicals across cell metabolism, and thereby identify potential metabolic burdens; to associate adverse events, such as the formation of a harmful derivative, with 193

194 specific marker metabolites; and to formulate hypotheses on the mechanisms of drug toxicity. Here, we describe a modeling framework for characterizing the modularity of specific reaction clusters, in this case xenobiotic transformation. At its core, this framework consists of an algorithm for top-down partitioning of directed graphs with non-uniform edge weight distributions. The core algorithm is further augmented with metabolic flux profiling and stoichiometric vector space analysis. Thus, our modeling framework is well-suited for leveraging advances in both analytical technologies as well as biological informatics, especially genome annotation and pathway database construction [5]. As a model system, we considered the metabolic network of the liver, which is the major site of xenobiotic transformation in the body. Representative metabolic profile data were obtained for cultured rat or human hepatocytes from prior work [6, 7]. The model xenobiotic was troglitazone (TGZ), an anti-diabetic drug that has recently been withdrawn due to idiosyncratic liver toxicity [8]. The exact mechanisms of toxicity remain unknown, but could involve the formation of harmful derivatives through metabolic activation, cellular energy depletion via mitochondrial membrane damage [9], or other metabolic burdens such as oxidative stress [10]. In this work, we utilize our modularity analysis model to characterize the connections between the reactions of known TGZ conjugates and the major pathways of liver cellular metabolism. This type of analysis should complement more detailed studies on the roles of specific conjugation enzymes by identifying their interdependence with other major components of the cellular metabolic network. In the case of TGZ transformation, our results indicate that key connectors are sulfur-containing amino acids and their derivatives. 2.

Methods

2.1. Liver metabolic network Stoichiometric models of liver central carbon metabolism were constructed as follows. First, a list of enzyme-mediated reactions was collected from an annotated genome database [11]. Second, stoichiometric information was added for each of the collected enzymes by cross-referencing their common names and enzyme commission (EC) numbers using the KEGG database [12]. Third, biochemistry textbooks and the published literature [13] were consulted to build organ (liver) and nutritional state (fed or fasted) specific models. Net flux directions of reversible or reciprocally regulated pathways were set based on the nutritional state. These models were rendered into compound, directed graphs, visualized using the MATLAB (MathWorks, Natick, MA) Bioinformatics

195 toolbox, and corrected for missing steps and nonsensical dead ends. Reversible reactions flanked by irreversible reactions were assigned directionality so as to ensure unidirectional metabolic flux between the flanking reactions. The pathway memberships and other dimensional characteristics are summarized for each of the two models in Table 1*. Table 1. Pathway memberships of the fed- and fasted-state liver models Pathway Alcohol metabolism Amino acid metabolism Bile acid synthesis Cholesterol synthesis Gluconeogenesis Glycogen synthesis Glycolysis Ketone body metabolism Lipogenesis Lipolysis, (5-oxidation Oxidative phosphorylation PPP TCA cycle TGZ metabolism Urea cycle

Fed

Fasted

V V V
Pyruvate + S032" + Glutamate Cysteine -> Pyruvate + NH4* + HS' HS" + 2Gluthatione + 20 2 -» GSSG + HS03- + H 2 0 TGZ uptake Glutamate + Cysteine + Glycine -> Glutathione TGZ + Glutathione -> TGZ-GSH TGZ + SCO2- -» TGZ-Sulfate TGZ + HS03' -> TGZ-Sulfate TGZ -> TGZ-Quinone TGZ -> TGZ-Glucuronide TGZ-GSH secretion TGZ-Sulfate secretion TGZ-Glucuronide secretion TGZ-Quinone secretion

0.32

0.12

-38

0.00

0.00

0.00

0.00

-648.5

0.00

0.00

0.00

0.00

0

0.46

0.91

0.46

0.92

132.4

0.07

0.59

0.00

0.50

0.07 0.06 0.00 0.18 0.15 0.07 0.06 0.15 0.18

0.59 0.32 0.00 0.00 0.00 0.59 0.32 0.00 0.00

0.00 0.12 0.00 0.17 0.17 0.00 0.12 0.17 0.17

0.50 0.42 0.00 0.00 0.00 0.50 0.42 0.00 0.00

-7.5 19 28.5 -31.1 -197.5 0 0 0 0

Measured inputs are shown in bold. +TGZ: flux distribution calculated by MFA with total drug uptake set to 0.46 umol/106 cells/day. Max GSH: flux distribution calculated by FBA with upper and lower bounds on glucose, TG, GLN, urea, and TGZ.

Interestingly, the two models predicted qualitatively similar trends despite their significantly different compositions and measured inputs, suggesting that there were a limited number of actively engaged connections between TGZ transformation and the other metabolic pathways. The major quantitative difference involved the contribution of the GSH conjugate. Thus, we next

200

examined the effect of increasing the availability of this conjugation substrate by simulating flux distributions that maximized GSH synthesis under the same stoichiometric and thermodynamic constraints applied to the MFA problems. To obtain flux values numerically compatible with the MFA results, we also assigned upper and lower bounds to the major carbon and nitrogen sinks and sources based on their respective measured external flux values. As expected, the flux through the GSH synthesis step (vGSH) increased significantly for both the fed- and fasted-models (in umol/106 cells/day) from 0.07 to 0.59 and 0 to 0.50, respectively, when the maximization objective was paired with no direct constraints on the uptake or output of the amino acid reactants. The only indirect constraint on GLU was applied through the upper and lower bounds on GLN (0.75 and 3 umol/106 cells/day, respectively), which were not approached. However, the higher vGSH flux for the fed-state model suggests a positive correlation with GLN uptake, which was significantly higher for the fed-state model. The predicted distribution of conjugation reaction fluxes were 65 % TGZ-GSH and 35 % TGZ-S for the fed-state model and 54 % TGZGSH and 46 % TGZ-S for the fed-state model. Both models predicted zero fluxes for the formation of the glucuronide and quinone conjugates, suggesting that the distribution of the TGZ derivatives may be dramatically altered by the availability of GSH, which in turn is influenced by the medium supply of its constituent amino acids. The increase in TGZ-GSH was accompanied by an increase in TGZ-S formation, likely because the cysteine component of GSH also acts as a source of sulfate (HSO3" and S032"), which drive the formation of TGZ-S. Cysteine as well as its sulfate derivatives mutually interacts with other intermediates of central carbon metabolism. These interactions have been further characterized through modularity analysis. 3.2. Reaction modules To characterize the interconnections between TGZ derivatives and other major liver metabolites, we applied a partition algorithm to directed graph representations of the various network models with and without edge-weights. The left-hand panels of Fig. 1 show the optimal partitions of the fed-state model without an edge-weight matrix (a), with an edge-weight matrix derived from MFA (c), and with an edge-weight matrix derived from FBA (e). Figs, lb, Id, and 1 f show the corresponding partitions of the fasted-state model. Optimality was evaluated based on the projection and match scores (see Methods, Fig. 2). For both the fed- and fasted-state models, the inclusion of reaction flux, or connection activity, significantly influenced their modularity. When only connectivity was considered, the (unweighted) fed-state network was optimally partitioned at iteration number 34 (Fig. la). Three modules were generated.

201

3.

r8Sx»

1

»

X

'

b

\

"-*& \

\

\

j

J

s

•v-,v

-

* * -

'I "•

*

'A?

TSN

J

rat

'

yl '•*
»»

/**-

Ow--

i

^

^

0W

*

'• «T«Q

\

• %L'^

P 4«

^

HS5M

\

-"^"^ Tap«

Sir*

*te' ^^^

TGZOSH

BW

Figure 1. Optimal partitions of the liver network models. Left- and right-hand column panels show fed-and fasted-state models, respectively. Partition without flux weights (a, b), with flux weights (c, d), and with flux weights maximizing GSH (e, t). Arrows indicated carbon flow between modules as determined from the partition of the previous iteration.

202

The smallest module consisted of two metabolites in lipid synthesis (palmitate, PAL, and triglyceride, TG). The largest module included all other metabolites with the exception of TGZ and its direct derivatives, which constituted the remaining third module. When an edge-weight matrix was applied with MFA derived fluxes, the optimal partition was reached at iteration 8 (Fig. lc). Four modules (consisting of at least two connected nodes) were found. The smallest module consisted of metabolites in the urea cycle. A second module consisted of lipid synthesis and PPP metabolites. A third module consisted of the TCA cycle metabolites. The largest module included TGZ, its direct derivatives, and the intermediates of amino acid and pyruvate metabolism. When a different edgeweight matrix was used with a flux distribution corresponding to maximal GSH synthesis, the optimal partition (reached at iteration 8) consisted of three modules (Fig. 1 e). The two smaller modules were identical to the two smallest modules of the partition in Fig. lc. The third module essentially combined the larger two modules of Fig. lc, with connections through the reactions in and around the urea and TCA cycles.

o '8 P

Q.Si '• 0.4'

Iteration No.

iteration Mo

Fig. 2. Mean projection and match score plots for the fed-state model partitions. Legends refer to flux distribution used to form the edge-weight matrix. For both series of partitions, the optimal iteration was set at 8, which corresponds to the first significant rise in the two scores.

The modularity of the fasted-state was also significantly influenced by the connection diversity (flux) data. Without an edge-weight matrix, the net effect of the edge removals was to reduce the network graph size (Fig. lb). Application of the MFA derived fluxes as edge-weights generated an optimal partition with two modules at iteration 15 (Fig. Id). Similar to Fig. la, TGZ and its derivatives formed a separate module. However, this module lacked TGZGSH, presumably because the fasted-state model calculated zero flux for GSH synthesis. Unlike the fed-state partition (Fig. lc), the TGZ module did not connect directly to the other metabolic pathways. Direct connections remained absent when the GSH maximizing flux distribution was used to form the edgeweight matrix (Fig. If). The major effects were to isolate a small module consisting of urea cycle metabolites from the largest reaction module. As expected from the results of Table 3, TGZ-G and TGZ-Q were eliminated from the TGZ module, and replaced with TGZ-GSH. Together, Figs, lc-d suggest that the nutritional state of the liver directly impacts the connections between

203

reactions of TGZ transformation and the other major pathways of liver metabolism. Moreover, a comparison of the partitions in Fig. lc and le indicated that conjugation substrate availability, in this case GSH, influences the extent of integration between these reaction modules. 4.

Discussion

In this paper, we examined the interactions between the specialized reactions of TGZ transformation and the network of major metabolic reactions in hepatocytes. Using prior data, flux distributions were simulated that were in partial agreement with experimental observations on the relative distributions of various TGZ conjugates. With only total TGZ clearance rate as input, TGZGSH was correctly predicted as a minor derivative, but the contribution of TGZS was significantly under-estimated, suggesting that additional measurements on the conjugation reactions are needed to improve the flux calculations. Nevertheless, we noted several useful outcomes. First, the thermodynamic constraints allowed convergent solutions to be found with relatively small numbers of measured inputs. Second, we avoided potential pitfalls of individual reaction-based inequality constraints. For example, flux calculations correctly predicted significant net production of TGZ-S in all cases, even though the individual reaction AGs of the final synthesis steps were positive (Table 3). These results directly reflect the energetic coupling between sequential reaction steps as specified by the EFM calculations. Third, the EFMs generated for the flux calculations provided an inventory of stoichiometrically and energetically feasible reaction routes of the model networks. A major obstacle to applying the EFM analysis to larger, e.g. genome scale, networks is its computational intractability. One way to address this issue is to solve for a partial set of EFMs by eliminating high-degree currency metabolites. Many currency metabolites cannot be accurately measured or balanced, and thus frequently not included in the stoichiometric constraints, but form metabolic cycles that significantly expand the EFM solution space. In this work, ATP, C0 2 and 0 2 were not balanced, and the EFM calculations became NP-hard problems. The EFMs and the calculated flux distributions were ultimately used to examine the modularity of TGZ metabolism across different nutritional states and levels of conjugation substrate availability. While the connections between the immediate reactions of TGZ metabolism were well-conserved across these different conditions, connections to other major pathways varied. In the fastedstate, interactions between the main carbon network and the TGZ module were limited, regardless of the GSH level. In contrast, a number of active connections were found for the fed state. These connections mainly involved the sulfur

204

containing amino acid cysteine (CYS) and its immediate reaction partners. The liberation of the sulfide moiety from CYS requires complete degradation of the amino acid via transamination reactions, which involves other high-degree metabolites such as GLU and a-ketoglutarate. Along with glycine, GLU and CYS make up GSH, which also interacts with the TGZ module as a conjugation substrate. Taken together, our findings suggest that the availability of common medium nutrients could significantly influence the formation of drug derivatives. Prospectively, metabolic profile-based studies on drug reaction modules could be used to analyze drug transformation under varying metabolic states, which in turn could facilitate the development of effective nutritional approaches for managing drug toxicity [10]. Acknowledgements We thank Dr. Anselm Blumer in the Department of Computer Science at Tufts University for his help in implementing the edge-betweenness centrality algorithm. This work was in part funded by NIH grant 1-R21DK67228 to KL. References 1. O. Fiehn et al., Nat Biotechnol 18, 1157 (2000). 2. R. N. Trethewey, Curr Opin Biotechnol 12, 135 (2001). 3. J. L. Griffin et al, AnalBiochem 293, 16 (2001). 4. J. K. Nicholson et al, Nat Rev Drug Discov 1, 153 (2002). 5. M. Kanehisa et al, Nucleic Acids Res 34, D354 (2006). 6. C. Chan et al, Metab Eng 5, 1 (2003). 7. R. P. Nolan, M.S., Tufts University (2005). 8. E. A. Gale, Lancet 357, 1870 (2001). 9. Y. Masubuchi et al, Toxicology 222, 233 (2006). 10. S. Tafazoli et al., Drug Metab Rev 37, 311 (2005). 11. H. Ma, A. P. Zeng, Bioinformatics 19, 270 (2003). 12. M. Kanehisa, S. Goto, Nucleic Acids Res 28, 27 (2000). 13. I. M. Arias, J. L. Boyer, The liver: biology andpathobiology. 4th ed. (2001) 14. M. T. Smith, Chem Res Toxicol 16, 679 (2003). 15. N. J. Hewitt et al, Chem Biol Interact 142, 73 (2002). 16. R. P. Nolan et al, Metab Eng 8, 30 (2006). 17. S. Schuster et al, Nat Biotechnol 18, 326 (2000). 18. J. Yoon et al, Bioinformatics (2006). 19. M. E. Newman, M. Girvan, Phys Rev E Stat Nonlin Soft Matter Phys 69, 026113(2004). 20. U. Bvandss, J Math Soci 25, 163 (2001). 21. K. Kawai et al., Xenobiotica 30, 707 (2000). 22. S. Prabhu et al, Chem Biol Interact 142, 83 (2002).

N E W FRONTIERS IN BIOMEDICAL T E X T MINING

PIERRE ZWEIGENBAUM, DINA DEMNER-FUSHMAN, HONG YU, AND K. BRETONNEL COHEN 1. Introduction To paraphrase Gildea and Jurafsky [7], the past few years have been exhilarating ones for biomedical language processing. In less than a decade, we have seen an amazing increase in activity in text mining in the genomic domain [20]. The first textbook on biomedical text mining with a strong genomics focus appeared in 2005 [3]. The following year saw the establishment of a national center for text mining under the leadership of committed members of the BioNLP world [2], and two shared tasks [10,9] have led to the creation of new datasets and a very large community. These years have included considerable progress in some areas. The TREC Genomics track has brought an unprecedented amount of attention to the domain of biomedical information retrieval [8] and related tasks such as document classification [5] and question-answering, and the BioCreative shared task did the same for genomic named entity recognition, entity normalization, and information extraction [10]. Recent meetings have pushed the focus of biomedical NLP into new areas. A session at the Pacific Symposium on Biocomputing (PSB) 2006 [6] focussed on systems that linked multiple biological data sources, and the BioNLP'06 meeting [20] focussed on deeper semantic relations. However, there remain many application areas and approaches in which there is still an enormous amount of work to be done. In an attempt to facilitate movement of the field in those directions, the Call for Papers for this year's PSB natural language processing session was written to address some of the potential "New Frontiers" in biomedical text mining. We solicited work in these specific areas: • • • •

Question-answering Summarization Mining data from full text, including figures and tables Coreference resolution 205

206

• User-driven systems • Evaluation 31 submissions were received. Each paper received four reviews by a program committee composed of biomedical language processing specialists from North America, Europe, and Asia. Eleven papers were selected for publication. The papers published here present an interesting window on the nature of the frontier, both in terms of how far it has advanced, and in terms of which of its borders it will be difficult to cross. One paper addresses the topic of summarization. Lu et al. [14] use summary revision techniques to address quality assurance issues in GeneRIFs. Two papers extend the reach of biomedical text mining from the abstracts that have been the input to most BioNLP systems to date, towards mining the information present in full-text journal articles. Kou et al. [13] introduce a method for matching the labels of sub-figures with sentences in the paper. Seki and Mostafa [19] explore the use of full text in discovering information not explicitly stated in the text. Two papers address the all-too-often-neglected issue of the usability and utility of text mining systems. Karamanis et al. [12] present an unusual attempt to evaluate the usability of a system built for model organism database curators. Much of the work in biomedical language processing in recent years has assumed the model organism database curator as its user, so usability studies are well-motivated. Yu and Kaufman [22] examine the usability of four different biomedical question-answering systems. Two papers fit clearly into the domain of evaluation. Morgan et al. [15] describe the design of a shared evaluation, and also gives valuable baseline data for the entity normalization task. Johnson et al. [11] describe a fault model for evaluating ontology matching, alignment, and linking systems. Four papers addressed more traditional application types, but at a deeper level of semantic sophistication than most past work in their areas. Two papers dealt with the topic of relation extraction. Ahlers et al. [1] tackle an application area—information extraction—that has been a common topic of previous work in this domain, but does so at an unusual level of semantic sophistication. Cakmak and Ozsoyoglu [4] deal with the difficult problem of Gene Ontology concept assignment to genes. Finally, two papers focus on the well-known task of document indexing, but at unusual levels of refinement. Neveol et al. [16] extract MeSH sw&headings and pairs them with the appropriate primary heading, introducing an element of context that is lacking in most other work in BioNLP. Rhodes et al. [18]

207 describe a methodology for indexing documents based on the structure of chemicals t h a t are mentioned within them. So, we see papers in some of the traditional aplication areas, but at increased levels of sophistication; we see papers in the areas of summarization, full text, user-driven work, and evaluation; but no papers in the areas of coreference resolution or question-answering. W h a t might explain these gaps? One possibility is the shortage of publicly available datasets for system building and evaluation. Although there has been substantial annotation work done in the area of coreference in the molecular biology domain [21,17], only a single biomedical corpus with coreference annotation is currently freely available [17]. Similarly, although the situation will be different a year from now due to the efforts of the T R E C Genomics track, there are currently no datasets freely available for the biomedical question-answering task. 2.

Acknowledgments

K. Bretonnel Cohen's participation in this work was supported by NIH grant R.01-LM008111 to Lawrence Hunter. References 1. Caroline B. Ahlers, Marcelo Fiszman, Dina Demner-Pushman, Francois Michel Lang, and Thomas C. Rindflesch. Extracting semantic predications from MEDLINE citations for pharmacogenomics. In Pacific Symposium on Biocomputing, 2007. 2. Sophia Ananiadou, Julia Chruszcz, John Keane, John McNaught, and Paul Watry. The National Centre for Text Mining: aims and objectives. Ariadne, 42, 2005. 3. Sophia Ananiadou and John McNaught. Text mining for biology and biomedicine. Artech House Publishers, 2005. 4. Ali Cakmak and Gultekin Ozsoyoglu. Annotating genes by mining PubMed. In Pacific Symposium on Biocomputing, 2007. 5. Aaron M. Cohen and William R. Hersh. The TREC 2004 genomics track categorization task: classifying full text biomedical documents. Journal of Biomedical Discovery and Collaboration, 1(4), 2006. 6. K. Bretonnel Cohen, Olivier Bodenreider, and Lynette Hirschman. Linking biomedical information through text mining: session introduction. In Pacific Symposium on Biocomputing, pages 1-3, 2006. 7. Daniel Gildea and Daniel Jurafsky. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245-288, 2002. 8. William R. Hersh, Ravi Teja Bhupatiraju, Laura Ross, Phoebe Roberts, Aaron M. Cohen, and Dale F. Kraemer. Enhancing access to the Bibliome:

208

9.

10.

11.

12.

13.

14. 15.

16.

17.

18.

19. 20.

21.

22.

the TREC 2004 Genomics track. Journal of Biomedical Discovery and Collaboration, 2006. William R. Hersh, Aaron M. Cohen, Jianji Yang, Ravi Teja Bhupatiraju, Phoebe Roberts, and Marti Hearst. TREC 2005 Genomics track overview. In Proceedings of the 14th Text Retrieval Conference. National Institute of Standards and Technology, 2005. Lynette Hirschman, Alexander Yeh, Christian Blaschke, and Alfonso Valencia. Overview of BioCreAtlvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6, 2005. Helen L. Johnson, K. Bretonnel Cohen, and Lawrence Hunter. A fault model for ontology mapping, alignment, and linking systems. In Pacific Symposium on Biocomputing, 2007. Nikiforos Karamanis, Ian Lewin, Ruth Seal, Rachel Drysdale, and Edward J. Briscoe. Integrating natural language processing with FlyBase curation. In Pacific Symposium on Biocomputing, 2007. Zhenzhen Kou, William W. Cohen, and Robert F. Murphy. A stacked graphical model for associating information from text and images in figures. In Pacific Symposium on Biocomputing, 2007. Zhiyong Lu, K. Bretonnel Cohen, and Lawrence Hunter. GeneRIF quality assurance as summary revision. In Pacific Symposium on Biocomputing, 2007. Alexander A. Morgan, Benjamin Wellner, Jeffrey B. Colombe, Robert Arens, Marc E. Colosimo, and Lynette Hirschman. Evaluating human gene and protein mention normalization to unique identifiers. In Pacific Symposium on Biocomputing, 2007. Aurelie Neveol, Sonya E. Shooshan, Susanne M. Humphrey, Thomas C. Rindflesch, and Alan R. Aronson. Multiple approaches to fine indexing of the biomedical literature. In Pacific Symposium on Biocomputing, 2007. J. Pustejovsky, J. Castano, R. Sauri, J. Zhang, and W. Luo. Medstract: creating large-scale information servers for biomedical libraries. In Natural language processing in the biomedical domain, pages 85-92. Association for Computational Linguistics, 2002. James Rhodes, Stephen Boyer, Jeffrey Kreulen, Ying Chen, and Patricia Ordonez. Mining patents using molecular similarity search. In Pacific Symposium on Biocomputing, 2007. Kazuhiro Seki and Javed Mostafa. Discovering implicit associations beween genes and hereditary diseases. In Pacific Symposium on Biocomputing, 2007. Karin Verspoor, K. Bretonnel Cohen, Inderjeet Mani, and Benjamin Goertzel. Introduction to BioNLP'06. In Linking natural language processing and biology: towards deeper biological literature analysis, pages iii-iv. Association for Computational Linguistics, 2006. Xiaofeng Yang, Guodong Zhou, Jian Su, and Chew Lim Tan. Improving noun phrase coreference resolution by matching strings. In IJCNLP04, pages 326-333, 2004. Hong Yu and David Kaufman. A cognitive evaluation of four online search engines for answering definitional questions posed by physicians. In Pacific Symposium on Biocomputing, 2007.

EXTRACTING SEMANTIC PREDICATIONS FROM MEDLINE CITATIONS FOR PHARMACOGENOMICS CAROLINE B. AHLERS,' MARCELO FISZMAN, 2 DINA DEMNER-FUSHMAN,' FRANCOIS-MICHEL LANG,' THOMAS C. RINDFLESCH 1 'Lister Hill National Center for Biomedical Communications, National Library of Medicine Bethesda, Maryland 20894, USA 2 The University of Tennessee, Graduate School of Medicine Knoxville, Tennessee 37920, USA

We describe a natural language processing system (Enhanced SemRep) to identify core assertions on pharmacogenomics in Medline citations. Extracted information is represented as semantic predications covering a range of relations relevant to this domain. The specific relations addressed by the system provide greater precision than that achievable with methods that rely on entity co-occurrence. The development of Enhanced SemRep is based on the adaptation of an existing system and crucially depends on domain knowledge in the Unified Medical Language System. We provide a preliminary evaluation (55% recall and 73% precision) and discuss the potential of this system in assisting both clinical practice and scientific investigation.

1.

Introduction

We discuss the development of a natural language processing (NLP) system to identify and extract a range of semantic predications (or relations) from Medline citations on pharmacogenomics. Core research in this field investigates the interaction of genes and their products with therapeutic substances. Discoveries hold considerable promise for treatment of disease [1], as clinical successes, notably in oncology, demonstrate. For example, Gleevec is a first-line therapy for chronic myelogenous leukemia, as it attacks the mutant BCR-ABL fusion tyrosine kinase in cancer cells, leaving healthy cells largely unharmed [2]. Automatic methods, including NLP, are increasingly used as important aspects of the research process in biomedicine [3,4,5,6]. Current NLP for pharmacogenomics concentrates on co-occurrence information without specifying exact relations [7]. We are developing a system (called Enhanced SemRep in this paper) which complements that approach by representing assertions in text as semantic predications. For example, the predications in (2) are extracted from the sentence in (1). l)These findings therefore demonstrate that dexamethasone is a potent inducer of multidrug resistance-associated protein expression in rat 209

210

hepatocytes through a mechanism that seems not to involve the classical glucocorticoid receptor pathway. 2)Dexamethasone STIMULATES Multidrug Resistance-Associated Proteins Dexamethasone NEG_INTERACTS_WITH Glucocorticoid receptor Multidrug Resistance-Associated Proteins PART_OF Rats Hepatocytes PART_OF Rats Enhanced SemRep is based on two existing systems: SemRep [8,9] and SemGen [10,11]. SemRep extracts semantic predications from clinical text, and SemGen was developed from SemRep to identify etiologic relations between genetic phenomena and diseases. Several aspects of these programs were combined and modified to identify a range of relations referring to genes, drugs, diseases, and population groups. The enhanced system extracts pharmacogenomic information down to the gene level, without identifying more specific genetic phenomena, such as mutations (e.g., CYP2C9*3), single nucleotide polymorphisms (e.g., C2850T), and haplotype information. In this paper we describe the major issues involved in developing Enhanced SemRep for pharmacogenomics. 2.

Background

2.1. Natural Language Processing for Biomedicine Several NLP systems identify relations in biomedical text. Due to the complexity of natural language, they often target particular semantic relations. In order to achieve high recall, some methods rely mainly on co-occurrence of entities in text (e.g. Yen et al. [12] for gene-disease relations). Some approaches use machine learning techniques to identify relations, for example Chun et al. [13] for gene-disease relations. Syntactic templates and shallow parsing are also used, by Blaschke et al. [14] for protein interactions, Rindflesch et al. [15] for binding, and Leroy et al. [16] for a variety of relations. Friedman et al. [17] use extensive linguistic processing for relations on molecular pathways, while Lussier et al. [18] use a similar approach to identify phenotypic context for genetic phenomena. In pharmacogenomics, methods for extracting drug-gene relations have been developed, based on co-occurrence of drug and gene names in a sentence [19, 7]. The system described in [19] is limited to cancer research, while Chang et al. [7] use machine learning to assign drug-gene co-occurrences to one of several broad relations, such as genotype, clinical outcome, or pharmacokinetics. The system we present here (Enhanced SemRep) addresses a

211 wide range of syntactic structures and specific semantic relations pertinent to pharmacogenomics, such as STIMULATES, DISRUPTS, and CAUSES. We first describe the structure of the domain knowledge in the Unified Medical Language System (UMLS) [20], upon which the system crucially depends. 2.2. The Unified Medical Language System The Metathesaurus and the Semantic Network are components of the UMLS representing structured biomedical domain knowledge. In the current (2006AB) release, the Metathesaurus contains more than a million concepts. Editors combine terms from constituent sources having similar meaning into a concept, which is also assigned a semantic type, as in (3). 3) Concept: fever; Synonyms: pyrexia, febrile, and hyperthermia; Semantic Type: 'Finding' The Semantic Network is an upper level ontology of medicine. Its core structure consists of two hierarchies (entities and events) of 135 semantic types, which represent the organization of phenomena in the medical domain. 4) Entity Physical Object Anatomical Structure Fully Formed Anatomical Structure Gene or Genome Semantic types serve as arguments of "ontological" predications that represent allowable relationships between classes of concepts in the medical domain. The predicates in these predications are drawn from 54 semantic relations. Some examples are given in (5). 5) 'Gene or Genome' PART_OF 'Cell' 'Pharmacologic Substance' INTERACTS_WITH 'Enzyme' 'Disease or Syndrome' CO-OCCURS_WITH 'Neoplastic Process' Semantic interpretation depends on matching asserted semantic predications to ontological semantic predications, and the current version of SemRep depends on the unedited version of the UMLS Semantic Network for this matching. One of the major efforts in the development of Enhanced SemRep was to edit the Semantic Network for application in pharmacogenomics. 2.3. SemRep and SemGen SemRep: SemRep [8,9] is a rule-based symbolic natural language processing system developed to extract semantic predications from Medline citations on clinical medicine. As the first step in semantic interpretation, SemRep produces

212 an underspecified (or shallow) syntactic analysis based on the SPECIALIST Lexicon [21] and the MedPost part-of-speech tagger [22]. The most important aspect of this processing is the identification of simple noun phrases. In the next step, these are mapped to concepts in the Metathesaurus using MetaMap [23]. The structure in (7) illustrates syntactic analysis with Metathesaurus concepts and semantic types (abbreviated) for the sentence in (6). 6) Phenytoin induced gingival hyperplasia 7) [[head(noun(phenytoin)), metaconc('Phenytoin':[orch,phsu]))], [verb(induced)], [head(noun(['gingival hyperplasia')), metaconc( 'Gingival Hyperplasia': [dsyn]))]] The structure in (7) serves as the basis for the final phase in constructing a semantic predication. During this phase, SemRep relies on "indicator" rules which map syntactic elements (such as verbs and nominalizations) to predicates in the Semantic Network, such as TREATS, CAUSES, and LOCATION_OF. Argument identification rules (which take into account coordination, relativization, and negation) then find syntactically allowable noun phrases to serve as arguments for indicators. If an indicator and the noun phrases serving as its syntactic arguments can be interpreted as a semantic predication, the following condition must be met: The semantic types of the Metathesaurus concepts for the noun phrases must match the semantic types serving as arguments of the indicated predicate in the Semantic Network. For example, in (7) the indicator induced maps to the Semantic Network relation in (8). 8) 'Pharmacological Substance' CAUSES 'Disease or Syndrome' The concepts corresponding to the noun phrases phenytoin and gingival hyperplasia can serve as arguments because their semantic types ('Pharmacological Substance' (phsu) and 'Disease or Syndrome' (dsyn)) match those in the Semantic Network relation. In the final interpretation (9), The Metathesaurus concepts from the noun phrases are substituted for the semantic types in the Semantic Network relation. 9) Phenytoin CAUSES Gingival Hyperplasia SemGen: SemGen [10,11] was adapted from SemRep in order to identify semantic predications on the genetic etiology of disease. The main consideration in creating SemGen was the identification of gene and protein names as well as related genomic phenomena. For this SemGen relies on ABGene [24], in addition to MetaMap and the Metathesaurus. Since the UMLS Semantic Network does not cover molecular genetics, ontological semantic relations for this domain were created for SemGen. The allowable relations were defined in two classes: gene-disease interactions (ASSOCIATED_WITH, PREDISPOSE, and CAUSE) and gene-gene interactions (INHIBIT, STIMULATE, and INTERACTS_WITH).

213 3.

Methods

The development of Enhanced SemRep for pharmacogenomics began with scrutiny of the pharmacogenomics literature to identify relevant predications not identified by either SemRep or SemGen. Approximately 1000 Medline citations were retrieved with queries containing drug and gene names. From these, 400 sentences were selected as containing assertions most crucial to pharmacogenomics, including genetic (gene-disease), genomic (gene-gene), and pharmacogenomic (drug-gene, drug-genome) relations; in addition relations between genes and population groups; relations between disease and population groups; and pharmacological relations (drug-disease, drug-pharmacological effect, drug-drug) were scrutinized. Examples of relevant assertions include: 10) N-acetyltransferase 2 plays an important role in Alzheimer's Disease. (gene-disease) Ticlopidine is a potent inhibitor for CYP2C19. (drug-gene) Gefitinib and erlotinib for tumors with epidermal growth factor receptor (EGFR) mutations or increased EGFR gene copy numbers. (drug-gene) The CHF patients with the VDR FF genotype have higher rates of bone loss, (gene-disease and gene-process) After processing these 400 sentences with SemRep, errors were analyzed and categorized for etiology. It was determined that the majority of errors were missed predications that could be accounted for under three broad categories: a) the Semantic Network, b) errors in argument identification due to "empty" heads, and c) Gene name identification. For Enhanced SemRep, gene name identification was addressed by adding ABGene [24] to the machinery provided by MetaMap and the Metathesaurus. The other classes of errors required more extensive modifications. 3.1. Modification of Semantic Network for Enhanced SemRep The UMLS Semantic Network was substantially modified in enhanced SemRep. New ontological semantic predications were added and the definitions of others were modified. In order to accommodate semantic relations crucial to pharmacogenomics, semantic types stipulated as arguments of ontological semantic predications were reorganized into groups reflecting major categories in this field. Semantic Types: Semantic groups have been defined to organize the finer grained UMLS semantic types into broader semantic categories relevant to the clinical domain [25]. For Enhanced SemRep, five semantic groups (Substance, Anatomy, Living Being, Process, and Pathology) were defined to permit

214 systematic and comprehensive treatment of arguments in predications relevant to pharmacogenomics. These semantic groups are used to stipulate allowable arguments of the ontological semantic predications defined for each domain. Each group for pharmacogenomics is defined as: 11) Substance: 'Amino Acid, Peptide, or Protein', 'Antibiotic', 'Biologically Active Substance', 'Carbohydrate', 'Chemical', 'Eicosanoid', 'Element, Ion, or Isotope', 'Enzyme', 'Gene or Genome', 'Hazardous or Poisonous Substance', 'Hormone', 'Immunologic Factor', 'Inorganic Chemical', 'Lipid', 'Neuroreactive Substance or Biogenic Amine', 'Nucleotide Sequence', 'Organic Chemical', 'Organophosphorous Compound', 'Pharmacologic Substance', 'Receptor', 'Steroid', 'Vitamin' 12) Anatomy: 'Anatomical Structure', 'Body Part, Organ, or Organ Component', 'Cell', 'Cell Component', 'Embryonic Structure', 'Fully Formed Anatomical Structure', 'Gene or Genome', 'Neoplastic Process', 'Tissue' 13) Living Being: 'Animal', Archaeon', 'Bacterium', 'Fungus', 'Human', 'Invertebrate', 'Mammal', 'Organism', 'Vertebrate', 'Virus' 14) Process: 'Acquired Abnormality', 'Anatomical Abnormality', 'Cell Function', 'Cell or Molecular Dysfunction', 'Congenital Abnormality', 'Disease or Syndrome', 'Finding', 'Injury or Poisoning', 'Laboratory Test Result', 'Organism Function', 'Pathologic Function', 'Physiologic Function', 'Sign or Symptom' 15) Pathology: 'Acquired Abnormality', 'Anatomical Abnormality', 'Cell or Molecular Dysfunction', 'Congenital Abnormality', 'Disease or Syndrome', 'Injury or Poisoning', Mental or Behavioral Disorder', 'Pathologic Function', 'Sign or Symptom' In addition to grouping semantic types, semantic types assigned to two classes of Metathesaurus concepts were manipulated to handle the following generalizations. 16) Proteins are also genes. Concepts assigned the semantic type 'Amino Acid, Peptide, or Protein' are also assigned the semantic type 'Gene or Genome' ("Cytochrome P-450 CYP2E1" now has 'Gene or Genome' in addition to 'Amino Acid, Peptide, or Protein') 17) Group members are human. Concepts assigned the semantic type 'Group' (or its descendants) are also assigned the semantic type 'Human'. ("Child" now has 'Human' in addition to 'Age Group'). Predications: Predications for the pharmacogenomics domain were defined in the following categories (18-23). Ontological predications are defined by specifying allowable arguments, that is semantic types in the stipulated semantic

215 groups. The predications in (18-23) constitute a type of schema [26] for representing pharmacogenomic information. 18) Genetic Etiology: {Substance} ASSOCIATED_WITH OR PREDISPOSES OR CAUSES {Pathology} 19) Substance Relations : {Substance} INTERACTS_WITH OR INHIBITS OR STIMULATES {Substance} 20) Pharmacological Effects: {Substance} AFFECTS OR DISRUPTS OR AUGMENTS {Anatomy OR Process} 21) Clinical Actions: {Substance} ADMINISTERED_TO {Living Being} {Process} MANIFESTATIONJDF {Process} {Substance} TREATS {Living Being OR Pathology } 22) Organism Characteristics: {Anatomy OR Living Being} LOCATION_OF, {Substance} {Anatomy} PART_OF {Anatomy OR Living Being} {Process} PROCESS_OF {Living Being} 23) Co-existence: {Substance} CO-EXISTS_WITH {Substance} {Process} CO-EXISTS_WITH {Process} 3.2. Empty Heads "Empty" heads [27,28] are a pervasive phenomenon in pharmacogenomics text. An example is variants in (24). 24) We saw differential activation of CYP2C9 variants by dapsone. Nearly 80% of the 400 sentences in the training set contain at least one empty head. These structures impede the process of semantic interpretation. In SemRep the semantic type of the Metathesaurus concept corresponding to the head of a noun phrase qualifies that noun phrase for use as an argument. For example, from (24) we want to use the noun phrase CYP2C9 variant as an argument of STIMULATES, which requires that the semantic type of its object be a member of the Substance group. However, the semantic type of the head concept "Variant" is 'Qualitative Concept'. As has been noted (e.g. [28]), such words are not really empty (in the sense of having no semantic content). A complete interpretation would take the meaning of empty heads into account. However, that is beyond the present capabilities of the Enhanced SemRep system. It is possible to get a partial interpretation of structures containing this phenomenon by ignoring the empty head [27].

216 We enumerated several categories of terms which we identified as semantically empty heads. These include general terms for genetic and genomic phenomena (allele, mutation, polymorphism, and variant), measurements (concentration, levels), and processes (synthesis, expression, metabolism). During processing in Enhanced SemRep, words from these lists that have been labeled as heads are hidden and the word to their left is relabeled as head. After this processing, CYP2C9 becomes the head (with semantic type 'Gene or Genome', a member of the Substance group) in CYP2C9 variants above, thus qualifying as an argument of STIMULATES. 3.3. Evaluation Enhanced SemRep was tested for recall and precision using a gold standard of 300 sentences randomly generated from the set of 36,577 sentences containing drug and gene co-occurrences found on the Web site [29] referenced by Chang and Altman [7], These sentences were annotated by three physicians (CBA, DD-F, MF) for the predications discussed in the methods section. That is, we did not mark up all assertions in the sentences, only those representing a predication defined in Enhanced SemRep. A total of 850 predications were assigned to these 300 sentences by the annotators. 4.

Results

Enhanced SemRep generated 623 predications from the 300 sentences in the test collection. Of these, 455 were true positives, 168 were false positives, and 375 were false negatives, reflecting recall of 55% (95% confidence interval 49% to 61%) and precision of 73% (95% confidence interval 65% to 81%). We also calculated results for the groups of predications defined in the categories (18-22) above. Recall and precision for the predications in the five categories are: Genetic Etiology (ASS0CIATED_WITH, CAUSES, PREDISPOSES): 74% 74%; Substance Relations (INTERACTS_WITH, INHIBITS, STIMULATES): 50%

73%; Pharmacological Effects (AFFECTS, DISRUPTS, AUGMENTS): 41% 68%; Clinical Actions (ADMINISTEREDJTO, MANIFESTATION_OF, TREATS): 54% 84%; Organism Characteristics (LOCATlON_OF, PART_OF, PROCESS_OF): 63% 71%. 5.

Discussion

5.1. Error Analysis We assessed the etiology of errors separately for recall and precision. In considering both false negatives and false positives for Enhanced SemRep, the etiology of error was almost exclusively due to characteristics in SemRep before

217 enhancement, not to changes introduced for Enhanced SemRep. Word sense ambiguity was responsible for almost a third (28%) of all errors. For example, in interpreting (25), inhibition was wrongly mapped to the Metathesaurus concept "Psychological Inhibition," thus allowing the system to generate the false positive "CYP2C19 AFFECTS Psychological Inhibition." 25) Ticlopidine inhibition of phenytoin metabolism mediated by potent inhibition of CYP2C19. Difficulty in processing coordinate structures caused more than a third (35%) of the false negatives seen in our evaluation. For example, in processing (26), although Enhanced SemRep identified the predication "Fluorouracil INTERACTS_WITH DPYD gene," it missed "mercaptopurine INTERACTS_WITH thiopurine methyltransferase." 26) The cytotoxic activities of mercaptopurine and fluorouracil are regulated by thiopurine methyltransferase (TPMT) and dihydropyrimidine dehydrogenase (DPD), respectively. 5.2. Processing Medline citations on CYP2D6 We processed 2849 Medline citations containing variant forms of CYP2D6 with Enhanced SemRep, which produced 36,804 predications, 22,199 of which were unique. 5219 total and 2310 unique predications contained CYP2D6 as an argument, with the remaining predications representing assertions about other genes, drugs, and diseases. The 5219 total predications containing CYP2D6 were analyzed according to two predication categories (Genetic Etiology and Substance Relations), and the results were compared with relations listed for this gene on the PharmGKB Web site [30]. Genetic Etiology: 267 total predications represented CYP2D6 as an etiologic agent (CAUSES, PREDISPOSES, or ASSOCIATED_WITH) for a disease. The most frequent of these are the following: Parkinson's disease (35 occurrences), carcinoma of the lung (21), tardive dyskinesia (15), Alzheimer's disease (9), bladder carcinoma (8). All of the above relations were judged to be true positives. Only carcinoma of the lung occurs in PharmGKB. Of the 4 PharmGKB CYP2D6-disease relations not obtained by SemRep (hepatitis C, ovarian carcinoma, pain, and bradycardia), two were found not to contain the disease name in the referenced citation (ovarian carcinoma and pain). Substance Relations: Enhanced SemRep retrieved 1128 total predications involving CYP2D6 and a drug. Sixty-nine drugs occurred 3 or more times in those predications. Forty-one of the 69 were in PharmGKB and 28 were not. Sixty-eight were true positives. For example, The following drugs (all true positives) were interpreted by Enhanced SemRep as inhibiting CYP2D6:

218 quinidine (45 occurrences in 1128 predications with CYP2D6), paroxetine (34), fluoxetine (27), fluvoxamine (8), sertraline (8). Quinidine and sertraline are not in PharmGKB. SemRep also retrieved predications that the following drugs (all true positives) interact with CYP2D6: bufuralol (27), antipsychotic agents (25) dextromethorphan (21 occurrences), venlafaxine (19), debrisoquin (18). Bufuralol is not in PharmGKB. The PharmGKB relations SemRep failed to capture were CYP2D6 interactions with cocaine, levomepromazine, maprotiline, trazodone, and yohimbine. Two of these entries (levomepromazine and maprotiline) were found not to be based on the content of Medline citations. 6.

Conclusion

We discuss the adaptation of an existing NLP system to apply in the pharmacogenomics domain. The major changes for developing Enhanced SemRep from SemRep involved modifying the semantic space stipulated by the UMLS Semantic Network. The output of Enhanced SemRep is in the form of semantic predications that represent assertions from Medline citations expressing a range of specific relations in pharmacogenomics. The information provided by Enhanced SemRep has the potential to contribute to systems that go beyond traditional information retrieval to support advanced information management applications for pharmacogenomics research and clinical care. In the future we intend to adapt the summarization and visualization techniques developed for clinical text [31] to the pharmacogenomic predications generated by Enhanced SemRep. Acknowledgments This study was supported in part by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine. The first author was supported by an appointment to the National Library of Medicine Research Participation Program administered by the Oak Ridge Institute for Science and Education through an inter-agency agreement between the U.S. Department of Energy and the National Library of Medicine. References 1. Halapi E, Hakonarson H. Advances in the development of genetic markers for the diagnosis of disease and drug response. Expert Rev Mol Diagn. 2002 Sep;2(5):411-21.

219 2. Druker BJ, Talpaz M, Resta DJ, et al. Efficacy and safety of a specific inhibitor of the BCR-ABL tyrosine kinase in chronic myeloid leukemia. N EnglJMed. 2001 Apr 5;344(14):1031-7. 3. Yandell MD, Majoros WH. Genomics and natural language processing. Nature Reviews Genetics 2002;3(8):601-10. 4. K. Bretonnel Cohen and Lawrence Hunter Natural language processing and systems biology. In Dubitzky and Pereira, Artificial intelligence methods and tools for systems biology. Springer Verlag, 2004. 5. Hirschman L, Par JC, Tsujii J, Wong L, Wu CH. Accomplishments and challenges in literature data mining for biology. Bio informatics 2002;18(12):1553-61. 6. Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics 2006;7:119-29. 7. Chang JT, Altman RB. Extracting and characterizing gene-drug relationships from the literature. Pharmacogenetics. 2004 Sep;14(9):57786. 8. Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J of Biomed Inform. 2003 Dec;36(6):462477. 9. Rindflesch TC, Fiszman M, Libbus B. Semantic interpretation for the biomedical research literature. In Chen, Fuller, Hersh, and Friedman, Medical informatics: Knowledge management and data mining in biomedicine. Springer, 2005, pp. 399-422. 10. Rindflesch TC, Libbus B, Hristovski D, Aronson AR, Kilicoglu H. Semantic relations asserting the etiology of genetic diseases. AMIA Annu Symp Proc. 2003;: 554-8. 11. Masseroli M, Kilicoglu H, Lang FM, Rindflesch TC. Argument-predicate distance as a filter for enhancing precision in extracting predications on the genetic etiology of disease. BMC Bioinformatics 2006 Jun 8;7(1):291. 12. Yen YT, Chen B, Chiu HW, Lee YC, Li YC, Hsu CY. Developing an NLP and IR-based algorithm for analyzing gene-disease relationships. Methods InfMed. 2006;45(3):321-9. 13. Chun HW, Tsuruoka Y, Kim J-D, Shiba R, Nagata N, Hishiki T, Tsujii J. Extraction of gene-disease relations from Medline using domain dictionaries and machine learning. Pac. Symp. Biocomput. 2006:4-15. 14. Blaschke C, Andrade MA, Ouzounis C, Valencia A: Automatic extraction of biological information from scientific text: protein-protein interactions. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology: Edited by Lenauer T, Schneider R, Bork P, Brutlag DL, Glasgow JI, Mewes H-W, Zimmer R: San Francisco, CA: Morgan Kaufman Publishers, Inc; 1999:60-67.

220

15. Rindflesch TC, Rajan JV, Hunter L. Extracting molecular binding relationships from biomedical text. Proceedings of the ANLP-NAACL 2000:188-95. Association for Computational Linguistics. 16. Leroy G, Chen H, Martinez JD: A shallow parser based on closed-class words to capture relations in biomedical text. J Biomed Inform. 2003, 36(3): 145-158. 17. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A: GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 2001, 17 Suppl 1:S74-S82. 18. Lussier YA, Borlawsky T, Rappaport D, Liu Y, Friedman C. PhenoGO: assigning phenotypic context to Gene Ontology annotations with natural language processing. Pac Symp Bio. 2006:64-75. 19. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac. Symp. Biocomput. 2000, 517-528. 20. Humphreys BL, Lindberg DA, Schoolman HM, Barnett GO. The Unified Medical language System: An informatics research collaboration. J Am Med Inform Assoc 1998 Jan-Feb;5(l): 1-11. 21. McCray AT, Srinivasan S, Browne AC. Lexical methods for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care. 1994;235-9. 22. Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for biomedical text. Bioinformatics. 2004;20(14):2320-1. 23. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. Proc AMIA Symp. 2001;17-21. 24. Tanabe L, Wilbur WJ. Tagging gene and protein names in biomedical text. Bioinformatics. 2002;18(8):1124-32. 25. McCray AT, Burgun A, Bodenreider O. Aggregating UMLS semantic types for reducing conceptual complexity. Medinfo 2001;10(Pt l):216-20. 26. Friedman C, Borlawsky T, Shagina L, Xing HR, Lussier YA. Bioontology and text: bridging the modeling gap. Bioinformatics. 2006 Jul 26. 27. Chodorow, Martin S., Roy I. Byrd, and George E. Heidom (1985). Extracting Semantic Hierarchies from a Large On-Line Dictionary. Proceedings of the 23rd Annual Meeting of the ACL, pp. 299-304. 28. Guthrie L, Slater BM, Wilks Y, Bruce R. Is there content in empty heads? Proceedings of the 13th conference on Computational linguistics. 1990; v3:

138-143. 29. http://bionlp.stanford.edu/genedrug/ 30. Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, Altman RB, Klein TE. PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Res. 2002 Jan 1;30(1): 163-5. 31. M Fiszman, TC Rindflesch, H Kilicoglu. Abstraction Summarization for Managing the Biomedical Research Literature. Proc HLTNAACL Workshop on Computational Lexical Semantics, 2004.

ANNOTATING GENES USING TEXTUAL PATTERNS ALI CAKMAK

GULTEKIN OZSOYOGLU

Department of Electrical Engineering and Computer Case Western Reserve University Cleveland, OH 44106, USA {ali.cakmak, [email protected]

Science

Annotating genes with Gene Ontology (GO) terms is crucial for biologists to characterize the traits of genes in a standardized way. However, manual curation of textual data, the most reliable form of gene annotation by GO terms, requires significant amounts of human effort, is very costly, and cannot catch up with the rate of increase in biomedical publications. In this paper, we present GEANN, a system to automatically infer new GO annotations for genes from biomedical papers based on the evidence support linked to PubMed, a biological literature database of 14 million papers. GEANN (i) extracts from text significant terms and phrases associated with a GO term, (ii) based on the extracted terms, constructs textual extraction patterns with reliability scores for GO terms, (iii) expands the pattern set through "pattern crosswalks", (iv) employs semantic pattern matching, rather than syntactic pattern matching, which allows for the recognition of phrases with close meanings, and (iv) annotates genes based on the "quality" of the matched pattern to the genomic entity occurring in the text. On the average, in our experiments, GEANN has reached to the precision level of 78% at the 57% recall level.

1.

Introduction

In this paper, we present GEANN (Gene Annotator), a system to automatically infer new Gene Ontology (GO) annotations for genes from biomedical papers based on the evidence support linked to PubMed, a biological literature database of 14 million papers. Currently, annotations for GO, a controlled term vocabulary describing the central attributes of genes [1], are most reliably done manually by experts who read the literature, and decide about appropriate annotations. This approach is slow and costly. And, compounding the problem is the rate of increase in the amount of available biological literature: at the present time, about 223,000 new genomics papers (that contain at least one of the words "gene", "protein" or "rna", and are published in 2005) per year are added to PubMed [3], far outstripping capabilities of a manual annotation effort. Hence, effective computational tools are needed to automate annotation of genes with GO terms. Currently, possibly many genes without appropriate GO annotations exist even though there may be sufficient annotation evidence in a scientific paper. We have observed that, as of Jan. 2006, only a small portion of the papers in 221

222

PubMed has been referred to in support of gene annotations (i.e., 0.9% of 3 million PubMed genomics papers with abstracts). We give an example. Example. The following is an excerpt from an abstract [18] which discusses experiments indicating the translation repressor activity (GO: 0030371) of the gene p97. However, presently gene p97 does not have the translation repressor activity annotation. "...experiments show that p97 suppresses both cap-dependent and independent translation ... expression of p97 reduces overall protein synthesis...results suggest that p97 functions as a general repressor of translation by forming... " . GEANN can be used to (i) discover new GO annotations for a gene, and/or (ii) increase the annotation strength of existing GO annotations by locating additional paper evidence. We are currently integrating GEANN into PathCase [2], a system of web-based tools for metabolic pathways, in order to allow users to discover new GO annotations. In general, GEANN is designed to: • facilitate and expedite the curation process in GO, and • extract explicit information about a gene that is implicitly present in text. GEANN uses paper abstracts, and utilizes textual pattern extraction techniques to discover GO annotations automatically. GEANN's methodology is to (i) extract textual elements identifying a GO term, (ii) construct patterns with reliability scores, conveying the semantics of how confidently a pattern represents a GO term, (Hi) extend the pattern set with longer ones via "crosswalks", (iv) apply semantic pattern matching techniques using WordNet, and (v) annotate genes based on the "quality" of the matched pattern to the genomic entity occurring in the text. In experiments, GEANN produced, on average, 78% precision at 57% recall. This level of performance is significantly better than the existing systems described in the literature, and compared in section 5.2.3 and section 6. Overview: The GEANN implementation has two phases, namely, the training and the annotation phases. The goal of the training phase is to construct a set of patterns that characterize a variety of indicators for the existence of a GO annotation. As the training data, annotation evidence papers [1] are used. The first step in the training phase is the tagging of genes in the papers. Then, significant terms/phrases that differentially appear in the training set are extracted. Next, patterns are constructed based on (i) the significant terms/phrases, and (ii) the terms surrounding significant terms. Finally, each pattern is assigned a reliability score. The annotation discovery phase looks for possible matches to the patterns in paper abstracts. Next, GEANN computes a matching score which indicates the strength of the prediction. Finally, GEANN determines the gene to be associated with the pattern match. At the end, new annotation predictions are ordered by their scores, and presented to the user.

223

The extracted patterns are flexible in that they match to a set of phrases with close meanings. GEANN employs WordNet [5] to deduce the semantic closeness of words in patterns. WordNet is an online lexical reference system in which nouns, verbs, adjectives and adverbs are grouped into synonym sets, and these synonym sets are hierarchically organized through various relationships. The paper is organized as follows. In section 2, we elaborate on significant term discovery, and pattern construction. Sections 3 and 4 discuss pattern matching and the scoring scheme, respectively. Section 5 summarizes the experimental results. In section 6 (and 5.2.3), we compare GEANN to other similar, competing, systems. 2.

Pattern Construction

In GEANN, the identifying elements of a GO concept are the representations of the concept in textual data. And, the terms surrounding the identifying elements are considered as auxiliary descriptors of the GO concept. A pattern is an abstraction which encapsulates the identifying elements and the auxiliary descriptors together in a structured manner. More specifically, a pattern is organized as a 3-tuple: {LEFT} <MIDDLE> {RIGHT} where each element corresponds to a set (bag) of words. <MIDDLE> element is an ordered sequence of significant terms {identifying elements), {LEFT} and {RIGHT} elements correspond to word sets that appear around significant terms (auxiliary descriptors). The number of terms in the left and the right elements is adjusted by a window size. Each word or phrase in the significant term set is assigned to be the middle element of a newly created pattern template. A pattern is an instance of a pattern template which may lead to several patterns with a common middle element, but (possibly) different left or right elements. We give an example. Example. Two of the patterns that are created from the pattern template {LEFT} {RIGHT} are listed below where rna polymerase ii is found to be a significant term within the context of GO concept positive transcription elongation factor with the window size of three. {LEFT} and {RIGHT} tuples are instantiated from the surrounding words that appear before or after the significant term in the text. {increase catalytic rate}{transcription suppressing transient} {proteins regulation transcription}{initiated search proteins} Patterns are contiguous blocks, that is, no space is allowed between the tuples in a pattern. Each tuple is a nag of words which are tokens delimited by white space characters. Since the stop words are eliminated in the preprocessing stage, the patterns do not include words like "the", "of, etc.

224

2.1. Locating Significant Terms and Phrases Some words or phrases appearing frequently in the abstracts provide evidence for annotations by a specific GO term. For instance, RNA polymerase II which performs elongation of RNA in eukaryotes appears in almost all abstracts associated with the GO term "positive transcription elongation factor activity". Hence, intuitively, such frequent term occurrences should be marked as indicators of a possible annotation. In order to avoid marking word(s) common to almost all abstracts (e.g., "cell"), the document frequency of a significant term is enforced to be below a certain threshold (10% in our case). The words that constitute the name of a GO term are by default considered as significant terms. Frequent phrases are constructed out offrequent terms through a procedure similar to the Apriori algorithm [9]. First, individual frequent terms are obtained using the IDF (inverse document frequency [4]) indices. Then, frequent phrases are obtained by recursively combining individual frequent terms/phrases, provided that the constructed phrase is also frequent. In order to obtain significant terms, one can use various methods from random-walk networks to correlation mining [9]. Since the training set for each GO term is most of the time not large, and to keep the methodology simple, we use frequency information to determine the significant terms. 2.2. Pattern Crosswalks Extended patterns are constructed by virtually walking from one pattern to another. The goal is to create larger patterns that can eliminate false GO annotation predictions, and boost the true candidates. Based on the type of the walk, GEANN creates two different extended patterns: (i) side-joined, and (ii) middle-joined patterns. Transitive Crosswalk: Given a pattern pair P, = {leftl} <middlel> {rightl}, and P2 = {left2} <middle2> {right2}, if {rightl} = {left 2}, then patterns P, and P2 are merged into a 5-tuple side-joined (SJ) pattern P3 = {leftl} <middlel> {rightl} <middle2> {right2}. Next, we give an example of a SJ pattern that is created for GO term positive transcription elongation factor. Example.

Pi = {factor increase catalytic}{RNA polymerase n} P2 = {RNA polymerase n}<elongation factor>{[ge]}

[SJ Pattern] P3 = {factor increase catalytic}{elongation factor} {[ge]}

SJ patterns are helpful in detecting consecutive pattern matches that partially overlap in their matches. If there exist two consecutive regular pattern matches, then such a match should be evaluated differently than two separate matchings of regular patterns as it may provide a stronger evidence for the

225

existence of a possible GO annotation in the match region. Note that pattern merging through crosswalks is performed among the patterns of the same GO concept. Middle Crosswalk: Based on the partial overlapping between the middle and side (right or left) tuples of patterns, we construct the second type of extended patterns. Given the same pattern pair Pj and P2 as above, the patterns can be merged into a 4-tuple middle-joined (MJ) pattern if at least one of the following cases holds. a. Right middle walk: {right 1} n <middle2> * 0 and <middlel> n {left2}=0 b. Left middle walk: <middlel> n {left2} * 0 and {rightl} n <middle2>=0 C. Middle walk: <middlel>n {left2} * 0 and {rightl} n <middle2> * 0 MJ patterns have two middle tuples. For case (a), the first middle tuple is the intersection of {rightl} and <middle2> tuples. Case (b) is handled similarly. As for case (c), the first and the second middle tuples are subsets of <middlel> and <middle2>. Below, we give an example of MJ pattern construction for the GO termpositive transcription elongation factor. Example. (Middle-joinedpattern construction) Pi = {[ge] facilitates chromatin} {chromatin-specific elongation factor} P2 = {classic inhibitor transcription} <elongation ma polymerase ii> {pol II} [MJPattern] P3 = {[ge] facilitates chromatin} <elongation> {pol n}

Like SJ patterns, MJ patterns capture consecutive pattern matches in textual data. In particular, MJ patterns detect partial information that may not be recognized otherwise, since we enforce the full matching of middle tuple(s) to locate a pattern match, which is discussed next. 3. Handling Pattern Matches Since middle tuples of a pattern are composed of significant terms, the condition for a pattern match is that the middle tuple of the pattern should be completely included in the text. For the matching of the left and the right tuples, GEANN employs semantic matching. We illustrate with an example. Example. Given a pattern "{increase catalytic rate}{RNA polymerase II}", we want to be able to detect the phrases which give the sense that "transcription elongation" is positively affected. Through semantic matching, phrases like "stimulates rate of transcription elongation" or "facilitates transcription elongation" are also matched to the pattern. GEANN first checks if an exact match is possible between the left/right tuples of the pattern, and the surrounding words of the matching phrase. Otherwise, GEANN employs WordNet [5] to check if they have similar meanings using an open source library [22] as access interface to WordNet. First, a semantic similarity matrix, R [ m , n ] , containing each pair of words is

226

built, where R[i, j] is the semantic similarity between the most appropriate sense of the word at position i of phrase X, and the word at position j of phrase Y. The most appropriate sense of the word is found by through a sense disambiguation process. Given a word w, each sense of the word is compared against the senses of the surrounding words, and the sense of w with the highest similarity to the surrounding words is selected as the most appropriate sense. To compute semantic similarity, we adopt a simple approach: the semantic similarity between word senses Wi and w2 is inversely proportional to the length of the path between the senses in WordNet. The problem of computing semantic similarity between two sets of words X and Y is considered as the problem of computing a maximum total matching weight of a bipartite graph [7], where X and Y are two sets of disjoint nodes (i.e., words in our case). The Hungarian Method [7] is used to solve this problem where R[i, j] is the weight of the edge from i to j . Finally, each individual pattern match is scored based on (i) the score of the pattern itself, and (ii) the semantic similarity computed using WordNet. Having located a match, the next step is to decide on the gene that is associated to the match. To this end, two main issues are resolved: (i) detecting gene names in the text, and (ii) determining the gene to be annotated among possible candidates. For the first task, we utilized a decent biological named entity tagger, called Abner [20]. For the second task of locating the gene to be annotated, GEANN first looks into the sentence containing the match, and locates the genes that are positioned before/after the matching region in the sentence, or else in the previous sentence and so on. The confidence of the annotation decays as the distance from the gene to the matching phrase increases. For more details, please see [14]. 4. Pattern Evaluation and Scoring 4.1. Scoring Regular Patterns Each constructed pattern is assigned a score conveying the semantics of how confidently a pattern represents a GO term. GEANN uses several heuristics for the final score of a pattern based on the structural properties of its middle tuple. i) Source of Middle Tuple [MT]: The patterns whose middle tuples fully consist of words from the GO term name gets higher score than those with middle tuples constructed from the frequent terms. ii) Type of Individual Terms in the Middle Tuple [TT]: Contribution of each word from GO term name changes according to (a) the selectivity, i.e., the occurrence frequency of the word among all GO term names, and (b) the position of the word in GO term name based on the observation that words in a GO term name get more specific from right to left [21].

227

Hi) Frequency of the Phrase in the Middle Tuple [PC]: A pattern's score is inversely proportional to the frequency of the middle tuple throughout the papers in the database. iv) Term-Wise Paper Frequency of the Middle Tuple [PP].1 The patterns with middle tuples which are highly frequent in the GO term's paper set get higher scores. Based on the reasoning summarized above, GEANN uses the following heuristic score function: PatternScr = (MT + TT + PP) * Log(l/PC) 4.2. Scoring Extended Patterns (a) Scoring SJ Patterns: SJ patterns serve for capturing consecutive pattern matches. Our scoring scheme differentiates between two-consecutive and twosingle pattern matches where consecutive pattern matches contribute to the final score proportional to some exponent of the sum of the pattern scores (after experimenting with different values of exponents in the extended pattern score functions for the highest accuracy, for the experimental results section, j and k were set to 2 and 1.5, respectively). This way, GEANN can assign considerably higher scores to consecutive pattern matches which are considered as much stronger indicators for an annotation than two individual pattern matches. Score(SJ Pattern) = ( Score(Patternl) + Score(Pattern2) ) ' (b) Scoring MJ Patterns: Consistent with the construction process, the score computation for MJ patterns is more complex in comparison to SJ patterns. Score(Middle-joined Pattern)= ( DegreeOfOverlap 1 * Score(Patternl) + DegreeOfOverlap2 * Score(Pattern2) )k where DegreeOfOverlap represents the proportion of the middle tuple of pattern 1 (pattern2) that is included in the left tuple of pattern2 (right tuple of pattern!). In addition, GEANN considers the preservation of word order, represented by the positionalDecayCoefficient. The degree of overlap is computed by: degreeOfOverlap = positionalDecayCoefficient * overlapFrequency The positional decay coefficient is computed according to the alignment of the left or the right middle tuple of a pattern with the middle tuple of the other pattern. If a matching word is in the same position in both tuples, then the positional score of the word is 1, otherwise, it is 0.75. ^

PositionalDecayCoefficient =

PosScore(w)

«i-o^,iv Size( Overlap)

228

5. Experimental Results 5.1. Data Set In order to evaluate the performance of GEANN, we performed experiments on annotating genes in NCBI's Genbank with selected GO terms. A subset of PubMed abstracts was stored in a database. The experimental subset consisted of evidence papers cited by GO annotations, and reference papers that were cited for the gene maintained by GenBank. This corpus containing around 150,000 papers was used to approximate the word frequencies in the actual PubMed dataset. As part of pre-processing, abstracts/titles of papers were tokenized, stopwords were removed, and inverse document indices were constructed for each token. GEANN was evaluated on a set of 40 GO terms (24 terms from the biological process, 12 terms from mol. function, 4 term from cellular component subontology). Our decision on which terms to choose for the performance assessment is shaped by the choices made in two previous studies [16, 17] for comparison purposes. For a complete list of GO terms used in the experiments, see [14]. The evidence papers that are referenced from at least one of the test GO term are used for testing patterns. In total, 4694 evidence papers abstracts are used to to annotate 4982 genes where on the average each GO term has 120 evidence papers and 127 genes. 5.2.

Experiments

Our experiments are based on the precision-recall analysis of the predicted annotation set. We use the k-fold cross validation scheme [9] (k=10 in our case). Precision is the ratio of the number of genes that are correctly predicted to the number of all genes predicted by GEANN. And, recall is the fraction of the correctly predicted genes in the whole set of genes that are known to be annotated with the GO term being studied. The genes that are annotated by GEANN, and yet, do not have a corresponding entry in Genbank are ignored as there is no way to check their correctness. Additionally, GEANN uses the following heuristics. Heuristic I (Shared Gene Synonyms): If at least one of the genes matching to the annotated symbol has the annotation with the target GO term, then this prediction is considered as a true positive. Heuristic 2 (Incorporating the GO Hierarchy): A given GO term G also annotates all the genes that are annotated by any of its descendants (true-path rule). 5.2.1. Overall Performance: For this experiment, predicted annotations were ordered by their confidence scores. Precision and recall values were computed by considering top k

229 predictions, k was increased by 1 at each step until either all the annotations for a GO term were located, or all the candidates in the predicted set were processed.

^ $ -

n

• •

•

1 1

r

jS 55 £> 4 5

€%^'

J fS-^ fw 5

6

8 10 15 20 30 50 79

Result Set Size

J

1

2

i

4

—•—CC-Presc •-~m---BPPresc -X'—MFPresc —#—BP-Recall —-&---MF-Recall -—JK—-CC-Recall

\ [ j |

5 6 8 10 15 20 30 50 79 Result Set Size ,

Figure 1: Overall System Performance & Figure 2: Annotation accuracy across different Approximate Error due to the NET subontologies in GO Observation 1: From fig. 1, which presents the average precision/recall values, GEANN yields 78% precision (the top-most line) at 46% recall (the bottom-most line). The association of a pattern to a gene relies on the accurate tagging of genes in the text. However, named entity taggers (NETs) are still far from being perfect (ABNER has 77% recall, 68% precision). It may be quite difficult to exactly quantify NET errors. Thus, we took a minimalist approach, and attempted to compute the rate of error that is guaranteed to be due to the fault of the NET. Heuristic 4 (Tagger Error Approximation): If none of the synonyms of a gene has been recognized by the tagger in any of the papers which are associated with the target GO term G, then we label the gene as a tagger-missed gene. Observation 2: After eliminating tagger-missed genes, the recall of GEANN increases to 57%from 46% at the precision level of 78% (the middle line in figure 1). Note that the actual error rate of the NET, in practice, may be much more than what is estimated above. In addition, eliminating tagger-missed genes does not affect the precision. Thus, precision is plotted only once. 5.2.2. Accuracy across Different

Subontologies:

In experiment 2, the same steps of experiment 1 were repeated; but average accuracy values were computed within the individual subontologies. Figure 2 plots precision/recall values of different subontologies of GO (MF: Molecular Function, BP: Biological Process, CC: Cellular Component). Observation 3: GEANN has the best precision for CC where the precision reaches to 85% at 52% recall while MF yields the highest recall (58% at 75% precision). Observation 4: CC almost always provides best precision values because the variety of the words to describe cellular locations may be much lower. However, CC has the lowest recall (52%) as the cellular location is well known for certain genomic entities, hence, are not stated explicitly in the text as much as MF or BP annotations.

230 Observation 5: Higher recall in MF is expected as, in general, the emphasis in a biomedical paper is on the functionality of a gene, where the process or the cellular location information is usually provided as secondary traits for the entity.

5.2.3. Comparative Performance Analysis with Other Systems: Raychaudhuri et al. [16] and Izumitani et al. [17] built paper classifiers to label the genes with GO terms through the classification of papers. Both works assume that a gene is a priori associated with several papers. This is a strong assumption in that if the experts are to invest sufficient time to read and associate a set of papers with a gene, then they can probably annotate the gene with the appropriate GO terms. Second, since both of the systems work at the document level, no direct evidence phrases are extracted from the text. Third, the classifiers employed by these studies need large training paper sets. In contrast, GEANN does not require a gene to be associated with any set of papers. Moreover, GEANN can also provide specific match phrases as evidence rather than the whole document. Fourth, GEANN handles the reconciliation of two different genomic databases whereas those studies have no such consideration. Izumitani et al. compares their system to Raychaudhuri et al.'s study for 12 GO terms. Our comparative analysis is also confined to this set of GO terms. Among these GO terms, five of them (Ion homeostasis, Membrane fusion, Metabolism, Sporulation) either have no or very few annotations in Genbank to perform 10-fold cross validation, and one of the test terms (Biogenesis) has recently became obsolete (i.e., removed from GO). Therefore, here we present comparative results for the remaining 6 GO terms. Table 1 provides the overall F-values [9] while Table 2 provides F-values in terms of the subontologies. F-value is a harmonic mean of precision and recall values, and computed as (2*Recall*Precision)/(Recall+Precision). GO category GO:0006914 GO:0007155 GO:0007165 GO:0006950 GO:0006810 GO:0008219 Average

GEANN

Izumitani etal.

0.85 0.66 0.75 0.69 0.72 0.75 0.74

0.78 0.51 0.76 0.65 0.83 0.58 0.69

Raychaudhuri Topi Top2 0.83 0.66 0.19 0.19 0.41 0.30 0.41 0.27 0.56 0.55 0.07 0.06 0.40 0.33

etal. Top3 0.38 0.13 0.21 0.24 0.49 0.02 0.25

.

*° .

Subontolgy Biological Process Molecular Function Cellular Location Average

GEANN

ni et al.

~ *

0.66

0.60

0.66

0.72

0.64

0.58

0.66

0.63

Table 1: Comparing F-Values against Izumitani and Table 2: Comparing F-Values for GO Raychaudhuri Subontologies Observation 6: Although GEANN does not rely on the strong assumption that genes need to be associated with a set ofpapers, and provides annotation prediction at a finer granularity with much smaller training data, it is still comparable to or better than other systems in terms of accuracy.

231 5.2.4. Contributions of Extended Patterns: Finally, we evaluated the effects of extended patterns. The experiments were conducted by first utilizing extended patterns, and, then, without using extended patterns. Observation 7: The use of extended patterns improves the precision by as much as 6.3% (GO.0005198). However, as the average improvement is quite small (0.2 %), we conclude that the contribution of the extended patterns is unpredictable. We observe that extended patterns have a localized effect which does not necessarily apply in every case. Furthermore, since we only use paper abstracts, it is not very likely to find long descriptions that match to extended patterns.

6. Related Work The second task of the BioCreAtlvE challenge involves extracting the annotation phrases given a paper and a protein. Most of the evaluated systems had low precision (46% for the best performing system) [15]. We are planning to participate in this assesment challenge in the near future. Raychaudhuri et al. [16] and Izumitani et al. [17] classify the documents, hence the genes that are associated to the documents into GO terms. As discussed above, even though GEANN is more flexible in terms of its assumptions, its performance is still comparable to these systems. Koike et al. [19] employs actorobject relationships from the NLP perspective. This system is optimized for the biological process subontology, and it requires human input and manually created patterns. Fleischman and Hovy [8] present a supervised learning method which is similar to our flexible pattern approach in that it uses WordNet. However, we use significant terms to construct additional patterns so that we can locate additional semantic structures while this paper only considers the target instance as the base of its patterns. Riloff [10] proposes a technique to extract the patterns. This technique ignores semantic side of the patterns. In addition, patterns are strict in that they require word-by-word exact matching. Brin's DIPRE [11] uses an initial set of seed elements as input, and uses the seed set to extract the patterns by analyzing the occurrences of seed instances in the web documents. SNOWBALL [12] extends DIPRE's pattern extraction system by introducing use of named-entity tags. Etzioni et al. developed a web information extraction system, KnowItAH [13], to automate the discovery of large collection of facts in web pages, which assumes redundancy of information on the web. 7. Conclusions and Future Work In this paper, we have explored a new methodology to automatically infer new GO annotations for genes and gene products from biomedical paper abstracts. We have developed GEANN which utilizes existing annotation information to

232

construct textual extraction patterns characterizing an annotation with a specific GO concept. Exploring the accuracy of different semantic similarity measures for WordNet, disambiguation of genes that share a synonym, and determining scoring weight parameters experimentally are among the future tasks. Acknowledgments This research is supported in part by the NSF award DBI-0218061, a grant from the Charles B. Wang Foundation, and Microsoft equipment grant. References 1. The Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource . Nucleic Acids Research 32, D258-D261, 2004 2. PathCase, available at http://nashua.case.edu/pathways 3. PubMed, available at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi 4. Salton, G., Automatic Text Processing, Addison-Wesley, 1989. 5. Fellbaum, C. An Electronic Lexical Database. Cambridge, MA. MIT Press, 1998. 6. Mann, G. Fine-Grained Proper Noun Ontologies for Question Answering. SemaNet, 2002. 7. Lovasz, L. Matching Theory, North- Holland, New York, 1986. 8. Fleischman, M., Hovy, E. Fine Grained Classification of Named Entities. COLING 2002 9. Han, J., Kamber, M. Data Mining: Concepts and Techniques. The Morgan Kaufmann, 2000. 10. Riloff, E. Automatically Generating Extraction Patterns from Untagged Text. AAA1/IAAI,1996. 11. Brin, S. Extracting Patterns and Relations from the World Wide Web. WebDB 1998. 12. Agichtein, E., Gravano, L. Snowball: extracting relations from large plain-text collections.ACM DL 2000 13. Etzioni, O. et al. Web-scale information extraction in Knowitall: WWW 2004. 14. Extended version of the paper available at: http://cakmak.case.edu/TechReports/GEANNExtended.pdf 15. Blaschke, C, Leon, EA, Krallinger M, Valencia A. Evaluation of BioCreAtlvE assessment of task 2. BMC Bioinformatics. 2005 16. Raychaudhuri, S. et al. Associating genes with Gene Ontology codes using a maximum entropy analysis of biomedical literature. Genome Res., 12(1):203-214. 17. Izumitani, T. et al. Assigning Gene Ontology Categories (GO) to Yeast Genes Using TextBased Supervised Learning Methods. CSB 2004. 18. Imataka, H., Olsen, H., Sonenberg, N. A new translational regulator with homology to eukaryotic translation initiation factor 4G. EMBO J. 1997 19. Koike, A., Niwa, Y., Takagi, T. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 2005. 20. Settles, B. ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 2005. 2 1 . Ogren, P. et al. The Compositional Structure of Gene Ontology Terms. PSB 2004. 22. WordNet Semantic Similarity Open Source Library http://www.codeproject.com/useritems/semanticsimilaritywordnet.asp

A FAULT MODEL FOR ONTOLOGY M A P P I N G , A L I G N M E N T , A N D LINKING SYSTEMS

H E L E N L. J O H N S O N , K. B R E T O N N E L C O H E N , AND LAWRENCE HUNTER

E-mail:

Center for Computational Pharmacology School of Medicine, University of Colorado Aurora, CO, 80045 USA {Helen. Johnson, Kevin. Cohen, Larry.Hunter} ©uchsc.edu

There has been much work devoted to the mapping, alignment, and linking of ontologies (MALO), but little has been published about how to evaluate systems that do this. A fault model for conducting fine-grained evaluations of MALO systems is proposed, and its application to the system described in Johnson et al. [15] is illustrated. Two judges categorized errors according to the model, and inter-judge agreement was calculated by error category. Overall inter-judge agreement was 98% after dispute resolution, suggesting that the model is consistently applicable. The results of applying the model to the system described in [15] reveal the reason for a puzzling set of results in that paper, and also suggest a number of avenues and techniques for improving the state of the art in MALO, including the development of biomedical domain specific language processing tools, filtering of high frequency matching results, and word sense disambiguation.

1. Introduction The mapping, alignment, and/or linking of ontologies (MALO) has been an area of active research in recent years [4,28]. Much of that work has been groundbreaking, and has therefore been characterized by the lack of standardized evaluation metrics that is typical for exploratory work in a novel domain. In particular, this work has generally reported coarse metrics, accompanied by small numbers of error exemplars. However, in similar NLP domains finer-grained analyses provide system builders with insight into how to improve their systems, and users with information that is crucial for interpreting their results [23,14,8]. MALO is a critical aspect of the National Center for Biomedical Ontology/Open Biomedical Ontologies strategy of constructing multiple orthogonal ontologies, but such endeavors have proven surprisingly difficult—Table 1 shows the results of a representative linking system, which ranged as low as 60.8% overall when aligning the BRENDA Tissue ontology with the Gene Ontology [15]. This paper proposes a fault model for evaluating lexical techniques in MALO systems, and applies it to the output of the system described in 233

234

Johnson et al. [15]. The resulting analysis illuminates reasons for differences in performance of both the lexical linking techniques and the ontologies used. We suggest concrete methods for correcting errors and advancing the state of the art in the mapping, alignment, and/or linking of ontologies. Because many techniques used in MALO include some that are also applied in text categorization and information retrieval, the findings are also useful to researchers in those areas. Previous lexical ontology integration research deals with false positive error analysis by briefly mentioning causes of those errors, as well as some illustrative examples, but provides no further analysis. Bodenreider et al. mention some false positive alignments but offer no evaluations [3]. Burgun et al. assert that including synonyms of under three characters, substring matching, and case insensitive matching are contributors to false positive rates and thus are not used in their linking system [5]. They report that term polysemy from different ontologies contributes to false positive rates, but do not explain the magnitude of the problem. Zhang et al. report a multi-part alignment system but do not discuss errors from the lexical system at all [29]. Lambrix et al. report precision from 0.285-0.875 on a small test set for their merging system, SAMBO, which uses n-grams, edit distance, WordNet, and string matching. WordNet polysemy and the N-gram matching method apparently produce 12.5% and 24.3% false positive rates, respectively [17,16]. Lambrix and Tan state that the same alignment systems produce different results depending on the ontology used; they give numbers of wrong suggestions but little analysis [18]. For a linking system that matches entities with and without normalization of punctuation, capitalization, stop words, and genitive markers, Sarkar et al. report without examples a 4-5% false positive rate [26]. Luger et al. present a structurally verified lexical mapping system in which contradictory mappings occur at certain thresholds, but no examples or analyses are given [20]. Mork et al. introduce an alignment system with a lexical component but do not detail its performance [22]. Johnson et al. provide error counts sorted by search type and ontology but provide no further analysis [15]. Their system's performance for matching BRENDA terms to GO is particularly puzzling because correctness rates of up to 100% are seen with some ontologies, but correctness for matching BRENDA is as low as 7% (see Table 1). There has been no comprehensive evaluation of errors in lexical MALO systems. This leaves unaddressed a number of questions with real consequences for MALO system builders: What types of errors contribute to reduced performance? How much do they contribute to error rates? Are there scalable techniques for reducing errors without adversely impacting recall? Here we address these questions by proposing a fault model for false-positive errors in MALO systems, providing an evaluation of the errors produced by a biomedical ontology linking system, and suggesting

235 Table 1. Correctness rates for the ontology linking system described in Johnson et al. (2006). The three OBO ontologies listed in the left column were linked to the GO via the three lexical methods in the right columns. Type of linking method Ontology

Overall

Exact

Synonyms

Stemming

ChEBI

842^

98.3% (650/661)

60.0% (180/300)

73.5%(147/200)

Cell Type

92.9%

99.3% (431/434)

73.0% (65/89)

83.8% (88/105)

BRENDA

60.8%

84.5% (169/200)

76.0% (152/200)

11.0% (22/200)

methods to reduce errors in MALO. 2. Methods 2.1. The ontology linking method in Johnson

et al.

(2006)

Since understanding the methodology employed in Johnson et al. is important to understanding the analysis of its errors, we review that methodology briefly here. Their system models inter-ontology relationship detection as an information retrieval task, where relationship is defined as any direct or indirect association between two ontological concepts. Three OBO ontologies' terms (BRENDA Tissue, ChEBI, and Cell Type) are searched for in GO terms [9,27,11,1]. Three types of searches are performed: (a) exact match to OBO term, (b) OBO term and its synonyms, and (c) stemmed OBO term. The stemmer used in (c) was an implementation of the Porter Stemmer provided with the Lucene IR library [13,25]. Besides stemming, this implementation also reduces characters to lower case, tokenizes on whitespace, punctuation and digits (removing the latter two), and removes a set of General English stop words. The output of the system is pairs of concepts: one GO concept and one OBO concept. To determine the correctness of the proposed relationships, a random sample of the output (2,389 pairs) was evaluated by two domain experts who answered the question: Is this OBO term the concept that is being referred to in this GO term/definition? Inter-annotator agreement after dispute resolution was 98.2% (393/400). The experts deemed 481 relations to be incorrect, making for an overall estimated system error rate of 20%. All of the system outputs (correct, incorrect, and unjudged) were made publicly available at compbio.uchsc.edu/dependencies. 2.2. The fault

model

In software testing, a fault model is an explicit hypothesis about potential sources of errors in a system [2,8]. We propose a fault model, comprising three broad classes of errors (see Table 2), for the lexical components of MALO systems. The three classes of errors are distinguished by whether they are due to inherent properties of the ontologies themselves, are due to the processing techniques that the system builders apply, or are due to

236

including inappropriate metadata in the data that is considered for locating relationships. The three broad classes are further divided into more specific error types, as described below. Errors in the lexical ambiguity class arise because of the inherent polysemy of terms in multiple ontologies (and in natural language in general) and from ambiguous abbreviations (typically listed as synonyms in an ontology). Errors in the text processing class come from manipulations performed by the system, such as the removal of punctuation, digits, or stop words, or from stemming. Errors in metadata matching occur when elements in one ontology matched metadata in another ontology, e.g. references to sources that are found at the end of GO definitions. To evaluate whether or not the fault model is consistently applicable, two authors independently classified the 481 incorrect relationships from the Johnson et al. system into nine fine-grained error categories (the seven categories in the model proposed here, plus two additional categories, discussed below, that were rejected). The model allows for assignment of multiple categories to a single output. For instance, the judges determined that CH:29356 oxidef2-) erroneously matched to GO:0019417 sulfur oxidation due to both character removal during tokenization {(2-) was deleted) and to stemming (the remaining oxide and oxidation both stemmed to oxid). Detailed explanations of the seven error categories, along with examples of each, are given belowa. 3. Results Table 2 displays the counts and percentages of each type of error, with inter-judge agreement (IJA) for each category. Section 3.1 discusses interjudge agreement and the implications that low IJA has for the fault model. Sections 3.2-3.3 explain and exemplify the categories of the fault model, and 3.4 describes the distribution of error types across orthogonal ontologies. 3.1. Inter-judge

agreement

Inter-judge agreement with respect to the seven final error categories in the fault model is shown in Table 2. Overall IJA was 95% before dispute resolution and 99% after resolution. In the 1% of cases where the judges did not agree after resolution, the judge who was most familiar with the data assigned the categories. The initial fault model had two error categories that were eliminated from the final model because of low IJA. The first category, tokenization, had an abysmal 27% agreement rate even after dispute resolution. The second eliminated category, general English polysemy, had 80% a

I n all paired concepts in our examples, BTO=BRENDA Tissue Ontology, CH=ChEBI Ontology, CL=Cell Type Ontology, and GO=Gene Ontology. Underlining indicates the portion of GO and OBO text that matches, thereby causing the linking system to propose that a relationship exists between the pair.

237

pre-resolution agreement and 94% post-resolution agreement, with only 10 total errors assigned to this category. Both judges felt that all errors in this category could justifiably be assigned to the biological polysemy category; therefore, this category is not included in the final fault model. Table 2. The fault model and results of its application to Johnson et al.'s erroneous outputs. The rows in bold are the subtotaled percentages of the broad categories of errors in relation to all errors. The non-bolded rows indicate the percentages of the subtypes of errors in relation to the broad category that they belong to. The counts for the subtypes of text processing errors exceed the total text processing count because multiple types of text processing errors can contribute to one erroneously matched relationship.

Type of error Lexical ambiguity errors biological polysemy ambiguous abbreviation Lexical A m b i g u i t y Total Text processing errors stemming digit removal punctuation removal stop word removal T e x t P r o c e s s i n g Total M a t c h e d M e t a d a t a Total Total

3.2. Lexical ambiguity

Inter-judge agreement pre-resolution post-resolution

Percent

Count

56% 44% 38%

(105/186) (81/186) (186/481)

86% 96%

98% 99%

6% 51% 27% 14% 60% 1%

(29/449) (231/449) (123/449) (65/449) (290/481) (5/481)

100% 100% 100% 99%

100% 100% 100% 100%

100%

100%

99%

(481/481)

95%"

99%"

errors

Lexical ambiguity refers to words that denote more than one concept. It is a serious issue when looking for relationships between domain-distinct ontologies [10:1429]. Lexical ambiguity accounted for 38% of all errors. Biological polysemy is when a term that is present in two ontologies denotes distinct biological concepts. It accounted for 56% of all lexical ambiguity errors. Examples of biological polysemy include (1-3) below. Example (1) shows a polysemous string that is present in two ontologies. (1) BTO 0000280: def: GO

0042676: def:

cone A mass of ovule-bearing or pollen-bearing scales or bracts in trees of the pine family or in cycads that are arranged usually on a somewhat elongated axis. cone cell fate commitment The process by which a cell becomes committed to become a cone cell.

OBO terms have synonyms, some of which polysemously denote concepts that are more general than the OBO term itself, and hence match GO concepts that are not the same as the OBO term. Examples (2) and (3) show lexical ambiguity arising because of the OBO synonyms.

238 (2) BTO

0000131: synonym: def:

GO

0046759: def:

(3) CH

17997 synonym 0035243 def:

GO

blood plasma plasma The fluid portion of the blood in which the particulate components are suspended. lytic plasma membrane viral budding A form of viral release in which the nucleocapsid evaginates from the host nuclear membrane system, resulting in envelopment of the virus and cell lysis. dinitrogen nitrogen protein-arginine omega-N symmetric methyltransferase activity ... Methylation is on the terminal nitrogen (omega nitrogen) ...

Example (4) shows that by the same synonymy mechanism, terms from different taxa match erroneously. (4) CL GO

0000338 synonym 0043350

neuroblast (sensu Nematoda and Protostomia) neuroblast neuroblast proliferation (sensu Vertebrata)

Ambiguous abbreviation errors happen when an abbreviation in one ontology matches text in another that does not denote the same concept. The ambiguity of abbreviations is a well-known problem in biomedical text [7,6]. In the output of [15] it is the cause of 43% of all lexical ambiguity errors. The chemical ontology includes many one- and two-character symbols for elements (e.g. C for carbon, T for thymine, As for arsenic, and At for astatine). Some abbreviations are overloaded even within the chemical domain. For example, in ChEBI C is listed as a synonym for three chemical entities besides carbon, viz. L-cysteine, L-cysteine residue, and cytosine. So, single-character symbols match many GO terms, but with a high error rate. Examples (5) and (6) illustrate such errors. (5)

CH GO

17821 synonym 0043377:

thymine T negative regulation of CD8-positive T cell differentiation

One- and two-character abbreviations sometimes also match closed-class or function words, such as a or in, as illustrated in example (6). (6)

CH GO

30430 synonym 0046465 def:

3.3. Text processing

indium In dolichyl diphosphate metabolism ... In eukaryotes, these function as carriers of .

errors

As previously mentioned, Johnson et al.'s system uses a stemmer that requires lower-case text input. The system performs this transformation with a Lucene analyzer that splits tokens on non-alphabetic characters, then removes digits and punctuation, and removes stop words. This transformed text is then sent to the stemmer. Example (7) illustrates a ChEBI term and a GO term, and the search and match strings that are produced by the stemming device.

239 (7) CH GO

32443: 0018118:

Tokenized/stemmed text 1 cystein peptidyl 1 cystein ...

Original text L-cysteinate(2-) peptidyl-L-cysteine

Errors arise from the removal of digits and punctuation, the removal of stop words, and the stemming process itself (see Table 2). These are illustrated in examples (8-16). Few errors resulting from text processing can be attributed to a single mechanism. Digit removal is the largest contributor among the text processing error types, constituting 51% of the errors. Punctuation removal is responsible for 27% of the errors. These are illustrated in examples (8-10). (8)

CL GO

0000624: 0043378-

CD4 positive T cell positive regulation of CD8-positive T cell differentiation

(9)

CH GO

20400: 0004409 def 30509 0018492

4-hydroxybutanal homoaconitate hydratase activity Catalysis of the reaction: 2-hydroxybutane-l,2,4-tri ... carbon (1+) carbon-monoxide dehydrogenase (acceptor) activity

(10) CH GO

Six percent of the errors involve the stemming mechanism. (This is somewhat surprising, since the Porter stemmer has been independently characterized as being only moderately aggressive [12].) Table 3. Counts of correct and incorrect relationships that resulted after the stemming mechanism was applied. Matches

-al

-ate

-ation

-e

-ed

-ic

-mg

Correct Incorrect

19 1

1 17

2 3

12 26

0 3

11 2

0 4

-ize

-ous

-s

0 1

2 0

157 39

Of the 580 evaluated relationships that were processed by the stemming mechanism in the original linking system, 43% (253/580) match because of the stemming applied. Of those, 73% (185/253) are correct relationships; 27% (68/253) are incorrect. Table 3 displays a list of all suffixes that were removed during stemming and the counts of how many times their removal resulted in a correct or an incorrect match. Examples (11-13) display errors due to stemming: (11) CH GO (12) CH GO (13) CH GO

25741: 0016623: def: 25382 0015718 def: 32530: 0019558:

oxides oxidoreductase activity, acting on the aldehyde or oxo Catalysis of an oxidation-reduction (redox) reaction ... monocarboxylates monocarboxylic acid transport The directed movement of monocarboxylic acids into .. histidinate(2-) histidine catabolism to 2-oxoglutarate

240

While stemming works most of the time to improve recall—the count of correct matches in Table 3 is more than double the count of incorrect matches (204 versus 96)—an analysis of the errors shows that in this data, there is a subset of suffixes that do not stem well from biomedical terms, at least in these domains. Removal of -e results in incorrect matches far more often than it results in correct matches, and removal of -ate almost never results in a correct match. These findings illustrate the need for a domain-specific stemmer for biomedical text. Finally, stop word removal contributed 14% of the error rate. Examples like (14-16) are characteristic: (14) CL GO (15) CH GO

0000197 0030152 def: 25051 0046834

receptor cell bacteriocin biosynthesis ... at specific receptors on the cell surface lipid As lipid phosphorylation

(16) CH GO

29155: 0050562:

His-tRNA(His) lysine-tRNA(Pyl) ligase activity

3.4. Applying

the fault model to orthogonal

ontologies

The fault model that this paper proposes explains the patterns observed in the Johnson et al. work. They report an uneven distribution of accuracy rates across the ontologies (see Table 1); Table 4 shows that this corresponds to an uneven distribution of the error types across ontologies. Most striking is that ChEBI is especially prone to ambiguous abbreviation errors, which were entirely absent with the other two ontologies. BRENDA is prone to deletion-related errors — in fact, over half of the errors in the text processing error category are due to a specific type of term in BRENDA (169/290). These terms have the structure X cell, where X is any combination of capital letters, digits, and punctuation, such as B5/589 cell, T-24 cell, and 697 cell. The search strings rendered from these after the deletions—B cell, T cell, and cell, respectively—match promiscuously to GO (see Figure 1). Biological polysemy errors are a problem in all three ontologies. Sixtyfour percent of the errors for Cell Type were related to polysemy, 20% in BRENDA, and 12% in ChEBI. Dealing with word sense disambiguation could yield a huge improvement in performance for these ontologies. None of this error type distribution is apparent from the original data reported in [15], and all of it suggests specific ways of addressing the errors in aligning these ontologies with GO. 4. Fault-driven analysis suggests techniques for improving MALO Part of the value of the fault model is that it suggests scalable methods for reducing the false positive error rate in MALO without adversely affecting recall. We describe some of them here.

241 Table 4.

Ontology

Distribution of error types across ontologies

Biological polysemy

Abbreviation ambiguity

digit

84 29 26

0 0 81

187 9 35

BRENDA Cell Type ChEBI

Deletion of: punct. stopword 89 0 34

7n to ' u , ' -"—iwrcdTi

o 60 - ,-•

£ 50 i •6 fe •g 5 z

Stemming

Totals

2 0 27

416 45 207

54 7 4

BRENDA

|BY-2celi|

T

~~"-| blood plasma |

40 30 20 - 1 *-H T"84 cdl 1 10 < nU 1 3 5 7 9 11 13 15 17 19 Number of Terms

Figure 1.

A few terms from BRENDA caused a large number of errors.

4.1. Error reduction

techniques

related to text

processing

Johnson et al. reported exceptionally low accuracy for BRENDA relationships based on stemming: only 7-15% correctness. Our investigation suggests that this low accuracy is due to a misapplication of an out-of-the-box Lucene implementation of the Porter stemmer: it deletes all digits, which occur in BRENDA cell line names, leading to many false-positive matches against GO concepts containing the word cell. Similarly, bad matches between ChEBI chemicals and the GO (73-74% correctness rate) occur because of digit and punctuation removal. This suggests that a simple change to the text processing procedures could lower the error rate dramatically. 4.2. Error reduction

techniques

related to

ambiguity

For ontologies with error patterns like ChEBI and BRENDA, excluding synonyms shorter than three characters would be beneficial. For example, Bodenreider and Burgun excluded synonyms shorter than three characters [5]. Length-based filtering of search candidates has been found useful for other tasks in this domain, such as entity identification and normalization of Drosophila genes in text [21]. Numerous techniques have been proposed for resolving word sense ambiguities [24]. The OBO definitions may prove to be useful resources for

242

knowledge-based ontology term disambiguation [19]. 4.3. Error reduction

by filtering high error

contributors

The Zipf-like distribution of error counts across terms (see Figure 1) suggests that filtering a small number of terms would have a beneficial effect on the error rates due to both text processing and ambiguity-related errors. This filtering could be carried out in post-processing, by setting a threshold for matching frequency or for matching rank. Alternatively, it could be carried out in a pre-processing step by including high-frequency tokens in the stop list. This analysis would need to be done on an ontology-byontology basis, but neither method requires expert knowledge to execute the filtering process. As an example of the first procedure, removing the top contributors to false-positive matches in each ontology would yield the results in Table 5. Table 5.

Effect of filtering high-frequency match terms.

Ontology

Terms removed

BRENDA Cell Type ChEBI

697 cell, BY-2 cell, blood plasma, T-84 cell band form neutrophil, neuroblast iodine, L-isoleucine residue, groups

Increase in correctness

Decrease in matches

27% 4% 2%

41% 3% 2%

5. Conclusion The analysis presented in this paper supports the hypotheses that it is possible to build a principled, data-driven fault model for MALO systems; that the model proposed can be applied consistently; that such a model reveals previously unknown sources of system errors; and that it can lead directly to concrete suggestions for improving the state of the art in ontology alignment. Although the fault model was applied to the output of only one linking system, that system included linking data between four orthogonal ontologies. The model proved effective at elucidating the distinct causes of errors in linking the different ontologies, as well as the puzzling case of BRENDA. A weakness of the model is that it addresses only false-positive errors; evaluating failures of recall is a thorny problem that deserves further attention. Based on the descriptions of systems and false positive outputs of related work, it seems that the fault model presented in this work could be applied to the output of many other systems, including at least [3,5,16,17,18,26,20,22,29]. Note that in the data that was examined in this paper, the distribution of error types was quite different across not just lexical techniques, but across ontologies, as well. This reminds us that specific categories in the model may not be represented in the output of

243

all systems applied to all possible pairs of ontologies, and t h a t there may be other categories of errors that were not reflected in the d a t a t h a t was available to us. For example, the authors of the papers cited above have reported errors due t o case folding, spelling normalization, and word order alternations t h a t were not detected in the output of Johnson et al.'s system. However, the methodology t h a t the present paper illustrates—i.e., combining the software testing technique of fault modelling with an awareness of linguistic factors—should be equally applicable to any lexically-based MALO system. Many of the systems mentioned in this paper also employ structural techniques for MALO. These techniques are complementary to, not competitive with, lexical ones. T h e lexical techniques can be evaluated independently of the structural ones; a similar combination of the software testing approach with awareness of ontological/structural issues may be applicable to structural techniques. We suggest t h a t the quality of future publications in MALO can be improved by discussing error analyses with reference to this model or very similar ones derived via the same techniques. 6.

Acknowledgments

The authors gratefully acknowledge the insightful comments of the three anonymous PSB reviewers, and thank Michael Bada for helpful discussion and Todd A. Gibson and Sonia Leach for editorial assistance. This work was supported by NIH grant R01-LM008111 (LH). References 1. J. Bard, S. Y. Rhee, and M. Ashburner. An ontology for cell types. Genome Biol, 6(2), 2005. 2. R. V. Binder. Testing Object-Oriented Systems: Models, Patterns, and Tools. Addison-Wesley Professional, October 1999. 3. O. Bodenreider, T. F. Hayamizu, M. Ringwald, S. De Coronado, and S. Zhang. Of mice and men: aligning mouse and human anatomies. AMIA Annu Symp Proc, pages 61-65, 2005. 4. O. Bodenreider, J. A. Mitchell, and A. T. McCray. Biomedical ontologies: Session introduction. In Pac Symp Biocomput, 2003, 2004, 2005. 5. A. Burgun and O. Bodenreider. An ontology of chemical entities helps identify dependence relations among Gene Ontology terms. In Proc SMBM, 2005. 6. J. Chang and H. Schiitze. Abbreviations in biomedical text. In S. Ananiadou and J. McNaught, editors, Text mining for biology and biomedicine, pages 99-119. Artech House, 2006. 7. J. T. Chang, H. Schiitze, and R. B. Altman. Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc, 9(6):612-620, 2002. 8. K. B. Cohen, L. Tanabe, S. Kinoshita, and L. Hunter. A resource for constructing customized test suites for molecular biology entity identification systems. BioLINK 2004, pages 1-8, 2004. 9. T. G. O. Consortium. Gene Ontology: tool for the unification of biology. Nat Genet, 25(l):25-29, 2000. 10. T. G. O. Consortium. Creating the Gene Ontology resource: design and implementation. Genome Research, 11:1425-1433, 2001.

244 11. K. Degtyarenko. Chemical vocabularies and ontologies for bioinformatics. In Proc 2003 Itnl Chem Info Conf, 2003. 12. D. Harman. How effective is suffixing? J. Am Soc Info Sci, 42(1):7-15, 1991. 13. E. Hatcher and O. Gospodnetic. Lucent in Action (In Action series). Manning Publications, 2004. 14. L. Hirschman and I. Mani. Evaluation. In R. Mitkov, editor, Oxford handbook of computational linguistics, pages 414-429. Oxford University Press, 2003. 15. H. L. Johnson, K. B. Cohen, W. A. Baumgartner, Z. Lu, M. Bada, T. Kester, H. Kim, and L. Hunter. Evaluation of lexical methods for detecting relationships between concepts from multiple ontologies. Pac Symp Biocomput, pages 28-39, 2006. 16. P. Lambrix and A. Edberg. Evaluation of ontology merging tools in bioinformatics. Pac Symp Biocomput, pages 589-600, 2003. 17. P. Lambrix, A. Edberg, C. Manis, and H. Tan. Merging DAML+OIL bioontologies. In Description Logics, 2003. 18. P. Lambrix and H. Tan. A framework for aligning ontologies. In PPSWR, pages 17-31, 2005. 19. M. Lesk. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In SIGDOC '86: Proceedings of the 5th annual international conference on systems documentation, pages 24-26, New York, NY, USA, 1986. ACM Press. 20. S. Luger, S. Aitken, and B. Webber. Automated terminological and structural analysis of human-mouse anatomical ontology mappings. BMC Bioinformatics, 6(Suppl. 3), 2005. 21. A. A. Morgan, L. Hirschman, M. Colosimo, A. S. Yeh, and J. B. Colombe. Gene name identification and normalization using a model organism database. J. Biomedical Informatics, 37(6):396-410, 2004. 22. P. Mork, R. Pottinger, and P. A. Bernstein. Challenges in precisely aligning models of human anatomy using generic schema matching. Medlnfo, 11 (Pt l):401-405, 2004. 23. S. Oepen, K. Netter, and J. Klein. TSNLP - Test suites for natural language processing. In Linguistic Databases. CSLI Publications, 1998. 24. T. Pedersen and R.Mihalcea. Advances in word sense disambiguation. In Tutorial, Conf of ACL, 2005. 25. M. Porter. An algorithm for suffix stripping. Program, 14:130-137, 1980. 26. I. N. Sarkar, M. N. Cantor, R. Gelman, F. Hartel, and Y. A. Lussier. Linking biomedical language information and knowledge resources: GO and UMLS. Pac Symp Biocomput, pages 439-450, 2003. 27. I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg. BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res, 32(Database issue), 2004. 28. P. Shvaiko and J. Euzenat. A survey of schema-based matching approaches. Journal on Data Semantics, 4, 2005. 29. S. Zhang and O. Bodenreider. Aligning representations of anatomy using lexical and structural methods. AMIA Annu Symp Proc, pages 753-757, 2003.

INTEGRATING NATURAL L A N G U A G E PROCESSING WITH FLYBASE CURATION

N I K I F O R O S K A R A M A N I S * * , IAN L E W I N * , R U T H SEAL+, RACHEL DRYSDALEf AND EDWARD BRISCOE* Computer

Laboratory* and Department of Genetics^ University of Cambridge E-mail for correspondence: [email protected] Applying Natural Language Processing techniques to biomedical text as a potential aid to curation has become the focus of intensive research. However, developing integrated systems which address the curators' real-world needs has been studied less rigorously. This paper addresses this question and presents generic tools developed to assist FlyBase curators. We discuss how they have been integrated into the curation workflow and present initial evidence about their effectiveness.

1. Introduction The number of papers published each year in fields such as biomedicine is increasing exponentially [1,2]. This growth in literature makes it hard for researchers to keep track of information so progress often relies on the work of professional curators. These are specialised scientists trained to identify and extract prespecified information from a paper to populate a database. Although there is already a substantial literature on applying Natural Language Processing (NLP) techniques to the biomedical domain, how the output of an NLP system can be utilised by the intended user has not been studied as extensively [1]. This paper discusses an application developed under a user-centered approach which presents the curators with the output of several NLP processes to help them work more efficiently. In the next section we discuss how observing curators at work motivates our basic design criteria. Then, we present the tool and provide an overview of the NLP processes behind it as well as of the customised curation editor we developed following the same principles. Finally, we discuss how these applications have been incorporated into the curation workflow and present a preliminary study on their effectiveness. *William Gates Building, Cambridge, CB3 OFD, UK. tDowning Site, Cambridge, CB2 3EH, UK.

245

246 i GENE PROFORMA

Version 37: 5 Aug 2005

I G1a Gene symbol to use »i database *a : !G1b. Gene symbol used in paper (if dtfferent) "I: !G1c. Database gene symbol to replace *l: I G1 d. Gene category (ff gene is new to FlyBase) [CV] I G2a. Gene name to use in database 'e: !G15. Internal notes "W:: CONTROLLED VOCABULARY

NEW RECORD 1 POST PROCESSOR I

CUSTOMISED EDITOR

ONTOLOGIES CURATOR

PAPER

_

DATABASE

I ALLELE PROFORMA

Version 32: 5 Aug 2005

I GA1a. Allele symbol to use in database *A : I GA1b Complete symbol tor GA1a in paper (if different) ' j GA1c. Database allele symbol to replace *l: I GA2a. Allele name to use in database "e : I GA14. Internal notes *W"

Figure 1.

(A) Overview of the curation information flow. (B) Gene and allele proformae.

2. The FlyBase curation paradigm The tools presented in this paper have been developed under an approach which actively involves the potential user and consists of iterative cycles of (a) design (b) system development (c) feedback and redesign [3]. The intended users of the system are the members of the FlyBase curation team in Cambridge (currently seven curators). FlyBase a is a widely used database of genomic research on the fruit fly. It has been updated with newly curated information since 1992 by teams located in Harvard, Indiana and Berkeley, as well as the Cambridge group. Although the curation paradigm followed by FlyBase is not the only one, it is based on practices developed through years of experience and has been adopted by other curation groups. FlyBase curation is based on a watchlist of around 35 journals. Each curator routinely selects a journal from the list and inspects its latest issue to identify which papers to curate. Curation takes place on a paper-bypaper basis (as opposed to gene-by-gene or topic-by-topic). A simplified view of the curation information flow is shown in Figure 1A. A standard UNIX editor with some customised functions is used to produce a record for each paper. The record consists of several proformae (Figure IB), one for each significant gene or allele discussed in the paper. Each proforma is made of 33 fields (not all of which are always filled): some fields require rephrasing, paraphrasing and/or summarisation while others record very specific facts using terms from ontologies or a controlled vocabulary. In addition to interacting with the paper, typically viewed in printed form or loaded into a PDF viewer, the curator also needs to access the database a

www.flybase.org

247

to fill in some fields. This is done via several task-specific scripts which search the database e.g. for a gene-name or a citation identifier. After the record has been completed, it is post-processed automatically to check for inconsistencies and technical errors. Once these have been corrected, it is uploaded to the database. Given that extant information retrieval systems such as MedMiner [4] or Textpresso [5] are devised to support the topic-by-topic curation model in other domains, FlyBase curators are in need of additional technology tailored to their curation paradigm and domain. In order to identify users' requirements more precisely, several observations of curation took place focussing on the various ways in which the curators interact with the paper: some curators skim through the whole paper first (often highlightling certain phrases with their marker) and then re-read it more thoroughly. Others start curation from a specific section (not necessarily the abstract or the introduction) and then move to another section in search of additional information about a specific concept. The "find function" of the PDF viewer is often used to search for multiple occurrences of the same term. Irrespective of the adopted heuristics, all curators agreed that identifying the sections of the text which contain information relevant to the proforma fields is laborious and time-consuming. Current NLP technology identifies domain-specific names of genes and alleles as well as relations between them relatively reliably. However, providing the curator simply with the typical output of several NLP modules is not going to be particularly helpful [1]. Hence, one of our primary aims is to design and implement a system which will not only utilise the underlying NLP processes but also enable the curators to interact with the text efficiently to accurately access segments which contain potentially useful information. Crucially, this is different from providing them with automatically filled information extraction templates and asking them to go back to the text and confirm their validity. This would shift their responsibility to verifying the quality of the NLP output. Instead, we want to develop a system in which the curators maintain the initiative following their preferred style but are usefully assisted by software adapted to their work practices. Records are highly structured documents so additionally we aimed to develop, using the same design principles, an enhanced editing tool sensitive to this structure in order to speed up navigation within a record too. This paper presents the tools we developed based on these premises. We anticipate that our work will be of interest to other curation groups following the paper-by-paper curation paradigm.

248

3. PaperBrowser PaperBrowser b presents the curator with an enhanced display of the text in which words automatically recognised as gene names are highlighted in a coloured font (Figure 4A). It enables the curators to quickly scan the whole text by scrolling up and down while their attention is directed to the highlighted names. PaperBrowser is equipped with two navigation panes, called Paper View and EntitiesView, that are organised in terms of the document structure and possible relations between noun phrases, both of which are useful cues for curation [2]. PaperView lists gene names such as "zen" in the order in which they appear in each section (Figure 4B). EntitiesView (Figure 4C) lists groups of words (noun phrases) automatically recognised as referring to the same gene or to a biologically related entity such as "the zen cDNA". The panes are meant not only to provide the curator with an overview of the gene names and the related noun phrases in the paper but also to support focused extraction of information, e.g. when the curator is looking for a gene name in a specific section or tries to locate a noun phrase referring to a certain gene product. Clicking on a node in either PaperView or EntitiesView redirects the text window to the paragraph that contains the corresponding gene name or noun phrase, which is now highlighted in a different colour. The same colour is used to highlight the other noun phrases listed together with the clicked node in EntitiesView. In this way the selected node and all related noun phrases become more visible in the text. The interface allows the curators to mark a text segment as "read" by crossing it out (which is useful when they want to distinguish between the text they have read and what they still need to curate). A "find" function supporting case sensitive and wrapped search is implemented too. The "Tokens to verify" tab is used to collect feedback about the gene name recogniser in a non-intrusive manner. This tab presents the curator with a short list of words (currently just 10 per paper) for which the recogniser is uncertain whether they are gene names or not. Each name in the list is hyperlinked to the text allowing the curator to examine it in its context and decide whether it should be marked as a gene or not (by clicking on the corresponding button). Active learning [6] is then used to improve the recogniser's performance on the basis of the collected data. b

PaperBrowser is a "rich content" browser built on top of the Mozilla Gecko engine and JREX (see www.mozilla.org for more details).

249 XML J

FEXML (xml containing our own "added value" markup)

HaperBrowser

Figure 2.

Anaphoric dependencies

Paper processing pipeline

4. Paper Processing Pipeline In this section we discuss the technology used to produce the XML-based format which is displayed by PaperBrowser. This a non-trivial task requiring the integration of several components, each addressing different but often inter-related problems, into a unified system. The pipeline in Figure 2 was implemented since it was unclear whether integrating these modules could be readily done within an existing platform such as GATE [7]. The input to the pipeline is the paper in PDF, which is currently the only "standard electronic format" in which all relevant papers are available. This needs to be translated to a format that can be utilised by the deployed NLP modules but since current PDF-to-text processors are not aware of the typesetting of each journal, text in two columns, footnotes, headers and figure captions tends to be dispersed and mixed up during the conversion. This problem is addressed by the Document Parsing module which is based on existing software for optical character recognition (OCR) enhanced by templates for deriving the structure of the document [8]. Its output is in a general XML format defined to represent scientific papers. By contrast to standard PDF-to-text processors, the module preserves significant formating information such as characters in italics and superscripts that may indicate the mention of a gene or an allele respectively. The initial XML is then fed to a module that implements a machinelearning paradigm extending the approach in [9] to identify gene names in the text [10], a task known as Named Entity Recognition (NER).C Then, the RASP parser [11] is employed to identify the boundaries of the noun phrase (NP) around each gene name and its grammatical relations with other NPs in the text. This information is combined with features derived c

The NER module may also be fed with papers in XML available from certain publishers.

250 Table 1. Performance of the modules for Document Parsing, Named Entity Recognition and Anaphora Resolution. Recall

Precision

F-score

Named Entity Recognition

82.2%

83.4%

82.8%

Anaphora resolution

75.6%

77.5%

76.5%

Document Parsing

96.2%

97.5%

96.8%

from an ontology to resolve the anaphoric dependencies between NPs [12]. For instance, in the following excerpt: ... is encoded by the gene male specific lethal-l ... the MSL-1 protein localizes to several sites ... male animals die when they are mutant for msl-1 ... the NER system recognises "male specific lethal-l" as a gene-name. Additionally, the anaphora resolution module identifies the NP "the gene male specific lethal-l" as referring to the same entity as the NP "msl-1" and as being related to the NP "the MSL-1 protein". A version of the paper in FBXML (i.e. our customised XML format) is the result of the whole process that is displayed by PaperBrowser. The Paper View navigation pane makes use of the output of the NER system and information about the structure of the paper, while EntitiesView utilises the output of the anaphora resolution module as well. Images, which are very hard to handle by most text processing systems [2] but are particularly important to curators (see next section), are displayed in an extra window (together with their captions which are displayed in the text too) since trying to incorporate them into the running text was too complex given the information preserved in the OCR output. Following the standard evaluation methodology in NLP, we used collections of texts annotated by domain experts to assess the performance of the NER [10] and the anaphora resolution [12] modules in terms of Recall (correct system responses divided by all human-annotated responses), Precision (correct system responses divided by all system responses) and their harmonic mean (F-score). Both modules achieve state-of-the-art results compared to semi-supervised approaches with similar architectures. The same measures were used to evaluate the document parsing module on an appropriately annotated corpus [8]. Table 1 summarises the results of these evaluations. Earlier versions of the NER and anaphora resolution modules are discussed in [13].

251 5. ProformaEditor In order to further support the curation process, we implemented an editing tool called ProformaEditor (Figure 4D). ProformaEditor supports all general and customised functionalities of the editor that it is meant to replace such as: (a) copying text between fields and from/to other applications such as PaperBrowser, (b) finding and replacing text (enabling case-sensitive search and a replace-all option), (c) inserting an empty proforma, the fields of which can then be completed by the curator, and (d) introducing predefined text (corresponding to FlyBase's controlled vocabulary) to certain fields by choosing from the "ShortCuts" menu. Additionally, ProformaEditor visualises the structure of the record as a tree enabling the curator to navigate to a proforma by clicking on the corresponding node. Moreover, the fields of subsequent proformae are displayed in different colours to be distinguished more easily. Since the curators do not store pointers to a passage that supports a field entry, finding evidence for that entry in the paper based on what has been recorded in the field is extremely difficult [2]. We address this problem by logging the curator's pasting actions to collect information which will enable us to further enhance the underlying NLP technology such as: (a) where the pasted text is located in the paper, (b) which field it is pasted to, (c) whether it contains words recognised as gene names or related NPs, and (d) to what extent it is subsequently post-edited by the curator. This data collection also takes place without interfering with curation. 6. Integrating the tools into FlyBase's workflow After some in-house testing, a curator was asked to produce records for 12 papers from two journals using a prototype version of the tools to which she was exposed for the first time (CurationOl). CurationOl initiated our attempt to integrate the tools into FlyBase's workflow. This integration requires substantial effort and often needs to address low-level software engineering issues [14]. Thus, our aims were quite modest: (a) recording potential usability problems and (b) ensuring that the tools do not impede the curator from completing a record in the way that she had been used to. ProformaEditor was judged to be valuable although a few enhancements were identified such as the introduction of the "find and replace" function and the "ShortCuts" menu that the curators had in their old editor. Compared to that editor, the curator regarded the visualisation of the record structure as a very useful additional feature. PaperBrowser was tested less extensively during CurationOl due to the

252

loss of the images during the PDF-to-XML process which was felt by the curator to be a significant impediment. Although the focus of the project is on text processing, the pipeline and PaperBrowser were adjusted accordingly to display this information. A second curation exercise (Curation02) followed, in which the same curator produced records for 9 additional papers using the revised tools. This time the curator was asked to base the curation entirely on the text as displayed in the PaperBrowser and advise the developers of any problems. Soon after Curation02, the curator also produced records for 28 other papers from several journals (Curation03) using ProformaEditor but not PaperBrowser since these papers had not been processed by the pipeline. Like every other record produced by FlyBase curators, the outputs of all three exercises were successfully post-processed and used to populate the database. Overall, the curator did not consider that the tools have a negative impact on task completion. ProformaEditor became the curator's editor of choice after Curation03 and has been used almost daily since then. The feedback on PaperBrowser included several cases in which identifying passages that provide information about certain genes as well as their variants, products and phenotypes using Paper View and/or EntitiesView was considered to be more helpful than looking at the PDF viewer or a printout. Since the prototype tools were found to be deployable within FlyBase's workflow, we concluded that the aims of this phase had been met. However, the development effort has not been completed since the curator also noticed that the displayed text carries over errors made by the pipeline modules and pointed out a number of usability problems on the basis of which a list of prioritised enhancements was compiled. The shortlisted improvements of PaperBrowser include: (a) making tables and captions more easily identifiable, (b) flagging clicked nodes in the navigation panes, and (c) saving text marked-as-read before exiting. We also intend to boost the performance of the pipeline modules using the curator's feedback and equip ProformaEditor with new pasting functionalities which will incorporate FlyBase's term normalisation conventions. 7. A pilot study on usability This section presents an initial attempt to estimate the curator's performance in each exercise. To the best of our knowledge, although preliminary, this is the first study of this kind relating to scientific article curation. Although the standard NLP metrics in Table 1 do not capture how useful a system actually is in the workplace [1], coming up with a quantitative

253

measure to assess the curator's performance is not straightforward either. At this stage we decided to use a gross measure by logging the time it took for the curator to complete a record during each curation exercise. This time was divided by the number of proformae in each record to produce an estimate of "curation time per proforma". The data were analysed following the procedure in [15]. Two outliers were identified during the initial exploration of the data and excluded from subsequent analysis.6 The average time per proforma for each curation exercise using the remaining datapoints is shown in Figure 3A. A one-way ANOVA returned a relatively low probability (F(2,44) = 2.350, p=0.107) and was followed by planned pairwise comparisons between the conditions using the independent-samples two-tailed t-test. CurationOl took approximately 3 minutes and 30 seconds longer than Curation02, which suggests that revising the tools increased the curator's efficiency. This difference is marginally significant (t(44)=2.151, p=0.037) providing preliminary evidence in favour of this hypothesis. Comparing Curation03 with the other conditions suggests that the tools do not impede the curator's performance. In fact, CurationOl took on average about 2 minutes longer than Curation03 (the main difference between them being the use of the revised ProformaEditor during Curation03). The planned comparison shows a trend towards improving curation efficiency with the later version of the tool (t(44) = 1.442, p=0.156) although it does not provide conclusive evidence in favour of this hypothesis. The main difference between Curation02 and Curation03 is viewing the paper exclusively on PaperBrowser in Curation02 (as opposed to no use of this tool at all in Curation03). f Completing a proforma using PaperBrowser is on average more than one minute and thirty seconds faster. Although the planned comparison shows that the difference is not significant (t(44)=1.1712, p=0.248), this result again indicates that the tool does not have a negative impact on curation. Additional analysis using a more fine-grained estimate of "curation time per completed field" (computed by dividing the total time per record e

T h e first outlier corresponds to the first record ever produced by the curator. This happened while a member of the development team was assisting her with the use of the tools and recording her comments (which arguably delayed the curation process significantly). The logfile for the second outlier which was part of Curation03 included long periods during which the curator did not interact with ProformaEditor. The version of ProformaEditor was the same in both cases but the curator was more familiar with it during Curation03.

254 Tlma par flvld

T)m*p«r preforms j

120 1G0

f 80

•S-

I41"

B

—

40

••

'.' -;; '

- * •

^

• •

1

•vV

20

(A) Time per proforma

_. .-.:•• ' '. •

:;.j-

(B) Time per completed field

Average

St. dev.

Average

CurationOl

631.64s (10m 32s)

192.21s

132.90s (2m 13s)

33.50s

11

Curation02

424.21s (7m 04s)

157.04s

104.67s (lm 45s)

41.47s

9

236.91s

123.20s (2m 03s)

52.35s

27

Curation03

520.95s (8m 41s) Figure 3.

St. dev.

papers

Results of pilot study on usability.

by the number of completed fields) showed the same trends (Figure 3B). However, the ANOVA suggested that the differences were not significant (F(2,44)=0.925, p=0.404), which is probably due to ignoring the time spent on non-editing actions by this measure. Overall, this preliminary study provides some evidence that the current versions of ProformaEditor and PaperBrowser are more helpful than the initial prototypes and do not impede curation. These results concur with the curator's informal feedback. They also meet our main aim at this stage which was to integrate the tools within an the existing curation workflow. Clearly, more detailed and better controlled studies are necessary to assess the potential usefulness of the tools building on the encouraging trends revealed in this pilot. Devising these studies is part of our ongoing work, aiming to collect data from more than one curator. Similarly to the pilot, we will attempt to compare different versions of the tools which will be developed to address the compiled shortlist of usability issues. We are also interested in measuring variables other than efficiency such as accuracy and agreement between curators. In our other work, we are currently exploiting the curator's feedback for the active learning experiments. We also intend to analyse the data collected in the logstore in order to build associations between proforma fields and larger text spans, aiming to be able to automatically identify and highlight such passages in subsequent versions of PaperBrowser. Acknowledgments This work takes place within the BBSRC-funded Flyslip project (grant No 38688).

255 We are grateful to Florian Wolf and Chihiro Yamada for their insights and contributions in earlier stages of the project. PaperBrowser and ProformaEditor are implemented in Java and will be available through the project's webpage at: www.cl.cam.ac.uk/users/av308/Project_Index/index.html

References 1. A. M. Cohen and W. R. Hersh. A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1):57-71 (2005). 2. A. S. Yeh, L. Hirschman and A. A. Morgan (2003), Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 19 (suppl. 1): i331-i339. 3. J. Preece, Y. Rogers and H. Sharp. Interaction design: beyond humancomputer interaction. John Wiley and Sons (2002). 4. L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter and J. N. Weinstein. MedMiner: an internet text-mining tool for biomedical information with application to gene expression profiling. BioTechniques 27(6):1210-1217 (1999). 5. H. M. Mueller, E. E. Kenny and P. W. Sternberg. Textpresso: An ontologybased information retrieval and extraction system for biological literature. PLoS Biology 2(ll):e309 (2004). 6. D. A. Cohn, Z. Ghahramani and M. I. Jordan. Active learning with statistical models. In G. Tesauro, D. Touretzky and J. Alspector (eds), Advances in Neural Information Processing, vol. 7, 707-712 (1995). 7. H. Cunningham, D. Maynard, K. Bontcheva and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. Proceedings of ACL 2002, 168-175 (2002). 8. B. Hollingsworth, I. Lewin and D. Tidhar. Retrieving hierarchical text structure from typeset scientific articles: A prerequisite for e-Science text mining. Proceedings of the 4th UK e-science all hands meeting, 267-273 (2005). 9. A. A. Morgan, L. Hirschman, M. Colosimo, A. S. Yeh and J. B. Colombe. Gene name identification and normalization using a model organism database. J. of Biomedical Informatics 37(6):396-410 (2004). 10. A. Vlachos and C. Gasperin. Bootstrapping and evaluating NER in the biomedical domain. Proceedings of BioNLP 2006, 138-145 (2006). 11. E. Briscoe, J. Carroll and R. Watson. 'The second release of the RASP system', Proceedings of ACL-COLING 2006, 77-80 (2006). 12. C. Gasperin. Semi-supervised anaphora resolution in biomedical texts. Proceedings of BioNLP 2006, 96-103 (2006). 13. A. Vlachos, C. Gasperin, I. Lewin and E. J. Briscoe. Bootstrapping the recognition and anaphoric linking of named entities in Drosophila articles. Proceedings of PSB 2006, 100-111 (2006). 14. C. Barclay, S. Boisen, C. Hyde and R. Weischedel. The Hookah information extraction system. Proceedings of Workshop on TIPSTER II, 79-82 (1996). 15. D. S. Moore and G. S. McCabe. Introduction to the practice of statistics, 713-747. Freeman and Co (1989).

256

f l l K / Z / t r n p , ' 15728670 Bkmll4 J !9iiunlf7l«lft iîiplMivSilVliSn

tîrtnrw"V"^Sl«fl4w,^l l kwiw*sril¥.3,li«*WrttilytfiCl

bcwp o ra-

(C)

:T:'J

allele sgg[10 diRNA_Sr.er\UASj allele sgg(16 dsRNA.Sier\UAS] allele sggp2J aitfJ^Sj]gilOSi-fr\lj*S f T Hsap\ allele sgo[KBSR.5cer\lIAST H allele sgg|46 SterWSS T Hsai allele sgg[023SG_5cer\UAS r allele sgg[D300C Ster\UA5 r allele sgglD235G U300G Scer\ allele sgg[£Dgr,N3O0 Scer\UAS allele sgg[A81T S:ar\UAS|

CAla. Allele svmbol i o use in Oaial SOQ[10Scer\UAS1 HsaD\HYC| G A l b Complete swibul for C A l a i CMc

DaiaBa»a!l

m

(If am erent) -I

le synoDl io

GA2a Allele name GA2b Allele n a m "

erflfdtfferem)

GA2c database allele name io replace

*Y

CASa. Is ihe allele in RyBase already (wilh any name)?K CA9& Other synany-nls) tor allele ss*Tibol

*l

CA9t QiHers'yTWWnftl'o'alltleiiarne CA3

*V

RanK|CV|

*k

GA.4 Allele das. |CV]

-k

CA56 Phenotrfift l domtaanca class |blpanliB CV|

It

"CA17 Phenoivca |CV, body part(s> where marales!) ' maCroctiaela [ 5cer\GAL4[sca-P30911 sensory mother tell ( Scer\GA14[sca-P30911 GA7a. P h - n o W I f r e e t e i l ] 'h Eipresslon otffsga[10ScertiWST Hsap\MYCl» under l i afSOP celli Ini

(D)

er\GAL4|sca-P j 0 9 1 # re

j&JLL&îi.&fe*«S Z

Figure 4. (A) Automatically recognised gene-names highlighted in PaperBrowser. Navigating through the paper using: (B) PaperView and (C) EntitiesView. (D) Editing a record with ProformaEditor.

A STACKED GRAPHICAL MODEL FOR ASSOCIATING SUB-IMAGES WITH SUB-CAPTIONS

ZHENZHEN KOU, WILLIAM W. COHEN, AND R O B E R T F. M U R P H Y Machine

E-mail:

Learning

Department, Carnegie Mellon 5000 Forbes Avenue, Pittsburgh, PA 15213, USA [email protected], [email protected],

University

[email protected]

There is extensive interest in mining data from full text. We have built a system called SLIF (for Subcellular Location Image Finder), which extracts information on one particular aspect of biology from a combination of text and images in journal articles. Associating the information from the text and image requires matching sub-figures with the sentences in the text. We introduce a stacked graphical model, a meta-learning scheme to augment a base learner by expanding features based on related instances, to match the labels of sub-figures with labels of sentences. The experimental results show a significant improvement in the matching accuracy of the stacked graphical model (81.3%) as compared with a relational dependency network (70.8%) or the current algorithm in SLIF (64.3%).

1. Introduction The vast size of the biological literature and the knowledge contained therein makes it essential to organize and summarize pertinent scientific results. Biological literature mining has been increasingly studied to extract information from huge amounts of biological articles : " 3 . Most of the existing IE systems are limited to extracting information only from text. Recently there has been great interest in mining from both text and image. Yu and Lee4 designed BioEx that analyses abstract sentences to retrieve the image in an article. Rafkind et al5 explored the classification of general bioscience images into generic categories based on features from both text (image caption) and image. Shatkay et al6 described a method to obtain features from images to categorize biomedical documents. We have built a system called SLIF 7 - 8 (for Subcellular Location Image Finder) that extracts information about protein subcellular locations from both text and images. SLIF analyzes figures in biological papers, which include both images and captions. In SLIF, a large corpus of articles is fully analyzed 257

258 Fig. 5. Double immunofluorescence confocal microscopy using mouse mAb against cPABF and affinity-purified rabbit antibodies against mmp41. Methanol-permeabilized and fixed He La cells were incubated with affinity-purifi < rabbit anti-mrnp 41 antibodies (a) andwith monoclonal anti-cPAPB antibodies (b), and I bound antibodies were visualized with fiuorescently labeled secondary antibodies. (Har= IdumA

Figure 1.

A figure caption pair reproduced from the biomedical literature.

and the results of analysis steps are stored in an SQL database as traceable assertions. An interface to the database (http://slif.cbi.cmu.edu) has been designed such that images and text of interest can be retrieved and presented to users 7 . In a system mining both text and images, associating the information from the text and the image is very challenging since usually there are multiple sub-figures in a figure and we must match sub-figures with the sentences in the text. In the initial version of SLIF, we extracted the labels for the sub-figures and sentences separately and matched them by finding equal-value pairs. This naive matching approach ignores much context information, i.e., the labels for sub-figures are usually a sequence of letters and people assign labels in a particular order rather than randomly, and could only achieve a matching accuracy of 64.3%. To obtain a satisfactory matching accuracy the naive approach requires high-accuracy image analysis and text analysis to get the labels. However, extracting labels from image is non-trivial. Inferring the label sequences and improving image processing allowed us to increase the Fl for panel label extraction to 78% 9 . In this paper, we introduce a stacked graphical model to match the labels of sub-figures with labels of sentences. The stacked model can take advantage of the context information and achieves an 81.3% accuracy. In the following, we give a brief review of SLIF in Section 2. Section 3 describes the stacked model used for the matching. Section 4 summarizes the experimental results and Section 5 concludes the paper. 2. SLIF Overview SLIF applies both image analysis and text interpretation to figures. Figure l a is a typical figure that SLIF can analyse. a

This figure is reproduced from the article "mRNA binding protein mrnp 41 localizes to both nucleus and cytoplasm", by Doris Kraemer and Giinter Blobel, Cell Biology Vol.

259 Entity matching & extraction

I proteins,

[Murphy el at, 2001]

Figure 2.

Overview of the image and text processing steps in SLIF.

Figure 2 shows an overview of the steps in the SLIF system with references to publications in which they are described in more details. Image processing includes several steps: Decomposing images into panels. For images containing multiple panels, the individual panels are recovered from the image. Identifying fluorescence microscope images. Panels are classified as to whether they are fluorescence microscope images, so that appropriate image processing steps can be performed. Image preprocessing and feature computations. Firstly the annotations such as labels, arrows and indicators of scale contained within the image are detected, analyzed, and then removed from the image. In this step, panel labels are recognized by Optical Character Recognition (OCR). Panel labels are textual labels which appear as annotations to images, for example, "a" and "b" printed in panels in Figure 1. Recognizing panel labels is very challenging. Even after careful image pre-processing and enhancement the F l accuracy is only about 75%. The OCR results are used as candidate panel labels and after filtering candidates an Fl accuracy of 78% is obtained 9 . Secondly, the scale bar is extracted, and finally subcellular location features (SLFs) are produced and the localization pattern of each cell is determined. Caption Processing is done as follows. Entity name extraction. In the current version of SLIF we use an extractor trained on conditional random fields10 and an extractor trained on Dictionary-HMMs11 to extract the protein name. The cell name is extracted using hand-coded rules. Image 94, pp. 9119-9124, August 1997.

260

pointer extraction. The linkage between the panels and the text of captions is usually based on textual labels which appear as annotations to the images (i.e., panel labels), and which are also interspersed with the caption text. We call these textual labels appearing in text image pointers, for example, "(a)" and "(b)" in the caption in Figure 1. In our analysis, image pointers are classified into four categories according to their linguistic function: Bullet-style image pointers, NP-style image pointers, Citation-style image pointers, and other12. The image-pointer extraction and classification steps are done via a machine learning method 12 . Entity to image pointer alignment. The scope of an image pointer is the section of text (sub-caption) that should be associated with it. The scope is determined by the class assigned to an image pointer. 12 3. A Stacked Model to Map Panel Labels to Image Pointers 3.1. Stacked Graphical Models for

Classification

Stacked graphical models are a meta-learning scheme to do collective classification13, in which a base learner is augmented by expanding one instance's features with predictions on other related instances. Stacked graphical models work well on predicting labels for relational data with graphical structures (Kou and Cohen, in preparation). The inference converges much faster than the traditional Gibbs sampling method and it has been shown empirically that one iteration of stacking is able to achieve good performance on many tasks. The disadvantage of stacking is that it requires more training time to achieve faster testing inference. Figure 3 shows the inference and learning methods for stacked graphical models. In a stacked graphical model, the relational template C finds the related instances. For instance Xi, C{xi) retrieves the indices i\,...,ii of instances xix, ...,xiL that are related to xt. Given predictions y for a set of instances x, C(xi,y) returns the predictions on the related instances, i.e., The idea of stacking is to take advantage of the dependencies among instances, or the relevance between inter-related tasks. In our application in this paper, we conjecture that panel label extraction and image pointer extraction are inter-related, and design a stacked model that combines them. 3.2. A Stacked Model for

Mapping

In the previous version of SLIF, we map panel labels to image pointers by finding the equal-value pair. Below we apply the idea of stacked graphical

261

• Parameters: a relational template C and a cross-validation parameter J. • Learning algorithm: Given a training set D = {(x,y)} and a base learner A: — Learn the local model, i.e., when k = 0: Return / ° = A(D°). Please note that D° = D,x° = x , y ° = y-

— Learn the stacked models, for k = 1...K: (1) Construct cross-validated predictions y f c _ 1 for x e D as follows: (a) Split D into J equal-sized disjoint subsets D\...Dj. (b) For j = 1...J, let fk~l = A(Dk-' - Dk~l). (c) For X e Dj,yk-1 = f^1^-1). (2) Construct an extended dataset Dk = (x fc ,y) by converting each instance Xi to xk as follows: xk = (xi,C(xi,yk~1)), where C(xi,y f c _ 1 ) will return the predictions for examples related to x-L such that x\ = (xi,yk-l,...,yk^). (3) Return fk = A{Dk). • Inference algorithm: given x : (1) y° = /°(x). For k = 1...K, (2) Carry out Step 2 above to produce xfc. (3) yfc = / fc (x fc ). Return yK. Figure 3. Stacked Graphical Learning and Inference

models to map the panel labels and image pointers. In SLIF the image pointer finding was done as follows. Most image pointers are parenthesized, and relatively short. We thus hand-coded an extractor that finds all parenthesized expressions that are (a) less than 15 characters long and (b) do not contain a nested parenthesized expression, and replaces X-Y constructs with the equivalent complete sequence. (E.g., constructs like "B-D" are replaced with "B,C,D".) We call the image pointers extracted by this hand-coded approach candidate image pointers. The hand-coded extractor has high recall but only moderate precision. Using

262

a classifier trained with machine learning approaches, we then classify the candidate image pointers as bullet-style, citation-style, NP-style, or other. Image pointers classified as "other" are discarded, which compensates for the relatively low precision of the hand-coded extractor. 12 In SLIF the panel label extraction was done as follows. Image processing techniques and OCR techniques are applied to find the labels printed within the panel. That is, firstly candidate text regions are computed via image processing techniques, and OCR is run on these candidate regions to get candidate panel labels. This approach has a relatively high precision yet low recall. We call the panel labels recognized by image processing and OCR candidate panel labels. A strategy based on grid analysis (a procedure which analyzes how many panels there are in a figure and finds out how the panels are ranged) is applied to the candidate panel labels to get a better accuracy.9 The match between panels labels and image pointers can be formulated as a classification problem. We construct a set of pairs < Oi,pj > for all candidate panel labels Oi 's and candidate image pointers pj 's from the same figure. That is, for a panel with li representing the real label, o, representing the panel label recognized by OCR, and Pj's representing the image pointers in the same figure, we construct a set of pairs < Oi,pj >. We label the pair < Oi,Pj > as positive only if li = pj, otherwise negative. For example, in Figure 1, the real label li for panel a is "a". If OCR recognizes Oj where o^ = "a", image pointers for the figure are "a" and "b", we construct two pairs, < a, a > labelled as positive and < a, b > labeled as negative. Note that the pair is labelled according to the real label and the image pointers. If unfortunately, OCR recognizes o* incorrectly for panel a in Figure 1, for example o* ="o", we have two pairs, < o,a > labelled as positive and < o, b > labeled as negative. We design features based on Oj's and pj's. For a base feature set, there are 3 binary features: one boolean value indicating whether Oi = Pj, one boolean value indicating whether o,j e /t = pj — 1 or Oi_„pper = pj — 1, and another boolean value indicating whether Oi_right = Pj + 1 or Oi_downPj + 1, where iJeft is the index of the left panel of panel i in the same row, ijupper is the index of the upper panel of panel i in the same column, pj + 1 is the successive letter of pj and pj — 1 is the previous letter of pj. This feature set takes advantage of the context information by comparing Oj jeft to pj — 1 and so on. The second and third features capture the first-order dependency. That is, if the neighboring panel (an adjacent panel in the same row or the same column) is recognized as the corresponding "adjacent" letter, there is

263

a

b

c

d

e

f

g

h

i

Figure 4.

Second-order dependency.

a higher chance that Oi is equal to pj. In the inference step for the base learner in the stacked model, if a pair < Oi,pj > is predicted as positive, we set the value of Oj to be pj since empirically the image pointer extraction has a higher accuracy than the panel label recognition. That is, the predicted value di is pj for a positive pair and 5{ remains as Oi for a negative pair. After obtaining di, we recalculate the features via comparing o^'s and pj's. We call the procedure of predicting < Oi,pj >, updating di, and re-calculating features "stacking". We choose MaxEnt as the base learner to classify < Oi,pj > and in our experiments we implement one iteration of stacking. Besides the basic features, we also include another feature that captures the "second-order context", i.e., consider the spatial dependency among all the "sibling" panels, even though they are not adjacent. In general the arrangement of labels might be complex: labels may appear outside panels, or several panels may share one label. However, in the majority of cases, panels are grouped into grids, each panel has its own label, and labels are assigned to panels either in column-major or row-major order. The "panels" shown in Figure 4 are typical of this case. For such cases, we analyze the locations of the panels in the figure and reconstruct this grid, i.e., the number of total columns and rows, and also determine the row and column position of each panel. We compute the second-order feature as follows: for a panel located at row r and column c with label o, as long as there is a panel located at row r and column c with label o (r ^ r and c ^ c) and according to either row-major order or column-major order the label assigned to panel (r , c) is o given the label for panel (r, c) is o, we assign 1 to the second-order feature. For example, in Figure 4, recognizing the panel label "a" at row 1, column 1 would help to recognize "e" at row 2, column 2 and "h" at row 3, column 2. With the first order-features and second-order features, it increases the chance of a missing or mis-recognized label to be matched to an image

264

pointer. 4. Experiments 4.1.

Dataset

To evaluate the stacked model for panel label and image pointer matching, we collected a dataset of 200 figures which includes 1070 sub-figures. This is a random subsample of a larger set of papers from the Proceedings of the National Academy of Sciences. Our current approach can only analyse labels contained within panels (internal labels) due to the limitations on the image processing stage therefore in our dataset we only collected figures with internal labels. Though our dataset does not cover all the cases, panels with internal labels are the vast majority in our corpus. We hand-labeled all the image pointers in the caption and the label for each panel. The match between image pointers and panels is also assigned manually. 4.2. Baseline

algorithms

The approaches to find the candidate image pointers and panel labels have been described in Section 3.2. In this paper, we take the hand-code approach and machine learning approach 12 as the baseline algorithms for image pointer extraction. The OCR-based approach and grid analysis approach 9 are baseline algorithms for panel label extraction. We also compare the stacked model to relational dependency networks (RDNs). 14 RDNs are an undirected graphical model for relational data. Given a set of entities and the links between them, a RDN defines a full joint probability distribution over the attributes of the entities. Attributes of an object can depend probabilistically on other attributes of the object, as well as on attributes of objects in its relational neighborhood. We build an RDN model as shown in Figure 5. In the RDN model there are two types of entities, image pointer and panel label. For an image pointer, the attribute Pj is the value of the candidate image pointer and o; is the candidate panel label. p_tru a n d o_tru are the true values to be predicted. The linkage Ljpre and L-next capture the dependency among the sequence of image pointers: L.pre points to the previous letter and L-next points to the successive letter. PJeft, Pjright, Pûpper, and P-down point to the panels to the left, right, upper and down direction respectively. The RDN model takes the candidate image pointers

265

equal

Figure 5.

An RDN model

and panel labels as input and predicts their true values. The match between the panel label and the image pointer is done via finding the equal-value pair.

4.3. Experimental

Results

We used 5-fold cross validation to evaluate the performance of the stacked graphical model for image pointer to panel label matching. The evaluation was reported in two ways; the performance on the matching and the performance on image pointer and panel label extraction. To determine the matching is the "real" problem, i.e., what we really care about are the matches, not getting the labels correctly. Evaluation on the image pointer and panel label extraction is a secondary check on the learning technique. Table 1 shows the accuracy of image pointer to panel label matching. For the baseline algorithms, the match was done by finding the equal-value pair. Baseline algorithm 1 was done by comparing the candidate image pointers to the candidate panel labels. Baseline algorithm 2 was done by comparing the image pointers extracted by the learning approach to the panel labels obtained after grid analysis. The stacked graphical model takes the same input as Baseline algorithm 2, i.e., the candidate image pointers extracted by the hand-coded algorithm and the candidate panel labels obtained by OCR. We observe that the stacked graphical model improves the accuracy of matching. Both the first-order dependency and second-order dependency help to achieve a better performance. RDN also achieved a better performance than the two baseline algorithms. Our stacked model achieves a better performance than RDN, because in stacking the dependency is captured and indicated "strongly" by the way we design features.

266 Table 1.

Accuracy of image pointer to panel label matching. Image pointer to panel label matching

Baseline algorithm 1

48.7%

Baseline algorithm 2 (current algorithm in SLIF)

64.3%

RDN

70.8%

Stacked model (first-order)

75.1%

Stacked model (second-order)

81.3%

Table 2.

Performance on image pointer extraction and panel label extraction Image pointer

Panel label

extraction

extraction


60.9%

52.3%


89.7%

65.7%

RDN

85.2%

73.6%

Stacked model with first order dependency

77.8%

Stacked model with second order dependency

83.1%

That is, the stacked model can model the matching as a binary classification of < O;, Pj > and capture the first-order dependency and second-order dependency directly according to our feature definition. However, in RDNs, the data must be formulated as types of entities described with attributes and the dependency is modeled with links among attributes. Though RDNs can model the dependency among data, the matching problem is decomposed to a multi-class classification problem and a matching procedure. Besides that, the second-order dependency can not be modeled explicitly in the RDN. Table 2 shows the performance on the sub-task of image pointer extraction and panel label extraction. The results are reported with Flmeasurement. Since during the stacked model we update the value of o; and set it to be pj when finding a match, the stacking also improves the accuracy of panel label extraction. The accuracy for image pointer extraction remains the same since we do not update the value of pj. Baseline algorithm 1 is the approach of finding candidate image pointers or candidate panel labels. Baseline algorithm 2 for image pointer extraction is the learning approach, and the grid analysis strategy for panel label extraction. The inputs for the stacked graphical model are candidate image pointers and candidate panel labels. We observe that by updating the value of Oi,

267

a

b?c?

d

(a) A hard case for OCR Figure 6.

(b) A hard case for the stacked model

Cases where current algorithms fail

we can achieve a better performance of panel label extraction, i.e., provide more "accurate" features for stacking. RDN also helps to improve the performance yet the best performance is obtained via stacking. 4.4. Error

Analysis

As mentioned in Section 2, OCR on panel labels is very challenging and we suffer a low recall of baseline algorithm 1. Most errors occur when there are not enough Oj recognized from the baseline algorithm to obtain information of the first-order and second-order dependency. Figure 6(a) shows a case where the current OCR fails. Figure 6(b) shows a case where there is not enough contextual information to determine the label for the upper-left panel. 5. Conclusions In this paper we briefly reviewed the SLIF system, which extracts information on one particular aspect of biology from a combination of text and images in journal articles. In such a system, associating the information from the text and image requires matching sub-figures in a figure with the sentences in the text. We used a stacked graphical model to match the labels of sub-figures with labels of sentences. The experimental results show that the stacked graphical model can take advantage of the context information and achieve a significant improvement in the matching accuracy of the stacked graphical model as compared with a relational dependency network or the current algorithm in SLIF. In addition to accomplish the matching at a higher accuracy, the stacked model helps to improve the performance of finding labels for sub-figures as well.

268 The idea of stacking is to take advantage of the context information, or the relevance between inter-related tasks. Future work will focus on applying stacked models to more tasks in SLIF, such as protein name extraction. Acknowledgments The work was supported by research grant 017396 from the Commonwealth of Pennsylvania Department of Health, NIH grants K25 DA017357 and R01 GM078622, and grants from the Information Processing Technology Office (IPTO) of the Defense Advanced Research Projects Agency (DARPA). References 1. B. de Bruijn and J. Martin, Getting to the (c)ore of knowledge: mining biomedical literature. Int. J. Med. Inf., 67(2002), 7-18. 2. M. Krallinger and A. Valencia Text-mining and information-retrieval services for molecular biology. Genome Biology 2005, 6:224. 3. L. Hunter and K. B. Cohen, Biomedical language processing: what's beyond PubMed? Molecular Cell 21(2006), 589-594. 4. H. Yu and M. Lee. Accessing Bioscience Images from Abstract Sentences. Bioinformatics 2006, 22(14), 547-556. 5. B. Rafkind, M. Lee, SF Chang, and H. Yu. Exploring text and image features to classify images in bioscience literature. Proceedings of BioNLP 2006, 73-80. 6. H. Shatkay, N. Chen, and D. Blostein. Integrating Image Data into Biomedical Text Categorization. Bioinformatics 2006, 22(14), 446-453. 7. R. F. Murphy, Z. Kou, J. Hua, M. Joffe, and W. W. Cohen, Extracting and Structuring Subcellular Location Information from On-line Journal Articles: The Subcellular Location Image Finder. Proceedings of KSCE 2004, 109-114. 8. R.F. Murphy, M. Velliste, J. Yao, and G. Porreca,,Searching Online Journals for Fluorescence Microscope Images Depicting Protein Subcellular Locations. Proceedings of BIBE 2001, 119-128. 9. Z. Kou, W. W. Cohen, and R. F. Murphy, Extracting Information from Text and Images for Location Proteomics. Proceedings of BIOKDD 2003, 2-9. 10. M. Ryan and P. Fernando, Identifying Gene and Protein Mentions in Text Using Conditional Random Field. BMC Bioinformatics, 6(Suppl 1):S6, May 2005. 11. Z. Kou, W. W. Cohen, and R. F. Murphy, High-Recall Protein Entity Recognition Using a Dictionary. Bioinformatics 2005, 21(Suppl 1), 266-273. 12. W. W. Cohen, R. Wang and R. F. Murphy, Understanding Captions in Biomedical Publications. Proceedings of KDD 2003, 499-504. 13. B. Taskar, P. Abbeel and D. Koller, Discriminative probabilistic models for relational data. Proceedings of UAI 2002, 485-492. 14. D. Jensen and J. Neville, Dependency Networks for Relational Data. Proceedings of ICDM 2004, 170-177.

GeneRIF QUALITY ASSURANCE AS SUMMARY REVISION

ZHIYONG LU, K. BRETONNEL COHEN, AND LAWRENCE HUNTER Center for Computational Pharmacology, University of Colorado Health Sciences Center, Aurora, CO, 80045, USA E-mail: {Zhiyong.Lu, Kevin.Cohen, Larry.Hunter}®uchsc.edu

Like the primary scientific literature, GeneRIFs exhibit both growth and obsolescence. NLM's control over the contents of the Entrez Gene database provides a mechanism for dealing with obsolete data: GeneRIFs are removed from the database when they are found to be of low quality. However, the rapid and extensive growth of Entrez Gene makes manual location of low-quality GeneRIFs problematic. This paper presents a system that takes advantage of the summary-like quality of GeneRIFs to detect low-quality GeneRIFs via a summary revision approach, achieving precision of 89% and recall of77%. Aspects of the system have been adopted by NLM as a quality assurance mechanism.

1. Introduction In April 2002, the National Library of Medicine (NLM) began an initiative to link published data to Entrez Gene entries via Gene References Into Function, or GeneRIFs. GeneRIFs consist of an Entrez Gene ID, a short text (under 255 characters), and the PubMed identifier (PMID) of the publication that provides evidence for the assertion in that text. The extent of NLM's commitment to this effort can be seen in the growth of the number of GeneRIFs currently found in Entrez Gene—there are 157,280 GeneRIFs assigned to 29,297 distinct genes (Entrez Gene entries) in 571 species as of June 2006. As we will demonstrate below, the need has arisen for a quality control mechanism for this important resource. GeneRIFs can be viewed as a type of low-compression, single-document, extractive, informative, topic-focussed summary [15]. This suggests the hypothesis that methods for improving the quality of summaries can be useful for improving the quality of GeneRIFs. In this work, we evaluate an approach to GeneRIF quality assurance based on a revision model, using three distinct methods. In one, we examined the recall of the system, using the set of all GeneRIFs that were withdrawn by the NLM indexers over a fixed period of time as a gold standard. In another, we performed a coarse assessment of the precision of the system by submitting 269

270

system outputs to NLM. The third involved a fine-grained evaluation of precision by manual judging of 105 system outputs. 1.1. A fault model for GeneRIFs Binder (1999) describes the fault model—an explicit hypothesis about potential sources of errors in a system [3], Viewing GeneRIFs as summaries suggests a set of related potential sources of errors. This set includes all sources of error associated with extractive summarization (discussed in detail in [16]). It also includes deviations from the NLM's guidelines for GeneRIF production—both explicit (such as definitions of scope and intended content) and tacit (such as the presumed requirement that they not contain spelling errors). Since the inception of the GeneRIF initiative, it has been clear that a quality control mechanism for GeneRIFs would be needed. One mechanism for implementing quality control has been via submitting individual suggestions for corrections or updates via a form on the Entrez Gene web site. As the size of the set of extant annotations has grown—today there are over 150,000 GeneRIFs—it has become clear that high-throughput, semi-automatable mechanisms will be needed, as well—over 300 GeneRIFs were withdrawn by NLM indexers just in the six months from June to December 2005, and data that we present below indicates that as many as 2,923 GeneRIFs currently in the collection are substandard. GeneRIFs can be unsatisfactory for a variety of reasons: • Being associated with a discontinued Entrez Gene entry • Containing errors, whether minor—of spelling or punctuation—or major, i.e. with respect to content • Being based only on computational data—the NLM indexing protocol dictates that GeneRIFs based solely on computational analyses are not in scope [7] • Being redundant • Not being informative—GeneRIFs should not merely indicate what a publication is about, but rather should communicate actual information • Not being about gene function This paper describes a system for detecting GeneRIFs with those characteristics. We begin with a corpus-based study of GeneRIFs for which we have thirdparty confirmation that they were substandard, based on their having been withdrawn by the NLM indexers. We then propose a variety of methods for detecting substandard GeneRIFs, and describe the results of an intrinsic evaluation of the methods against a gold standard, an internal evaluation by the system builders,

271 and an external evaluation by the NLM staff. In this work, we evaluate an approach to GeneRIF quality assurance based on a summary revision model. In summarization, revision is the process of changing a previously produced summary. [16] discusses several aspects of revision. As he points out (citing [5]), human summarizers perform a considerable amount of revision, addressing issues of semantic content (e.g., replacing pronouns with their antecedents) and of form (e.g., repairing punctuation). Revision is also an important component of automatic summarization systems, and in particular, of systems that produce extractive summaries, of which GeneRIFs are a clear example. (Extractive summaries are produced by "cutting-and-pasting" text from the original, and it has been repeatedly observed that most GeneRIFs are direct extracts from the title or abstract of a paper ([2,9,12,15]). This suggests using a "revision system" to detect GeneRIFs that should be withdrawn.

2. Related Work GeneRIFs were first characterized and analyzed in [17]. They presented the number of GeneRIFs produced and species covered based on the LocusLink revision of February 13, 2003, and introduced the prototype GeneRIF Automated Alerts System (GRAAS) for alerting researchers about literature on gene products. Summarization in general has attracted a considerable amount of attention from the biomedical language processing community. Most of this work has focussed specifically on medical text—see [1] for a comprehensive review. More recently, computational biologists have begun to develop summarization systems targeting the genomics and molecular biology domains [14,15]. GeneRIFs in particular have attracted considerable attention in the biomedical natural language processing community. The secondary task of the TREC Genomics Track in 2003 was to reproduce GeneRIFs from MEDLINE records [9]. 24 groups participated in this shared task. More recently, [15] presented a system that can automatically suggest a sentence from a PubMed/MEDLINE abstract as a candidate GeneRIF by exploiting an Entrez Gene entry's Gene Ontology annotations, along with location features and cue words. The system can significantly increase the number of GeneRIF annotations in Entrez Gene, and it produces qualitatively more useful GeneRIFs than previous methods. In molecular biology, GeneRIFs have recently been incorporated into the MILANO microarray data analysis tool. The system builders evaluated MILANO with respect to its ability to analyze a large list of genes that were affected by overexpression of p53, and found that a number of benefits accrued specifically from the system's use of GeneRIFs rather than PubMed as its literature source, including a reduction in the number of irrelevant

272 Table 1. GeneRIF statistics from 2000 to 2006. The second row shows the annual increase in new GeneRIFs. The third row shows the number of new species for the new GeneRIFs. The fourth row is the number of genes mat gained GeneRIF assignments in the year listed in the first row. Note that although the gene indexing project was officially started by the NLM in 2002, the first set of GeneRIFs was created in 2000. Year New GeneRIFs New Species New Genes

2000 47 3 34

2001 617 1 529

2002 15,960 2 6,061

2003 37,366 3 6,832

2004 35,887 130 5,113

2005 45,875 341 7,769

2006 a 21,628 91 2,959

Sum 157,280 571 29,297

results and a dramatic reduction in search time [19]. The amount of attention that GeneRIFs are attracting from such diverse scientific communities, including not only bioscientists, but natural language processing specialists as well, underscores the importance of ensuring the quality of the GeneRIFs stored in Entrez Gene. 3. A corpus of withdrawn GeneRIFs The remarkable increase in the total number of GeneRIFs each year (shown in Table 1) comes despite the fact that some GeneRIFs have been removed internally by the NLM. We compared the GeneRIF collection of June 2005 against that of December 2005 and found that a total of 319 GeneRIFs were withdrawn during that period. These withdrawn GeneRIFs are a valuable source of data for understanding the NLM's model of what makes a GeneRIF bad. Our analyses are based on the GeneRIF files downloaded from the NCBI ftp siteb at three times over the course of a one-year period (June 2005, December 2005, and June 2006). The data and results discussed in this paper are available at a supplementary website0.

3.1. Characteristics of the withdrawn GeneRIFs We examined these withdrawn GeneRIFs, and determined that four reasons accounted for the withdrawal of most of them (see Figure 1). 1. Attachment to a temporary identifier: GeneRIFs can only be attached to existing Entrez Gene entries. Existing Entrez Gene entries have unique identifiers. New entries that are not yet integrated into the database are assigned a temporary identifier (the string NEWENTRY), and all annotations that are associated with them are provisional, including GeneRIFs. GeneRIFs associated with these temporary IDs are often withdrawn. Also, when the temporary identifier becomes a

From January 2006 to June 2006 ftp://ftp. ncbi. nlm. nih. gov/gene http://compbio.uchsc.edU/HunlerJab/7.hiyong/psb2007

h c

273

•Attached to NEWENTRY

6% 4%

i4« j i k \ H ^^ 3 9 „•* H V 37%

Figure 1.

B Computational methods BGrammar {Misspellings * Punctuation) [Miscellaneous corrections n Unknown

Distribution of reasons for GeneRIF withdrawal from June to December 2005.

obsolete, the GeneRIFs that were formerly attached to it are removed (and transferred to the new ID). 39% (123/319) of the withdrawn GeneRIFs were removed via one of these mechanisms. 2. Based solely on computational analyses: The NLM indexing protocol dictates that GeneRIFs based solely on computational analyses are not in scope. 37% (117/319) of the withdrawn GeneRIFs were removed because they came from articles whose results were based purely on computational methods (e.g., by prediction techniques) rather than traditional laboratory experiments. 3. Typographic and spelling errors: Typographic errors are not uncommon in the withdrawn GeneRIFs. They include misspellings and extraneous punctuation. 14% (46/319) of the withdrawn GeneRIFs contained errors of this type (41 misspellings and 5 punctuation errors). 4. Miscellaneous errors: 6% (20/319) of the withdrawn GeneRIFs were removed for other reasons. Some included the authors' names at the end, e.g., Cloning and expression ofZAK, a mixed lineage kinase-like protein containing a leucine-zipper and a sterile-alpha motif. Liu TC, etc. Others were updated by adding new gene names or modifying existing ones. For example, the NLM replaced POPC with POMC in Mesothelioma cell were found to express mRNAfor [POPC]... for the gene POMC (GenelD: 5443). 5. Unknown reasons: we were unable to identify the cause of withdrawal for the remaining 4% (13/319) of the withdrawn GeneRIFs. These findings suggest that it is possible to develop automated methods for detecting substandard GeneRIFs.

4. System and Method We developed a system containing seven modules, each of which addresses either the error categories described in Section 3.1 or the content-based problems described in Section 1.1 (e.g. redundancy, or not being about gene function).

274 Table 2. A total of 2,923 suspicious GeneRIFs found in the June 2006 data. See Sections 4.5-7 for the explanations of categories 5-7. No. 1.

Category Discontinued

GeneRIFs 202

2.

Misspellings

1,754

3.

Punctuation

505

4.

Computational results

5.

Similar GeneRIFs

209

6.

One-to-many

67

7.

Length Constraint

167

19

GeneRIF example GenelD 6841: SVS1 seems to be found only in rodents and does not exist in humans GenelD 64919: CTIP2 mediates transcriptional repression with SIRT1 in mammmalian cells GenelD 7124: ). TNF-alpha promoter polymorphisms are associated with severe, but not less severe, silicosis in this population. GenelD 313129: characterization of rat Ankrd6 gene in silico; PMID 15657854: Identification and characterization of rat Ankrd6 gene in silico GenelD 3937: two GeneRIFs for the same gene differ in the gene name in the parenthesis; Shb links SLP-76 and Vav with the CD3 complex in Jurkal T cells (SLP-76) A single GeneRIF text identification, cloning and expression is linked to two GenelDs (217214 and 1484476) and two PMIDs (12049647, 15490124) GenelD 3952: review; GenelD 135 molecular model; GenelD 81657: protein subunitfunction

4.1. Finding discontinued GeneRIFs Discontinued GeneRIFs are detected by examining the gene history file from the NCBI's ftp site, which includes information about GenelDs that are no longer current, and then searching for GeneRIFs that are still associated with the discontinued GenelDs. 4.2. Finding GeneRIFs with spelling errors Spelling error detection has been extensively studied for General English (see [13]), as well as in biomedical text (e.g. [20]). It is especially challenging for applications like this one, since gene names have notoriously low coverage in many publicly available resources and exhibit considerable variability, both in text [10] and in databases [4,6]. In the work reported here, we utilized the Google spell-checking APId . Since Google allows ordinary users only 1,000 automated queries a day, it was not practical to use it to check all of the 4 million words in the current set of GeneRIFs. To reduce the size of the input set for the spellchecker, we used it only to check tokens that did not contain upper-case letters or punctuation (on the assumption that they are likely to be gene names or domainspecific terms) and that occurred five or fewer times in the current set of GeneRIFs ;

http://www. google, com/apis/

275 Table 3.

Distribution of non-word spelling errors across unigram counts.

Word Frequency Spelling Errors

1 1,348

2 268

3 84

4 34

5 20

(on the assumption that spelling errors are likely to be rare). (See Table 3 for the actual distributions of non-word spelling errors across unigram frequencies in the full June 2006 collection of GeneRIFs, which supports this assumption. We manually examined a small sample of these to ensure that they were actual errors.) 4.3. Finding GeneRIFs with punctuation errors Examination of the 319 withdrawn GeneRIFs showed that punctuation errors most often appeared at the left and right edges of GeneRIFs, e.g. the extra parenthesis and period in ). TNF-alpha promoter polymorphisms are associated with severe, but not less severe, silicosis in this population. (GeneID:7124)... or the terminal comma in Heart graft rejection biopsies have elevated FLIP mRNA expression levels, (GeneID:8837). We used regular expressions (listed on the supplementary web site) to detect punctuation errors. 4.4. Finding GeneRIFs based solely on computational methods Articles describing work that is based solely on computational methods commonly use words or phrases such as in silico or bioinformatics in their titles and/or abstracts. We searched explicitly for GeneRIFs based solely on computational methods by searching for those two keywords within the GeneRIFs themselves, as well as in the titles of the corresponding papers. GeneRIFs based solely on computational methods were incidentally also sometimes uncovered by the "one-to-many" heuristic (described below). 4.5. Finding similar GeneRIFs We used two methods to discover GeneRIFs that were similar to other GeneRIFs associated with the same gene. The intuitions behind this are that similar GeneRIFs may be redundant, and that similar GeneRIFs may not be informative. The two methods involved finding GeneRIFs that are substrings of other GeneRIFs, and calculating Dice coefficients. 4.5.1. Finding substrings We found GeneRIFs that are proper substrings of other GeneRIFs using Oracle.

276

4.5.2. Calculating Dice coeffi cients We calculated Dice coefficients using the usual formula ([11]:202), and set our threshold for similarity at > 0.8. 4.6. Detecting one-to-many mappings We used a simple hash table to detect one-to-many mappings of GeneRIF texts to publications (see category 6 in Table 2). We anticipated that this would address the detection of GeneRIF texts that were not informative. (It turned out to find more serious errors, as well—see the Discussion section.) 4.7. Length constraints We tokenized all GeneRIFs on whitespace and noted all GeneRIFs that were three or fewer tokens in length. The intuition here is that very short GeneRIFs are more likely to be indicative summaries, which give the reader some indication of whether or not they might be interested in reading the corresponding document, but are not actually informative [16]—for example, the single-word text Review— and therefore are out of scope, per the NLM guidelines. 5. Results 5.1. Evaluating recall against the set of withdrawn GeneRIFs To test our system, we first applied our system to the withdrawn GeneRIFs described in Section 3. GeneRIFs that are associated with temporary IDs are still in the curation process, so we did not attempt to deal with them, and they were excluded from the recall evaluation. To ensure a stringent evaluation with the remaining 196 withdrawn GeneRIFs, we included the ones in the miscellaneous and unknown categories. The system identified 151/196 of the withdrawn GeneRIFs, for a recall of 77% as shown in Table 4. The system successfully identified 115/117 of the GeneRIFs that were based on solely computational results. It missed two because we limited our algorithm to searching only GeneRIFs and the corresponding titles, but the evidence for the computational status of those two is actually located in their abstracts. For the typographic error category, the system correctly identified 33/41 spelling errors and 3/6 punctuation errors. It missed several spelling errors because we did not check words containing uppercase letters. For example, it missed the misspellings Muttant (Mutant), MMP-lo (MMP-10), and Frame-schift (Frame-shift). It missed punctuation errors that were not at the edges of the GeneRIF, e.g. the missing space after the semicolon in RE-

277 Table 4. Recall on the set of withdrawn GeneRIFs. Only the 196 non-temporary GeneRIFs were included in this experiment. Although we did not attempt to detect GeneRIFs that were withdrawn for miscellaneous or unknown reasons, we included them in the recall calculation. Category Computational methods Misspellings Punctuation Miscellaneous Unknown Sum

Total 117 41 5 20 13 196

True Positive 115 33 3 0 0 151

False Negative 2 8 2 20 13 45

Recall 98% 80% 60% 0 0 77%

VIEW:Association of expression ... and the missing space after the comma in ...lymphocytes,suggesting a role for trkB...

5.2. 3rd-party evaluation of precision The preceding experiment allowed us to evaluate the system's recall, but provided no assessment of precision. To do this, we applied the system to the entire June 2006 set of GeneRIFs. The system identified 2,923 of the 157,280 GeneRIFs in that data set as being bad. Table 2 shows the distribution of the suspicious GeneRIFs across the seven error categories. We then sent a sample of those GeneRIFs to NLM, along with an explanation of how the sample had been generated, and a request that they be manually evaluated. Rather than evaluate the individual submissions, NLM responded by internally adopting the error categories that we suggested and implementing a number of aspects of our system into their own quality control process, as well as using some of our specific examples to train the indexing staff regarding what is "in scope" for GeneRIFs (Donna Maglott, personal communication).

5.3. ln-house evaluation of precision We constructed a stratified sample of system outputs by selecting the first fifteen unique outputs from each category. Two authors then independently judged whether each output GeneRIF should, in fact, be revised. Our inter-judge agreement was 100%, suggesting that the error categories are consistently applicable. We applied the most stringent possible scoring by counting any GeneRIF that either judge thought was incorrectly rejected by the system as being a false positive. Table 5 gives the precision scores for each category.

278 Table 5. Precision on the stratified sample. For each error category, a random list of 15 GeneRJFs were independently examined by the two judges. No. 1. 2. 3. 4. 5. 6. 7. 8.

Category Discontinued Misspellings Punctuation Computational methods Similar GeneRIFs One-to-many Length constraint Overall

True Positive 15 15 13 15 15 15 5 93

False Positive 0 0 2 0 0 0 10 12

Precision 100% 100% 86.7% 100% 100% 100% 33.3% 88.6%

6. Discussion and Conclusion The kinds of revisions carried out by human summarizers cover a wide range of levels of linguistic depth, from correcting typographic and spelling errors ([16]:37, citing [5]) to addressing issues of coherence requiring sophisticated awareness of discourse structure, syntactic structure, and anaphora and ellipsis ([ 16]:78—81, citing [18]). Automatic summary revision systems that are far more linguistically ambitious than the methods that we describe here have certainly been built; the various methods and heuristics that are described in this paper may seem simplistic, and even trivial. However, a number of the GeneRIFs that the system discovered were erroneous in ways that were far more serious than might be suspected from the nature of the heuristic that uncovered them. For example, of the fifteen outputs in the stratified sample that were suggested by the one-to-many text-to-PMID measure (category 6 in Table 2), six turned out to be cases where the GeneRIF text did not reflect the contents of the article at all. The articles in question were relevant to the Entrez Gene entry itself, but the GeneRIF text corresponded to only one of the two articles' contents, presumably due to a cutand-paste error on the part of the indexer (specifically, pasting the same text string twice). Similarly, as trivial as the "extra punctuation" measure might seem, in one of the fifteen cases the extra punctuation reflected a truncated gene symbol (sir-2.1 became -2.1). This is a case of erroneous content, and not of an inconsequential typographic error. The word length constraint, simple as it is, uncovered a GeneRIF that consisted entirely of the URL of a web site offering Hmong language lessons—perhaps not as dangerous as an incorrect characterization of the contents of a PubMed-indexed paper, but quite possibly a symptom of an as-yetunexploited potential for abuse of the Entrez Gene resource. The precision of the length constraint was quite low. Preliminary error analysis suggests that it could be increased substantially by applying simple language models to differentiate GeneRIFs that are perfectly good indicative summaries, but

279 poor informative summaries, such as REVIEW or 3D model (which were judged as true positives by the judges) from GeneRIFs that simply happen to be brief, but are still informative, such as regulates cell cycle or Interacts with SOCS-1 (both of which were judged as false positives by the judges). Our assessment of the current set of GeneRIFs suggests that about 2,900 GeneRIFs are in need of retraction or revision. GeneRIFs exhibit the two of the four characteristics of the primary scientific literature described in [8]: growth, and obsolescence. (They directly address the problem of fragmentation, or spreading of information across many journals and articles, by aggregating data around a single Entrez Gene entry; linkage is the only characteristic of the primary literature that they do not exhibit.) Happily, NLM control over the contents of the Entrez Gene database provides a mechanism for dealing with obsolescence: GeneRIFs actually are removed from circulation when found to be of low quality. We propose here a data-driven model of GeneRIF errors, and describe several techniques, modelled as automation of a variety of tasks performed by human summarizers as part of the summary revision process, for finding erroneous GeneRIFs. Though we do not claim that it advances the boundaries of summarization research in any major way, it is notable that even these simple summary revision techniques are robust enough that they are now being employed by NLM: versions of the punctuation, "similar GeneRIF," and length constraint (specifically, single words) have been added to the indexing workflow. Previous work on GeneRIFs has focussed on quantity—this paper is a step towards assessing, and improving, GeneRIF quality. NLM has implemented some of the aspects of our system, and has already corrected a number of the examples of substandard GeneRIFs that are cited here. 7. Acknowledgments This work was supported by NIH grant R01-LM008111 (LH). We thank Donna Maglott and Alan R. Aronson for their discussions of, comments on, and support for this work, and the individual NLM indexers who responded to our change suggestions and emails. Lynne Fox provided helpful criticism. We also thank Anna Lindemann for proofreading the manuscript. References 1. S. Afantenos, V. Karkaletsis, and P. Stamatopoulos. Summarization from medical documents: a survey. Artifi cial Intelligence in Medicine, 33(2):157-77; Feb 2005. Review 2. G. Bhalotia, P. I. Nakov, A. S. Schwartz and M. A. Hearst. Biotext report for the TREC 2003 genomics track. In Proceedings of The Twelfth Text REtrieval Conference, page 612,2003.

280 3. R. V. Binder. Testing Object-Oriented Systems: Models, Patterns, and Tools. AddisonWesley Professional, 1999. 4. K. B. Cohen, A. E. Dolbey, G. K. Acquaah-Mensah, and L. Hunter. Contrast and variability in gene names. In Proceedings of ACL Workshop on Natural Language Processing in the Biomedical Domain, pages 14-20. Association for Computational Linguistics. 5. E. T. Cremmins. The Art of Abstracting, 2nd edition. Information Resources Press, 1996. 6. H. Fang, K. Murphy, Y. Jin, J. S. Kim, and P. S. White. Human gene name normalization using text matching with automatically extracted synonym dictionaries. In Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology, pages 41-48. Association for Computational Linguistics. 7. GeneRIF: http://www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html 8. W. Hersh. Information Retrieval: a Health and Biomedical Perspective, 2nd edition. Springer-Verlag, 2006. 9. W. Hersh and R.T. Bhupatiraju. TREC genomics track overview. In Proceedings of The Twelfth Text REtrieval Conference, page 14, 2003. 10. L. Hirschman, M. Colosimo, A. Morgan, and A. Yeh. Overview of BioCreative Task IB: normalized gene lists. BMC Bioinformatics 6(Suppl. 1):S11, 2005. 11. P. Jackson and I. Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. John Benjamins Publishing Co., 2002. 12. B. Jelier, M. Schwartzuemie, C. van der Fijk, M. Weeber, E. van Mulligen and B. Schijvenaars. Searching for GeneRIFs: concept-based query expansion and Bayes classifi cation. In Proceedings of The Twelfth Text REtrieval Conference, page 225, 2003. 13. D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, January 2000. 14. X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai and B. Schatz. Automatically generating gene summaries from biomedical literature. In Proceedings of Pacifi c Symposium on Biocomputing, pages 40-51, 2006. 15. Z. Lu, K. B. Cohen and L. Hunter. Finding GeneRIFs via Gene Ontology annotations. In Proceedings of Pacifi c Symposium on Biocomputing, pages 52-63, 2006. 16. I. Mani. Automatic Summarization. John Benjamins Publishing Company, 2001. 17. J. A. Mitchell, A. R. Aronson, J. G. Mork, L. C. Folk, S. M. Humphrey and J. M. Ward. Gene indexing: characterization and analysis of NLM's GeneRIFs. In Proceedings of AMI A 2003 Symposium, pages 460-464, 2003. 18. H. Nanba and M. Okumura. Producing more readable extracts by revising them. In Proceedings of the 18th International Congress on Computational Linguistics (COLING-2000), pages 1071-1075. 19. R. Rubinstein and I. Simon. MILANO - custom annotation of microarray results using automatic literature searches. BMC Bioinformatics, 6:12, 2005. 20. P. Ruch, R. Baud and A. Geissbuhler. Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record. Artifi cial Intelligence in Medicine, 29(2): 169-84, 2003.

EVALUATING THE A U T O M A T I C MAPPING OF HUMAN GENE AND PROTEIN MENTIONS TO UNIQUE IDENTIFIERS ALEXANDER A. MORGAN 1 , BENJAMIN WELLNER 2 , JEFFREY B. COLOMBE, ROBERT ARENS 3 , MARC E. COLOSIMO, LYNETTE HIRSCHMAN MITRE Corporation, 202 Burlington Road Bedford, MA, 01730, USA Email: [email protected]; [email protected]

We have developed a challenge task for the second BioCreAtlvE (Critical Assessment of Information Extraction in Biology) that requires participating systems to provide lists of the EntrezGene (formerly LocusLink) identifiers for all human genes and proteins mentioned in a MEDLINE abstract. We are distributing 281 annotated abstracts and another 5,000 noisily annotated abstracts along with a gene name lexicon to participants. We have performed a series of baseline experiments to better characterize this dataset and form a foundation for participant exploration.

1.

Background

The first Critical Assessment of Information Extraction in Biology's (BioCreAtlvE) Task IB involved linking mentions of model organism genes and proteins in MEDLINE abstracts to their corresponding identifiers in three different model organism databases (MGD, SGD, and FlyBase). The task is described in some detail in [1], along with descriptions of many different approaches to the task in the same journal issue. There has been quite a bit of past work associating text mentions of human genes and proteins with unique identifiers including the early work by Cohen et al. [2] and the AZURE system [3]. Very recently, Fang et al. [4] reported excellent results on a data set they created using one hundred MEDLINE abstracts. This widespread community interest in the issue and our experience with the first BioCreAtlvE motivated us to prepare another evaluation task for inclusion in the second BioCreAtlvE [5]. This task will require systems to link mentions of human genes and proteins with their corresponding EntrezGene (LocusLink) identifiers. We hope that researchers in this area can use this data set to compare techniques and ' Currently at Stanford Biomedical Informatics, Stanford University Also, the Department of Computer Science, Brandeis University 3 Currently at the Department of Computer Science, University of Iowa

2

281

282 gauge performance gains. It can also be used to address issues in the general portability of normalization techniques and to investigate the relationships between co-mentioned genes and proteins. 2.

Task Definition

The most important part of evaluating system performance is, of course, a very careful definition of the task. The original Task IB required each system to provide a list of all the model organism database identifiers for the species-specific (mouse, fly or yeast) genes and gene products mentioned in a MEDLINE abstract. There are a number of possible uses for such a system, such as improved document retrieval for specific genes, data mining over gene/protein co-mentions, or direct support of relation extraction (e.g., protein-protein interaction) and/or attribute assignment (e.g., assignment of Gene Ontology annotations). The latter might be immediately useful to researchers attempting to analyze high throughput experiments, performing whole genome or comparative genomics analyses, or data-mining for relationship discovery, all of which require links to the unique identifiers. Our initial investigations into a human gene/protein task suggested that UniProt identifiers [6] might be a good target to which we might normalize mentions of human proteins and their coding genes, and we hoped that this might bring the task into closer alignment with other efforts such as BioCreAtlvE I Task 2 [7] which required associating GO codes with human proteins identified through protein identifiers. UniProt provides a unified set of protein identifiers and represents a great leap forward for bioinformatics research, but it contains many redundancies: different fragments of the same polypeptide, polypeptide sequences derived from the same gene that differ in non-synonymous polymorphisms, and alternate transcripts from the same gene all may have separate entries and unique identifiers. We eventually settled on EntrezGene identifiers as unique target identifiers, despite incomplete mappings of UniProt to EntrezGene identifiers and what can be a complex many-to-many (e.g. alternate transcripts and gene duplications) relationship between genes and proteins. As described in [8], our annotation viewed genes and their products as equivalent because experience has found their typical usage interchangeable and/or indistinguishable. This is, of course, a simplification for purposes of evaluation; we recognize that this distinction is important in other cases. A significant difference between the normalized gene list task (BioCreAtlvE Task IB) and general entity normalization/grounding is that each gene list is associated with the abstract as a whole, whereas general entity grounding requires

283 the annotation of each mention in the text. The advantage of the "gene list" approach is that it avoids the issue of how to delimit the boundaries when annotating gene and protein mentions [9]. This becomes more of a problem in normalization when mentions are elided under various forms of conjunction. For example, it is difficult to identify the boundaries for the names of the different forms of PKC in "PKC isoforms alpha, delta, epsilon and zeta". Then there is the more difficult example of ellipsis: "AKR1C1-AKR1C4". Clearly AKR1C2 and AKR1C3 are being included in this mention, and functional information extracted about that group should include them. Fang et al. [4] excluded these cases from consideration, but we feel that these are important instances that need to be annotated and normalized. Equally difficult is the large gray area in gene and protein nomenclature between a description and a name and the related question of what should be tagged. The text "Among the various proteins which are induced when human cells are treated with interferon, a predominant protein of unknown function, with molecular mass 56 kDa, has been observed" mentions the protein also known as "interferoninduced protein 56", but the text describes the entity rather than using the listed name derived from this description. Our compromise was to keep the gene list task, but to provide a richer data set that associates at least one text string with each entry in the gene list, a significant addition over the first BioCreAtlvE Task 1B. Polysemy in gene and protein names creates additional complexity, both within and between organisms [10]. Determination of the gene or protein being described may require the interpretation of the whole abstract - or several genes may be described with one "family name" term (see the Discussion section for further exploration of this issue). The particular species can be intentionally under-specified when the text is meant to refer to all the orthologues in relevant species, but in other cases, a name is meant to be highly species specific. For example: "Anoxia activates AMP-activated protein kinase (AMPK), resulting in the inhibition of biosynthetic pathways to conserve ATP. In anoxic rat hepatocytes or in hepatocytes treated with 5-aminoimidazole-4-carboxamide (AICA) riboside, AMPK was activated and protein synthesis was inhibited." The mention of the properties of AMPK in the first sentence is meant to be general and to include activity in humans, but the subsequent experimental evidence is, of course, in rats.

284 3.

Corpus Construction

3.1. Abstract Collection To identify a collection of abstracts with a high likelihood of mentions of human genes and proteins, we obtained the genejassociation.goajiuman file [11] on 10 October 2005. This provided us with 11,073 PubMed identifiers for journal articles likely to have mentions of human genes and proteins. We obtained abstracts for 10,730 of these. The file gene2pubmed obtained from NCBI [12] on 21 October 2005 was used, along with the GO annotations, to create the automatic/noisy annotations in the 5,000 abstracts set aside as a noisy training set as described in [8]. This is further described in the Evaluation of Noisy Training Data section. We selected our abstracts for hand annotation from the 5,730 remaining abstracts. 3.2. Lexicon Creation The basic gene symbol and gene name information corresponding to each human EntrezGene identifier was taken from the genejnfo file from NCBI [12]. This was merged with name, gene and synonym entries taken from UniProt [6]. Suffixes containing "HUMAN", "1_HUMAN", "H_HUMAN", "protein", "precursor", "antigen" were stripped from the terms and added to the lexicon as separate terms in addition to the original term. HGNC [13] symbol, name, and alias entries were also added. We identified the phrases most repeated across identifiers and those that had numerous matches in the 5000 abstracts of noisy training data; we then used these to create a short (381 term) list to remove the most common terms that were unlikely to be gene or protein names but which had entered the lexicon as full synonyms. Examples of entries in this list are "recessive", "neural", "Zeta", "liver", "glycine", and "mediator". This list is available from the CVS archive [5]. This left us with a lexicon of 32,975 distinct EntrezGene identifiers linked to a total of 163,478 unique terms. The majority of identifiers have more than one term attached (average 5.5), although 8,385 had only one. For example, identifier 1001 has the following synonyms: "PCAD; CDHP; CDH3; cadherin 3, type 1, P-cadherin (placental); HJMD". It is important to note that many of these terms are unlikely to be used as mentions in abstracts for the given proteins and genes. Many of the terms/synonyms were not unique among the identifiers, with the terms often being shared across a handful of identifiers (Table 1). Sometimes this reflects noise inherited from the source databases; the most egregious example is "hypothetical" which shows up as a name for 89 genes. Similarly, "human" (alone)

285 shows up 15 times, "g protein coupled receptor" 12 times, and "seven transmembrane helix receptor" 30 times. Each normalized (Section 4) phrase included as a synonym in this relatively noisy lexicon is linked to an average of 1.1 different unique identifiers, although 80% of phrases link to only one identifier. These synonyms average 16.5 characters in length if whitespace is removed. Table 1. Lexicon statistics Unique Gene ID'S Unique Un-Nqrmalized Terms Unique Normalized Terms

32,975 177,200 163,478

Avq Term Lenqth (Characters) Avq Gene Identifiers per Term Avq Term Lenqth (Words) Avq Terms per Identifier

16.51 1.12 2.17 5.55

3.3. Annotation Tool and Annotation Process We developed a simple annotation tool using dynamic webpages with PHP and MySQL to support the creation of the normalized gene lists and extraction of the associated mention excerpts from the text. Annotators could annotate via their own web browsers. We could also make rapid changes to the interface as soon as they were requested without needing to update anything but the scripts on the server. The simple annotation guidelines and the PHP scripts used for the annotation are available for download from the Sourceforge CVS archive [5], The interface presented the plain text of the title and abstract to the annotators, along with suggested annotations (based on the automatic/noisy process). Using these resources, annotators had to provide the EntrezGene identifiers and supporting text for all mentions of human genes and proteins. All annotations then went through a review process to examine abstracts marked with comments and to merge the differences between annotators before inclusion in the gold standard set. A total of 300 abstracts were annotated for the freely distributed training set, although 19 were removed for a variety of reasons, such as, having mentions which could not be normalized to EntrezGene, leaving 281 for distribution. The annotators found of an average of 2.27 different human genes mentioned per abstract. We have annotated another -263 for use as an evaluation set. We plan to correct errors in these annotations based on pooling of the participants' submissions, as was done in the previous BioCreAtlvE [8]. The Sourceforge CVS archive will allow us to track corrections to these datasets [5].

286 3.4. Inter-annotator

Agreement

We studied the agreement between different annotators on the same abstracts. The annotation was done by three annotators (two with PhD's in biological sciences, one with an MS; none are specialists in human biology, but all had previous experience in annotation). There was one annotator (primary) who did annotations for all abstracts. Our first pass of agreement studies was done on the first abstracts in the training set and was done mostly to check our annotation guidelines. Two annotators annotated the same 30 abstracts. There were 71 annotations (same EntrezGene identifiers for the abstract) in common and 7 differences (91% agreement). A second agreement experiment was performed with 26 new abstracts. There was only 87% agreement, but all disagreements were missed mentions or incorrect normalizations by the non-primary annotator. Unfortunately, these small sample sizes can only be suggestive of the overall level of agreement. 4.

Characterizing the Data

In order to better characterize the properties of this dataset and task, we performed some baseline experiments, described below, to generate the list of EntrezGene identifiers for each abstract using the lexicon. We evaluated this using simple match against the gold standard annotations. For matching the terms from the lexicon, we ignored case and any punctuation or internal whitespace in the terms matched to the lexicon, but required match of start and end token boundaries as described in [14]. Table 2. Properties of the Data Experiment Noisy Traininq Data Quality Coverage of Lexicon

True Positive 348 530

False Positive 49 7941

False Negative 292 110

Precision 0.877 0.063

Recall 0.544 0.828

4.1. Evaluation of Noisy (Automatically Generated) Training Data We wanted to estimate the quality of the noisy training data and to evaluate our assumption that the document level annotations from the gene2pubmed file were indicative of a high likelihood of the mention of those genes in the abstract. To do this, we evaluated the gene lists derived from the gene2pubmed file (automatic/noisy data process) against those derived from human annotation (see Table 2). However, many genes may be mentioned in the abstract and paper but may not included in the gene2pubmed file causing our noisy training data to systematically underreport

287 genes mentioned, and we estimate from this result that only half of all genes mentioned are included in the automatic/noisy data annotations (recall 0.544). 4.2. Evaluating the Coverage of the Lexicon We also evaluated the coverage of the lexicon by using it to do simple pattern matching. This mirrors some of our early experiments in developing normalized gene lists for Drosophila melanogaster [15]. Our goal was to estimate a recall ceiling on performance for systems requiring exact match to the lexicon. The recall of 0.828 clearly shows the limits of the simple lexicon (Table 2). This demonstrates the need to extend exact lexical match beyond such simple rules as ignoring case, punctuation and white space. In some cases, very small affixes (e.g. h-, -p, -like), either in the lexicon or the text, caused a failure to match. There were numerous cases of acronyms, often embedded in longer terms, which caused problems ("actinin-1" vs. "ACTN1" or "GlyR alpha 1" vs. "Glycine receptor alpha-1 chain precursor" or "GLRA1"). The various modifiers indicating subtypes were a serious problem, e.g. "collagen, type V, alpha 1"; modifiers such as "class II", "beta subtype", "type 1", and "mu 1" varied in orthography and placement, and the modifier " 1 " is often optional. Conjunctions such as "freacl-freac7" are particularly costly from an evaluation perspective since it can count as several false negatives at once. There was a considerable amount of name paraphrase (see Discussion section), involving word ordering and term substitutions or insertions and deletions. This arises because the long phrases in the lexicon are often more descriptive than nominal, although the associated acronyms can give some indication as to how a mention might actually occur in text. For example, the text contains "kappa opioid receptor", whereas the lexicon contains "KOR" and "opioid receptor, kappa 1"). Lan Aronson has investigated these issues in term variation while mapping concepts to text extensively [16]. Interestingly, self-embedded terms (e.g. "insulin-like growth factor-1 (IGF-I) receptor") seem to be a relatively rare problem at the level of the whole abstract. As expected, the precision based on lexical pattern matching (Table 2, row 2) was very low due to false positive matches of terms in the lexicon against common English terms, ambiguous acronyms, and so forth. 4.3. Biological Context of Co-Mentioned Genes and Proteins As an example of how this dataset might be used outside of the evaluation, we looked at the biological relationships between genes and proteins which are mentioned together in the same abstracts. Our experience annotating the abstracts

288 indicated that genes or proteins are typically co-mentioned because of sequence homology and/or some functional relationship (e.g., interaction), although cell markers (e.g., CD4) may be mentioned in a variety of contexts. Many sophisticated techniques have arisen for comparing genes based on functional annotations and sequence, but for this initial analysis we intentionally used something naive and simple. We computed two different similarity measurements for each pair of genes mentioned together in our dataset. For a sequence similarity computation, we used BioPython's pairwise2 function [17]: pairwise2.allgn.globalxs (seql,seq2,-l,.l,penalize_encLgaps=0,scor9_only=l). For the sequence, we used the longest protein RefSeq for each gene. For a measure based on functional annotations, we computed the Jacquard set similarity (1Tanimoto distance) for the set of all GO annotations for each gene: Is, ns91 Set Similarity =

T

—

'

,

r

|5,|+|s2|-|5,ns2|

We excluded all GO codes that had an accompanying qualifier, which for human genes, is restricted to "contributesto", "colocalizesjwith", and "NOT". This GOderived similarity measure is a poor one for many reasons, including mixing experimental and homology based GO codes, ignoring the structure of GO, and ignoring the fact that the three main hierarchies are very different. Figure 1 shows the result of computing these similarity measures for the 737 pairs of genes that are co-mentioned in our hand annotated training set and for 1,630 pairs of randomly selected genes which are explicitly not co-mentioned. Of the 737 co-mentioned pairs, 100 have both similarity measures above 0.3, while none of the 1,630 non co-mentioned pairs do. This suggests that in the context of the evaluation, even simple biological knowledge may be helpful in such tasks as disambiguation (dealing with polysemy) for normalization or in ascertaining if comention suggests functional and/or physical interaction or simply homology. It is hoped that this dataset can encourage the use of greater exploration into the use of biological knowledge to improve text mining. Figure 1: Biological similarity between co-mentioned genes vs. not co-mentioned genes A) Co-mentioned

B) NOT Co-mentioned

o

o 0

0

0.0

0.2

0.4

0.6

GO Similarity

0.8

1.0

~i

1

1

1

1

r

0.0

0.2

0.4

0.6

0.8

1.0

GO Similarity

289 S.

Discussion

It is interesting to compare this new corpus with Task IB of BioCreAtlvE 1 for insights into portability of normalization techniques. One set of measures in Table 3 seems to indicate that human may be easier than mouse; it has over twice the number of terms for each identifier, it has many fewer unique identifier targets, and Table 3: A comparison of gene mention normalization Noisy Data Recall 0.54 0.55 0.86 0.81

Noisy Max Recall Data Approach Precision Recall 0.86 0.83 0.99 0,83 0.99 0,93 0.86 0.85

Average Max Recall Synonym Approach Length Precision in Words 0.06 2.17 0.19 2.77 0.33 1.00 0:07 1.47

Number of Unique IP's 32,975 52,494 7,928 27,749

Average # Synonyms/ Identifier 5.55 2.48 1.86 2.94

Average # BioCreAtlvE 1 Identifiers/ Max Synonym Submitted (ambiguity) F-measure 1.12 0.79 1,02 0.92 1.01 0.82 1.09

only slightly more ambiguity. However, this does not really represent how the terms in the lexicon map to the text. The synonyms in the model organism databases are drawn from text, whereas the lexicon that we created for human genes includes database identifiers or descriptive forms that have very little overlap with actual text mentions. This overestimates the number of useful term variants in the lexicon and probably underestimates ambiguity in practice. The affects of polysemy/ambiguity in gene/protein mention identification is discussed in detail in [10]. An important contrast between human and mouse nomenclature on the one hand, and yeast and fly on the other, is that the nomenclature is often much more descriptive than nominal as mentioned in the Task Definition section. In Drosophila, the gene rather whimsically named "Son of sevenless" ("Sos") is named just that. It would never be called "child of sevenless" or "Sevenless' son". However, the names of human genes may vary quite a bit. The Alzheimer's disease related "APP" gene is generally known as "beta-amyloid precursor protein", although "beta-amyloid precursor polypeptide" may be used as well. Many other equivalent transformations are also acceptable, such as "amyloid beta-protein precursor", and "betaAPP". In general, any semantically equivalent description of the gene or protein may be used as a name. However, the regularity of the allowed transformations suggests that it might be possible to design or automatically learn transformation rules to permit better matching, something investigated by past researchers [18]. As Vlachos et al. observed [19], in biomedical text there is a high occurrence of families of genes and proteins being mentioned by a single term such as: "Mxil

290 belongs to the Mad (Mxil) family of proteins, which function as potent antagonists of Myc oncoproteins". In future work in biomedical entity normalization, we suggest that normalizing entity mentions to family mentions may be an effective way to support other biomedical text mining tasks. Possibly the protein families in InterPro [6] could be used as normalization targets for mentions of families. For example, the mention of "Myc oncoproteins" could link to InterPro:IPR002418. This would enable information extraction systems that extract facts (relations, attributes) on gene families to attach those properties to all family members.

6.

Conclusion

In summary, we have described the motivation and development of a dataset for evaluating the automatic mapping of the mention of human genes/proteins to unique identifiers, which will be used as part of the second BioCreAtlvE. We have elucidated some of the properties of this data set, and made some suggestions about how it may be used in conjunction with biological knowledge to investigate the properties of co-mentioned genes and proteins. Anonymized submissions by evaluation participants along with the evaluation set gold standard annotations will be made publicly available [5] after the workshop, tentatively scheduled for the spring of 2007. 7. 1. 2.

3.

4.

References Hirschman, L., et al., Overview of BioCreAtlvE task IB: normalized gene lists. BMC Bio informatics, 2005. 6 Suppl 1: p. SI 1. Cohen, K.B., et al. Contrast and variability in gene names, in Proceedings of the workshop on natural language processing in the biomedical domain, pp. 14-20. Association for Computational Linguistics. 2002. Podowski, R.M., et al., AZuRE, a scalable system for automated term disambiguation of gene and protein names. Proc IEEE Comput Syst Bio inform Conf, 2004: p. 415-24. Fang, H., et al., Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries, in Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006, Association for Computational Linguistics: New York, New York. p. 41-48.

291 5. 6.

7. 8. 9. 10. 11. 12. 13. 14.

15. 16. 17. 18. 19.

http://biocreative.sourceforge.net/, BioCreAtlvE 2 Homepage. Wu, C.H., et al., The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res, 2006. 34(Database issue): p. D187-91. Blaschke, C , et al., Evaluation of BioCreAtlvE assessment of task 2. BMC Bioinformatics, 2005. 6 Suppl 1: p. S16. Colosimo, M.E., et al., Data preparation and interannotator agreement: BioCreAtlvE Task IB. BMC Bioinformatics, 2005. 6 Suppl 1: p. S12. Tsai, R.T., et al., Various criteria in the evaluation of biomedical named entity recognition. BMC Bioinformatics, 2006. 7: p. 92. Tuason, O., et al., Biological nomenclatures: a source of lexical knowledge and ambiguity. Pac Symp Biocomput, 2004: p. 238-49. http://www.geneontology.org/, The Gene Ontology. ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, NCBI Gene FTP site. Wain, H.M., et al., Genew: the Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res, 2004. 32(Database issue): p. D255-7. Wellner, B., Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data, in Proceedings of the ACLISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. 2005, Association for Computational Linguistics: Detroit, p. 1—8. Morgan, A.A., et al., Gene name identification and normalization using a model organism database. J Biomed Inform, 2004. 37(6): p. 396-410. Aronson, A.R., The effect of textual variation on concept based information retrieval. Proc AMIA Annu Fall Symp, 1996: p. 373-7. http://biopython.org, BioPython Website. Hanisch, D., et al., Playing biology's name game: identifying protein names in scientific text. Pac Symp Biocomput, 2003: p. 403-14. Vlachos, A., et al., Bootstrapping the Recognition and Anaphoric Linking of Named Entities in Drosophila Articles. Pac Symp Biocomput, 2006. 11: p. 100-111.

MULTIPLE APPROACHES TO FINE-GRAINED INDEXING OF THE BIOMEDICAL LITERATURE AURELIE NEVEOL 1 ' 2 , SONYA E. SHOOSHAN 1 , SUSANNE M. HUMPHREY 1 , THOMAS C. RINDFLESH 1 , ALAN R. ARONSON 1 'National Library of Medicine, NIH Bethesda, MD 20894, USA 2

Equipe CISMeF, Rouen, France

The number of articles in the MEDLINE database is expected to increase tremendously in the coming years. To ensure that all these documents are indexed with continuing high quality, it is necessary to develop tools and methods that help the indexers in their daily task. We present three methods addressing a novel aspect of automatic indexing of the biomedical literature, namely producing MeSH main heading/subheading pair recommendations. The methods, (dictionary-based, post- processing rules and Natural Language Processing rules) are described and evaluated on a genetics-related corpus. The best overall performance is obtained for the subheading genetics (70% precision and 17% recall with post-processing rules, 48% precision and 37% recall with the dictionarybased method). Future work will address extending this work to all MeSH subheadings and a more thorough study of method combination.

1.

Introduction

1.1. Indexing the biomedical literature To ensure efficient retrieval of the ever-increasing number of articles in the U.S. National Library of Medicine's (NLM's) MEDLINE® database, these documents must be systematically stored and indexed. In MEDLINE, the subject matter of articles is described with a list of descriptors selected from NLM's Medical Subject Headings (MeSH®). MeSH contains about 24,000 main headings covering specific concepts in the biomedical domain such as diseases, body parts, etc. It also contains 83 subheadings that denote broad areas in biomedicine such as immunology or genetics. Subheadings can be coordinated to a main heading in order to refer to a concept in a more specific way. NLM indexers select for each article an average of ten to twelve MeSH main headings (e.g., Williams Syndrome) or main heading/subheading pairs (e.g., Williams Syndrome/genetics). The indexing task is time consuming and requires skilled, trained individuals. In order to assist indexers in their daily practice, the NLM's Indexing Initiative [1] has investigated automatic indexing methods, which led to the development of the Medical Text Indexer (MTI) [2]. MTI is a software tool producing indexing recommendations in the form of a list of stand-alone main 292

293

headings (i.e. not associated with subheadings) shown on request to the indexers while they work on a record in the MEDLINE Data Creation and Maintenance System (DCMS). Other work on the automatic assignment of MeSH descriptors to medical texts in English has also focused on stand-alone main headings [3-4]. While the indexing resulting from some of these automatic systems has been shown to approach human indexing performance as measured by retrieval [5], there is a need for automatic means to provide finer-grained indexing recommendations, namely main heading/subheading pairs in addition to standalone main headings. In fact, there are both theoretical and practical reasons for this effort. From a theoretical point of view, the MeSH indexing manual [6] states that indexers must chose descriptors that reflect the content of an article by first selecting correct main headings and second by attaching the appropriate subheadings. Consequently, selecting an isolated main heading where a main heading/subheading pair should have been assigned is, strictly speaking, erroneous - or at best, incomplete. On the practical side, indexers do use both main headings and main heading/subheading pairs when indexing a document. Therefore, stand-alone main heading recommendations, while useful, will always need to be completed by attaching subheadings where appropriate. The task of assigning MeSH descriptors to a document can be viewed as a multi-class classification problem where each document will be assigned several "classes" in the form of MeSH descriptors. When assigning MeSH main headings [4, 7] the scale of the classification problem is 23,883. Now, if one attempts to assign MeSH main heading/subheading pairs, the number of classes increases to 534,981. Many machine learning methods perform very well on binary classes but prove more difficult to apply successfully on larger scale problems. As regards MeSH main heading classification, the hierarchical relationships between the classes have been used to reduce the complexity of the problem [4, 7]. Previous work on producing automatic MeSH pair recommendations that relied on dictionary and rule-based methods seemed promising [10]. For these reasons, we are investigating similar methods here. 1.2. Genetics literature Following the rapid developments of genetics research in the past twenty years, the volume of genetics-related literature has grown accordingly. While genetics

294 literature represented about 6% of MEDLINE records for the year 1985*, it represents over 19% of MEDLINE records for 2005f. In this context, it seems that providing fine-grained indexing recommendations for genetics literature is particularly important, as it will impact a significant portion of the biomedical literature. Therefore, we have elected to concentrate our effort in this subdomain for our preliminary work investigating automatic methods of providing MeSH pair indexing recommendations. This led us to focus on the subheadings genetics, immunology and metabolism which were found to be prevalent in the MeSH indexing of our genetics test corpus (see section 2.4). 1.3. Objective and approach This paper presents the various methods we investigated to automatically identify MeSH main heading/subheading pairs from the text (title and abstract) of articles to be indexed for MEDLINE. The ultimate goal of this research is to add subheading-related features to DCMS when displaying recommendations to NLM indexers, in order to save time during the indexing process. A previous study of MTI usability showed that the possibility of selecting recommendations from a pick list saved look-up and typing time [8]. The ideal time-saving mechanism for subheading attachment would be to include relevant pairs in the current list of main headings available for selection. However, this solution is only viable if the precision of such recommendations is sufficiently high. The possible obstacle that we foresee to including pair recommendations in the current pick list is that high precision for pair recommendations might be difficult to achieve without any human input throughout the process. Work in the area of computer-assisted translation [9] has shown the usefulness of interactive systems in the context of highly demanding cognitive tasks such as translation or indexing. For this reason, we are considering the possibility of either dynamically showing related pair recommendations once the indexer selects a main heading for the record, or highlighting the most likely subheadings for the current record when indexers are viewing the list of allowable subheadings for a given main heading that they selected. The remainder of this paper will address the difficult task of producing the recommendations themselves.

f

19,348 citations retrieved by the query genetics AND MEDLINE [sb] compared to 313,638 records retrieved [dcom] AND MEDLINE [sb] on 07/12/06. 114,530 citations retrieved by the query genetics AND MEDLINE [sb] compared to 598,217 records retrieved [dcom] AND MEDLINE [sb] on 07/12/06.

1985 [dcom] AND by the query 1985 2005 [dcom] AND by the query 2005

295 2.

Material and methods

In this section, we describe the three methods we investigated to identify main heading/subheading pairs from medical text. We also introduce the genetics corpus we used to evaluate the methods. 2.1. Baseline dictionary-based method The first method we considered consists of identifying main headings and subheadings separately for a given document and then attempting to pair them. Main headings are retrieved with the Medical Text Indexer [2] and subheadings are retrieved by looking up words from the title and abstract in a manually built dictionary in which each entry contains a subheading and a corresponding term or expression that is likely to represent the subheading in text. These terms are mainly derived from inflectional and derivational forms of the subheadings. They were obtained manually and tested on a general training corpus composed of a random 3% selection of MEDLINE 2004. Candidate terms were added to the dictionary if they benefited the method performance on the training corpus. For example, gene, genes, genetic, genetics, genetical, genome and genomes are terms corresponding to /genetics. The dictionary contains 227 entries for all 83 subheadings, including 10 for /genetics. To obtain the pairs, the subheadings retrieved by the dictionary are coordinated with the main headings retrieved, if applicable. For each main heading, MeSH defines a set of subheadings called "applicable qualifiers" that can be coordinated with it (e.g. /genetics is applicable to Carcinoma, Renal Cell but not Odds Ratio). In the dictionary method, all the legal pairs that can be assembled from the sets of main headings and subheadings retrieved are recommended. For example, two occurrences of the dictionary entry genes were found in the abstract of MEDLINE record 15319295, which means that /genetics was identified for this record. Attempts were made to attach /genetics to each of the twelve main headings recommended by MTI for this record, including Carcinoma, Renal Cell and Odds Ratio. The pair Carcinoma, Renal Cell/genetics was recommended because /genetics is an allowable qualifier for Carcinoma, Renal Cell. However, /genetics is not an allowable qualifier for Odds Ratio; therefore no other pair recommendation was made. 2.2. Indexing rules The two methods detailed in this section are based on indexing practice, sometimes expressed in MeSH annotations. In previous work on the indexing of medical texts in French [10], indexing rules were derived from interviews with indexers. Similar rules were also available in the MedlndEx knowledge base

296

[11]. To build the sets of rules used here, we adapted existing rules [10-11] and manually created new rules. The rules were divided in two groups. Post-processing rules Post-processing (PP) rules build on a pre-existing set of indexing terms (i.e., the main heading recommendations from MTI), and enrich it by expanding on the underlying concepts denoted by the indexing terms within that set. Twenty-nine of these rules are currently implemented for /genetics (as well as 11 for /immunology and 8 for /metabolism). Rules that were created in addition to the existing rules from MedlndEx and the French system (such as the example shown in figure 1) were evaluated using MEDLINE data. Specifically, we computed an estimated precision equal to the number of citations indexed with the trigger terms over the number of citations indexed with the trigger terms and the recommended pair*. Only rules with an estimated precision over 0.6 were considered for inclusion in the rule sets. According to the sample rule shown in Figure 1, a pair recommendation shall be triggered by existing MTI recommendations including the main heading Mutation as well as a term*. Since Mutation is a genetics concept, an inference is made that /genetics should be attached to the disease main heading. For example, both main headings Mutation and Pancreatic Neoplasms are recommended by MTI for the MEDLINE record 14726700. As Pancreatic Neoplasms is a disease term, the rule will be applied and the pair Pancreatic Neoplasms/genetics will be recommended. If the main heading Mutation and a term appear in the indexing recommendations then the pair /genetics should also be used. Figure 1. Sample post-processing rule for the subheading genetics

* For the sample rule shown in Figure 1, the estimated precision was 0.67. (On 09/06/06, the query mutation [mh] AND (diseases category/genetics [mh] OR mental disorders/genetics [mh]) retrieved 144,698 citations while mutation [mh] AND (diseases category [mh] OR mental disorders[mh]) retrieved 216,749 citations) § DISEASE refers to any phrase that points to a MeSH main heading belonging to the diseases or mental disorders categories.

297 Natural Language Processing rules Natural Language Processing (NLP) rules use cues from the title or abstract of an article to infer pair recommendations. A sample NLP rule is shown in Figure 2. In the original French system, this type of rule was implemented by a set of transducers that exploited information on each term's semantic category (DISEASE, etc. ) stored in an integrated electronic MeSH dictionary. Although very efficient, this method is also heavily language-dependent. For English, such advanced linguistic analysis of medical corpora is performed by NLM's SemRep [12], a tool that is able to identify interactions between medical entities based on domain knowledge from the Unified Medical Language System® (UMLS®). If a phrase such as " is associated with " appears in text then the pair /genetics should also be used. Figure 2. Sample Natural Language Processing rule for the subheading genetics

Specifically, SemRep retrieves UMLS triplets composed of two concepts from the UMLS Metathesaurus® together with their respective UMLS Semantic Types (STs) and the relation between them, according to the UMLS Semantic Network. Hence, phrases corresponding to the pattern of the sample rule presented in Figure 2 would be extracted by SemRep as the triplet (gngm ASSOCIATED_WITH dsyn) where "gngm" denotes the ST "Gene or Genome", and "dsyn" denotes the ST "Disease or Syndrome". We can infer from this that there is an equivalence between the semantic triplet (gngm ASSOCIATED_WITH dsyn) and the MeSH pair *'genetics where "dsyn" and refer to the same entity. In this way, the NLP rules were used to obtain a set of equivalencies between these UMLS triplets and MeSH pairs. Subsequently, a restrict-to-MeSH algorithm [13] was used to translate UMLS concepts to their MeSH equivalents. For example, the phrase "Association of a haplotype of matrix metalloproteinase (MMP)-1 and MMP-3 polymorphisms with renal cell carcinoma" occurring in the MEDLINE record 15319295 was annotated by SemRep with the triplet (gngm ASSOCIATED_WITH neoptf) where the "Gene or Genome" was MMP and the "Neoplastic Process" ("neop") was Renal Cell Carcinoma. The latter UMLS concept can be restricted to its MeSH equivalent Carcinoma, Renal Cell and the ** GENE refers to any phrase that points to a MeSH main heading belonging to the GENE sub-hierarchy within the GENETIC STRUCTURES hierarchy. n In the Semantic Types hierarchy, "neop" is a descendant of "dsyn". By inheritance, rules that apply to a given Semantic Type also apply to its descendants.

298

pair Carcinoma, Renal Cell/genetics is then recommended for the indexing. In the context of the genetics domain, we also use triplets retrieved by SemGen [14], a variant of SemRep specifically adapted to the identification of GeneGene and Gene-Disease interactions. 2.3. Combination of methods In an attempt to assess the complementarity of the methods, we also evaluated the recommendations provided by any two methods. The combination consisted in examining all the recommendations obtained from two methods, and selecting only the concurring ones, if any. For example, the pairs Ascomycota/genetics, Capsid Proteins/genetic and RNA Viruses/genetics and Totivirus/genetics were recommended by the post-processing rules method for citation 15845253 while Viruses/genetics, RNA Viruses/genetics and Totivirus/genetics were recommended by the NLP rules for the same citation. Only the common pairs RNA Viruses/genetics and Totivirus/genetics are selected by combination of the two methods. In this case, the two pairs selected by combination were used to index the documents in MEDLINE. Two of the three discarded pairs {Ascomycota/genetics and Viruses/genetics) were not used by the indexers while the other one {Capsid Proteins/genetics) was. 2.4. Test corpus All three methods (baseline dictionary-based, PP rules, NLP rules) were tested on a corpus composed of genetics-related articles selected from all citations indexed for MEDLINE in 2005. In order to avoid bias, the selection was not directly based on whether the articles were indexed with the subheading genetics. Instead we applied NLM's Journal Descriptor Indexing tool, which categorized the citations according to Journal Descriptors and also according to Semantic Types [15]. This categorization provided an indication of the biomedical disciplines discussed in the articles. For our genetics-related corpus, we selected citations that met either of these criteria: • "Genetics" or "Genetics, Medical" were among the top six Journal Descriptors • "genf' (Gene Function) or "gngm" (Gene or Genome) were among the top six Semantic Types A total of 84,080 citations were collected and used to test the methods presented above. At least one of the subheadings genetics, immunology and metabolism appear in 53,903 of the corpus citations.

299 3.

Results

3.1. Independent methods Table 1 shows the performance of the methods of pair recommendation presented in section 2. For each method, we detail the results obtained for /genetics, /immunology and /metabolism. We also indicate the overall figures (All) for the total number of recommendations obtained (Nb_rec), the total number of citations impacted (Nb_cit), the number of recommendations that were selected by MEDLINE indexers (Nb_rec+), the precision (PREC) and the recall (REC). Precision corresponds to the number of recommendations that were actually used by MEDLINE indexers over the total number of recommendations provided by the methods. Recall corresponds to the number of recommendations that were used by the indexers over the total number of pairs that were used by the indexers. Table 1. Performance of MeSH pair recommendation

Method Dictionary Dictionary Dictionary Dictionary

(GE) (IM) (ME) (All)

PP (GE) PP(IM) PP (ME) PP (All) NLP (GE) NLP (IM) NLP (ME) NLP (All)

Nb_rec 97,553 6,691 5,317 109,561 31,164 1,451 25,823 58,438 2,480 97 21 2,598

Nb_rec+ 46,804 2,326 2,166 51,296 21,752 1,048 13,578 36,378 1,566 26 3 1,605

Nb_cit 29,632 1,629 1,577 31,476 16,441 1,027 10,391 23,184 2,327 91 17 2,435

PREC 0.48 0.35 0.41 0.47 0.70 0.72 0.53 0.62 0.63 0.27 0.33 0.62

3.2. Combinations Table 2. Cross precision of MeSH pair recommendation methods

Method Dictionary PP NLP

Dictionary 0.47 0.73 0.75

PP 0.73 0.62 0.87

NLP 0.75 0.87 0.62

REC 0.3663 0.1095 0.0200 0.1993 0.1703 0.0493 0.1253 0.1413 0.0123 0.0012 0.0000 0.0062

300

Table 2 shows the precision and Table 3 shows the recall obtained when the methods are combined two by two (bold figures on the diagonal reflect the performance of the methods considered independently, as presented in Table 1). Table 3. Cross recall of MeSH pair recommendation methods

Method Dictionary PP NLP 4.

Dictionary 0.1993 0.0498 0.0055

PP 0.0498 0.1413 0.0028

NLP 0.0055 0.0028 0.0062

Discussion

4.1. General The performance of each method can vary considerably depending on the subheading it is applied to. Moreover, the global performance of all three methods seems higher for /genetics than /metabolism or /immunology. This may be explained by the fact that genetics is a more circumscribed domain than metabolism and immunology. The best overall precision is obtained with the post-processing rules, and the best overall recall is obtained with the dictionary method. Similar observations could be made on a general training corpus, where the scope of the methods was mostly limited to the genetics-related articles. 4.2. Error analysis To gain a better understanding of the results and how they might be improved, we have analyzed a number of recommendations that were made which were inconsistent with our reference (MEDLINE indexing) and therefore analyzed as errors. Table 4 presents a few characteristic cases. Most errors fall into these categories: • Recommendation seems to be relevant • Recommendation corresponds to a concept not substantively discussed • Recommendation is incorrect Especially with the NLP rules, there seem to be more cases where the recommendations address a relevant topic that is not discussed substantively in the article (e.g. PMID 15659801 in table 4). Sometimes, however, as shown in the example of PMID 15638374 in table 4, the concept denoted by the recommended pair seems relevant but not indexed. The added value of our tool could include reducing the number of similar omissions in the future. Most "incorrect" recommendations come from the dictionary method which is the most simplistic. Another common source for errors is the case exemplified

301 with PMID 15574482 in table 4 where a given post-processing rule can apply to several main headings, but only one of the candidates is relevant for subheading attachment. This situation was particularly prevalent with /metabolism and resulted in a significantly lower precision for this subheading, compared to /immunology and /genetics. Table 4. Analysis of sample erroneous pair recommendations

Recommendations PMID 15574482 Seeds/GE Seedling/GE Orvza sativa/GE**

PMID 15638374

Method PP: if MH Plants, Genetically Modified and a appear in the indexing, the pair /genetics should be used.

Error interpretation Three plants were discussed and the rule only applied to one, Oryza sativa, which was more specific (however, there is no direct ancestordescendant relationship between the terms). The recommended pair seems relevant for the article, although it doesn't appear in the MEDLINE indexing.

NLP: The text "The aim of the study was an evaluation Phyl lodes Tumor of PCNA and Ki-67 /GE expression in the stromal component of fibroepithelial tumours."§§ was interpreted by SemRep as "gngm LOCATION_OF neop" which translate into Phyllodes Tumor/genetics. PMID 15659801 Dictionary: The phrase "... The concept is not gene expression in liver substantively discussed in Liver Neoplasms tumors ... " contains the the article. /GE dictionary entry "gene", related to /genetics which is an allowable qualifier for Liver Neoplasms, retrieved by MTI. Error analysis can point to changes that should be made in the rules or formal concept description. Links between concepts in the case of PMID 15574482 in table 4 would make it possible to consider a filtering according to main heading specificity. For example if the fact that Oryza sativa is a more specific term than either seeds or seedling were available, one might consider In this case, three pairs were recommended when applying the rule and only one (underlined) was correct. The original phrase was edited to enhance legibility in the table

302

enforcing a rule stating that subheadings should be only attached to the most specific term when several terms belonging to a same hierarchy are candidates for attachment. 4.3. Complementarity of the methods The overlap in recommendations is not significant. As a result, using different methods will help cover more citations and increase the overall recall. However, the gain in precision obtained when combining several methods is offset by significant loss in recall. In fact, most of the recommendations resulting from the combination of methods concern the subheading genetics, especially where the NLP method is one of the combined methods. To overcome this problem we could consider the performance of post-processing rules and Natural Language Processing rules independently (e.g., there are 29 PP rules for /genetics). Rules that achieve high precision individually may be used as such. 5.

Conclusion and Future Work

We have presented three methods to provide MeSH main heading/subheading pair recommendations for indexing the biomedical literature. These methods were applied to a genetics-related corpus to provide recommendations including the subheadings genetics, immunology and metabolism. Although performance may vary considerably depending on the subheading and the method used, the results are encouraging and seem to indicate that some useful pair recommendations could be used in indexing in the near future. In future work, we plan to expand the set of PP and NLP rules to cover all 83 MeSH subheadings. Investigating statistical methods to provide pair recommendations will be considered. For example, in the specific field of genetics, links between MEDLINE and other Entrez databases such as Gene could be exploited. Based on the results from the combination of methods, more elaborate combination techniques will be studied in order to lessen decrease in recall. Finer combinations at the rule level may be considered as well as other factors such as the influence of the specific genetics corpus we used. Finally, a qualitative evaluation of this work will be sought from the indexers at NLM. Ackno wledgm ents This research was supported in part by an appointment of A. Neveol to the Lister Hill Center Fellows Program sponsored by the National Library of Medicine and administered by the Oak Ridge Institute for Science and Education, and in part by the Intramural Research Program of the National Institutes of Health, National Library of Medicine. The authors would like to thank Halil Kilicoglu

303

for his help in the use of SemRep/SemGen and James G. Mork for his help in the use of MTI (Medical Text Indexer) during the experiments. References 1.

2.

3. 4. 5. 6. 7. 8. 9.

10.

11. 12.

13.

14.

15.

AR. Aronson, O. Bodenreider, HF. Chang, SM. Humphrey, JG. Mork, SJ. Nelson, TC. Rindflesch and WJ Wilbur. "The NLM Indexing Initiative". Proc AMI A Symp. 17-21 (2000). AR. Aronson, JG. Mork, GW. Gay, SM. Humphrey, WJ. Rogers. "The NLM Indexing Initiative's Medical Text Indexer". Proc. Medinfo. 268-72 (2004). P. Ruch, R. Baud, A. Geissbtihler. "Learning-free Text Categorization". LNAI.lim, 199-204(2003). L. Cai and T. Hofmann. "Hierarchical document categorization with support vector machines". Proc. CIKM. 396-402 (2004). W. Kim, AR. Aronson and WJ. Wilbur. "Automatic MeSH term assignment and quality assessment". Proc AMIA Symp.319-23 (2001). http://www.nlm.nih.gov/mesh/indman/chapter_19.html (visited on 05/23/06) M. Ruiz and P. Srinivasan. "Hierarchical neural networks for text categorization". Proc. SIGIR. 281-282 (1999). C. Gay. "A MEDLINE Indexing Experiment Using Terms Suggested by MTI" National Library of Medicine Internal Report (2002). P. Langlais, G. Lapalme and M. Loranger. "Transtype: DevelopmentEvaluation Cycles to Boost Translator's Productivity" Machine Translation 15, 77-98 (2002). A. Neveol, A. Rogozan, SJ. Darmoni. "Automatic indexing of online health resources for a French quality controlled gateway." Inf. Process. Manage. 42, 695-709 (2006). SM. Humphrey. "Indexing biomedical documents: from thesaural to knowledge-based retrieval systems" Artif Intel. Med. 4, 343-371 (1992) TC. Rindflesh and M. Fiszman. "The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text'7 Biomed Inform. 36(6), 462-77 (2003) O. Bodenreider, SJ. Nelson, WT. Hole, and HF. Chang. "Beyond synonymy: exploiting the UMLS semantics in mapping vocabularies." Proc AMIA Symp. 815-9(1998). TC. Rindflesch, B. Libbus, D. Hristovski, AR. Aronson, H. Kilicoglu. "Semantic relation asserting the etiology of genetic diseases." Proc AMIA Symp. 8-1 (2003). SM. Humphrey. "Automatic indexing of documents from journal descriptors: a preliminary investigation." J Am Soc Inf Sci Technol. 50(8), 661-674(1999).

MINING PATENTS USING MOLECULAR SIMILARITY SEARCH JAMES RHODES 1 , STEPHEN BOYER 1 , JEFFREY KREULEN 1 , YING CHEN 1 , PATRICIA ORDONEZ 2 IBM, Almaden Services Research, San Jose, CA 95120, USA www.ibm.com 1 E-mail: jjrhodes, sboyer, kreulen, [email protected] 2 ordopal @umbc. edu Text analytics is becoming an increasingly important tool used in biomedical research. While advances continue to be made in the core algorithms for entity identification and relation extraction, a need for practical applications of these technologies arises. We developed a system that allows users to explore the US Patent corpus using molecular information. The core of our system contains three main technologies: A high performing chemical annotator which identifies chemical terms and converts them to structures, a similarity search engine based on the emerging IUPAC International Chemical Identifier (InChI) standard, and a set of on demand data mining tools. By leveraging this technology we were able to rapidly identify and index 3, 623, 248 unique chemical structures from 4, 375, 036 US Patents and Patent Applications. Using this system a user may go to a web page, draw a molecule, search for related Intellectual Property (IP) and analyze the results. Our results prove that this is a far more effective way for identifying IP than traditional keyword based approaches. Keywords: Chemical Similarity; Data Mining; Patents; Search Engine; InChI

1. Introduction The US Patent corpus is an invaluable resource for any scientist with a need for prior art knowledge. Since patents need to clearly document all aspects of an invention, they contain an plethora of information. Unfortunately, much of this information is buried within pages upon pages of legal verbiage. Additionally, current search applications are designed around keyword queries which prove ineffective when searching for chemically related information. Consider the drug discovery problem of finding a replacement molecule 304

305

for fiuoro alkane sulfonic acid (CF3CF2SO3H). This molecule appears in everyday products like Scotchgard®, floor wax, Teflon®, and in electronic chip manufacturing materials like photo resists etc. The problem is that this molecule is a bioaccumulator and is a potential carcinogen (substance that causes cancer). Furthermore, it has made its way through the food chain, and can now be found in polar bears and penguins. Companies are pro actively trying to replace this acid with other more environmentally friendly molecules. The sulfonic acid fragment, SO3H, is the critically necessary element. The harmful fragment is anything that looks like CF 3 (CF2) n - The problem then is to find molecules that have the SO3H fragment, and perhaps a benzene ring which would allow the synthetic chemist to replace an alkyl group with something that accounts for the electron withdrawing property of CF3CF2. The chemist would like to look for a candidate molecule based on its similarity to the molecular formula of the fragment, or the structure of the benzene or some weighted combination of both. It is quite possible that the needed information exists in literature already, but may be costly and time consuming to discover. A system that would allow users to search and analyze documents, such as patents, at the molecular level could be a tremendously useful tool for biomedical research. In this paper we describe a system that leverages text mining techniques to annotate and index chemical entities, provide graphical document searching and discover biomedical/molecular relationships on demand. We prove the viability of such a system by indexing and analyzing the entire US Patent corpus from 1976-2005 and we present comparative results between molecular searching and traditional keyword based approaches. 2. Extracting Chemicals The first step in the process is to extract chemical compounds from the Patent corpus. We developed two annotators which automatically parsed text and extracted potential chemical compounds. All of the potential chemicals were then fed through a name-to-structure program such as the name=struct®program from CambridgeSoft Corporation. Name=Struct makes no value judgments, focusing only on providing a structure that the name accurately describes.1 The output of Name=Struct in our system is a connection table. Using the openly available InChI code,10 these connection tables are converted into InChI strings. Due to the page limits, this paper focuses on the similarity search technology. We have built a machine learning and dictionary based chemical annotator that can extract chemical names out of text and convert them

306

into structures. The similarity search capability is built on top of such annotation results, but is not tied to any specific underlying annotator implementation. 3. Indexing As the new IUPAC International Chemical Identifier (InChI) standard continues to emerge, there is an increasing need to use the InChI codes beyond that of compound identification. Given our background in text analytics, we reduced the problem down to finding similar compounds based on the textual representation of the structure. Our experiments focused on the use of the InChFs as a method for identifying similar compounds. Using our annotators we were able to extract 3,623, 248 unique InChl's from the US Patent database (1976-2005) and Patent Applications (2001-2005). From this collection of InChl's an index was constructed using text mining techniques. We employed a traditional vector space model 14 as our underlying data structure. 3.1. Vector

Representation

InChl's are unique for each molecule and they consist of multiple layers that describe different aspects of the molecule as depicted in Figure 1. The first three layers (formula, connection and hydrogen) are considered the main layers (see15) and are the layers we used for our experiments. Using the main layers, we extracted unique features from a collection of InChI codes. Caffeine

1

/CH3 lf^ \

I CH3 InChI=l/C8Hl0N4O2/cl-104-9-6-5(10)7(l3)l2(3)8(l4)ll(6)2/h4H,l-3H3

Fig. 1.

A c o m p o u n d a n d its I n C h I d e s c r i p t i o n

We defined features as one to three unique character phrases in the connection and hydrogen layers and unique atoms or symbols in the formula layer. Features from each layer are proceeded by a layer identifier. For the

307

connection and hydrogen layers, features for an InChI i with characters Cj can be defined as unique terms Cj, Cj+Cj+\, CJ+CJ+I+CJ+2- These terms are added to the overall set of terms T which include unique Cj from the formula layer. Given a collection of InChl's U with terms Tj, each InChI is represented by the vector h —

(dil,di2---dij)

where dij represents the frequency to the jth term in the InChI. For example, the two InChl's InChI=l/H20/hlH2 and InChI=l/N20/cl-2-3 would produce the following features H, O, hi, hlH, hlH2, hH, hH2, h2, N, cl, cl-, cl-2, c-, c-2, c-2-, c2, c2-, c2-3, c-3, c3 with the following vector representations {2,1,1,1,1,1,1,1,0,0,0,0,0,0, 0,0,0,0, 0, 0} for water, and {0,1,0,0,0,0,0,0,2,1,1,1,2,1,1,1,1,1,1,1} for nitrous oxide. In our experiments, the formula, connection and hydrogen layers produced 963, 69334 and 55256 features respectively. This makes the combined dimensionality of the dataset T=125, 553. Feature values are always nonnegative integers. To take into account the frequency of features when computing the similarity distance calculation, we represented the vectors in unary notation where each of the three feature spaces is expanded by the maximum value of a feature in that space. This causes the dimensionality to exploded to 31, 288,976 features and the sparsity increases proportionally. Of course, this unary representation is implicit and need not be implemented explicitly. Each InChI is processed by building for it three vectors which are then added to the respective vector space model. The results are three vector space models of size 309MB, 950MB and 503MB for the formula (Fj), connection (F2) and hydrogen (F3) layers. Each vector space model Fj defines a distance function Dj by taking the Tanimoto 19 coefficient between the corresponding vectors. Consequently, for every two molecules x and y there are 3 distances defined between them, namely D\{x,y), D2{x,y) and D^{x,y).

308

3.2. Index

Implementation

For indexing of the vector space models we implemented the Locality Sensitive Hashing (LSH) technique of Indyk and Motwani . 9 A major benefit of the algorithm is the relative size of the index compared to the overall vector space. In our implementation the objects (and their feature vectors) do not need to be replicated. Vectors are computed for each InChI and stored only in a single repository. Each index maintains a selection of k vector positions and a standard hash function for producing an actual bucket numbers. The buckets themselves are individual files on the file system, and they contain pointers to (or serial numbers of) vectors in the aforementioned single repository. This allows both the entire index as well as each bucket to remain small. This implementation is of course useful because this single large repository still fits in our computer's main memory (RAM). During index creation, not all hash buckets are populated. Additionally, the number of data points per hash bucket may also vary quite a bit. In our implementation, buckets were limited to a maximum of B = 1000. The end result is a LSH index Lj for each of the 3 layers of the InChI. 3.3. Query

Processing

For each query molecule Q, vectors dj are created from each vector space model Fj. Each vector is then processed by the LSH index Lj which corresponds to a given layer. The LSH index provides a list of potential candidate Ci which are then evaluated against the query vectors using the Tanimoto Coefficient. The total similarity for each candidate C; is computed by Si =

3 l

~

3

3

(1)

where n is the total number of vector space models. The Tanimoto Coefficient has been widely used as an effective measure of intermolecular similarity in both the clustering and searching of databases. 6 While Willet et al. 19 discuss six different coefficients for chemical similarity, we found that the Tanimoto Coefficient was the most widely recognized calculation with our users. The results are then aggregated so each vector with the same S is merged into a set of synonyms. By dereferencing the vectors to the InChl's they represent and further dereferencing the InChI to the original text within the corpus, a list of the top K matching chemical names and the respective documents that contain those names is returned.

309

4. Experimental Results In order to explain the experimental results, an overview of the application as it is currently implemented is required. We will conclude with a full description of the experimental process and its results. 4.1. Graphical Similarity

Search

To use the Chemical Search Engine, a user may either draw the chemical structure of the molecule to be searched, enter an InChI or smile which represents the molecule into a text field, or open a file which stores a smile or InChI value in the corresponding field. The engine converts the query into an InChI and returns of a listing of molecules and their similarity values. Beside the molecule image is its similarity to the search molecule entered, its IUPAC name, an expandable list of synonyms, and the number of patents that were found containing that molecule as seen in Fig. 2. Not surprisingly for a query of a sketch of caffeine, the engine returned over 8,500 patents that contained a molecule with a similarity of 1.0, meaning that there was an exact match, and over 52 synonyms for that molecule. Six molecules with a similarity above .8 were rendered. For the experimental results, the canonical smile for the tested drug in the PubChem database was entered into the text field. £3 http://rtioiJes2.«Imadsn.l!>m. • IBI* OtMntal Search • Microsoft Interna Explorer Fte

Edit

ê«

Favorites

Tods

Hdp

Chemical Search Alpha Vtew Patents

20 compounds found Draw a compound: i

;

I

•

T

PQ-R ered by ChmAxsm Mar* a?

•

caffeine.,

t

T j \ "• c '"l 1 .

Similarity:

1.0

(8 * synonyms

- Y't">

C e n t e r a SMILES:

Sj 56 synonyms

Patents: 7541 (1,2.3,6- tetrahydro-i,3-dimetnyl-2,6-rjioxo-

«,

Patents: 6 Similarity:

0.937

caffeinyl,

Examples...

?•

D

Patents: 2 Similarity:

Or entei a InChI: Examples ..

«*-*,!..«, — « „ ;

Fig. 2.

claim

Search results

0.920

310 4.2. Molecular

Networks

In the upper right hand corner of the results page, the user may click on three different links to view selected molecules and their patents either as a graph using Graph Results, as a listing of hyper-linked patents with View Patents, or as an analysis of claims with Claim Analysis. In this section, we will describe and illustrate the usefulness of the Graph Results page and in the following, the Claim Analysis. The value of a graphical representation of the selected molecules and their corresponding patents is most evident if we select the molecules with similar affinities to caffeine, but not exact matches to caffeine. The graph in Fig. 3 is a graph of the four molecules with the closest similarity to caffeine less than 1. In the graph, the search node is fixed as the center node and molecular representations of the other nodes surround it. In the future, the graph will also display each molecule's similarity to the search node as indicted by the thickness of its edge to the center(search) node. When the user rolls over the center node, the comment "Search Node" is viewed whereas for the other nodes the name of the molecule is displayed. Note that some of the same molecules have different names. The leaf nodes are the patents and patent applications associated with each molecule. If double-clicked the node will launch a browser window displaying the corresponding patent or application. A mouseover of these nodes will render the name of the assignee of the document. The nodes are color-coded by assignees. A researcher may use this graph to view which molecules are most like the search node and of those molecules which have the greatest number of patents associated with them. It is also very useful for determining which assignees have the greatest number of patents for a particular molecular structure.

4.3. Affinity

Analysis

The Claim Analysis page examines the claims of the patents associated with the selected molecules on the previous page to determine which medical conditions were found in the greatest number of patents. The more patents that mention a particular condition, the higher the condition's affinity to the molecule. Notice in Fig. 4, that for caffeine, migraine and headache have a high affinity, nausea and anxiety a moderate one, and burns and cough a low affinity.

311 •»' Applet Vip*.-pr: pfefu^n.ripîos applets.MvGrephView.class Applet

^SSSHU 5841SSH

MedllM • 13318381

•w # * ^

Fig. 3.

^

YT*»

Graph of selected molecules

The conditions were derived from a dictionary of proteins, diseases, and biomarkers. A dictionary based annotator annotates the full text of the selected patents in real time to extract the meaningful terms. A Chisquared test was used referencing the number of patents that contained the conditions to determine the affinity between the molecules and the conditions. On expanding a condition in the Claim Analysis page, a listing of the patents mentioning the condition in its text is rendered. The patent names are links to the actual patents. Thus, a researcher looking to patent a drug may do a search on the molecule and uncover what other uses the molecule has been patented for before. Such data may also serve to discover unexpected side effects or complications of a drug for the purposes of testing its safety. 4.4.

Results

To evaluate the engine's effectiveness, we used a listing of the top 50 brandname drugs prescribed in 2004 as provided by Humana. 8 We acquired a

312 _€! IBM Chemleiil Search - Wcroott Internet Explorer =*e

edit

Vie*

Fa/txites

Tools

Neip

High Affinity

Moderate Affinity

Fig. 4. Claims analysis of selected molecules

canonical smile value associated with each of the 25 top prescribed drugs from the PubChem database. 7 PubChem could not provide the smiles for two of the drugs, Yasmin 28 and OrthoEvra. If more than one molecule was returned from the database, we used the canonical smile value of the first one listed except in the case of three of the drugs, Tropol XL, Premarin, and Plavix. In these cases, we used the smile string that returned the greatest number of matches when we performed a search on the chemical search engine. With the generic name of the drug, we performed a search on one of the most sophisticated patent databases known, Delphion, using a boolean query that examined the abstracts, titles, claims, and descriptions of the patents for the name on patents from January 1, 1976 to December 31, 2005. The results can be seen in Fig. 5. On acquiring the 25 drug names, the first obstacle was that 2 of the drugs could not be found in the PubChem database so that the canonical smile for these drugs could not be determined. Out of the 23 drugs that remained, our results indicate that for 19 of them more patents associated with the drug were found on our system than on Delphion. In the instances where the engine found more matches, the number of matches that it found was in some cases up to 10 times more, because the search was based on

313 the molecular structure of the match and not on the generic name. The number of times that a text based search outperformed the molecular search may be attributed a miss-selection of the smile string from the PubChem database. Thus, one of the greatest limitations of the chemical search engine is finding an accurate smile string for a given drug. Nevertheless, our experimental results demonstrate the enormous potential of being able to search the patent database based on a molecular structure.

Search Results 4500 -i

B Delphion Keyworci Search

tA

,2 8S

• IBM Chemsearch o

£

1 m T

rl_T-

1

1 B B H1I 1 rIrl 1 BU^. ^

Top 25 Drugs

Fig. 5. A graph comparing the results of searching for the top 25 drugs listed by Humana 8 on the Chemical Search Engine using a molecular search and on DELPHION performing a text search of the compound's name.

5. Conclusion We developed a practical system which leverages text analytics for indexing, searching and analyzing documents based on molecular information. Our results demonstrate that graphical structure search is a far more effective way to explore a document corpus than traditional keyword based queries when searching for biomedical related literature. The system is flexible and may be expanded to include other data sources besides Patents. These additional data sources would allow for meta-data information to

314

be tied to Patents through chemical annotations. Future versions may allow researchers to explore data sets based on chemical properties such as toxicity or molecular weight. In addition to discovering literature for an exact match, this tool can be used for identifying practical applications of a compound or possible negative side effects by examining the literature surrounding similar compounds.

References 1. J. Brecher. Name=struct: A practical approach to the sorry state of reallife chemical nomenclature. Journal of Chemical Information and Computer Science, 39:943-950, 1999. 2. A. Dalby, J. G. Nourse, W. D. Hounshell, A. K. I. Gunshurst, D. L. Grier, B. A. Leland, and J. Laufer. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. Journal of Chemical Information and Computer Science, 32(3):244-255, 1992. 3. Inc. Daylight Chemical Information Systems. Daylight Theory: Fingerprints, 2005. http://www.daylight.com/dayhtml/doc/theory/theory. finger.html. 4. Inc. Daylight Chemical Information Systems. Daylight Cheminformatics SMILES, 2006. h t t p : / / d a y l i g h t . c o m / s m i l e s . 5. GNU FDL. Open babel, 2006. h t t p : / / o p e n b a b e l . s o u r c e f o r g e . n e t . 6. D. Flower. On the properties of bit string-based measures of chemical similarity. Journal of Chemical Information and Computer Science, 38(3):379-386, 1998. 7. National Center for Biotechnology Information. Pubchem, 2006. h t t p : / / pubchem.ncbi.nlm.nih.gov/search. 8. Humana. Top 50 brand-name drugs prescribed, 2005. http://apps.humana.com/prescription_benefits_ and_services/incl_des/Top50BrandDrugs.pdf. 9. P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pages 604-613, may 1998. 10. IUPAC. The IUPAC International Chemical Identifier(InChI TM), 2005. http://www.iupac.org/inchi. 11. Stefan Kramer, Luc De Raedt, and Christoph Helma. Molecular feature mining in HIV data. In HDD '01: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, 2001. 12. Elsevier MDL. Ctfile formats, 2005. http://www.mdl.com/downloads/ public/ctfile/ctfile.pdf. 13. Elsevier MDL. Mdl isis/base, 2006. http://www.mdli.com/support/ knowledgebase/faqs/faq_ib_22.jsp. 14. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(ll):613-620, 1975.

315 15. S. E. Stein, S. R. Heller, and D. Tchekhovskoi. An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier. In Proceedings of the 2003 International Chemical Information Conference (Nimes), 2003. 16. Murray-Rust Research Group The University of Cambridge. The Unofficial InChI FAQ, 2006. http://wwmm.ch.cam.ac.uk/inchifaq/. 17. D. Weininger. Smiles, a chemical language and information system, introduction to methodology and encoding rules. Journal of Chemical Information and Computer Science, 28(l):31-36, 1988. 18. D. Weininger, A. Weininger, and J. L. Weininger. Smiles algorithm for generation of unique smiles notation. Journal of Chemical Information and Computer Science, 29(2):97-101, 1989. 19. P. Willett, J. M. Barnard, and G. M. Downs. Chemical Similarity Searching. Journal of Chemical Information and Computer Science, 38(6):983-996, 1998.

DISCOVERING IMPLICIT ASSOCIATIONS BETWEEN GENES AND HEREDITARY DISEASES

KAZUHIRO SEKI Graduate School of Science and Technology, Kobe University 1-1 Rokkodai, Nada, Kobe 657-8501, Japan E-mail: [email protected] JAVED MOSTAFA Laboratory of Applied Informatics Research, Indiana University 1320 E. 10th St., LI Oil, Bloomington, Indiana 47405-3907 E-mail: [email protected]

We propose an approach to predicting implicit gene-disease associations based on the inference network, whereby genes and diseases are represented as nodes and are connected via two types of intermediate nodes: gene functions and phenotypes. To estimate the probabilities involved in the model, two learning schemes are compared; one baseline using co-annotations of keywords and the other taking advantage of free text. Additionally, we explore the use of domain ontologies to complement data sparseness and examine the impact of full text documents. The validity of the proposed framework is demonstrated on the benchmark data set created from real-world data.

1. Introduction The ever-growing textual data make it increasingly difficult to effectively utilize all the information relevant to our interests. For example, Medline—the most comprehensive bibliographic database in life science—currently indexes approximately 5,000 peer-reviewed journals and contains over 17 million articles. The number of articles is increasing rapidly by 1,500-3,000 per a day. Given the substantial volume of the publications, it is crucial to develop intelligent information processing techniques, such as information retrieval (IR), information extraction (IE), and text data mining (TDM), that could help us manage the information overload. In contrast to IR and IE, which deal with information explicitly stated in documents, TDM aims to discover heretofore unknown knowledge through an automatic analysis on textual data.1 A pioneering work in TDM (or literature-based discovery) was conducted by Swanson in the 1980's. He argued that there were 316

317 two premises logically connected but the connection had been unnoticed due to overwhelming publications and/or over-specialization. For instance, given two premises A —> B and B -> C, one could deduce a possible relation A -» C. To prove the idea, he manually analyzed numbers of articles and identified logical connections implying a hypothesis that fish oil was effective for clinical treatment of Raynaud's disease.2 The hypothesis was later supported by experimental evidence. Based on his original work, Swanson and other researchers have developed computer programs to aid hypothesis discovery (e.g., see Refs. 3 and 4). Despite the prolonged efforts, however, the research in literature-based discovery can be seen to be at an early stage of development in terms of the models, approaches, and evaluation methodologies. Most of the previous work was largely heuristic without a formal model and their evaluation was limited only on a small number of hypotheses that Swanson had proposed. This study is also motivated by Swanson's and attempts to advance the research in literature-based discovery. Specifically, we will examine the effectiveness of the models and techniques developed for IR, the benefit of free- and fulltext data, and the use of domain ontologies for more robust system predictions. Focusing on associations between genes and hereditary diseases, we develop a discovery framework adapting the inference network model5 in IR, and we conduct various evaluative experiments on realistic benchmark data.

2. Task Definition Among many types of information that are of potential interest to biomedical researchers, this study targets associations between genes and hereditary diseases as a test bed. Gene-disease associations are the links between genetic variants and diseases to which the genetic variants influence the susceptibility. For example, BRCA1 is a human gene encoding a protein that suppresses tumor formation. A mutation of this gene increases a risk of breast cancer. Identification of these genetic associations has tremendous importance for prevention, prediction, and treatment of diseases. In this context, predicting or ranking candidate genes for a given disease is crucial to select more plausible ones for genetic association studies. Focusing on gene-disease associations, we assume a disease name and known causative genes, if any, as system input. In addition, a target region in the human genome may be specified to limit the search space. Given such input, we attempt to predict a (unknown) causative gene and produce a ranked list of candidate genes.

318 3. Proposed Approach Focusing on gene-disease associations, we explored the use of a formal IR model, specifically, the inference network5 for this related but different problem targeting implicit associations. The following details the proposed model and how to estimate probabilities involved in the model. 3.1. Inference Network for Gene-Disease Associations In the original IR model, a user query and documents are represented as nodes in a network and are connected via intermediate nodes representing keywords that compose the query and documents. To adapt the model to represent gene-disease associations, we treat disease as query and genes as documents and use two types of intermediate nodes: gene functions and phenotypes which characterize genes and disease, respectively (Fig. 1). An advantage of using this particular IR model is that it is essentially capable of incorporating multiple intermediate nodes. Other popular IR models, such as the vector space models, are not easily applicable as they are not designed to have different sets of concepts to represent documents and queries.

Mutated genes

Gene functions (GO terms)

Phenotypes (MeSH C terms)

Disease Figure 1. Inference network for gene-disease associations.

The network consists of four types of nodes: genes (g), gene functions (/) represented by Gene Ontology (GO) terms,3 phenotypes (p) represented by MeSH C terms,b and disease (d). Each gene node g represents a gene and corresponds to the event that the gene is found in the search for the causative genes underlying d. Each gene function node / represents a function of gene products. There a

http://www.geneontology.org http://www.nlm.nih.gov/mesh

b

319 are directed arcs from genes to functions, representing that instantiating a gene increases the belief in its functions. Likewise, each phenotype node p represents a phenotype of d and corresponds to the event that the phenotype is observed. The belief in p is dependent on the belief in / ' s since phenotypes are (partly) determined by gene functions. Finally, observing certain phenotypes increases the belief in d. As described in the followings, the associations between genes and gene functions (g—>/) are obtained from an existing database, Entrez Gene,c whereas both the associations between gene functions and phenotypes (/ —» p) and the associations between phenotypes and disease (p —> d) are derived from the biomedical literature. Given the inference network model, disease-causing genes can be predicted based on the probability defined below.

P(d\G) = J]YJ pWft i

x P{ }

^

x F( |G)

^

(1)

J

Equation (1) quantifies how much a set of candidate genes, G, increases the belief in the development of disease d. In the equation, /?, (or fj) is defined as a vector of random variables with j'-th (or j-tti) element being positive (1) and all others negative (0). By applying Bayes' theorem and some independence assumptions discussed later, we derive

p(d\G) ex V V lÊM

x

PWKJm

x F{)

x F{f)

x mG)\

(2)

where FiPi)

p p ryj],ryj m P(Mpt) „,„ r f W iWP"'> r r cmEU, F = n > = (fj)

(3)

The first factor of the right-hand side of Eq. (2) represents the interaction between disease d and phenotype /?,-, and the second factor represents the interaction between pi and gene function fj, which is equivalent to the odds ratio of P(fj\pd and P(fj\pi). The third and fourth factors are functions of p, and fj, respectively, representing their main effects. The last factor takes either 0 or 1, indicating whether fj is a function of any gene in G under consideration. The inference network described above assumes independence among phenotypes, among gene functions, and among genes. We assert that, however, the effects of such associations are minimal in the proposed model. Although there may be strong associations among phenotypes (e.g., phenotype px is often observed with phenotype py), the model does not intend to capture those associations. That c

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=gene

320

is, phenotypes are attributes of the disease in question and we only need to know those that are frequently observed with disease d so as to characterize d. The same applies to gene functions; they are only attributes of the genes to be examined and are simply used as features to represent the genes under consideration. 3.2. Probability Estimation 3.2.1. Conditional Probabilities P(p\d) Probability P(p\d) can be interpreted as a degree of belief that phenotype p is observed when disease d has developed. To estimate the probability, we take advantage of the literature data. Briefly, given a disease name d, a Medline search is conducted to retrieve articles relevant to d and, within the retrieved articles, we identify phenotypes (MeSH C terms) strongly associated with the disease based on chi-square statistics. Given disease d and phenotype p, the chi-square statistic is computed as , ,

s

X 2 (a p) =

N{nu

• «22 - «2i - « i 2 ) 2

(4)

("11 +«2l)(«12 +«22)(«11 +«12)("21 +«22>

where N is the total number of articles in Medline, n\\ is the number of articles assigned p and included in the retrieved set (denoted as R), mi is the number of articles not assigned p and not included in R, n2\ is the number of articles not assigned p and included in R, and nu is the number of articles assigned p and not in R. The resulting chi-square statistics are normalized by the maximum to treat them as probabilities P{p\d). 3.2.2. Conditional Probabilities P(f\p) Probability P(f\p) indicates the degree of belief that gene function / underlies phenotype p. For probability estimation, this study adopts the framework similar to the one proposed by Perez-Iratxeta et al.6 Unlike them, however, this study focuses on the use of textual data and domain ontologies and investigate their effects for literature-based discovery. As training data, our framework uses Medline records that are assigned any MeSH C terms and are cross-referenced from any gene entry in Entrez Gene. For each of such records, we can obtain a set of phenotypes (the assigned MeSH C terms) and a set of gene functions (GO terms) associated with the crossreferencing gene from Entrez Gene. Considering the fact that the phenotypes and gene functions are associated with the same Medline record, it is likely that some of the phenotypes and gene functions are associated. A question is, however, what phenotypes and functions are associated and how strong those associations are.

321 We estimate those possible associations using two different schemes: SchemeK and SchemeT. SchemeK simply assumes a link between every pair of the phenotypes and gene functions with equal strength, whereas SchemeT seeks for evidence in the textual portion of the Medline record, i.e., title and abstract, to better estimate the strength of associations. Essentially, SchemeT searches for co-occurrences of gene functions (GO terms) and phenotypes (MeSH terms) in a sliding window, assuming that associated concepts tend to co-occur more often in the same context than unassociated ones. However, a problem of SchemeT is that gene functions and phenotypes are descriptive by nature and may not be expressed in concise GO and MeSH terms. In fact, Schuemie et al? analyzed 1,834 articles and reported that less than 30% of MeSH terms assigned to an article actually appear in its abstract and that only 50% even in its full text. It suggests that relying on mere occurrences of MeSH terms would fail to capture many true associations. To deal with the problem, we apply the idea of query expansion, a technique used in IR to enrich a query by adding related terms. If GO and MeSH terms are somehow expanded, there is more chance that they could co-occur in text. For this purpose, we use the definitions (or scope notes) of GO and MeSH terms and identify representative terms by inverse document frequencies (IDF), which has long been used in IR to quantify the specificity of terms in a given document collection. We treat term definitions as documents and define IDF for term t as \og(N/Freq(t)), where N denotes the total number of MeSH C (or GO) terms and Freq(-) denotes the number of MeSH C (or GO) terms whose definitions contain term t. Only the terms with high IDF values are used as the proxy terms to represent the starting concept, i.e., gene function or phenotype. Each co-occurrence of the two sets of proxy terms (one representing a gene function and the other representing a phenotype) can be seen as evidence that supports the association between the gene function and phenotype, increasing the strength of their association. We define the increased strength by the product of the term weights, w, for the two co-occurring proxy terms. Then, the strength of the association between gene function / and phenotype p within article a, denoted as S (/, p, a), can be defined as the sum of the increases for all co-occurrences of the proxy terms in a. That is, S(f

a)=

V

^ 0 / ) • w(tp)

where tf and tp denote any terms in the proxy terms for / and p, respectively, and (tf, tp,a) denotes a set of all co-occurrences of tf and tp within a. The product of the term weights is normalized by the proxy size, \Proxy(-)\, to eliminate the effect of different proxy sizes. As term weight w, this study used the TF IDF weighting

322

scheme. For term tp for instance, we define TF(tp) as 1 + log Freq{tp, Def(p)), where Def(p) denote p's definition and Freq(tp, Def(p)) denotes the number of occurrences of tp in Def(p). The association scores, S (/, p, a), are computed for each cross reference (a pair of Medline record and gene) by either SchemeK or SchemeT and are accumulated over all articles to estimate the associations between / ' s and p's, denoted as S(f,p). Based on the associations, we define probability P(f\p) as

S(f,p)/ZpS(f,p). A possible shortcoming of the approach described above is that the obtained associations S (/, p) are symmetric despite the fact that the network presented in Fig. 1 is directional. However, since it is known that an organism's genotype (in part) determines its phenotype—not in the opposite direction, we assumed that the estimated associations between gene functions and phenotypes are directed from the former to the latter. 3.2.3. Enhancing Probability Estimates P{f\p) byDomainOntologies The proposed framework may not be able to establish true associations between gene functions and phenotypes for various reasons, e.g., the amount of training data may be insufficient. Those true associations may be uncovered using the structure of MeSH and/or GO. MeSH and GO have a hierarchical structure11 and those located nearby in the hierarchy are semantically close to each other. Taking advantage of these semantic relations, we enhance the learned probabilities P(f\p) as follows. Let us denote by A the matrix whose element a,; is probability estimate P(fj\Pi) and by A' the updated or enhanced matrix. Then, A' is formalized as A' = WpAWf, where Wp denotes an n xn matrix with element wp(i, j) indicating a proportion of a probability to be transmitted from phenotypes pj to /?,-. Similarly, Wf is mm xm matrix with w/(i, j) indicating a proportion transmitted from gene functions ft to /;. This study experimentally uses only direct child-to-parent and parent-to-child relations and defines wp(i, j) as f 1 wp(i, j) =

if i = j

, r ,.,, r~ if Pi is a child of pj # of children of Pj j

— if p,; is a parent of p, ll v ll # of parents of pj 0 otherwise d

To be precise, GO's structure is directed acyclic graph, allowing multiple parents.

(6)

323

Equation (6) means that the amount of probability is split equally among its children (or parents). Similarly, wp(i, j) is defined by replacing i and j in the righthand side of Eq. (6). Note that the enhancement process can be iteratively applied to take advantage of more distant relationships than children/parents. 4. Evaluation To evaluate the validity of the proposed approach, we implemented a prototype system and conducted various experiments on the benchmark data sets created from the genetic association database (GAD).e GAD is a manually-curated archive of human genetic studies, containing pairs of gene and disease that are known to have causative relations. 4.1. Creation of Benchmark Data For evaluation, benchmark data sets were created as follows using the real-world data obtained from GAD. (1) Associate each gene-disease pair with the publication date of the article from which the entry was created. The date can be seen as the time when the causative relation became public knowledge. (2) Group gene-disease pairs based on disease names. As GAD deals with complex diseases, a disease may be paired with multiple genes. (3) For each pair of a disease and its causative genes, (a) Identify the gene whose relation to the disease was most recently reported based on the publication date. If the date is on or after 7/1/2003, the gene will be used as the target (i.e., new knowledge), and the disease and the rest of the causative genes will be used as system input (i.e., old knowledge). (b) Remove the most recently reported gene from the set of causative genes and repeat the previous step (3a). The separation of the data by publication dates ensures that a training phase does not use new knowledge in order to simulate gene-disease association discovery. The particular date was arbitrarily chosen by considering the size of the resulting data and available resources for training. Table 1 shows the number of gene-disease associations in the resulting test data categorized under six disease classes defined in GAD. In the following experiments, the cancer class was used for system development and parameter tuning. e

http://geneticassociationdb.nih.gov

324 Table 1. Number of gene-disease associations in the benchmark data. Cancer 45

, vascular 36

Immune

Metabolic

Psych

Unknown

Total

61

23

12

80

257"

4.2. Experimental Setup Given input (disease name d, known causative genes, and a target region), the system computes the probability P(d\G) as in Eq. (3) for each candidate gene g located in the target region, where G is a set of the known causative genes plus g. The candidate genes are then outputted in a decreasing order of their probabilities as system output. As evaluation metrics, we use area under the ROC curve (AUC) for its attractive property as compared to the F-score measure (see Ref. 8 for more details). ROC curves are two dimensional measure for system performance with x axis being true positive proportion (TPP) and y axis being false positive proportion (FPP). TPP is denned as TP/(TP+FN), and FPP as FP/(FP+TN), where TP, FP, FN, and FP denote the number of true positives, false positives, false negatives, and false positives, respectively. AUC takes a value between 0 and 1 with 1 being the best. Intuitively AUC indicates the probability that a gene randomly picked from positive set is scored more highly by a system than one from negative set. For data sets, this study used a subset of the Medline data provided for the TREC Genomics Track 2004.9 The data consist of the records created between the years 1994 and 2003, which account for around one-third of the entire Medline database. Within these data, 29,158 cross-references (pairs of Medline record and gene) were identified as the training data such that they satisfied all of the following conditions: 1) Medline records are assigned one or more MeSH C terms to be used as phenotypes, 2) Medline records are cross-referenced from Entrez Gene to obtain gene functions, 3) cross references are not from the target genes to avoid using possible direct evidence, 4) Medline records have publication dates before 7/1/2003 to avoid using new knowledge. Using the cross references and the test data in the cancer class, several parameters were empirically determined for each scheme, including the number of Medline articles as the source of phenotypes (nm), threshold for chi-square statistics to determine phenotypes (tc), threshold for IDF to determine proxy terms (t,), and window size for co-occurrences (vvs). For SchemeT, they were set as nm=700, fc=2.0, t,=5.0, and w,= 10 (words) by testing a number of combinations of their possible values.

325 4.3. Results 4.3.1. Overall Performance With the best parameter settings learned in the cancer class, the system was applied to all the other classes. Table 2 shows the system performance in AUC. Table 2. System performance in AUC for each disease class. The figures in the parentheses indicate percent increase/decrease relative to SchemeK. Scheme K

, vascular 0.677 0.737 (8.9%)

Immune

Metabolic

Psych

Unknown

Overall

0.686 0.668 (-2.6%)

0.684 0.623 (-9.0%)

0.514 0.667 (29.8%)

0.703 0.786 (11.7%)

0.682 0.713 (4.6%)

Both SchemeK and SchemeT achieved significantly higher AUC than 0.5 (i.e., random guess), indicating the validity of the general framework adapting the inference network for this particular problem. Comparing the two schemes, SchemeT does not always outperform SchemeK but, overall, AUC improved by 4.6%. The result suggests the advantage of the use of textual data to acquire more precise associations between concepts. Incidentally, without proxy terms described in Section 3.2.2, the overall AUC by SchemeT decreased to 0.682 (not shown in Tab. 2), verifying its effectiveness. 4.3.2. Impact of Full-Text Articles This section reports preliminary experiments examining the impact of full text articles for literature-based discovery. Since full-text articles provide more comprehensive information than abstracts, they are thought to be beneficial in the proposed framework. We used the full-text collection from the TREC Genomics Track 2004,9 which contains 11,880 full-text articles. However, the conditions described in Section 4.2 inevitably decreased the number of usable articles to 679. We conducted comparative experiments using these full-text articles and only the corresponding 679 abstract in estimating P(f\p) for fair comparison. Note that, due to the small data size, these results cannot be directly compared to those reported above. Table 3 summarizes the results obtained based on only titles and abstracts ("Abs") and complete full-text articles ("Full") using SchemeT. Examining each disease class, it is observed that the use of full-text articles lead to a large improvement over using abstracts except for the immune class. Overall, the improvement achieved by full texts is 5.1 %, indicating the potential advantage of full text articles.

326 Table 3. System performance in AUC based on 679 articles. The figures in the parentheses indicate percent increase/decrease relative to Abs. Text Abs Full

, vascular 0.652 0.737 (13.0%)

Immune

Metabolic

Psych

Unknown

Overall

0.612 0.590 (-3.6%)

0.566 0.640

0.623 0.724 (16.2%)

0.693 0.731 (5.5%)

0.643 0.676 (5.1%)

(13.0%)

4.3.3. Enhancing Probability Estimates by Domain Ontologies In order to examine the effectiveness of the use of domain ontologies for enhancing P(f\p), we applied the proposed method to SchemeT in Tab. 2 and to Full in Tab. 3. (Note that Full is also based on SchemeT for estimating P(f\p) but uses full-text articles instead of abstracts). Figure 2 summarizes the results for different number of iterations, where the left and right plots correspond to SchemeT and Full, respectively. Incidentally, we used only child-to-parent relations in GO hierarchy for this experiment as it yielded the best results in the cancer class. Full (SchemeT vtl 679 full-text articles)

SchemeT

O

www

J

•

Figure 1: MedQA system architecture 2

MedQA

MedQA is a question answering system that automatically analyzes thousands of documents (both the Web documents and MEDLINE abstracts) to generate a short text

330 to answer definitional questions (21). In summary, MedQA takes in a question posed by either a physician or a biomedical researcher. It automatically classifies the posed question into a question type for which a specific answer strategy is developed (24, 25). Noun phrases are extracted from the question to be query terms. Document Retrieval applies the query terms to retrieve documents from either the World-Wide-Web documents or locally-indexed literature resources. Answer Extraction automatically identifies the sentences that provide answers to questions. Text Summarization condenses the text by removing the redundant sentences. Answer Formulation generates a coherent summary. The summary is then presented to the user who posed the question. Figure 1 shows the architecture of MedQA, and Figure 2 shows MedQA's output of the question "What is vestibulitis?" Most of the evaluation work on question answering systems (26) focuses on information retrieval metrics. A text corpus and the answer are provided for a question, the evaluation task is to measure the correctness to extract the text answer from the corpus. None of the studies, to our knowledge, apply cognitive methods to evaluate humancomputer interaction, and to measure efficacy, accuracy and perceived ease of use of a question answering system, and to compare a question answering system to other information systems such as information retrieval systems. 3

Cognitive Evaluation Methods

We designed a randomized controlled cognitive evaluation in order to assess the efficacy, accuracy and perceived ease of use of Google, MedQA, OneLook, and PubMed. The study was approved by the Columbia University Institutional Review Board. 3.1

Question Selection

We manually examined the total of 4,653 questions1 posed by physicians at various clinical settings (14, 27-29) and found a total of 138 definition questions". We observed that the definitional questions in general fell into several categories including Disease or Syndrome, Drug, Anatomy and Physiology, and Diagnostic Guideline. In order to maximize the evaluation coverage, we attempted to select questions that cover most of the categories. After preliminary examination, we found that many questions did not yield answers from two or more systems to be evaluated. For example, the question "what is proshield?" did not yield a meaningful answer from three systems (MedQA, OneLook, and PubMed). The objective was to compare different systems, and unanswerable questions present a problem for the analyses because they render such comparisons impossible. On the other hand, if we screened the questions with the four systems, it may introduce bias and a selective exclusion process. We therefore employed an independent information retrieval

The question collection is freely accessible at http://clinques.nlm.nih.gov/ All 138 definitional questions are listed at http://www.dbmi.columbia.edu/~yuh9001/research/definitional_questions.htm.

331 system, BrainBoost3, which is a web-based question answering engine that accepts natural language queries. BrainBoost was presented with questions randomly selected from the categories of definitional questions, and the first twelve questions that returned an answer were included in the study. The task was performed by an unbiased assistant who was not privy to the reasons for doing the search. The 12 selected questions are shown in bold in http://www.dbmi.columbia.edu/~vuh9001/research/definitional questions.htm. • HVU< jPuhKkihi meLnoi

êdQ4 "lull a-sKni Hhnf is

Ask-

\L\ttfath(t\

tt • ' a m r t n J th*1 /ihiP card H* wiitijf f n ^ 1ul«,snt«JE-'ill/ ti *• "H t-f E ^" mm) h j n ti>, *-rn * ">r Vyp fi1{_* i> tp hw ^ 1 * i 3ni * rw ! / " • » f * l^-al zclEoUi-re^jii _t t'ww-Wjs e^tbLtu P ; t t u * 1 ' " 1 ^ 1 *^w•'•*'ot.t,«r •»«' if.»i —it iwûirvJ n N tiuri s * vi*i's t- i-anvi. tiiV n h m < ; U b j h l i t i I I/II , , C J I j M ; vVt u r O»J-^ *• m "i u V U v .? ff t w l }«!,< U uliti i»rpri i 3e n>3 M rue »nmj-\Ma a Ssu y * in J- w*"'I* r e Hit B t-j-irti p-it- i <J aiaHm jn.sM *rjwtajv* t a t L « ; - « !*"3H, " m f n **IM i i u r Y t ' s M* ^ •)}>

T n T v r « ' ' l IK ft(*> t v t v ii t>( wi>hv» rbul tK'-p ii tf»*f HI iviiv pîrvj' S f ps *> ! iti Mi U iv*r •y

r/FIH T I f - W i - r n ' P iin,P j i - nth i ^ l ' f >' ill ir --tl

't

T*~ l r ""( W IT'-lot •"" TVr i ' i ) v *yjf« « 10s of time spent examining the retrieved list of documents (e.g., Web documents or PubMed abstracts). Query Modification: An action that involves modification of the existing query or user-interface (e.g., change from Google to Scholar.Google). Read Document: An action that involves a subject to spend >10s to read the selected document. Scroll-Down Document: Scroll down a document to search for the answer. Search Text Box: A subject applies the "Find" function to locate relevant text. Select Document: A subject selects and opens a document to examine whether the answer appears in the document Select Linkout: An action that involves selecting another link from the selected document. Select Text as Answer: A subject selects the text as the answer to a question.

4

Evaluation Results

In the following section, we present results of the cognitive evaluation. The first part of this section illustrates the processes of question-answering. We also show the coding process used to characterize participants' actions. The second part of this section focuses on a quantitative comparison of the four systems. We include both objective measures

333 such as actions and response latency, and subjective measures, namely, participants' ratings of the quality of answers as well as their ease of use. 4.1

Illustrations

The following two coding excerpts illustrate the process of question-answering on two pairs of systems, PubMed and MedQA, and OneLook and Google. The excerpts are representative of task performance. The subject was an experienced physician with a master in informatics and was well-versed in performing medical information seeking tasks. Excerpt 1—PubMed and MedOA The subject had completed five questions and was a little more than forty minutes into the session. The question in this excerpt was "What is vestibulitis?" The systems used to find the answer were PubMed and MedQA respectively. The entire segment lasted 6 minutes, of which 4:25 is used to search PubMed and 1:11 to search MedQA. 44:23 ACTION (Enter Query-PubMed): vestibulitis 44:34 SYSTEM RESPONSE: 251 MEDLINE records returned 44:51 (User) COMMENT: OK, I definitely got some answers that do not apply at all...I have no idea why the first set of returns are coming back with psychological problems, but maybe not true, as a physician just makes assumption of that ENT would be returned, but if I am gynecologist, that probably is what I am looking for. Vulvar vestibulitis, I have no idea what it is. I guess I will go find out because I do not know. 45:22 ACTION SELECT DOCUMENT 45:23 ACTION SELECT FULL-TEXT OUT-LINK 45:24 SYSTEM RESPONSE: Out-link failed 45:25 ACTION SELECT FULL-TEXT OUT-LINK 45:26 SYSTEM RESPONSE: Out-link failed 45:33 ACTION FIND DOCUMENT COMMENT: No... I can not find any definitions 46:11 ACTION (Query Modification, "vestibulitis") COMMENT: Try vestibulitis only 46:14 SYSTEM RESPONSE: 251 MEDLINE records returned 46:17 ACTION SELECT DOCUMENT COMMENT: Just try this one, surgical treatment of vulvar vestibulitis, this seems to be a good definition 46:29 ACTION SELECT FULL-TEXT 46:39 SYSTEM RESPONSE: Out-link failed 46:40 ACTION SELECT LINKOUT (of the full-text article) 46:41 SYSTEM RESPONSE: Out-link failed COMMENT: It does not seem to have any outlink, it is only the abstract. The abstract does not give any characteristics of what syndrome is. 47:10 ACTION SELECT TEXT AS ANSWER 47:49 ACTION FIND DOCUMENT 47:57 ACTION SELECT DOCUMENT 48:00 ACTION SELECT FULL-TEXT (PDF FILE) 48:02 ACTION READ DOCUMENT COMMENT: seems to get pain syndromes 48:48 ACTION SELECT TEXT AS ANSWER COMMENT: OK, I am going to leave PubMed 49:12 ACTION (ENTER QUERY-MEDQA) : What is vestibulitis? COMMENT: MedQA uses MEDLINE, probably will return the same information, hopefully, it will get other information as well. 49:52 SYSTEM RESPONSE: shown in Figure 2 COMMENT: OK, MedQA pulls back exact the same information, nothing else. 50:23 ACTION SELECT TEXT AS ANSWER

334 GENERAL COMMENT: I would say that PubMed again all the information was there but was not held in a useful fashion and I need to search all and I have to filter myself...and quality of answer was OK and ease of use is poor because I need to go through everything. MedQA quality of answer is excellent and ease of use is excellent, I do not need to do anything.

Excerpt 2—QneLook and Google The subject had completed nine questions and was a little more than one hour and half into the session. The current question answered was "What is gemfibrozil?" The systems used to find the answer are OneLook and Google, respectively. The entire segment was 5:08 minutes, of which 1:44 is used to search the OneLook system and 2:46 to search Google. 1:31:08 ACTION (ENTER QUERY-ONELOOK): gemfibrozil COMMENT: I know I am looking into medication, Gemfibrozil, I know that I have the advantage of what I am looking for. 31:32 SYSTEM RESPONSE: 4 matching dictionaries in General and 4 matching dictionaries in Medicine COMMENT: So I get of course a General definition and Medicine related match. I will go my favorite Wikipedia first 31:51 SYSTEM RESPONSE Web Page Changes--Wiki... COMMENT: it returns out-links... COMMENT: Unfortunately, the Wikipedia isn't so good because it gives me more or less an outline of a whole set of other links that I would have to go find in order to get specific information. I am going back from Wikipedia and go to Medical online dictionaries, I am going to try Online Medical Dictionary first. 32:20 SYSTEM RESPONSE Web Page Changes -Online Medical Dictionary COMMENT: I got absolutely useless information. I am going to Stedman's and Stedman's is not working, I found it out before. I go to Dorland's, Dorland's Medical Dictionary... 32:30 SYSTEM RESPONSE Web Page Changes - Dorland's Medical Dictionary 32:52: ACTION SELECT TEXT AS ANSWER COMMENT: I get gemfibrozil ... which is medication used to lower serum lipid level by decreasing triglyceride, it is just one line definition. I would say that it is probably acceptable, but if I have spent the time with the Wikipedia following the out-links, I probably would be able to find more information. 33:30 ACTION (ENTER QUERY-SCHOLAR.GOOGLE) : gemfibrozil COMMENT Now I am going to Scholar.Google 33:43 SYSTEM RESPONSE Web Page Changes - Google returned three article links 33:58 ACTION SELECT DOCUMENT (a full-text article) 34:10 ACTION READ DOCUMENT COMMENT: On my first look on the medication... 34:32 ACTION SELECT TEXT AS ANSWER COMMENT: I get quite a good description of the effects of new medication along with ... 34:40 ACTION PULLUP PDF FILE 34:48 ACTION SCROLL-DOWN DOCUMENT 34:55 ACTION SELECT TEXT AS ANSWER COMMENT: looks great.-.along with appropriate bibliography...With Google, with Google again, I got lucky, find an article very quickly, given me the best information about the medication. 35:50 ACTION (ENTER QUERY-GOOGLE) : gembibrozil COMMENT: let's see what happened if I go Google itself as appose to Google Scholar. 36:05 SYSTEM RESPONSE Web Page Changes - (Google returns 1,330,000 hits) COMMENT: I got Medicine.com dictionary 36:16 ACTION SELECT TEXT AS ANSWER

335 COMMENT: I got some very good information. ...which is more an overview, put gemfibrozil in the context with other medications for lowering serum lipid levels, so I would get a more understanding from this perspective and therefore Google general as oppose to Google.Scholar is actually a better choice as the Google search engine. GENERAL COMMENT: For this study, Onelook I would say, was able to give me the definition which was OK in terms of quality, ease of use was poor because either that a lot of out-links are not working, or that the out links link to useless information. Google in this instance the quality of answer is definitely good, excellent, and ease of use in this instance, again is excellent, right answer comes from the top.

The two excerpts show that the pattern of actions employed by participants reflects the nature of interactions supported by each system. For example, subjects would iteratively search PubMed until they found a satisfactory answer. As a consequence, they would examine multiple documents (necessitating find link and Linkout actions), only a few of which were relevant. The subjects typically searched for full-text articles as the Linkout actions. The iterative nature of the search was also evidenced by the number of actions pertaining to query modification, searching the text box and document selection. Table 2 lists a summary of the comments made by subjects throughout the evaluation. Our results show that Google received more favorable comments than complaints. Both MedQA and OneLook received some good comments and some complaints. PubMed was generally criticized and was not given any favorable comments. Table 2: A summary of comments of different systems (D for disadvantages and A for advantages) Google (D) retrieves back a lot of links (to the question "What is cubital tunnel syndrome?"). Most of links seem to relate to individual cases of the diseases, not necessarily definitions. Google (D): One needs to search and evaluate the definitions in Google. Google (A) retrieves both patient (Google) and physician-centric (scholar. Google) information. Google (A): Scholar.Google is much faster because it is the second link, while in PubMed the evaluator has to search through a lot of other articles. MedQA (D) needs to type in 'What is' versus a direct query. MedQA (D) takes a considerable longer time to respond than other systems. MedQA (A) returns all the context otherwise the evaluator has to search manually. It is only one step and gets exactly needed. MedQA (A) gives answer (to die question "What is Popper?") that Onelook did not, which is that the drug is injectable, which is important to know for a physician. Onelook (D) pulls all links. It lets the user to guess which link contains a comprehensive answer. Sometimes, the links are broken. It is a matter of luck to get to the right links. Onelook (D) answer quality is poor. It has a terrible user-interface. It shows two ugly photos. Onelook (A) definition has more content than PubMed. PubMed (D) is not a good resource for definitions. PubMed (D) is not useful. It takes forever to find information.

4.2

Quantitative Evaluation

The results show that the subjects did not find answers to a single question in Google ("Dawn's phenomenon"), 3 questions in Onelook ("epididymis appendix", "heel pain syndrome", and "Ottawa knee rules"), 3 questions in MedQA ("epididymis appendix", "Ottawa knee rules," and "paregoric"); and 2 questions in PubMed ("epididymis appendix" and "paregoric"). Both MedQA and Onelook acknowledged "no results found" and returned no answers if such an event occurs, while both PubMed and Google

336 returned a list of documents even if a subject could not identify the definitions from the documents within the 5 minutes of time limit. We observed that none of the subjects used Google:Definition as the service to identify definitions; instead, they applied the query terms in either Google or Scholar.Google. We also observed that subjects gave the poorest score (i.e., 1) for quality of answer when both MedQA or OneLook returned no answers, and a better score (i.e., 2-3) when a search engine (e.g., Google or PubMed) returns a list of documents, even if the subject could not find any answers from the documents within 5 minutes of time limit. Subjects commented that even documents that do not contain answers frequently provided some knowledge about the answers. For example, subjects found "popper" is a drug although there were no details of definitions found. On the contrary, the subjects typically gave a good score for ease of use when MedQA and OneLook returned no answers. Table 3 presents descriptive statistics of the subjective and objective measures. In general, Google was the preferred system as reflected both in the quality of the answer and ease of use ratings. MedQA achieved the second highest ratings in both measures. OneLook received the lowest ratings for quality of answer and PubMed was rated the worst in terms of ease of use. If we excluded the poor scores when MedQA did not return any answer, the quality of answer for MedQA went up to 4.5. Table 3: Average score and (standard deviation) of quality of answer second) and action taken. Google MedQA Time Spent 69.6 (6.9) 59.1 (57.7) Number of Actions 4.4 (3.0) 2.1 (2.0) 2.92 (0.24) 4.90 (0.15) Quality of Answer 4.75(0.29) 4.0 (0.24) Ease of Use

and ease of use and average time spent (in Onelook 83.1(63.6) 6.5 (7.7) 2.77 (0.08) 3.9 (0.32)

PubMed 182.2(85.8) 10.3 (5.7) 2.92 (0.88) 2.36(0.88)

While the processing time to obtain an answer was almost instantaneous for Google, Onelook, and PubMed, the average time spent for MedQA to obtain an answer to the 10 answerable questions was 15.8±7.1 seconds. MedQA was nevertheless the fastest system on average for a subject to obtain the definition. For measuring the average time spent, we excluded the cases in which MedQA and Onelook returned no answer. The subjects, on average, spent more time searching PubMed than any of the other systems. In fact, the average PubMed search required more than three times the amount of time required to search MedQA. This is at least partly due to the complexity of the interaction. This is borne out by the fact that participants needed more than 10 actions in using PubMed to answer the question, whereas they only required 2 actions on average when they used MedQA. PubMed provides a range of affordances (e.g., limits, MeSH) that supports iterative searching. Although this is a powerful tool, it also increases complexity of the task and user cognitive load. MedQA offers the simplest mode of interaction because it eliminates several of the steps (e.g., upload documents, search text and selectively access relevant information in document) involved in searching for information. The results of the commercial search engines, Google and Onelook, fell in

337 between MedQA and PubMed. However, as evidenced by the high standard deviations, there was significant variability between questions. 5

Discussion

The evaluation results show that Google was the best system for quality of answer (4.90) and ease of use (4.75). Recall the highest score for both criteria was 5. The results indicate that the Internet resources incorporate reliable medical definitions, and Google allows subjects to readily access those reliable definitions. This is in contrast to numerous other studies that found Internet information to be of poor quality in the medical context (1-10). However, there are significant differences between our study and the others. First, our study evaluated a more general type of question; namely, definitional questions, while the other works examined more specific medical questions (e.g., "What further testing should be ordered for an asymptomatic Thyroid Nodule solitary thyroid nodule, with normal TFTs?" in (23)). Secondly, physicians would evaluate Google high if they found answers from some sites even if other sites did not provide answers to the questions. In other studies, precision (i.e., the number of hits that provide answers divided by the total retrieved top N hits) plays an important role for measuring the quality. For example, one study (31) concluded that Google hits were of a poor quality because only one link out of five contained relevant information. Lastly, in our study, the quality of answer was estimated by aggregating information from multiple Web pages. Other studies evaluated the quality of each Web page to answer a specific question; such, evaluation will certainly lead to a much poorer rating of the Internet because one evaluation study (32) concluded that information were typically scattered across multiple sites: most of the Web pages incorporate information either in depth or in breath, and few Web sites combine both depth and breath. Our results show that OneLook came in the 3rd in most of the evaluation criteria. We observed that the evaluators frequently expressed frustrations of failed out-links and nonspecific, general definitions that are of little value to physicians. We show that PubMed performed worst in almost all criteria. Unlike Google which assigns weights to the returned documents, PubMed returns a list of documents in a chronological order in which the most recent publications appear first. The most relevant documents in PubMed may never appear at the top; and therefore it usually takes a user significant time to identify answers. Previous research showed that it took an average of more than 30 minutes for a healthcare provider to search for answer from PubMed, which meant that "information seeking was practical only 'after hours' and not in the clinical setting" (22). Finally, we found that MedQA in general outperformed all search engines except for Google. In addition, MedQA out-performed Google in time spent and number of actions, two important efficiency criteria for obtaining an answer. Although it took less than a second for Google to retrieve a list of relevant documents based on a query keyword and it took an average of 16 seconds for MedQA to generate a summary, the average time spent for a subject to identify a definition was 59.1+57.7 seconds for MedQA, which was faster than 69.6±6.9 seconds for Google. This is due to the fact that information is scattered across the web (32). A subject typically needs to visit multiple web pages for an

338 answer. One can never be certain when a link will lead to useful information. This is a relative disadvantage for Google as compared to MedQA. 6

Conclusion

We evaluated four search engines; namely, Google, MedQA, OneLook, and PubMed, for their quality and efficiency to answer definitional questions posed by physicians. We found that Google was in general the preferred system and PubMed performed poorly. We also found that MedQA was the best in terms of time spent and number of actions needed to answer a question. It would be ideal if a powerful search engine such as Google could be integrated with an advanced question-answering system to yield timely and precise response to address users' specific information needs. Although we are encouraged by the findings, this research is best viewed as formative. The conclusions are limited by a number of factors. These include the fact that only four physicians participated in the evaluation. Future research would need to include a larger and more diverse sample of clinicians with different levels of domain expertise and degrees of familiarity with information retrieval systems. In this study, we introduced a novel cognitive method for the in-depth study of the question answering process. The method would have to be validated in different contexts. Finally, the scope of the system (answering definitional questions) is rather narrow at this point and we would want to conduct similar comparisons with different questions types. In general, the results of this work suggest that MedQA presents a promising approach for clinical information access. Acknowledgement: We thank three anonymous reviewers for valuable comments. References l.Purcell G. The quality of health information on the internet. BMJ. 2002;324(7337):557-8. 2.Jadad AR, Gagliardi A. Rating health information on the Internet: navigating to knowledge or to Babel? JAMA. 1998 Feb 25;279(8):611-4. 3.Silberg WM, Lundberg GD, Musacchio RA. Assessing, controlling, and assuring the quality of medical information on the Internet: Caveant lector et viewor-Let the reader and viewer beware. JAMA. 1997 Apr 16;277(15):1244-5. 4.Glennie E, Kirby A. The career of radiography: information on the web. Journal of Diagnostic Radiography and Imaging. 2006;6:25-33. 5.Childs S. Judging the quality of internet-based health information. Performance Measurement and Metrics. 2005;6(2):80-96. 6.Griffiths K, Christensen H. Quality of web based information on treatment of depression: Cross sectional survey. BMJ. 2000;321:1511-15. 7.Cline RJ, Haynes KM. Consumer health information seeking on the Internet: the state of the art. Health Educ Res. 2001 Dec; 16(6):671-92. 8.Benigeri M, Pluye P. Shortcomings of health information on the Internet. Health Promot Int. 2003Dec;18(4):381-6. 9.Wyatt J. Commentary: measuring quality and impact of the WWW. BMJ. 1997;314:1879. lO.McClung HJ, Murray RD, Heitlinger LA. The Internet as a source for current patient information. Pediatrics. 1998 Jun;101(6):E2.

339 1 l.Sacchetti P, Zvara P, Plante MK. The Internet and patient education-resources and their reliability: focus on a select urologic topic. Urology. 1999 Jun;53(6): 1117-20. 12.Gemmell J, Bell G, Lueder R, Drucker S, Wong C. MyLifeBits: fulfilling the Memex vision. Proceedings of the 10th ACM international conference on Multimedia; France, pp. 235-8. 13.Podichetty V, Booher J, Whitfield M, Biscup R. Assessment of internet use and effects among healthcare professionals: a cross sectional survey. Postgrad Med J. 2006;82:274-9. 14.Ely JW, Osheroff JA, Ebell MH, Bergus GR, Levy BT, Chambliss ML, et al. Analysis of questions asked by family doctors regarding patient care. BMJ. 1999 Aug 7;319(7206):358-61. 15.Pandolfini C, Bonati M. Follow up of quality of public oriented health information on the world wide web: systematic re-evaluation. BMJ. 2002 Mar 9;324(7337):582-3. 16.Sandvik H. Health information and interaction on the internet: a survey of female urinary incontinence. BMJ. 1999 Jul 3;319(7201):29-32. 17.Alper B, Stevermer J, White D, Ewigman B. Answering family physicians' clinical questions using electronic medical databases. J Fam Pract 2001 ;50(11):960-5. 18Jacquemart P, Zweigenbaum P. Towards a medical question-answering system: a feasibility study. Stud Health Technol Inform. 2003;95:463-8. 19.Takeshita H, Davis D, Straus S. Clinical evidence at the point of care in acute medicine: a handheld usability case study. Proceedings of the human factors and ergonomics society 46th annual meeting; 2002. p. 1409-13. 20.Bhavnani S, Bichakjian C, Johnson T, Little R, Peck F, Schwartz J, et al. Strategy Hubs: Domain Portals to help Find Comprehensive Information. JASIST. 2006;57(l):4-24. 21.Lee M, Cimino J, Zhu H, Sable C, Shanker V, Ely J, et al. Beyond information retrieval Medical question answering. AMIA. Washington DC, USA; 2006. 22.Hersh WR, Crabtree MK, Hickam DH, Sacherek L, Friedman CP, Tidmarsh P, et al. Factors associated with success in searching MEDLINE and applying evidence to answer clinical questions. J Am Med Inform Assoc. 2002. 9(3):283-93. 23.Berkowitz L. Review and Evaluation of Internet-based Clinical Reference Tools for Physicians: UpToDate; 2002. 24. Yu H, Sable C. Being Erlang Shen: Identifying answerable questions. Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence on Knowledge and Reasoning for Answering Questions 2005. 25.Yu H, Sable C, Zhu H. Classifying Medical Questions based on an Evidence Taxonomy. . Proceedings of the AAAI 2005 Workshop on Qusetion Answering in Restricted Domains; 2005. 26.Voorhees E, Tice D. The TREC-8 question answering track evaluation. TREC; 2000. 27.Ely JW, Osheroff JA, Ferguson KJ, Chambliss ML, Vinson DC, Moore JL. Lifelong selfdirected learning using a computer database of clinical questions. J Fam Pract. 1997 Nov;45(5):382-8. 28.Ely JW, Osheroff JA, Chambliss ML, Ebell MH, Rosenbaum ME. Answering physicians' clinical questions: obstacles and potential solutions. J Am Med Inform Assoc. 2005 MarApr;12(2):217-24. 29,D'Alessandro DM, Kreiter CD, Peterson MW. An evaluation of information-seeking behaviors of general pediatricians. Pediatrics. 2004 Jan; 113(1 Pt l):64-9. 30.Eysenbach G, Kohler C. How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews. BMJ. 2002 Mar 9;324(7337):573-7. 31.Berland GK, Elliott MN, Morales LS, Algazy JI, Kravitz RL, Broder MS, et al. Health information on the Internet: accessibility, quality, and readability in English and Spanish. JAMA. 2001 May23-30;285(20):2612-21. 32.Bhavnani S. Why is it difficult to find comprehensive information? Implications of information scatter for search and design: Research Articles. Journal of the American Society for Information Science and Technology. 2005;56(9):989-1003.

BIODIVERSITY INFORMATICS: MANAGING KNOWLEDGE BEYOND HUMANS AND MODEL ORGANISMS INDRA NEIL SARKAR1' Marine Biological Laboratory, 7 MBL Street Woods Hole, MA 02543, USA E-mail: [email protected]

In the biomedical domain, researchers strive to organize and describe organisms within the context of health and epidemiology. In the biodiversity domain, researchers seek to understand how organisms relate to one another, either historically through evolution or spatially and geographically in the environment. Currently, there is limited cross-communication between these domains. As a result, valuable knowledge that could inform studies in either domain often goes unnoticed. Biodiversity knowledge has long been a valuable source for many biomedical advances [1]. Before the creation of synthetic compounds, medicinal compounds originated from solely natural plant and animal extracts. Although there are upward estimates of between 10-100 million organisms on Earth [2], biomedical research primarily focuses on only a fraction of these as "model" organisms [3]. Furthermore, much knowledge may be lost from the biomedical community because many organisms are only described within biodiversity resources. These studies can form the basis for research on evolution, speciation, and distribution, and also provide an important baseline for studies of not only conservation but also the study of emerging diseases. The integration of biodiversity knowledge from museum collections, for example, has provided significant insights into the etiology and distribution of diseases such as hantavirus [4]. Understanding the etiology of diseases and their host epidemiology may also further the development of vaccinations and treatments that can help prevent epidemics or pandemics, such as the looming threat of the avian flu [5]. For emergent diseases (e.g., malaria), biomedical researchers traditionally focus on a limited number of species (e.g., four species of Plasmodium [6]). This represents very little in terms of the phylogenetic diversity of diseases that are known to infect numerous other organisms (e.g., malaria affects birds, lizards, and other mammals [7-12]). The incorporation of biodiversity knowledge in the context of biomedical advances may lead to breakthrough therapies for many of the diseases that still plague human society [13, 14]. The genomic revolution has resulted in a deluge of sequence data and derivative knowledge (e.g., protein structure prediction and gene expression experiments), which are the predominant data types in biomedical research. INS is funded in part by NSF-IIS-0241229, NSF-BDI-0421604, and the D.A.B. Lindberg Research Fellowship from the Medical Library Association. 340

341 Three of the papers in this session examine how these data can be integrated, annotated, and interpreted in light of greater taxonomic sampling. First, Cadag et al. propose a semi-automated framework that uses a federated approach to incorporate relevant knowledge from heterogeneous resources to assist with gene annotation. Next, Ng et al. demonstrate a prototype application to suggest annotations for genes that are involved with biological pathways across a range of organisms. Finally, Hampikian and Andersen examine the existence and utility of genes sequence regions that are not present in organisms across the tree of life. The final two papers in this session consider how sequence and sequencederived information can be complemented with biodiversity data (e.g., morphological, ecological, and temporal data). First, Maglia et al. propose a framework to incorporate existing ontologies and their structures towards the development of an ontology for Amphibian morphology. Sautter et al. then describe a system to semi-automatically organize knowledge that is embedded in literature resources. There has been considerable discussion in both the scientific [15-18] and popular media [5, 19] with regards to the need for methods and tools to organize and integrate biodiversity and biomedical knowledge within the context of legacy, existing, and newly generated data. Significant infrastructural and methodological advancements are needed to incorporate knowledge from both biomedical and biodiversity resources. To this end, it is hoped that the papers that follow will spark synergistic activities that benefit both biomedical and biodiversity communities. References 1. W. E. Muller, R. Batel, H. C. Schroder, and I. M. Muller, "Traditional and Modern Biomedical Prospecting: Part I-the History: Sustainable Exploitation of Biodiversity (Sponges and Invertebrates) in the Adriatic Sea in Rovinj (Croatia)," Evid Based Complement Alternat Med, vol. 1, pp. 7182, 2004. 2. E. O. Wilson, "The encyclopedia of life," Trends in Ecology and Evolution, vol. 18, pp. 77-80, 2003. 3. B. L. Umminger, "Unconventional organisms as models in biological research," J Exp Zool Suppl, vol. 4, pp. 2-5, 1990. 4. T. L. Yates, J. N. Mills, C. A. Parmenter, T. G. Ksiazek, R. R. Parmenter, J. R. Vande-Castle, C. H. Calisher, S. T. Nichol, K. D. Abbott, J. C. Young, M. L. Morrison, B. J. Beaty, J. L. Dunnum, R. J. Baker, J. Salazar-Bravo, and C. J. Peters, "The Ecology and Evolutionary History of an Emergent Disease: Hantavirus Pulmonary Syndrome," BioScience, vol. 52, pp. 989-998, 2002.

342

5. D. G. McNeil, "Hitting the Flu at Its Source, Before It Hits Us," in New York Times. New York, 2005. 6. M. T. Makler, C. J. Palmer, and A. L. Ager, "A review of practical techniques for the diagnosis of malaria," Ann Trop Med Parasitol, vol. 92, pp. 419-33, 1998. 7. C. T. Atkinson, K. L. Woods, R. J. Dusek, L. S. Sileo, and W. M. Iko, "Wildlife disease and conservation in Hawaii: pathogenicity of avian malaria (Plasmodium relictum) in experimentally infected iiwi (Vestiaria coccinea)," Parasitology, vol. I l l Suppl, pp. S59-69, 1995. 8. P. C. Garnham, "Recent research on malaria in mammals excluding man," Adv Parasitol, vol. 11, pp. 603-30, 1973. 9. B. Mons and R. E. Sinden, "Laboratory models for research in vivo and in vitro on malaria parasites of mammals: Current status," Parasitol Today, vol. 6, pp. 3-7, 1990. 10. R. E. Ricklefs and S. M. Fallon, "Diversification and host switching in avian malaria parasites," Proc Biol Sci, vol. 269, pp. 885-92, 2002. 11. J.J. Schall, "Virulence of lizard malaria: the evolutionary ecology of an ancient parasite-host association," Parasitology, vol. 100 Suppl, pp. S3552, 1990. 12. J. J. Schall, "Lizards infected with malaria: physiological and behavioral consequences," Science, vol. 217, pp. 1057-9, 1982. 13. L. A. Basso, L. H. da Silva, A. G. Fett-Neto, W. F. de Azevedo, Jr., S. Moreira Ide, M. S. Palma, J. B. Calixto, S. Astolfi Filho, R. R. dos Santos, M. B. Soares, and D. S. Santos, "The use of biodiversity as source of new chemical entities against defined molecular targets for treatment of malaria, tuberculosis, and T-cell mediated diseases—a review," Mem Inst Oswaldo Cruz, vol. 100, pp. 475-506, 2005. 14. F. Pelaez, "The historical delivery of antibiotics from microbial natural products-Can history repeat?," Biochem Pharmacol, 2005. 15. D. Agosti, "Biodiversity data are out of local taxonomists' reach," Nature, vol. 439, pp. 392, 2006. 16. J. Soberon and A. T. Peterson, "Biodiversity informatics: managing and applying primary biodiversity data," Philos Trans R Soc Lond B Biol Sci, vol. 359, pp. 689-98, 2004. 17. S. Blackmore, "Environment. Biodiversity update—progress in taxonomy," Science, vol. 298, pp. 365, 2002. 18. F. A. Bisby, "The quiet revolution: biodiversity informatics and the internet," Science, vol. 289, pp. 2309-12, 2000. 19. "Today we naming of parts: a global registry of animal species could shake up taxonomy," in Economist, 2006.

BIOMEDIATOR DATA INTEGRATION AND INFERENCE FOR FUNCTIONAL ANNOTATION OF ANONYMOUS SEQUENCES EITHON CADAG t 1 5 , BRENT LOUIE 1 , PETER J. MYLER 1 ' 25 , PETER TARCZY-HORNOCH 13 ' 4 Depls. of Medical Education and Biomedical Informatics', Pathobiology2, Pediatrics , and Computer Science and Engineering'1, University of Washington, Seattle, WA USA Seattle Biomedical Research Institute3, Seattle, WA USA Scientists working on genomics projects are often faced with the difficult task of sifting through large amounts of biological information dispersed across various online data sources that are relevant to their area or organism of research. Gene annotation, the process of identifying the functional role of a possible gene, in particular has become increasingly more time-consuming and laborious to conduct as more genomes are sequenced and the number of candidate genes continues to increase at near-exponential pace; genes are left un-annotated, or worse, incorrectly annotated. Many groups have attempted to address the annotation backlog through automated annotation systems that are geared toward specific organisms, and which may thus not possess the necessary flexibility and scalability to annotate other genomes. In this paper, we present a method and framework which attempts to address problems inherent in manual and automatic annotation by coupling a data integration system, BioMediator, to an inference engine with the aim of elucidating functional annotations. The framework and heuristics developed are not specific to any particular genome. We validated the method with a set of randomly-selected annotated sequences from a variety of organisms. Preliminary results show that the hybrid data integration and inference approach generates functional annotations that are as good as or better than "gold standard" annotations -80% of the time.

1.

Introduction

The increasing rate of genomic discovery has left biologists with an overwhelming amount of new and tentatively novel genes to examine. One of the first steps in scrutinizing a new genome is to annotate its genes with biochemical characteristics, cellular localization, and other functional properties to quickly identify targets of interest for further study. The re-visitation of "hypothetical" proteins using multiple updated molecular databases can reveal valuable biological information as well. It is estimated that between 25-66% of genes, depending on the organism, are annotated as "hypothetical" [1]. Annotation, however, is often a slow and laborious process, and the complete annotation of even a modestly-sized genome can take a small team of skilled annotators years to finish. Even with a large group of scientists the task remains non-trivial; collaborating scientists working on Drosophilia Corresponding author ([email protected]) 343

344

melonogaster organized a two week "jamboree" to accomplish functional annotation [2]. Coupled with the necessity to maintain currency as sequence information is revised and molecular reference databases are updated, annotation becomes a Sisyphean effort. Much of the challenge involved in annotating genes stem from scientists needing to consult various molecular databases to ensure complete and thorough annotations. Online data sources such as those furnished by the National Center for Biotechnology Information (NCBI)a, the Wellcome Trust Sanger Instituteb and many more made freely available by other researchers have become invaluable in helping annotators assign genes putative functions based on computational results. The nature of how biologic information is stored, i.e. in separate, heterogeneous data sources, dictates that data integration is the first step in gene annotation [3]. Information regarding functional properties of genes is fragmented in various online databases which were developed independently and do not inherently interoperate. To annotate genes biologists must manually query many individual data sources. Considerable research has been done investigating automated methods of annotation, which in addition to alleviating manual efforts have the capability of querying and analyzing a far larger volume of information. While many of the automated annotation systems created thus far are very effective and successful at generating annotations, most are meant as one-off solutions to specific organisms or set of organisms, or utilize only a select number of databases and analyses on which the annotation process is tailored; data integration is frequently ad hoc. As the number of molecular databases increases, scalable automated annotation systems will be more necessary. In this paper we present and evaluate a hybrid approach that addresses both the data integration and analytical needs of gene annotation. Recognizing that an effective annotation system must first be an effective data integration system and that biological expertise is indispensable in developing accurate annotations, we incorporated a robust inference engine on top of an already-existing data integration platform, BioMediatorc. We identified several promising online biologic databases based on the processes used for model and non-model genome annotation projects and formulated a set of pilot heuristics for the inference engine which would reason over database query results and draw conclusions toward the annotations for submitted sequences.

a b c

http://www.ncbi.nlm.nih.gov http://www.sanger.ac.uk http://www.biomediator.org

345 To evaluate our methodology, 116 annotated genes were selected randomly from GenBank [4] as a sample set. These genes were re-annotated using our BioMediator-based approach, and our computational annotations were compared to the actual annotations as listed in GenBank. Relying on manual inspection to resolve ambiguity, we found that our automated method yielded functional annotations as good as or better than the listed annotation for 78% of the sample. 2.

Related Work

Automated gene annotation is a well-studied subfield of bioinformatics, and many projects have arisen out of the need for expedient gene annotation. Most automated annotation systems rely on a pipeline-based approach [5-7], whereby data is transformed or analyzed step-wise to reach a predicted function. Often the data sources used for the pipeline are replications of publicly available online databases and housed in local data warehouses. Kasukawa et al., for instance, relied on a custom annotation pipeline with a well-defined control structure to generate first-pass annotations for the mouse genome, and provided an interface for human curation and modification of automated annotations [5]; Potter et al., included a protein annotation pipeline in the ENSEMBL analysis pipeline [7] which assigns InterPro [8] domains to putative proteins after the gene identification stage, derived from species-specific curated data. Marrying inference to gene annotation systems has also received research attention. Similar to MAGPIE [9], which uses PROLOG to reason over analytical results, FIGENIX uses Java-based JLog to enact intelligent reasoning over specific portions of its annotation pipelines [6]. Like most other automated annotation systems, FIGENIX uses a data warehouse approach to storing information. In contrast to the automated annotation methods already mentioned, our approach uses a federated database system. The system does not store information locally; rather, queries are sent to the sources, normalized, cleansed and then analyzed in real-time, providing a small client-side footprint. This has the advantage of always providing up-to-date data, a limitation of the aforementioned warehousing approaches [10]. Additionally, data integration is accomplished with the use of a mediated schema, which provides the necessary semantic linkage between data sources as well as a common ontology for the development of general heuristics that are not specific to any single data source or genome. Because of the multi-tiered architecture used for the data integration process, new data sources can be readily added and incorporated into the mediated schema with minimal overhead cost and without a large increase in system complexity such a change might provoke in a pipeline-based system.

346

And, unlike other systems that rely on inference in annotation, the reasoning system is not restricted by an algorithmic pipeline, and is free to enact rules at arbitrary points in the data gathering process. 3.

Methods

3.1. Identifying Annotation Sources and Heuristics To test our combined method of data integration and inference, it was first necessary to select a set of data sources as well as initial logical inference rules to reason over returned information. We created a list of online databases and other resources for use in the process of functionally annotating genomes derived from methods used for the annotation of a set of organisms at the Seattle Biomedical Research Institute (SBRI). Scientists from SBRI participate in an international effort to sequence and annotate the genomes of three disease-causing parasites, Leishmania major, Trypanosoma brucei and Trypanosoma cruzi [11]. While the three genomes share considerable sequence similarity, most of the genes have little homology with genes in other species; approximately 66% of their genomes are annotated as "hypothetical". Additionally, we attempted to emulate the annotation of Haemophilus influenzae, the first non-viral genome to be completely sequenced. As such, H. influenzae is a far more studied genome than any of the Trypanosomatids. Our experience with these genomes provided the data sources and annotation processes on which we based our system. Understanding annotation processes for a set of non-model genomes and a model genome gave us interesting results. Many of the data sources relied on by scientists for annotating the Trypanosomatids were based on computational analyses, and with the aid of Perl scripts, submission to multiple analytical services was done in parallel. Parsing through and drawing knowledge from the information, however, was a manual endeavor. Annotators for H. influenzae, while also employing some computational services, primarily NCBI's BLAST [12] and domain searches, relied more heavily on literature searches and some species-specific databases. From the sources used by scientists for the aforementioned genomes, a subset was selected to act as the data sources for the evaluation of our automated annotation system: the NCBI BLAST database [12], the NCBI Conserved Domain Database (CDD) [13], Wellcome Trust Sanger's Pfam database [14], PROSITE database [15], Fred Hutchinson Cancer Research Center's BLOCKS database [16] and the ProDom database [17]. Information on how to apply expert knowledge on returned data was also elicited from scientists, and provided the basis for initial logical inference rules. For example, heuristics provided by one scientist working on the

347

Trypanosomatid genomes noted that in examining BLAST scores, it was not necessarily preferable to use the top-scoring results because best BLAST hits are not always the closest relation to the sequence in question [18]. 3.2. Data Integration for Annotation with BioMediator The BioMediator data integration system is the querying, retrieval and normalization platform for our automated annotation method. Developed at the University of Washington, BioMediator is a general-purpose biologic data integration system whose adaptability to various biomedical domains has been demonstrated in the past by providing a data integration platform for linking expression array data with analytics software and uniting disparate neuroscience databases to identify locations in the cortex related to language processing [1921]. A federated data warehouse that queries sources in real-time, BioMediator relies on a multi-tiered architecture whose core is a mediated schema that translates data from heterogeneous data sources into entity instances from the schema, thus collecting all query results under a single semantic framework (see Figure 1).

Figure 1. Diagram of BioMediator's architecture; data comes from sources (far right, F) via wrappers (E), which serialize data to schema-mapped XML (A) via the metawrapper (C,D) layer and sent to the BioMediator query processor (B) and interface (G). Original image adapted from .

We manually created a mediated schema for generalized, non-genomespecific functional annotation using the Proteged ontology editor, wrappers to serialize data from the sources and source-to-schema mappings. During the evaluation of our annotation system, the schema contained 57 entities to represent data across genomic databases {e.g. 'Protein', 'ProteinDatabaseHit') and 55 binary relationships between those entities (such as 'ProteinHasProteinDatabaseHit' to describe a protein homology relationship).

d

http://protege.stanford.edu

348 3.3. Heuristics for Anonymous Sequence Annotation Utilizing BioMediator's plug-in architecture, we added the Java Expert System Shell (Jess) rule engine [22] to BioMediator, giving us the capability to formulate flexible sets of rules against mediated result sets. Unlike other previous annotation systems that employ rule engines to manage pipelines or make decisions based on analyses, our approach to integrating Jess into BioMediator does not compartmentalize the scope of the rule engine by limiting when or where the rules may fire; the Jess component is free to enact rules over any data as it enters the system piecemeal, after all data is loaded in aggregate or any combination thereof and treats all received data as part of the working memory. As a result of our approach, rules are applied in a consistent fashion for all annotations. For our evaluation, three classes of rules were created to emulate as best as possible some of the annotation processes used by the genome annotators at SBRI. A total of 16 rules6 were developed for the pilot evaluation of our system (see Figure 2 for rule example). (defrule evalue-threshold-homologs (threshold (type evalue) (max ?M) (db ?D)) ?F = ?X ?M)) (db ?B&:(eq ?B ?D)) (property ?P)) => (delete-reason ?P ?F "High expect value")) Figure 2. Example rule that prunes homologs from the result set that do not pass a threshold. Specific thresholds for individual databases may be optionally set, and the final line above saves the reason for the removal of the record.

3.3.1 Filtering Rules Filtering rules are heuristics that were limited to strictly ruling out possible annotations or other relevant data from further use by the inference engine. Rules that examined quantitative values for a minimum threshold, for instance, fell under this classification. Also, based on techniques utilized by the FANTOM2 annotation pipeline [5], a filtering rule for the perceived quality of information was created. 12 regular expressions whose patterns indicate a possibly uninformative annotation were used. For example, homologous proteins that contained "unnamed" in their functional annotation were removed from further consideration. Data classified as removed does not leave the working memory; rather, they are restructured so that the reason for their removal is noted, and can be retrieved again if need be. e

Supplementary material on rules at: http.V/www.biomediator.org/publications

349

3.3.2 Evidence-Building Rules The second class of rules uses information returned to increment evidence levels of tentative annotations. Homologous proteins enter the system as working memory with a low evidence level. As evidence is found to support that protein annotation (e.g. corroborating domains or large number of similar protein annotations returned), its evidence level is increased. This rule is analogous to the confidence classification system used by scientists annotating the H. influenzae genome at SBRI, with an ordinal scale to represent the level of evidence. Domains that recur in working memory multiple times, for example, may have their evidence level increased, as their likelihood to be associated with a target sequence is improved; likewise, functional annotations that are correlated with domain support will also reflect an increase in evidence. Because our initial annotation system does not yet make use of formal biomedical vocabularies, such as the Gene Ontology [23] (GO), and there is no universally-accepted nomenclature in practice for all genomic databases, we establish correlations between the text of functional annotations provided by our data sources using a modified edit distance algorithm. Consider two strings, k and / with lengths m and n respectively; a matrix G of (m + 1) x (n + 1) is created, where row 0 is initialized to 0...n, and column 0 initialized to 0...m. The remaining positions in the matrix, G(ij) are computed byf: min(G(i - l,j) + l,G(i,j - 1) + l,G(i - l,j - 1) + c) _ f 0, char(k,i) = char(lj) I 1, otherwise V

'

(Eq. 1)

The value given by G(m,ri) I q is the phrase similarity measure we use between k and /, where q is the length of the longer of the two strings. In our annotation system, various evidence-building rules invoke this string-comparing algorithm, such as when protein homologies share similarly-phrased annotations. 3.3.3 Annotation Selection Rules The third classification in our initial rule-base is those that select likely functional annotations from the working memory, based on evidence levels. All possible annotations are stratified by their level of evidence; related annotations are percolated to the top of the list if they appear repeatedly, and the highestlevel annotation with the greatest amount of evidence is provided as the automated functional annotation, though the remaining possible annotations are

Where char(k,i) represents the (th character in the string k

350

available for viewing as well. If no annotation is available at any evidence level, the default "hypothetical" result is presented. 3.4. Evaluation To evaluate the efficacy of our BioMediator-based automated annotation system, we randomly selected 116 genes from a local copy of the GenBank database from April 2006 [4]. The GenBank annotations for the 116 genes served as our "gold standard". We parsed out species names so that results from the source organism could be excluded from query returns; protein sequences from 58 bacteria, 31 eukaryotes, three viruses and one archaea were represented. Once the genes were annotated by our system, the automated annotation and actual annotation were compared and individual automated annotation results scored as incorrect, correct but inferior to actual, same as actual or superior to actual. This quaternary scoring rubric was adapted to adjust for the known danger of outdated or incorrect GenBank annotations [24]. We used two measures in scoring, specificity and utility. Specificity is in reference to the level of granularity and precision provided in the annotation, e.g. "peptidase" would be a less specific annotation in comparison to "lysosomal cysteine-type endopeptidase", provided both are correct. Utility was used as a measure to compare how informative annotations are based on the textual content. An annotation that is based on a GO term, for example, would be considered more informative than one that uses idiosyncratic nomenclature. In cases where the automated annotation did not match the actual annotation, we used manual annotation methods and referred our findings to a domain expert for final scoring. 4.

Results of Automated Annotation Using Inference

Our evaluation showed the automated annotations had specificity at the same level, or better, than the GenBank annotations 78% of the time. Additionally, the automated annotation was equal to or more informative than the GenBank annotation in 85% of the sample genes. As putative genes from non-model organisms are generally less likely to register sequence similarity hits in databases versus well-studied model organisms, we also compared the systems performance along a model- and non-model organism stratification as determined by the NCBI Model Organisms Guide [25] (see Table 1). Of the 116 automated annotations generated, seven were deemed to be incorrect when compared to the GenBank annotations. Upon manual inspection, reasons for the system assigning incorrect annotations were attributable to either a) the genes having short sequences, and were subsequently expunged by expect-value rules,

351 or b) pertinent information returned originated from the organism which the sequence was taken and were thus pruned out.

Spec. Util.

Table 1, Results of automated annotation in comparison to GenBank annotations. Non-model Organisms (n=60) Model Organisms (n=56) Wrong Worse Same Better Wrong Worse Same Better 2 8 30 16 5 10 37 8 (3.6%) (14.3%) (53.6%) (28.6%) (8.3%) (16.7%) (61.7%) (13.3%) 2 6 41 7 5 4 42 9 (3.6%) (10.7%) (73.2%) (12.5%) (8.3%) (6.7%) (70.0%) (15.0%)

Individual results varied in quality and nomenclature. The databases we relied on as sources did not share a common terminology so semantically equivalent, though syntactically different, annotations were commonplace. In some cases, lower evidence levels provided superior annotations than those at higher evidence levels, though we used the highest evidence level presented in scoring. In seven cases, the automated annotation system presented a function for a gene for which GenBank records show either none or list "hypothetical". Manual annotation indicated that there was evidence in four of the seven to suggest that the automated annotation was correct; for the remaining three annotations some evidence suggested their correctness, though their true annotation remained relatively ambiguous (see Table 2 for example results). Table 2. Selected automated annotation results juxtaposed with actual annotations from GenBank, with notes. Automated Annotation Actual Annotation Notes Hypothetical Ribosomal protein L34 Sequence was small; relevant entries removed by expect-value rules Anion exchange transporter SLC26A5 protein Automated is less specific but more informative Unnamed protein product Nicotinic acetylcholine Evidence for automated is receptor alpha4 subunit very convincing; affirmed with manual inspection ABC-type uncharacterized COG4619: ABC-type Automated and actual uncharacterized transport transport system, ATPase match, controlled system, ATPase component component vocabulary used GTP-binding protein RAB4 PREDICTED: similar to rasAnnotations are essentially related GTP-binding protein 4b the same, but varying naming conventions used

5.

Discussion

The framework and methodology on which we base our approach to gene annotation is unique from previous automated gene annotation solutions. By using BioMediator as a data integration platform to handle sequence queries and

352

retrieve results, we avoid the overhead involved with maintaining large repositories that replicate already-available data sources; responsibility for updating the data sources we use falls on the originators of the source data itself, and generally remove users of our system from most maintenance tasks. Because of the system's relatively small memory and processor footprint, it can be used on the desktop computers of annotating scientists. BioMediator's tiered architecture also allows us to add and remove sources with relative ease, and without the effort often necessary in warehouse systems, where database schemata and workflows may need to be altered considerably as data sources and tasks change over time. Scientists researching a novel genome, for example, could map any local in-house databases to the databases linked to BioMediator, thereby rapidly integrating their species-specific data with any sources already supported by BioMediator. Also, building the inference system around the schema rather than individual sources afforded us a method of quickly developing annotation rules without having to necessarily address each data source individually. Inference rules are also a natural, transparent way of capturing annotator knowledge. Once the rules were conceived, the development time in Jess was rapid. It is important to note, though, that our results were obtained using a set of rules that were not tuned or optimized, and thus we expect results will be better as rules are improved based on feedback from annotators. The scalability and flexibility of our approach, however, did come at cost, and online data sources do experience downtime. While testing the system, one of our sources was unavailable for several hours. Theoretically, we hope that by utilizing many more sources in the future that have partial redundancy, the loss of any single source may be somewhat offset. Still, as a federated data system, our ability to retrieve data is subject to the real-time availability of the data sources. An important handicap was that we did not rely on a structured ontology such as GO for our initial evaluation. While the schema we utilized was ontology-based, none of the sources we relied on used any controlled vocabulary on a consistent basis. Phylogenic information was not represented in our evaluation, and could have provided valuable data in relating evolutionary linkage to target sequences. Despite these shortcomings, the initial evaluation of our annotation system and methodology gives encouraging results; the efficacy of our approach is comparable to that of a previously evaluated species-specific and pipeline-based automated annotation system, 75.1-78,6% estimated accuracy for FANTOM2 [5], with the additions of being non-specific to any genome and having an architecture oriented toward scalability.

353 6.

Conclusion

The growing size, disparity and heterogeneity of biologic data and the necessity for expert curation in determining the protein functions for the myriad of newly sequenced genomes means that an automated annotation system that can address future gene annotation requirements must to be both a robust data integration platform and a powerful expertise-based system. In this paper, we have presented a technique and framework that couples the two important tasks in gene annotation into a cohesive platform, and evaluated its performance. Future iterations of the system will annotate genes using a controlled vocabulary with the addition of data sources such as InterPro which regularly and consistently include GO terms in their records. While our initial system relies on online databases, incorporating analytical services like transmembranelocating or phylogeny-inferring software into the schema and developing rules to take advantage of such information would be a valuable addition. Alteration of current rules will also improve our annotation capabilities, such as a dynamically-determined threshold to account for sequences of variable length. Additionally, in the future, we hope to evaluate our system against more ongoing genome annotation projects, to compare automated annotation results with further manually-created annotations. The true test of our system would be to annotate a novel genome in parallel with expert scientists. Acknowledgements This work is supported by NHGRI grant R01HG02288 and the National Library of Medicine training grant T15LM07442. The authors of this paper would like to acknowledge Elizabeth Worthey and Alice Erwin for lending their knowledge of annotation to our research, as well as Ron Shaker, Janos Barberos and Dhileep Sivam for their technical assistance. References 1. Worthey, E., Myler, P., Protozoan genomes: gene identification and annotation. International Journal for Parasitology, 2005(35): p. 495-512. 2. Adams, M., Celniker, S., et al., The Genome Sequence of Drosophilia melanogaster. Science, 2000. 287(5461): p. 2185-2195. 3. Garrels, J.I., Yeast genomic databases and the challenge of the post-genomic era. Functional & Integrative Genomics, 2002. 2(4-5): p. 212-237. 4. GenBank. 2006 [cited April 2006]; Available from: http://www.ncbi.nlm.nih.gov/Genbank/ 5. Kasukawa, T., Furuno, M., et al.. Development and Evaluation of an Automated Annotation Pipeline andcDNA Annotation System. Genome Research, 2003. 13. 6. Gourct, P., Vitiello, V., et al., FIGENIX: Intelligent automation ofgenomic annotation: expertise integration in a new software platform. BMC Bioinformatics, 2005. 6.

354 7. Potter, S., Clarke, L., et al, The Ensembl Analysis Pipeline. Genome Research, 2004. 14. 8. Apweiler, R., Attwood, T., et al, The InterPro database, an integrated documentation resource for protein families, domains andfunctional sites. Nucleic Acids Research, 2001. 29(1): p. 37-40. 9. Gaasterland, T., Sensen, C , MAGPIE: automated genome interpretation. Trends in Genetics, 1996. 12(2): p. 76-78. 10. Louie, B., Mork, P., et al, Data Integration and Genomic Medicine. Journal of Biomedical Informatics, 2006. 11. El-Sayed, N., Myler, P., et al, Comparative Genomics of TrypanosomatidParasitic Protozoa. Science, 2005. 309(5733): p. 404-409. 12. Altschul, S., Gish, W., et al, Basic Local Alignment Search Tool. Journal of Molecular Biology, 1990. 215. 13. Marchler-Bauer, A., Anderson, J., et al, CDD: a Conserved Domain Database for protein classification. Nucleic Acids Research, 2005. 33(D): p. 192-196. 14. Bateman, A., Coin, L., et al, The Pfam protein families database. Nucleic Acids Research, 2004. 32(D). 15. Hulo, N., Bairoch, A., et al, The PROSITE database. Nucleic Acids Research, 2006. 34(D): p. 227-230. 16. Henikoff, S., Henikoff, J., Protein family classification based on searching a database of blocks. Genomics, 1994. 19(1): p. 97-107. 17. Corpet, F., Gouzy, l„et al, The ProDom database ofprotein domain families. Nucleic Acids Research, 1998. 26(1): p. 323-326. 18. Koski, L., Golding, B., The Closest BLAST Hit Is Often Not the Nearest Neighbor. Journal of Molecular Evolution, 2001. 52: p. 540-542. 19. Donelson, L., Tarczy-Hornoch, et al, The BioMediator System as a Data Integration Tool to Answer Diverse Biologic Queries. Proceedings of Medlnfo, IMIA, 2004. 20. Wang, K., Tarczy-Hornoch, P., et al., BioMediator Data Integration: Beyond Genomics to Neuroscience Data, in American Medical Informatics Association 2005 Symposium Proceedings. 2005. 21. Mei, H., Tarczy-Hornoch, P., et al, Expression Array Annotation Using the BioMediator Biological Data Integration System and the BioConductor Analytic Platform, in American Medical Informatics Association 2003 Symposium. 2003. 22. Jess, the Rule Engine for the Java Platform. 2006 [cited 2006]; Available from: http://herzberg.ca.sandia.gov/jess/ 23. Ashburner, M., Ball, C , et al, Gene ontology: toolfor the unification of biology. Nature Genetics, 2000. 25(1): p. 25-29. 24. Harris, J., Can you bank on GenBank? Trends in Ecology and Evolution, 2003. 18(7): p. 317-319. 25. National Center for Biotechnology Information, Model Organisms Guide. 2006 June 2006 [cited; Available from: http://www.ncbi.nih.gov/About/model/index.html

ABSENT SEQUENCES: NULLOMERS AND PRIMES GREG HAMPIKIAN Biology, Boise State University, 1910 N University Drive Boise, Idaho 83725, USA TIM ANDERSEN Computer Science, Boise State University, 1910 N University Drive Boise, Idaho 83725, USA

We describe a new publicly available algorithm for identifying absent sequences, and demonstrate its use by listing the smallest oligomers not found in the human genome (human "nullomers"), and those not found in any reported genome or GenBank sequence ("primes")- These absent sequences define the maximum set of potentially lethal oligomers. They also provide a rational basis for choosing artificial DNA sequences for molecular barcodes, show promise for species identification and environmental characterization based on absence, and identify potential targets for therapeutic intervention and suicide markers.

1.

Introduction

As large scale DNA sequencing becomes routine, the universal questions that can be addressed become more interesting. Our work focuses on identifying and characterizing absent sequences in publicly available databases. Through this we are attempting to discover the constraints on natural DNA and protein sequences, and to develop new tools for identification and analysis of populations. We term the short sequences that do not occur in a particular species "nullomers," and those that have not been found in nature at all "primes." The primes are the smallest members of the potential artificial DNA lexicon. This paper reports the results of our initial efforts to determine and map sets of nullomer and prime sequences in order to demonstrate the algorithm, and explore the utility of absent sequence analysis. It is well known that the number of possible DNA sequences is an exponentially increasing function of sequence length, and is equal to 4", where n is the sequence length. This means that any attempt to assemble the complete set of unused sequences is hopeless. We have developed an approach that examines the minimum length sequences that are absent. These absent oligomers (nullomers and primes) occur at the boundary between the sets of natural and potentially unused sequences, and in part can be utilized to delineate the two sets15. By identifying the boundary nullomers surrounding the various branches 355

356 of the phylogenetic tree of life, we hope to produce a map of the negative sequence space around each group. While the nullomer and prime sets will shrink as more sequences are reported, the mechanisms of mutation allow for rational predictions to be made about sequence evolution based on the accumulated nullomer data. The excluded sequences can be used for a number of purposes including: 1. Molecular bar codes 2. Species identification 3. Sequence specification for: RNAi, PCR primers, gene chips 4. Database verification and harmonization 5. Drug target identification 6. Suicide targets for recalling or eliminating genetically engineered organisms 7. Pesticide/antibiotic development 8. Environmental monitoring 9. Evolution studies Our ultimate goal in studying nullomers, is to model and predict which biosequences (DNA, RNA and amino acid) are unlikely to be found in the biosphere. If "forbidden" sequences can be identified and confirmed through bioassays, this information will be foundational to understanding the basic rules governing sequence evolution. The insights gained could also greatly improve the theoretical foundation for comparative genomics, and provide an important conceptual framework for genetic engineering using artificial sequences. 2.

Background

A naive assumption of early genomic analysis was that sequence distribution over large genomes would approximate randomness. That is, a 6 base sequence would be found on average every 46 or 4096 bases. These types of assumptions were used for such calculations as the number of expected restriction enzyme recognition sites in a genome. But even early studies of genome organization using thermal melting and gradient centrifugation813 showed that there is great non-uniformity in genomic sequences, particularly in warm-blooded vertebrates. What has emerged from many subsequent genome studies is a striking nonrandom distribution of certain large and short sequence motifs. Many of the described irregularities concern functional units of sequences. For example, AGA codons are rare in bacterial genes, and when artificially substituted for synonymous codons they often have lethal consequences. This is believed to be due to ribosome stalling and the consequent early termination of protein synthesis. The reason for this effect is that while the codon chart tells us

357

that AGA is one of the codons for the amino acid arginine, most bacteria preferentially use CGA to code for arginine. Even though the bacteria have the requisite tRNAs to use an AGA codon, these tRNAs are in such low concentration that the ribosome complex is destabilized while waiting for the tRNA to load an arginine6. Examples of such "codon biases" have been seen in all species sequenced to date20, and are a good example of the constraints on sequence evolution based on progenitor biases. In eukaryotes too, many genomic features have been identified which skew the distribution of very short sequence motifs. For example, one of the authors (GH) was involved in research that examined the role of GG sequences in oxidative damage to DNA. It was found that when oxidizing agents captured electrons from DNA, the electron holes were transferred along DNA until they reached a GG sequence where they induced strand breakage12. Subsequent studies have borne out our hypothesis that GGG stretches are rare in coding regions, and other researchers have shown that "sentinel GGG" motifs found in non-coding introns serve as sacrificial sinks for oxidative damage". Statistical studies using the autocorrelation function of Bernaola-Galvan (2002) have shown that the human genome contains areas with GC-rich isochors displaying long-range correlations and scale invariance. Other studies have shown long range correlations between sequence motifs and regularly spaced structural features of the genome such as nucleosome binding sites2,21. All of these studies demonstrate what we would expect for a highly ordered information processing system: it is highly organized, non-random, and constrained by many factors, including the architecture of its storage and processing systems. Thus, even though DNA is passed on through dynamic evolving systems, there are still limits on its content, and some of these limits exist within large species groups. For example, any limits imposed by nucleosomal organization are applicable to all eukaryotic organisms; while bacteria which lack nucleosomal structure are immune to these constraints. This suggests one obvious use for our nullomer approach: the identification of molecular therapeutic targets that are present in the pathogen and absent in the host, or vise versa. Other constraints may be universal, since all organisms share a presumed origin, and many components of DNA function are highly conserved. By examining universally absent sequences (primes), we hope to discover insights into the most conserved mechanisms of molecular biology: inviolable rules which preclude these prime sequences. Interestingly, the vast majority of bio-sequence analysis has ignored the exploration of absent sequences, instead focusing entirely on sequences that are either very rare, or very common. Some work has been done to characterize the expected number of missing words in a random text19, however the primary focus

358 of this research was the application of the result to the construction of pseudorandom number generators. One group has discussed the "absence versus presence" of short DNA sequences for the sake of identifying species10, and another group has examined absent protein sequences18; but our approach is unique in that we are studying the set of smallest absent sequences (nullomers and primes) in order to discover basic rules of sequence evolution, and then apply this understanding for practical purposes such as drug development and the development of a DNA tagging system. Our research stems from one of the primary assumptions of genomic analysis, that over and under-represented sequences are more likely to be interesting. While our work focuses on the novel area of absent oligomers, the general determination of over and under-represented sequences has received a great deal of attention3'4,5''4,16'17,22. For example, Nicodeme16 developed a fast statistical approximation method for determining motif probabilities and demonstrated that over and under representation of protein motifs can be a good indicator of functional importance17. Stefanov22 introduced a computationally tractable approach for determining the expected inter-site distance between pattern occurrences in strings generated by a Markov model. Bourdon and Vallee5 and Flajolet7 extended techniques to determine the likelihood and frequency of sequence motifs to generalized patterns, in particular patterns where the gap lengths between elements of the pattern in a random text are both bounded and unbounded. Amir et al.' generalize the notion of string matching further, developing statistical analysis techniques for a string matching approach they term structural matching. With this approach, the exact text of the strings is not important, rather, two strings are considered to match if some generalized relation between the two strings is satisfied. 3.

Counting Sequences

We have developed a set of software utilities for counting sequences in a variety of sequence data. The main software package that we have created is SeqCount. This program has two primary functions. First, it counts the frequency of occurrence of all possible short sequences up to a user given maximum length in a set of sequence data and then writes this frequency count information to a file. Second, SeqCount determines the set of sequences that do not occur (nullomers) and writes these sequences to an additional set of files, one file for each sequence length being examined. The algorithm used for counting sequences is shown in figure 1. The computational complexity of the algorithm is 0(mn), where m is the maximum sequence length and n is the amount of DNA being processed. The algorithm

359 can calculate the frequency of DNA sequences up to length 13 for the human genome (3 billion bases) in approximately 25 minutes on a single processor machine. The parallel version of the algorithm can process the human genome in less than 1 minute. A single pass through the entire set of DNA data downloaded from the NCBI web site takes approximately 12 hours. In addition to SeqCount, we have created a number of secondary support tools for manipulating and understanding the data output by SeqCount. These support tools are available in both C and Java versions. Also, we have created a web-based interface to some of the data that we have generated with SeqCount. In particular, one can access the sequence counts and nullomer sets for several species for sequences up to length 13. Following is the full list of software packages and support tools that are available: 1.

2.

3.

Set the m a x i m u m sequence length under consideration (/?) and the strand of D N A to examine. Beginning with the 1 st position, for each position in the strand of D N A being examined: a. Increment the count for the w-length sequence of nucleotides found at the current position After step 2 has finished, a. process the initial counts for the nlength sequences to determine the counts for the complementary strand, b. re-process the final «-length counts to determine the counts for all sequences of length n-\ through 1.

Figure 1. Algorithm for counting sequences. •

•

SeqCount: Given a set of genomic data in binary format, counts the total number of all sequences up to a user deter-mined length. The counts are saved in a single file. Additionally, if any sequences within the length given are not found, these sequences are output to a set of nullomer files (1 file for each nullomer length). GBK2Bin: Given a set of files in Genbank format, this pro-gram converts the files to a binary format wherein each DNA nucleotide is encoded as a 2-bit value. A single file is created for each contiguous sequence of DNA found in the genbank files, with the file name encoding the location of the sequence.

360

• •

•

• • •

CountNulIs: Counts the number of nullomers in a nullomer file and prints the result. Char2Null: Converts any set of carriage return delimited sequences encoded in ascii format to the nullomer file format. This utility is typically used to take the piped output from either DiffNulls, IntNulls, UnionNulls, or ViewNulls and convert the ascii-based output of these files to binary format. DiffNulls: Takes as input 2 to many nullomer files and prints to the screen the set difference of the 1 st nullomer file minus the union of the rest of the nullomer files. IntNulls: Takes as input 2 to many nullomer files and prints to the screen the set intersection of the nullomer files. UnionNulls: Takes as input 2 to many nullomer files and prints to the screen the set union of the nullomer files. ViewNulls: Takes as input 1 nullomer file and prints to the screen in ascii format the nullomers contained in the file.

SeqCount processes sequence data in a single pass, and has been optimized for speed of processing. SeqCount can be executed in either parallel mode on a Beowulf cluster or in sequential mode on a single workstation. In sequential mode the program is limited to counting sequences up to length 13. When the program is executed in parallel mode and the user requests the program to count sequences of length greater than 13, the program evenly divides the sequence space up amongst the available processors and then each process is responsible for counting sequences that occur within its assigned sequence space. At the end of processing the counts from each process are collected and written to a file as in the sequential version. The software packages, documentation, and webbased interface can be freely accessed at: http://trac.boisestate.edu/bioinformatics/nullomers. 4.

Results

We have downloaded the entire sequence database from the NCBI web site and used our algorithms to determine the nullomer sequences for several fully sequenced organisms: chimpanzee, human, etc. These results are given in section 4.1. We have also processed all of the data in the entire DNA sequence database and determined the "prime" DNA sequences (sequences that do not occur in any of the data), and these results are given in section 4.2. In addition, we have processed the entire protein database and also give these results in section 4.2

361 4.1. Nullomers -fully sequenced organisms Table 1 gives the number of DNA nullomers found at lengths 8 through 13 for several different organisms. The results for bacteria, fungi, and yeast are across all sequenced organisms. Table 1. Number of DNA nullomers at sequence length 8 through 13. 8

arabid bacteria c elegans chicken chimp cow dog fruitfly human mouse rat zebrafish

9

10 107

11 23646

2 2

7686 590 136 96 40 206 80 178 50 2

12 13 1167012 20237388 541 562870 1152038 23339534 131515 4722702 45938 2426474 45060 2432554 25217 1868964 221616 12399300 39852 2232448 54383 2625646 30708 1933220 15561 2469558

Table 2 shows how the nullomer sets of each of the organisms given in table 1 intersect with each other. The names of the organisms are listed in the first column. The 2nd through 4lh column show the actual size of each intersection for lengths 11 through 13. The 5th through 7lh column show the expected size (with the assumption that each set was independently and randomly generated), and the 8th through 10th column give the ratio of the actual/expected. For the ratio, numbers greater than 1 indicate the degree to which the intersection is larger than expected. The results are sorted in descending order on the ratio value at length 12. Table 2 Intersection of human nullomers with the nullomers of other organisms. chimp dog rat cow mouse chicken zebrafish fruitfly arabid c elegans bacteria

11 28 0 8 0 2 4 0 0 0 0 0

actual size 12 19581 4963 5975 7314 8765 10946 1080 2122 9521 8378 0

13 1521778 731372 734566 886544 927076 1162632 504532 761094 1325550 1273344 24242

expected size 11 12 13 0.002594 109.1195 80719.25 0.000763 59.89956 62173.08 0.000954 72.94269 64310.63 0.001831 107.0339 80921.51 0.003395 129.1794 87344.92 0.011253 312.396 157105.7 3.81 E-05 36.96304 82152.48 0.003929 526.4187 412476 0.451012 2772.079 673218.3 0.146599 2736.51 776414.5 0 1.285072 18724.47

ratio 11 10794.16 0 8388.608 0 589.0876 355.4495 0 0 0 0 0

12 179.4455 82.85536 81.91363 68.33348 67.85136 35.03886 29.21837 4.031012 3.434607 3.061564 0

13 18.85273 11.76348 11.42216 10.9556 10.61397 7.400316 6.141409 1.845184 1.968975 1.640031 1.294669

Human and chimp have the greatest intersection between their absent sequences, and mammals in general show a much stronger intersection with human than the other listed organisms. While this is intuitively satisfying, further studies will be required to demonstrate if nullomer sets can be used to corroborate phylogenetic relationships among species.

362 4.2. Human Genome nullomers Other researchers have reported absent sequences as a part of large scale analysis9, however, as far as we know this is the first publication of an actual list of human nullomers. Our results also differ from earlier reports of 44 absent 11mers, in that we have found 43 sequences and their compliments which are not found in the two published human genomes (Table 3). Of these sequences, 4 11mers and their complements currently have no sequence match in any reported human sequence in GenBank as determined by BLAST. Table 3. Human nullomers at length 11. Human BLAST matches 0 0 0 0

2 2 2 2 2 2 2 2 2 2 2 2

Nullomer cgctcgacgta gtccgagcgta cgacgaacggt ccgatacgtcg tacgcgcgaca cgcgacgcata tcggtacgcta tcgcgaccgta cgatcgtgcga cgcgtatcggt cgtcgctcgaa tcgcgcgaata tcgacgcgata atcgtcgacga ctacgcgtcga cgtatacgcga cgattacgcga cgattcggcga cgacgtaccgt cgacgaacgag cgcgtaatacg cgcgctatacg

Human BLAST matches 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 5 6 7 20

Nullomer cgcgcataata cgacggacgta cgaatcgcgta cggtcgtacga gcgcgtaccga cgcgtaatcga cgtcgttcgac ccgtcgaacgc acgcgcgatat cgaacggtcgt cgcgtaacgcg ccgaatacgcg catatcgcgcg cgcgacgttaa gcgcgacgtta ccgacgatcgt ccgttacgtcg ccgcgcgatat ccgacgatcga cgaccgatacg cgaatcgacga

We are presently searching the available single nucleotide polymorphism (SNP) databases, to determine which if any of the nullomers are associated with known SNPs. 4.3. Primes - all sequence data We have also used our algorithms to process the entire DNA sequence database available from NCBI, and found that length 15 is the shortest length at which primes (absent sequences) are found. At this length there are 60370 primes that are not found in any of the DNA sequence data. These sequences can be referenced through our web site at http://trac.boisestate.edu/bioinformatics/nullomers.

363

We have also processed all available protein sequences, and identified 1799 primes of length 5. It should be noted that this number is significantly less than the 12,080 "zero count pentats" that were reported by Otaki et al. in 200418. In that paper, the researchers cloned 6 of their zero count pentats and showed that they were not lethal when expressed in E.coli. But we found (using our algorithm) that 5 of the 6 "zero-count" oligomers are actually presently listed in GenBank. This discrepancy is likely due to the addition of new protein data at NCBI since the zero count search was performed in September of 2003. This demonstrates the need for continued processing of this data, and the utility of our web-available program for conducting immediate absent sequence inventories. We believe that the approach taken by Otaki et al.18 is a valuable first step in examining the potential lethality of absent sequences. As the number of such sequences shrinks, and large scale expression projects become more routine, the fitness effects of nullomers and primes can be studied more systematically. The fact that the amino and nucleotide primes both presently represent a maximum of 5 amino acids (15 nucleotide bases in the DNA database, and 5 amino acids in the protein database) is coincidental. We examined all possible coding sequences for the 1799 length-5 protein primes, and did not find any intersection with the DNA primes at length 15. The nucleotide sequences include coding and non-coding DNA, while the protein database has only expressed (and hypothetically expressed) sequences. Thus it is likely that most nucleotide sequences representing codons for absent amino acid sequences are found only in non-coding regions of DNA. We are presently exploring the intersection of amino acid nullomers, and DNA nullomers in coding regions, and will report those results separately. On average, the protein primes had about half as many possible DNA coding sequences as expected for peptides of their length, which indicates that the set of protein primes is biased towards those protein sequences that have fewer DNA coding options. We found 5 protein primes that have a single DNA coding sequence - MWMWW, MWWWW, WMMWM, WMWWW, and WWMMW. We then performed a BLAST search for short, exact matches to each of these DNA coding sequences and examined the results. Each DNA sequence yielded a number of exact matches. Most of these matches were in intron specific regions, however, several of the matches occurred in putative coding regions. We are currently working to resolve each of these database discrepancies. The identification of apparent discrepancies between protein and nucleotide primes in coding regions, demonstrates the utility of the nullomer approach as tool for harmonizing the various biomolecular databases.

364

5.

Conclusion and Discussion

We have developed a series of tools for the identification and study of absent sequences. Using these tools we have made publicly available the full set of amino acid and nucleotide primes (the shortest sequences not found in their respective databases.) In order to allow creative extensions of our approach, the software packages, documentation, and web-based interface can be freely accessed at: http://trac.boisestate.edu/bioinformatics/nullomers. In this paper we demonstrate some of the uses of these tools, and the elegance of the nullomer approach. It should be noted that nullomer searches have corollaries in the natural world, most notably in the development of the human immune system. During embryonic development a large variety of antigen recognizing cells are generated by the random rearrangement of DNA cassettes coding for the "variable" segments of antibody producing cells. This DNA shuffling results in the incredible diversity of immune cells which produce molecular soldiers that each recognize a single small oligomer (peptides, lipids, sugars or nucleotide). This army is reduced by a colossal "deselection" in the embryonic thymus. Here, any immune cell which finds its target among the "self molecules is culled from the army. In essence, what is left is a sentinel army of nullomer hunters. They recognize and destroy only absent oligomer sequences. When an adult immune cell detects its particular nullomer, it is stimulated to reproduce, and sometimes to hypermutate in order to recognize related nullomers. Thus the natural defense system of the body is based on recognizing nullomers, and anticipating oligomers that may arise from them. This type of approach would be useful in any intelligent response to novel biological threats, natural or manmade. For example, nullomer detection in environmental samples could indicate the introduction of novel natural or engineered species. The rapid response to such a potential threat should include the generation of agents to detect and possibly incapacitate related novel molecules. The absent sequences that we report here represent the largest possible set of artificial oligomers. Within this dynamic, shrinking set will be found all lethal oligomers, if any exist. These small molecules may prove to be powerful bioactive compounds which act in a species-specific or group-specific manner. Within the set of primes, there is even the possibility of a pan-lethal agent which could function as a sterilant, or suicide gene for therapeutic and biocontrol applications. We have also shown that nullomer searches can be used to assess the harmony of molecular databases (nucleotide and protein), and to identify potential therapeutic targets that exist in a pathogenic species but not its host. The nullomer approach may also be useful for studying genome relationships, in that the absent oligomers (nullomers) are more similar in closely related species, than in those more distantly related. Finally, it is easy to construct artificial tags of DNA or amino acids that

365

have not been reported in GenBank. But identifying the smallest oligomers that have not been found in a species or group of species, provides the first rational basis for the construction of an artificial DNA lexicon. By devising tags based on nullomers and primes, more efficient and elegant artificial sequences can be constructed. These sequences can be used to identify artificial constructs, tag them with identifying characteristics, or even code for suicide genes in order to "recall" a genetically engineered product. Acknowledgements: The authors wish to thank the following people for their help: Dr. Amit Jain and Ben Noland for assistance with the Beowulf cluster and initial algorithms; Barry Hall, Jim Smith and Ken Cornell for comments and criticism about the nullomer approach; Jim Munger for his encouragement and support; and the anonymous reviewers who provided valuable feedback. References 1. Amir, A., Cole, R., Hariharan, R., Lewenstein, M., & Porat, E. (2003). Overlap Matching. Inf. Comput. 181(1), 57-74. 2. Audit B, Vaillant C, Arneodo A, d'Aubenton-Carafa Y, Thermes C. (2002) Long-range correlations between DNA bending sites: relation to the structure and dynamics of nucleosomes. J Mol Biol. 2002 Mar 1;316(4):903-18. 3. Apostolico, A., M. Bock., and S. Lonardi. (2002). Monotony of Surprise and Large-Scale Quest for Unusual Words. Proceedings of the sixth annual international conference on Computational biology, pp22-3. 4. Apostolico, A., Gong, F., and Lonardi, S. "Verbumculus and the Discovery of Unusual Words", Journal of Computer and Science Technology, vol.19, no.l,pp.22-41,2004. 5. Bourdon, J. & Vallee, B. (2002). Generalized Pattern Matching Statistics. In Mathematics and Computer Science, II Versailles, 249-265. 6. Cruz-Vera LR, Magos-Castro MA, Zamora-Romo E, Guarneros G (2004) Ribosome stalling and peptidyl-tRNA drop-off during translational delay at AGA codons. Nucleic Acids Res. 2004 Aug 18;32(15):4462-8. 7. Flajolet, P., Guivarc'h, Y., Szpankowski, W., & Vallee, B. (2001). Hidden Pattern Statistics. ICALP2001, 152-165. 8. Filipski, J. (1987). Correlation between molecular clock ticking, codon usage fidelity of DNA repair, chromosome banding and chromatin compactness in germline cells. FEBS Lett. 217: 184-186. 9. Fofanov Y, Luo Y, Katili C, Wang J, Y. B, Powdrill T, Fofanov V, Li T-B, Chumakov S, Pettitt BM (2003) How independent are the appearances of nmers in different genomes? Bioinformatics, vol. 20, no. 15, pp2421-2428. 10. Fofanov V., Fofanov Y., Pettitt B. (2002). Counting array algorithms for the problem of finding appearances of all possible patterns of size n in a sequence. In The 2002 Bioinformatics Symposium, Keck/GCC

366

11.

12.

13. 14.

15. 16. 17. 18. 19.

20.

21.

22.

Bioinformatics Consortium, p 14. W.M. Keck Center for Computational and Structural Biology, Houston Texas. Friedman K, Heller A (2001) On the Non-Uniform Distribution of Guanine in Introns of Human Genes: Possible Protection of Exons against Oxidation by Proximal Intron Poly-G Sequences. J. Phys. Chem. B, 105 (47), 11859 -11865,2001. 10.1021/jp012043nS1089-5647(01)02043-0. Henderson P.T., Jones D., Hampikian G., Kan Y., Schuster G.B. (1999) Long distance charge transport in DNA: the phonon-assisted polaron-like hopping mechanism. Proc. Natl Acad. Sci. USA. 1999;96:8353-8358. Inman, R.B. (1966). A denaturation map of the 1 phage DNA molecule determined by electron microscopy. J. Mol. Biol. 18: 464-476. Leung, M. Y., Marsh, G. M., and Speed, T. P. (1996). Over and underrepresentation of short DNA words in herpesvirus genomes. J. Comput. Bio. 3, 345-360. Mitchell, T. (1997) Machine Learning. New York: McGraw Hill. Nicodeme, P. (2001). Fast approximate motif statistics. Journal of Computational Biology, 8(3), 234-248. Nicodeme, P., Doerks, T., & Vingron, M. (2002). Proteome Analysis Based on Motif Statistics. Bioinformatics, vol. 18,161—171. Otaki J, Ienaka S, Gotoh T, and Yamamoto H. (2005) Availability of short amino acid sequences in proteins. Protein Science, 14:617-625. Rahmann, S. & Rivals, E. (2000). Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts. CPM 2000, 375-387. Reis, M., Sawa, R. & Wernisch, L. (2004) Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res. 32: 5036-5044. Segal, E., Y. Fondufe-Mittendorf, L. Chen, A. Thastrom, Y. Field, I. K. Moore, J. Z. Wang, J. Widom. (2006) A Genomic Code for Nucleosome Positioning. Nature, 2006 July, 442(7104):772-8. Stefanov, V. (2003). The intersite distances between pattern occurrences in strings generated by general discrete and continuous-time models: an algorithmic approach. Journal of Applied Probability. 40, no. 4, 881-892.

AN ANATOMICAL ONTOLOGY FOR AMPHIBIANS* ANNE M. MAGLIA Department of Biological Sciences, University of Missouri-Rolla, Rolla, MO 65409, USA

105 Schrenk Hall

JENNIFER L. LEOPOLD Department of Computer Science, University of Missouri-Rolla, 317 Computer Science Rolla, MO 65409, USA L. ANALIA PUGENER Department of Biological Sciences, University of Missouri-Rolla, Rolla, MO 65409, USA

105 Schrenk Hall

SUSAN GAUCH Department of Electrical Engineering & Computer Science, The University of Kansas Lawrence, KS 66045, USA

Herein, we describe our ongoing efforts to develop a robust ontology for amphibian anatomy that accommodates the diversity of anatomical structures present in the group. We discuss the design and implementation of the project, current resolutions to issues we have encountered, and future enhancements to the ontology. We also comment on future efforts to integrate other data sets via this amphibian anatomical ontology.

1. Introduction 1.1. The Need for an Amphibian Anatomical Ontology Studies of gene expression, molecular markers, and developmental biology are advancing our knowledge of the morphogenetic and evolutionary processes that lead to disease, physiological responses, adaptation, and phylogenetic diversity. Results from these studies promise both to enhance our quality of life and reveal the complex connection between genotype and phenotype. But to understand fully the results, we must have a detailed understanding of the anatomy of * This work is partially supported by NSF grant DBI-0445752

367

368 organisms. Unfortunately, the lack of terminological standardization for the anatomy of most organisms limits our ability to compare results across taxa, and thus has restricted the applicability of many embryological and gene expression experiments. The scientific community is well aware of this problem. In the hopes of facilitating the integration of genetic, embryological, and morphological studies, several groups are developing anatomical ontologies for certain model species (e.g., mouse, zebrafish). Further demonstrating the importance of anatomical ontologies was the recent National Center for Biomedical Ontology-sponsored workshop1 to bring researchers together to discuss issues associated with developing anatomical ontologies. The need for terminological standardization of anatomy is particularly pressing in amphibian morphological research. Amphibians are commonly used for gene expression and embryological studies, yet the three amphibian orders— Salientia (frogs and toads), Caudata (salamanders and newts), and Gymnophiona (caecilians)—are so morphologically distinct that studies of one order are rarely applied to another. As a consequence, morphological and developmental studies of frogs, salamanders, and caecilians are conducted by disassociated research groups, resulting in three different amphibian anatomical lexicons. Language inconsistencies confuse our understanding of homology, and thus, our ability to use morphology to understand the phylogeny and biodiversity among the orders. In addition, disparate anatomical lexicons limit our abilities to conduct comparative anatomical research, while hindering the integration of morphological, genomic, and embryological data. There are several challenges to developing an ontology for amphibian anatomy. First, the separate anatomical lexicons must be reconciled. Second, there are over 6,000 species of amphibians for which the anatomical terminology must be resolved. Although much of the terminology is similar across species, among-species variation will lead to a much larger ontology than those developed for a single model species. Third, because of anatomical diversity among amphibian orders, homologies of some structures are unknown; therefore, assigning terminological standards to them may be problematic. These challenges can be overcome by forging a partnership between the amphibian morphological community and the power of information extraction technology. Herein, we describe our ongoing efforts to develop a robust ontology for amphibian anatomy. We discuss the design and implementation of the project, http://www.bioontology.org/wiki/index.php/Anatomy_Ontology_Workshop

369

resolutions to date for issues that we have encountered, and future enhancements and modifications to the ontology. In addition, we comment on future plans to integrate other data sets via the amphibian anatomical ontology. 1.2. Prior Work in Biological Ontologies As stated in [1], "ontologies are becoming popular largely due to what they promise: a shared and common understanding of a domain that can be communicated between people and application systems." The importance of ontologies has not been lost in the biological community—a research domain that is notorious for its complex form and semantics, and one that will benefit tremendously from data integration and analysis [2]. Perhaps the best known of the biological ontologies is the Gene Ontology (GO)*, which began in the late 1990's as a collaboration among three model-organism databases (FlyBase§, the Saccharomyces Genome Database", and the Mouse Genome Database t+ ), but has grown to include many other genomic databases. The biomedical research community has made significant strides in developing medical and clinical ontologies. One of the most extensive projects is the U.S. National Library of Medicine's Unified Medical Language System (UMLS)**, a comprehensive knowledge-representation system that includes data sources and software tools (e.g., the Metathesaurus, the Semantic Network, and the Specialist Lexicon) that facilitate information retrieval, natural language processing, and other vocabulary services for biomedical research data. As an extension to the UMLS, the Digital Anatomist Foundational Model (FMA), an ontology of human anatomical relationships, was developed as part of the Digital Anatomist project [3]. Both GO and UMLS have proved to be extremely valuable for several widely-used applications (e.g., PubMed®§§, Swiss-Prot***). Some bio-ontology projects have begun integrating genomic and anatomical information for model species (e.g., the Zebrafish Information Network (ZFIN) m , The Jackson Laboratory's Mouse Anatomical Dictionary project***, and the FlyBase list of Anatomy and Development terms§§§). * http://www.geneontology.org 8 http://flybase.bio.indiana.edu ** http://www.yeastgenome.org " http://www.informatics.jax.org ** http://www.nlm.nih.gov/research/umls 85 http://www.pubmed.gov '"http://www.ebi.ac.uk/swissprot nt http://zfin.org *** http://www.informatics.jax.org/searches/anatdict_form.shtml

370

Unfortunately, some of these anatomical ontologies have restrictions that prevent their application to other organisms. For example, often there is a narrow set of relations, such as is-part-of and develops-from—terms that limit the options for describing the inter- and intra-relationships of anatomical parts. This limitation of concepts and properties also limits their use for phylogenetic and comparative anatomical analyses. 2. Methodological Considerations and Ontology Construction The architecture of an ontology typically is sufficiently complex to require a considerable amount of manual effort. As such, the development of an ontology usually is carried out by experts in the knowledge domain. Based on [4], the process of constructing an ontology can be represented by the following steps: 1. 2. 3. 4. 5. 6. 7. 8.

Determine the boundaries of the ontology. Consider reusing (parts of) existing ontologies. Enumerate all the concepts to include. Define an appropriate taxonomy to describe concepts, properties and relationships. Define properties of the concepts. Define facets of the concepts such as cardinality, required values, etc. Define instances. Check the consistency of the ontology.

Using the Protege-OWL editor [4], we developed an ontology in OWL DL for amphibian morphology that was consistent with the recommendations outlined in the Suggested Upper Merged Ontology (SUMO) [5]. In accordance with the list above, we first determined that the boundary for the ontology should include all anatomical physical, self-connected objects**** for all amphibians (i.e., frogs, toads, salamanders, newts, and caecilians). We evaluated a number of existing sources for reuse, including the SUMO mapping of WordNet [6], the Unified Medical Language System (UMLS), and several species-specific anatomical ontologies (e.g., the Jackson Laboratory's Mouse Anatomical Dictionary, the Anatomical Dictionary mt , and the ZFIN 885

http://flybase.bio.indiana.edu/cgi-bin/fbcvq.html7start *"* No abstract concepts were defined in the amphibian morphology ontology. Furthermore, each concept in the ontology is considered a self-connected object whose parts are all mediately or immediately connected with one another, and no collection concepts have been defined at this time. No process concepts are currently included in the ontology; however, such an extension may be added in the future to represent functional and physiological knowledge. See [5J for a more detailed discussion of these SUMO top level ontological categories. ttft http://www.dinosauria.com/dml/anatomy.htm

371 Anatomical Ontology of the Zebrafish). The SUMO mapping of WordNet provides basic descriptions of terms, and although we were able to identify a few concepts applicable to amphibian morphology, the terminology is too general to be useful for this project. The UMLS is an extensive biomedical ontology containing numerous concepts and relationships. However, our initial attempts to incorporate the UMLS terminology into the amphibian morphological ontology proved to be difficult because: 1) UMLS contains numerous concepts that are not relevant to the amphibian anatomical lexicon and, 2) those concepts that are relevant are not detailed enough for our needs. We also experimented with using an approach similar to the Foundational Model of Anatomy. Interestingly, the top-level organization of this ontology is based on abstract geometric concepts and relationships (e.g., spaces, points, adjacency, direction, etc.). Although such conceptual organization facilitates spatial queries at different levels of complexity, we felt that, for our initial efforts, a top-level organization based on anatomical systems was more consistent with facilitating comparisons among amphibian taxa. Of the species-specific anatomical ontologies, the ZFIN Zebrafish Anatomical Dictionary is most in line with the goals of our project****. We adopted relevant concepts, hierarchy, and relationships from ZFIN as an initial framework for the amphibian morphological ontology. Subsequent modifications and enhancements to our knowledge base, including the addition of concepts and properties and the identification of instances, were made by manually mining literature sources [e.g., 7, 8, 9, 10]. Finally, the consistency of the ontology was evaluated through tools provided in the Protege-OWL ontology builder. End-user evaluations of the usability and usefulness of the ontology are planned (see Section 3.3). 3. The Amphibian Anatomical Ontology 3.1. The Semantic Network The amphibian anatomy semantic network currently consists of 212 semantic concepts and 58 relationships. Each concept is given a textual definition, adopted from ZFIN (where appropriate) or manually mined from the literature. Properties in the ontology are symmetric (e.g., is-fused-to), inverse (e.g., forms It is important to note that at the time of this writing no information was publicly available about the dictionary of embryological anatomy of Xenopus (African clawed frog); thus, we could not evaluate the appropriateness of the contents of that knowledge base. When it becomes available, we plan to explore the integration of the dictionary with our amphibian anatomical ontology.

372

vs. is-formed-from), functional and inverse functional (e.g., is-defined-as vs. isthe-definition-of), or transitive (e.g., is-part-of). A partial view of the concept hierarchy and properties for the amphibian anatomical ontology (as displayed in Protege) is shown in Figure 1. 3.2. Challenges and Current Solutions Because of the broad range of organisms and morphologies included in our amphibian anatomical ontology, we faced several challenges in its development. For example, we were required to represent anatomical diversity in a logical and meaningful manner within the terminological and hierarchical framework of the ontology. To do this, we included taxonomic (i.e., Linnaean nomenclature) references as concepts in the ontology. In this way, we were able to designate the range of an instance of a concept as a given taxonomic group. This method also provided us with a way of referencing homologous and partiallyhomologous structures, while allowing the community to continue to use commonly-accepted terminology (e.g., the orbitosphenoid in salamanders is homologous to the sphenethmoid in frogs). An additional challenge arose from the need to include developmental stages in the ontology. Most ontologies that include development information are created specifically for that purpose, and often do not include information about adult anatomy (let alone anatomical diversity among groups). To overcome this challenge, we took an approach similar to the one above and included developmental stages as classes. As such, we could designate the range of a concept as an instance of a particular developmental stage. 3.3. Planned Modifications and Enhancements As is the case with most biological ontologies (e.g., Gene Ontology, Plant Ontology§§§§), the current ontology of amphibian anatomy can be considered a partonomy, because it uses both is-a and part-of relationships in the hierarchical foundation. Although the use of part-of relationships appears to be a logical representation of biological hierarchy, as shown by [11], the inclusion of part-of relationships in the hierarchy of a structural ontology can result in inconsistencies and multiple inheritances that are illogical, and can limit the mapping of an ontology into other such systems.

http://www.plantontology.org/docs/otherdocs/poc_project.html

373

.. ;"•! i•.-..!: • diivphibdiicit •,yMt > lll

! lnleirt»d V-ew

" • "

.-. wSfcetefcaLsystem t • Axtel_sketetan r # CraniaLskeleton •Cboridrocranium •Dermatocranlurrt •Neurocranium ». • Splanchnocraniurn > • Fore1imb_stetetan * • Hindlimb_skfifeton • Digestive_system * • £mbryonlc_stiudures * * integument si # Muscularjsysterri .•

•8

a

i

b

1

a

d

C

e

C

protein A

protein B

Figure 1. Detectabihty plot of a hypothetical protein A, broken up into tryptic peptides a-e, and protein B, containing peptides a-c and/-;'. Assume that peptides a-c are identified by the peptide identification software (shaded). Peptides in each protein are sorted according to their detectabihty. The example shows the intuition for tie breaking in the proposed protein inference problem. Peptides a-c are more likely to be observed in protein A than d-e, while they are less likely to be observed than peptides/-;' in protein B. Thus, protein A is a more likely to be present in the sample than B. Note that the detectabihty for the same peptide within different proteins is not necessarily identical, due to the influence of neighboring regions in its parent proteins. are degenerate. 7 However, if all the tryptic peptides are ranked in each protein according to their detectabilities (Fig. 1), we may infer that protein A is more likely to be present in the sample than protein B. This is because if B is present we would have probably observed peptides/ 1 ; along with peptides a-c, which all have lower detectabilities than either/ g, h, or ;'. On the other hand, if protein A is present, we may still miss peptides d and e, which have lower detectabilities than peptides a-c, especially if A is at relatively low abundance. 8 In summary, peptide detectabihty and its correlation with protein abundance provides a means of inferring the likelihood of identifying a peptide relative to all other peptides in the same parent protein. This idea can then be used to distinguish between proteins that share tryptic peptides based on a probabilistic framework. Based on this simple principle, we propose a reformulation of the protein inference problem so as to exploit the information about computed peptide detectabilities. We also propose a tractable heuristic algorithm to solve this problem. The results of our study show that this algorithm produces reliable and less ambiguous protein identities. These encouraging results demonstrate that peptide detectabihty can be useful for not only label-free protein quantification, but also for protein identification that is based on identified peptides. 8,9

412 2. Problem Formulation Consider a set of proteins T= {Pu P2, ..., PN} such that each protein P, consists of a set of tryptic peptides {p'j}, i= 1,2, ..., «,, where «, is the number of peptides in {p'j}. Suppose that J= {fufi, • • -,/M} is the set of peptides identified by some database search tool and that J c u {p'j}. Finally, assume each peptide/?^ has a computed detectabilityD{p't), fory = 1, 2, ..., N, and i= 1,2, ..., rij. We use T> to denote the set of all detectabilities D(p'j), for each / andy. The goal of a protein inference algorithm is to assign every peptide from J to a subset of proteins from T which are actually present in the sample. We call this assignment the correct peptide assignment. However, because in a real proteomics experiment the identity of the proteins in the sample is unknown, it is difficult to formulate the fitness function that equates optimal and correct solutions. Thus, the protein inference problem can be redefined to find an algorithm and a fitness function which result in the peptide-to-protein assignments that are most probable, given that the detectability for each peptide is accurately computed. In a practical setting, the algorithm's optimality can be further traded for its robustness and tractability. If all peptides in y are required to be assigned to at least one protein, the choice of the likelihood function does not affect the assignment of unique (nondegenerate) peptides in u {p'j}. On the other hand, the tie resolution for degenerate peptides will depend on all the other peptides that can be assigned to their parent proteins, and their detectabilities. In order to formalize our approach we proceed with the following definitions. Definition 1. Suppose that the peptide-to-protein assignment is known. A peptide p'j e {p'j} is considered assigned to Pt if and only if p'j e y and D(p'j) > Mj. Then, Mj e D is called the Minimum Detectability of Assigned Peptides (MDAP) of protein P,. Definition 2. A set of MDAPs {MJ}J = , 2, N is acceptable if for each/ e y, there exists Pj, such that D(f) > Mj. Thus, any acceptable MDAP set will result in an assignment of identified peptides that guarantees that every identified peptide is assigned to at least one protein. Definition 3. A peptide p'j is missed if p'j for ally's while J * 0 Choose/e J with lowest detectability for each protein i containing/ Compute the number of missed peptides, assuming M,=D(f) Select proteiny with the minimum number of missed peptides Set Mj=D(f) Remove from .y all peptides from proteiny Figure 2. Pseudocode for the LDFA solution to the minimum missed peptide problem. Out of 176,470 proteins from Swiss-Prot, 494 proteins (including the 12 proteins from the mixture) were identified as containing at least one identified peptide. The LDFA identified 12 proteins in the sample, 11 correctly. Of the 11 proteins that were correctly assigned, in only one instance could the algorithm not distinguish between the correct protein and one of its close homologs. We refer to this situation as a tie. Each tie is resolved by a random selection. The same data was tested using the GMPSA which simply tries to explain the identified peptides with the smallest possible number of proteins. GMPSA also identified 12 proteins as the total number of proteins in the sample, however, it suffered in accuracy. For 5 out of the 12 proteins, the GMPSA could not distinguish between the correct proteins and their homologs. Since in each step, the GMPSA considers only the number of the identified peptides per protein it is much more likely to encounter ties than the LDFA. As shown in Fig. 1, the GMPSA does not have a means of differentiating between proteins containing no unique identified peptides and the same number of degenerate peptides. In practice, these result in ties involving more homologs than the LDFA, thus reduce the chance of selecting the correct protein. An example of such a tie involves protein HBB_HUMAN. LDFA found two possible solutions ( H B B H U M A N and H B B G O R G O ) , resulting in a 50% chance of a correct selection. On the other hand, the GMPSA selected between four different proteins (HBB_HUMAN, HBB_HAPGR, HBB_HYLLA and HBB_PANPO) resulting in 25% chance of a correct prediction. Furthermore, the smaller average number of proteins per tie encountered by LDFA is advantageous for reporting results of identification. To avoid information leak in calculating peptide detectabilities, the training set for the predictor was constructed from a different synthetic dataset."

416 The one protein that was not identified correctly by the LDFA, bovine RNase A, was assigned to a close homolog from one of 7 organisms (69.4% average sequence identity) chosen at random. This assignment was made with a single identified peptide. Furthermore, the sequence for bovine RNase A in the Swiss-Prot database includes the 26-amino acid signal peptide that is not actually present in the sample. Since LDFA takes into consideration the detectabilities of both identified and unidentified peptides, the presence of the signal peptide in the database hinders the assignment of bovine RNase A. After the signal peptide is removed, the sequence identity compared to all seven sequences that match the identified peptide is 84.0%. In comparison, the GMPSA randomly selects among 20 proteins from Swiss-Prot sharing the identified peptide. Another experiment was performed on a biological sample from R. norvegicus, in which the correct proteins were not known. The identified peptides in the sample (693 in total) were searched against an IPI (http://ncbi.nlm.nih.gov) database and were found in 805 proteins. These are the proteins that may potentially be present in the sample. Table 1 shows the distribution of these peptides contained by different numbers of proteins. In this experiment, about 60% identified peptides (397 out of 693) are degenerate peptides, i.e. contained by two or more proteins. The two algorithms described above, LDFA and GMPSA, were run on this set. Table 1. Distribution of identified peptides contained by different number of proteins in a R. norvegicus proteome analysis. No. proteins No. peptides

1 296

2-5 330

6-10 43

11-20 16

>20 8

Mascot had originally assigned 301 proteins in this sample, LDFA assigned 275 proteins and GMPSA assigned 247 proteins. Taking into consideration all unique peptides from the rat sample only 149 proteins could be assigned by at least one unique peptide. Thus, any other protein to be assigned by any of the three methods would have to rely solely on degenerate peptides. Due to the prevalence of ties, GMPSA was run 30 times. Only 153 proteins were consistently assigned in all runs. Out of 430 proteins assigned over all GMPSA runs, 229 were assigned less than 50% of the time. Since the correct proteins in this sample were not known, the accuracy of the LDFA and GMPSA could not be quantified as on the synthetic data. Instead, a different approach was taken where protein distinguishability was measured in this experiment. Figure 3 shows, in grey, all pairs of 805 identified proteins that shared at least one identified peptide. The y-axis corresponds to the percentage of sequence identity, while the x-axis represents the length of one of the proteins

417 1.00 0.90

Virw r * '

0.80 0.70

to, g O.EO c 0.22)

O

Cons LinOP 204k Cons NoisyCfl 213k LogLikGS LinOP 306k LogLikGS NoisyOR 306k PropGS LinOP511 PropGS NobyOP. 5k

(d) Mouse FunctionalFlow

Figure 2. ROC analysis using a Fixed Probability Threshold

441

The LogLikGS reliability assignments show the best performance in yeast, yet the worst performance in mouse. Since LogLikGS is similar to PropGS corrected for background linkage distributions, their relative performance suggests this may be due to different background distributions (-fXm — 0.15 in yeast versus 0.01 in mouse). In fact, we found that the numerical edge weights were nearly identical for both methods in yeast, while in mouse, LogLikGS edge weights were generally twice the value of PropGS weights. Also, the yeast graph has a maximum of 77 neighbors while mouse has a maximum of 348, an enormous difference in size of neighborhoods which, together with a difference in weightings, allows FunctionalFlow to propagate a lot more noisy predictions. The Cons variants perform the best overall in mouse, suggesting better overlap of information from sources in mouse compared to those in yeast, even though mouse has fewer sources. Even in yeast, the Cons results, which do not use a function/pathwaybased gold standard, are comparable to PropGS which does. In fact, these results suggest capturing a more diverse notion of 'interaction' using Cons still proves successful for the task of function prediction. Together, these results suggest Cons is a valuable alternative to LogLikGS and PropGS in less-studied organisms, where including diverse types of interaction information is critical. For the third question of whether to use NoisyOR or LinOP to combine source reliabilities, the NoisyOR variants invariably have slightly higher performance than LinOP. For a given interaction, the value assigned by NoisyOR will be greater than by LinOP given the same set of reliability assignments to sources. In this task, this bias causes NoisyOR to make the same prediction as LinOP but at a higher threshold, accounting for the slight vertical shift between the two curves. The effect of this shift in distribution is the subject of the next figure. The effect of different edge distributions Pr(e) can be seen by fixing a probability threshold and allowing only edges which exceed the threshold. Results using Pr(e) > 0.54 in yeast and Pr(e) > 0.22 in mouse are shown in Figure 2 (legends indicate graph size per method). Shorter curves mean fewer predictions were made, a comment on the connectivity. As noted above, the LinOP variants will include fewer edges than NoisyOR for a given threshold, though here we see little performance difference between the two for all methods, except Cons (Fig. 2). This difference arises due to the large size of Cons NoisyOR (339k edges) versus the others (mean 26k) in combination with the neighborhood-based FunctionalFlow; for sparse graphs the immediate neighborhood is equivalent to the extended neighborhood,

442

making FunctionalFlow nearly equivalent to Majority. In mouse, LinOP and NoisyOR yield similar graph sizes so we do not see this effect repeated. Again, Cons performs strongly in mouse, suggesting this non gold standardbased approach will be valuable in less well-studied organisms.

3.2. Learning

Regulatory

Networks

Bayesian networks (BN) are a popular modelling formalism for learning regulatory networks from gene expression data (see Pe'er et. al 33 for an excellent example). A BN has two components: a directed acyclic graph (DAG) capturing dependencies between variables, and a set of conditional probability distributions (CPDs) local to each node. Nodes represent expression values, arcs represent potential regulatory relationships, and the CPDs quantify those relationships. Algorithms to learn BNs from data can use prior knowledge about the probability of arcs, such as our Pr(e). Learning performs an iterative search starting from an initial graph, exploring the space of DAGs by removing, adding or deleting a single edge, choosing the best scoring model among these one-arc changes, and terminating when no further improvement in score can be made. Each candidate model is scored with respect to the loglikelihood (LL) of the data, e.g. how well the CPDs capture dependencies inherent in the expression data. To evaluate the quality of a search, we obtain a single performance measure as follows. Given a starting model, we obtain a LL-trace of the best model chosen at each iteration and average the trace over all iterations. We repeat this process for a set of starting models sampled from some distribution, and average the average LL-trace over all models. Starting models are sampled either from an informed structural prior (our Pr(e)), or an uninformed prior which asserts uniform probability over edges. A high average LL trace value for a given prior indicates that searches using that prior consistently explore high-scoring models. Using the yeast genome, as before we create informed structural priors Pr(e) using all interaction sources (including functional/pathway sources) together with the Cons, PropGS and LogLikGS methods to assign reliabilities (again, KEGG is the gold standard for the latter two) and the LinOP and NoisyOR methods to combine reliabilities. We learn Bayesian networks for 50 genes using a expression dataset covering 1783 yeast microarray experiments (see refs. in Tanay et al.34). We also create priors using edge reliabilities calculated by other groups, namely STRING11 (a PropGS NoisyOR (on

443

experts different than ours) for predicting protein complexes) and MAGIC35 (a hand-crafted BN for predicting function). Both use expression data as experts. As baselines, we include a uniform reliability assignment over experts (Unif5) and two random reliability assignments (Randl and Rand2). Figure 3 shows the LL trace averages, scaled to give Uninformed the value x = 0. The worst overall performance by Uninformed demonstrates the value of using priors based on weighted reliabilities. The poor performance of the remaining baseline variants demonstrates the effect of neglecting to assign (Unif) or incorrectly assigning (Rand) reliability to interaction sources. Note NoisyOR performs worse than LinOP for the baseline priors, yet performs better for the non-baseline variants. This repeats the effect seen in the function prediction task where NoisyOR assigns higher values than LinOP. Here, the performance difference indicates that LinOP is more robust to errors in reliability assignment than NoisyOR. The strength of STRING, LogLikGS and MAGIC is due in part to having few high probabilities and many low probabilities in the corresponding Pr(e), in contrast with the more evenly distributed Pr(e) for the other methods. Such conservativism allows the Bayesian learner to strongly preserve only the highest confidence edges while remaining flexible for the others. Performance of the Cons variants is comparable to PropGS for this task as well, demonstrating the utility of our method which does not require a gold standard. Average ot LL Trace Over All Iterations •STRING •LogOdds NoisyOR •LogOdds LinOP •MAGIC •PropGS NoisyOR •Cons NoisyOR •Cons LinOP •PropGS LinOP *Rand2 LinOP •Randl LinOP •Unif5 NoisyOR #Unil5 LinOP *Rand2 NoisyOR •Randl NoisyOR HJniriormed 0

Figure 3.

200

400

600

800

1000

1200

Average of Log-Likelihood trace over all iterations

4. Conclusions Our results show that the Cons method for assigning reliability to interaction sources is an attractive alternative to existing methods and has the added advantage of not requiring a gold standard for assessment. In the task of predicting protein function, we demonstrated the effectiveness of using weighting strategies, where Cons proved competitive against other methods

444 which have an unfair advantage of using t h e same gold s t a n d a r d used for evaluation. For t h e task involving regulatory networks, we showed t h a t learning greatly benefits from correctly informed estimates of reliability. Again, Cons was comparable t o t h e other methods. We introduced LinOP as a n alternative m e t h o d for combining reliabilities a n d d e m o n s t r a t e d its performance t o be comparable t o NoisyOR in most tasks and more robust t o errors in others. References 1. B. Schwikowski et al, Nature Biotech. 18, 1257 (2000). 2. H. Hishigaki et al, Yeast 18, 523 (2001). 3. A. M. Edwards et al, Trends Genet. 18, 529 (2002). 4. C. M. Deane et al, Mol. Cell Proteomics 1, 349 (2002). 5. E. Sprinzak et al, J. Mol. Biol. 327, 919 (2003). 6. J. S. Bader et al, Nature Biotech. 22, 78 (2004). 7. Y. Qi et al, NIPS Workshop on Comp. Bio. and Anal, of Het. Data (2005). 8. S. Asthana et al, Genome Res. 14, 1170 (2004). 9. I. Lee et al, Science 306, 1555 (2004). 10. D. R. Rhodes et al, Nature Biotech. 23, 951 (2005). 11. C. von Mering et al, Nucl. Acids Res. 33, D433 (2005). 12. E. Nabieva et al, Bioinformatics 21, i302 (2005). 13. C. Genest and J. V. Zidek, Statistical Science 1, 114 (1986). 14. S. Suthram et al, BMC Bioinformatics 7, 360 (2006). 15. I. Xenarios et al, Nuc. Acids Res. 30, 303 (2002). 16. G. Bader et al, Nuc. Acids Res. 29, 242 (2001). 17. H. Hermjakob et al, Nuc. Acids Res. 32, D452 (2004). 18. C. Stark et al, Nuc. Acids Res. 34, D545 (2006). 19. T. I. Lee et al, Science 298, 799 (2002). 20. E. Wingender et al, Nuc. Acids Res. 28, 316 (2000). 21. J. C. Mellor et al, Nuc. Acids Res. 30, 306 (2002). 22. J. T. Eppig et al, Nuc. Acids Res. 33, D471 (2005). 23. H. W. Mewes et al, Nuc. Acids Res. 30, 31 (2002). 24. N. Hulo et al, Nuc. Acids Res. 32, 134 (2004). 25. A. Bateman et al, Nuc. Acids Res. 32, D138 (2004). 26. N. J. Mulder et al, Nuc. Acids Res. 33, D201 (2005). 27. R T. Spellman et al, Mol. Bio. Cell 9, 3273 (1998). 28. M. Ashburner et al, Nature Genet. 25, 25 (2000). 29. M. Kanehisa et al, Nuc. Acids Res. 34, D354 (2006). 30. K. D. Dahlquist et al, Nature Genet. 31, 19 (2002). 31. S. C- Weller and N. C. Mann, Medical Decision Making, 17, 71 (1997). 32. G. R. G. Lanckriet et al, PSB 9, 300 (2004). 33. D. Pe'er et al, Bioinformatics 17 S u p p l . l , S215 (2001). 34. A. Tanay et al, Molecular Systems Biology (2005). 35. O. G. Troyanskaya et al, PNAS 100 8348 (2003).

P R O B A B I L I S T I C M O D E L I N G OF S Y S T E M A T I C E R R O R S I N TWO-HYBRID EXPERIMENTS

DAVID S O N T A G * Computer

R O H I T SINGH*

BONNIE BERGERtt

Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge MA 02139 E-mail: {dsontag, rsingh, bab}Qmit.edu

We describe a novel probabilistic approach to estimating errors in two-hybrid (2H) experiments. Such experiments are frequently used to elucidate protein-protein interaction networks in a high-throughput fashion; however, a significant challenge with these is their relatively high error rate, specifically, a high false-positive rate. We describe a comprehensive error model for 2H data, accounting for both random and systematic errors. The latter arise from limitations of the 2H experimental protocol: in theory, the reporting mechanism of a 2H experiment should be activated if and only if the two proteins being tested truly interact; in practice, even in the absence of a true interaction, it may be activated by some proteins - either by themselves or through promiscuous interaction with other proteins. We describe a probabilistic relational model that explicitly models the above phenomenon and use Markov Chain Monte Carlo (MCMC) algorithms to compute both the probability of an observed 2H interaction being true as well as the probability of individual proteins being self-activating/promiscuous. This is the first approach that explicitly models systematic errors in protein-protein interaction data; in contrast, previous work on this topic has modeled errors as being independent and random. By explicitly modeling the sources of noise in 2H systems, we find that we are better able to make use of the available experimental data. In comparison with Bader et al.'s method for estimating confidence in 2H predicted interactions, the proposed method performed 5-10% better overall, and in particular regimes improved prediction accuracy by as much as 76%. S u p p l e m e n t a r y I n f o r m a t i o n : h t t p : / / t h e o r y . c s a i l . m i t . edu/probmod2H

1. Introduction The fundamental goal of systems biology is to understand how the various components of the cellular machinery interact with each other and the environment. In pursuit of this goal, experiments for elucidating proteinprotein interactions (PPI) have proven to be one of the most powerful tools available. Genome-wide, high-throughput PPI experiments have started to ' T h e s e authors contributed equally to the work t Corresponding author •Also in the MIT Dept. of Mathematics

445

446

provide data that has already been used for a variety of tasks: for predicting the function of uncharacterized proteins; for analyzing the relative importance of proteins in signaling pathways; for new perspectives in comparative genomics, by cross-species comparisons of interaction patterns etc. Unfortunately, the quality of currently available PPI data is unsatisfactory, which limits its usefulness to some degree. Thus, techniques that enhance the availability of high-quality PPI data are of value. In this paper, we aim to improve the quality of experimentally available PPI data by identifying erroneous datapoints from PPI experiments. We attempt to move beyond current one-size-fits-all error models that ignore the experimental source of a PPI datapoint; instead, we argue that a better error model will also have components tailored to account for the systematic errors of specific experimental protocols. This may help achieve higher sensitivity without sacrificing specificity. This motivated us to design an error model tailored to one of the most commonly-used PPI experimental protocols. We specifically focus on data from two hybrid (2H) experiments 6 ' 4 , which are one of the most popular high-throughput methods to elucidate proteinprotein interaction. Data from 2H experiments forms the majority of the known PPI data for many species: D. melanogaster, C. elegans, H. sapiens etc. However, currently available 2H data also has unacceptably high false-positive rates: von Mering et al. estimate that more than 50% of 2H interactions are spurious 11 . These high rates of error seriously hamper the ability to perform analyses of the PPI data. As such, we believe an error model that performs better than existing models — even if it is tailored to 2H data — is of significant practical value, and may also serve as an example for the development of error models for other biological experiments. Ideally, the reporting mechanism in a 2H experiment is activated if and only if the pair of proteins being tested truly interact. As in most experimental protocols, there are various sources of random noise. However, there are also systematic, repeatable errors in the data, originating from limitations in the 2H protocol. In particular, there exist proteins that are disproportionately prone to be part of false-positive observations (Fig. 1). It is thought that these proteins either activate the reporting mechanism by themselves or promiscuously bind with many other proteins in the particular setup (promiscuous binding is an experimental artifact— it does not imply a true interaction under plausible biological conditions). Contributions: The key contribution of this paper is a comprehensive error model for 2H experiments, accounting for both random as well as systematic errors, which is guided by insights into the systematic errors of the 2H experi-

447 True Positive

False Negative

Actua ,

Actual

^ ~ \

2H Experiment

j ^ T ^ ) 2H Experiment

•No Signal

Signal (Present

False Positive

Actual

I

2H Experiment Signal . (present Reporter'gene

PromoterXregion of the reporter gene

F i g u r e 1: T h e origin of s y s t e m a t i c errors in 2 H d a t a . The cartoons shown above demonstrate the mechanism of 2H experiments. Protein A is fused to the DNA binding domain of a particular transcription factor, while protein B is fused to the activation domain of that transcription factor. If A and B physically interact then the combined influence of their respective enhancers results in the activation of the reporter gene. Systematic errors in such experiments may arise: false negatives occur when two proteins which interact in-vivo fail to activate the reporter gene under experimental conditions. False positives may occur due to proteins which trigger the reporting mechanism of the system, either by themselves (self-activation) or by spurious interaction with other proteins (promiscuity). Spurious interaction can occur when a protein is grossly over-expressed. In the above figure, protein A in the lower right panel is such a protein: it may either promiscuously bind with B or activate the reporting mechanism even in the absence of B.

mental protocol. We believe this is the first model to account for both sources of error in a principled manner; in contrast, previous work on estimating error in PPI data has assumed that the error in 2H experiments (as in other experiments) is independent and random. Another contribution of the paper are estimates of proteins especially likely to be self-activating/promiscuous (see Supp. Info.). Such estimates of "problem proteins", may enable the design of 2H experimental protocols which have lower error rates. We use the framework of Bayesian networks to encode our assumption that a 2H interaction is likely to be observed if the corresponding protein pair truly interacts or if either of the proteins is self-activating/promiscuous. The Bayesian framework allows us to represent the inherent uncertainty and the relationship between promiscuity of proteins, true interactions and observed 2H data, while using all the data available to simultaneously learn the model parameters and predict the interactions. We use a Markov Chain Monte Carlo (MCMC) algorithm to do approximate probabilistic inference in our models, jointly inferring both desired sets of quantities: the probability of interaction, and the propensity of a protein for self-activation/promiscuity. We show how to integrate our error model into the two most common

448

probabilistic models used for combining PPI experimental data, and show that our error model can significantly improve the accuracy of PPI prediction. Related Work: With data from the first genome-wide 2H experiments (Ito et al.6, Uetz et a/. 4 ), there came the realization that 2H experiments may have significant systematic errors. Vidalain et al. have identified the presence of self-activators as one of the sources of such errors, and described some changes in the experimental setup to reduce the problem 10 . Our work aims to provide a parallel, computational model of the problem, allowing postfacto filtering of data, even if the original experiment retained the errors. The usefulness of such an approach was recently demonstrated by Sun et al.2 (to reconstruct transcriptional regulatory networks). Previous computational methods of modeling systematic errors in PPI data can be broadly classified into two categories. The first class of methods 5 ' 11 ' 8 exploits the observation that if two very different experimental setups (e.g. 2H and Co-IP) observe a physical interaction, then the interaction is likely to be true. This is a reasonable assumption to make because the systematic errors of two different experimental setups are likely to be independent. However, this approach requires multiple costly and time consuming genome-wide PPI experiments, and may still result in missed interactions, since the experiments have high false negative rates. Many of these approaches also integrate non-PPI functional genomic information, such as co-expression, co-localization, and Gene Ontology functional annotation. The second class of methods is based on the topological properties of the PPI networks. Bader et al.1, in their pioneering work, used the number of 2H interactions per protein as a negative predictor of whether two proteins truly interact. Since the prior probability of any interaction is small, disproportionately many 2H interactions involving a particular protein could possibly be explained by it being self-activating or promiscuous. However, such an approach is unable to make fine-grained distinctions: an interaction involving a high-degree protein need not be incorrect, especially if there is support for it from other experiments. Furthermore, the high degree of a promiscuous protein in one experiment (e.g. Ito et aVs) should not penalize interactions involving that protein observed in another experiment (e.g. Uetz et al.'s) if the errors are mostly independent (e.g. they use different reporters). Our proposed probabilistic models solve all of these problems.

2. Data Sets One difficulty with validating any PPI prediction method is that we must have a gold standard from which to say whether two proteins interact or do

449 not interact. We constructed a gold standard data set of protein-protein interactions in S. cerevisiae (yeast) from which we could validate our methods. Our gold standard test set is an updated version of Bader et a/.'s data. Bader et a/.'s data consisted of all published interactions found by 2H experiments; data from experiments by Uetz et al.4 (the U E T Z 2 H data set) and Ito et al.6 (the I T O 2 H data set) comprised the bulk of the data set. They also included as possible protein interactions all protein pairs that were of distance at most two in the 2H network. Bader et al. then used published Co-Immunoprecipitation (Co-IP) data to give labels to these purported interactions. When two proteins were found in a bait-hit or hit-hit interaction in Co-IP, they were labeled as having a true interaction. When two proteins were very far apart in the Co-IP network (distance larger than three), they were labeled as not interacting. We updated Bader et al.'s data to include all published 2H interactions through February 2006, getting our data from the MIPS 7 database. We added, for the purposes of evaluation, recently published yeast Co-IP data from Krogan et al.3. This allowed us to significantly increase the number of labeled true and false interactions in our data set. Since the goal of our algorithms is to model the systematic errors in largescale 2H experiments, we evaluated our models' performance on the test data where at least one of U E T Z 2 H or I T O 2 H indicated an interaction. We were left with 397 positive examples, 2298 negative examples, and 2366 unlabeled interactions. We randomly chose 397 of the 2298 negative examples to be part of our test set. For all of the experiments we performed 4-fold cross validation on the test set, hiding one fourth of the labels while using the remaining labeled data during inference.

3. Probabilistic Models We show how to integrate our model of systematic errors into the two most common probabilistic models used for PPI prediction. Our first model is complementary to the relational probabilistic model proposed by Jaimovich et al.8, and can be easily integrated into their approach. Our second model is an extension of Bader et al.'s, and will form the basis of our comparison. Our models also adjust to varying error rates in different experiments. For instance, while we account for random noise and false negatives in our error model for both U E T Z 2 H and I T O 2 H , we only model selfactivation/promiscuity for I T O 2 H observations. The U E T Z 2 H data set was smaller and included only one protein with degree larger than 20; I T O 2 H had 36 proteins with degree larger than 30, including one with degree as high as 285. Thus, while modeling promiscuity made a big difference for the I T O 2 H

450 data, it did not significantly affect our results on the

UETZ2H

data.

3.1. Generative model We begin with a simplified model of PPI interaction (Fig. 2). We represent the uncertainty about a protein interaction as an indicator random variable Xij, which is 1 if proteins i and j truly interact, and 0 otherwise. For each experiment, we construct corresponding random variables (RVs) indicating if i and j have been observed to interact under that experiment. Thus, Uij is the observed8, random variable (RV) representing the observation from U E T Z 2 H , and Jy is the observed RV representing the observation from I T O 2 H . The arrow from X^ to 1^ indicates the dependency of I^ on X^. The box surrounding the three RVs indicates that this template of three RVs is repeated for alH, j — 1 , . . . , N (i.e. all pairs of proteins), where N is the number of proteins. In all models of this type, the 1^ RVs are assumed to be independent of one another. If an experiment provides extra information about each observation, the model can be correspondingly enriched. For instance, for each of their observed interactions Ito et al. provide the number of times the interaction was discovered (called the number of 1ST hits). Rather than making 1^ binary, we have it equal the number of 1ST hits, or 3 if 1ST > 3. We will refer to the portion of I T O 2 H observations with 1ST > 3 as I T O C O R E . The model is called "generative" because the ground truth about the interaction, X^, generates the observations in the 2H experiments, 1^ and Uij. To our knowledge, all previous generative models of experimental interactions made the assumption that 1^ depended only on Xij. They allowed for false positives by saying that Pr(Iij > 0\Xij — 0) = 5fp, where Sfp is a parameter of their model. Similarly, they allowed for false negatives by saying that Pr(Iij = 0\Xij = 1) = 6fn, for another parameter 5fn. However, these models are missing much of the picture. For example, many experiments have particular difficulty testing the interactions of proteins along the membrane. For these proteins, Sfn should be significantly higher. In the 2H experiment, for interactions that involve self-activating/promiscuous proteins, Sfp will be significantly higher. In Fig. 3, we propose a novel probabilistic model in which the selfactivating/promiscuous tendencies of particular proteins are explicitly modeled. The latent Bernoulli RV F& is 1 if protein k is believed to be promiscuous or self-activating. In the context of our data set, this RV applies specifically to the I T O 2 H data; if self-activation/promiscuity in multiple exClear nodes are unobserved (latent) RVs, and shaded nodes are observed RVs.

451

IJ=1,.,N

Figure 2: Generative model.

lj-1,-,N

Figure 3: Generative model, with noise variables.

periments is to be modeled, we may introduce multiple such variables Fj^ (for protein k and experiment H). The 1^ RV thus depends on Ft and Fj. Intuitively, Iij will be > 0 if either Xtj — 1 or F& = 1. As we show later in the Results section, this model of noise is significantly more powerful than the earlier model, because it allows for the "explaining away" of false positives in I T O 2 H . Furthermore, it allows evidence from data sets other than I T O 2 H to influence (through the Xij RVs) the determination of the Ff. RVs. We also added the latent variables Ofj and OL, which will be 1 if the Uetz et al. and Ito et al. experiments, respectively, have the capacity to observe a possible interaction between proteins i and j . These RVs act to explain away the false negatives in U E T Z 2 H and I T O 2 H . We believe that these RVs will be particularly useful for species where we have relatively little PPI data. The distributions in these models all have Dirichlet priors (6) with associated hyperparameters a (see Supp. Info, for more details). There are many advantages to using the generative model described in this section. First, it can easily handle missing data without adding complexity to the inference procedure. This is important for when integrating additional experimental data into the model. Suppose, for example, that we use gene expression correlation as an additional signal of protein interaction, by introducing new RVs Etj (indicating coexpression of genes i and j ) and corresponding edges X^ —> Etj. If, for a pair of proteins, the coexpression data is unavailable, we simply omit the corresponding E^ from this model. In Bader et aVs model, and the second model that we propose below, we would need to integrate over possible values of the missing datapoint, a potentially complicated task. Second, this generative model can be easily extended: e.g., we could easily combine this model with Jaimovich et al.'s in order to model the common occurrence of transitive closure in PPIs.

452

Figure 4: Bader et al.'s logistic regression model (BaderLR).

Figure 5: Our Bayesian logistic model, with noise variables (BayesLR).

3.2. Bayesian logistic model In Fig. 4 we show Bader et aVs model ( B A D E R L R ) ; it includes three new variables in addition to the RVs already mentioned, whose values are pre-calculated using the 2H network. Two of these encode topological information: variable A^ is the number of adjacent proteins in common between i and j , and variable D^ is ln(dj + 1) + ln(dj + 1), where di is the degree of protein i. Variable Ly is an indicator variable for whether this protein interaction has been observed in any low-throughput experiments. In Bader et al.'s model, 1$ is an indicator variable representing whether the interaction between proteins i and j was in the I T O C O R E data set (1ST > 3). Xij's conditional distribution is given by the logistic function: v

1 + exp (w0ffset + UijWu + I?wi + LijWL + A^WA + DijWD)

The weights w are discriminatively learned using the Iterative Re-weighted Least Squares (IRLS) algorithm, which requires that all of the above quantities are observed in the training data. In Fig. 5 we propose a new model ( B A Y E S L R ) , with two significant differences. First, we no longer use the two proteins' degree, Dij, and instead integrate our noise model in the form of the F^ random variables. Second, instead of learning the model using IRLS, we assign the weights uninformative priors and do inference via Markov Chain Monte Carlo (MCMC). This will be necessary because Xij will have an unobserved parent, ijy The new RV t[> will be 1 when the Ito et al. experiment should be considered for predicting Xij. Intuitively, its value should be (/y > 0) /\->{Fi\JFj). However, to allow greater flexibility, we give the conditional distribution for I? a Dirich-

453

let prior, resulting in a noisy version of the above logical expression. The RVs Oij are not needed in this logistic model because the parameterization of the Xij conditional distribution induces a type of noisy OR distribution in the posterior. Thus, logistic models can easily handle false negatives. Because we wanted to highlight the advantages of modeling the experimental noise, we omitted Aij (one-hop) from both the models, B A Y E S L R and B A D E R L R . The one-hop signal, gene expression, co-localization, etc. can be easily added to any of the models to improve their prediction ability. 3.3. Inference As is common in probabilistic relational models, the parameters for the conditional distributions of each RV are shared across all of their instances. For example, in the generative model, the prior probability Pr(Xij — 1) is the same for all i and j . With the exception of X^ in B A Y E S L R , we gave all the distributions a Dirichlet prior. In B A Y E S L R , the conditional distribution of Xij is the logistic function, and its weights are given Gaussian priors with mean \xx = 0 and variance ax = .01. Note that by specifying these hyperparameters (e.g. fiXjO'x), we never need to do learning of the parameters (i.e., weights). Given the relational nature of our data, and the relatively small amount of it, we think that this Bayesian approach is well-suited. We prevent the models from growing too large by only including protein pairs where at least one experiment hinted at an interaction. We used BUGS 9 to do inference via Gibbs sampling. We ran 12 MCMC chains for 6000 samples each, from which we computed the desired marginal posterior probabilities. The process is simple enough that someone without much knowledge of machine learning could take our probabilistic models (which we provide in the Supplementary Information) and use them to interpret the results of their 2H experiments. We also tried using loopy belief propagation instead of MCMC to do approximate inference in the generative model of Fig. 3. These results (see Supp. Info.) were very similar, showing that we are likely not being hurt by our choice of approximate inference method. Furthermore, our implementation of the inference algorithm (in Java) takes only seconds to run, and would easily scale to larger problems.

4. R e s u l t s We compared the proposed Bayesian logistic model ( B A Y E S L R ) with the model based on Bader et a/.'s work ( B A D E R L R ) . Both models were trained and tested on the new, updated version of Bader et al.'s gold standard data set. We show in Fig. 6 that B A Y E S L R achieves 5-10% higher accuracy

454

at most points along the ROC curve. We then checked to see that the improvement was really coming from the noise model, and not just from our use of unlabeled data and MCMC. We tried using a modified B A Y E S L R model (called Bayesian Bader) which has D^ RVs instead of the noise model, and which uses I T O C O R E instead of I T O 2 H . AS expected, it performed the same as B A D E R L R . We also tried modifying this model to use I T O 2 H , and found that the resulting performance was much worse. Investigating this further, we found that the average maximum a posteriori (MAP) weights for B A Y E S L R were {wu = -2.32, wL — -10.85, u>/ = —4.26, and w0ffset = 7.34}. The weight corresponding to I T O 2 H is almost double the weight for U E T Z 2 H . Interestingly, this is a similar ratio of weights as would be learned had we only used the I T O C O R E data set, as in B A D E R L R . In the last of the above-mentioned experiments, the MAP weight for I T O 2 H was far smaller than the weight for UETZ2H, which indicates that U E T Z 2 H was a stronger signal than I T O 2 H . Overall, these experiments demonstrate that we can get significantly better performance using data with many false positives ( I T O 2 H ) and a statistical model of the noise than by using prefiltered data ( I T O C O R E ) and no noise model. In all regimes of the ROC curve, B A Y E S L R performs at least as well as B A D E R L R ; in some, it performs significantly better (Fig. 8). The examples that follow demonstrate the weaknesses inherent in B A D E R L R and show how the proposed model B A Y E S L R solves these problems. When IRLS learns the weight for the degree variable (in B A D E R L R ) , it must trade off having too high a weight, which would cause other features to be ignored, and having too low a weight, which would insufficiently penalize the false positives caused by self-activation/promiscuity. In B A D E R L R , a high degree Dij penalizes positive predictors from all the experiments (Uij,Iij, Lij). However, the degree of a protein in a particular experiment (say, Ito et aVs) only gives information about self-activation/promiscuity of the protein in that experiment. Thus, if a protein has a high degree in one experiment, even if that experiment did not predict an interaction (involving some other protein), the degree will negatively affect any predictions made by other experiments on that protein. Our proposed models solve this problem by giving every experiment a different noise model, and by having each noise model be conditionally independent given the Xij variables. Thus, we get the desired property that noise in one experiment should not affect the influence of other experiments on the X^ variables. Fig. 8(a) illustrates this by showing the prediction accuracy for the test points where Dtj > 4 and Uij = 1 or L^ = 1 (called the 'medium' degree

455

Bayesian LR with noise model Bader Bayosian Bader " Bayesian Bador wilh lul llo Random 0

02

04

06

04

08

0.6

Falso positive talo

Falso positivo talo

Figure 7: Comparison of generative models.

Figure 6: Comparison of logistic models. i

0 8

0.6

0 P, M'(T, 5) > p and M'(G, 6) > p. Since the Class V has no characteristics, we assume all matrices belong to Class V, i.e. the regular expression is ".*". Since the size of the sample space for each motif class is not the same, the likelihood of a particular class g given a matrix M, i.e. P(M I g = k), k = 1, ..., 6, is not the same for different motif classes. In order to compare (without finding their exact values) the likelihood of different motif classes when given a matrix, we consider a 4 x 1 column vector CV = (fi(A), //(C), //(G), //(T)) in a probability matrix. Since 0 < //(A), //(C), //(G), //(T) < 1 and //(A) + //(C) + //(G) + //(T) = 1, the sample space of CV can be represented by the set of points in the tetrahedron shown in Figure 1 [10]. The four corners of the tetrahedron at (1,0,0,0), (0,1,0,0), (0,0,1,0) and (0,0,0,1) represent the four nucleotides A, C, G and T. Without loss of generality, let CV be the first column of a 4 x 4 matrix with the pattern "TAAT" in motif Class VI (Table 1), in which case //(T) > p. To illustrate the idea, let us consider two classes of motif. In Class V a column vector CV is randomly picked from all possible column vectors, whereas in Class VI, a column vector CV is randomly picked from all column vectors with //(T) > p. As the size of the sample space for column vectors with //(T) > p, i.e. the tetrahedron shown in Figure 2, is (1 - pf of the size of the sample space for arbitrary column vectors, i.e. the whole tetrahedron, conditional probability P(CV I g = 6) is 1/(1 - pf times higher than the conditional probability P(CV I g = 5). Similarly, we may compare the conditional probability of a particular matrix M' being picked given that it is from Class V (all probability matrices) and the conditional probability of another matrix M being picked given that it is from one of the remaining classes. For example, assume / = 4 and P - 0.8. The conditional probability P(M I g - 6) that a particular 4 x 4 matrix M in Class VI is picked from all length-4 matrices in Class VI is 1/(2(1 - 0.8)3x4) = 1.2 x 108 times larger than the conditional probability P{M' I g = 5) that another matrix M'

477

is picked from all length-4 matrices in Class V. Note that, if M' does not belong to Class VI, P(M' I g = 6) = 0. When the motif length / is not exactly 4, care should be taken not to double count those matrices with more than one sub-matrix satisfying the requirement (by using the Inclusion and Exclusion Principle).

3. DIMDom Algorithm DIMDOM, which stands for Discovering Motifs with DOMain knowledge, uses the expectation maximization (EM) approach to discover the motif matrix from the input sequences. In the expectation step (E-step), based on the current estimates of parameters M, B, Xb and g, DIMDom algorithm calculates the expected log likelihood log L(B, Xb, Pm I X, Z, M, g), over the conditional probability distribution of the missing data Z from the input sequences X. In the maximization step (M-step), DIMDom algorithm calculates a new set of parameters M, B, Xb and g based on the new estimated Z for maximizing the log likelihood. These two steps will be iterated in order to obtain a probability matrix with larger log likelihood. In order to discover the probability matrix with maximum log likelihood (instead of local maxima), DIMDom algorithm repeats the EM steps with different seed matrices. 3.1. Expectation

step

Given a fixed probability matrix M®\ the background probability Bi0>, prior probability V 0 ) and the motif class g(0), the expected log likelihood is ,„E

= IK

{\ogL(B,Ab,Pm\X,Z,M,g))

ioga 6 (0) ) + xi°g(/> (0) (x,.m)) iog(i-^ to be the probability matrix for each motif class that maximizes the expected log likelihood. Consider g(I) = 5, Equation (4) will be maximized (by considering a Lagrange Multiplier of each column vector of M') when

J(l-Ze score

Predicted S III (III)

noD

-(IV) i n (iii) VI (VI) -(VI) -(H)

n(i) -(ffl)

rv(rv) -(VI)

rv(rv) KD rv(rv) ffl (IV)

i(D

-d)

rv(H) VI (VI) II (II) 1(1) III (TV) (HI)

m(rv) VI (VI)

n(vi) ni (iii)

rv(rv) -(IV) -(H)

-0) I (VI)

rv(i) n(n) rv(vr» VI (VI)

rv©

rv(i)

DIMDom (class V only) 0 (0.6667) 0.2(0.1667) 0 (0.25) 0.5 (0.5) 0(0.3333) 0(0) 0 (0.3333) 0.1667 (0.2) 0 (0.3333) 0(0.1818) 0(0.1429) 0 (0.25) 0.5 0.3077 (0.375) 0(0.5) 0 (0.2222) 0 (0.25) 0.3333 (0.3333) 0(0.2813) 0(0) 0.0476(0.2941) 0.0588 (0.2307) 0 (0.3333) 0(0.1333) 0.0909 (0.2222) 0 (0.2857) 0 (0.6667) 0 (0.2727) 0 (0.25) 0 (0.2857) 0 (0.2) 0(0.1111) 0.0909 (0.3333) 0.25 (0.25) 0(0.1818) 0.1429(0.375) 0.0192(0.1224) 0.0192(0.1224) 0.0998(0.2761)

DIMDom

MEME

0.6667 (0.6667) 0.2 (0.33) 0(1) 0.5 (0.5) 0.2308 (0.3529) 0(0) 0(1) 0.1429(0.2143) 0 (0.3333) 0 (0.2857) 0 (0.3333) 0.5 (0.5) 0.1667(0.1667) 0.3333 (0.6667) 0.3333 (0.5) 0 (0.6667) 0 (0.25) 0.3333 (0.6667) 0.1429(0.25) 0.5 (0.5) 0.1579(0.1579) 0.3333 (0.3333) 0(1) 0(0.2142) 0.1111 (0.25) 0.0833 (0.2667) 0.6667 (0.6667) 0.2857 (0.5) 0(1) 0 (0.5) 0 (0.25) 0.1176(0.1176) 0 (0.4286)

0 (0.5) 0.1111 (0.1111) 0 (0.5) 0.3333 (0.3333) 0.0227 (0.2) 0 (0.2222) 0 (0.5) 0.25 (0.5) 0(0) 0.0476 (0.0870) 0(0.1429) 0 (0.5) 0.125(0.5) 0.1818(0.4) 0(0.3333) 0.1 (0.4444) 0(0.1) 0.2 (0.4) 0.1471 (0.1875) 0(0) 0(0.1818) 0 (0.25) 0(0.3333) 0.1667 (0.25) 0.1429(0.1667) 0 (0.25) 0 (0.5) 0.0667 (0.3333) 0 (0.5) 0(0.3333) 0 (0.25) 0.0526(0.1667) 0.2143 (0.2143) 1(1) 0.0435 (0.2353) 0.05(0.1667) 0.4222 (0.4222) 0.4222 (0.4222) 0.1925(0.3141)

KD

0 (0.2222) 0.1 (0.5) 0.05 (0.2) 0.05 (0.2) 0.2501 (0.4471)

4. Experimental Results We have implemented DIMDom using C++ and have compared its performance

481 with that of the popular motif discovery algorithm MEME [2], which is also based on an EM approach, on real biological motif from the TRANSFAC database (http://www.gene-regulation.com). For each transcription factor with at least one known binding site in fruit fly (Drosophila), we searched for all genes regulated by that transcription factor and used the 450 bp (base pairs) upstream and 50 bp downstream of the transcriptional start site of these genes as the input sequences. We set /' = 8 when constructing seed matrices and considered a substring X, as a binding site if 1 — Z, > 0.9 for a 90% confidence. Higher thresholds, such as 0.95 and 0.99, failed to give satisfactory results as the number of predicted binding sites decreased sharply to almost zero. A score for each predicted motif is defined as: Ipredicted sites n published sites! score = • \ predicted sites u published sites| A published binding site is correctly predicted if that binding site overlaps with at least one predicted binding site. The score is in the range of [0,1]. When all the published binding sites are correctly predicted without any mis-prediction, score = 1. When no published binding site is predicted correctly, score = 0. The value of the threshold /? used in calculating probability P(M I g) was determined by performing tests on another set of real data from the SCPD database (http://rulai.cshl.edu/SCPD/) for yeast (Saccharomyces cerevisiae). DIMDom had the highest average score when /? = 0.9. A smaller value of/? did not give better performance because the values of log(P(M I g)) were similar for different motif classes. As a result, DIMDom could not take much advantage of different motif classes and motifs from class V were predicted most of the time. Table 2 shows the performance of MEME [2] and DIMDom on two types of output, only one predicted motif and 30 predicted motifs (from now on, all results related to outputs with 30 predicted motifs will be parenthesised). In order to have a fair comparison with our experiments, we have ignored the known prior probabilities of different motif classes and set them all equal. We have also performed experiments on a version of DIMDom which considers only the class V (basic EM-algorithm) so as to illustrate the improvement in performance by introducing the knowledge of different motif classes. It is not surprising to find that MEME (with average score 0.1925 (0.3141)) performed better than the basic EM-algorithm (with average score 0.0998 (0.2761)). However, after introducing the five motif classes, DIMDom (with average score 0.2501 (0.4471)) outperformed MEME when the same set of parameters were

482

used. Note that DIMDom was about 1.5 times more accurate than MEME when 30 predicted motifs could be outputted. Among the 47 data sets, both DIMDom and MEME failed to predict any published binding sites in 19 (9) data sets and DIMDom had a better performance (higher score) for 17.5 (27.5) data sets while MEME had a better performance for 10.5 (10.5) data sets only. When the output has 30 predicted motifs, DIMDom outperformed MEME with 2.5 times in the number of successes. In 5.5 out of 10.5 cases for which MEME could do better than DIMDom, MEME predicted only 1 or 2 out of many not-so-similar binding sites because of the high threshold (0.9) used by DIMDom. Even with a simple description of motif classes, DIMDom can correctly predict the motif classes in 9 (12) out of 21 (25) instances. We expect better prediction results if more parameters are used to describe motif classes [17]. However, more training data are needed for tuning these parameters.

5. Conclusion We have incorporated biological information, in terms of prior probabilities and pattern characteristics of possible motif classes, into the EM algorithm for discovering motifs and binding sites of transcription factors. Our algorithm DIMDom was shown to have better performance than the popular software MEME. DIMDom will have potentially even better performance if more motif classes are known and included in the algorithm. Like many motif discovery algorithms, DIMDom will work without the length of the motif being given. When the length of the motif is specified, DIMDom will certainly have better performance than when the length is not given and the likelihoods of motifs of different lengths must be compared.

References 1. W. Atchley and W. Fitch, Proc. Natl Acad. ScL, 94, 5172-5176 (1997). 2. T. Bailey and C. Elkan, ISMB, 28-36 (1994). 3. F. Chin, H. Leung, S.M. Yau, T.W. Lam, R. Rosenfeld, W.W. Tsang, D. Smith and Y. Jiang, RECOMB04, 125-132 (2004). 4. E. Eskin, RECOMB04, 115-124 (2004). 5. S. Keles, M. Lann, S. Dudoit, B. Xing and M. Eisen, Statistical Applications in Genetics and Molecular Biology, 2, Article 5 (2003). 6. C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald and J. Wootton, Science, 262,208-214 (1993).

483 7. C. Lawrence and A. Reilly, Proteins: Structure, Function and Genetics, 7,41-51 (1990). 8. H. Leung and F. Chin, JBCB, 4,43-58 (2006). 9. H. Leung and F. Chin, WABI, 264-275 (2005). 10.H. Leung and F. Chin, Bioinformatics, 22(supp 2), ii86-ii92 (2005). 1 l.H. Leung and F. Chin, Bioinformatics (to appear) 12.H. Leung, F. Chin, S.M. Yiu, R. Rosenfeld and W.W. Tsang, JCB, 12(6), 686-701 (2005). 13.M. Li, B. Ma, and L. Wang, Journal of Computer and System Sciences, 65, 73-96 (2002). 14.J.S. Liu, A.F. Neuwald and C.E. Lawrence, Journal of the American Statistical Association, 432, 1156-1170 (1995). 15.K. Maclsaac, D. Gordon, L. Nekludova, D. Odom, J. Schreiber, D. Gifford, R. Young and E. Fraenkel, Bioinformatics, 22(4), 423-429 (2006). 16.N.J. Mulder et al, cleic Acids Res., 31, 315-318 (2003). 17.L.Narlikar, R. Gordan, U. Ohler and A. Hartemink, Bioinformatics, 22(14) e384-e392 (2006). 18.L. Narlikar and A. Hartemink, Bioinformatics, 22(2), 157-163 (2006). 19.C. Pabo and R. Sauer, Annu. Rev. Biochem., 61, 1053-1095 (1992). 20.P. Pevzner and S.H. Sze, ISMB, 269-278 (2000). 21.A. Sandelin and W. Wasserman, JMB, 338, 207-215 (2004). 22.S. Sinha and M. Tompa, BIBE, 214-220 (2003). 23.S. Wolfe, L. Nekludova and CO. Pabo, Annu. Rev. Biomol. Struct., 3, 183212(2000). 24.E. Xing and R. Karp, Nati. Acad. Set, 101, 10523-10528 (2004). 25.J. Zilliacus, A.P. Wright, D.J. Carlstedt and J.A. Gustafsson, Mol. Endocrinol, 9, 389-400 (1995).

AB INITIO PREDICTION OF T R A N S C R I P T I O N FACTOR B I N D I N G SITES L. ANGELA LIU and JOEL S. BADER* Department of Biomedical Engineering and High-Throughput Biology Center, Johns Hopkins University, Baltimore, MD 21218, USA * E-mail: [email protected] Transcription factors are DNA-binding proteins that control gene transcription by binding specific short DNA sequences. Experiments that identify transcription factor binding sites are often laborious and expensive, and the binding sites of many transcription factors remain unknown. We present a computational scheme to predict the binding sites directly from transcription factor sequence using all-atom molecular simulations. This method is a computational counterpart to recent high-throughput experimental technologies that identify transcription factor binding sites (ChlP-chip and protein-dsDNA binding microarrays). T h e only requirement of our method is an accurate 3D structural model of a transcription factor-DNA complex. We apply free energy calculations by thermodynamic integration to compute the change in binding energy of the complex due to a single base pair mutation. By calculating the binding free energy differences for all possible single mutations, we construct a position weight matrix for the predicted binding sites that can be directly compared with experimental data. As water-bridged hydrogen bonds between the transcription factor and DNA often contribute to the binding specificity, we include explicit solvent in our simulations. We present successful predictions for the yeast MAT-a2 homeodomain and GCN4 bZIP proteins. Water-bridged hydrogen bonds are found to be more prevalent than direct protein-DNA hydrogen bonds at the binding interfaces, indicating why empirical potentials with implicit water may be less successful in predicting binding. Our methodology can be applied to a variety of DNA-binding proteins. Keywords: transcription factor binding sites; free energy; position weight matrix; hydrogen bond

1. Introduction Transcription factors (TFs) are proteins that exert control over gene expression by recognizing and binding short DNA sequences ( DNA'-protein (aq) AAG = AG' - AG = A G ^ - AGDNA

AG AG'

Fig. 1. Thermodynamic cycle used in the relative binding free energy calculation.

free energies of a protein with two different DNA sequences can be measured experimentally. The first horizontal reaction contains the native DNA and TF-DNA complex, whereas the second horizontal reaction contains the mutant DNA and its complex. In computations, it is relatively easy to calculate the free energy change caused by a mutation in the DNA sequence, indicated by the vertical reactions in the figure. The difference in binding free energy in the two experimental measurements, AG' — AG, is identical to the computational free energy difference, AGCOmp - A G D N A - This difference, AAG, will be referred to as the relative binding free energy in this paper. More detailed theoretical background can be found in Refs. 20,21 The molecular simulation package C H A R M M 3 0 was used to carry out the molecular dynamics simulation, and its BLOCK module was used for free energy calculations. We first established well-equilibrated native proteinDNA complex and DNA-duplex configurations using molecular dynamics simulation. Missing hydrogen atoms were added to the crystal structures of MAT-a2 (PDB:1APL) and GCN4 (PDB:1YSA). Charges of the titratable amino acid residues were assigned to their values at neutral pH. TIP3P water molecules were added and periodic boundary conditions were applied. Counterions (Na + ion) were introduced to neutralize the system using the random water-replacement routine developed by Rick Venable. 31 The C H A R M M 2 7 force field was used. The positions of the ions and water molecules were minimized followed by full minimizations of the entire system using the adopted basis Newton-Raphson method. The non-bonded cutoff radius was 14 A. The system was then heated to 300 K and equi-

488

librated for 1.5 ns in the NPT ensemble using a 1 fs time step. The final configurations contained about 7000 water molecules and 25000 atoms for both MAT-a2 and GCN4 protein-DNA complexes. The protein-DNA complex and the DNA duplex were simulated separately. From the equilibrated native configurations, we used a house-built program to replace each native base pair by multi-copy base pairs. 32 ' 33 In this multi-copy approach, multiple base pairs are superimposed and their contributions to the total energy or force function are scaled by coupling parameters. In this paper, all multi-copy base pairs are a superposition of two physical base pairs. Therefore, there are 6 possible multi-copy base pairs at one position. The standard base geometry 34 was used to build a library of multi-copy base pair equilibrium geometries. Three consecutive rotations were applied to align the multi-copy base with the native base to preserve the orientation with repect to the rest of the DNA duplex. The structure with the multi-copy base pair was minimized first to remove possible bad contacts caused by the introduction of the multi-copy base. It was then heated to 350 K and equilibrated for 15 ps. This heating step helps move the conformation away from the native structure's local minima and may improve sampling of the glassy waters at the protein-DNA interface. The system was then cooled to 300 K and equilibrated for 65 ps. A 100 ps production run was done during which the trajectory was saved every 0.5 ps. The simulation is done in the NVT ensemble using the same periodic boundary condition as in the fully-equilibrated native structure. The free energy analysis on the production trajectory is outlined below. Thermodynamic integration 20,21 was used to calculate the free energy change for mutating the original base pair into another possible base pair in the multi-copy base pair. The linear coupling scheme in the coupling parameter A was used in BLOCK for the energy function of the multi-copy structures, which allows analytical solution of the free energy gradient. Typically, multiple values of A are required for the integration. From preliminary calculations, we have found that the free energy gradient was approximately linear with respect to A for multi-copy base pairs. Therefore, we used a mid-point approximation (A = 0.5) for computational saving. The binding free energy difference decomposes into separate contributions from DNA, protein, and solvent (ions and water) using the same

489 notation as Fig. 1: AAGtotal = AGcomp — A G D N A = AAGjnternal + AAGexternal

(1)

AGComp = A G p r o t + A G s o l v e n t + A G D N A A G D N A = AGsolvent + A G D N A AAGinternal — A G D N A — A G D N A

AAGexternal = A G £ r o t + AGg o l v e n t - A G s o l v e n t ,

where the superscripts c and ' represent the protein-DNA complex and the free DNA duplex, respectively. For homeodomains, the contribution of the N-terminus to the binding free energy difference was also calculated using AAGNterm = A^Nterm — 0, where the zero represents the corresponding AG term in the DNA duplex. The binding free energy differences in Eq. (1) are converted into Boltzmann factors and position weight matrices as in Ref.15 using the additive approximation. These matrices are converted into sequence logos35 using W E B L O G O . 3 6 For the TFs considered in this work (Sec. 2), the DNAs remain relatively undeformed upon TF binding, which may make the additive approximation accurate. 14 3.2. Hydrogen

bond

analysis

The native protein-DNA complex and DNA-duplex trajectories were further analyzed to explore the role of water in the binding specificity. CHARMM'S H B O N D module was used to analyze whether a hydrogen bond (H-bond) exists in a certain frame in the trajectory. A distance cutoff of 2.4 A was used as the maximum H-bond length (between acceptor and donor hydrogen) with no angle cutoffs. Then a house-built program was used to calculate the lifetime histograms for all occurrences of H-bonds. A 2 ps resolution was used such that any breakage of the H-bond shorter than 2 ps is ignored. 37 The existence of a direct or a water-bridged H-bond between the protein and DNA at each base pair position was also calculated. H-bonds formed by the N-terminal residues of MAT-a2 were considered separately from the rest of the protein. 4. Results and Discussions Using the methods outlined in Sec. 3, the predicted sequence logos for the free energy terms in Eq. (1) are shown in Fig. 2. Our prediction of MAT-a2 achieves excellent agreement for all 5 positions in the "TTACA" consensus

490

sequence. This agreement verifies that the mid-point approximation for thermodynamic integration (Sec. 3) is valid for this TF. The N-terminus is

a)

b)

I^TACA*. LTTicAQ, p* tJxCTx 'k.

total

^

kr"L....

]x_ A C A ^

internal | Ap such that the overall sequence similarity between all genes in pathway Q and their corresponding orthologues in the template P, as well as the consistency of the operon and regulation structures between pathways P and Q are as high as possible. 2.2. The

methods

Our approach consists of the following steps: (1) For every gene in the template pathway P, find a set of homologes in the target genome T with BLAST; (2) Remove from the homologes genes unlikely to be orthologues to the corresponding gene in the template P. This is done based on functional information, e.g., Cluster of Orthologous Groups (COG) 16 , which is available. In particular, genes that are unlikely orthologous would have different COG numbers. (3) Obtain protein-DNA interactions and operon structures for the homologous genes in the template pathway and target genome from related databases 6 ' n , literatures or computational tools 10>15. (4) Exactly one of the homologous genes is eventually assigned as the ortholog for the corresponding gene in the template P. This is done based on the constraints by the protein-DNA interaction and operon information (for any gene that is not covered by the structural information due to the incomplete data or other reasons, we simply assign the best BLAST hit as the ortholog). Such an orthology mapping or assignment essentially should yield a predicted pathway that has overall high sequence similarity and structural consistency with the template pathway. By incorporating sophisticated structural information, the pathway prediction problem may become computationally intractable. We describe in the following in detail how an efficient algorithm can be obtained to find

500 the orthology mapping between the template pathway and the one to be predicted. We consider in two separate steps structural constraints with protein-DNA interactions and those with operons. 2.2.1. Constraints with protein-DNA

interactions

We use available protein-DNA interaction information, i.e. the transcriptional regulation information, to constrain the orthology assignment. This is to identify orthologs with consistent regulation structures to the corresponding genes in the template pathway. Think genes as vertices and relations among the genes as edges, the template pathway and the corresponding homologs in target genome can be naturally formed into two graphs. Thus the problem can be converted to finding the optimal common subgraph of these two graphs. It is in turn to be formulated into the maximum independent set (MIS) problem. Details are given below. For convenience, we call a regulon in this paper to be a gene encoding a transcription factor and all the genes regulated by the factor.

(a)

(b)

(c)

Figure 1. Constraints with transcriptional regulations, (a) Regulation graph G\ for template pathway. A directed edge points from a tf gene to a gene regulated by the corresponding T F , a solid edge connects two genes regulated by a same T F , a dashed edge connects two genes belonging to different regulons. (b) Regulation graph G2 for the homologous genes in the target genome, constructed in similar way to (a), (c) Merged graph G from G\ and Gi. Each node is a pair of homologous genes.

(1) A regulation graph G\ = (Vi, E\) is built for the template pathway P, where vertex set V\ represents all genes in template pathway P, and edge set E\ contains three types of edges: an edge of type-1 connects a tf gene and every gene regulated by the corresponding product; an edge of type-2 connects two genes regulated by the same tf gene product; and edges of type-3 connect two genes from different regulons if they are not yet connected (Figure 1(a)). (2) A regulation graph Gi = {V2,E2) is built for the target genome in

501

the similar way, where V2 represents homologous genes in the target genomes (Figure 1(b)). (3) Graphs G\ and G% are merged into a single graph G = (V, E) such that V contains vertex [i,j] if and only if i e V\ and j £ V2 are two homologous genes. A weight is assigned to vertex [i,j] according to the BLAST score between genes i and j . Add an edge ([i,j], [i',jr]) if either (a) i = i' or j — j ' but not both, or (b) edges (i, i') G E\ and (j,j') € E-2 are not of the same type (Figure 1(c)). (4) Then the independent set in the graph G with the maximum weight should correspond to the desired orthology mapping that achieves the maximum sequence similarity and regulation consistency. This assigns one unique orthologous gene in this template pathway to each gene in the pathway to be predicted, as long as they are covered by the known protein-DNA interaction structures.

2.2.2. Constraints with operon structures We now describe how to use confirmed or predicted operon information to further constrain the orthology assignment. This step applies to the genes that have not been covered by protein-DNA interaction structures.

W = 0.5X2X[(w1*w2*w3)/3]

fo

A A i\ ©@@©©©

©©©©©© \

Figure 2. Constraints with operon information. See description for details. A dashed line connects two homologes. a) Setting weight for an operon. b) A pair of partially conserved operons in template pathway and target genome, (c) A mapping graph formed according to (b). (d) An operon only appears in target genome, (e) The mapping graph formed according to (d).

502 We first assign to each gene i with a weight Wi. Wi is set according to the average of its BLAST scores with its top m (say, 5) homologes. The weight of an operon o is set as 0.5(n - 1) J2ieo wi/ni where n is the number of genes in the operon (Figure 2(a)). The factor 0.5 allows an operon in one genome to only contribute 50% and a conserved operon in the other genome to contribute the other 50%. We use term n — 1 in the formula since we want to exclude the operons that have only one gene from consideration, since they do not introduce structural information. We then sort the operons according the non-decreasing of their sizes and then use the following greedy iterative process to constrain the orthology mapping as long as there is an operon unexamined. Repeat the following 4 steps: (1) Select the largest unexamined operon and consider the related homologes in another genome as well as the available operon structures in them; (2) Build a mapping graph Gm = (Vm,Em) (Figure 2(b)-(e)), where Vm contains the following two types of vertices: an operon vertex presents each of the involved operons and a mapping vertex [i,j] presents each pair of homologous genes i and j . Edge set Em also contains three types of edges: an edge connects every pair of mapping vertices ([i,j], [k,l]) Hi / k and j / I, an edge connects an operon node and a mapping node if one of the two genes in the mapping node belongs to the operon, and an edge connects every pair of involved operons between the target genome and the template pathway; (3) Find the maximum clique C on Gm; (4) Remove the template genes appeared in the mapping nodes of C and their homologes. Remove an operon if all genes in it have been removed. If only a subset of the genes in an operon have been removed, leave the remaining genes as a reduced operon. Resort the remaining operons. By this formulation, an edge in graph Gm denotes a consistent relationship between two nodes connected by it. A maximum clique denotes a set of consistent operon and mapping nodes that have the maximum total weight and thus can infer a optimal mapping. Note that an operon in one genome could have zero or more, complete or partial conserved operons in another genome 10 . If it has one or more (Figure 2(b)), the constraint can be obtained from both of the genomes and thus is called a two side con-

503

straint. The procedure can find the orthology mapping that maximizes the sequence similarity and the operon structural consistency. Otherwise, it is called called an one side constraint (Figure 2(b)). The procedure can find the orthology mapping that minimizes the number of involved operons.

2.3. Tree decomposition

based

algorithm

Based on section 2.2, constraining the orthology mapping with proteinDNA interactions and with operon structures can be reduced to the problems of maximum independent set (MIS) and maximum clique (CLIQUE) on graphs formulated from the structural constraints. Both problems are in general computationally intractable; any naive optimization algorithm would be very inefficient considering the pathway prediction is at the genome scale. Our algorithm techniques are based on graph tree decomposition. A tree decomposition 13 of a graph provides a topological view on the graph and the tree width measures how much the graph is tree-like. Informally, in a tree decomposition, vertices from the original graph are grouped into a number of possibly intersecting bags; the bags topologically form a tree relationship. Shared vertices among intersecting bags form graph separators; efficient dynamic programming traversal over the graph is possible when all the bags are (i.e., the tree width is) of small size 3 . In general, the graphs formulated from protein-DNA interactions and operon structures have small tree width . We employ the standard tree decomposition-based dynamic programming algorithm 3 to solve MIS and CLIQUE problems on graphs of small tree width. On graphs with larger tree width, especially on dense graphs, our approach applies the tree decomposition algorithm on the complement of the graph instead. The running time of the algorithms is 0(2tn), where t and n are respectively the tree width and the number of vertices in the graph. Such a running time is scalable to larger pathways. Due to the space limitation, we omit the formal definition of tree decomposition and the dynamic programming algorithm. Instead, we refer the reader to 3 for details. We need to point out that finding the optimal tree decomposition (i.e., the one with the smallest tree width) is NP-hard 2 . We use a simple, fast approximation algorithm greedy fill-in 4 to produce a tree decomposition for the given graph. The approximated tree width t may affect the running time of the pathway prediction but not its accuracy.

504

3. Evaluation Results We evaluated TdPATH against BH, BBH and PMAP by using 40 known pathways in B. subtilis 168 from KEGG pathway database 5 as templates (Table 1) to infer corresponding pathways in E. coli K12. For TdPATH, the operon structures are predicted according to the method used in 10 and experimentally confirmed transcriptional regulation information is taken from 6 for B. subtilis 168 and from n for E. coli K12. For PMAP, predicted operon and regulon information is obtained according to the method used in 7 . Both TdPATH and PMAP include the COG filtering. Table 1.

Template pathways of B. subtilis 168, taken from KEGG pathway database.

bsu00040 bsu00471 bsu00660 bsu00930 bsu03060 bsu00520

bsuOOlOO bsu00480 bsu00720 bsu00950 bsu00220 bsu00920




bsu00401 bsu00602 bsu00900 bsu03020 bsu01053

bsu00430 bsu00604 bsu00903 bsu03030 bsu02030

We evaluated the accuracy of the algorithms. The accuracy was measured as the arithmetic mean of sensitivity and specificity. Let K be the real target pathway, H be the homologous genes searched by BLAST according to the corresponding template pathway. Let R be the size of KC\H, i.e. the number of genes common in both the real target pathway and the candidate orthologues. We use this number as the number of real genes to calculate sensitivity and specificity because that is the maximum number of genes a sequence based method can predict correctly. Since BH (or BBH) can be considered a subroutine of PMAP and TdPATH, we only evaluated efficiency for PMAP and TdPATH. Running times from reading inputs to output the predicted pathway were collected. For TdPATH, we also collected the data on tree width of the tree decompositions on the constructed graphs or their complement graphs. For all of the algorithms, program NCBI blastp * was used for BLAST search and the E-value threshold was set to 10~ 6 . The experiments ran on a PC with 2.8 GHz fntel(R) Pentium 4 processor and 1-GB RAM, running RedHat Enterprise Linux version 4 AS. Running times were measured using the "time" function. The testing results are summarized in Table 2. On average, TdPATH has accuracy of 0.88, which is better than those of other algorithms. We give two examples here to show the improvement is good for small as well as large pathways. One is the nicotinate and nicotinamide metabolism, which has 13 genes in B. subtilis 168 while 16

505

genes in E. coli K12. The prediction accuracy of TdPATH is 0.9, better than 0.79, 0.83 and 0.79 of BH, BBH and PMAP respectively. Another is the pyrimidine metabolism pathway, which has 53 genes in B. subtilis 168 and 58 in E. coli K12. TdPATH has prediction accuracy of 0.82, better than 0.79, 0.80, 0.79 of BH, BBH and PMAP respectively. PMAP has second highest accuracy, which means prediction accuracy could be improved even by incorporating structural information partially. Table 2. Evaluation results. T: time (in seconds), A: accuracy ((sensitivity+specificity)/2). BBH A 0.45 1.00 0.85

BH A 0.33 1.00 0.84

min max ave

PMAP A T 0.33 12.8 1.00 27.3 0.86 16.4

TdPATH A T 0.50 1.2 1.00 33.3 0.88 11.5

For efficiency, TdPATH has average of 11.5 seconds for predicting a pathway, which is slightly better than 16.4 seconds of PMAP. The tree width distribution is shown in Figure 3. On average, tree width of the tree decompositions on the constructed graphs or their complement graphs is 3. 87% of them have tree width at most 5 while 94% at most 8. Since theoretically the running time to find the maximum independent set by the tree decomposition based method is 0(2'n) (where t is the tree width), we can conclude that most of the time our algorithm is efficient based on the statistics of the tree width. 40-, 35

\

30

\

25

\

^ 20 -

\

10

\

5

V 0

2

4

6

8 Treewidth

10

12

14

16

Figure 3. Distribution, of the tree width of the tree decompositions on the constructed graphs or their complement graphs.

506 4. D i s c u s s i o n a n d C o n c l u s i o n We have shown our work in utilizing functional information and structural information including protein-DNA interactions and operon structures in comparative analysis based pathway prediction and annotation. T h e structural information used to constrain the orthology assignment between the template pathway and t h e one to b e predicted appears t o b e critical for prediction accuracy improvement. It was to seek the sequence similarity and the structural consistency between the template and the predicted pathways as high as possible. Technically, the problem was formulated as finding the maximum independent set problem on the graphs constructed based on the structure constraints. Our algorithm, based on the non-trivial tree decomposition, coped with the computational intractability issue well and ran very efficiently. Evaluations on real pathway prediction for E coli also showed the effectiveness of this approach. It could also utilize incomplete d a t a and tolerate some noise in the data. Tree decomposition based algorithm is sophisticated yet practically efficient. Simpler algorithms are possible if only functional information and sequence similarity are considered. However, computationally incorporating structure information such as protein-DNA interactions and operons in optimal pathway prediction appears t o b e inherently difficult. Naive optimization algorithms may not be scalable to larger pathway at the genome scale. In addition to the computational efficiency, our graph-theoretic approach also makes it possible to incorporate more information such as gene fusion and protein-protein interactions 1 2 to further improve the accuracy simply because such information may be represented as graphs as well. On the other hand, when a template pathway is not well conserved in the target genome, the method may fail to predict the pathway correctly. Multiple templates could be used to rescue this problem since the conserved information could be compensated with each other. We are trying to build profiles from multiple template pathways and use t h e m to do the pathway prediction.

References 1. S. F. Altschul, T. L. Madden, A. A. Schffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res, 25, 3389-3402, 1997. 2. H. L. Bodlaender, "Classes of graphs with bounded tree-width", Tech. Rep. RUU-CS-86-22, Dept. of Computer Science, Utrecht University, the Netherlands, 1986.

507 3. H. L. Bodlaender, "Dynamic programming algorithms on graphs with bounded tree-width", In Proceedings of the 15th International Colloquium on Automata, Languages and Programming, Lecture Notes in Computer Science, 317, 105-119, Springer Verlag, 1987. 4. I. V. Hicks, A. M. C. A. Koster, E. Kolotoglu, "Branch and tree decomposition techniques for discrete optimization", In Tutorials in Operations Research: INFORMS - New Orleans, 2005. 5. M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, M. Hirakawa, "From genomics to chemical genomics: new developments in KEGG", Nucleic Acids Res. 34, D354357, 2006. 6. Y. Makita, M. Nakao, N. Ogasawara, K. Nakai, "DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics", Nucleic Acids Res., 32, D75-77, 2004 7. F. Mao, Z. Su, V. Olman, P. Dam, Z. Liu, Y. Xu, "Mapping of orthologous genes in the context of biological pathways: An application of integer programming", PNAS, 108 (1), 129-134, 2006. 8. D. W. Mount, Bioinformatics: sequence and genome analysis, Cold Spring Harbor Lab Press, 516-517, 2000. 9. R. Nielsen, "Comparative genomics: Difference of expression", Nature, 440, 161-161, 2006. 10. M. N. Price, K. H. Huang, E. J. Aim, A. P. Arkin, "A novel method for accurate operon predictions in all sequenced prokaryotes", Nucleic Acids Res., 33, 880-892, 2005. 11. H. Salgado, S. Gama-Castro, M. Peralta-Gil, etc., "RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions", Nucleic Acids Res., 34, D394-D397, 2006. 12. J. L. Reed, I. Famili, I. Thiele, B. O. Palsson, "Towards multidimensional genome annotation.", Nature Reviews Genetics, 7, 130-141, 2006. 13. N. Robertson and P. D. Seymour, "Graph minors ii. algorithmic aspects of tree width", J. Algorithms, 7, 309-322, 1986. 14. P. Romero, J. Wagg, M. L. Green, D. Kaiser, M. Krummenacker, P. D. Karp, " Computational prediction of human metabolic pathways from the complete human genome", Genome Biology, 6, R2, 2004. 15. Z. Su, P. Dam, X. Chen, V. Olman, T. Jiang, B. Palenik, Y. Xu, "Computational Inference of Regulatory Pathways in Microbes: an Application to Phosphorus Assimilation Pathways in Synechococcus sp. WH8102", Genome Informatics, 14, 3-13, 2003. 16. R. L. Tatusov, E. V. Koonin, D. J. Lipman, "A Genomic Perspective on Protein Families", Science, 278 (5338), 631-637, 1997.

PACIFIC S Y M P O S I U M ON

BIOCOMPUTING 2007 The Pacific Symposium on Biocomputing (PSB) 2007 is an international, m u l t i d i s c i p l i n a r y conference for the presentation and discussion of current research in the theory and application of computational methods in problems of biological significance. Presentations are rigorously peer reviewed and are published in an archival proceedings volume. PSB 2007 will be held January 3-7, 2007 at the Grand Wailea, Maui. Tutorials will be offered prior to the start of the conference. PSB 2007 will bring together top researchers from the US, the Asian Pacific nations, and around the world to exchange research results and address open issues in all aspects of computational biology. It is a forum for the presentation of work in databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.

The PSB has been designed to be responsive to the need for critical mass in sub-disciplines within biocomputing. For that reason, it is the only meeting whose sessions are defined dynamically each year in response to specific proposals. PSB sessions are organized by leaders of research in biocomputing's "hot topics." In this way, the meeting provides an early forum for serious examination of emerging methods and approaches in this rapidly changing field.

World Scientific

ISBN 981 270 417 5

www.worldscientific.com 6332 he

9 "789812 704177

Pacific Symposium on Biocomputing 2007: Maui, Hawaii, 3-7 January 2007

Pacific Symposium on Biocomputing 2004: Hawaii, USA 6-10 January 2004

Pacific Symposium on Biocomputing 2008: Kohala Coast, Hawaii, USA 4-8 January 2008

Frommer's Maui 2007 (Frommer's Complete)

SERVO Magazine - January 2007

Motor (January 2007)

Scientific American (January 2007)

Frommer's Hawaii 2007 (Frommer's Complete)

Memory Makers Scrapbooking January 2007

Frommer's Portable Maui (2007) (Frommer's Portable)

2007

2007,

2007

QuickBooks 2007 On Demand

Quicken 2007 On Demand

Israel Yearbook on Human Rights (Volume 37: 2007)

Quicken 2007 On Demand

Advances in Chronic Kidney Disease 2007: 9th International Conference on Dialysis, Austin, Tex., January 2007 (Blood Purification 2007)

Algebraic topology: The Abel symposium 2007

The Cooperstown Symposium on Baseball and American Culture, 2007-2008

Grid Computing: International Symposium on Grid Computing (ISGC 2007)

Microsoft Office 2007 On Demand

Advances in Image and Video Technology: Second Pacific Rim Symposium, PSIVT 2007 Santiago, Chile, December 17-19, 2007 Proceedings

Frommer's Hawaii with Kids (2007) (Frommer's With Kids)

2007 № 09

2007 (258)

2007 № 08

2007 (242)

2007 (248)

2007 (244)

2007 (246)

Pacific Symposium on Biocomputing 2007: Maui, Hawaii, 3-7 January 2007

Pacific Symposium on Biocomputing 2004: Hawaii, USA 6-10 January 2004

Pacific Symposium on Biocomputing 2008: Kohala Coast, Hawaii, USA 4-8 January 2008

Frommer's Maui 2007 (Frommer's Complete)

SERVO Magazine - January 2007

Motor (January 2007)

Scientific American (January 2007)

Frommer's Hawaii 2007 (Frommer's Complete)

Memory Makers Scrapbooking January 2007

Frommer's Portable Maui (2007) (Frommer's Portable)

2007

2007,

2007

QuickBooks 2007 On Demand

Quicken 2007 On Demand

Israel Yearbook on Human Rights (Volume 37: 2007)

Quicken 2007 On Demand

Advances in Chronic Kidney Disease 2007: 9th International Conference on Dialysis, Austin, Tex., January 2007 (Blood Purification 2007)

Algebraic topology: The Abel symposium 2007

The Cooperstown Symposium on Baseball and American Culture, 2007-2008

Grid Computing: International Symposium on Grid Computing (ISGC 2007)

Microsoft Office 2007 On Demand

Advances in Image and Video Technology: Second Pacific Rim Symposium, PSIVT 2007 Santiago, Chile, December 17-19, 2007 Proceedings

Frommer's Hawaii with Kids (2007) (Frommer's With Kids)

2007 № 09

2007 (258)

2007 № 08

2007 (242)

2007 (248)

2007 (244)

2007 (246)

Recommend Documents