BIOCOMPUTING 2002
Edited by
Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Kevin Lauderdale & Teri E. Klein
World Scientific
( over image: I mm the (ovei / the Proceedings of Pat ifii Symposium on Bioc omputing 1996 publish :il by World s< ientifii Publishing ( ompany. This image depii ts .i moleculai model 01 the complex ol B DNAandthezini finger moiety ofFPCi protein, ,md is used as a prototype system for tiiulcrit.inding
how DMA
tlam.ige
is recognized b) repaii enzymes. Image and molet ulai modeling studies by Teri E. Klein. UCSF Computer Graphic s Laboratory. Used with permission from the Regents ol the l niversit) ol ( alifornia, 1995 (Image is copyrighted to the Regents of the University ol ( alifornia)
PACIFIC SYMPOSIUM ON
BIOCOMPUTING 2002
PACIFIC SYMPOSIUM ON
BIOCOMPUTING 2002 Kauai, Hawaii 3-7 January 2002
Edited by Russ B. Altman Stanford University, USA
A. Keith Dunker Washington State University, USA
Lawrence Hunter University of Colorado Health Sciences Center, USA
Kevin Lauderdale Stanford University, USA
Teri E. Klein Stanford University, USA
[Q World Scientific U
New Jersey London'Singapore'Hong Kong
Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
BIOCOMPUTING Proceedings of the 2002 Pacific Symposium Copyright © 2001 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-02-4777-X
Printed in Singapore by World Scientific Printers
PACIFIC SYMPOSIUM ON BIOCOMPUTING 2002 The seventh Pacific Symposium on Biocomputing (PSB) marks the first PSB held following the tragic events of September 11, 2001 in New York, Pennsylvania and Washington DC. These events have affected the world at large and cannot go unnoticed by the computational biology community. The organizers would like to add their condolences to those who suffered. In spite of technical and personal difficulties that individuals incurred, we are happy able to put forth these proceedings. PSB is sponsored by the International Society for Computational Biology (http://www.iscb.org/'). Meeting participants benefit once again from travel grants from the U.S. Department of Energy, the National Library of Medicine/National Institutes of Health, Applied Biosystems and Boston College. We gratefully acknowledge the hardware contributions from Compaq. We thank Professor David Botstein in advance for his plenary address on Extracting Biologically Interesting Information from Microarrays and Professor Rebecca Eisenberg for her plenary address on Bioinformatics, Bioinformation and Biomolecules: the Role and Limitations of Patents. Kevin Lauderdale has gone beyond the call of duty and once again expertly created the printed and online proceedings. Al Conde has ensured that the hardware and network systems are functional. We would especially like to acknowledge the contributions of the session organizers who solicited papers and reviews, and ensured that the quality of the meeting remains high. The session organizers (and their associated sessions) are: Inna Dubchak, Lior Pachter and Liping Wei (Genome-wide Analysis and Comparative Genomics) Peter Karp, Pedro Romero and Eric Neumann (Genome, Pathway and Interaction Bioinformatics) Willi von der Lieth (Expanding Proteomics to Glycobiology) Lynette Hirschman, Jong C. Park, Junichi Tsujii, Cathy Wu and Limsoon Wong (Literature Data Mining for Biology) Isaac Kohane, Clay Stephens, Julie Schneider and Francisco De La Vega (Human Genomic Variation: Disease, Drug Response, and Clinical Phenotypes) v
vi Scott Stanley and Benjamin Salisbury (Phylogenetic Genomics and Genomic Phylogenetics) Peter Clote, Gavin Naylor, and Ziheng Yang (Proteins: Structure, Function and Evolution) The PSB organizers and session leaders relied on the assistance of those who capably reviewed the submitted manuscripts. A partial list of reviewers is provided elsewhere in this volume. We thank those who have been left off this list inadvertently or who wish to remain anonymous. Aloha!
Pacific Symposium on Biocomputing Co-Chairs Russ B. Altman Stanford University A. Keith Dunker Washington State University Lawrence Hunter University of Colorado Health Sciences Center Teri E. Klein Stanford University
October 1, 2001
VII
Thanks to reviewers . . . Finally, we wish to thank the scores of paper reviewers. PSB requires that every paper in this volume be reviewed by at least three independent referees. Since there is a large volume of submitted papers, that requires a great deal of work from many people, and we are grateful to all of you listed below, and to any whose names we may have accidentally omitted. Aram Adourian Laura Almasy Orly Alter Chris Amos Mike Bada Pierre Baldi Serafim Batzoglou Jadwiga Bienkowka Eckart Bindewald Erich Bornberg-Bauer Phil Bradley Richard Broughton Michael Brudno Andrea Califano Matt Callow Roland Carel Vincent J. Carey Simon Cawley Hue Sun Chan Joseph Chang Andrew Clark Julio Collado-Vides Josep Comeron Olivier Couronne Derek Dimcheff Chris Ding Roland Dunbrack Jeremy Edwards Jodi Vanden Eng Niklas Eriksen George Estabrook Andras Fiser Jennifer Gleason Richard Goldstein Susumu Goto Douglas Greer Igor Grigoriev Mark Grote Ivo Gut Alexander J. Hartemink Lynette Hirschman Steve Holbrook
David Paul Holden John Holmes Roderick V. Jensen Ruhong Jiang Kenneth Karol Peter Karp Ju Han Kim Jessica Kissinger Alex Lancaster Jobst Landgrebe Rick Lathrop Hans-Peter Lenhof Jin-Long Li Weizhong Li Pat Lincoln Jan Liphardt Irene Liu Xiaole Liu Gaby Loots Joanne Luciano Andrew Martin Kate McKusick William Newell Magnus Nordborg Gary Nunn Matej Oresic Christos Ouzounis Ivan Ovcharenko Jong Park Peter Park Hugh Pasika Len Pennacchio Yitzhak Pilpel Tom Plasterer Darrent Piatt David Pollock John Quackenbush Mark Rabin Marco Ramoni Aviv Regev Michael Reich Markus Ringner
Pedro Romero Vincent Schachter Steffen Schulze-Kremer Jody Schwartz Thomas Seidl Imran Shah Ron Shamir Roded Sharan Victor Solovyev Terence Speed Paul Spellman Scott Stanley Robert Stuart Jane Su Xiaoping Su Zoltan Szallasi Amos Tanay Debra Tanguay Glenn Tesler Denis Thieffry Glenys Thomson Jeff Thorne Martin Tompa Jun'ichi Tsuji Jacques van Helden Mike Walker Teresa Webster Simon Whelan Kelly Ewen White Glenn Williams Limsoon Wong Cathy Wu YuXia Dong Xu Ying Xu Chen-Hsiang Yeang John Yin Ping Zhan Ge Zhang Yingdong Zhao
CONTENTS Preface
v
HUMAN GENOME VARIATION: DISEASE, DRUG RESPONSE, AND CLINICAL PHENOTYPES Session Introduction /. Kohane, C. Stephens, J. Schneider, and F. De La Vega
3
A Stability Based Method for Discovering Structure in Clustered Data A. Ben-Hur, A. Elisseeff, and I. Guyon
6
Singular Value Decomposition Regression Models for Classification of Tumors from Microarray Experiments D. Ghosh An Automated Computer System to Support Ultra High Throughput SNP Genotyping J. Heil, S. Glanowski, J. Scott, E. Winn-Deen, I. McMullen, L. Wu, C. Gire, and A. Sprague Inferring Genotype from Clinical Phenotype through a Knowledge Based Algorithm B.A. Malin and L.A. Sweeney A Cellular Automata Approach to Detecting Interactions Among Single-nucleotide Polymorphisms in Complex Multifactorial Diseases J.H. Moore and L. W. Hahn Ontology Development for a Pharmacogenetics Knowledge Base D.E. Oliver, D.L. Rubin, J.M. Stuart, M. Hewett, T.E. Klein, and R.B. Altman IX
18
30
41
53
65
X
A SOFM Approach to Predicting HIV Drug Resistance R.B. Potter and S. Draghici Automating Data Acquisition into Ontologies from Pharmacogenetics Relational Data Sources Using Declarative Object Definitions and XML D.L. Rubin, M. Hewett, D.E. Oliver, T.E. Klein, and R.B. Altman On a Family-Based Haplotype Pattern Mining Method for Linkage Disequilibrium Mapping S. Zhang, K. Zhang, J. Li, and H. Zhao
77
88
100
GENOME-WIDE ANALYSIS AND COMPARATIVE GENOMICS Session Introduction /. Dubchak, L. Pachter, andL. Wei
112
Scoring Pairwise Genomic Sequence Alignments F. Chiaromonte, V.B. Yap, and W. Miller
115
Structure-Based Comparison of Four Eukaryotic Genomes M. Cline, G. Liu, A.E. Loraine, R. Shigeta, J. Cheng, G. Mei, D. Kulp, and MA. Siani-Rose
127
Constructing Comparative Genome Maps with Unresolved Marker Order D. Goldberg, S. McCouch, and J. Kleinberg
139
Representation and Processing of Complex DNA Spatial Architecture and its Annotated Genomic Content R. Gherbi and J. Herisson
151
Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars /. Holmes and G.M. Rubin
163
XI
Estimation of Genetic Networks and Functional Structures Between Genes by Using Bayesian Networks and Nonparametric Regression S. Imoto, T. Goto and S. Miyano Automatic Annotation of Genomic Regulatory Sequences by Searching for Composite Clusters O.V. Kel-Margoulis, T.G. Ivanovo, E. Wingender, andA.E. Kel
175
187
EULER-PCR: Finishing Experiments for Repeat Resolution Z Mulyukov and P.A. Pevzner
199
The Accuracy of Fast Phylogenetic Methods for Large Datasets L. Nakhleh, B.M.E. Moret, U. Roshan, K. St. John, J. Sun, and T. Warnow
211
Pre-mRNA Secondary Structure Prediction Aids Splice Site Prediction 223 D.J. Patterson, K. Yasuhara, and W.L. Ruzzo Finding Weak Motifs in DNA Sequences S.-H. Sze, M.S. Gelfand, and P.A. Pevzner Evidence for Sequence-Independent Evolutionary Traces in Genomics Data W. Volkmuth, and N. Alexandrov
235
247
Multiple Genome Rearrangement by Reversals S. Wu and X. Gu
259
High Speed Homology Search with FPGAs
271
Y. Yamaguchi, T. Maruyama, and A. Konagaya EXPANDING PROTEOM1CS TO GLYCOBIOLOGY Session Introduction C.-W. von der Lieth
283
XII
Glycosylation of Proteins: A Computer Based Method for the Rapid Exploration of Comformational Space of N-Glycans A. Bohne and C.-W. von der Lieth Data Standardisation in GlycoSuiteDB C.A. Cooper, M.J. Harrison, J.M. Webster, M.R. Wilkins, and N.H. Packer Prediction of Glycosylation Across the Human Proteome and the Correlation to Protein Function
285
297
310
R. Gupta and S. Brunak LITERATURE DATA MINING FOR BIOLOGY Session Introduction L. Hirschman, J. C. Park, J. Tsujii, C. Wu, and L. Wong Mining MEDLINE: Abstracts, Sentences, or Phrases? J. Ding, D. Berleant, D. Nettleton, and E. Wurtele
323 326
Creating Knowledge Repositories from Biomedical Reports: The MEDSYNDIKATE Text Mining System U. Hahn, M. Romacker, and S. Schulz
338
Filling Preposition-Based Templates to Capture Information from Medical Abstracts G. Leroy and H. Chen
350
Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations J. Pustejovsk, J. Castano, J. Zhang, M. Kotecki, and B. Cochran
362
Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines B.J. Stapley, LA. Kelley, and M.J. E. Sternberg
374
XIII
A Thematic Analysis of the AIDS Literature W.J. Wilbur
386
GENOME, PATHWAY AND INTERACTION BIOINFORMATICS Session Introduction P. Karp, P. Romero, and E. Neumann
398
Pathway Logic: Symbolic Analysis of Biological Signaling S. Eker, M. Knapp, K. Laderoute, P. Lincoln, J. Meseguer, and K. Sonmez
400
Towards the Prediction of Complete Protein-Protein Interaction Networks S.M. Gomez and A. Rzhetsky Identifying Muscle Regulatory Elements and Genes in the Nematode Caenorhabditis Elegans D. Guhathakurta, LA. Schriefer, M.C. Hresko, R.H. Waterston, and G.D. Stormo Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models A.J. Hartemink, D.K. Gifford, T.S. Jaakkola, and R.A. Young The ERATO Systems Biology Workbench: Enabling Interaction and Exchange Between Software Tools for Computational Biology M. Hucka, A. Finney, H.M. Sauro, H. Bolouri, J. Doyle, and H. Kitano Genome-Wide Pathway Analysis and Visualization Using Gene Expression Data M.P. Kurhekar, S. Adak, S. Jhunjhunwala, and K. Raghupathy
413
425
437
450
462
XIV
Exploring Gene Expression Data with Class Scores P. Pavlidis, D.P. Lewis, and W.S. Noble
474
Guiding Revision of Regulatory Models with Expression Data J. Shrager, P. Langley, and A. Pohorille
486
Discovery of Causal Relationships in a Gene-Regulation Pathway from a Mixture of Experimental and Observational DNA Microarray Data C. Yoo, V. Thorsson, and G.F. Cooper
498
PHYLOGENETIC GENOMICS AND GENOMIC PHYLOGENETICS Session Introduction S. Stanley and B.A. Salisbury Shallow Genomics, Phylogenetics, and Evolution in the Family Drosophilidae M. Zilversmit P. O 'Grady, and R. Desalle Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study L.-S. Wang, R.K. Jansen, B.M.E. Moret, L.A. Raubeson, and T. Warnow Vertebrate Phylogenomics: Reconciled Trees and Gene Duplications R.D.M. Page and J.A. Cotton
510
512
524
536
PROTEINS: STRUCTURE, FUNCTION AND EVOLUTION Session Introduction P. Clote, G.J.P. Naylor, and Z. Yang
548
XV
Screened Charge Electrostatic Model in Protein-Protein Docking Simulations J. Fernandez-Redo, M. Totrov, and R. Abagyan
552
The Spectrum Kernel: A String Kernel for SVM Protein Classification C. Leslie, E. Eskin, and W.S. Noble
564
Detecting Positively Selected Amino Acid Sites Using Posterior Predictive P- Values R. Nielsen and J. P Huelsenbeck
576
Improving Sequence Alignments For Intrinsically Disordered Proteins P. Radivojac, Z. Obradovic, C.J. Brown, andA.K. Dunker
589
ab initio Folding of Multiple-Chain Proteins J.A. Saunders, K.D. Gibson, and H.A. Scheraga
601
Investigating Evolutionary Lines of Least Resistance Using the Inverse Protein-Folding Problem 613 J. Schonfeld, O. Eulenstein, K. Wander Velden, and G.J. P. Nay lor Using Evolutionary Methods to Study G-Protein Coupled Receptors O. Soyer, M. W. Dimmic, R.R. Neubig, and R.A. Goldstein Progress in Predicting Protein Function from Structure: Unique Features of O-Glycosidases E. W. Stawiski, Y. Mandel-Gutfreund, A. C. Lowenthal, and L. M. Gregoret Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings J.-P. Vert
625
637
649
xvi
Constraint-Based Hydrophobic Core Construction for Protein Structure Prediction in the Face-Centered-Cubic Lattice S. Will
661
Detecting Native Protein Folds Among Large Decoy Sets with Hydrophobic Moment Profiling R. Zhou and B.D. Silverman
673
Session Introductions and Peer Reviewed Papers
HUMAN GENOME VARIATION: DISEASE, DRUG RESPONSE, AND CLINICAL PHENOTYPES FRANCISCO M. DE LA VEGA Applied Biosystems, 850 Lincoln Centre Drive, Foster City, CA 94404, USA ISAAC S. KOHANE Children's Hospital Informatics Program & Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115, USA JULIE A. SCHNEIDER and J. CLAIBORNE STEPHENS Genaissance Pharmaceuticals, Inc., Five Science Park, New Haven, CT 06511, USA With the completion of a rough draft of the human genome sequence in sight, researchers are shifting to leverage this new information in the elucidation of the genetic basis of disease susceptibility and drug response. Massive genotyping and gene expression profiling studies are being planned and carried out by both academic/public institutions and industry. Researchers from different disciplines are all interested in the mining of the data coming from those studies; human geneticists, population geneticists, molecular biologists, computational biologists and even clinical practitioners. These communities have different immediate goals, but at the end of the day what is sought is analogous: the connection between variation in a group of genes or in their expression and observed phenotypes. There is an imminent need to link information across the huge data sets these groups are producing independently. However, there are tremendous challenges in the integration of polymorphism and gene expression databases and their clinical phenotypic annotation This is the third session devoted to the computational challenges of human genome variation studies held at the Pacific Symposium on Biocomputing1,2. The focus of the session has been the presentation and discussion of new research that promises to facilitate the elucidation of the connections between genotypes and phenotypes using the data generated by high-throughput technologies. Nine accepted manuscripts comprise this year's original work presented at the conference. A major incentive for collecting genetic variation data is to use this information to identify genomic regions that influence disease susceptibility or drug response. In this volume, Zhang et al. outline a new approach to identify clinically relevant genes that produce quantitative phenotypes. Although similar methods have been developed to measure the strength of association between haplotypes and binary (case-control) data, Zhang et al.'s method is particularly valuable because many
3
4 important clinical phenotypes display quantitative inheritance. On the other hand, the manuscript of Moore and Hahn introduce a novel computational approach using cellular automata (CA) and parallel genetic algorithms to identify combinations of SNPs associated with clinical outcomes. They use a simulated dataset of a discordant sib-pair study design to demonstrate that the CA approach has good power to identify high-order nonlinear interactions with few false-positives. Given the current uncertainties on the genetic architecture underlying complex disease5, it is critical to develop new approaches, such as the CA advanced by the authors, that can test for association in the presence of allelic heterogeneity6 and epistatic interactions between loci. Large quantities of DNA sequence variation data is needed to better understand the contribution of genetics to human disease, drug response, and clinical phenotypes. In order to insure the quality of these data, fully automated genotyping processes are required: from assay design, assay validation, assay interpretation, quality control, to data management and release. Che of the major challenges involved in developing a streamlined, high-throughput genotyping is creating appropriate software to support the system. In their conference paper, Heil et al. describe the components of a successful, ultra high-throughput genotyping process developed at Celera Genomics. Their approach could be an excellent starting point for those involved in developing similar infrastructures elsewhere. How to properly store and combine complex biological data is an extremely important subject h the post-genome era. Among the challenges to develop an efficient data or knowledge base are the diversity of semantics, potential uses, and data sources. Ontologies have been successfully applied in the past to develop knowledge base systems to store complex data, such as the Gene Ontology for gene annotations3, and RiboWeb4 for capturing experimental results in scientific literature. The contributions of Rubin et al. and Oliver et al. to this conference present a successful application of ontologies on genotype-phenotype data in relation to clinical drug response. The approach used in "PharmGKB" presented by the authors address many of the complex problems arising when retrieving data from diverse genomics and clinical databases, and when updating links to external database domains. Their methodology may be very helpful for making the diverse genomics data better suited for scientific analysis. Molecular profiling is a tool that is gaining acceptance to classify tissue samples and other clinical outcomes based on gene and potentially protein expression profiles. Its accuracy depends on the appropriate analysis of the resulting datasets, and typically involves multivariate statistics and other machine learning techniques. The paper of Ben-Hur et al. describes an algorithm to investigate the stability of the solutions of clustering algorithms. The authors apply their method to the hierarchical clustering of microarray and synthetic data. On the other hand, Ghosh applies a regression analysis to data that has been first
5 transformed by Singular Value Decomposition (SDV), for uncovering possible relations between microarray expression data of tumor samples and tumor diagnosis. The problem is a novel application for SVD, which has been recently applied to microarray data in a different but complementary approach. The paper of Potter and Draghici addresses a clinically important problem: classification of HIV protease's resistance to IC90 drug solely from protein sequences. Their contribution shows that improved accuracy can be achieved by combining SOFM classifiers. As high-throughput genotyping and expression-measurement methodologies are applied to large populations, the opportunity now exists to use existing clinical phenotypic annotations (i.e., the extended medical record) in the analysis of the relationship between genotype/haplotype variation and phenotype. Typically, however, the forward link is sought, leading from genetic variation data to the inference of clinical phenotypes. The paper of Malin and Sweeney in this volume offers instead a reverse approach, allowing the inference of genetic variability data based on clinical phenotypes. In this unusual approach, clinical/hospital/claims data is brought together with phenotype/genotype through the use machine learning techniques to predict the underlying genotype. Acknowledgments We would like to acknowledge the generous help of the anonymous reviewers that supported the selection process for this session, as well as the panelists that joined us to discuss the challenges in this field. References 1. 2.
3. 4. 5. 6.
F. M. De La Vega, and M. Kreitman. "Human genome variation" In: Pacific Symposium on Biocomputing 2000, R.B. Airman et al. (Eds.). World Scientific Press, Singapore (2000). F.M. De La Vega, M. Kreitman, and I. S. Kohane. "Human genome variation: Linking genotypes to clinical phenotypes" In: Pacific Symposium on Biocomputing 2001, R.B. Altaian et al. (Eds.). World Scientific Press, Singapore (2001). The Gene Ontology Consortium. "Creating the gene ontology resource: design and implementation" Genome Res. 11(8), 1425-1433 (2001). R.O. Chen, R. Feliciano, R.B. Altaian. "RIBOWEB: linking structural computations to a knowledge base of published experimental data" \nProc Int Conflntell Syst Mol Biol 5, 84-87 (1997). A.F. Wright and N.D. Hastie. "Complex genetic diseases: controversy over the Croesus code" Genome Biology 2(8), comment 2007.1-2007.8 (2001). J.K. Pritchard. "Are Rare Variants Responsible for Susceptibility to Complex Diseases?" Am. J. Hum. Genet. 69,124-137 (2001).
A stability based method for discovering structure in clustered data Asa Ben-Hur*, Andre Elisseeff* and Isabelle Guyon* BioWulf Technologies LLC *2030 Addison st. Suite 102 +305 Broadway (9th Floor) Berkeley, CA 94704 New-York, NY 10007 Abstract We present a method for visually and quantitatively assessing the presence of structure in clustered data. The method exploits measurements of the stability of clustering solutions obtained by perturbing the data set. Stability is characterized by the distribution of pairwise similarities between clusterings obtained from sub samples of the data. High pairwise similarities indicate a stable clustering pattern. The method can be used with any clustering algorithm; it provides a means of rationally defining an optimum number of clusters, and can also detect the lack of structure in data. We show results on artificial and microarray data using a hierarchical clustering algorithm. 1
Introduction
Clustering is widely used in exploratory analysis of biological data. With the advent of new biological assays such as DNA microarrays that allow the simultaneous recording of tens of thousands of variables, it has become more important than ever to have powerful tools for data visualization and analysis. Clustering, and particularly hierarchical clustering, play an important role in this process. x ' 2 ' 3 Clustering provides a way of validating the quality of the data by verifying that groups form according to the prior knowledge one has about sample categories. It also provides means of discovering new natural groupings. 4 Yet there is no generally agreed upon definition of what is a "natural grouping." In this paper we propose a method of detecting the presence of clusters in data that can serve as the basis of such a definition. It can be combined with any clustering algorithm, but proves to be particularly useful in conjunction with hierarchical clustering algorithms. The method we propose in this paper is based on the stability of clustering with respect to perturbations such as sub-sampling or the addition of noise. Stability can be considered an important property of a clustering solution, since data, and gene expression data in particular, is noisy. Thus we suggest stability as a means for defining meaningful partitions. The idea of using stability to evaluate clustering solutions is not new. In the context of hierarchical clustering, some authors have considered the stability of the whole hierarchy.5 However, our experience indicates that in most real world cases the complete dendrogram is rarely stable. The stability of partitions has also been addressed. 6 , 7 ' s In this model, a figure of merit is assigned to a partition
6
7 of the data according to average similarity of the partition to a set of partitions obtained by clustering a perturbed dataset. The optimal number of clusters (or other parameter employed by the algorithm) is then determined by the maximum value of the average similarity. But we observed in several practical instances that considering the average, rather than the complete distribution was insufficient. The distribution can be used both as a tool to visually probe the structure in the data, and to provide a criterion for choosing an optimal partition of the data: plotting the distribution for various numbers of clusters reveals a transition between a distribution of similarities that is concentrated near 1 (most solutions highly similar) to a wider distribution. In the examples we studied, the value of the number of clusters at which this transition occurs agrees with the intuitive choice of the number of clusters. We have developed a heuristic for comparing partitions across different levels of the dendrogram that make this transition more pronounced. The method is useful not only in choosing the number of clusters, but also as a general tool for making choices regarding other components of the clustering algorithm. We have applied it in choosing the type of normalization and the number of leading principal components. 9 Many methods for selecting an optimum number of clusters can be found in the literature. In this paper we report results that show that our method performs well when compared with some of the more successful methods reported in recent surveys. 1 0 , n This may be explained by the fact that our method does not make assumptions about the distribution of the data or about cluster shape as most other methods; 11,10 only our method and the gap statistic can detect the absence of structure. Our method has advantages over information-theoretic criteria based on compression efficiency considerations and over related Bayesian criteria12 in that they are model free, and work with any clustering algorithm. Some clustering algorithms have been claimed to generate only meaningful partitions, so do not require our method for this purpose. 4 ' 13 We also mention the method of Yeung et al.u for assessing the relative merit of different clustering solutions. They tested their method on microarray data; however, they do not give a way of selecting an optimal number of clusters, so no direct comparison can be made. The paper is organized as follows: in Section 2 we introduce the dot product between partitions and express several similarity measures in terms of this dot product. In Section 3 we present our practical algorithm. Section 4 is devoted to experimental results of using the algorithm. This is followed by a discussion and conclusions. 2
Clustering similarity measures
In this section we present several similarity measures between partitions found in the literature,15,7 and express them with the help of a dot product. We begin by reviewing our notation. Let X = { x i , . . . , x,,}, and Xj 6 M.d be the dataset to be clustered.
8 A labeling £ is a partition of X into k subsets S\,. • •, 5*. We use the following representation of a labeling by a matrix C with components: r
— / 1 'f X i ^ X J belong to the same cluster and i ^ j , ' \ 0 otherwise .
...
,J —
Let labelings £ i and £ 2 have matrix representations C^ define the dot product
and C' 2 ', respectively. We
(1U12) = (CV,C(V) = J2CVC^.
(2)
This dot product computes the number of pairs of points clustered together, and can also be interpreted as the number of common edges in graphs represented by C ^ and C^2\ and we note that it can be computed in 0(kik2n). As a dot product, ( £ i , £ 2 ) satisfies the Cauchy-Schwartz inequality: (£, l ! £ 2 ) < y / ( £ 1 , £ i ) (£2, £2), and thus can be normalized into a correlation or cosine similarity measure: ^
^
>/(£-!,-ClX-C.2,^2)
This similarity measure was introduced by Fowlkes and Mallows. 7 Next, we show that two commonly used similarity measures can be expressed in terms of the dot product defined above. Given two matrices C^\C^ with 0-1 entries, let Nij for hj ^ {0,1} be the number of entries on which C^ and C^ have values i and j , respectively. The matching coefficient15 is defined as the fraction of entries on which the two matrices agree:
The Jaccard coefficient is a similar ratio when "negative" matches are ignored:
The matching coefficient often varies over a smaller range than the Jaccard coefficient since the N$Q term is usually a dominant factor. These similarity measures can be expressed in terms of the labeling dot product and the associated norm: J(£i,£2)
M(LUL2)
^
'-
(cw,cw) + (c(2\ c*(2)) - (cw,c(2 =
i--i||cW-C(2>||2
9
.:
•.,\:
•
..-..;•:. v •:•'•" ..';-v-.-•
••
•
-:JiSf-"
Figure 1: Two 250 point sub-samples of a 400 point Gaussian mixture.
This is a result of the observation that Nu = (C^,C^),N0l = ( l „ - C^, C*(2)), (1) 2 (1) 2 N10 = (C , 1„ - C< >), N00 = (1„ - C , 1„ - C< >), where 1„ is an n x n matrix with entries equal to 1. The above expression for the Jaccard coefficient shows that it is close to the correlation similarity measure, as we have observed in practice. 3
The model explorer algorithm
When one looks at two sub-samples of a cloud of data points, with a sampling ratio / (fraction of points sampled) not much smaller than 1 (say / > 0.5), one usually observes the same general structure (Figure 1). Thus it is reasonable to postulate that a partition into k clusters has captured the "inherent" structure in a dataset if partitions into k clusters obtained from running the clustering algorithm with different subsamples are similar, i.e. close in structure according to one of the similarity measures introduced in the previous section. "Inherent" structure is thus structure that is stable with respect to sub-sampling. We cast this reasoning into the problem of finding the optimal number of clusters for a given dataset and clustering algorithm: look for the largest k such that partitions into k clusters are stable. Note that rather than choosing just the number of clusters, one can extend the scope of the search for a set of variables where structure is most apparent, i.e. stable. This is performed elsewhere. ° We consider a generic clustering algorithm that receives as input a dataset (or similarity/dissimilarity matrix) and a parameter k that controls either directly or indirectly the number of clusters that the algorithm produces. This input convention is applicable to hierarchical clustering algorithms: given k, cut the tree so that k clusters are produced. We want to characterize the stability for each value of k. This is accomplished by clustering sub-samples of the data, and then computing the similarity between pairs of sub-samples according to similarity between the labels of the points common to both sub-samples. The result is a distribution of similarities for each k. The algorithm is presented in Figure 2. The distribution of the similarities is then compared for different values of k
10 Input: X {a dataset}, fcmax {maximum number of clusters}, num-subsamples {number of subsamples} Output: S{i,k) {list of similarities for each k and each pair of sub-samples } Require: A clustering algorithm: cluster(X, k); a similarity measure between labels: s(Li, L2) 1: / = 0.8 2: for k — 2 to fcmax do 3: for i = 1 to num_subsamples do 4: subi =subsamp(X, /){a sub-sample with a fraction / of the data} 5: sub2 =subsamp(X, / ) 6: L\ =cluster(subi, fc) 7: L2 =cluster(su6 2 , k) 8: Intersect= subi n su6 2 9: S(i,k) = s(Li(Intersect),L2(Intersect)) {Compute the similarity on the points common to both subsamples} 10: end for 11: end for Figure 2: The Model explorer algorithm.
(Figure 3). In our numerical experiments (Section 4) we found that, indeed, when the structure in the data is captured by a partition intofcclusters, many sub-samples have similar clustering, and the distribution of similarities is concentrated close to 1. Remark 3.1 For the trivial case k = 1, all clusterings are the same, so there is no need for any computation in this case. In addition, the value of / should not be too low; otherwise not all clusters are represented in a sub-sample. In our experiments the shape of the distribution of similarities did not depend very much on the specific value of/. 4
Experiments
In this section we describe experiments on artificial and real data. We chose to use data where the number of clusters is apparent, so that one can be convinced of the performance of the algorithm. In all the experiments we show the distribution of the correlation score; equivalent results were obtained using other scores as well. The sampling ratio, / , was 0.8 and the number of pairs of solutions compared for each k was 100. As a clustering algorithm we use the average-link hierarchical clustering algorithm.15 The advantage of using a hierarchical clustering method is that the same
11
25
.
20
0.7
.J
.J
I
J
Li A
«
•
/
/
/
'
/
/
'
/// /1 yJn I till I h 4 0 75
08
Q 85
Figure 3: Left: histogram of the correlation similarity measure; right: overlay of the cumulative distributions for increasing values of k.
set of trees can be used for all values of k, by looking at different levels of the tree each time. To tackle the problem of outliers, we cut the tree such that there are k clusters, each of them not a singleton (thus the total number of clusters can be higher than k). This is extended to consider partitions that contain k clusters, each of them larger than some threshold. This helps enhance the stability in the case of a good value of k, and de-stabilizes clustering solutions for higher k, making the transition from highly similar solutions to a wide distribution of similarities more pronounced. We begin with the data depicted in Figure 1, which is a mixture of four Gaussians. The histogram of the score for varying values of k is plotted in figure 3. We make several observations regarding the histogram. At k = 2 it is concentrated at 1, since almost all the runs discriminated between the two upper and two lower clusters. At k = 3 most runs separate the two lower clusters, and at k = 4 most runs found the "correct" clustering which is reflected in the distribution of scores still concentrated near 1. For k > 4 there is no longer one preferred solution, as is seen by the wide spectrum of similarities. We remark that if the clusters were well separated, or the clusters arranged more symmetrically, there would not have been a preferred way of clustering into 2 or 3 clusters as is the case here; in that case the similarity for k = 2,3 would have been low, and increased for k — 4. In such cases one often observes a bimodal distribution of similarities. The next dataset we considered was the yeast DNA microarray data of Eisen et al} We used the MYGD functional annotation to choose the 5 functional classes that were most learnable by SVMs, 16 and that were noted by Eisen et al. to cluster well. l We looked at the genes that belong uniquely to these 5 functional classes. This gave a dataset with 208 genes and 79 features (experiments) in the following classes: (1)
12
V
"
5 %/*
.V'> w ^
+
xV
:
*>
v
+
I « 5
+1=fes^ +
•
«c+
t++
+
" tofc Figure 4: First three principal components of the yeast microarray data. The legend identifies the symbols that represent each functional class. Class number corresponds to the numbers given in the listing of the classes in the text.
Figure 5: Dendrogram for yeast microarray data. Numbers indicate the functional class represented by each cluster. The horizontal line represents the lowest level at which partitions are still highly stable.
13
.,
»
.•-
.. .•\ JDB Action Reports
§-,. .
'"""•TGTAAAACGACGGCCAGTAGGAGTATCTAGCCCAAGCAATA Figure 3. Sample input data from Mayo test submission in XML format. This sample input shows the submission of a forward PCR primer used, and is stored as experimental data in PharmGKB.
A simple nucleotide difference (SND) defines a position in a region of interest of a reference sequence where bases differ from the bases in the corresponding location of a tested sequence. A SND is simple because the bases in the variant segment must be contiguous, rather than located in different parts of the genome. A SND differs from a single nucleotide polymorphism (SNP) in that there is no frequency restriction in the definition of a SND. In contrast, when scientists perform SNP detection assays, it is common practice to filter out SNPs that have allele frequencies that are less than some threshold percentage (e.g., 10 percent). Also, in the spirit of dbSNP, a SND can refer to an insertion, a deletion, or a variable number of repeats, as well as to a single nucleotide difference. The convention in PharmGKB for specifying where the difference is located is to identify the position in the reference sequence that precedes the variant site (the position upstream in the 5' direction). This approach provides a consistent method for describing variant positions across all polymorphism types. Figure 3 shows a representative sample of data from the Mayo test submission. It shows the submission of a forward PCR primer used in an experiment. Annealing positions are based on a numbering scheme that was previously specified in a sequence coordinate system for the reference sequence. 6
Concluding Remarks and Future Work
Given the potential impact of pharmacogenetic research and the vast quantities of data that are likely to result from efforts to link genotype to phenotype, the NIH has begun a program that encourages collaboration among investigators and that mandates public sharing of data. The value of PharmGKB as a resource for sharing pharmacogenetic experimental data and knowledge lies not only in its commitment to public dissemination of data, but also in its demonstration of the use of knowledge representation techniques to organize pharmacogenetic knowledge and data. There is currently no standard data model for pharmacogenetic knowledge, and without standards for names and meanings of terms, it is difficult to share information in computer-based systems. Thus, the ontology effort is essential to the
75 success of this project, and may contribute to ontology development done by others who work in this area. Our ontology development process is a process of iterative development and communication between bioinformatics professionals and other collaborators, including molecular biologists, chemists, clinical pharmacologists, and clinicians. Our bottom-up approach to modeling experimental data allows us to take a stageddelivery approach in software development. We can provide software that is usable to a few groups initially, and then extend it in a controlled fashion. However, our top-down approach to knowledge modeling also encourages us to consider the broader picture in the early stages. Our ontology is comprised of the data model for experimental data, and the domain conceptual knowledge that provides controlled-vocabulary information and other knowledge that supports queries. These two parts are integrated in PharmGKB, but it is useful to distinguish them because the former is essential for communication with our collaborators who submit data, and the latter is essential for management of shared concepts in the system. Together, these two parts form the ontology that may be reusable in other settings in the field of pharmacogenetics. Future work on the PharmGKB ontology includes (1) expansion of content to broaden the scope, (2) enhancement of constraint representation in the ontology to support automated or semi-automated data validation, (3) extension of change logging features to facilitate change management, (4) development of merging techniques to support the process of merging the production version of the knowledge base with the development version when a new version is released, and (5) enhancement of methods that help users to query PharmGKB in an intuitive manner to obtain genotype-phenotype associations. Acknowledgements PharmGKB is financially supported by grants from the National Institute of General Medical Sciences (NIGMS), the Human Genome Research Institute (NHGRI) and the National Library of Medicine (NLM) within the National Institutes of Health (NIH). This work is supported by the NIH/NIGMS Pharmacogenetics Research Network and Database grant U01GM61374, and by Stanford University's Children's Health Initiative. JMS is supported by National Library of Medicine grant LM07033.
76 References 1. Long RM, Giacomini KM. Announcement. June 1, 2001 http://www.nigms.nih.gov/pharmacogenetics/editors.html 2. RFA GM-00-003, April 7, 2000 http://grants.nih.gov/grants/guide/rfa-files/RFA-GM-00-003.html 3. MA Rothstein, PG Epps "Ethical and legal implications of pharmacogenomics" Nature Review Genetics, 2, 228-231 (2001) 4. Webster's New Collegiate Dictionary, 9th edition, Ontology, p. 825 (Springfield, MA: Merriam-Webster, 1991) 5. N Guarino, "Formal ontology and information systems" Proceedings of FOIS '98, Trento, Italy, June 6-8, 1998. (Amsterdam, IOS Press, 1998) pp. 3-15 6. C Price, M O'Neil, TE Bentley, PJB Brown, "Exploring the ontology of surgical procedures in the Read Thesaurus" Methods of Information in Medicine 37, 420-5 (1998) 7. The Gene Ontology Consortium, "Gene ontology: tool for the unification of biology" Nature Genetics 25, 25-9 (2000) 8. D Fensel, "Ontologies and electronic commerce" IEEE Intelligent Systems January/February, 8 (2001) 9. "Medical Subject Headings" http://www.nlm.nih.gov/mesh/meshhome.html 10. MA Musen, "Domain ontologies in software engineering: Use of Protege with the EON architecture" Methods of Information in Medicine 37(4-5), 540-50 (1998) 11. "Welcome to the Protege project" http://www.smi.stanford.edu/projects/protege/ 12. "PharmGKB Investigators" http://www.pharmgkb.org/investigators.html 13. "Query PharmGKB" http://www.pharmgkb.org/PharmGKB/query 14. "Pharmacogenetics Research Network and Knowledge Base First Annual Scientific Meeting" April 25, 2001 http://pub.nigms.nih.gov/pharmacogenetics 15. "Enzyme nomenclature" http://www.chem.qmw.ac.uk/iubmb/enzyme 16. "HUGO Gene Nomenclature Committee" http://www.gene.ucl.ac.uk/nomenclature
A SOFM Approach to Predicting HIV Drug Resistance
Department
R . B r i a n P o t t e r " , Sorin D r a g h i c i of Computer Science, Wayne State University,
Detroit,
MI
48202
The self-organizing feature map (SOFM or SOM) neural network approach has been applied to a number of life sciences problems. In this paper, we apply SOFMs in predicting the resistance of the HIV virus to Saquinavir, an approved protease inhibitor. We show that a SOFM predicts resistance to Saquinavir with reasonable success based solely on the amino acid sequence of the HIV protease mutation. The best single network provided 69% coverage and 68% accuracy. We then combine a number of networks into various majority voting schemes. All of the combinations showed improved performance over the best single network, with an average of 85% coverage and 78% accuracy. Future research objectives are suggested based on these results.
1
Introduction
1.1
Overview
The human immunodeficiency virus (HIV-1), the causative agent of acquired immune deficiency syndrome (AIDS), has been the subject of extensive research in recent years. A good, although somewhat dated introduction to AIDS research is provided by Watson, et. al.1 HIV-1 infection has been approached via many treatment pathways. One of the first was the use of Azidothymidine (AZT) to inhibit the synthesis of the HIV provirus in vivo. Unfortunately, the HIV virus was able to mutate in order to resist AZT, eventually overcoming its therapeutic benefits. Two other popular methods of treating the HIV virus are by attacking the reverse transcriptase responsible for synthesizing the DNA provirus from the retroviral R.NA, and by inhibiting the HIV protease responsible for splicing the primary polyproteins produced by the HIV virus into the active proteins necessary for its replication. Both of these approaches also eventually fail due to mutation of the viral genome, leading to protease inhibitor resistant viral strains. Most current therapies involve combinations of drugs aimed at inhibition of both the reverse transcriptase and the protease. Artificial neural network (ANN) based self-organizing maps were developed by Kohonen.2 SOFM algorithms belong to the unsupervised learning, competitive network class of ANNs. An input vector is introduced to the network, after which a winning neuron is determined and the weight vectors of all neurons within a specified neighborhood of the winning neuron are updated? In this "Please send correspondence to this author at
[email protected].
77
78 way, SOFMs are useful for clustering related patterns together. When patterns in the training set are labelled, clusters containing these labelled patterns can then be used to identify unknown patterns. This laboratory has previously applied SOFM clustering to the HIV drug resistance problem.4 Resistance to the protease inhibitor Indinavir was studied first by applying supervised learning techniques to protein structural data for various HIV protease mutants to predict Indinavir IC90 values. Only limited success was obtained, primarily due to an insufficient number of mutations with corresponding Indinavir IC90 values available from the literature with which to train the classifier. An SOFM was used to segment the same data into clusters of Indinavir-resistant mutants and non-resistant mutants based on structural features. We were able to divide all reported HIV mutants into several categories based on their 3-dimensional molecular structures and the pattern of contacts between the mutant protease and Indinavir. Our classifier shows reasonable prediction performance, being able to predict the drug resistance of previously unseen mutants with an accuracy of between 60% and 70%. We believe that this performance can be greatly improved once more data becomes available. The results support the hypothesis that structural features of the HIV protease can be used in antiviral drug treatment selection and drug design. The goal of this research is to build a SOFM to predict the resistance of known mutations of HIV protease to Saquinavir, a protease inhibitor related to Indinavir that is also approved for use in the treatment of HIV infection. No attempt is made to understand the mechanism or reasons why certain mutation are or are not resistant to Saquinavir, only to predict such resistance based solely on the amino acid sequence of HIV protease mutants, a small number of which have reported Saquinavir IC90 values. Our hope is that this early work will ultimately enable clinicians to prescribe HIV treatments based on drug resistance predictions. 1.2
Related Work
Self-organizing maps have been used successfully in a wide variety of life science applications. Kaartinen et.al. have successfully used a SOFM to discriminate between human blood plasma lipoprotein lipids (LDL and HDL cholesterol, triglycerides) and furthermore to cluster plasma samples into different lipoprotein lipid risk profiles.5 Makipaa et. al. have applied SOFMs to the clustering and subsequent classification of blood glucose data from insulin-dependent diabetic patients. 8 Santos-Andre and Roque da Silva combined a SOFM with a multi-layer perceptron to provide radiologists with a "second opinion" in
79 the diagnosis of breast cancer.7 Christodoulou and Pattichis have developed medical diagnostic systems for the assessment of electromyographic (EMG) signals necessary for the diagnosis and monitoring of patients with neuromuscular disorders, and carotid plaques based on ultrasound images of patients with pulmonary disease. The systems were comprised of multiple SOFM classifiers whose results were combined using majority voting and SOFM-derived confidence measures.8'9 Finally, Golub et. al. were able to distinguish between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) from SOFM clustering of gene expression monitoring data.10 2 2.1
Experimental Detail Data Preparation
Only thirty-two patterns (HIV Protease mutants) were found in the literature with reported IC90 drug resistance values.11 These patterns were supplemented with 910 reported HIV protease mutants obtained from the Los Alamos National Laboratory HIV Sequence Database (http://hiv-web.lanl.gov/), along with the wild type HIV protease sequence. Netprep, a command line Java program, was written to convert the amino acid sequence of a protein or peptide segment (a string of alpha characters) into normalized numeric patterns suitable for input to a neural network. The input to Netprep is a file containing one peptide sequence per line, with each residue separated by a comma. The first pattern in the file is the wild type. For each residue, all of the patterns are compared to the wild type. Patterns that match the wild type at that residue are assigned a value of zero. Residues that differ from the wild type are ordered by frequency of occurrence. They are then assigned a value between 0 and 1 based on dividing (0,1] into n equal increments, where n is the number of different mutations from the wild type for that residue. For instance, if the wild type is V, and there are four mutations across all of the input patterns, say N, L, I, and A, N may be assigned a value of 1, L a value of .75, I a value of .5, and A a value of .25. Once these numeric assignments are made, each pattern is normalized and written to an output file. The researcher may optionally specify at runtime a percentage of the patterns to withhold from training. All the patterns are processed as described above, after which the appropriate number of patterns to be withheld are randomly selected and output to a separate holdout file. The remaining patterns are used as input to the neural network. For the research described in this paper, ten percent of the 911 unclassified patterns were withheld. The 32
80 patterns with resistance values were all used, as described in the next section. We were interested not in predicting the resistance of a particular mutant, but rather in classifying a mutant as having high, medium, or low resistance to saquinavir. We defined low resistance for a mutant as having less than a fourfold resistance to saquinavir as compared to the resistance of the wild type. High resistance was defined as greater than ten-fold resistance to saquinavir as compared to the resistance of the wild type. Having defined these cutoffs, twelve of the 32 patterns with IC90 values were classified as having low resistance, three with medium resistance (between 4- and 10-fold resistance), and the remaining patterns classified as exhibiting high resistance. The actual range of resistance values was from 0.33-fold to 269.33-fold (see Table t). 2.2
Training
A leave-one-out cross-validation strategy was used due to the scarcity of classified patterns. Thirty-one of the 32 patterns with resistance values were added to 800+ patterns remaining after holdout on the data set obtained from Los Alamos. The patterns with resistance values allowed us to identify clusters of mutants as high, medium or low resistance to saquinavir. Clusters with conflicting assignments were classified as 'mixed', and those with no assignment were classified as 'none'. In all, 36 networks were trained a total of 32 times (one for each leave-oneout pattern to be tested), for a total of 1152 runs. See Table 3 for a complete listing of the networks. To summarize, networks with output matrices of 12x12, 10x10, 8x8, 6x6, 5x5, 4x4, and 3x3 were trained using initial learning rates of 0.9-0.5 and initial neighborhoods corresponding to the dimensionality of their output matrix (e.g., an initial neighborhood of 12 for the 12x12 matrix). All networks except one trained using 10 iterations. The 10x10 matrix was also trained using 50 iterations, an initial learning rate of 0.7, and an initial neighborhood of 10. The results of this test were then compared to the same conditions and 10 iterations to see if increasing the number of iterations would improve the performance of the network. 3
Results and Discussion
3.1
Single Network Performance
Once each network was trained, the lone test pattern was run through the network. If the pattern was assigned to a 'mixed' cluster or to one with no b
All mutations were obtained from Winters, et. al.11, except as noted.
Mutation Wild Type L10I K14R N37D M46I F53L A71V G73S V77I L90M L10I E35D M36I R41K I62V L63P A71V G73S I84V L90M I93L L10II15V M36I G48V I54V I62V V82A L10II15V M36I G48V I54V I62V K14R I15V N37D F53L A71V G73S L90M K14E M36V G48V L63P A71V T74S V82A I15V R41K L63P A71T G73S L90M G48V L63P T74A K20I M36I L63P A71T G73S L90M L10I E35D R41K I62V L63P A71V G73S I84V L90M I93L K14R R41K L63P V77I L90M I93L L10I K20M L63P A71T V77I L90M I93L N37D R57K D60E L63P A71V G73S L90M I93L I15V D30N E35D M36I R41K L63P L63P T74S L90M L63P L90M K14R R41K L63P V77II93L LlOV I62V G73S L90M L63P T74A V77I L63P L90M N37D L63P A71V G73S L90M I93L L10I L63P A71T V77II93L I15V E35D R41K L63P K14R/K L63P I93I/L K14E L63P A71V I15V L63P LIOI L63T A71T L63P A71V L90M L63A G48V I54V L90M12 G48V I84V L90M12
/C 9 o(uM) 0.03 8.08
Fold resistance 1 269.33
6.00
200
1.18 0.92 0.58 0.58 0.37 0.80 0.42 0.34
39.33 30.67 19.33 19.33 12.33 26.67 14 12.67
0.21 0.20 0.20
7 6.67 6.67
0.03 0.09 0.08 0.07 0.07 0.07 0.06 0.06 0.06 0.06 0.06 0.06 0.04 0.05 0.02 0.02 0.01 1.50 0.90
1 3 2.67 2.33 2.33 2.33 2 2 2 2 2 2 1.33 1.67 0.67 0.67 0.33 50 30
Table 1: Resistance values of HIV Protease mutants to Saquinavir. The fold resistance was calculated as a ratio between the IC90 value of the mutant and the IC90 value of the wild type.
82 X
L
L M H
FN FN
M FP
H FP FP
FN
Table 2: Truth table for determining false positives and false negatives. Actual classifications are on the left, classifications predicted by the SOFM are across the top.
label, then the pattern was not classified. Otherwise, a predicted resistance classification would be assigned based on the label of the. cluster in which the pattern was placed. We defined a false positive (FP) as a mutation that was classified as being more resistant than it actually was based on its n-fold resistance value. For instance a false positive condition exists if the mutant's IC90 value as reported causes the mutant to be defined as low resistance (i.e., the IC90 of the mutant is less than four-fold more resistant to saquinavir than the wild type) and the network assigns to that mutant a label of medium or high resistance. Conversely, if a mutant is reported as more resistant than the label assigned by the network, a false negative (FN) condition exists. Table 2 summarizes this logic as a truth table. For each network, the 32 test patterns are identified as correctly classified, FP, FN, or not classified (if they are assigned to a 'mixed' or unlabelled cluster). Then the coverage and accuracy of the network is calculated. Coverage is defined the ratio of test patterns that were classified (i.e., assigned to a labelled cluster) to total test patterns. Accuracy is defined as the ratio of patterns that were correctly classified to the total number patterns classified. For our purposes, both are expressed as percentages. A third number that has been calculated for each network is what we call the network's score: Score = Coverage* Accuracy* 100 The score allows us to compare networks based on a single number. Obviously, there are other ways one may calculate a score that weights the contribution of coverage and accuracy differently. For our purposes, we will treat them as equal contributions to the overall score of the network, although we will also discriminate by coverage before attempting to find the network with the best accuracy. Our results are summarized in Table 3. The network with the best overall performance and also the best coverage was the 8x8 output matrix with an initial learning rate of 0.6. The most accurate network was the 8x8 output matrix with an initial learning rate of 0.5. This network produced 100% accuracy, but provided only 31% coverage. Note that there are other networks
n
>* a
nq
u
)
,,
o
v
r
,
^
n
^
n
r
^
, H „ 0
, r
n
H H M
H W 0 ) M C 0 k ) U l D * . W M C 0 U
O i S y x S c n i a i O J O o O i U O i O O S O l l O O M C n O
iSSSSS"SS^5So'coSto01,-Ji-J|-J,-'|-',-'tooo'-'"^
o o o o o o o o o o o o o o o o o o o o o o o o o o
o
^ 4 i ^ ^ C J 1 0 1 0 l C n C i a O l O ) C 3 0 1 ( X O O Q O C O ( » K
o o o o o o o o o o o o o o o o o o o o o o o o o o
W 0 5 t O W U ^
x x x x x x x x x x x x x x x x x x x x x x x x x x
W ( 0 U M U f r ^ ^ * * 0 i t n 0 l 0 l 0 l 0 1 0 1 1 ? l t t 0 1 1 » » C « 0 J 0 1 O
Output Matrix 12x12 10x10 8x8 6x6 5x5 4x4 3x3
Coverage 41% 44% 46% 37% 20% 9% 1%
Accuracy 59% 64% 71% 78% 76% 95% 100%
Score 25 29 32 29 16 8 1
Table 4: Average performance of networks by size of output matrix.
which produced 100% accuracy, but all of these networks exhibited very poor coverage (less than 10%) and were rejected from serious consideration. Overall, it was observed (see Table 4) that the networks with 8x8 output matrices performed best (average score of 32) and also provided the best coverage (average of 46%). Networks with 12x12, 10x10, 6x6 and 5x5 output matrices also performed reasonably well. The networks with smaller output matrices had very high accuracy, but their coverage was quite poor (again, less than 10%). It was also observed that increasing the number of iterations during training did not improve network performance, but actually degraded performance for the test case (10x10 output matrix, 0.7 initial learning rate, 50 iterations). 3.2
Majority Voting Schemes
The performance of the best network allowed for better-than-random accuracy (68%) and acceptable coverage of 69%. The most accurate network had 100% success for those patterns that it was able to classify, but provided only marginal coverage at 31%. Certainly for such a critical application as predicting HIV drug resistance, we would want better performance. One possibility is to make use of multiple networks at once using a majority voting scheme. In majority voting, the results of presenting a pattern to a number of networks is tallied, and the majority classification is taken as correct. In situations where one or more networks fail to classify the pattern (e.g., the pattern is assigned to a 'mixed' or unlabelled cluster), only the outputs of the networks that successfully classify the pattern are used. In the case of a tie (there were none for the schemes that we explored), the lowest drug resistance classification was selected. That is, we considered the risk of trying a drug treatment that did not work to be lower than the risk of missing a potentially effective drug treatment.
Voting Scheme Majority of 6 Most Accurate Majority of Best + 3 Most Accurate Majority of 4 Best Score Best Single Network5 Most Accurate Single Network
Coverage 84% 88% 84% 69% 31%
Accuracy 85% 79% 70% 68% 100%
Score 71 70 59 47 31
Table 5: Comparison of scores for various majority voting schemes.
Three schemes were tested and compared to the best single network and the most accurate single network. The first scheme was a combination of the six most accurate networks: 8x8-0.5, 6x6-0.7, 6x6-0.5, 5x5-0.9, 5x5-0.8, and 5x50.6 (the number after the dash is the initial learning rate). The second scheme combined the best single network with the three most accurate networks: 6x60.7, 6x6-0.5, and 8x8-0.5. Again, those networks with 100% accuracy but very low coverage (the networks with 4x4 and 3x3 output matrices) were ignored. Our final scheme combine the results of the four networks with the best overall scores: 8x8-0.6, 10x10-0.9, 10x10-0.6, and 12x12-0.9. Perrone claims that the performance of a combiner (e.g., a majority voting scheme) is never worse than the average of the individual classifiers, but not necessarily better than the best classifier.13 In our case, all of the majority voting schemes outperformed the single best network (see Table 5). The average coverage across the three voting schemes was 85%, the average accuracy of the three was 78%, and the average score was 67. This represents a significant improvement over the single best network (69%, 68%, and 47, respectively). 4
Conclusions and Further Work
This research explored the possibility of using self-organizing feature maps to predict drug resistance in HIV-1 infected patients based only on the peptide sequence of the HIV protease mutant strain. This differs from previous work which attempted to predict drug resistance based on structural features of the HIV protease.4 This paper shows that the single best classifier found produces acceptable results (69% coverage and 68% accuracy), but to produce a predictive system suitable for clinical use, multiple networks configured in a majority voting scheme may be necessary. The best scheme was the six most ' B e s t single network was 8x8 output matrix, 0.6 initial learning rate, initial neighborhood of 8, 10 iterations; most accurate single network was 8x8 output matrix, 0.5 initial learning rate, initial neighborhood of 8, 10 iterations
86 accurate networks, with coverage of 84%, accuracy of 85%, and a score of 71. All majority voting schemes outperformed the single best network. There are many opportunities for further research on using SOFMs for predicting drug resistance. In the case of HIV drug resistance, there are additional drugs (e.g., Indinavir and Nelfinavir) and drug combinations that may be explored. The difficulty with this work and work with other HIV treatments is the lack of publicly available clinical data (IC90 values). Christodoulou and Pattichis have also incorporated the use of confidence measures for weighting individual network results in majority voting schemes8, which may be applied to the HIV drug resistance problem. Finally, SOFMs may be applied to the treatment of other retroviral diseases such as human T-cell leukemia virus (HTLV-1) and hairy cell leukemia (HTLV-2), as well as DNA viruses such as Hepatitis-B and Herpes. References 1. James D. Watson, Michael Golman, Jan Witkowski, and Mark Zoller. Recombinant DNA, 2nd Ed., pages 485-509. Scientific American Books, New York, 1992. 2. T. Kohonen. Self-Organization Maps. Springer-Verlag, Berlin Heidelberg, 1995. 3. Martin T. Hagan, Howard B. Demuth, and Mark Beale. Neural Network Design, pages 14.10-14.16. PWS Publishing Company, Boston, 1996. 4. Sorin Draghici, Lonnie Cumberland, and Ladislau C. Kovari. Correlation of hiv protease structure with indinavir resistance: a data mining and neural network approach. In Proceedings of SPIE 2000, volume 4057-40, Orlando, Florida, 2000. 5. Jouni Kaartinen, Yrjo Hiltunen, P.T. Kovanen, and Mika Ala-Korpela. Application of self-organizing maps for the detection and classification of human blood plasma lipoprotein lipid profiles on the basis of lh nmr spectroscopy data. NMR in Biomedicine, 11:168-176, 1998. 6. Mikko Makipaa, Pekka Heinonen, and Erkki Oja. Using the som in supporting diabetes therapy. Helsinki University of Technology, Finland, June 4-6,1997. 7. A.C.R. Santos-Andre, T.C.S.; da Silva. A neural network made of a kohonen's som coupled to a mlp trained via backpropagation for the diagnosis of malignant breast cancer from digital mammograms. In IJCNN '99, volume 5, pages 3647-3650, 1999. 8. C. I. Christodoulou and C. S. Pattichis. Medical diagnostic systems using ensembles of neural sofm classifiers. In Proceedings of ICECS '99,
87 volume 1, pages 121-124, 1999. 9. C. I. Christodoulou and C. S. Pattichis. Unsupervised pattern recognition for the classification of emg signals. Biomedical Engineering, IEEE Transactions on, 46(2):169-178, Feb 1999. 10. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531-537, 1999. 11. Mark A. Winters, Jonathan M. Schapiro, Jody Lawrence, and Thomas C. Merigan. Human immunodeficiency virus type 1 protease genotypes and in vitro protease inhibitor susceptibilities of isolates from individuals who were switched to other protease inhibitors after long-term saquinavir treatment. Journal of Virology, 72(6):5303-5306, 1998. 12. Raymond F. Schinazi, Brendan A. Larder, and John W. Mellors. Mutations in retroviral genes associated with drug resistance: 1999-2000 update. International Antiviral News, 7(4):46-69, 1999. 13. M. P. Perrone. Averaging/modular techniques for neural networks. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 126-129, Cambridge, Massachusetts, 1999. MIT Press.
AUTOMATING DATA ACQUISITION INTO ONTOLOGIES FROM PHARMACOGENETICS RELATIONAL DATA SOURCES USING DECLARATIVE OBJECT DEFINITIONS AND XML DANIEL L. RUBIN, MICHEAL HEWETT, DIANE E. OLIVER, TERI E. KLEIN, AND RUSS B. ALTMAN Stanford Medical Informatics, MSOB X-215, Stanford, CA 94305-5479 USA E-mail:
[email protected].
[email protected] Ontologies are useful for organizing large numbers of concepts having complex relationships, such as the breadth of genetic and clinical knowledge in pharmacogenomics. But because ontologies change and knowledge evolves, it is time consuming to maintain stable mappings to external data sources that are in relational format. We propose a method for interfacing ontology models with data acquisition from external relational data sources. This method uses a declarative interface between the ontology and the data source, and this interface is modeled in the ontology and implemented using XML schema. Data is imported from the relational source into the ontology using XML, and data integrity is checked by validating the XML submission with an XML schema. We have implemented this approach in PharmGKB (http://www.pharmgkb.org/), a pharmacogenetics knowledge base. Our goals were to (1) import genetic sequence data, collected in relational format, into the pharmacogenetics ontology, and (2) automate the process of updating the links between the ontology and data acquisition when the ontology changes. We tested our approach by linking PharmGKB with data acquisition from a relational model of genetic sequence information. The ontology subsequently evolved, and we were able to rapidly update our interface with the external data and continue acquiring the data. Similar approaches may be helpful for integrating other heterogeneous information sources in order make the diversity of pharmacogenetics data amenable to computational analysis.
1 1.1
Introduction Pharmacogenetics and the need to connect diverse data
Connecting genotype and phenotype data is the quest of pharmacogenetics —a discipline that seeks to understand how inherited genetic differences among people influence their response to drugs. Discovering important relationships between genes and drugs could lead to personalized medicine, where drug therapy is customized according to the genetic constitution of the patient. Thus, there is great interest in rapidly acquiring genotype and phenotype data in many individuals, and clinical trials in the future will routinely collect genotype as well as phenotype information.1 Modern experimental methods such as high-throughput DNA sequencing techniques and gene-expression microarrays are contributing detailed genetic and phenotypic information at a rapid rate.2,3 These abundant and diverse data are a rich source for developing a comprehensive picture of relationships among genes and *We will consider the term "pharmacogenomics" to be equivalent to "pharmacogenetics." 88
89
drugs, but they also create new and complex problems for data integration and interpretation. The plethora of diverse databases having genomic,4"7 cellular,8 and phenotype information9 exacerbates this complexity. Even within a given class of database, such as those containing genetic sequence data, the organization, terminologies, and data models differ.6,7'10 It is difficult to integrate heterogeneous databases, and standards are not easily adopted.3 In response to the need for an integrated resource for pharmacogenetics research, the National Institutes of Health funded the Pharmacogenetics Research Network and Knowledge Base initiative, including the pharmacogenetics knowledge base (PharmGKB).11 The goal of the PharmGKB project is to develop a knowledge base that can become a national resource containing high quality publicly-accessible pharmacogenetics data that connects genotype, molecular/cellular phenotype, and clinical phenotype information. The challenge for PharmGKB is to integrate a wide scope of genetic and phenotypic information. 1.2
Integrating data in ontologies
To integrate diverse genetic, cellular phenotypic, and clinical information, it is necessary to develop a data model that specifies the pertinent concepts, the semantics of these concepts, and the relationships among them. Because biological understandings evolve, and new types of information continue to emerge after a database design is established, the data model changes. However, when the data model changes, the links to outside sources of data must be updated, which can be a timeconsuming process. Ontologies are models that describe concepts and the relationships among them, combining an abstraction hierarchy of concepts with a semantic network of relationships. Ontologies are flexible and highly expressive, and have been useful for building knowledge bases in biology,12'15 as well as in the PharmGKB project.16 A disadvantage of ontologies is that network and hierarchical data models are very different from flat tabular relational models, and ontologies are not easily integrated with relational data sources; yet the latter are predominant in most biology databases4'7'17 and experimental laboratories today. This is not a problem when the ontologies are relatively stable, do not change once data acquisition begins, and are manually curated to ensure integrity of the data.1415 But while developing the ontology for PharmGKB, it became clear that it will continue to change as our understanding of the concepts and relationships in pharmacogenetics data evolves. Furthermore, many biomedical scientists think about their data in terms of tables (a relational view), not in terms of ontologies. Our challenge, therefore, is to develop a robust interface between relational data acquisition and the PharmGKB ontology. We also sought a method that would automate updating this interface when the ontology changes.
90 1.3
XML and data exchange
Extensible Markup Language (XML18) is useful as a data representation scheme19'21 and for exchanging data between resources and databases.22"24 XML provides a general framework for exchanging data between resources because it is extensible, readable by humans, unambiguously parsed by computers, and can be formally defined using a document type definition (DTD) or XML schema. XML schema25 is a more powerful language for defining XML formats. XML schema is superior to a DTD for expressing constraints because XML schemas specify not only the structure but also the data type of each element and attribute. XML schemas are written in XML, and thus are self-describing and easier to understand than a DTD. XML schemas are also extensible, permitting authors to develop customized constraints. Data integration requires access to a variety of data sources through a single mediated schema. A major difficulty with integrating data from outside sources is the laborious manual construction of semantic mappings between the source schemas and the mediated schema. It is also necessary to validate the incoming data against the legal ranges for each field in the importing database. If we were to develop an XML schema to serve as the mediating schema, this would address the problem of validating the structure and content of incoming data. But we would still need to have a way of defining the content in the XML schema. Ideally, the XML schema should be defined from information in the PharmGKB ontology. We have developed a method for using an ontology to define a mediating XML schema. 2 2.1
Method Overview of our method
Our method consists of several components that are shown schematically in Figure 1. The first component is the PharmGKB ontology, which contains the concepts (classes) that describe the domain of pharmacogenetics, and it also models the relationships among the classes (Figure 2, left side). Data are stored in the ontology by creating instances of these classes and storing the data in the appropriate slots (named attributes that store data) associated with the instances. To specify a relationship between instances, we connect them by assigning one instance to the slot value in the other instance. For example, a PCR assay submission has relationships to two instances: a forward PCR primer and a reverse PCR primer (Figure 2, right side). This relationship allows us to specify the particular primers used in a PCR assay. The second component of our system is the XML schema (Figure 1), which is derived from the ontology and used as an interface between data acquisition and the ontology. The ontology contains a declarative representation of data constraints that are used to define validation constraints on incoming data, and to create the XML schema. This component includes an XML parser that validates incoming XML
91
o
PharmGKB Ontology Instance-based storage
o o o
V
Data Entry Layer: HTML Form
Create Instances
<xsd:element name= "Gene">
XML Schema {derived from ontology)
Application Layer: API Programs
<xsd:comp!exType> <xsd:sequance>
XML Validation
s»