www.dbebooks.com - Free Books & magazines
EXPERT SYSTEMS IN CHEMISTRY RESEARCH
5323X.indb 1
11/13/07 2:08:37 PM
53...
90 downloads
1591 Views
25MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
www.dbebooks.com - Free Books & magazines
EXPERT SYSTEMS IN CHEMISTRY RESEARCH
5323X.indb 1
11/13/07 2:08:37 PM
5323X.indb 2
11/13/07 2:08:37 PM
EXPERT SYSTEMS IN CHEMISTRY RESEARCH Markus C. Hemmer
Boca Raton London New York
CRC Press is an imprint of the Taylor & Francis Group, an informa business
5323X.indb 3
11/13/07 2:08:37 PM
CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487‑2742 © 2008 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid‑free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number‑13: 978‑1‑4200‑5323‑4 (Hardcover) This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the conse‑ quences of their use. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www. copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive, Danvers, MA 01923, 978‑750‑8400. CCC is a not‑for‑profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Hemmer, Markus C. Expert systems in chemistry research / Markus C. Hemmer. p. cm. Includes bibliographical references and index. ISBN 978‑1‑4200‑5323‑4 (hardback : alk. paper) 1. Chemistry‑‑Data processing. 2. Chemistry‑‑Research. I. Title. QD39.3.E46.H46 2007 542’.85633‑‑dc22
2007031226
Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com
5323X.indb 4
11/13/07 2:08:38 PM
Dedicated to my father, who has aided and accompanied my scientific career all these years.
5323X.indb 5
11/13/07 2:08:38 PM
5323X.indb 6
11/13/07 2:08:38 PM
Contents Preface....................................................................................................................xvii Acknowledgments....................................................................................................xix Trademark Information............................................................................................xxi Chapter 1 Introduction........................................................................................... 1 1.1 Introduction...................................................................................................... 1 1.2 What We Are Talking About...........................................................................1 1.3 The Concise Summary.....................................................................................3 1.4 Some Initial Thoughts......................................................................................3 References................................................................................................................... 8 Chapter 2 Basic Concepts of Expert Systems........................................................ 9 2.1 2.2 2.3
What Are Expert Systems?.............................................................................. 9 The Conceptual Design of an Expert System................................................ 10 Knowledge and Knowledge Representation.................................................. 12 2.3.1 Rules................................................................................................... 12 2.3.2 Semantic Networks............................................................................. 14 2.3.3 Frames................................................................................................. 16 2.3.4 Advantages of Rules........................................................................... 18 2.3.4.1 Declarative Language............................................................ 18 2.3.4.2 Separation of Business Logic and Data................................ 18 2.3.4.3 Centralized Knowledge Base................................................ 18 2.3.4.4 Performance and Scalability................................................. 19 2.3.5 When to Use Rules.............................................................................. 19 2.4 Reasoning.......................................................................................................20 2.4.1 The Inference Engine..........................................................................20 2.4.2 Forward and Backward Chaining....................................................... 22 2.4.3 Case-Based Reasoning........................................................................ 22 2.5 The Fuzzy World............................................................................................24 2.5.1 Certainty Factors................................................................................24 2.5.2 Fuzzy Logic........................................................................................25 2.5.3 Hidden Markov Models......................................................................26 2.5.4 Working with Probabilities — Bayesian Networks............................ 27 2.5.5 Dempster-Shafer Theory of Evidence.................................................28 2.6 Gathering Knowledge — Knowledge Engineering....................................... 29 2.7 Concise Summary.......................................................................................... 31 References................................................................................................................. 32 vii
5323X.indb 7
11/13/07 2:08:38 PM
viii
Expert Systems in Chemistry Research
Chapter 3 Development Tools for Expert Systems.............................................. 35 3.1 Introduction.................................................................................................... 35 3.2 The Technical Design of Expert Systems...................................................... 35 3.2.1 Knowledge Base................................................................................. 35 3.2.2 Working Memory................................................................................ 35 3.2.3 Inference Engine................................................................................. 36 3.2.4 User Interface...................................................................................... 36 3.3 Imperative versus Declarative Programming................................................ 37 3.4 List Processing (LISP)...................................................................................40 3.5 Programming Logic (PROLOG)................................................................... 41 3.5.1 PROLOG Facts................................................................................... 41 3.5.2 PROLOG Rules................................................................................... 42 3.6 National Aeronautics and Space Administration’s (NASA’s) Alternative — C Language Integrated Production System (CLIPS)............. 43 3.6.1 CLIPS Facts........................................................................................44 3.6.2 CLIPS Rules....................................................................................... 45 3.7 Java-Based Expert Systems — JESS............................................................. 47 3.8 Rule Engines — JBoss Rules......................................................................... 48 3.9 Languages for Knowledge Representation.................................................... 49 3.9.1 Classification of Individuals and Concepts (CLASSIC)..................... 50 3.9.2 Knowledge Machine........................................................................... 51 3.10 Advanced Development Tools........................................................................ 53 3.10.1 XpertRule............................................................................................ 55 3.10.2 Rule Interpreter (RI)........................................................................... 56 3.11 Concise Summary.......................................................................................... 57 References................................................................................................................. 58 Chapter 4 Dealing with Chemical Information................................................... 61 4.1 Introduction.................................................................................................... 61 4.2 Structure Representation................................................................................ 61 4.2.1 Connection Tables (CTs)..................................................................... 61 4.2.2 Connectivity Matrices......................................................................... 62 4.2.3 Linear Notations................................................................................. 63 4.2.4 Simplified Molecular Input Line Entry Specification (SMILES)....... 63 4.2.5 SMILES Arbitrary Target Specification (SMARTS).........................64 4.3 Searching for Chemical Structures................................................................64 4.3.1 Identity Search versus Substructure Search.......................................64 4.3.2 Isomorphism Algorithms.................................................................... 65 4.3.3 Prescreening........................................................................................66 4.3.4 Hash Coding.......................................................................................66 4.3.5 Stereospecific Search.......................................................................... 67 4.3.6 Tautomer Search................................................................................. 67 4.3.7 Specifying a Query Structure............................................................. 68 4.4 Describing Molecules.................................................................................... 69 4.4.1 Basic Requirements for Molecular Descriptors.................................. 70 4.4.1.1 Independency of Atom Labeling........................................... 71
5323X.indb 8
11/13/07 2:08:39 PM
Contents
4.5
4.6
4.7
4.8
4.9
5323X.indb 9
ix
4.4.1.2 Rotational/Translational Invariance...................................... 71 4.4.1.3 Unambiguous Algorithmically Computable Definition........ 71 4.4.1.4 Range of Values..................................................................... 71 4.4.2 Desired Properties of Molecular Descriptors..................................... 72 4.4.2.1 Reversible Encoding.............................................................. 73 4.4.3 Approaches for Molecular Descriptors............................................... 73 4.4.4 Constitutional Descriptors.................................................................. 73 4.4.5 Topological Descriptors...................................................................... 74 4.4.6 Topological Autocorrelation Vectors.................................................. 74 4.4.7 Fragment-Based Coding..................................................................... 75 4.4.8 3D Molecular Descriptors................................................................... 76 4.4.9 3D Molecular Representation Based on Electron Diffraction............ 77 4.4.10 Radial Distribution Functions............................................................. 77 4.4.11 Finding the Appropriate Descriptor................................................... 78 Descriptive Statistics...................................................................................... 79 4.5.1 Basic Terms......................................................................................... 79 4.5.1.1 Standard Deviation (SD)....................................................... 79 4.5.1.2 Variance................................................................................ 79 4.5.1.3 Covariance.............................................................................80 4.5.1.4 Covariance Matrix................................................................80 4.5.1.5 Eigenvalues and Eigenvectors...............................................80 4.5.2 Measures of Similarity....................................................................... 81 4.5.3 Skewness and Kurtosis....................................................................... 83 4.5.4 Limitations of Regression................................................................... 85 4.5.5 Conclusions for Investigations of Descriptors.................................... 86 Capturing Relationships — Principal Components....................................... 87 4.6.1 Principal Component Analysis (PCA)................................................ 87 4.6.1.1 Centering the Data................................................................ 89 4.6.1.2 Calculating the Covariance Matrix....................................... 89 4.6.2 Singular Value Decomposition (SVD)................................................ 91 4.6.3 Factor Analysis...................................................................................94 Transforming Descriptors.............................................................................. 95 4.7.1 Fourier Transform............................................................................... 95 4.7.2 Hadamard Transform..........................................................................96 4.7.3 Wavelet Transform..............................................................................96 4.7.4 Discrete Wavelet Transform...............................................................97 4.7.5 Daubechies Wavelets.......................................................................... 98 4.7.6 The Fast Wavelet Transform...............................................................99 Learning from Nature — Artificial Neural Networks................................. 102 4.8.1 Artificial Neural Networks in a Nutshell.......................................... 103 4.8.2 Kohonen Neural Networks — The Classifiers................................. 105 4.8.3 Counterpropagation (CPG) Neural Networks — . The Predictors................................................................................... 107 4.8.4 The Tasks: Classification and Modeling........................................... 109 Genetic Algorithms (GAs)........................................................................... 110
11/13/07 2:08:39 PM
Expert Systems in Chemistry Research
4.10 Concise Summary........................................................................................ 112 References............................................................................................................... 115 Chapter 5 Applying Molecular Descriptors....................................................... 119 5.1 5.2
5.3
5.4
5.5
5.6 5.7
5.8 5.9 5.10 5.11 5.12 5.13 5.14
5323X.indb 10
Introduction.................................................................................................. 119 Radial Distribution Functions (RDFs)......................................................... 119 5.2.1 Radial Distribution Function............................................................ 119 5.2.2 Smoothing and Resolution................................................................ 120 5.2.3 Resolution and Probability................................................................ 122 Making Things Comparable — Postprocessing of RDF Descriptors......... 123 5.3.1 Weighting.......................................................................................... 123 5.3.2 Normalization................................................................................... 124 5.3.3 Remark on Linear Scaling................................................................ 124 Adding Properties — Property-Weighted Functions................................... 125 5.4.1 Static Atomic Properties................................................................... 125 5.4.2 Dynamic Atomic Properties............................................................. 126 5.4.3 Property Products versus Averaged Properties................................ 126 Describing Patterns...................................................................................... 128 5.5.1 Distance Patterns.............................................................................. 129 5.5.2 Frequency Patterns............................................................................ 129 5.5.3 Binary Patterns................................................................................. 130 5.5.4 Aromatic Patterns............................................................................. 130 5.5.5 Pattern Repetition............................................................................. 130 5.5.6 Symmetry Effects............................................................................. 130 5.5.7 Pattern Matching with Binary Patterns............................................ 131 From the View of an Atom — Local and Restricted RDF Descriptors...... 131 5.6.1 Local RDF Descriptors..................................................................... 132 5.6.2 Atom-Specific RDF Descriptors....................................................... 132 Straight or Detour — Distance Function Types.......................................... 133 5.7.1 Cartesian RDF.................................................................................. 133 5.7.2 Bond-Path RDF................................................................................. 133 5.7.3 Topological Path RDF....................................................................... 134 Constitution and Conformation.................................................................... 135 Constitution and Molecular Descriptors...................................................... 136 Constitution and Local Descriptors............................................................. 139 Constitution and Conformation in Statistical Evaluations........................... 140 Extending the Dimension — Multidimensional Function Types................ 145 Emphasizing the Essential — Wavelet Transforms..................................... 147 5.13.1 Single-Level Transforms................................................................... 150 5.13.2 Wavelet-Compressed Descriptors..................................................... 151 A Tool for Generation and Evaluation of RDF Descriptors — ARC.......... 151 5.14.1 Loading Structure Information......................................................... 153 5.14.2 The Default Code Settings................................................................ 153 5.14.3 Calculation and Investigation of a Single Descriptor....................... 154 5.14.4 Calculation and Investigation of Multiple Descriptor Sets............... 155 5.14.5 Binary Comparison........................................................................... 155
11/13/07 2:08:39 PM
Contents
xi
5.14.6 Correlation Matrices......................................................................... 155 5.14.7 Training a Neural Network............................................................... 155 5.14.8 Investigation of Trained Network..................................................... 157 5.14.9 Prediction and Classification for a Test Set...................................... 157 5.15 Synopsis....................................................................................................... 157 5.15.1 Similarity and Diversity of Molecules.............................................. 162 5.15.2 Structure and Substructure Search................................................... 162 5.15.3 Structure–Property Relationships..................................................... 162 5.15.4 Structure–Activity Relationships...................................................... 162 5.15.5 Structure–Spectrum Relationships................................................... 162 5.16 Concise Summary........................................................................................ 163 References............................................................................................................... 165 Chapter 6 Expert Systems in Fundamental Chemistry...................................... 167 6.1 6.2
6.3 6.4 6.5
6.6
6.7
5323X.indb 11
Introduction.................................................................................................. 167 How It Began — The DENDRAL Project.................................................. 167 6.2.1 The Generator — CONGEN............................................................. 168 6.2.2 The Constructor — PLANNER....................................................... 168 6.2.3 The Testing — PREDICTOR........................................................... 169 6.2.4 Other DENDRAL Programs............................................................ 171 A Forerunner in Medical Diagnostics......................................................... 171 Early Approaches in Spectroscopy.............................................................. 175 6.4.1 Early Approaches in Vibrational Spectroscopy................................ 176 6.4.2 Artificial Neural Networks for Spectrum Interpretation.................. 177 Creating Missing Information — Infrared Spectrum Simulation............... 178 6.5.1 Spectrum Representation.................................................................. 178 6.5.2 Compression with Fast Fourier Transform....................................... 179 6.5.3 Compression with Fast Hadamard Transform.................................. 179 From the Spectrum to the Structure — Structure Prediction...................... 179 6.6.1 The Database Approach.................................................................... 181 6.6.2 Selection of Training Data................................................................ 181 6.6.3 Outline of the Method....................................................................... 182 6.6.3.1 Preprocessing of Spectrum Information............................. 182 6.6.3.2 Preprocessing of Structure Information.............................. 182 6.6.3.3 Generation of a Descriptor Database.................................. 182 6.6.3.4 Training............................................................................... 182 6.6.3.5 Prediction of the Radial Distribution Function (RDF) Descriptor............................................................................ 183 6.6.3.6 Conversion of the RDF Descriptor...................................... 184 6.6.4 Examples for Structure Derivation................................................... 184 6.6.5 The Modeling Approach................................................................... 187 6.6.6 Improvement of the Descriptor......................................................... 188 6.6.7 Database Approach versus Modeling Approach.............................. 189 From Structures to Properties...................................................................... 190 6.7.1 Searching for Similar Molecules in a Data Set................................ 191
11/13/07 2:08:40 PM
xii
Expert Systems in Chemistry Research
6.7.2 Molecular Diversity of Data Sets...................................................... 193 6.7.2.1 Average Descriptor Approach............................................. 194 6.7.2.2 Correlation Approach.......................................................... 194 6.7.3 Prediction of Molecular Polarizability............................................. 199 6.8 Dealing with Localized Information — Nuclear Magnetic Resonance (NMR) Spectroscopy................................................................................... 201 6.8.1 Commercially Available Products.................................................... 201 6.8.2 Local Descriptors for Nuclear Magnetic Resonance Spectroscopy.....................................................................................202 6.8.3 Selecting Descriptors by Evolution...................................................205 6.8.4 Learning Chemical Shifts.................................................................206 6.8.5 Predicting Chemical Shifts...............................................................207 6.9 Applications in Analytical Chemistry.........................................................208 6.9.1 Gamma Spectrum Analysis..............................................................208 6.9.2 Developing Analytical Methods — Thermal Dissociation . of Compounds...................................................................................209 6.9.3 Eliminating the Unnecessary — Supporting Calibration................ 215 6.10 Simulating Biology...................................................................................... 217 6.10.1 Estimation of Biological Activity..................................................... 217 6.10.2 Radioligand Binding Experiments.................................................... 218 6.10.3 Effective and Inhibitory Concentrations........................................... 219 6.10.4 Prediction of Effective Concentrations............................................. 221 6.10.5 Progestagen Derivatives.................................................................... 221 6.10.6 Calcium Agonists.............................................................................. 223 6.10.7 Corticosteroid-Binding Globulin (CBG) Steroids............................224 6.10.8 Mapping a Molecular Surface.......................................................... 226 6.11 Supporting Organic Synthesis..................................................................... 229 6.11.1 Overview of Existing Systems.......................................................... 230 6.11.2 Elaboration of Reactions for Organic Synthesis............................... 232 6.11.3 Kinetic Modeling in EROS............................................................... 233 6.11.4 Rules in EROS.................................................................................. 233 6.11.5 Synthesis Planning — Workbench for the Organization of Data for Chemical Applications (WODCA).............................................. 234 6.12 Concise Summary........................................................................................ 236 References............................................................................................................... 239 Chapter 7 Expert Systems in Other Areas of Chemistry................................... 247 7.1 7.2
5323X.indb 12
Introduction.................................................................................................. 247 Bioinformatics.............................................................................................. 247 7.2.1 Molecular Genetics (MOLGEN)......................................................248 7.2.2 Predicting Toxicology — Deductive Estimation of Risk . from Existing Knowledge (DEREK) for Windows.......................... 249 7.2.3 Predicting Metabolism — Meteor.................................................... 251 7.2.4 Estimating Biological Activity — APEX-3D................................... 251 7.2.5 Identifying Protein Structures..........................................................254
11/13/07 2:08:40 PM
Contents
xiii
7.3
Environmental Chemistry............................................................................ 257 7.3.1 Environmental Assessment — Green Chemistry Expert . System (GCES)................................................................................. 257 7.3.2 Synthetic Methodology Assessment for Reduction Techniques....... 258 7.3.3 Green Synthetic Reactions................................................................ 259 7.3.4 Designing Safer Chemicals...............................................................260 7.3.5 Green Solvents/Reaction Conditions................................................ 261 7.3.6 Green Chemistry References............................................................ 261 7.3.7 Dynamic Emergency Management — Real-Time Expert System (RTXPS)............................................................................... 262 7.3.8 Representing Facts — Descriptors................................................... 262 7.3.9 Changing Facts — Backward-Chaining Rules................................. 263 7.3.10 Triggering Actions — Forward-Chaining Rules.............................. 263 7.3.11 Reasoning — The Inference Engine.................................................264 7.3.12 A Combined Approach for Environmental Management................. 265 7.3.13 Assessing Environmental Impact — EIAxpert................................266 7.4 Geochemistry and Exploration.................................................................... 267 7.4.1 Exploration........................................................................................ 267 7.4.2 Geochemistry.................................................................................... 268 7.4.3 X-Ray Phase Analysis....................................................................... 268 7.5 Engineering.................................................................................................. 269 7.5.1 Monitoring of Space-Based Systems — Thermal Expert System (TEXSYS)......................................................................................... 269 7.5.2 Chemical Equilibrium of Complex Mixtures — CEA..................... 270 7.6 Concise Summary........................................................................................ 271 References............................................................................................................... 274 Chapter 8 Expert Systems in the Laboratory Environment............................... 277 8.1 8.2
8.3
5323X.indb 13
Introduction.................................................................................................. 277 Regulations................................................................................................... 277 8.2.1 Good Laboratory Practices............................................................... 278 8.2.1.1 Resources, Organization, and Personnel............................. 278 8.2.1.2 Rules, Protocols, and Written Procedures.......................... 278 8.2.1.3 Characterization.................................................................. 278 8.2.1.4 Documentation.................................................................... 278 8.2.1.5 Quality Assurance............................................................... 279 8.2.2 Good Automated Laboratory Practice (GALP)................................ 279 8.2.3 Electronic Records and Electronic Signatures (21 CFR Part 11).....280 The Software Development Process............................................................ 281 8.3.1 From the Requirements to the Implementation................................ 282 8.3.1.1 Analyzing the Requirements............................................... 282 8.3.1.2 Specifying What Has to Be Done....................................... 282 8.3.1.3 Defining the Software Architecture.................................... 282 8.3.1.4 Programming...................................................................... 282 8.3.1.5 Testing the Outcome........................................................... 283
11/13/07 2:08:40 PM
xiv
Expert Systems in Chemistry Research
8.3.1.6 Documenting the Software.................................................. 283 8.3.1.7 Supporting the User............................................................. 283 8.3.1.8 Maintaining the Software................................................... 283 8.3.2 The Life Cycle of Software............................................................... 283 8.4 Knowledge Management.............................................................................. 287 8.4.1 General Considerations..................................................................... 287 8.4.2 The Role of a Knowledge Management System (KMS)................... 288 8.4.3 Architecture...................................................................................... 289 8.4.4 The Knowledge Quality Management Team....................................290 8.5 Data Warehousing........................................................................................290 8.6 The Basis — Scientific Data Management Systems.................................... 293 8.7 Managing Samples — Laboratory Information Management Systems (LIMS)......................................................................................................... 295 8.7.1 LIMS Characteristics........................................................................ 296 8.7.2 Why Use a LIMS?............................................................................ 297 8.7.3 Compliance and Quality Assurance (QA)........................................ 297 8.7.4 The Basic LIMS................................................................................ 298 8.7.5 A Functional Model.......................................................................... 298 8.7.5.1 Sample Tracking.................................................................. 298 8.7.5.2 Sample Analysis.................................................................. 299 8.7.5.3 Sample Organization........................................................... 299 8.7.6 Planning System............................................................................... 299 8.7.7 The Controlling System....................................................................300 8.7.8 The Assurance System......................................................................300 8.7.9 What Else Can We Find in a LIMS?................................................ 301 8.7.9.1 Automatic Test Programs.................................................... 301 8.7.9.2 Off-Line Client.................................................................... 301 8.7.9.3 Stability Management......................................................... 301 8.7.9.4 Reference Substance Module..............................................302 8.7.9.5 Recipe Administration........................................................302 8.8 Tracking Workflows — Workflow Management Systems...........................302 8.8.1 Requirements.................................................................................... 303 8.8.2 The Lord of the Runs........................................................................ 303 8.8.3 Links and Logistics...........................................................................304 8.8.4 Supervisor and Auditor.....................................................................304 8.8.5 Interfacing......................................................................................... 305 8.9 Scientific Documentation — Electronic Laboratory Notebooks (ELNs).... 305 8.9.1 The Electronic Scientific Document.................................................307 8.9.2 Scientific Document Templates........................................................309 8.9.3 Reporting with ELNs........................................................................ 310 8.9.4 Optional Tools in ELNs.................................................................... 310 8.10 Scientific Workspaces.................................................................................. 312 8.10.1 Scientific Workspace Managers........................................................ 313 8.10.2 Navigation and Organization in a Scientific Workspace.................. 315 8.10.3 Using Metadata Effectively.............................................................. 315
5323X.indb 14
11/13/07 2:08:41 PM
Contents
8.11
8.12 8.13
8.14
8.15
8.16
5323X.indb 15
xv
8.10.4 Working in Personal Mode............................................................... 319 8.10.5 Differences of Electronic Scientific Documents.............................. 319 Interoperability and Interfacing................................................................... 320 8.11.1 eXtensible Markup Language (XML)-Based Technologies............. 320 8.11.1.1 Simple Object Access Protocol (SOAP).............................. 321 8.11.1.2 Universal Description, Discovery, and Integration (UDDI)................................................................................ 321 8.11.1.3 Web Services Description Language (WSDL).................... 321 8.11.2 Component Object Model (COM) Technologies.............................. 321 8.11.3 Connecting Instruments — Interface Port Solutions........................ 322 8.11.4 Connecting Serial Devices................................................................ 322 8.11.5 Developing Your Own Connectivity — Software . Development Kits (SDKs)................................................................. 324 8.11.6 Capturing Data — Intelligent Agents............................................... 325 8.11.7 The Inbox Concept............................................................................ 327 Access Rights and Administration.............................................................. 328 Electronic Signatures, Audit Trails, and IP Protection................................ 329 8.13.1 Signature Workflow.......................................................................... 329 8.13.2 Event Messaging............................................................................... 331 8.13.3 Audit Trails and IP Protection.......................................................... 331 8.13.4 Hashing Data.................................................................................... 331 8.13.5 Public Key Cryptography................................................................. 332 8.13.5.1 Secret Key Cryptography.................................................... 333 8.13.5.2 Public Key Cryptography.................................................... 333 Approaches for Search and Reuse of Data and Information....................... 333 8.14.1 Searching for Standard Data............................................................. 334 8.14.2 Searching with Data Cartridges........................................................ 334 8.14.3 Mining for Data................................................................................ 335 8.14.4 The Outline of a Data Mining Service for Chemistry...................... 336 8.14.4.1 Search and Processing of Raw Data................................... 336 8.14.4.2 Calculation of Descriptors.................................................. 337 8.14.4.3 Analysis by Statistical Methods.......................................... 337 8.14.4.4 Analysis by Artificial Neural Networks.............................. 337 8.14.4.5 Optimization by Genetic Algorithms.................................. 338 8.14.4.6 Data Storage........................................................................ 338 8.14.4.7 Expert Systems.................................................................... 338 A Bioinformatics LIMS Approach.............................................................. 338 8.15.1 Managing Biotransformation Data................................................... 339 8.15.2 Describing Pathways.........................................................................340 8.15.3 Comparing Pathways........................................................................ 342 8.15.4 Visualizing Biotransformation Studies............................................. 343 8.15.5 Storage of Biotransformation Data...................................................344 Handling Process Deviations.......................................................................344 8.16.1 Covered Business Processes............................................................. 345 8.16.2 Exception Recording.........................................................................346 8.16.2.1 Basic Information Entry......................................................346 8.16.2.2 Risk Assessment..................................................................346
11/13/07 2:08:41 PM
xvi
Expert Systems in Chemistry Research
8.16.2.3 Cause Analysis.................................................................... 347 8.16.2.4 Corrective Actions............................................................... 347 8.16.2.5 Efficiency Checks................................................................348 8.16.3 Complaints Management..................................................................348 8.16.4 Approaches for Expert Systems........................................................ 349 8.17 Rule-Based Verification of User Input......................................................... 350 8.17.1 Creating User Dialogues................................................................... 350 8.17.2 User Interface Designer (UID)......................................................... 351 8.17.3 The Final Step — Rule Generation.................................................. 354 8.18 Concise Summary........................................................................................ 354 References............................................................................................................... 358 Chapter 9 Outlook.............................................................................................. 361 9.1 9.2 9.3
Introduction.................................................................................................. 361 Attempting a Definition............................................................................... 361 Some Critical Considerations....................................................................... 362 9.3.1 The Comprehension Factor............................................................... 363 9.3.2 The Resistance Factor....................................................................... 363 9.3.3 The Educational Factor..................................................................... 363 9.3.4 The Usability Factor.........................................................................364 9.3.5 The Commercial Factor.................................................................... 365 9.4 Looking Forward......................................................................................... 365 Reference................................................................................................................ 366 Index....................................................................................................................... 367
5323X.indb 16
11/13/07 2:08:41 PM
Preface Sitting at the breakfast room of my hotel I thought — and not for the first time — about this term expert. Would I consider myself a specialist for jam, just because I eat it every morning at breakfast? And then — why not? Isn’t a consumer a specialist because of his experience with consuming products? Certainly, he would not be considered an expert for the raw materials, the production process, or the quality control in jam production. Still, he is the consumer, so he knows the most about consuming jam. Whatever constitutes an expert, one of the fascinating topics of computer science arises from the question of how to take advantage of the expert’s knowledge in a computer program. Wouldn’t it be great if we could transfer some of the expert’s knowledge and reasoning to a computer program and to use this program for education and problem solving? If we continue thinking in this direction, we encounter a serious problem: We all know that human reasoning and decision making are a result of knowledge, experience, and intuition. How can such a thing like intuition be expressed in logical terms? The answer is that it cannot. In fact, knowledge and reasoning cannot really be expressed in static terms, since it is a result of a complex combination of all three properties. However, computer software is able to store facts. If we were able to describe the relationships between facts and the more complex topics of knowledge, experience, and intuition, we could imagine software that does reasoning and makes decisions. At the beginning of a lecture about artificial intelligence at the university, I explained to my students that this topic is basically simple, since it has to do with our perception of the world rather than the sometimes complex computational point of view. One of them looked at me and asked, “Why then do I have to learn it here?” The answer is, “Because simpler things may be much more difficult to understand due to their general nature.” This is the very crux of expert systems: On one the hand, we have a somewhat complex logic that we have to develop and encode in a computer program, to make things easier for the expert sitting in front of the screen; on the other hand, the more generalized context of the expert might be even harder to understand than the program running in the background. And, there is still a big gap between the expert and the expert system. During my Internet research for this book I stumbled across a nice phrase in a presentation from Joy Scaria from the Biological Sciences Group of the Birla Institute of Technology and Science in Pilani, India. In an introductionary slide, she stated, “Bioinformatics is … complicating biology with introducing algorithms, scripts, statistics, and confusing softwares so that no one understands it anymore.…” Sometimes, I think, this applies to expert systems and the underlying field of artificial intelligence, as well. We tend too often to complicate things rather than to simplify them. Consequently, one of my goals in writing this book was to simplify things as well as possible; unfortunately, I did not succeed entirely. xvii
5323X.indb 17
11/13/07 2:08:41 PM
xviii
Expert Systems in Chemistry Research
It lies in the nature of the topic that the mathematical, information technology, and regulative aspects have to be formulated in a more complex manner; whereas conceptual aspects could be described in a more entertaining fashion. The book shall finally be a good mixture of scientific literature and a captivating novel. There is nothing else than to wish an exciting journey into chemistry’s future. Markus C. Hemmer Bonn, Germany
5323X.indb 18
11/13/07 2:08:42 PM
Acknowledgments This book is finally a result of a series of discussions and contributions around the scientific idea of expert systems, and particularly around simplifying the matters. A series of people contributed to this goal in one way or the other, and I would like to express my gratitude to them: Dr. Joao Aires de Sousa (Department of Chemistry, New University of Lisbon, Caparica, Portugal), Dr. Jürgen Angerer (Institute and Outpatient Clinic of Occupational, Social and Environmental Medicine, University of Erlangen, Germany), Ulrike Burkard (Institute for Didactics in Physics, University of Bremen, Germany), Dr. Roberta Bursi (N.V. Organon, The Netherlands), Dr. Antony N. Davies (Division of Chemistry and Forensic Science, University of Glamorgan, United Kingdom), Dr. Thomas Engel (Chemical Computing Group AG, Köln, Germany), Dr. Thorsten Fröhlich (Waters GmbH, Frechen), Dr. Johann Gasteiger (Computer Chemistry Center, University of Erlangen, Germany), Dr. Wolfgang Graf zu Castell (Institute of Biomathematics and Biometrics, GSF, Neuherberg, Germany), Alexander von Homeyer (Computer Chemistry Center, University of Erlangen, Germany), Dr. Ulrich Jordis (Institute for Applied Synthesis Chemistry, University of Vienna, Austria), Michael McBrian (Advanced Chemistry Development, Inc., Toronto, Canada), Dr. Reinhard Neudert (Wiley-VCH, Weinheim, Germany), Dr. Livia Sangeorzan (Department of Computer Science, University Transilvania of Brasov, Romania), Dr. Thomas Sauer (Institute of Mathematics, University of Erlangen, Germany), Dr. Axel Schunk (Institute for Didactics of Chemistry, University of Frankfurt Main, Germany), Dr. Christoph Schwab (Molecular Networks GmbH, Erlangen, Germany), Dr. Valentin Steinhauer (Danet Group, Darmstadt, Germany), Dr. Jürgen Sühnel (Bioinformatics Group, Institute of Molecular Biology, Jena, Germany), Dr. Lothar Terfloth (Computer Chemistry Center, University of Erlangen, Germany), and Dr. Heiner Tobschall (Department of Applied Geology, University of Erlangen, Germany). Parts of the research in this work were supported by the German Federal Ministry of Education and Research (BMFT), the German National Research and Education Network (DFN), the German Academic Exchange Service (DAAD), the National Cancer Institute (U.S. National Institutes of Health), and Waters Corporation, Milford, Massachusetts.
xix
5323X.indb 19
11/13/07 2:08:42 PM
5323X.indb 20
11/13/07 2:08:42 PM
Trademark Information Apache® is a registered trademark of Apache Software Foundation. CA® and CAS® are registered trademarks of the American Chemical Society. Citrix® is a registered trademark of Citrix Systems, Inc. Citrix®, the Citrix logo, ICA®, Program Neighborhood®, MetaFrame®, WinFrame®, VideoFrame®, MultiWin®, and other Citrix product names referenced herein are trademarks of Citrix Systems, Inc. CleverPath® and Aion® are registered trademarks of Computer Associates International, Inc., New York. Contergan® is registered trademark of Grünenthal GmbH, Aachen, Germany. CORINA® and WODCA® are registered trademarks of Molecular Networks GmbH, Erlangen, Germany. EXSYS® and EXSYS CORVID® are registered trademarks of EXSYS Inc., Albuquerque, New Mexico. HTML®, XML®, XHTML®, and W3C® are trademarks or registered trademarks of W3C®, World Wide Web Consortium, Massachusetts Institute of Technology. HyperChem® is a registered trademark of Hypercube, Inc. IBM®, DB2®, OS/2®, Parallel Sysplex®, and WebSphere® are trademarks of IBM Corporation in the United States and/or other countries. Java® is a registered trademark of Sun Microsystems, Inc. JavaScript® is a registered trademark of Sun Microsystems, Inc., used under license for technology invented and implemented by Netscape®. JBoss® is a registered trademark of Red Hat, Inc., New Orleans, Louisiana. Jess® is a registered trademark of Sandia National Laboratories. Loom® is a registered trademark of the University of Southern California. MDL® and ISIS® are trademarks of Elsevier MDL. Microsoft®, WINDOWS®, NT®, EXCEL®, Word®, PowerPoint®, ODBC®, OLE®, .NET®, ActiveX®, and SQL Server® are registered trademarks of Microsoft Corporation. MOLGEN® is a registered trademark of Stanford University. MySQL®is a trademark of MySQL AB, Sweden. NUTS® is a registered trademark of Acorn NMR, Inc. Oracle® is a registered trademark of Oracle Corporation. SAP® is a registered trademark of SAP AG, Germany. SMILESTM is a trademark and SMARTS® is a registered trademark of Daylight Chemical Information Systems Inc., Aliso Viejo, California. Sun®, Sun Microsystems®, Solaris®, Java®, JavaServer Web Development Kit®, and JavaServer Pages® are trademarks or registered trademarks of Sun Microsystems, Inc. UNIX®, X/Open, OSF/1, and Motif are registered trademarks of the Open Group. UNIX®is a registered trademark of the Open Group. xxi
5323X.indb 21
11/13/07 2:08:42 PM
xxii
Expert Systems in Chemistry Research
VAX® and VMS® are registered trademarks of Digital Equipment Corporation, Maynard, Massachusetts. V-Modell® is a registered trademark of Bundesministerium des Innern (BMI), Germany. All other product names mentioned in this book are trademarks of their respective owners.
5323X.indb 22
11/13/07 2:08:42 PM
1
Introduction
1.1 Introduction Allen Newell — a researcher in computer science and cognitive psychology in the School of Computer Science at Carnegie Mellon University — was involved in the design of some of the earliest developments in artificial intelligence. In 1976 he presented a speech at Carnegie Mellon University that was later published in his essay “Fairy Tales,” from which the following statement was taken [1]: Exactly what the computer provides is the ability not to be rigid and unthinking but, rather, to behave conditionally. That is what it means to apply knowledge to action: It means to let the action taken reflect knowledge of the situation, to be sometimes this way, sometimes that, as appropriate.
What is interesting about this statement is that it apparently contradicts our perception of computers: Computers usually have the stigma of being not as reasonable and adaptive as the human brain. However, if we think about what a human decision characterizes, we find that exactly what Newell described as behaving conditionally is an ultimate basis for reasoning, evaluation, and assessment. A second fundamental idea behind this phrase is the concept of applying knowledge to action. Although theoretical research is important for modeling a scientific basis, the application of scientific ideas in the best sense is what finally leads to advancement, a fundamental concept of evolution on Earth. Expert systems particularly use conditional methodologies and are designed to apply knowledge to practical problems. Knowledge and experience play a major role in the handling of scientific information. Computers are indispensable tools for processing and retrieval of the huge amounts of laboratory data, and expert systems can aid an expert in making decisions about a certain problem. Human experts rely on experience as well as on knowledge. Experience can be regarded as a specialized kind of knowledge created by a complex interaction of rules and decisions. Instead of representing knowledge in a static way, rule-based systems represent knowledge in terms of rules that lead to conclusions. Computer software will never be able to replace the human expert in interpreting information. However, expert systems can assist the human expert by organizing information and by making estimations and predictions.
1.2 What We Are Talking About The patient reader will find out that this book is unconventional in several aspects. This is partially due to the fact that expert systems are an unconventional topic and require having from time to time a different point of view than one would expect at the beginning.
5323X.indb 1
11/13/07 2:08:42 PM
Expert Systems in Chemistry Research
This book is primarily written for scientists — particularly chemists and biochemists — and the thread leading through this book is the application of expert systems and similar software in chemistry, biochemistry, and related research areas. I know that scientists expect an overview on existing research before the book or article starts with the news. We will follow this approach, however, only if necessary for the context of the new content that is presented. The information technology aspect is described to an extent that allows a scientist to understand the principles rather than instructions given on how to develop expert systems. Nevertheless, we will deal with a series of mathematical aspects from cheminformatics and chemometrics that are required to represent chemical information in a computer. The mathematical techniques are introduced at a reasonable level and provide the background to understand the examples of expert systems and their application that are introduced afterward. The expert systems introduced are selected to cover different application areas. Even though the list is far from being complete, we will find some historical systems and more recent noncommercial and commercial software. Since applying expert systems requires integrating them with other software, we will finally have a look at the laboratory software environment. This final topic will give the reader an idea about the challenges occuring with the practical implementation of such systems in the research laboratories and, hopefully, will support his own ideas for developing or applying expert systems. The following gives a summary on what to find in this book. Later in this chapter, we will start thinking about some initial aspects of intelligence to get a framework for what we want to address. Chapter 2 gives an overview of the ideas and concepts underlying the term expert systems. This is a necessary basis for understanding the approaches that are subsequently described. At this point, we will focus on two domains: the conceptional background and the scientific methodologies that support the concepts. Chapter 3 provides a concise summary on technical design, programming paradigms, and development tools for expert systems. Chapter 4 covers technologies for representing and processing chemical information in a computer as well as a summary of supporting technologies that typically appear in expert systems. Chapter 5 deals with a particular method for describing molecules in a computer, which acts as representative for molecular descriptors, a topic of particular importance when dealing with chemistry in software. It introduces applications for the different molecular descriptors depicted in the previous chapter and closes with the introduction of a software package developed for the investigations. Chapter 6 covers experts systems and their application in chemistry research, starting with the historical forerunners of these systems and then focusing on different application areas. Chapter 7 comprises expert system applications in areas related to chemistry research, such as bioinformatics and industrial areas. Chapter 8 deals with the software environment in the laboratory, gives an overview on typical software packages, and describes requirements that have to be taken into account when using expert systems in conjunction with the existing laboratory software. Chapter 9 closes with a generic definition and a critical assessment of expert systems and an outlook.
5323X.indb 2
11/13/07 2:08:43 PM
Introduction
1.3 The Concise Summary A book about expert systems in chemistry has to bring two disciplines together: chemistry and information technology (IT). Since the chemists might be unfamiliar with the considerable amount of IT aspects for this topic (and vice versa), I added an additional section to each chapter: the “Concise Summary.” The concise summary gives the reader a chance to revisit the important terms and concepts described in the context of the respective chapter. It might also be helpful to reexamine the new concepts and to make sure that the new information has been appropriately digested before starting with the next chapter.
1.4 Some Initial Thoughts Before we go into some detail for the theoretical background for expert systems, we should take some time to get familiar with some of the concepts of dealing with information in a computer. I decided to use one of the newer developments in chemistry research to focus on a problem that is essential for many aspects in computational chemistry: the topic of molecular descriptors. Molecular descriptors describe a molecule in a way that is appropriate for processing in a computer program. Like the term suggests, a molecular descriptor describes certain features of a molecule. We could now start focusing on the details and algorithms of descriptors. We will do this later, but let us try another approach here. Let us take a step back and have a more general view of the meaning of the term descriptor. A typical example is a spectrum; spectra describe specific features or the behavior of a molecule under defined circumstances. An infrared spectrum describes the movement of a molecule under infrared radiation, whereas a nuclear magnetic resonance spectrum describes how a molecule’s nuclei behave when the surrounding magnetic field changes. All of these descriptors have something in common: They cover just a specific portion of characteristics of a molecule, and that is the reason why so many different descriptors exist. One of the approaches for using molecular descriptors is to compare an artificial descriptor with one evolving from a measurable physical or chemical interaction. As an example, we are able to record an infrared spectrum for a compound. If we use an artificial descriptor, which is calculated from properties that are essential for the thermal movement of a molecule, and we compare this artificial descriptor with a measured infrared spectrum, we are able to find a correlation between these descriptors. Finding a correlation allows us — in the best of all worlds — to convert one descriptor into the other. In fact, it would be possible to derive an infrared spectrum from an artificial descriptor. What good is it to relate an artificial and measured descriptor? At this point, we have to think about the application of measured descriptors. Infrared spectra are usually used to confirm the structure of a compound: If an infrared spectrum for an unknown compound is measured and we find a similar spectrum in a database, the probability is high that the measured unknown compound is identical to the database compound. Here is an important fact for this approach: We do not need to know the details about infrared spectroscopy; the simple fact that it is similar assumes the similarity of the unknown compound. This approach is typical for artificial
5323X.indb 3
11/13/07 2:08:43 PM
Expert Systems in Chemistry Research
intelligence systems: We are not looking at the details, but we try to make use of the bigger picture to solve our problem. Expert systems use this approach for problem solving by separating technical aspects clearly from the domain aspects. Consequently, the domain experts are seldom development experts who create an expert system. However, developers of expert systems are able to create a reusable shell that allows domain experts to enter knowledge in an understandable fashion. Facts, like descriptors, are seldom directly interpretable. Looking again at an infrared spectrum, it might be possible to assign each peak to a certain movement in a molecule; however, these facts do not really help to solve the problem of structure elucidation. Chemists know that a complete structure cannot be derived from an infrared spectrum. This is simply because thermal movement and constitution — even they are related to some extent — are two different things. But there is an interesting detail: Nearly all structures produce their unique spectrum. Should an infrared spectrum be specific enough to provide all information to derive a structure? Structure elucidation with an infrared spectrum cannot be performed simply by interpreting the individual peaks one after another. As a result, we are in need of some mechanism that is able to find patterns in infrared spectra or to use the entire spectrum as pattern, like it is done in the conventional infrared software to identify a compound. If we are not able to obtain a structure from the details, we can either try to use patterns or even the entire data set for comparison. This is how infrared spectra are typically used: as pattern for structure elucidation. The best pattern-matching machine currently available is the human brain. How else can we explain that we recognize a face by looking at someone from behind? Imagine a typical situation: You walk in town along a street heading for your favorite restaurant to have your lunch break. The street is full of people, and for a short moment you see a face of someone smiling at you. The person passes by; you ponder for a moment and then turn your head to look at the crowd of people from behind. In almost any case it would be easy for you to recognize the person from the back — quite normal, isn’t it? Now let us think about software that should be able to do exactly the same: recognize a person from the back while having a picture from the face of the person. Very few techniques would be able to handle such a task in a short time frame. No conventional software would succeed. Additionally supported by parallel processing, it is not much of a problem for a brain to predict the right person within a few seconds. The brain works more effectively because it is not working with all details. It does not memorize the detail structure of the iris, nor does it count the number of hairs or look at each hair individually. The brain is working with more or less complex patterns instead; the medium color of all hairs is categorized in very simple classes, like blond, brown, black, or red; the shape of the clothes is stored as, for example, a shirt, jacket, or coat. Patterns seem to be very important for analyzing complex problems, and expert systems particularly take advantage of those and other concepts to deal with experimental data. A typical example is artificial neural networks (ANN), which are complex statistical models that leverage certain mechanisms known from the functioning of the
5323X.indb 4
11/13/07 2:08:43 PM
Introduction
mammal brain for analyzing patterns. One advantage of pattern matching is that we can try to mimic human behavior with the simplest algorithm if we are able to define the pattern appropriately. However, as we all know, our way of describing the world is not based on clear facts. Almost any result we get from our perception underlies certain fuzziness. For instance, gray hair may have different shadings or color tones that can be further classified. The same applies to experimental data: Almost every result from an experiment is unsharp, as chemists know well when they try to classify a peak in a spectrum. Since computer programs are strictly logic systems, which (at first sight) are unable to handle unsharp data, we need to introduce a technology that is capable of generalizing a fact to assign it to a specific value. These so-called fuzzy techniques, which are able to handle not just black and white problems but also the natural gray problems, allow creating systems that support the concept of decision making with fuzzy data. Another aspect of the situation just described is this: In many cases we are not even able to specify the exact criteria for how our brain classifies things. The term elegant, for instance, would be hard to describe with few facts. Still, if we are talking about elegance, we all seem to know what is meant. And here is another interesting point: Although it is easy to define blond hair in a series of parameters (e.g., brightness, saturation, luminance) for an image processing system, we are with the term elegance already in the next level of interpretation. We definitely know what elegance means to us, but it is obvious that there exist different opinions about being elegant. This knowledge has three aspects: • We have knowledge about what elegance means to us. • Each individual has different perceptions and understandings of this term. • There is a common thing about elegance, or else we would not be able to use this term in our conversations. Let us stay a moment with the first aspect: How do we know what elegance is? Obviously this has something to do with the experience that we obtain when we learn to use such term. We apply rules to our decision as to whether something is elegant or not. An interesting question would now be, are we able to express this knowledge in a computer program? Since we used rules to memorize the meaning, it is a straightforward method to define rules in the program. We have seen already that behaving conditionally is one of the strengths of a computer program; in fact, if–then constructions can be found in any programming language and most probably in every software. Simple if–then rules are helpful in describing hard facts. But there are situations where we need to describe facts in the more general context, like in the aforementioned example. A concept of elegance, for instance, might be better described in a system of concepts and relationships. This is one of the available methods we will deal with later. Another important aspect of the situation just described is the action that followed the perception: You saw the person smiling at you and decided to turn around. This decision was derived from a series of perceptions and memorized patterns. The first aspect is the smiling face; your experience tells you that this is a sign of sympathy, which causes you to be more open and also to perceive sympathy for the person.
5323X.indb 5
11/13/07 2:08:44 PM
Expert Systems in Chemistry Research
The second aspect is the appearance, derived by the face, the clothes, and the behavior of the person. Everything that makes a person friendly to you is included in the decision process. You will automatically rank the facts; for instance, the eyes might be more important to you than the hair. If all patterns are evaluated and ranked, you will summarize the outcomes. The next step is to include a certain probability (e.g., maybe the person did not smile at you but at another person) and to account for certain fuzziness (e.g., you still cannot be sure whether there was an intent behind the smile). Finally, you will conclude with a decision as to whether to turn around or not. The inference mechanism is again something that has been trained and is affected by experience: Maybe you will not turn around because you had a bad experience with a similar situation before. External facts are also accounted for (e.g., you do not have the time to talk to the person). All of these influences have to be covered if we want to create a computer system that mimics human reasoning and decision making. An important aspect of what we have done so far is the generalization: We tried to create generic concepts, rules, and relationships to describe the situation and the course of activity of the beholder. Things in chemistry and related sciences are usually a bit simpler than the example previously given. This is due to their systematic nature. Sciences are nothing else than the attempt to create a system around observations. This systematic approach makes things a lot easier. We already have a series of rules in sciences that can be more or less directly transformed into a rule that a computer program can interpret. A Diels-Alder reaction, for instance, is clearly defined by their inventors as a reaction between a diene and a dienophil. We can now create a hierarchy — starting from the generalized view to the detail — to describe a Diels-Alder reaction: Transformation — Reaction — Dienophilic — Diels-Alder The more general the point of view is (the higher the level), the less applicable becomes a rule to describe a detail. On the other hand, a more specific description lets us derive the underlying general aspects. An example is the definition of life. In a specific rule we can define life from our experience with life forms on our planet: consumption, metabolization, and growth are some commonly used criteria for life from a discrete point of view. If we define a rule on the basis of these criteria in a computer program, it would look like this: IF a system consumes AND transforms AND grows THEN it is a life form. Let us query now our software with the concept of fire. Fire also consumes raw material, transforms it into degradation products, and is able to grow. Our software would evaluate that fire is a life form; still it is not understood as life. Obviously, our definition was not correct, and we have to be a bit more specific. Let us include the criterion of reproduction. Still, the software would evaluate fire as life form,
5323X.indb 6
11/13/07 2:08:44 PM
Introduction
since it is able to create offspring. However, now we are running into another issue: Members of some species, such as ant workers, do not reproduce since they are a sterile variety of the species; still, they are considered forms of life. There are now two ways to approach this problem. We can try to include all necessary constraints and exceptions that lead exactly to our expected result: Ant workers are life forms; fire is not. However, it can be foreseen that we will run into other similar situations where we have to add more and more constraints and exceptions, and at some point, we would find that we have an unreasonable amount of criteria. Another approach is to generalize the definition. Probably one of the most general scientific definitions of life can be derived from the essay Chance and Necessity, written by French biochemist Jaques Monod [2]. Monod derives two basic concepts from a biochemical point of view: • Proteins cover controlling, organizational, and energetic aspects of biochemical transformations. • Nucleic acids store the information required by proteins for the transformation. If we focus on the two aspects of control and information, we are able to derive all of the specific requirements mentioned before and are able to distinguish clearly between living and nonliving systems, as in the case of fire. Interestingly, the separation of control and information leads exactly to one of the important differentiators between conventional computer programs and expert systems: the separation between the information pool (i.e., the knowledge base) and the controlling (i.e., executing) environment. This might be one of the underlying reasons why expert systems are able to behave somewhat intelligently. However, other authors are more competent to go deeper into this kind of philosophical discussion; let us return to the more practical aspects. With these initial thoughts, we are prepared to go now into some details of the theory of expert systems. For an expert systems to handle real-world problems, we need to incorporate a series of supporting technologies, in particular to deal with the following: • Describing a fact, property, or a situation; this leads us to the topic of Mathematical Descriptors that are able to represent such information in a computer program. • Finding, defining, and searching for patterns; Pattern Recognition and Pattern Matching are essential technologies supporting searches in large data sets. • Handling natural fuzziness; experimental data underlie certain fuzziness that is addressed by Fuzzy Logic and similar techniques. • Estimating the appropriateness of properties or events; Probability Theory and related approaches let us handle the likelihood of proposed results. • Describing knowledge in a computer system; the topic of Knowledge Representation deals with specifying knowledge rather than static facts. • Deducing from facts and knowledge; Inference Engines allow us to apply existing knowledge to new situations. The following chapter shows how we can approach these requirements.
5323X.indb 7
11/13/07 2:08:44 PM
Expert Systems in Chemistry Research
References
5323X.indb 8
1. Newell, A., Fairy Tales, AI Magazine, 13, 46, 1992. 2. Monod, J., Chance and Necessity: An Essay on the Natural Philosophy of Modern Biology, Alfred A. Knopf, New York, 1971.
11/13/07 2:08:44 PM
2
Basic Concepts of Expert Systems
2.1 What Are Expert Systems? Expert systems are computer programs that are derived from artificial intelligence (AI) research. AI’s scientific goal is to understand intelligence by building computer programs that exhibit intelligent behavior. It is concerned with the concepts and methods of symbolic inference, or reasoning, by a computer and how the knowledge used to make those inferences will be represented inside the machine. However, there is a difference between conventional software algorithms and expert systems. Whereas conventional algorithms have a clearly defined result, an expert system may provide no answer or just one with a certain probability. The methodology used here is heuristic programming, and, depending on the point of view, the terms expert system and knowledge-based or rule-based system are often used synonymously. Expert systems are computer programs that aid an expert in making decisions about a certain problem. Knowledge-based systems can generally be defined as computer systems that store knowledge in the domain of problem solution. Human experts rely on experience as well as on knowledge. This is the reason why problem-solving behavior cannot be performed using simple algorithms. Experience can be regarded as a specialized kind of knowledge created by a complex interaction of rules and decisions. The term expert system is typically used for programs in which the knowledge base contains the knowledge used by human experts, in contrast to knowledge-based systems, which use additional information sources to solve a problem. An expert system is software that typically operates with rules that are evaluated to predict a result for a certain input. For the generation of rules, prior knowledge about the relationship between query and output data is necessary. Inductive learning processes that are typically performed by feeding the system with experimental data of high reliability can establish such a relationship. As a consequence, expert systems pertain to at least a knowledge base that contains predefined knowledge in the domain of problem solution (e.g., assignments of bands in spectra). Many expert systems contain a knowledge base in the form of a decision tree that is constructed from a series of decision nodes connected by branches. For instance, in expert systems developed for the interpretation of vibrational spectra, decision trees are typically used in a sequential manner. Similar to the interpretation of a spectrum Although there is still discussion about what distinguishes expert systems from knowledge-based systems or which term is the correct one, we will drop out of this formality as it is not useful in the context of understanding the principles and will use both terms more or less synonymously.
5323X.indb 9
11/13/07 2:08:44 PM
10
Expert Systems in Chemistry Research
by an expert, decisions can be extended to global or restricted to special problems; a larger tree can include more characteristics of a spectrum and can be considerably more accurate in decision making. By combining other AI techniques, like pattern recognition techniques, fuzzy logic approaches, or artificial neural networks, many problems in chemistry can be solved by expert systems automatically.
2.2 The Conceptual Design of an Expert System Let us think about how we can make a first hypothetical approach to what an expert system would constitute. A standard model in the science of knowledge management is the knowledge pyramid, which basically describes the quantitative and logical relationship among data, information, knowledge, and action (please also refer to Figure 2.1). • Data: These represent the basis of the knowledge pyramid and are a result of measurements, observations, and calculations. They are usually available in a large amount (as the broad basis indicates) and are usually meaningless if they are disconnected from their context. The raw data points of a spectrum are a typical example: These data points are useless without additional data describing the measurement technique, parameters, conditions, and attributes, such as sample identifier. The additional data that describe the context are usually referred to as metadata (i.e., data about data). • Information: If we start to work on data by converting them into a different representation, analyzing or interpreting them, we create information. With our example of a spectrum, we could already see the step of creating a graph from the measured data as a first approach to create information from data. A graph can be interpreted much more efficiently than the raw data points. • Knowledge: The concept of knowledge is a bit different from the previous ones. Knowledge is something that can result from information. The continued interpretation of a spectrum graph (i.e., information) leads to a certain probability that a peak in the graph is related to a particular — usually structural — feature. This observation leads to the experience that the peak relates to a property. By validating this result with experimental methods, we gain the knowledge about this relationship. • Action: The final step in the concept of a knowledge pyramid is applying the knowledge to make decisions and to finally take certain actions. Without applying the knowledge, the entire approach is purely theoretical. Now that we have defined the basic terms, the next logical step would be to think about the transition from data to information, knowledge, and action. Let us select three methods for turning data into information: calculation, analysis, and interpretation. Calculation and analysis can be appropriately performed using algorithms based on linear — or imperative — programming. There is also a series of software algorithms available for automatic interpretation of data; in many cases, a human expert is even required to validate the results. In contrast to that, the transition of
5323X.indb 10
11/13/07 2:08:45 PM
11
Basic Concepts of Expert Systems Transitions
Expert
Intuition
Action Knowledge
Patterns Context
Data
Theory
Software
Information
Analysis Calculation
Probability Fuzziness
Experience Interpretation
Certainty
Reality
Reasoning
Factors
Amount
Figure 2.1 The relationship among data, information, knowledge, and action is described by the knowledge pyramid. The transition from data to information is performed by calculation, analysis, and interpretation performed by computer software, whereas in the transition from information to knowledge, human interpretation, experience, and intuition play a major role. The final step of reasoning leads to actions to be taken. Amount, context, and patterns can be considered as theoretical factors; in contrast to that, real data have to be described by concepts of certainty, probability, and fuzziness.
information to knowledge and action can barely be solved by conventional software products. Here the subject-matter expert (SME, or simply expert) comes into play. For instance, the concept of experience plays a significant role for the interpretation of data and for gaining knowledge from information. After a certain amount of training improves the experience and knowledge, another term evolves: intuition. A spectroscopist might be able to intuitively derive from a look at a spectrum whether a compound type is probable or not. He does this by recognizing a pattern within part of the spectrum or even from the entire spectrum. Finally, a human expert performs reasoning by combining the available information with his experience, his intuition, and his knowledge to make conclusions and to finally take the appropriate actions. How can we approach these issues in a computer program? The first step is to have a look at the factors and constraints that play a role for the transitions. Starting again with data, the first factor we have to take into account is the context. Data outside of their context and without metadata describing their meaning are basically useless. A second factor is the amount. In particular with computers data are produced in such masses that only computer software can handle them appropriately. What is needed here are algorithms that allow us to reduce the amount of data to the important facts that can be further analyzed. One approach for analyzing data is to find patterns, similarities, and correlations. Statistical analysis and certain methods developed in the field of AI, or soft computing, can be easily implemented in computer software to perfom these tasks more or less automatically. However, we have to face the fact that most data come from experimental observations. The results from these observations rarely underlie a linear and straightforward model; experimental data exhibit certain fuzziness, usually as a result of the
5323X.indb 11
11/13/07 2:08:46 PM
12
Expert Systems in Chemistry Research
complexity of the observed system. It is particularly interesting that the concept of probability has gained so much use in a scientific area like quantum physics, which seems at first sight to be an exact science. A quantum physicist has to deal with systems of such complexity that well-defined states are rarely found in observations and probabilities have to be used instead. Probabilities are used to describe the likelihood of events, usually based on the frequency of their appearance. A similar — but nevertheless different — factor is the concept of certainty. Only data, information, and knowledge of a particular certainty are adequate for deriving a decision that finally triggers an action. Taking these factors into account, we can now try to investigate the elementary concepts of expert systems.
2.3 Knowledge and Knowledge Representation The term expert system is often used synonymously in the literature with the term knowledge-based system to point out a more general approach. Knowledge-based systems use not only human expertise but also other available information sources. However, expert or knowledge-based systems are different from other computer information systems. Knowledge engineering is one aspect of information system engineering, but it focuses on the epistemological status of information that is defined as knowledge. Instead of representing knowledge in a static way, rule-based systems represent knowledge in terms of rules that lead to conclusions. A simple rule-based system consists of a set of if–then rules, a collection of facts, and an interpreter controlling the application of the rules by the given facts. Other important knowledge representation techniques are frames and semantic networks [1].
2.3.1 Rules Rule-based programming is one of the most commonly used techniques for developing expert systems. In this programming paradigm, rules are used to represent heuristics, which specify a set of actions to be performed for a given situation. A rule is composed of an if-portion and a then-portion. The if-portion of a rule is a series of patterns that specify the facts, or data, that cause the rule to be applicable. The process of matching facts to patterns is called pattern matching. A simple if–then rule in a login procedure can look like the following: if (userName = “Administrator”) then grantAccess In this case, access is granted only if the user logs in as administrator; the decision is based on a single condition. A combination of two conditions leading to a decision can look like the following: if (username = “Administrator” AND password = “admin”) then grantAccess
5323X.indb 12
11/13/07 2:08:46 PM
Basic Concepts of Expert Systems
13
Here, the AND combines both facts and leads to a positive result only if the first and the second fact is true. By combining more facts, complex rule sets can be set up (the brackets indicate groups of commands): if (username = “Administrator”) then { if (password = “admin”) then { if (systemStatus = “online”) then grantAccess else showMessage (“System is offline”) } else showMessage (“Password is wrong”) } else denyAccess These program statements are hard coded; that is, each rule is placed directly in the program code. This is how reasoning would look like in a conventional program. As mentioned before, an important differentiator between conventional programs and expert systems is the separation of rules from the program logic. A bit more complex example — still developed in a conventional programming language — is shown in the following. This is a C++ code of a function that evaluates whether a compound is aromatic or not, based on the general definition that an aromatic system is characterized by coplanarity, cyclic conjugation, and the Hückel rule (the number of π-electrons is 4n + 2, where n is any positive integer beginning with zero): bool Molecule::AromaticRing (ring) { if (mod((ring.piElectrons – 2) % 4) != 0) return (false); For (int i = 0; i < ring.nAtoms - 2; i++) { if ((ring.coordX[i] != ring.coordX[i+1]) return(false); if ((ring.bondOrder[i,i+1]] == comp.bondOrder[i+1,i+2]) ) return(false); } return(true); } We will not go into all details of this C++ code but will focus on the evaluation part. The function Molecule::AromaticRing requires an input of the ring atoms (ring) and returns a Boolean value (bool) of true or false. The first statement checks whether the Hückel rule applies or not; the system returns false (and leaves the function) if the modulus of π-electrons – 2 divided by 4 is not zero. The second and third
5323X.indb 13
11/13/07 2:08:46 PM
14
Expert Systems in Chemistry Research
evaluation takes place in the within the loop (For statement) that goes over all atoms in the molecule. The second if statement returns false if the x-coordinates of any atom pair are unequal; in this case the system would not be planar. The third if statement evaluates to false if the bond order of neighboring bonds is equal — that is, no cyclic conjugation appears. If the function is never left by any of the return (false) statements, it finally returns true in the last statement. This example shows how a rule could be implemented in a conventional program; however, a rule-based system would typically just include the statement for calling the rule rather than the inherent logic. For instance, consider the following: (rule AromaticRing if (ring aromatic) then (stability = high) ) This rule evaluates whether a ring system is aromatic and, if true, asserts a high stability for the compound without showing how the evaluation is actually achieved. This is typical for so-called declarative programming languages, whereas the first example is written in an imperative language that includes the evaluation logic. Declarative languages are also easier to understand since they are more descriptive. We will have a closer look at these languages later in this book. What we need to know right now is that in an expert system the evaluation of if (ring aromatic) may call the C++ function shown before; however, both codes will reside in separated modules — that is, rules and evaluation logic are separated. This allows the rule base to be changed and extended — using a more descriptive language — without the need for changing the underlying imperative logic, written in a language that is harder to understand. These types of if–then rules are most often used for rule bases; however, there are other forms of knowledge representation in a computer.
2.3.2 Semantic Networks Semantic networks are a specialized form of knowledge representation in a computer. The basis is a formal model describing concepts and their relations in a directed graph; concepts are represented by nodes that are connected by their relations. The term semantic network was proposed by Ross Quillian in his Ph.D. thesis in 1966 and was published a year later [2]. As a linguist, Quillian was looking for a way to represent the meaning of words in an objective manner as psychological model for the architecture of human semantic knowledge. The basic assumption is that the meaning of a word can be represented by a series of verbal associations. A semantic network can simply be constructed by connecting concepts with one or more relational terms. As an example, let us look at the concepts and relations in Figure 2.2. Starting with the concept molecule we can define a relationship to structure with the relational term consists of. Following the route we can relate structure to substructure, and to functional group. Here we introduce the new relation term defines to describe the relationship between functional group and reactivity. As we follow through all cross-references, we build up a complex picture of the concept of molecule and its relations to other concepts, like biological activity.
5323X.indb 14
11/13/07 2:08:46 PM
15
Basic Concepts of Expert Systems Consists of
Molecule
has
Defines
Structure
Molecular Surface
Consists of Defines
Substructure Consists of Functional Group
Defines
Defines
Spectrum
Reactivity
Defines
Biological Activity
Consists of Atom
has
Defines Partial Charge
has Polarizability
Defines
Figure 2.2 A semantic network of a molecule describing the relationship between entities with the relational terms consists of or defines and their attributes with the term has. Following the route from the molecule we can relate structure to substructure, functional group, and atom. The atom entity provides attributes, such as partial charge or polarizability, which finally are part of the definition of biological activity.
The semantic network reflects the way of how human knowledge is structured, and the knowledge of a chemist can be represented by creating such graph of concepts and relationships; getting from a molecule to its biological activity is defined in such a graph. This generic approach can be used for any specific instance of a molecule and allows not only storing concepts and realtionships in an effective manner but also opens the way for deductive reasoning by systematically traversing through the nodes of the graph. Semantic networks have several features that make them particulary useful [3,4]: • They define hierarchy and inheritance between concepts easily in a network format. • Their network structure can be dynamically adapted to new information. • Their abstract relationships represent conclusions and cause–effects and allow for deductive reasoning. • They are an effective and economic approach for storing and retrieving information. Semantic networks represent ontologies, and the inherent knowledge can be retrieved by using automatic reasoning methods [5]. An ontology in computer science is a data model that represents a set of concepts within a domain and the relationships between those concepts. Semantic networks have been successfully applied in various applications for bioinformatics.
5323X.indb 15
11/13/07 2:08:47 PM
16
Expert Systems in Chemistry Research
One of the most well-known ontologies is hosted by the Open Biomedical Ontologies (OBO) consortium and is called the Gene Ontology project. This project provides a vocabulary to describe gene functions and gene product attributes for organisms, as well as mammalian phenotypes and cell types [6–9]. Gene Ontology organizes biological terms into three directed and acyclic graphs that cover cellular component, molecular function, and biological process, each of which consists of concepts connected by is a or part of relationships. Since 2007 the biological process ontology incorporates a complete tree where each concept in the hierarchy has at least one is a relationship to the top node. The following example shows the definition of a concept (called term in the Gene Ontology project): [Term] id: GO:0000079 name: regulation of cyclin-dependent protein kinase activity namespace: biological_process def: “Any process that modulates the frequency, rate or extent of CDK activity.” [GOC:go_curators] synonym: “regulation of CDK activity” EXACT [] is_a: GO:0000074 ! regulation of progression through cell cycle is_a: GO:0045859 ! regulation of protein kinase activity Each term requires a unique identifier (id), a name, as well as at least one is_a relationship to another term. Additional data are optional: in the example, a namespace, its definition, and a synonym. The namespace modifier, for instance, allows the definition of a group in which the term is valid and can also be defined for other modifiers, like the is a relationship. The current project contains more than 20,000 terms for biological processes, cellular components, and molecular functions. The ontologies can be downloaded in different formats including eXtensible Markup Language (XML) and the opensource relational database scheme MySQL. A series of other projects is available from Open Biomedical Ontologies. Among these, the Generic Model Organism Project (GMOD) comprises several organism databases and is intended to develop reusable components suitable for creating new databases for the biology community. The Microarray Gene Expression Data (MGED) project covers microarray data generated by functional genomics and proteomics experiments. The Sequence Ontology (SO) is a part of the Gene Ontology project for developing ontologies for describing biological sequences. An overview of recent applications of semantic networks in bioinformatics is given by Hsing and Cherkasov [10].
2.3.3 Frames Frames are a technique for representing knowledge in a modular way that is ideally suited for object-oriented programming. The term frames was first proposed by Marvin Minsky in the early 1970s [11].
5323X.indb 16
11/13/07 2:08:47 PM
17
Basic Concepts of Expert Systems Frame
Slots
Compound
Applications
Application
Solvent
Risk
Reactant Catalyst
Appearance
Dessicator
Availability Risks
Flammable Toxic Catalyst Dessicator
Figure 2.3 A frame-based representation of a compound. The frame compound includes so-called slots that define attributes and relationships between multiple frames. The example shows how the compound’s slots application and risk are connected to specific slots in the generic frames applications and risks.
Frames represent an object as a group of attributes and relationships (Figure 2.3). Each relationship is stored in a separate slot. The contents of a slot may be data types, functions, or procedures. Frames may be linked to other frames providing inheritance. Frames have the following general form: (frame framename (type frametype) (attributes (slots (slotname slotvalue) (slotname slotvalue) ... ) ) A specific instance of a frame describing a solvent might look like the following: Frame Dimethylformamide Application Solvent Risk Flammable Appearance Clear liquid Availability Catalog Fields named (Application), (Risk), (Appearance), and (Availability) are called slots; they represent the relation between frames. Frames are similar to object-oriented programming; however, there is an important difference between a frame and an
5323X.indb 17
11/13/07 2:08:48 PM
18
Expert Systems in Chemistry Research
object‑oriented approach. In the object-oriented paradigm the objects are constructed in a hierarchical manner that focuses on the data exchange between specific instances of classes. They are constructed dynamically but underlie a procedural approach focusing on messaging between instances. Frames represent a declarative point of view that is more or less static and focuses on the relationships between frames. However, both concepts rely on inheritance — that is, the capability for deriving a frame, or class, from another. Frames are created taking expectations into account that are typical for the human notion of information. For instance, with a chemist the term compound would raise the question for its application rather than for the systematic name. Slots can represent variables of different types (e.g., integer, real, symbol, interval) or functions (e.g., procedure, link) used in basic algorithms. The majority of systems using case-based reasoning, which is described later in this chapter, work with frames.
2.3.4 Advantages of Rules There are several advantages to using rule engines in expert systems. 2.3.4.1 Declarative Language They key advantage of rule engines is the declarative language. Declarative programming allows expressing problems in an easy and understandable way. Rule verification is much easier to read than usual code in a programming language. Since rule systems are capable of solving very complex problems, they provide solutions in a simple manner, thus reducing the complexity of a given problem to a simple set of answers or even just a single one. This simplified representation allows often explaining why a decision was made in a much easier manner than with results from other AI systems like neural networks. By creating domain specific languages that model the problem domain, rules can look very close to natural language. They lend themselves to logic that is understandable to domain experts who may be nontechnical, like auditors. 2.3.4.2 Separation of Business Logic and Data Conventional object-oriented programming approaches tend to couple data with business logic. These entities are clearly separated in rule systems; business logic is much more easily maintained and managed in a separate entity. This is particularly important for environments, where business rules change on a regular basis, which is in fact typical for business areas in the scientific field. Typical examples are the regulations that authorities like the U.S. Food and Drug Administration (FDA) impose on the pharmaceutical industry in drug development. 2.3.4.3 Centralized Knowledge Base Storing business rules separated from data supports the creation of a centralized knowledge base: a single point of authentic knowledge, which can be used at multiple sites and on different occasions, reducing maintenance and validation effort and costs.
5323X.indb 18
11/13/07 2:08:48 PM
Basic Concepts of Expert Systems
19
Auditing can be done more effectively at a single site, and decisions and decision paths are easy to explain since conclusions of the inference engine can be logged. 2.3.4.4 Performance and Scalability Pattern-matching algorithms are usually fast compared with complex statistical models or AI approaches. In particular, Rete and performance-improved variation Leaps recall previous matches and are quite efficient as long as no major changes occur in the input data sets.
2.3.5 When to Use Rules In any situation, where no conventional programming approaches exist to solve the problem, rule engines might be an alternative. Here are some typical reasons to decide for a rule-based approach. • Conventional algorithms, commercial software, or solution approaches are not available. Scientific problems often arise from new analytical results and insights. Many of the problems to be solved resulting from new findings are rarely covered by literature, not to mention software algorithms. Rule-based systems are one of the alternatives for an easier development of expert systems. • A conventional approach would lead to instability and undetermined results due to the complexity of the resulting program code. The more complex a problem is, the more relations have to be considered. Translating these relationships to a static program code becomes tedious due to required nested code statements. The larger the code, the higher is the chance of unexpected results, which are hard to reproduce and to explain. Extracting rules from the code base allows storing them in a maintainable fashion and reduces the risk of unwanted results. • The complexity of the problem would lead to code that is not maintainable; in particular, if rules are built in using if–then approaches, the complexity of code may increase dramatically. If code contains more than 20 lines of if–then statements, it is usually a good idea to think about a rule-based approach. • The business rules change regularly; updating rules in a conventional program code is tedious and time consuming. Rule-based systems allow rule changes and updates independent from the back-end program. • The release cycle of a commercial software product is too long; if a commercial solution is used, the release cycles are usually too long to make fast adaptation to a change in the problem-solving process. An individual rule base can be updated much easier, either using in-house personnel or the software provider’s professional software services. • Domain or subject matter experts are nontechnical; most of the domain experts are not programmers. The problem is well understood by nontechnical domain experts but is not fully understood by the programmers. Consequently, the domain expert would have to train the developer until he becomes the domain expert — or the other way around — which would
5323X.indb 19
11/13/07 2:08:48 PM
20
Expert Systems in Chemistry Research
contradict any effective business model. The declarative programming of a rule base allows nontechnical staff to understand rules and to create new rules without requiring a complete software development background. Finally, the centralized knowledge base is often a good reason for a decision toward rule-based systems. Scientific processes in an enterprise are often the same or at least very similar. Regulations concerning the use of scientific results for commercial or academic purpose are standardized all over the world. Centralizing knowledge and rules does not just make maintenance easier; it is also the basis for effective auditing, regulatory, and legal compliance. Rule-based systems are quite effective for problem solving; however, they work dynamically in the sense of rule interpretation and updating. If a static problem shall be solved, a simple look-up table might provide a better performance while retaining the ease of maintenance and updating. The following drawbacks have to be taken into account. • Whereas individual rules are relatively simple, their logical relations and interactions within a large set of rules are usually not straightforward. Complex rule-based systems make it difficult to understand how individual rules contribute to the overall problem solution strategy. • Another aspect of reality that we often have to deal is with incomplete and uncertain knowledge. Most rule-based expert systems are capable of representing and reasoning with incomplete and uncertain knowledge. • The generic rule-based expert system does not provide the ability to learn from experience. Unlike human experts, an expert system does not automatically modify its knowledge base, adapt it to a problem, or add new ones. The amount of data is increasing exponentially. On March 11, 2002, the Chemical Abstracts Service (CAS) and REGISTRY databases contained 19,446,779 organic and inorganic substances as well as 17,787,700 sequences — more than 37 million total registrations. On March 13, 2007, five years later, the count was 31,010,591 organic and inorganic substances and 58,693,490 sequences — nearly 90 million total registrations. This is equivalent to a 60% increase in reactions and a 330% increase in sequences in five years. The current number can be retrieved from the World Wide Web [12]. Rule-based systems are a part of a solution but do not constitute the entire solution. Problem solving in scientific areas is quite often part of a large workflow that incorporates dozens or hundreds of individual processes. A rule base is a part of an expert system, which itself is a part of an extensive process. Looking, for instance, at a drug discovery process, rule bases might be used to predict the docking behavior of a new molecule; however, they would not be able to predict adverse reactions.
2.4 Reasoning 2.4.1 The Inference Engine In rule-based systems, an interpreter module controls the application of the rules and, thus, the systems activity. In a basic cycle of activity (i.e., recognize–act cycle)
5323X.indb 20
11/13/07 2:08:49 PM
Basic Concepts of Expert Systems
21
the system first checks to find all the rules in which the conditions hold. In a second step one rule is selected, and the actions are performed that apply to the rule. The selection of a rule is based on fixed, or conflict resolution strategies. The actions that are chosen lead either to a final decision or to an adaptation of the existing rule environment. By continuous repetitions a final decision is made or final state of the rule system is established. An expert system tool provides a mechanism, the inference engine, which automatically matches facts against patterns and determines which rules are applicable. Essentially, the inference engine performs three steps at each iteration: (1) comparing the current rule to stored or given patterns; (2) selecting the most appropriate rule; and (3) executing the corresponding action. This process continues until no applicable rules remain. The inference engine is the brain of a rule-based system; it manages a large number of facts and rules, matches rules against facts (data), and performs or delegates actions based on the outcome. Whereas simple if–then conditions suffice for small or unique data, complex data require specialized pattern-matching algorithms. Pattern-matching algorithms consider a group of data (a pattern) at the same time rather than single values. For instance, the recognition of patterns in a spectrum may be performed by comparing groups of peaks that represent characteristic structures or substructures. Patterns found in a database spectrum are mapped to peaks occurring in a query spectrum, and their similarity is assessed by various statistical and related methods [13,14]. There are two prerequisites for the application of pattern methods. First, the patterns are retrieved from calculated data, and, thus, the accuracy and reliability of a pattern in a given context must be validated. In terms of descriptors, the accuracy relies mainly on the experimental conditions or the raw data used for calculation. The second requirement is a suitable similarity measure for the comparison of patterns. Once the patterns are defined and the quality of the experimental base data is good, pattern-recognition methods are valuable. Nevertheless, if patterns change irregularly and cannot be explicitly defined, the similarity measure no longer describes the difference between query and experimental pattern even if a fuzzy logic approach is implemented. In general, systems based on the comparison of patterns can provide a series of candidates, and, finally, the expert has to decide the target compound by experience or by using additional information. One approach that is implemented in most expert system shells is the Rete algorithm. Rete was developed by Charles Forgy of Carnegie Mellon University [15]. It consists of a a generalized logic for matching facts against rules. The rules consist of one or more conditions and a set of actions that are executed if a set of facts matches the conditions. Rete works with directed acyclic graphs (Retes) that represent the rule sets as a network of nodes, each of which represents a single pattern. New facts entering the system propagate along the network, and each node matches its pattern to the facts. When successfully matched, the node is annotated. If all facts match all of the patterns for a given rule, the corresponding action is triggered. Rules are defined by subject-matter experts using a high-level rules language. They are collected into rule sets that are then translated at run time into an executable Rete.
5323X.indb 21
11/13/07 2:08:49 PM
22
Expert Systems in Chemistry Research
The Rete algorithm stores not just exact matches but also partial matches; this avoids reevaluating the complete set of facts when the rule base is changed. Additionally, it uses a decision-node-sharing technique, which eliminates certain redundancies.
2.4.2 Forward and Backward Chaining From a theoretical point of view two kinds of rule-based systems can be distinguished: forward-chaining systems and backward-chaining systems. A forwardchaining system starts with initial facts and uses the rules to draw new conclusions or take certain actions. In a backward-chaining system the process starts with a hypothesis, or target, that is to be proved and tries to find rules that would allow concluding that hypothesis. Forward-chaining systems are primarily data driven, whereas backward-chaining systems are target driven. Let us consider an example consisting of four rules made up of several conditions (i.e., server status, user type, user name) and actions (i.e., deny access, grant access, connect to server): if (server = “online” AND usertype = “User”) then denyAccess if (server = “online” AND usertype = “Administrator”) then grantAccess if (user = “Jill”) then usertype = “Administrator” if (grantAccess) then connectToServer Suppose we need to check if the server is connected, given the server is online and Jill tries to log on. In forward-chaining technique, a rule that applies is searched for in the first iteration. Rule 3 applies, and the user type is defined as Administrator. In the second iteration rule 2 applies and access is granted, which finally leads to application of rule 4 in the third iteration. In backward chaining the rules are evaluated starting from the conclusion (i.e., server is connected) and finding out the necessary condition. In first iteration the access must be granted (rule 4) to connect to the server. To grant access, the server must be online — one of the given assertions — and the user must be an administrator (rule 2). Since the online criterion is one of the given assertions, rule 3 holds the condition to be fulfilled.
2.4.3 Case-Based Reasoning Case-based reasoning (CBR) is a concept for problem solving based on solutions for similar problems. The central module of a CBR system is the case memory that stored the previously solved problems in form of a problem description and a problem solution. One of the first comprehensive approaches for this method was described by Aamodt and Plaza in 1994 [16]. The process of case-based reasoning consists of four steps:
5323X.indb 22
11/13/07 2:08:49 PM
Basic Concepts of Expert Systems
23
• Retrieval: Problems stored in the case memory are compared with a new target problem. Each case consisting of a problem description is compared to the new problem and is assessed for its similarity. • Reuse: The solution from the case memory is mapped to the target problem and is adapted to fit best to the new target problem. • Revision: After mapping the solution to the target problem, the new solution is validated by testing or simulation. Eventual corrections are made and are included in the new solution. • Retaining: After successful adaptation, the new solution is stored as as a new case in the case memory. Cases can be represented in a variety of formalisms like frames, objects, predicates, semantic nets, and rules. Cases are usually indexed to allow fast and efficient retrieval. Several guidelines on indexing have been proposed by CBR researchers [17,18]. Both manual and automated methods have been used to select indices. Choosing indices manually involves deciding a case’s purpose with respect to the aims of the reasoner and deciding under what circumstances the case will be useful. Indices can be automatically constructed on several paradigms: • Properties: Cases are indexed by properties — typically one- or multidimensional experimental or mathematical descriptors. Descriptors allow a specific definition of the case according to the relevant features of a case. • Differences: Indexing cases based on the difference between cases requires the software to analyze similar cases and evaluates which features of a case differentiate it from other similar cases. Those features that differentiate cases best are chosen as indices. • Similarities: Cases are investigated for common set of features. The ones that are not common are used as indices to the cases. Indices may also be created by using inductive learning methods — for instance, artificial neural networks — for identifying predictive features that are then used as indices. However, despite the success of many automated methods, Janet Kolodner believes that people tend to do better at choosing indices than algorithms, and therefore for practical applications indices should be chosen manually [19]. The case base is organized in a structure that allows efficient search and retrieval. Several case-memory models have been proposed. The two most widely used methods are the dynamic memory model of Schank and Kolodner [20,21] and the category-exemplar model of Porter and Bareiss [22]. The dynamic memory model uses memory organization packets (MOPs), either based on instances that represent specific cases or on abstractions representing generalized forms of instances. The category-exemplar model takes advantage of categories and semantic relations organized in a network. This network represents a background of general domain knowledge that enables explanatory support to some CBR tasks. The retrieval algorithm relies on the indices and the organization of the memory to direct the search to potentially useful cases. Several algorithms have
5323X.indb 23
11/13/07 2:08:49 PM
24
Expert Systems in Chemistry Research
been implemented to retrieve appropriate cases: for example, serial search [23], hierarchical search [24], and simulated parallel search [25]. Case-based reasoning has been applied successfully to a variety of applications. Leng et al. developed a two-level case-based reasoning architecture for predicting protein secondary structure [26]. They divided the problem into two levels: (1) reasoning at the protein level and using the information from this level to focus on a more restricted problem space; and (2) reasoning at the level of internal structures decomposed as segments from proteins. Conclusions from second level were then combined to make predictions on the protein level. An article on practical reasoning in chemistry was published by Kovac [27].
2.5 The Fuzzy World Though an expert system consists primarily of a knowledge base and an inference engine, a couple of other technologies have to be used to deal with several issues that are inherent to the processing of natural. Problem solving in expert systems is often based on experimental data. The accuracy of the data that are used for the knowledge base in an expert system is normally controlled by experienced personnel; inaccurate, diffuse, and unspecific data are eliminated manually. To process larger sets of potential source data for knowledge bases, a method must be used that takes inaccuracies as well as natural fuzziness of experimental data into account —ideally automatically and without the help of an expert. Let us consider an example to illustrate this: John has a meeting at 8:30 a.m. at his company. He leaves his house at 8 a.m. heading toward the bus station. It takes him 5 minutes to reach the station, where the bus departs at 8:10; this will alow him to be right on time to attend to the meeting. What if John leaves his house at 8:05 a.m. instead? According to logical rules, he should reach the bus in time. However, experience says that this is not sure, and he probably misses the meeting. Obviously, we think about these exact times with certain fuzziness. Nobody would be really surprised if a guest for dinner arrives at 8:01 p.m. when he has been invited to arrive at 8:00. That is why systems that can handle this kind of uncertainty are mandatory elements of an expert system. Uncertainty in expert systems can be handled in a variety of approaches: certainty factors, fuzzy logic, and Bayesian theroy.
2.5.1 Certainty Factors Certainty theory is an approach to inexact reasoning that describes uncertain information in a certainty factor [28]. Certainty factors are used as a degree of confirmation of a piece of evidence. Mathematically, a certainty factor is the measure of belief minus the measure of disbelief. Again, using an example with John, an uncertainty could be the following: if (JohnsDeparture = 8.00) then (JohnsAtMeeting equals 0.8)
5323X.indb 24
11/13/07 2:08:50 PM
25
Basic Concepts of Expert Systems
1
On Time
Early
Late
0.5
8.20
8.25
8.30
8.40
Time
Figure 2.4 Simple mathematical functions can be used to relate certainty factors with a corresponding outcome. Each outcome — early, on time, late — is associated with a function. Early arrival is defined for a time frame between 8:20 and 8:30, with certainty factors between 0 and 1. An input time of 8:25 with a certainty factor of 1 would result in an early arrival. The smaller the certainty factor, the wider the time frame for arrival. The same applies to the other functions.
The rule says that if John leaves the house at 8:00, the certainty that he will reach the bus in time and attend to the meeting is 80%. Depending on the time frame, we can create functions that relate certainty factors with the time of arrival. Figure 2.4 show how the aforementioned example of John’s arrival can be described by means of a transfer function. Even the concept of certainty factors is not derived from a formal mathematical basis; it is the most commonly used method to describe uncertainty in expert systems. This is mainly due to the fact that certainty factors are easy to compute and can be used to effectively reduce search by eliminating branches with low certainty. However, it is difficult to produce a consistent and accurate set of certainty factors. In addition, they are not consistently reliable; a certainty factor may produce results opposite to probability theory.
2.5.2 Fuzzy Logic Problem solving in computational chemistry is often based on experimental data that may be more or less accurate. Experienced scientist can control the accuracy of the data that are used for the knowledge base in an expert system; inaccurate diffuse and unspecific data are manually eliminated. For example, the main reasons for inaccuracies in vibrational spectra are instrumental errors, measuring conditions and poor purity of the sample. Although most of these errors can be kept small by adequate handling of samples and instruments, random errors that lead to a loss in precision cannot be completely avoided. Even errors may not necessarily concern the entire spectrum; instead, a certain important spectral range may be affected. Complex dependencies of spectral signals from the chemical environment of the corresponding substructure lead to a certain fuzziness of the spectral signal. This is the reason why strict assignment rules (e.g., a strong peak between 3000 cm–1
5323X.indb 25
11/13/07 2:08:50 PM
26
Expert Systems in Chemistry Research
and 3100 cm–1 in the infrared spectrum proves at least one benzene-ring) cannot be applied in most cases. Accuracy checks are generally performed by comparing measured data with data from certified reference materials. When measured data are not accurate because of relative or systematic errors, or a lack of precision (noise), the comparison between measured data and reference values cannot lead to any useful conclusion in an expert system. To process larger sets of potential source data for knowledge bases, a method must be used that takes inaccuracies as well as natural fuzziness of experimental data into account — ideally automatically and without the help of an expert. Problems of uncertainty and inaccuracy can be addressed by using statistical and stochastic methods that have been described before [29,30]. The fuzzy logic approach provides a mathematical framework for representation and calculation of inaccurate data in AI methods [31,32]. Fuzzy logic is a superset of conventional (Boolean) logic that has been extended to handle the values between exactly true and exactly false. The general principle in fuzzy logic is that a reference value x0 is associated with a fuzzy interval dx, and experimental data within an interval of x0 ± dx are identified as reference data. Since natural, or experimental, data are always inaccurate, and the representation of knowledge is quite like that in fuzzy logic, expert systems have to use fuzzy logic or some techniques similar to fuzzy logic [33]. In a computer system based on the fuzzy logic approach, fuzzy intervals for reference values are defined a priori. The term fuzzy logic leads to the association of a logic that is fuzzy itself. In fact, the term concerns the logic of fuzziness — not a logic that itself is fuzzy. Fuzzy sets and logic are used to represent uncertainty, which is crucial for handling natural systems. A fuzzy expert system is an expert system that uses a collection of fuzzy functions and rules to reason about data. In other words, a fuzzy expert system is a collection of membership functions and rules that are used to reason about data. Unlike conventional expert systems, which are mainly symbolic reasoning engines, fuzzy expert systems are oriented toward numerical processing. However, there is another aspect for using fuzzy logic in the context of expert systems. The conclusion of a fuzzy rule requires converting the result back into objective terms before they can be used in a nonfuzzy method. If we describe John’s probability to arrive in time by using fuzzy logic, the result from the inference engine might be high, medium, or low. If we want to use this result in an algorithm that calculates the most probable arrival time, the conclusions from the inference engine have to be translated to a real number. This process is called defuzzification. Defuzzification is done by mapping the confidence factors that result from a fuzzy outcome onto the membership function. The defuzzified value can then be calculated by different methods, like the center of gravity, the smallest or largest maximum or mean of maximum of the function.
2.5.3 Hidden Markov Models Hidden Markov models (HMMs) are statistical models based on the Markov property [34]. A stochastic process has the Markov property if the conditional probability
5323X.indb 26
11/13/07 2:08:51 PM
27
Basic Concepts of Expert Systems
distribution of future states is independent of the path of the process. A typical example is a dice game: Each step in the process of a dice game is independent of former states of the game. In contrast to that, each turn in a card game depends on previously played cards. HMMs are statistical models for systems that behave like a Markov chain, which a discrete-time stochastic process formed on the basis of the Markov property. A Markov chain is a sequence of events, the probability of each of which is dependent only on the event immediately preceding it. An HMM represents stochastic sequences as Markov chains where the states are not directly observed but are associated with a probability density function. The generation of a random sequence is then the result of a random walk in the chain. HMMs are dynamic models, in the sense that they are specially designed to account for some macroscopic structure of the random sequences. In statistical pattern recognition, random sequences of observations were considered as the result of a series of independent draws in one or several Gaussian densities. To this simple statistical modeling scheme, an HMM adds the speciation of some statistical dependence between the Gaussian densities from which the observations are drawn. In a hidden Markov model the states are less simple, and instead of having a single known outcome they can have several possible outcomes. The model is called hidden because the outcome of any given state is uncertain. HMMs have been successfully used in a number of Bioinformatics applications, like the modeling of proteins and nucleic acids or for quantitative analysis of biological sequence data using statistical approaches [35,36].
2.5.4 Working with Probabilities — Bayesian Networks Bayesian networks are based on Bayes’ Theorem, which gives a mathematical framework for describing the probability of an event that may have been the result of any of two or more causes [37]. The questions is this: What is the probability that the event was the result of a particular cause, and how does it change if the cause is changing? Bayesian networks are statistic models for describing probabilistic dependencies for a set of variables. They trace back to a theorem in the eighteenth century found by Thomas Bayes, who first established a mathematical base for probability inference [38]. Bayes’ theorem is based on two different states:
P ( A| B ) =
P ( A| B ) P ( A ) P ( B)
(2.1)
This equation describes the probability P for state A existing for a given state B. To calculate the probability, Bayes used the probability of B existing given that A exists, multiplied by the probability that A exists, and normalized by the probability that B exists. This admittedly complicated explanation can be interpreted as follows: For an existing state B, what is the probability that state B is caused by state A? The importance of this theorem is that probabilities can be derived without specified knowledge about P(A|B), if information about P(B|A), P(A), and P(B) is available.
5323X.indb 27
11/13/07 2:08:52 PM
28
Expert Systems in Chemistry Research
The essence of the Bayesian approach is to provide a mathematical rule explaining how a hypothesis changes in light of new evidence [39]. Back to John for an example: John has to reach his bus not only once, but every morning. The experience he makes if he leaves his house at different times each day affects his feeling of what the probability is for reaching his bus. In a Bayesian analysis, a set of observations should be seen as something that changes opinion. In other words, Bayesian theory allows scientists to combine new data with their existing knowledge or expertise. A Bayesian network can be used to model the dependencies between variables that directly influence each other, which are usually few. The rest of the variables are assumed conditionally independent. A Bayesian network is a directed graph in which each node is annotated with quantitative probability information. It is constructed by selecting a set of variables that define the nodes of the network. The nodes are connected via directed links that indicate their inheritance, and each node has a conditional probability distribution that quantifies the effect of the parents on the node. Dynamic Bayesian networks (DBNs) are directed graph models of stochastic processes. They generalize HMMs by representing the hidden and observed state in terms of state variables, which can have complex interdependencies. The graphical structure provides an easy way to specify these conditional independencies, and, hence, to provide a compact parameterization of the model. Although this approach has a clear mathematical basis for inexact reasoning, this approach requires the gathering of data about all possible states of a problem, which is rarely applicable in practice. One of the most noted expert systems developed using a Bayesian approach for inexact reasoning is PROSPECTOR, a system designed for mineral exploration [40]. F. V. Jensen provides a complete overview on Bayesian networks and decision graphs [41].
2.5.5 Dempster-Shafer Theory of Evidence The Dempster-Shafer theory, also known as the theory of belief functions, is a generalization of the Bayesian theory of subjective probability [30,42]. Whereas the Bayesian theory requires probabilities for each question of interest, belief functions allow us to base degrees of belief for one question on probabilities for a related question. These degrees of belief may or may not have the mathematical properties of probabilities; how much they differ from probabilities will depend on how closely the two questions are related. The Dempster-Shafer theory is based on two ideas: (1) the idea of obtaining degrees of belief for one question from subjective probabilities for a related question; and (2) Dempster’s rule for combining such degrees of belief when they are based on independent items of evidence. To illustrate the idea of obtaining degrees of belief for one question from subjective probabilities for another, suppose John has subjective probabilities for the reliability of his alarm clock. Let us say the reliability that his alarm clock shows the right time is 0.9 and the probability that it is unreliable is 0.1. Suppose he will get to his bus if he leaves his house at 8:00 in the morning. This statement, which is true if his alarm clock is reliable, is not necessarily false if the
5323X.indb 28
11/13/07 2:08:52 PM
Basic Concepts of Expert Systems
29
clock is unreliable. So the reliability of the clock alone justifies a 0.9 degree that he will reach the bus but only a zero degree of belief (not a 0.1 degree of belief) that he will miss the bus. This zero does not mean that he will reach the bus always in time, as a zero probability would; it merely means that there is no reason for John to believe that he will miss the bus. The 0.9 and the zero together constitute a belief function. In summary, we obtain degrees of belief for one question (i.e., Will John reach the bus in time?) from probabilities for another question (i.e., Is the alarm clock reliable?). Dempster’s rule begins with the assumption that the questions for which we have probabilities are independent with respect to our subjective probability judgments, but this independence is only a priori; it disappears when conflict is discerned between the different items of evidence.
2.6 Gathering Knowledge — Knowledge Engineering After all, one of the most important problems in dealing with expert systems is the stage of getting the knowledge of a human expert into the expert system. The subject of this task is knowledge engineering [43]. Knowledge engineering is still the bottleneck in the development of expert systems. It is difficult for a chemist to explicitly state the knowledge he is using in a form that is suitable for computational processing. For a spectroscopist it would be hard to write down his experience in interpreting nuclear magenetic resonance spectra on a sheet of paper; a synthesis chemist would have the problem of defining the basic reasoning for his retrosynthesis approach in if–then rules, and a metabonomics expert would not be able to define the rules for his experience when he is proposing a metabolic pathway for a drug based on hundreds of individual results. One of the reasons that a scientist, currently working on a problem, would have problems explicitly stating his knowledge is that information for rules has to be abstracted, and this is a challenging task when a specific problem is in mind. Three main questions arise from this problem: (1) How to derive a rule from the intuitive decision-making skills of an expert: This problem can be addressed by allowing the expert himself to use the software without requiring special technical or development skills. Declarative languages are the first step in this direction. Intuitive and easy-to-use graphical user interfaces can support the expert considerably with this task. Although a few methods for guiding rule entry have been proposed, the problem of intuitive knowledge remains. (2) How to assess the logic of the acquired rules for conflicts, overlaps, and gaps: The logic for conflict handling is closely related to the subject of interest. In many cases, only the expert is able to find the reason for a conflict that itself emerges usually during validation of the system. However, software can again assist the expert by presenting the conflict in an understandable way that is easy for the expert to evaluate. (3) How to generalize the explicit knowledge for reusing it in similar cases: The only solution to this issue is to separate the development of the rule base from a specific problem.
5323X.indb 29
11/13/07 2:08:53 PM
30
Expert Systems in Chemistry Research
In a knowledge engineering process, the scientist would be interviewed and posed representative problems. Based on his responses, the knowledge he applies needs to be understood and encoded in the form of the knowledge representation used: the rules. In a second stage, the scientist would then need to validate the programmed rules to make sure that the outcome from expert systems is valid in the problems domain. This cycle needs continuous repetition until the system performs in the desired manner. Depending on the complexity of the domain, knowledge engineering could take anywhere from a few days to a few years. Expert system tools have been created that provide support in the creation of this knowledge and carry out checks on the completeness and correctness of the knowledge represented in the system. The process of knowledge engineering can be considerably improved through graphical knowledge representation tools. A number of methods for structuring rule sets have been traditionally used, such as trees, spider diagrams, and concept maps. These methodologies aim to model the application by establishing a hierarchy of rule sets. The developer can add control rules to force the flow of the inference engine through a tree view or a diagram. Modern applications allow the knowledge to be structured into intuitive units of knowledge, each of which is represented as decision tree or a cases table and an overall summary map representing the structure of the entire knowledge base for a particular task. Based on these examples, the requirements for knowledge engineering from a software point of view can be summarized as follows: • Declarative languages and intuitive and easy-to-use graphical user interfaces are required for problem assessment. • Multiple knowledge representations shall be available to express knowledge in the most appropriate way. • Structuring elements shall be available for easy construction, analysis, and adaptation of hierarchies and relationships. • Method-like rule induction shall be available to convert tables or lists of examples into decision trees. • Tools are required that support the expert in finding the reason for a conflict, a gap, or an overlap by presenting the conflict in an understandable way that is easy for the expert to evaluate. • Experts shall be absolved from their day-to-day work and shall have the chance to concentrate on the knowledge engineering process. • Tools shall be available to test and validate the inserted knowledge, preferably on the basis of constructed test cases. • Revision and approval of the system shall be supported by an appropriate user management system. Knowledge engineering is related to many computer science domains such as artificial intelligence, databases, data mining, and decision support. The quality of the primary data for the knowledge base determines the validity and reliability of results. However, knowledge engineering is just a part of the task for an expert system. Expert systems rely on a series of supporting technologies that ensure that
5323X.indb 30
11/13/07 2:08:53 PM
Basic Concepts of Expert Systems
31
complex data can be managed and evaluated in an efficient manner. The following chapters will cover some of these techniques.
2.7 Concise Summary Backward Chaining is a problem-solving procedure that starts with a statement and a set of rules leading to the statement and then works backward, matching the rules with information from a database of facts until the statement can be either verified or proven wrong. Bayes’ Theorem is a mathematical framework describing the probability that an event was the result of a particular cause and how it changes if the cause changes. Bayesian Networks are statistic models for describing probabilistic dependencies for a set of variables based on Bayes’ theorem. Case-Based Reasoning (CBR) is a problem-solving system that relies on stored representations of previously solved problems and their solutions. Certainty Factors are used as a degree of confirmation of a piece of evidence. Mathematically, a certainty factor is the measure of belief minus the measure of disbelief. Data are printed, stored, visualized, or in any other state delocalized and temporally decoupled available facts that can be preserved in a documentation process. Dempster-Shafer Theory of Evidence is a generalization of the Bayesian theory based on degrees of belief rather than probabilities. Expert System Shells are software packages that facilitate the building of knowledgebased systems by providing a knowledge representation scheme and an inference engine. The developer adds domain knowledge. Expert Systems are application programs making decisions or solving problems in a particular field by using knowledge and analytical rules defined by experts in the field. They aid an expert in making decisions for a certain problem. An expert system typically operates with rules that are evaluated to predict a result for a certain input. Forward Chaining is a problem-solving procedure that starts with a set of rules and a database of facts and works to a conclusion based on facts that match all the premises set forth in the rules. Frames are a technique for representing knowledge in a group of attributes and relationships, which are stored in slots. Hidden Markov Models (HMMs) are statistical models based on the assumption that the probability of future states of a process is independent of its path (Markov property). If–Then Rules are collected in a rule base and describe problem situations and the actions an expert would perform in those situations. Inference Engine is the processing portion of an expert system. With information from the knowledge base, the inference engine provides the reasoning ability that derives inferences, or conclusions, on which the expert system acts. Information is the process of perception and cognition and is typically followed by interpretation.
5323X.indb 31
11/13/07 2:08:53 PM
32
Expert Systems in Chemistry Research
Knowledge is related to individuals and is stored in the cognitive system, the brain. It is subject as well as purpose oriented and is related to a certain context. Consequently, knowledge carriers are all individuals belonging to an enterprise. Knowledge Acquisition defines the gathering of expertise from a human expert for entry into an expert system. Knowledge Engineering is the process of gathering knowledge from subject-matter experts and transforming it into a computable format. Knowledge Induction describes the generation of knowledge either directly (communicaton), indirectly (via information), or through self-inducting (by training). Knowledge Logistics ensures that knowledge is available in adequate form at any time and place. Knowledge Representation describes the notation or formalism used for coding the knowledge to be stored in a knowledge-based system. Knowledge-Based Systems use stored knowledge to solve problems in a specific domain. Pattern Matching is the process of matching facts against patterns to determine rules that are applicable. Rule-Based Systems, instead of representing knowledge in a static way, represent knowledge in terms of rules that lead to conclusions. A simple rule-based system consists of a set of if–then rules, a collection of facts, and an interpreter (inference engine) controlling the application of the rules by the given facts. Semantic Networks are a form of knowledge representation in a directed graph, where the nodes represent concepts and the edges, or connectors, describe their relations. Subject-Matter Experts (SMEs) are persons that have knowledge of a specific subject or domain. Task refers to some goal-oriented, problem-solving activity, whereas domain refers to the area within which the task is being performed. Task Domain is the area of human intellectual endeavor to be captured in an expert system.
References
5323X.indb 32
1. Barr, A. and Feigenbaum, E.A., Eds., The Handbook of Artificial Intelligence, Vol. 1. William Kaufman Inc., Los Altos, CA, 1981. 2. Quillian, M.R., Word Concepts: A Theory and Simulation of Some Basic Semantic Capabilities, Behavioral Sci., 12, 410, 1967. 3. Shastri, L., Why Semantic Networks? in Principles of Semantic Networks: Explorations in the Representation of Knowledge, Sowa, J.F., Ed., Morgan Kaufmann Publishers, San Mateo, CA, 1991, 109. 4. Schubert, L.K., Semantic Nets Are in the Eye of the Beholder, in Principles of Semantic Networks: Explorations in the Representation of Knowledge, Sowa, J.F., Ed., Morgan Kaufmann Publishers, San Mateo, CA, 1991, 95. 5. Lehmann, F., Semantic Networks, Comp. Mathemat. Applicat., 23, 1, 1992. 6. Open Biomedical Ontologies, The OBO Foundry, http://obo.sourceforge.net/. 7. Harris, M.A., et al., The Gene Ontology (GO) database and informatics resource, Nucleic Acids Res., 32, D258, 2004.
11/13/07 2:08:54 PM
Basic Concepts of Expert Systems
5323X.indb 33
33
8. Smith, C.L., Goldsmith, C.A., and Eppig, J.T., The Mammalian Phenotype Ontology as a Tool for Annotating, Analyzing and Comparing Phenotypic Information, Genome. Biol., 6, R7, 2005. 9. Bard, J., Rhee, S.Y., and Ashburner, M., An Ontology for Cell Types, Genome. Biol., 6, R21, 2005. 10. Hsing, M. and Cherkasov, A., Integration of Biological Data with Semantic Networks, Current Bioinformatics, 1(3), 1, 2006. 11. Minsky, M.L., Form and Content in Computer Science, ACM Turing Lecture, JACM, 17, 197, 1970. 12. American Chemical Society, Chemical Abstracts Service, Registry Number and Substance Counts, http://www.cas.org/cgi-bin/cas/regreport.pl. 13. Jalsovszky, G. and Holly, G., Pattern Recognition Applied to Vapour-Phase Infrared Spectra: Characteristics of nuOH Bands, J. Mol. Struct., 175, 263, 1988. 14. Sadtler Research Laboratories, Spectroscopy Group, Cambridge, MA, http://www. sadtler.com. 15. Forgy, C., Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem, Artificial Intelligence, 19, 17, 1982. 16. Aamodt, A. and Plaza, E., Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches, AI Communications, 7(1), 39, 1994. 17. Birnbaum, L. and Collings, G., Remindings and Engineering Design Themes: A Case Study in Indexing Vocabulary, in Proceedings of the Second Workshop on Case-Based Reasoning, Pensacola, FL, 1989, 47. 18. Hammond, K.J., On Functionally Motivated Vocabularies: An Apologia, in Proceedings of the Second Workshop on Case-Based Reasoning, Pensacola, FL, 1989, 52. 19. Kolodner, J.L., Case-Based Reasoning, Morgan Kaufmann, San Mateo, CA, 1993. 20. Kolodner, J.L., Maintaining Organization in a Dynamic Long-Term Memory, Cognitive Science, 7, 243, 1983. 21. Schank, R., Dynamic Memory: A Theory of Reminding and Learning in Computers and People, Cambridge University Press, Cambridge, UK, 1982. 22. Porter, B.W. and Bareiss, E.R., PROTOS: An Experiment in Knowledge Acquisition for Heuristic Classification Tasks, in Proceedings of the International Meeting on Advances in Learning, Les Arcs, France, 1986, 159. 23. Navichandra, D., Exploration and Innovation in Design: Towards a Computational Model, Springer Verlag, New York, 1991. 24. Maher, M.L. and Zhang, D.M., CADSYN: Using Case and Decomposition Knowledge for Design Synthesis, in Artificial Intelligence in Design, Gero, J.S., Ed., ButterworthHeinmann, Oxford, 1991. 25. Domeshek, E., A Case Study of Case Indexing: Designing Index Feature Sets to Suit Task Demands and Support Parallelism, in Advances in Connectionnist and Neural Computation Theory, Vol.2: Analogical Connections, Barenden, J. and Holyoak, K., Eds., Norwood Publishing, NJ, 1993. 26. Leng, B., Buchanan, B.G., and Nicholas, H.B., Protein Secondary Structure Prediction Using Two-Level Case-Based Reasoning, J. Comput. Biol., 1(1), 25, 1994. 27. Kovac, J., Theoretical and Practical Reasoning in Chemistry, Found. Chem., 4, 163, 2002. 28. Shortliffe, E.H., Computer Based Medical Consultations: MYCIN, Elsevier, New York, 1976. 29. Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publishers, New York, 1988. 30. Shafer, G.A., A Mathematical Theory of Evidence, Princeton University Press, Princeton, NJ, 1976. 31. Negoita, C.V. and Ralescu, D., Simulation, Knowledge-Based Computing, and Fuzzy Statistics, Van Nostrand Reinhold, New York, 1987.
11/13/07 2:08:54 PM
34
Expert Systems in Chemistry Research
32. Zadeh, L.A., Fuzzy Sets, Information and Control, 8, 338, 1965. 33. Zadeh, L.A., Fuzzy Sets as the Basis for a Theory of Possibility, Fuzzy Sets Syst., 1, 3, 1978. 34. Baum, L.E., et al., A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains, Ann. Math. Statist., 41, 164, 1970. 35. Durbin, R., et al., Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, 1st ed., Cambridge University Press, Cambridge, UK, 1999. 36. Pachter, L. and Sturmfels, B., Eds., Algebraic Statistics for Computational Biology, 1st ed., Cambridge University Press, Cambridge, UK, 2005. 37. Jensen, F.V., Bayesian Networks and Decision Graphs, 1st ed., Springer, New York, 2001. 38. Bayes, T., An Essay towards Solving a Problem in the Doctrine of Chances, by the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John Canton, M. A. and F. R. S., Philosophical Transactions of the Royal Society of London, 53, 370, 1763. 39. Dempster, A.P., A Generalization of Bayesian Inference, Journal of the Royal Statistical Society, Series B, 30, 205, 1968. 40. Duda, R., et al., Development of a Computer-Based Consultant for Mineral Exploration, SRI Report, Stanford Research Institute, Stanford, CA, 1977. 41. Jensen, F.V., Bayesian Networks and Decision Graphs, 1st ed., Springer, New York, 2001. 42. Shafer, G., Perspectives on the Theory and Practice of Belief Functions, International Journal of Approximate Reasoning, 3, 1, 1990. 43. Negnevitsky, M., Artificial Intelligence: A Guide to Intelligent Systems, Addison Wesley, 2004.
5323X.indb 34
11/13/07 2:08:54 PM
3
Development Tools for Expert Systems
3.1 Introduction Conventional programming languages are designed for the procedural manipulation of data. Humans, however, often solve complex problems using very abstract, symbolic approaches, which cannot be directly implemented in conventional languages. Although abstract information can be modeled in these languages, considerable programming effort is required to transform the information to a format usable with procedural programming paradigms. One of the results of research in the area of artificial intelligence has been the development of techniques, which allow the modeling of information at higher levels of abstraction. These techniques are embodied in languages or tools, which allow programs to be built that closely resemble human logic in their implementation and are therefore easier to develop and maintain. Most expert systems are developed via specialized software tools called shells. These shells come equipped with an inference mechanism — backward chaining, forward chaining, or both — and require knowledge to be entered according to a specified format. We will have a closer look at several tools for developing expert system in this chapter.
3.2 The Technical Design of Expert Systems Expert systems consist of several modules (Figure 3.1).
3.2.1 Knowledge Base The knowledge base contains knowledge on a particular domain in the form of logic rules or similar representation technologies. Rules are stored as declarative statements and can be added, changed, and edited using a knowledge engineering module. Rules are formulated in a cause-and-effect manner that is similar to how an expert would formulate his knowledge. The rules are either obtained from a human expert in the knowledge engineering process or entered by the expert himself. The definition of rules is the most common technique used today for representing knowledge in an expert system.
3.2.2 Working Memory Specific information on a current problem is represented as case facts and is entered in the expert system’s working memory. The working memory contains both the facts 35
5323X.indb 35
11/13/07 2:08:54 PM
36
Expert Systems in Chemistry Research
Descriptive Language Interaction
Knowledge Engineer/Expert
User
Inference Control
Engineering Module
User Interface
Working Memory
Inference Engine
Task-Specific Data
Search
Rule Definition
Pattern Matching
Knowledge Base
Evaluation
Prediction
Rules
Simulation
Fuzzy Logic
Certainty Factors
Figure 3.1 The general design of an expert system comprises the user-oriented parts of user interface and the working memory that stores specific information for a current problem as facts. The knowledge base contains knowledge on a particular domain in the form of logic rules or other representation technologies. The inference engine performs the reasoning by using the knowledge to perform an evaluation similar to the way a human reasons with available information. Inference engine and knowledge base are handled by a knowledge engineering module for defining rules and controlling the inference mechanism. A series of underlying technologies, such as searches or pattern matching, support the process of inference.
entered by the user from questions asked by the expert system and facts inferred by the system. The working memory could also acquire information from databases, spreadsheets, or sensors and could be used by the expert system to conclude additional information about the problem by using the general knowledge contained in the knowledge base.
3.2.3 Inference Engine The analogy of human reasoning is performed in the expert system with the inference engine. The role of the inference engine is to work with the available information contained in the working memory and the general knowledge contained in the knowledge base to derive new information about the problem. This process is similar to the way a human reasons with available information to arrive at a conclusion. The inference engine, working either in the backward- or forward-chaining mode, will attempt to conclude new information about the problem from available information until some goal is reached or the problem is solved.
3.2.4 User Interface The function of the user interface is to present questions and information to the operator and to supply the operator’s responses to the inference engine. Any values
5323X.indb 36
11/13/07 2:09:28 PM
Development Tools for Expert Systems
37
entered by the user must be received and interpreted by the user interface. Some responses are restricted to a set of possible legal answers; others are not. The user interface checks all responses to ensure that they are of the correct data type. Any responses that are restricted to a legal set of answers are compared against these legal answers. Whenever the user enters an illegal answer, the user interface informs the user that his answer was invalid and prompts him to correct it. As explained in the cross-referenced application, communication between the user interface and the inference engine is performed through the use of a user interface control block, which is passed between the two. Besides providing final results or conclusions, both human experts and expert systems can explain how they arrived at their results. This capability is often important because the types of problems to which expert systems are applied require that a justification of the results be provided to the user. For example, an expert system that recommends some antibiotic treatment for a patient would need to explain to the physician how this recommendation was formulated. Expert systems also often have the capability of explaining why a given question is being asked. When an individual consults with a human expert, the conversation is highly interactive, and on occasion, the individual may ask why a certain line of reasoning is being pursued. The explanation given can make the user feel more comfortable with the line of questioning and also can help to clarify what issues the expert believes are important for the problem. Several programming languages and tools are available for developing an expert system. The most important ones will be described here.
3.3 Imperative versus Declarative Programming Conventional programming is performed typically in the form of sequential statements, each of which changes the state of variables in the program. This method of program development is called imperative programming. A common method of imperative programming is procedural programming, where the code consists of sequential calls to functions, methods, and subroutines, all of which are abstracted as procedures. Another paradigm is functional programming that focuses on the application of mathematical functions rather than the changing states of variables. Figure 3.2 gives an overview of the different programming paradigms and examples of programming languages. A code example of an imperative language (C++) is shown following. This is a function that normalizes a vector according to the Euclidian L2-Norm — that is, it divides each component of a vector by the square root of the sum of products of all vector components. The function is called with two parameters, one of which is a pointer (*) to a multidimensional variable (Vector) that represents the input vector that has to be normalized. The other parameter is the length of the vector (N). The code then declares a variable (Norm) and initializes it, or (assigns a start value of zero). It then defines a loop (for statement) that runs over all components (i) of the vector to summarize all component products. It then calculates the square root (sqrt) of the sum. In a second loop, it again runs over all vector components to divide them by the square
5323X.indb 37
11/13/07 2:09:28 PM
38
Expert Systems in Chemistry Research
Expert Systems Shells
Descriptive Languages
Application-Specific
Macro Languages, VBA
Scripting
JavaScript, VB, Python, Perl
Imperative, Object-Oriented
C++, C#, Java, Python, Perl
Imperative, Procedural Machine
FORTRAN, Common LISP, C Assembler
Computers
Figure 3.2 A schematic model of programming paradigms shows how the complexity of programming techniques increases from the computer on the base to the expert systems on top. Expert systems are designed for extensive user interaction and require descriptive languages that are on a higher level (i.e., closer to the user) than those actually processed by the computer, such as machine languages. Imperative and object-oriented paradigms support the sequential processing and the data and functional modeling of a program. Most scripting languages are based on such a layer, whereas macro languages are typically intended to perform user interactions automatically. The example of visual basic for applications (VBA) combines imperative programming with application-specific functionality. Finally, descriptive languages no longer include any basic business logic but are adapted to the expert using the system.
root (Norm). The if statement ensures that the calculation is performed only if the nominator is not zero: void EuclNormF(double* Vector, int N) { double Norm = 0; for (int i = 0; i < N; i++) { Norm += Vector[i] * Vector[i]; } Norm = (sqrt(Norm)); for (int i = 0; i < N; i++) { if(Vector[i]) Vector[i] /= Norm; } }
5323X.indb 38
11/13/07 2:09:29 PM
Development Tools for Expert Systems
39
This is a procedural algorithm that is performed in a sequential manner and leads to an explicit answer. The program code may be called from another code and may include function calls, such as the square root function sqrt. Other methods of programming paradigms are logical programming and declarative programming. Logical programming languages describe the business logic of a program but do not include the way the computation is performed — that is, any algorithms as imperative programming languages do. A typical example is PROLOG, which is described later. Closely related to the latter is declarative programming that describes states rather than how to achieve them. Additionally, the syntax is modeled after the way humans describe a state and thus is more declarative than other languages. Logic and declarative programs explicitly specify a goal that has to be achieved and leave the implementation of the algorithm to other software. A simple example is Structured Query Language (SQL), which is the most popular computer language for creating, updating, and retrieving data from relational database management systems. An SQL select statement, for instance, specifies what data are to be retrieved from a database but does not describe the process of how this is actually done; this task is left to the database management software. An example is shown here: SELECT name FROM compounds WHERE compounds.mol_wt < 250 ORDER BY name ASC; This is a statement that retrieves names (name) from a table of compounds in a database, restricting the results to those compounds that have a molecular weight (mol_wt) less than 250, and presents the results in ascending order of the names. These statements do not include any algorithm or function that defines how the search is actually done; the database server incorporates the business logic to do this. In addition, this is a declarative statement in the sense that it is more or less self-explanatory. A clear classification of programming languages according to these or other programming paradigms is often not possible. This is on one hand due to small differences in the definition of the paradigms. For instance, a different definition of a declarative program includes just functional and logic programming languages, and often the term declarative is used to distinguish languages from imperative ones. On the other hand, most logic programming languages are able to describe algorithms and implementation details. Similarly, it is possible to write programs in a declarative style even in an imperative programming language by using declarative names for variables and excluding nondeclarative details from the main code. However, a declarative language, like all natural languages, has a syntax describing how the words in the language may be combined and a semantics describing how sentences in the language correspond to the result of a program. In the following sections, we will examine some of the programming languages that are of importance for the development of expert systems.
5323X.indb 39
11/13/07 2:09:29 PM
40
Expert Systems in Chemistry Research
3.4 List Processing (LISP) LISP is a programming language that was developed in 1959 at the Massachusetts Institute of Technology [1]. Its origin goes back to the development of formula translator programming language (Fortran) subprograms for symbolic calculations. At that time, John McCarthy had the idea of developing an interpreter for symbolic calculations and formed the basis for LISP. It uses two data structures — single scalar values, or atoms; and single or nested lists — that form an associative array. These data structures are created dynamically without the need for reserving computer memory. The dynamic approach makes declarations unnecessary and allows variables to represent arbitrary objects, independent of their data type or structure. LISP programs can run using interpreters or compilers, which allows the programmer to combine the advantages of both worlds into a single program. LISP was used widely in the 1970s for developing expert systems. However, the low availability of LISP on a wide variety of conventional computers, the high cost of state-of-the-art LISP tools, and the poor integration of LISP with other languages, led National Aeronautics and Space Administration’s (NASA’s) computing department to the development of another expert system shell: C Language Integrated Production System (CLIPS) (see Section 3.6). Since LISP has been available in different variants, American National Standards Institute (ANSI) standardized a dialect of the LISP programming language in 1994 called Common Lisp (CL) [2]. Common Lisp is a language specification rather than an implementation. An example follows: (defun square (x) (* x x)) (let ((x 3) (y 7)) (+ (square(x) y)) The code defines a function (square) that returns the square of a variable (x), assigns (let) values to the variables x and y, and performs a calculation using the square of x and adding it to y. The results in this case would be 16. As the name list processing suggests, the strength of this language lies in the definition, manipulation, and evaluation of lists that are defined in a simple statement: (:benzene “aromatic” :cyclohexane “cycloaliphatic”) This is a property list, where keywords (those preceded by a colon) have values assigned that can be retrieved by using a retrieval function (getf): (getf :benzene) “aromatic” Few chemistry specific dialects have been developed, one of which is the ChemLisp language interpreter in Apex-3D, an expert system for investigating structure– activity relationships. ChemLisp represents a special dialect of the LISP language. ChemLisp accesses main data structures and modules containing basic algorithmic functions.
5323X.indb 40
11/13/07 2:09:29 PM
Development Tools for Expert Systems
41
3.5 Programming Logic (PROLOG) Logic programming and one of its representatives, programation et logique (or programming and logic; PROLOG), make a declarative approach to writing computer programs. PROLOG was designed by a team around Alain Colmerauer, a French computer scientist, in the 1970s with the goal of creating a tool for communicating with computer software in a natural language [3]. The development was divided into two parts: a system for automatic deduction, developed by Jean Trudel and Phillipe Roussel, and a natural language interface, taken over by Colmerauer and Robert Pasero [4]. A PROLOG program consists of a database including facts and rules, which are evaluated systematically when a query is defined by a user. The result of this evaluation can be either positive or negative. The outcome is positive if a logical deduction can be made; otherwise it is negative if the current database does not allow a deduction. PROLOG, as a declarative programming language, allows communication with a PROLOG program in some kind of natural language. In contrast to imperative programming languages, like C++, C#, or Java, the rule-based approach does not require a fixed-path solution but creates the solution by deduction from rules. In general, logic programs use two abstract concepts: (1) truth; and (2) logical deduction. A PROLOG program can be asked whether a fact is true or whether a logical statement is consistent with the rules stored in the program. These questions can be answered independently of a concrete hard-coded code. Facts consist of a name (or functor) similar to a variable name in other programming languages and one or more arguments that define the corresponding values.
3.5.1 PROLOG Facts PROLOG facts might be the following: compounds(benzene, aromatic). compounds(toluene, aromatic). compounds(xylene, aromatic). compounds(hexane, aliphatic). The first fact indicates that benzene is an aromatic member of a series of compounds. Another example is the following: boilingPoint(benzene, 353.2) where benzene has a value in the list of boiling points. Note that the facts can be arranged in any appropriate way. For instance, aromaticCompounds(benzene) would be another way of declaring that benzene is an aromatic compound.
5323X.indb 41
11/13/07 2:09:29 PM
42
Expert Systems in Chemistry Research
3.5.2 PROLOG Rules A PROLOG rule is an if–then construction; in PROLOG notation this would look like the following: aromaticCompound(X) : coplanar(X), cyclicConjugated(X), hueckelElectrons(X) The rule is interpreted as X X X X
is an aromatic compound if is coplanar AND is cyclic conjugated AND follows the Hueckel rule (4n+2 pi electrons).
Here X is a variable, where the notation requires an uppercase character in PROLOG — variables are not similar to those in conventional programming languages. If a set of facts and rules have been defined, a PROLOG program can be executed, which leads to a question-and-answer scenario. PROLOG tries to interpret the facts and rules to answers with either yes, no, or a predefined answer. For example, the following could be asked: ?- compounds(benzene, aromatic) which would result in the answer yes, since this has been stored as a fact before. Questions can have variables in them, which may get bound to a particular value when PROLOG tries to answer the question. PROLOG will display the resulting bindings of all the variables in the question. So for a variable CompoundType (note the uppercase notation for variables), we might have the following: ?- compounds(toluene, CompoundType). CompoundType = aromatic We can also ask the question the other way around: ?- compounds(Compound, aromatic). Compound = benzene Compound = toluene Compound = xylene Or we can ask for all compounds ?- compounds(Compound, CompoundType). Compound = benzene CompoundType = aromatic Compound = toluene
5323X.indb 42
11/13/07 2:09:29 PM
Development Tools for Expert Systems
43
CompoundType = aromatic Compound = xylene CompoundType = aromatic Compound = hexane CompoundType = aliphatic PROLOG systematically goes through all its facts and rules and tries to find all the ways it can associate variables with particular values so that the initial query is satisfied. As an example of how rules are used, suppose we ask the following question: ?- aromaticCompound(A) and we have defined the facts coplanar(cyclopentadiene) coplanar(benzene) hueckelElectrons(benzene) hueckelElectrons(cyclohexene) cyclicConjugated(benzene) and the rule stated previously. In this case, PROLOG will respond to A = benzene PROLOG matches aromaticCompound(A) against the head of the rule aromaticCompound(X). The first condition in the rule coplanar(X) can be satisfied with cyclopentadiene; that is, cyclopentadiene is bound to A. The next condition cyclicConjugated(X) does not apply, so PROLOG goes back to the first condition, where it now binds benzene to A. PROLOG checks all facts that match the rule (the three conditions) until it finds an answer; the only fact matching all three rules for aromaticity is benzene. The concept of other programming languages for expert systems is very similar to the one just introduced; however, the syntax and the individual functionalities differ.
3.6 National Aeronautics and Space Administration’s (NASA’s) Alternative — C Language Integrated Production System (CLIPS) CLIPS is a programming language shell developed in 1985 at the NASA Johnson Space Center [5]. The original intent for CLIPS was to improve knowledge about the construction of expert system tools to form the basis for replacing the existing commercial tools. Just one year later it became apparent that CLIPS, as a low-cost expert system tool, would be an ideal replacement for existing systems based on LISP. The first version made available to groups outside of NASA was version 3.0. Originally, the primary representation methodology in CLIPS was a forward-chaining rule language based on the Rete algorithm. In version 5.0 CLIPS procedural programming,
5323X.indb 43
11/13/07 2:09:30 PM
44
Expert Systems in Chemistry Research
as found in C, and object-oriented programming, like in C++, were introduced. The latter is implemented in the CLIPS Object-Oriented Language (COOL). Version 6.0 of CLIPS further supported the development of modular programs as well as improved integration between object-oriented and rule-based programming capabilities. Now, CLIPS is available in version 6.24 and is maintained independently from NASA as public domain software [6]. CLIPS allows for handling of a wide variety of knowledge and supports three programming paradigms: (1) rule based; (2) object oriented; and (3) procedural. Rule-based programming represents knowledge as heuristics, which specify a set of actions to be performed for a given situation. The object-oriented approach allows dividing complex systems to be modeled as modular components that can be reused for other purposes. Procedural programming is supported in a similar manner like in conventional programming languages. CLIPS program code can be embedded within procedural code written in C, Java, Fortran, and ADA and provides a series of protocols for integration with software written in other programming languages. Although CLIPS is written in C, it can be installed on different operating systems, like Windows 95/98/NT/2000/XP, MacOS X, and UNIX, without code changes. It can be ported to systems with ANSIcompliant C or C++ compilers. CLIPS can be easily extended, ported, and integrated with existing systems and databases. These are some of the reasons why CLIPS has been well accepted in the scientific community throughout government, industry, and academia. The development of CLIPS has strongly supported the acceptance of expert system technology for a wide range of applications and diverse computing environments. NASA uses these tools extensively to solve problems in scientific and military areas, and several governmental institutions, universities, and private companies are taking advantage of this language. Additional software tools have been created to support development of expert systems. A good example is the parallel production system (PPS), developed by Frank Lopez during his thesis at the University of Illinois [7]. PPS is designed for creating modular expert systems on single or multiprocessor architectures. The resulting modules, called expert objects, are knowledge sources that include communication features for the expert. When communicating with a module, the expert does actually not see any physical architecture of the computer since PPS adapts automatically to it. Users may create a graph of how the expert objects are to communicate on a blackboard structure. In addition, the language of PPS is simple but powerful. Lopez, who later became the founder of Silicon Valley One (http://www. siliconvalleyone.com), also developed OPS-2000. This is an interactive, rule- and object-based software development environment and is the first knowledge engineering tool designed for the C++ programming language. Let us have a closer look at a simple interaction with the CLIPS command interface.
3.6.1 CLIPS Facts The (assert) command can be used to add facts to the system: CLIPS> (assert (aromatic))
5323X.indb 44
11/13/07 2:09:30 PM
Development Tools for Expert Systems
45
The user interface responds with a statement giving the number of facts (starting with zero) currently available in the systems. The facts can be listed by using the (facts) command: CLIPS> (facts) f-0 (aromatic) For a total of 1 fact. Adding a list of facts can also be created in a single statement: CLIPS> (assert (aliphatic) (polycyclic) (linear)) CLIPS> (facts) f-0 (aromatic) f-1 (aliphatic) f-2 (polycyclic) f-3 (linear) For a total of 4 facts. In this case, the facts such as (aromatic) and (aliphatic) have no names; they are single-field facts. To name the fact, the following syntax can be used: CLIPS> (assert (compoundType aromatic aliphatic polycyclic linear)) CLIPS> (facts) f-0 (aromatic) f-1 (aliphatic) f-2 (polycyclic) f-3 (linear) f-4 (compoundTypes aromatic aliphatic polycyclic linear) For a total of 5 facts.
3.6.2 CLIPS Rules A rule in CLIPS uses the following syntax: CLIPS> (defrule stability (compoundType aromatic) => (assert (stability high))) Here (defrule) is the command for defining a new rule, which has the header stability. The next line is the left-hand side of the rule, here consisting of a single condition. The => notation is similar to a then statement in a conventional programming language. The result of a successful match is to perform the action described in the fourth line, which is the creation of a new fact. The system interprets this rule as follows:
5323X.indb 45
11/13/07 2:09:30 PM
46
Expert Systems in Chemistry Research
If compound type is aromatic, then stability is high. The user interaction scenario is similar to the one described previously. We can have a look at the rule by using the (ppdefrule) command: CLIPS> (ppdefrule stability) (defrule MAIN::stability (compoundTypes aromatic) => (assert (stability high))) Looking at the facts currently stored, we will get the following: CLIPS> (facts) f-0 (aromatic) f-1 (aliphatic) f-2 (polycyclic) f-3 (linear) f-4 (compoundTypes aromatic aliphatic polycyclic linear) For a total of 5 facts. To execute our program, we just need to enter the (run) command and to check the facts: CLIPS> (run) CLIPS> (facts) f-0 (aromatic) f-1 (aliphatic) f-2 (polycyclic) f-3 (linear) f-4 (compoundTypes aromatic aliphatic polycyclic linear) f-5 (stability high) For a total of 6 facts. The program used the given facts and matched them to the given rules and created a new fact. Basically any programmable action can take place as a result of a positive rule match. For instance, another set of rules could be evaluated; facts could be added, deleted, or changed; or even an external program could be called to trigger any other external action, like a data analysis. CLIPS has a series of extensions to integrate additional technologies, many of them also developed by NASA staff. Also, several scientific institutions have worked on extensions. FuzzyCLIPS, for instance, was developed by the Integrated Reasoning Group at the NRC Institute for Information Technology. FuzzyCLIPS enhances CLIPS by providing a fuzzy reasoning capability that can integrate with CLIPS facts and the CLIPS inference engine. It allows representing and manipulating fuzzy facts
5323X.indb 46
11/13/07 2:09:30 PM
Development Tools for Expert Systems
47
and rules. It handles fuzzy reasoning and supports uncertainty concepts and allows combining nonfuzzy and fuzzy algorithms in the rules and facts of an expert system. DYNAmic CLIPS Utilities (DYNACLIPS) is another set of tools that can be linked with CLIPS providing dynamic knowledge exchange and agent tools implemented as a set of libraries. Expert systems developed with CLIPS may be executed as follows: • Interactively using a text-oriented command line interface or a Windows interface with mouse interaction. • Batch-oriented using a text file that contains a series of commands that can be automatically read from a file when CLIPS is started. • Embedded in an external program, where calls can take advantage of the CLIPS program. In this case, the user interfaces are provided by the external program. The generic CLIPS interface works similar to the one previously described for PROLOG. The knowledge base is usually created with a standard text editor and is saved as one or more text files. Executing CLIPS then loads the knowledge base and begins with the question–answer user interaction scenario. The CLIPS user interface allows for changing rules and facts on the fly, for viewing the current state of the system, and for tracing the execution. A window interface is available for the Macintosh, Windows, and X Window environments.
3.7 Java-Based Expert Systems — JESS Java Expert System Shell (JESS) is a superset of the CLIPS programming language for the Java platform developed by Ernest Friedman-Hill of Sandia National Laboratories, a National Nuclear Security Administration laboratory, in late 1995 [8]. JESS provides rule-based programming for automating an expert system and is often referred to as an expert system shell. Rather than a procedural paradigm, where a single program has a loop that is activated only one time, the declarative paradigm used by JESS continuously applies a collection of rules to a collection of facts by a process called pattern matching. Rules can modify the collection of facts, or they can execute any Java code. Each JESS rule engine can incorporate a series of facts, which are stored in the working memory. Rules react to additions, deletions, and changes to the working memory. JESS distinguishes between pure facts, which are defined by JESS, and shadow facts, which are connected to Java objects. Facts are based on templates, similar to Java classes, which consist of a name and a series of slots that represent the properties of the class. A definition of a template might look like this: (deftemplate molecule “A generic molecule.” (slot name) (slot formula)
5323X.indb 47
11/13/07 2:09:30 PM
48
Expert Systems in Chemistry Research
(slot nAtoms (type INTEGER)) (slot state (default liquid))) A fact is then defined by using the (assert) command: (assert (molecule (name Benzene) (formula C6H6) (nAtoms 12))) Using a shadow fact that derives from a Java class called Molecules requires a reference to this class in the template: (deftemplate Molecule (declare (from-class Molecule))), where (Molecule) is the template name as well as the name of the corresponding Java class. If both are identical, the template automatically generates slots that correspond to the properties of the Java class. If the Java class provides a method getFormula(), the corresponding slot (formula) is automatically generated in the template. JESS uses a special Java class (java.beans.Introspector) to find the property names. A JESS rule is declared similarly to the CLIPS notation (see previous example). JESS can be used to build Java servlets, Enterprise JavaBeans (EJBs), applets, and full applications that use knowledge in the form of declarative rules to draw conclusions and to make inferences. Since many rules may match many inputs, there are few effective general-purpose matching algorithms. The JESS rules engine uses the Rete algorithm for pattern matching. The most current version, JESS 7.0, includes a series of new tools, like an integrated development environment for rules that features tools for creating, editing, visualizing, monitoring, and debugging rules. JESS 7.0 executes rules written both in its own expressive rule language and in eXtensible Markup Language (XML).
3.8 Rule Engines — JBoss Rules JBoss Rules (also referred to as Drools) is an open-source business rules engine designed for implementing business rules according to business policies [9]. It consists basically of a rules engine that allows for viewing and managing business rules encoded in a software application as well as validating the implemented rules against documented business rules. JBoss Rules is more correctly classified as a production rule system, which works with a memory of current states and details of a problem, a rule base, and an interpreter (i.e., inference engine), which applies the rules to each fact entering the memory. The system implements both Rete and leaps pattern-matching algorithms. Rules are stored in the production memory, and the facts that the inference engine matches against are stored in the working memory. A typical syntax is as follows: rule “Strength of Ternary Acids” when
5323X.indb 48
11/13/07 2:09:31 PM
Development Tools for Expert Systems
49
Molecule(nOxygen) && Molecule(nOxygen > Molecule(nHydrogen) + 2) then Strength = “strong” else Strength = “weak”; end The strength of a ternary acid is estimated from the difference between the number of oxygen atoms and hydrogen atoms. Facts are asserted by a simple assignment statement, similar to object-oriented programming languages: Molecule aspirin = new Molecule(“Acetylsalicylic acid”); aspirin.setOxygen(4); aspirin.setHydrogen(8); Here (aspirin) is an instance of a molecule and receives two values for the number of oxygen and hydrogen atoms. A fact that is asserted is directly examined for matches against the rules; consequently, aspirin would be interpreted as weak acid. Facts are asserted into the working memory where they may then be modified or retracted. A system with a large number of rules and facts may result in many rules being true for the same fact assertion; these rules are said to be in conflict. The execution order of these conflicting rules is managed using a conflict resolution strategy based on, for instance, the time when the fact has been asserted. The most current version 3.0 of JBoss Rules is available on the World Wide Web.
3.9 Languages for Knowledge Representation From a formal point of view, a knowledge representation language (KRL) incorporates syntactic (i.e., notational) and inferential aspects. The syntactic aspect covers the explicit format used for storing knowledge, whereas the inferential part deals with the use of knowledge for reasoning. Knowledge representation languages typically encompass business logic, production rules, semantic networks, and frames. One of the examples is KRL, developed by Bobrow and Winograd from Stanford University [10]. KRL supports knowledge representation in different declarative forms based on conceptual objects that form a network. KRL is basically an interpreter that allows LISP expressions to be formulated in a more expressive fashion. It includes an inference engine with a backward-chaining mechanism. KL-ONE is a frame-like family of knowledge representation approaches in the tradition of semantic networks [11]. It was developed to overcome some of the drawbacks of semantic networks and represents conceptual information in a structured network for inheritance. The frames — in the KL-ONE approach called concepts — are arranged in a class-like hierarchy that includes the relations between frames. Frames are typically inherited from super classes.
5323X.indb 49
11/13/07 2:09:31 PM
50
Expert Systems in Chemistry Research
3.9.1 Classification of Individuals and Concepts (CLASSIC) CLASSIC is a successor of the KL-ONE approach. The framework allows representing concepts, attributes, objects, and rules in either primitive frames or in object-oriented manner [12]. Concepts underlie an automatic generalization, and objects are automatically made instances of all concepts for which they pass a membership test. CLASSIC notation uses the following terminology to represents concepts, instances, and attributes. • Roles are attributes, such as author, color, or material, and are stored in the frames’ slots. • Individuals are instances or objects that have roles (i.e., slots) assigned to them, such as molecule, container, or matrix. • Concepts are combining individuals and attributes to describe an item. • Rules provide the guiding principles for computing individuals based on their roles. A series of operators — such as AND, FILLS, ONE-OF, ALL, and AT-LEAST — allow for comparison and for defining restrictions on attributes. Individuals are specified in the following manner: Wasp -> SPECIES Philanthotoxin -> TOXIN-VENOM Wasp-Philanthotoxin -> (AND TOXIN (FILLS insect Wasp) (FILLS toxin Philanthotoxin) (FILLS LD50 2.4) (FILLS species Vespidae) (FILLS type Venom)) The first two statements define instances of an object: “Wasp is a species”; “Philanthotoxin is a toxin of the class of venoms.” The third line defines a specific instance of the wasp philanthotoxin and adds several attributes in terms of slots. A concept may be defined as follows: VENOM-SPECIES (AND VENOM-PROPERTY (ONE-OF Apidae Mutillidae Vespidae Formicidae)) VENOM-TOXICITY (AND VENOM-PROPERTY (ONE-OF High Medium Low)) TOXIN => (AND MOLECULE (AT-LEAST 1 toxicity) (ALL toxicity TOXIN-TOXICITY) (AT-LEAST 1 species) (ALL species VENOM-SPECIES) WASP-TOXINE (AND TOXIN (FILLS species Vespidae))
5323X.indb 50
11/13/07 2:09:31 PM
Development Tools for Expert Systems
51
Here, two instances of venom — species and toxicity — are assigned a list of possible values. The concept of toxin is finally derived from the concept of molecules and includes certain constraints for toxicity and species. Finally, a specific instance of the wasp toxine receives a constraint for the corresponding species. This kind of knowledge representation is designed for applications where rapid responses to questions are essential rather than expressive power. The CLASSIC is available in the original LISP version that was developed for research purposes, as well as in C and C++, the latter of which is called NeoCLASSIC.
3.9.2 Knowledge Machine Knowledge machine (KM) is a frame-based knowledge representation language similar to KRL and other KL-ONE representation languages such as Loom and CLASSIC [13–15]. In KM, a frame denotes either a class (i.e., type) or an instance (i.e., individual). Frames have slots, or binary predicates, in which the fillers are axioms about the slot’s value. These axioms have both declarative and procedural semantics, allowing for procedural inference. KM, besides the knowledge representation aspects, also features for reasoning with a variety of representations, such as frames, patterns, constraints, and metaclasses. It provides support for a series of representations such as patterns, contexts, constraints, situations, and metaclasses. A main difference between it and other approaches is the ability of handling value sets instead of single values for search and assignment of slots. This enables coreferences between statements at different levels of abstraction to be correctly determined. KM includes a mechanism for producing justifications for the reasoning process based on a proof tree. Similar to object-oriented programming, KM provides classes and instances thereof. Properties of members of a class are described by the following: (every has (<slot1> (<expr11> <expr12> ...)) (<slot2> (<expr21> <expr22> ...)) ... ) For example, the class Steroid could look as follows: (every Steroid has (fusedRing-count (4)) (stereocenters-count ((a Number))) (properties ((a Spectrum) (a Descriptor)))) defining a fixed value for the number of fused rings, a number of stereocenters, which is itself a class of type number, and the two properties, derived from the classes Spectrum and Descriptor. We can now define the class Spectrum in the following manner: (every Spectrum has (type (a SpectrumType)) (nucleus (a Nucleus))
5323X.indb 51
11/13/07 2:09:31 PM
52
Expert Systems in Chemistry Research
A specific instance of descriptor can look as follows: (*HNMR has (instance-of (Spectrum)) (type *NMR) (nucleus *H) (frequency (400)) (solvent (*CDCL3))) We can then generate a steroid instance and query the slots by using the following syntax: > (a Steroid) (_Steroid0) > (the properties of _Steroid0) (_Spectrum1 _Descriptor7) The first expression creates an anonymous instance, or Skolem constant, from the class Steroid at run time, denoted by the underscore and an instance number. These anonymous instances are also created for the Spectrum and the Descriptor. A named instance can be created by (*Cholesterol has (instance-of Steroid) (stereocenters-count (8)) (Spectrum (*HNMR)))) A query is expressed in the form the slot of instance: >(the stereocenters-count of ((*Cholesterol))) (8) This type of query has a single access path, since it derives its answer with a single step. A query with an access path of 2 would look as follows: >(the type of (Spectrum of (*Cholesterol))) (*HNMR) These access paths are already a simple form of a rule. Additional conditional expressions may be used and inserted in a slot, forming a more complex rule: (every Steroid has (diffusion ( (if ((the value of (the lipophilicity of (Self)) > 3) and (the polarizability of (Self)) = *High))
5323X.indb 52
11/13/07 2:09:31 PM
Development Tools for Expert Systems
53
then *Medium else *Low)))) In this example a steroid tends to diffuse through a membrane if it the lipophilicity exceeds a certain value and the polarizability is high; otherwise, the diffusion would be described as low. An instance may inherit information from multiple generalizations. The semantics of KM dictate that the instance will acquire all the properties of all its generalizations; that is, all axioms applying to its generalizations also apply to the instance. Consequently, KM will merge information from all generalizations when computing the slot values of the instance. This process, called unification, is one of the important differences between KM and object-oriented programming languages. Unification of inherited information is fundamental for building a knowledge base in a modular and reusable fashion. Instead of incorporating all information in a single frame, unification distributes information and allows it to be applied to concepts outside the original context.
3.10 Advanced Development Tools Several noncommercial solutions have been in place for several decades. Among them, BABYLON is a modular development environment for expert systems, created at GMD (Forschungszentrum Informationstechnik GmbH) in Germany, which is now integrated in Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. [16]. BABYLON is composed of frames, constraints, a PROLOG-like logic formalism, and a description language for diagnostic applications. It is implemented in Common LISP and runs on Mac and UNIX. The Generic Expert System Tool (GEST) was a blackboard system approach internationally licensed by Georgia Polytechnic Institute for expert system applications [17]. GEST can be used in a variety of problem domains and supports backward and forward chaining. Its knowledge representation schemes include frames, rules, and procedures. Support is also present for fuzzy logic and certainty factors. The user interface uses the Symbolics mouse and menu-driven user interface and runs on Symbolics Lisp Machines, Genera 7.2, and Sun platforms. Most of the noncommercial solutions have not been maintained over time, but even commercial solutions disappeared from time to time due to company mergers. An example is Aion Corporation, the developer of the Aion Development System (ADS). ADS was an expert system shell that supported forward and backward chaining, an object-oriented knowledge representation, and graphics and integrated with other programming languages, like C and Pascal. In 1992 Aion Corporation merged with AICorp Inc. to form Trinzic Corporation. Just three years later, Platinum Technology Inc. acquired Trinzic Corporation, which operated for approximately four years as a unit of Platinum Technology. Finally, Platinum Technology International Inc. was acquired by Computer Associates International Inc. in May 1999. Computer Associates has a series of products in the market that support expert system development and knowledge management. A remainder of the original Aion development is CleverPath Aion Business Rules Expert (BRE), which is an advanced
5323X.indb 53
11/13/07 2:09:32 PM
54
Expert Systems in Chemistry Research
rules engine to capture knowledge in manageable rules for business processes [18]. Equipped with a powerful inference engine, BRE supports dynamic rules that allow a fast adaptation to changing business scenarios and comes in component-based development design. This proven solution reduces development and maintenance efforts through easy-to-maintain decision tables, dynamic rules, and a sophisticated inference engine. Particularly interesting with BRE is the rule manager, which enables both business and information technology (IT) users to build and maintain rules without programming. Rules are presented in intuitive formats and can be created and adapted in a Windows-like user interface. The vocabulary can be adapted to the actual business approach. BRE allows organization of rule projects — entities that allow the user to organize rules in a hierarchical manner that looks similar to file-and-folder view. A rule definition wizard guides users through the steps of rule definition. BRE also provides features for rule validation using test cases as well as an administrative approval process including reviewing and deployment of rules. All of these features are covered by a user management system that ensures that individual users have the right access, create, modify, and approve rules in their domain. BRE thus provides a typical system for business applications. The most current version, 10.2, has an extended database and platform support. EXSYS Inc. in Albuquerque, New Mexico, provides several interesting tools for expert system development [19]. For instance, EXSYS Professional is a rulebased expert system shell and educational tool that comes with many examples and a good tutorial on developing expert systems. It features backward chaining, forward chaining, and fuzzy logic. It provides a SQL interface to link to databases and integrates with spreadsheet programs and runs on DOS, Windows, Macintosh, UNIX, and VAX platforms. EXSYS CORVID is a knowledge automation expert system development tool where decision steps are described by building logic diagrams. It provides object-structured knowledge representation, backward and forward chaining, and fuzzy logic. It can run either stand alone or via Java applet or servlet. It supports database integration and help desk systems and runs on Microsoft (MS) Windows 95/98/NT/2000/ME/XP; the servlet runs on servers with Apache Tomcat or compatible servlet engines with Java Runtime v1.3 or higher. Other companies, like Intellicorp located in California, did not continue to develop their tools. Intellicorp meanwhile markets diagnostic and system management software for organizations using software from SAP AG [20]. IntelliCorp continued to market the LISP-based knowledge engineering environment (KEE), which was originally introduced in August 1983, until 1999 when it was announced that they maintain KEE but do no further development on the product [21]. Knowledge engineering environment is a frame-based expert system supporting dynamic inheritance, multiple inheritance, and polymorphism. It provides a knowledge base development tool, utilities for the interface with the user, and graph drawing tools for the knowledge base and execution. The methods are written in LISP, and the inference engine provides both forward and backward chaining. The system allows for linking to external databases and provides an extensive graphical user interface. NEXPERT OBJECT, a system originally developed by Neuron Data (later renamed to Blaze software), was one of the first systems written in Pascal and
5323X.indb 54
11/13/07 2:09:32 PM
Development Tools for Expert Systems
55
introduced on an Apple Macintosh personal computer in 1985, at a time when most available products were coded in LISP and only available on large computers [22]. In 1992, a major version, 3.0, was released, and 25,000 installations were applied mainly for modeling business tasks, corporate decision making, and building knowledge repositories. NEXPERT OBJEXT represented domain specific knowledge explicitly as networks of rules and class-object hierarchies. The rule-based and object-based representations are tightly integrated in a common design environment and are supported by visual editors and graphical browsers. The system’s inference engine offers one of the largest sets of reasoning techniques. Besides forward and backward chaining, examples are as follows: • Nonmonotonic reasoning: The ability to infer alternate conclusions from new facts. • Defeasible reasoning: The ability to suspend or retract deductions. • Multiple inheritance: To propagate information through class-object hierarchies and exception and uncertainty handlers. NEXPERT OBJECT was probably one of the most mature systems concerning the integration capabilities with software on a corporate level. It offered an application programming interface (API) that supports industry standard programming languages, such as C, C++, Smalltalk, or LISP. Neuron Data was acquired by Brokat in 2000 and was subsequently sold to HNC Inc., which merged in 2002 with the Fair Isaac Corporation.
3.10.1 XpertRule A particularly interesting software package available on the market is XpertRule, a series of Windows-based expert system development tools that use genetic algorithms for optimization [23]. XpertRule Knowledge Builder, for instance, is an environment for developing and deploying knowledge-based applications and components. Knowledge Builder supports a broad range of knowledge representations and graphical knowledge building blocks to define them. In particular, decision making (Decisioning) is supported for diagnostic applications, selection processes, assessment, monitoring, and workflow applications. The knowledge in such applications is represented by rules or decision trees. The decisions are derived from attributes, which are captured from the user through question–answer dialogs or covered by calculations. Decision trees relate a decision to a number of attributes; a table of cases contains a list of examples or rules, each showing how an outcome or decision relates to a combination of attribute values. The system uses case-based reasoning to be used in diagnostic applications, as well as fuzzy logic for applications like performance assessment and diagnostics. Knowledge Builder uses graphical knowledge representation to simplify the process of capturing business knowledge. Several additional software packages are available in Xpert Rule; among them the XpertRule Miner is a dedicated data mining software product, which allows deriving decision-tree rules or patterns from data files using data-mining technologies.
5323X.indb 55
11/13/07 2:09:32 PM
56
Expert Systems in Chemistry Research
Analyser is a machine-learning add-on to XpertRule that uses genetic algorithms to optimize solutions (http://www.attar.com). Xpert Rule is widely used in industrial and scientific applications, one of them performed in cooperation with NASA and Rockwell Aerospace. It is called NASA’s Contamination Control Engineering Design Guidelines Expert System and was developed by Rockwell International’s Space Systems Division. It was developed for education in contamination control processes and is designed as an interactive guide to assist with quantifying contamination for sensitive surfaces. The tool enables the user to quantify molecular and particulate contamination requirements for solar arrays, thermal control surfaces, or optical sensors [24].
3.10.2 Rule Interpreter (RI) RI is a program that supports the development and execution of rule scripts, written in rule description language (RDL) [25]. RDL combines substructure search with descriptor-oriented selection, incorporates Boolean logic, and allows the generation of a tree-like decision structure. RI was written in Delphi code under MS Windows and derived from the OASIS SAR system [26,27]. Rule scripts operate on substances defined in a data file in either SMILES (simplified molecular input line entry specification) or CMP (compound) format. The conventional SMILES notation as developed by Weininger [28] provides a basic description of molecules in terms of two-dimensional chemical graphs. The CMP file format developed with the OASIS system [29] provides separate logical records for information about connectivity, three-dimensional structure, electronic structure from quantum-chemical molecular–orbital computations, as well as physicochemical and experimental toxicological data. The output of RI is files in SMILES binary CMP files, including toxicochemical classes based on descriptors and generates toxic potency estimates. The RDL uses SMILES as a basic format and includes extensions. An RDL script consists of two definition sections and an application section. The Define Section contains definitions of screens in an extended SMILES format that includes qualifiers. Qualifiers are enclosed in braces in the structure definition, succeeding the atom or bond entries to which they belong. They may denote ionic state, hybridization, participation in rings, number of adjacent protons, and chiral parity for an atom. An example is as follows: c1c{H}c(C{sp3})c{H}cc1cl; SMILES: c1cc(C)ccc1cl It specifies the presence of a para-chlorotoluene moiety and excludes any orthosubstituents. Bond qualifiers define a particular bond configuration, cis or trans for a double bond. Descriptor qualifiers can be used to restrict values of, for instance, physicochemical descriptors of an atom: a qualifier like (–0.2 < q) identifies a charge descriptor with a partial atomic charge less than 0.2. In addition to predefined descriptors, reserved key can be used for dynamic descriptors that depend on the screen context. Examples are the reserved key enumerate, which denotes the frequency of the preceding substructure occurring in a molecule, and distance, which denotes the distance between the geometric centers of two substructures.
5323X.indb 56
11/13/07 2:09:32 PM
Development Tools for Expert Systems
57
The Rule Section contains the definitions of rules that are applied to a substructure. The rule is enclosed in quotation marks and uses the logical operators AND, OR, and NOT to return either TRUE or FALSE as result. Rules are organized in expressions, and their priority of execution is determined by the enclosing brackets. A rule identifier ensures the unique identification and reuse of rules in other expressions. An example is as follows: Rbenz: ‘Benz’ and not ‘Polyarom’ where Rbenz is the identifier for the rule, and Benz and Polyarom are definitions for benzene and polyaromtic substructures. The Apply Section represents the business logic and uses the definitions and rules described in the previous section. The decision scheme is constructed of assignment, conditional, or compound statements. An assignment statement assigns either a numerical value or results of a screening definition to a molecular descriptor, similar to conventional programming languages. Conditional statements represent the nodes in a decision tree in the form of if–then–else statements and can be combined as a compound statement. A text editor provides special functionality for creating and editing rule script files, as well as for syntax checking and script application. The software incorporates a two-dimensional chemical model builder that uses SMILES as input. RI has been applied in different areas, such as for determining acute fish toxicity and for determining androgen receptor binding affinity. Though RI is a script-based system requiring some expertise in writing and evaluating scripts and rules, it is a good example of a robust and flexible tool for screening of chemical databases and inventories.
3.11 Concise Summary Action is one of the possible results of the reasoning process in an expert system. In contrast to assignments, actions are triggered processes that lead to a change in the software environment. C Language Integrated Production System (CLIPS) is a programming language shell designed for rule-based, object-oriented, and procedural software development. CLIPS Facts are assignment statements in CLIPS representing an actual state or property in the form: assert(X, value1, value2, …). CLIPS Rules are conditional statements in CLIPS used in the inference process in the form: defrule rulename (if-statement condition) => (then-statement). Declarative Programming is a programming paradigm that describes computations by specifying explicit objectives rather than the procedure on how to achieve them. Examples are SQL, PROLOG XSLT (see also Imperative Programming). Expert System Shell is a suite of software that allows construction of a knowledge base and interaction through use of an inference engine. Imperative Programming is a programming paradigm that describes computation as sequential statements that change a program state. Examples are FORTRAN, Pascal, C, and Visual Basic (see also Declarative Programming).
5323X.indb 57
11/13/07 2:09:33 PM
58
Expert Systems in Chemistry Research
Inference Engine is a module of an expert system that performs reasoning using rules from the knowledge base and facts from the working memory to process them according to specified business rules. JBoss Rules is an open-source production rule system designed for implementing business rules according to business policies. JESS is a superset of the CLIPS programming language for the Java platform using pattern matching to continuously apply rules to facts. Knowledge Base is a part of the expert systems memory that stores domain-specific knowledge in rules, frames, or other types of knowledge representations. List Processing (LISP) is a programming language that uses an interpreter for symbolic calculations based on single-scalar values (atoms) and associative arrays (lists). Production Rule System is a computer program using productions, which include a basic representation of knowledge and rules. Productions can be executed and triggered to perform an action. Programation et Logique, or Programming and Logic (PROLOG) is a declarative programming language based on facts and rules that are evaluated systematically to form a logical deduction. PROLOG Facts are assignment statements in PROLOG representing an actual state or property in the form: property(X, value). PROLOG Rules are conditional statements in PROLOG used in the inference process in the form: Then-Statement(X) :- if-condition1(X), if-condition2(X), …. Rule Description Language (RDL) is a language developed for the Rule Interpreter (RI) software that combines substructure search with descriptor-oriented selection. Rule Interpreter (RI) is a program that supports the development and execution of rule scripts, written in the rule description language (RDL). Working Memory is a part of the expert systems memory that stores specific information on a current problem represented as facts. XpertRule is a series of Windows-based expert system development tools, which use genetic algorithms for optimization.
References
5323X.indb 58
1. Winston, P.H. and Horn, B.K., Lisp, Addison-Wesley, Reading, MA, 1989. 2. Steele, G.L., Common Lisp: The Language, 2nd ed., Digital Press, Bedford, MA, 1990. 3. Colmerauer, A., et al., Un systeme de communication homme-machine en francais, Internal report, Croupe d’Intellengence Artificielle, Universite Aix-Marseillo II., June 1973. 4. Colmerauer, A., Metamorphosis Grammars, in Natural Language Communication with Computer, Bolc, L., Ed., Springer Verlag, Berlin, 1978. 5. Giarratano, J.C. and Riley, G.D., Expert Systems, Principles and Programming, 4th ed., Thomson Course Technology, Boston, 2005. 6. Riley, G., CLIPS — A Tool for Building Expert Systems, http://www.ghg.net/clips/CLIPS. html. 7. Lopez, F., The Parallel Production System, Ph.D. thesis, University of Illinois at Urbana-Champaign, 1987.
11/13/07 2:09:33 PM
Development Tools for Expert Systems
5323X.indb 59
59
8. Friedman-Hill, E., Jess in Action — Java Rule-Based Systems, 1st ed., Manning Publications Co., New York, 2003. 9. JBoss, JBoss Rules, http://www.jboss.com/products/rules. 10. Bobrow, D. and Winograd, T., An overview of KRL, a Knowledge Representation Language, Cognitive Sci., 1, 1, 1977. 11. Brachman, R.J. and Schmolze, J., An Overview of the KL-ONE Knowledge Representation System, Cognitive Sci., 9, 171, 1985. 12. Brachman, R.J., et al., Living with CLASSIC: When and How to Use a KL-ONE-Like Language, in Principles of Semantic Networks: Explorations in the Representation of Knowledge, Sowa, J.F., Ed., Morgan Kaufmann Publishers, San Mateo, CA, 1991, 401. 13. MacGregor, R. and Bates, R., The LOOM Knowledge Representation Language, Tech. Report RS-87-188, Information Sciences Institute, University of Southern California, Los Angeles, CA, 1987. 14. Clark, P. and Porter, B., Building Concept Representations from Reusable Components, in Proceedings of the 14th National Conference on Artificial Intelligence, Providence, RI, 1997, 369. 15. Clark, P. and Porter, B., KM — The Knowledge Machine: Users Manual, http://www. cs.utexas.edu/users/mfkb/km/userman.pdf. 16. Christaller, T., Di Primio, F., Schnepf, U., and Voss, A., The AI-Workbench BABYLON — An Open and Portable Development Environment for Expert Systems, Academic Press, New York, 1992. 17. Gilmore, J.F., Roth, S.P., and Tynor, S.D., A Blackboard System for Distributed Problem Solving: Blackboard Architectures and Applications, Academic Press, San Diego, 1989. 18. Computer Associates International, Inc., New York, CA Aion Business Rules Expert, http://www.ca.com. 19. EXSYS Inc., Albuquerque, New Mexico, http://www.exsys.com/. 20. Intellicorp, Menlo Park, CA, http://www.intellicorp.com. 21. Kehler, T. and Clemenson, G., KEE: The Knowledge Engineering Environment for Industry, Systems and Software, 3(1), 212, 1984. 22. Turner, S.R., Nexpert Object, IEEE Expert: Intelligent Systems and Their Applications, 6, 72, 1991. 23. XpertRule Software Ltd., Leigh, UK, http://www.xpertrule.com. 24. Tribble, A.C. and Boyadjian, B.V., An Expert System for Aerospace, PC AI Magazine, 11(3), 16, 1997. 25. Karabunarliev, S., et al., Rule Interpreter: A Chemical Language for Structure-Based Screening, J. Mol. Struct. (Theochem), 622, 53, 2003. 26. Mekenyan, O., Karabunarliev, S., and Bonchev, D., The OASIS Concept for Predicting the Biological Activity of Chemical Compounds, J. Math. Chem., 4, 207, 1990. 27. Mekenyan, O., et al., A New Development of the OASIS Computer System For Modeling Molecular Properties, Comp. Chem., 18, 173, 1994. 28. Weininger, D., SMILES — A Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules, J. Chem. Inf. Comput. Sci., 28, 31, 1988. 29. Prickett, S.E. and Mavrovouniotis, M.L., Construction of Complex-Reaction Systems .1. Reaction Description Language, Comp. Chem. Engin., 21, 1219, 1997.
11/13/07 2:09:33 PM
5323X.indb 60
11/13/07 2:09:33 PM
4
Dealing with Chemical Information
4.1 Introduction Artificial intelligence, or soft computing, plays an increasingly important role in evaluating scientific data. In particular, analytical chemistry, cheminformatics, and bioinformatics depend on the power of artificial intelligence approaches to store and use the experience and knowledge of the experts in a company efficiently. The extensive use of computational methods leads to a steady increase of data that are barely manageable, even with a team of scientists. Consequently, automated intelligent analysis techniques for primary data are becoming essential, particularly in the research areas of combinatorial chemistry, chemometrics, cheminformatics, and bioinformatics. This chapter provides an overview of techniques used in combination with knowledge- or rule-based approaches.
4.2 Structure Representation Databases for chemical structures store machine-readable representations of the two-dimensional (2D) or three-dimensional (3D) structure models of chemical compounds. Most of the research in the field of representing and searching 2D and 3D chemical structures is based on the Cambridge Structural Database, which is maintained by the Cambridge Crystallographic Data Centre. The Cambridge Structural Database contains structures of small organic and organometallic compounds. The Chemical Abstracts Service (CAS) initiated the first investigations on processing chemical structures. The two-dimensional structure of a chemical compound defines the topology of a molecule — that is, the location of atoms and bonds. To represent 2D chemical structures in machine-readable form, the topology has to be defined in a unique manner.
4.2.1 Connection Tables (CTs) Early approaches used fragmentation codes, which consist of a list of substructures (fragments) defining the molecule [1]. This approach did not cover any information about the connectivity of the substructures, and the assignment of substructures was done manually. A more thorough approach was the connection table (CT), which consists of two arrays: (1) the atom list, which defines the atom symbols or atomic numbers of a molecule; and (2) the bond list, which contains the bonded atoms and the bond order. A connection table is a comprehensive description of the topology of a molecular graph [2]. It can either be redundant, listing each bond in the molecule twice, or nonredundant, where each bond is listed once. To ensure that connection 61
5323X.indb 61
11/13/07 2:09:34 PM
62
Expert Systems in Chemistry Research
tables are unique representations of molecules, the numbering of atoms follows a set scheme. Connection tables are the most common representation and are supported by most structure editor software.
4.2.2 Connectivity Matrices Atom and bond matrices can be used for calculation. A special form of a connection table is the adjacency matrix, which contains information about the connections between atoms without atom types and bond order. The adjacency matrix, M, of a chemical structure is defined by the elements, Mij, where Mij is 1 if atoms i and j are bonded and zero otherwise (Figure 4.1a). In addition, a distance matrix contains the bond length information. The distance matrix, D, of a chemical structure is usually defined by the elements Dij, where Dij is the length of the shortest path from atoms i to j; zero is used if atoms i and j are not part of the same connected component (Figure 4.1b). Distance matrices may also be calculated for real three-dimensional (Euclidean) distances between atoms using the Cartesian coordinates of the atom positions (Figure 4.1c). These matrices allow the calculation of descriptors that account for the shape and conformation of atoms. If conformation is not required or desired, the bond O S Cl Cl (a) Adjacency O1 Cl2 S3 Cl4
O1 — 0 1 0
Cl2 0 — 1 0
(b) Distance S3 1 1 — 1
Cl4 0 0 1 —
(c) Cartesian Distance O1 Cl2 S3 Cl4
Cl4 O1 S3 Cl2 — 2.664 1.421 2.664 2.664 — 2.020 3.065 1.421 2.020 — 2.020 2.664 3.095 2.020 —
O1 Cl2 S3 Cl4
O1 — 2 1 2
Cl2 2 — 1 2
S3 1 1 — 1
Cl4 2 2 1 —
(d) Bond Path Distance O1 Cl2 S3 Cl4
Cl4 O1 S3 Cl2 — 3.441 1.421 3.441 3.441 — 2.020 4.040 1.421 2.020 — 2.020 3.441 4.040 2.020 —
Figure 4.1 Matrices can be used for describing chemical structures in different fashions. The adjacency matrix (a) of thionyl chloride shows whether are bonded to each other. The distance matrix (b) describes the number of bonds between two elements of a structure. The Cartesian distance matrix (c) contains the real three-dimensional (Euclidean) distances between atoms calculated from Cartesian coordinates of the atom positions. The bond path distance matrix (d) contains the sum of bond length between two atoms and is, in contrast to the Cartesian matrix, independent of the conformation of the molecule.
5323X.indb 62
11/13/07 2:09:34 PM
Dealing with Chemical Information
63
path distance matrix is an alternative, which sums up the bond length between the atoms (Figure 4.1d). We will investigate applications of these matrices in the next chapter.
4.2.3 Linear Notations Another approach for representing 2D chemical structures is the linear notation. Linear notations are strings that represent the 2D structure as a more or less complex set of characters and symbols. Characters represent the atoms in a linear manner, whereas symbols are used to describe information about the connectivity [3]. The most commonly used notations are the Wiswesser line notation (WLN) and the simplified molecular input line entry specification (SMILES) [2]. The WLN, invented by William J. Wiswesser in the 1949, was the first line notation capable of precisely describing complex molecules [4]. It consists of a series of uppercase characters (A–Z), numerals (0–9), the ampersand (&), the hyphen (-), the oblique stroke (/), and a blank space.
4.2.4 Simplified Molecular Input Line Entry Specification (SMILES) SMILES was developed by David Weininger at the U.S. Environmental Research Laboratory in Duluth, Minnesota, in 1986 [5]. The simplistic approach of this notation made it to a readily used exchange format for chemical structures. The coding starts with an arbitrary atom and follows the bonds along the molecule. Whereas single bonds are implicit, double and triple bonds are described by the symbols = and #. Atoms are described with their atomic symbol starting with an uppercase character; lowercase characters are reserved for aromatic atoms, and hydrogen atoms are omitted. Chain branches are represented with brackets and closed-ring systems by numbers, and stereochemistry is described by the symbols /, \, and @. Some examples are shown in Table 4.1. Table 4.1 Examples of Chemical Compounds and Their SMILES Notation
5323X.indb 63
Trivial Name
SMILES Notation
Ethane Carbon dioxide Hydrogen cyanide
CC O=C=O C#N
Triethylamine Acetic acid Cyclohexane Benzene Hydronium ion E-difluoroethene L-alanine D-alanine Nicotine Vitamin A
CCN(CC)CC CC(=O)O C1CCCCC1 c1ccccc1 [OH3+] F/C=C/F N[C@@H](C)C(=O)O N[C@H](C)C(=O)O CN1CCC[C@H]1c2cccnc2 C/C(=C\CO)/C=C/C=C(/C)\C=C\C1=C(C)CCCC1(C)C
11/13/07 2:09:35 PM
64
Expert Systems in Chemistry Research
4.2.5 SMILES Arbitrary Target Specification (SMARTS) SMARTS is an extension to SMILES specifically developed for substructure searching [6]. It is notation that allows specifying substructures using rules that are extensions of SMILES. In the SMILES language, there are two fundamental types of symbols — atoms and bonds — that allow specifying a molecular graph. In SMARTS, the labels for the atoms and bonds are extended to include logical operators and special atomic and bond symbols; these allow atoms and bonds to be more general. For example, the atomic symbol [C,N,O] in SMARTS is an atom list for aliphatic C, N, or O atoms; the bond symbol ~ (tilde) matches any bond.
4.3 Searching for Chemical Structures Structure search, also known as identity search, can be used to find out whether the substance is already known or to find references and information on that structure to discover what other work has already been performed on that structure. The advantage of structure search over text searching is that it will find any occurrence of the structure in the database regardless of whether the substance has numerous different names. Whereas fragmentation codes and line notations are excellently suited for data transfer over the Internet — they are just strings — they are not unique in every case and, thus, are unfavorable for structure searches. Connection tables are unique representations for structure and substructure searching ensuring both precise indexing and retrieval. One of the first approaches for substructure search was published by Feldmann et al. [7]. They used connection table representations of 2D chemical structures to search for particular substructures within these larger structures. The resulting system — the National Institutes of Health–Environmental Protection Agency (NIH– EPA) Chemical Information System — started in 1973 as a joint project in mass spectrometry and structure searching between the NIH and the EPA [8]. In many cases scientists search for substructures rather than structures to find candidates with a similar chemical structure or compounds with a common partial structure.
4.3.1 Identity Search versus Substructure Search The difference between algorithms for identity search and substructure search is significant. Finding a structure in a database can be performed by comparing two molecules with a unique enumeration of atoms. Several techniques have been published for unique enumeration of atoms in a molecule, of which the Morgan algorithm [9] is one of the most well known. To prove the identity of consistently enumerated molecules, a direct superimposition by atom-by-atom matching algorithms is performed. However, a unique enumeration does not solve the problem of substructure search. Superimposition of substructures on structures would require the mapping of any combination of molecular graphs to find a graph isomorphism; this is a tedious and time-consuming process. Because the rate of search is always one of the most important limitations for database applications, substructure search should incorporate additional preprocessing steps that restrict the number of molecules to be compared in an atom-by-atom matching algorithm.
5323X.indb 64
11/13/07 2:09:35 PM
Dealing with Chemical Information
65
A substructure is simply a collection of atoms with no order of atoms or bonds being implied; each atom and bond of a molecule occurs only once in the substructure. We normally think of a substructure as a set of atoms interconnected by bonds in a chemically reasonable way. However, a substructure object can be used to represent much more or less than an ordinary substructure; it can also be used to represent less conventional collections, like the number of double bonds in a structure, all of the atoms with an odd number of protons, and so forth. In other words, a substructure object is just an arbitrary set of atoms and bonds; it is up to the programmer using a substructure object to determine the relevance of the set. Finding a substructure, or molecular fragment, in one or more structures in a chemical database corresponds to finding a subgraph isomorphism. The number of structures in the database that needs to be examined can be reduced by preprocessing the database either by clustering the database entries with a common substructure or with a given hyperstructure (e.g., substructure screening) or by using special key codes (e.g., hash codes) to describe the database information relevant to substructures. Finally, the molecular fragment must be superimposed on the database structure at various positions, a process typically performed by the previously mentioned atom-by-atom matching algorithms.
4.3.2 Isomorphism Algorithms A connection table is a simple representation of a labeled graph where the nodes and edges of the graph are the atoms and the bonds of the structure, respectively. In 1976 graph theory was used for describing 2D chemical structures leading finally to the use of isomorphism algorithms for searching chemical structures [10]. An isomorphism algorithm is used in graph theory to determine the extent to which two graphs can be mapped onto each other; this is done by permutation through the vertices of the graph. Three methods are typical. • Graph isomorphism algorithms can be used for mapping a (query) full structure onto a structure to determine identity between two structures. • Subgraph isomorphism algorithms can find a substructure within a structure. • Maximum common subgraph isomorphism algorithms are used to locate the largest common part that two structures have in common. These algorithms are used to find similar structures. A subgraph isomorphism algorithm starts at an initial atom of the query and tries to match it with a database structure. When two atoms are successfully matched, it continues on adjacent atoms; when a mismatch with the query is found, the algorithm traces back to the last matching atom. This method is called atom-by-atom mapping and continues until the query substructure is completely found in the database structure. These and similar approaches for subgraph isomorphism algorithms have been published [11–13]. Due to the nature of this approach, subgraph isomorphism algorithms are time consuming; isomorphism is a combinatorial problem belonging to the nondeterministic polynomial time complete (NP complete) class of problems, which are widely believed to be unsolvable. Several authors suggested improvements to reduce the
5323X.indb 65
11/13/07 2:09:35 PM
66
Expert Systems in Chemistry Research
computational effort, like the concept of parallelism published by Wipke and Rogers in 1984. Parallelism eliminates backtracking by performing individual computational tasks for each atom in a query substructure [14]. Tasks that contain a mismatch end, whereas the other tasks continue. This approach is strongly supported by computers that support parallel processing.
4.3.3 Prescreening One of the most successful attempts may be prescreening, a method that reduces the number of molecules that require the full subgraph search [3]. Prescreening uses bit strings encoding the presence or absence of a fragment in the query and the database molecules. If the database structure bit string contains all the bits set in the query bit string, then that molecule undergoes a subgraph isomorphism algorithm. The screen assignment for database structures is performed on uploading the structure and can be done automatically; therefore, the screen searching part of the substructure searching method is very efficient. However, the definition of the fragments to be encoded may vary from task to task. The substructure search problem can be reduced to a search for all chemical structures containing fragments isomorphic to a query fragment. Testing whether or not a chemical structure contains an isomorphic substructure can be formalized as a subgraph isomorphism problem. Several practical algorithms have been developed for chemical structures. However, substructure search will take a very long time if such an algorithm is applied to all chemical structures in a database. To avoid testing all chemical structures, substructure screens are used. Substructure screens are defined and used as follows. For each chemical structure in a database, a bit vector is associated; each bit represents the presence (1) or absence (0) of a predefined chemical fragment. These screen fragments are chosen according to the structural relevance for a given data set of molecules. The query structure is analyzed for screen structures, and an equivalent screen vector is produced. If a database structure contains a fragment isomorphic to one of the screen structures, the corresponding bit is set to 1. Consequently, every bit set in the query molecule must appear in the database molecule bit vector to ensure the presence of the corresponding screen fragment. Comparing the screen vectors of query and database molecules can reduce the search effort dramatically if the screen fragments have been selected adequately. Although substructure screens have been used effectively in existing database systems, no rules exist for compiling a set of effective screen fragments for multipurpose applications. At that point, the chemist is in demand to select screen fragments for a given data set of chemical structures in the most efficient way — that is, by minimizing the number of fragments used as a substructure screen and by optimizing a set of fragments. As a matter of fact, substructure screens can actually be adapted to specific molecule databases.
4.3.4 Hash Coding Hash coding is a scheme for providing rapid access to data items that are distinguished by some key. Each data item is associated with a key (e.g., a substructure). A
5323X.indb 66
11/13/07 2:09:35 PM
Dealing with Chemical Information
67
hash function is applied to the key of the item, and the resulting hash value is used as an index to select one out of a number of hash buckets in a hash table. The table contains pointers to the original items. If, when adding a new item, the hash table already has an entry at the indicated location, the key of the entry must be compared with the given key. If two keys hash to the same value (hash collision), an alternative location is used — usually the next free location cyclically following the indicated one. For optimum performance, the table size and hash function must be tailored to the number of entries and range of keys to be used. The hash function usually depends on the table size; consequently, the table must usually be rebuilt completely when hash codes are recalculated.
4.3.5 Stereospecific Search When stereospecific search options are used, only compounds that include stereochemical information are regarded. The stereo property of atoms is defined by specifying wedged and dashed bonds. The resulting stereo descriptor is calculated internally. If no stereo center has been specified, the stereo configuration is usually presumed as undefined. Stereo bonds can be defined in several ways: • • • •
Undefined — configuration is unknown. Relative S configuration defined. Relative R configuration defined. Either stereo configuration (mixture of enantiomers).
The latter case can be used to specify racemic mixtures with a stereo either bond. Stereo either bonds are interpreted as a racemic mixture of R and S enantiomers and are also used for epimers. An atom is considered as a potential stereo center if it binds to at least three substituents; the fourth may be an implicit hydrogen atom. The atom is handled as an explicit stereo center if at least three different nonhydrogen atoms are bonded. Recognition of stereochemical information can be restricted to several search modes, such as • Stereo inselective: Disables the stereo check. • Isomers selective: Checks only if the database molecule has stereo information at the same positions as the query (independent of the stereo descriptors). • Relative configuration: Finds all molecules on the database with the same stereochemistry and all its enantiomers, but no mixtures. • Absolute configuration: In this mode, only molecules with the same stereochemistry are found.
4.3.6 Tautomer Search In contrast to resonance effects where delocalization of π-electrons occur, tautomeric effects lead to isomeric structures, which differ significantly in the relative positions of their atoms. Tautomerism always involves formation and dissection σ-bonds in the course of a transformation together with a change in geometry. A
5323X.indb 67
11/13/07 2:09:36 PM
68
Expert Systems in Chemistry Research
tautomer search algorithm enables the recognition of molecules capable of proton tautomerism, where a proton changes its connection while a double bond moves to an adjacent position. To identify potential tautomeric structures, an additional hash code is stored on the database for each structure. This hash code represents the molecule transformed to a canonical structure. Each query is also transformed to the canonical structure, before the atom-by-atom matching algorithms are applied.
4.3.7 Specifying a Query Structure Commercially available structure editors provide several tools that allow a query to be defined with specific or general components. The back-end software has to transfer this information to the substructure search module. Transferring the structure and special information is done via standardized molecule file formats that include or are enhanced by special code conventions for the additional information. The substructure search module interprets the special conventions and translates them into an internal substructure representation. Examples for special query information are as follows: • • • • • • • • • •
A positive or negative charge on a node of the structure A particular atom at a site or a generic atom The bond type, or left unspecified Free sites Stereochemistry Generic atom lists Stereoisomers Tautomers Charged compounds Isotopes
Besides specific search parameters, the substructure search modules often include formula or formula range search as well as mass and mass-range-search algorithms. Many structures can be drawn in various manners, even if they represent the same molecule. To achieve a unique representation of structures in a database, a normalization algorithm is required to convert ambiguous structures into a unique form. For identity searches and substructure searches, all structures are usually normalized. Normalization is performed for each database entry just before uploading the structure, as well as for queries before sending the query to the search module. Stereoselective substructure search can be performed in various levels of detail. The stereo structure query is created by using special conventions, like wedge and dashed bonds, to indicate the chirality of atoms. By default, a stereochemistry in the query is relative. Advanced features are necessary for stereospecific substructure search requests to search for a chemical fragment while controlling, for example, the degree of substitution of atoms, the coordination number (i.e., number of substituents) of atoms, the cyclicity of atoms and bonds (i.e., whether or not they are part of a ring), and the chemical environment of the substructure.
5323X.indb 68
11/13/07 2:09:36 PM
Dealing with Chemical Information
69
If the structure search retrieves too many results, then there are a number of ways to refine or limit the result: • Change the drawn structure to be more specific. • Add another chemical structure to the search strategy. • Combine structure search with a text search or other searches. Query atoms can be defined either as single atoms or as atom lists. Atom lists are represented in brackets, such as [C,N,O,P]. In this case, the atoms in the list are allowed at the corresponding position. Query atom lists can be negated (NOT [N,O]) to exclude atom types at certain positions. It is possible to define multiple query atom lists at several positions. Several other options are provided by most of the commercially available structure editors: • The number of substituents for variable bonded atoms can be restricted by using a substitution count atom property. • The number of bonds belonging to a ring can be specified for an atom. • Bond types allow a specification whether the bond is belonging to a ring or a chain system. • In substructure queries, each atom that is not explicitly defined is interpreted as any-atom (A). In most cases, it is not necessary to define anyatoms because open valences are interpreted automatically as any-atom (including hydrogen). However, because the bond type for an implicit anyatom is a single bond by default, any-atoms are necessary for specifying another bond type that connects the any-atom. The explicit specification of the any-atom includes all elements except hydrogen. • Molecular formulas can be used as additional search criteria. Either complete formulas — similar to sum formula identity search — or partial formulas and atomic ranges can be specified, such as C5-8H8-12O2. • Molecular masses or ranges of masses can be defined to restrict the search results to molecules of a certain mass. • Charge selective searches enable one to distinguish between charged and uncharged compounds. • For semipolar bonds, nitro and nitroso groups are sometimes drawn with double bonds instead of a single bond with formal charges. To enable finding groups with two double bonds as well as those with formal charges, nitro and nitroso groups are normalized.
4.4 Describing Molecules The investigation of molecular structures and their properties is one of most fascinating topics in chemistry. Since the first alchemy experiments, scientists have created a language consisting of symbols, terms, and notations to describe compounds, molecules, and their properties. This language was refined to give a unique notation known today by scientists all over the world. The increasing use of computational
5323X.indb 69
11/13/07 2:09:36 PM
70
Expert Systems in Chemistry Research
methods made it necessary to implement this language into computer software. Although a trained scientist understands the conventional language easily, new approaches have become necessary for describing molecular information in computer software. A new kind of language for computational chemistry has evolved: the language of molecular descriptors. Let us start with a basic definition of a molecular descriptor: A molecular descriptor represents a certain property or a set of properties of a olecule in a way that is suitable for computational processing. m
It is important to note that a molecular descriptor does not necessarily represent an entire chemical structure; it does not even necessarily describe structural features at all. As scientists look at a structural drawing, they might have different perceptions depending on the task they have to solve. A synthesis chemist will have a special perception for the reactive centers, whereas a spectroscopist might be interested in functional groups or mass fragments. This difference in perceiving a structural drawing has to be reflected in a molecular descriptor. Consequently, molecular descriptors are closely related to the task to be solved. Let us stay a bit with the example of the spectroscopist. Let us assume he is a specialist in infrared spectroscopy. One of the features of a structural drawing he would pay special attention to is hydroxy groups. The reason he does this is because his experience tells him that hydroxyl groups can usually be seen as a broad band at the beginning of an infrared spectrum. He translates this information almost automatically into a hydroxyl band in the spectrum. In fact, the infrared spectrum is nothing else than a molecular descriptor that represents a certain property of the molecule: its vibrational behavior under infrared radiation. If we look from this standpoint at molecular descriptors, several well-known structural descriptors are already used in the day-to-day laboratory of a scientist. In contrast to that, molecular descriptors in the context of computational chemistry are values, vectors, or matrices that are calculated from one or more measured or calculated properties of a molecule. For ease of understanding, let us define those descriptors recorded as a result of an analytical technique as experimental descriptors. In contrast to that, we will talk about artificial descriptors when we refer to those that are calculated. Experimental descriptors emerge from a fixed experimental design, and their appearance is subject to the physical or chemical limitations of the measurement technique. The advantage of artificial descriptors is that they can be adjusted and fine-tuned easily to fit to a task due to their pure mathematical nature. The only limitation to this approach is the scientist’s imagination. However, there are several constraints to be taken into account when selecting or constructing a molecular descriptor. Todeschini and Consonni pointed these out in their book Handbook of Molecular Descriptors [15]. Let us have a closer look at these constraints.
4.4.1 Basic Requirements for Molecular Descriptors There are four basic requirements for a molecular descriptor property.
5323X.indb 70
11/13/07 2:09:37 PM
Dealing with Chemical Information
71
4.4.1.1 Independency of Atom Labeling Labels or numbers assigned to atoms are not related at all to a molecular property and, thus, should not have an effect on the descriptor. This applies particularly to the sequence of atoms, which is most often defined by the sequence of drawing atoms in the structure editor software. Unless a special algorithm is used to unique enumeration, like the Morgan algorithm [9] or the Jochum-Gasteiger canonical renumbering [16], the numbering of atoms is purely arbitrary. The Morgan algorithm assigns an integer label to every atom in the molecule and updates the atom labels iteratively so that topologically equivalent atoms have the same labels. In the initial phase, every atom obtains a label for the number of connected bonds. In subsequent iterative steps, every atom label is updated as the sum of its current label and the labels of adjacent atoms. After the n-th iteration, each atom obtains a label that contains the information of nodes that are within n bonds around it. The computed labels are used as a necessary condition to find topologically equivalent nodes in the graph structure. The algorithm does not work for structures whose atoms are connected by the same number of bonds (regular graphs). However, except for a few special cases, like fullerenes, such regular graphs are rarely found in chemical compounds. 4.4.1.2 Rotational/Translational Invariance A descriptor has to be unaffected by changes in the absolute coordinates of atoms. There are two cases to distinguish here: • Conformational changes (rotation of parts of a molecule around a bond axis): A descriptor might be independent of conformational changes; however, it may be desired that a descriptor reflects rotations around single bonds. • Entire molecule rotation: A descriptor should definitely not be affected by rotation of the entire molecule. The way atom coordinates are calculated is more or less arbitrary, since it depends on a starting point for calculation. 4.4.1.3 Unambiguous Algorithmically Computable Definition Molecular descriptors can only be managed by computer software in an effective manner. Consequently, the calculation shall be computationally unambiguous and reasonably fast. The same algorithm shall result in the same descriptor as long as the molecule properties of interest do not change. 4.4.1.4 Range of Values Values of a descriptor shall be in a suitable numerical range. It is always hard to work with data that may appear in any arbitrary range or may be infinite. It is a good approach to allow minimum and maximum values for a descriptor in advance, which makes comparison easier and avoids calculating numbers in an unpredictable range when further processing is performed.
5323X.indb 71
11/13/07 2:09:37 PM
72
Expert Systems in Chemistry Research
Another important fact is a fixed size of a descriptor. Descriptors shall be comparable by using statistical methods, artificial neural networks, or other mathematical approaches that rely on vectors or matrices of the same dimension.
4.4.2 Desired Properties of Molecular Descriptors Todeschini and Consonni also pointed out a series of important desired characteristics [15]. Further required characteristics of molecular descriptors were discussed by Randic [17]. Let us have a closer look at some of these. • Structural Interpretation — It is helpful if a descriptor can be interpreted in a way that allows features of the chemical structure or even the entire chemical structure to be derived. This is particularly interesting, if a molecular descriptor is applied to the task like a spectrum; in this case we could talk about an artificial spectrum. Several approaches exist that use such artificial spectra to compare them with experimental spectra — for instance, to derive experimental spectra that do not exist in a database. • Descriptor/Property Correlation — This is one of the basic tasks for a descriptor; descriptors are usually helpful for predicting properties or comparing properties of chemical compounds. • Correlation with Other Molecular Descriptors — This applies mainly to artificial descriptors. A correlation with experimental descriptors is helpful to produce missing experimental data. • Reflection of Structural Changes — This applies if the molecular descriptor is intended to represent a structural feature. Most of the properties of a molecule are, in fact, related to the structure. However, it depends on the definition of a chemical structure; two- or three-dimensional representations, molecular surface, charge distribution, and polarizability are all different features of a structure, and the extent of their effects on, for instance, chemical binding is quite different. For instance, a 2D structural drawing is of minor importance for the biological activity of a compound, which is mainly affected by the surface and the charge distribution of a structure. Inner structural changes in a large protein might not affect the shape and charge distribution at the surface; thus, a descriptor for biological activity should not react on these changes. • Restriction to Molecule Classes — A molecular descriptor shall be applicable to all potential molecules. This is the ideal case; actually, this is barely achievable when working with vectors or matrices. A typical example is the size of a molecule: The larger the molecule, the fuzzier is the influence of detail features of a molecule on the desired property. To calculate vectors or matrices of fixed dimension — which is a requirement described already — the size of the molecule has to be independent of the resulting descriptor. Consequently, a descriptor that does a fine representation of a small or medium structure will usually fail to represent a protein or a gene. It is like looking at a painting from 200 feet; details get fuzzy, and it gets hard to compare the view with the one standing in front of the painting.
5323X.indb 72
11/13/07 2:09:37 PM
Dealing with Chemical Information
73
• Discrimination among Isomers — Isomerism and, in particular, stereochemistry have a profound influence on macroscopic properties of a compound. Whereas rotation around a single bond might not be a desired feature for a descriptor used for structure searching, stereochemical hindrance might affect the reactivity dramatically. The ideal algorithm for molecular descriptors will allow these properties to be switched on or off as desired. 4.4.2.1 Reversible Encoding The ideal descriptor can be decoded to obtain the original chemical structure or the properties that have been used to calculate the descriptor. Although this is definitely desired, the real world shows that the information used to calculate the descriptor is usually too complex to use it at its full size. A descriptor shall have a reasonable size for effective computation, and this is mostly achieved by reducing information to the facts that are of major importance for the task. The need for a fixed descriptor dimension also contradicts this requirement.
4.4.3 Approaches for Molecular Descriptors A molecular or compound property consisting of a single value can already be used as a descriptor. Molecular weight, sum of atomic polarizabilities, mean atomic van der Waals volume, mean atomic polarizability, and number of atoms are typical instances for a descriptor. These properties can be used in the form of a single component, or they can be combined in a vector consisting of multiple different properties to fit the task in the most appropriate manner. An example is an application published by Wagener et al. for the prediction of the effective concentration of carboquinones as anticarcinogenic drugs [18]. For the prediction they used four physicochemical properties: (1) the molar refractivity index; (2) hydrophobicity; (3) field effect; and (4) resonance effect. By calculating the contributions of different residues on para-benzochinone derivatives and feeding them into a back-propagation neural network, they were able to predict the concentration of the drug that still showed a carcinogenic effect.
4.4.4 Constitutional Descriptors Constitutional descriptors represent the chemical composition of a molecule in a generic way. They are independent of molecular connectivity and geometry. Typical constitutional descriptors are number, sum, mean, or average of constitutional properties. Examples are number of a particular atom type or bond type, number of particular ring systems, sum of electronegativities, mean atomic van der Waals volumes, and average molecular weight. Constitutional descriptors are particularly attractive because of their simplicity; it is easy to understand their meaning in the context, and they can be easily calculated from atom lists or textual representations. One of the first generic approaches for calculating constitutional descriptors comes from Free and Wilson in 1964 [19]:
5323X.indb 73
11/13/07 2:09:38 PM
74
Expert Systems in Chemistry Research
P = P0 +
N
∑c I
k k
(4.1)
k
Here, P is a property calculated from the sum of k structural features I multiplied by a dynamic or constant factor c. Constitutional descriptors are widely used for investigating quantitative structure/activity relationships (QSAR) for drug discovery.
4.4.5 Topological Descriptors Topological indices are calculated using data on the connectivity of atoms within a molecule. Consequently, these descriptors contain information about the constitution, size, shape, branching, and bond type of a chemical structure, whereas bond length, bond angles, and torsion angles are neglected. The first published topological descriptor is the Wiener Index, named after Harry Wiener [20] — is the first topological descriptor. It calculates the sum of the number of minimum bonds n between all nonhydrogen atoms i and j:
W=
1 2
N
∑n
ij
(4.2)
i, j
The factor in front of the sum term ensures that bond paths are only counted once. The Wiener index is particularly useful to describe quantitative structure–property relationships (QSPR). An example is the correlation of saturation vapor pressure with the chemical structure. Since the Wiener index does not distinguish atom types, those correlations can only be achieved for homologues series of compounds. Another series of successfully applied topological descriptors is derived from graph theory using atom connectivity information of a molecule. An example is the connectivity index developed by Randic [21]. In the simple form, N
χ=
∑( D D ) i
j
−1 2
(4.3)
i, j
D denotes the degrees of the atoms i and j of a molecular graph, and the sum goes over all adjacent atoms. The exponent was originally chosen by Randic to be –1/2; however, it has been used to optimize the correlation between the descriptor and particular classes of organic compounds. The connectivity index was later extended by Kier and Hall to applications with connected subgraphs [22].
4.4.6 Topological Autocorrelation Vectors Topological autocorrelation vectors are calculated from the two-dimensional structure of a molecule that can be expressed as molecular graph. One of the original
5323X.indb 74
11/13/07 2:09:39 PM
75
Dealing with Chemical Information
approaches was developed by Moreau and Broto, called the autocorrelation of a topological structure (ATS) [23]. ATS describes how a property is distributed along the topological structure:
Ad
N
∑δ ( p p ) ij
i, j
i
j d
(4.4)
The properties p of the atoms i and j are considered for a particular topological distance d. δij is a Kronecker delta that represents additional constraints or conditions. The topological distance may also be replaced by the Euclidean distance, thus accounting for two- or three-dimensional arrangement of atoms. Three-dimensional spatial autocorrelation of physicochemical properties has been used to model the biological activity of compound classes [24]. In this case, a set of randomly distributed points is selected on the molecular surface, and all distances between the surface points are calculated and sorted into preset intervals. These points are used to calculate the spatial autocorrelation coefficient for particular molecular properties, such as the molecular electrostatic potential (MEP). The resulting descriptor is a condensed representation of the distribution of the property on the molecular surface.
4.4.7 Fragment-Based Coding A widely used method for calculating a molecular descriptor is substructure- or fragment-based coding. The molecule to be encoded is divided into several substructures that represent the typical information necessary for the task. Many authors have used this method for the automated interpretation of spectra with artificial neural networks [25], in expert systems for structure elucidation [26], with pattern recognition methods [27], and with semiempirical calculations [28]. A binary vector is typically used to define simply the presence or absence of functional groups that exhibit important spectral features in the corresponding spectrum. The main disadvantage of this method is that the number of substructures cannot be generally restricted. In various publications, the number of substructures differs from 40 to 229, depending on the more or less subjective point of view of the problem. Affolter et al. showed that a simple assignment of infrared-relevant substructures and corresponding infrared signals does not describe spectrum–structure correlation with adequate accuracy [29]. This is mainly due to the effect of the chemical environment on the shape and position of absorption signals. An improvement of the simple substructure approach is the method fragment reduced to an environment that is limited (FREL) introduced by Dubois et al. [30]. Several centers of the molecule are described, including their chemical environment. By taking the elements H, C, N and O and halogens into account and combining all bond types — single, double, triple, aromatic — the authors found descriptors for 43 different FREL centers that can be used to characterize a molecule. To characterize the arrangement of all atoms in a molecule, the entire molecule can be regarded as a connectivity graph where the edges represent the bonds and
5323X.indb 75
11/13/07 2:09:40 PM
76
Expert Systems in Chemistry Research
the nodes represent the atoms. It is possible to calculate a descriptor that defines the constitution of a molecule independently of conformational changes by adding the number of bonds or the sum of bond lengths between all pairs of atoms. The resulting descriptor is not restricted regarding the number of atoms. Clerc and Terkovics [31] used this method based on the number of bonds for the investigation of Quantitative Structure/Property Relationships (QSPR). Using the sum of bonds (or sum of distances) eliminates redundant information, but also certain molecular features. Such a descriptor no longer characterizes complex relationships, such as the correlation between structure and spectrum. Further information about molecular descriptors can be found at the free on-line resource “Molecular Descriptors” [32].
4.4.8 3D Molecular Descriptors Molecules are usually represented as 2D graphs or 3D molecular models. Although two-dimensional representations are extremely helpful for communicating and reporting, most of the compound properties depend on the spatial arrangement of the atoms in a molecule. Three-dimensional coordinates of atoms in a molecule are sufficient to describe the spatial arrangement of atoms; however, they exhibit two major disadvantages: (1) They depend on the size of a molecule; and (2) they do not describe any additional properties, like partial charges or polarizability. The first attribute is important for computational analysis of data. Even a simple statistical function (e.g., the correlation) requires the information to be represented in equally sized vectors of a fixed dimension. The solution to this problem is a mathematical transformation of the Cartesian coordinates of a molecule to a vector of fixed length. The second point can be overcome by including the desired properties in the transformation algorithm. A descriptor for the three-dimensional arrangement of a molecule can be derived from the Cartesian coordinates of the atoms. Reliable coordinates can be calculated quite easily by semiempirical or molecular mechanics (i.e., force-field) methods by using molecular modeling software. Fast 3D structure generators are available that combine rules and force-field methods to calculate Cartesian coordinates from the connection table of a molecule (e.g., CORINA [33]). Let us summarize the three important prerequisites for a 3D structure descriptor: It should be (1) independent of the number of atoms, that is, the size of a molecule; (2) unambiguous regarding the three-dimensional arrangement of the atoms; and (3) invariant against translation and rotation of the entire molecule. Further prerequisites depend on the chemical problem to be solved. Some chemical effects may have an undesired influence on the structure descriptor if the experimental data to be processed do not account for them. A typical example is the conformational flexibility of a molecule, which has a profound influence on a 3D descriptor based on Cartesian coordinates. The application in the field of structure–spectrum correlation problems in vibrational spectroscopy requires that a descriptor contains physicochemical information related to vibration states. In addition, it would be helpful to gain the complete 3D structure from the descriptor or at least structural information (descriptor decoding).
5323X.indb 76
11/13/07 2:09:40 PM
77
Dealing with Chemical Information
4.4.9 3D Molecular Representation Based on Electron Diffraction One way to 3D molecular descriptors is deriving them from molecular transforms, which are generalized scattering functions used to create theoretical scattering curves in electron diffraction studies. A molecular transform serves as the functional basis for deriving the relationship between a known molecular structure and x-ray and electron diffraction data. The general molecular transform is
I (s) =
N
∑f
j•
e2πi rj s •
(4.5)
j
where I(s) represents the intensity of the scattered radiation and s the scattering in various directions by a collection of N atoms located at points rj. fj is a form factor taking into account the directional dependence of scattering from a spherical body of finite size. The value s measures the wavelength-dependent scattering s = 4π·sin(ϑ/2)/λ, where ϑ is the scattering angle and λ is the wavelength of the electron beam. Wierl suggested in 1931 to use this equation in a modified form [34]. Soltzberg and Wilkins introduced a number of simplifications to obtain a binary code [35]. They considered only the zero crossings of I(s) = 0 in the range 0 – 31 Å–1 and divided the s range into 100 equal intervals, each described by a binary value equal to 1 if the interval contains a zero crossing, 0 otherwise. The resulting code is a 100-dimensional binary vector. Considering a rigid molecule, neglecting the instrumental conditions, and replacing the form factors by the atomic number, A, they obtained
I (s) =
N −1
N
∑∑ A A i
i
j >i
j
sin ( s • rij ) s • rij
(4.6)
where I(s) is the scattered electron intensity, Ai and Aj are the atomic numbers, rij are the interatomic distances between the ith and jth atoms, respectively, and N is the number of atoms. Gasteiger et al. returned to the initial I(s) curve and maintained the explicit form of the curve [36]. For A they substituted various physicochemical properties such as atomic mass, partial atomic charges, and atomic polarizability. To obtain uniform length descriptors, the intensity distribution I(s) is made discrete, calculating its value at a sequence of evenly distributed values of, for example, 32 or 64 values in the range of 0 – 31Å–1. The resolution of the molecule representation increases with higher number of values. The resulting descriptor is the 3D MoRSE (Molecular Representation of Structures based on Electron diffraction) Code.
4.4.10 Radial Distribution Functions 3D MoRSE codes are valuable for conserving molecular features, but it is not possible to interpret them directly. This drawback leads to investigations of related types of descriptors. Steinhauer and Gasteiger picked up the idea of radial distribution
5323X.indb 77
11/13/07 2:09:42 PM
78
Expert Systems in Chemistry Research
functions (RDF) used in x-ray scattering investigations [38,39]. Radial distribution functions, well known in physicochemistry and physics, describe the distance distribution of points in three-dimensional space. This suggests applying this function to a three-dimensional molecular model. The radial distribution function
g (r ) =
N −1
N
i
j >i
∑∑
2 − B r−r e ( ij )
(4.7)
describes an ensemble of N atoms in 3D space in a spherical volume of radius r. g(r) is the probability distribution of atom distances rij between atoms i and j. The exponential term contains the smoothing parameter B, which determines the width of the individual peaks in the probability distribution. B can be interpreted as a temperature factor defining the movement of atoms. A number of discrete intervals is usually used to calculate g(r). Equation 4.3 meets all the requirements for a 3D structure descriptor: Within a predefined distance range, it is independent of the number of atoms (the size of a molecule), unique concerning the three-dimensional arrangement, and invariant against translation and rotation of the entire molecule. A slight modification of the general form of an RDF leads to a molecular descriptor, the radial distribution function (RDF) code, which includes atom properties that address characteristic atom features in the molecular environment. Fields of application for RDF codes are the simulation of infrared spectra and deriving molecular structure information from infrared spectra [40,41].
4.4.11 Finding the Appropriate Descriptor According to the previous definition, molecular descriptors represent properties of a molecule. The most important issue is to eliminate properties that are not relevant for the task to be solved. This is similar to analytical methodology, where unwanted effects are eliminated or suppressed by using chemical or physical methods, like chemical modifiers in chromatography or background compensation in spectroscopy. Finding the appropriate descriptor for the representation of chemical structures is one of the basic problems in chemical data analysis. Several methods have been developed for the description of molecules, including their chemical or physicochemical properties [42]. The wide variety of molecular structures usually requires a reduction of information together with the encoding process. Additionally, structural features should be encoded including properties that have a profound influence on the features to be investigated, such as molecular symmetry, physicochemical bond, or atomic or molecular properties like charge distribution, electronegativity, and polarizability of the compounds. Radial distribution functions and their derivatives have been found
The term radial distribution functions should not be confused with radial basis functions (RBF), a term introduced by Broomhead and Lowe in 1988 and which represents a type of function used for neural networks employing a hidden layer of radial units [37].
5323X.indb 78
11/13/07 2:09:43 PM
79
Dealing with Chemical Information
to be particularly useful due to their flexibility. There are several ways of generating descriptors based on the concept of radial distribution functions. • RDF descriptors can be restricted to specific atom types or distance ranges to represent complete or partial information in a certain chemical environment, for instance, to describe steric hindrance in a reaction or structure– activity relationships of a molecule. • RDF descriptors are interpretable by using simple rules. Consequently, they provide a possibility to convert the vector into the corresponding 3D structure. • RDF descriptors provide further valuable information, such as about bond distances, ring types, planar and nonplanar systems, and atom types. This is a valuable consideration for computer-assisted descriptor elucidation. The next chapter deals with the development and modifications of RDF descriptors.
4.5 Descriptive Statistics Application of statistics in expert systems is a topic that fills more than a single book. However, some of the investigations presented in the next chapters are based on methods of descriptive statistics. The terms and basic concepts of importance for the interpretation of these methods should be introduced first. Algorithms and detailed descriptions can be found in several textbooks [43–47].
4.5.1 Basic Terms Before we look at some details of the analysis process, let us recall some basic important statistical terms. 4.5.1.1 Standard Deviation (SD) SD of a data set is a measure of how spread out the data is; it is the average distance from the mean of the data set to a point, calculated by the sum of squares of the distances from each data point to the mean of the set divided by a subset of the population, expressed as n – 1:
σ=
1 n −1
n
∑(x − x ) i
2
(4.8)
i =1
4.5.1.2 Variance This is another measure of the spread of data in a data set. It is in fact almost identical to the standard deviation; it is simply the standard deviation squared.
5323X.indb 79
var ( x ) =
1 n −1
n
∑(x − x ) i
2
= σ2
(4.9)
i =1
11/13/07 2:09:44 PM
80
Expert Systems in Chemistry Research
4.5.1.3 Covariance Whereas standard deviation and variance are purely one dimensional, covariance is measured between two dimensions. Instead of taking squares of the distances of the one-dimensional data x, we take the product of distances for two variables, x and y. For each data item, multiply the difference of the x value and the mean of x by the difference between the y value and the mean of y.
cov( x, y) =
1 n −1
n
∑ ( x − x )( y − y ) i
i
(4.10)
i =1
A positive covariance indicates that both dimensions increase together, whereas if the covariance is negative, then as one dimension increases, the other decreases. If the covariance is zero, it indicates that the two dimensions are independent of each other. 4.5.1.4 Covariance Matrix This is an extension to the covariance concept for multidimensional data. It is a matrix that shows all covariance values calculated between the individual dimensions. A covariance matrix for three dimensions, x, y, and z, would look as follows:
cov( x, x ) cov( y, x ) cov( z, x )
cov( x, y) cov( y, y) cov( z, y)
cov( x, z ) cov( y, z ) cov( z, z )
(4.11)
Since the sequence of dimensional terms in the covariance does not affect the results, that is, cov(x,y) = cov(y,x), the covariance matrix is a symmetric one. The main diagonal shows the self-products and, thus, the variances for the dimension. 4.5.1.5 Eigenvalues and Eigenvectors Multiplying a matrix A with a vector v results in the transformation of this vector from its original position. When multiplying matrices with vectors, we have to consider a special case: If the resulting vector is an integer multiple of the original vector, the factor λ is called an eigenvalue and the vector is an eigenvector of matrix A; in mathematical terms,
Av = λv
(4.12)
In this case the length of the vector changes by λ, but not its orientation. In other words, the eigenvectors of a matrix are the nonzero vectors that when multiplied by the matrix give integer multiples of the eigenvectors, and the corresponding integer factors are the eigenvalues of the matrix. Eigenvectors arise from transformations and can only be found for square matrices, and if they are found, exactly n eigenvectors exist for an n times n matrix. All eigenvectors of a matrix are orthogonal; that is, data can be represented in terms of
5323X.indb 80
11/13/07 2:09:46 PM
81
Dealing with Chemical Information
these orthogonal vectors instead of expressing them in terms of the x and y axes. Due to the fact that scaling does not change the direction of a vector, scaling of eigenvectors still leads to the same eigenvalue. Consequently, eigenvectors are usually scaled to have a length of 1 to be comparable, resulting in the unit eigenvectors. If we consider all n eigenvectors, we can write an eigenvector matrix V
V = v (1)
v (2 )
v1(1) v(1) 2 v ( n ) = (1) vn
v1( 2 ) v2( 2 ) (2 ) vn
v1( n ) v2( n ) . vn( n )
(4.13)
We can combine all of them by using the single matrix equation AV = ΛV ,
(4.14)
where Λ is a diagonal matrix of all eigenvalues:
0 λ2 0
λ1 0 Λ= 0
0 0 λ n
(4.15)
The diagonal eigenvalues in this matrix make up the eigenvalue spectrum. The decomposition A = ΛVV −1 , (4.16) helps in characterizing many of the behaviors of A in various applications.
4.5.2 Measures of Similarity It is useful to characterize a set of values that has a tendency to cluster around a centered value by its centered moments of distribution, Mn (the sums of the nth integer powers of the values). Although most descriptors exhibit no centered distribution, the similarity in the general trend of distribution makes it comparable to other descriptors. Several general measures for describing the similarity of descriptors are valuable. The root mean square (RMS) error is calculated by the mean squared individual differences of the components of two descriptors, gAj := [gA1...gAn] and gBj := [gB1... gBn] (n is the number of components) and is used as the default measure of similarity for descriptor database searches:
5323X.indb 81
RMS =
1 n
n
∑(g j =1
Aj
2
− gBj )
(4.17)
11/13/07 2:09:49 PM
82
Expert Systems in Chemistry Research
Many investigations rely on descriptor sets rather than individual descriptors. An obvious way to compare an individual descriptor with a descriptor set is to calculate an average set descriptor g (r ) from the sum of all vector components of the descriptor gi(r) divided by the number of descriptors L:
g (r ) =
L
1 L
∑ g (r )
(4.18)
i
i =1
In other words, each component of a descriptor gi(r) in a data set is summarized and divided by the number of molecules L. g (r ) represents the mean descriptor and, thus, a kind of mean molecule of a data set. The next step is to calculate the difference between each descriptor gi(r) and the average descriptor g (r ) . This value is defined by the average descriptor deviation δgi for an individual descriptor i with the components j,
δgi =
1 n
n
∑g
ij
− gj
(4.19)
j =1
and represents a robust estimator for the variability of a descriptor within its data set. The sum of all average descriptor deviations divided by the number of descriptors L leads to the average diversity ∆g: 1 ∆g = L
L
∑ δg
(4.20)
i
i =1
The average diversity represents an estimator for the molecular diversity of a data set and is calculated for each descriptor in a data set. The correlation measures the relation between two or more variables and goes back to works performed in the late nineteenth century [48]. The most frequently used type of correlation is the product–moment correlation according to Pearson [49]. The Pearson correlation determines the extent to which values of two variables are linearly related to each other. The value of the correlation (i.e., the correlation coefficient) does not depend on the specific measurement units used. A high correlation can be approximated by a straight regression line — sloped upward or downward — that is determined by the minimum sum of the squared distances of all the data points. One of the descriptors, gA or gB, can be regarded as the dependent variable. Consequently, two regression lines with different slopes can exist. By defining
aAB =
n
∑(g j =1
5323X.indb 82
g
Aj Bj
) − ngA gB
(4.21)
11/13/07 2:09:52 PM
83
Dealing with Chemical Information
the slopes bAB and bBA can be calculated:
bAB = aAB
n
∑ ( g ) − ng 2 Aj
2 A
(4.22)
∑ ( g ) − ng
2 B
(4.23)
j =1
bBA = aAB
n
2 Bj
j =1
The geometric mean of both regression lines reflects the correlation coefficient R:
R = aAB
bABbBA
(4.24)
The squared correlation coefficient R2, the coefficient of determination, represents the proportion of common variation in the two variables (i.e., the magnitude of the relationship).
4.5.3 Skewness and Kurtosis An important prerequisite for the use of descriptive statistics is the shape of the distribution of variables, that is, the frequency of values from different ranges of the variable. It is assumed in multiple regression analysis that the residuals — predicted minus observed values — are normally distributed. The significance of correlation coefficients of a particular magnitude will change depending on the size of the sample from which it was computed. Because a regression line is determined by minimizing the sum of squares of distances, outliers have a profound influence on the slope of the regression line and consequently on the value of the correlation coefficient. A single outlier is capable of considerably changing the slope of the regression line and, thus, the value of the correlation. This effect has been described as the King Kong Effect [50]. Let us assume we want to evaluate the relationship between the height and the weight of individuals in a horde of gorillas by using simple correlation. For most of the adult animals we will find a height between 150 and 170 centimeters, whereas the weight ranges between 130 and 140 kilograms. A typical correlation coefficient results in 0.85, which shows a trend in the relationship. Now we find King Kong, who is the chief of the horde, with a remarkable height of 190 centimeters and 188 kilograms; including these data into our correlation chart results in a correlation coefficient of 0.95 (Figure 4.2a). The reason for this increase in correlation by including a value that is obviously an outlier lies in the distribution of values. Figure 4.2b shows that a standard distribution no longer exists, and thus, the basic assumptions for the correlation coefficient do not apply. Descriptive statistics can provide information about how well the distribution can be approximated by the normal distribution with skewness and kurtosis (the term was first used by Pearson) [49].
5323X.indb 83
11/13/07 2:09:54 PM
84
Expert Systems in Chemistry Research 190
King Kong
Weight/kg
180
r = 0.95
170 160 150
r = 0.85
140 130
150
160
170
(a)
180
190
Height/cm
12
Frequency
10
.
Height Weight King Kong
8 6 4 2 0
150
155
160
(b)
165
170
175
180
185
Height/cm
Figure 4.2 The King Kong effect in unweighted linear regression. The inclusion of a significant outlier can lead to an increase in the correlation coefficient; even the graphical representation shows that the bulk of data in the lower right is no longer well represented by the regression line when the outlier is included. The reason is the distance of the bulk data to the outlier. (b) The distribution of data shown in (a) no longer shows Gaussian behavior due to the outlier included in the data set. Without weighing the individual data points, one of the basic requirements for applying linear regression — standard distribution of data points — is no longer given.
The skewness, S, of a descriptor reflects symmetry of distribution of its components related to the Gaussian distribution
S=
n⋅ M3 n − 1 ( ) ⋅ (n − 2) ⋅ σ 3
(4.25)
with
5323X.indb 84
M k = ∑i ( xi − x )k
(4.26)
11/13/07 2:09:56 PM
85
Dealing with Chemical Information
Gaussian
Leptokurtic
Platykurtic
Kurtosis
Positive
Negative Skewness
Figure 4.3 Graphical representation of kurtosis, K, and skewness, S, in comparison to the Gaussian (standard) distribution (upper left). The right-hand side shows leptokurtic (peaked) or platykurtic (flatted) distribution as well as positive skewed distribution (fronting) and negative skewed distribution (tailing).
as the kth moment and σ = M 2 (n − 1)
(4.27)
as the unbiased sample estimate of a population variance. S is clearly different from zero when the distribution is asymmetric — whereas normal distributions should be perfectly symmetric. A positive skewness indicates a fronting, whereas a negative skewness indicates a tailing in the distribution. The kurtosis, K, measures the flatness of a distribution related to the Gaussian distribution
K=
n ⋅ ( n + 1) ⋅ M 4 − 3 ⋅ ( M 2 )2 ⋅ ( n − 1) ( n − 1) ⋅ ( n − 2 ) ⋅ ( n − 3) ⋅ σ 4
(4.28)
If the kurtosis is substantially different from zero, then the distribution is either flatter (platykurtic, K < 0) or more peaked (leptokurtic, K > 0) than the Gaussian distribution that should have a kurtosis of zero (Figure 4.3).
4.5.4 Limitations of Regression The major conceptual limitation of regression techniques is that one can only ascertain relationships but can never be sure about the underlying causal mechanism. As mentioned already, multiple linear regression analysis assumes that the relationship between variables is linear. In practice, this assumption is rarely met. Fortunately, multiple regression procedures are not greatly affected by minor deviations from
5323X.indb 85
11/13/07 2:09:57 PM
86
Expert Systems in Chemistry Research
this assumption. If curvature in the relationships is evident — determined by skewness and kurtosis — one may consider either transforming the variables or explicitly allowing for nonlinear components. The skewness as well as the kurtosis of measured or calculated values will be non-zero in many cases, even if the underlying distribution is in fact more or less symmetrical. The meaning of S could be estimated by the standard deviation of S itself. Unfortunately, this depends on the shape of the distribution. As an approximation, the standard deviation for Equation 4.22 in the idealized case of a normal distribution
σ ≅ ± 15/n
(4.29)
is used as a lower limit for S. A similar approach can be applied to the kurtosis with a Gaussian standard deviation
σ ≅ ± 96 / n
(4.30)
Thus, for a vector consisting of 128 components the skewness limit is about 0.35, and the kurtosis limit is about 0.85. Whereas the mean is a measure of the tendency of a variable within its confidence intervals, the width of the confidence interval itself depends on the sample size and on the variation of data values. On one hand, the larger the sample size, the more reliable is the mean. On the other hand, the larger the variation, the less reliable is the mean. The calculation of confidence intervals is based on the assumption that the variable is normally — or at least equally — distributed in the population. The estimate may not be valid if this assumption is not met, unless the sample size is large. The concept of squared distances has important functional consequences on how the value of the correlation coefficient reacts to various specific arrangements of data. The significance of correlation is based on the assumption that the distribution of the residual values (i.e., the deviations from the regression line) for the dependent variable y follows the normal distribution and that the variability of the residual values is the same for all values of the independent variable. However, Monte Carlo studies have shown that meeting these assumptions closely is not crucial if the sample size is very large. Serious biases are unlikely if the sample size is 50 or more; normality can be assumed if the sample size exceeds 100.
4.5.5 Conclusions for Investigations of Descriptors Consequently, statistical investigations of descriptors should be performed on vectors containing at least 128 components and training sets containing at least 50 compounds. Similar conditions related to the sample size apply to the investigations with neural networks, which are, in fact, nothing more than a more complex statistical algorithm. Since a network minimizes an overall error, the proportion of types of data in the set is critical. A network trained on a data set with 900 good cases and 100 bad ones will bias its decision toward good cases, as this allows the algorithm
5323X.indb 86
11/13/07 2:09:59 PM
Dealing with Chemical Information
87
to lower the overall error, which is much more heavily influenced by the good cases. If the representation of good and bad cases is different in the real population, the network’s decisions may be wrong. In such circumstances, the data set may need to be crafted to take account of the distribution of data (e.g., the less numerous cases can be replicated, or some of the numerous cases can be removed). Often, the best approach is to ensure an even representation of different cases and then to interpret the network’s decisions accordingly. With an insufficient number of training data, the leave-one-out technique was applied. In this procedure all available data are used for the training of the network, except the one for which a prediction or classification has to be performed. This method can be applied iteratively for each object in the data set.
4.6 Capturing Relationships — Principal Components The idea behind statistics is based on the analysis of data sets in terms of the relationships between the individual points in that data set. If we summarize the correlation between two variables in a scatter plot, a regression line would show the linear relationship between the variables. We can now define a variable that approximates the regression line and characterizes the essential relationship between the two variables. In this case, we have reduced the relationship to a single factor, which is actually a linear combination of the two variables. There are two noteworthy techniques for such an approach: (1) principal component analysis (PCA); and (2) factor analysis.
4.6.1 Principal Component Analysis (PCA) PCA is one of the methods for simplifying a data set by reducing the number of dimensions in a data set. It goes back to a work from Pearson in 1901 [51]. In mathematical terms, PCA is an orthogonal linear transformation of a data set to a new coordinate system. The new coordinate system may, for instance, show the greatest variance on the first coordinate (i.e., first principal component), the second greatest variance on the second coordinate, and so on. The graphical representation of this coordinate system then shows relationships between data points by grouping them into particular areas of the coordinate system. PCA can be used to identify patterns in data and to express the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in data of high dimension, where graphical representation is not available, PCA is a powerful tool for analyzing data. Once the patterns are found in the data, they can be compressed by reducing the number of dimensions, without significant loss of information — a technique used for instance in image compression. With the basic concepts of covariance and eigenvectors, a principal component analysis can be performed with the following steps. Suppose we have calculated distance-based molecular descriptors represented as vectors for a data set of n compounds. If we write the components of the descriptor as column vectors, we can create a matrix of compounds, where each column represents a compound and each row contains the individual values of the descriptor:
5323X.indb 87
11/13/07 2:09:59 PM
88
Expert Systems in Chemistry Research
g1(1) g (1) 2 G= (1) gm
g1( 2 ) g2( 2 ) gm( 2 )
g1( n ) g2( n ) , gm( n )
(4.31)
or the individual descriptors
g ( k ) = g1( k )
g2( k )
T
gm( k )
(4.32)
In principal component analysis we regard each vector as representing a single point in an m-dimensional Euclidean space. If we look at the relationship between the first and second component of each descriptor (m = 2), we obtain a graph with n points (one for each descriptor). Obviously g1 and g2 are related to each other. The first objective of PCA (Figure 4.4) is to specify this relationship with a straight line, which is called the principal axis y1 of a new coordinate system for the points. This axis is determined by finding the largest variation in the data points. The second axis of this new coordinate system is perpendicular to the first axis. This example is based on two components; we can extend it to all components of the descriptor by adding axes that are perpendicular to the preceding axis. The third axis will be perpendicular to both the first and second axes and will correspond to the third largest variation in the data, and so on for the remaining m-3 axes. We assumed that the observations along a single row in the descriptor matrix are independent (different compounds) but that some values within a column vector are
Figure 4.4 Construction of a new coordinate system for PCA. The relationship of the first and the second component of two descriptors is displayed in the left-hand diagram. The data points are distributed, but show a trend along a straight line. In PCA, two new axes are introduced, one determined by the largest variation (y1), and the other one (y2) is perpendicular to the first.
5323X.indb 88
11/13/07 2:10:01 PM
89
Dealing with Chemical Information
correlated (distances). The existence of such correlation is a basic requirement for PCA. We can eventually achieve a simplification of the data by discarding dimensions that correspond to small variations in the data. If we work with experimental data, the principal axes corresponding to smaller variations will represent the noise in data. That is, reducing the dimensionality by discarding the small variation will have a denoising effect on the data. In our example of descriptors, the small variations may correspond to underlying physical properties that naturally produce small changes. For instance, if we look at a conformational change of a protein structure, we may find small changes in the double bond length of aromatic rings due to the conformational stress. This effect can be eliminated by discarding the corresponding principal components. If we do this for all components of our descriptors, we will find the first principal component corresponding to the linear combination of the data rows in the descriptor matrix that shows the largest variation in the data. The next principal component is another linear combination of data rows that accounts for the next largest variation in the data set and so forth. The procedure for performing a principal component analysis is described in the following sections. 4.6.1.1 Centering the Data To make the principal components comparable, we center the data to find the first principal component as a direction from the origin of our coordinate system (Figure 4.4). We achieve this by calculating the mean of the variations in distance in each row to get m mean values (one for each descriptor), and subtract them from the descriptor matrix to get
g1(1) − g1 g (1) − g 2 2 G= (1) gm − gm
g1( 2 ) − g1 g2( 2 ) − g2 (2 ) gm − gm
g1( n ) − g1 g2( n ) − g2 gm( n ) − gm
(4.33)
Each row in this matrix will sum now to zero. 4.6.1.2 Calculating the Covariance Matrix The covariance for two components i and j can be calculated as
sij =
1 n −1
n
∑(g
(k ) i
)(
)
− gi g (jk ) − g j
k =1
(4.34)
or since we are using centered data, where gi = g j = 0 , we can simplify to
5323X.indb 89
sij =
1 n −1
n
∑g
g
(k ) (k ) i j
(4.35)
k =1
11/13/07 2:10:04 PM
90
Expert Systems in Chemistry Research
We will retain the calculations for all covariances in a symmetric m by m covariance matrix S with the entry at row i and column j containing si,j while the main diagonal contains the variance for row(i).
s11 s12 S= s 1m
s21 s22 s2 m
s m1 sm 2 smm
(4.36)
The next steps are summarized in the following: • Calculate the eigenvectors and eigenvalues of the covariance matrix. • Scale the eigenvectors to obtain unit eigenvectors. • Select the eigenvector with the highest eigenvalue to define the principle component of the data set, that is, the one showing the most significant relationship between the data dimensions. • Sort the eigenvectors in order of decreasing eigenvalues, which gives the components in order of significance. • Optionally, ignore the components of lesser significance. This will reduce information content the higher the eigenvalues are. However, it will reduce the final data set to fewer dimensions than the original; in fact, the number of dimensions is reduced by the number of eigenvectors left out. • Form a feature vector; this is constructed by taking the eigenvectors that shall be kept from the list of eigenvectors and forming a matrix with these eigenvectors in the columns: FV = (eig1 eig2 … eign). • Derive the new data set by taking the transpose of the vector and multiply it on the left of the original data set, transposed:
ND = RV × AD
(4.37)
where RV is the matrix with the eigenvectors in the columns transposed (eigenvectors are now in the rows), with the most significant eigenvector at the top, and AD is the transposed mean-adjusted descriptor values, where data items are in each column with each row holding a separate dimension. ND is the new data set, with data items in columns and dimensions along rows. This procedure gives us the original data in terms of the eigenvectors we chose. If the original data set had two axes, x and y, they are now represented in terms of eigenvectors. In the case of keeping all eigenvectors for the transformation, we get the original data set rotated so that the eigenvectors are the axes. In conclusion, we have seen that the coefficients generating the linear combinations that transform the original variables into uncorrelated variables are the eigenvectors of the covariance matrix. The fraction of an eigenvalue out of the sum of all eigenvalues of the covariance matrix represents the amount of variation accounted by the corresponding eigenvector. The underlying intuition for this statement is that
5323X.indb 90
11/13/07 2:10:05 PM
91
Dealing with Chemical Information
the magnitude of each eigenvalue represents the length of the corresponding eigenvector. The length of the eigenvector is a measure of the degree of common variation it accounts for in the data or, better yet, the intensity of a trend in the data. Empirically, when PCA works well, the first two eigenvalues usually account for more than 50% of the total variation in the data.
4.6.2 Singular Value Decomposition (SVD) One mathematical procedure that can be used in the context of PCA is singular value decomposition (SVD). SVD decomposes a matrix A created from m compounds each having n descriptor components by A m× n = U m× mS m× n (V T )n× n
(4.38)
where U is the unitary matrix of the principal components sorted from highest to the lowest value; V is the unitary matrix of loadings, containing the loadings of the principal components in the columns; and S is a diagonal matrix containing the singular values s associated to the variance in the direction of each principal component in decreasing order:
0 s2
s1 0 S= 0
0 0 0 0 sm 0 m × n
0
(4.39)
We can discard the principal components with small variance and make a reduced matrix S : s1 0 S= 0
0 s2 0
0 0 sr r × r
(4.40)
For a square, symmetric matrix X, singular value decomposition is equivalent to diagonalization, or solution of the eigenvalue problem. The product of U and S yields the matrix of coefficients Tm× n in the new basis: Tm× n = U m× mS m× n
(4.41)
Inserting this in Equation 4.38 results in
A m× n = Tm× m V Tn× n ,
(4.42)
A m× n Vm× n = Tm× m V Tn× n Vm× n
(4.43)
and multiplying by V yields
5323X.indb 91
11/13/07 2:10:09 PM
92
Expert Systems in Chemistry Research
and A m× n Vm× n = Tm× m
(4.44)
PCA is equivalent to finding the SVD of X and can be used to obtain the regresˆ from sion vector estimate w ˆ − A Ty = 0 A TAw
(4.45)
by replacing A with Equation 4.38:
ˆ − (USV T )T y = 0 (USV T )T USV Tw
(4.46)
ˆ = VSU T y . VSU TUSV Tw
(4.47)
and
Multiplying both sides by V T
ˆ = V T ⋅ VSU T y V T ⋅ VSU TUSV Tw
(4.48)
ˆ = VV T ⋅ SU T y . VV T ⋅ UU T ⋅ S 2 V Tw
(4.49)
we can rearrange to
Eliminating the identity matrices VV T and UU T that do not change the results, we can reduce this equation to or
ˆ = SU T y S 2 V Tw
(4.50)
ˆ = S −1U T y . V Tw
(4.51)
Multiplying both sides by V yields
ˆ = VS −1U T y . w
(4.52)
The regression vector estimate allows us to predict molecular properties from descriptors using a simple linear regression model. This empirical approach assumes a principal relationship between the descriptor and the property. Predicting a molecular property by using descriptors for three-dimensional structures makes the only sense, if the molecular property is somehow related to the arrangement of atoms. However, by including the appropriate physicochemical atomic properties, that is, those that account for the molecular property to be predicted, we are able to easily evaluate any existing structure/property correlation in a simple manner.
5323X.indb 92
11/13/07 2:10:16 PM
Dealing with Chemical Information
93
Linear transforms that utilize orthogonal basis vectors and provide compression — such as Fourier or wavelet transforms — can be used for regression. Any descriptor consisting of n components can be written in terms of n basis vectors. In a matrix definition, we would write A m× n = Tm× n Ψ n× n
(4.53)
with A as the matrix of descriptors and T as the matrix of coefficients. Ψ is a matrix of basis vectors, each of which represents a row in the matrix. The regression term would look like ˆ − A T y = 0 A TAw
(4.54)
Since we assumed orthogonal basis vectors, it applies that Ψ TΨ = I. If we compress information, we can reduce the number of coefficients by discarding some columns to obtain a reduced matrix T. We can now replace A by TΨ to get:
ˆ = Ψ TT T y Ψ TT TTΨw
(4.55a)
ˆ = ΨΨ TT T y ⇒ ΨΨ TT TTΨw
(4.55b)
ˆ = (T TT )−1 T T y ⇒ Ψw
(4.55c)
ˆ = Ψ T (T TT )−1 T T y T ⇒w
(4.55d)
This matrices of coefficients can be reduced to where
ˆ = Ψ Tw ˆ t , w
(4.56)
ˆ t = (T TT )−1 T T y w
(4.57)
is a regression vector in the reduced matrix of coefficients, that is, those that were retained in the transform domain. The transform domain generally allows a more efficient discrimination of descriptor features since collinearity is removed and the variance of reduced coefficients is larger than in the original domain. PCA and SVD are strongly related: whereas SVD provides a factorization of a descriptor matrix, PCA provides a nearly parallel factoring, due to the analysis of eigenvalues of the covariance matrix. Singular value decomposition is a valuable approach to matrix analysis. Compared to the original matrix, the SVD of a descriptor matrix reveals its geometric aspects and is more robust to numerical errors.
5323X.indb 93
11/13/07 2:10:21 PM
94
Expert Systems in Chemistry Research
This procedure transformed the original data so that it is expressed in terms of the patterns between them, where the patterns are the lines that most closely describe the relationships between the data. This is helpful because we have now classified our data point as a combination of the contributions from each of those lines. In contrast to the coordinate system that does not indicate how a data point relates to the rest of the data, PCA tells us exactly whether a data point sits above or below the trend lines. Working with reduced eigenvectors removes the contribution from these vectors and leaves just data of higher significance. Thus, the original data can be backtransformed without losing significant information. PCA can be used, for instance, to select a smaller set of descriptors to be used for multivariate regression and artificial neural networks. A set of descriptors can be divided into subclasses, such as geometrical, topological, or constitutional. A principal component analysis including the required output can then be performed for each subclass and the descriptor nearest to the output can be selected, checking if it improved the predictive power of a model.
4.6.3 Factor Analysis Principal component analysis is used to reduce the information in many variables into a set of weighted linear combinations of those variables; it does not differentiate between common and unique variance. If latent variables have to be determined, which contribute to the common variance in a set of measured variables, factor analysis (FA) is a valuable statistical method, since it attempts to exclude unique variance from the analysis. FA is a statistical approach that can be used to analyze interrelationships among a large number of variables and to explain these variables in terms of their common underlying dimensions, or factors. This statistical approach involves finding a way of condensing the information contained in a number of original variables into a smaller set of dimensions, or factors, with a minimum loss of information [52]. There are three stages in factor analysis: (1) generation of a correlation matrix for all the variables consisting of an array of correlation coefficients of the variables; (2) extraction of factors from the correlation matrix based on the correlation coefficients of the variables; and (3) rotation of factors to maximize the relationship between the variables and some of the factors. FA requires a set of data points in matrix form, and the data must be bilinear — that is, the rows and columns must be independent of each other. The result of FA reveals the factors that are of highest significance for the data set. A general rule for determining the number of factors is the eigenvalue criterion. Factors that exhibit an eigenvalue greater than 1 usually account for at least the variance of one of the variables used in the analysis. If the number of variables is small, the analysis may result in fewer factors than really exist in the data, whereas a large number of variables may produce more factors meeting the criteria than are meaningful.
5323X.indb 94
11/13/07 2:10:21 PM
95
Dealing with Chemical Information
4.7 Transforming Descriptors Mathematical transforms are widely used in nearly all application areas where mathematical functions are involved. The basic idea behind transforms is rather simple: A transform represents any function as a summation of a series of terms of increasing frequency. In other words, data varying in space or time can be transformed into a different domain called the frequency space. In the original physical context, the term frequency describes a variation in time of periodic motion or behavior. Frequency in another context might have a different meaning. In image processing, frequency may be related to variation in brightness or color across an image; that is, it is a function of spatial coordinates rather than time. However, the basic idea of frequency is the variation of a value or quantity along a certain continuous or discrete variable. Transform techniques in computational chemistry allow separating important aspects of a signal from unimportant ones. For instance, some typical applications of Fourier transforms in spectroscopy are smoothing or filtering to enhance signalto-noise ratio, resolution enhancement, changing spectral line shapes, generation of integrals or derivatives, and data compression. Transforms of molecular descriptors generally create a different functional representation of the descriptor that allows easier analysis, automatic interpretation, or compression of molecular data. Before we have a quick look at three of the most important transform methods, we should keep the following in mind. The mathematical theory of transformations is usually related to continuous phenomena; for instance, Fourier transform is more exactly described as continuous Fourier transform (CFT). Experimental descriptors, such as signals resulting from instrumental analysis, as well as calculated artificial descriptors require an analysis on basis of discrete intervals. Transformations applied to such descriptors are usually indicated by the term discrete, such as the discrete Fourier transform (DFT). Similarly, efficient algorithms for computing those discrete transforms are typically indicated by the term fast, such as fast Fourier transform (FFT). We will focus in the following on the practical application — that is, on discrete transforms and fast transform algorithms.
4.7.1 Fourier Transform Fourier transform is a well-known technique for signal analysis, which breaks down a signal into constituent sinusoids of different frequencies. Another way to think of Fourier analysis is as a mathematical technique for transforming a descriptor from the spatial domain into a frequency domain. The theory of Fourier transforms is described in several textbooks, such as Boas [53], and is not discussed here in detail. Fourier analysis of descriptors is performed by the DFT, which is the sum of the descriptor g(r) over all distances r multiplied by a complex exponential. With a descriptor consisting of n components expressed in its discrete form g[x] (x is the index of a discrete component), the DFT can be written as
g[u ] =
n −1
∑ g[ x] ⋅ e x=0
5323X.indb 95
− i⋅2 πxu n
,
u ∈{0,..., n − 1}
(4.58)
11/13/07 2:10:22 PM
96
Expert Systems in Chemistry Research
or, for the two-dimensional case, with n distance components and m property components (x and y are the indices of the distance and property components, respectively), as
g[u, v] =
m −1 n −1
∑ ∑ g[ x, y] ⋅ e
− i⋅2 π ( xu m + yv n )
(4.59)
x = 0 y= 0
The complex exponential can be broken down into real and imaginary sinusoidal components. The results of the transform are the Fourier coefficients g[u] (or g[u,v]) in frequency space. Multiplying the coefficients with a sinusoid of frequency yields the constituent sinusoidal components of the original descriptor. The DFT is usually implemented in a computer program using the FFT. A standard FFT algorithm based on the Danielson-Lanczos formula provides normal and inverse transforms using a Fourier matrix factored into a product of a few sparse matrices [54]. An array of real numbers is transposed into an array of vector lengths of the real and imaginary parts of the complex Fourier array. The source vector is transposed into a temporary array of length 2n containing the real and imaginary parts alternating, and the imaginary part is set to zero. The FFT requires n log n operations for an n-dimensional vector.
4.7.2 Hadamard Transform The Hadamard transform is an example of a generalized class of a DFT that performs an orthogonal, symmetric, involuntary linear operation on dyadic (i.e., power of two) numbers. The transform uses a special square matrix: the Hadamard matrix, named after French mathematician Jacques Hadamard. Similarly to the DFT, we can express the discrete Hadamard transform (DHT) as
F (u ) =
N −1
∑
n −1
bi ( x )bi ( u ) f ( x )(−1)∑ i = 0
(4.60)
x=0
where F(u) is the transform of the signal f(x). Instead of sines and cosines, square wave functions define the transformation matrix of the Hadamard transform. The FHT is generally preferred over the DFT due to faster calculation and because it operates with real instead of complex coefficients. A typical application is the compression of spectra, which are decomposed onto small square-wave components.
4.7.3 Wavelet Transform Like the FFT, the fast wavelet transform (FWT) is a fast, linear operation that operates on a data vector in which the length is an integer power of two (i.e., a dyadic vector), transforming it into a numerically different vector of the same length. Like the FFT, the FWT is invertible and in fact orthogonal; that is, the inverse transform, when viewed as a matrix, is simply the transpose of the transform. Both the FFT and the discrete wavelet transform (DWT) can be regarded as a rotation in function
5323X.indb 96
11/13/07 2:10:23 PM
97
Dealing with Chemical Information
space: from the input space or time domain — where the basis functions are the unit vectors or Dirac delta functions in the continuum limit — to a different domain. For the FFT, this new domain has sinusoid and cosinusoid basis functions. In the wavelet domain, the basis functions, or wavelets, are somewhat more complicated. In contrast to the basis functions of Fourier transform, individual wavelet functions are quite localized in space and frequency. Unlike sines and cosines, there is no one single unique set of wavelets; in fact, there are infinitely many possible sets of different compactness, smoothness, and localization. The major advantage achieved by wavelets is the ability to perform local analysis, that is, to analyze a localized area of a larger signal. Wavelet analysis is capable of revealing aspects of data that other signal analysis techniques miss, like trends, breakdown points, self-similarity, and discontinuities in higher derivatives. In addition, wavelet analysis is able to compress or to denoise a signal without appreciable degradation. FWTs can be implemented quite efficiently; the calculation time of algorithms performing wavelet transformations increases only linearly with the length of the transformed vector. A special kind of wavelet was developed by Ingrid Daubechies [63]. Daubechies wavelets are base functions of finite length and represent sharp edges by a small number of coefficients. They have a compact support; that is, they are zero outside a specific interval. There are many Daubechies wavelets, which are characterized by the length of the analysis and synthesis filter coefficients. Wavelets transforms are useful for compression of descriptors for searches in binary descriptor databases and as alternative representations of molecules for neural networks in classification tasks.
4.7.4 Discrete Wavelet Transform Similar to the DFT, the DWT can be defined as the sum over all distances r of the descriptor g(r) (expressed in its discrete form g[x]) multiplied by individually scaled basis functions Ψd,t:
g[ w ] =
n −1
∑ ( g[ x] ⋅ Ψ x=0
d, t
[ x ]),
w ∈{0,..., n − 1}
(4.61)
The basis functions, or wavelets, Ψd,t are dilated and translated versions of a wavelet mother function. A set of wavelets is specified by a particular set of numbers, called wavelet filter coefficients. To see how a wavelet transform is performed, we will take a closer look at these coefficients that determine the shape of the wavelet mother function. The basic idea of the wavelet transform is to represent any arbitrary function as a superposition of basis functions, the wavelets. As mentioned already, the wavelets Ψ(x) are dilated and translated versions of a mother wavelet Ψ0. Defining a dilation factor d and a translation factor t, the wavelet function Ψ(x) can be written as
5323X.indb 97
ψ(x) =
1 x−t ψ0 , d d
d , t ∈ , s > 0
(4.62)
11/13/07 2:10:25 PM
98
Expert Systems in Chemistry Research
For efficient calculations dyadic dilations (d = 2j ) of an integer dilation level j (usually called a resolution level) and dyadic translations (t = kd) with an integer translation level k are used. Rearranging Equation 4.62 yields
ψ j ,k ( x ) =
1
x ψ0 j − k , 2 2 j
j, k ∈
(4.63)
The scaling of the mother wavelets Ψ0 is performed by the dilation equation, which is, in fact, a function that is a linear combination of dilated and translated versions of it:
φ( x ) = 2
K −1
∑ c ⋅ φ(2 x − k ) k
(4.64)
k =0
where ϕ(x) is the scaling function (sometimes called a father wavelet), and ck is the wavelet filter coefficients [c0,...,cK-1] of a set of K total filter coefficients. The filter coefficients must satisfy linear and quadratic constraints to ensure orthogonality and to enable reconstruction of the original data vector by applying an inverse transformation matrix.
4.7.5 Daubechies Wavelets Wavelet filters of the Daubechies type generally exhibit a fractal structure and are self-similar; that is, they consist of multiple fragments similar to the complete mother wavelet. Their use in scientific research is their ability to represent polynomial behavior [55]. This class of wavelets includes members ranging from highly localized to highly smooth. In the transform algorithm, the coefficient array [c0,...,cK-1] can be regarded as a filter that is placed in a transformation matrix, which is applied to the raw data vector. The coefficients are arranged within the transformation matrix in two dominant patterns that are shifted through the entire matrix. We know these patterns from signal processing as the quadrature mirror filters (QMFs) H and G. In the simplest (and most localized) member D4 of the Daubechies family, the four coefficients [c0, c1, c2, c3] represent the low-pass filter H that is applied to the odd rows of the transformation matrix. The even rows perform a different convolution by the coefficients [c3, –c2, c1, –c0] that represent the high-pass filter G. H acts as a coarse filter (or approximation filter) emphasizing the slowly changing (low-frequency) features, and G is the detail filter that extracts the rapidly changing (highfrequency) part of the data vector. The combination of the two filters H and G is referred to as a filter bank. According to a proposal of Ingrid Daubechies, the notation DK will be used for a Daubechies Wavelet transform with K coefficients. Actually, D2 is identical to the simplest Wavelet of all, the so-called Haar Wavelet, and, thus, is not originally a member of the Daubechies family.
5323X.indb 98
11/13/07 2:10:26 PM
99
Dealing with Chemical Information
Wavelet Function
Scaling Function φ D2
D4
D6
D8
Figure 4.5 Shape of Daubechies wavelet and scaling functions with different numbers of coefficients. Both functions become smoother with increasing number of coefficients. With more coefficients, the middle of the wavelet functions and the left side of the scaling function deviate more and more from zero. The number of coefficients defines the filter length and the number of required calculations.
Figure 4.5 gives an idea of the shape of Daubechies wavelet and scaling functions with different numbers of coefficients. The scaling function and wavelet function become smoother when the number of coefficients (filter length) is increased. The general behavior for longer filter length functions is that the scaling function is significantly nonzero at the left of the nonzero interval, whereas the wavelet is significantly nonzero in the middle. Applying the filter coefficients to the scaling function (Equation 4.64) leads to the wavelet equations
ψ(x) = 2
K −1
∑ (−1) c k
−k
⋅ φ(2 x − k )
(4.65)
k =0
ψ(x) = 2
K −1
∑ c ⋅ ψ (2 x − k ) k
(4.66)
k =0
where ψ(x) is the actual wavelet, c –k is the low-pass filter coefficient defined by the scaling function ϕ, and ck is the high-pass filter coefficients defined by the wavelet function ψ.
4.7.6 The Fast Wavelet Transform An efficient way to implement the concept of QMFs was developed in 1989 by Mallat leading to the FWT, which requires only n operations for an n-dimensional vector [56]. The Mallat algorithm is in fact a classical scheme known in the signal processing community as a two-channel subband coder.
5323X.indb 99
11/13/07 2:10:28 PM
100
Expert Systems in Chemistry Research
In the FWT — analogous to the FFT — a signal in the wavelet domain is represented by a series of coefficients ck(j) at a certain resolution level j (cf. Equation 4.24). The sum of data obtained from the original descriptor at different resolution levels j is defined as the signal at resolution level 0:
g 0 (r ) =
K −1
∑ k =0
ck( J ) 2 J φ J ,k (r ) +
J
K −1
∑∑d j =1 k = 0
( j) k
2 j ψ j , k (r )
(4.67)
where ϕJ,k(r) and ψj.k(r) are the scaling and wavelet function, respectively; ck(j) and dk(j) are the corresponding scale and wavelet coefficients, respectively, at the jth resolution level; and J is the highest resolution level chosen for the transform. The scaling function ϕ is determined by the low-pass QMF and thus is associated with the coarse components, or approximations, of the wavelet decomposition. The wavelet function ψ is determined by the high-pass filter, which also produces the details of the wavelet decomposition. By iterative application of the FWT to the high-pass filter coefficients, a shape emerges that is an approximation of the wavelet function. The same applies to the iterative convolution of the low-pass filter that produces a shape approximating the scaling function. Figure 4.6 and Figure 4.7 display the construction of the scaling and wavelet functions, respectively:
φ[ n ] =
ψ[ n ] =
1 1+ 3 4 2
1 1− 3 4 2
3+ 3
3− 3
1 − 3
(4.68)
−3 + 3
3+ 3
−1 − 3
(4.69)
The DWT descriptor is usually represented by the set of coefficients C(j) (≡ ck(j)) containing the low-pass filtered (coarse) part of the descriptor, and set D(j) (≡ dk(j)) containing the high-pass (detail) part of the descriptor. C(0) represents the original descriptor in its discrete form at resolution level 0. The FWT of a descriptor C(0) containing n components at a resolution level 1 results in a transformed descriptor consisting of k (n/2) coarse coefficients C(1) and k detail coefficients D(1) (Figure 4.8). The next resolution level leads to the decomposition of the coarse coefficients C(1) into C(2) and D(2), whereas the detail coefficients D(1) of the first level remain unchanged (Figure 4.9). Each additional resolution — up to the highest resolution level J — decomposes the coarse coefficients and leaves the detail coefficients unchanged. The remaining coarse coefficient C(J) cannot be decomposed further; it consists of just four components. J is determined by the size n of the original vector with J = log2(n) – 2. Consequently, a wavelet-transformed descriptor can be represented by either single-level (j = 1) or multilevel (j < = J) decomposition.
5323X.indb 100
11/13/07 2:10:29 PM
101
Dealing with Chemical Information φ(n) = –1 [1 + √3 3 + √3 3 – √3 1 – √3] 4√2
Filter Coefficients
1st Iteration
2nd Iteration
3rd Iteration
4th Iteration
Figure 4.6 Construction of scaling function ϕ with low-pass D4 filter coefficients (above). Below: the functional representation of the low-pass filter coefficients (left) and their refinement by iterative calculation (increasing resolution level j) leading to an approximation of the scaling function ϕ (right). φ(n) = –1 [1 + √3 3 + √3 3 – √3 1 – √3] 4√2
Filter Coefficients
1st Iteration
2nd Iteration
3rd Iteration
4th Iteration
Figure 4.7 Construction of wavelet function ψ with high-pass D4 filter coefficients (above). Below: the functional representation of the high-pass filter coefficients (left) and their refinement by iterative calculation (increasing resolution level j) leading to an approximation of the wavelet function ψ (right). D(1) G 256 Coefficients g(r)
C (1) 512 Data Points H 256 Coefficients
Figure 4.8 Schematic image of wavelet decomposition. The QMFs H and G are applied to the raw signal. The signal is downsampled, leading to sets of coarse and detail components of half the size.
5323X.indb 101
11/13/07 2:10:31 PM
102
Expert Systems in Chemistry Research
j=0
j=1
j=2
j=3
256
C (0)
H
128
C (1)
C (2) 32
C (3)
G
H
64 H
128 G
64 G
D(1)
D(2)
32 D(3)
Figure 4.9 Schematic image of wavelet decomposition. The QMFs H and G are applied iteratively to each coarse coefficient set. The raw signal C(0) down sampled, leading to sets of coarse (C) and detail (D) coefficients of half the size.
Wavelet transforms are a quite new field of data processing, but they have proven to be a valuable addition to the analyst’s collection of tools. Therefore, they should be introduced here; a detailed discussion of how the transformation procedure is applied to descriptors can be found in the next chapter. A general overview of wavelet transformations is described by Strang [57]; the mathematical details are published by DeVore and Lucier [58]; and a review on applications in chemistry is given by Leung et al. [59].
4.8 Learning from Nature — Artificial Neural Networks The development efforts on expert systems in the 1970s and 1980s pointed out a particular weakness: the inability to mimic certain important capabilities of the human brain, like association and learning aptitude. To achieve these capabilities in a computer program, it seemed to be necessary to build systems comprising architecture similar to human brain. Information processing in the mammalian brain is performed by neurons. A human brain contains about a hundred billion neurons that are connected among one another via dendrites and axons. A dendrite carries the incoming electrochemical signal and leads it to the nucleus of the neuron, whereas the axon delivers processed signals to other neurons or a target tissue. The functioning of a human brain is not yet understood in full detail; however, a simplified model is sufficient for our purpose. In this model, neurons are connected among one another with dendrites, each of which has an individual strength or thickness. All incoming signals from the dendrites taken together have to exceed a certain threshold before the nucleus starts processing and initiates a response on the axon. The axon is connected to dendrites of further neurons, and all of these neurons work together in a huge network. This simple behavior can be easily modeled by mathematical algorithms, and the outcome is a surprisingly efficient technology, capable of learning, mapping, and modeling: the artificial neural network (ANN).
5323X.indb 102
11/13/07 2:10:32 PM
103
Dealing with Chemical Information
Artificial neural networks have been applied successfully across an extraordinary range of classification, prediction, association, and mapping problem domains. The success of this technique can be attributed to a few key factors: • Artificial neural networks are able to derive empirical models from a collection of experimental data. This applies in particular to complex, nonlinear relationships between input and output data. • Artificial neural networks learn from examples. Once fed with representative data, they are able to model the relationships between the data. • The trained network is capable of generalizing from these examples to other input data that were not given during training.
4.8.1 Artificial Neural Networks in a Nutshell An ANN is a simplified model of the human brain consisting of several layers of neurons that pass signals to each other depending on the input signals that they receive (Figure 4.10). A single neuron j consists of i dendrites, each of which receives an incoming signal xi. Each dendrite has an associated weight w that simulates the strength of the connection. A simple algorithm summarizes the products of signal x and weight w to form a Net value for the neuron: Net j =
N
∑w
ji
⋅ xi
(4.70)
i
x3
x4
x5
x2 x1
x6 w3 w4 w5 w6 w2 w7 w1 Net
x7
Net = Σwji • xi i
Outj = (α 1• Netj –ϑj) 1+e j Outj Figure 4.10 Schematic image of an artificial neuron. The input data x are calculated with their connective weight w to form the Net value of the neuron. A transfer function is applied to mimic the threshold of the biological neuron. The out value represents the outcome of the process, which is fed to another artificial neuron.
5323X.indb 103
11/13/07 2:10:33 PM
104
Expert Systems in Chemistry Research
The Net value undergoes a transfer function that mimics the signal threshold from the biological neuron. A typically used transfer function has a sigmoidal shape expressed by σ(t ) =
1 1 + e− t
(4.71)
We can introduce this function to change the bias of the outgoing signal outj of a neuron. out j =
1 − α j ⋅Net j − ϑ j ) ( 1+ e
(4.72)
With introducing the factor α for the Net values of the neuron we are able to define the steepness of the function. The ϑ value affects the relative shift of the function on the x-axis. The out value may act as an input value for another neuron or as final result. An ANN is created by connecting multiple neurons in multiple layers. Typically, each neuron in the input layer receives the input data and is connected to each neuron of the next layer; this architecture repeats down the output layer, which finally contains the output values (Figure 4.11). The layers between the input and output layer are referred to as hidden layers, and the number of hidden layers as well as the number of neurons in each layer can be adapted to the task. Training can now be performed in different ways. The feed-forward training propagates the values trough the individual layers of the network. This process can be repeated in multiple iterations, or epochs, each of which adapts the weights, or neural connections, until the training is finished. The back-propagation training calculates the difference to a desired output at the end of each epoch and corrects the Input Input Units Weights
Hidden Layer
Neurons
Output Neurons Output
Figure 4.11 A simple neural network consists of input units that receive the incoming data, a hidden layer of neurons, and the output neurons that finally provide the results of processing. The weights are values between 0 and 1 representing the strength of connectivity between the neurons. Typically, all neurons are connected to all neurons of the next layer.
5323X.indb 104
11/13/07 2:10:35 PM
Dealing with Chemical Information
105
input values accordingly before the next training epoch starts. This type of training is referred to as supervised learning, since the relationship between input and output training data must be known in advance, whereas in unsupervised learning no prior knowledge about this relationship is required. A system that is able to learn in unsupervised mode derives its knowledge from a set of experimental data — for example, ANNs learn to model a relationship between molecular structures and their experimental spectra without having any prior knowledge about the relationship between these two data. Once a neural network is trained — that is, its weights are adapted to the input data — we can present a new input vector and predict a value or vector in the output layer. Although the training procedure can be quite time consuming, once trained the network produces an answer (i.e., prediction) almost instantaneously. A more detailed discussion of applications of neural networks in chemistry and drug design can be found in Zupan and Gasteiger [60].
4.8.2 Kohonen Neural Networks — The Classifiers A type of neural network that has been proved to be successful in a series of applications is based on self-organizing maps (SOMs) or Kohonen neural networks [61]. Whereas most of the networks are designed for supervised learning tasks (i.e., the relationship between input and output must be known in form of a mathematical model), Kohonen neural networks are designed primarily for unsupervised learning where no prior knowledge about this relationship is necessary [62,63]. The Kohonen neural network is similar to a matrix of vectors, each of which represents a neuron. If we arrange the neurons in a matrix and look from the top, we can actually think of a map of neurons where the connectors that have been adapted in the previous model are fixed and depend only on the topological distance of the neurons (Figure 4.12). What are adapted in this model are the components of the vectors in the neural network, which are in this case called weights. Each input is also a
Weights
Neurons
Figure 4.12 Transition from a simple neural network to a Kohonen network. By arranging the neurons in a map, we can assign relationships between the neurons by means of topological distances that replace the strength of connectivity. Instead of feeding every neuron with all values from an input vector, we can now place the vectors in the third dimension (right). Instead of adapting the connector strength, the components of these vectors are adapted.
5323X.indb 105
11/13/07 2:10:35 PM
106
Expert Systems in Chemistry Research
v1
wj1
Kohonen Network
Input Vector vn
wjn
Figure 4.13 A Kohonen network in three dimensions is a combination of neuron vectors, in which the number of components n matches the ones in the input vector v. The weights w in the Kohonen network are adapted during training. The most similar neuron is determined by the Euclidean distance; the resulting neuron is the central neuron, from which the adaptation of the network weights starts.
weight vector whose components determine (1) the neuron where the adaptation of the neural network starts; and (2) how the components of this neuron are adapted. The neuron where adaptation starts is usually referred to as the central neuron, C, or winning neuron. Different methods exist to determine the central neuron. A typical approach is to calculate for all neurons in the network the Euclidean distance between the weights of the particular neuron and the input vector:
dvw =
n
∑(v − w ) i
ji
2
(4.73)
i =1
where vi is the ith component of the n-dimensional input vector, and wji is the ith component of neuron j (Figure 4.13). The neuron j in the network with the smallest Euclidean distance is the central neuron C:
{
}
C = arg min vi − w ji , j
j = 1, 2,..., m
(4.74)
The weights of the central neuron are then adjusted to the weights of the input vectors. The amount of adjustment of neurons surrounding the central neuron is actually determined by their topological distance. The neuron weights are corrected by
w ji (t + 1) = w ji (t ) + f L ( vi − w ji (t ))
(4.75)
where t is an integer of a discrete learning time coordinate and f L is the neighborhood kernel, a Gaussian function:
5323X.indb 106
fL = η (t ) ·e
−
dc − d j
2
2 ρ( t )
2
(4.76)
11/13/07 2:10:38 PM
107
Dealing with Chemical Information
Planar
Toroidal
Figure 4.14 A Kohonen network as seen from the top with the neuron vectors lying out of plane. After the central neuron has been determined and adapted, the surrounding neurons are affected by means of their topological distance to the central neuron. In a planar Kohonen network, the edges are not connected to each other; a central neuron at the edge does affect fewer neurons in the neighborhood than a central neuron in the middle. The toroidal arrangements ensures that every neuron has the same number of neighbors; thus, the central neuron does equally affect its neighborhood.
where ||dc – dj||2 is the Euclidean distance between the central neuron c and the current neuron j. The learning rate η(t) and the learning radius ρ(t) decrease linearly with time; that is, the size of the winner’s neighborhood reduces continuously during the training process. Kohonen networks can be arranged in toroidal shape; that is, both ends of each plane are connected to each other so that the complete map forms a torus. Consequently, each neuron has the same number of neighbors, and a central neuron at the edge of the plane influences neurons at the other end of plane (Figure 4.14).
4.8.3 Counterpropagation (CPG) Neural Networks — The Predictors An enhanced concept of Kohonen networks is the CPG neural network, first introduced by Hecht-Nielsen [64]. The CPG network can be established by using basically a Kohonen layer and an additional output layer. The input layer contains the input objects (e.g., molecular descriptors). The output layer contains the variables to be predicted, such as a one- or multidimensional property of the corresponding molecules. Additionally, a topological map layer [65,66] may be added that contains classes for the individual test cases (Figure 4.15). The training process of this network is composed of two steps: (1) An unsupervised learning is performed by the Kohonen layer; and (2) a supervised learning is performed by the output layer. During iterative training of a CPG neural network, an epoch is a single pass through the entire training set. The iterative algorithm runs through a number of epochs, each of which executes each training case by (1) selecting the central neuron in the Kohonen layer (i.e., the one with the smallest Euclidean distance to the input vector; (2) adjusting the weights of central neuron and its neighborhood in the Kohonen layer to the input vector; and (3) adjusting the weights of central neuron and its neighborhood in the output layer to the input vector.
5323X.indb 107
11/13/07 2:10:39 PM
108
Expert Systems in Chemistry Research
Input Layer Input Vector
Output Layer Output Vector
Class
Map Layer
Figure 4.15 Scheme of a multilayer Kohonen network including input layer, output layer, and a topological map layer for classification of input data. The dimension of input and output vectors is equal to the dimension of the corresponding neurons in the network.
Kohonen networks are trained using an iterative algorithm: Starting with an initially random set of radial centers, the algorithm gradually adjusts them to reflect the clustering of the training data. The algorithm uses a learning rate that decays with time and affects the amount of the adjustment. This ensures that the centers settle down to a compromise representation of the cases that cause that neuron to win. The learning radius takes the topological neighborhood of the central neuron into account. Like the learning rate, the learning radius decays over time so that initially the entire topological map is affected; approaching the end of the training, the neighborhood will finally be reduced to the central neuron itself. The iterative training procedure adapts the network in a way that similar input objects are also situated close together on the topological map. The network’s topological layer can be seen as a two-dimensional grid, which is folded and distorted into the n-dimensional input space to preserve the original structure as well as possible. Clearly, any attempt to represent an n-dimensional space in two dimensions will result in loss of detail; however, the technique is useful to visualize data that might otherwise be hard to understand. Once the network is trained, the topological map represents a classification sheet. Some or all of the units in the topological map may be labeled with class names. If the distance is small enough, then the case is assigned to the class. A new vector presented to the Kohonen network ends up in its central neuron in the topological map layer. The central neuron points to the corresponding neuron in the output layer. A CPG neural network is able to evaluate the relationships between input and output information and to make predictions for missing output information.
5323X.indb 108
11/13/07 2:10:40 PM
109
Dealing with Chemical Information
Data Analysis of Objects X(x1, x2, ... xp) Classification Modeling
y1, y2 Y = f(X)
X(x1, x2, ... xp) Mapping Association
Y(y1, y2, ... ys) X΄(x΄1, x΄2, ... x΄p)
Figure 4.16 Different tasks solved by ANNs for the data analysis of a multidimensional object. Classification performs the assignment of input objects X to predefined classes y. Modeling creates a functional relationship between the input objects and other multidimensional data. Mapping allows for reducing the input objects to a usually two-dimensional plane. Association allows assigning input objects to other multidimensional data on the basis of their relationships.
4.8.4 The Tasks: Classification and Modeling Problem solving with ANNs can be divided into four main categories (Figure 4.16): (1) Classification is where classes are defined a priori for the training data, and the objective is to determine the class to which a given input object belongs. (2) Modeling has the general objective to transform one- or multidimensional variables into a representation of different dimensionality. In the simplest case, a model can be created as a mathematical function that represents the input data in a different way. (3) Mapping allows the reduction of multidimensional objects to a space of different dimension. (4) Association refers to finding relationships between two different types and dimensionality of data to predict new data for a given input. ANNs can actually perform a number of these tasks at the same time, although most commonly used network perform only a single one. Although neural networks have been extensively investigated, the main efforts in chemistry research focus on the appropriate representation of data for neural networks. For instance, finding the adequate descriptor for the representation of chemical structures is one of the basic problems in chemical data analysis. The solution to these problems is a mathematical transformation of the molecular data into a vector of fixed length. On the basis of these vectors, several methods of data analysis can be performed: statistical evaluation, evaluation of complex relationships, and fast and effective simulation and prediction of molecular features.
5323X.indb 109
11/13/07 2:10:40 PM
110
Expert Systems in Chemistry Research
4.9 Genetic Algorithms (GAs) GA is a programming technique that mimics biological evolution to find true or approximate solutions to optimization and search problems. They are a specific instance of evolutionary algorithms that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover. GAs were developed by John Holland and his team [67]. GAs are optimization techniques that search a set of alternative solutions to find the best one. Solutions are represented in a vector, which is usually called a chromosome in the style of biological evolution. The basic constituents of the GA are as follows: • Chromosome: Data and solutions in the GA are represented as chromosomes. The chromosome might consist of bit values or real numbers, depending on the task to be solved. Bit values are used to represent the presence or absence of features, whereas real numbers are used in purely numerical optimization approaches. • Fitness function: The fitness function is a mathematical function or a computer algorithm that describes the quality of a solution. It is necessary for the final decision as to whether an optimization was achieved. The fitness function should be able to distinguish between the individual chromosomes but should be able to recognize similarities, too. • Selection algorithm: Chromosomes are selected from a pool for subsequent changes. In most expert systems selections are made according to numerical ranks. Each chromosome is assigned a numerical rank based on fitness, and selection is based on this ranking rather than on absolute difference in fitness. The advantage of this method is that it can prevent very fit individuals from gaining dominance early at the expense of less fit ones, which would reduce the population’s genetic diversity and might hinder attempts to find an acceptable solution. • Genetic operators: They are responsible for changing the chromosomes by either mutation or recombination. Like in biological evolution, point mutations switch a particular component of the chromosome vector, whereas recombination exchanges components of two chromosomes to produce a new mixed chromosome (Figure 4.17). • Termination criterion: The algorithm terminates when either a maximum number of generations has been produced or a satisfactory fitness level has been reached for the population. The basic procedure in a genetic algorithm is as follows: (1) Create a (usually) random population of chromosomes that represent potential solutions. (2) Use the fitness function to calculate the fitness of each chromosome. (3) Mutate and recombine. (a) Use the selection algorithm to select pairs of chromosomes. (b) Create offspring by recombination of chromosomes. (c) Mutate the chromosomes.
5323X.indb 110
11/13/07 2:10:41 PM
111
Dealing with Chemical Information 0 1 1 0 0 0 1 … 0 Parent Point Mutation 0 1 1 1 0 0 1 … 0 Offspring 0 1 1 0 0 0 1 … 0 1 1 0 0 1 1 1 … 0
Parents
Cross-Over 1 1 0 0 0 0 1 … 0 0 1 1 0 1 1 1 … 0
Offsprings
Figure 4.17 Schematic view of binary chromosomes undergoing a point mutation or crossover mutation. Point mutation is performed by changing a single component of a parent. The resulting offspring is identical except for a single component. Crossover is the exchange of a series of components in two chromosomes at a particular intersection of components in the chromosome.
(d) Replace the current generation of chromosomes with the new ones. (e) Calculate the fitness for each new chromosomes. (4) Repeat step 3 until the termination criterion is fulfilled. The resulting chromosomes should represent an optimum solution based on the fitness function and the termination criterion. GAs are applied in biogenetics, computer science, engineering, economics, chemistry, manufacturing, mathematics, physics, and other fields. A typical area is the library design in combinatorial chemistry where large sets of reactions are performed to form a wide variety of products that can be screened for biological activity. The design of a library can be reactant based using optimized groups of reactants without considering the product’s properties or product based, which selects reactants most likely to produce products with the desired properties. Product-based design creates more diverse combinatorial libraries with a greater chance to obtain a usable product; however, the technique is more complicated and time consuming. An approach for the automatic design of product-based combinatorial libraries was published by Gillet [68]. He optimized properties such as molecular diversity, toxicity, absorption, distribution, and metabolism, as well as structural similarity to create combinatorial libraries of high molecular diversity and maximum synthetic efficiency. Another application of GAs was published by Aires de Sousa et al.; they used genetic algorithms to select the appropriate descriptors for representing structure– chemical shift correlations in the computer [69]. Each chromosome was represented by a subset of 486 potentially useful descriptors for predicting H-NMR chemical shifts. The task of a fitness function was performed by a CPG neural network that used the subset of descriptors encoded in the chromosome for predicting chemical shifts. Each proton of a compound is presented to the neural network as a set of descriptors obtaining a chemical shift as output. The fitness function was the RMS error for the chemical shifts obtained from the neural network and was verified with a cross-validation data set.
5323X.indb 111
11/13/07 2:10:41 PM
112
Expert Systems in Chemistry Research
4.10 Concise Summary 3D Molecular Descriptors are molecule representations based on Cartesian coordinates and Euclidean distances. Adjacency Matrix is a chemical structure representation in a matrix consisting of binary components that indicate whether two atoms are bonded or not. Artificial Descriptor is a molecular descriptor that is calculated from molecular properties. Due to its pure mathematical nature, it can be adjusted and fine-tuned to fit to a task. Artificial Neural Networks (ANNs) are simplified models of the human brain consisting of several layers of artificial neurons that pass signals to each other depending on the input signals they receive. They are applied in classification, prediction, association, and mapping problem domains that exhibit nonlinear and complex relationships. Autocorrelation Vectors (topological autocorrelation vectors, autocorrelation of a topological Structure, ATS) are molecular descriptors for a property distribution along the topological structure. Average Descriptor Deviation is a statistical measure for the deviation between descriptors in a data set and the average descriptor. Average Diversity is a statistical measure for the diversity of descriptors in a data set calculated by the sum of all average descriptor deviations divided by the number of descriptors. Centered Moment of Distribution is a statistical method for characterizing a set of values that has a tendency to cluster around a centered value by the sums of the nth integer powers of the values. Chromosome in a genetic algorithm is a representation for data and solutions consisting of real numbers or bit values that represent the presence or absence of features. Classification is the task of assigning data into predefined categories to emphasize their similarities in a particular aspect. Connection Table is a comprehensive description of the topology of a molecular graph. Connectivity Index (Randic index) is a molecular descriptor based on the sum of degrees over all adjacent atoms in a molecular graph. Constitutional Descriptor is a type of molecular descriptor that represents the chemical composition of a molecule in a generic way. It is independent from molecular connectivity and geometry. Correlation (product–moment correlation, Pearson correlation) is a statistical measure for the relation between two or more sets of variables. Counterpropagation (CPG) Neural Networks are a type of ANN consisting of multiple layers (i.e., input, output, map) in which the hidden layer is a Kohonen neural network. This model eliminates the need for back-propagation, thereby reducing training time. Daubechies Wavelets are basic functions for the wavelet transform, which are selfsimilar and have a fractal structure, used to represent polynomial behavior.
5323X.indb 112
11/13/07 2:10:42 PM
Dealing with Chemical Information
113
Descriptor Interpretation refers to the evaluation of molecular descriptors to derive features of their underlying chemical structure or the entire chemical structure. Descriptor/Descriptor Correlation is the property of an artificial molecular descriptor to correlate with at least one experimental descriptor. Descriptor/Property Correlation is a requirement for a molecular descriptor to correlate with at least one property of a molecule. Distance Matrix is a chemical structure representation in a matrix consisting of either Euclidean distances or the sum distances along the shortest bond path between two atoms. Diversity in chemistry refers to the differences of molecules in a structural aspect, characteristics, or properties. Diversity is a main goal in creating combinatorial compound libraries to achieve maximum chemical, physicochemical, and biological variety (see also Similarity). Experimental Descriptor is a descriptor that results from an analytical technique, such as spectrometry. Experimental descriptors emerge from a fixed experimental design, and their appearance is subject to the physical or chemical limitations of the measurement technique. Fast Wavelet Transform (FWT) is a fast algorithm for wavelet transforms that requires only n operations for an n-dimensional vector. Fitness Function is a mathematical function or a computer algorithm that describes the quality of a solution in a genetic algorithm. Fourier Transform is a mathematical linear operation that decomposes a function into a continuous spectrum of its frequency components as a sum of sinusoids and cosinusoids. Fragment-Based Coding (substructure-based coding) is a code resulting from dividing a molecule into several substructures that represent typical groups. Genetic Algorithms (GAs) are a programming technique to find true or approximate solutions to optimization and search problems. They are a specific instance of evolutionary algorithms that use techniques inspired by evolutionary biology. Genetic Operator is the method for changing chromosomes by either mutation or recombination in a GA. Graph Isomorphism is a method from mathematical graph theory that can be used for mapping a structure onto another to determine the identity between two structures. Hash Coding (hashing) is a scheme for providing rapid access to data items that are distinguished by some key; it is used to store and search entities, like substructures, according to their key values. Isomer Discrimination is the property of a molecular descriptor to distinguish between structural isomers. Ideally, this property is configurable. Isomorphism Algorithm is a mathematical graph theory method to determine the extent to whether two graphs can be mapped onto each other by permutation through the vertices of the graph. Kier and Hall Index is a special form of the connectivity index that allows optimizing the correlation between the descriptor and particular classes of organic compounds.
5323X.indb 113
11/13/07 2:10:42 PM
114
Expert Systems in Chemistry Research
Kohonen Neural Networks or self-organizing maps (SOMs) are a type of ANN designed for unsupervised learning where no prior knowledge about this relationship is necessary. Kurtosis is a statistical measure for describing whether a distribution of values is flatter (platykurtic) or more peaked (leptokurtic) than the Gaussian distribution. Linear Notation is a type of structure descriptors strings that represent the 2D structure of a molecule as a set of characters, which represent the atoms in a linear manner, and symbols that are used to describe connectivity. Mallat Algorithm is in fact a classical scheme known in the signal processing community as a two-channel subband coder. Maximum Common Subgraph Isomorphism is a method from mathematical graph theory used to locate the largest part that two structures have in common to find similar structures. Modeling refers to the generation of a mathematical or virtual model that describes relationships between data and allows for deriving information. Molecular Descriptor is a value, vector, or multidimensional mathematical representation that represents a certain property or a set of properties of a molecule in a way that is suitable for computational processing. Morgan Algorithm is a mathematical method for canonical (unique) numbering of atoms in a molecule based on iterative indexing of atoms according to the number of their attached bonds. Mother Wavelet is a finite length waveform, which is applied in scaled and translated copies (wavelets) in a wavelet transform. Multiple Layer Networks are ANNs designed in multiple tiers, each of which processes different information that have to be correlated among one another. Mutation is a genetic operator in GAs that changes one or more components of a chromosome. Prescreening is a method for structure searches that uses bit strings encoding the presence or absence of a fragment in the query to reduce the number of molecules that require the full subgraph search. Quadrature Mirror Filter (QMF) is a set of coefficients for transformation algorithms that act as a filter when applied to a raw data vector. Recombination is a genetic operator in GAs that exchanges one or more components of two chromosomes to produce a new mixed chromosome. Reversible Decoding is a desired property of a molecular descriptor to mathematically decode it to obtain the underlying chemical structure or the properties that have been used to calculate the descriptor. Root Mean Square (RMS) is a statistical measure for the deviation between two set of variables calculated by the mean squared individual differences for each of the variables. Rotational Invariance describes the independency of a molecular descriptor from partial or complete rotation of the molecule — independent of absolute Cartesian coordinates of the atoms. Selection Algorithm is a method for selecting chromosomes from a pool for subsequent modifications in a GA.
5323X.indb 114
11/13/07 2:10:42 PM
Dealing with Chemical Information
115
Similarity in chemistry refers to the analogy of molecules in a structural aspect, characteristics, or properties (see also Diversity). Skewness is a statistical measure for describing the symmetry of distribution of values related to the Gaussian distribution. SMILES Arbitrary Target Specification (SMARTS) is an extension to the SMILES linear notation developed for specifying queries in substructure searches. Simplified Molecular Input Line Entry Specification (SMILES) is a simplistic line notation for describing chemical structures as a set of characters, numbers, and symbols that represent atoms, bonds, and stereochemistry. Subgraph Isomorphism is a method from mathematical graph theory to find a substructure within a structure. Termination Criterion is the criterion in a GA that terminates the genetic operations when either a maximum number of generations has been produced or a satisfactory fitness level has been reached for the population. Topological Indices are molecular descriptors based on connectivity data of atoms within a molecule. These descriptors contain information about the constitution, size, shape, branching, and bond type of a chemical structure, whereas bond length, bond angles, and torsion angles are neglected. Translational Invariance describes the independency of a molecular descriptor from translation of the entire molecule — independent of absolute values of Cartesian coordinates of the entire atoms. Wavelet Transform is mathematical method to linear operation that decomposes a function into a continuous spectrum of its frequency components. Wavelet basis functions are localized in space and frequency. Wavelets are dilated and translated versions of a mother wavelet in wavelet transforms. Wiener Index is the first published topological descriptor calculated by the sum of the number of minimum bonds between all nonhydrogen atoms. Wiswesser Line Notation is the first line notation capable of precisely describing complex molecules consisting of a series of uppercase characters, numerals, and symbols (i.e., ampersand, hyphen, the oblique stroke, and a blank space).
References
5323X.indb 115
1. Downs, G.M. and Willett, P., Similarity searching in databases of chemical structures, in Reviews in Computational Chemistry Volume 7, Lipkowitz, K.B. and Boyd, D.B., Eds., VCH Publishers, New York, 1996, 1. 2. Warr, W.A., Ed., Chemical Structures: The International Language of Chemistry, Springer-Verlag, London, 1988. 3. Willett, P., Searching for Pharmacophoric Patterns in Databases of Three-Dimensional Chemical Structures, J. Mol. Recogn., 8, 290, 1995. 4. Wiswesser, W.J., How the WLN Began in 1949 and How It Might Be in 1999, J. Chem. Inf. Comput. Sci., 2, 88, 1982. 5. Weininger, D., SMILES — A Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules, J. Chem. Inf. Comput. Sci., 28, 31, 1988. 6. Daylight Chemical Information Systems Inc., Aliso Viejo, CA, http://www.daylight.. com.
11/13/07 2:10:43 PM
116
5323X.indb 116
Expert Systems in Chemistry Research
7. Feldmann, R.J., et al., An Interactive Substructure Search System, J. Chem. Inf. Comput. Sci., 17, 157, 1977. 8. Milne, G.W.A.., et al., The NIH/EPA Structure and Nomenclature Search System (SANSS), J. Chem. Inf. Comput. Sci., 18, 181, 1978. 9. Morgan, H.L., The Generation of a Unique Machine Description for Chemical Structures — A Technique Developed at Chemical Abstracts Service, J. Chem. Doc., 5, 107, 1965. 10. Willett, P., Processing of Three-Dimensional Chemical Structure Information Using Graph-Theoretical Techniques, Online Information ‘90, 115–127, 1990. 11. von Scholley, A., A Relaxation Algorithm for Generic Chemical Structure Screening, Journal of Chemical Information and Computer Science, 24, 235–241, 1984. 12. Sussenguth, E.H., A Graph-Theoretical Algorithm for Matching Chemical Structures, Journal of Chemical Documentation, 5, 36–43, 1965. 13. Ullmann, J.R., An Algorithm for Subgraph Isomorphism, Journal of the Association for Computing Machinery, 23, 31–42, 1976. 14. Wipke, W.T. and Rogers, D., Rapid Subgraph Search Using Parallelism, Journal of Chemical Information and Computer Science, 24, 255–262, 1984. 15. Todeschini, R. and Consonni, V, Handbook of Molecular Descriptors, 1st ed., WileyVCH, Weinheim, 2002. 16. Jochum, C. and Gasteiger, J., Canonical numbering and Constitutional Symmetry, J. Chem. Inf. Comput. Sci., 17, 113, 1977. 17. Randic, M., Molecular Bonding Profiles, J. Math. Chem., 19, 375, 1996. 18. Wagener, M., Sadowski, J., and Gasteiger, J., Autocorrelation of Molecular Surface Properties for Modeling Corticosteroid Binding Globulin and Cytosolic Ah Receptor Activity by Neural Networks, J. Am. Chem. Soc., 117, 7769, 1995. 19. Free, S. M. J. and Wilson, J.W., A Mathematical Contribution to Structure Activity Studies, J. Med. Chem., 7, 395, 1964. 20. Wiener, H., Structural Determination of Paraffin Boiling Points, J. Am. Chem. Soc., 69, 17, 1947. 21. Randic, M., On Characterization of Molecular Branching, J. Amer. Chem. Soc., 97, 6609, 1975. 22. Kier, L.B. and Hall, L.H., Molecular Connectivity in Chemistry and Drug Research, Academic Press, New York, 1976. 23. Moreau, G. and Broto, P., Autocorrelation of Molecular Structures, Application to SAR Studies, Nouv. J. Chim., 4, 359, 1980. 24. Wagener, M., Sadowski, J., and Gasteiger, J., Autocorrelation of Molecular Surface Properties for Modeling Corticosteroid Binding Globulin and Cytosolic Ah Receptor Activity by Neural Networks, J. Am. Chem. Soc., 117, 7769, 1995. 25. Munk, M.E., Madison, M.S., and Robb, E.W., Neural Network Models for Infrared Spectrum Interpretation, Mikrochim. Acta, [Wien] 2, 505, 1991. 26. Huixiao, H. and Xinquan, X., Essessa: An Expert System for Elucidation of Structures. 1. Knowledge Base of Infrared Spectra and Analysis and Interpretation Programs, J. Chem. Inf. Comput. Sci., 30, 302, 1990. 27. Weigel, U.-M. and Herges, R., Automatic Interpretation of Infrared Spectra: Recognition of Aromatic Substitution Patterns Using Neural Networks, J. Chem. Inf. Comput. Sci., 32, 723, 1992. 28. Weigel, U.-M. and Herges, R., Simulation of Infrared Spectra Using Artificial Neural Networks Based on Semiempirical and Empirical Data, Anal. Chim. Acta, 331, 63, 1996. 29. Affolter, C.K., et al., Automatic Interpretation of Infrared Spectra, Microchim. Acta, 14, 143, 1997.
11/13/07 2:10:43 PM
Dealing with Chemical Information
117
30. Dubois, J.E., et al., Simulation of Infrared Spectra: An Infrared Spectral Simulation Program (SIRS) which Uses DARC Topological Substructures, J. Chem. Inf. Comput. Sci., 30, 290, 1990. 31. Clerc, J.T. and Terkovics, A.L., Versatile Topological Descriptor for Quantitative Structure/Property Studies, Anal. Chim. Acta, 235, 93, 1990. 32. Milano Chemometrics & QSAR Research Group, Molecular Descriptors, http://www. moleculardescriptors.eu/. 33. Sadowski, J. and Gasteiger, J., From Atoms and Bonds to Three-Dimensional Atomic Coordinates: Automatic Model Builders, Chem. Rev., 93, 2567, 1993. 34. Wierl, R., Elektronenbeugung und Molekülbau, Ann. Phys., 8, 521, 1931. 35. Soltzberg, L.J. and Wilkins, C.L., Molecular Transforms: A Potential Tool for Structure-Activity Studies, J. Am. Chem. Soc., 99, 439, 1977. 36. Schuur, J.H., Selzer, P., and Gasteiger, J., The Coding of the Three Dimensional Structure of Molecules by Molecular Transforms and Its Application to Structure — Spectra Correlations and Studies of Biological Activity, J. Chem. Inf. Comput. Sci., 36, 334, 1996. 37. Broomhead, D.S., and Lowe, D., Multi-variable Functional Interpolation and Adaptive Networks, Complex Systems, 2, 321, 1988. 38. Steinhauer, V. and Gasteiger, J., Obtaining the 3D Structure from Infrared Spectra of Organic Compounds Using Neural Networks, in Software-Entwicklung in der Chemie 11, Fels, G. and Schubert, V., Eds., Gesellschaft Deutscher Chemiker, Frankfurt/Main, 1997. 39. Karle, J., Application of Mathematics to Structural Chemistry, J. Chem. Inf. Comput. Sci., 34, 381, 1994. 40. Schuur, J.H. and Gasteiger, J., Infrared Spectra Simulation of Substituted Benzene Derivatives on the Basis of a 3D Structure Representation, Anal. Chem., 69, 2398, 1997. 41. Hemmer, M.C., Steinhauer, V., and Gasteiger, J., The Prediction of the 3D Structure of Organic Molecules from Their Infrared Spectra, Vibrat. Spectroscopy, 19, 151, 1999. 42. Schleyer, P.V.R., et al., Eds., The Encyclopedia of Computational Chemistry, 1st ed., John Wiley and Sons, Chichester, 1998. 43. Plauger, P.J., The Standard C Library, Prentice Hall PTR, 2d ed., 1992. 44. Cody, W.J. and Waite, W., Software Manual for the Elementary Functions, 1st ed., Prentice Hall, City, NJ, 1980. 45. Shammas, N., C/C++ Mathematical Algorithms for Scientists and Engineers, 1st ed., McGraw-Hill, 1995. 46. Lau, H.T., A Numerical Library in C for Scientists and Engineers, 1st ed., CRC Press, Boca Raton, FL, 1994. 47. Press, W.H., et al., Numerical Recipes in C++: The Art of Scientific Computing, 2d ed., Cambridge University Press, Cambridge, UK, 2002. 48. Galton, F., Co-relations and Their Measurement, in Proceedings of the Royal Society of London, 45, 135, 1888. 49. Pearson, K., Regression, Heredity, and Panmixia, in Philosophical Transactions of the Royal Society of London, Ser. A, 187, 253, 1896. 50. Makridakis, S., Wheelwright, S.C., and Hyndman, R.J., Forecasting — Methods and Applications, Wiley & Sons, New York, 1998. 51. Pearson, K., Principal Components Analysis, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 6, 559, 1901. 52. Hair, J.F., et al., Multivariate Data Analysis, 3rd ed., Macmillan, New York, 1992, 231. 53. Boas, M.L., Mathematical Methods in the Physical Sciences, John Wiley & Sons, New York, 1983.
5323X.indb 117
11/13/07 2:10:43 PM
118
Expert Systems in Chemistry Research
54. Clerc, J.T., Pretsch, E., and Zurcher, M., Performance. Analysis of Infrared Library Search Systems, Mikrochim. Acta[Wien], 2, 217, 1986. 55. Daubechies, I., Orthonormal Bases of Compactly Supported Wavelets, Comm. Pure Appl. Math., 41, 906, 1988. 56. Mallat, S., Multiresolution Approximation and Wavelets, Trans. Am. Math. Soc, 315, 69, 1989. 57. Strang, G., Wavelets, Am. Scientist, 82, 250, 1992. 58. DeVore, R.A. and Lucier, B.J., Wavelets, Acta Numerica, 1, 1, 1992. 59. Leung, A.K., Chau, F., and Gao, J., A Review on Applications of Wavelet Transform Techniques in Chemical Analysis: 1989–1997, Chemom. Intell. Lab. Syst., 43, 165, 1998. 60. Zupan, J. and Gasteiger, J., Neural Networks in Chemistry and Drug Design, 2nd ed., Wiley-VCH, Weinheim, 1999. 61. Kohonen, T., The Self-Organizing Map, in Proc. IEEE, Vol. 78, 1464, 1990. 62. Kohonen, T., Self-Organization and Associative Memory, Springer Verlag, Berlin, 1989. 63. Haykin, S., Neural Networks: A Comprehensive Foundation, Macmillan Publishing, New York, 1994. 64. Hecht-Nielsen, R., Counterpropagation Networks, Appl. Optics, 26, 4979, 1987. 65. Patterson, D., Artificial Neural Networks, Prentice Hall, New York, 1996. 66. Fausett, L., Fundamentals of Neural Networks, Prentice Hall, New York, 1994. 67. Holland, J., Outline for a Logical Theory of Adaptive Systems, J. ACM, 9, 297, 1962. 68. Gillet, V.J., Computational Aspects of Combinatorial Chemistry, J. Chem. Inf. Comp. Sci., 42, 251, 2002. 69. Aires-de-Sousa, J., Hemmer, M.C., and Gasteiger, J., Prediction of 1H NMR Chemical Shifts Using Neural Networks, Anal. Chem., 74, 80, 2002.
5323X.indb 118
11/13/07 2:10:43 PM
5
Applying Molecular Descriptors
5.1 Introduction Let us evaluate the different properties and applications of a molecular descriptor while keeping the aforementioned requirements for descriptors in mind. We will focus on a particular descriptor type: the radial distribution function (RDF). RDF descriptors grew out of the research area of structure–spectrum correlations but are far more than simple alternative representations of molecules. The flexibility of these functions from a mathematical point of view allows applying them in several other contexts. This chapter will give a theoretical overview of RDF descriptors as well as their application for the characterization of molecules, in particular for similarity and diversity tasks.
5.2 Radial Distribution Functions (RDFs) Radial distribution functions can be adapted quite flexibly to the desired representation of molecules. The RDF functions developed can be divided into several groups regarding the basic function type, the distance range of calculation, the type of distance information, the dimensionality, and the postprocessing steps. Most of the varieties of RDF descriptors introduced in this chapter can be combined arbitrarily to fit to the required task. As a consequence, more than 1,400 useful descriptors can be derived from radial functions [1]. The molecules used for the calculation of descriptors are shown in the figures.
5.2.1 Radial Distribution Function The general RDF is an expression for the probability distribution of distances rij between each of which is measured between two points i and j, within a threedimensional (3D) space of N points:
g(r ) =
N −1
N
i
j >i
∑∑e
− B ( r − r ij )2
(5.1)
The exponential term leads to a Gaussian distribution around the distance rij with a half-peak width depending on the smoothing parameter B.
119
5323X.indb 119
11/13/07 2:10:44 PM
120
Expert Systems in Chemistry Research
The general RDF can be easily transformed to a basic molecular descriptor by applying it to the 3D coordinates of the atoms in a molecule. In molecular terms, rij is the distance between the atoms i and j in an N-atomic molecule. g(r) is usually calculated for all unique pairs of atoms (denoted by i and j) in a certain distance range divided into equidistant intervals. Thus, the function g(r) is usually represented by its discrete form of an n-dimensional vector [g(r1), g(r 2),..., g(rn)] calculated between r1 and rn. In this case, the RDF is considered as a molecular descriptor (RDF descriptor) for the three-dimensional arrangement of atoms in a molecule. RDFs have certain characteristics in common with the 3D Molecular Representation of Structures Based on Electron Diffraction (MoRSE) code. In fact, the theory of RDF is related to the theoretical basis of 3D MoRSE functions. In 1937, Degard used the exponential term in the RDF to account for the experimental angular limitations in electron diffraction experiments [2]. The function I(s) can be understood as the Fourier transform of rij·g(r):
∞
rij ·g(r ) = C ⋅
∫ I (s) ⋅ sin(s ⋅ r ) ds ij
(5.2)
s=0
where C is an integration constant. Rearranging Equation 5.2 we yield
g(r ) = C ⋅
∞
∫ I (s) ⋅
s=0
sin( s ⋅ rij ) dr rij
(5.3)
In electron diffraction experiments, the intensity is the Fourier transform of 4πrij²g(r) and is related to the electron distribution in the molecule [3]. The Fourier transform of a 3D MoRSE code leads to a frequency pattern, but lacks a most important feature of RDF descriptors: the frequency distribution. In contrast to the corresponding RDF descriptors 3D MoRSE codes can be hardly interpreted directly. Nevertheless, 3D MoRSE codes lead to similar results when they are used with methods where direct interpretability is not required.
5.2.2 Smoothing and Resolution Whereas the complete RDF vector describes a probability distribution, the individual vector components are related to the relative frequencies of atom distances in the molecule. Thus, the individual g(r) values are plotted in a frequency dimension whereas r lies in the distance dimension (Figure 5.1). The smoothing parameter B can be interpreted as a temperature factor; that is, (2B)–1/2 is the root mean square amplitude of vibration for the atoms i and j. As the exponent must be dimensionless, the unit of B is Å–2 when r is measured in Å. Let us have a closer look at this parameter and its effects on a descriptor. RDF descriptors are calculated for a certain distance range. This range depends primarily on the size of the largest molecule to be calculated. It is not mandatory to cover the entire molecule size to describe a certain feature of a compound. Since
5323X.indb 120
11/13/07 2:10:45 PM
121
Applying Molecular Descriptors
Distance Probability
0.25 0.20 N
0.15 0.10 0.05 0.00
0
1
2
3
4
5 6 Distance/
7
8
9
10
Figure 5.1 RDF descriptor calculated for a polycyclic system. The complete RDF function represents a probability distribution; the individual peaks are related to the relative frequencies of atom distances in the molecule.
each atom is regarded as a center of distance measurement, even a descriptor shorter than the maximum distance in the molecule still covers the entire molecule. Choosing a distance range that covers the size of the majority of molecules in a data set is acceptable for most applications of RDF descriptors. The RDF descriptor length may be limited if only short distance information is important. For instance, a maximum distance of 4 Å is usually sufficient to describe a reaction center. The parameter B determines the line width of the peaks in an RDF. B depends on the resolution of distances, that is, the step size used for calculating the RDF descriptor components. With decreasing B and increasing number of atoms, an RDF descriptor usually exhibits increasing overlap of peaks. Overlaps in RDF descriptors are not necessarily a disadvantage; it can be a desired property for the processing of RDF descriptors with methods that rely on interpolation, such as artificial neural networks (ANNs). The relationship between B and the resolution ∆r is
B ∝ ( ∆ r )−2 .
(5.4)
Thus, with increasing B, the resolution increases and the step size for an RDF descriptor decreases. Figure 5.2 shows the differences in a Cartesian RDF descriptor calculated between 1 and 2 Å with a smoothing parameter between 25 Å–2 and 1000 Å–2. With the corresponding resolution between 0.1 Å and about 0.032 Å the halfpeak width of an intense maximum in the Euclidean L2-normalized RDF lies around 0.05 and 0.2 Å; the maximum width is about 0.2–0.4 Å. B is equivalent to a temperature factor that defines the vibration of the atoms and the uncertainty in their positions. If the distance is measured in Angstrom, the resulting unit of B is Å–2. B is ideally calculated from the resolution of the resulting descriptor. It is important to take two considerations into account when determining a smoothing parameter:
5323X.indb 121
11/13/07 2:10:47 PM
122
Expert Systems in Chemistry Research
B = 1000 –2
B = 100 –2
B = 400 –2
B = 25 –2
Figure 5.2 Effect of smoothing parameter on the resolution and intensity distribution of an RDF descriptor in the distance range between 1 and 2 Å. The lines indicate the discrete components.
1. Fuzziness and interpolation: Some methods, such as ANNs (and other artificial intelligence [AI] methods), rely on a certain fuzziness of the input data. Using exact distances is not only contradictory to physical reality (i.e., molecules at absolute zero) but also restricts the ability of neural network methods to produce new information by interpolation. In particular, certain fuzziness is a prerequisite for flexibility in structure evaluation. 2. Computation time: A high precision in the representation of atom positions may lead to a significant increase in computation time, since the same maximum distance requires a larger number of components, which may affect the calculation or comparing hundreds of thousands of descriptors significantly.
In practice, the resolution will be a compromise between gaining the necessary resolution for the discrimination of distances and providing adequate flexibility for interpolation methods and acceptable computation times.
5.2.3 Resolution and Probability A typical effect that occurs when changing the resolution of an RDF is a shift in peak intensity distribution. The higher the coincidence of a real distance with the calculated point, the more precise is the representation of the relative frequency of the distance. For example, an RDF descriptor with a resolution of 0.2 Å represents a real distance of 1.51 Å with two points at 1.4 and 1.6 Å, both lying on the tails of the theoretical peak. The RDF does not appropriately represent the peak maximum at 1.51 Å.
5323X.indb 122
11/13/07 2:10:47 PM
123
Applying Molecular Descriptors
Consequently, the frequency of the corresponding distance is lower than in a descriptor calculated with a resolution of 0.1 Å, which would contain a peak at 1.5 Å. This kind of inappropriate representation of individual frequencies occurs throughout the entire descriptor, and the effect on the accuracy of the descriptor is unpredictable. Consequently, the resolution of the distance dimension affects the accuracy of the probability dimension leading to the following effects: • Low-resolution descriptors are less appropriate for algorithms relying on the comparison of numerical values. This applies, for instance, to pattern-matching algorithms, where the query pattern contains real distances, whereas the descriptor contains interpolated maxima. • High-resolution descriptors are less appropriate for algorithms relying on interpolation. This applies to comparing RDF descriptors among one another for similarity searches or processing the descriptors with neural networks, where a high resolution decreases the effects of interpolation.
5.3 Making Things Comparable — Postprocessing of RDF Descriptors In many cases, a final processing of the descriptors is necessary to gain comparability, to reduce information, as well as to emphasize or to suppress certain regions. We will divide these postprocessing operations into weighting and normalization, or scaling.
5.3.1 Weighting Weighting emphasizes or suppresses effects in an RDF descriptor at certain distances. A weighted general RDF descriptor,
g(r ) = fW ⋅
N −1
N
∑∑e i
− B ( r − r ij )2
(5.5)
j >i
contains a weighing factor ƒW, which can result in linear distance weighting by using
fW = r
(5.6)
f−W = r −1
(5.7)
or in exponential distance weighting with
fW = e− r
(5.8)
f− W = e r
(5.9)
5323X.indb 123
11/13/07 2:10:50 PM
124
Expert Systems in Chemistry Research
where ƒW and ƒ–W represent the weight factors for increasing and decreasing weighting, respectively. Weighting is particularly useful to emphasize peaks in the high distance region, where there are usually smaller intensities due to the lower frequency of atomic distances.
5.3.2 Normalization Vectors are usually scaled or normalized in computational data processing. For a normalized general RDF descriptor,
g(r ) =
1 ⋅ fN
N −1
N
i
j >i
∑∑e
− B ( r − rij )2
(5.10)
the maximum norm f N = gmax
(5.11)
normalizes the function to the peak with the highest probability gmax, and the Euclidean L2-norm (or vector norm). L2 denotes the Lebesgue integral — an extension of the Riemann integral for nonnegative functions — in over quadratic integratable functions:
fN =
∑ g(r )
2
(5.12)
r
that is, normalizing the vector to a total peak area of 1. Normalization is performed before processing descriptors with statistical methods or neural networks. The weight and normalization functions are available for one- and two-dimensional (2D) descriptors. However, multidimensional calculations are performed technically in one dimension; that is, each descriptor contains multiple one-dimensional vectors, such as [x0 ,x1,..., xn , y0 ,y1,..., yn , ...]. Consequently, distance-related functions like transforms are performed only in the distance dimension, whereas general functions like weighting and normalization are calculated for an entire descriptor; for example, normalization takes place on the entire vector instead of on the individual vectors of the first dimension.
5.3.3 Remark on Linear Scaling Normalization reduces the information linearly against the individual descriptor instead of against the complete data set. Several authors have mentioned this as a problem, in particular, concerning the use of descriptors with neural networks. This can be overcome by using linear scaling methods, which use factors that are usually related to the complete set of data. However, scaling has two general drawbacks. Each data set needs its own set of scaling factors; that is, the data set must be investigated in advance; a fact that is
5323X.indb 124
11/13/07 2:10:51 PM
125
Applying Molecular Descriptors
contradictory to the concept of automatic analysis. Additionally, if RDF descriptors are altered based on the entire data set, the information of an individual molecule is affected by outliers (i.e., exotic or untypical compounds) to the same extent as it is affected by typical compounds. In particular, this can lead to a loss or an overestimation of information at certain distances. Due to the fact that peaks in an RDF descriptor are related among each other — as the distances are related to the combination of bonds in a molecule — a scaled descriptor no longer describes the origin molecule.
5.4 Adding Properties — Property-Weighted Functions One of the most important improvements for a radial distribution function is the possibility to introduce chemical properties of atoms or entire molecules to describe physicochemical, physical, or biological properties of a chemical structure rather than just the three-dimensional structure. All of the RDF functions presented already can be calculated including properties p of the individual atoms. These properties can be inserted in a preexponential term as a product leading to the property-weighted RDF descriptor, sometimes called RDF code:
g(r , p ) =
N −1
N
∑∑ p p ⋅ e i
i
j
− B ( r − r ij )2
(5.13)
j >i
This method transforms the frequency dimension into a property-weighted frequency dimension. The selection of atomic properties determines the characterization of the atoms within an RDF descriptor. Particularly, the classification of molecules by a Kohonen network is influenced by a decision for an atomic property. We can distinguish between static and dynamic atom properties. • Static atomic properties are constant for a given atom type but are independent of the chemical neighborhood. • Dynamic atomic properties are calculated for the atoms in their specific chemical environment; they are dynamically changing with variation of the atom environment in a molecule.
5.4.1 Static Atomic Properties Examples of valuable static atomic properties are atomic number, atomic mass (AMU), Pauling electronegativity, 1st ionization potential (V) of atom in ground state, atomic radius (pm), covalent radius (pm), and atomic volume (cm³/M). Static atomic properties are helpful to simplify interpretation rules for RDF descriptors. The product pi · pj in Equation 5.13 for a given atom pair can be easily calculated, and the relations between the heights of individual peaks can be predicted. This approach is valuable for structure or substructure search in a database of descriptors. If a descriptor is calculated for a query molecule and if molecules with similar skeleton structures exist in the database, they will be found due to the unique
5323X.indb 125
11/13/07 2:10:52 PM
126
Expert Systems in Chemistry Research
positions of the peaks. However, atom types are described by (1) slight differences in bond length, which lead to peak shifts usually not significant for unique characterization; and (2) the product of atom properties, which may have a significance depending on the atom property values. Using static atomic properties allows controlling the effect on the peak height depending on the chosen property. Calculating a Cartesian descriptor with atomic volume as static property allows, for instance, emphasizing chlorine atoms in the descriptor. The descriptor can become an indicator for almost any property that can be attributed to an atom.
5.4.2 Dynamic Atomic Properties One of the major advantages of dynamic atomic properties in this context is that they account for valuable molecular information beyond the raw 3D data of atoms. Dynamic atomic properties depend on the chemical environment of the atoms. Typical examples are atom polarizability, molecular polarizability, residual electronegativity, partial atomic charges, ring-strain energies, and aromatic stabilization energies. Some of these properties are of high interest for spectrum-structure correlation investigations due to their influence on bonds. Substructures containing strongly polarized bonds, like the carbonyl groups or halogen atoms, exhibit a characteristic pattern in the descriptor with electronegativity as atomic property. The effect of strongly polarized atoms is not restricted to the neighborhood of this atom; all distances in the molecule are affected. Molecular polarizability is of importance for biological activity of compounds. It quantifies the effect of distortion of a molecule in a weak external field, as generated by charges appearing throughout a reaction. The partial atomic charge is one of the most interesting properties for structure derivation. Atomic charges may have both positive and negative values, and the resulting RDF descriptors usually exhibit a characteristic shape. The influence of partial charges on the RDF strongly depends on the presence of heteroatoms. For instance, carbonyl groups lead to extreme differences in the charge distribution over an entire molecule, which leads to a descriptor with characteristic positive–negative peak distribution (Figure 5.3). This example shows the different influences of atomic charges on the RDF descriptor, particularly in the chemical neighborhood of the carbonyl group. The negative partial atomic charges of oxygen in the carbonyl group and the resulting strong positive charge of the carbonyl carbon atom affect several peaks. Especially the C=O distance at 1.21 Å is strongly emphasized to negative values.
5.4.3 Property Products versus Averaged Properties Although some products of properties are related to physical quantities (e.g., according to Coulomb theory, the product of charges is related to the force between them), other products of properties may not have a physical meaning. In this case, a variant of Equation 5.13 may be applied that uses the mean of properties:
5323X.indb 126
g(r , p ) =
N −1
N
i
j >i
∑ ∑ p +2 p ⋅ e i
j
− B ( r − r ij )2
(5.14)
11/13/07 2:10:53 PM
127
Applying Molecular Descriptors
0.5
C=C
C2/6…C3/5
(None) Partial Charge
Distance Probability
0.3 0.1 –0.1 C1…O
–0.3
C2/6…C7
–0.5 –0.7
C3/5…C7 5 4
C=O 0
1
2
3 Distance/
4
6 1 3
O 7
8
2 5
Figure 5.3 RDF descriptor calculated with Cartesian distances and the partial atomic charge as dynamic property. The charge distribution affects the probability in both the positive and negative direction. The strong negative peaks correspond to atom pairs with charges of different sign.
The effects of exchanging the property product by a mean property are obvious: A property product with particular small or large values in one of the properties will lead to a significant decrease or increase in peak intensities. This has an amplifying effect on those peaks that originate from atom pairs with strongly different properties — often pairs of heteroatoms with nonheteroatoms. In contrast to that, Equation 5.14 leads to an attenuative effect with those atom pairs because the properties are averaged. This behavior is intensified if negative values occur, for instance with partial atomic charges: Pairs of positive and negative charged atoms will lead to negative peaks whereas uniformly charged atom pairs will remain on the positive part of the frequency axis. Figure 5.4 shows that the behavior of the preexponential term — and, thus, the resulting peak amplitude — depends on both the magnitude and orientation of the participating properties. Pairs with C1 exhibit a similar shape in distribution with generally smaller products of properties than averages. Pairs with O8 lead to an inverse behavior of products and averages due to the strong negative charge of the oxygen atoms. Consequently, the resulting RDF descriptors — here called the amplified and attenuated RDF — exhibit quite different shapes. The attenuated RDF descriptors typically exhibit more, or significant, peaks than the amplified variant when dynamic properties are used. In particular, dynamic properties of carbon atoms are often small in comparison with one of the heteroatoms. Thus, peaks related to carbon atoms are also small, whereas the peaks related to the heteroatoms are intensified. In general, the property product leads to a descriptor that emphasizes extreme differences in properties — thus the name amplified. In most of the investigations presented in this volume, the product of properties was used to gain advantage of this effect for the specific task.
5323X.indb 127
11/13/07 2:10:54 PM
128
Expert Systems in Chemistry Research 0.10
C2 C3
0.005
C4
C5
C6 0.00
0.000
Product
–0.005 –0.010
4
–0.015
2
3
–0.020
6
5
8
–0.025
1 OH
O7
O8
OH 7
–0.10 –0.20
Average
0.010
–0.30
–0.030 –0.035
(a)
Product Average
C2
C3
C4
C5
C6
O7
O8
0.007 0.081
–0.002 0.029
–0.004 0.016
–0.004 0.016
–0.002 0.029
–0.032 –0.154
–0.032 –0.154
–0.40
.
. –0.10
0.20 C2
Product
0.15
C3 4
0.10
0.05
3
6
5 2
1 8
C4
C5
C6
–0.20
OH 7
–0.30
OH O7
0.00
–0.05
(b)
Product Average
C1
C2
C3
C4
C5
C6
O7
–0.032 –0.154
–0.032 –0.154
0.009 –0.207
0.020 –0.220
0.020 –0.220
0.009 –0.207
0.152 –0.390
Average
C1
–0.40
–0.50
Figure 5.4 (a) Distribution of product and average of partial charges for nonhydrogen atoms paired with carbon atom 1 in trans-1,2-cyclohexanediol (hydrogen atoms are omitted). (b) Distribution of product and average of partial charges for nonhydrogen atoms paired with oxygen atom 8 (below) in trans-1,2-cyclohexanediol (hydrogen atoms are omitted).
5.5 Describing Patterns One of the main advantages of RDF descriptors is that they are interpretable. A simple approach to the recognition of structural features is to search for patterns, which provide valuable information in RDF descriptors for similarity searches. The importance of recognizing patterns in chemical data has already been emphasized.
5323X.indb 128
11/13/07 2:10:55 PM
129
Applying Molecular Descriptors
Some simple adaptations of radial functions lead to useful descriptors for pattern recognition. Although the components of the previously introduced RDF descriptors are calculated for a continuous range of distance intervals, RDF pattern functions rely on the δ-function 1, if r = rij δ(r − rij ) = 0, else
(5.15)
This function simply restricts the calculation to those distances that actually occur in the molecule. The condition (r = rij) is calculated with a certain fuzziness or tolerance limit that is either determined by the resolution of the function or defined by the user. Three types can be distinguished: (1) distance patterns; (2) frequency patterns; and (3) binary patterns.
5.5.1 Distance Patterns The RDF distance pattern function
g(r ) =
N −1
N
∑ ∑ δ (r − r ) ⋅ e ij
i
− B ( r − rij )2
(5.16)
j >i
represents a kind of reduced RDF descriptor that considers only the actual distances rij. The Gaussian distribution is suppressed, and B only affects the frequency of the distance rij. Consequently, each nonzero value of an RDF distance pattern represents a single entry — or a sum of entries containing the same values — of a conventional real-distance matrix. Throughout this text, the term distance matrix will be used for a matrix containing the real distances rather than the number of distances. This function provides a simpler pattern for distance probability of a molecule. It is usually represented in peak style rather than line style.
5.5.2 Frequency Patterns The frequency pattern function
g(r ) =
N −1
N
∑ ∑ δ (r − r ) ij
i
(5.17)
j >i
is a simplified form of Equation 5.3 that underlies the same restriction as given in Equation 5.2 but does not contain an exponential term. It calculates the absolute frequency of distances instead of the relative frequency. Jakes and Willett introduced an approach similar to frequency pattern functions [4]. The authors used the frequency of certain distances based on 3D interatomic distances for substructure and fragment screening. Frequency patterns are usually used without scaling or normalization in pattern-matching algorithms.
5323X.indb 129
11/13/07 2:10:57 PM
130
Expert Systems in Chemistry Research
5.5.3 Binary Patterns The binary pattern function
g(r ) = δ(r − rij ),
i ∈{1,..., N − 1} , j ∈{2,, ..., N } , j > i
(5.18)
is a simplified form of Equation 5.4 and results in a binary vector that merely defines the absence (0) or presence (1) of a distance within the range r. This function is a basis for pattern search and match algorithms.
5.5.4 Aromatic Patterns Some chemical structures exhibit typical distances that occur independently of secondary features, which mainly affect the intensity distribution. In particular, aromatic systems contain at least a distance pattern of ortho-, meta-, and para-carbon atoms in the aromatic ring. A monocyclic aromatic system shows additionally a typical frequency distribution. Consequently, Cartesian RDF descriptors for benzene, toluene, and xylene isomers show a typical pattern for the three C-C distances of ortho-, meta-, and para-position (1.4, 2.4, and 2.8 Å, respectively) within a benzene ring. This pattern is unique and indicates a benzene ring. Additional patterns occur for the substituted derivatives (3.8 and 4.3 Å) that are also typical for phenyl systems. The increasing distance of the methyl groups in meta- and para-Xylene is indicated by a peak shift at 5.1 and 5.8 Å, respectively. These types of patterns are primarily used in rule bases for the modeling of structures explained in detail in the application for structure prediction with infrared spectra.
5.5.5 Pattern Repetition Due to the increase in the number of peaks with increasing size of a molecule, simple patterns in the small distance range may coincide with other patterns. However, molecular RDF functions contain all distances from a particular atom to every other atom in the molecule. Thus, patterns emerge through the entire descriptor. For instance, a sulfide bridge in an organic compound is characterized by at least two distances of the 1,2 and 1,3 neighboring carbon atoms, which is a unique indicator for the occurrence of C-S bridges (Figure 5.5).
5.5.6 Symmetry Effects Another interesting effect occurs in RDF descriptors of highly symmetric compounds. An RDF typically exhibits decreasing probabilities (i.e., peak heights) with increasing distance. In largely linear molecules, the C-C distances occur often, whereas the maximum distance occurs only once. Highly symmetric systems like fullerene exhibit an opposite distribution. A high structural symmetry in the RDF descriptors generally leads to a reduction of peaks due to repetitions of distances. In this case, an RDF behaves similar to many types of spectra.
5323X.indb 130
11/13/07 2:10:57 PM
131
Applying Molecular Descriptors 7.0
1.82
2.75
2.75
Distance Probability
6.0
1.82 HO
5.0
O
S
OH NH2
4.0 3.0 2.0 1.0 0.0 0.0
1.0
2.0
3.0
4.0
5.0 6.0 Distance/
7.0
8.0
9.0
10.0
Figure 5.5 Typical pattern of a sulfide-bridge in the cat pheromone Felinine. (Cartesian RDF, 128 components). Two typical distances, 1.84 and 2.75 Å, are a characteristic feature for the presence of the sulfide bridge.
5.5.7 Pattern Matching with Binary Patterns Patterns and other characteristics of RDF descriptors that seem to indicate unique features are not easy to determine. However, compiled carefully they are helpful tools for a quick recognition of substructures. A pattern search algorithm based on binary pattern descriptors can then be used for substructure search. Although binary pattern descriptors exclusively contain information about the presence or absence of distances, frequency pattern descriptors additionally contain the frequency of distances. Frequency pattern descriptors are valuable for direct comparison of structural similarities. For instance, a substructure can be assumed to exist if the frequencies in a substructure pattern occur in the query descriptor. Bond patterns can be used in a similar pattern search approach to determine structural similarities. In this case bond-path RDF descriptors are used.
5.6 From the View of an Atom — Local and Restricted RDF Descriptors Instead of describing a complete molecule, we can restrict the calculation of an RDF descriptor to parts of a molecule or just a single atom. This is particularly helpful if the property we want to model relies on partial structure information rather than on the entire molecule, such as in nuclear magnetic resonance spectroscopy. The distance range for which an RDF descriptor is calculated may cover the entire molecule — as with the functions introduced already — or a certain fraction of the molecule or individual atoms. RDF descriptors can be divided into two types.
5323X.indb 131
11/13/07 2:10:58 PM
132
Expert Systems in Chemistry Research
5.6.1 Local RDF Descriptors In molecular RDF descriptors, g(r ) =
N −1
N
i
j >i
∑ ∑ f (r )
(5.19)
the radial function ƒ(r) (i.e., any of the previously introduced exponential or Kronecker terms) is calculated for all pairs of atom denoted by i and j; it covers the entire molecule. In contrast to that, local, or atomic, RDF descriptors run over all pairs of a predefined atom j, with every other atom i:
g(r ) =
N
∑ f (r )
j = const ., i ≠ j
(5.20)
i
The resulting descriptor can be regarded as isolated from the molecular RDF that contains the sum of N possible descriptors. Thus, every N-atomic molecule can have N local RDF descriptors. However, local descriptors can cover the entire molecule, depending on the predefined maximum distance of the function; their center is just localized on a single atom. In particular, local RDF descriptors are useful for characterizing the chemical environment of an atom and can be applied, for example, to represent the environment of protons in 1H-NMR spectrum–structure correlations and for the evaluation of steric hindrance at a reaction center.
5.6.2 Atom-Specific RDF Descriptors By restricting RDF descriptors to certain atom types an atom-specific RDF descriptor,
g(r ) =
N −1
N
∑ ∑ δ(t ) ⋅ e ij
i
− B ( r − r ij )2
(5.21)
j >i
is calculated with the condition δ(tij) depending on the types ti and tj of the atoms i and j, respectively. Two cases can be distinguished: The ignore mode excludes a certain atom type t and underlies the condition
1, if tij ≠ t δ(tij ) = 0, else
(5.22)
The resulting descriptor is simpler than a complete one and is useful if the ignored atom type has no meaning for the task to be represented — for instance, by ignoring hydrogen atoms with t ≡ H (ignore mode).
5323X.indb 132
11/13/07 2:11:00 PM
133
Applying Molecular Descriptors
The opposite approach is a descriptor that is calculated in exclusive mode for a certain atom type with the inverse condition 1, if tij = t δ(tij ) = 0, else
(5.23)
where t is the only atom type that has to be regarded in the calculation. In particular, exclusive RDF descriptors are useful for characterizing skeleton structures, for example, with carbon as exclusive atom (t ≡ C).
5.7 Straight or Detour — Distance Function Types We have seen before that different types of matrices can be used for characterizing a molecule. Depending on which matrix is used, the distance rij in a radial function can represent either the Cartesian distance, a bond-path distance, or simply the number of bonds between two atoms. Consequently, we yield three groups of RDF descriptors.
5.7.1 Cartesian RDF The Cartesian RDF uses the distances rij calculated from the Cartesian coordinates (x, y, z) in three-dimensional space: g(r ) =
N −1
N
i
j >i
∑∑e
− B ( r − r ij )2
rij = ( xi − x j )2 + ( yi − y j )2 + ( zi − z j )2
(5.24)
(5.25)
These functions map real three-dimensional information onto a one-dimensional function. This type of function strongly depends on the exact and consistent calculation of the Cartesian coordinates of the atoms and the conformational flexibility of a molecule. When r is measured in Å, the unit of a smoothing parameter B is Å–2. The previously shown descriptors rely on this distance measure.
5.7.2 Bond-Path RDF The bond-path RDF is calculated using the sum of the bond lengths along the shortest bond path bij (instead of rij) between two atoms i and j:
5323X.indb 133
g(r ) =
N −1
N
i
j >i
∑∑e
− B ( b − bij ) 2
(5.26)
11/13/07 2:11:02 PM
134
Expert Systems in Chemistry Research s=3
s=2 s=1 i k (1)
k (2) k (3) = j
Figure 5.6 Scheme of the calculation of bond-path distances introduced in Equation 5.27. k represents the bond-path levels for spheres s between atoms i and j.
Sb −1 bij = rik (1) + min rk ( s ) k ( s+1) s=1
∑
(5.27)
where Sb indicates the number of bond spheres s surrounding the atom i and rik(1) is the bond length to the directly bonded atom k, in the shortest path. To define the atom index k the algorithm evaluates first the path with the smallest number of bonds between the target atoms i and j. rk(s)k(s+1) are the bond length between two atoms k(s) and k(s+1) of the current and the next sphere s, respectively (Figure 5.6). One can imagine this type of function as a representation of a molecule flatted along the individual atom pair that is actually calculated. Therefore, these functions are independent of conformational changes in a molecule. Again, the unit of the smoothing parameter B is Å–2.
5.7.3 Topological Path RDF The topological-path RDF is derived from the bond-path RDF. This type simply uses the number of bonds nij (instead of rij) between the atom pairs along the shortest path.
g(r ) =
N −1
N
i
j >i
∑∑e
− B ( n − nij )2
Sb −1 nij = nik (1) + min nk ( s ) k ( s+1) s=1
∑
(5.28)
(5.29)
It provides a kind of bond-frequency pattern. It is also independent of conformational flexibility. In this case, the smoothing parameter B has the unit 1 (one). The statement that the bond-path RDF is independent of conformational changes relies on the precision of the Cartesian coordinates of the atoms and the accuracy of calculation. In practice, the bondpath functions of two conformers are extremely similar but seldom coincidental.
5323X.indb 134
11/13/07 2:11:05 PM
Applying Molecular Descriptors
135
Depending on the task, topological path descriptors are represented in either peak mode (i.e., similar to a distance pattern) or in the conventional smoothed mode containing the probability distribution. With analysis methods that rely on interpolation, like neural networks, the smoothed representation is necessary to preserve interpolation features.
5.8 Constitution and Conformation The recognition of differences in molecular structures — the characterization of structural similarity — is a special feature of RDF descriptors. Changes in the constitution of a molecule will generally lead to changes in peak positions. For instance, a typical Cartesian RDF descriptor of a linear alkane shows periodic peaks — essentially the sum of the C-C distances. Small changes in the structure can lead to a series of changes in the descriptor. Some of the typical effects on a Cartesian RDF descriptor are as follows: • New side chains lead to a change in periodicity of the peaks in the descriptor. • Changes in constitutional isomerism can lead to extremely different descriptors. • The descriptor reacts to shortening of the maximum distance more sensitively with smaller molecules. • Distances of directly bonded atoms remain virtually unchanged, whereas an increase in branching increases the splitting of the peaks, particularly if different orientations of the terminal groups in a side chain occur. • Decreasing saturation leads to significant changes in the range of the short bond distances; additionally, new peaks can show up that result from the altered orientation of groups in the neighborhood. Conformational variations have significant effects on the shape of the entire descriptor because of the overall change in interatomic distances. For instance, the RDF descriptor of the chair and boat conformer of cyclohexane can be distinguished by the frequency of distances (i.e., in a first approximation by the intensity of the peaks). Some of the effects previously described are valuable for automatic RDF interpretation. In fact, this sensitivity is an elementary prerequisite in a rule base for descriptor interpretation. However, since many molecular properties are independent of the conformation, the sensitivity of RDF descriptors can be an undesired effect. Conformational changes occur through several effects, such as rotation, inversion, configuration interchange, or pseudo-rotation, and almost all of these effects occur more or less intensely in Cartesian RDF descriptors. If a descriptor needs to be insensitive to changes in the conformation of the molecule, bond-path descriptors or topological bond-path descriptors are more appropriate candidates. Figure 5.7 shows a comparison of the Cartesian and bond-path descriptors. A typical feature of Cartesian RDF descriptors is a (at least virtual) decrease in characteristic information with increasing distance. The influence of the short distance range (in particular, the bond information) dominates the shape of a Cartesian RDF. In contrast to that, the bond-path descriptor is generally simpler; it exhibits
5323X.indb 135
11/13/07 2:11:05 PM
136
Expert Systems in Chemistry Research
O O
Distance Probability
O
0.0
N—O
Cartesian Distances
Bond-Path Distances
2.0
4.0
6.0
8.0
10.0
12.0
14.0
Distance/
Figure 5.7 Comparison of Cartesian and bond-path RDF descriptor (256 components each) of a cyclohexanedione derivative. The bond-path descriptor exhibits sharper peaks in particular single bond-distance patterns and is generally larger than the corresponding Cartesian descriptor.
fewer peaks that are more likely to show similar distribution over the entire distance range, and it shows no domination of peaks in a particular distance range. The topological-distance descriptor is a quite rough representation of the shape of a molecule because the original distance information reduces to integers (the number of bonds). However, a normalized topological-distance descriptor represents a probability distribution, which is useful for analysis methods relying on interpolation.
5.9 Constitution and Molecular Descriptors The appropriate representation of a molecule with RDF descriptors finally depends on the question of what the term similarity should describe in the given context. Let us have a look at the following example. Figure 5.8 shows three possible configurations of stereoisomers for a Ruthenium complex with sulfur dominated coordination sphere, a compound that serves as a model for nitrogenase. Searching this molecule in a molecule database of high diversity, the scientist would expect all stereoisomers to be recognized as the same compound. However, a synthesis chemist might expect a different result, since the catalytic behavior of the stereoisomers is different. A drug designer might be interested in their docking behavior with biological receptors, which is a completely different task. The particular use or application defines obviously the meaning of similarity. Figures 5.9a through 5.9c show how the Cartesian, the bond path, and the topological-path RDF descriptors for the Ruthenium complex can be used to characterize these differences in similarity. Cartesian RDF descriptors cover the three-dimensional arrangement of atoms; these descriptors are suited to represent steric differences that may affect different behavior in chemical reactions. Whereas the initial bond distance range is similar,
5323X.indb 136
11/13/07 2:11:06 PM
137
Applying Molecular Descriptors S
H N S
Ru N
N
S
H
H
N
S
Ru
S
N
S N H
O
O
(b)
(a) S
H
N
Ru
S
S N
N
H
O (c)
Figure 5.8 Stereoisomers of a Ruthenium complex with sulfur-dominated coordination sphere. Staggered trans-N conformer with equatorial and axial phenyl group (a); crossed trans-N conformer with equatorial phenyl groups (b); butterfly cis-N conformer with equatorial phenyl groups (c). All molecules were geometry optimized with an MM+ force field in a Newton-Raphson algorithm [5]. 70.0
Cartesian
Distance Probability
60.0 50.0 40.0
Butterfly
30.0
Crossed
20.0 10.0 0.0
(a)
Staggered 0
1
2
3
4
5
6 7 8 Distance/
9
10
11
12
13
14
Figure 5.9 (a) Molecular Cartesian RDF descriptors for the stereoisomers shown in igure 5.8. Major deviations are emphasized (B = 400 Å–2, 256 components). The descriptors F show a series of differences emphasizing the different spatial orientations of atoms in the molecules. (b) Molecular bond-path RDF descriptors for the stereoisomers shown in Figure 5.8. Major deviations are emphasized (B = 400 Å–2, 256 components). Only small differences occur in these descriptors indicating a similarity of the compounds. The small changes in several bond lengths are due to the different stereochemical stresses in the molecules with different orientations. (c) Molecular topological-path RDF descriptors for the stereoisomers shown in Figure 5.8. Major deviations are emphasized (B = 400 Å–2, 256 components). Topological-path descriptors suppress any orientation or stereochemical stress, leading to identical characterization of the compounds.
5323X.indb 137
11/13/07 2:11:07 PM
138
Expert Systems in Chemistry Research
. 60.0
Bond-Path
Distance Probability
50.0 40.0
Butterfly
30.0 Crossed
20.0 10.0 0.0
Staggered 0
1
2
3
4
5
6
7
(b)
8
9 10 11 12 13 14 15 16 17
Distance/
.
. .
Distance Probability
250.0
Topological Path
200.0 150.0 100.0
Butterfly
50.0
Crossed
0.0
Staggered 0
1
(c)
2
3
4
5
6
7
8
9
10
11
No. of Bonds
Figure 5.9 (continued)
the peaks in the high distances reveal the differences in the 3D arrangement of atoms. This descriptor can be applied to describe the differences in docking behavior. The bond-path descriptors are much more similar among each other because they represent a 2D arrangement — the molecule flatted along the actually calculated atom pairs — rather than a 3D one. Staggered and butterfly configurations exhibit nearly the same descriptors; the crossed configurations shows a slight shift in the high distance range. These differences occur due to the high resolution of the descriptors (0.05 Å); the descriptor reacts sensitively to small changes in the high distance range (±0.01 Å, due to slightly different ring strains) in the C-S and C-N distances found by the MM+ method. The topological bond-path descriptors are identical; they represent the structural similarity perfectly.
5323X.indb 138
11/13/07 2:11:08 PM
139
Applying Molecular Descriptors
5.10 Constitution and Local Descriptors The Cartesian RDF seems to represent the biological activity of the Ruthenium complex. In any case, the descriptor is quite complex and cannot be compared easily with other molecules of similar ligand arrangement and with similar biologic potency. Another approach is based on a local descriptor that specifies the chemical environment of the reaction center: the Ruthenium atom. Local, or atomic, RDF descriptors are suitable to characterize an individual atom in its chemical environment. They are usually not appropriate for investigations of diverse data sets, since each N-atomic molecule can have N local descriptors. A typical application of local descriptors is the characterization of steric hindrance at reaction centers. This can be performed using a consequent numbering of the atoms of the reactants. In the following experiment, the Ruthenium atom of each conformer shown in Figure 5.8 was the first atom in the data file, and the local RDF descriptors for atom 1 (Ru) were calculated. Figures 5.10a through 5.10c show the results for the Cartesian RDF descriptors. The local Cartesian RDF descriptors of the stereoisomers are generally more similar among each other than the molecular ones. They exhibit particularly two patterns that describe the different ligand sphere of the stereoisomers in the distance
Distance Probability
Cartesian
Staggered
Crossed
Butterfly
(a)
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
Distance/
Figure 5.10 (a) Local Cartesian RDF descriptors calculated on the Ruthenium atom in the stereoisomers shown in Figure 5.8. Major deviations are emphasized (B = 400 Å–2, 256 components). In contrast to the descriptor shown in Figure 5.9a, local Cartesian descriptors clearly indicate the change in the chemical environment of the Ruthenium atoms at around 3.8 and 4.5 Å. (b) Local bond-path RDF descriptors calculated on the Ruthenium atom in for the stereoisomers shown in Figure 5.8. Major deviations are emphasized (B = 400 Å–2, 256 components). Whereas the local Cartesian descriptor in Figure 5.10 exhibits two major changes in the chemical environment of the Ruthenium atom, the differences in local bond-path descriptors occur together at 4.5 Å, which is due to the flattening of molecules with local descriptors. (c) Local topological-path RDF descriptors calculated on the Ruthenium atom in for the stereoisomers shown in Figure 5.8. Major deviations are emphasized (B = 400 Å–2, 256 components). The local topological-path descriptor suppresses orientation, stereochemical stress, and changes; it merely shows a pattern of spheres in the environment of the Ruthenium atom.
5323X.indb 139
11/13/07 2:11:08 PM
140
Expert Systems in Chemistry Research
Distance Probability
Bond-Path
Staggered Crossed
Butterfly
(b)
1.5
2.5
3.5
4.5
5.5 6.5 Distance/
7.5
8.5
.
.
Distance Probability
12.0
2 2 2 2 2 1 2 3 2 1 1 2 3 4 4 Ru 4 3 2 1 1 2 3 4 1
Topological Path
10.0
2
8.0 6.0 4.0 2.0 0.0
(c)
0
1
2
3
4
5
6
Distance/
Figure 5.10 (continued)
range 2.5 – 3.7 Å. The local bond-path descriptors are again nearly identical — except for a slight deviation in the range of C-S and C-N distances — and indicate that all stereoisomers belong to the same compound. The local topological bondpath descriptors are coincident and provide a pattern that describes simply the distribution of bonded atoms in the environment of the Ruthenium atom. The discussions in this section briefly described the use of local descriptors. Another application of local descriptors is the characterization of atoms in nuclear magnetic resonance (NMR) spectroscopy. This is described later with an application for the prediction of chemical shifts in 1H-NMR spectroscopy, where protons were represented by their local RDF descriptors.
5.11 Constitution and Conformation in Statistical Evaluations Changes in constitution and conformation also have significant effects on the statistical evaluation of data sets. The following examples of small molecule data sets
5323X.indb 140
11/13/07 2:11:09 PM
141
Applying Molecular Descriptors
N
N
1
N
2
N
3
N
4
N
5
6
N
N
N
N
7
8
9
10
O
11
O
O
H
N
H
N H
O N H
12
N H
H2N—NH2
13
14
Figure 5.11 Data set of 14 amines for the investigation of statistical parameters. The structures are of medium diversity and contain two sets of similar compounds (1-5 and 6-10) but exhibit generally a medium diversity.
Descriptor Correlation to Average Code
1.20 1.00 0.80 0.60 0.40 Cartesian Distances Bond-Path Distances Topological Distances
0.20 0.00
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Molecule Number
Figure 5.12 Distribution of correlation coefficients of each descriptor with the ASD for the data set shown in Figure 5.11. The trend limit for correlation coefficients R > 0.8 is emphasized (RDF, 256 components, B = 400 Å–2).
describe these effects. The molecule set used for the following experiment consists of 14 primary, secondary, and tertiary amines (Figure 5.11). The descriptors should be similar concerning the amino functionality, whereas the different N-substituents should be emphasized. Figure 5.12 shows how the individual descriptors for each compound correlates with the average set descriptor (ASD):
5323X.indb 141
1 g (r ) = L
L
∑ g (r ) i
(5.30)
i =1
11/13/07 2:11:11 PM
142
Expert Systems in Chemistry Research 8.00
Descriptor Skewness
7.00 6.00 5.00 4.00 3.00 2.00
Cartesian Distances Bond-Path Distances Topological Distances
1.00 0.00
0
1
2
3
4
5
6 7 8 9 Molecule Number
10
11
12
13
14
15
Figure 5.13 Distribution of descriptor skewness for the data set shown in Figure 5.11 (RDF, 256 components, B = 400 Å–2).
Since the ASD represents the entire data set, the correlation coefficient between a compound descriptor and the ASD describes how well the compound fits into the data set. In addition, three distance modes — Cartesian, bond-path, and topological-path distances — are compared. Cartesian RDF descriptors are usually quite sensitive to small constitutional changes in the molecule. The bond-path descriptors exhibit less sensitivity, whereas topological bond-path descriptors only indicate extreme changes in the entire molecule or in the size of the molecule. Figure 5.12 shows the statistical trend limit (R between 1 and 0.8), which is the range of correlation coefficients that assume a probability for a trend. In this context, it indicates range of similar compounds. The correlation coefficients between the topological descriptors do not change significantly, except for the short hydrazine (14) molecule with the (for this data set) untypical N-N bond. In bond path and Cartesian descriptors, double bonds (5, 11, 12, 13) lead to significant differences because of the shorter bond lengths. However, most of the compounds appear within the statistical trend limit and are indicated to be similar, even with the more sensitive Cartesian RDF descriptor. Correlation coefficients between two RDF descriptors do not necessarily reflect the diversity of compounds. They do not distinguish between obvious structural differences since the reliability of the correlation coefficient itself depends on the symmetry of the distribution within a descriptor — that is, on the skewness and kurtosis. Figure 5.13 shows the distribution of the skewness of the descriptors; the kurtosis shows basically the same trend, although it exhibits a higher sensitivity. Whereas the skewness of Cartesian RDF descriptors reacts quite insensitively to changes in the dataset (except in hydrazine, 14), significant changes occur in bondpath descriptors when the molecule becomes more compact (e.g., the sequence 2-13-4) and when the frequency of side chains changes (e.g., 7, 9 and 8, 10).
5323X.indb 142
11/13/07 2:11:12 PM
143
Applying Molecular Descriptors O P 1
P
Cl Cl
2
3
Cl Cl
N
4 O P
R1
OH
7 – 14
Cl Cl Cl
PCl2 5
O
O
OH
P
Cl
O O P
O
6
P
OH OH
O O P
O
O
O
O
O
O
O
15
R2
PCl2
16
S
O
P
P
O O
O 17
O
P
O
O
18
P
Cl P
O
Cl
P Cl
20
O
19
O
O
P
O
O
Cl
21
Figure 5.14 Data set of 21 phosphorous compounds for the investigation of statistical parameters. The set contains phosphorous in several oxidation states but exhibits generally a high diversity. R1 and R2 for compounds 7-14 are: 7 (H, NO2); 8 (NO2, H); 9 (H, Cl); 10 (Cl, H); 11 (NH2, H); 12 (H, NH2) 13 (H, OCH3) 14 (NH2, COOEt).
As the diversity of the compounds in the data set increases, the ability of correlation coefficients to identify structural differences may decrease. Another data set of diverse phosphorous compounds (Figure 5.14) illustrates this. The correlation coefficients between the individual RDF descriptors and the ASD (Figure 5.15) show no significant difference between the compounds 7-14 because of the high diversity in the data set. In addition, the compounds 2 and 3 as well as compounds 19 and 20 are indicated as similar within the data set. One major difference between the compounds 6-14 (ethyl ester) and 15 (methyl ester) is only indicated by the bond-path RDF descriptor that reacts sensitively to the additional carbonyl group of compound 15. In contrast to correlation coefficients, the distribution of skewness (Figure 5.16) indicates the differences in structures 2 and 3 and 15-18, whereas the phosphorous esters 6-15 are recognized as similar. It should be remarked that the bond-dependent descriptors (i.e., bond path and topological RDF) generally recognize changes in the ring size, heteroatoms, and double bonds, whereas the Cartesian RDF descriptors show significant deviations with the size of a molecule and the number of rings.
5323X.indb 143
11/13/07 2:11:12 PM
144
Expert Systems in Chemistry Research
Descriptor Correlation to Average Code
1.20 1.00 0.80 0.70 0.60 Cartesian Distances Bond-Path Distances Topological Distances
0.20 0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Molecule Number
Figure 5.15 Distribution of correlation coefficients between RDF descriptors and their ASD for the data set shown in Figure 5.14. The trend limit for correlation coefficients R > 0.8 is emphasized (RDF, 256 components, B = 400 Å–2). The correlation coefficients indicate the similarity (7-14) but are mainly influenced by the size of the molecule. 7.00
Cartesian Distances Bond-Path Distances Topological Distances
Descriptor Skewness
6.00 5.00 4.00 3.00 2.00 1.00 0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Molecule Number
Figure 5.16 Distribution of the skewness of RDF descriptors for the data set shown in Figure 5.14 (256 components, B = 400 Å–2). Again, the correlation coefficients indicate the similarity (7-14) but are mainly influenced by the size of the molecule.
These experiments show that correlation coefficients between individual descriptors and the ASDs are valuable for indicating similarity and diversity of data sets and compounds. Since the effects of symmetry of distribution may be significant, it is recommended that correlation coefficients be handled together with skewness or kurtosis. The skewness and kurtosis of the descriptors both show the same trends, although the kurtosis — in particular, the flatness of distribution — is generally more sensitive
5323X.indb 144
11/13/07 2:11:13 PM
145
Applying Molecular Descriptors
than the skewness of the distribution. However, both skewness and kurtosis are more sensitive indicators to constitutional and conformational changes in a molecule.
5.12 Extending the Dimension — Multidimensional Function Types We have seen that additional atom properties allow a discrimination of molecules beyond the three-dimensional structure. However, we will find cases where the information content of enhanced RDF descriptors is still not sufficient for a certain application. In particular, if the problem to be solved depends on more than a few parameters, it may be necessary to divide information that is summarized in the onedimensional RDF descriptors. Though the RDF descriptors introduced previously are generated in one dimension, it is generally possible to calculate multidimensional descriptors. In this case, we can extend the function into a new property dimension by simply introducing the property into the exponential term
g(r , p′ ) =
N −1
N
∑∑e i
− B ( r − rij )2 + D ( p′− pij′ )2
(5.31)
j >i
This two-dimensional RDF descriptor is calculated depending on the distance r and an additional property p′. In this case, p′ij is a property difference calculated in the same fashion as the Cartesian distance rij; in fact, p′ij can be regarded as a property distance. Much in the same way as B influences the resolution of the distance dimension, the property-smoothing parameter D affects the resolution — and, thus, the half-peak width — in the property dimension. D is measured in inverse squared units [p′] –2 of the property p′. Two-dimensional RDF descriptors preserve additional information about an arbitrary atomic property p′ that appears in the second dimension of the descriptor. This descriptor can be also calculated with two atomic properties by including atom properties p of the individual atoms according to Equation 5.13:
g(r, p, p′ ) =
N −1
N
∑∑ p p ⋅ e i
i
j
− B ( r − rij )2 + D ( p′− pij′ )2
(5.32)
j >i
The probability-weight properties pi and pj are, and should be, primarily independent of the property p′ that defines the second dimension of the descriptor. In other words, the distance dimension g(r) is separated from the property dimension g(p′), and, additionally, the probability is weighted by the property p. Figure 5.17 shows an example of a 2D RDF descriptor calculated for a simple molecule. The 2D RDF shows a distance axis and a property axis showing the partial atomic charge distribution. Since the probability-weight function p is omitted, the function simply splits into distance space and property space. The distance axis is equivalent to a one-dimensional RDF, whereas the property axis shows the charge distribution at a certain distance. The two intense peaks represent the C–H and C··H′
5323X.indb 145
11/13/07 2:11:15 PM
146
0.7
Expert Systems in Chemistry Research
1.3
1.9
r/
2.5 3.1
–0.20
0.04 0.01 –0.02 –0.05 –0.08 u q/e –0.11 –0.14 –0.17
g (r, q) 3.8–4.0 4.00 3.75 3.5–3.8 3.50 3.25 3.3–3.5 3.00 2.75 3.0–3.3 2.50 2.8–3.0 2.25 2.00 2.5–2.8 1.75 1.50 2.3–2.5 1.25 1.00 2.0–2.3 0.75 0.50 1.8–2.0 0.25 0.00 1.5–1.8 1.3–1.5 1.0–1.3 0.8–1.0 0.5–0.8 0.3–0.5 0.0–0.3
Figure 5.17 Two-dimensional RDF descriptor of ethene calculated with Cartesian distances in the first and the partial atomic charge as property for the second dimension. Instead of the one-dimensional descriptor with four peaks, the six distances occurring in ethene are clearly divided into the separate property and distance dimensions.
distances indicating the high symmetry of the molecule. They exhibit high charge differences (i.e., negative values) between carbon and hydrogen atoms. The remaining four peaks represent the C = C distance with low probability and the three H-H distances, all of which have a low charge difference. Figure 5.18 shows a 2D RDF descriptor of a complex molecule. A direct interpretation would be more sophisticated; however, the value of separating distance and property becomes more obvious. 2D RDF descriptors of the distance dimension n and the property dimension m can be treated like a one-dimensional vector containing m descriptors of length n. This makes it easy to compare two-dimensional descriptors using the same algorithms as for one-dimensional vectors. Since we can use two atomic properties, the question arises as to which property would be appropriate for the second dimension and which one would then be the right one to weigh the probabilities. We cannot answer this question generally, but the functional influence of the two properties is quite different: • The weight-property term pi · pj is a product of properties of the actual atom pair. Thus, the difference in the properties is averaged and is largely determined by the higher value, especially if the difference is high. Additionally, this can lead to negative values in the descriptor if one atom exhibits a negative property (e.g., the charge product of two oxygen atoms is usually positive).
5323X.indb 146
11/13/07 2:11:16 PM
147
Applying Molecular Descriptors
g (r, q)
0.1
1.6
11.0
10.0–11.0
10.0
9.0–10.0
9.0
8.0–9.0
8.0
7.0–8.0
7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 3.1
4.6
r/
6.1 7.6 9.1
0.05 –0.01 –0.04 –0.09 –0.14 u –0.19 q/e –0.24 –0.29 –0.34 –0.39
H
6.0–7.0 5.0–6.0 4.0–5.0 3.0–4.0 2.0–3.0 1.0–2.0 0.0–1.0 N
+
OH O H O
Figure 5.18 Two-dimensional RDF descriptor of the polycyclic ring system shown below calculated with Cartesian distances in the first and the partial atomic charge as property for the second dimension.
• The second dimension property term p′– p′ij is a difference of two properties and is equivalent to the difference in Cartesian coordinates; the distance matrix is actually a difference matrix. This will always lead to positive values because of the squared exponential term. In general, 2D RDF functions are less valuable for describing structural similarity if the diversity of the compounds is high. However, a one-dimensional RDF might not be sensitive enough for low-diversity data sets. If it is necessary to distinguish between more or less similar molecules and a property with negative values is used, it is a good idea to use it as a weight-property. This will enhance the probability space, or more exactly, it divides the probabilities into positive and negative dimensions. Typically, charge-weighted 2D RDF functions are appropriate descriptors to evaluate complex relationships between structures and properties. An example for the prediction of effective concentrations of biological active compounds using this function follows.
5.13 Emphasizing the Essential — Wavelet Transforms The transformation of an RDF, in particular the wavelet transform, can enhance or suppress features that are typical or atypical for a given task. Two decomposition methods are useful.
5323X.indb 147
11/13/07 2:11:17 PM
Expert Systems in Chemistry Research 0.30
0.4
0.25
0.35 N
0.20 0.15
0.3
0.10
0.25 Low-Pass Filtered D20 Transform 0.2
0.05
0.15 0.1
0.00 Cartesian RDF
–0.05 –0.10
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
0.05
Distance Probability (D20 Transform)
Distance Probability (Cartesian RDF)
148
0 10.0
Distance/
Figure 5.19 Comparison of the coarse-filtered D 20 transformed RDF (128 components) with the original Cartesian RDF (256 components). The transformed RDF represents a smoothed descriptor containing all the valuable information in a vector half the size of the original RDF descriptor.
1. Coarse-Filtered One-Level Decomposition: The coarse-filtered (low-pass) coefficients C(1) of the first resolution level can be regarded as a smoothed representation with only half the size of the original descriptor. We can use it as compressed descriptor for similarity searches in binary databases. Figure 5.19 shows a coarse-filtered wavelet transform of an RDF descriptor. The outcome is an appropriate but reduced representation of the original descriptor. 2. Detail-Filtered One-Level Decomposition: The detail-filtered (high-pass) coefficients D(1) of the first resolution level represent a new type of descriptor that reveals special aspects of data, like trends, breakdown points, and discontinuities in higher derivatives. It is useful as alternative molecular representation for neural networks in classification and prediction tasks (Figure 5.20).
The resolution level can be chosen arbitrarily between 1 and J. Any of the valid combinations of coarse and detail coefficients at a certain resolution level that lead to a descriptor of the same size are possible. For example, an original RDF descriptor with 256 components (i.e., J = 6) can be decomposed up to the resolution level j = 3, and the Wavelet transform (WLT) can be represented as C(3) + D(3) + D(2) + D(1). This method allows the refinement to be chosen individually. Combining coarse- and detail-filtered complete decomposition results in a combination of the coarse coefficients C(J) of the last resolution level and all the detail coefficients D(J), D(J–1),..., D(1). Figure 5.21 displays the combination of the coarse and the detail coefficients filtered up to the highest possible resolution level (j = 6 with 256 components in the original RDF descriptor). In this case, the final (and only) coarse coefficient C(6)
5323X.indb 148
11/13/07 2:11:17 PM
149
Applying Molecular Descriptors 0.5
High-Pass Filtered D20 Transform
0.4 Distance Probability
0.3 0.2 0.1 0.0 –0.1 –0.2 –0.3 –0.4
0.0
1.0
2.0
3.0
4.0
5.0 6.0 Distance/
7.0
8.0
9.0
10.0
Figure 5.20 Detail-filtered D 20 transformed RDF at resolution level j = 1. 0.15
D(4)
Distance Probability
0.10
D(3)
D(2)
D(1)
0.05 0.00 –0.05 –0.10 –0.15 –0.20 –0.25
C(6) 0
32
D(6)
D(5) 64
96 128 160 Component Index
192
224
256
Figure 5.21 Combination coarse- and detail-filtered D 20 transformed RDF descriptor performed at the highest resolution level (j = 6). The transform is an alternative representation of an RDF in the wavelet domain. This signal consists of the coefficients C(6) + D (6) + D (5) + D (4) + D (3) + D (2) + D (1).
consists of four components. The descriptor size does not change in this case. This combination is just an alternative representation of the untransformed RDF descriptor. Consequently, it leads to the same experimental results with neural networks. Due to different shapes, the results of a statistical evaluation are usually different between transformed and raw descriptor. The coarse- or detail-filtered wavelet transform or a combination of the coefficients for an RDF descriptor at a certain resolution level is done right before any postprocessing, like normalization.
5323X.indb 149
11/13/07 2:11:19 PM
150
Expert Systems in Chemistry Research
H H
H
HO
H
H
H
H H
H O Cl
O
H
H
H
Figure 5.22 Molecular structures of cholesterol (left) and cholesterol chloroacetate (right).
Wavelet-transformed RDF descriptors can exhibit detailed information about 3D arrangements in a compressed form. Obviously, wavelet-transformed RDF descriptors, as compressed representations of structures, react less significantly on changes in the position of single peaks (i.e., individual atomic distances) than on changes in the entire shape (i.e., the molecular constitution) in the original descriptor. The reason is the specific way in which the transformation takes place by dilation and translation of the wavelet mother function. This specific behavior occurs because a complete decomposition leads to (in this case five) self-similar regions in the resulting descriptor. The overall deviations between the different descriptors will become smaller. Consequently, the wavelet-transformed descriptor emphasizes small deviations in very similar descriptors. Special applications of wavelet-transformed RDF descriptors using this feature are presented later in this book.
5.13.1 Single-Level Transforms RDF descriptors can either be transformed completely or partially. If only a onestage D20 transform (i.e., j = 1) is applied, the resulting descriptor can reveal discontinuities — and, thus, differences between the two molecules — that are not seen in the normal RDF descriptor. This is shown with an example of RDF descriptors of cholesterol and cholesterol-chloroacetate (Figure 5.22) that were encoded with a Cartesian RDF and a one-stage high-pass filtered D20 transform (Figure 5.23). The Cartesian RDF exhibits the differences between the two molecules, but the overall shape is quite similar, leading to a high correlation coefficient of 0.96. The transformed and high-pass filtered RDF emphasizes discontinuities — in particular, opposite slopes — of the nontransformed descriptor and leads to a strongly decreased correlation coefficient of 0.83. The question as to which descriptor is better finally depends on the task: The nontransformed RDF would produce the better result if one searches a steroid in a data
5323X.indb 150
11/13/07 2:11:19 PM
151
Applying Molecular Descriptors
Distance Probability
High-Pass Filtered D20 Transform R = 0.828
1.0
Cartesian RDF R = 0.965
3.0
5.0
7.0
9.0
11.0
Distance/
Figure 5.23 Overlay of the D 20 transformed (above) and the normal Cartesian RDF descriptors (below) of Cholesterol (dark) and Cholesterol chloroacetate (light). Discontinuities (framed area) in the Cartesian RDF are revealed by the high-pass filtered transform, leading to a significant decrease in the correlation coefficient.
set of high diversity. The filtered transform is well suited for searching a particular steroid in a set of steroids where the nontransformed RDF would be too unspecific.
5.13.2 Wavelet-Compressed Descriptors Low-pass, or coarse-filtered, wavelet transforms are valuable as compressed representations of RDF descriptors for fast similarity searches in binary databases. Decreasing. the resolution generally reduces the size of an RDF descriptor. However, coarse-filtered wavelet transforms, which are already half the size of a nontransformed descriptor, conserve more information than the corresponding RDF descriptors. Figure 5.24 shows a comparison of a filtered transform and an RDF descriptor. Even though both functions have the same size (i.e., same number of components), the RDF used for transform originally had a higher resolution and size. This is the reason why additional peaks appear in the transform that do not occur in the RDF descriptor. The additional information primarily affects analysis methods that rely on the appearance of individual peaks rather than the shape of the entire descriptor. Thus, statistical parameters like correlation coefficients and skewness are affected to a minor extent. However, coarse-filtered wavelet transformations lead to an increase in valuable information.
5.14 A Tool for Generation and Evaluation of RDF Descriptors — ARC The investigation and evaluation of descriptors is cumbersome without the appropriate software. During an investigation of the use of radial distribution functions as chemical descriptors, a software has been developed that helps in calculating and
5323X.indb 151
11/13/07 2:11:20 PM
152
Distance Probability
Expert Systems in Chemistry Research
Low-Pass Filtered D20 Transform
Cartesian RDF
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Distance/
Figure 5.24 Comparison of the low-pass filtered D 20 transformed RDF (128 components) of the original Cartesian RDF (256 components) with a Cartesian RDF of half the resolution (128) components. Remarkable differences are indicated in bold.
selecting the appropriate descriptor for a given task: Algorithms for Radial Coding (ARC) developed at the University of Erlangen-Nuremberg in 2001 [2]. ARC is designed for the generation, investigation, and analysis of RDF descriptors and for the application of them to prediction and classification problems primarily in the field of chemistry. ARC provides a centralized access to a wide variety of RDF descriptors with an extensive set of optional parameters. On the basis of these descriptors, several methods of data analysis can be performed: ARC incorporates statistical evaluation features and neural networks for the evaluation of complex relationships and provides features for a fast and effective simulation and prediction of chemical properties by means of RDF descriptors. ARC incorporates multilayer Kohonen networks derived from the original Self-Organizing Map Program Package (SOM_ PAK) algorithms from the research team of Kohonen, developed at the Helsinki University of Technology in Espoo, Finland [6]. Some additional features, like multilayer technology and processing of toroidal shaped networks, were included in the algorithms of ARC. ARC calculates different RDF descriptors by using connection tables and Cartesian coordinates given in the input file. It was designed with two goals: (1) to enable a fast and efficient access to RDF descriptors; and (2) to provide a user-friendly interface for a detailed investigation, evaluation, and application of these descriptors. ARC is not an expert system in the classical sense; it does not provide question-and-answer strategies or a generic rule base. It includes nonlinear optimization algorithms and guides the user through the necessary steps for creation and evaluation of descriptors. In contrast to typical expert systems, ARC is written in C++ and includes a rules base in binary format, which can be created with information from different file formats, such as MDL Molfile, Brookhaven Protein Database, Gasteiger Cleartext, JCAMP (DX, JDX, JCM, and CS), binary files (molecule sets, databases, code
5323X.indb 152
11/13/07 2:11:21 PM
Applying Molecular Descriptors
153
bases), neural network files, and matrices (distance, bond, bond path, topological). Rules are written in a basic declarative manner and are stored in binary or text format. ARC was compiled for a Windows environment with a multiple document interface (MDI) technology that enables the user to open multiple independent data windows (child windows) inside the main workspace.
5.14.1 Loading Structure Information Molecule data in ARC are represented in a data set structure, which stores the relationships between molecular data and subsequent dialogs, windows, or routines. Each time single or multiple files are opened, a data set node appears in a tree view that contains subnodes of molecules as well as data calculated for the entire molecule set. Nevertheless, each window contains its own data and can be manipulated independently from the corresponding data set. Each file entry consists of a single compound and may include several subwindows (e.g., molecule, spectrum, descriptor). Depending on the configuration, multiple selected files either are loaded as individual entries or are collected as a single data set.
5.14.2 The Default Code Settings ARC allows descriptor settings to be defined in various ways. The code method defines the general method for the calculation of descriptors. Available methods are RDF, distance pattern, binary pattern, simple pattern, MoRSE descriptor, and 2D RDF. The distance mode defines the mode for distance calculation; available modes are Cartesian distances, bond-path distances, and topological distances. Descriptors may be calculated on particular atoms. Exclusive mode restricts the calculation to the atom type, and with ignore mode the selected atom type is ignored when calculating the descriptor. In partial-atom mode an atom number has to be given instead of the atom type. The second atom property is available if 2D RDF is selected as code method. The class type defines the keyword for a class vector for neural network training and prediction. Depending on the number of components of the class either a multicomponent class vector (e.g., a spectrum) or a single-component class property is allocated. The number of components automatically defines the number of weights used in the ANNs. The dimensions for a descriptor are defined in two groups — Cartesian distance and 2D property — where the minimum, maximum, and resolution of the vector in the first dimension of the descriptor can be defined. The track bars are adapted automatically to changes; for example, resolution is calculated and minimum–maximum dependencies are corrected. When the binary checkbox is clicked, only selections are possible that result in dyadic vector length (i.e., the dimension is a factor of 2n). This feature prevents the complicated adjustment of all settings to gain a binary vector that is necessary for transformations. The 2D property allows similar settings for two-dimensional descriptors. These descriptors allow negative ranges (e.g., for partial atomic charges), which divides the complete range into a negative and positive section. Additional postprocessing options like weighting, normalization, fast Fourier, or fast wavelet transforms with 2, 12, and
5323X.indb 153
11/13/07 2:11:21 PM
154
Expert Systems in Chemistry Research
20 coefficients can be performed on the resulting descriptor. The software allows combining sets of parameters that are often to be saved and restored for later use.
5.14.3 Calculation and Investigation of a Single Descriptor In the auto calculation mode the descriptors are automatically calculated when a file is loaded. On double-clicking a molecule within a data set a descriptor page is created that consists initially of a single sheet with a descriptor page. Every descriptor page can contain multiple sheets with different types of information (e.g., graphical, textual, tables) that are displayed and saved in different formats. Text sheets are tabulated descriptors containing header lines and descriptor information. They are designed for use with spreadsheet or statistical programs. Vector sheets contain the vectors in single lines separated by tabulators and are used as raw input files for other programs, such as for neural networks. The info sheet contains information about how the descriptor has been calculated. Distance, bond, path, and topological matrices can be generated, and class vectors, like spectra, can be displayed. Additional hypertext sheets are displayed in an online browser format and are saved in Hypertext Markup Language (HTML) format. The chart control bar allows the descriptor to be changed on the fly in similar fashion as with the code settings. The only difference is that the actual descriptor can be manipulated to visualize the changes directly. Each chart page contains its own descriptor and calculator; that is, the original descriptor settings in the data set are not affected by the manipulations. With this feature the behavior of the descriptor can be investigated in detail to find the appropriate code types and settings. The second area of the chart control bar is a button collection for the several manipulations of the chart itself, such as auto adaptation of the descriptor size to the minimum and maximum distance of the original molecule. The atom pairs button enables a special mode for the display of a list of atom pairs if the mouse is clicked on a certain descriptor point. The corresponding distance is calculated, and a list of atom pairs together with their original values from the distance matrix is displayed. The statistics button calculates some statistical parameters for the actual descriptor and for a superimposed descriptor if available. The peak areas button separates the individual peaks and calculates the peak areas. The transform buttons enable the display of an additional transformed descriptor. When a descriptor page is displayed, it is possible to superimpose the descriptor of any other molecule that is available in the file tree — independently if the molecule comes with the same data set or not. In addition, several functions can be calculated and displayed together with the original descriptor. The superimposed descriptor is calculated with the current code settings of the original descriptor and can be compared by different methods, such as subtraction, average, or weighted average. Most of the manipulations that can be performed with the chart control bar are applied automatically to both descriptors. An exception is the change of the resolution that affects only the original descriptor. With this feature, a direct comparison of different resolutions can be performed by selecting the same molecule for the overlay operation and then changing the resolution.
5323X.indb 154
11/13/07 2:11:21 PM
Applying Molecular Descriptors
155
5.14.4 Calculation and Investigation of Multiple Descriptor Sets A complete set of descriptors can be calculated by highlighting the node of the corresponding file and selecting data set in the context menu. If not already performed, the data set will be calculated and displayed as a table of all molecules and their basic statistical parameters, like variance, skewness, or kurtosis. Additionally, an ASD is calculated and displayed at the top of the table. The data set table can be saved in different file formats including binary files for fast searches. Writing a binary set saves the entire data set in a native binary format. This allows for fast loading and searching in large molecule data, usually at a processing speed of more than 10,000 molecules per second. Descriptors are saved automatically with the molecules — that is, they do not have to be recalculated when the file is loaded again.
5.14.5 Binary Comparison An entry in a data set table can be directly compared via root mean square or correlation coefficient with one of the default binary databases. A new window opens that contains two areas with three-dimensional molecule models: The left one displays the original molecule of the selected data set entry. The window shows the molecule and a match list containing the most similar entries of the database according to the selected similarity criterion (descriptor) with their sequential number, the names, and the calculated similarity measures. Another type of binary file is the binary database. This type is similar to the binary molecule set but is especially designed for fast searches of similar descriptors and the retrieval of the corresponding molecule. An example of the use of binary databases is the prediction of a descriptor from the neural network after a reverse training.
5.14.6 Correlation Matrices A correlation matrix of all the descriptors in a data set can be calculated, including a regression chart displaying the regression lines of the entries selected in the correlation matrix and a descriptor chart that shows both descriptors in comparison to the ASD. The regression chart contains two axes; one for each descriptor. The graph consists of correlation points (the probability values of each vector component in relation to the second descriptor), as well as two regression lines: one for each descriptor understood as an independent variable (bxy and byx , respectively). By clicking either a correlation point or a descriptor at a certain position a list of the corresponding atom pairs is displayed in the descriptor chart. The correlation matrix provided its own context menu to find the best and worst correlation coefficient without searching the complete matrix.
5.14.7 Training a Neural Network Each data set provides its own neural network. The first time a neural network is created the actual data set is used as training set as well as test set; a separate test set can be loaded after training. ARC provides a special feature to automatically divide
5323X.indb 155
11/13/07 2:11:22 PM
156
Expert Systems in Chemistry Research
a loaded data set easily into a training and test set. However, the entries in the test set can be selected manually, too. Although classification is not necessary to train a network, it can be performed independently of the prediction. Classification is based on human-readable class files that can be created and edited from within the software or by using external editors. If single-component properties are available, a maximum of 32 classes can be defined either automatically or manually. Classification can be performed directly with a training set via the context menu. The auto classify command enables classes to be calculated according to exponential, decadic, logarithmic, decadic logarithmic, and linearly distributed properties. The range of properties is divided into the number of classes defined in the window. This enables the user to get the optimum distribution with the optimum number of classes. Once class or properties are available in the training set, the training parameters can be selected: net dimension and number of epochs, the learn radius and learn rate, and the initialization parameters. Neurons can be arranged in a rectangular or quadratic network, as well as in a toroidal mode; that is, the left and right side as well as the upper and lower sides of the topological map are connected to a closed toroidal plane. By default, the input layer contains the descriptors, and the output layer contains the corresponding properties or property vectors. In this case only the input layer is responsible for selecting the central neuron, but the weights in both layers are adapted. By putting the network into reverse mode, the input and output layers are reversed. In this case, the output layer decides for the central neuron. Thus, if a property is available in the training set, the training can be performed on the basis of this property rather than on the descriptor. A prediction will then provide a descriptor instead of a property. This feature allows, for instance, switching between spectrum simulation and structure prediction with a single mouse click. The minima and maxima of learn radius and learn rate can be selected via tracks bars, and their decrease during learning is calculated automatically. With the auto adapt checkbox the radius will be adapted automatically to the dimension of the network; the minimum will be set to zero, and the maximum will cover complete network. In rectangular networks the maximum will equal the maximum dimension in the X or Y direction, whereas in toroidal networks the half dimension is used. The auto adapt setting ensures that the complete network is covered when the first epoch is executed. Additionally, the distance function for the correction of the neurons in the environment can be set to a cylinder, cone, or Gaussian function. The initialization of the network is calculated with pseudo-random numbers with a user-definable standard deviation. If the network works in reinitialize mode, an initialization will take place each time training is started; otherwise, the previous neuron weights will be retained, and the network is handled like a pretrained one. Once the parameters have been selected the training can be started by clicking the train (or train reverse) button. The performance of the training can be visualized by checking the show performance option. In this case a chart will appear that shows the decrease of the overall Euclidean distance after each epoch. Training may be interrupted after each epoch showing the actual training results.
5323X.indb 156
11/13/07 2:11:22 PM
Applying Molecular Descriptors
157
5.14.8 Investigation of Trained Network When the training is finished, a topological map layer is generated and colored in the order of classes as defined by the user. By clicking on a colored square in the topological map, the contents can be investigated in a separate window. This window contains a preview of the input (Kohonen) layer, the output layer (if a property vector was used), and the three-dimensional model of the corresponding molecules for which the descriptor has been calculated.
5.14.9 Prediction and Classification for a Test Set A trained network can be used to classify or to predict properties or property vectors of a test set. This is done by selecting the test set window of the network form. As mentioned before, as long as no other test set has been loaded the training and testset entries are identical. At that stage the only difference between both sets is which entries are checked. The context menu contains prediction methods that are available for particular training conditions. The set can either be self-tested (recall test) or the properties (or property vectors) can be predicted for individual objects in a data set or the complete data set. In a recall test or if a property vector was available in the input file but was not used for training, a chart displays the predicted property vector in comparison to the original vector. If a training has been performed in reverse mode, a descriptor command will be available — instead of a property command — which opens a chart containing a comparison of two descriptors. In contrast to a property vector, a descriptor can be directly searched for in a binary descriptor database (e.g., to search for corresponding structures). The result window contains then a hit list and two three-dimensional molecule models: one displaying the original molecule of the test set entry (if available), and the other showing the molecule of the actually selected entry in the hit list of similar molecules.
5.15 Synopsis RDF descriptors have been selected to exemplify how descriptors can be applied and what problems might arise during selection of the appropriate descriptor for a task. Similar problems and solution approaches can be expected for any other type of molecular descriptors. RDF descriptors may be used in any combination to fit the required task. For instance, it is possible to calculate a multidimensional descriptor based on bond-path distances and restricted to nonhydrogen atoms in the shape of a frequency pattern. Consequently, more than 1,400 different descriptors are available. A final summary of RDF descriptor types, their properties, and applications is given in Table 5.1. This section summarizes typical applications, some of which are described in detail in the next chapter.
5323X.indb 157
11/13/07 2:11:22 PM
5323X.indb 158
j >i
i
∑∑e
− B ( r − rij )2
j >1
i
∑∑
sin( s ·rij ) s ·rij
j >1
i
ij
∑ ∑ δ(r − r )·e
j >1
i
ij
i ∈{1,…, N − 1},
− B ( r − rij )
2
j ∈ {2,…, N },
∑ ∑ δ (r − r )
Binary Pattern g(r ) = δ(r − rij )
g(r ) =
N
N −1
Frequency Pattern
1, if r = rij δ(r − rij ) = 0, else
g(r ) =
N
N −1
RDF Distance Pattern
I (s ) =
N
N −1
3D MoRSE Code
g(r ) =
N
N −1
j >1
Binary vector that defines the absence (0) or presence (1) of a distance.
Absolute frequency of distances instead of the relative frequency.
Relative frequency distribution restricted to real distances.
Represents the scattered electron intensity on a molecular beam.
Probability distribution of distances between atoms of a three-dimensional model of a molecule.
j >i
i
∑∑e
− B ( r − rij )2
RDF Descriptor
g(r ) =
Probability distribution of distances between points in a three-dimensional space.
N
Radial Distribution Function (RDF)
N −1
Specification
Descriptor Type
Displays the absence or presence of distances.
Displays the absolute number of distances.
Gaussian distribution is suppressed and B only affects the frequency of the distance rij.
Noninterpretable.
Typically calculated for a continuous range of distance intervals.
Represents three-dimensional information in a one-dimensional mathematical vector.
Features/Remarks
Table 5.1 Summary of the Specification, Features, and Applications of RDF Descriptors
Recognition of structure patterns, substructure search.
Recognition of structure patterns, substructure search.
Recognition of structure patterns, similarity, and substructure search.
QSPR, IR spectrum simulation.
3D-molecule-based applications.
Particle and electron density distribution.
Applications
158 Expert Systems in Chemistry Research
11/13/07 2:11:27 PM
5323X.indb 159
j >1
i
ij
∑ ∑ δ(t )·e
i
∑ f (r )
N
j = const ., i ≠ j
i
i
∑ q ·e
− B ( r − rij )2
2 D
i
∑ r1 e
− B ( r − aD )2
gs ( r ) =
2 S
i
∑ r1 e
S( 7 )
− B ( r − aS )2
Proton Shielding Descriptor
g D (r ) =
D( 7 )
Proton Double-Bond Descriptor
g H (r ) =
N( 4 )
Proton RDF Descriptor
g(r ) =
j >1
∑ ∑ f (r )
i
Incorporates double bonds up to the 7th sphere centered on the proton; includes the radian angle between the double-bond plane and the proton. Incorporates single bonds up to the 7th sphere centered on the proton; includes the radian angle between the single bond and the proton.
Special geometric descriptors for proton NMR spectroscopy based on a local RDF descriptors for a proton.
Calculates all pairs of a predefined atom with every other atom.
g(r ) =
N
N −1
Local RDF Descriptor
Excludes a certain atom type t or exclusively for a certain atom type.
Calculated over all unique pairs of atoms in a molecule.
− B ( r − rij )
2
Molecular RDF Descriptor
1, if tij = t δ(tij ) = 0, else
1, if tij ≠ t δ(tij ) = 0, else
g(r ) =
N
N −1
Atom-Specific RDF
Describes shielding and unshielding by single bonds.
Describes influence of the electronic current in double bonds.
Characterization of influences of the chemical environment on a proton.
Covers the environment of one atom in a molecule.
Covers all local RDF descriptors.
Exclusion of atom specific distance information, restriction to atom specific skeletons.
1H-NMR chemical shift prediction.
1H-NMR chemical shift prediction.
Atom environment characterization, NMR spectroscopy, reactivity. 1H-NMR spectroscopy.
Default application.
Similarity searches, carbon skeleton searches, reduced molecular representation.
Applying Molecular Descriptors 159
11/13/07 2:11:31 PM
5323X.indb 160
N( 3)
i
∑e
j >i
i
∑∑e
2
j >1
i
∑∑e
∑
− B ( b − bij )
2
j >1
i
∑∑e
∑
− B ( n − nij )2
g(r , p ) =
j >1
i
i
j
∑ ∑ p p ·e
N
N −1
− B ( r − rij )
2
Property-Weighted RDF (amplified)
Sb −1 nij = nik (1) + min nk ( s ) k ( s +1) s =1
g(r ) =
N
N −1
Topological Path RDF
Sb −1 bij = rik (1) + min rk ( s ) k ( s +1) s =1
g(r ) =
N
N −1
Bond-Path RDF
rij = ( xi − x j )2 + ( yi − y j )2 + ( zi − z j )2
g(r ) =
N
N −1
− B ( r − rij )
− B ( r − a3 )2
Cartesian RDF
g3 ( r ) =
Transforms the frequency dimension into a (product) property-weighted frequency dimension.
Based on the number of bonds between atom pairs along the shortest path.
Based on the sum of the bond lengths between atom pairs along the shortest path.
Specification Incorporates atoms three non-rotatable bonds away from the proton and belonging to a sixmembered ring; includes the dihedral radian angle between the bond and the proton. Based on the Euclidean distance between atom pairs.
Descriptor Type
Proton Axial Descriptor
Amplifies effects of strong property differences between atom pairs.
Includes topological bond information, conformation independent.
Includes bond-distance information, largely conformation independent.
Includes three-dimensional information, conformation dependent.
Accounts for axial and equatorial positions of protons bonded to cyclohexane-like rings.
Features/Remarks
Table 5.1 (continued) Summary of the Specification, Features, and Applications of RDF Descriptors Applications
QSPR, IR spectrum simulation.
Conformation-sensitive applications independent of distances.
Conformation-sensitive applications.
Default applications for 3D molecule representation.
1H-NMR chemical shift prediction.
160 Expert Systems in Chemistry Research
11/13/07 2:11:35 PM
5323X.indb 161
j >1
i
∑∑e
− B ( r − rij )2 + D ( p ′− pij′ )2
j >1
i
i
j
∑ ∑ p p ·e
K −1
K −1
2 j ψ j, k (r )
Low- and detail-filtered WLT at an arbitrary resolution level j
Combination of the coarse wavelet coefficients C(J) of the last resolution level and all the detail wavelet coefficients D(J), D(J-1),..., D(1). User-defined combination of coarse and detail coefficients.
( j) k
Coarse- and detail-filtered WLT at the highest resolution level J (complete decomposition)
j =1 k = 0
J
∑∑ d
Detail-filtered (high-pass) wavelet coefficients D(1) of the first resolution level.
2 J φ J, k (r ) +
Detail-filtered WLT at resolution level j = 1 (one-level decomposition)
(J ) k
Includes atom properties as weight factors in the first dimension.
Coarse-filtered (low-pass) wavelet coefficients C(1) of the first resolution level.
k =0
∑c
2
Coarse-filtered WLT at resolution level j = 1 (one-level decomposition)
g 0 (r ) =
2
− B ( r − rij ) + D ( p ′− pij′ )
Wavelet Transforms (WLT)
g(r, p, p ′ ) =
N
N −1
Property-Weighted Two-Dimensional RDF
g(r , p ′ ) =
N
N −1
pi + p j − B ( r − rij ) ⋅e 2 Calculated depending on the distance r and an additional property p′ in the second dimension of the descriptor.
j >i
∑ ∑
i
Two-Dimensional RDF
g(r , p ) =
2
Transforms the frequency dimension into a (mean) property-weighted frequency dimension.
N
N −1
Property-Weighted RDF (attenuated)
Reveals special aspects of data, like trends, breakdown points, and discontinuities in higher derivatives. Alternative representation of the untransformed RDF descriptor with same experimental results, except in statistical calculations. Allows the refinement to be chosen individually.
Smoothed representation with half the size of the original descriptor.
Includes an additional property-weighted frequency distribution.
Preserves additional information about an atomic property p′.
Attenuates effects of strong property differences between atom pairs.
Classification and prediction tasks.
Similar to untransformed RDF.
Compressed descriptor for similarity searches in binary databases. Classification and prediction tasks.
QSAR, prediction of complex properties.
QSAR, prediction of complex properties.
QSPR, IR spectrum simulation.
Applying Molecular Descriptors 161
11/13/07 2:11:37 PM
162
Expert Systems in Chemistry Research
5.15.1 Similarity and Diversity of Molecules The terms similarity and diversity can have quite different meanings in chemical investigations. Describing the diversity of a data collection with a general valid measure is almost impossible. Descriptor flexibility allows the characterization of similarity by means of statistics for different tasks. The statistical evaluation of descriptors shows that it is recommended to interpret correlation coefficients together with the symmetry of distribution. In contrast to correlation coefficients, skewness and kurtosis are sensitive indicators to constitutional and conformational changes in a molecule. This feature allows a more precise evaluation of structural similarity or diversity of molecular data sets.
5.15.2 Structure and Substructure Search Descriptors based on pattern functions are helpful tools for a quick recognition of substructures. A pattern-search algorithm based on binary pattern descriptors can then be used for substructure search. However, patterns and other characteristics of descriptors that seem to indicate unique features should be investigated carefully. With these descriptors 3D similarity searches for complete structures or substructures in large databases are possible and computationally very efficient. In addition, descriptors can serve as the basis for a measure for the diversity of compounds in large data sets, a topic that is of high interest in combinatorial chemistry.
5.15.3 Structure–Property Relationships Molecular descriptors enable the correlation of 3D structures and some of their physiochemical properties. An example is the mean molecular polarizability — a property related to the distance information in a molecule, as is the stabilization of a charge due to polarizability. Molecular descriptors based on three-dimensional distances can correlate well with this property. In particular, transformed molecular descriptors enable predictions with reasonable error and are well suited for automatic interpretation systems.
5.15.4 Structure–Activity Relationships One-dimensional descriptors are often not able to describe such complex properties like biological activity in a data set of high similarity. Two-dimensional descriptors, in particular those including dynamic atom properties such as partial charge-weighted functions, are able to characterize biological activity. Although the correlation of structures and biological activities with the aid of molecular descriptors leads to reliable predictions, the matter of structure–activity relationship is a complex problem depending on many atomic and molecular properties and experimental conditions. In practice, several descriptors are used in combination.
5.15.5 Structure–Spectrum Relationships Structure–spectrum relationships, in particular the prediction of 3D molecules from spectral information and the prediction of spectral properties, such as 1H-NMR
5323X.indb 162
11/13/07 2:11:37 PM
Applying Molecular Descriptors
163
chemical shifts, can be performed quite well with 3D molecular descriptors. Investigations have shown that infrared spectra and RDF descriptors of the corresponding molecules can be well correlated by ANNs. Therefore, they can be applied to the simulation of infrared spectra from the structure information as well as to predict reliable 3D molecules from their infrared spectra. For the prediction of 3D structure from infrared spectra two approaches were introduced. A database approach relies on a database of descriptors that can be compiled for any existing or hypothetical compound. The major advantage of this technique is that the structure space can be increased while the infrared spectrum space can remain restricted. The second approach, the modeling approach, can be applied if adequate structures and their molecular descriptors are not available. In this case, the predicted structure is modeled from the most similar structure found in a database. Another approach has been developed for the prediction chemical shifts in 1H-NMR spectroscopy. In this case, special proton descriptors were applied to characterize the chemical environment of protons. It can be shown that 3D proton descriptors in combination with geometric descriptors can successfully be used for the fast and accurate prediction of 1H-NMR chemical shifts of organic compounds. The results indicate that a neural network can make predictions of at least the same quality as those of commercial packages, especially with rigid structures where 3D effects are strong. The performance of the method is remarkable considering a relatively small data set that is required for training. A particularly useful feature of the neural network approach is that the system can be easily dynamically trained for specific types of compounds.
5.16 Concise Summary 3D Molecular Representation of Structures Based on Electron Diffraction (MoRSE) Function is a molecular descriptor representing the scattered electron intensity from a molecular beam. Amplified RDF is a descriptor resulting from using a property product as a factor for the exponential term in an RDF descriptor, leading to an amplification of peaks that originate from atom pairs with strongly different atom properties. Attenuated RDF is a descriptor resulting from using a property average as a factor for the exponential term in an RDF descriptor, leading to an attenuation of peaks that originate from atom pairs with strongly different atom properties. Bond Path is the sum of the bond lengths along the shortest bond path between two atoms. Cartesian Coordinates are defined by three values — x, y, z — in three-dimensional Cartesian space. Cartesian Descriptors use the distances calculated from the Cartesian coordinates in a three-dimensional structure. Dynamic Atomic Properties depend on the chemical environment of the atom and are characteristic for the molecule. Examples are partial atomic charge, atom polarizability, and partial electronegativity. Euclidean L2-Norm is a mathematical normalization method that normalizes a vector to a total peak area of 1.
5323X.indb 163
11/13/07 2:11:38 PM
164
Expert Systems in Chemistry Research
Fast Fourier Transform (FFT) is a fast algorithm to compute the discrete Fourier transform (DFT) and its inverse. Fast Wavelet Transform (FWT) is a fast mathematical algorithm to decompose a function into a sequence of coefficients based on an orthogonal basis of small finite waves, or wavelets. Local Descriptors (atomic descriptors) run over all pairs of a predefined atom with every other atom. Molecular Descriptors are calculated for all pairs of atoms denoted and cover the entire molecule. Normalization is a mathematical method that divides multiple sets of data by a common variable to compensate for the variable’s effect on the data set and to make multiple data sets comparable. One-Dimensional Descriptors are calculated on a single property of a molecule. The result is numerically a vector and graphically a one-dimensional function, visualized in two dimensions. Pattern Repetition is an effect in molecular RDF descriptors that leads to a unique pattern of distances originating from unique bond length between heteroatoms and carbon atoms, like C-S bridges. Property-Weighted Descriptors are molecular descriptors that use a specific atom property to weight the individual function values. Property Smoothing Parameter (D) is an exponential factor that defines the width of the Gaussian distribution around a peak in the property space of a multidimensional RDF descriptor. It can be interpreted as measure of deviation describing the uncertainty of atoms properties within a molecule. Radial Distribution Function (RDF) is a mathematical expression describing the probability distribution of distances between within a three-dimensional space of points. RDF Binary Pattern is a simplified form of RDF frequency patterns that results in a binary vector defining the absence or presence of distances in a molecule. RDF Descriptor is a molecular descriptor that uses the 3D coordinates of the atoms in a molecule to describe the probability distribution of distances in a threedimensional molecule. RDF Distance Pattern represents a reduced RDF descriptor that considers only the actual distances in a molecule and suppresses the Gaussian distribution. RDF Frequency Pattern is a simplified form of an RDF distance pattern calculating the absolute frequencies of distances instead of the relative frequency. RDF Pattern Function is an RDF descriptor in which calculations are restricted to actual distances in a molecule. Smoothing Parameter (B) is an exponential factor that defines the width of the Gaussian distribution around a peak in the distance space of an RDF descriptor. It can be interpreted as a temperature factor that describes the movement of atoms within a molecule. Static Atomic Properties are characteristic for an atom type but are independent of the individual molecule. Examples are atomic number, atomic volume, or ionization potential.
5323X.indb 164
11/13/07 2:11:38 PM
Applying Molecular Descriptors
165
Symmetry Effect is an effect in molecular RDF descriptors that leads to increasing peak heights with increasing distance, reflecting a high symmetry in the systems. Topological Path describes the number of the bonds along the shortest bond path between two atoms. Two-Dimensional Descriptors are calculated on two different properties of a molecule, each of which is represented in a single mathematical dimension. The result is numerically a matrix and graphically a two-dimensional function, visualized in three dimensions. Wavelet-Transformed Descriptor is a transformation of a descriptor into the frequency space to enhance or suppress characteristic features of a molecule. Wavelet Transforms of RDF descriptors can be used to enhance or suppress features that are typical for a given task as well as to represent 3D arrangements of atoms in a molecule in a compressed form. Weighting is a mathematical method that assigns factors to individual values or a series of values in a data set.
References
5323X.indb 165
1. Hemmer, M.C., Radial Distribution Functions in Computational Chemistry — Theory and Applications, Ph.D. thesis, University of Erlangen-Nuernberg, Erlangen, 2001. 2. Degard, C., Methods to Determine the Electrical and Geometrical Structure of Molecules, Bull. Soc. R. Sci. Liege, 12, 383, 1937. 3. Barrow, G.M. and Herzog, G.W., Physical Chemistry, 6th ed, McGraw-Hill, New York, 1984. 4. Jakes, S.E. and Willett, P., Pharmacophoric Pattern Matching in Files of 3-D Chemical Structures: Selection of Inter-atomic Distance Screens, J. Molec. Graphics, 5, 49, 1986. 5. Allinger, N.L., Conformational Analysis 130. MM2. A Hydrocarbon Force Field Utilizing V1 and V2 Torsional Terms, J. Am. Chem. Soc., 99, 8127, 1977. 6. Kohonen, T., The Self-Organizing Map, in Proc. IEEE Vol. 78, 1464, 1990.
11/13/07 2:11:38 PM
5323X.indb 166
11/13/07 2:11:38 PM
6
Expert Systems in Fundamental Chemistry
6.1 Introduction With the knowledge about the principles of expert systems and chemical information processing from the previous chapters, we are now ready to look at the program packages that combine these technologies to form expert systems. A series of expert systems has been developed and successfully applied in the area of chemistry in the last four decades, since the first expert system was published. This chapter gives an overview on noncommercial and commercial systems as well as some detail on the underlying mechanisms for selected examples. The expert systems were selected according to specific application areas, starting with the early Dendritic Algorithm (DENDRAL) project and covering spectroscopic applications, analytical chemistry, property prediction, reaction prediction, and synthesis planning.
6.2 How It Began — The DENDRAL Project One of the first approaches for expert systems in general was developed by Stanford University at the request of the National Aeronautics and Space Administration (NASA) in 1965 [1]. At that time, NASA was planning to send an unmanned spacecraft to Mars that included a mass spectrometer for the chemical analysis of Martian soil. NASA requested software that would be able to automatically interpret the mass spectra to derive molecular structures. To solve this task, the Stanford team had to encode the expertise of a mass spectroscopist. This was the starting point for the DENDRAL project, initiated in 1965 by Edward Feigenbaum, Nobel Prize winner Joshua Lederberg, and Bruce Buchanan, in cooperation with chemist Carl Djerassi. The basic approach was to design software that mimics the concept of human reasoning and allows the formalization of scientific knowledge. DENDRAL consists basically of two subprograms: Heuristic DENDRAL and Meta-DENDRAL. Heuristic DENDRAL uses the mass spectrum and other available experimental data to produce a set of potential structure candidates. MetaDENDRAL is a knowledge acquisition system that receives the structure candidates and corresponding mass spectra to propose hypotheses for explaining the correlation between them. These hypotheses are validated in Heuristic DENDRAL. DENDRAL uses a set of rule-based methods to deduce the molecular structure of organic chemical compounds from chemical analysis and mass-spectrometry data. This process includes detecting structural fragments, generating structural formula from these fragments, and verifying it by determining the correlation between 167
5323X.indb 167
11/13/07 2:11:39 PM
168
Expert Systems in Chemistry Research
them and experimental spectra using additional spectral, chemical or physicochemical information. DENDRAL is written in list processing (LISP) and consists of a series of programs that shall be described here.
6.2.1 The Generator — CONGEN CONstrained GENeration program (CONGEN) is a generic approach that is able to provide a set of graphs, or structures, that incorporate a given set of nodes, or atoms, of a given type, or element [2]. In chemical language, CONGEN provides all structures with a given substructure including a variety of constraints; that is, it generates all isomers for a given structure. Constraints can be substructures, ring size, number of hydrogen atoms, and number of isopren substructures. The first step is to define superatoms, which constitute substructures that are handled as individual nodes in the graph. A CONGEN user can work interactively — for instance, to retrieve the existing solutions at any stage or to define further constraints if the number of results is too high. A successor of CONGEN is Generation with Overlapping Atoms (GENOA), which introduced several improvements to functionality and interactivity [3]. GENOA allows for the overlap of substructures as well as for defining limitations for a series of possible substructures. It reduces the number of constraints and gives the ability to retrieve intermediate results. CONGEN and GENOA can handle any structure and can enumerate the isomers of a molecular formula, and they are able to generate structures with more restrictive constraints (e.g., isomers with specific molecular fragments). Both GENOA and CONGEN use heuristic rather than systematic algorithms.
6.2.2 The Constructor — PLANNER Whereas the generation phase is interactive concerning the constraint definition, the construction, or planning, module uses DENDRAL algorithms to create constraints automatically. The first version was designed to interpret the mass spectrum and to derive substructures using the static rules in the knowledge base. In subsequent versions, the rules were created by a rule generator and were refined and separated for several compound types. The start information for PLANNER is the skeleton structure for the particular compound class, the definition of expected bond breakages in a structure, and the peak information of the experimental mass spectrum. The mass spectrum can be interpreted by the module MOLION, which attempts to detect the parent ion and to derive the sum formula. MOLION uses the given skeleton structure of the compound class and creates constraints for potential or unlikely positions of side chains at the main structure. The rules in PLANNER use a set of peaks from the mass spectrum to ascertain either a substructure or a fragmentation pattern based on the skeleton structure of the compound class. The potential positions of side chains are determined by predicting the resulting mass and charge numbers from a proposed fragmentation pattern and comparing these with the ones from the original mass spectrum.
5323X.indb 168
11/13/07 2:11:39 PM
Expert Systems in Fundamental Chemistry
169
6.2.3 The Testing — PREDICTOR Testing of hypothesis is necessary if the outcome of the construction phase is not unique. The program PREDICTOR consists of a series of production rules defining the theoretical behavior of a compound in the mass spectrometer. These rules are applied to each candidate structure producing a hypothetical mass spectrum, which is compared to the experimental mass spectrum. The rules used for the prediction phase are similar to those in the construction phase but are more generic. They use a proposed structure to derive potential fragmentation patterns. This is accomplished by creating the parent ion from the structure candidate and storing it in an ion list. The production rules are applied and lead to a series of ionic structure fragments that are appended to the list. In the next step, fragmentation is predicted for each ion, and a hypothetical mass spectrum is created and added to a spectrum list. The spectrum list contains the ion and the mass and charge value and its relative abundance. The relative abundance is required since the same ion may occur multiple times as result of fragmentation of different substructures. If all ions in the ion list are computed, the spectrum list contains all data for the hypothetical mass spectrum of the candidate structure. Structure representation in this system is similar to a connection table. Each structure receives a unique compound name and a list of atoms. The atom list contains the unique atom names, the elements, a sequential atom number, a list of nonhydrogen bond partners represented by their atom numbers, the number of multiple bonds attached to the atom, and the number of hydrogen atoms. The representation of phenol is shown in Figure 6.1a. The first atom line (C1 C 1 (2 3 7) 1 0) is interpreted in the following way: The atom with the unique name C1 (1) is a carbon atom with the sequential number 1; (2) has three neighbors, which are identified in the atom list by the numbers 2, 3, and 7; (3) has one multiple bond; and (4) has zero hydrogen atoms. Figure 6.1b shows an example of the phenyl substructure, where the first atom line (C1 C 1 (2 x 6) 1 -) is read as follows: The atom with the unique name C1 (1) is a carbon atom with the sequential number 1; (2) has three neighbors, two of which are identified in the atom list by the numbers 2, and 6, and one is a bond to an atom outside the fragment; (3) has one multiple bond; and (4) the number of hydrogen atoms is not defined (can be ignored). A production rule consists of a condition or an expression that has to be evaluated, and of an action that is triggered as a result of the condition evaluates to true. Examples for conditions are ISIT(X) which yields true if X is a substructure of the fragment under investigation WHERE(X) retrieves a list of numbers of those atoms in the fragment that map to substructure X DB(A,B)
5323X.indb 169
11/13/07 2:11:39 PM
170
Expert Systems in Chemistry Research 2
3
OH
1
7 6
4
5
(PHENOL (C1 C (O2 O (C3 C (C4 C (C5 C (C6 C (C7 C
1 2 3 4 5 6 7
(2 3 7) 1 0) (1) 0 1) (1 4) 1 1) (3 5) 1 1) (4 6) 1 1) (5 7) 1 1) (1 6) 1 1) )
(a) R 2 3
1
6
4
5
(PHENYL (C1 C (C2 C (C3 C (C4 C (C5 C (C6 C
1 2 3 4 5 6
(2 (1 (2 (3 (4 (1
x 6) 3) 1 4) 1 5) 1 6) 1 5) 1
1 –) 1) 1) 1) 1) 1) )
(b)
Figure 6.1 (a) The matrix representation of molecules in the PREDICTOR system of the DENDRAL suite is similar to a connection table. Each structure receives an atom list. The first line contains an identifier (C1), the element symbol (C), and a sequential number (1). Following is a list of nonhydrogen bond partners represented by their atom numbers (2, 3, 7), the number of multiple bonds (1) attached to the atom, and the number of hydrogen atoms (0). (b) The matrix representation of residues in the PREDICTOR system of the DENDRAL. The example show a phenyl residue including in the first line an identifier, the element symbol, and a sequential number, similar to the example in Figure 6.1a. The list of nonhydrogen bond partners represented by their atom numbers (2, x, 7), indicates two known partners and one unknown; the number of hydrogen atoms is indicated as unknown (–).
yields true if A and B connected via a double bond. Any other list processing (LISP) function can be used as condition. The action part of the rule contains one or more functions including function parameters that are triggered as a result of the condition becomes true. In the general form the action part consists of a label, a bond-cleavage function for fragmentation, an intensity defining the frequency of the bond breakage, and an optional transfer function that indicates the transfer of hydrogen or radicals. An example is as follows: ((WHERE ESTROGEN) (B (BREAKBND ((14. 15) (13. 17))) 100 (HTRANS -1 0)) (D (BREAKBND ((9. 11) (14. 13) (16. 17))) 100 (HTRANS 2 -1)) (C (BREAKBND ((9. 11) (14. 13) (15. 16))) 100 (HTRANS 1 -0)) (E (BREAKBND ((11. 12) (8. 14))) 100 (HTRANS -1 0)) (F (BREAKBND ((9. 11) (8. 14))) 100 (HTRANS -1 0)) ((WHERE PHENOL)
5323X.indb 170
11/13/07 2:11:40 PM
Expert Systems in Fundamental Chemistry
171
(B (BREAKBND ((1. 3) (1. 7))) 100 (HTRANS 1 3)) Any LISP function or any other production rule can be triggered.
6.2.4 Other DENDRAL Programs MSRANK is a program that compares the predicted mass spectra and ranks them according to their fitness in the experimental spectrum. The fitness function can be selected by the user. MSPRUNE is an extension that works with a list of candidate structures from CONGEN and the mass spectrum of the query molecule to predict typical fragmentations for each candidate structure. Predictions that deviate greatly from the observed spectrum are pruned from the list. MSRANK uses improved rules to rank the remaining structures according to the number of predicted peaks found in the experimental data, weighted by measures of importance for the processes producing those peaks. Further developments in the DENDRAL project are REACT, a program that predicts potential reactions of a candidate with another structure, and MSPRUNE, which uses additional constraints to predict the most reliable defragmentation. DENDRAL proved to be fundamentally important in demonstrating how rulebased reasoning can be developed into powerful knowledge engineering tools [4].
6.3 A Forerunner in Medical Diagnostics The most common understanding of the term diagnosis is the recognition of a disease by interpreting a series of symptoms in combination with experimental results from physiological analysis. The diagnostic knowledge required to solve a stated diagnosis can be easily captured in an expert system. Most diagnostic problems have a finite list of solutions, which can be achieved with a limited amount of information. Even complex diagnostic problems can be easily represented by a decision tree. Another aspect of diagnosis addresses the biochemical aspects of a disease — that is, the causes that lead to an observable disease, behavior, or physiological condition. Diagnostic expert systems have the typical design of interrogation, where the expert enters symptoms, observations, and experimental physiological data, and the expert system narrows down the potential solutions until a disease can finally be proposed. Supporting technologies can help to interpret results from experimental data. An example is the determination of proteins in human serum. Serum can be achieved by separating cellular components of coagulated blood via centrifugation and usually contains all soluble components, except for the coagulation factors. Human serum consists of 6.5 to 8% proteins that can be separated via gel electrophoresis into five protein fractions. About 60% of the fraction contains the protein albumin, which is a biological indicator for toxins. Serum albumins are produced in the liver and can already form adducts with toxins before they pass into the blood. For instance, Aflatoxins — a group of mycotoxins produced by several microscopic fungi during the degradation of food — bind to albumins in the liver, and the resulting adduct is detectable in the blood serum. Consequently, albumins represent at least a part of the toxine metabolism in the liver. The human serum albumin consists of 584 amino acids and has a molecular mass of 66,000; it occurs in a healthy body at levels
5323X.indb 171
11/13/07 2:11:40 PM
172
Expert Systems in Chemistry Research
Albumin (Ab)
Ab
α1 Macroglobuline α2 Haptoglobine β-Liproprotein Transferrin β Complement C3
α 1 α2
IγA
β
γ
γ
IγM IγG
. .
Ab
α1 α2
Ab
β
(a)
γ
α1 α2
β
γ
(b)
Figure 6.2 Electropherogram of human serum proteins (left) with areas of protein fractions and densitometric diagram (right). (b) Densitometric diagram of an electropherogram showing deviations from the typical values (a). Diagram (b) shows clearly an increase in the γ-globulin fraction, together with increase of α-, and β-globulin fractions and a relative decrease of albumin. The results support the diagnosis of inflammation — potentially cirrhosis of the liver.
between 35 and 60 g/L of blood. Another group of important serum proteins are globulines, which consist of lipids and proteins and are responsible for lipid transportation in blood as well as antibodies produced in the lymphocytes. Due to their differences in charges, serum proteins can be separated via electrophoresis, and the stainable lipids fraction can be quantified using photometric methods. An electropherogram (Figure 6.2) provides information about the relative abundance of serum proteins and is particularly useful for differential diagnosis — that is, to decide between two diseases exhibiting similar symptoms or physiological data. Like with all medical diagnosis tasks, the results from serum protein investigations provide evidence for a suspected disease rather than being unique indicators for a disease. However, in combination with other diagnostic findings, these results can validate a diagnosis.
5323X.indb 172
11/13/07 2:11:41 PM
Expert Systems in Fundamental Chemistry
173
An example of an electropherogram of human serum proteins for a healthy patient is shown in Figure 6.2a; Figure 6.2b shows the electropherogram of a patient that shows from physical examination all signs of an inflammation. The increase of γ-globulines as antibodies with a concomitant relative decrease in albumin (narrow peak) indicates inflammation. A comparison with a previously examined patient having a clear indication for cirrhosis of the liver suggests the same finding. In addition to the diagnostic questions and the appropriate inference, a diagnostic expert system can provide methods for the automatic interpretation of electropherograms or other experimental results. Similarity between two one-dimensional vectors can be easily obtained by statistical methods or neural networks. One of the first expert systems for medical diagnosis — the program MYCIN — was developed at Stanford in the 1970s [5]. MYCIN assisted a physicist with diagnosis and was able to recommend appropriate treatment for certain blood infections. The original intent with MYCIN was to investigate how humans make decisions based on incomplete information, which is particularly important in medical emergency cases, where a fast decision has to be made by even nonspecialized or less experienced physicists. A diagnosis of infections usually involves growing cultures of the infecting organism, which is not an option if a fast decision has to be made. Even tests at Stanford Medical School showed that MYCIN was able to outperform physicists; it was actually never used in practice because of the ethical and legal implications of using expert systems in medicine. MYCIN consists of a knowledge base, a dynamic patient database, a consultation program, an explanation program, and a knowledge acquisition program for knowledge engineering. The decision process consists of four parts: MYCIN (1) decides if the patient has a significant infection; (2) attempts to determine the bacteria involved; (3) selects one or more drugs for appropriate treatment; and (4) presents the optimal application of drugs based on additional patient data. MYCIN is developed in LISP and represents its knowledge as a set of initially 450 if–then rules and presented certainty factors as output. A MYCIN rule has a rule number, one or more conditions, and a conclusion with a certainty factor. The basic construction is (defrule if (<parameter> ) if ... then (<parameter> )) Each if statement includes a parameter considered in a certain context that is tested for a value based on the operation. The then statement has the same syntax but provides a certainty factor. This construction basically makes no difference between the structure of the question and the answer and allows for the application of other rules as a result of a rule — although at the cost of losing modularity and clarity of the rule base. An example for a MYCIN rule is as follows: (defrule 71 if (gram organism is pos)
5323X.indb 173
11/13/07 2:11:42 PM
174
Expert Systems in Chemistry Research
(morphology organism is coccus) (growth-conformation organism is clumps) then 0.7 (identity organism is staphylococcus)) This rule is interpreted as follows: If the stain of the organism is gram-positive, and the morphology of the organism indicates the type coccus, and the growth conformation of the organism is clumps, then 0.7 is suggestive evidence that the identity of the organism is staphylococcus.
The 0.7 is the certainty that the conclusion will be true given the evidence. MYCIN uses certainty factors to rank the rules or outcomes; it will abandon a search once the certainty factor is less than 0.2. If the evidence is uncertain the certainties of the bits of evidence are combined with the certainty of the rule to give the certainty of the conclusion. Besides rules, MYCIN asks several general questions, like name or weight of a patient, and tries to find out whether or not the patient has a serious infection. Once these questions have been answered, the system focuses on particular blood disorders and validates each statement in backward-chaining mode. A part of a typical dialogue with MYCIN based on the rules would look as follows: Enter the identity of ORGANISM-1 > unknown Is ORGANISM-1 a rod or coccus? > coccus Gram stain of ORGANISM-1? > grampos Did ORGANISM-1 grow in clumps? > yes These rules are used to reason backward; that is, MYCIN starts with a hypothesis that needs to be validated and then works backward, searching the rules in the rule base that match the hypothesis. The hypothesis can be either verified with a certainty factor or can be proven wrong. After a series of further questions an intermediate outcome might be the following: My therapy recommendation will be based on the following probable infections and potential causative organism(s): > the identity of ORGANISM-1 maybe STAPHYLOCOCCUS > ... If the statement is verified, MYCIN attempts to check whether bacteria are involved based on previously stored facts. For instance, if an infection with Staphylococcus is identified, MYCIN would present a series of antibiotics. Finally, a therapy with appropriate drugs is suggested based on the previous outcome and additional questions:
5323X.indb 174
11/13/07 2:11:42 PM
Expert Systems in Fundamental Chemistry
175
On a scale of 0 to 4, where higher number indicate increasing severity, how would you rate patient‘s degree of sickness? > 3 Does patient have a clinically significant allergic reaction to any antimicrobial agent? > no Patient’s weight in kg? > 70 A final recommendation could be: [REC 1] My primary therapy recommendation is as follows: Give GENTAMICIN; Dose: 119 mg (1.7 mg/kg) q8h IV for 10 days At this point the user can ask for alternative treatments if the recommendation is not acceptable. At any stage the system is able to explain why it asks a particular question or how a conclusion was reached. A series of improvements have been developed on the basis of MYCIN or by using it as a precursor for new developments. One of the drawbacks of this system is that the rules for domain knowledge and problem-solving strategies were mixed and hard to manage. A later development called NEOMYCIN addressed these issues by providing explicit disease taxonomy in a frame-based system. NEOMYCIN was able distinguish between general classes of diseases and specific ones, gathering information to differentiate between two disease subclasses. An expert system shell developed in the MYCIN project is EMYCIN, which was used to develop other expert systems. One of these systems is PUFF, designed for the domain of heart disorders. Another outcome was the ventilator manager (VM) program developed as a collaborative research project between Stanford University and Pacific Medical Center in San Francisco within the scope of a Ph.D. thesis by Lawrence M. Fagan [6]. VM was designed to interpret on-line quantitative data in the intensive care unit. The system measures the patient’s heart rate, blood pressure, and the status of operation of a mechanical ventilator that assists the patient’s breathing. Based on this information, the system controls the ventilator and makes necessary adjustments.
6.4 Early Approaches in Spectroscopy A typical application area of expert systems and their supporting technologies is spectroscopy. Since spectra require interpretation, they are ideally suited for automated analysis with or without the aid of a spectroscopist. Particularly vibrational spectra, like infrared spectra, are subject to interpretation with rules and experience. A series of monographs and correlation tables exist for the interpretation of vibrational spectra [7–10]. The relationship between frequency characteristics and structural features is rather complicated, and the number of known correlations between
5323X.indb 175
11/13/07 2:11:42 PM
176
Expert Systems in Chemistry Research
vibrational spectra and structures is very large. In many cases, it is almost impossible to analyze a molecular structure without the aid of computational techniques. Existing approaches mainly rely on the interpretation of vibrational spectra by mathematical models, rule sets, decision trees, or fuzzy logic approaches. The following section introduces them briefly.
6.4.1 Early Approaches in Vibrational Spectroscopy Many expert systems designed to assist the chemist in structural problem solving were based on the approach of characteristic frequencies. Gribov and OrvilleThomas made a comprehensive consideration of conditions for the appearance of characteristic bands in an infrared spectrum [11]. Gribov and Elyashberg suggested different mathematical techniques in which rules and decisions are expressed in an explicit form, and Elyashberg pointed out that in the discrete modeling of the structure–spectrum system symbolic logic is a valuable tool in studying complicated objects of a discrete nature [12–14]. Zupan showed that the relationship between the molecular structure and the corresponding infrared spectrum could be represented conditionally by a finite discrete model [15]. He expressed these relationships as if–then rules in the knowledge base of an expert system. Systems using these logical rules can be found in several reviews and publications [16–21]. In 1981 Woodruff and colleagues introduced the expert system PAIRS, a program that is able to analyze infrared spectra in the same manner as a spectroscopist would do [22]. Chalmers and colleagues used an approach for automated interpretation of Fourier transform Raman spectra of complex polymers [23]. Andreev and Agirov developed the expert system for the interpretation of infrared spectra (EXPIRS) [24]. EXPIRS provides a hierarchical organization of the characteristic groups, recognized by peak detection in discrete frames. Plamen et al. introduced a computer system that performs searches in spectral libraries and systematic analysis of mixture spectra [25]. It is able to classify infrared spectra with the aid of linear discriminant analysis, artificial neural networks (ANNs), and the method of k-nearest neighbors. The elucidation of structures with rule-based systems requires a technique to assemble a complete structure from predicted substructure fragments. Several techniques and computer programs have been proposed under the generic name computer-assisted structure elucidation (CASE). Lindsay et al. introduced the first program that was able to enumerate all acyclic structures from a molecular formula [1]. This program was the precursor for some of the first expert systems for structure elucidation: CONGEN and GENOA. These programs can handle any structure and enumerate the isomers of a molecular formula and are able to generate structures with more restrictive constraints, such as isomers with specified molecular fragments. However, both GENOA and CONGEN use more heuristic than systematic algorithms. Several CASE programs based on a more systematic structure generation technique are the structure generators CHEMICS, ASSEMBLE, and COMBINE [26–28]. Dubois and colleagues studied the problem of overlapping fragments. They developed the program DARC-EPIOS, which can retrieve structural formulas from overlapping 13C-NMR data [29]. The COMBINE software uses similar techniques,
5323X.indb 176
11/13/07 2:11:42 PM
Expert Systems in Fundamental Chemistry
177
whereas GENOA uses a more general technique based on the determination of all possible combinations of nonoverlapping molecular fragments. All the CASE programs previously described generate chemical structure by assembling atoms or molecular fragments. Another strategy is structure reduction, which is structure generation by removing bonds from a hyperstructure that initially contains all the possible bonds between all the required atoms and molecular fragments. Programs based on the concept of structure reduction are COCOA and GEN [30,31].
6.4.2 Artificial Neural Networks for Spectrum Interpretation The interpretation of an infrared spectrum based on strict comparison can be rather complex and ambiguous. The appearance of characteristic vibrations for a group in different molecules depends on geometric and force parameters (i.e., the structural environment of the group). Although group frequencies often occur within reasonably narrow limits, changes in the chemical environment and physicochemical effects may cause a shift of the characteristic bands due to the mixing of vibrational modes. Additionally, different functional groups may absorb in the same region and can only be distinguished from each other by other characteristic infrared bands that occur in nonoverlapping regions. The problem of the interpretation of vibrational spectra is to calculate all possible combinations of substructures that may be present in a molecule consistent with the characteristic frequencies of a given infrared spectrum. Elyashberg and his team showed with an example that the infrared spectrum-structure correlation, as simply expressed by the characteristic frequency approach, does not allow one to establish the structure unambiguously due to a lack of information in characteristic frequencies [32]. They pointed out that the use of ANNs appears to be particularly promising. Artificial neural networks do not require any information about the relationship between spectral features and corresponding substructures in advance. The lack of information about complex effects in a vibrational spectrum (e.g., skeletal and harmonic vibrations, combination bands) does not affect the quality of a prediction or simulation performed by a neural network. Great attention has been paid in recent decades to the application of ANNs in vibrational spectroscopy [33,34]. The ANN approach applied to vibrational spectra allows the determination of adequate functional groups that can exist in the sample as well as the complete interpretation of spectra. Elyashberg reported an overall prediction accuracy using ANNs of about 80% that was achieved for general-purpose approaches [35]. Klawun and Wilkins managed to increase this value to about 95% [36]. Neural networks have been applied to infrared spectrum interpreting systems in many variations and applications. Anand introduced a neural network approach to analyze the presence of amino acids in protein molecules with a reliability of nearly 90% [37]. Robb used a linear neural network model for interpreting infrared spectra in routine analysis purposes with a similar performance [38]. Ehrentreich et al. used a counterpropagation (CPG) network based on a strategy of Novic and Zupan to model the correlation of structures and infrared spectra [39]. Penchev and colleagues compared three types of spectral features derived from infrared peak tables for their ability to be used in automatic classification of infrared spectra [40].
5323X.indb 177
11/13/07 2:11:43 PM
178
Expert Systems in Chemistry Research
Back-propagation networks have been used in supervised learning mode for structure elucidation [41,42]. A recall test with a separate data set confirms the quality of training. Novic and Zupan doubted the benefits of back-propagation networks for infrared spectroscopy and introduced the use of Kohonen and CPG networks for the analysis of spectra-structure correlations.
6.5 Creating Missing Information — Infrared Spectrum Simulation Gasteiger and coauthors suggested another approach [43]. They developed the previously described three-dimensional (3D) Molecular Representation of Structures Based on Electron Diffraction (MoRSE) descriptor, derived from an equation used in electron diffraction studies. This descriptor allows presenting the 3D structure of a molecule by a constant number of variables. By using a fast 3D structure generator, they were able to study the correlation between any three-dimensional structure and infrared spectra using neural networks. Steinhauer et al. used radial distribution function (RDF) codes as structure descriptors together with the infrared spectrum to train a CPG neural network, thus modeling the complex relationship between structure and its infrared spectrum [44]. They simulated an infrared spectrum by using the RDF descriptor of the query compound to determine the central neuron in Kohonen layer. The corresponding infrared spectrum in the output layer represents the simulated spectrum. Selzer et al. described an application of this spectrum simulation method that provides rapid access to arbitrary reference spectra [45]. Kostka et al. described a combined application of spectrum prediction and reaction prediction expert systems [46]. The combination of the reaction prediction system Elaboration of Reactions for Organic Synthesis (EROS) and infrared spectrum simulation proved to be a powerful tool for computer-assisted substance identification [47].
6.5.1 Spectrum Representation Neural network methods require a fixed length representation of the data to be processed. Vibrational spectra recorded usually fulfill this requirement. With most applications in vibrational spectroscopy, the spectral range and resolution are fixed, and a comparison of spectra from different sources is directly possible. Appropriate scaling of the spectra allows handling different resolutions to obtain the same number of components in a descriptor. Digitized vibrational spectra typically contain absorbance or transmission values in wave-number format. Most of the spectrometers provide the standardized spectral data format JCAMP-DX developed by the Working Party on Spectroscopic Data Standards from the International Union of Pure and Applied Chemistry (IUPAC) [48]. Preprocessing of spectra usually includes methods for background correction, smoothing and scaling, or normalization. The simplest methods are the scaling of the spectrum relative to the maximum intensity (typically set to 1) or the vector sum normalization. The problem in computational processing of spectra is the high number of data points delivered by the spectrometer software. An adequate reduction of information is necessary to ensure reasonable calculation times with artificial intelligence (AI) computational methods.
5323X.indb 178
11/13/07 2:11:43 PM
Expert Systems in Fundamental Chemistry
179
6.5.2 Compression with Fast Fourier Transform A simple data reduction technique is to divide the spectrum into sections and to calculate the mean values of the absorbencies in these sections. The number of sections determines the resolution of the data-reduced spectrum. A mathematical way for the reduction of spectra is the fast Fourier transform (FFT). By applying the FFT to a spectrum — or generally, to a periodic function — its values are decomposed in a series of sines and cosines, resulting in a set of Fourier coefficients. Each coefficient leads to a more detailed representation of the original spectrum; the more of these coefficients that are used in a reverse transformation, the higher is the similarity of a back-transformed spectrum with the original one. Reducing the number of coefficients used for the back transformation compresses the spectrum. Actually, those coefficients that do not increase the resolution of a back-transformed spectrum significantly are set to zero. The number of remaining coefficients determines the resolution of the reduced spectrum.
6.5.3 Compression with Fast Hadamard Transform The fast Hadamard transform (FHT) leads to results similar to those of the FFT. Instead of sines and cosines, square wave functions define the transformation matrix of the Hadamard transform (Figure 6.3). The FHT is generally preferred due to faster calculation and because it operates with real instead of complex coefficients. The FHT of infrared spectra follows a linear reduction of the spectrum into 512 intervals, each of which represents the corresponding mean intensity. The widths of the intervals are set to 20 cm–1 in the high frequency range (4000 to 2000 cm–1) and to 4 cm–1 in the low frequency range (2000 to 352 cm–1) of the spectrum. Applying the FHT produces 512 Hadamard coefficients. The first 64 coefficients represent the spectrum; the remaining coefficients are discarded, or set to zero. The advantage is the considerably shorter representation (i.e., 64 instead of 512 values) with a reasonably good reproduction of the original spectrum. Other studies have shown that there is no essential difference whether the data reduction is made by calculating the mean of spectrum sections or by reducing the Hadamard coefficients [49].
6.6 From the Spectrum to the Structure — Structure Prediction Since infrared spectroscopy monitors the vibrations of atoms in a molecule in 3D space, information on the 3D arrangement of the atoms should somehow be contained in an infrared spectrum. An infrared spectrum itself, as well as other spectra, constitutes a molecular descriptor. The relationships between the 3D structure and the infrared spectrum are rather complex so that no attempts have yet been successful in deriving the 3D structure of a molecule directly from the infrared spectrum. Training a Kohonen neural network with a molecular descriptor and a spectrum vector models the rather complex relationship between a molecule and an infrared spectrum. This relationship is stored in the Kohonen network by assigning the weights through a competitive learning technique from a suitable training set of
5323X.indb 179
11/13/07 2:11:43 PM
180
Expert Systems in Chemistry Research
128
64
32
16
8
4
Figure 6.3 The FHT of an infrared spectrum decomposes the spectrum into square wave functions. With decreasing number of kept coefficients, the signal becomes more and more rough.
structures and infrared spectra. A Kohonen network may operate in two directions, since input and output are reversible. Input of a 3D structure results in output of an infrared spectrum. In reverse mode, an RDF descriptor is obtained on input of an infrared spectrum. The simulation of infrared spectra with CPG networks has already been described. The more interesting topic lies at hand: The input of a query infrared spectrum into a reverse-trained Kohonen network provides a structure descriptor. The question now is whether it is possible to obtain a 3D structure from this descriptor. An RDF descriptor cannot be back transformed in explicit mathematic equations to provide the Cartesian coordinates of a 3D structure. However, we will focus on two other methods. The first method relies on the availability of a large diverse descriptor database, called the database approach. The second method, the modeling approach, is a modeling technique designed to work without appropriate descriptor databases.
5323X.indb 180
11/13/07 2:11:44 PM
Expert Systems in Fundamental Chemistry
181
6.6.1 The Database Approach RDF descriptors exhibit a series of unique properties that correlate well with the similarity of structure models. Thus, it would be possible to retrieve a similar molecular model from a descriptor database by selecting the most similar descriptor. It sounds strange to use again a database retrieval method to elucidate the structure, and the question lies at hand: Why not directly use an infrared spectra database? The answer is simple. Spectral library identification is extremely limited with respect to about 28 million chemical compounds reported in the literature and only about 150,000 spectra available in the largest commercial database. However, in most cases scientists work in a well-defined area of structural chemistry. Structure identification can then be restricted to special databases that already exist. The advantage of the prediction of a descriptor and a subsequent search in a descriptor database is that we can enhance the descriptor database easily with any arbitrary compound, whether or not a corresponding spectrum exists. Thus, the structure space can be enhanced arbitrarily, or extrapolated, whereas the spectrum space is limited. This simple fact is the major advantage of the database approach — which is also the basis for the modeling approach — against conventional spectra catalog searches. This approach is designed for use with an expert system and user-supplied databases. It generally provides a fast prediction within 1 to 20 seconds and a higher success rate than any other automated method of structure elucidation with infrared spectra.
6.6.2 Selection of Training Data The molecules and infrared spectra selected for training have a profound influence on the radial distribution function derived from the CPG network and on the quality of 3D structure derivation. Training data are typically selected dynamically; that is, each query spectrum selects its own set of training data by searching the most similar infrared spectra, or most similar input vector. Two similarity measures for infrared spectra are useful:
1. The Pearson correlation coefficient between query spectrum and database spectrum may be used to describe the general shape of an infrared spectrum. It does not overestimate unspecific deviations, like strong band-intensity differences, or differences due to sample preparation errors or impurities. 2. The root mean square (RMS) error between two spectra reacts more sensitively to global intensity differences and small changes, for instance, in the fingerprint region.
Small intensity differences between two spectra are usually not of great value in routine analysis. However, differences in the fingerprint region of an infrared spectrum may lead to significantly different interpretation. This applies in particular to solid analysis, where polymorphism and conformational change affect the shape of an infrared spectrum. When significant strong bands (e.g., the carbonyl band) dominate, they may determine the significance of discrimination. In these cases, the correlation coefficient usually leads to better results.
5323X.indb 181
11/13/07 2:11:44 PM
182
Expert Systems in Chemistry Research
6.6.3 Outline of the Method The following steps outline the method for the database approach. The investigation can be performed with Algorithms for Radial Coding (ARC), presented in chapter 5 of this volume. 6.6.3.1 Preprocessing of Spectrum Information Commercially available infrared spectrum databases represent spectra in different resolution and length. To unify the spectra and to gain an appropriate performance, the spectra can be compressed using, for instance, an FHT according to the method of Novic and Zupan [50]. These authors represented the infrared spectrum by 128 absorbance values between 3500 to 552 cm–1. They divided the spectrum in two parts represented with different resolutions: 40 cm–1 between 3500 and 2020 cm–1, and 16 cm–1 in the more significant region between 1976 and 552 cm–1. 6.6.3.2 Preprocessing of Structure Information The chemical structures for the corresponding spectra are usually not available in 3D coordinates. Consequently, the Cartesian coordinates have to be calculated from the connection tables. One of the useful programs for this task is the 3D structure generator CORINA [51]. 6.6.3.3 Generation of a Descriptor Database Having the three-dimensional coordinates of atoms in the molecules, we can convert these into Cartesian RDF descriptors of 128 components (B = 100 Å–1). To simplify the descriptor we can exclude hydrogen atoms, which do not essentially contribute to the skeleton structure. Finally, a wavelet transform can be applied using a Daubechies wavelet with 20 filter coefficients (D20) to compress the descriptor. A low-pass filter on resolution level 1 results in vectors containing 64 components. These descriptors can be encoded in binary format to allow fast comparison during descriptor search. 6.6.3.4 Training The training of the CPG neural network is carried out in the following steps:
5323X.indb 182
1. For each experiment, a set of the 50 infrared spectra that are most similar to the query spectrum — using the correlation coefficient or the root mean square between the spectra as similarity criterion — is compiled for training (Figure 6.4a). 2. Cartesian RDF descriptors of 128 components (B = 100 Å–1) are calculated for each structure without hydrogen atoms. 3. The descriptors are transformed by Daubechies wavelet decomposition with 20 filter coefficients (D20). 4. A low-pass filter on resolution level 1 is applied and the resulting vector containing 64 components is encoded in binary format.
11/13/07 2:11:44 PM
183
Expert Systems in Fundamental Chemistry
5. The Kohonen network implemented is trained in reverse mode with infrared spectra and corresponding descriptors (Figure 6.4b). Reverse mode training allows exchanging the Kohonen and the output layer in the network. Consequently, the central neuron is determined by the infrared spectrum — or more generally, the property vector — rather than the molecular descriptor.
6.6.3.5 Prediction of the Radial Distribution Function (RDF) Descriptor After training, the query infrared spectrum is loaded into the Kohonen network, and the corresponding predicted RDF descriptor is retrieved (Figure 6.4c). Query IR Spectrum
Selection of 50
IR Spectra Database
Most Similar Spectra
50 Connection Tables 50 IR Spectra 3D Coordinates
Training Data Set
50 RDF Codes
Atom Properties
.
(a).
. 50 IR Spectra
50 RDF Codes
Untrained CPG Network
Trained CPG Network
.
(b) Figure 6.4 (a) Compilation of a training data set for deriving a 3D structure of a compound from its infrared spectrum. A database of infrared spectra and corresponding RDF descriptors is searched for the 50 spectra that are most similar to the query infrared spectrum. Connection tables are retrieved, and 3D coordinates as well as physicochemical atom properties are calculated. These data are used to calculate molecular descriptors, which are combined with the infrared spectra from the database to a training set. (b) Training of a CPG network for deriving a 3D structure of a compound from its infrared spectrum is performed with 50 infrared spectra and their corresponding RDF descriptors.
5323X.indb 183
11/13/07 2:11:46 PM
184
Expert Systems in Chemistry Research
Predicted RDF Code
Query IR Spectrum Query
Trained CPG Network Selection of Data Set with Most Similar RDF Code
“Initial Model”
RDF Code Database 250,000 Compounds
.
(c) Figure 6.4 (continued) (c) Derivation of the 3D structure of a compound from its infrared spectrum. After training, the query infrared spectrum is used to predict the RDF descriptor, and a structure database is searched for the most similar descriptor. The corresponding structure is retrieved as the initial model.
6.6.3.6 Conversion of the RDF Descriptor A molecule with an RDF descriptor most similar to the one retrieved from the neural network is searched in the binary descriptor database using the minimum RMS error or the highest correlation coefficient between the descriptors.
6.6.4 Examples for Structure Derivation To get an idea of such a prediction process, we will look at a result for the prediction of a monosubstituted benzene derivative. The data set for this experiment contains 50 benzene derivatives and their spectra–descriptor pairs. Figure 6.5 shows the infrared spectrum of the query compound reduced to 128 components by the FHT method described already. The spectrum contains some bands that may indicate the existence of an aromatic system and of halogen atoms. However, the spectrum is not particularly characteristic. The query compound is considered as unknown; that is, only infrared spectrum is used for prediction. The prediction of a molecule is performed by a search for the most similar descriptors in a binary descriptor database. The database contains compressed low-pass filtered D20 transformed RDF descriptors of 64 components each. The descriptors originally used for training (Cartesian RDF, 128 components) were compressed in the same way before the search process. Figure 6.6 shows the two-dimensional (2D) images of the eight molecules with the highest similarity to the predicted descriptor. In this case, the query compound
5323X.indb 184
11/13/07 2:11:47 PM
185
Expert Systems in Fundamental Chemistry
3500.0 0.0
3000.0
Wave Number/cm–1 2500.0 2000.0 1500.0
1000.0
500.0
0.1
Absorbance
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Figure 6.5 The infrared spectrum of a query compound compressed by Hadamard transform for the prediction of benzene derivatives by a CPG neural networks. The spectrum exhibits some typical bands for aromatic systems and chlorine atoms.
OH Cl
Br HN
0.9926
0.9911
0.9761
0.9746 Cl
SH
Cl HN
HN
0.9739
0.9734
O
0.9710
O
0.9605
Figure 6.6 Benzene derivatives predicted by a CPG neural network (low-pass D 20 Cartesian RDF, 128 components). The 2D images of the eight best matching structures from the descriptor database are shown together with the correlation coefficients between their descriptor and the one predicted from the neural network.
5323X.indb 185
11/13/07 2:11:48 PM
186
Expert Systems in Chemistry Research
was contained in the binary descriptor database and was identified exactly with a correlation coefficient higher than 0.99 that typically indicates an exact match. Because predicted descriptors result from an interpolation process, an exact coincidence (r = 1.00) is actually never found. Figure 6.7 and Figure 6.8 show another example of correct predictions for bicyclic compounds, displaying query infrared spectra and the eight best matching
3500.0 0.0
Wave Number/cm–1 2500.0 2000.0 1500.0
3000.0
1000.0
500.0
0.1
Absorbance
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Figure 6.7 The infrared spectrum of a query compound compressed by Hadamard transform for the prediction of bicyclic compounds by a CPG neural network.
O
O
0.112
O
0.120
O
N
O H 0.206
0.216 N
O
N N N
H O
O
Cl O
O
O 0.245
0.248
O
N
N O 0.251
H N H O
H N+
O H
Cl–
0.259
Figure 6.8 Prediction of a bicyclic compound by CPG neural networks (low-pass D20 Cartesian RDF, 128 components). Eight best matching structures from the descriptor database and the RMS errors between their descriptor and the one predicted from the Kohonen network. The structure belonging to the query spectrum was found at the lowest RMS error of 0.122.
5323X.indb 186
11/13/07 2:11:49 PM
187
Expert Systems in Fundamental Chemistry
molecules. In these results, the software could always identify the molecule — the first one in the list — belonging to the original query infrared spectrum. These examples used the RMS error between the RDF descriptor from the database and the one predicted from the neural network as similarity measures. In fact, no significant difference can be found between the RMS error and correlation coefficient for the results in these experiments. The results prove the ability of the database approach to make correct predictions for a wide range of compounds if the compounds are available in the RDF descriptor database. Because of the previously mentioned fact that the RDF descriptor database can be compiled with any arbitrary compound, a prediction for any spectrum is generally possible.
6.6.5 The Modeling Approach The database approach enables the prediction of structures that are already available in a descriptor database of arbitrary molecules. If the database contains no identical but similar molecules, the modeling approach may provide a correct prediction. This approach is an enhancement of the previously described method that uses a modeling process for optimizing the prediction (Figure 6.9) [52]. The modeling process comprises the following steps:
1. The most similar molecular descriptor is retrieved from a database in the same way as in the database approach. The retrieved molecule is referred to as the initial model. 2. The initial model is transformed by altering the atom type and by removal or addition of atoms. With the addition of atoms, the corresponding change in bond length is considered. 3. After each transformation step, an RDF descriptor is calculated and again compared to the one derived from the CPG neural network. Stepwise Iterative Manipulation
Addition
Elimination
Initial Model Substitution
Poor
Similarity to Predicted RDF Code
Best
“Final Model”
Shift Calculate RDF Code
Figure 6.9 The initial model is transformed (i.e., type of atom, removal or addition of atoms, shift of atom position). After each manipulation, an RDF descriptor is calculated and again compared to the one derived from the CPG network. The best-fitting RDF descriptor determines the final model.
5323X.indb 187
11/13/07 2:11:49 PM
188
Expert Systems in Chemistry Research
4. If the similarity of the two descriptors increases, the manipulated model is used for subsequent manipulations until no further improvement of similarity can be achieved. 5. The resulting molecular model is referred to as the final model.
As in the database approach, the structure descriptor is calculated without hydrogen atoms, which can be added implicitly after the decoding process. However, the positions of the hydrogen atoms are stored and used later as potential vectors pointing to new atoms. Besides the previously used similarity criteria (RMS and R) for RDF descriptors, the difference in number and position of peaks between two descriptors can be applied as improvement criterion. The peak positions are important information for an RDF descriptor. Two RDF descriptors that exhibit the same peak positions must have the same distance distribution in the molecule and, thus, a similar basic structure. The differences in the RDF descriptor can then be attributed to the atomic properties. When atomic properties are used that are independent of the chemical neighborhood, this kind of comparison is useful to find initial models, which contain similar skeletons and which can be optimized through alteration of atom types and shifting operations. The tolerance of the method can be chosen optionally. The initial model chosen and the criterion for similarity of the RDF descriptors determine the strategy for optimization.
6.6.6 Improvement of the Descriptor The initial model is altered in several ways to adapt the resulting RDF descriptor to that obtained from the CPG neural network. For this task a program was developed that contains empirical rules for the optimization of an RDF descriptor. The program searches for the molecule having an RDF descriptor most similar to the one retrieved from the CPG neural network using different similarity criteria. Although several similarity criteria may be chosen, some rules must be considered. • The extrema of an RDF are useful if a more or less similar skeleton structure exists in the database containing the initial models. In a first approximation, the peak positions are independent of the chosen atomic properties. On the other hand, the atomic properties define the peak heights and are therefore important for the chemical nature of the atom pairs or their chemical environments. • Using the correlation coefficient or the RMS error, deviations of the entire descriptor including the peak heights are considered. In this case, the similarity of the initial model depends not only on the skeleton structure but also on the atom types and chemical environments occurring in the corresponding molecule. Correlation coefficients are used if similar substructures are contained in the database used for the initial models. After an initial model is retrieved, the molecule is manipulated by addition, removal, and changing of the atom type. The decision for the sequence of optimization steps is dynamically adapted to the properties of the RDF descriptor, the chosen atomic properties, and the similarity criterion. The following properties are altered:
5323X.indb 188
11/13/07 2:11:50 PM
Expert Systems in Fundamental Chemistry
189
• The change of the atom type mainly has an effect on the peak heights in the RDF descriptor. Through variation of all nonhydrogen atoms in the initial model the program tries to find a minimum deviation between the RDF descriptors of the initial model to the one predicted from the CPG neural network. After each variation, the 3D structure has to be regenerated. The number of repetitions for alteration can be chosen optionally. Because of the reasons previously mentioned, the alteration of the atom types is usually the first step if the number and positions of the function extrema is chosen as similarity criterion. In this case, a preceded alteration of all heteroatoms to carbon can optionally be performed, which facilitates finding the carbon skeleton of the query molecule. • The addition or elimination of atoms leads to changes in number, position, and height of the peaks in an RDF descriptor. The elimination is restricted to the parts of a molecule where the skeletal structure remains intact — that is, where no skeletal bond is broken. The minimum is determined through stepwise elimination of all atoms in a definable number of iterations. The addition of atoms is restricted to the former positions of the hydrogen atoms, which are potential positions for new atoms. At first, carbon atoms are added, and the atom type is varied later on. The change of the bond length is considered by a recalculation of a 3D model. The same applies to the removal of atoms. The decision as to whether atoms are removed or added is determined by the number of peaks in the two RDF descriptors and the differences in the maximum distance. The sequence of the optimization steps is determined at first by the chosen similarity criterion. Using nonscaled RDF descriptors with atomic properties that are independent of the chemical neighborhood, the peak heights can be used to evaluate information about the atom types contained in the molecular structure. Using the extrema criterion, the program tries to approximate the RDF descriptor through addition and removal of atoms, depending on the observed difference in the number of extrema. With nonphysicochemical atomic properties, the program performs a change of atom types preceding the addition and removal of atoms.
6.6.7 Database Approach versus Modeling Approach The modeling approach is generally useful if the query structure is not contained in the database. Although the methods for structure prediction from infrared spectra introduced here can give quite accurate proposals, several considerations have to be taken into account.
5323X.indb 189
1. An appropriate database for the initial model requires an extremely high structural diversity to cover all of the structural features that can occur in the compound type to be investigated. Although diversity is important, it is generally a good idea to generate type-specific databases that represent the searched structure type the best. 2. Molecules can exist in different conformations, depending on the chemical and physical environment. Therefore, infrared spectra contain information
11/13/07 2:11:50 PM
190
Expert Systems in Chemistry Research
only about the actual state of the molecule during the recording of the spectrum. The individual stereochemical nature of the final model is primarily defined by the information contained in the database. A fast 3D structure generator such as CORINA tries to find a conformation of low energy, which is not necessarily equivalent to the one found in the matrix of interest. If several conformations occur in the initial model database, the chance of finding the correct conformation is high. However, another conformation of the same molecule contained in the database is normally not used as an initial model, and the derivation will provide a wrong structure. 3. Additionally, the similarity measure between RDF descriptors of two complex molecules should be interpreted differently from that of simpler molecules. With increasing number of atoms, the number of interatomic distances grows, and the interpretation of the RDF descriptor for model building will become less reliable. 4. The computation times are much higher than in the database approach; the recalculations in the modeling process must be performed on each relevant initial model found in the database. Depending on the number of operations, this leads to between approximately 500 and 5000 recalculations of new 3D models and RDF descriptors for each initial model. With respect to about 100,000 compounds in the binary database for the initial models, this can result in several million calculations per prediction if several initial models should be regarded. The method can be improved by implementation of a fast 3D structure generator into the prediction software. In this case, a reliable 3D structure is calculated after each modeling operation directly.
Despite these considerations the prediction of structures is possible even with infrared spectra of low quality. This is due to the fact that the quality of structure proposals depends on the presence of appropriate molecules with a similar conformation in the database rather than on the reliability of the infrared spectrum. The database approach is more straightforward and is well suited if user-defined databases with similar structures can be compiled. As previously mentioned, the advantage of the database approach is that the database can be compiled individually without needing the corresponding infrared spectra. This method can be seen as an intelligent database search. The success rate of this method depends on the chosen descriptor parameters. With the experimental conditions previously described, the database approach requires no experience in interpreting RDF descriptors and is therefore well suited for routine analysis.
6.7 From Structures to Properties The exploration of large amounts of molecular data in a search for consistent patterns, correlations, and other systematic relationships can be a helpful tool to evaluate hidden information in a set of molecules. Finding adequate descriptors for the representation of chemical structures is one of the most important requirements for this task. RDF descriptors are tools for analyzing chemical data sets for molecular patterns and systematic relationships with the help of statistical analyses and
5323X.indb 190
11/13/07 2:11:50 PM
Expert Systems in Fundamental Chemistry
191
self‑organizing neural networks. This section describes how to apply neural networks and how to interpret the results to gain information about chemical structures and their properties.
6.7.1 Searching for Similar Molecules in a Data Set We have seen that RDF descriptors are one-dimensional representations of the 3D structure of a molecule. A classification of molecular structures containing characteristic structural features shows how the descriptor preserves effectively the 3D structure information. For this experiment, Cartesian RDF descriptors were calculated for a mixed data set of 100 benzene derivatives and 100 cyclohexane derivatives. Each compound was assigned to one of these classes, and a Kohonen neural network was trained with these data. The task for the Kohonen network was to classify the compounds according to their Cartesian RDF descriptors. Figure 6.10 shows the result in terms of the topological map of the Kohonen network where each color indicates a class. The topological maps shown in this work are screenshots from the ARC software. The map is a numbered grid of fields, each of which represents a neuron. Each neuron may contain up to 64 training molecules that are indicated by colored squares. This representation allows the user to access to each training compound via a mouse click. The Kohonen network assigned each descriptor to one of the neurons (i.e., the central neuron) due to its similarity with the neuron data. Similar RDF descriptors appear in the same region of the topological map and form a cluster. The individual structures are indicated by squares colored according to their corresponding classification mark. The result shows a clear separation of benzene and cyclohexane derivatives.
Figure 6.10 Results of the classification with 100 benzene derivatives (dark) and 100 cycloaliphatic compounds (light). The topological map shows a clear distinction between compounds with planar benzene ring systems and nonplanar cyclic systems (rectangular network, Cartesian RDF, 128 components).
5323X.indb 191
11/13/07 2:11:51 PM
192
Expert Systems in Chemistry Research
Obviously, a Kohonen network is able to distinguish between planar and nonplanar structures via their descriptors. The use of this classification is that if we present a new unclassified compound to this trained network, its central neuron would indicate whether it is a benzene or cyclohexane derivative. The training for such an experiment is performed within seconds on a conventional personal computer. The prediction for a new compound just takes a few milliseconds; the network is able to assign thousands of structures to their proper structure type within a few seconds. A comparable substructure search with conventional algorithms would take at least several hours. The method ideally is suited for automated in analysis high-throughput screening tasks. However, the task just described is a quite simple one, and the question is whether a classification is still successful and accurate with other structure types. If we extend a training data set with another class of structure, a new cluster occurs. The next experiment shows this with a small set of steroids included in the training set. Parts of the neurons containing descriptors of cyclic aliphatic compound split, and the steroid descriptors appear. With two neurons (6.9) and (7.8), a conflict occurs: Both neurons were occupied by steroid as well as cycloaliphatic derivatives. Any unknown compound that would hit this neuron could not be assigned properly to one the classes defined in the training. A conflict can be resolved by increasing the size of the network or by separating the classes by using multiple networks, each of which is solving its own classification task. Investigating the conflict neurons and the ones in the neighborhood leads to an interesting phenomenon: Some of these neurons (6.5), (7.9), and (8.9), as well as the conflict neurons, contain polycyclic aliphatic compounds — obviously because of their higher similarity to steroids. By including a new type of compounds, the network found a new class. This is a simple example of how investigation of trained neural networks leads to new conclusions about chemical structures. In this case, it is an obvious conclusion since it is easy to find the reason for the assignment. However, in several cases the reason is not obvious. Due to the logical mathematical framework, the assignment performed by neural networks is always correct. If a result is unexpected, the defect is in the predefined classification, or the descriptor does not properly represent the task. The next example shows how the complexity of an RDF descriptor might influence the classification. The compiled data set consisted of benzene derivatives, phosphorous compounds, and amines. The Cartesian RDF descriptors were calculated once including all atoms and a second time without hydrogen atoms. The left-hand image in Figure 6.11 shows the classification with normal RDF descriptors. Two remarkable situations occur:
5323X.indb 192
1. A lack of discrimination between some phosphorous compounds and amines (upper-right corner); all compounds in this region are aliphatic. 2. A similar effect occurs near benzene derivatives and phosphorous compounds in the lower-left corner of the map; all compounds in this region are aromatic.
11/13/07 2:11:51 PM
193
Expert Systems in Fundamental Chemistry
Including Hydrogen Atoms
Excluding Hydrogen Atoms
Phosphorous Compounds Amines Benzene Derivatives Figure 6.11 Results of the classification with Cartesian RDF descriptors for 24 benzene derivatives, 20 phosphorous compounds, and 11 amines, calculated including (left) and ignoring (right) hydrogen atoms (rectangular network, Cartesian RDF, 256 components).
In contrast to the expected classification, the obvious effect is the discrimination between aliphatic and aromatic compounds. We achieve a better result by excluding unnecessary information. A clear discrimination occurs by excluding hydrogen atoms from the calculation (Figure 6.11, right). Whereas the normal descriptor describes the aliphatic character of the compounds, the hydrogen-excluded descriptor emphasizes the differences in heteroatoms correctly. These examples show the typical workflow for optimizing a discrimination of compounds with RDF descriptors: • • • • • •
Selection of training compounds that represent that task properly. Selection of a descriptor that incorporates properties related to the task. Defining the classification according to the task. Investigating the topological map for inconsistent classification. Eventual redefinition of classification. Refining the descriptor by inclusion or exclusion of information.
The flexibility of RDF descriptors is a basic requirement to achieve the expected results.
6.7.2 Molecular Diversity of Data Sets Gaining information about similarity of compounds is just one part of a problem in modern high-throughput chemistry. In the areas of drug design and drug specification, the similarity of structural features is interesting for retrieving of structures
5323X.indb 193
11/13/07 2:11:52 PM
194
Expert Systems in Chemistry Research
with defined biological activity from a database, whereas the diversity of structures may be of interest for the synthesis of new drugs; with increasing variety of structures in a data set, the chance of finding a new way of synthesis for a compound with similar biological properties increases. Describing the diversity of a data collection with a unique measure is almost impossible. Such a measure would depend on its relationship to a generally valid reference data set, which is hard to define. In fact, the terms similarity and diversity can have quite different meanings in chemical investigations. In the simplest case, similarity concerns structural features, which are, in fact, easy to determine. Similarity in a more general chemical context typically includes additional properties and is in most cases hard to describe as an individual feature. Most of the methods that have been introduced for the estimation of molecular similarity are based on substructure [53], topological [54], and graph theoretical approaches [55] (for an overview on similarity measures see, for example, Willett [56] or Johnson and Maggiora [57]). However, 3D distance measures have been used quite seldom for similarity purposes [58]. The use of RDF descriptors suggests a way to calculate a single value that describes the diversity of a data set by means of descriptive statistics. A series of statistical algorithms is available for evaluating the similarity of larger data sets. By statistical analysis, the diversity — or similarity — of two data sets can be characterized. Two methods are straightforward. 6.7.2.1 Average Descriptor Approach One way to characterize an individual descriptor within its data set is to calculate the average descriptor g (r ) for a data set
g (r ) =
1 L
L
∑ g (r ) i
(6.1)
i =1
g (r ) is the sum of all descriptors gi(r) divided by the number of molecules n. The average of the differences between each descriptor and the average descriptor can act as a diversity measure, the average diversity ∆g:
∆g =
1 L
L
∑ δg
i
(6.2)
i =1
6.7.2.2 Correlation Approach A more sophisticated statistical method for the characterization of diversity is to examine how the individual descriptors for each compound correlate with the average set descriptor (ASD). In low diversity data sets, the correlation coefficients will tend to reach 1.0, and their distribution will be small.
5323X.indb 194
11/13/07 2:11:54 PM
195
Expert Systems in Fundamental Chemistry
Correlation to Average Descriptor
1.20 1.00 0.80 0.60 0.40 Low Diversity Set High Diversity Set
0.20 0.00
0
20
40
60
80
100
120
140
160
180
200
Molecule Number
Figure 6.12 Distribution of the correlation coefficients between individual descriptors and the ASD for high-diversity data (185 arbitrarily chosen organic compounds) and lowdiversity data (185 benzene derivatives).
An investigation of two data sets explains the application of these methods. The first set contains 185 monosubstituted benzene derivatives with arbitrary side chains representing a collection of low diversity. The second set of 185 randomly chosen compounds from a spectra database covers molecules of a size between about 20 and 150 atoms and represents the high-diversity set. Figure 6.12 shows an example for the correlation approach. The data set of low diversity shows an average correlation coefficient of 0.94 with a small standard deviation of 0.03. The high-diversity data set shows a typical trend to smaller correlation coefficients (average correlation 0.84) and a relatively high standard deviation of 0.11. The mean and standard deviation of correlation coefficients seems to be a reliable diversity measure. However, as mentioned in the theoretical section, the reliability of the correlation coefficient itself depends on the symmetry of distribution within a descriptor; skewness or kurtosis should be regarded if a data set has to be classified as similar or diverse. Due to the characteristic shape, almost any raw RDF descriptor is skewed: It typically exhibits an asymmetric tailing and is leptokurtic (flatted in relation to the Gaussian distribution; kurtosis > 0). Depending on the descriptor type, the size, and the symmetry of a molecule, the skewness or kurtosis of a raw RDF descriptor may also show asymmetric fronting or platykurtic behavior (peaked in relation to the Gaussian distribution; kurtosis < 0). As the general behavior applies to most of the RDF descriptors, it is no fault to neglect this skewness and to assume a skewed standard distribution within the descriptor set. Checking the skewness distribution for outliers provides a fast method to determine if few individual structures do not fit in the data set (Figure 6.13). The three outliers with a descriptor skewness of 4.0 and higher are (1) hydrazine; (2) thionyl chloride; and (3) ammonium chloride — three compounds that are not representative for the wide variation of organic structures in the remaining data set.
5323X.indb 195
11/13/07 2:11:55 PM
196
Expert Systems in Chemistry Research
30.00 Low Diversity Set High Diversity Set
25.00 20.00 15.00 10.00 5.00 0.00
0
20
40
60
80
100
120
140
160
180
200
Molecule Number
Figure 6.13 Distribution of descriptor skewness for high- and low-diversity data sets. The deviation in skewness of the high-diversity set is about two times the one in the low-diversity set. Additionally, three outlying descriptors with extremely different skewness can be found in the upper part of the figure (185 benzene derivatives, 200 arbitrarily chosen organic compounds).
Table 6.1 Statistical Evaluation of Low-(A) and High (B)-Diversity Data Sets Data Set Average Correlation Coefficient Average Skewness Average Kurtosis Average Deviation to Average Set Descriptor Average Deviation in Correlation Coefficients Average Deviation in Skewness Average Deviation in Kurtosis Assumed Category
(A)
(B)
0.936 1.62 5.45 0.006 0.029 0.33 1.40 Low Diversity
0.837 1.32 4.90 0.011 0.114 0.72 3.09 High Diversity
Note: Appropriate data for diversity estimation are indicated in bold. Set A consists of 185 monosubstituted benzene derivatives containing arbitrary side chains. Set B includes 185 randomly selected compounds from an infrared database, covering molecules of a size between 20 and 150 nonhydrogen atoms.
What is more important for a diversity evaluation is that with increasing diversity of a descriptor collection, the mean deviation in skewness should increase. In fact, the skewness of the two data sets investigated is similar and does not clearly indicate a difference between the sets (Table 6.1), whereas the deviation in skewness of the high-diversity data set is about twice the one in the low-diversity data set. The distribution of the kurtosis of the data sets leads to a similar result.
5323X.indb 196
11/13/07 2:11:55 PM
197
Expert Systems in Fundamental Chemistry
Descriptor Deviation to Average Code
0.025 0.015
RDF (128 Components)
0.005 –0.005 –0.015 0.025
–0.025
0.015 0.005 –0.005 High-Pass Filtered Daubechies-20 RDF (64 Components)
–0.015 –0.025
Figure 6.14 Distribution of descriptor deviation against the ASD for normal (above) and detail D 20 transformed RDF (below) for a data set containing 100 benzene derivatives (black) and 100 cyclohexane derivatives (gray).
Whereas the mean correlation coefficient is significantly lower in the arbitrary data set, the mean skewness and mean kurtosis are similar. Though the latter values do not indicate clearly a difference between the data sets — they just indicate a similar symmetry and flatness of distribution — the deviations from the average behavior describes properly the diversity of the data set: The average deviations in skewness and kurtosis are about twice as high in the arbitrary data set as those of the benzene derivatives. The ASD and the combination of deviations in correlation coefficients, skewness, and kurtosis provide the most reliable measure for similarity and diversity of data sets. Wavelet-transformed RDF descriptors can enhance or suppress typical features of descriptors — even filtered, or compressed, ones. This behavior also covers the diversity and similarity of molecules in a data set. The experiment in Figure 6.14 shows results from a single data, compiled from two types of compounds: 100 benzene derivatives followed by 100 monocyclic cyclohexane derivatives. The distribution of deviation of the individual descriptors to the ASD indicates the diversity of the two data sets. With Cartesian RDF descriptors, the deviations against the ASD exhibit a higher similarity within the benzene derivative set than within the cyclohexane set. One of the reasons is that the aromatic system is rigid and the resulting aromatic pattern in the descriptor is always the same, whereas the conformational flexibility in the cyclohexane systems can lead to quite different distance patterns. The results from a high-pass, or detail-filtered, D 20 transformed Cartesian RDF descriptor are different. The filtering process reduces the descriptor to half the size (64 components) of the original Cartesian RDF (128 components). The deviation against the average descriptor is not only smaller for both types of compounds; it is additionally possible to distinguish clearly between them by using the mean descriptor deviation. The transformed descriptors characterize the two types of compounds in a more specific way.
5323X.indb 197
11/13/07 2:11:56 PM
198
Expert Systems in Chemistry Research
Cartesian RDF 64 Components (∆r = 0.2 ) 32 Components (∆r = 0.4 ) 16 Components (∆r = 0.8 )
High-Pass D20 Transformed Cartesian RDF 32 Components (∆r = 0.2 ) 16 Components (∆r = 0.4 )
8 Components (∆r = 0.8 )
Figure 6.15 Difference in classification for a data set containing 100 benzene derivatives (black) and 100 cyclohexane derivatives (gray) encoded with Cartesian RDF descriptors (above) and with high-pass filtered D 20 transformed RDF descriptors (below) between 0 and 12.8 Å (rectangular Kohonen network; ∆r: discrete step size for the calculation). The classification quality decreases significantly with decreasing vector size and resolution. Although the resolution decreases significantly in the wavelet transform, the classification is only marginally affected.
The reduction of the descriptor size (i.e., the decrease in resolution) usually has a profound influence on the ability of the descriptor to characterize a molecule. Even though compressed, or filtered, wavelet transforms of descriptors have a reduced size, they preserve the similarity information well and in a much more efficient way. Figure 6.15 shows results from an experiment where a Kohonen neural network classifies the same data set (100 benzene derivatives plus 100 monocyclic cyclohexane derivatives) according to ring type. The reduction in descriptor size and resolution of Cartesian RDF descriptors leads to a significant decrease of the quality of classification. The high-pass D20 transformed descriptors — although half the size of the Cartesian RDF — are suited for classification even down to extremely short vectors with a resolution of just 0.8 Å (B = 1.5625 Å–2) of the original descriptor. This result should not lead to the conclusion that a step size of 0.8 Å is recommended. It should simply demonstrate that wavelet transforms are able to represent
5323X.indb 198
11/13/07 2:11:57 PM
199
Expert Systems in Fundamental Chemistry
structural similarity in a shorter data vector. In practice, the recommendation given here is to use a length of 128 components for the original descriptor. The transformed descriptor reduces then to 64 components but represents the structures in almost the same quality as the original descriptor.
6.7.3 Prediction of Molecular Polarizability Many physicochemical properties of a molecule are implicitly related to its threedimensional structure. RDF descriptors seem to be a valuable tool for prediction of properties that are hard to calculate. This section shows an example. Polarizability (i.e., static dielectric polarizability), α, is a measure of the linear response of the electronic cloud of a chemical species to a weak external electric field of strength E. For isotropic molecules, the dipole moment, µ, is
µ = αE
(6.3)
In the general case, polarizability is anisotropic; it depends on the position of the molecule with respect to the orientation of the external electric field. To consider the orientation, α is expressed by a function — atom polarizability tensor — that defines the induced dipole moment for each possible direction of the electric field. This atom polarizability tensor α describes the distortion in the nuclear arrangement in a molecule (i.e., the tendency of polarization in three dimensions). The polarizability determined in an experiment is an average polarizability; it is the sum of polarizabilities in three principal directions that are collinear with the external field [59]. The mean molecular polarizability, α mol, quantifies the ease with which an entire molecule undergoes distortion in a weak external field. In chemical reactions, the attack of a reagent generates a charge. The displacement of the electron distribution in the reactant from the equilibrium will induce a dipole that stabilizes the presence of a charge. Thus, the mean molecular polarizability serves as a descriptor for the investigation of chemical reactions [60]. A physical property directly related to polarizability is the refractive index. Retardation of light within a substance is due to the interaction of light with the electrons, and high refractivity indicates that a molecule has a tendency to polarize easily. Let us have a look at the prediction of the mean molecular polarizability α mol by neural networks and RDF descriptors. α mol can be calculated from additive contributions of the atomic static polarizability α of individual atoms i N
α mol =
∑α i
(6.4)
i =1
as described by Gasteiger and Hutchings [61]. The author estimated static polarizability α with a method published by Kang and Jhon [62]. N
α mol =
∑ 0.5
nij
αi
(6.5)
i =1
5323X.indb 199
11/13/07 2:11:59 PM
200
Expert Systems in Chemistry Research
Predicted Molecular Polarizability/3
26.0 25.0 24.0 23.0 22.0
y = 0.8736x + 2.334 R2 = 0.9025
21.0 20.0 19.0 18.0 17.0 16.0 15.0 14.0 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 Molecular Polarizability/3
Figure 6.16 Correlation between calculated and predicted molecular polarizability for 50 benzene derivatives encoded with one-stage filtered D 20 transformed Cartesian RDF (128 components). The standard deviation of the prediction error is 0.6 Å3.
To allow the network to predict new properties by interpolation (i.e., to produce empty neurons), the network size is set to be about 1.5 times the number of compounds. In contrast to the previous experiments, we will use a higher number of epochs for training to allow sufficient adaptation of the empty neurons. The next example illustrates the different representation of normal and wavelettransformed RDF descriptors. The first 50 training compounds were selected from a set of 100 benzene derivatives. The remaining 50 compounds were used for the test set. Compounds were encoded as low-pass filtered D20 Cartesian RDF descriptors, each of a length of 64 components, and were divided linearly into eight classes of mean molecular polarizability between 10 and 26 Å3. The topological map layer after training of a Kohonen network shows a reasonable clustering, a first hint that a reliable prediction can be achieved. Figure 6.16 displays the correlation between predicted and experimental values for molecular polarizability for the set of 50 test compounds. The correlation coefficient of 0.95 (the coefficient of determination R2 is 0.902) indicates a clear relationship between the descriptor and the property. The standard deviation in the error of prediction is 0.6 Å3; that is, 99% of the predictions (3σ) lie within an error of about 1.8 Å3. The correlation between predicted and experimental values when nontransformed descriptors are used shows just a trend (R > 0.8, R2 = 0.651) to characterize polarizability; the slope of the regression line cannot be determined clearly. In addition, the standard deviation increases from 0.6 to 2.5 Å3 (± 7.5 Å3 for 99% of the predictions). Since the mean molecular polarizability is a property that is related to the distance information in a molecule — as is the stabilization of a charge due to polarizability — it is reasonable that RDF descriptors correlate with this property.
5323X.indb 200
11/13/07 2:12:00 PM
Expert Systems in Fundamental Chemistry
201
6.8 Dealing with Localized Information — Nuclear Magnetic Resonance (NMR) Spectroscopy Fast and accurate predictions of NMR chemical shifts of organic compounds are of high interest for automatic structure elucidation and for the analysis of combinatorial libraries. Several approaches for the estimation of NMR chemical shifts have been published. NMR chemical shifts can be predicted by ab initio calculations [63,64]. However, the computation time required is considerable, and for most organic compounds the accuracy is comparable to faster empirical methods. The available empirical methods for the prediction of 1H-NMR spectra are based on data collected from experimental spectra of organic compounds [65]. One approach relies on a database of structural environments of atoms that are usually represented by Hierarchical Organization of Spherical Environments (HOSE) codes [66]. The prediction of chemical shifts is based on the experimental values of the most similar atoms found and is used in commercially available program packages [67–69]. Another approach is based on tabulated chemical shifts compiled for different types of nuclei and is corrected with additive contributions from neighboring functional groups or substructures [70]. A third approach using neural networks and linear regression modeling has been developed for the prediction of 13C-NMR chemical shifts using topological, physicochemical, or geometric parameters [71]. The predictions have been reported to be at least in the same range of accuracy of those obtained by the other methods [72]. Neural network methods seemed to give generally better results than regression methods [73]. The method introduced in this chapter uses a similar approach for the prediction of 1H-NMR chemical shifts.
6.8.1 Commercially Available Products ChemNMR is a product from Upstream Solution in Switzerland, a company founded in 1997 as a spin-off from Federal Institute of Technology at the ETH Zurich. ChemNMR performs prediction of NMR shift values based on 4000 parameters for 13C‑NMR and about 3000 parameters for 1H-NMR. Chemical shifts are predicted using additivity rules and several strategies of approximation. For a given molecule, the appropriate substructures are automatically assigned following a hierarchical list. Ring systems not available in the data set are approximated by embedded rings or disassembled into acyclic substructures. In case of 1H-NMR, shifts of about 90% of all CHn-groups can be predicted with a mean deviation of 0.2–0.3 ppm. ACD/H-NMR from Advanced Chemistry Development (ACD) Labs calculates 1H-NMR spectra under any basic frequency. The system uses 3D molecular structure minimization and Karplus relationships to predict proton–proton coupling constants. The software recognizes spectral differences among diastereotopic protons, cis-trans isomers of alkenes, syn-anti isomers of amides, oximes, hydrazones, and nitrosamines. The base data set includes more than 1,000,000 experimental chemical shifts and 250,000 experimental coupling constants. To quantify intramolecular interactions in new organic structures and to predict their chemical shifts, ACD/HNMR uses an algorithm based on intramolecular interaction parameters to quantify intramolecular interactions in new organic structures and to predict their chemical shifts.
5323X.indb 201
11/13/07 2:12:00 PM
202
Expert Systems in Chemistry Research
Other systems such as HyperNMR from Hypercube Inc. calculates magnetic shielding and nuclear spin coupling constants for nuclei of the molecular system, using a quantum mechanical description of the electronic structure. HyperNMR uses semiempirical typed neglect of differential overlap (TNDO) methods to compute magnetic shielding and nuclear spin coupling constants. The determination of protein nuclear magnetic resonance (NMR) assignments requires a series of different NMR experiments to be executed; the more NMR data are available, the better are the assignments but the more time is needed to analyze all these experiments. Several approaches for automated assignment of protein NMR spectra use optimization algorithms for increasing the quality of results, including ANNs, simulated annealing, and genetic algorithms (GAs) [74–78]. A more complex approach is achieved with AutoAssign, a constraint-based expert system for automating the analysis of backbone resonance assignments of small proteins using triple-resonance NMR spectra [79]. It was originally developed in LISP and later was extended as client-server architecture with a Java-based graphical user interface. AutoAssign is implemented in C++, Java2, and Perl programming languages and is supported on UNIX and Linux operating systems. AutoAssign automates the assignments of HN, NH, CO, C-alpha, C-beta, and H-alpha resonances in non-, partially-, and fully-deuterated samples. AutoAssign uses a problem-solving strategy derived from the basic procedure developed by Wüthrich and colleagues for the analysis of homonuclear spectra [80]. AutoAssign combines symbolic constraint satisfaction methods with a domain-specific knowledge base to exploit the logical structure of the sequential assignment problem, the specific features of the various NMR experiments, and the expected chemical shift frequencies of different amino acids.
6.8.2 Local Descriptors for Nuclear Magnetic Resonance Spectroscopy The prediction of chemical shifts in 1H-NMR spectroscopy is usually more problematic than in 13C-NMR. Experimental conditions can have an influence on the chemical shifts in 1H-NMR spectroscopy and structural effects are difficult to estimate. In particular, stereochemistry and 3D effects have been addressed in the context of empirical 1H-NMR chemical shift prediction only in a few specific situations [81,82]. Most of the available databases lack stereochemical labeling, assignments for diastereotopic protons, and suitable representations for the 3D environment of hydrogen nuclei [83]. This is the point where local RDF descriptors seemed to be a promising tool. The approach presented here uses a combination of physicochemical, topological, and geometric information [84]. The geometric information is based on local proton RDF descriptors to characterize the chemical environment of the proton. CPG neural networks established the relationship between protons and their 1H-NMR chemical shifts. Four different types of protons were treated separately regarding their chemical environment: protons belonging to aromatic systems, nonaromatic π-systems, rigid aliphatic substructures, and nonrigid aliphatic substructures. Each proton was represented by a fixed number of descriptors. The mathematical flexibility of RDF descriptors and its use for specific representations has been mentioned before. The descriptors used in this approach have been
5323X.indb 202
11/13/07 2:12:00 PM
203
Expert Systems in Fundamental Chemistry
adapted specifically to the task of describing a proton in its environment including typical NMR-relevant features. Geometric descriptors were based on local RDF descriptors (see Equation 5.20) for the proton j g H (r ) =
N( 4 )
∑q ⋅e i
− B ( r − rij )2
,
(6.6)
i
where i denotes an atom up to four nonrotatable bonds away from the proton, and N(4) is the total number of those atoms. A bond is defined as nonrotatable if it belongs to a ring, to a π-system, or to an amide functional group. qi is the partial atomic charge of the atom i, and rij is the 3D distance between the proton j and the atom i. Figure 6.17 shows an example of a charge-weighted proton RDF; each 3D distance contributes proportionally to qi to a peak in this descriptor. Values of gH(r) at fixed points are used as descriptors of the proton. Some modifications can be applied to Equation 6.6 to represent further geometric features. The electronic influence of double bonds can be incorporated by g D (r ) =
D( 7 )
∑ r1 e 2 D
i
− B ( r − aD )2
(6.7)
where i is now a double bond up to the seventh sphere (D(7)) of nonrotatable bonds centered on the proton, rD is the distance of the proton to the center of the double bond, and aD is the radian between the plane defined by the bond and the distance rD (Figure 6.18a). 0.4 H6…H7
Distance Probability
0.2
H6…H8
0.0 –0.2
H6O1 H7
–0.4 –0.6
C2…H6
–0.8 –1.0
1.4
1.6
1.8
2.0
5
O1…H6
2.2
2.4
2.6
2.8
3.0
4
3.2
3
2
3.4
H8
3.6
3.8
Distance/
Figure 6.17 Example for a local RDF descriptor for proton 6 used in the prediction of chemical shifts and distances (B = 20 Å–2).
5323X.indb 203
11/13/07 2:12:02 PM
204
Expert Systems in Chemistry Research H
H aD
rD
as
rs
H a3
(a)
(b)
(c)
Figure 6.18 Special distance measures for the characterization of proton environments. (a) distance rD and radian angle aD to double bonds; (b) distance rS and radian angle aS to single bonds; (c) dihedral radian angle a3 to the third bond from the hydrogen atom.
Shielding and unshielding by single bonds can be encoded by encoded using
gS (r ) =
S( 7 )
∑ r1 e 2 S
i
− B ( r − aS )2
(6.8)
where i is a single bond up to the seventh sphere (S(7)) of nonrotatable bonds centered on the proton, rS and aS are distance and angle, respectively (Figure 6.18b). To account for axial and equatorial positions of protons bonded to cyclohexanelike rings, the function
g3 ( r ) =
N( 3)
∑e
− B ( r − a3 )2
(6.9)
i
can be used, where i is an atom three nonrotatable bonds (totally N(3) atoms) away from the proton and belonging to a six-membered ring, and a3 is a dihedral radian angle (Figure 6.18c). The descriptors are calculated with a fixed resolution and distance range, for instance: • gH(r) at 15 evenly distributed points between 1.4 and 4 Å with B set to 20 Å–2. • gS (r) and gD (r) at seven evenly distributed points between 0 and π/2 (B = 1.15 Å–2). • g3(r) at 13 evenly distributed points between 0 and π/ (B = 2.86 Å–2). All local descriptors are L2-normalized. Additional properties or descriptors can be introduced to characterize physicochemical, geometric, and topological properties of the proton environment: • Physicochemical Properties: Useful one-dimensional physicochemical properties in this context are partial atomic charge, effective polarizability,
5323X.indb 204
11/13/07 2:12:04 PM
Expert Systems in Fundamental Chemistry
205
and sigma electronegativity of the protons and the atoms in their vicinity. These descriptors have to be scaled linearly between 0 and 1 for the cases of the training set, and the same scaling factors have to be applied to the prediction set. • Geometric Descriptors: The geometry of the chemical environment of the proton can be described by the minimum and maximum bond angles centered on the atom adjacent to the proton. These two descriptors are usually sufficient for aromatic and nonrigid aliphatic protons, whereas aromatic π-protons characterized by the proton RDF descriptor gH(r) can be used to describe nonaromatic π-protons. If the influence of cis-trans isomerism is important, the number of nonhydrogen atoms at the cis and trans positions can be used as additional geometric descriptors. For rigid aliphatic protons, usually all of these geometric descriptors are valid. • Topological Descriptors: These are based on the analysis of the connection table and the physicochemical properties. They are related to purely topological aspects, such as the number of carbon atoms in a particular sphere around the proton, the number of oxygen atoms in a sphere, or the number of atoms in a sphere that belong to an aromatic system. A topological RDF descriptor based on the sum of bond lengths can also be used for any atom in a particular sphere around the proton, including partial atomic charge. Some topological and physicochemical descriptors will not provide significant differences for certain compound classes. However, around 90 different descriptors are useful for aromatic protons, more than 100 for nonrigid aliphatic and pi protons, and nearly 200 for rigid aliphatic protons. Here, the understandable questions arises as to how we can select the appropriate one for our task. An optimization technique that has been previously introduced will help in this task: genetic algorithms.
6.8.3 Selecting Descriptors by Evolution It has been reported that models developed with selected subsets of descriptors can be more accurate and robust that those using all possible descriptors [84]. To select the appropriate descriptors for each of the four classes of protons, we can compare the prediction results for a selected set of descriptors to those obtained when all the descriptors are used. GAs are an excellent tool for this task. The general approach of the GA has been briefly introduced. Detailed explanations can be found in the original publications, and a comprehensive review on GAs is available in the literature [85]. We will focus here on applying GAs in the context of NMR shift prediction. For the selection of descriptors, a GA simulates the evolution of a population, which in our case consists of descriptors. Each individual of the population represents a subset of descriptors and is defined by a chromosome of binary values. The chromosome has as many genes as there are potentially useful descriptors: in the reference mentioned above, 92 for the aromatic group; 119 for nonrigid aliphatic; 174 for rigid aliphatic; and 101 for nonaromatic π-protons. The genes are binary representations of the presence of a descriptor: A value of 1 indicates that the corresponding descriptor is included in the subset; otherwise it takes 0 (Figure 6.19).
5323X.indb 205
11/13/07 2:12:04 PM
206
0
0
0
0
0
1
0
1 … 0
1
0
0
α
1
1
0
1
1
1
1
α
1
χ
(r) gs ( r)
gD
1
χ
gD
1
1 … 0
1 … 0
χ
1
q
1
gs ( r) g3 (r) g(d ) q
1
(r) gs ( r)
0
χ
gs ( r) g3 (r) g(d )
Expert Systems in Chemistry Research
1 … 0
Figure 6.19 Specification of a subset of descriptors from a pool of possible descriptors by a chromosome. A cross-over mutation leads to an exchange of descriptor availability in two chromosomes.
The GA is now simulating an evolution with the entire population. The size of the population corresponds to the number of selected descriptor subsets. Each population undergoes now an evolution for a fixed number of generations: • Each of the individuals is combined with another randomly chosen individual by performing a cross-over mutation of the binary genes. • Two new offsprings are generated for each cross-over. • The scoring of each chromosome is now performed by a CPG neural network that uses the subset of descriptors encoded in the chromosome for predicting chemical shifts. The score function of one chromosome (fitness function) is the RMS error for the chemical shifts obtained with a crossvalidation set. • Chromosomes with lower RMS errors are considered fitter than those with higher ones and are selected for mating. In each generation, half of the individuals survive (the fittest individuals), and the other half die. • Each of the surviving individuals is combined with another randomly chosen surviving individual, and, again two new offsprings are generated. This process continues until a termination criterion is reached: either a maximum number of generations or a satisfactory fitness level for the population.
6.8.4 Learning Chemical Shifts If the appropriate descriptors have been selected, a data set can be compiled for the training of the CPG neural network. A series of several hundred of 1H-NMR chemical shifts for protons of different molecular structures are required for such training. Typically, the available data set is divided into training and test (cross-validation) set; that is, a part is used for training, whereas the rest of the data set is treated as unknown and used to derive the predictive quality of the method. Since the selection of training data determines the predictive quality, some key factors have to be accounted for: • The data set has to be consistent concerning the measurement parameters, such as instrument parameters, conditions, or solvent.
5323X.indb 206
11/13/07 2:12:05 PM
207
Expert Systems in Fundamental Chemistry
• It is usually a good idea to restrict the selection for certain compounds containing a limited set of typical elements, for instance, avoiding organometallic or complex compounds. • Another critical factor are protons in which chemical shifts strongly depend on experimental conditions, such as protons connected to heteroatoms, which vary in chemical shifts depending on the concentration of the sample. • Still, the data sets have to be designed as diverse as possible to cover as many types of protons as possible. • If we want to address multiple classes of compounds, the training set has to be divided into one for each class, and the same has to be done for the prediction set. After testing the results with the cross-validation set, we may end up with different descriptors and different network parameters for each of the types of protons. Since the networks have only to be trained once, we can reuse the final configuration for all predictions within the same compound class. The following section gives an example.
6.8.5 Predicting Chemical Shifts An interesting example of the predicted chemical shift values is given in Figure 6.20. Even though no heterocycle containing sulfur was in the training set, the hydrogen atom bonded to a heterocycle results in a good prediction (8.53 vs. 8.42 ppm). Such predictions are possible due to the use of physicochemical and topological descriptors that generalized atom types to inherent physicochemical properties. Using these models, the global mean absolute error for the prediction set is 0.25 ppm with a standard deviation of 0.25 ppm. A mean absolute error of 0.19 ppm can be obtained for 90% of the cases. Figure 6.21 shows a plot of the predicted chemical shifts against the observed values. The performance of the method is remarkable considering the relatively small data set on which it was based. A particularly useful feature of the neural network approach is that the system can be easily retrained for specific types of compounds. Assigned spectra of related structures can be added to the training set, and the training can be repeated using the same descriptors or reselecting descriptors from the pool of possible descriptors. Improved results can be expected for similar compounds.
CH3
N 8.53 H 8.42
2.44 2.20 6.77 H 6.65
S 5.45 H 5.24
5.23 H 4.97
Figure 6.20 Prediction of 1H-NMR chemical shifts for a molecule CPG neural networks.
5323X.indb 207
11/13/07 2:12:05 PM
208 Experimental Chemical Shift/ppm
Expert Systems in Chemistry Research 11 10 9 8 7 6 5 4 3 2 1 0
0
1
2
3
4
5
6
7
8
9
10
11
Predicted Chemical Shift/ppm
Figure 6.21 Plot of observed chemical shifts against prediction of the neural networks for the protons of the prediction set.
6.9 Applications in Analytical Chemistry Interpretation is one of the regular tasks in the analytical laboratory. Nearly all data from instruments, whether these are single values, vectors, diagrams, or images, have to be interpreted. Many expert systems contain a knowledge base in the form of a decision tree that is constructed from a series of decision nodes connected by branches. In expert systems developed for the interpretation of spectra, decision trees are typically used in a sequential manner. Similar to the interpretation of a spectrum by an expert, decisions can be enhanced to global or restricted to special problems; a larger tree can include more characteristics of a spectrum and can be considerably more accurate in decision making.
6.9.1 Gamma Spectrum Analysis Gamma spectroscopy is a radiochemical measurement method that allows identification and quantitative determination of activity of radionuclides, which emit gamma radiation or x-rays. The equipment used in gamma spectroscopy includes an energy-sensitive radiation detector, such as semiconductors, scintillators or proportional counters, and a multichannel analyzer. The energies and the photon yields are characteristic for specific nuclides. Gamma spectroscopy was for many years found only in high-tech laboratories. The growing demand for instruments that can immediately identify radionuclides on site led to the development of handheld devices that became commercially available in the last decade. Typical application areas are transport inspection, waste monitoring, leakage control, scrap monitoring, and the search for radioactive materials in restricted areas (airports). Nuclear medicine and research use gamma spectroscopy for determining radionuclide purity. Most radioactive sources produce gamma rays of various energies and intensities. These emissions are analyzed with a multichannel analyzer that produces a gamma energy spectrum. A detailed analysis of this spectrum allows the identity and quantity of gamma emitters present in the source to be determined.
5323X.indb 208
11/13/07 2:12:06 PM
Expert Systems in Fundamental Chemistry
209
One of the expert systems developed for qualitative and quantitative radionuclide identification in gamma spectrometry is SHAMAN [86]. This system can be coupled with gamma spectrum-analysis systems, such as SAMPO, which was developed for automating the qualitative identification of radionuclides as well as for determining the quantitative parameters of the spectrum components [87]. SHAMAN uses a library of 2600 radionuclides and 80,000 gamma lines, as well as a rule base consisting of 60 inference rules. The rule base includes energies, relative peak intensities, genesis modes, half-lives, and parent–daughter relationships as criteria and can be extended by the user. The expert system has been evaluated using test cases from environmental samples and standard sources and produced good results even for difficult spectra or in cases where concentrations of radionuclides are very low and side effects like self-absorption usually decrease the quality of determination. SHAMAN takes these phenomena into account and produces reliable and accurate results. SHAMAN has also been evaluated as an automated radionuclide identifier within the scope of the Comprehensive Nuclear-Test-Ban Treaty (CTBT), which was established in 1996 and covers the constraints for development and qualitative improvement of nuclear weapons. It has achieved strong worldwide support. A system combining a portable gamma-ray spectrometer with an ANN was discussed by Keller and Kouzes [88]. They used neural networks to automatically identify radioactive isotopes in real time from their gamma-ray spectra. They examined perceptron and associative memory approaches and showed that both networks are useful in determining the composition of unknown samples on the basis of superimposed known spectra. The approach uses the entire spectrum — rather than individual peaks — for identification. This approach was successfully tested with data generated by Monte Carlo simulations and with field data from both sodium iodide and germanium detectors. Typically for a neural network, the training process is time consuming, whereas the prediction process is fast enough to use the system in on-site applications with handheld devices.
6.9.2 Developing Analytical Methods — Thermal Dissociation of Compounds Atomic absorption spectrometry (AAS) relies on the atomization of substances in an appropriate medium, like a flame, a plasma, or a graphite tube, and the capability of the free atoms to absorb light of a specific wavelength [89]. One of the usual techniques for the determination of trace elements is the heated graphite atomizer (HGA), which consists of a graphite tube connected as a resistor in a high electrical current circuit (Figure 6.22). The sample is typically introduced in solution, and a temperature program dries the sample before the atomization step. Atomization is performed by applying a strong electrical current to the graphite tube, which heats up at around 2000 K/s, thus atomizing the sample and releasing free atoms to the gas phase. The reactions mechanisms in a heated graphite atomizer are essential for development and optimization of an appropriate analytical method. Thus, the investigation of atomization mechanisms in heated graphite tubes has been subject to several investigations.
5323X.indb 209
11/13/07 2:12:06 PM
210
Expert Systems in Chemistry Research
Graphite Tube
Quarz Window Seal
Graphite Contact Cylinder
Coolant Cylinder
Figure 6.22 A HGA used for the atom spectrometric determination of trace elements in solid or dissolved samples. The graphite tube is connected as resistor between two graphite contact cylinders. Quartz window seals are attached to both ends of the primary beam plane, and the entire unit is water cooled.
The reaction that leads to free atoms is mainly determined by the reductive properties of carbon. The reactions in a graphite tube that finally lead to atomization start in most cases from the oxides. The following mechanisms of atomization are imaginable: • Thermal solid-phase dissociation:
MO(s) → M (g) + O(g) • Thermal gas-phase dissociation:
→ MO(g) MO(s/l) ← → M (g) + O(g) • Thermal halide dissociation:
→ MX(g) MX(s/l) ← → M (g) + X(g) • Reduction by graphite carbon:
5323X.indb 210
→ 1 2 M 2(g) MO(s/l) + C → M (s/l) + CO(g) ←
11/13/07 2:12:09 PM
Expert Systems in Fundamental Chemistry
211
• Reduction by gaseous intermediate carbides:
M ( g ) + C ( s ) → MC( g )
→ 2M (g) + CO(g) MO(s/l) + MC(g) ←
Thermal dissociation in the solid phase takes place if the decomposition point of the product is significantly lower than its vaporization point. In the opposite case, thermal dissociation takes place in the gas phase. Halides usually have a significantly lower dissociation temperature than the corresponding oxides; that is, dissociation in solid phase is barely possible. The extremely high electron density in a graphite tube at temperatures above 1200°C can lead to a reduction of stable metal oxides of, for instance, iron, chromium, and manganese in solid or liquid phase. This reduction is typically observed at temperatures of around 500°C lower than the dissociation point of the oxides. The last reaction is the dissociation of carbides in the gas phase:
MC( s /l ) → M ( g ) + C ( s /l )
It is determined by the dissociation enthalpy ∆HDo
∆H Do = ∆H Bo ( M ) − ∆H Bo ( MC )
(6.10)
where ∆HBo is the enthalpy of formation in the condensed phase; if ∆HBo(MC) for the metal carbide is negative, the process of carbide formation is exotherm. For stable carbides,
∆ H Do > ∆H Bo ( M )
(6.11)
applies, whereas both enthalpies are similar for instable carbides. The dissociation under these conditions usually produces monatomic fragments. The partial pressures of the particular species is extremely small (10 –7–10 –8 bar), and the probability for recombination is relatively low. The time in which free atoms can be detected depends on factors like thermal convection, diffusion, gas expansion, and condensation in colder parts of the graphite tube. Diffusion, expansion, and condensation effects are relatively stable and under the same conditions merely about 15% of the atom loss. Condensation, however, has significant influence on the amount of free atoms in the gas phase. In particular, the irregular temperature distribution along the tube during the atomization process as well as the delay in the heating rate of the gas within the tube leads to recondensation effects that remove the atoms from the measurement process. The dissociation energy of a compound in solid or gas phase plays an important role for the atomization process, and thermodynamic and kinetic relationships have to be taken into account. However, the conditions in a graphite tube heated up to more than 2000 K are so extreme that only few reference data exist.
5323X.indb 211
11/13/07 2:12:11 PM
212
Expert Systems in Chemistry Research Graphite Wall
Temperature/K
3000 Gas Phase
2700 2400
Atomization Signal
1700 1200 1
2
Time/s
3
4
Figure 6.23 The temperature of the gas phase and the graphite wall in a heated graphite atomizer may differ significantly during the atomization process. The gas phase shows a delay of several hundredths of milliseconds until it reaches the temperature of the graphite wall. The highlighted area shows the initial range of the atomization signal, where the first atoms are formed.
Herzberg et al. showed with the help of Coherent anti-Stokes Raman scattering (CARS) thermometry that heating at a rate of 1700 K/s causes a delay between the temperature of the gaseous phase and the tube wall of at several hundred milliseconds (Figure 6.23) [90]. However, in the first 200 to 300 ms of the atomization phase, the delay is minimal, and just beyond 300 ms a usually proportional delay is determined. Let us have a look at the thermodynamic approach [91]. The reaction rate for a monomolecular dissociation mechanism depends in transition state theory on the partial pressure of the metal oxide pMO(g). If we incorporate the loss of atoms p′M(g) we yield for the reaction rate
dpM ( g ) = k1 ⋅ pMO( g ) − p′M ( g ) dt
(6.12)
If we assume a linear decrease of atom density from the center (where the sample is injected) to the ends of the graphite tube, the loss can be approximated by
p′M ( g ) = kv ⋅ pM ( g )
(6.13)
with kV as temperature-independent loss constant. The partial pressure of the metal will be constant for the short time of equilibrium; that is, for a given temperature we yield dpM ( g ) =0 (6.14) dt
5323X.indb 212
pM ( g ) =
k1 pMO( g ) kV
(6.15)
11/13/07 2:12:14 PM
213
Expert Systems in Fundamental Chemistry
Let us have a look now at the atomization process. The measured absorbance in atomic absorption spectrometry is proportional to the concentration of free atoms in the gas phase. By introducing a constant kA for the relationship between absorbance and concentration, the absorbance AT at temperature T would be AT =
k A ⋅ k1 pMO( g ) kV
(6.16)
The rate constant k1 for the monomolecular reaction can be calculated by
k1 =
kT − ∆G Ao∗ ⋅e h
RT
=
kT − ∆ S Ao∗ R − ∆ H Ao∗ ⋅e ⋅e h
RT
(6.17)
where k and h are the Boltzmann and Planck constant, respectively, ∆G 0* is the free activation enthalpy, ∆Ho* is activation enthalpy, and ∆S 0* is activation entropy. We now introduce an equilibrium constant KP for the relation between the partial pressure of the metal oxide and the activity of the condensed phase a(s), which is according to van’t Hoffs equation: ln K p = −
∆H o +C RT
(6.18)
where ∆Ho is the change in standard enthalpy of the phase transition and C is an integration constant. If we make the valid approximation for replacing the activation enthalpy with the activation energy, we can now introduce Equation 6.17 and 6.18 in Equation 6.16 to get
ln AT = −
ε A + ∆ H o ∆SAo k ⋅ a ⋅ kT ⋅ C ′ + + ln E s RT R h ⋅ kV
(6.19)
ε A + ∆ H Ao + Ao RT
(6.20)
or, rearranged,
ln AT = −
This equation describes the fundamental relationship between measured absorbance and the activation energy required for the atomization process. Now back to the problem previously defined. We can measure now the absorbance during the initial phase of the atomization process, where we can assume a linear relationship between absorbance and activation energy and can calculate the activation energy required for the atomization step by using the slope of a graph log A against 1/T. By comparing the resulting activation energy with database values from published activation energies, we can predict the mechanism of the initial atomization step.
5323X.indb 213
11/13/07 2:12:17 PM
214
Expert Systems in Chemistry Research Pb
Log Absorbance
Fe
Inverse Temperature
Figure 6.24 Logarithms of absorbance plotted against the inverse temperature for iron and lead. Iron shows a straight regression line with a constant slope that indicates a single step atomization during the observation. Lead shows two segments with different slopes due to a change of the atomization process during the observation phase from solid lead to the atomization of the metal dimer in the gas phase.
v1
wj1 Kohonen Network
64 Absorbance Values of the Initial Range vn Atomization Mechanism
wjn
Map Layer
Figure 6.25 Schematic view of a Kohonen network trained with 64 absorbance values from the initial range of the atomization signal and a class component for the atomization mechanism.
Figure 6.24 shows schematic views of two such graphs for iron and lead. Lead exhibits interestingly two different segments — that is, two different activation energies. This is due to a change in the atomization process at higher temperatures, where the metal dimer becomes gaseous and atomizes in the gas phase. The first step in activation energy corresponds to the enthalpy of vaporization, whereas the second activation energy corresponds to the dissociation energy of the metal bond. The absorbance values from the initial range of the atomization peaks are nothing else than a descriptor for the atomization process. Consequently, we can use these values and train a Kohonen neural network for predicting the atomization mechanism in heated graphite tube atomizers (Figure 6.25). Figure 6.26 shows the result of such prediction, where the training set was divided into five classes for the final atomization step. The atomization mechanisms for different metal solutions can be accurately predicted. Extending this solution with appropriate data about the
5323X.indb 214
11/13/07 2:12:18 PM
215
Expert Systems in Fundamental Chemistry
Cr Mo
Thermal Dissociation of Metal Dimer
V
Thermal Dissociation of Metal Carbide
Cd
Thermal Dissociation of Gaseous Halide Pb Mg Cd
Thermal Dissociation of Gaseous Oxide
Sn
Co Cu Fe
Reduction of Metal Oxide
Ni
Figure 6.26 The Kohonen map of a network trained with 64 absorbance values from the initial range of atomization signals of different compounds shows a clear separation into four areas of atomization processes. The only conflict occurs with thermal dissociation of metal carbides and metal dimers. The map indicates the central neurons for metals from an independent test set.
compounds to be analyzed — like melting points and boiling of the metals, oxides, halides, and carbides — we can use the predicted mechanism to propose and appropriate temperature program for the heated graphite atomizer. All of these results integrated in an expert system providing simple rules for matrix modifiers, and integrating this into a question-and-answer game for the analytical chemist an expert system can be developed that provides analytical methods for graphite furnace atomic absorption spectrometry.
6.9.3 Eliminating the Unnecessary — Supporting Calibration X-ray fluorescence spectroscopy is a widely used analytical method for automated sequential analysis of major and trace elements in metals, rocks, soils, and other usually solid materials. The advantage of this method is that no separation is necessary. The sample is irradiated with an x-ray beam that causes secondary fluorescence in the sample that can be measured at a certain wavelength to determine the concentration of the element in the sample [92]. However, methods that do not require separation before analysis have a drawback in common: interelemental effects. Physical or chemical separation of the analyte from its matrix avoids interelemental effects but is time-consuming, expensive, or tedious. Physical methods have the general advantage that interelement effect can be calculated. In x-ray fluorescence spectroscopy this is performed by taking the absorption of the primary beam and of the secondary fluorescence radiation into account. Figure 6.27 shows a profile of a sample during irradiation. If we look at the fluorescence radiation that originates from a point at a certain sample depth, we can define the intensity of the primary (excitation) beam at dl by Beer’s law of absorption:
5323X.indb 215
I ( λ 0 ) = I 0 ( λ 0 ) ⋅ e− µ⋅ρd sin φ
(6.21)
11/13/07 2:12:19 PM
216
Expert Systems in Chemistry Research
Φ ψ
φ
d
λ0* λi λ0
Figure 6.27 Schematic view of scattering and diffraction of an x-ray beam penetrating a sample. The primary beam (λ0) enters at an angle of φ and leads to secondary fluorescence beam (λi) in the sample depth d. A scattered primary beam emerges at angle ψ of the sample plane.
where I(λ0) is the intensity of the primary beam at wavelength λ in layer dl, and I0(λ0) is the intensity of primary beam at primary wavelength λ0 corrected by the exponential term that takes the mass absorption coefficient µ, the density ρ, and the incident angle φ into account. The intensity of the fluorescence beam I(λi) of element i at wavelength λ underlies also Beer’s law and can be combined in a single expression, I ( λ i ) = I ( λ 0 ) ⋅ e− µ( λ 0 )⋅ρd sin φ−µi ( λi )⋅ρd sin ψ
(6.22)
which includes additionally the fluorescent beam angle ψ. The detector will also receive an incoherent scatter at wavelength λ0*: λ *0 − λ 0 = 0, 0243 (1 − cos Φ )
(6.23)
Several additional effects that determine the count rate at the detector, as shown in Figure 6.28. Taking all effects into account requires complex calculations based on iterative measurements and corrections for all analytes and the interfering matrix components. An example of a somewhat simple correction model is the following equation:
ci = Di −
∑ l
ri − fb ⋅ rb(i ) ⋅ 1 + ( fl ⋅ cl ) + Ei ⋅ rs − fb ⋅ rb( s )
(a)
5323X.indb 216
∑ j
sin φ µ j ( λ eff ) + sin ψ ⋅ µ j ( λ i ) − 1 ⋅ c j µ i ( λ eff ) + sin φ ⋅ µ i ( λ i ) sin ψ
(6.24)
(b) (c)
11/13/07 2:12:21 PM
217
Expert Systems in Fundamental Chemistry Characteristic
Absorption
Source
Coherent Scatter
Incoherent Scatter
Enhancement
Sample
Tertiary Enhancement
Absorber
Secondary Emitter
Figure 6.28 Effects in a sample irradiated with x-rays. The characteristic (fluorescence) emission is the desired analytical effect that needs to be separated from the other secondary emissions. The primary x-ray beam scatters coherently (without loss in energy) and incoherently (losing energy) and is recorded to a small amount in the detector. The characteristic fluorescence beam not only undergoes absorption by other metals in the sample but may also be excited by secondary or tertiary fluorescence from other elements.
The correction model according to de Jongh is basically a linear regression model for calibration with c as the concentration of the analyte i, intercept D, and slope E multiplied by the measured signal, in this case the count rate r. The additional terms are a correction factor for the intercept (a) for an element l that shows line overlap, the correction of the measured countrate for internal correction channels s (b), and a complex term that takes mass absorption coefficients µ, at a particular wavelength λ, the geometry (primary beam angle, φ, and fluorescence beam angle, ψ), as well as the concentration cj of the interfering element j into account. Needles to say, this calibration is an iterative procedure where the analyte and all interfering elements have to be measured and iteratively corrected until certain convergence is established. A CPG neural network can help to find the appropriate interelement coefficients by training the network with pairs of descriptors, one of which contains the raw count rates for the interfering elements whereas the other contains experimentally determined interelement coefficients (Figure 6.29). Training the network with values of a well-defined type of chemical matrix, the network is able to predict the interelement coefficients that can finally be used to correct the calibration graph used for determining the element concentrations (Figure 6.30).
6.10 Simulating Biology 6.10.1 Estimation of Biological Activity A main topic in molecular pharmacology deals with the interaction between a receptor molecule and an agonist, either alone or in the presence of a competing antagonist. In a simple case, one molecule of agonist binds reversibly to a receptor
5323X.indb 217
11/13/07 2:12:22 PM
218
Expert Systems in Chemistry Research
Si Ti Al Fe 10 Raw Countrates Mn Mg Ca Na K P µ1 µ2 µ3 µ4 µ5 100 Interelement µ6 Coefficients µ7 . . . . µn
Input Layer
Output Layer
Figure 6.29 Schematic view of a CPG neural network trained with vector pairs. The input vector consists of ten countrates for major elements in rock samples, whereas the output vector contains the interelement coefficients µ for all major elements. After training, the network is able to predict the interelement coefficients for the given sample matrix.
molecule to form an active agonist–receptor complex, which generates a pharmacological response while the agonist remains bound. Specialized RDF descriptors are valuable to characterize the behavior of an agonist to indicate its biological activity. Details on the receptor kinetics can be found in several textbooks, such as from Kenakin [93]. A brief introduction to the concept of effective and inhibitory concentrations should be given.
6.10.2 Radioligand Binding Experiments Competitive radioligand binding experiments are used to determine whether a drug binds to a receptor and to investigate the interaction of low-affinity drugs with receptors. These experiments measure the binding of a single concentration of a labeled ligand in the presence of various concentrations of unlabeled ligand. The basis is a simple first-order kinetics model,
5323X.indb 218
KD → Ligand-Receptor Ligand + Receptor ←
11/13/07 2:12:23 PM
219
Expert Systems in Fundamental Chemistry 20.0 18.0
r2 = 0.930
Countrate/kcps
16.0 14.0 12.0 10.0 8.0 6.0 4.0
r2 = 0.028
2.0 0.0 5.0
10.0
15.0
20.0
25.0
30.0
35.0
Concentration/ppm
Figure 6.30 Calibration graph of a trace element before and after correction with interelement coefficients predicted by a CPG neural network. Using raw uncorrected countrates from the instrument (gray line) leads to a poor correlation inappropriate for analytical determination. After correction with the predicted matrix-specific interelement coefficients, the calibration leads to a reasonable regression line.
with the law of mass action
KD =
[ Ligand ][ Receptor ] [ Ligand-Receptor ]
(6.25)
When ligand and receptor collide due to diffusion, they remain bound together for a random amount of time influenced by the affinity of the receptor and ligand for one another. After dissociation, the ligand and receptor are the same as they were before binding. Equilibrium is reached when the rate of association at which new ligand-receptor complexes are formed equals the rate of dissociation. KD is the concentration of ligand that occupies half of the receptors at equilibrium. Thus, a small KD means that the receptor has a high affinity for the ligand, whereas a large KD means that the receptor has a low affinity for the ligand.
6.10.3 Effective and Inhibitory Concentrations A radioligand binding experiment is performed with a single concentration of radioligand and, typically, 12 to 24 concentrations of unlabeled compound until equilibrium is reached. The results are used to determine a competitive binding curve (Figure 6.31) with the total radioligand binding expressed as femto mole bound per milligram protein (or number of binding sites per cell) on the ordinate and the logarithmic concentration of unlabeled drug on the abscissa. The maximum in this curve is a plateau at a value equal to radioligand binding in the absence of the competing unlabeled drug (total binding), whereas the minimum
5323X.indb 219
11/13/07 2:12:24 PM
220
Total Radioligand Binding
Expert Systems in Chemistry Research
Total
Specific
Nonspecific Log(IC50)
Log (Unlabeled Drug)
Figure 6.31 Determination of the inhibitory concentration (IC50) in a radioligand binding experiment. The top of the curve indicates the value equal to radioligand binding in the absence of the competing unlabeled drug (Total Binding). The bottom of the curve equals to nonspecific binding (Nonspecific Binding). The concentration of unlabeled drug that results in radioligand binding halfway between the upper and lower plateaus is the IC50.
is a plateau equal to nonspecific binding. The difference between the top and bottom plateaus is the specific binding. The concentration of unlabeled drug that results in radioligand binding halfway between the upper and lower plateaus is the inhibitory concentration (IC), at 50% binding, IC50. The IC50 value in a radioligand-binding assay can be defined as the molar concentration of competing ligand (agonist or antagonist), which reduces the specific binding of a radioligand by 50%. In a functional assay, the IC50 is the molar concentration of antagonist, which reduces the response to a fixed concentration of agonist to 50% of its original level. If the labeled and unlabeled ligand competes for a single binding site, the law of mass action determines the steepness of the competitive binding curve. The curve descends from 90% specific binding to 10% specific binding with an 81-fold increase in the concentration of the unlabeled drug. More simply, nearly the entire curve will cover two log units (100-fold change in concentration). Nonlinear regression is used to fit the competitive binding curve to determine the log(IC50). Three factors determine the value of IC50:
5323X.indb 220
1. The equilibrium dissociation constant KD for binding of the unlabeled drug: That is, the concentration of the unlabeled drug that will bind to half the binding sites at equilibrium in the absence of radioligand or other competitors; KD is proportional to the IC50 value. If KD is low (i.e., the affinity is high), the IC50 value will also be low. 2. The concentration of the radioligand: A higher concentration of radioligand will require a larger concentration of unlabeled drug to compete for the binding; increasing the concentration of radioligand will increase the IC50 value without changing KD. 3. The affinity of the radioligand for the receptor: It takes more unlabeled drug to compete for a tightly bound radioligand than for a loosely bound radioligand; using a radioligand with a higher affinity will increase the IC50 value.
11/13/07 2:12:25 PM
Expert Systems in Fundamental Chemistry
221
KD can be calculated from the IC50 value, using the equation of Cheng and Prusoff [94]. The molar concentration of an agonist that produces 50% of the maximum possible response for that agonist is the effective concentration (EC), EC50 (50%). In most investigations the logarithm of the effective or inhibitory concentration pEC50 (pIC50) are used. For an antagonist, selectivity may be expressed as the ratio of its relative affinity for each receptor type; this is exclusively a drug-dependent value. For an agonist, selectivity may be expressed as the ratio of potencies (e.g., EC50 values) that activate the receptors in a particular functional system. However, this value will be both drug and tissue dependent — because it depends on the coupling efficiency of the tissue — and does not allow for receptor classification.
6.10.4 Prediction of Effective Concentrations The biological activity of a molecule depends on its ability to interact with a specific binding site on the corresponding receptor. In most cases, biological activity correlates directly with binding affinity (BA). The binding affinity, usually characterized by the binding constant KD for its specific receptor, is dependent on the presence or absence of particular functional groups and the overall three-dimensional structure of the molecule. Stereoisomerism may play an important role in this respect: Molecules with the same chemical composition but a different spatial orientation of their substituents at critical points (e.g., the C-5 position in steroids) may have totally different binding properties and biological effects. Thus, 5α-reduced dihydrotestosterone (DHT) is a potent androgen with a strong affinity for intracellular androgen receptors, whereas its 5ß-epimers do not bind to these receptors. Isomerization can therefore lead either to inactivation or to a change in the biological properties of the original molecule. The evaluation of the complicated structure–activity relationships of molecules suggests using a descriptor that can incorporate additional molecular information. Three experiments will show that 2D RDF descriptors in particular are valuable for this task.
6.10.5 Progestagen Derivatives Progestagens are the precursors of progestins, sex hormones essential for preparing the uterus for implantation of a fertilized ovum during pregnancy. The most important progestin is progesterone. The primary therapeutic uses of progestins are in contraceptive pills and in hormone replacement regimens, where they counter the proliferative effects of estrogens. Antiprogestins are used for therapeutic abortion. A compilation of 44 progestagen derivatives (Figure 6.32) was used for the following experiments together with their BAs to receptor proteins in MCF-7 cells, a cell line of human breast cancer cells [95]. This data set was originally published by Bursi et al. [96]. In contrast to the previous experiments, this data set was energy minimized with the Tripos force field and has AM1 charges. Compounds were rigidly superimposed on the C and D steroidal rings. Due to the high similarity of the compounds, one-dimensional RDF descriptors are not able to describe such a complex property as the biological activity. On the one hand, they would react too sensitively to primary structural features (e.g., the
5323X.indb 221
11/13/07 2:12:25 PM
222
Expert Systems in Chemistry Research OH R1
R2
R3
O R4
0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 –0.05 –0.10 –0.15 –0.20 –0.25
12.6
∆α/ 3
0.35 to 0.4 0.3 to 0.35 0.25 to 0.3 0.2 to 0.25 0.15 to 0.2 0.1 to 0.15 0.05 to 0.1 0 to 0.05 –0.05 to 0 –0.1 to –0.05 –0.15 to –0.1 –0.2 to –0.15 –0.25 to –0.2
3.0 11.1
ce/
9.6
tan
8.0 8.1
Dis
6.6
5.1
3.6
2.1
0.6
Charge Weighte
d Probability
Figure 6.32 Base structure for a set of 44 progestagen derivatives used for training a CPG neural network for the prediction of biological activity.
–2.0
Figure 6.33 3D view of a charge-weighted 2D RDF descriptor for a progestagen derivative encoded with effective atom polarizability (α) in the second dimension (128 × 40; 5120 components).
existence of double bonds); on the other hand, they are too insensitive to enable a reliable discrimination between these similar compounds. Thus, for the following experiments a two-dimensional RDF descriptor was used. The effective atom polarizability was chosen as second property in a charge-weighted distribution. Figure 6.33 shows a 2D view of the descriptor used in training and prediction process of a Kohonen network. Because only 44 compounds were available, the leave-one-out technique was applied, where each compound was predicted by using the remaining 43 compounds in the training set. Figure 6.34 shows the correlation between experimental and predicted effective concentrations. The correlation coefficient is quite high but is less reliable. The few training compounds do not cover the range of effective concentrations appropriately, and the
5323X.indb 222
11/13/07 2:12:26 PM
223
Expert Systems in Fundamental Chemistry
Effective Concentration/106 nM/L
1.8 1.6 y = 0.8047x + 0.0355 R2 = 0.967
1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Effective Concentration/106 nM/L
1.6
1.8
2.0
Figure 6.34 Correlation between calculated and predicted effective concentrations (EC50) for 44 progestagen derivatives with the leave-one-out technique. The standard deviation of the error of prediction is 0.1 nM/L.
values do not exhibit standard distribution. The correlation graph shows six outliers (about 13%) parallel to the regression line. Even where there is an uncertainty in the slope of the regression line, a clear trend occurs. The standard deviation of the error of prediction is 0.1 nM/L; in fact, about 60% of the effective concentrations can be predicted within a reasonable limit of 0.05 nM/L.
6.10.6 Calcium Agonists An agonist is a signaling molecule — hormone, neurotransmitter, or synthetic drug — that binds to a receptor, inducing a conformational change that produces a response such as contraction, relaxation, secretion, and change in enzyme activity. Calcium agonists interact with the charge-dependent ion channels in cell membranes and enable the Ca ion transfer. Table 6.2 lists a set of compounds that act as calcium agonists together with their effective concentrations in logarithmic scale (pEC50). The data set consists of 36 pyrrol derivatives containing a phenyl group with different substituents. The compounds were classified in five groups according to their effective concentrations, and a charge-weighted 2D RDF descriptor with atom polarizability in the second dimension (5120 components) was calculated. The classification by the Kohonen network results in two clearly separated clusters of high- and low-activity compounds. The three classes of intermediate biological activity do not form a proper cluster; however, a classification into high-, intermediate-, and low-activity compounds is sufficient for screening biological activity. Figure 6.35 shows the correlation of the experimental values with the predicted ones. As with the previous experiment, the correlation coefficient is quite high. Some of the predicted values exhibit a deviation from the experimental value that exceeds the standard deviation of 0.6. Compared with the previous experiment this leads to a
5323X.indb 223
11/13/07 2:12:27 PM
224
Expert Systems in Chemistry Research
Table 6.2 Pyrrol Derivative Structures and Effective Concentrations for 36 Calcium Agonists R1
O N
R2
R1 H CH3 CH3 CH2C6H11 CH2CH2Ph CH2Ph CH2Ph CH2(4′F-Ph) CH2(4′-F-Ph) CH2(4′-NH2Ph) CH2(2′-NO2Ph) CH2(4′-NO2Ph) Ph F Cl Cl Cl Br
O
O
R2
pEC50
R1
R2
pEC50
H H CH3 H H H F F H H H H H H H Cl NO2 F
–2.83 –1.96 –2.86 3.31 2.07 3.56 2.94 2.94 2.63 –3.08 1.06 1.46 –1.74 –4.67 –2.36 –1.10 –1.83 –1.51
Br I CF3 NHC6H11 NHPh NH-pyrid-2′-yl OCH3 OCH2-Ph O-Ph
H H H H H H H H H H H H H H H H H H
–1.89 –1.51 –1.30 3.31 2.93 2.04 –5.24 0.12 1.06 –0.04 –1.07 –4.93 1.06 0.21 –3.86 –1.16 –1.07 0.94
O(4′-NO2Ph) OCO(2′-OH-Ph) OSO2(4′-MePh) SCH2Ph S(4′-NO2Ph) SO2CH2-Ph SOCH2-Ph SOPh S-Ph
relatively high uncertainty in the error of prediction, particularly since a logarithmic scale is used. However, many of the effective concentrations are predicted within a reasonable error, and the regression line shows a clear trend.
6.10.7 Corticosteroid-Binding Globulin (CBG) Binding Steroids Because of their lipophilic properties, free steroid molecules are only sparingly soluble in water. In biological fluids, they usually occur in either a conjugated form or bound to proteins in a noncovalent, reversible binding. In plasma, nonconjugated steroids occur mostly bound to carrier proteins. They are bound in a rather unspecific form to plasma albumin (up to 50% of the bound fraction) and in more stringent stereospecific form to CBG. The CBG is a plasma glycoprotein that binds glucocorticoids with high affinity; it is responsible for transport and bioavailability of these hormones. In this final investigation of structure–activity relationships, a
5323X.indb 224
11/13/07 2:12:28 PM
225
Predicted Log Effective Concentration (pEC50)
Expert Systems in Fundamental Chemistry 4.0 3.0
y = 0.8847x – 0.0184 R2 = 0.9499
2.0 1.0 0.0 –1.0 –2.0 –3.0 –4.0 –5.0 –6.0 –6.0
–5.0
–4.0
–3.0
–2.0
–1.0
0.0
1.0
2.0
3.0
4.0
Log Effective Concentration (pEC50)
Predicted Effective Concentration pEC50
Figure 6.35 Correlation between calculated and predicted effective concentrations (EC50) for 36 Ca agonists with the leave-one-out technique. The standard deviation is 0.59. –5.0 –5.5
y = 0.9847x – 0.0961 R2 = 0.9877
–6.0 –6.5 –7.0 –7.5 –8.0 –8.0
–7.5
–7.0
–6.5
–6.0
–5.5
–5.0
Effective Concentration pEC50
Figure 6.36 Correlation between calculated and predicted effective concentrations pEC50 for 31 CBG binding steroids with the leave-one-out technique. The standard deviation is 0.12.
well-investigated data set of 31 CBG binding steroids was used [97]. The 2D RDF descriptor was adapted to the property range of the compounds. Figure 6.36 shows the correlation between predicted and experimental data. Although six outliers occur in the correlation plot, they exhibit the same trend as the regression line. The results of the prediction are generally better than in the previous experiments. One of the reasons might be the higher precision of effective concentrations for this well-investigated data set; prediction accuracy depends on the precision of biological activity data.
5323X.indb 225
11/13/07 2:12:29 PM
226
Expert Systems in Chemistry Research
The correlation of structures and biological activities with the aid of 2D RDF descriptors leads to reliable predictions. Nevertheless, the matter of structure–activity relationship is a complex problem depending on many atomic and molecular properties and experimental conditions. It is advisable to use several descriptors in combination with 2D RDF descriptors and to adjust the latter ones by adequate selection of calculation conditions for the specific task.
6.10.8 Mapping a Molecular Surface Even though it has been known for 100 years or more that stereoselectivity plays an important role in drug activity, chiral drugs have been developed and used as racemates while neglecting the fact that they comprise mixtures of two or more compounds that may have quite different pharmacological properties [98,99]. An example with dramatic consequences is Thalidomide (2-(2,6-dioxo-3-piperidyl)isoindol-1,3-dion, Contergan). Thalidomide was developed by the German pharmaceutical company Grünenthal for treatment of morning sickness in pregnant women, and it was stated to be extensively tested [100]. The molecule has a chiral center and can be produced in its pure (R)- or (S)-enantiomeric forms. The compound was originally sold as racemic mixture even though initial animal studies indicated that the enantiomers have different biological properties. In 1959 an extremely rare type of birth defect began appearing across Germany: Babies were born with hands and feet attached directly to the body, a condition known as phocomelia. Although these were first thought to be isolated cases, more and more similar cases of pregnant women who had taken the drug giving birth to babies with abnormally short limbs, eye and ear defects, or malformed internal organs were reported in the following years. By the end of 1961, a German newspaper published a letter by a German pediatrician, who described more than 150 children with malformations and associated them with Thalidomide given to their mothers [101,102]. After this disclosure, Grünenthal withdrew the drug from the German market. An estimated 8000 to 12,000 infants were born with deformities caused by Thalidomide, and of those only about 5000 survived beyond childhood. The effects of Thalidomide were subject to investigations for several decades. The results showed that Thalidomide can produce dysgenesis of fetal organs when the fetus is exposed to this drug 20 to 36 days after conception, which is during the organogenic period of human development [103]. The timing of exposure to Thalidomide during fetal development correlates with specific fetal damage, resulting in phocomelia (shortening of limbs) and amelia (absence of one or more limbs), kidney and heart dysgenesis, brain dysfunction, and death. For several years it was unclear whether any of the actions of racemic Thalidomide could be separated out using a pure enantiomere. Early investigations showed that calming and sleep-inducing effects have been associated with the R-enantiomer, whereas teratogenic effects are more closely associated with the S-enantiomer. However, since both enantiomers undergo rapid interconversion under physiological conditions, the use of pure (R)-thalidomide instead of the racemate would not have prevented the Contergan tragedy from happening.
5323X.indb 226
11/13/07 2:12:29 PM
Expert Systems in Fundamental Chemistry
227
Very limited access to pure enantiomers in the past has been responsible for this unsatisfactory state of affairs. During the last 20 years, significant achievements have made it possible to perform stereoselective synthesis and analysis. Today, novel chiral drugs are as a rule developed as single enantiomers. Yet studies of old racemic drugs are still designed, performed, and published without mention of the fact that two or more compounds are involved. In recent years, a number of old racemic drugs have been reevaluated and reintroduced into the clinical area as the pure, active enantiomer (the eutomer). Though in principle correct, the clinical benefit of this shift from a well established racemate to a pure enantiomer often seems to be limited and sometimes exaggerated. Racemic drugs with a deleterious enantiomer that does not contribute to the therapeutic effect (the distomer) may have been sorted out in the safety evaluation process. However, in the future any pharmacological study of racemic drugs must include the pure enantiomers. This will generate new, valuable information on stereoselectivity in drug action and interaction. If we want to describe biological activity, we need to have a closer look at the surface of a molecule. The shape of a molecule as well as properties on molecular surfaces such as hydrophobicity, hydrogen bonding potential, and the electrostatic potential have a profound influence on many physical, chemical, or biological properties. A molecule must have a certain shape to fit into a receptor, and the properties on the surface of a ligand must correspond to those on the surface of a receptor. The study of the geometry of molecular surfaces and of the distribution of electronic and other properties on molecular surfaces may, therefore, give important insights into mechanisms of interactions of molecules and their influence on the properties of compounds. One of the important surface properties is the molecular electrostatic potential (MEP), which describes the work involved when a unit positive charge is brought from infinity to a certain point near a molecule. It can be calculated by quantum mechanical procedures of various degrees of sophistication or by a simple empirical point-charge model using partial charges of the atoms in a molecule. MEPs give detailed information for studies on chemical reactivity or pharmacological activity of a compound. The spatial distribution and the values of the electrostatic potential determine the attack of an electrophilic or nucleophilic agent as the primary event of a chemical reaction. The three-dimensional distribution of the electrostatic potential is largely responsible for the binding of a substrate molecule at the active site of a biologically active receptor. The three-dimensional nature of the electrostatic potential makes it difficult to simultaneously visualize its spatial distribution and its magnitude. However, we can deal with the electrostatic potential on the van der Waals surface of a molecule, which is of major importance for the molecular contact between a ligand and a receptor. The electrostatic potential can be calculated by a point charges derived from partial atomic charges, the latter of which can be calculated by iterative partial equalization of orbital electronegativity (PEOE), a well-established empirical method for the rapid calculation of charge distributions [104]. The next problem is how to represent the spatial distribution of the electrostatic potential. This is usually done by choosing an observation point in the vicinity of the molecule and looking at the van der Waals surface from this point. This kind of
5323X.indb 227
11/13/07 2:12:30 PM
228
Expert Systems in Chemistry Research O
O
OH
O
N S
O O N
NH2
H
OH O
Figure 6.37 Cephalosporin C, a compound found in mold cultures of Cephalosporium acremonium.
a parallel linear projection of the molecular electrostatic potential is helpful but only covers a single observation point. We can now calculate multiple projections for a series of observation points to get the overall picture of the electrostatic potential. In addition, we can use the mapping capabilities of Kohonen neural networks for handling this nonlinear projection. The following shows an example. Figure 6.37 shows the two-dimensional structure of cephalosporin C, a compound first isolated from mold cultures of Cephalosporium acremonium in 1948. Cephalosporin C exhibits weak antibacterial effect, but the modification of side chains of the molecule generates cephalosporins having diverse antibacterial activity [105]. Cephalosporins are β-lactam antibiotics, semisynthetically produced from cephalosporin C. Figure 6.38 shows schematically how the two-dimensional structure is turned into valuable information. The two-dimensional structure is the common representation of a molecule in the chemical language as it is obtained from structure editor software. This representation does not include explicit information about the spatial arrangement of atoms. Several atoms in this molecule can appear in different spatial orientations leading to different enantiomers. The three-dimensional model of cephalosporin (Figure 6.38, upper left) is based on Cartesian coordinates (xyz-triples) for each atom. It is usually calculated by forcefield methods that take the repulsion of the atoms magnetic fields into account. Biological activity is mainly determined by the surface of the molecule (Figure 6.38, lower left). A biologically active molecule docks to an active center (e.g., a protein). The shape of the active center and the distribution of electrostatic fields along the surface determine whether a molecule fits or not. This representation is calculated by special software and is colored according to electronic properties of the surface. Kohonen neural networks are able to map the surface of a molecule together with its electrostatic potential distribution onto a 2D plane: the Kohonen map. These maps can be compared since their size is independent of the size of the molecule’s surface. The lower right images in Figure 6.38 show an example: a pattern of the electrostatic potential mapped into a 2D plane as performed by a Kohonen neural network. This visual representation can be used as a descriptor in a search for compounds with similar biological activity. Cephalosporium acremonium has been reclassified by Gams, who suggested the name Acremonium chrysogenum [106].
5323X.indb 228
11/13/07 2:12:30 PM
229
Expert Systems in Fundamental Chemistry
O
O
OH N
O
2D Input
S
O
O
N H
NH2 OH
Biological Activity
O
Data
Knowledge
3D Model Comparison Information
Surface Properties
Surface Descriptor
Figure 6.38 The data–information–knowledge cycle for a chemical structure and its biological activity. Starting with the 2D structure, we are able to calculate a 3D model and the surface properties, such as partial charge distribution, at the molecular surface. The surface can be mapped onto the topological layer of a Kohonen network by using partial charge values from evenly distributed points around the surface, thus obtaining a representation of the molecular surface in a fixed dimensionality. The surface maps can be compared to find molecules with similar biological activity.
Comparing the original 2D structure with the descriptor exhibits that the information content of the descriptor not just considerably increased but also that the information is much more specific and valuable for the task of determining biological activity. Even though the descriptor seems to be an unusual kind of a chemical structure, it is nothing else than a molecule in another chemical language. Transforming a query structure and all of the molecules of a data set into their descriptor allows a data set to be searched not only for structures but also for properties, such as biological activities.
6.11 Supporting Organic Synthesis The synthesis of organic compounds poses particular problems for expert systems. The variety of potential reactions requires a specialized knowledge and, thus, an extensive rule base. The first major effort in developing a computer-based method for synthetic planning was reported by Corey and Wipke in 1969 [107]. The resulting program, Organic Chemical Simulation of Synthesis (OCSS), used fundamental heuristics by choosing the most general and powerful principles and reactions available in organic synthesis at the time. OCSS was the predecessor of the more wellknown program Logic and Heuristics Applied to Synthetic Analysis (LHASA).
5323X.indb 229
11/13/07 2:12:31 PM
230
Expert Systems in Chemistry Research
6.11.1 Overview of Existing Systems LHASA is a program for synthesis planning, an expert system to assist chemists in designing efficient routes to target molecules for organic synthesis. LHASA was developed by the research groups of E. J. Corey and A. K. Long at the Harvard University, later including the research team of A. P. Johnson at Leeds University [108]. LHASA uses six basic strategies for retrosynthetic analysis:
1. Transform-based strategy searches for simplifying retrosynthetic reactions. 2. Mechanistic transforms convert the product into a reactive intermediate. 3. Structure-goal (S-goal) searches for a potential starting material, building block, or initiating chiral component. 4. Topological strategy to identify strategic bonds that can lead to major simplifications of retrosynthetic reactions. 5. Stereochemical strategy, which uses stereoselective reactions to reduce stereocomplexity. 6. Functional group strategy, which investigates one or more functional groups for disconnection.
LHASA searches its way in synthesizing known and unknown compounds, using a chemical knowledge base. Since LHASA operates in a rigorously retrosynthetic fashion, the knowledge base contains information about retroreactions rather than reactions. The current version of LHASA 20.3 contains 2271 transforms and 495 tactical combinations. The software runs on UNIX and Linux environments only. SYNCHEM is a heuristic search program for retrosynthetic analysis introduced by Gelernter and colleagues in 1977 shortly after LHASA [109]. SYNCHEM originally took advantage of the Aldrich catalog of starting materials containing around 3000 compounds. More recent versions contain more than 5000 compounds and 1000 reaction schemes. Unlike LHASA, SYNCHEM was developed as self-guided system that does not incorporate feedback from a chemist. Chiral Synthon (CHIRON) was developed by S. Hannessian particularly for recognizing chiral substructures in a target molecule as well as their access from the chiral pool [110]. Version 5 includes a database of more than 200,000 compounds including commercially available compounds, 5000 selected literature data, and more than 1000 biologically active compounds. Search for Starting Materials (SESAM) was an approach published by Mehta, Barone, and Chanon in 1998 for identifying synthons based on skeletal overlaps with potential starting materials and is particularly useful for terpene skeleton recognition [111]. Simulation and Evaluation of Chemical Synthesis (SECS) was developed by Wipke and uses heuristic methods similar to LHASA but puts special emphasis on stereochemistry, topology, and energy minimization [112]. Starting Material Selection Strategies (SST) was an effort by the same team that uses pattern recognition to find starting materials to a given target [113]. SST uses three strategies:
5323X.indb 230
11/13/07 2:12:31 PM
Expert Systems in Fundamental Chemistry
231
1. Constructive synthesis incorporates the starting material in the target product. 2. Degradative synthesis requires significant modifications to incorporate the starting material in the target product. 3. Remote relationship synthesis performs several bond-forming and bondcleaving operations.
Reaction routes in SST receive scores for mapping the starting material onto the target. SYNSUP-MB is a heuristic program developed particularly for industrial application by M. Bersohn of Toronto University in cooperation with Sumitomo Chemical Co. [114]. It includes a database with 2500 reactions and allows very fast automated simulation of reactions and 22,000 reactions per hour with moderately complex target molecules including multiple stereocenters. Though the user may define constraints on reaction routes, like the maximum number of reaction steps, the search is conducted without any user interaction. The Knowledge base-Oriented System for Synthesis Planning (KOSP) was developed by Satoh and Funatsu and uses reaction databases for the retrosynthesis approach [115]. KOSP provides strategic site-pattern perception and precursor skeleton generation and evaluates retrosynthetic schemes. This system is based on knowledge bases in which reactions are abstracted by structural or stereo characteristics of reaction sites and their environments. EROS is a program developed for the simulation of organic reactions. EROS explores the pathways that given starting materials will follow during a reaction path and attempts to predict the products that will be obtained from those reactions. The rule base is based on empirical models mainly derived from literature or database information. EROS covers different application areas: laboratory organic synthesis, process synthesis, combinatorial chemistry, environmental chemistry, metabonomics, and mass-spectrum simulation. Development of EROS started in 1973 at the University of Munich, and the first version was described five years later in a publication by Gasteiger and Jochum [116]. The early approach — programmed in PL/1 and later transcribed to formula translator programming language (Fortran) — was designed to address two types of applications: reaction simulation and synthesis design. Whereas reaction simulation is a forward-driven approach that starts with reactants to find potential products, synthesis design is a backward-driven (retrosynthesis) approach that starts with the products and attempts to evaluate potential candidates that can serve as reactants. With extending the program capabilities, the Gasteiger research team came to the conclusion that reaction prediction and synthesis design can be managed more efficiently in two separate systems [117]. The synthesis design approach was separated, leading to the development of Workbench for the Organization of Data for Chemical Applications (WODCA), described following. The team continued development of EROS focusing on reaction simulation and separated the knowledge base from the business logic [118]. One of the natural outcomes of the EROS project was a module for the simulation of mass spectra: MAss Spectra SIMulatOr (MASSIMO) [119]. Mass-spectrum defragmentation requires technologies and constraints similar to the ones used in synthesis. The most current version, EROS 7, was reengineered in C++ and incorporates a
5323X.indb 231
11/13/07 2:12:32 PM
232
Expert Systems in Chemistry Research
k nowledge base written either in C++ or in Tool Command Language (Tcl), a scripting language. EROS 7 has been used in production since 1998.
6.11.2 Elaboration of Reactions for Organic Synthesis EROS is based on three fundamental concepts: reactors, phases, and modes. A reactor is defined as a container where reactions occur at the same time. These containers are a generic concept of an entity that stores the reaction. In the real world, these are equivalent to a physical container like a flask, a reaction tank, or the fragmentation space in a mass spectrometer. Each change in a reaction requires a new reactor to be instantiated. For instance, if starting materials are added to a container and the mixture is then heated, EROS requires two reactors: one for the reactant entry and another for the heating period, where no further substances are added. Also the simulation of mass spectra using MASSIMO algorithms is performed in an individual reactor. A phase describes the location, environment, and matrix in which a reaction occurs, such as organic and aqueous phase, or a physiological compartment such as blood or a kidney. A phase is usually characterized by a homogenous concentration of starting materials. Since reactors can incorporate more than one phase, the transitions occurring between the phases have to be considered. In a reactor consisting of an organic and an aqueous phase, both phases are modeled individually, but the transition of compounds between the phases is handled like a reaction with a rate corresponding to the rate of diffusion. The mode of a reaction defines how the starting materials of a reaction are combined to take the potential reactions into account. EROS provides several modes, which are selected mainly according to the concentration of starting materials. With high concentrations of starting materials, the mode MIX is selected. This mode ensures that all combinations of starting materials are considered in the generation of reactions. For instance, if three starting materials, A, B, and C are given, nine reactions will be considered: three for monomolecular or pseudo-monomolecular reactions of the reactants (decomposition); three for the combinations A + B, A + C, and B + C; and three for potential dimerization or polymerization of the reactants. In the case of lower concentrations where dimerization or polymerization is unlikely, the mode MIX_NO_A_A is chosen. It is similar to MIX but eliminates the dimerization or polymerization reactions. At very low concentration levels, the mode MONOMOLEC is used, which additionally eliminates the intermolecular reactions and considers only monomolecular or pseudo-monomolecular reactions of the starting materials. Additional modes consider special reaction conditions or environments. The mode TUBE is designed for a laminar flow tube reactor (LFTR), which allows reactions between products but not between products and reactants. Consequently, these reactions are ignored in the TUBE mode. Another difference to MONOMOLEC is that special reaction kinetics is not particularly considered, since a turbulent flow in the tube reactor has similar kinetics to a stirred tank reactor. The mode SURFACE takes the special conditions for reactions in a phase interface into account, where molecules react one after another. This mode is also used
5323X.indb 232
11/13/07 2:12:32 PM
Expert Systems in Fundamental Chemistry
233
for modeling combinatorial chemistry experiments, where each molecule in a first set consecutively reacts with each one of a second set. This behavior can be achieved by specifying the first phase with the mode SURFACE, whereas the second phase receives the mode INERT, which does not allow any reaction. The mode INERT can also be used for intermediate containers with molecules that shall not be incorporated in a reaction.
6.11.3 Kinetic Modeling in EROS EROS handles concurrent reactions with a kinetic modeling approach, where the fastest reaction has the highest probability to occur in a mixture. The data for the kinetic model are derived from relative or sometimes absolute reaction rate constants. Rates of different reaction paths are obtained by evaluation mechanisms included in the rule base that lead to partial differential equations for the reaction rate. Three methods are available that cover the integration of the differential equations: the GEAR algorithm, the Runge-Kutta method, and the Runge-Kutta-Merson method [120,121]. The estimation of a reaction rate is not always possible. In this case, probabilities for the different reaction pathways are calculated based on probabilities for individual reaction steps.
6.11.4 Rules in EROS A rule in EROS consists of a global part describing the name, reactors, phases, and kinetics and a reaction part that defines one or more reaction rules. An example of a simple EROS rule for bond formation and cleavage in Tcl looks as follows: set a1 [center 1] set a2 [center 2] set a3 [center 3] set a4 [center 4] # Reactant function set flag [prop A_QSIG $a2 qsigr2] # Reaction generator change_bond_order $a1 $a2 -1 change_bond_order $a3 $a4 -1 change_bond_order $a2 $a3 1 change_bond_order $a1 $a4 1 # Product function set flag [prop A_QSIG $a3 qsigp3] # Calculation of reactivity set var_reactivity [expr (2.5 + $qsigr2 - $qsigp3) * 1.e-6] return OK The set functions define the atoms that are centers of reaction. The reactant function included information on physicochemical properties for the reaction centers that shall be accounted for. In the upper case, the property
5323X.indb 233
11/13/07 2:12:32 PM
234
Expert Systems in Chemistry Research
A_QSIG describes the partial atomic sigma charge of reaction center 2, which is assigned to the variable qsige2. The reaction generator defines two statements for bond cleavage: (1) changing the bond order between reaction centers 1,2 and 3,4 to –1; and (2) the creation of new single bonds indicated by changing the bond order between centers 2,3 and 1,4 to 1. The product function retrieves the partial atomic sigma charge of reaction center 3 and assigns it to the variable qsigp3. Finally, the reactivity is calculated using the partial sigma charges of both reaction centers. The rule returns OK if no error occurred and provides the reactivity value for further evaluation of reaction probability.
6.11.5 Synthesis Planning — Workbench for the Organization of Data for Chemical Applications (WODCA) The program system WODCA is designed for the interactive planning of organic syntheses in a retrosynthetic approach. It was the result of research on reaction prediction with the software EROS, performed by Gasteiger and colleagues. WODCA was derived from EROS to separate the reaction simulation from the synthesis planning approach [122–126]. The concept behind retrosynthesis is planning a synthesis in a reverse manner — that is, starting at the product level and breaking the product down into simpler molecules until cheap and commercially available starting materials are generated. An important part of an automated system for synthesis planning is the search for strategic bonds to dissect a target molecule into potential synthesis precursors. WODCA offers four disconnection strategies to recognize strategic bonds in the target compound. Each strategy uses different rules for strategic bond evaluation and rates the quality of a bond being considered as dissection point in a molecule between 0 and 100, with 100 equal to the highest probability for a bond cleavage. The algorithmic foundations are physicochemical properties of atoms and bonds that are calculated by empirical methods, as well as structural constraints. The software takes advantage of available electronic catalogs that are used as a data basis for commercially available chemicals and provides in the most recent version 5.2 interfaces to reaction databases, such as Theilheimer, ChemInformRX, and MDL SPORE. Like many noncommercial software WODCA has grown over many years and, thus, incorporates modules written in different programming languages, such as C, C++, and scripting languages. It runs on Linux systems only; Windows 98/2000/ XP clients require the installation of WODCA on a Linux server and the usage of HOBLink X11, which is bundled with the software. WODCA works fully interactively and has a graphical user interface that guides the user through the individual synthesis steps. The workflow with WODCA starts with entering a target structure, the reaction product. The software automatically performs an identity search in the database to identify suitable starting materials. If no starting materials are found, the user can start a similarity search in the database. Similarity searches include 40 different criteria, such as the following: • Ring System: Analyzes for identical ring systems including heteroatoms. • Carbon Skeleton: Maps the largest identical carbon skeleton.
5323X.indb 234
11/13/07 2:12:32 PM
235
Expert Systems in Fundamental Chemistry
• Reduced Carbon Skeleton: Same as carbon skeleton but ignores multiple bonds. • Ring Systems and Substitution Pattern: Searches for identical ring systems with identical substitution. • Element and ZH Exchange: Halogen atoms and groups like OH, SH, and NH2 are considered to be equivalent. • Substitution Patterns: Covers 12 different substitution patterns that typically occur. These patterns are based on replacing atom groups with chlorine atoms to treat them as similar. Four criteria can be defined for atom groups to be replaced: atoms directly attached to aromatic rings, atoms attached to carbon-carbon multiple bonds, bond orders to substituents, and multiple substitutions by heteroatoms. Further similarity criteria include conversion of the target by performing hydrolysis, oxidation, reduction, elimination, or ozonolysis. The converted target compound is again compared with database molecules to find similar structures. The second step is the identification of strategic bonds in the target compound to dissect the target into suitable synthesis precursors (Figure 6.39). Whether a bond is considered to be a strategic one or not depends on two criteria: (1) The structure of the precursors needs to be simpler than the one of the target (topology criterion); and (2) the reaction from the precursor to the target shall be simple and not unusual. WODCA offers four different disconnection strategies to identify strategic bonds in the target: (1) aliphatic bonds; (2) aromatic substitution; (3) carbon-heteroatom bonds; and (4) polycyclic compounds. Each strategy provides its own rules and results in a ranking factor for a bond between 0 and 100. The ranking is relative to the current chemical environment of the bond. This is due to the physicochemical properties that are calculated dynamically for each chemical environment rather than relying on static bond properties.
2
NO2
Product
NO2
Precursor
2 O2N
O2N
76 Cl HO
100 95
+ Cl–Cl HO
Figure 6.39 Strategic bonds as found by a synthesis design software for a target compound (product) leading to a potential precursor. The numbers in the left structure indicate the relative probability for bond dissections based on dynamic physicochemical properties of the chemical environment. The value of 100 indicates the highest probability leading to the precursor shown on the right-hand side.
5323X.indb 235
11/13/07 2:12:33 PM
236
Expert Systems in Chemistry Research
After the disconnection strategy is defined, the systems indicate the strategic bond together with their ranks. The user can now analyze the precursor or can verify the disconnection by performing a reaction substructure search in any of the interfaced reaction databases. To perform a search in the reaction database, the user can define the bond sphere to be considered as identity criterion. The first sphere, for instance, includes bonds attached to the atoms of the strategic bond. A hit is presented as a reaction with additional information from the reaction database, such as reaction condition, yield, and references. If the precursor is accepted by the user, the molecule can be transferred to a synthesis tree view. The user can now analyze the first precursor for strategic bonds, can verify the next precursor, and can add it to the synthesis tree until a suitable starting material is achieved. The active development, support, and distribution of WODCA was stopped in 2005. The distributor, Molecular Networks — a spin-off of Gasteiger’s research team — is currently developing a Web-based expert system called Retrosynthesis Browser (RSB) for retrosynthetic analysis of a given target compound. RSB scans reaction databases to suggest new synthetic routes and simultaneously searches in catalogs of available starting materials for the proposed precursors.
6.12 Concise Summary Agonist is a signaling molecule (i.e., hormone, neurotransmitter, or synthetic drug) that binds to a receptor leading to a response, such as contraction, relaxation, secretion, or change in enzyme activity. Antagonist is a substance that inhibits the function of an agonist by blocking the agonist’s specific receptor site. Atomic Absorption Spectrometry (AAS) is a quantitative spectroscopic method based on the ability of free atoms, produced in an appropriate medium, like a flame, plasma, or a heated graphite tube, to absorb radiation of an atom-specific wavelength. Binding Affinity is a measure describing the ability of a molecule to interact with a receptor. It is typically characterized by the binding constant for its specific receptor. Chiral Synthon (CHIRON) is a program particularly developed for recognizing chiral substructures in a target molecule for organic synthesis. Coherent Anti-Stokes Raman Scattering (CARS) Thermometry is a technique for temperature measurement in high temperature environments using a thirdorder nonlinear optical process involving a pump and a Stokes frequency laser beam that interacts with the sample and generates a coherent anti-Stokes frequency beam. Computer-Assisted Structure Elucidation (CASE) is a paradigm that covers techniques and computer programs for the elucidation of structures with rule-based systems. CONstrained GENeration (CONGEN) is software for generating all isomers for a given structure and is part of the DENDRAL project.
5323X.indb 236
11/13/07 2:12:33 PM
Expert Systems in Fundamental Chemistry
237
Corticosteroid-Binding Globulin (CBG) is a plasma glycoprotein that binds glucocorticoids with high affinity; it is responsible for transport and bioavailability of these hormones. DARC-EPIOS is an automated software for retrieving structural formulas from overlapping 13C-NMR data. Database Approach is a specific method for deriving the molecular structure from an infrared spectrum by predicting a molecular descriptor from an artificial neural network and retrieving the structure with the most similar descriptor from a structure database. Dendritic Algorithm (DENDRAL) is one of the first expert systems designed for automatic interpretation of the mass spectra to derive molecular structures. Effective Concentration (EC50) in a radioligand-binding assay is the molar concentration of an agonist that produces 50% of the maximum possible response for that agonist. Elaboration of Reactions for Organic Synthesis (EROS) is a program for reaction prediction in organic synthesis. It explores pathways that given starting materials will follow during a reaction path and attempts to predict the products that will be obtained from those reactions. Electrostatic Potential is the work required to move a unit positive charge from infinity to a point near the molecule. Enantiomers are stereoisomers that are not superimposable, or mirror images of each other. Enantiomers have identical chemical properties except when they are involved in reactions with chiral centers, such as in biological systems. The only difference in physical properties is their ability to rotate plane-polarized light in opposite directions. EXPIRS is an expert system for the interpretation of infrared spectra based on hierarchical organization of the characteristic groups. Fast Fourier Transform Compression is a method that uses Fourier transformation to decompose spectra into a series of Fourier coefficients, to reduce them, and to backtransform them to achieve a compressed version of the spectrum. Fast Hadamard Transform (FHT) is a mathematical transformation of signals or vectors similar to Fourier transform but is based on square wave functions rather than sines and cosines. Fast Hadamard Transform (FHT) Compression is a method that uses Hadamard transformation to decompose spectra into a series of Hadamard coefficients, to reduce them, and to backtransform them to achieve a compressed version of the spectrum. Generation with Overlapping Atoms (GENOA) is software for generating all isomers for a given structure and is a successor of CONGEN with improved constraint handling. Heated Graphite Atomizer (HGA) is a device used in atomic absorption spectrometry for atomization of compounds in graphite tube, which is connected as a resistor in a high electrical current circuit. Inhibitory Concentration (IC50) in a radioligand-binding assay is the molar concentration of competing ligand that reduces the specific binding of a radioligand by 50%.
5323X.indb 237
11/13/07 2:12:34 PM
238
Expert Systems in Chemistry Research
JCAMP-DX is a standardized file format for the representation of spectra, chromatograms, and International Union of Pure and Applied Chemistry (IUPAC). Local RDF Descriptor is a geometric descriptor based on radial distribution functions and designed for the characterization of individual atoms in a molecule in their chemical environment. Logic and Heuristics Applied to Synthetic Analysis (LHASA) is an expert system for synthesis planning that assists chemists in designing efficient routes to target molecules for organic synthesis. Mass Absorption Coefficients are element-specific coefficients describing the linear absorption in relation to the density of the absorber. Mean Molecular Polarizability quantifies the ease with which an entire molecule undergoes distortion in a weak external field. Modeling Approach is a specific method for deriving the molecular structure from an infrared spectrum by predicting a molecular descriptor from an artificial neural network, searching for the most similar descriptor in a structure database, and using an iterative method for structure adaptation until its descriptor matches the predicted one. Molecular Electrostatic Potential (MEP) is the electrostatic force describing the relative polarity of a molecule and is affected by dipole moment, electronegativity, and partial charges of a molecule. It gives detailed information for studies on chemical reactivity or pharmacological activity of a compound. MOLION is software designed to detect the parent ion in a mass spectrum and to derive the sum formula. It is part of the DENDRAL project. MSPRUNE is an extension to MSRANK that works with a list of candidate structures from CONGEN and the mass spectrum of the query molecule to predict typical fragmentations for each candidate structure. MSRANK is part of the DENDRAL software suite and compares the predicted mass spectra and ranks them according to their fitness in the experimental spectrum. MYCIN is one of the first expert systems for medical diagnosis that supports physicists in the diagnostic process. NEOMYCIN is a successor of MYCIN providing explicit disease taxonomy in a frame-based system. PAIRS is a program that analyzes infrared spectra in a similar manner as a spectroscopist does and was designed for automated interpretation of Fourier transform Raman spectra of complex polymers. Partial Equalization of Orbital Electronegativity (PEOE) is an iterative procedure to calculate charge distribution in a molecule based on electronegativities of bonded atoms and the electrostatic potential created by electron transfer that acts against further electron transfer. PLANNER is a construction module in the DENDRAL project for automatic constraint generation. Polarizability (Static Dielectric Polarizability) is a measure of the linear response of the electronic cloud of a chemical species to a weak external electric field of particular strength.
5323X.indb 238
11/13/07 2:12:34 PM
Expert Systems in Fundamental Chemistry
239
PREDICTOR is part of the DENDRAL software suite that uses rules to produce a hypothetical mass spectrum for a candidate structure, which is compared to the experimental mass spectrum. Progestagens are sex hormones essential for preparing the uterus for implantation of a fertilized ovum during pregnancy. They are precursors to progestins. Progestins are sex hormones used in contraceptive pills and in hormone replacement regimens, where they counter the proliferative effects of estrogens. Radioligand Binding Experiments are used to determine whether a drug binds to a receptor and to investigate the interaction of low-affinity drugs with receptors based on radioactive marking. REACT is a program from the DENDRAL projects that predicts potential reactions of a candidate structure with another structure. Retrosynthesis is a synthesis approach that starts at the product level and breaks it into simpler molecules until cheap and commercially available starting materials are generated. Simulation and Evaluation of Chemical Synthesis (SECS) is a program for organic synthesis that uses heuristic methods similar to LHASA but puts special emphasis on stereochemistry, topology, and energy minimization. Search for Starting Materials (SESAM) is software for organic synthesis that identifies synthons based on skeletal overlaps with potential starting materials. SHAMAN is an expert system developed for qualitative and quantitative radionuclide identification in gamma spectrometry. Spectrum Simulation is a method for creating spectra from information about the chemical structure of a molecule, for which none exist; this is typically supported by prediction technologies, such as artificial neural networks. Starting Material Selection Strategies (SST) is a program for organic synthesis that uses pattern recognition to find starting materials to a given target. Structure Reduction is a technique for structure generation based on removal of bonds from a hyperstructure that initially contains all the possible bonds between all the required atoms and molecular fragments. SYNCHEM is a self-guided heuristic search program for retrosynthetic analysis of compounds in organic synthesis. Workbench for the Organization of Data for Chemical Applications (WODCA) is a synthesis planning program based on a retrosynthesis approach that searches for strategic bonds to dissect a target molecule into potential synthesis precursors. X-ray Fluorescence Spectroscopy is an analytical method for automated sequential analysis of major and trace elements in metals, rocks, soils, and other usually solid materials. The technique is based on the absorption of a primary x-ray beam that leads to secondary fluorescence that is specific for the atoms in a compound.
References
5323X.indb 239
1. Lindsay, R.K., et al., Applications of Artificial Intelligence for Organic Chemistry — The DENDRAL Project, McGraw-Hill, New York, 1980. 2. Carhart, R.E., et al., An Approach to Computer-Assisted Elucidation of Molecular Structure, J. Am. Chem. Soc., 97, 5755, 1975.
11/13/07 2:12:34 PM
240
5323X.indb 240
Expert Systems in Chemistry Research
3. Carhart, R.E., et al., GENOA: A Computer Program for Structure Elucidation Utilizing Overlapping and Alternative Substructures, J. Org. Chem., 46, 1708, 1981. 4. Lindsay, R.K. et al., DENDRAL: A Case Study of the First Expert System for Scientific Hypothesis Formation, Artificial Intelligence, 61, 209, 1993. 5. Buchanan, B.G. and Shortliffe, E.H., Eds., Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project, Addison-Wesley, Reading, MA, 1984. Out of print; electronically available at http://www.aaaipress.org/Classic/Buchanan/buchanan.html. 6. Fagan, L.M., et al., Representation of Dynamic Clinical Knowledge: Measurement Interpretation in the Intensive Care Unit, in Proc. 6th Intern. Joint Conf. on Artif. Intellig., Tokyo, 1979, 260. 7. Bellamy, L.J., The Infrared Spectra of Complex Molecules, Wiley, Chichester, 1975. 8. Dolphin, D. and Wick, A., Tabulation of Infrared Spectral Data, Wiley, New York, 1977. 9. Pretsch, E. et al., Tables of Spectral Data for Structure Determination of Organic Compounds, Springer-Verlag, Berlin, 1989. 10. Lin-Vien, D. et al., Infrared and Raman Characteristic Frequencies of Organic Molecules, Academic Press, New York, 1991. 11. Gribov, L.A. and Orville-Thomas, W.J., Theory and Methods of Calculation of Molecular Spectra, Wiley, Chichester, 1988. 12. Gribov, L.A. and Elyashberg, M.E., Computer-Aided Identification of Organic Molecules by Their Molecular Spectra, Crit. Rev. Anal. Chem., 8, 111, 1979. 13. Elyashberg, M.E., Infrared Spectra Interpretation by the Characteristic Frequency Approach, in The Encyclopedia of Computational Chemistry, Schleyer, P.V.R., et al., Eds., John Wiley and Sons, Chichester, 1998, 1299. 14. Grund, R., Kerber, A., and Laue, R., MOLGEN, ein Computeralgebra-System für die Konstruktion molekularer Graphen, MATCH , 27, 87, 1992. 15. Zupan, J., Algorithms for Chemists, Wiley & Sons, New York, 1989. 16. Luinge, H.J., Automated Interpretation of Vibrational Spectra, Vib. Spectrosc., 1, 3, 1990. 17. Warr, W., Computer-Assisted Structure Elucidation. Part 2. Indirect Database Approaches and Established Systems, Anal. Chem., 65, 1087A, 1993. 18. Elyashberg, M.E., Serov, V.V., and Gribov, L.A., Artificial Intelligence Systems for Molecular Spectral Analysis, Talanta, 34, 21, 1987. 19. Funatsu, K., Susuta, Y., and Sasaki, S., Application of IR-Data Analysis Based on Symbolic Logic to the Automated Structure Elucidation, Anal. Chim. Acta, 220, 155, 1989. 20. Wythoff, B., et al., Computer-Assisted Infrared Identification of Vapor-Phase Mixture Components, J. Chem. Inf. Comput. Sci., 31, 392, 1991. 21. Andreev, G.N. and Argirov, O.K., Implementation of Human Expert Heuristics in Computer Supported Infrared Spectra Interpretation, J. Mol. Struct., 347, 439, 1995. 22. Woodruff, H.B. and Smith, G.M., Computer Program for the Analysis of Infrared Spectra, Anal. Chem., 52, 2321, 1980. 23. Claybourn, M., Luinge, H.J., and Chalmers, J.M., Automated Interpretation of Fourier Transform Raman Spectra of Complex Polymers Using an Expert System, J. Raman Spectrosc., 25, 115, 1994. 24. Andreev, G.N., Argirov, O.K., and Penchev, P.N., Expert System for Interpretation of Infrared Spectra, Anal. Chim. Acta, 284, 131, 1993. 25. Plamen, N., et al., Infrared Spectra Interpretation by Means of Computer, Traveaux Scientifiques d’Universite de Plovdiv, 29, 21, 2000. 26. Funatsu, K., Miyabayaski, N., and Sasaki, S., Further Development of Structure Generation in Automated Structure Elucidation System CHEMICS, J. Chem. Inf. Comp. Sci., 28, 18, 1988.
11/13/07 2:12:34 PM
Expert Systems in Fundamental Chemistry
241
27. Shelley, C.A., et al., An Approach to Automated Partial Structure Expansion, Anal. Chim. Acta, 103, 121, 1978. 28. Kalchhauser, H. and Robien, W., CSearch — A Computer Program for Identification of Organic Compounds and Fully Automated Assignment of C-13 Nuclear MagneticResonance Spectra, J. Chem. Inf. Comput. Sci., 25, 103, 1985. 29. Carabedian, M., Dagane, I., and Dubois, J.E., Elucidation by Progressive Intersection of Ordered Substructures from Carbon-13 Nuclear Magnetic Resonance, Anal. Chem., 60, 2186, 1988. 30. Christie, B.D. and Munk, M.E., Structure generation by reduction: A new strategy for computer assisted structure elucidation, J. Chem. Inf. Comput. Sci., 28, 87, 1988. 31. Bohanec, S. and Zupon, J., Structure generation of constitutional isomers from structural fragments, J. Chem. Inf. Comput. Sci., 31, 531, 1991. 32. Elyashberg, M.E., et al., X-Pert: A User-Friendly Expert System for the Molecular Structure Elucidation by Spectral Methods, Anal. Chim. Acta, 337, 265, 1997. 33. Munk, M.E., Madison, M.S., and Robb, E.W., The Neural Network as a Tool for Multispectral Interpretation, Chem. Inf. Comput. Sci., 35, 231, 1996. 34. Luinge, H.J., van der Maas, J.H., and Visser, T., Partial Least Squares Regression as a Multivariate Tool for the Interpretation of Infrared Spectra, Chemom. Intell. Lab. Syst., 28, 129, 1995. 35. Elyashberg, M.E., Expert Systems for Molecular Spectral Analysis, Anal. Khim., 47, 698, 1992. 36. Klawun, C. and Wilkins, C.L., Joint Neural Network Interpretation of Infrared and Mass Spectra, J. Chem. Inf. Comput. Sci., 36, 69, 1996. 37. Anand, R., et al., Analyzing Images Containing Multiple Sparse Pattern with Neural Networks, in Proc. of IJCAI-91, Sidney, 1991. 38. Robb, E.W. and Munk, M.E., A Neural Network Approach to Infrared Spectrum Interpretation, Mikrochim. Acta (Wien), 1, 131, 1990. 39. Ehrentreich, F., et al., Bewertung von IR-Spektrum-Struktur-Korrelationen mit Counterpropagation-Netzen, in Software-Entwicklung in der Chemie 10, Gasteiger, J., Ed., Gesellschaft Deutscher Chemiker, Frankfurt/Main, 1996. 40. Penchev, P.N., Andreev, G.N., and Varmuza, K., Automatic Classification of Infrared Spectra Using a Set of Improved Expert-Based Features, Anal. Chim. Acta, 388, 145, 1999. 41. Ricard, D., et al., Neural Network Approach to Structural Feature Recognition from Infrared Spectra, J. Chem. Inf. Comput. Sci, 33, 202, 1993. 42. Meyer, M. and Weigel, T., Interpretation of Infrared Spectra by Artificial Neural Networks, Anal. Chim. Acta, 265, 183, 1992. 43. Gasteiger, J., et al.‚ Chemical Information in 3D Space, J. Chem. Inf. Comput. Sci., 36, 1030, 1996. 44. Steinhauer, L., Steinhauer, V., and Gasteiger, J.‚ Obtaining the 3D Structure from Infrared Spectra of Organic Compounds Using Neural Networks, in Software-Development in Chemistry 10, Gasteiger, J., Ed., Gesellschaft Deutscher Chemiker, Frankfurt/Main, 1996, 315. 45. Selzer, P., et al., Rapid Access to Infrared Reference Spectra of Arbitrary Organic Compounds: Scope and Limitations of an Approach to the Simulation of Infrared Spectra by Neural Networks, Chem.- Eur. J., 6, 920, 2000. 46. Kostka, T., Selzer, P., and Gasteiger, J., Computer-Assisted Prediction of the Degradation Products and Infrared Spectra of s-Triazine Herbicides, in Software-Entwicklung in der Chemie 11, Fels, G. and Schubert, V., Eds., Gesellschaft Deutscher Chemiker, Frankfurt/Main, 1997, 227.
5323X.indb 241
11/13/07 2:12:35 PM
242
Expert Systems in Chemistry Research
47. Gasteiger, J., et al., A New Treatment of Chemical Reactivity: Development of EROS, an Expert System for Reaction Prediction and Synthesis Design, Topics Curr. Chem., 137, 19, 1987. 48. IUPAC Committee on Printed and Electronic Publications, Working Party on Spectroscopic Data Standards (JCAMP-DX), http://www.iupac.org/ 49. Affolter, C., Baumann, K., Clerc, J.T., Schriber, H., and Pretsch, E., Automatic Interpretation of Infrared Spectra, Microchim. Acta, 14, 143–147, 1997. 50. Novic, M. and Zupan, J., Investigation of Infrared Spectra-Structure Correlation Using Kohonen and Counterpropagation Neural Network, J. Chem. Inf. Comput. Sci., 35, 454, 1995. 51. Sadowski, J. and Gasteiger, J., From Atoms and Bonds to Three-Dimensional Atomic Coordinates: Automatic Model Builders, Chem. Rev., 93, 2567, 1993. 52. Hemmer, M.C. and Gasteiger, J., Prediction of Three-Dimensional Molecular Structures Using Information from Infrared Spectra, Anal. Chim. Acta, 420, 145, 2000. 53. Turner, D.B., Tyrell, S.M., and Willett, P., Rapid Quantification of Molecular Diversity for Selective Database Acquisition, J. Chem. Inf. Comput. Sci., 37, 18, 1997. 54. Lipkus, A.H., Exploring Chemical Rings in a Simple Topological-Descriptor Space, J. Chem. Inf. Comput. Sci., 41, 430, 2001. 55. Jørgensen, A.M. and Pedersen, J.T., Structural Diversity of Small Molecule Libraries, J. Chem. Inf. Comput. Sci., 41, 338, 2001. 56. Willett, P., Three-Dimensional Chemical Structure Handling, John Wiley & Sons, New York, 1991. 57. Johnson, M.A. and Maggiora, G.M., Concepts and Applications of Molecular Similarity, John Wiley & Sons, New York, 1990. 58. Jakes, S.E. and Willett, P., Pharmacophoric Pattern Matching in Files of 3-D Chemical Structures: Selection of Inter-atomic Distance Screens, J. Molec. Graphics, 5, 49, 1986. 59. Exner, O., Dipole Moments in Organic Chemistry, G. Thieme, Stuttgart, 1975. 60. Gasteiger, J. and Hutchings, M.G., Quantitative Models of Gas-Phase Proton Transfer Reactions Involving Alcohols, Ethers, and their Thio Analogs. Correlation Analyses Based on Residual Electronegativity and Effective Polarizability, J. Amer. Chem. Soc. 106, 6489, 1984. 61. Gasteiger, J. and Hutchings, M.G., Quantification of Effective Polarisability. Applications to Studies of X-ray Photoelectron Spectroscopy and Alkylamine Protonation, J. Chem. Soc. Perkin, 2, 559, 1984. 62. Kang, Y.K. and Jhon, M.S., Additivity of Atomic Static Polarizabilities and Dispersion Coefficients, Theor. Chim. Acta, 61, 41, 1982. 63. Czernek, J. and Sklenář, V., Ab Initio Calculations of 1H and 13C Chemical Shifts in Anhydrodeoxythymidines, J. Phys. Chem., 103, 4089, 1999. 64. Barfield, M. and Fagerness, P.J., Density Functional Theory GIAO Studies of the C13, N-15, and H-1 NMR Chemical Shifts in Aminopyrimidines and Aminobenzenes: Relationships to Electron Densities and Amine Group Orientations, J. Am. Chem. Soc., 119, 8699, 1997. 65. Williams, A., Recent Advances in NMR Prediction and Automated Structure Elucidation Software, Current Opinion in Drug Disc. Devel., 3, 298, 2000. 66. Bremser, W.‚ HOSE — A Novel Substructure Code, Anal. Chim. Acta, 103, 355, 1978. 67. Advanced Chemistry Development Inc., Canada, http://www.acdlabs.com/. 68. CambridgeSoft Corp., Cambridge, MA, http://www.cambridgesoft.com. 69. Upstream Solutions GmbH, Zurich, Switzerland, http://www.upstream.ch. 70. Bürgin-Schaller, R. and Pretsch, E., A Computer Program for the Automatic Estimation of 1 H-NMR Chemical Shifts, Anal. Chim. Acta, 290, 295, 1994.
5323X.indb 242
11/13/07 2:12:35 PM
Expert Systems in Fundamental Chemistry
243
71. Ball, J.W. Anker, L.S., and Jurs, P.C., Automated Model Selection for the Simulation of Carbon-13 Nuclear Magnetic Resonance Spectra of Cyclopentanones and Cycloheptanones, Anal. Chem., 63, 2435, 1991. 72. Doucet, J.P., et al., Neural Networks and 13C NMR Shift Prediction, J. Chem. Inf. Comput. Sci., 33, 320, 1993. 73. Meiler, J., Meusinger, R., and Will, M., Fast Determination of 13C-NMR Chemical Shifts Using Artificial Neural Networks, J. Chem. Inf. Comput. Sci., 40, 1169, 2000. 74. Hare, B. J. and Prestegard, J.H., Application of Neural Networks to Automated Assignment of NMR Spectra of Proteins, J. Biomol. NMR, 4, 35–46, 1994. 75. Bernstein, R., et al., Computer-Assisted Assignment of Multidimensional NMR Spectra of Proteins: Application to 3D NOESY-HMQC and TOCSY-HMQC Spectra, J. Biomol. NMR, 3, 245, 1993. 76. Morelle, N., et al., Computer Assignment of the Backbone Resonances of Labeled Proteins Using Two-Dimensional Correlation Experiments, J. Biomol. NMR, 5, 154, 1995. 77. Wehrens, R., et al., Sequential Assignment of 2D-NMR Spectra of Proteins Using Genetic Algorithms, J. Chem. Inf. Comput. Sci., 33, 245, 1993. 78. Zimmerman, D.E., et al., Automated Analysis of Protein NMR Assignments Using Methods from Artificial Intelligence, J. Mo. Biol., 269, 592, 1997. 79. Moseley, H.N.B., Monleon, D., and Montelione, G.T., Automatic Determination of Protein Backbone Resonance Assignments from Triple Resonance NMR Data, Methods in Enzymology, 339, 91, 2001. 80. Wüthrich, K., NMR of Proteins and Nucleic Acids, John Wiley & Sons, New York, NY, 1986. 81. Ivanciuc, O., et al., 13C NMR Chemical Shift Prediction of the sp3 Carbon in a Position Relative to the Double Bond in Acyclic Alkenes, J. Chem. Inf. Comput. Sci., 37, 587, 1997. 82. Hönig, H.‚ An Improved 13C-NMR-Shift Prediction Program for Polysubstituted Benzenes and Sterically Defined Cyclohexane Derivatives, Magn. Reson. Chem., 34, 395, 1996. 83. Martin, N.H., Allen, N.W., and Moore, J.C., An Algorithm for Predicting the NMR Shielding of Protons over Substituted Benzene Rings, J. Molec. Graph. Model., 18, 242, 2000. 84. Aires de Sousa, J., Hemmer, M.C., and Gasteiger, J., Prediction of 1H NMR Chemical Shifts Using Neural Networks, Anal. Chem. 74, 80, 2002. 85. Jones, G., Genetic and Evolutionary Algorithms, in Encyclopedia of Computational Chemistry, Schleyer, P.V.R., et al., Eds., John Wiley & Sons, Chichester, UK, 1998, 1127. 86. Aarnio, P.A., Application of the Nuclide Identification System SHAMAN in Monitoring the Comprehensive Test Ban Treaty, J. Radioanal. Nucl. Chem., 235, 95, 1998. 87. Aarnio, P., Nikkinen, M., and Routti, J., Gamma Spectrum Analysis Including NAA with SAMPO for Windows. J. Radioanal. Nucl. Chem., 193, 179, 1995. 88. Keller, P.E. and Kouzes, R.T., Gamma Spectral Analysis via Neural Networks, Nuclear Science Symposium and Medical Imaging Conference, IEEE Conference Record, 1, 341, 1994. 89. Welz, B. and Sperling, M., Atomic Absorption Spectrometry, 3rd ed., Wiley-VCH, Weinheim, Germany, 1998. 90. Herzberg, J., et al., CARS Thermometry in a Transversely Heated Graphite Tube Atomizer Used in Atomic Absorption Spectrometry, Applied Physics, 61, 201, 1995. 91. Sturgeon, R.E., Chakrabarti, C.L., and Bertels, B.C., Atomization in Graphite-Furnace Atomic Absorption Spectrometry, Anal. Chem., 47, 1250, 1975. 92. Williams, K.L., Introduction to X-Ray Flourescence Spectrometry, Allen and Unwin, London, 1987.
5323X.indb 243
11/13/07 2:12:35 PM
244
Expert Systems in Chemistry Research
93. Kenakin, T., Pharmacologic Analysis of Drug-Receptor Interaction, 3d ed., Lippincott-Raven Press, 1997. 94. Cheng, Y. and Prusoff, W.H., Relationship between the Inhibitory Constant (Ki) and the Concentration of Inhibitor which Causes 50 Per cent Inhibition (IC50) of an Enzymatic Reaction, Biochem. Pharmacol., 22, 3099, 1973. 95. Berkink, E.W., et al., Binding of Progestagens to Receptor Proteins in MCF-7 Cells, J. Steroid Biochem., 19, 1563, 1983. 96. Bursi, R., et al., Comparative Spectra Analysis (CoSA): Spectra as Three-Dimensional Molecular Descriptors for the Prediction of Biological Activities, J. Chem. Inf. Comput. Sciences, 39, 861, 1999. 97. Cramer, D., Patterson, D.E., and Bunce, J.D., Recent Advances in Comparative Molecular Field Analysis (CoMFA), J. Am. Chem. Soc., 110, 5959, 1988. 98. Waldeck, B., Three-Dimensional Pharmacology, a Subject Ranging from Ignorance to Overstatements, Pharmacol. Toxicol., 93, 203, 2003. 99. Koren, G., Pastuszak, A., and Ito, S., Drugs in Pregnancy, N. Engl. J. Med., 338, 1128, 1998. 100. Dally, A., Thalidomide: Was the Tragedy Preventable?, Lancet, 351, 1197, 1998. 101. Vanchieri, C., Preparing for the Thalidomide Comeback, Ann. Intern. Med., 127, 951, 1997. 102. Silverman, W.A., The Schizophrenic Career of a “Monster Drug,” Pediatrics, 110, 404, 2002. 103. Dencker, L., Susceptibility in Utero and upon Neonatal Exposure, Food Addit. Contam., 15, 37, 1998. 104. Gasteiger, J. and Marsili, M., Iterative Partial Equalization of Orbital Electronegativity — A Rapid Access to Atomic Charges, Tetrahedron, 36, 3219, 1980. 105. Tollnick, C., et al., Investigations of the Production of Cephalosporin C by Acremonium Chrysogenum, Advances in Biochem. Engineering/Biotechnology, 86, 1, 2004. 106. Gams, W., Cephalosporium-artige Schimmelpilze (Hyphomycetes), Gustav Fischer Verlag, Stuttgart, Germany, 1971. 107. Corey, E.J. and Wipke, W.T., Computer-Assisted Design of Complex Organic Syntheses, Science, 166, 178, 1969. 108. Corey, E.J., Howe, W.J., and Pensak, D.A., Computer-Assisted Synthetic Analysis. Methods for Machine Generation of Synthetic Intermediates Involving Multistep LookAhead, J. Am. Chem. Soc., 96, 7724, 1974. 109. Gelernter, H.L., et al., Empirical Explorations of SYNCHEM, Science, 197, 1041, 1977. 110. Hanessian, S., Franco, J., and Larouche, B., The Psychobiological Basis of Heuristic Synthesis Planning — Man, Machine and the Chiron Approach, Pure Appl. Chem., 62, 1887, 1990. 111. Mehta, G., Barone, R., and Chanon, M., Computer-Aided Organic Synthesis — SESAM: A Simple Program to Unravel “Hidden” Restructured Starting Materials Skeleta in Complex Targets, Eur. J. Org. Chem., 18, 1409, 1998. 112. Wipke, W.T., Ouchi, G.I., and Krishnan, S., Simulation and Evaluation of Chemical Synthesis — SECS — Application of Artificial Intelligence Techniques, Artif. Intell., 11, 173, 1978. 113. Wipke, W.T. and Rogers, D.J., Artificial Intelligence in Organic Synthesis. SST: Starting Material Selection Strategies. An Application of Superstructure Search, Chem. Inf. Comput. Sci., 24, 71, 1984. 114. Takahashi, M., et al, The Performance of a Noninteractive Synthesis Program, J. Chem. Inf. Comput. Sci., 30, 436, 1990. 115. Satoh, K. and Funatsu, K., A Novel Approach to Retrosynthetic Analysis Using Knowledge Bases Derived from Reaction Databases, J. Chem. Inf. Comput. Sci., 39, 316, 1999.
5323X.indb 244
11/13/07 2:12:36 PM
Expert Systems in Fundamental Chemistry
245
116. Gasteiger, J. and Jochum, C, EROS — A Computer Program for Generating Sequences of Reactions, Topics Curr. Chem., 74, 93, 1978. 117. Gasteiger, J., et al., A New Treatment of Chemical Reactivity: Development of EROS, an Expert System for Reaction Prediction and Synthesis Design, Topics Curr. Chem., 137, 19, 1987. 118. Ihlenfeldt, W.-D. and Gasteiger, J., Computer-Assisted Planning of Organic Syntheses: The Second Generation of Programs, Angew. Chem., 34, 2613, 1995. 119. Gasteiger, J., Hanebeck, W., and Schulz, K.-P., Prediction of Mass Spectra from Structural Information, J. Chem. Inf. Comput. Sci., 32, 264, 1992. 120. Gear, C.W., Numerical Initial Value Problems in Ordinary Differential Equations, Prentice Hall, Englewood Cliffs, NJ, 1971. 121. Press, W.H., et al., Numerical Recipes, The Art of Scientific Computing, Cambridge University Press, Cambridge, UK, 1989. 122. Gasteiger, J., et al., Similarity Concepts for the Planning of Organic Reactions and Syntheses, J. Chem. Inf. Sci., 32, 701, 1992. 123. Röse, P. and Gasteiger, J., Automated Derivation of Reaction Rules for the EROS 6.0 System for Reaction Prediction, Anal. Chim. Acta, 235, 163, 1990. 124. Gasteiger, J., et al., Computer-Assisted Synthesis and Reaction Planning in Combinatorial Chemistry, Persp. Drug Discov. Design, 20, 1, 2000. 125. Pförtner, M. and Sitzmann, M., Computer-Assisted Synthesis Design by WODCA (CASD), in Handbook of Chemoinformatics — From Data to Knowledge in 4 Volumes, vol. 4, Gasteiger, J., Ed., Wiley-VCH, Weinheim, 2003, 1457. 126. Fick, R., Gasteiger, J., and Ihlenfeldt, W.D., Synthesis Planning in the 90s: The WODCA System, in E.C.C.C.1 Computational Chemistry F.E.C.S. Conference, Nancy, France, Bernardi, F., and Rivail, J.-L., Eds., AIP Press, Woodbury, New York, 1995, 526.
5323X.indb 245
11/13/07 2:12:36 PM
5323X.indb 246
11/13/07 2:12:36 PM
7
Expert Systems in Other Areas of Chemistry
7.1 Introduction The majority of expert systems developed in the area of chemistry have been applied in a laboratory environment. The advantage of these systems is the assistance they provide to the laboratory scientist throughout a given experiment. They can assist in the planning and monitoring of the experiment and in interpreting test data. This is particularly interesting where chemistry knowledge is required in one or more of the manifold interdisciplinary research areas, like biology, biochemistry, pharmacology, ecology, and engineering. We will deal with the application of expert systems in some of these areas in this chapter.
7.2 Bioinformatics Bioinformatics is the interdisciplinary research area between biological and computational sciences. Similar to cheminformatics, bioinformatics deals with computational processing and management of biological information, particularly proteins, genes, whole organisms, or ecological systems. Most of the bioinformatics research has been performed on the analysis of biological data, although a growing number of projects deal with the effective organization of biological information. Bioinformatics addresses several fields: • Molecular genetics: The structure and function of genes at a molecular level. • Proteomics: Protein identification, characterization, expression, and interactions. • Structural Genomics: Protein structure determination, classification, modeling, and docking. • Functional Genomics: Development and application of global experimental approaches to assess gene function by making use of the information provided by structural genomics. • Metabonomics: Quantitative measurement of the dynamic metabolic response of living systems as a response to drugs, environmental changes, and diseases. • Metabolomics: Catalogization and quantification of metabolites found in biological fluids under different conditions. The following sections provide an overview on applications of expert system and related software in these fields. 247
5323X.indb 247
11/13/07 2:12:36 PM
248
Expert Systems in Chemistry Research
7.2.1 Molecular Genetics (MOLGEN) One of the first outcomes of bioinformatics research was the Molecular Genetics (MOLGEN) project, which evolved from the same projects at Stanford University in artificial intelligence and knowledge engineering that produced the Dendritic Algorithm (DENDRAL) system. MOLGEN started in 1975 in the Heuristic Programming Project in a group around Edward Feigenbaum and was the result of a thesis of Mark Stefik and Peter Friedland completed in 1979 [1]. The goal of MOLGEN was to model the experimental design activity of scientists in molecular genetics. The authors started with the assumption that most experiments are designed taking advantage of related work or previous experiments by adapting them to the particular experimental context. Like with DENDRAL this approach depends on domainspecific knowledge in the field of molecular biology and good heuristics to select among alternative implementations. The decision for choosing molecular biology as a prototype field was related to the enormous amounts of data produced by the increasingly available techniques in molecular biology. It could be foreseen that the existing search methodologies would not be able to cope with this mass of data. MOLGEN was thought to be able to remedy this by applying rules that were able to predict strategic directions for analysis [2]. The knowledge base of MOLGEN was created by software engineers and several molecular biologists and was an exemplary starting point for the cooperation of molecular biology and computer science that finally led to the discipline of bioinformatics. MOLGEN comprises expert knowledge on enzymatic methods, nucleic acids, detection methods, and a number of tools for automated sequence analysis. The special module sequence (SEQ) was designed as an interactive program for nucleic acid sequence analysis, included different analytical methods, and was able to perform homology searches [3]. A series of other programs have been developed in the MOLGEN suite, such as MAP, which allows generations of restriction enzyme maps of DNA structures from segmentation data [4], and SAFE, for prediction of restriction enzymes from amino acid sequence data. The second phase of development beginning in 1980 included additional analytical tools and extended the knowledge base. At that time, MOLGEN and the available genetic databases were offered to the scientific community as GENET on the same computer resource that hosted the DENDRAL system [5,6]. The success of this software, in particular at biotech companies that frequently used these resources, brought Feigenbaum and the facility manager, Thomas Rindfleisch, to the decision to restrict the access to noncommercial users only. As a result of this restriction, Feigenbaum and others founded IntelliGenetics with the goal to commercialize the MOLGEN suite and gene databases to the emerging biotechnology market as BIONET. Within a few years BIONET included most of the important databases for molecular biology, including the National Institutes of Health (NIH) genetic sequence database GenBank, the NIH DNA sequence library, the European Molecular Biology Laboratory (EMBL) nucleotide sequence library, the National Biomedical Research Foundation Protein Identification Resource (NBRF-PIR), and the Swiss-Prot protein sequence database. IntelliGenetics changed its name to IntelliCorp in 1984. In May 1986, Amoco bought voting control of the IntelliGenetics subsidiary, and the company became a wholly owned subsidiary of Amoco Technology Corporation in 1990. Four years
5323X.indb 248
11/13/07 2:12:36 PM
Expert Systems in Other Areas of Chemistry
249
later Oxford Molecular Group in England bought IntelliGenetics to form a bioinformatics division. Founded in 1989, Oxford Molecular grew aggressively through the acquisition of software and experimentation-based companies, such as Chemical Design Limited and the Genetics Computer Group. At that time, IntelliGenetics product range included GeneWorks, PC/GENE, and IntelliGenetics Suite for DNA and protein sequence analysis, BIONET, and the GENESEQ database of protein and nucleic acid sequences extracted from worldwide patent documents. In 2000, the Oxford Molecular software businesses were acquired by Pharmacopeia Inc., which already had created a sound scientific software segment with the acquisition of Molecular Simulations Inc. and Synopsys Scientific Systems Ltd. Pharmacopeia combined its software subsidiaries in June 2001 to form Accelrys, which separated from Pharmacopeia in 2004.
7.2.2 Predicting Toxicology — Deductive Estimation of Risk from Existing Knowledge (DEREK) for Windows Lhasa Ltd. was founded in 1983 to continue the development of the Logic and Heuristics Applied to Synthetic Analysis (LHASA) system for the design of complex organic molecule syntheses originating from Harvard University. The aim of the founding members was to fund and support the development and refinement of the generalized retrosynthetic reactions in the LHASA knowledge base. The idea of Lhasa Ltd. was to act as a nonprofit company with the idea that customers of the software actively participate in the development of existing and new applications. Since interest in the LHASA system has diminished, Lhasa Ltd. has focused more on the development of software for the area of toxicology, particularly focusing on the prediction of genotoxicity, mutagenicity, and carcinogenic compounds. The current membership of Lhasa Ltd. almost exclusively supports the development and refinement of their flagship software, Deductive Estimation of Risk from Existing Knowledge (DEREK) for Windows and Meteor. The predecessor for DEREK for Windows was Electric DEREK, originally created by Schering Agrochemical Company in the United Kingdom in 1986 [7]. DEREK was later adopted by Lhasa Ltd., allowing toxicologists from other companies to contribute data and knowledge to refine and improve the software. DEREK for Windows was created in 1999 using the original DEREK approach as its basis [8]. DEREK for Windows is a knowledge-based application for predicting the toxic hazard of chemicals, in particular designed for high-throughput screening for genotoxicity and mutagenicity, the prediction of carcinogenic hazards, and skin-irritating properties of compounds. In contrast to a conventional database management system, the knowledge-based DEREK for Windows contains expert knowledge about quantitative structure–activity relationships (QSAR) and toxicology and applies these rules to make predictions about the toxicity of chemicals, typically when no experimental data are available [9,10]. The workflow with DEREK starts with entering a molecular structure via structure editor software or by importing files in one of the following formats: MDL Molfile, SD file, MDL ISIS Sketch files (.skc). In the second step, the operator selects a species for which he wants to predict toxicological information. On processing
5323X.indb 249
11/13/07 2:12:37 PM
250
Expert Systems in Chemistry Research
these data, DEREK attempts to predict potential chromosome damage, genotoxicity, mutagenicity, and skin sensitization and presents details on hazardous fragments that have toxic potential. The program applies QSAR rules and other expert knowledge to the input data to derive a conclusion about the potential toxicity of a compound. Each rule describes the relationship between a structural feature or toxicophore and its associated toxicity. In addition to carcinogenicity, toxicological end points currently covered by the system include mutagenicity, skin sensitization, irritancy, teratogenicity, and neurotoxicity. The operator can verify the predictions or can look at details supporting the evidence, such as the following: • The toxicophore, that is, the feature or group responsible for the toxic properties. • The mechanistic description of the toxic effects or transformation of the compound, including literature references. • The toxicologic details of structural analogons, including data about species, assay, and literature reference. This information allows evaluating redesign of compounds to reduce toxicology from a toxicophore standpoint. Using Structure Data (SD) files — a file format that basically includes multiple MDL Molfiles — allows the user to process compounds in batch mode using an automated (AutoDerek) function. The user can then scroll through the results for the individual compounds. Results can be saved in a report in Rich Text Format (RTF), which can be opened by most word processing software as a modified SD file, which allows importing the information into other databases or software, or as tab delimited text, which is suitable for spreadsheet programs and statistics software. The software can either run as a stand-alone version or in a two-tier implementation where knowledge base and client software are installed separately, either on a single computer or by running the knowledge-base server on a separate machine in the network. The client software runs on Microsoft Windows NT, Windows 2000, or Windows XP operating systems with minimal memory requirements. DEREK for Windows is helpful for identifying potential toxicity risks for pharmaceutical drug candidates, intermediates, or impurities at an early stage in development. It particularly supports the regulative submissions in terms of the High Production Volume (HPV) Challenge Program and the Registration, Authorization, and Evaluation of Chemicals (REACH) regulatory framework to evaluate the mammalian toxicity of chemicals. The HPV Challenge Program, defined in 1998 by the U.S. Environmental Protection Agency (EPA), anticipates toxicological assessment of substances that are manufactured or imported into the United States in quantities more than one million pounds (around 450 tons) per year [11]. REACH is a similar, but more restrictive, approach adopted by the European Commission in 2003 [13]. REACH requires that companies that manufacture or import more than one ton of a chemical substance per year into the European Union to register it in a central database including exposure estimation and risk characterization. The European Parliament Council finally adopted the package at the Environment Council so that REACH entered into force on June 1, 2007.
5323X.indb 250
11/13/07 2:12:37 PM
Expert Systems in Other Areas of Chemistry
251
7.2.3 Predicting Metabolism — Meteor Another program available form Lhasa Ltd. is Meteor, software that allows predicting metabolic pathways of xenobiotic compounds [13–15]. Again, the program uses expert knowledge in form of rules to predict the metabolic pathway of compounds. The predictions are presented in metabolic trees, similar to a synthesis tree, and may be filtered according to the likelihood of metabolites. Starting with a structure entered with MDL ISIS/Draw or imported as MDL Molfile or ISIS Sketch files (.skc), the user can define constraints to restrict the reported metabolites. Unconstrained generation of metabolic pathways from a compound would lead to combinatorial explosion of the amount of metabolites. Constraints can be defined for the following: • • • •
Reasoning method (absolute or relative). Likelihood of a prediction (probable, plausible, equivocal, doubted). Species where the metabolism is expected. Relationship between lipophilicity and drug metabolism by linking to an external log P calculation program.
The software also allows comparisons between potentially competing biotransformations and provides comments and literature references as evidence to support its predictions. The resulting metabolic transformation tree view allows evaluating the metabolic path for potentially harmful metabolites. Meteor can be interfaced with the MetaboLynx mass spectrometer software from Waters Corporation to integrate mass spectrometry data from metabolism studies directly. MetaboLynx is part of the Waters MassLynx Application Managers, a suite of mass spectrometry instrument software [16]. It is designed for automated metabolism studies with data from LC/MS or LC/MS/MS time-of-flight (TOF) experiments. MetaboLynx is able to detect peaks in an LC/MS data file resulting from in vitro or in vivo biotransformation and provides a list of elemental formulae for unidentified components in a mass spectrum. Meteor uses these data to filter the list of predicted metabolites. System requirements are similar to DEREK for Windows. Other rule-based systems include MetabolExpert from CompuDrug Inc., designed for initial estimation of the structural formula of metabolites, and MetaDrug from GeneGo. MetaDrug splits query compounds into metabolites, runs them through QSAR models, and visualizes them in pathways, cell processes, and disease networks. An extensive overview on the commercially available systems was given by Ekins [17].
7.2.4 Estimating Biological Activity — APEX-3D One of the most comprehensive applications for investigating biological activity is Apex-3D, developed for Molecular Simulations Inc. [18,19] and later integrated in the Insight II environment of Accelrys Inc. Apex-3D is an expert system for investigating structure–activity relationships (SARs). The software is able to derive SAR and QSAR models for three-dimensional (3D) structures, which can be used for activity classification and prediction. The
5323X.indb 251
11/13/07 2:12:37 PM
252
Expert Systems in Chemistry Research
foundation of the Apex-3D methodology is the automated identification of biophores (pharmacophores), that is, the steric and electronic features of a biologically active compound that are responsible for binding to a target structure. These biophores are used to build rules for prediction of biological activity and for creating search queries to identify new leads in a database of 3D structures. A biophore can also be a starting point for deriving the 3D QSAR model. Apex-3D is divided into several modules: • Computational Chemistry Module: Computes quantum-chemical and other atomic and molecular indexes; in addition, performs clustering of conformers for flexible compounds. • Data Management Module: A management system for the internal database providing storage of 3D structures, structural parameters, and activity data. • Frame System: Uses frames for the representation of chemical information that are implemented using a built-in ChemLisp language interpreter. ChemLisp represents a special dialect of the list processing (LISP) language. • Rule Management Module: Allows creation, editing, and management of qualitative or quantitative rules stored in the system’s knowledge base. • Inductive Inference Module: Performs generation of rules on the basis of structure and activity data; based on algorithms of the logical structural approach [20] and provides tools for automated selection of biophores (pharmacophores) and interactive building of 3D QSAR models. The module performs statistical evaluation of the predictive and discriminating power of selected biophores and models. • Deductive Inference Module: Provides prediction of biological activity based on the rules stored in the knowledge base. • Query Generator: Generates biophore queries for databases from MDL Information Systems Inc. (Elsevier MDL) to find new compounds satisfying biophore definitions. SARs in Apex-3D are represented either as qualitative rules or as quantitative rules. A qualitative rule has the following form: IF structure S contains biophoric pattern B THEN structure S has activity A with probability P Here the system tests if the structure contains the biophoric pattern B and calculates the biological activity with a probability. The biophoric (or pharmacophoric) pattern describes the steric and electronic features of a biologically active compound to bind to a target structure. A quantitative rule is as follows: IF structure S contains biophoric pattern B
5323X.indb 252
11/13/07 2:12:37 PM
Expert Systems in Other Areas of Chemistry
253
AND pattern B has QSAR model A=F(B,S) THEN structure S has activity A In this case, the system tests if the structure contains the biophoric pattern B and checks whether the pattern has an associated QSAR model. If both conditions are true, the system calculates the activity using the QSAR model. Besides the statistical prediction of biological activity based on the rules, the deductive inference module provides explanation of predictions. It produces a threedimensional graphic display of biophores found in the analyzed structures and allows superimposition of compounds with the same biophore. Visualization is performed via integration with the Insight II environment from Accelrys Inc. [21], a molecular modeling environment with a powerful graphical interface. The chemical structure representation in Apex-3D is based on the concept of a descriptor center that represents a part of the hypothetical biophore. Descriptor centers can be atoms, sets of atoms, pseudo-atoms, or substructures that participate in ligand–receptor interactions. The interaction is derived from electrostatic, hydrophobic, dispersion force, and charge-transfer information that comes from quantum-chemical calculations or from atomic contributions to hydrophobicity or molar refractivity. A descriptor center finally consists of a structural part (e.g., atom, multiple similar atoms, pseudoatoms, fragments) and a property. To define this structural part efficiently, the line notation SLang is used, which is similar to simplified molecular input line entry specification (SMILES) and SMILES arbitrary target specification (SMARTS) [12]. Examples for structural parts and their representations in SLang are as follows: • A list of heteroatoms [N,P,O,S,F,CL,BR,I] • A carboxylic group C{3}(=O)-OH • Any aromatic ring A01:[A,*]0:A01 • CNH in a peptide unit CH(-N0H)(-C=O)-C These structural fragments are represented in frames by using a combination of ChemLisp and SLang in the following form:
5323X.indb 253
11/13/07 2:12:38 PM
254
Expert Systems in Chemistry Research
(frame pseudoatom1 (type frametype) (slots (name symbol) (pattype symbol : atom set-of-atoms) (comment string) (select form) (actions form) (distance2d symbol) ) ) The slots define the name or symbol of the pseudoatom (name) and whether the pseudoatom is a single atom or a set of atoms (pattype) and includes ChemLisp procedures for identifying real atoms defining the pseudoatom (select) and the assignment of pseudoatom properties (actions), as well as ChemLisp functions, such as for defining a pseudobond between real and pseudoatoms (distance2d). Descriptor center information for a molecule is stored in a property matrix that contains the indices for all descriptor centers in a molecule and a distance matrix that covers all distances between pairs of descriptor centers. These matrices are finally used to identify the biophores, which then are a subset of the matrices. The biophore identification is performed using the structural information and statistical criteria for assessing the probability of activity prediction. Structure matching is based on the selection of maximal common patterns of biophoric centers that lead to a compatibility graph [23]. Vertices of this graph correspond to pairs of equivalent centers, whereas edges correspond to pairs of centers having equivalent distances. Two centers are considered to be equivalent if they have at least one equal property within a preset tolerance. Activity prediction can be trained in Apex-3D on the basis of identified biophores to provide the best estimation for a particular type of compound. Activities for new molecules are then predicted from their own dynamically created training set. Predictions can be made as classification of the new compound into predefined classes or by calculating a quantitative value based on 3D QSAR models present in the knowledge base. Results are reported interactively using the prediction viewer for each molecule at a time. Apex-3D provides an explanation of why the activity was given based on biophores contained in the molecule and present in the knowledge base.
7.2.5 Identifying Protein Structures Proteins are important for a variety of biological functions, and the knowledge about their structures is essential for evaluating interaction mechanisms, understanding diseases, and designing biologically active compounds. The typical technique for determining protein structures is x-ray crystallography, which is based on the detection of diffraction patterns from crystals of the purified protein. This process has several drawbacks:
5323X.indb 254
11/13/07 2:12:38 PM
Expert Systems in Other Areas of Chemistry
255
• Growing a protein crystal is a tedious and often not successful process. • If a protein crystal could be grown, it is usually small and fragile and contains solvents and impurities that barely can be separated. • If a good crystal can be obtained, the crystal structure does usually not reflect the shape of the protein in its natural environment. • If a diffraction pattern can be obtained, it contains information only about the intensity of the diffracted wave; the phase information cannot be experimentally determined and must be estimated by other means (usually referred to as the phase problem). • The selection of points from which intensities are collected is limited, leading to a limited resolution, which reduces the differentiation between atoms. • An electron density map can be obtained by a Fourier transform of the observed intensities and approximated phases, showing the electron density distribution around the atoms of a unit cell of the protein structure. An initial structure can be determined from this electron density map; however, the maps are typically noisy and blurred, and evaluating such a map is a tedious process. • The resulting structure can be used to improve the phases and to create a better map, which can be reinterpreted. The whole process can go through many cycles, and the complete interpretation may take days to weeks. The step of interpreting the electron density map and building an accurate model of a protein remains one of the most difficult to improve. There are several approaches to mimic the reasoning of a crystallographer in map interpretation. One of the early attempts on model building from an electron density map for proteins was the CRYSALIS project, a system that maintains domain-specific knowledge through a hierarchy of production rules [24,25]. Another approach is molecular scene analysis, which is based on spatial semiautomated geometrical analysis of electron density maps [26]. One of the most successful approaches is the TEXTAL system, designed by a team around Kreshna Gopal in 1998 [27]. TEXTAL is designed to build protein structures automatically from electron density maps and is primarily a case-based reasoning program; that is, problem solving is based on solutions for similar previously solved problems. The stored structures are searched for matches for all regions in the unknown structure; it is a computationally quite expensive method, since the optimal rotation between two regions has to be taken into account. TEXTAL uses a filtering method based on feature extraction as pre-search before it applies an extensive density correlation. The filter method uses a k-nearest neighbor algorithm based on weighted Euclidean distances to learn and predict similarity. However, domain expert interaction is required to decide whether two electron density regions are similar. Since candidates may occur in any orientation, TEXTAL uses rotation-invariant statistical properties, such as average density and standard deviation. Rotation-invariant properties are extracted from a database of regions within a set of electron density maps for known proteins. The regions are restricted to those centered on known Cα coordinates, which are selected by the C-alpha pattern recognition algorithm (CAPRA) module [28]. CAPRA uses a feed-forward neural network
5323X.indb 255
11/13/07 2:12:38 PM
256
Expert Systems in Chemistry Research
for predicting the Cα atoms in the backbone and creates a linear chain with these atoms. The remaining backbone and side chains are searched for similar patterns on each Cα-centered region along the predicted chains. This process is similar to the decision making crystallographers use to interpret electron density maps: main-chain tracing followed by side-chain modeling. The used techniques take many of the constraints and criteria into account that a crystallographer would apply, and the created model can be refined by a crystallographer. The similarity of density patterns can be estimated by applying different descriptors. Rotation-invariant statistical descriptors are mean, standard deviation, skewness, and kurtosis of the density distribution within a spherical region. Another approach is based on moments of inertia; an inertia matrix is calculated and the eigenvalues are retrieved for the three mutually perpendicular moments of inertia. Either the eigenvalues or their ratios provide information about the electron density distribution in a particular region. Other descriptors are the distance to the center of mass of the region and the geometry of the density. Each descriptor can be parameterized by using predefined radii of 3Å, 4Å, 5Å, and 6Å, resulting in four versions for each descriptor. The database of regions is derived from normalized maps of 200 proteins and provides around 50,000 spherical regions (5Å) of Cα atoms. After identification, similar regions are weighted according to their relevance in describing patterns of electron density. The system uses the SLIDER algorithm to weigh the found regions based on their relative similarity within the protein structure [29]. Finally, the coordinates of the local side chains and backbone atoms of the matching Cα region are transformed and placed into a new map. Several postprocessing steps are used to refine the model: (1) reorientation of residues to prevent sterical hindrance; (2) fine-orientation of atoms to optimize their fit to the electron density, while preserving geometric constraints like typical bond distances and angles; and (3) sequence alignment to correct for falsely assigned amino acids [30]. TEXTAL is a typical example of a system using a variety of previously described technologies to mimic the decision-making process of domain experts in protein crystallography. The system has been quite successful in determining various protein structures, even with average quality data. Developed since 1998, TEXTAL has originally been programmed in a variety of languages, such as formula translator programming language (Fortran), C, C++, Perl, and Python. Meanwhile, the system is incorporated into the Python-based Hierarchical Environment for Integrated Xtallography (PHENIX) crystallographic computing environment, developed at Lawrence Berkeley National Lab and hosted by the Computational Crystallography Initiative (CCI) [31]. CCI is part of the Physical Biosciences Division at Lawrence Berkeley National Laboratory and focuses on development of computational tools for high-throughput structure determination. PHENIX is an outcome of an international collaboration among Los Alamos National Laboratory, the University of Cambridge, and Texas A&M University and is funded by the NIH [32]. TEXTAL also demonstrates successfully the value of collaboration and continuous improvement in the research community.
5323X.indb 256
11/13/07 2:12:38 PM
Expert Systems in Other Areas of Chemistry
257
7.3 Environmental Chemistry 7.3.1 Environmental Assessment — Green Chemistry Expert System (GCES) In 1990 the U.S. Congress approved the Pollution Prevention Act, which established a national policy to prevent or reduce pollution at its source whenever feasible. The EPA took this opportunity to devise creative strategies to protect human health and the environment. They developed the program Green Chemistry, which helps in “promoting innovative chemical technologies that reduce or eliminate the use or generation of hazardous substances in the design, manufacture, and use of chemical products” [33]. Green Chemistry covers the design of new chemical products and processes that reduce or eliminate the use and generation of hazardous substances. The program is based on 12 principles (adopted from [34]): 1. Prevent waste. 2. Design safer chemicals and products. 3. Design less hazardous chemical syntheses. 4. Use renewable feedstocks. 5. Use catalysts, not stoichiometric reagents. 6. Avoid chemical derivatives. 7. Maximize atom economy. 8. Use safer solvents and reaction conditions. 9. Increase energy efficiency. 10. Design chemicals and products to degrade after use. 11. Analyze in real time to prevent pollution. 12. Minimize the potential for accidents. Based on these principles, the Green Chemistry initiative supports basic research to provide the chemical tools and methods necessary to design and develop environmentally sound products and processes. The EPA funded a series of basic research projects that consider impacts to human health and the environment in the design of chemical syntheses and other areas of chemistry. One of the outcomes of these efforts is the Green Chemistry Expert System (GCES) [35]. GCES allows users to build a green chemical process, to design a green chemical, or to survey the field of green chemistry. The system is equally useful for new and existing chemicals and their synthetic processes. GCES runs on Microsoft Windows 3.1 or higher and requires minimum hardware. It was developed using Microsoft Access 2.0 relational database and provides a runtime version. GCES consists of five modules: • Synthetic Methodology Assessment for Reduction Techniques (SMART): A program adapted from the EPA’s SMART review process and designed to quantify and categorize hazardous materials used in a manufacturing process. • Green Synthetic Reactions: Provides a searchable database of synthetic processes and identifies alternative processes published to replace more hazardous materials with less hazardous ones.
5323X.indb 257
11/13/07 2:12:39 PM
258
Expert Systems in Chemistry Research
• Designing Safer Chemicals: An information module that gives details on the design of safer chemicals. • Green Solvents/Reaction Conditions Database: Provides detailed information about solvents and reaction conditions and supplies selected solvent properties that help to identify alternative, less hazardous solvents. • Green Chemistry Reference Sources: Allows searching in a database for literature references to the four modules and other Green Chemistry references.
7.3.2 Synthetic Methodology Assessment for Reduction Techniques SMART is a nonregulatory approach for using chemistry to achieve pollution prevention [36]. It operates in parallel with the EPA’s New Chemicals Program within the Office of Pollution Prevention and Toxics. The New Chemicals Program originates from the U.S. Toxic Substances Control Act and was established to help manage the potential risk of commercially available chemicals. SMART was originally developed for these new chemicals but later has been applied to existing chemical processes. The SMART assessment evaluates manufacturing methods described in new chemicals submissions and is intended to recommend green chemistry approaches to reduce pollution at the source prior to commercial production of a new chemical substance. SMART is part of the review of premanufacture notifications received under the Toxic Substances Control Act. If the manufacturer uses or produces hazardous substances, the EPA makes suggestions on a voluntary basis to encourage the manufacturer to review its processes and to reduce the hazardous materials related to these processes. The purpose of the SMART module in GCES is to help chemical manufacturers to review their processes and to perform similar analyses during the course of process development prior to filing a premanufacture notification with the EPA. The SMART assessment techniques are applicable to both new and existing manufacturing processes. The SMART interrogation process starts with asking basic data, like reaction yield and the number of batches produced per year. A reaction identifier allows the entered data to be stored and retrieved. In the second step, the user enters process chemicals, each of which may be retrieved from the existing database via Chemical Abstracts Service (CAS) number or internal identifier — for previously stored new data. The SMART module uses SMILES notation to represent molecular structures by strings of symbols. Substance data include name, SMILES string, molecular weight, quantity, role (e.g., feedstock, product, solvent), and quantities. GCES includes a database with basic information on about 60,000 chemicals listed in CAS. If the CAS number is in the database, GCES automatically retrieves name, molecular weight, and the SMILES notation for a structure. New substances can be identified via SMILES string and can be added to the database. After entering all reaction data, a SMART assessment can be performed. The program then performs a series of mass-balance calculations and provides the waste quantification, hazard classification, and a qualitative level of concern. The algorithms cover single-step reactions that produce a single chemical product; the software is not applicable to reactions with multiple products or for polymer reactions. In this case, the individual reactions of a synthetic sequence have to be calculated sequentially.
5323X.indb 258
11/13/07 2:12:39 PM
Expert Systems in Other Areas of Chemistry
259
The results are presented as amounts of waste for different hazardous categories (i.e., chemical tiers) as percentage of annual production volume. The categories, or tiers, are as follows: • Tier 1 contains a small set of chemicals with exceptionally hazardous effects, such as dioxins and phosgene. • Tier 2 includes the chemicals covered by different sections of the U.S. Emergency Planning and Community Right-to Know Act (EPCRA) — particularly those classified as extremely hazardous — as well as chemicals having functional groups associated with high toxicity, such as acid chlorides, alkoxysilanes, epoxides, and isocyanates. • Tier 3 chemicals are those of either unknown or intermediate toxicity, particularly those not covered by the other tiers. • Tier 4 contains substances that pose little or no risk of harm under normal usage conditions, like water, sodium chloride, or nitrogen. At this stage SMART allows summary information to be retrieved about the EPA regulated chemicals in the reaction — that is, those covered in tier 2. The summary provides production amount and waste information and indicates the individual EPCRA and other regulations for the substance. The system can provide suggestions for alternatives from the green chemistry point of view as well as a level-of-concern assessment, which presents several concern statements about the assignment of used or produced substances to one or more of the tiers or on the waste amount. The output of a level of concern analysis might look as follows: • A Tier 1 chemical (phosgene) is used. • EPA regulated chemicals (i.e., phosgene, toluene, bisphenol A, ethyl amine) are used. • High levels of tier 2 waste (i.e., toluene, bisphenol A, ethyl amine, HCl, product) are present. • High levels of tier 1 + tier 2 + tier 3 (i.e., phosgene, toluene, bisphenol A, ethyl amine, HCl, product) waste are present. • The total waste is excessive. This evaluation can also address several general drawbacks in the procedure, for instance improving solvent recovery, increasing yield, or reducing excess of unrecoverable substances. Even these statements are not intended to provide details about how the improvement can be achieved; they are helpful for designing of the entire process in an ecological manner and help to understand the EPA’s process of reviewing premanufacture notifications.
7.3.3 Green Synthetic Reactions The Green Synthetic Reactions module is database query module to find references related to selected synthetic processes, particularly those that act as examples of
5323X.indb 259
11/13/07 2:12:39 PM
260
Expert Systems in Chemistry Research
green chemistry alternatives to replace conventional industrial syntheses. The following search fields are available: • Pollution Prevention Comments: General comments, such as safety, concern, inexpensive. • Key words: Cover a series of predefined terms of importance for the reaction, such as aromatic, oxidation, hydrolysis, rearrangement. • Status: Refers to the status of the manufacturing process and includes terms like pilot, plant, production, or patent. • Reference: Allows querying the individual citations for any words or numbers. Results are presented on a form that contains the full reference, keywords, P2 comments, status, and vendor, like in the following example: Reference:
1) Cusumano, James “New technology and the environment,”. CHEMTECH, August 1992, 22(8), pp.482-489. . 2) Kirk-Othmer, Encyclopedia of Chemical Technology, . 3rd ed., 1983, Vol. 15, pp.357-376. Key words: C4-Oxidation ; Methyl methacrylate (MMA); Methanol; Oxygen P2 Comments: Two-stage catalytic oxidation (isobutylene to methacrolein and then to methacrylic acid; final esterification with . methanol); complex catalyst system. Status: Commercially used in Japan. Vendor: Mitsubishi Rayon Co.
7.3.4 Designing Safer Chemicals Designing Safer Chemicals is a module that provides qualitative information about the toxicities of compounds within certain chemical classes or that have certain uses. The module helps estimating the qualitative toxicity of a particular substance based on SARs. It describes mechanisms of toxicity and is able to predict structural modifications that may reduce its toxicity. The module provides a search interface as well as directories for chemical classes, characteristics, and use of substances. Particularly interesting is an expert system for designing safer substances based on chemical classes. Selecting, for instance, the chemical class polymers gives access to the designing safer polymers entry, which leads to a series of questions (answers are marked in bold case): The polymer is cationic or potentially cation: No Of C, H, N, O, S, and Si; two elements are integral: Yes Halogens ar absent form the polymer: No Halogens are only covantly bound to C: Yes If present, Na, Mg, Al, K, and Ca are only monatomic cations: Absent
5323X.indb 260
11/13/07 2:12:39 PM
Expert Systems in Other Areas of Chemistry
261
If present, Li, B, P, Ti, Mn, Fe, Ni, Cu, Zn, Sn, and Zr are cumulatively present in < 0.2 weight percent: Absent All other elements are present exclusively as impurities or in formulation: Yes Polymer is made exclusively from feedstocks listed for Polyester (e3) exemption: No Polymer has NAVG MW³ 10,000 (an e2 polymer): Yes Oligomeric content of the e2 polymer is ≤ 5% below 1,000 daltons and ≤ 2% below 500 daltons: Yes The e2 polymer absorb their weight in water: No The polymer you are manufacturing is safer than most and may qualify for the polymer exemption from TSCA Section 5 reporting requirements for new chemicals. Further information regarding the Polymer Exemption Rule, a technical Guidance Manual, and the Chemistry Assistance Manual for Premanufacture Notices can be obtained from the TSCA Hotline at (202) 554-1404.(If the polymer decomposes, degrades, or depolymerizes, it may not qualify for the polymer exemption.) The Design Safer Compounds part of GCES can be extended with estimation software from the EPA for bioaccumulation (bcfwin), biodegradation (biowin), and aquatic toxicity (ecowin).
7.3.5 Green Solvents/Reaction Conditions The Green Solvents/Reaction Conditions module includes a database of physicochemical properties for more than 600 solvents of varying hazard and is primarily designed to find alternative solvents with properties similar to the one under consideration. The databases do not contain toxicity data but regulatory information and global warming potential and ozone depletion potential. The user may retrieve general information about green solvents and reaction conditions, may search for physical and chemical properties, or may browse the solvents database. The information for each solvent includes physicochemical properties, such as molecular weight, boiling point, melting point, specific gravity, vapor pressure, and water solubility, log octanol/water partition coefficient, Henry’s Law constant, flash point, explosion limits, and dielectric constant. The database provides also numerical values for the ozone depletion potential and the global warming potential, which are calculated relative to carbon dioxide or to CFC11 (CCl3F). The EPA regulations for the solvents are indicated.
7.3.6 Green Chemistry References This module covers references to material published in green chemistry and related fields and provides a simple search interface to retrieve references by key terms or by
5323X.indb 261
11/13/07 2:12:40 PM
262
Expert Systems in Chemistry Research
category and subcategory. Search fields are author, title, journal, or a set of predefined keywords. Although GCES is far from providing comprehensive information, it is a good example of a query-based expert system. More information and a download version of the software can be found on the EPA’s Web site (http://www.epa.gov/gcc).
7.3.7 Dynamic Emergency Management — Real-Time Expert System (RTXPS) A RTXPS was developed by K. Fedra and L. Winkelbauer [37]. It was designed for on-site dynamic emergency management in the case of technological and environmental hazards, including early warning for events such as toxic or oil spills, floods, and tsunamis. It incorporates control and assessment tasks, including coordination of first response, recovery, restoration and clean-up operations, and provides teaching and training applications. The system can implement operating manuals or checklists based on standard operating procedures or protocols that assist the operator and tracks all events, communications, and outcomes of an evaluation process. RTXPS uses forwardchaining inference to process context sensitive rules that are based on descriptors. The rule interpretation results in triggering actions that are presented to the user and may invoke function calls for computing, data entry, and data display. RTXPS provides extensive reporting and communication mechanisms that allow external communication via e-mail or fax and the automatic generation and on-line update of Web pages for public information. RTXPS uses natural language syntax for descriptors, rules, and actions, supported by a script language for developing knowledge base.
7.3.8 Representing Facts — Descriptors Descriptors are used to store the facts of RTXPS. These descriptors can be either directly entered or can be assigned by the inference engine as a result of data evaluation. The descriptor includes enumerated methods to update its values in the appropriate context. A descriptor basically incorporates values and units, questions, as well as links to rules, functions, or models. A simple example of a descriptor is as follows: DESCRIPTOR retention_time U days V very_small[ 0, 360] / V small [ 360, 1080] / V medium [1080, 1800] / V large [1800, 3600] / V very_large[3600, 7200] / R 7777007 / Q What is the average retention time, in days, Q for the reservoir ? retention time is the theoretical
5323X.indb 262
11/13/07 2:12:40 PM
Expert Systems in Other Areas of Chemistry
263
Q period the average volume of water spends in the reservoir, Q estimated as the ratio of volume to through flow. ENDDESCRIPTOR This descriptor has the unique identifier (retention_time) and provides values (V) in text format that are associated with a particular range of numerical values in days (U) and a link to a particular rule (R). The (Q) section defines the text for the interaction with the user.
7.3.9 Changing Facts — Backward-Chaining Rules Backward-chaining rules define how values for descriptors are affected by values of other descriptors, data entry, or results from a simulation model. Backward-chaining rules can result in the assignment of descriptor values or can define their relative or incremental changes. Additionally, rules can include basic calculations and can control the inference strategy for a given context. An example for a rule is as follows: RULE 1020231 IF average_reservoir_depth == small AND retention_time < 30 THEN reservoir_stratification = unlikely ENDRULE The rules uses two descriptors (average_reservoir_depth and retention_time) to evaluate them in either text format (small) or numerical format (< 30). If both conditions are true, a new descriptor value is set. Another use of the backward-chaining capabilities of RTXPS is to provide few summarized variables to describe large scenarios that would usually generate large data volumes. The scenario can be reduced to selected facts that are relevant for describing the event, such as level of exposure, contamination area, number of people involved, or a hazard classification of the event. These values are then used to trigger the appropriate actions.
7.3.10 Triggering Actions — Forward-Chaining Rules The forward-chaining rules define the sequence of actions by setting an action status value depending on values of descriptors or status values of other actions. The system distinguishes the following rule types for forward chaining: • The rule assigns a value to the specified action if no value exists. RULE IF ACTION(Zero_Action) == done THEN ACTION(First_Action) => ready ENDRULE
5323X.indb 263
11/13/07 2:12:40 PM
264
Expert Systems in Chemistry Research
• The rule assigns a value to the specified action, eventually overriding an existing value. RULE IF ACTION(First_Action) == ignored THEN ACTION(First_Action) =>> done ENDRULE • The rule enables or disables the use of actions assigned to the specified group. RULE IF TRUE THEN GROUP(0002) => enable ENDRULE • The rule repeats a specified action. RULE IF DESCRIPTOR(score) < 100 THEN ACTION(Test_Score) => repeat ENDRULE • The rule repeats all actions in the group. RULE IF ACTION(Test_Score) == done AND DESCRIPTOR(score) < 50 THEN GROUP(0002) => repeat ENDRULE • The rule assigns a value to the specified descriptor. RULE IF ACTION(Test_Score) == done AND DESCRIPTOR(score) < 50 THEN DECRIPTOR(test) = failed ENDRULE
7.3.11 Reasoning — The Inference Engine The inference engine finally compiles all information for the input of the rules recursively. It evaluates the rules, eventually updates the target descriptor, and triggers appropriate actions. This kind of inference process assists the operator in specifying scenario parameters and provides estimations for cases where data are available. The flexibility to use both qualitative symbolic and quantitative numerical methods in a single application allows the system to be responsive to experimental data and observations or constraints entered by the user. A knowledge-base browser gives the user opportunity to navigate through the tree structure of the knowledge base within the context of the particular problem. The inference tree displays sets of rules linked to a list of descriptors and allows inspecting their definitions.
5323X.indb 264
11/13/07 2:12:40 PM
Expert Systems in Other Areas of Chemistry
265
RTXPS provides a combination of data analysis, reasoning, and communication options that can be used as a framework for combination with other environmental information management systems.
7.3.12 A Combined Approach for Environmental Management The application of expert systems in the domain of environmental management is particularly useful in assisting human experts in their decisions for environmentally sound procedures at reasonable costs. A computer-aided system for environmental compliance auditing was proposed by N. M. Zaki and M. Daud from the Faculty of Engineering at the University Putra in Malaysia [38]. The system performs environmental monitoring system for impact assessment and incorporates an environmental database management system. The foundation for developing this system was the Environmental Impact Assessment program that required reports on new projects to be approved by the Department of Environment before project implementation. Malaysia enacted in 1987 the Environmental Quality Order, which required the evaluation of every development project regarding its potential environmental impact. The aim of the environmental impact assessment is to assess the overall impact on the environment of development projects proposed by the public and private sectors. The objectives of environmental impact assessment are as follows: • Examine and select the best from the project options available. • Identify and incorporate into the project plan appropriate abatement and mitigating measures. • Predict significant residual environmental impacts. • Determine the significant residual environmental impacts predicted. • Evaluate the environmental costs and benefits of the project to the community. The Malaysian Department of Environment recommended that the different project phases of exploration, development, operation, and rehabilitation are evaluated due to their environmental, biological, and socioeconomic impact. Once the possible environmental impacts are assessed, the project initiator must identify and indicate the possible mitigation measures to be taken with a purpose of controlling the environmental pollution. Compliance auditing is performed during the assessment to check whether the project complies with environmental protection standards. An example of an expert system in environment domain is an hybrid expert system consisting of a geographic information system (GIS) [39] and RTXPS as previously described. The system is an integration of a real-time forward-chaining expert system and a backward-chaining system for decision support using simulation models and the GIS. RTXPS was based on the results from the international research project high-performance computing and networking for technological and environmental risk management (HITERM), which was developed under the European Strategic Program of Research and Development in Information Technology for high-performance computing and networking. HITERM integrates high-performance computing on parallel machines and workstation clusters with a decision-support approach based on a hybrid expert system. Typical applications are in the domain of technological risk assessment and
5323X.indb 265
11/13/07 2:12:41 PM
266
Expert Systems in Chemistry Research
chemical emergencies in fixed installations or transportation accidents. The system is based on client-server architecture to integrate the various information sources in an operational decision-support system. In this model, RTXPS serves as a framework that is connected to a number of servers used for computing of incoming data from mobile clients, which can be used for on-site data acquisition. The expert system works in real time and controls communication between all persons involved in an environmental and technological risk situation. It guides and advises the user using information from several databases and presents material safety data sheets for hazardous substances. The system simulates risky situations by using various simulation models and predicts the environmental impact. All input information is verified for completeness, consistency, and plausibility. The simulation model is then selected using a simple screening mechanism on the available information. The selected model interprets the results and converts them into guidance and advice for the operators. The embedded simulation models include different release types including pool evaporation, atmospheric dispersion with wind field models, fire and explosion models, and soil contamination. The system logs data entry, user input, decisions, model selection, results, and communication activities. The recorded information can be used for training purposes and planning risk assessment tasks.
7.3.13 Assessing Environmental Impact — EIAxpert EIAxpert is a rule-based expert system for environmental impact assessment, particularly designed for assessment of development projects at an early stage. EIAxpert is a generic and data-driven tool that can be adapted to different application domains. It was designed for the assessment of development projects for water resources in the Cambodian, Laotian, Thai, and Vietnamese parts of the Mekong Basin. The system relies on assessment rules derived from the Asian Development Banks Environmental Guideline Series [40]. The generic EIAxpert system includes the following: • Database and editor tools for water resources development projects for comparing alternatives for a project. • A GIS including satellite imaging that covers the entire river basin and the areas immediately affected by individual projects. • Databases covering information on meteorology, hydrography, water quality, and wastewater treatment. • A knowledge base with checklists, rules, background information, guidelines, and instructions for the scientist. • An inference engine that guides the scientist through project assessment in a menu-driven dialog. • A report generator for summarizing the impact assessment and for generating hardcopy reports. EIAxpert rules allow invoking external models as part of the inference procedure. This includes models for impacts of reduced water flow, wastewater discharges,
5323X.indb 266
11/13/07 2:12:41 PM
Expert Systems in Other Areas of Chemistry
267
changes in land use, irrigation water demand, or changes to the groundwater. The knowledge base for the lower Mekong basin contains more than 1000 rules and is linked to a hypertext generator for creating Web pages with explanation of terms and concepts, background information, and instructions for the user. A knowledge-base browser allows inspection of individual rules and recursively traces the symbolic reasoning mechanism of the system.
7.4 Geochemistry and Exploration 7.4.1 Exploration The area of geological exploration is particularly complex, since a huge amount of data is collected during a geological survey. In 1977 the Stanford Research Institute published the expert system PROSPECTOR, which aids geophysicists in the interpretation of these data [41]. PROSPECTOR was focused on the exploration of ore deposits and uses a knowledge base containing five different models describing various mineral deposits and more than 1000 rules. The software requires a characterization of the particular deposit of interest, including information on the geological environment, structural controls, and the types of minerals present or suspected. This information is then compared with existing models, and an evaluation is done for similarity, differences, and missing information. It includes an explanation engine that allows investigating how conclusions are drawn. Finally, the system attempts to assess the potential presence of a given mineral deposit. PROSPECTOR was tested in the field in 1980, which finally resulted in the discovery of a $100 million molybdenum deposit. The success of this system inspired the development of a series of other commercial expert system packages that use similar approaches like in PROSPECTOR. An example is DIPMETER, which determines the subsurface geological structure of a given site by interpreting dipmeter logs [42]. The system uses knowledge about dipmeter data and basic geology to uncover features in the data that aid in the identification of geological structures. This capability is of particular importance in oil or mineral exploration. SECOFOR was developed in 1983 by Teknowledge on request from Elf Aquitaine as a sort of exception management system for geological exploration [43]. The reasons for this request were twofold. First, each failure in the on-site exploration process causes high costs, often in the range of several hundred thousands of dollars every day. Second, experts who are able to handle the exception are rarely available in time. Elf Aquitaine wanted to address these issues with an expert system that is able to manage exception data, to provide solutions, and to act as a replacement for an expert until he becomes available. SECOFOR was specifically developed to address drilling problems that may cause shutdowns for several days or weeks. The system uses information about the geological formations at the site, conditions of the current problem, and historical information about other problems experienced in the past. It performs a diagnosis of the problem, produces a recommendation to correct the problem based on previous experience, and provides advice for modifications to current practices to avoid the problem in the future.
5323X.indb 267
11/13/07 2:12:41 PM
268
Expert Systems in Chemistry Research
7.4.2 Geochemistry Geochemistry has a series of fascinating methods that allow geochemical fingerprinting of past tectonic environments by analysis of major and trace elements in ancient volcanic rocks. Geochemists use quantitative or empirically derived geochemical discriminant diagrams. The expert system approach enables geochemical evidence to be integrated with geological, petrological, and mineralogical evidence in identifying the eruptive setting of ancient volcanic rocks. Expert System for the Characterization of Rock Types (ESCORT), published by J. A. Pearce in 1978, provides methodologies for combining geochemical and nongeochemical probabilities [44]. The knowledge base of ESCORT contains probability data for magma types and includes dispersion matrices of probabilities for various nongeochemical criteria, from which geochemical-based probabilities can be calculated using probability density functions. The knowledge base is derived from a geochemical database of around 8000 experimental data and provides a set of editable a priori probabilities. The inferencing engine is based on Bayes’ Decision Rule, adapted to take into account uncertainties in geological evidence, which enables the different probabilities to be numerically combined. Upper and lower probability thresholds are used to decide whether an interpretation is likely or unlikely. The outputs are probabilities for each tectonically defined magma type including confidence values for data that are out of the typical range. The system showed to overcome many of the ambiguities usually related to geochemical discrimination diagrams.
7.4.3 X-Ray Phase Analysis X-ray phase analysis is used for identification of mineral phases of rocks, soils, clays, or mineral industrial material. The phase analysis of clays is particularly difficult because these materials generally consist of a mixture of different phases, like mixed and individual clay minerals, and associated minerals, such as calcite and quartz. Placon and Drits proposed an expert system for the identification of clays based on x-ray diffraction (XRD) data [45]. This expert system is capable of identifying associated minerals, individual clay minerals, and mixed-layer minerals. It can further approximate structural characterization of the mixed-layer minerals and can perform a structural determination of the mixed-layer minerals by comparison of experimental x-ray diffraction patterns with calculated patterns for different models. The phase analysis is based on the comparison of XRD patterns recorded for three states of the sample: dried at room temperature, dried at 350°C, and solvated with ethylene glycol. The identification of associated minerals is performed by the ASSOCMIN module on all diffraction peaks in which the positions and intensities do not change after glycolation or heating. The observed data are compared with data from the most commonly associated minerals that are stored in the system. Results for the found minerals are displayed, and the user has to finally make the decision whether the result is useful or not. The identification of individual clay minerals is done with the module INDVCLAY, which allows the user to select a clay mineral from a list for phase analysis. When a clay mineral is selected, the module displays the main features of the XRD pattern, its behavior after heating or ethylene-glycol treatment, and the minerals that
5323X.indb 268
11/13/07 2:12:41 PM
Expert Systems in Other Areas of Chemistry
269
provide similar patterns. The user can check the presence of the mineral in the sample by entering experimental diffractogram data, which results in a statistical analysis. The determination of the mixed-layer clay minerals is done by the NATMIX module, which makes decisions based on the comparison of the three states — dried, heated, and solvated. The comparison is based on rules that allow predictions to be made for the different types of mineral layers due to the different behavior in the three states. The system also proposes adding additional specific experiments that further refine the characterization of the layer composition. The system covers the entire family of mixed-layer clay minerals and determines their nature. Other modules support the structural characterization of the mixed-layer minerals. The STRUCMIX module determines the mean abundance of each layer and the range of interaction between these layers. CALCMIX proposes experimental data for reflections located in different diffraction domains and allows the nature and structure of the mixed-layer minerals to be confirmed by calculating theoretical XRD patterns and matching them with the experimental pattern.
7.5 Engineering 7.5.1 Monitoring of Space-Based Systems — Thermal Expert System (TEXSYS) Space-based systems usually require constant monitoring of performance and condition; human operators scan telemetry from the system and watch for deviations in conditions or expected performance. The process requires multiple operators and is expensive, particularly for large space-based systems where monitoring has to be performed in real time. By automating fault detection and isolation, recovery procedures, and control of these dynamic systems, the need for direct human intervention may be reduced. The National Aeronautics and Space Administration (NASA) worked in 1988 on the Systems Autonomy Demonstration Project with the goal to develop and validate an expert system for real-time control and fault detection, isolation, and recovery of a complex prototype space subsystem. The project was a joint effort between NASA’s Ames Research Center and Johnson Space Center. One of the outcomes of this project was the Thermal Expert System (TEXSYS), a computer program that exerts real-time control over a complicated thermal-regulatory system that includes evaporators, condensers, a pump, valves, and sensors [46]. TEXSYS observes differences between actual and expected conditions and analyzes differences to determine whether a given condition signifies a malfunction in a component or at the system level. It then takes corrective action, such as opening or closing of a valve. The knowledge base included engineering expertise on the particular thermalregulatory system in 340 rules. TEXSYS is integrated in a multitier architecture where the expert system interacts with conventional controlling hardware and software via an intermediate integration layer. The thermal regulatory system that has been used for evaluating the prototype of TEXSYS consists of a thermal control system, or thermal bus, that works like a Carnot refrigerator. In such a system, a heat-acquisition device absorbs heat from an external source, changing the state of anhydrous ammonia from liquid to mixed liquid and vapor. The mixture is separated by a centrifugal pump and sends
5323X.indb 269
11/13/07 2:12:42 PM
270
Expert Systems in Chemistry Research
vapor to condensers in a heat-rejection device and liquid back to the heat-acquisition device. A regulating valve on the vapor line maintains a constant pressure and temperature. In normal operation, the thermal bus is balanced by controlling the valve setting and pump power. In a real-time scenario a fast response from TEXSYS has to be ensured, usually within seconds. To achieve fast response times, data to be processed by TEXSYS are filtered according to their change characteristics; for instance, steadily or slowly changing data are eliminated. TEXSYS is able to identify all the seven known system-level faults as well as component-level faults as chosen by thermal engineers. A separate library of generic components including their characteristics ensures that TEXSYS can be adapted to different hardware configurations. Changes in the thermal bus hardware caused the creation of a new mathematical model in TEXSYS by choosing components from this library and connecting them as in the schematic diagram of the hardware. New data from the intermediate or integration layer of TEXSYS are placed into the model at sensor locations and are then processed both by active values, across connections, and by rules, across components. The insertion of a datum at a given location in the model may then result in a chain of inferences about the behavior of the system. TEXSYS was a prototype in the course of the project for Space Station Freedom, which later was integrated in the International Space Station project. The selected thermal bus was the Boeing Aerospace Thermal Bus System, which was the baseline thermal architecture for the Space Station Freedom external thermal bus. The nonlinearity of this particular thermal bus architecture has made conventional dynamic numerical simulation infeasible. In test runs with the Boeing Aerospace Thermal Bus System at the NASA Johnson Space Center in 1989, TEXSYS successfully performed real-time monitoring and control during all nominal modes of operation. TEXSYS successfully noticed and reacted to all of the required 17 bus faults, informed the operator of a possible blockage faults, and toggled the isolation valve to confirm the fault. Toggling the valve restored normal flow, indicating a previously stuck valve.
7.5.2 Chemical Equilibrium of Complex Mixtures — CEA Information about the composition in chemical equilibria is helpful for drawing conclusions on theoretical thermodynamic properties of a chemical system. These properties can be applied to the design and analysis of technical equipment, like turbines, compressors, engines, and heat exchangers. The NASA Lewis Research Center, later renamed the NASA John H. Glenn Research Center, has focused for several decades on the development of methods for calculating complex compositions and equilibria to apply these in a number of problems to be solved in the course of engineering propulsion jet engines for aircrafts and spacecrafts. In 1962 they developed a computer program for chemical equilibrium calculation, which was extended in several stages and was finally reported in 1994 [47,48]. The resulting program, Chemical Equilibrium with Applications (CEA), is used to obtain chemical equilibrium compositions of complex mixtures, which are used in several application areas:
5323X.indb 270
11/13/07 2:12:42 PM
Expert Systems in Other Areas of Chemistry
271
• Obtaining chemical equilibrium compositions for assigning thermodynamic states on the basis of temperature, pressure, density, enthalpy, entropy, shock tube parameters, or detonations. • Obtaining the transport properties of complex mixtures. • Calculating theoretical rocket performance for finite- or infinite-area combustion chambers. • Calculating Chapman-Jouguet detonations, proceeding at a velocity at which the reacting gases reach sonic velocity. • Calculating shock tube parameters for both incident and reflected shocks. CEA requires two types of data that are common to all problems: thermodynamic data and thermal transport property data. These two data sets include approximately 1340 gaseous and condensed species as reaction products and thermal transport property data for 155 gaseous species. The data sets may be extended or edited by the user. Problem input consists of seven categories of input datasets in a general freeform format. A simple rule for defining the reactant composition is as follows: reactant fuel CH4 wt%=30 t,k=373 fuel C6H6 wt%=70 t,k=373 oxidant Air wt&=100 t,k=88 h,j/mol=55 Here, two fuel components are defined — methane and benzene — with their weight percent and the temperature in Kelvin. The oxidant is liquid air at 88 K and has an enthalpy of 55 Joule per mole. The problem to be solved is defined according to different problem areas, such as temperature and pressure problems, entropy and volume problems, rocket or shock problems, or detonation problems. A simple problem can be defined as follows: problem detonation t,k=298.15 which is interpreted as a detonation problem at 25°C. The program produces different output including tables of thermodynamic and equilibrium data and information about the iteration procedures. The report provides particular information on rocket performance, detonation, and shock parameters that helps to decide the appropriate rocket design in the engineering process.
7.6 Concise Summary Apex-3D is an expert system for investigating SARs and quantitative structure activity relationship for three-dimensional structures. Bioinformatics is the interdisciplinary research area between biological and computational sciences dealing with computational processing and management of biological information, particularly proteins, genes, whole organisms, or ecological systems.
5323X.indb 271
11/13/07 2:12:42 PM
272
Expert Systems in Chemistry Research
Biophore (pharmacophore) refers to the steric and electronic features of a biologically active compound that are responsible for binding to a target structure. Chemical Equilibrium with Applications (CEA) is an expert system developed by NASA for determining compositions in chemical equilibria for deriving thermodynamic properties of a chemical system in propulsion jet engines. Cheminformatics is a research area dealing with computational processing and management of chemical information, particularly information acquisition, chemical descriptors, and computerized analysis of chemistry data. Deductive Estimation of Risk from Existing Knowledge (DEREK) for Windows is a knowledge-based software for predicting genotoxicity, mutagenicity, and carcinogenicity for high-throughput screening processes. EIAxpert is a rule-based expert system for environmental impact assessment, particularly designed for assessment of development projects at an early stage. Environmental Protection Agency (EPA) is a U.S. governmental organization founded in 1970 in response to the growing public demand for higher standards in environmental protection. Expert System for the Characterization of Rock Types (ESCORT) is an expert system based on Bayesian rules providing probabilities for the occurrence of rock types based on geochemical and nongeochemical data. Functional Genomics refers to the development and application of global experimental approaches to assess gene functions by making use of the information provided by structural genomics. Geographic Information System (GIS) is a computer program that allows storing, editing, analyzing, sharing, and displaying geographic information, primarily for the use in scientific investigations, resource management, asset management, and planning. Green Chemistry is a program from the EPA that helps to promote chemical technologies that reduce or eliminate the use or generation of hazardous substances. Green Chemistry Expert System (GCES) is an expert system developed within the scope of the EPA Green Chemistry program. It includes modules for SMART, reaction databases, chemicals design, solvents database, and literature references for green chemistry. High Performance Computing and Networking for Technological and Environmental Risk Management (HITERM) is a project that integrates high-performance computing with a decision support approach based on a hybrid expert system and is designed for applications such as technological risk assessment, chemical emergencies, or transportation accidents with respect to environmental damage. High Production Volume (HPV) Challenge is a program defined in 1998 by the EPA for toxicological assessment of substances that are manufactured or imported into the United States. Metabolomics deals with the catalogization and quantification of metabolites found in biological fluids under different conditions. Metabonomics refers to the quantitative measurement of the dynamic metabolic response of living systems as a response to drugs, environmental changes, and diseases.
5323X.indb 272
11/13/07 2:12:42 PM
Expert Systems in Other Areas of Chemistry
273
Meteor is knowledge-based software that allows predicting metabolic pathways of xenobiotic compounds. Molecular Genetics studies the structure and function of genes at a molecular level. MOLGEN is an expert system developed to model the experimental design activity of scientists in molecular genetics. PROSPECTOR is an expert system for exploration of ore deposits based on geological environment, structural controls, and the types of minerals present or suspected. Proteomics is a scientific research area covering protein identification, characterization, expression, and interactions. Quantitative Structure–Activity Relationships (QSARs) refer to scientific methods for correlating chemical structures quantitatively with biological activity or chemical reactivity. Quantitative Structure–Property Relationships (QSPRs) refer to scientific methods for correlating chemical structures quantitatively with chemical, physicochemical, or physical molecular properties. Real-Time Expert System (RTXPS) is a system designed for on-site dynamic emergency management in the case of technological and environmental hazards, including early warning for events such as toxic or oil spills, floods, and tsunamis. Registration, Evaluation, and Authorization of Chemicals (REACH) is a European law entered into force in 2007. It requires that companies that manufacture or import more than one ton of a chemical substance per year into the European Union register the substance in a central database and include exposure estimation and risk characterization. SECOFOR is an expert system supporting exception management for drilling process failures during geological exploration. Structural Genomics is a bioinformatics area dealing with protein structure determination, classification, modeling, and the investigation of docking behavior. Structure–Activity Relationships (SARs) describe the correlation between a chemical structure and biological activity or chemical reactivity in a qualitative manner. Synthetic Methodology Assessment for Reduction Techniques (SMART) is a nonregulatory program of the EPA for achieving pollution prevention in the production of new chemicals. It includes a review process for the environmental assessment and hazard classification for new chemicals. TEXSYS is an expert system developed by NASA to monitor conditions and determine malfunctions in thermal bus components of propulsion jet engines. X-Ray Diffraction (XRD) is an analytical technique that takes advantage of diffraction patterns that emerge when a single crystal or a powder of crystalline structure is exposed to x-rays. X-Ray Phase Analysis an analytical technique for identification of mineral phases in rocks, soils, clays, or mineral industrial material based on XRD.
5323X.indb 273
11/13/07 2:12:43 PM
274
Expert Systems in Chemistry Research
References
5323X.indb 274
1. Friedland, P., Knowledge-Based Experiment Design in Molecular Genetics, Ph.D. Thesis, Computer Science, Stanford University, Stanford, 1979. 2. Feigenbaum, E.A., et al., A Proposal for Continuation of the MOLGEN Project: A Computer Science Application to Molecular Biology, Computer Science Department, Stanford University, Heuristic Programming Project, Technical Report HPP-80-5, April, 1980, 1. 3. Friedland, P., et al., SEQ: A Nucleotide Sequence Analysis and Recombinant System, Nucleic Acids Res., 10, 279, 1982. 4. Stefik, M., Inferring DNA Structures from Segmentation Data, Artificial Intelligence 11, 85–114, 1977. 5. Stefik, M., Planning with Constraints (MOLGEN Part 1), Artificial Intelligence, 16(2), 111–140, May 1981. Reprinted in Feigenbaum, E.A., Building Blocks of Artificial Intelligence, Reading, MA: Addison-Wesley, 1987. 6. Stefik, M., Planning and Meta-Planning (MOLGEN Part 2), Artificial Intelligence, 16(2), 141–169, May 1981. Reprinted in Nilsson, N. and Webber, B., Readings in Artificial Intelligence, Wellsboro, PA, Tioga Publishing Company, 1982. 7. Sanderson, D.M. and Earnshaw, C.G., Computer Prediction of Possible Toxic Action from Chemical Structure, Hum. Exp. Toxicol. , 10, 261, 1991. 8. Lhasa Ltd, Leeds, UK, http://www.lhasalimited.org. 9. Greene, N., et al, Knowledge-Based Expert System for Toxicity and Metabolism Prediction: DEREK, StAR, and METEOR, SAR QSAR, Environ. Res., 10, 299, 1997. 10. Judson, P.N., Marchant, C.A., and Vessel, J.D., Using Argumentation for Absolute Reasoning about the Potential Toxicity of Chemicals, J. Chem. Inf. Comput. Sci., 43, 1364, 2003. 11. U.S. Environmental Protection Agency, High Production Volume (HPV) Challenge Program, http://www.epa.gov/hpv/. 12. European Union, Regulation (EC) No 1907/2006 of the European Parliament and of the Council of 18 December 2006 concerning the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), Official J. Eur. Union, 49, L396/1, 2006. 13. Langowski, J. and Long, A., Computer Systems for the Prediction of Xenobiotic Metabolism, Adv. Drug Del. Rev., 54, 407, 2002. 14. Button, W.G., et al., Using Absolute and Relative Reasoning in the Prediction of the Potential Metabolism of Xenobiotics, J. Chem. Inf. Comp. Sci., 43, 1371, 2003. 15. Testa, B., Balmat, A.-L., and Long, A., Predicting Drug Metabolism: Concepts and Challenges, Pure Appl. Chem., 76, 907, 2004. 16. Waters Corporation, Milford, CT, http://www.waters.com. 17. Ekins, S., et al., Computational Prediction of Human Drug Metabolism, Expert Opin. Drug Metab. Toxicol., 1, 1, 2005. 18. Golender, V.E. and Vorpagel, E.R., Computer-Assisted Pharmacophore Identification, in 3D QSAR in Drug Design: Theory Methods and Applications, Kubinyi, H., Ed., ESCOM Science Publishers, Leiden, The Netherlands, 1993, 137. 19. Golender, V.E., et al., Knowledge Engineering Approach to Drug Design and Its Implementation in the Apex-3D Expert System, in Proc. 10th European Symposium on Structure-Activity Relationships, QSAR and Molecular Modelling, Barcelona, Spain, Ferran, S., Giraldo, J., and Manaut, F., Eds., Prous Science Publishers, 1995, 246. 20. Golender, V.E. and Rozenblit, A.B., Logical and Combinatorial Algorithms for Drug Design, Research Studies Press, Letchworth, Hertfordshire, UK, 1983. 21. Accelrys Sofwtare Inc, San Diego, CA, http://www.accelrys.com/. 22. Weininger, D., SMILES, A Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules, J. Chem. Inf. Comput. Sci., 28, 31, 1988.
11/13/07 2:12:43 PM
Expert Systems in Other Areas of Chemistry
275
23. Golender, V.E. and Rosenblit, A.B., Logical and Combinatorial Algorithms for Drug Design, Wiley and Sons, New York, 1983, 129. 24. Feigenbaum, E.A., Engelmore, R.S., and Johnson, C.K., A Correlation Between Crystallographic Computing and Artificial Intelligence Research, Acta Cryst., 33, 13, 1997. 25. Terry, A., The CRYSALIS Project: Hierarchical Control of Production Systems, Technical Report HPP-83-19, Stanford University, Palo Alto, CA, 1983. 26. Glasgow, J., Fortier, S., and Allen, F., Molecular Scene Analysis: Crystal Structure Determination through Imagery, in Artificial Intelligence and Molecular Biology, Hunter, L., Ed., MIT Press, Cambridge, MA, 1993. 27. Ioerger, T.R. and Sacchettini, J.C., The TEXTAL System: Artificial Intelligence Techniques for Automated Protein Model-Building, in Methods in Enzymology — Volume 374, Sweet, R.M. and Carter, C.W., Eds., Academic Press, Boston, 2003, 244. 28. Ioerger, T.R. and Sacchettini, J.C., Automatic Modeling of Protein Backbones in Electron-Density Maps via Prediction of C-alpha coordinates, Acta Cryst., D5, 2043, 2002. 29. Holton, T.R., et al., Determining Protein Structure from Electron Density Maps Using Pattern Matching, Acta Cryst., D46, 722, 2000. 30. Smith, T.F. and Waterman, M.S., Identification of Common Molecular Subsequences, J. Mol. Biol., 147, 195, 1981. 31. Adams, P.D., et al., PHENIX: Building New Software for Automated Crystallographic Structure Determination, Acta Cryst., D58, 1948, 2002. 32. Ernest Orlando Lawrence Berkeley National Laboratory, Computational Crystallography Initiative, http://cci.lbl.gov/. 33. U.S. Environmental Protection Agency – Green Chemistry, http://www.epa.gov/gcc/. 34. Anastas, P.T. and Warner, J.C., Green Chemistry: Theory and Practice, Oxford University Press, Oxford, 1998. 35. Anastas, P., Green Chemistry Textbook, Oxford University Press, Oxford, 2004. 36. Farris, C.A., Podall, H.E., and Anastas, P.T., Alternative Syntheses and Other Source Reduction Opportunities for Premanufacture Notification Substances at the U.S. Environmental Protection Agency, in Benign by Design, Alternative Synthetic Design for Pollution Prevention, Anastas, P.T. and Farris, C.A., Eds., American Chemical Society, Washington, DC, 1994, 156. 37. Fedra, K. and Winkelbauer, L., A Hybrid Expert System, GIS and Simulation Modelling for Environmental and Technological Risk Management, Computer-Aided Civil and Infrastructure Engin., 17, 131, 2002. 38. Zaki, N.M. and Daud, M., Development of a Computer-Aided System for Environmental Compliance Auditing, J. Theoretics, 3, 1, 2001. 39. Fedra, K., GIS and Environmental Modeling, in Environmental Modeling with GIS, Goodchild, M.F., Parks, B.O., and Steyaert, L.T., Eds., Oxford University Press, New York, 1994, 35. 40. Fedra, K., Winkelbauer, L., and Fantulu, V.R., Expert Systems for Environmental Screening, An Application in the Lower Mekong Basin, International Institute for Applied Systems Analysis, Laxenburg, Austria, 1991. 41. Duda, R., et al., Development of a Computer-Based Consultant for Mineral Exploration, SRI Annual Report, Stanford Research Institute, Stanford, CA, 1977. 42. Davis, R., et al., The Dipmeter Advisor: Interpretation of Geological Signals, Proc. IJCAI, 846, 1981. 43. Courteille, J.-M., Fabre, M., and Hollander, C.R., An Advanced Solution: The Drilling Advisor SECOFOR, in Proc. 58th Annual Technical Conference and Exhibition. Society of Petroleum Engineers of AIME, San Francisco, CA, 1983.
5323X.indb 275
11/13/07 2:12:43 PM
276
Expert Systems in Chemistry Research
44. Pearce, J.A., Expert System for Characterization of Rock Types, J. Volcanol. Geotherm. Res., 32, 51, 1987. 45. Plancon, A. and Brits, V.A., Phase Analysis of Clays Using an Expert System and Calculation Programs for X-Ray Diffraction by Two- and Three-Component Mixed-Layer Minerals, Clays and Clay Minerals, 48, 57, 2000. 46. Glass, B.J., Thermal Expert System, (TEXSYS) Final Report — Systems Autonomy Demonstration Project. Volume I –—Overview, NASA Technical Memorandum, 102877, 22, 1992. 47. Gordon, S. and McBride, B.J., Computer Program for the Calculation of Complex Chemical Equilibrium Compositions and Applications. I — Analysis, NASA RP-1311, 1996. 48. McBride, B.J. and Gordon, S., Computer Program for the Calculation of Complex Chemical Equilibrium Compositions and Applications. II — Users Manual and Program Description, NASA RP-1311, 1996.
5323X.indb 276
11/13/07 2:12:43 PM
8
Expert Systems in the Laboratory Environment
8.1 Introduction As we have seen in the previous chapters, expert systems have been successfully applied to a wide variety of problems in research and industrial chemistry areas. Since many of these systems are applied in a laboratory environment, they have to coexist with a series of other software types in the laboratory. Several issues should be addressed: • Several requirements and technical considerations have to be taken into account if expert systems are implemented into an existing laboratory software environment. The requirements depend on the application area as well as on applicable regulations for data management. • Laboratories underlie certain regulations that ensure the results produced by a laboratory — and finally, the products delivered to a customer — conform to generally accepted standards. This also affects the development process for software operating in those regulated environments. • Laboratories generate a vast amount of data and information that has to be organized, managed, and distributed by data management systems. • Data management systems provide the basis for decisions; they have to be interfaced with existing expert systems. • Documentation is a significant piece of the decision process. In almost any case of a decision made by a team, a document is required that comprises the consolidated information relevant to the decision subject. This chapter will give an overview on applicable regulations, software development issues, and software types encountered in the laboratory. We will address some typical situations, where the technologies described in the previous chapters can be useful to support laboratory data management.
8.2 Regulations Software running in the laboratory has to meet certain regulatory standards to be operated in an acceptable fashion. As an example, the pharmaceutical industry underlies a series of requirements, most of which are covered by regulation of the U.S. Food and Drug Administration (FDA) in the Code of Federal Regulations (CFR) under Title 21, abbreviated as 21 CFR Part x [1–4]. These federal rules are more or less valid all over the world. The most important FDA regulations for the chemistry laboratory environment are as follows: 277
5323X.indb 277
11/13/07 2:12:44 PM
278
Expert Systems in Chemistry Research
• 21 CFR Part 58 describing the good laboratory practices (GLP) • 21 CFR Parts 210, 211, and others covering good manufacturing practices (GMP) • 21 CFR Part 11 describing requirements for electronic records and electronic signatures Part 210, 211, and Part 11 are sometimes collectively referred to as the GMPs, and the abbreviation GxP is often used to address all good practices and includes — besides GLP and GMP — good clinical practice, good automated manufacturing practice, and good documentation practice. The most important regulations concerning the management aspect of electronic data are described in the next sections.
8.2.1 Good Laboratory Practices GLP generally refers to a system of management controls for laboratories and research organizations to ensure the consistency and reliability of results as outlined in the Organisation for Economic Co-operation and Development (OECD) Principles of GLP and national regulations. Whereas the FDA and the U.S. Environmental Protection Agency (EPA) instituted GLP regulations, the OECD published GLP Principles in 1981, which now apply to the 30 member states of the OECD. The regulations set out the rules for good practice and help scientists to perform their work in compliance with internal standards and external regulations rather than evaluating scientific content or value of the research. GLP focuses on the following areas. 8.2.1.1 Resources, Organization, and Personnel GLP regulations require that the structure of the research organization and the responsibilities of the research personnel are clearly identified. This includes welldefined qualification and training of staff members, sufficient facilities and equipment, and the qualification, calibration, and regular maintenance of equipment. 8.2.1.2 Rules, Protocols, and Written Procedures GLP requires that the main steps of research studies are described in a study plan, and documentation is performed in a manner that allows being able to repeat studies and to obtain similar results. The routine procedures are described in written standard operating procedures (SOPs). 8.2.1.3 Characterization GLP stresses the essential knowledge about the materials used during the study. For studies to evaluate the properties of pharmaceutical compounds during the preclinical phase, it is a prerequisite to have details about the test item and the test system — often an animal or plant — to which it is administered. 8.2.1.4 Documentation GLP requires that the raw data are documented in a way that reflects the procedures and conditions of the study. Final study reports and the scientific interpretation of the
5323X.indb 278
11/13/07 2:12:44 PM
Expert Systems in the Laboratory Environment
279
results are the responsibility of the study director, who must ensure that the contents of the report describe the study accurately. Storage of records must ensure searchability and safekeeping during the retention period. 8.2.1.5 Quality Assurance Quality assurance (QA) as defined by GLP is a team of persons charged with assuring the management that GLP compliance has been attained within the laboratory. They are organized independently of the operational and study program and function as witnesses to the entire research process.
8.2.2 Good Automated Laboratory Practice (GALP) The EPA established GALP. The purpose of GALP is to establish a uniform set of procedures to assure that all data used by the EPA are reliable and credible. GALP particularly refers to Laboratory Information Management Systems (LIMS), a type of software that manages data for research and production in a sample-oriented manner. However, GALP refers in LIMS generally to automated laboratory systems that collect and manage data, including communication components, operating system software, database management systems, and application software that is involved in entering, recording, modifying, and retrieving data. The GALP guidance is built on six principles:
5323X.indb 279
1. Laboratory management must provide a method of assuring the integrity of all data. Communication, transfer, manipulation, and the storage/recall process all offer potential for data corruption. The demonstration of control necessitates the collection of evidence to prove that the system provides reasonable protection against data corruption. 2. The formulas and decision algorithms employed by the software must be accurate and appropriate. Users cannot assume that the test or decision criteria are correct; those formulas must be inspected and verified. 3. A critical control element is the capability to track raw data entry, modification, and recording to the responsible person. This capability uses a password system or equivalent to identify the time, date, and person or persons entering, modifying, or recording data. 4. Consistent and appropriate change controls, capable of tracking the software operations, are a vital element in the control process. All changes must follow carefully planned procedures, be properly documented, and, when appropriate, include acceptance testing. 5. Procedures must be established and documented for all users to follow. Control of even the most carefully designed and implemented software will be thwarted if the user does not follow these procedures. This principle implies the development of clear directions and SOPs, the training of all users, and the availability of appropriate user support documentation. 6. The risk of software failure requires that procedures be established and documented to minimize and manage their occurrence. Where appropriate, redundant systems must be installed, and periodic system backups must be
11/13/07 2:12:44 PM
280
Expert Systems in Chemistry Research
performed at a frequency consistent with the consequences of the loss of information resulting from a failure. The principle of control must extend to planning for reasonable unusual events and system stresses.
8.2.3 Electronic Records and Electronic Signatures (21 CFR Part 11) The regulation 21 CFR Part 11, issued by the FDA, focuses on the safety of electronic records and signatures and provides criteria for acceptance by the FDA, which apply to all companies intended to develop, produce, market, and sell products in the United States. The rule is divided into several parts covering the following typical requirements: (1) system validation, including copies of electronic records, record safety and archiving, audit trail, compliance with sequencing of events, and authority checks; and (2) controls for identification. The FDA defines electronic records in the following way: • “Records that are required to be maintained under predicate rule requirements and that are maintained in electronic format in place of paper format.” • “Records that are required to be maintained under predicate rules, that are maintained in electronic format in addition to paper format, and that are relied on to perform regulated activities.” • “Electronic signatures that are intended to be the equivalent of handwritten signatures, initials, and other general signings required by predicate rules. Part 11 signatures include electronic signatures that are used, for example, to document the fact that certain events or actions occurred in accordance with the predicate rule (e.g., approved, reviewed, and verified).” In other words, all records that are required to be maintained by predicate rules and that are maintained in electronic format and all regulated activities that rely on electronic records are considered to be 21 CFR Part 11 relevant. The following provide a summary of procedures that have to be implemented in a system that complies with 21 CFR Part 11: • Procedures and controls exist to ensure the authenticity, integrity, and, when appropriate, the confidentiality of electronic records and to ensure that the signer cannot readily repudiate the signed record as not genuine. • The system is validated to verify accuracy, reliability, consistent intended performance, and the ability to discern invalid or altered records. • It is able to make available accurate and complete (data, audit trail, and metadata) copies of records in electronic form for inspection, review, and copying. • Records are protected so that they are accurate and readily retrievable throughout the required retention period. • The system produces a secure, computer-generated date and time-stamped audit trail to independently record operator entries and actions that create, modify, or delete electronic records. • In the case of record changes and deletions, the original values are maintained and not obscured. Historical information is available in the audit trail.
5323X.indb 280
11/13/07 2:12:44 PM
Expert Systems in the Laboratory Environment
281
• The identity of the individual creating, modifying, or deleting the record is maintained. • The date and time stamp are maintained. • The audit trail information is retained as long as the data it pertains to. • The audit trail information is available for QA and agency review and copying. • Access to the system is restricted and maintained: − Access to the system is limited to authorized individuals — authority checks are automatically performed by the system. − The user list is documented. − Access status is periodically reviewed to assure that it is current. • Authority checks are conducted and documented to ensure that only authorized persons − Can use the system − Electronically sign a record − Access the operation or computer system input or output device − Alter a record − Perform operations • Electronic signatures and handwritten signatures executed to electronic records are linked or associated to their respective electronic records to ensure that the signature cannot be excised, copied, or otherwise transferred to falsify an electronic record. The link association is retained as long as the record is kept. • Each electronic signature is unique to one individual and is not reused by or reassigned to anyone else. • There are procedures to electronically deauthorize lost, stolen, missing, or otherwise potentially compromised devices that bear or generate code and password information and to issue temporary or permanent replacements using suitable, rigorous controls. • There are safeguards to prevent unauthorized use of passwords or identification codes. • Unauthorized attempts to use the system are detected and reported in an immediate and urgent manner to the system security unit and, as appropriate, to organizational management. Even several software vendors promote a Part 11 compliant system; this claim is simply incorrect. 21 CFR Part 11 requires both procedural controls and administrative controls in addition to the technical controls that a software vendor can offer. What the vendor can offer is a compliant-ready solution; that is, a software application that does not disrupt a compliant environment.
8.3 The Software Development Process To create software that performs according to the original requirements as well as to external and internal regulations, we need to keep certain processes in mind. Let us have a look at a generally accepted procedure for creating new software.
5323X.indb 281
11/13/07 2:12:45 PM
282
Expert Systems in Chemistry Research
8.3.1 From the Requirements to the Implementation A software development project starts typically with the definition of the business requirements. These requirements can either be collected by sales personnel, project managers, business development or from internal resources. The typical commercial procedure involves a request for proposal from a customer, which is a document describing the goals, acceptance criteria, and the environment in which the software is intended to run. This document is evaluated by the software vendor involving a team from different departments, such as sales, marketing, business development, and product management. The team sorts the incoming requests, discusses the strategic impacts, and involves the software engineering department for preliminary assessment of the feasibility of the project. If a decision is made for starting the development project, the customer is informed about that decision in a statement of work. At this point, one or more product managers will establish and maintain a well-defined process, involving the following steps. 8.3.1.1 Analyzing the Requirements The most important task in software product development is specifying the technical requirements that are derived from the user requirements. Although a user knows exactly what he expects from software, he does usually not have the skills to create a design for the software. One of the tasks in documenting the requirements is to extend incomplete specifications and to eliminate ambiguous or contradictory ones. The outcome is usually called a requirements document or release plan and contains only requirements from the user or customers in a form that allows deriving more detailed specification documents. 8.3.1.2 Specifying What Has to Be Done Specification is the task of precisely describing the software to be written. It is a good rule to spend considerably more time with specification than with programming. Specification can be divided into several categories for describing functionality, user interface design, or technical aspects of the software. 8.3.1.3 Defining the Software Architecture The architecture of a software system refers to an abstract representation of the system, concerning the underlying hardware, the operating system, the components of the software, as well as the interfacing with other software and the readiness for future improvements. 8.3.1.4 Programming Programming (implementation) refers to the coding of the software in one or more appropriate programming languages. Even though this seems to be the most obvious part of software engineering, it is by far not the largest portion; usually, the design part far exceeds the programming effort.
5323X.indb 282
11/13/07 2:12:45 PM
Expert Systems in the Laboratory Environment
283
8.3.1.5 Testing the Outcome Testing the software is performed at several stages. The first test is done by the software engineer, in which he tests the code he created. This is usually referred to as the developer test. If several software engineers are working on different parts of a software project, the next step is the integration test, which verifies that the codes created by different software engineers work appropriately together. Finally, the entire software is tested in two ways. Software verification ensures that the software conforms to the specifications, whereas software validation concerns the software performing according to the original requirements. 8.3.1.6 Documenting the Software Documentation concerns basically two areas: (1) specification documents — produced during software development process — covering for the internal design of software for the purpose of future maintenance and enhancement; (2) documentation addressing the application of the software, such as user guides, reference manuals, training documentation, and release notes. 8.3.1.7 Supporting the User Software users are occasionally resistant to change and avoid using software with which they are unfamiliar. Supporting training classes can build excitement about new software and confidence in using it. Starting with more enthusiastic users and introducing the neutral users in mixed classes are typical approaches to finally incorporate the entire organization into adopting the new software. 8.3.1.8 Maintaining the Software Maintaining and enhancing software to cope with new requirements or application areas is a critical step for most commercial software vendors. New requirements often lead to a change in application and the required positioning of the software. In almost any case, the initial design of software does not cover all potential application areas. Adding code that does not fit the original design may require significant changes or even a complete redesign by a software engineer.
8.3.2 The Life Cycle of Software The aforementioned procedure covers an entire lifetime of software, which is usually referred to as the software development life cycle (SDLC). In particular, commercial software development underlies certain regulations that are compiled as a set of standards. Several models can be applied to form a formal point of view. One of these is the V-model; its name is derived from the shape of the graphical representation used to describe the required processes (Figure 8.1). The V-model defines a uniform procedure for information technology product development, is the standard for many organizations dealing with software development, and is used in commercial organization as well as in federal
5323X.indb 283
11/13/07 2:12:45 PM
284
Expert Systems in Chemistry Research Business Requirements
Implementation
Statement of Work
Business Acceptance
Project Schedule
Validation Plan Design Review
User Requirements Specification Functional Specification Technical Specification Development
Performance Qualification Operational Qualification Installation Qualification
Implementation Plan Training Documents Standard Operating Procedures System Documentation
Operational Testing
Figure 8.1 The V-Model covers the entire process for software development from the business requirements to the implementation of the software in its intended operational environment. The process starts with a statement of work that defines the effort required for the design, engineering, development, production, and test, and prototyping of a software system. The software documentation process covers user requirements, functional, and technical aspects of the software development. After development, different test phases are performed to verify installation, operation, and performance of the software. A business acceptance from the customer finally leads to implementation of the software.
organizations, such as in German federal administration and defense projects. It is a software development project management model similar to generic project management methods, such as PRojects IN Controlled Environments (PRINCE2), and describes methods for project management as well as methods for system development. The V-model was developed in 1992 to regulate the software development process within the German federal administration. It was extended in 1997, called the V-model 97, and the most current version, V-model XT, was finalized in February 2005 [5]. The left tail of the V in the graphical scheme (Figure 8.1) represents the specification stream where the system specifications are defined. The right tail of the V represents the testing stream, where the systems are being tested against the specifications defined on the left tail. The bottom of the V represents the development stream. The specification stream mainly consists of the following: • The user requirements specification, also referred to as requirements document or release plan, describes the requirements from an operator’s point of view. This high-level document does not give any details about how the software actually is constructed, except that it may contain desired user interface designs to address a requirement.
5323X.indb 284
11/13/07 2:12:46 PM
Expert Systems in the Laboratory Environment
285
• Functional specification is derived from the user requirements specification and includes the description of the functionality that is needed to fulfill the requirements. It either may be a list of functions or may include proposals for user interfaces and describes the individual functionality in their context to match the operator’s requirements. In the latter case, the corresponding document is often referred to as system specification, which may be created in addition to the functional specification. Which procedure is followed relies on the internal documentation standards of the development organization. • Technical specification, or technical design, is a document that translates the functional or system specification into technical terms useful for the software engineer or developer. The technical specification includes information for the developer on how to program the required functionality in the architectural context of the existing system. The development stream consists of the following: • Development covers the coding phase of a software engineer based on the technical specification. • Operational test, or developer test, is a phase for verifying that the individual code pieces or modules created by a software engineer are working according to the technical specification. It also includes integration tests for those pieces ensuring that they work appropriately in the entire software package. • Source code reviews are performed by software engineers who are independent from the development process. These reviews make sure that the source code includes comments and remarks according to internal standards. The testing stream generally consists of the following: • Installation qualification is a process that ensures that the software has been installed and configured correctly according to manufacturers’ instructions, SOPs, and organizational guidelines. The scope of this qualification can be very broad or somewhat narrow depending on the size and type of system being deployed. It is typically related to the technical specification for the installation process. A typical installation qualification includes the following: − Hardware compatibility. − Correct installation and configuration of all systems of the supporting software environment — that is, the operating system, servers (e.g., application server, database server), and other components of a multitier system. − Correct installation and configuration of the application. • Operational qualification is based on the intended use of the system and the training of system users. This process ensures that the system functions as required through the use of documented test cases and procedures exercising significant system functionality. The tests during this phase are documented in detailed pass–fail criteria addressing the functionality of the system based of the functional or system specification. The customer’s
5323X.indb 285
11/13/07 2:12:46 PM
286
Expert Systems in Chemistry Research
SOPs help to narrow the scope of the operational qualification by reducing the test cases to the procedures approved in the SOPs. The operational qualification is performed by an independent team that may consist of qualified customers and may include qualified vendor personnel. The team has to be independent of the operator or developer groups to allow an impartial assessment of the fitness of the application. • Performance qualification ensures that the system behaves appropriately in the intended routine application and is checked against the user requirements specification. This step is performed by the end-user community and is the essential step toward the business acceptance of software. The performance qualification includes a subset of procedures used during the operational qualification to reduce duplication of effort by rewriting similar test cases and procedures. The procedures to use during the performance qualification typically focus on areas that pose the greatest risk if they were to operate incorrectly, the latter of which are defined during a risk assessment procedure. SDLC documentation underlies a strict approval process that is defined in the software vendor’s SOPs and can be audited by customers and regulative organizations. Documents created under the SDLC are versioned and include either handwritten or electronic signatures for approval from the responsible departments. The approval also involves a member of the QA department that is independent of the project team. Different versions of documents may be created as full documents; that is, the most recent version includes the entire documentation, or delta documents, that contain just changes with respect to the previous version. The release of software is documented with a release certificate created by the QA team in cooperation with product management, which also creates the release notes for each product version. This document describes the enhancements and modifications finally implemented into the new product version. Prominent chapters of the release notes are — if applicable to the respective product version — new functions, modules, and configurations, changed functionality, functions that are no longer supported, as well as technical implications that require a change in the software environment (e.g., a new version of a database server). Release notes often include a list of known issues that have not been solved in the released version. Even if software meets all of the user requirements and functional and technical specification, it might still be subject to several business implications. Those cases include any business impact outside the immediate application area, such as information technology (IT) requirements, security issues, increased need for maintenance, and pricing. The V-Model includes a final business acceptance phase that is intended to cover all potential implications as best as possible. For this and other numerous reasons, it is a good practice to involve the end-user community, IT personnel, and management during all phases of a development project. The final phase in the SDLC covers the implementation — that is, the systematic approach to effectively integrate the software into the process of an organization. Depending on the size of the software, the number of end-users, and the number of interfaces to other software systems, the implementation phase can be anywhere
5323X.indb 286
11/13/07 2:12:46 PM
Expert Systems in the Laboratory Environment
287
between days and several years. Typical implementation phases for some of the larger laboratory data management systems described later in this chapter range between 12 and 36 months. Experience shows that it is rather unlikely that the implementation will occur without obstacles. This applies particularly to expert systems, where the process of knowledge integration is a dynamic one, typically covering the entire lifetime of a product. With this information about the software development process, we will now look at some general concepts of data and knowledge management.
8.4 Knowledge Management Chemical and related industries have to face the serious problem of research information overload that originates from new instrumental technologies as well as from the cheminformatics and bioinformatics software presented in previous chapters. Bioinformatics in particular — including genomics, proteomics, metabonomics, and the other research areas — not only supports laboratory research in managing the terabytes of data being produced in the research processes but also produces additional data during, for instance, drug target identification, drug lead validation, and optimization. At some point another management problem becomes apparent: the organizational practice in structuring the production and sharing of knowledge within an organization. The term knowledge typically covers the human knowledge acquired by gathering information and experience. From an enterprise point of view, knowledge is distributed mainly with the human resources in an enterprise as well as in few electronic knowledge databases. If it is possible to locate the knowledge resources within a company, an effective distribution of knowledge can take place. An electronic knowledge management system (KMS) can support an enterprise in the respect. This section will provide an overview of the basic concepts and requirements of this type of software.
8.4.1 General Considerations An organization can be described in terms of knowledge by considering its points of importance in decision making. A good way is to first establish the knowledge centers. A knowledge center is a repository of knowledge related to a specific domain, and all decisions in the domain are generally derived from the knowledge stored in the particular knowledge center. As an example, we may consider the following base knowledge centers for an enterprise: • Research covers the knowledge domain of scientific or technical investigations and the analysis of their outcomes. • Development includes knowledge on the use of results from research and their practical application. • Production focuses on the implementation of results from development usually for commercial purposes. • Marketing includes knowledge on the commercialization and presentation of products in the market.
5323X.indb 287
11/13/07 2:12:47 PM
288
Expert Systems in Chemistry Research
• Sales provides knowledge of market requirements and the adaptation of products and solutions to a specific customer. • Customers have knowledge about application purposes of a product and provide the requirements for product development. • Administration and management cover knowledge about the organizational structure of the product vendor. For each organization these knowledge centers differ in number, relevance, and content. Further classification of base knowledge centers into subcenters might be appropriate; for example, the research centers may have the subcenters of principal research and method development or analytical services. The knowledge centers do not necessarily relate the organizational structure of the organization; they reflect the important centers of decision making based on the domain of specific knowledge. Even a KMS might be considered to be a knowledge center.
8.4.2 The Role of a Knowledge Management System (KMS) The role of a KMS can be described by few basic features: • They capture explicit knowledge generated within the organization in a centralized or multiple decentralized knowledge repositories. • They structure this knowledge according to domain- or decision-specific aspects. • They support retrieval, distribution, extension, and modification of knowledge by including ideas, innovations, and suggestions. The latter feature is supported by offering the possibility to find the right person for a specific problem by consulting a software component that identifies domain experts and their availability (e.g., the Yellow Pages). Another way is a message board, which captures knowledge that is not explicitly expressed. • They support classification of knowledge for an efficient retrieval; classification can be achieved with metadata in a simple hierarchy and is usually adapted to the application of knowledge; the hierarchy reflects the structure of the content centers. • They store information based on the degree of use of the knowledge, the opinion of the authors and subject matter experts, and frequency of knowledge use. • They monitor the entry of knowledge and support the process of eliminating obsolete knowledge to keep the knowledge database up to date. • They provide ways of automated delivery of personalized information to people, based on their profiles, interests, expertise area, and experience area (functions, e.g., event calendars, message board, newsletters). • They provide support for decision making and process optimization. • They support training of employees for knowledge improvement; computeraided training may be integrated to support training for new employees and existing employees.
5323X.indb 288
11/13/07 2:12:47 PM
Expert Systems in the Laboratory Environment
289
• They transform implicit knowledge — that is, knowledge that is partly or completely inexplicable, into corporate explicit knowledge and make it available to the entire organization by centralizing it into a shareable repository that is accessible to all interested employees.
8.4.3 Architecture A KMS provides access to and distribution of information and knowledge that is available in an entire enterprise. The main goals of such a system are to manage the entry, access, and distribution of primary (raw data), secondary (information), and tertiary (knowledge) data within an enterprise. The main modules in such a system are as follows: • Document Repository: This serves as the information pool. The basic document database is the essential data and information pool for the KMS. • Human Resources Database: This system is the basis of the Yellow Pages, which provides information about the location and distribution of human knowledge in the enterprise. • Search Engine: The search system basically consists of a search engine capable of retrieving information from one of the aforementioned databases. Either a separate search engine for each database or a centralized system is conceivable. • Intelligence System: This provides special search and data representation algorithms to give access to hidden information, consistencies, or patterns in the available information. Basically different systems can be applied: data-mining algorithms, pattern-matching algorithms, intelligent agents, and other artificial intelligence (AI) approaches. • Visualization System: This should ensure the adequate representation of the information retrieved from an information pool. This covers highly specialized representations like functional mathematical graphs, multidimensional graphs, and molecules as well as default representations in formatted texts or tables. In addition, some kind of three-dimensional (3D) information, like 3D views of departments, workflows, and operational procedures, can be helpful. • Report Generator: Finally, the user that has accessed and consolidated the information needs to have access to a report tool that presents the search results in a printable style ready for distribution in a paper-based form. Additionally, two layers should be included in a KMS: (1) the access (or administrative) layer, which defines the access of users to different levels of permission; and (2) the user layer, which defines the access from the user point of view. In this layer the user defines which type of information should be searched for and how it is presented in the visualization system and the report generator. The configurations of access and user layers are responsible for the final output of a search request. The access layer is a predefined layer that has a simple administrative character. An access layer restricts the availability of information for the entire enterprise, and, thus, for each potential user of the KMS.
5323X.indb 289
11/13/07 2:12:47 PM
290
Expert Systems in Chemistry Research
The user interface layer is defined directly by the user of the system and has a lower priority compared with the access layer. The user can exclude or include specific information types, like reports, newsletters, or spectra, to meet his specific information requirements. In addition, the user should be able to define the presentation style on the screen as well as in the report.
8.4.4 The Knowledge Quality Management Team A KMS is a software solution of high complexity. To ensure the fault- and error-free performance of this system, a special group of employees, the knowledge quality management team, watches over the informational and operational integrity of the system. According to standard regulations, like GLP for laboratories, the customer has to provide a team of authorities that ensures the reliability of the entire system. Regulations and requirements from standard organizations can be used as a template for the performance of the knowledge quality management team.
8.5 Data Warehousing Another topic closely related to intelligent data management is the concept of a data warehouse. A data warehouse can be described as a system optimized for information retrieval. The term data warehouse was originally defined by Bill Inmon in the following way [18]: “A Data Warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” According to this definition, a data warehouse has some similarity to an expert system from the perspective of decision making. Most software belongs to the category of operational systems comprising an architecture that reflects its application and containing current data that are volatile — that is, changing regularly with a process (working data). Data warehouses are informational systems characterized by the following features: • Subject-Oriented: Data warehouses contain data in a structured way reflecting the business objectives of a department or a company. The subject orientation supports the analysis and the production of knowledge. • Integrated: Because the warehouse relies on data, information, and knowledge produced by other software applications or humans, a data warehouse is a solution that is well integrated into the software environment of an organization. A unique representation of data is required for effective analytical processing of large amounts of information stored in a data warehouse. • Time Variant: Data warehouses store data for a longer period of time and contain current as well as historic data. This allows the analysis of how data change with time and the evaluation of trends. Data warehouses often contain sequences of snapshots taken periodically from operational data. • Nonvolatile: Data warehouses retain their data; that is, they comprise an essentially nonvolatile repository that supports upload but no deletion of data. Data warehouses are basically organized two dimensionally, reflecting time and granularity. Data loaded from operational systems are regarded as current data that
5323X.indb 290
11/13/07 2:12:47 PM
Expert Systems in the Laboratory Environment
291
are aggregated under certain constraints into a form of lightly summarized data, usually reflecting a short-term summary representation, such as weekly or monthly reports. The summary can be aggregated further to yield long-term summarized data — for instance, performed on a yearly basis. This granularity is stored in a data warehouse and supports the fast and efficient retrieval and analysis of data. Current and aggregated data are additionally described by metadata that support data organization and retrieval. The importance of the data that are subject to progression usually decreases with time; a data warehouse provides scheduling mechanisms for outsourcing of those data to slower archiving components or to external data repositories. However, data stored in external systems are still considered part of the data warehouse. The architecture of a data warehouse comprises the following general layers:
1. Incoming data from operational systems as well as those from external sources, collectively referred to as source systems, are stored in a data staging area. Several modules support the sorting, categorization, transformation, combination, and dereplication of data stored in the staging area to prepare them for use in the data warehouse. The data staging area does not provide any query or visual representation of data. 2. The presentation server provides the query capabilities and the visualization of data to the end-users and to other applications.
The entire process of data transition from the source systems to presentation servers via the data staging area is usually referred to as extract, transform, and load (ETL). The detail architecture can be divided into the following sections (Figure 8.2): • Data load managers are a collection of modules performing extraction of raw data and metadata, transformation to the unique data warehouse representation, and the upload of data into the data warehouse. • Data warehouse managers are modules performing data analysis in terms of consistency and dereplication, indexing, the aggregate generation, and the archiving of data. In addition, data warehouse manager modules may generate query profiles to determine appropriate indices and aggregations. • Query managers include the functionality for management of user queries and are connected to the front-end systems for the end-user. • Front-ends provide information to the end-user for decision making. They usually comprise query and reporting tools and may be supported by data mining and online analytical processing (OLAP) tools. Data warehouses provide the following generic data storage areas: • Operational data sources include operational systems, relational databases, and data from other external sources. • The data staging area stores the incoming data for subsequent aggregation by an ETL process.
5323X.indb 291
11/13/07 2:12:48 PM
292
Expert Systems in Chemistry Research Front-End Query Manager
Lightly Summarized Data
Highly Summarized Data
Metadata
Archive
Data Warehouse Manager Staging Area Data Load Manager Operational Data
Operational Data
Operational Data
Figure 8.2 The architecture of a data warehouse comprises a data load, data warehouse, and query management modules. The data load manager extracts and transforms operational data and transfers them to the staging area. A data warehouse manager analyzes the data from the staging area for inconsistencies and duplicates, generates aggregate data and metadata, and manages archiving of data. The query manager handles data queries and is connected to the front end for the end-user.
• Metadata are extracted from the staging area; they are used for producing summary tables, for query purposes, and to provide the data categorization to the end-user. • Aggregated data are stored in the predefined lightly and highly summarized data areas generated by the warehouse manager. • Data beyond the predefined lifetime of highly summarized data are stored in a long-term archive or on external media. The process of ETL starts with the definition of the relevant source information from the operational system databases, such as files, data tables, or data fields. These data can be renamed for use in the data warehouse repositories. Additionally, the schedule for subsequent automated extraction is defined. The transformation process includes several processing steps for removing inconsistencies and unifying data representation. Typical processes are the assignment of any inconsistent fields from multiple data sources to a unique field or the conversion of data to common formats, such as for textual and numerical data, units, and abbreviated data. Data cleansing is the process of identifying relevant data, resolving data conflicts with multiple data sources, and dealing with missing or incorrect data at a single source. ETL is a part of a broader business intelligence process that includes other technologies, like OLAP and data mining. OLAP is an approach to provide answers to multidimensional queries, as they typically arise in business reporting processes for sales, marketing, and management processes such as budgeting and forecasting.
5323X.indb 292
11/13/07 2:12:48 PM
Expert Systems in the Laboratory Environment
293
Data mining is an extension to linear searches, mainly based on pattern recognition in a large data set, and is of particular interest in combination with expert systems capabilities in scientific areas.
8.6 The Basis — Scientific Data Management Systems A scientific data management system combines generic data management capabilities with features required for effective creation, search, and management of scientific information. Although it may include more specialized techniques for performing these tasks, it is principally a management system for electronic records. An electronic record represents a record in digital format. Even though the components and requirements of an electronic record are defined by its application, a system for creating and maintaining electronic records has basically the following components: • Content relates to the factual information of a record and consists of any electronic format. Typical formats used in chemistry include texts, tables, spreadsheets, images, structures, reactions, spectra, descriptors, protein sequences, and gene sequences. • Context refers to additional data or information that show how the record is related to its application as well as to other records or systems. Context is often described by metadata (i.e., data about data) or meta-information (i.e., information about information). • Structure addresses the technical characteristics of the record related to formats, layout, or organization of records. Examples are file format, category, page layout, hyperlinks, headers, and footnotes. In addition to these basic components, a scientific data management system usually must address several of the regulatory standards previously described. Some of the required features are as follows: • • • • • • • • • • •
5323X.indb 293
Identification of the author or source of electronic data. Identification of the user or system that performs modification of data. Ability to witness electronic records by a second user. Ability to track creation, modifications, and deletion of data, including date, time, user, and reason (audit trail). Capability to lock a particular record to avoid successive changes. Linking to or cross-referencing supporting data. Functionality for independent review by an identifiable knowledgeable person other than the author. A date- and time-stamp mechanism that is independent of the control of the author or source of the electronic data. Generation of an audit trail for each electronic record that documents the author, source, and activities just described; is permanently linked to that record. An archiving mechanism system that permits retrieval, review, and interpretation of the electronic data as of the date it was recorded. A retention time scheduler for retaining records as long as defined per SOPs.
11/13/07 2:12:49 PM
294
Expert Systems in Chemistry Research
Authentication is usually performed by entering a unique user identification (ID) and password. If data are transferred automatically from external software, the source system either must provide a similar mechanism to identify a valid user or the external system has to log on to the record manager with a unique identifier. In any case, data that are entering a compliant-ready record manager require this identification. The authentication needs to be logged with a date and time stamp in an audit trail. The same requirement applies to the modification of data that already exist in the record, if the user is not already in an authenticated session. In fully regulated environments, a reason for the modification has to be supplied before changes can be made. For operations requiring an electronic signature the following must be included within the functionality of a record manager: • The signature must be linked to the record that it was applied to; separating the signature from the record will basically invalidate the record. • All electronic signatures contain the following information: the full user name, date and time of signing to the second, and the reason for the signature. • All components of the signature must appear on human readable reports of the record. • Each signature requires entering a user ID and password on the first execution of signature and password for subsequent executions. • If a session is not continuous both components of the signature (i.e., user ID and password) are required for signing. • If the system is an open system — that is, working over a nonsecured network connection — digital signature encryption is required when executing an electronic signature. Witnessing of electronic data is accomplished by routing electronically the file to the witness. That person will access the file, review its entire contents, and indicate that he read and understood the content by applying an electronic signature. The process is similar to the authentication performed with a unique user ID and a password. The system will automatically and independently provide a date and time stamp and will log this event. The file will then be routed back to the authoring party or the author will be sent an electronic notice confirming that the signature has been completed. In any future legal or administrative proceeding, the witness should be able to testify as to the authenticity of the electronic data record. Consequently, the witness must at a minimum be able to verify that the file and electronic data existed on the date it was reviewed by the witness and that he read and understood the content of the file at that time. After reviewing the electronic data, the witness should be in a position to testify at the time of the legal proceeding about the contents and meaning of the entry. Scientific data management requires a complete audit trail to be a part of the application, which must be electronic, always on, unchangeable, and able to work automatically without user intervention. The audit trail is linked to the record being changed and must be kept for the full record retention time.
5323X.indb 294
11/13/07 2:12:49 PM
Expert Systems in the Laboratory Environment
295
8.7 Managing Samples — Laboratory Information Management Systems (LIMS) Laboratories usually have to face a high data flow that needs to be organized. The laboratory industry has responded to this challenge by automating instrumentation. However, the individual instrument software still uses different data formats and software interfaces. To manage laboratory data and to streamline data handling, data management systems have been introduced [6–17]. The forerunners to the systems now called Laboratory Information Management Systems first appeared in the late 1960s as in-house software solutions. The intention was to help streamline the data flow from frequently performed laboratory tests and to transcribe the results to a centralized data repository. In the 1970s custombuilt systems became available. These early custom systems were one-off solutions designed by independent systems development companies to run in specific laboratories. The complexity of these systems had increased to allow them to facilitate the transfer of large quantities of data in an electronic format from specific laboratory instruments. At the same time when customized LIMS implementations were developed by system solution providers, initial efforts were taken to create commercial LIMS products. These extensive research efforts resulted in the first commercial solutions formally introduced in the early 1980s. Such commercial LIMS were proprietary systems, often developed by analytical instrument manufacturers to run on their instruments. These commercial systems were typically developed for a particular industry and still required considerable customization to meet specific laboratories needs. Parallel to the rise in commercial LIMS, the processing speed of third-party software capabilities increased and laboratories switched more and more to small workstations and personal computers, and the LIMS environment had to follow this trend. During the 1990s, the term LIMS became universally recognized, and more and more standard solutions from commercial software vendors became available. The systems were refined to combine the individual transfer of laboratory test data into the transfer and consolidation of the results to provide laboratory managers an overall view of the laboratory and a platform to assist them in running the laboratory. Today, several software solutions exist, offering the capability to meet the specific requirements of different laboratories and specific industrial environments. Many of today’s most popular commercial LIMS take advantage of open systems architectures and platforms to offer client–server capabilities and enterprise-wide access to lab information. Most recent developments around LIMS have been to migrate from proprietary commercial systems toward an open systems approach and to further integrate laboratory data management into global knowledge management strategies of organizations. However, many of the LIMS implementations have been less than successful, thus creating a negative perception among small to medium-sized organizations. In-house LIMS, which are still being developed by many organizations, can take considerable time and resources to implement. Unfortunately, this negative perception effectively cuts many laboratories off from the substantial benefits that a
5323X.indb 295
11/13/07 2:12:49 PM
296
Expert Systems in Chemistry Research
well-designed and installed LIMS can bring to a laboratory. The main benefit of a LIMS is a drastic reduction of the paperwork and improved data recording, leading to higher efficiency and increased quality of reported analytical results. Many of the traditional systems were developed in house and therefore remained isolated solutions useful for just a few specialists. Today’s LIMS are advanced, modular, and state-of-the-art software tools that help to improve and enhance business practices throughout the enterprise. LIMS of tomorrow will incorporate or integrate with knowledge-based systems, linking scientific expert systems, research databases, and production planning systems. This vision implies the immediate global access to all measurement results via intranet or Internet, compliance to federal regulations, as well as the availability of open interfaces and public data formats to integrate already existing information systems. The acquisition of a LIMS is a strategic purchase for a laboratory. The implementation of a LIMS impacts not only the existing IT environment but also the existing workflow and the philosophy of how laboratory processes are performed. As a result, laboratory organizations have to face various problems when implementing a LIMS.
8.7.1 LIMS Characteristics A LIMS is a computer-based system used by many types of testing laboratories to store, process, track, and report various types of analytical data and records. The goal is to streamline the storage of result data based on samples or set of samples (batches) and to improve the reporting processes associated with these samples. A LIMS mirrors all management processes and activities in quality control and QA. It assists the QA in a laboratory in the following tasks: • • • • • • • • • • • • • • •
Order registration and order processing through the entire working cycle Rendering the testing scope more dynamic Monitoring of complex sampling orders Compiling of complex sampling orders Generation and documentation of analysis results for evaluation Documentation of results over a period of several batches Scheduling of laboratory activities Monitoring of the release and evaluation of products and test goods Generation of various certificates and reports Evaluation of results by means of trend monitoring Supervision and classification of suppliers Control of packaging material Supervision of the handling of data by means of an audit trail function Reservation of material and control of the execution of stability studies Assistance in the processing of stability and complaints management, recipes, and reference substances
These features are described in more detail in the following sections.
5323X.indb 296
11/13/07 2:12:49 PM
Expert Systems in the Laboratory Environment
297
8.7.2 Why Use a LIMS? A LIMS supports different kinds of organizations starting from research and development (R&D) to quality management. It serves core analytical tasks like planning, procedure control, and automated QA. This includes data management at the point of collection, evaluation, interfacing, archiving, and retrieval of information to documentation for R&D and quality management. Special functions supported typically include stability testing, complaints management, recipe administration, and reference substances. The substantial benefits of a LIMS for laboratory management are as follows: • • • • •
Improving sample-oriented workflows Automating repetitive tasks Tracking samples, analysts, instruments, and work progress Maintaining data quality, defensibility, and validity Increasing accuracy of analytical results
8.7.3 Compliance and Quality Assurance (QA) Control is the essential objective behind most data management principles. Effective management and operation of an automated laboratory cannot be assured unless use and design of LIMS is consistent with principles intended to assure LIMS control. Although accuracy and reliability of data must be ensured by a control-based system of management, the most effective management systems invoke the participation of those employees affected by the control process. According to GALP guidelines described already, laboratory management is responsible for the use and management of the LIMS. This implies that all LIMS support personnel and users are completely familiar with their responsibilities and assigned duties as described in the organization’s SOPs. Laboratory management is also responsible for ensuring that appropriate professionals are hired and assigned to the task, coupled with appropriate training, to ensure that all users are able to use the LIMS effectively. Most commercial laboratories rely on a three-part strategy for compliance:
1. Users are provided with clear operating instructions, manuals, and SOPs to enable them to perform assigned system functions. 2. Sufficient training to clarify these instructions is provided to users. 3. Users able to meet operation requirements are eligible to perform these LIMS functions.
An additional QA unit monitors LIMS activities as described in the GALP guidelines. The QA unit’s responsibilities primarily are inspection, audit, and review of the LIMS and its data. An organizational plan is developed to define lines of communication, reporting, inspection, and review of the LIMS and its data. The QA unit must be entirely separate from and independent of the personnel engaged in the direction and conduct of a study and should report to laboratory management.
5323X.indb 297
11/13/07 2:12:50 PM
298
Expert Systems in Chemistry Research
8.7.4 The Basic LIMS The essential element of a LIMS is a relational database in which laboratory data are logically organized for rapid storage and retrieval. In principle, a LIMS plans, guides, and records the passage of a sample through the laboratory from its registration, through the workflow of analyses, the validation of data (acceptance or rejection), to the presentation or filing of the analytical results. The LIMS software basically consists of two elements: (1) the routines for the functional parts; and (2) the database. For the latter, usually a commercially available database software is used that may include certain functional parts, such as production of graphs and report generation. The database is subdivided into a static and a dynamic part. The static part comprises the elements that change very little with time (e.g., the definition of analytical methods), whereas the dynamic part relates to clients, samples, planning, and results. Basic LIMS functions include the following: • • • • • • •
Sample login Sample labeling Analysis result entry Final report generation Sample tracking, including chain of custody (auditing) QA and quality control tracking and charting functions Operational reporting including backlog and sample statistics
It is important to distinguish between LIMS and solutions that, at first sight, offer some of the individual features associated with a LIMS. Most commonly, LIMS are compared to scientific data management systems, which may cover the requirements for data management and data security but do not provide the sample-oriented processes that are typical for LIMS.
8.7.5 A Functional Model A LIMS solution offers ways to address several different operational issues. These issues can be split into the three key functional areas of sample tracking, sample analysis, and sample organization. Within each of these areas, different specific functions exist. Whereas some of these functions are “must haves” for all operational environments, others are “nice to haves” and are more dependent on the specific environment in which the LIMS is operating. 8.7.5.1 Sample Tracking Tracking of samples addresses issues around the monitoring of the specific sample or batch of samples to ensure that their processing is properly executed, documented, and completed and that the associated workload correctly managed. Typical specific functions include creating and reading bar codes of samples, direct access to the SOPs, managing workload and outstanding analysis, and sample analysis planning.
5323X.indb 298
11/13/07 2:12:50 PM
Expert Systems in the Laboratory Environment
299
8.7.5.2 Sample Analysis Sample analysis requires that sufficient test data have been obtained for a sample, such that all internal and external requirements are met. It also covers the accuracy and applicability of test data. Related specific functions include manual and automated result entry, validating against method specification, checking of detection limits, and instrument management. 8.7.5.3 Sample Organization The information structure of sample data and results affects the usability of the systems and the possibility to readily retrieve required sample data. An effective sample organization ensures that sample analysis is available and is understandable for endusers and auditors. Specific functions include statistical processing and trending, reporting, and management of fault cases (exceptions). A modern LIMS consists of various units, which can be divided into several functional sections: (1) the planning system, which covers the handling of administrative data and product standards; (2) the controlling system, which covers order generation, sampling, laboratory processing, decision making, and generation of certificates; and (3) the assurance system, which covers evaluation of data, maintenance, and calibration of instruments as well as support for audits and inspections.
8.7.6 Planning System The planning system serves to handle basic data such as the following: • General data, which consist of data referring to, for example, employees, result units, sampling units, storage conditions, and specific sampling instructions • Laboratory data, which consists of data referring to, for example, laboratories, laboratory levels, sampling instructions, laboratory costs, and general test parameters • Enterprise data, which consists of data referring to, for example, products, customers, suppliers, countries, and cost center; usually in the responsibility of external departments and can be transferred via software interfaces from other systems Specific basic data exist to control the order processing of specific samples. This describes the sample information from order generation to order approval and report generation. The information that controls the order processing consists of the type of order generated, the scope of testing required for that order, and the distribution lists for the different reports, which can be automatically generated at different levels throughout the order processing. One standard can control different types of testing with each type having its own distinct testing strategy. System configuration data allows the customer to set system flags that are specific to the needs of the customer.
5323X.indb 299
11/13/07 2:12:50 PM
300
Expert Systems in Chemistry Research
8.7.7 The Controlling System Within the controlling system, orders are compiled, generated, and processed. Orders contain all data necessary for the monitoring of the control procedures of a specific batch. The order generation functionality provides a high degree of flexibility and consequently supports the various requirements. Routine orders with a fixed testing scope can be generated. After entering product, batch, and testing scope by the user, the order is compiled by the system with the help of the parameters that have been assigned in the standard. The data from the standard are transferred to the order and can be modified and completed there. Additional data not assigned in the standard might be entered later. After completion of the data entry, the orders are generated. Alternatively, the order can be initialized and generated via an interface, for instance, to a production planning system. In this case specific processes or events on the production planning system side trigger the quality assurance order in the LIMS. The support of the workflow in the laboratory is the central feature of integrated LIMS functionality. After the order is generated, it is allocated to the respective laboratories according to various parameters, which are relevant to the order. The beginning of the processing of the orders in the respective laboratories can be registered in the system. Usually it can be defined in the LIMS whether the determined results are checked and compared automatically with the limits, which have been assigned before to a standard. In this case, the system is capable of making a proposal about how the sample or the entire batch could be evaluated. The modern LIMS supports a retrospective evaluation of data. It is possible to present the result of a specific test in a graph together with the results of former analyses. This function can be used for trend analyses — for example, for the rating of suppliers or for the observation of the trend of a specific test parameter. The effective times of the laboratory work can be entered in the system and used for cost calculations. The LIMS has the ability to calculate the costs of processing orders. It is possible to provide detailed reports associated with processing a particular product or set of products by entering into the order the cost center, the charged and expected times, and costs defined in the standard. The results of the analyses are evaluated and released by the person responsible for the respective product. Often the evaluation is carried out in a single step and by one person, called one-step decision. For other products, such as in the pharmaceutical industry, a confirmation for the scientific decision is required from the laboratory manager and the QA officer, called the two-step decision. LIMS usually provide features for delegating decisions to other persons or authorities. If required, the sample results can be presented on a certificate, such as a certificate of analysis.
8.7.8 The Assurance System The assurance system contains the following:
5323X.indb 300
11/13/07 2:12:50 PM
Expert Systems in the Laboratory Environment
301
1. Evaluations, including statistical analysis of the test results; cost calculations; and overviews of the use of equipment and laboratories. 2. Audit and inspection functions, which support the control of the manufacturing of products, for instance, for review by an external contract laboratory.
Quality regulation charts, also called regulation charts or control charts, permit the continuous control of precision and correctness in a defined control period with the help of a simple graphic presentation. For this purpose one or more control samples are analyzed additionally in each series of analyses. The results of these analyses or the statistical identification data derived from these results are continuously entered in a diagram. The usage of quality control charts in the field of quality assurance is based on the assumption that the determined results are distributed normally. Typical control charts used in a LIMS for routine analysis are, for example, the Shewhart charts for mean and blank value control, the retrieval frequency control chart, and the range and single-value control chart [19]. Quality regulation charts can be displayed graphically in the system or exported to spreadsheet programs.
8.7.9 What Else Can We Find in a LIMS? A series of other modules are available in many of the commercial LIMS applications. 8.7.9.1 Automatic Test Programs For regular tests a LIMS offers the functionality of automatic test programs that are used for processes, which can be planned exactly. Examples are calibration of instruments, audits, and assessment of environmental impact in regular intervals. After an automatic test program has been initiated, it starts working on a predefined relative date or time to compile and generate processes on the basis of the respective definitions in the standard. 8.7.9.2 Off-Line Client An off-line client allows the user to work with the LIMS without having direct access to the production database, such as in on-site investigations or production facilities. Data from the production system are stored temporarily at the computer system running the off-line client and can be later transferred by docking back to the production system. The management and administration of the individual off-line clients are performed by the production system. 8.7.9.3 Stability Management Data about the long-term stability of products data are important for QA in production. In the past, stability studies were planned and documented on simple spreadsheets or paper-based systems. A stability data module is an application dedicated to
5323X.indb 301
11/13/07 2:12:51 PM
302
Expert Systems in Chemistry Research
supporting the entire workflow of automatic stability sample data management. This includes sampling, storage, order generation, testing, result entry, evaluation, and documentation. The basis for planned stability studies is the definition of test plans, which serve as templates for the processing of tests at specific time intervals. The test plan is linked to a particular standard containing definitions regarding the scope and requirements of the tests. 8.7.9.4 Reference Substance Module Modern software for the sample storage of all substances used in different analytical fields of QA of a corporation is of great interest for many customers and helps to solve logistics problems. A reference substance module stores information on substances for analytical testing (i.e., standard, reference, and impurities). This module includes support for ordering of the substances from the supplier and of the dispatch of analytical reports. 8.7.9.5 Recipe Administration A modern LIMS includes a module designed for the administration of recipes for the various products. This includes specifying exact quantities or relationships of individual components of a product. With a recipe administration module, each recipe is subdivided into two different manufacturing phases: a static part (i.e., gross-mixing ratio) and a dynamic part. The static part describes which quantities of the individual components are used for the manufacturing of the product. The quantities are specified as absolute units or as ratios. The dynamic part describes dependencies of the quantities of the components on different parameters, such as time, temperature, pH-value, or pressure. With the recipe module, various calculations and evaluations can be carried out.
8.8 Tracking Workflows — Workflow Management Systems Workflow management is gaining increasing attention, since it allows a combined view focusing on data as well as on applications and processes. In particular, an efficient logistics of laboratory information requires systems of high flexibility concerning the wide variety of data formats and representations. Consequently, laboratory workflow management is one of the greatest challenges for IT providers. Laboratory research projects comprise a number of unique challenges for the designers of information systems. The experimental results are complex and involve several stages and numerous links among each other. There are many different reagents, types of experiments, instrumental singularities, and conditions to process. The data formats themselves are complex and are frequently interrelated in unusual ways. Additionally, requirements are subject to a rapid rate of change as old techniques are refined and new techniques are introduced. As a laboratory scales up, the management software that covers the complete workflow becomes the crucial factor in operational efficiency.
5323X.indb 302
11/13/07 2:12:51 PM
Expert Systems in the Laboratory Environment
303
A critical requirement for a large laboratory is software to control laboratory workflow while managing the data produced in the laboratory. This software covers manual as well as automated laboratory activities, including experiment scheduling and setup, robot control, raw-data capture and archiving, multiple stages of preliminary analysis and quality control, and release of final results. Additionally, it should allow the coordination of these activities both intellectually manageable and operationally efficient.
8.8.1 Requirements The typical distribution of information and data in laboratories is event oriented. For instance, each property of a material is related directly to several steps of testing. With each experimental step, its intrinsic information and related data are stored in one place, but information about the entire laboratory process is scattered among different locations. This provides a linear record of laboratory activity but an unfavorable representation of the entire process. In particular, retrieving information about a laboratory order requires a detailed knowledge of the underlying workflow. Additionally, because workflows can change frequently, a detailed knowledge of workflow changes and workflow history are also needed. This is a common problem with laboratory notebooks: When the notebook is stored in a database, the process is consolidated in an event-oriented scheme, and applications programs may have to be changed each time the workflow changes. At the state of the art, workflow management is typically distributed among individual application programs, each of which is responsible for a small part of the workflow. An individual program contributes to the overall flow of work implicitly through its interactions with users and laboratory databases. This solution is suitable for simple protocols that do not change. Modern high-throughput routine and research laboratories demand a software system for explicitly recording the laboratory workflow, organizing the data transfer between application programs involved, and providing support for the management of activities. A particularly interesting example is an Analytical Workflow Management (AWM) system that covers these demands as laboratories become more and more automated, as protocols become more complex and flexible, and as throughput and data volume increases.
8.8.2 The Lord of the Runs An AWM tracks the progress of laboratory samples through the entire course of an analysis. The laboratory receives a continual stream of samples, each of which is subject to a series of tests and analyses. A second key requirement for workflow management is to control what happens as the workflow’s steps are carried out: The right programs have to execute in the right order, user input has to be obtained at the right points, and so on. To avoid the event-oriented management previously mentioned, an AWM provides a view that is targeted to a laboratory order — that is, in which information is associated with the underlying process rather than individual laboratory events. In
5323X.indb 303
11/13/07 2:12:51 PM
304
Expert Systems in Chemistry Research
fact, this view is a virtual one and has nothing to do with the internal representation of laboratory orders. Internally, observed variables — as analysis and calculation results, spectra, chromatograms, molecules, and images — are individually handled and stored, whereas in a report they are attributes of laboratory order. Using the report, a scientist can retrieve individual results without necessarily knowing which step performed the measurement. In this fashion, the view isolates the scientist from the details of the entire laboratory workflow. Although the definition of the report depends on the workflow and its history, different instances of laboratory orders can have different attributes in the report. This rather dynamic behavior reflects the flexibility demanded by workflow management systems.
8.8.3 Links and Logistics Whereas linear workflows can be handled by conventional information management systems, an AWM must support the order–suborder relationships that arise frequently in laboratory workflows. This applies to a set of samples that is linked to or derived from a single sample as well as to grouped samples. The workflow management system must be able to keep an eye on the progress of the group while still understanding relations to the individuals, linked samples, and the superior order. Consequently, the AWM must accommodate complex, multistep workflow schemes and must allow multiple concurrent executions with the same data inputs without losing the focus on the samples being processed in the laboratory. Workflow management systems provide lists of the possible workflow activities that the user can perform next. The user can take responsibility for the execution of a pending workflow activity or can initiate computer-based work to execute the activity by selecting it from the work list. Several laboratory activities have to be executed at regular intervals (e.g., recalibration procedures, stability studies) that can be handled by the AWM using time-based triggers.
8.8.4 Supervisor and Auditor Many commercial laboratories are legally bound to record event histories; in particular software in the pharmaceutical research laboratory has to be compliant ready concerning the compliance rules stated already. Consequently, the AWM must maintain an audit trail, or event history, of all workflow activity. It records what was done, when it was done, who did it, and what the results were. The AWM must be able to quickly retrieve information about any order, sample, or activity for daily operations. The history is also used to explore the cause of unexpected workflow results to generate reports on workflow activity and to discover workflow bottlenecks during process reengineering. This information has to be prepared for the user by a separate or an integrated exception management system. The AWM must therefore support queries and views on an historical database. Most of these queries can be divided into the following categories: (1) queries that look up a particular experimental step; (2) queries that examine the workflow history of a particular laboratory order; and (3) report-generation queries to produce summaries of laboratory activity.
5323X.indb 304
11/13/07 2:12:51 PM
Expert Systems in the Laboratory Environment
305
8.8.5 Interfacing An AWM has to operate with a variety of database and information management systems, such as LIMS, expert systems, archive management, exception management systems (XMSs), electronic records management (ERM), and electronic laboratory notebooks (ELNs). A good laboratory workflow management system solution incorporates several features of these systems or is designed as an integrated system. Many of the commercial products cannot support applications with high-throughput workflows. However, high-throughput workflows are characteristic of laboratories in the field of bioinformatics and biotechnology as well as of combinatorial chemistry. Because of automation in sample handling, analysis, instrumentation, and data capture, transfer rates in the laboratory increase dramatically. The basic requirements of a laboratory workflow management system are the ability to read, store, archive, analyze, and visualize complex-structured data in standardized formats, data links from one to other laboratory information systems, and the possibility to run it on diverse platforms as a multitier application.
8.9 Scientific Documentation — Electronic Laboratory Notebooks (ELNs) “Scientific work is basically worthless without appropriate documentation”: This statement is a fundamental rule valid for all scientific areas. The complexity of scientific work demands sufficiently sophisticated, efficient documentation. As the main documentation vehicle in the laboratory, the scientific notebook must meet complex, often rigorous, requirements. But what does appropriate mean for scientific documentation? Obviously, simple documentation is not enough; there are additional requirements for a scientific notebook. Some companies sell preprinted paper notebooks, usually containing 100 to 150 pages, square grid or horizontal ruled, serially numbered in the upper outside corner, and bound at the left margin. Loose-leaf or spiral notebooks are not acceptable, because pages can be intentionally inserted or removed or accidentally ripped out. Any such action opens up the possibility that someone will question the authenticity of the data. Here is a summary of the most important laboratory notebook rules: • The notebook has to provide a complete record of activities in a manner that allows coworkers to exactly repeat work and be able to obtain the same results without having the original author around. • Each entry of a notebook should be signed. At least one other worker, who is competent to understand the work, should regularly examine and witness the entries by signing and dating each page examined. Unsigned or undated or nonwitnessed pages are virtually worthless. A long delay between the signing of the page by the inventor and the witness will raise doubts in the authenticity of the document. • Errors should not be erased or obliterated beyond recognition. They should be crossed out and replaced by new entries. All errors and mistakes should be explained and signed as they occur.
5323X.indb 305
11/13/07 2:12:51 PM
306
Expert Systems in Chemistry Research
• Entries should not be changed at a later date. Instead, a new entry should be created, pointing out any change. • Pages should never be removed from the notebook. • Pages or areas on pages should be never left blank or incomplete. • The notebook should be regarded as a legal document, and, as such, its use should be controlled. When completed, it should be stored in a safe place. It should not be treated as a freely available publication. • The notebook should attempt to follow similar processes such as described in the FDA GLP guidelines. An electronic solution has to cover these requirements but can bring significant advantages to some of these points: • Improved organization: Electronic systems provide an easier way to organize data according to different classification criteria, like laboratories, projects, and studies. • More complete information: With electronic systems, the user can be obliged to enter certain information for notebook entries, such as additional metadata. This makes scientific documentation and workflows more consistent. No important information is missing. • Signatures: Electronic signature workflows can be managed in a very fast and simple manner; it is not necessary to collect paper notebooks for signing — it can be done from a desktop at any time. • Safety: Electronic information can be stored using safekeeping electronic features. Access can be monitored and restricted via user role user management. • Accessibility: Electronic information can be centrally stored and easily distributed worldwide. Access can be managed via sophisticated user management. More importantly, searching an electronic system is much more efficient than with paper notebooks. Besides covering these basic requirements, an electronic solution can seriously save time in different aspects: • Saving time when entering information, particularly the incorporation of information from other software systems and instruments. • Streamlining the notebook signature process. The lab manager does not have to collect paper notebooks at the end of the week, read through all volumes, sign them, and redistribute it back to the laboratories; instead everything can be done from the desktop computer in the laboratory manager’s office. • Access the actual work status in all laboratories. This is an additional issue related to the previous one; the lab manager does not have to wait until the end of the week but can actually sign entries after they have been created. He can get the actual work status of laboratories at any time with a few mouse clicks.
5323X.indb 306
11/13/07 2:12:52 PM
Expert Systems in the Laboratory Environment
307
• Search the experiments of all of the scientists in an organization. Searching within an electronic system makes it easy to find important information: − Before starting a new investigation, consider (1) What has been done before? (2) Were similar experiments, or even the same experiment, performed before? − During the work, consider (1) Are there similar results that prove your own results? (2) Are there existing discussions or interpretations that can be used in your own work? − After the work, discuss with researchers in distributed labs about the outcome of your investigation. • Organizing your data more effectively. An electronic solution can store a huge set of wide variety of data. Since each user may have a different view on data, it is important to follow an easy concept for classifying and categorizing entries. Data visualization in electronic solutions can be structured on the fly by using metadata. • Improving communication. Electronic solutions can provide messaging utilities to directly communicate about results stored in the database including additional comments and reviews. This is the first step to knowledge management in the laboratory. If we take these requirements into account, we can specify further details for an electronic solution: the ELN. Even though the market for ELNs is steadily growing, the conceptual design and common understanding of this software is still in its infancy. This is in contrast to a mature software product, like a LIMS. Even LIMS never really earned this synonym; since it deals with very specialized information mainly in the quality control area, during the nearly thirty years of history of LIMS people gained a clear understanding of what to expect from this software. ELN has just successfully crossed from what marketing calls the early-adopter to the pragmatist market, and there is still confusion about what an ELN should and can do. The following sections describe a conceptual design of an ELN rather than a commercially available system. Only a few commercial software products are capable of dealing with more or less all of the requirements described here.
8.9.1 The Electronic Scientific Document Since documenting work is a major aspect, an ELN should provide at least a documentation capability similar to conventional word processing software. We will refer to this functionality as electronic scientific document to distinguish this functionality from other features of an ELN. In the ELN, the electronic scientific document serves the role of a conventional laboratory paper notebook: It is a container for laboratory data bound by regulatory guidelines. As an electronic regulatory vehicle, the electronic scientific document must meet GxP and 21 CFR Part 11 requirements mentioned earlier as well as the requirements governing paper laboratory notebooks. As a result, it must do the following:
5323X.indb 307
11/13/07 2:12:52 PM
308
Expert Systems in Chemistry Research
• Keep a historical record of data: To maintain a historical record, an ELN prevents users from deleting data physically. Instead, it marks the data as deleted and keeps the data in history. • Keep versions of data: When users change and save data, the ELN creates and numbers new versions of those data. • Follow a signature workflow: As a system of checks and balances, users with appropriate rights must sign off on ELN data at the proper times and in the proper order. If we want to take a scientific workflow into account, we can organize an electronic scientific document into sections, each of which represents the notes for a work step. Each section in the electronic scientific document requires then at least a status and a version and may have additional information attached. Sections may have different states: • Modifiable: The default state; this entry is still under construction by the author and is not yet in a signature workflow (it cannot be signed). • Released: If the author decides that the entry is finished, he will release it. In released state, the section can be signed by a signer. • Approved: This section has been signed. • Locked: A special state that is triggered by any valid author. It prohibits changes to this section by another person. The section may be unlocked by either the author who applied the lock or by an administrator. • History: The final state of a section. Sections having this state have been either replaced or approved. • Deleted: The special state of a section that has been explicitly deleted. • Version: The version number of the section. In the ideal case, a section will disappear if it is deleted or if a new version is created. However, a page cannot be removed or content erased beyond recognition in paper laboratory notebooks. Consequently, the electronic scientific document should provide a special view showing all versions of sections as well as the deleted ones. In this view, the status of a section has to be clearly indicated. It is useful to allow for adding additional information to a section that may not be part of a printed report, but helps the operator in organizing and searching through the document contents. The following supporting information is useful: • Metadata (data about data): Every piece of information that describes the content of a section in detail. Examples are literature references, sample number, and instrument ID. Metadata shall be searchable and are important for retrieval of experiments. • Attachments: Any type of file information that can be attached to a section; similar to e-mail attachments. Examples are texts in a generic format, chemical structures, spectra, chromatograms, and images. These attachments should be searchable using the appropriate search functionality, such as full-text search, structure search, or spectrum.
5323X.indb 308
11/13/07 2:12:52 PM
Expert Systems in the Laboratory Environment
309
• Supporting data formats for nontextual sections: Additional proprietary or standardized open data formats that are required by the ELN for appropriate visualization of nontext data. Examples are Molfile format for structure display, JCAMP format for spectra and chromatograms, or native binary formats for representing sections in proprietary visual format. In contrast to attachments, these data formats are required by the ELN for the appropriate visualization of contents. • Comments: Any textual comment from coworkers or approvers that are helpful for identifying contents or for discussion threads. Comments should typically not appear in the printed report, but they should be searchable, too. • Links: Active links to another section, electronic scientific document, software, or Web-based systems via uniform resource locator. Ideally, the ELN shall allow the link target to be opened directly from within the document.
8.9.2 Scientific Document Templates A scientific workflow usually consists of multiple work steps, some of which will be regularly repeated. An ELN ideally allows documenting these repetitions in the most effective manner by providing templates both for entire electronic scientific documents and individual sections. Predefined templates for entire documents are usually linked to users working in the same project or laboratory. A document template contains predefined sections, each of which typically represents a step in the workflow. The sections may be retrieved directly from an existing application, like spreadsheet or word processing software, or they are created, edited, and organized by an implemented template management system. Document templates are used to fill a new electronic scientific document automatically with predefined entries according to the scientist’s workflow. A new document created on basis of a template might then be used in different ways: • Standard template: The template simply provides the predefined sections on creation of the document. Users are then allowed to add, remove, or shift sections according to their requirements. • Form-based template: The document is created from the template, but users are not allowed to add, remove, or shift entries; only creation of new entry versions is allowed. This function is intended to completely replace existing paper forms as typically used in routine laboratories. • Workflow-based template: The template constituting the document does not just contain the predefined sections but also incorporates information about the sequence in the scientific workflow. In this case, the user is not allowed to deviate from the preset workflow; for instance, he has to start with the first section in the document and needs to have witnessed the section before he can work on the first one. Form- and workflow-based templates may be combined in a single template.
5323X.indb 309
11/13/07 2:12:52 PM
310
Expert Systems in Chemistry Research
Document templates are managed in a template editor, which allows editing, copying, versioning, and release of templates. A user that creates and edits templates would need to have special access rights or administrative rights on the template management module. To avoid confusion in the selection of templates for creating new electronic scientific documents, the ELN should provide functionality to assign templates to individual users, laboratories, or team members. As with the scientific document itself, version control and audit trails are required for the creation or modification of templates. In addition to the template functionality, ELN users should have a way to reuse previously created electronic scientific documents or sections when starting with a new laboratory experiment. This kind of clone-functionality should optionally include all supporting information described above.
8.9.3 Reporting with ELNs Reporting with an ELN covers the entire visible content of a scientific document. This includes all sections, independent of data format, such as text, tables, images, spectra, chromatograms, 2D and 3D chemical structures, reactions, and protein sequences. Reports from electronic scientific documents may be created in different formats, and two of them are particularly important:
1. Notebook reports include all section versions as well as previously deleted sections, each with additional comments and signature information, in a similar style like in a paper laboratory notebook. 2. Publication-like reports include only final versions of sections and represent the final state of a document as it is used for submission or publication.
Reports from electronic scientific documents require predefined headers and footers to be included that show the same administrative data (e.g., author, date, time, project) and signature information as with hardbound paper notebooks. If document reports are required for submission to regulative authorities like the FDA, these headers and footers are predefined in the system and are automatically included in a report without user interaction; a scientist is then also not allowed to change this information. A scientific workspace allows creating reports on a less restrictive basis, for instance for intermediate or internal reporting purposes and for documentation that does not underlie the FDA regulations as previously described. In the ideal case, conventional word processing software can be embedded into a scientific workspace to directly edit and report information in the same software package. Creating a report document in a scientific workspace also allows the transfer of the preview of any file entry to be included, for instance via drag and drop from the file tree to the document. Workspace reports may optionally include placeholders for administrative data, similar to the electronic scientific document. However, in contrast to the latter, this information may be changed or edited.
8.9.4 Optional Tools in ELNs The ultimate goal for an ELN would be to interface all laboratory software that potentially provides content for a scientific document to provide a single unique
5323X.indb 310
11/13/07 2:12:53 PM
Expert Systems in the Laboratory Environment
311
Figure 8.3 A stoichiometry table that supports a chemist in the tedious calculations for reactions. The calculation grid contains fields for type of compound, name, phase, molecular weight, and so forth. The calculation starts with selecting a limiting (L) compound from any of the reactants, that is, the compound that shall be consumed completely in the reaction process. By entering the amount of this compound, the table calculates all values for the other compounds, such as molarity, mass, and optionally volumes derived from given densities or concentrations of the solutions. After the reaction, the product weight is entered, and the yield experimental is calculated automatically.
p ortal for the scientist to enter, search, and document information. Although there is a chance to technically realize this, the reality is different, mainly due to the commercial aspects of a software market. However, it is helpful to integrate several optional tools into an ELN environment. One of the tasks closely related to documentation is simple calculations that have to be performed to prepare an experiment. The number of calculations performed, for instance, in the organic synthesis laboratory is quite small, but those calculations required are very important. The calculations associated with conversion of the starting materials to the product are based on the assumption that the reaction will follow simple ideal stoichiometry. In calculating the theoretical and actual yields, it is assumed that all of the starting material is converted to the product. The first step in calculating yields is to determine the limiting reactant. The limiting reactant in a reaction that involves two or more reactants is usually the one present in lowest molar amount based on the stoichiometry of the reaction. This reactant will be consumed first and will limit any additional conversion to product. These calculations, which are simple rules of proportions, are subject to calculation errors due to their multiple dependencies. Any reaction entered in a scientific document for a synthesis chemist should have an optional stoichiometry table (Figure 8.3). Creating a stoichiometry table automatically retrieves all necessary information, like molecular mass and formula, from the reaction entered. A stoichiometry table then allows the following: • • • •
5323X.indb 311
Defining either a reactant or a product as limiting compound. Entering mass and purity for solid compounds. Entering volume, concentration, and purity for solutions. Entering volume, density, and purity for solutions.
11/13/07 2:12:53 PM
312
Expert Systems in Chemistry Research
• Defining excess factors. • Entering experimental masses for yield calculation. When all necessary data are entered, the stoichiometry table calculates all missing information, like required substance amounts, masses, and yields. A stoichiometry table acts then like a small specialized calculator that accompanies the synthesis chemist. Another typical situation in the R&D phase is that the scientist does not know enough about a structure to store this in a conventional structure database. In this case, generic structures can be used, which contain some placeholders (e.g., residues, superatoms, molecular masses, labels) on a part of the structure instead of the not yet known atoms or groups. An ELN in the research environment needs to provide means for storing this information and searching for it in combination with the known structure query features. Markush structures follow a similar approach. They were named after Eugene Markush, who used these structures to include them in a U.S. patent in the 1920s (U.S. Patent 1506316). In general, a Markush structure is a chemical structure with multiple functionally equivalent chemical entities (residues) allowed in one or more parts of the compound. Residues are structure fragments of not fully defined structures. The knowledge of these structure fragments is important to the analyst in evaluating a reaction path or a metabolic pathway. The functionality required for an ELN to handle these structures is a specialized structure editor allowing creation and visualization of residues, definition of residues as real structures, and combined search for substructures in both the compound and the residue. Some additional features in the structure viewer help to mark and emphasize residues of the structure, allow overlapping residues, and label residues. A flag indicates whether the residues are displayed in the current context or not. If more than one residue is available, it is possible then to show or hide individual residues of a structure. The structure editor may be independent of a primary editor that is able to handle complete structures; that is, structures are created with an external standard editor, whereas residue definitions are performed with an embedded tool. Since most databases are not designed for storing incomplete information, an ELN has to provide an internal format to store incomplete structures. In fact, it stores every complete part of a structure in a conventional database and keeps the additional information about the missing parts. Finally, access to an ELN via Web client is desirable. However, many of the additional functions would be hard or even impossible to implement in a Web viewer. At least from the documentation and reviewing perspective, a Web Retrieval Client is an alternative. Such a module provides searching and retrieval of complete document reports, the current state of experiments, as well as review and signature features for documents.
8.10 Scientific Workspaces By providing the features just described, we covered the most important requirements for the electronic scientific document. There is another aspect of scientific
5323X.indb 312
11/13/07 2:12:53 PM
Expert Systems in the Laboratory Environment
313
documentation in the real world: the preparation of documents. The aforementioned requirements cover the documentation of experimental work; hence, the document creation underlies certain formal regulations. However, in scientific reality there are a lot of situations where documents have to be carefully prepared. This is particularly true for internal reports, publications, and documents that have to be submitted to regulatory authorities. The preparation of documents is typically done on the local hard drives of desktop computers. This procedure has several drawbacks: • Documents created on local drives are usually not backed up regularly and automatically. The risk of losing important information due to software crashes or hard disk failures is considerably high. • The security of documents from an intellectual property (IP) standpoint is at the very least questionable. This applies particularly to patent relevant fields. According to U.S. patent policy, the date of invention is relevant for deciding who is the patent owner. This is different, for instance, from the European patent policy, where the date of filing the patent with the European Patent Office constitutes the patent ownership. However, creating documents on a local file system does usually not ensure the appropriate time stamp and does not allow applying electronic signatures, both of which are necessary for patent relevant documentation. • Organizing document files on a file system requires defining the document hierarchy, typically represented in the folder structure of the hard drive. This hierarchy is static and has to be defined in advance. Reorganization of multiple documents into a new folder hierarchy is tedious, requires copying or linking files, and leads to a deficit in organization structure. • Sharing files in a team for discussion or revision is poorly supported on local file systems. This requires additional communication mechanisms that have to be provided by additional software, like e-mail programs. Using shared drives is a workaround to some of these requirements but does not provide any significant improvements to protect IP or to organize files in a more effective manner. One way to cope with this situation is using a scientific data management system, as described already. Another concept that shall be described here is the scientific workspace. Scientific workspaces are containers designed for personal preparation of data as well as for effective organization, sharing, and publishing of information within a team. A scientific workspace is very similar to the well-known files system organization in a tree view, except that all data are stored in a secure database and it is has more powerful features to organizing and sharing information. In addition to the tree view, a scientific workspace provides an editing area, where files selected in the tree view can be previewed or edited.
8.10.1 Scientific Workspace Managers In contrast to the scientific data management system, the focus of scientific workspaces relies on information rather than data.
5323X.indb 313
11/13/07 2:12:54 PM
314
Expert Systems in Chemistry Research
Workspaces can be used as a personal workspace replacing the local file system on a desktop computer or as shared workspaces for a team. A scientific workspace allows creating or uploading any kind of document or file and editing it within an ELN. Scientific workspace documents may be created from word processors or other software. Workspaces can additionally contain specialized information — like images, spectra, structures, reactions, and reaction pathways — each of which is at least available as visual representations. Workspace entries can be created by default in a personal mode. In personal mode, the document is visible only to the author until the entry is ready for publishing it in a team. A workspace document can be released to other users of the workspace. At this stage, the document is made available for review to a predefined list of users according to a publication policy, and only users with appropriate access rights are able to see the document. This access restriction does not apply to administrators; like on a local file system of a desktop computer in a network where IT administrators are allowed to access any file, the ELN must provide means for ELN administrators, for instance, to delete files that are in personal mode, if the author no longer has access to the system. A scientific workspace editor is a tool that allows scientists to keep their work data supporting the ELN in a separate secure database area. It allows users to create, upload, organize, and share data in a usually less regulated manner than with electronic scientific documents. The principle intention of this tool is to allow the user to enter, work on, and share data in an intermediate secure repository. The user can enter data and keep them private until explicit release and does not need to keep data on paper or on a local file system. In summary, a scientific workspace editor allows the following: • Definition of file types that can be uploaded or created, for instance, integrated templates for a word processing application. • Creation of files using the predefined file types. • Upload of files from local or shared file systems. • Creation of metadata for file classification. • Organization of files in a tree hierarchy, where tree nodes represent metadata of the file. • Automatic assignment of metadata by uploading data to tree nodes. • Keeping data private until explicit release by the author. • Easy sharing of data to other permitted users of the scientific workspace and to external systems or groups. • Easy creation of intermediate reports with a conventional word processing application. This concept requires the following: • Direct integration of common and frequently used authoring applications, like word processors, spreadsheet software, and image editors. • Versatile viewers for text, images, chromatograms, spectra, structures, reactions, and other types of scientific information.
5323X.indb 314
11/13/07 2:12:54 PM
Expert Systems in the Laboratory Environment
315
• Direct transfer of documents to electronic scientific documents in the ELN. • Extensive access management on file type basis for easy distribution of information and collaboration. • Intelligent handling of metadata dependencies for easier metadata input.
8.10.2 Navigation and Organization in a Scientific Workspace The transition from using a file system to using a scientific workspace editor is much easier if the user recognizes similarities. That is why a conventional tree view similar to the various types of file explorer applications should be part of a workspace editor. The tree view serves then as a primary navigation tool for accessing all files stored in the workspace database. It displays a tree of currently opened records in a style just like in conventional explorers, where folders can be collapsed or expanded and files can be moved and copied. In addition, the file tree contains specific popup menus to enable fast access to folder- and file-related operations. However, instead of representing the folder structure in a static way, the hierarchic structure of a navigation tree in a scientific workspace is dynamic. The system uses metadata that can be attached to any file entry to categorize the files in the navigation tree. These metadata are either defined by a system administrator together with the file type or by the user during his work. Metadata that are defined on the file type level may be mandatory; that is, the file based on the particular file type cannot be created without entering the corresponding metadata. This ensures that file types with a specific meaning are categorized consistently throughout an enterprise. The scientific workspace can store a huge amount of diverse information, like a file system. In shared workspaces different users may need different categorization for files. A typical example is a team of scientists working on analytical method development. Whereas one scientist may be responsible for separation of sample components for analysis using different digestion techniques, another may be working on improvement of the analytical instrument conditions, and a third may validate the method for the routine application. In this case, the three scientists would have different requirements for organizing the files or documents supporting the development of the method. This can be accomplished by allowing the user to define which metadata he wants to be represented in the file tree view. From a technical point of view, metadata are constructed with a meta-key that represents the name or category of the data and a value. By changing the metakey structure in his current session, each user would be able to see (1) only the documents he is interested in and (2) the files categorized in an appropriate sequence. Figure 8.4 shows an example of different metadata structures displayed for different users.
8.10.3 Using Metadata Effectively A workspace by itself has a flat data structure with a minimum of predefined hierarchy. The data organization can be structured on the fly by using the metadata of each file. Since each file may have several metadata that can be used for categorization, the user defines the file tree representation within each session and each workspace independently for other views. Different user-specific tree view settings can then be stored and selected for later use.
5323X.indb 315
11/13/07 2:12:54 PM
316
Expert Systems in Chemistry Research
Method
Method Sample
Type
Type
Sample Temperature
LC-MS
LC-MS A-121
Results Results
A-121
Recovery
15C
Recovery 1 .doc
Run 1 .doc
Recovery 2 .doc
Run 2 .doc
Recovery 3 .doc
Run 3 .doc
A-122
25C Results
Run 1 .doc
Recovery A-123 Results Recovery
Run 2 .doc A-122
Run 3 .doc
A-123
Figure 8.4 Example of metadata structures displayed for different users. The left-hand side shows documents that have been organized by the user according to method, sample, and type of document. The first document, Recovery1.doc, shows up in folder LC-MS (Method), A-121 (Sample), and Recovery (Type). A second user (right-hand side) might have a different view on the same data. He organized his documents in the sequence of method, type, sample, and temperature and sees only documents in the folder that provide the corresponding metadata. The repository for both users is the same, but the views are different.
Metadata of files can be modified or extended by different mechanisms: • Defining metadata within each workspace independently and storing them together with the workspace. • Defining metadata on the file type level and storing them together with the file type definition. • Automatic assignment of metadata by uploading files to a folder in the tree structure. • Automatic assignment or modification of metadata with drag-and-drop operations in the tree view. If metadata are predefined for a workspace or on file-type level, the user is prompted for entering missing metadata, either entered manually or via selection from a pick list. Predefined pick lists help to avoid misspellings, multiple ambiguous definitions,
5323X.indb 316
11/13/07 2:12:55 PM
Expert Systems in the Laboratory Environment
317
different abbreviations, or different wording, which would result in double entries for the same data. If mandatory metadata exist, the system will refuse to create or import the file without the appropriate metadata entered. An easier way to automatically assign metadata is to create or upload data directly to a folder by using the context menu. In this case, all the metadata underlying the folder are automatically added to the entry. This a major difference from conventional file systems: If a file is moved from one folder to another, the information that this file once belonged to the original folder is lost. This might be intentional but is not in the most cases. With the described metadata concept, this information is inherent to the file and will only change or drop out on explicit action of the operator. The same applies to drag-and-drop operations with files in the navigation tree. Dragging a file and dropping it into another folder will cause a change of metadata for the file, and the system will ask the user to confirm the change. In this case, the user has several options: • He can keep the previously assigned metadata, which would virtually result in a copy of the file to another folder. However, the file is not really copied but appears now in two different folders, each of which contains a file referring to the same file instance. Consequently, changing one of the files would also change the second one appearing in the other folder. • He can keep the previously assigned metadata and create a new file with the new metadata. In this case, he explicitly copies the file to another folder to create a second instance of the file, which is independent from the original one. • He can change the metadata — that is, deleting the previous metadata and assigning the new ones. This looks virtually like moving a file from one folder to another. In fact, the file does not really move to a new location but is just displayed in a new folder since its metadata changed. Let us consider an example of an investigation of metabolic pathways with animal species. We would organize our tree view with the default meta-key Species for each metabolite structure uploaded to the workspace. Let us assume that the metabolite structure has not yet been confirmed. If we drag a structure file from the Mouse folder and drop it into the Rat folder, the system would ask us whether we want to keep the previously assigned metadata Mouse or not. Now we have three possibilities:
5323X.indb 317
1. Keeping Mouse would indicate that the unconfirmed metabolite found in the mouse has also been found in the rat. Similar experimental results lead to the conclusion that both metabolites are identical. 2. Removing Mouse would mean that the metabolite found in the mouse is a different one, maybe because of a simple mistake; instead, the rat shows this metabolite. 3. Copying the file to the Rat folder indicates that the experimental results suggest identical metabolites for both mouse and rat; however, this might change later on when the complete structures have been verified. In this case, a copy ensures that changes found in the rat are not automatically reflected in the results for mouse.
11/13/07 2:12:55 PM
318
Expert Systems in Chemistry Research
Metadata are most easily provided in pick lists that appear as dropdown menus in the user interface. New files in the system would typically receive multiple metadata. To stay with the previous example, each new metabolite would receive metadata for Species, Matrix, Tissue, and so forth. There are certain cases in which previously selected metadata (primary metadata) may affect the number of entries in a pick list of other metadata (secondary metadata). Selecting the primary metadata would then lead to exclusion of several nonapplicable secondary metadata or exclusion of an entire meta-key. As a typical example, investigations for metabolite identification are performed in different experimental systems, each of which may provide different tissues. Let us define the experimental systems In Vivo and In Vitro, and the tissues Whole Blood, Liver, and Nerve Tissue. Some of these secondary metadata may be shared by both primary metadata, and some of them are individually used for just one primary metadata. For instance, a blood sample can be taken from a living animal, and there might be a chance to take a sample of liver tissue without significantly harming the animal; however, taking a sample of nerve tissue is usually not possible without killing the animal. That is, the combination of In Vivo with Whole Blood and Liver makes sense, whereas for In Vitro all the tissues apply. A solution is the definition of dependencies between metadata, which has consequences on the input of metadata if an object is created or uploaded. If a value is selected from the pick list for any metadata defined as primary data, available values for the secondary metadata are modified — usually reduced — according to the definition of dependencies. If only a single value remains to be valid for secondary data, the corresponding value is automatically selected. This feature is important because the pick lists are best defined uniquely and consistently to avoid multiple pick lists with similar meaning. All of the tissues in the aforementioned example would be defined in a single list, which might go up to several hundreds of values. By defining dependencies, this list is reduced if other metadata are selected. Another feature that becomes important for metadata management is metadata grouping. As an example, a blood sample is taken from a mouse and a new metabolite found. We can assign the metadata Mouse, Whole Blood, and In Vivo to this metabolite. We now do a second investigation where we unfortunately have to kill the mouse, take a brain sample, and find the same metabolite. We would now assign Mouse, Brain, and In Vitro to this structure. If we would look now at the metabolite metadata, it would list the meta-keys and its values: Species Tissue Tissue System System
Mouse Whole Blood Brain In Vivo In Vitro
We would not be able to distinguish, for instance, whether the metabolite was found in the blood of the living or the dead mouse. At this point we need a mechanism that allows us to group the entered metadata so that the context becomes clear.
5323X.indb 318
11/13/07 2:12:55 PM
Expert Systems in the Laboratory Environment
319
8.10.4 Working in Personal Mode In many cases, documents are prepared on a local hard drive. If a scientist writes a draft report for his lab manager, he usually does not want to distribute this document before finalizing it. This has nothing to do with seeing the document as private property but instead with the fact that an unfinished document would raise unnecessary discussions. For a scientific workspace to behave accordingly, a personal mode would be active by default when creating or uploading a document. Only the author would be able to see the file in the tree view, and when he finishes the document, he would simply release it to other users of the same scientific workspace, for instance, his lab manager. This procedure makes sure that every contribution to a scientific experiment is captured at the time of creation in a secure database. In addition, IT administrators are provided with a special right to see personal mode files and to release them. This feature ensures that no uncontrolled accumulation of personal files occurs and that files are available even if the original author is not around. Personal and released files are indicated in the tree view to distinguish them easily. If a file is released to other scientific workspace members, the file can also be made available for review to a predefined list of users that are not members of the workspace. This is done by introducing a publishing level for a file. Only users having the corresponding access level would be able to see files in a scientific workspace; they usually would not have access to it at all. This feature is independent from any other user membership; that is, publishing a file will make it available to anyone who has access to the software, provided he has the corresponding access level.
8.10.5 Differences of Electronic Scientific Documents Electronic scientific documents and scientific workspaces are primarily independent from each other but are used in the same environment. Consequently, they share several common resources, such as user, laboratory and project data, metadata, and pick list. Both concepts use a similar technology for predefined templates and file type, respectively. The most important differences between both systems are as follows. Electronic scientific documents are report-type documents created and maintained in a restrictive manner according to paper lab notebook rules, GxP, and 21CFR Part 11 requirements. They are assembled using individual sections, each of which represents typically a step in a scientific workflow, and they can be created from predefined templates using word processing applications, as well as specific editors integrated in an ELN environment. Scientific workspaces underlie less restrictive regulations and do not need necessarily witnessing capabilities but provide an audit trail for any change in the workspace database. They are containers for different file-based information similar to a local file system but are stored in a secure database according to IP protection policies. Scientific workspaces also allow creation of content by default in a personal mode; that is, only the author (and administrators with appropriate rights) can see the file. They allow deletion of files; that is, the corresponding record on the database is removed. The audit trail for the workspace notes the change, but no file version is kept.
5323X.indb 319
11/13/07 2:12:56 PM
320
Expert Systems in Chemistry Research
8.11 Interoperability and Interfacing Special emphasis has to be placed on interfacing electronic data management system with expert systems, specific client applications, or other data management systems. The focus lies on open and standardized communication technology. An interface is a device or system used for interaction of unrelated software entities. The final goal of an interface is to transfer information from one system to another. Transferring information from instruments to software can be performed in several ways, including, but not limited to, the following: • Data can be transferred via data exchange technologies, such as eXtensible Markup Language (XML) and interface languages like Structured Query Language (SQL). • Data can be embedded with their original application using ActiveX, Object Linking and Embedding (OLE), or Dynamic Data Exchange technologies. • Files can be captured and interpreted by a data management system. • Printouts from the source software can be captured and integrated. • Instruments can be directly connected and controlled by the data management system. Electronic communication is not restricted to the problem of the native device interface. To gain access to all information relevant for a laboratory process, the internal laboratory workflow must be taken into account. Consequently, a workflow analysis is one of the most important steps in implementing instrument ports.
8.11.1 eXtensible Markup Language (XML)-Based Technologies Many modern client-server models rely on XML-based Web services as primary communication technology between client and application server. Web services consist of small discrete units of code, each of which handles a limited set of tasks. They are written in XML, which is a universal description language for transferring data on the Internet or Intranet. XML is defined through public standards organizations such as the World Wide Web Consortium. The advantage of Web services is that they can be called across platforms and operating systems and are primarily independent of programming language. They allow the following types of communication: • Client-to-client for client applications to share data in a standardized way. • Client-to-server for communication of clients to their application servers that are hosting the business logic. • Server-to-server for data transfer between different application servers or between database server and application server. • Service-to-service for sequential operations with multiple Web services. Web services are invoked over the Internet (or Intranet) by industry-standard protocols and procedures including Simple Object Access Protocol (SOAP), Univer-
5323X.indb 320
11/13/07 2:12:56 PM
Expert Systems in the Laboratory Environment
321
sal Description, Discovery, and Integration (UDDI), and Web Services Description Language (WSDL). 8.11.1.1 Simple Object Access Protocol (SOAP) This is a lightweight protocol intended for exchanging structured information in a decentralized, distributed environment. SOAP uses XML technologies to define an extensible messaging framework, which provides a message construct that can be exchanged over a variety of underlying protocols. The framework was designed to be independent of any particular programming model and other implementation specific semantics. 8.11.1.2 Universal Description, Discovery, and Integration (UDDI) This is a specification defining a SOAP-based Web service for locating Web services and programmable resources on a network. UDDI provides a foundation for developers and administrators to readily share information about internal services across the enterprise and public services on the Internet. 8.11.1.3 Web Services Description Language (WSDL) This defines an XML-based grammar for describing network services as a set of endpoints that accept messages containing either document- or procedureoriented information. The operations and messages are described abstractly. They are bound to a concrete network protocol and message format to define an endpoint. Related concrete endpoints are combined into abstract endpoints, or services. WSDL is extensible to allow the description of endpoints and their messages regardless of what message formats or network protocols are being used to communicate.
8.11.2 Component Object Model (COM) Technologies The COM is the standardized communication technology in the Microsoft Windows world. COM technology allows integration of software on different levels, some of which are the Windows Clipboard integration, ActiveX plug-ins, and OLE. The use of this standard interface allows software to do the following: • Integrate software using OLE including access to all formats provided by the source application (e.g., Molecule [MOL] files from ISIS/Draw, Rich Text Format (RTF) files from word processing applications, Comma Separated Value (CSV) formats from worksheet applications). • Live editing with applications software via ActiveX technology used with, for instance, Microsoft Office applications, structure, spectrum, and chromatogram data, Adobe Portable Document Format (PDF). • Integrating the Windows Clipboard including access to different formats available from the source application, like text or graphics formats.
5323X.indb 321
11/13/07 2:12:56 PM
322
Expert Systems in Chemistry Research
8.11.3 Connecting Instruments — Interface Port Solutions Connecting instruments to software in either an active or passive way can be a delicate issue. Handling of native instrument data, standardized file formats, and, not least, the hardware and software implementation pose high demands for software. To implement instrument interfaces in laboratories, more details have to be investigated. Connecting an instrument is usually done by accessing a physical interface, such as a serial or parallel port. Two types can be basically distinguished: • Unidirectional ports function in a single direction, usually retrieving data from a physical instrument port into the data management system. • Bidirectional ports work in two directions and provide, in addition to data retrieval, semiautomated or fully automated control of the instrument. With unidirectional ports the instrument leads the processing of data, and the data management system waits for information from the instrument. Unidirectional ports do not attempt to control the instrument software. The most sophisticated solution is to connect directly to an instrument using a bidirectional port. In this case work lists or single measurement orders are compiled by the data management software during the routine workflow and measurements are triggered directly.
8.11.4 Connecting Serial Devices Sometimes the connectivity to simple instruments, like balances or pH-meters, poses special problems mainly due to the following facts: • Small devices often do not provide any computer network connectivity. • Most of these devices communicate via serial ports; that is, a computer, or a device server, is required to connect the device to the computer network. • Many small devices do not have specific application software. • Small devices produce single or few values that need to be integrated into an appropriate experimental context. Most analytical hardware, like balances, communicates via serial port (RS232) with a computer. The serial port is an asynchronous port that transmits one bit of data at a time; that is, the communication can start and stop at any time. Data sent through an asynchronous transmission contains a start bit and stop bit, helping the receiving end know when it has received all of its data. Consequently, a broad concept is required that includes three parts: (1) connecting the device to the computer network; (2) supporting the data transfer between software and the serial device; and (3) allowing for receiving either continous or single data from the serial device. We will refer to this concept as serial device support (SDS). To connect software to a serial device, the following components are required: • A serial device provides single or continues data on a serial port. Typical devices include balances, pH meters, autotitrators, conductivity meters, and plate readers. Many serial devices do not provide additional software for controlling the device.
5323X.indb 322
11/13/07 2:12:56 PM
Expert Systems in the Laboratory Environment
323
• A serial device server connects the serial port to the network. The device server is a piece of hardware that provides an IP address, thus making the serial device act like a computer in the network. Consequently, no additional computer is needed, and the device can be plugged directly into the network. Some device servers optionally provide a virtual serial port, which makes the serial device behave like a computer with a standard COM port. • The computer network provides connectivity between computers. • A serial device support module links the device server to software via network. SDS acts as a device switch and routes the device to the software application. Additionally, SDS is responsible for converting the serial signal into a decimal value. Using data from a device with serial port in a software application usually requires a computer, which on the one hand runs the target software and provides a serial connector — usually referred to as a COM port — on the other hand. If the target software is running on a computer in a network, the device with the serial port must be somehow connected to the network. SDS allows connecting a serial device to the network without having a computer around. When a serial device server is available on the network, SDS allows now to connect to the server and (1) switches between multiple connected serial devices; (2) routes the data from the device server to the target software; and (3) converts the bitwise signal into decimal values. Before going fully electronic with serial devices, infrastructure and workflow considerations have to be taken into account. Depending on the existing laboratory infrastructure SDS can be used in different scenarios. If a serial device server is used, the computer running the target software might be placed somewhere in the network, for instance, in another room. However, depending on the user’s workflow and data integrity requirements, it might be necessary to place a computer nearby the serial device, especially if (1) data manipulation between the device and the computer has to be precluded; (2) multiple users are alternatively using the device and device usage scheduling is not available; and (3) unique user or sample identification at the device location is not feasible. A wireless infrastructure allows for the use of tablet personal computers (PCs). In other cases, where only a single user is accessing the device, device schedules are in place or data integrity between the device and a network computer is not critical, the computer running the target software might be placed elsewhere in the network. To ensure data integrity as required by regulatory authorities, the following information has to be present at the balance location: • User identification in a typical SDS scenario is performed by the target software. • Sample and batch identification can be performed either at the balance using a bar code reader or a keyboard input or by running the target software on a mobile device, like a tablet PC. • Device identification is done during SDS configuration. Each time a user selects a device, the identification is unique. • Device validity has to be managed by external software.
5323X.indb 323
11/13/07 2:12:57 PM
324
Expert Systems in Chemistry Research
8.11.5 Developing Your Own Connectivity — Software Development Kits (SDKs) If it comes down to not having out-of-the-box interface solutions available, there is still a chance to get the connectivity by doing a little programming and with the support of the software vendor or developer. The software vendor can basically provide two options:
1. An application programming interface (API) is a documented set of programming methods that allow a software engineer to interface with software from within his own program. 2. A software development kit (SDK) is a set of development tools that provides certain features of a software to be reused in its own program. An SDK usually includes an API.
While an API can be a simply a document describing already existing interfaces, an SDK is a module that is developed separately — and often independently — from the source code of a software. Most modern software uses Web services to support interoperability between different software connected over a computer network. Web services provide an interface that is described in a machine-processible format such as WSDL. Other systems interact with the Web service in a manner prescribed by its interface using messages, which may be enclosed in a SOAP envelope. These messages are typically conveyed using Hypertext Transfer Protocol (HTTP) and normally comprise XML in conjunction with other Web-related standards. Software applications written in various programming languages and running on various platforms can use Web services to exchange data over computer networks like the Internet in a manner similar to interprocess communication on a single computer. Microsoft developed for their operating system a specific communication mechanism called .NET Remoting. This is a generic approach for different applications to communicate with one another, in which .NET objects are exposed to remote processes, thus allowing interprocess communication. The applications can be located on the same computer, on different computers on the same network, or even on computers across separate networks. In fact, .NET remoting uses Web service calls. To avoid manipulation of calls from outside the system, each remoting method uses an additional parameter — the access key — which is generated by the SDK client and prevents access through other Web service clients and from other hosts. The access key is typically changed every second and has a limited lifetime. Another commonly used form of communication in the Microsoft environment Visual Basic for Applications (VBA). VBA is an implementation of Microsoft’s Visual Basic built into all Microsoft Office applications — including Apple Mac OS versions — some other Microsoft applications such as Microsoft Visio. This technology has been partially implemented in some other applications such as AutoCAD and WordPerfect. It expands the capabilities of earlier application-specific macroprogramming languages such as WordBasic and can be used to control almost all aspects of the host application, including manipulating user interface features such as menus and toolbars, and allows working with custom user forms or dialogue
5323X.indb 324
11/13/07 2:12:57 PM
Expert Systems in the Laboratory Environment
325
boxes. VBA can also be used to create import and export filters for various file formats, such as Open Document Format (ODF). As its name suggests, VBA is closely related to Visual Basic but can typically only run code from within a host application rather than as a stand-alone application. It can, however, be used to control one application from another, for example, automatically creating a Word report from Excel data. VBA is functionally rich and flexible. It has the ability to use ActiveX/COM dynamic link libraries, and the more recent versions support class modules. An example of a VBA code that retrieves the status of a record and writes it to a cell in a Microsoft Excel worksheet may look like the following: Sub GetStatusOfRecord() Dim sdk As Object Dim recordNumber As Integer Set sdk = CreateObject(“SDK.InternalInterface”) sdk.SetPort (8288) recordNumber = Range(“A2”).FormulaR1C1 Range(“B2”).FormulaR1C1 = sdk.GetStatusOfRecord(entryNumber) End Sub The code declares and instantiates an SDK object (sdk), defines its Transmission Control Protocol (TCP)/IP communication port (8288), retrieves the record number from cell A2 in the Excel worksheet, calls the SDK function with this parameter to retrieve the status of the corresponding record, and writes it to cell B2. A typical SDK allows using specific functionality of software outside its client in a program code. An SDK may be restricted to just communication or can provide reuse of functionality, like sending out data and receiving some calculated values.
8.11.6 Capturing Data — Intelligent Agents A generic solution for transferring data from any type of software is print capturing. Print capturing basically adds a special printer to the operating system that can be used to print results to target software. The major advantage of this method is that almost any software provides print capability, which is particularly useful if no converters or data viewers are available. A drawback of this method is that only image information or printer-specific files formats, like PostScript, is transferred; the raw data are lost. In those cases, where external software has no standardized interface or interface specifications are not available, communication can be performed via transferring files. File transfer can be done on a shared drive, and the target application requires a mechanism to take and interpret the file. A useful technology is the agent (e.g., intelligent agent, software agent) that comes from the field of AI. The agent in the AI field is software characterized by sensors that perceive changes in the software environment and react on these changes by acting in a predefined way. AI makes further differentiation between rational and knowledge-based agents, but we will reduce this definition to match the goal of our task.
5323X.indb 325
11/13/07 2:12:57 PM
326
Expert Systems in Chemistry Research
With software agent, or simply agent, we will refer in the following to a relatively small application that is able to (1) retrieve one or more files from a shared drive; (2) convert files from a usually proprietary format to a standardized format; and (3) parse the proprietary or the converted files for valuable information. By default, an agent is a generic tool that has to be configured for the specific task. The configuration of an agent includes specification of source computer (computer name or IP address), source folder, as well as optional error folders where data are backed up in case of failures. The files to be transferred are identified using rules for file extension or file name patterns. Agents provide several timer features to define polling intervals; that is, how often does the agent check for changes in the source folder? Wait intervals can be specified for compensation of system delays on the shared drive, and optional deletion timers regularly delete the error folders. Other rules specify the number of files required for a file collection to be complete. Agents provide an interface to plug in different types of converters. File converters translate data from instrument vendor-specific or industry-standard formats into native or human-readable formats. Common industry standardized formats are, for instance, JCAMP (e.g., DX, JDX, CS), ANDI, AIA, NetCDF, MDL Molfile and related formats (e.g., SD, SDF), ASCII formats (ASC, TXT, CSV), and XML-based formats. Most of the commercially available instrument software can import or export one of these formats. Conversion can either be done using an implemented converter in the application server of the target software or by specifying an executable for streaming conversion. Parsing is the process of splitting up a stream of information into its constituent pieces, often called tokens. In the context of instrument interfacing, parsing refers to scanning a binary or human-readable format from either physical files or data streams to split it into its various data and their attributes or metadata. The parsing of files is done using regular expressions. The expressions contain a keyword that appears in the source file and define what parts of a text preceding or following the keyword are extracted. A keyword may be set as mandatory, meaning that the agent does not process the data if certain metadata are missing. Agents are also able to split data sets containing multiple results into individual subsets and to process only the subsets. Parsing is possible for different types of encoding (e.g., UTF-8 or ISO8859-1) to ensure type code independency. In case of failures, the agent can store files that have not been transferred or that could be parsed in a specified backup folder. Integrated email communication ensures that IT administrators are informed about the failure. Since agents work automatically, they provide special features that ensure the integrity of data: • Agents report data to the target application server only when new or changed data exists. • Agents can check for the existence of certain files and metadata before processing the data. • Agents may split files that exceed a specific size to prevent data overflow in the computer memory. • Agents can handle data overflow and can send data to the application server in a timely, correct, and traceable manner.
5323X.indb 326
11/13/07 2:12:57 PM
Expert Systems in the Laboratory Environment
327
• Agents support data buffering and hold data communication with the target application server in case of system failure and restore it as soon the system is operational again. • Agents restart automatically in case of conversion or parsing failure. If agents run in regulated environments, the agent management software comprises additional features for restricted access to agent creation as well as versioning and signature capabilities for agent configurations. Agents are able to establish and maintain data communication between software and a device or instrument and to communicate to the target software application server asynchronously. They are installed either on the computer that runs the source application or on a separate server if multiple regular transfer tasks are expected.
8.11.7 The Inbox Concept If an agent transferred data to an application server, the client software requires some functionality to view and use these data. One solution is the concept of a generic inbox. Similar to an e-mail inbox, this feature allows receiving converted files sent from a software agent in the client application. The inbox concept is particularly interesting for documentation systems, like ELNs. According to paper laboratory notebook rules, a scientist is not permitted to write to another notebook besides his own. The same applies to ELN software: An external system like an agent will not be able to insert data into an electronic scientific document without user interaction. The inbox plays here the role of a database cache in between an external system and the secured notebook database. The scientist can take a look at the inbox items and can decide which item he wants to insert into his document. The idea behind the inbox concept is to provide a very easy mechanism for transferring data to a notebook, much as it works with a paper notebook. Let us consider an example of how we would insert spectral data into a notebook in a paper-based laboratory:
1. Request a spectrum for a sample from the spectroscopy laboratory, usually by filling out some form with details about the sample and the measurement conditions. 2. Receive the spectrum as paper printout from the laboratory. 3. Decide which information in the spectrum is relevant for our conclusion. 4. Cut out a part of the spectrum showing the relevant information. 5. Add annotations to the spectrum (e.g., indicating the important peaks). 6. Stick the spectrum to the appropriate position in the notebook, and sign the page.
The procedure for the electronic notebook can be very similar:
5323X.indb 327
1. Request a spectrum electronically by using a service request wizard, which acts like a dynamic form where we fill out or select the details for our request. 2. Receive an electronic message indicating that new results from an external system are available.
11/13/07 2:12:58 PM
328
Expert Systems in Chemistry Research
3. Open the inbox by clicking a message link to get a preview of the already converted spectrum and the metadata delivered by the instrument software. 4. Zoom or select a particular area, annotate the peaks, and drag it to the electronic scientific document.
Exactly the view that has been prepared by zooming, selecting, or annotating the spectrum is transferred to the document. This procedure gives an easier access to the data, saves time, and avoids transcription errors. There are several advantages of this electronic cut and paste: • A request may be created while working in a particular document; available results can then be checked directly from within the document editor. • If just the inbox is open, there is no need to search for the associated document. The spectrum is simply transferred, and the appropriate document opens automatically. • The converted data are transferred with the spectrum to the document. This makes the document independent of external software and ensures that the visualization can be changed later on, for instance by unzooming the spectrum or selecting another area. • Metadata delivered by the external system (e.g., sample ID, measurement conditions) are transferred with the spectrum. They can be viewed at any time and can be used in search queries. • If proprietary data cannot be converted and no preview is available, the data can still be attached to a document as an associated file. One approach to automate the way of electronically requesting information from other software systems is the concept of service requests, which are mechanisms to create definable requests for external data. Service requests are predefined by key users of the systems, whereas the end-user simply calls a wizard for one of the predefined service requests, fills out the data required by the external software, and sends it to the target software. The wizard application translates the request to XML format, which is typically understood by the SDK of external software. Results from a service request are automatically converted, parsed, and transferred to the inbox by use of configurable software agents.
8.12 Access Rights and Administration Electronic data management software has to provide a user management system on different levels. Whereas authentication is typically performed via user ID and password, access rights are defined via membership of users in departments, teams, and roles. The following structure shows a potential organization: • A role is a group of access rights that either defines access to individual modules (e.g., user management system, template creation module) or to individual task-related functions (e.g., IT administrator, reviewer, approver) of the software.
5323X.indb 328
11/13/07 2:12:58 PM
Expert Systems in the Laboratory Environment
329
• A department (e.g., laboratory, unit, division) defines a group of users, each of which has explicit rights on content editing and signing for electronic records. Each user may be a member of multiple departments, having different rights in each department. • A user defines the individual’s details (e.g., name, user ID, email address) as well as his membership in at least one department and at least one role. • A team is an optional group of either departments or users with individual content editing and signing rights within the team. In the simplest model, a document or file is created in a department that defines the users and their access rights according to the static organization of an institution or company. In contrast, a team represents a usually temporary new constellation of users. This is helpful for the typical study or project organization, where several scientists from different departments work for a limited time on a collaborative task. Teams can cover multiple laboratories or users, each of which may have rights in the team other than what they have in their default department. A record, file, or document created within a team derives the access rights from the team’s definition rather than from the department. The important difference to file-based storage is that now the record inherits the access rights rather than the software environment. If electronic records derive their rights and restrictions from the corresponding department or team, in which they have been created, a possible scenario is a user might work at the same time on two records, one of which underlies GxP rules and requires an electronic signature and one of which does not.
8.13 Electronic Signatures, Audit Trails, and IP Protection The review and signature process for electronic records may become very lengthy and may impact overall enterprise productivity dramatically. Well-defined data management software overcomes this deficiency by streamlining the reviewing and witnessing process with permanent logical record status management and scheduling. One of the important elements of such a strategy is an incorporated messaging system. This messaging system might be similar to or even integrated with conventional e-mail software. It allows the scientists to send records to the reviewer or approver by creating signature requests. On the other side, an approver is able to review the record in a timely fashion, to sign it, to request additional information before signing it, or to reject the signature giving an appropriate statement.
8.13.1 Signature Workflow ELNs in particular rely on this type of signature functionality on document or section level. However, whether signatures are necessary, how many signatures are required, and who are the responsible persons have to be configurable on a department or a team level. Well-designed systems allow signing by an approver without the need for the user to log off from the current session. Electronic signatures require user ID and password to be equivalent to handwritten signatures and are not dependent on the current user.
5323X.indb 329
11/13/07 2:12:58 PM
330
Expert Systems in Chemistry Research
Several features can streamline a signature workflow: • Opening a signature request in the messaging system leads the approver directly to the record or document to be signed. • A signature request for multiple records in the same context generates a single message for the approver instead of multiple messages. • The approver is able to directly see the content to be signed by highlighting the corresponding section in a document or by filtering the records to be signed and hiding other records. • The approver may need to quickly switch between multiple records in the same context or sections in a document. • The number of signatures for records or documents needs to be configurable according to institutional policies. It must be possible to switch off the signature feature for individual records. • Signatures need to be configurable concerning the detail factor of a record. This applies particularly to electronic documents, where signatures may be applied to sections before the entry document is signed. • When an author releases a record or document for signature, the application prompts the user to approve all sections that require individual signatures. • Whenever a signature is performed, the person requesting the signature is automatically notified about the signature. • The approver is able to request additional information — via free text message — from the author who created the record. Either the owner of the corresponding record or the user who sent the request is notified about the request for information. • The approver is able to reject a signature. In this case, the approver has to enter a reason for rejection. Either the owner of the corresponding record or the user who sent the request is notified about the rejection including the reason for rejection. A rejected record must be modified before a new request on the same record can be sent, or the record must be deleted. In the latter case, the approver is informed about the deletion. • The reason for a signature, rejection, or modification can be selected from a pick list consisting of items of a predefined collection of acceptable reasons. The reason from the pick list may be set as mandatory. Optional comment fields are available. • The user who released a record for signature must be able to cancel the release status, thus withdrawing a signature request. The signature request for the record needs to be invalidated by removing the request from the approver’s task list. A new message may optionally be sent to the approver informing him about the revocation. The reviewing and signing process for electronic scientific documents can be considerably improved when keeping these requirements in mind. One critical remark on this topic is this: According to FDA guidelines, an electronic signature shall be equivalent to a handwritten signature. The approver certifies with the signature that he read, understood, and accepted the entire content of a document. With electronic systems it is common practice to apply signatures
5323X.indb 330
11/13/07 2:12:58 PM
Expert Systems in the Laboratory Environment
331
to electronic records that cannot be reviewed or understood in its entirety, such as liquid chromatography-mass spectrometry (LC-MS) spectra in a native binary format. What is signed in this case is not the content but the fact that this content has been created with a certain instrument using certain conditions. There is a fine line between applying signatures to an electronic document and to a generic electronic record. A generic electronic record might not need a signature at all as long as the record itself is not subject to submission or required to proof the contents of a humanreadable report that finally contains the most important information: the reasoning.
8.13.2 Event Messaging One of the positive side effects of an electronic solution is that it does not necessarily affect the current workflow if reviews or signatures have to be applied. The current work status can be made available during the routine use of the system, and a signature can be applied to a record while the scientist is working on another. A messaging system is a central communication port in data management software and, again, is particularly useful in documentation systems. A messaging system includes a calendar, making it easier for lab managers to review their tasks based on a certain date. It works similarly to an e-mail system consisting of three communication ports:
1. Messages: Writing, sending, and receiving messages like in an e-mail system; additionally, links to documents or records can be inserted, and the recipient can directly open the associated file. 2. Notifications: These are automatically sent and received if the record status changes, as well as from different other modules, if certain events occur. 3. Signature Requests: These are messages concerning signature tasks and deadlines; links to the corresponding records are automatically included.
A conventional e-mail system would not be able to perform all the required tasks; however, it can be effectively integrated with an internal messaging system.
8.13.3 Audit Trails and IP Protection Basically, all changes in a database that underlies FDA regulations must be audit trailed. The software has to provide means for creating complete audit trail reports of records; creation of document or file templates; and changes in users, departments, role, and permission management. Audit trail reports are ideally created for use in Web applications to be viewed with conventional Web browsers. Whereas the level of details for the audit trail report may be configurable, the audit trail itself must be hard-coded in the application server that hosts the business logic. All data that are stored in an FDA compliant-ready database require basic IP protection mechanisms like hash coding and encryption.
8.13.4 Hashing Data Encryption is the process of taking sensitive data and scrambling it so that the data is unreadable. The reverse process is called decryption, wherein the encrypted
5323X.indb 331
11/13/07 2:12:59 PM
332
Expert Systems in Chemistry Research
unreadable data is transformed into the original data. There are two main forms of encryption: (1) single-key, or symmetric, encryption; and (2) key-pair, or asymmetric, encryption. A hash function or hash code takes data of any length as input and produces a digest or hash of a finite lengths, usually 128 or 160 bits long. The resultant hash is the used to represent — not to replace — the original piece of data. For a hash function to be able to represent the original data it must be (1) consistent — the same input will always produce the same output; (2) unpredictable — given a particular hash it is virtually impossible to reverse the hashing process and obtain the original data; and (3) volatile — a slight change of the input data (e.g., the change of one bit) will produce a explicit change in the hash code. When a signature is performed, the system calculates a hash code for the record including the unique ID of the approver, date and time of signature, and reason for signature. The resulting hash code is similar to a unique fingerprint of the record and its signature. The hash string is then encrypted using symmetric (with an internal key) or asymmetric (using internal and an external key) methods.
8.13.5 Public Key Cryptography Public key cryptography is fast becoming the foundation for on-line commerce and other applications that require security and authentication. The use of this kind of cryptography requires a public key infrastructure (PKI) to publish and manage public key values. To prepare the activities for the implementation of PKI features an evaluation of commercial hardware and software solutions is necessary. The implementation of PKI features generally can be divided into several steps: • Evaluation of commercial software packages that provide basic functionality for PKI. • Connection of biometric device. • Access handling with biometric device. • Implementation of digital signatures for short term data (e.g., spectra). • Implementation of digital signatures for long-term data and database contents (e.g., archived or LIMS data). The present concept deals with the first part of the entire PKI concept: software evaluation, biometric device implementation, and biometric access control. In its most simple form, a PKI is a system for publishing the public key values used in public key cryptography. There are two basic operations common to all PKIs. Certification is the process of binding a public key value to an individual, organization, or other entity or even to some other piece of information, such as a permission or credential. On the other hand, validation is the process of verifying that a certificate is still valid. How these two operations are implemented is the basic defining characteristic of all PKIs.
5323X.indb 332
11/13/07 2:12:59 PM
Expert Systems in the Laboratory Environment
333
8.13.5.1 Secret Key Cryptography Secret key cryptography is the classical form of cryptography. Two candidates A and B that want to share secure information use the same key for encryption and decryption, which requires prior communication between A and B over a secure channel. 8.13.5.2 Public Key Cryptography In contrast with secret key cryptography, public key cryptography is based on separate encryption and decryption keys, where one of them can be published. Anyone can use that public key to encrypt a message that only the owner of the private key can decrypt. In practice, computing a public key cipher takes much longer than encoding the same message with a secret key system. This has led to the practice of encrypting messages with a secret key system such as data encryption standard and then encoding the secret key itself with a public key system such as RSA, in which the public key system transports the secret key. The very nature of public key cryptography permits a form of message signing. Suppose party A publishes its decryption key and keeps its encryption key secret. When A encrypts a message, anyone can decrypt it using A’s public decrypting key, and, in doing so, he can be sure that the message could only have been encrypted by A since A is the sole possessor of the encryption key. A has effectively signed the message. Typically, to digitally sign a message rather than encrypt the message using a public key scheme the message is hashed using a cryptographic hash function, and the hash is encrypted. A cryptographic hash function maps an arbitrary-length message to a fixed number of bits. Digitally signing a message using hashes is a two-step process. The message is first hashed; the hash result is encrypted using a public key scheme, and then the message is transmitted along with its encrypted hash. To verify the signature, the recipient needs to hash the message himself and then to decrypt the transmitted hash and compare the pair of hash values. The signature is valid if the two values match; otherwise the message was somehow altered, perhaps maliciously, in transit.
8.14 Approaches for Search and Reuse of Data and Information Searching in electronic systems is important for reuse and confirmation of data as well as for documentation of investigations and procedures. A data management system has to support searches for different types of information; each of these search functions may be combined among one another. Searches can be defined to perform automatically, for instance, if a module opens. Each user should be allowed to store his individual search queries by simply assigning a query name. This query can be reused later or set as default query, which is automatically executed when the corresponding search module is activated.
5323X.indb 333
11/13/07 2:12:59 PM
334
Expert Systems in Chemistry Research
8.14.1 Searching for Standard Data Standard search capabilities for data management systems include the following: • Administrative information, such as identifier, status, title, laboratory, project, author, date/time of creation, signers, date and time of signature, and comments. • Full text content in documents, files, and attachments; definable on section or file type level. • Metadata of sections, documents, and files. • Comments on documents and sections. In addition to this standard search functionality, special content types require more sophisticated search technologies, such as the following: • Stereoselective identity search. • Stereoselective substructure search, including relative and absolute stereochemistry, isomer, tautomer, any-atom, any-bond, atom lists, substituents, ring bonds, and chain bonds. • Reaction search, including atom mapping and bond-specific search. • Chemical residue and Markush structure search. • Molecular formula search and molecular mass search. • Chromatogram and spectrum search for IR, UV, Raman, nuclear magnetic resonance (NMR), MS, among others. • Base or protein sequence search. • Image content and image annotation search. This is definitely an incomplete list of potentially interesting search queries but shows that even a generic documentation system like an ELN requires at least interfacing capabilities to provide this search functionality to a scientist. Any of these searches may be combined with either chemical or nonchemical searches. Spectra, chromatograms, structures, and reactions appear in the hit list; for a combined search, the result shown in the hit list might be a combination of text with one of these data types.
8.14.2 Searching with Data Cartridges Particularly helpful technologies for providing specific search functionality are data cartridges, which are a bundled set of tools that extend database servers for handling new search capabilities for specific database content. Typically, a single data cartridge deals with a particular information domain. A cartridge consists of a number of different components that work together to provide complete database capabilities within the information domain. A cartridge extends the server, and the new capabilities are tightly integrated with the database engine. The cartridge interface specification provides interfaces to, for example, the query parser, optimizer, and indexing engine. Each of these subsystems of the server learns about the cartridge capabilities
5323X.indb 334
11/13/07 2:12:59 PM
Expert Systems in the Laboratory Environment
335
through these interfaces. Data cartridges can be accessed using database languages like SQL as well as programming or script languages. Particularly interesting for the chemist are chemistry cartridges for storing and retrieving structures, substructures, and reactions. For instance, structures created in an ELN with a structure editor can be stored in the internal database using an external chemistry cartridge, which is installed separately. The system can then take advantage of the additional search functionality provided by the cartridge. This is important for registration of structures in an external database. Structure registration is the process of entering structural information in a centralized repository, usually a structure database. These repositories serve as a pool for providing structure information that has been created in other departments of a company. Structure databases are set up according to the individual needs of a department or company. They consist of a common representation of a structure in a standardized file format, such as MolFile, SDF, reaction (RXN) (MDL), JCAMP (International Union of Pure and Applied Chemistry), or simplified molecular input line entry specification. Any additional data can be stored with the structure depending on the context; typical examples are structure properties, reaction conditions, and literature references. Since the external database might be used as a general repository, an ELN must also allow for registering structures and reactions created in an electronic document to avoid switching between different systems and reentering information. Searching for structures or reactions in an external repository is an alternative to entering the structure directly into the scientific document. This is especially helpful if the structure is complex and not easy to author. Another application is the search for structures stored in an external system — for instance, to find already performed identical or similar reactions. The search query is typically entered via commonly used structure editors that provide the required standard file format. The search query can be performed on individual databases or on all connected databases, including the internal one, at the same time. The results of a structure or reaction query are displayed in a hit list, ideally within an ELN software. The user can then create a new section in the electronic scientific document using any structure from the hit list, including metadata delivered by the database. By transferring the structure, an additional unique identifier from the external database needs to be kept; that is, the created section has a hyperlink to the external system that allows opening the default structure viewer directly from the section.
8.14.3 Mining for Data The progressive use of computational methods in chemistry laboratories leads to an amount of data that is barely manageable even by a team of scientists. Specialized methods in instrumental analysis, like the combination of chromatography with mass spectrometry, may produce several hundred megabytes of data in every run. Combinatorial approaches support this trend. These data can barely be evaluated in detail without help of computational methods; in fact, it becomes more and more interesting to produce generalized information
5323X.indb 335
11/13/07 2:13:00 PM
336
Expert Systems in Chemistry Research
from large sets of data. As a consequence, the quality of information retrieved during a search depends no longer on the quantity and size of primary information but on an automated intelligent analysis of primary sources. This is where data mining comes into play. The task of data mining in a chemical context is to evaluate chemical data sets in search of patterns and common features to find information that is somehow inherent to the data set but not obvious. One of the differences between data mining and conventional database queries is that the characterization of chemicals is performed with the help of secondary data that are able to categorize data in a more general way and helps in finding patterns and relationships. It would be an unsuccessful approach to try to keep all potentially useful information about a chemical substance in a structure database. Thus, the extraction of relevant information from multiple data sources and the production of reliable secondary information are important for data mining. In the last decades methods have been developed to describe quantitative structure–activity relationships and quantitative structure–property relationships, which deal with the modeling of relationships between structural and chemical or biological properties. The similarity of two compounds concerning their biological activity is one of the central tasks in the development of pharmaceutical products. A typical application is the retrieval of structures with defined biological activity from a database. Biological activity is of special interest in the development of drugs. The diversity of structures in a data set of drugs is of interest in the exploration of new compounds with a given biological activity — with increasing diversity, the chance to find a new compound with similar biological properties is higher. The similarity of structural features is of importance for retrieving a compound with similar biological properties. In fact, the term similarity can have quite different meanings in chemical approaches. Similarity does necessarily refer to just structural features — which is, in fact, easy to determine — but includes additional properties, some of which are not just simple numbers or vectors but have to be generated using multiple techniques. Molecular descriptors for the representation of chemical structures are one of the basic problems in chemical data mining.
8.14.4 The Outline of a Data Mining Service for Chemistry Data mining can be defined as a process of exploration of large amounts of data in search of consistent patterns, correlations, and other systematic relationships between queries and database objects. The tasks of a data mining engine can be divided into the following classes. 8.14.4.1 Search and Processing of Raw Data The most convenient raw data that are available for molecules are connection tables. As described previously, a connection table simply consists of a list of the atoms that constitute the molecule and information about the connectivity of the atoms — that is, basic information about the atom (i.e., atom symbol or number) and the bonds to other atoms. Other types of information can be calculated or derived from this basic information. Starting with a connection table, the two- and three-dimensional model
5323X.indb 336
11/13/07 2:13:00 PM
Expert Systems in the Laboratory Environment
337
of molecules can be calculated. These models are the basis for the calculation of secondary data, like physicochemical properties of the atoms in a molecule. These properties are calculated for the specific spatial arrangement of the atoms and, thus, are different not only for each atom but also for each individual molecule. A data mining system can use those properties either as single values or in a combined form to find data that cannot be searched otherwise. 8.14.4.2 Calculation of Descriptors By using the 3D arrangement of atoms in a molecule and the calculated physicochemical properties of these atoms, it is possible to calculate molecular descriptors. Since the descriptor is typically a mathematical vector of a fixed length, we can use it for a fast search in a database, provided that the database contains the equivalent descriptor for each data set and that the descriptor is calculated for the query. We have seen before that particularly similarity and diversity can be excellently expressed with molecular descriptors. 8.14.4.3 Analysis by Statistical Methods Because of their fixed length, descriptors are valuable representations of molecules for use in further statistical calculations. The most important methods used to compare chemical descriptors are linear and nonlinear regression, correlation methods, and correlation matrices. Since patterns in data can be hard to find in data of high dimension, where graphical representation is not available, principal component analysis (PCA) is a powerful tool for analyzing data. PCA can be used to identify patterns in data and to express the data in such a way as to highlight their similarities and differences. Similarities or diversities in data sets and their properties data can be identified with the aid of these techniques. 8.14.4.4 Analysis by Artificial Neural Networks If statistical methods fail to solve a chemical problem, artificial neural networks can be used for analyzing especially nonlinear and complex relationships between descriptors. The important tasks for neural networks in data mining are as follows: • Classification: Assigning data to predefined categories. • Modeling: Describing complex relationships between data by mathematical functions. • Auto-Association: Extrapolation and prediction of new data using already learned relationships. Some of these neural networks are self-adaptive auto-associative systems; that is, they learn by example, which is an ideal technique for data mining. For each query, a set of similar training data is collected from the data source, the neural network is trained, and a prediction of properties is then performed on a dynamic basis. The properties may be again used for searching. Since training requires some additional time — usually between 5 and 30 seconds — this search takes longer than a linear search; however, the results are of much higher quality than with any other search technology.
5323X.indb 337
11/13/07 2:13:00 PM
338
Expert Systems in Chemistry Research
8.14.4.5 Optimization by Genetic Algorithms Genetic algorithms are computer algorithms used for the optimization of data. Genetic algorithms work on the basis of the model of biological genetic processes (i.e., recombination, mutation, and natural selection of data). The optimization capabilities can be used to identify the correct search query parameters. This is typically done off line, since genetic algorithms usually belong to the slowest algorithms in the group of the described AI methods. However, once identified, the optimum query parameters may be stored and recalled for later searches. 8.14.4.6 Data Storage The effective access of raw data that has been saved in previous experiments is important for repetition of experiments and to avoid unnecessary calculations. Information that was proved to be useful can be saved and retrieved for later use with other data sets. It is usually recommended to store raw data (input data), descriptors, and query data in separate database entities. By separation of primary structure data, secondary data, and query data, a simple retrieval of a previous experiment as well as the use of already calculated descriptors is possible. 8.14.4.7 Expert Systems One of the tasks of a data mining service is the appropriate visualization that reveals correlations and patterns and allows validating the chosen parameters by applying the methods to new subsets of data. The visualization of complex relationships between individual data and data sets is important for the usability of an expert system. Though it is easy to present statistical data in graphs and correlation matrices, complex results of data analysis must be presented in an interactive 3D environment. Virtual reality software, like the virtual reality modeling language, may ease the interpretation process for large data sets. Results of data analysis can be linked to the representations of source data as well as to representations of related data within the same graph. Though most of the previously mentioned systems work automated, the most important part for an expert system is the interaction with the domain expert to verify the quality of search results and to track the decision processes that the automated software part used for finding the underlying correlation. Expert systems are the final step in data analysis.
8.15 A Bioinformatics LIMS Approach Although the previously described LIMS are typically used in areas of quality control that underlie stronger regulations, only few successful approaches have been presented in the research area. Although a regulated laboratory has clear guidelines, research laboratories regularly encounter situations in which decisions have to be made on the basis of the actual experimental results. Implementing a standard LIMS in a research laboratory would limit the flexibility needed by a researcher. A LIMS for bioinformatics particularly would require the ability to organize huge amounts of data in a team-oriented fashion, in which each team member has its
5323X.indb 338
11/13/07 2:13:00 PM
Expert Systems in the Laboratory Environment
339
personal filter and view on the results. We addressed this already in the workspace approach described previously. Another similarity is the ability to store, manage, and search incomplete data. The following describes an approach for data management in metabonomics.
8.15.1 Managing Biotransformation Data Metabolic pathways play an important role in the interpretation of genomic data. With the emergence of potentially very large data sets from high-throughput gene expression and proteomics experiments, there is a recognized need to relate such data to known networks of biochemical processes and interactions. The investigation of metabolic pathways belongs to the research area of metabonomics, which studies how the metabolic profile of a complex biological system changes in response to stresses like disease, toxic exposure, or dietary change. In support of these investigations, a well-structured yet flexible data model is required. Metabolic pathways are a way of describing molecular entities and their interrelationships. From a data management point of view, a container for chemical structures has to be established that meets several requirements: • Handling chemical structure objects including their metadata. • Handling of links and functional relationships between the structure entities. • Availability of interfaces to structure databases and a reporting system. Pathway investigations within the biotransformation area cover a broad range of laboratory techniques. Lead drug substances are applied to different cell tissues, animals, and humans. The metabolism (i.e., intake, reaction, and excretion) of a drug is observed using different types of samples (i.e., urine, plasma, tissues extractions). It is also common that the active substance is a metabolite of the administered drug. The samples are typically analyzed by high-performance LC-MS systems dividing and purifying the sample fractions and giving indication about number and type of metabolites. Due to the lack of explicit information on the molecule structure, the metabolites are primarily characterized by mass differences and major ions, both derived from mass spectra. For instance, since hydroxylation of a xenobiotic is a common starter for a detoxification process, a mass difference of +16, for example, is likely to be assumed as hydroxy group added to the molecule, although the position of the group is not known in the early stages of a study. The drug under investigation is also administered in a radio-labeled form, making the identification among similar organism immanent substances easier. Software that manages biotransformation data has to provide several specific functionalities. The system allows handling of not fully determined structures. Usually NMR measurements are performed for obtaining more detailed information about the molecule structures. For localization of metabolites within organisms and organs, immune-staining techniques in tissue slices are also used. The goal of the investigation is also to propose a pathway within the different organisms and conditions. Therefore a pathway managing tool is a main functionality of biotransformation management software. Biotransformation management systems allow the
5323X.indb 339
11/13/07 2:13:01 PM
340
Expert Systems in Chemistry Research
scientists to enter all type of data into a database at the time when they appear. In addition to the data coming from instruments, structures, pathways, text, tables, and other data must be stored in the system. The information can be updated and refined during further investigation process. Scientific data such as chromatograms or spectra can be displayed within the software, and annotations can be made. The application provides the scientist with comprehensive search capabilities and acts as a knowledge base. Since these systems are typically used in a patent-relevant process, the user administration will allow defining access restrictions to data, such as view-only access for scientists. The result of a study is documented in a final report, which has to be supported by the system.
8.15.2 Describing Pathways Biochemical pathways are representative of the kind of complex data that are generally best handled in an object-oriented manner. They are ideally modeled as object graphs and typical access patterns involve traversals through the graph. In addition, many-to-many relationships are common between participating objects. A key issue concerns the treatment of roles a molecular entity can adopt. Rather than performing multiple inheritance from a protein and catalyst class (static multiple inheritance), for example, to derive an enzyme subclass, we instead combine the role object “catalysis” with the metabolic entity “protein” through aggregation (dynamic multiple inheritance). Thus, the metabolic entity maintains a list of the roles it can assume along with the context for each role; for instance, a protein may assume the role of catalyst for a given instance of a reaction class. In this way, the protein object can adopt differing roles (e.g., catalyst, transporter) in different scenarios. Also, an object can acquire additional roles easily as they are discovered without having to modify or rebuild its class definition. In particular, the benefits of the approach are becoming apparent through its application to regulation of metabolism. The model is extensible to the representation of less well-defined regulatory networks and is flexible enough to handle the problem of incomplete or missing data. For example, not all the steps in a cascade leading to the eventual phosphorylation by a protein kinase may be known, but the process as a whole can be treated as an object that is a composite of those individual steps — some or all of which might be unspecified. The module for representing biotransformations, or reaction pathway editor, is similar to the ones used for visualizing sequences of chemical reactions but provides some additional functionality to represent relationships between molecular entities in a more efficient manner. This module visualizes a sequence of structures, starting from a simple one-step reaction down to complex metabolic pathways. In contrast to specialized software — for instance, those described before for the synthesis planning or metabonomics area — this tool needs to be specially adapted to the task of reporting. This is also a typical component of the previously described scientific workspace, in which structures are stored and organized according to their context. A pathway editor then simply allows any of the stored structures to be inserted into the pathway diagram, taking advantage of the metadata that come with the structure. The pathway diagram contains structures, incomplete structures, their relationships,
5323X.indb 340
11/13/07 2:13:01 PM
Expert Systems in the Laboratory Environment
341
and optionally selected metadata to be displayed with the structures. A schematic view, displaying graphical symbols instead of real structures, helps structures and their connections to be easily rearranged. Switching to normal view shows the complete pathway, which can be incorporated into a document. Connectors are a generic form of connectivity information that can be assigned to a pair of structures. Connectors may be shifted; the start and end points may be attached to other objects, and open connectors are explicitly allowed for intermediate results. When dealing with complex pathways, functionality is required to automatically arrange the structures and to bring it into a reportable from. Usual techniques are layered-digraph and a force-directed layout. The layered-digraph layout arranges pathways basically linearly, if they contain single one-to-one connectors. In case of n-to-n connectors the pathway is arranged in a circular or compact way. The forcedirected layout arranges pathways in a directed left-to-right mode and is useful for systems with a single predecessor and multiple successors. A pathway diagram may finally be previewed, printed, or incorporated in a document by dragging it to a report container. Figure 8.5 explains the application of the data structure to metabolic pathway management. The container handles an arbitrary number of molecular entities, their relationships, and conditions. The container has a unique identifier and is capable of handling metadata belonging to the entire collection. Molecular structures can be entered as full structures, generic structures, and unknown structures. Generic structures may have arbitrary conditions on atoms or molecular fragments. A placeholder for a molecule represents unknown structures. The system is able to handle manyto-many relationships and allows insertion, deletion, replacement, and exchange of molecular entities and their connectivity information. The container serves as a repository for logically dependent molecular entities. It can handle an arbitrary number of these entities together with their interdependencies, meta-information, and associated data. The objects represent the molecular information in a generic way; that is, information about these objects must not necessarily be complete (e.g., chemical structure not yet available). It consists basically of five information blocks:
5323X.indb 341
1. Identifiers: Uniquely identify the molecular entity in the data repository and provide a descriptive name for identification of molecules in the user interface and reports. A classifier is used to categorize the object in a collection; classifiers may be metabolite, intermediate, or reactant. 2. Reference Data: The system does not have to take care of the raw data or associated data that belong to a molecular entity. The reference data section manages pointers to reference information that is kept in individual databases, like molecular raw data, analytical results, or spectroscopic or chromatographic data. 3. Metadata: Represent any additional information on the molecular entities, IUPAC name, Chemical Abstracts Service name, remarks, and comments. 4. Connectivity: This section holds all information necessary to represent n‑to-m relationships between the entities in the same container as well as
11/13/07 2:13:01 PM
342
Expert Systems in Chemistry Research
O N
N
N
N N
O
O N
N
N
N
O
N
O
Parent Structure
Metabolite 1 Administrative Data Type: Metabolite Name: Metabolite 1 ...
Molecular Metadata Connectivity
Route: Mouse, Intravenal, Oral Excretion: Plasma, Urine
Atoms: 1–12 Condition: Any Atom O ...
Metadata A E
B
C
D
Molecular Weight = 385.51 Exact Mass = 385 Molecular Formula = C21H31N5O2
Instrumental Data Type Instrument Conditions
F Biotransformation Scheme
Figure 8.5 Schematic view of a pathway container. The parent structure is the starting point for metabolic investigations. From there, several metabolites are identified in different species and tissues. Each connector stores the type of relationship between metabolites together with information about the route of transformation. Each metabolite includes administrative data (e.g., name, identifier), molecular metadata such as residue information, structure metadata (e.g., formula, mass), and links to records of experimental data obtained for the structure.
links to objects or collections outside the container. Each relationship has an arbitrary number of metadata, such as reaction conditions. 5. Associated Data: Basically one- or n-dimensional properties that can be assigned to a molecular entity. One-dimensional data are typically chemical or physicochemical molecular properties (e.g., molecular polarizability), whereas n-dimensional data include spectra and molecular descriptors for the compounds.
8.15.3 Comparing Pathways Each metabolite saved to the database is stored, including its connectivity information and metadata. Using a diagram template, any existing set of metabolites having at least one metadata in common may be mapped onto diagram template. This mapping allows an easy comparison of different pathways and highlights the differences between them. If a primary pathway is mapped onto a reference pathway, the system interprets the mapping in the following manner:
5323X.indb 342
11/13/07 2:13:02 PM
Expert Systems in the Laboratory Environment
343
• All identical metabolites are transferred to their respective positions on the reference pathway. • All existing connectors are kept according to the connectivity information of the metabolites in the primary pathway. • A metabolite that does not exist in the reference pathway is left out of the scheme. Mapping a set of metabolites from one species onto a pathway created with another species, would allow one to see which metabolites are identical and which are missing. A diagram template may also be used to create a default pathway. As mentioned before, each pathway keeps its own information about the position of metabolites, and each created pathway may serve as basis for a template. A pathway diagram needs to be printable and easily transferred to a report container, such as a document. Finally, a pathway diagram may be exported either as image or in XML format including the structure (connection tables) and connectivity information of the molecules.
8.15.4 Visualizing Biotransformation Studies During the investigation of a biotransformation pathway of a lead drug compound, many different types of data accumulate, such as documents, images, chromatograms, mass spectra, NMR spectra, structures, substructures, generic structures, and result tables. A study editor module serves as the main user interface for the scientist, which uses the following: • • • • • • •
Storing of data objects in the database Navigation within the database data objects Categorizing data objects Reorganizing data objects Preview of data objects Modification or annotation of data objects Transfer of data objects to reports
Due to the variety of data it is important to keep a comprehensive way of structuring and categorizing data objects in a simple, preferably automatic way. Again we can refer to the previously described concept of the workspace, where a flat data structure without predefined hierarchy is used and data visualization can be structured on the fly by using the metadata that each data object contains. The study editor provides a navigation tree based on metadata similar to the workspace. Structuring of data view can then either be defined by a system administrator or adapted by the individual user. Structuring metadata — that is, those that are used in the tree view — may be optionally be set as mandatory, ensuring the same common view for different users. An inherent advantage of this method is that data that are automatically parsed and uploaded by software agents already bring all required metadata and are dynamically sorted into the tree. Data uploaded manually to a folder automatically inherit the folders metadata.
5323X.indb 343
11/13/07 2:13:02 PM
344
Expert Systems in Chemistry Research
8.15.5 Storage of Biotransformation Data Saving a biotransformation in the system stores all relevant data in the main database (administrative data, connectivity information) and in the chemistry databases (structure, spectra, other instrumental data and metadata). Optionally, the biotransformation system may be integrated with an electronic notebook solution. When started from the notebook, the biotransformation system creates a graphical representation of the metabolic pathway and sends it together with administrative data and metadata to an electronic scientific document. Metadata are an important concept of the biotransformation system. They provide additional information about data objects and can be customized. Metadata consist at least of a name and a value. They may additionally contain flags, which describe special conventions for processing of metadata with the software. Administrators of the biotransformation system may define a set of metadata for a series of data types by name and default value. In addition, the user is able to define lists of metadata that can be used in certain visible list fields, like dropdown lists. The functionality of the system contains as a major part the capability to read and write data files from different manufacturers and common standard formats. Usually the support of a specific data format is a user requirement that is implemented for a specific product. Once a converter is developed for a specific product, it is available for other products. The internal binary data storage is maintained by a common library that is accessible directly by data conversion libraries. Data conversion libraries have two modes of operation: (1) Import of data means reading those data from their source and converting them into the current internal data formats; and (2) export of data means writing data from internal data storage to destination (files) in a format being supported by other software. Both ways should transport the data with no or minimum loss of information. The aim is to fulfill the FDA’s requirement of true and complete transfer; that is, information must be transferred completely and numerical data must be converted accurately.
8.16 Handling Process Deviations Production, transport, and storage of products underlie several SOPs, which define the parameters and required results from tests at different stages. Since the world is not as perfect as described in an SOP, there are deviations from the expected outcomes, in which some of the results from testing do not conform to the specified parameters. The results are referred to as nonconformities or exceptions. Expert systems are ideally suited to support the handling of exceptions, particularly since rules are already available in the SOPs and final decisions have to be made in interaction with a domain expert. Exceptions are usually divided into different classes, each of which has a series of root causes. Table 8.1 shows examples of classes and their root causes in the laboratory. The quality control unit that is responsible for the release of a product has finally to decide whether or not the charge of the product can be released and sent to the customer. This decision requires all information about exceptions that occurred during the production process. An exception also requires a corrective measure to address
5323X.indb 344
11/13/07 2:13:02 PM
Expert Systems in the Laboratory Environment
345
Table 8.1 Definition of Classes and Potential Root Causes for Exceptions in a Laboratory Class
Root Cause
Out of Specifications (OOS)
Sterility Activity/potency Final container labeling Not clearly defined (or wrong) Design Validation Not available Calibration/validation Not suitable for purpose Preventive maintenance Breakdown Calibration/validation Not suitable for purpose Preventive maintenance Breakdown Not available False/not complete/not updated Not clearly defined or too complex Training Qualification Person-related mistake Expired Not released Not suitable for purpose Unnotified changes Nonconforming product
Process/System
Equipment/Measuring Devices/IS-Systems
Rooms
Documentation/Specifications/Instructions
Personal
Nonconforming Source Material
Supplier/Contractor
the issue in the next case. Since pharmaceutical and chemical companies are routinely inspected by national and international authorities, corrective measures on the findings from inspections by third parties as well as from internal inspection have to be considered. A comprehensive exceptions management system covers business processes such as exceptions, deviations, corrective measures and commitments, changes, and decision support for these cases. It allows tracking of exceptions from entry to conclusion and makes sure that the corresponding management processes are performed within a predefined period. All information and data entered in this system have to be monitored, traced back, and controlled for their relevance to product quality and safety for early detections of trends.
8.16.1 Covered Business Processes The business processes covered by an exceptions management system are corrective and preventive actions, audit by third parties and regulatory organizations, internal
5323X.indb 345
11/13/07 2:13:03 PM
346
Expert Systems in Chemistry Research
assessments, and change control. Exception management and corrective and preventive actions request any information necessary for the facility procedure until the exception is completely described, particularly (1) statements to assess the risk of an exception for the business process; (2) closure date of interim disposition failure investigations, root cause analysis, and (3) corrective actions, and final disposition. The system finally performs an efficiency check: After closure of corrective actions of a systematic exception, the same exception is not allowed to occur for a defined period of time; otherwise, the system will not allow complete closure of this exception processing. A systematic exception requires links to correction and corrective actions previously occurred exceptions; that is, the information is entered only at first exception but appears at all linked exceptions. Reports have to be generated including data and trend analysis in a predefined schedule, such as a monthly report including exceptions per process, process step, product, and department, and typically include cost and amount information. Specific trend reports may be created for plant-specific inquiries. Exception management also requires some automated messaging capabilities to inform responsible parties for action items prior to target date completion. This includes capability for the responsible party to reply to the message directly to the manager responsible for corrective actions. Finally, any entry of an exception in the exceptions management system creates a standard deviation report, which can be archived for a definable retention period.
8.16.2 Exception Recording Recording an exception involves several steps (Figure 8.6). 8.16.2.1 Basic Information Entry The exception receives a unique identifier, title, and description, as well as date and time of occurrence. A multilevel classification is performed according to predefined categories, such as out-of-specification, materials, equipment, rooms, and personnel. The next level includes a specification of the affected samples, charges, products, or systems. Any additional information is entered or is attached as electronic files. 8.16.2.2 Risk Assessment The risk assessment requires information about potential risks, effects, and impacts on current processes and environments. Process risk assessment is mainly determined by the probability of process interruption, system down times, and restart of a system. Typical chemical risk assessments comprise identification of hazards for personnel, customers, and environment, qualitative assessment of potential adverse consequences of the contaminant, and evidence of their significance. The previously described systems for toxicology estimation are useful software modules in this process. Environmental risks require additional dose-response assessments as well as quantification of exposure to determine the dose that individuals will receive. Finally, a qualitative assessment of the probability for recurrence of the exception is performed, or — in the case of systematic occurrence — the exception is linked to similar cases.
5323X.indb 346
11/13/07 2:13:03 PM
Expert Systems in the Laboratory Environment
347
Create Exception Information Entry Risk Assessment Failure Investigation Cause Analysis Direct Cause
Contributing Cause
Root Cause Corrective Actions Efficiency Check Automatic
Manual
Close Exception
Figure 8.6 Flow diagram for exception management (for details see text).
8.16.2.3 Cause Analysis Cause analysis is usually divided into three types: (1) direct causes; (2) contributing causes; and (3) root causes. The direct cause of an incident is the immediate event or condition that caused the incident. Contributing causes are events or conditions that collectively increase the likelihood of the direct cause but that are not the main factors causing the incident. Root causes are the events or conditions underlying the root cause. Corrective measures for root causes will prevent the recurrence of the incident. In simple cases, root causes include materials or equipment deficiencies or their inappropriate handling. More complex examples are management failures, inadequate competencies, omissions, nonadherence to procedures, and inadequate communication. Root causes can be typically attributed to an action or lack of action by a group or individual. 8.16.2.4 Corrective Actions Corrective actions address the root cause and may be as simple as recalibration of an instrument or as complex as procedural or organizational changes. Depending on this complexity, reports on corrective actions may require more detailed cause analysis and extended risk assessment. Actions that require multiple steps are compiled in action lists, which may be predefined for problems that are straightforward to handle.
5323X.indb 347
11/13/07 2:13:04 PM
348
Expert Systems in Chemistry Research
8.16.2.5 Efficiency Checks Efficiency checks are performed in regular intervals after the exception occurred. These checks are based on the expected outcome of a process where the exception occurred and either are of simple binary nature or incorporate measures of deviation that allow the next occurrence of the exception to be predicted. Exception may be closed either manually or automatically. Manual closure is performed by a responsible person and requires appropriate statements for reasoning. Automatic closure is usually defined with a time interval in which the exception is not allowed to recur. The handling of exceptions usually requires changes in predefined procedures. In regulated environments, these changes have to be requested and approved before they can take place. An electronically documented change request involves communication features for informing responsible personnel and decision makers automatically. This allows not only for fast revision and response to the change request but also for real-time tracking of the status of a change request. The corresponding change control module provides different access levels, such as data entry for requesters, approval for decision makers, and read-only access for informal purposes. It tracks change requests according to target dates and alerts in case of target overdue and provides graphic capabilities for trending of defined variables. The need for documenting exceptions and how they are handled requires a series of features for audits performed by regulatory institutions or customers. Audits generally require a consistent traceability of observations, states, commitments, and corrective actions. An audit module not only reports these factors but also allows for scheduling internal assessments and audits including resource planning.
8.16.3 Complaints Management There are situations in any business where complaints by customers have to be addressed and managed. The reasons for complaints can be multifold — for example, quality concerns, contamination or lack of purity of a product, or simply broken packaging. Professional handling of complaints requires a management infrastructure that includes customer data, production data, reasons, corrective measures, and information about how to handle and solve the problem. A complaints management module supports all processes of this sensitive task with the following benefits: • • • • •
Monitoring of the processing of complaints. Transparency of the complaints procedure. Traceability of systematic errors. Initiation of subsequent activities to address the root cause of the complaint. Applying the corrective measures to future complaints.
A complaints management module allows the processing of complaints that are caused by internal, external, or manufacturer reasons. Such systems control all steps involved in management of electronic complaints: the compilation of addresses of all participants, detailed descriptions of the defects, communication of actions, recording protocols, and setting deadlines. The system provides an automated controlling
5323X.indb 348
11/13/07 2:13:04 PM
Expert Systems in the Laboratory Environment
349
module that coordinates all necessary actions from the registration, reporting of actions, and evaluation to documentation. The individual steps are outlined following: • An operator registers a complaint and collects important details about the process, such as complaint description, reporting instance or person, address, origin, manufacturer involved, product, and batch. • At this time the person in charge is informed automatically with the complaint report, and the system requires an acknowledgement of reception. • After all information is entered, the complaint is categorized and compared with existing ones. If a match is found, the resolution strategy is proposed by the system. • The distribution of the complaint report to the respective departments may also trigger further activities, such as a retest of a stored reference sample. The results of these tests are collected by the system for further evaluation. • Corrective measures are entered — or those found in previously recorded complaints are accepted — and a specific time frame is defined for solving the root cause. • Administrative costs are entered or calculated. • Finally, the system provides intermediate and final reports including corrective measures, dates for the elimination of the root cause, and costs. A reply to the customer or a specific reduced customer report can also be formulated automatically or derived from the internal reports. The main difference between exception management and complaints management is that the latter usually involves a customer, which requires at least two different reporting methods.
8.16.4 Approaches for Expert Systems Even though several software solutions exist for exception or complaints management — most of them as a module of a LIMS — none of them takes advantage of technologies used in expert systems. Most of the existing systems rely heavily on textual information that has to be interpreted by a domain expert. Interpretation could be done by an expert system module, if the description is entered in a formal fashion that allows automatic parsing and interpretation. The same applies to the rules already available in the SOPs. These documents are usually not in a form that allows automatic interpretation by a computer program. Setting up an SOP in a predefined manner — preferably with supporting editor software — would allow parsing the rules and creating a knowledge base automatically from this information. Finally, exception or complaints management is a strongly regulated process, where responsible domain experts will have to make a final decision in a complex context. The interrogation capabilities of expert systems can considerably support this decision process, while linear and straightforward decisions and action can be performed automatically.
5323X.indb 349
11/13/07 2:13:04 PM
350
Expert Systems in Chemistry Research
8.17 Rule-Based Verification of User Input The requirements for automatic interpretation of SOPs mentioned already leads us to another approach that is a general use for any steps in a laboratory workflow. If we look at the operator-entering data, we have to keep several critical sources of errors in mind: typing errors, data type errors, formats errors, and data limit errors. One valuable solution based on expert system technology is a system that verifies the data entered by the operator on the basis of rules. Verification of user input in a data management system is a typical request from departments with regulated workflows. The basic requirements for such systems are as follows: • Ability to enter data in a set of data fields • Verification of field input in terms of format requirements and upper and lower limits • Verification of reliability of inputs for an entire set of fields • Creation of adequate entries in reporting systems • Transfer of fielded data to external software, like LIMS Such systems are an enhancement to various data management systems, like LIMS and ELNs. An input certification system usually does not (1) replace any existing LIMS or quality control system functionality that goes beyond the requirements of calculation, certification, and exception handling of direct inputs; (2) create any representation of data beyond textual or tabulated or graphical result entries; or (3) contain any thirdparty calculation or data interpretation. However, such systems may include additional functionality that even makes the development of user interfaces unnecessary: automatic form creation. Figure 8.7 shows the outline of the system described in the following.
8.17.1 Creating User Dialogues A verification system can automatically create user dialogs or forms for input of fielded data on the basis of a human-readable textual specification, preferably in XML. This specification contains the following information: • Input field label appearing as textual label in tabulated results. • Field description used as a hint text or help text in context sensitive help. • Input field type describing the data format of input, such as numeric, alphanumeric, or date and time. • Field formats specifying the textual representation, such as specific date and time formats. • Limits for input fields, such as upper and lower limit for numeric input or specific text formats for alphanumeric input. The specification is preferably documented in XML format and can be customized by the user or administrator of the system. The verification module reads the specification and creates the corresponding input user interface dynamically rather than using any hard-coded user dialogue. A dynamic user dialogue is created with a set of
5323X.indb 350
11/13/07 2:13:04 PM
351
Expert Systems in the Laboratory Environment
UI Designer
XML UID Base
Rule Designer
XML UI Definition
XML Rule Definition
UI Interpreter
Rule Interpreter
Generation
User Interface Module File/Data Set Database
Examiner
User Interface
UI Object Generator
XML Rule Base
Exception
Calculation Calculator
QC Systems
Formatter
UI Text Object UI Image Object
Figure 8.7 Architecture of a system for user input verification. The system comprises two administrative user interfaces for design of dialogues, design of rules, and the dynamically generated interface for the end-user. The user interface designer (UID) is used to create the end-user dialogue, which is stored as XML-based format in the UID database. This dialogue includes the required business logic for calculations and verification of the data entered. The rule designer allows definition and management of rules for any exception occurring during the end-user session. The end-user interface is generated on the fly by the user interface interpreter. If the dialogue appears on the screen, it already includes the business logic for calling the calculator and the examiner module, the latter of which is verified via the rule interpreter the results entered by a user. If the verification fails, the end-user interface is automatically regenerated to match the new situation.
generic program codes that create the individual dialogue components, such as text fields, edit fields, and dropdown dialogues. On interpreting the user interface specification, the system compiles the generic program codes on the fly into an executable code in the working memory of the computer and incorporates the business logic from the rule base. The development of such a knowledge-based system can be divided into three phases. Phase I includes design of a user interface, generation of a user interface, verification of user input, as well as generation of textual or visual representation of completed inputs. Phase II incorporates calculations performed during or after input of data. Phase III includes the extension to rule-based handling of exceptions occurring during user input.
8.17.2 User Interface Designer (UID) The UID is a tool that helps administrators create a new user dialogue. It allows definition of the following:
5323X.indb 351
11/13/07 2:13:05 PM
352
• • • • • • •
Expert Systems in Chemistry Research
Key field types (e.g., numeric field, alphanumeric field, dropdown field) Labels appearing as an edit box label Description used as hint texts for the user Unique identifier for referencing the field Groups, group labels, and descriptions Vectors, which are grouped sets of numeric or alphanumeric fields Input matrices, which are sets of two-dimensional vectors
The UID provides the ability to save a definition to a database or to a local file system in XML format. The basic key field types that have to be handled are numeric, alphanumeric, currency, date and time, dropdown lists, and checkbox fields. Each field type includes labels appearing as an edit box label, descriptions used as hint texts for the user, as well as a unique identifier for referencing the field. • Numeric fields have a definable format including length of integer and fractional parts. In addition, the validation includes upper and lower limit and a default value. • Alphanumeric fields have a definable format including uppercase and lowercase, capital, and title-case character settings. Reformatting is applied during user input. • Currency fields have a definable format including an optional currency character and formatting setting for positive and negative numbers. Reformatting is applied during user input. • Date and time fields have a selectable set of formats including settings for day, month, year, hours, minutes, and seconds, as well as time zone. • Dropdown fields contain a list of predefined entries and corresponding descriptions and hint texts. They have a default entry and an option as to whether the default entry is selected or not. • Checkbox fields contain a definition of the size of the checkbox and a default value defining whether the box is checked or not. • Optionally, the fields in the user dialogue can be grouped. Grouped input fields appear framed (GroupBox in Windows dialogues) and contain a label and description. • Field labels appear as a label on top of the field entry. They contain a (preferably single word) description of the field contents. • Descriptions are used in context sensitive help and in hint boxes to describe the function of the field in detail. Input fields can occur in three different types: (1) fields, in which individual fields are primarily independent from other fields; (2) vectors, a collection of onedimensional fields of the same type (and restrictions) having a fixed length; and (3) matrices, a collection of two-dimensional fields arranged in rows and columns, each having a fixed length. Each field, vector and matrix includes a unique identifier as a reference for calculations. The types are described in detail:
5323X.indb 352
11/13/07 2:13:05 PM
Expert Systems in the Laboratory Environment
353
• Fields are primarily independent from other fields. For calculations, unique field identifiers are used to reference fields. • Vectors are grouped sets of numeric or alphanumeric fields. The individual fields are referred to as components. Each component underlies the same restrictions for formats or upper and lower limits. Statistical calculations like standard deviation or correlation can be performed on a vector. • Matrices are a set of two-dimensional vectors. Each row or column underlies the same restrictions for formats or upper and lower limits. Statistical calculations like standard deviation or correlation can be performed on a row or column in a matrix by using the 2D statistical functions. The User Interface Interpreter is a module that is able to interpret the XML definition file created by the UID. The generation of the forms leads to a consistency check of the field entries before the data are accepted. Generating Windows forms requires an algorithm for positioning and sizing of fields and labels on a user dialogue. In addition, grouping of input fields is possible via group box representation. The interpreter is able to generate the following: • • • • • • •
A Windows form containing the dialogue Group boxes according to group definition for input fields Input fields (edit boxes) and assignment of format properties Dropdown fields Checkbox fields Labels as description for input fields Mouse-over and mouse-click events
The examiner is a module that allows verification of input format and limits for an input field or a calculated field. Verification is performed on the basis of predefined verification criteria, like formats, numerical limits, and field length. The calculator handles all calculations performed during input or after acceptance of the form data. It consists of the following: • A mathematical library holding all the static mathematical functions, like sum and logarithm. • A parser, which is able to analyze a dynamic (user-defined) calculation formula; this component requires access to the XML UID database. • An interpreter that combines static and dynamic formulas and provides a result. The calculator may deliver results to the user interface directly, (e.g. to provide a result for a calculation field) or to the user interface object generator. Calculations are performed on a set of values from input fields of a dialogue. Calculations include simple mathematical calculations like basic arithmetic functions (e.g., addition, subtraction, multiplication, division), basic statistical functions (e.g., sum, average, products, sum of products, sum of squares, minimum, maximum,
5323X.indb 353
11/13/07 2:13:06 PM
354
Expert Systems in Chemistry Research
standard deviation, variance, linear regression, correlation), and logical functions (e.g., AND, OR, NOT, XOR). The object generator creates the visual representation of the resulting completed form in either (1) textual (optionally tabulated) representation, to be used with systems that provide text interpretation or full-text search; or (2) graphical representation, to be used as direct visualization of the completed form for reporting. The formatter is a subcomponent of the object generator that provides formatting of textual results as tables according to predefined templates.
8.17.3 The Final Step — Rule Generation LIMS usually work tightly with SOPs or a method management system. These systems provide the documented rules for exception management in sample data analysis batch evaluation. The concept of user input verification may replace the rule-based calculation functionality in existing LIMS. The required information logistics to solve this task is based on the following assumptions: • An SOP or method management system provides the capability to define rules and to transfer the rule information to a rule generator. • The rule generator creates sets of rules defined in a generic and humanreadable format. • The interface generator allows creation of a human-readable format including all definitions and dependencies from rules. • The calculation module interprets the calculation commands and processes the results. • The rule generator takes, based on the results, the decision for applying rules to a new set of input parameters (e.g., a new input interface). By combining a rule generator with an input generator and a calculator, the system is able to apply rules of exception management to the user interface. Again, we could take advantage of a formal electronic creation of Standard Operating Procedures, that allows parsing rules and creating a knowledge base automatically.
8.18 Concise Summary 21 CFR Part 11 is a rule in the FDA Code of Federal Regulations published in 1997 that establishes the criteria by which the FDA considers electronic records to be equivalent to paper records and electronic signatures equivalent to traditional handwritten signatures. 21 CFR Part 210 is a rule in the FDA Code of Federal Regulations covering GMP in manufacturing, processing, packing, or holding of drugs. 21 CFR Part 211 is a rule in the FDA Code of Federal Regulations covering GMP for finished pharmaceuticals. Administrative Data are a subset of metadata containing administrative data about the content of an information object, such as identifiers, author, and date of creation.
5323X.indb 354
11/13/07 2:13:06 PM
Expert Systems in the Laboratory Environment
355
Application Programming Interface (API) is a documented set of programming methods that allow a software engineer to interface with software from within his own program. Approval refers to the witnessing process of a document or record. Audit Trail is a function of a data management system that maintains a permanent record of selected changes in the system — typically, changes in data. The audit trail typically includes the identification of who performed the change, when the change occurred, why the change occurred, and the before and after values. Authentication in software development is the process of unique identification of a user, typically via user ID and password (log-in). Client–Server Architecture refers to a two-tier architecture model where client (user interface, input logic visualization) and server (business logic, database) are separated from each other and are connected via a computer network. Code of Federal Regulations (CFR) covers the general and permanent rules published in the Federal Register by the executive departments and agencies of the Federal Government. Component Object Model (COM) is an object-oriented system for creating binary software components that can interact. COM is the foundation technology for Microsoft’s OLE (compound documents) and ActiveX (Internet-enabled components). Data Cartridge is a bundled set of tools that extends database clients and servers with new capabilities. Typically, a single data cartridge deals with a particular information domain. Decryption is the process of decoding previously encrypted electronic data to convert it back to the original format. Department (laboratory, unit, division) in software administration defines a group of users, each of which has explicit rights on content editing and signing for electronic content. Digital Signature is an electronic signature based on cryptographic methods of originator authentication, computed by using a set of rules and a set of parameters such that the identity of the signer and the integrity of the data can be verified. Electronic Laboratory Notebook (ELN) is a laboratory software for authoring, managing, and sharing electronic information for the purpose of scientific documentation. Electronic Record refers to any combination of text, graphics, data, audio, images, or other information represented in digital form that is created, modified, maintained, archived, retrieved, or distributed by a computer system. Electronic Scientific Document is a concept of ELN software that serves the role of a container for documented laboratory data underlying the regulatory guidelines for paper laboratory notebooks. Electronic Signature is the legally binding equivalent of an individual’s handwritten signature. Encryption is the process of scrambling sensitive electronic data into unreadable formats by using a secure and unpredictable mechanism. File Converter is a software or software library to translate data from instrument vendor-specific or industry-standard formats into native or human-readable formats.
5323X.indb 355
11/13/07 2:13:06 PM
356
Expert Systems in Chemistry Research
Generic Inbox is a concept that allows for receiving converted files sent from a software agent in the client application, reviewing the results, and transferring them to a record or document. Generic Structure is a chemical structure that contains placeholders instead of the atoms on parts of the structure, which may consist of structure, textual, or alphanumerical information. Good Automated Laboratory Practice (GALP) was developed to establish guidelines for automated data management in laboratories that supply data to the EPA. Good Laboratory Practice (GLP) refers to a system of controls for laboratories conducting nonclinical studies to ensure the quality and reliability of test data as outlined in the OECD principles of GLP and national regulations, such as 21 CFR Part 58. Good Manufacturing Practice (GMP) refers to a set of rules covering practices in manufacturing of drugs, specifically 21 CFR Part 210, 21 CFR Part 211, and 21 CFR Part 11. Hash Coding (hashing) is a method of converting data into a small unique representation that serves as digital fingerprint of the data. Hypertext Markup Language (HTML) is the coded format language used for creating hypertext documents on the World Wide Web and for controlling how Web pages appear. Interface is a device or system used for interaction of unrelated software entities. Laboratory Information Management System (LIMS) is a type of laboratory software that is designed typically to manage sample-oriented data entry and workflows in the quality control area of commercial production laboratories. Laboratory Workflow Management System (LWMS) is a software that controls laboratory workflow while managing the data produced in the laboratory. This software covers manual as well as automated laboratory activities, including experiment scheduling and setup, robot control, raw-data capture and archiving, multiple stages of preliminary analysis and quality control, and release of final results. An Analytical Workflow Management (AWM) system is a specific instance of laboratory workflow management systems. Markush Structure is a chemical structure with multiple functionally equivalent chemical entities (i.e., residues) allowed in one or more parts of the compound. Markush structures are a specific instance of generic structures. Messaging in electronic data management systems refers to a set of software functionalities including messages (writing, sending, and receiving messages), notifications about events and status change of record, and signature requests that inherit business logic from the signature workflow. Meta-Key is the unique identifier for metadata. Meta-Value refers to the value of metadata and is kept together with the meta-key. N-Tier Architecture (multitier architecture) is a software architecture model where multiple software components serve different purposes and are physically separated, such as client–server architectures. Object Linking and Embedding (OLE) is Microsoft technology used to embed an application into another, thus preserving the native data used to create them as well as information about the format.
5323X.indb 356
11/13/07 2:13:06 PM
Expert Systems in the Laboratory Environment
357
Parsing is the process of splitting up a stream of information into its constituent pieces, often called tokens. Pathway Editor is a tool to visualize a sequence of structures in a multistep reaction or a metabolic pathway. Personal Mode is a specific concept of scientific workspaces that allows creating data that are not visible to or searchable by other persons, except for authorized IT personnel. Print Capturing is a data acquisition technology based a special printer can be used from client software to print results to a target software. Records Retention refers to the time an electronic record has to be kept for regulatory purposes. Release refers to the process for making documents or records available for the approval cycle. Residue (Structure) is a generic part of a structure containing a label, value, or a substructure. Role in software administration is a group of access rights that either defines access to individual modules, or to individual task-related functions of the software. Scientific Data Management System (SDMS) is software that implements generic data management capabilities and features for creation, search, and management of scientific information. Scientific Workspaces are a specialized concept of electronic laboratory notebook software underlying less restrictive regulations than electronic scientific documents. They are containers designed for personal preparation of data as well as for effective organization, sharing, and publishing of information within a team. Serial Device Support (SDS) is a concept that supports the data transfer between applications and a serial device connected to the network. It allows data to be received from serial ports, whether they send streaming (continuous) or single data. Signature Workflow refers to the process of release, signature request, and signature of electronic data in a sequential manner according to internal policies. Simple Object Access Protocol (SOAP) is a lightweight protocol intended for exchanging structured information in a decentralized, distributed environment. Software Agent (Intelligent Agent, Agent) in the field of AI is software characterized by sensors that perceive changes in the software environment and react to these changes by acting in a predefined way. Software Development Kit (SDK) is a set of development tools that provide access to certain internal functions of a software. These functions can be implemented in another program. An SDK usually includes an API. Specification in a software development life cycle is the task of precisely describing the software to be written. Stoichiometry refers to the calculation of the quantitative relationships between reactants and products in a chemical reaction, for instance, to determine required amounts of reactants or to calculate yields of products. Team (Group) in software administration defines is an optional group of either departments or users with individual content editing and signing rights within the team. Templates are a concept of ELN software for providing predefined documents, forms, or workflow-driven design of a document.
5323X.indb 357
11/13/07 2:13:07 PM
358
Expert Systems in Chemistry Research
Universal Description, Discovery, and Integration (UDDI) is a specification defining a SOAP-based Web service for locating Web services and programmable resources on a network. U.S. Food and Drug Administration (FDA) is consumer protection agency founded in the scope of the U.S. Congress Food and Drugs Act of 1906. The FDA’s main goals are the protection of public health by safety monitoring of products and by making accurate, science-based information on products and health improvement publicly available. User (operator) in software administration defines the details of individuals using the software as well as their membership in departments and assignments to a role. Validation in software development refers to a process providing evidence as to whether the software and its associated products and processes satisfy system requirements, solve the right problem, and satisfy intended use and user needs. Verification in software development refers to a process for providing objective evidence as to whether the software and its associated products and processes conform to requirements, satisfy standards, and successfully completed each lifecycle activity. Versioning refers to the generation of updated records or documents and the management of previous versions according to regulatory guidelines. Visual Basic for Applications (VBA) is an implementation of Microsoft’s Visual Basic built into all Microsoft Office applications (including Apple Mac OS versions) and some other Microsoft applications such as Microsoft Visio. V-Model defines a uniform procedure for IT product development. It is designed as a guide for planning and executing development projects, taking into account the entire system life cycle. It defines the results to be achieved in a project and describes the actual approaches for developing these results. Web Services Description Language (WSDL) defines an XML-based grammar for describing network services as a set of endpoints that accept messages containing either document- or procedure-oriented information. XML (eXtensible Markup Language) is a subset of SGML constituting a particular text markup language for exchange of structured data between two computer programs.
References
5323X.indb 358
1. U.S. Federal Register, Part II Department of Health and Human Services, Food and Drug Administration, 21 CFR Part 11, Electronic Records, Electronic Signatures; Final Rule, 1997. 2. U.S. Federal Register, Part II Department of Health and Human Services, Food and Drug Administration, 21 CFR Part 58, Good Laboratory Practice for Nonclinical Laboratory Studies. 3. U.S. Federal Register, Part VII Department of Health and Human Services, Food and Drug Administration. 21 CFR Parts 808, 812, and 820 Medical Devices; Current Good Manufacturing Practice (CGMP); Final Rule, 1996. 4. U.S. Environmental Protection Agency, Good Automated Laboratory Practices Principles and Guidance to Regulations for Ensuring Data Integrity In Automated Laboratory Operations with Implementation Guidance, Environmental Protection Agency, 2185, 1995.
11/13/07 2:13:07 PM
Expert Systems in the Laboratory Environment
5323X.indb 359
359
5. Bundesministerium des Innern, Das neue V-Modell XT Release 1.2 — Der Entwicklungsstandard für IT-Systeme des Bundes, http://www.v-modell-xt.de. 6. Nakagawa, A.S., LIMS: Implementation and Management, The Royal Chemistry Society, 1994. 7. Paszko, C., E. Turner Laboratory Information Management Systems, 2d ed., Marcel Dekker, Inc., 2001. 8. Singer, D.C. (Ed.), A Laboratory Quality Handbook of Best Practices and Relevant Regulations, American Society for Quality (ASQ), 2001. 9. Cork, D.G. and Sugawara, T., Laboratory Automation in the Chemical Industries, Marcel Deckker, Inc., 2002. 10. Gibbon, G., A Brief History of LIMS, Laboratory Automation and Information Management, 32, 1–5, 1996. 11. GALP Regulatory Handbook, CRC Press, 1994. 12. Stafford, J. (Ed.), Advanced LIMS Technology: Case Studies and Business Opportunities, Chapman & Hall, 1995. 13. Stafford, J.E.H. (Ed.), Advanced LIMS Technology — Case Studies and Business Opportunities, Kluwer Academic Publishers, Dordrecht, 1995. 14. Hinton, M., Laboratory Information Management Systems, Development and Implementation for a Quality Assurance Laboratory, Marcel Dekker, Inc., 1994. 15. McDowall, R.D. (Ed.), Laboratory Information Management Systems, Concepts, Integration, Implementation, Sigma Press, 1987. 16. Mahaffey, R.R., LIMS: Applied Information Technology for the Laboratory, Van Nostrand Reinhold, 1990. 17. Cerda, V. and Ramis, G., An Introduction to Laboratory Automation, John Wiley & Sons, New York, 1990. 18. Inmon, W.H., Building the Data Warehouse, 2d ed., John Wiley & Sons, Inc., New York, 1996. 19. Shewhart, W.A., Economic Control of Quality of Manufactured Product, Van Nostrand Company Inc., New York, 1931.
11/13/07 2:13:07 PM
5323X.indb 360
11/13/07 2:13:07 PM
9
Outlook
9.1 Introduction Expert systems have now been in use for more than 40 years and have spread into nearly all imaginable areas of application. Evolving from the research area of artificial intelligence (AI), they can be found today in commercial applications throughout the entire scientific and technical areas. We have seen that this technology is not just stand-alone software but is supported by many technologies from mathematics, cheminformatics, bioinformatics, AI, and related areas. In particular, cheminformatics has gained increasing awareness in a wide scientific community in recent decades. One of the fundamental research tasks in this area is the development and investigation of molecular descriptors, which represent a molecule and its properties as a mathematical vector. These vectors can be processed and analyzed with mathematical methods, allowing extensive amounts of molecular information to be processed in computer software. The most important feature of molecular descriptors is their ability to represent three-dimensional (3D) molecular information in a one- or two-dimensional vector that can be analyzed by means of statistics and other mathematical techniques. It has been shown that descriptors are useful tools in the fields of characterizing similarity and diversity of compounds, evaluating structure–property and structure–activity relationships, and investigating spectrum–structure correlations. In particular, the analysis of mathematically transformed descriptors is capable of revealing aspects of data, like trends, breakdown points, discontinuities, and self-similarity, which are rarely exposed by other signal analysis techniques. In addition, mathematical transforms can be used to compress descriptors without appreciable loss of information. The compressed representation is ideally suited for fast similarity searches in descriptor databases. It seems that the use of molecular descriptors in autoassociative artificial neural networks is a valuable supplement to a descriptor generator. In particular, descriptors can help to store the increasing amount of molecular information effectively and to analyze them fast and with high reliability. The combination of rule-based methods and neural networks, statistical analysis, pattern recognition techniques, and fuzzy logic is an important step toward more human decision making of expert systems.
9.2 Attempting a Definition Considering everything we heard about expert systems, we can attempt to define some minimal requirements that would constitute an expert system, or a knowledgebased system:
361
5323X.indb 361
11/13/07 2:13:07 PM
362
Expert Systems in Chemistry Research
• An expert system contains an explicit, identifiable knowledge base, including knowledge rather than data or information. We found at the beginning of this book that knowledge differentiates itself from data and information mainly by its repetitive character and its applicability to new problems. The knowledge base is defined in a somewhat self-explanatory language that allows non-information technology (IT) experts to manage the knowledge. Another key requirement for the knowledge base is that it is separated from controlling and reasoning mechanisms. • Since the reasoning mechanism is separated from the knowledge, it has to deal with generic semantic objects that require definition. The inference mechanism adapts with changes in the knowledge base, allowing for dynamic adaptation to new situations. This is in contrast to conventional programs that execute in a linear manner based on available data. In the ideal case, expert systems incorporate some explanatory features that allow retracing of reasoning. • Expert systems rely on a series of supporting technologies, allowing them to deal with natural fuzziness of data, estimation of patterns and relationships, human-like categorization of data and information, human-like learning and decision making, and optimization features that allow automatic selection of information relevant to the task. • Expert systems are literally applications rather than theoretical research topics. Their application addresses parts of a larger context that incorporates a series of different data, information, and knowledge management systems. Interoperability is therefore one of main features of an expert system. On the other hand, expert systems are not necessarily developed in descriptive programming languages as long as they are able to present the knowledge in a descriptive manner. Expert system shells and descriptive languages are merely tools that allow — in some, but not all cases — development to be done more efficiently.
9.3 Some Critical Considerations Expert systems applications have met with both success and failure. Although there was a tremendous increase in popularity of expert systems until the mid 1980s, there was also a decline in popularity until the turn of the millennium. Even it is hard to estimate all factors that led to this decrease, several key issues are apparent: • • • •
Poor acceptance by domain experts. High costs and unclear business benefits. Lack of maintenance and functionality supporting maintaining expert systems Insufficient interoperability with existing systems.
Interestingly, some of these factors apply more or less to human experts, too; acceptance by other domain experts, high costs, and insufficient “interoperability” (communication) are typical issues to deal with in any community. However, in contrast to human experts, a computer program has to run perfectly in order to gain acceptance from humans. Let us address some of these reasons a bit more in detail.
5323X.indb 362
11/13/07 2:13:08 PM
Outlook
363
9.3.1 The Comprehension Factor The concept underlying expert systems is — as hopefully has shown in this book — a quite simple one. The basic ideas of separating rules from programming logic, the reasoning mechanisms, and the supporting concepts like fuzzy logic and even artificial neural networks have nothing particularly complicated; in fact, they are as simple as most good approaches. If we consider the term complicated to mean something that is hard to understand, the previous statement is definitely true. However, not being complicated does not imply that the problem-solving procedures might not be complex — for instance, in the way these techniques deal with information. Many of the supporting techniques work iteratively on huge amounts of data. The complexity of such processing might cause us to lose track of the details, but that is exactly what computers can handle best. Computers are the quick dummies, so to speak.
9.3.2 The Resistance Factor The basic concepts of expert systems do not deal that much with details but with a more general point of view. One reason for resistance in the science community is the understanding of a generalized design when looking from the detail problem in which the user is currently involved. However, important breakthroughs in sciences in the second half of the last century show that generalized approaches are gaining more and more importance in understanding the complex relationships found in real-world observations. Two examples are the description of the behavior of nonlinear dynamic systems by chaos theory or the understanding of the natural nonequilibrium processes by introducing thermodynamics of irreversible systems. Particularly, bioinformatics is typically confronted with those complex relationships that require generalized models because experimental details do not describe a higher-level effect appropriately, nor are they useful for induction without a generic approach. Expert systems are still a mystery to a large part of the scientific community. This is partially due to the volitional mystification that unfortunately can be found in many scientific areas. Additional confusion comes with the discussion about whether the system shall be called expert systems or knowledge-based systems, which is basically a valid discussion but does not really help to raise the level of acceptance. One of the most obvious factors is, Which expert would like to be replaced by an expert system? It is important to understand that expert systems rely on experts and that without the knowledge base they are essentially useless.
9.3.3 The Educational Factor Improving the acceptance for such systems is on one hand a task for educational instances, like universities, companies’ training facilities, and other educational institutions. On the other hand, it is a task for the community of computational sciences, and in general the AI community, to make the software systems and their underlying technology easier to understand. This starts with the user interface and continues with software training and appropriate documentation.
5323X.indb 363
11/13/07 2:13:08 PM
364
Expert Systems in Chemistry Research
In the field of education, many of the expert system applications are embedded inside the intelligent tutoring system (ITS) by using techniques from adaptive hypertext and hypermedia. ITS is a system that provides individualized tutoring or instruction. In contrast to other expert systems, ITS covers multiple knowledge areas: • The domain knowledge, including the knowledge of the particular topic or curriculum. • The student knowledge, which covers the current state of knowledge of the learner. • Strategy knowledge, which refers to the methods of instruction and presentation of the domain knowledge. This outline of requirements was introduced by Derek H. Sleeman and J. R. Hartley in 1973 [1]. ITS basically works on the differences in a problem–solution approach of a student and the solution stored in the domain knowledge and analyzes the differences, informs the student about the outcome, and updates the student knowledge base. As a result of the analysis, the system can do the following: • • • •
Assess the point of improvement for the student. Decide for changes in the sequence of learning topics. Adapt the learning strategy. Modify the presentation of the topic.
The results of this assessment are applied to the learning course. As a consequence, an ITS requires diagnostic algorithms based on dynamic principles rather than static facts.
9.3.4 The Usability Factor Expert systems evolved as a specific technology developed on computer platforms that were, at least in the eighties, not at all easy to use and understand. The availability was restricted to professionals and, not least, the operating systems and required tools were barely affordable. Even this changed during the last decades, when more and more systems were made available on more common and cheaper platforms; the AI community still tends to develop systems using less conventional tools and platforms. One of the general requirements for software is the user friendliness. Expert systems, in which the essential features are still based on complicated command line input, are simply no longer acceptable in a world where nearly all operating systems provide mouse and menu-driven interaction, which is far more intuitive for the user. This certainly does not apply to the development expert, who is usually much faster on the command line than with a mouse. However, as stated before, the term software application focuses on the term application; finally, such systems shall aid us in problem solving rather than serve as a playground for software engineers. As long as user interface design is not conceived as a substantial part of the software development process, we will struggle with propagating the usefulness and advantages of unconventional software approaches in a wider user community.
5323X.indb 364
11/13/07 2:13:08 PM
Outlook
365
9.3.5 The Commercial Factor The purchase of an expert system must have some justification. Justifying an expert system is most efficiently done by comparing current processes with those supported by expert systems. Faster decision making and reaction, decrease in workload, assistance in evaluating situations, and handling of situations that are too complex to be completely understood by a single person are some of the main reasons for establishing expert systems in the scientific environment. There may be a few cases where expert systems may be able to perform at least part of a process if the human expert is not available. This applies to emergency cases as well as to educational situations where unique expertise can be made available to company sites located in other geographic areas. Another important fact is the persistence of the human expertise gained through years of experience from working on the problem. Human experts that leave or retire from the company can incorporate at least part of their knowledge in a knowledge base. Whereas human experts may be scarce — hence expensive — expert systems may be quite inexpensive. Expert systems have to be affordable and capable of running on desktops or lower-cost parallel computing platforms. The software industry is challenged to make expert systems available at affordable prices, at least for academic areas, education, or even personal use. The increasing popularity of expert systems will finally increase their return of investment. Validation of expert systems still poses a problem that remains to be solved. This is mainly due to the dynamic nature of the reasoning process and the multiplicity of cases that can be solved by an expert system. Whereas the functionality of a linear software system provides the basis for test cases, an expert system can barely be covered comprehensively by a few hundred test cases. Expert systems, as well as most AI software, have to be transitioned carefully from the implementation to the production phase and require a continuous quality control through experimentation as part of the standard maintenance procedure. However, the same applies to a series of other systems, such as knowledge management systems, data mining systems, and, finally, human experts and is no reason for a general denial of expert systems in productive environments.
9.4 Looking Forward The answer to the question of whether expert systems will replace an expert is still no — or at least not as long as beaming, floating cars, and Cyborgs become standard technologies. Expert systems are designed to aid an expert in decision making rather than to replace him; that is, the expert is still sitting in front of the computer. Sophisticated problems can rarely be solved without the experience of a human expert, and a lot of this experience cannot be forced into logical rules. There are still issues that have not been addressed with expert systems, such as the human common sense needed in some decision making, the creativity used to solve problems in unusual circumstances, and intuition — the power of unconsciousness — that has never been really addressed by any artificial method.
5323X.indb 365
11/13/07 2:13:08 PM
366
Expert Systems in Chemistry Research
An expert system still simply serves as a powerful amplifier of the human intellect. However, the existing systems show that it could be quite an effective amplifier. Expert systems and their supporting technologies from AI, like knowledge management approaches, artificial neural networks, fuzzy logic, and others, are playing a more and more important role in the application areas of computational chemistry, chemometrics, bioinformatics, and combinatorial chemistry. And this trend does not just apply to scientific areas but also to our day-to-day life. Owners of digital cameras and of modern washing machines usually possess a piece of fuzzy logic that is used to reduce image blur (for the camera) or for controlling rotation speed (for the washing machine). This would have been inconceivable in the first decades of the last century. Even if there are still technical or conceptual problems to solve in the context of expert systems, it would be unreasonable to believe that we would not be able to solve them. The real issues with intelligent systems still lie in the future: The more we succeed in creating intelligent systems, the more scientific, legal, and social problems will arise. Then, we (again) have to show that we are able to deal with an upcoming technology at point in time when we are not used to it. All of these critical considerations described melt down to a simple fact: Expert systems — as well as other AI approaches — are definitely different from conventional software in the same way humans differ from expert systems. What is needed is a change in perception of these systems since they pose the beginning of a transition from nonintelligent to intelligent systems. We will better be off to see this transition as a matter of fact to be prepared for the future.
Reference
5323X.indb 366
1. Hartley, J. and Sleeman, D., Towards more intelligent teaching systems, Int. J. ManMachine Studies, 2, 215, 1973.
11/13/07 2:13:09 PM
Index A AAS, See Atomic absorption spectrometry ab initio calculations, 201 Absolute configuration, 67 Absorbance, relation to activation energy, 213 Acceptance business, 284, 286 criteria by FDA, 280, 282 of expert systems, 362–363 procedure in LIMS, 298 Acceptance testing, 279 Access control, 332–333 Access layer, in knowledge management, 289 Access level, in scientific workspaces, 319 Access rights, See System access ACD/H-NMR software, See Advanced Chemistry Development/H-NMR software Acremonium chrysogenum, 228 Action applying knowledge to, 10 concise summary, 57 Activation energy, 213–214 ActiveX, 320–321, 325, 355 ADA programming language, 44 Adaptation of Kohonen neural networks, 108 of neuron weights, 104 Adjacency matrix as connectivity matrix, 62 concise summary, 112 Adjustment, of weights, 104–107 in counterpropagation neural networks, . 107 in Kohonen neural networks, 106 Administrative data, 299, 310, 342 concise summary, 354 Administrative layer, in knowledge management, 289 ADS, See Aion Development System Advanced Chemistry Development . (ACD)/H-NMR software, 201 Aflatoxin, albumin–adduct in liver, 171 Agent, See Software agent Aggregated data, 292 Agonist receptor, 217–218, 223 concise summary, 236 AI, See Artificial intelligence Aion Development System (ADS), 53 Albumins, 171–172
Algorithms for Radial Coding (ARC), 152–157 binary comparison, 155 code settings, 153 correlation matrices, 155 descriptor calculation, 154 multiple descriptors, 155 structure prediction, 182 ALL operator, 50 Alphanumeric fields, in user interface design, 352 American National Standards Institute (ANSI), 40 Amplified descriptor vs. attenuated descriptor, 127 concise summary, 163 Analysis vs. calculation and interpretation, 10 factor, 94 principal component, 87–91 Analytical chemistry applications, 208–216; See also Specific techniques Analytical method development, 209 Analytical workflow management (AWM), 303–305; See also Workflow management systems AND operator conditional, 13 in CLASSIC, 50 Anisotropic molecules, 199 ANN, See Artificial neural network Anonymous instance, See Skolem constant ANSI, See American National Standards Institute Antagonist receptor, 217–218, 221 concise summary, 236 Antiprogestins, 221 Any-atom (A), 69 Apex-3D expert system, 40, 251–254 API, See Application programming interface Application programming interface (API), 324 alternatives, 326 in NEXPERT OBJECT, 55 concise summary, 355 Approval; See also Electronic signature concise summary, 355 in software development life cycle, 286 Approximation filter, See Low-pass filter Aquatic toxicity assessment, 261 ARC, See Algorithms for Radial Coding Archive, See Data archive Aromatic pattern, 130 Aromaticity, 43; See also Hückel rule
367
5323X.indb 367
11/13/07 2:13:09 PM
368 Artificial dendrite, 105 Artificial descriptor, 3 advantages, 70 concise summary, 112 Artificial intelligence (AI) for scientific evaluation, 61 scientific goal, 9 Artificial neural network (ANN), 4–5, 102–109 concise summary, 112 in data mining, 337 investigation, 157 training, 155–156 for spectrum interpretation, 177–178 Artificial neuron, 104 ASD, See Average Set Descriptor Asian Development Bank, 266 ASSEMBLE software, 176 Assert command in rules, 44 in JESS, 48 Assignment statement, in rule interpretation, . 57 Association as human factor, 102 auto-, 337 in artificial neural networks, 103, 109 rate of, 219 ASSOCMIN module, 268 Assurance system, in LIMS, 300–301 Asymmetric encryption, 332 AT LEAST operator, 50 Atmospheric dispersion, 266 Atom-by-atom matching, 64 Atom list, 69 Atom matrix, 62 Atom polarizability tensor, 199 Atom-specific descriptor, 132 Atom type as descriptor, 73 modeling of, 189 Atomic absorption spectrometry (AAS) applications, 209–215 concise summary, 236 Atomic descriptors, See Local descriptors Atomic mass, 125 Atomic number, 73, 125 Atomic polarizability, 73, 222 Atomic properties, 125–128; See also Specific properties difference of, 147 dynamic, 126 product vs. average, 127–128 static, 125 weight, 146 Atomic radius, 125 Atomic static polarizability, 199 Atomic volume, 73, 125 Atomization mechanism, 209–210
5323X.indb 368
Expert Systems in Chemistry Research Atomization process, thermodynamical, 213 ATS, See Autocorrelation of a topological structure Attenuated descriptor vs. amplified descriptor, 127 concise summary, 163 Audit trail, 293, 329, 331 concise summary, 355 in electronic laboratory notebooks, 310 and intellectual property protection, 331 regulations, 280–281 in scientific data management systems, . 294 in workflow management systems, 304 Authentication, 294, 328 concise summary, 355 with public key cryptography, 332 AutoAssign expert system, 202 Autocorrelation of a topological structure (ATS) as descriptor, 75 concise summary, 112 Autocorrelation vectors as descriptors, 74 concise summary, 112 AutoDerek function 250; See also Deductive estimation of risk from existing knowledge Automatic test programs, in LIMS, 301 Average correlation coefficient, 196 Average descriptor, 194 Average descriptor deviation in statistics, 82 concise summary, 112 Average diversity in statistics, 82, 194 concise summary, 112 Average kurtosis, 196 Average properties, 126–127 Average Set Descriptor (ASD), 82, 141–142, 194, 196 Average skewness, 196 AWM, See Analytical workflow management Axon, 102
B BA, See Binding affinity BABYLON development environment, 53 Back-propagation neural network applications of, 73, 104 for structure elucidation, 178 Backward chaining, 22 concise summary, 31 in MYCIN, 174 rules in RTXPS, 263 Basis function in mathematical transforms, 97 radial (RBF), 78
11/13/07 2:13:09 PM
369
Index Bayes’ Theorem, 27 decision rule, 268 concise summary, 31 Bayesian network theory of, 27–28 concise summary, 31 Beer’s law of absorption, 215 Belief functions, See Dempster-Shafer theory Bidirectional port, 322 Binary descriptor database, 152, 157 Binary pattern concise summary, 164 as descriptor, 130 vs. frequency pattern, 131 Binary vectors as fragment representation, 75 in genetic algorithms, 206 Binding affinity (BA) in ligand-receptor interactions, 221 concise summary, 236 Bioaccumulation assessment, 261 Bioinformatics, 247 concise summary, 271 expert systems for, 247–256 and knowledge management, 287 laboratory information management for, 338–344 semantic networks, 16 Biological activity by radioligand binding experiments, . 217–218 prediction of, 251–254 Biological evolution, 110 Biological neuron, 103–104 Biometric access control, 332 BIONET suite, 248–249 Biophore, 252–254 concise summary, 272 pattern, 252 Biotransformation data management, 339–334 data storage, 344 prediction of, 251 visualization, 343 Bit strings, for prescreening, 66 Boeing Aerospace thermal bus system, 270 Boltzmann constant, 213 Bond cleavage, ranking of, 235 Bond frequency pattern, 134 Bond-path, 134 Bond-path descriptor, 133 vs. Cartesian descriptor, 136 concise summary, 163 Bond-path distance matrix, 62 Bond sphere, 134, 236 Bond type as descriptor, 73 for query specification, 69
5323X.indb 369
Boolean logic, 26, 56 value (bool), 13 Brain, See Human brain BRE, See Business Rules Expert Brookhaven Protein Database, 152 Business acceptance, 284, 286 Business intelligence process, 292 Business logic in generated user dialogs, 351 logical programming of, 39 in knowledge representation, 49 in Rule Interpreter, 57 separation from data, 18, 231, 351 Business requirements, 284 Business rules engine, 48 Business Rules Expert (BRE), 53–54 Butterfly conformer, 137
C C language integrated production system (CLIPS), 43–47 concise summary, 57 facts, 44–45 rules, 45–46 C, programming language, 38, 44 C#, programming language, 38 C++, programming language, 38 code example, 13, 37 in rule-based programming, 12 C-a pattern recognition algorithm (CAPRA), 255–256 Calcium agonists, 223 CALCMIX module, 269 Calculation vs. analysis and interpretation, 10 in ARC software, 153–155 of bond-path distances, 134 chemical equilibrium, 271 of confidence intervals, 86 with connectivity matrices, 62 of descriptors, 62, 337 of inaccurate data, 26; See also Fuzzy logic of reactions, See Stoichiometry of reactivity, 233 of secondary data, 337 of shock tube parameters, 271 in user input verification, 351, 353–354 Calibration, 215–217 algorithm for XRF, 216 in good laboratory practices, 278 graph, 219 in laboratory information management, 299, 301 in root cause analysis, 345 Cambridge Crystallographic Data Centre, 61 CAPA, See Corrective and preventive actions
11/13/07 2:13:10 PM
370 CAPRA, See C-a pattern recognition algorithm Carbides in atomic absorption spectroscopy formation of, 210 dissociation of, 211 Carboquinones, as anticarcinogenic drugs, 73 Carcinogenic effect, prediction of, 73 Carnot refrigerator, 269 CARS, See Coherent anti-Stokes Raman scattering Cartesian coordinates, 62 concise summary, 163 conversion to distance, 133 vector transformation, 76 Cartesian descriptor, 163 Cartesian distance, 62 vs. bond-path distance, 136 in radial distribution functions, 127 in statistical evaluation, 141–142, 144 Cartridge, See Data cartridge CAS, See Chemical abstracts service CASE, See Computer-assisted structure elucidation Case-based reasoning (CBR), 22–24 concise summary, 31 in TEXTAL, 255 in Xpert Rule, 55 Cause analysis, 347 Cause and effect, 15, 35 CBG, See Corticosteroid-binding globulin CBR, See Case-based reasoning CCI, See Computational crystallography initiative CEA, See Chemical equilibrium with applications Centered moments of distribution, 81 concise summary, 112 Centering of data, 89 Central neuron, 106–107 Centralized knowledge base, 18–19 Cephalosporin, 228 Cephalosporium acremonium, 228 Certainty factor (CF), 24–25 concise summary, 31 in MYCIN, 173–174 Certificate of analysis (CoA), 300 CF, See Certainty factor CFR, See Code of Federal Regulations CFT, See Continuous Fourier transform Chance and Necessity (Jaques Monod), 7 Change control, 279, 346, 348 Chapman-Jouguet detonation, See Detonations Characteristic frequency approach, 176–177 Charge distribution, 72, 128, 145, 227 Charge-weighted descriptor, 222 Checkbox fields, in user interface design, 352 Chemical abstracts service (CAS), 20, 60, 258 Chemical equilibrium with applications (CEA), 270–271
5323X.indb 370
Expert Systems in Chemistry Research Chemical information system, 64 Chemical shift prediction, 207; See also . 1H-NMR spectroscopy Chemical structures, See Structures CHEMICS software, 176 Cheminformatics, 61, 361 concise summary, 272 Chemistry cartridge, See Data cartridge ChemLisp, 40, 252, 254 ChemNMR software, 201 Chemometrics, 2 Chiral Synthon (CHIRON), 230 CHIRON, See Chiral synthon Chromosome binary, 111, 205 concise summary, 112 damage prediction, 250 in genetic algorithms, 110, 206 CL, See Common LISP Class in CPG neural networks, 108 of exceptions, 345 formation by data extension, 192 in programming, 47–48 in knowledge management, 51–52 type in ARC, 153 vector, 153–154 Class-object hierarchies, 55 CLASSIC, See Classification of individuals and concepts Classification, 109, 156 of amines, 193 in ARC software, 157 of biological activity, 223 concise summary, 112 conflicts in, 192 in CPG neural networks, 108–109 in data mining, 337 of exceptions, 346 of hazards, 258, 263 of infrared spectra, 177 of knowledge, 288 of molecules, 125 of phosphorus compounds, 193 of ring systems, 191 with wavelet transforms, 97, 148, 198 Classification of individuals and concepts (CLASSIC), 49 Classification sheet, 108 CleverPath, See Business Rules Expert Client-server architecture, 320 concise summary, 355 Client-server communication, 320 CLIPS, See C language integrated production system CLIPS object-oriented language (COOL), . 43 CMP file format, 56
11/13/07 2:13:10 PM
Index C-NMR prediction of structures, 176 prediction of chemical shifts, 201 CoA, See Certificate of analysis Coarse coefficients, 100, 102, 148 Coarse filter, See Low-pass filter COCOA software, 177 Code of Federal Regulations (CFR), 277–281 21 CFR Part 11, 280–281, 354 21 CFR Part 58 (GLP), 278–279, 290, 306 21 CFR Part 210 and 211 (GMP), 278 concise summary, 354, 356 Coefficient of determination (R 2), 83 Coherent anti-Stokes Raman scattering (CARS) thermometry, 212 concise summary, 236 COM, See Component object model COM port, 323 Combinatorial chemistry, 61, 111, 162, 231 COMBINE software, 176 Comma separated value (CSV) file format, 321, 326 Common LISP (CL), 40, 53; See also List processing Common variance, 94 Complaints management, 297, 348–349 Compliance, See Code of Federal Regulations Compliant-ready solutions, 281, 294 Component, principal, 87; See also Principal component analysis (PCA) Component object model (COM), 321 concise summary, 355 Compound statement, in rule interpretation, 57 Comprehensive nuclear-test-ban treaty (CTBT), 209 Compression of descriptors, 97 Fourier, 179 Hadamard, 179 of images, 87 of patterns, 87 of spectra, 96 by transformation, 95 wavelet, 151, 198 Computational chemistry, 3, 25, 70 module in Apex-3D, 252 transforms in, 94 Computational crystallography initiative (CCI), 256 Computer assisted structure elucidation (CASE), 176 concise summary, 236 Concepts abstract, 41 in CLASSIC, 50 generic, 6 graph of, 15 in KL-ONE, 49 13
5323X.indb 371
371 and relations, 5 in semantic networks, 14 Concise summary, 3 Condensation effects in graphite tubes, 211 Conditional behavior, 1 Conditional statements, in rule interpretation, 57; See also Rules Confidence interval, 86 Conflicts in data, 292 in rules, 49 rules for handling, 29 resolution strategies, 21 in topological maps, 192, 215 Conformation with Cartesian matrices, 62, 76 vs. constitution, 135 effects on structure prediction, 189–190 in statistical evaluation, 140–145 Conformational flexibility, 71, 134 descriptor independence, 76, 134, 197 during receptor binding, 223 of protein structures, 89 CONGEN, See Constrained generation software Connection table, 61–62, 336 as graph, 65 concise summary, 112 in DENDRAL, 170 for substructure searches, 64 Connectivity graph, 75 Connectivity index, 74 concise summary, 112 Connectivity matrix, 62 Connectors in artificial neural networks, 105 in biotransformations, 341–343 in semantic networks (relations), 15 Constitution and conformation, 135–136 and local descriptors, 139–140 and molecular descriptors, 136–138 statistical evaluation, 140–145 Constitutional descriptor, 73–74 concise summary, 116 Constrained generation (CONGEN) software, 168, 176 Constraint-based expert system, 202 Constraints in DENDRAL, 168, 171 of filter coefficients , 98 in KL-ONE, 51 in Meteor, 251 in SYNSUP-MB, 231 in WODCA, 234 Constructive synthesis approach, 231 Contamination control engineering design guidelines expert system, 55 Content, in electronic record, 293
11/13/07 2:13:10 PM
372 Contergan, See Thalidomide Context in electronic records, 293 generalization of, 5 in knowledge management, 51 in MYCIN, 173 relevance for data, 10–11 Context sensitive help, 350, 352 Context sensitive rules, 262 Continuous Fourier transform (CFT), 95 Control and information, 7 in quality assurance, 297 real-time, 269 Control charts, See Quality regulation charts Control management, 278–281 Controlling complaints management, 348–349 contamination, 56 dynamic systems, 269 environmental, 256 of inference strategy, 30, 36, 263 instrument, 320, 322–323 laboratory workflow, 301 software, 235 Controlling system, in LIMS, 300 Conventional programming languages, 35, 37 Converter, See File converter COOL, See CLIPS object-oriented language Coplanarity, 13, 42–43 CORINA software, 76, 182, 190 Corrective actions, 347–348 Corrective and preventive actions (CAPA), . 345 Correlation, 82 concise summary, 112 as requirement for principal component analysis, 88–89 significance of, 83–86 as similarity measure, 181 Correlation approach, 194–195 Correlation coefficient (R), 83 Correlation matrices in ARC, 155 in factor analysis, 94 Corticosteroid-binding globulin (CBG), 224 concise summary, 237 Coulomb theory, 126 Counterpropagation (CPG) neural network, 107–108 for biological activity prediction, 221–223 for chemical shift prediction, 205–206 concise summary, 112 for prediction of interelement coefficients, 217–218 for structure/spectrum correlation, 177, 178–179 for structure prediction, 184
5323X.indb 372
Expert Systems in Chemistry Research Covalent radius, 125 Covariance, 80 Covariance matrix, 80 calculation of, 89–90 CPG, See Counterpropagation neural network Cross-validation, 206 Crossed conformer, 137 Crossover mutation, 110–111, 206 Cryptographic hash function, 333 CRYSALIS project, 255 Crystallography, 254–256 CSV, See Comma separated value (CSV) file format CT, See Connection table CTBT, See Comprehensive nuclear-test-ban treaty Currency fields, in user interface design, 352 Cyclic conjugation, 13, 14
D Danielson-Lanczos formula, 96 DARC-EPIOS software, 176 Data aggregated, 292 archived, 292 associated, 342 concise summary, 31 context of, 10 granularity of, 291 vs. information and knowledge, 10–11 meta-, See Metadata nonvolatile, 290 operational, 292 as record, See Electronic record separation from business logic, 18, 231, 351 summarized, 292 witnessing, 294 Data capture, 325 Data cartridge, 334–335 concise summary, 355 Data cleansing, 292 Data compression concise summary, 237 by principal component analysis, 87 by transformation, 95–96, 179 Data conversion streaming, 326 libraries for, 344 Data–information–knowledge cycle, 229 Data input tracking of, 279 verification of, 350 Data integrity, 279 Data load manager, 291 Data management system, See Scientific data management system Data mining, 291, 293, 335–338
11/13/07 2:13:11 PM
Index Data reduction, by transformation, 198; See also Data compression Data staging area, 291 Data warehouses, 290–293 architecture of, 219–292 definition of, 290 extract, transform, and load (ETL) process, 291 features of, 290 presentation server, 291 Database approach, in structure prediction, . 181 vs. modeling approach, 189–190 concise summary, 237 Date fields, in user interface design, 352 Date/time stamp, 280 Daubechies, Ingrid, 97 Daubechies transform, See Wavelet transform Daubechies wavelets, 99–100 shape of, 100 concise summary, 117 DBN, See Dynamic Bayesian network de Jongh correction model, 216–217 Decision as case table, 55 concept of, 5–6 criteria in regulations, 279 documentation for, 277 explanation of, 18 human, 1 as logic diagrams, 54 rule derivation from, 29 Decision graph, 28; See also Decision trees Decision making (decisioning), 55 in analytical chemistry, 208 in crystallography, 256 in data warehousing, 290 in environmental chemistry, 265 in knowledge management, . 287–288 in medical diagnostics, 173 Decision-node-sharing technique, 22 Decision nodes, See Decision trees Decision rule, See Bayes’ Theorem Decision table, 54 Decision trees in development environments, 55, 57 in medical diagnostics, 171 for structuring knowledge, 30 Decision support, See Decision making Declarative programming, 20, 39 concise summary, 57 with frames, 18 vs. imperative programming, 37 in JESS, 41 in KM, 51 in KRL, 49 vs. logical programming, 39
5323X.indb 373
373 programming languages for, 29–30 in PROLOG, 41 with rules, 18 Decryption, 331 concise summary, 355 Deductive estimation of risk from existing knowledge (DEREK), 249–250 Deductive inference, 252–253 Defeasible reasoning, 55 Defrule command, 45 Defuzzification, 26; See also Fuzzy logic Degradative synthesis approach, 231 Degree of common variation, 90 Degrees of belief, 28 Delta document, 286 Dempster-Shafer theory, 28 concise summary, 31 DENDRAL project, 167–171 Dendrite, 102 Dendritic Algorithm, See DENDRAL project Department in user administration, 329 concise summary, 355 DEREK, See Deductive estimation of risk from existing knowledge Descriptive programming, See Declarative programming Descriptive statistics, 78–86 Descriptor, 3 artificial vs. experimental, 3, 70 decoding of, 76 for fact representation, 262–263 improvement for structure prediction, . 188 interpretation of, 117 molecular, See Molecular descriptors prerequisites for, 76 selection of, 78 Descriptor center concept, 253 Descriptor database, 182 Descriptor/Descriptor correlation to average set descriptor, 141, 144 concise summary, 113 Descriptor elucidation, computer-assisted, 79 Descriptor/Property correlation for molecular descriptors, 72 concise summary, 113 Design review, 284 Designing safer chemicals module, 260 Detail coefficients, 103 Detail filter, See High-pass filter Detonations, equilibrium compositions for, 271 Developer test, See Operational test Development, See Software development DFT, See Discrete Fourier transform DHT, See Discrete Hadamard transform Diagnosis, 171 Diagnostics, See Medical diagnostics
11/13/07 2:13:11 PM
374 Diagonal matrix of eigenvalues, 81 in singular value decomposition, 91 Diagonalization, 91 Diastereotopic protons, 201 Difference matrix, 147 Difference indexing, in case-based reasoning, 23 Diffraction in X-ray fluorescence, 216 pattern, 255 Diffusion effects, in graphite tubes, 211 Digital signature; See also Electronic signature concise summary, 355 Dihydrotestosterone, 221 Dilation equation, 98; See also Wavelet transforms DIPMETER expert system, 267 Dipole moment, 199 Dirac delta function, 97 Directed acyclic graphs, 21 Disconnection strategies, 235 Discrete Fourier transform (DFT), 96 Discrete Hadamard transform (DHT), 96 Discrete learning time coordinate, 109 Discrete wavelet transform (DWT), 97–98 Discrimination of isomers, 73 of atomic distances, 122 of compound types, 193 of molecules, 145 significance of, in infrared spectra, 181 Discrimination diagrams, in geology, 268 Dissociation constant, 220 Dissociation energy, 211 Distance dimension, 120 Distance mode, in ARC software, 153 Distance matrix, 62, 113 Distance pattern, 129, 164 Distomer, 227 Distribution centered moments of, 81 distance, 78 flatness of, 85 frequency, 120 Gaussian, 119 intensity, in electron diffraction, 77 probability, 28 property, on molecular surfaces, 75 radial, See Radial distribution function shapes of, 85 spatial, 227 symmetry of, 84 Diversity average, 82 concise summary, 113 comparison of data sets, 195–198 molecular, 193–199 of descriptors, 194–199
5323X.indb 374
Expert Systems in Chemistry Research vs. similarity, 162 statistical evaluation, 196 of transformed descriptors, 197 Division, in user administration, See Department Document repository, 289 Document templates, 309–310 Document versioning, 308 Documents drawbacks of file-based storage, 313 FDA guidelines, 278–279 rules for, 305–306 scientific, See Electronic scientific document Drools, See JBoss Rules Dropdown fields, in user interface design, 352 DWT, See Discrete wavelet transform Dyadic dilation, 98 Dyadic vector, 96 DYNACLIPS, See Dynamic CLIPS utilities Dynamic atomic properties, 126 concise summary, 163 vs. static atomic properties, 125 Dynamic Bayesian network (DBN), 28 Dynamic CLIPS utilities, 47 Dynamic data exchange, 320 Dynamic emergency management, 262
E Effective concentration (EC), 219 prediction of, 221 concise summary, 237 Efficiency checks, in exception management, 348 EIA, See Environmental impact assessment program EIAxpert system, 266 Eigenvalue criterion, in factor analysis, 94 Eigenvalue spectrum, 81 Eigenvalues, 80 of electron density distribution, 256 in principal component analysis, 90–91 Eigenvector matrix, 81 Eigenvectors, 80–81 in principal component analysis, 87, 90–91 reduced, 94 EJB, See Enterprise JavaBean Elaboration of reactions for organic synthesis (EROS), 232–234 Electron density, in graphite tubes, 211 Electron density distribution of, 255–256 Electron density map, 255 Electron diffraction, for descriptors, 76 Electron distribution, 120 Electronegativity, as descriptor, 73, 205 Electronic laboratory notebook (ELN), . 305–312 advantages, 306 concise summary, 355 data transfer to (inbox concept), 327–328
11/13/07 2:13:11 PM
Index metadata concept, 308 optional tools, 311–312 regulations, 307–308 for reporting, 310 rules for, 305–306 search capabilities, 334 time-saving aspects, 306–307 Electronic records access rights, 329 concise summary, 355 definition of, 280 elements of, 293 Electronic scientific documents, 307–309 concise summary, 355 vs. electronic records, 331 report types, 310 vs. scientific workspaces, 319 states of, 308 templates for, 309 versioning of, 308 Electronic signatures applicability of, 330–331 concise summary, 355 regulations, 280–281 request for, 330, 331 requirements, in scientific data management, 294 workflow for, 329–331 Electropherogram, 172–173 Electrostatic potential, 227 concise summary, 237 Elimination, in synthesis planning, 235 ELN, See Electronic laboratory notebook EMBL, See European Molecular Biology Laboratory Emergency management, 262 Emergency Planning and Community Right-to Know Act (EPCRA), 259 Empirical models, 102 EMYCIN program, 175 Enantiomerism, 226 concise summary, 237 pharmaceutical effect, 226–227 in search queries, 67 Encryption concise summary, 355 of electronic signatures, 294, 331 Energy of activation, 213 Engineering, applications, 269–271 Enterprise JavaBean (EJB), 48 Enthalpy of activation, 213 of formation, 211 of vaporization, 214 Entropy of activation, 213 Enumeration, of atoms 64 Environmental chemistry, 257–267 hazard assessment, 262
5323X.indb 375
375 impact assessment, 265–267 risk assessment, 346 Environmental impact assessment (EIA) program, 265 Environmental management, 265–266 EPA, See U.S. Environmental Protection Agency EPCRA, See Emergency Planning and Community Right-to Know Act Epoch in neural network training, 107 in counterpropagation neural networks, 110 Equilibrium in gas phase reactions, 212–213 in ligand-receptor interaction, 219 Equilibrium composition, 270–271 Equilibrium dissociation constant, 220 EROS, See Elaboration of reactions for organic synthesis ESCORT, See Expert system for the characterization of rock types ETL, See Extract, transform, and load process Euclidean distance, 62, 108 Euclidian L 2-Norm concise summary, 163 for descriptor normalization, 124 program code example for, 37–38 European Molecular Biology Laboratory (EMBL), 248 Eutomer, 227 Event messaging, 331 Evidence, theory of, See Dempster-Shafer theory Evolution, See Biological evolution Examiner module, in user input verification, . 353 Exception management, 344–348 efficiency checks, 348 recording, 346 root causes, 345 Exception management system for geological exploration, 267 for workflow deviations, 304 Exclusive mode in atom-specific descriptors, 133 in ARC software, 153 Expansion effects, in graphite tubes, 211 Experience, 5 Experimental data empirical models from, 103, 105 fuzziness of, 5, 24–26 noise in, 89 prediction of, 72 Experimental descriptor vs. artificial descriptor, 3, 70 concise summary, 117 Expert domain, 4 subject matter, 11, 32 resistance factor of, 365
11/13/07 2:13:12 PM
376 Expert system for the characterization of rock types (ESCORT), 268 Expert system for the interpretation of infrared spectra (EXPIRS), 176 Expert system shells, 4, 35; See also Individual software concise summary, 31 vs. conventional programming, 38 as development tools, 31 Expert systems application of, 2 basic concepts of, 8 conceptual design of, 10 constraint-based, 202 critical considerations for, 362–365 commercial factors of, 365 definition of, 361–362 development tools for, 35 education in, 363–364 frame-based, 50 future of, 365–366 fuzzy, 26 hybrid, 265 vs. knowledge-based systems, 12 real-time, 262, 269 resistance to, 363 vs. rule-based systems, 12 technical design of, 35–37 usability of, 364 Exploration, 267 Explosions models, 266; See also Detonations EXSYS Professional expert system shell, 54 Extensible markup language (XML) concise summary, 358 in Gene ontology project, 16 for rules in Java expert system shell, 48 for software interoperability, 320–321 in user interfaces design, 351 Extract, transform, and load (ETL) process, . 292
F FA, See Factor analysis Face recognition, 4 Factor analysis (FA), 94 Facts, 4 in CLIPS, 44–45 forward chaining mechanism, 22 inference mechanism, 21 in JBoss Rules 48–49 in JESS, 47 in PROLOG, 41 in RTXPS (descriptors), 262 in rule-based systems, 12–13 in working memory, 35–36 Fast Fourier transform (FFT), 96, 164; See also Fourier transform
5323X.indb 376
Expert Systems in Chemistry Research Fast Hadamard transform (FHT), 97; See also Hadamard transform Fast wavelet transform (FWT), 97, 117; See also Wavelet transform Father wavelet, See Scaling function FDA, See U.S. Food and Drug Administration Feature Vector, 89 Feedforward network, 104, 255 Fetal abnormity, 226 FFT, See Fast Fourier transform FHT, See Fast Hadamard transform Field effect, as descriptor, 73 Fields, in user interfaces, 353 File converter; See also Data conversion in software agents, 326 concise summary, 355 File sharing, 313 Files vs. database, 313 FILLS operator, 50 Filter bank, 98 Filtering, with transforms, 95 Final model, 187–188 Finite-area combustion chambers, 271 Fitness function, 110–111 concise summary, 113 Flatness, of distribution, See Kurtosis Fluorescence, See X-ray fluorescence spectroscopy For-statement, 13 Force-directed layout, 341 Form-based document template, 309 Form factors, in electron diffraction, 76 Formula, See Molecular formula Formula translator (FORTRAN) language, 40 Forward chaining, 22 concise summary, 31 real-time system, 265 rules in RTXPS, 263–264 Fourier coefficients, 92, 179 Fourier matrix, 96 Fourier transform, 92, 96 discrete, 95 fast, 179 for compression of spectra, 178 concise summary, 117 of electron density maps, 255 vs. wavelet transform, 97 Fragment, See Molecular fragments Fragment-based coding, 75 concise summary, 113 in structure search, 66 Fragment reduced to an environment that is limited (FREL), 75 Fragmentation codes, 60 Frames, 16–17 in Apex-3D, 252, 254 code example, 16 concise summary, 31
11/13/07 2:13:12 PM
377
Index in KM, 50 representation, 17 Free activation enthalpy, 213 Free-Wilson descriptor, 73 FREL, See Fragment reduced to an environment that is limited Frequency, 94 Frequency dimension, 120, 125 Frequency pattern bond-, 134 vs. binary pattern, 131 concise summary, 164 as descriptor, 129 from 3D-MoRSE code, 120 Frequency space, 94 Fronting, in standard distribution, 84 Functional genomics in bioinformatics, 247 concise summary, 272 Functional programming, 37 Functional specification, 284, 285 Functor, in PROLOG, 41 Fuzziness; See also Fuzzy logic in experimental data, 11, 24 in human perception, 5–6 in pattern descriptors, 129 as prerequisite for interpolation, 122 Fuzzy logic, 25–26 FuzzyCLIPS, 46–47 FWT, See Fast wavelet transform
G GA, See Genetic algorithm GALP, See Good automated laboratory practices Gamma spectroscopy, 206–209 Gas-phase dissociation, 210 Gaussian density, in hidden Markov models, 26 Gaussian distribution, 84 Gaussian function, 109 GCES (Green chemistry expert system), See Green chemistry program GEAR algorithm, 233 GEN software, 177 GenBank sequence database, 248 Gene Ontology project, 15 Generalization concept of, 6 in artificial neural networks, 103 Generation with overlapping atoms (GENOA), 168, 176 Generic expert system tool (GEST), 53 Generic inbox concept as software interface, 327–328 concise summary, 356 Generic model organism project (GMOD), 15 Generic structure concise summary, 356
5323X.indb 377
for fragment metadata, 341 for incomplete molecules, 312 GENET suite, 248 Genetic algorithm (GA), 113–115 concise summary, 113 in data mining, 337 for descriptor selection, 205–206 Genetic operator in genetic algorithms, 110 concise summary, 113 GENOA, See Generation with overlapping atoms Genomics, See Functional genomics Geochemistry applications, 267–269 Geographic information system (GIS), 265 Geological exploration, 267 Geometric descriptors, 203–205 Geometric mean, 83 GEST, See Generic expert system tool GIS, See Geographic information system Global warming potential, 261 Globulin fraction, in serum proteins, 172 GLP, See Good laboratory practices Glucocorticoids, 224 GMOD, See Generic model organism project GMP, See Good manufacturing practices Good automated laboratory practices (GALP), 279–280 concise summary, 356 six principles of, 279 quality assurance unit for, 297 Good laboratory practices (GLP), 278–279 concise summary, 356 for electronic laboratory notebooks, 307–308 for scientific workspaces, 319 Good manufacturing practices (GMP), 278 Graph isomorphism, 64–65, 113 Graphite tubes, See Heated graphite atomizers Green chemistry program, 257 expert system (GCES), 257–262 references, 261–262 solvents, 261 synthetic reactions, 259–260 GxP, 278; See also Specific good practices
H Hadamard transform, 96 for compression of spectra, 178, 180 examples, 185–186 Halide dissociation, 210 Hash-coding, 66 concise summary, 118 of electronic signatures, 331–332 Hash collision 66 Hashing, See hash-coding Hazard categorization, 259 Heated Graphite Atomizers (HGA); See also Atomic absorption spectrometry
11/13/07 2:13:12 PM
378 in atomic absorption spectrometry, 209 construction of, 210 temperature profile in, 212 Heuristic DENDRAL, See DENDRAL project Heuristic programming project, 248 HGA, See Heated Graphite Atomizer Hidden layer, 104 Hidden Markov model (HMM), 26–27, 31 Hierarchical environment for integrated crystallography, 256 Hierarchical organization of spherical environments (HOSE), 256 Hierarchical search, in case-based reasoning, 24; See also Decision trees Hierarchy dynamic tree-based, 316–317 frame-based, in KL-ONE, 49 for generalization, 6 in knowledge engineering, 30 in knowledge management, 288 of production rules, 255 in semantic networks, 15 static file-based, 313 High-pass filter, 98 High production volume (HPV) challenge program, 250 HITERM research project, 265 HMM, See Hidden Markov model 1H-NMR spectra calculation of, 201 local descriptors for, 132, 202–205 prediction of chemical shifts for, 111, 207–208 selection of descriptors for, 205–206 HOSE, See Hierarchical organization of spherical environments HPV, See High production volume (HPV) challenge program HTML, See Hypertext markup language HTTP, See Hypertext transfer protocol Hückel rule, 13 Human brain as pattern matching engine, 4–5 information processing in, 102–103 Human resources, in knowledge management, 289 Hydrolysis, in synthesis planning, 235 Hydrophobicity, as descriptor, 73 Hydroxylation, as detoxification process, . 339 HyperNMR, 202 Hyperstructure, 64 Hypertext markup language (HTML), 356 Hypertext transfer protocol (HTTP), 324
I IC50, See Inhibitory concentration
5323X.indb 378
Expert Systems in Chemistry Research IDE, See Integrated development environment Identity search, 64–65 If–then statement concise summary, 31 conditional, 5 deficiencies, 19, 21, 29 in MYCIN, 173 in PROLOG, 42 in rule-based systems, 12–13 in Rule Interpreter software, 57 in vibrational spectroscopy, 176 Ignore mode in atom-specific descriptors, 132 in ARC software, 153 Imperative programming, 10 concise summary, 57 vs. declarative programming, 37–39 example of, 37–38 programming languages for, 38 Implementation, in software development, 282; See also Programming Implementation plan, 284 Improvement, See Optimization Inaccuracy of data, 26 Inbox, See Generic inbox concept Incoherent scatter, 216 Individuals in CLASSIC, 49 in Genetic algorithms, 114 INDVCLAY module, 268 Inference deductive, 253 inductive, 252 probability, 27 procedural, 51 symbolic, See Reasoning Inference chain, 270 Inference control, 36 Inference engine, 21 concise summary, 31 deductive, 252 as module in expert systems, 36 in RTXPS, 264–265 Inference tree, 264 Infinite-area combustion chamber, 271 Information aspect for defining life forms, 7 conceptual, 49 concise summary, 31 vs. data and knowledge, 10–11 uncertain, See Certainty factors unification of inherited, 53 Information management in knowledge management, 289 in data warehousing, 290 in scientific data management, 293 in laboratories, 296 in workflow management, 303
11/13/07 2:13:13 PM
379
Index in electronic documentation, 306 in scientific research, 313–315 Information modeling, 35 Information processing, in human brain, 102 Information reduction by mapping, 109 by scaling, 124 by transformation, 179 Information search, in data management systems, 334 Infrared spectrum (IS) as descriptors, 3–4 for spectrum/structure correlation, 75, 176–177 fuzziness of, 25–26 simulation of, 78, 178–179 structure prediction from, 78, 179–190 Inheritance in semantic networks, 15 multiple, 53 Inhibitory concentration (IC), 219–220, 237 Initial models in structure prediction, 183–184 alteration of, 187–188 diversity of, 189–190 Initial phase, in atomization process, 212 Interpretation of standard operating procedures, 350 of spectra, 177 vs. analysis and calculation, 10 Input layer, in artificial neural networks, . 108 Input vectors in artificial neural networks, 105–106 similarity of, 181 Installation qualification (IQ), 284 Instrument interface serial device, 322–323 regulations for, 232 Integrated development environment (IDE), 48; See also Expert system shells Intellectual property (IP) protection, 313 Intelligence artificial, See Artificial intelligence business, See Business intelligence process understanding of, 9 Intelligence system, in knowledge management, 289 Intelligent agent, See Software agent Intelligent tutoring system (ITS), 364 Interelement effects, in calibration, 215 Interelement coefficients, prediction of, 217–218, 219 Interface concise summary, 356 definition of, 320 designer, See User interface designer instrument, See Instrument interface
5323X.indb 379
software, See Software interface natural language, 41 user, See User interface Interface ports, 322 Interface programming, See Application programming interface International Space Station, 270 International Union of Pure and Applied Chemistry (IUPAC), 178 Interoperability, See Software interface Interpolation in artificial neural networks, 122–123, . 135 effects on correlation coefficient, 186 prediction of new properties by, 200 Interpretation automated text, 349 in analytical chemistry, 237 of data, 9–11 of descriptors, 72, 125, 135 of electropherograms, 173 of electron diffraction patterns, 255–256 of genomic data, 339 of geochemical data, 268 of geophysical data, 267 of rules, 20; See also Inference of spectra, 75, 175–177 Interrogation process in environmental assessment, 260–261 in medical diagnostics, 171, 173–175 SMART, 258 Inverse transform, 96, 98; See also Mathematical transform Ionization potential, as descriptor, 125 IP, See Intellectual property IQ, See Installation qualification IR, See Infrared spectrum Irritancy, prediction of, 250 Isomer discrimination, 73, 113 Isomer generation, 168, 176 Isomer selective search, 67 Isomerism constitutional, 135 cis-trans, 205 effects of, 221, 226–227 stereo-, 136–137, 139–140 vs. tautomerism, 67–68 Isomorphism algorithms, 65–66, 113 Isotopes, identification of radioactive, 209 Isotropic molecules, 199 ITS, See Intelligent tutoring system IUPAC, See International Union of Pure and Applied Chemistry
J Java expert system shell (JESS), 47 Java programming language, 38
11/13/07 2:13:13 PM
380 JBoss rules, 48–49 JCAMP-DX file format, 178, 309 JESS, See Java expert system shell Jochum-Gasteiger canonical renumbering, 71 Justification, in knowledge management, 51
K k-nearest neighbor method, 176, 255 Karplus relationships, 201 KE, See Knowledge engineering KEE, See Knowledge engineering environment Kier and Hall index, 74, 118; See also Connectivity index Kinetic modeling, 233 King Kong effect, 83–84 KL-ONE frame-based system, 49 KM, See Knowledge machine KMS, See Knowledge management system Knowledge, 10 application to action, 1 concise summary, 32 domain, 364 vs. data and information, 10–11 from experimental data, 105 and experience, 9 implicit, 289 strategic, 364 Knowledge acquisition, 32 Knowledge base, 7 in expert system design, 35–36 centralized, 18–19 concise summary, 58 Knowledge base-oriented system for synthesis planning (KOSP), 231 Knowledge-based systems, See Expert systems Knowledge builder, See Xpert Rule Knowledge center, 287–288 Knowledge distribution, See Knowledge management Knowledge engineering (KE), 29–30, 32, 44 Knowledge engineering environment (KEE), 54 Knowledge induction, 32 Knowledge machine (KM), 51–53 Knowledge management, 287–288 Knowledge management system (KMS), . 288–289 Knowledge pyramid, 10–11 Knowledge quality management team, 290 Knowledge representation concise summary, 32 fuzzy, 26 by frames, 16–17 object-oriented, 50, 53 by procedures, 53 by rules, 12, 32 by semantic networks, 14–15 Knowledge representation language (KRL), 49
5323X.indb 380
Expert Systems in Chemistry Research Kohonen neural network, 105–107 concise summary, 114 multilayer, 108, 152 for prediction of atomization mechanisms, 214 for prediction of effective concentration, 223–226 for similarity search, 191 for structure prediction, 179–180 for surface mapping, 228–229 Kohonen map, See Topological map KOSP, See Knowledge base-oriented system for synthesis planning KRL, See Knowledge representation language Kronecker delta, 75 Kurtosis, 83–86 concise summary, 114 as statistical descriptor, 256
L L 2 norm, 37, 124, 163 b-Lactam antibiotics, See Cephalosporin Laboratory, in user administration, See Department Laboratory information management systems (LIMS), 295–302 assurance system in, 301 basic functions of, 298 benefits of, 297 compliance requirements for, 297 concise summary, 356 as controlling system, 300 history of, 295 optional modules in, 301 planning system in, 299 relation to GALP, 279 for sample organization, 299 vs. scientific data management systems, 298 Laboratory notebooks; See also Electronic laboratory notebooks rules for, 305–306 vs. electronic laboratory notebooks, 327 Laboratory workflow, cost calculation for, 300 Laboratory workflow management system (LWMS), 302–303, 356; See also Workflow management systems Laminar flow tube reactor (LFTR), 232 Languages, See Programming languages Law of mass action, 219 Layered-digraph layout, 341 LC, See Liquid chromatography Leaps algorithm, 48 Learning from examples, 103 Learning rate and radius, in artificial neural networks, 106–108 Leave-one-out technique, 87, 222 Lebesgue integral, 124
11/13/07 2:13:13 PM
Index Leptokurtic distribution, 85, 195; See also Kurtosis LFTR, See Laminar flow tube reactor LHASA, See Logic and heuristics applied to synthetic analysis Library design, 111 Life, definition of, 6–7 Life cycle, See Software development life cycle Ligand-receptor interaction, 218–219, 253 Likelihood, See Probability Limiting reactant, 311 LIMS, See Laboratory information management systems Linear combination of variables, 84, 89 of wavelets, 98 Linear notation, 63, 114 Linear regression for calibration, 217 limitations of, 85–86 lines, 83 Linear scaling, drawbacks of, 124–125 Linear transformation, 87, 92 Lipophilicity, 53, 224, 251 Liquid chromatography (LC), 251 List processing (LISP), 40, 58 Local analysis, by wavelet transforms, 97 Local descriptors, 132 concise summary, 164 for NMR spectroscopy, 202–205 Log P, See Lipophilicity Logic assessment for rules, 29 business, See Business logic fuzzy, See Fuzzy logic separation from rules, 13–14, 18 Logic and heuristics applied to synthetic analysis (LHASA), 230, 249 Logical programming, 39; See also Programming and logic Loom knowledge representation language, 51 Loop, in programming, 38 Low-pass filter, 98 LWMS, See Laboratory workflow management system
M Maintenance of equipment, 278 of expert systems, 362, 365 of instrument, 299 preventive, 345 of software, 283 Mallat algorithm, 99, 114 Mammalian brain, See Brain MAP module, 248 Mapping, in CPG neural networks, 103, 109
5323X.indb 381
381 Markov chain, 27 Markov model, See Hidden Markov models Markush structure, 312 concise summary, 356 Mass absorption coefficients, 217 concise summary, 238 Mass spectra simulator (MASSIMO), 231–232 Mass spectrometry (MS) in metabonomics, 251, 339 time-of-flight (TOF), 251 MASSIMO, See Mass spectra simulator Mathematical descriptor, See Descriptor Mathematical transforms, 94–102 Matrix fields, in user interfaces, 353 Matrix of loadings, in singular value decomposition, 91 Maximum common subgraph isomorphism, 65, 114 Maximum norm, 124 MCF-7 cells, 221 MDI, See Multiple document interface Mean, statistical, 256 Mean molecular polarizability, 199, 238; See also Molecular polarizability Medical diagnostics, 171–172 Memory organization packet (MOP), 23 MEP, See Molecular electrostatic potential Messaging, 331, 356 Metabolic pathways; See also Biotransformation comparison, 342–343 management, 339 prediction of, 251 storage of, 344 visualization, 251, 343 Metabolites, mapping of, 343 Metabolomics, 247, 272 Metabonomics, 247, 272 Meta-DENDRAL, See DENDRAL project Meta-key, 315, 356 Meta-value, 356 Metadata automatic assignment of, 316 for biotransformation studies, 344 dependencies of, 318 in data warehousing, 292 in electronic scientific documents, 308 grouping of, 318 mandatory, 317 primary, 318 secondary, 318 in scientific workspaces, 315 Meteor software, 251 Method development, See Analytical method development Microarray gene expression data (MGED), 16 Mineral phase identification, 268 MO, See Molecular-orbital Mode of reaction, in EROS software, 232
11/13/07 2:13:14 PM
382 Modeling with artificial neural networks, 109, 114 concise summary, 118 Modeling approach, 187–188 vs. database approach, 189–190 concise summary, 238 Molar refractivity, as descriptor, 73, 253 Molecular biology, 248 Molecular descriptors, 69–70 constitutional, 73 correlation with experimental descriptors, 72 concise summary, 69 definition of, 70 interpretability of, 72 isomer discrimination with, 72 online resource, 76 radial, See Radial distribution function requirements for, 70–71 reversible decoding of, 72 rotational invariance of, 71 selection of, 78 three-dimensional, 75 topological, 73–74 unambiguity of, 71 Molecular electrostatic potential (MEP), 75, 238 Molecular formula, as search criterion, 68–69, 334 Molecular fragments; See also Substructure coding of, 61, 75 detection from mass spectra, 167 frame representation of, 253–254 prediction of, 176 as residues, 312 screening for, 66, 129 superimposition of, 65 Molecular genetics, 247, 273 Molecular Genetics project (MOLGEN), 248–249 Molecular graph, 61 Molecular mass, as descriptor, 73 Molecular orbital (MO) computations, 56 Molecular properties, as descriptors; See also Atomic properties aromatic stabilization energy, 126 charge distribution, 72, 78 electronegativity, 73, 78, 125, 126 electrostatic potential, 75, 227–228 field effect, 73 hydrogen bonding potential, 227 hydrophobicity, 73, 227, 253 ionization potential, 125 mass, 73, 261, 301, 311 molar refractivity, 73, 253 number of ring systems, 73 polarizability, 72, 76, 78, 126, 162, 199–200, 222, 342 reactivity, 15 resonance effect, 73
5323X.indb 382
Expert Systems in Chemistry Research ring-strain energy, 126 symmetry, 78 Molecular polarizability as descriptor, 126 prediction of, 199–200 Molecular representation of structures based on electron diffraction (MoRSE), 77 concise summary, 163 for infrared spectrum simulation, 178 Molecular shape, 227 Molecular surface as descriptor, 72, 229 mapping, 226–229 property distribution on, 227 Molecular transform, 77 Molecular weight, See Molecular mass MOLGEN, See Molecular Genetics project MOLION module, 168 Monod, Jaques, 6–7 MOP, See Memory organization packet Morgan algorithm for atom enumeration, 64 concise summary, 114 functioning of, 71 MoRSE, See Molecular representation of structures based on diffraction Mother wavelet, 97; See also Wavelet transform concise summary, 114 scaling of, 98 MS, See Mass spectrometry Multidimensional descriptors, 145–147 Multilayer network, See Multiple layer network Multitier architecture, See N-tier architecture Multiple document interface (MDI), 153 Multiple inheritance, 55 Multiple layer network, 104; See also Counterpropagation neural network construction of, 108 concise summary, 114 Mutagenicity, prediction of, 250 Mutation in genetic algorithms, 110–111 concise summary, 114 MYCIN, diagnostic expert system, 173–175 MySQL, See Open-source relational database scheme
N N-tier architecture, 357 National Aeronautics and Space Administration (NASA), 43 Ames Research Center, 269 CLIPS project, 43 contamination control expert system, 55 DENDRAL project, 167–171 John H. Glenn Research Center, 270 Johnson Space Center, 43, 269, 270
11/13/07 2:13:14 PM
Index Lewis Research Center, 270 thermal expert system, 269 National Biomedical Research Foundation (NBRF), 248 National Institutes of Health (NIH), 248, 256 chemical information system, 64 DNA sequence library, 248 GeneBank project, 248 PHENIX project, 256 NATMIX module, 269 Natural fuzziness, 7, 362; See also Fuzzy logic Natural language, 18 vs. declarative language, 39 for computer communication, 41 Natural language syntax, in expert systems, 262–264 NBRF, See National Biomedical Research Foundation Negated list, See NOT list Neighborhood kernel, 106 NeoCLASSIC framework, 51 NEOMYCIN expert system, 175 .NET remoting, 324 Net value, in artificial neural networks, 103 Neural Network, See Artificial neural network Neuron, 102–106 artificial, 103 biological, 102 central (winning), 106 Neurotoxicity, prediction of, 250 New Chemicals Program, 258 Newton-Raphson algorithm, 137 NEXPERT OBJECT software, 54–55 NIH, See National Institutes of Health NMR, See Nuclear magnetic resonance Nonconformities, See Exception management Nondeterministic polynomial (NP) time complete problem, 65 Nonmonotonic reasoning, 55 Nonspecific binding, 220 Normal distribution, See Standard distribution Normalization concise summary, 164 of structure data, 68 of vectors, 124 NOT list, 69 Notebook, See Electronic laboratory notebooks Notifications, process, 331 NP complete problem, See Nondeterministic polynomial time complete problem Nuclear magnetic resonance (NMR), 201–208; See also Specific Nuclei Nuclear medicine, 208 Numeric fields, in user interface design, 352
O OASIS SAR system, 56
5323X.indb 383
383 Object generator, in user input verification, 354 Object linking and embedding (OLE); See also Component object model concise summary, 357 for interoperability, 320 for software integration, 321 Object-oriented programming in CLIPS programming language, 44 vs. frame-based, 17–18 with frames, 50 in knowledge representation language, 51 as programming paradigm, 38 vs. rules-based programming, 19 OBO, See Open biomedical ontologies OCSS, See Organic chemical simulation of synthesis ODF, See Open document format OECD, See Organisation for Economic Cooperation and Development Off-line client, 301 Offsprings, in genetic algorithms, 110–111, 206 OLAP, See Online analytical processing OLE, See Object linking and embedding One-dimensional descriptor; See also Descriptor concise summary, 164 vs. two-dimensional descriptors, 145 One-level decomposition, 148 ONE-OF operator, 50 Online analytical processing (OLAP), 291, 292 Ontologies definition, 15 in semantic networks, 15–16 Out of specification, as exception class, 345 Open biomedical ontologies (OBO), 15 Open document format (ODF), 325 Open-source relational database scheme (MySQL), 16 Operational data, in data warehousing, 291, 292 Operational qualification (OQ), 284, 285–286 Operational test, 285 OPS-2000 development environment, 44 Optical sensors, contamination requirements, 56 Optimization of analytical methods, 209–210 descriptor, 188–189 with genetic algorithms, 55, 110–111, 338 nonlinear, 152 process, 288 production, 258–259 OQ, See Operational qualification Order-suborder relations, 304 Organic chemical simulation of synthesis (OCSS), 229 Organic reactions abstracted, 231 effect of molecular polarizability on, 199 effect of molecular electrostatic potential on, 227
11/13/07 2:13:14 PM
384 kinetics of, 233 prediction of, 167, 171 simulation of, 232 Organic synthesis; See also Retrosynthesis evaluation of, 232 planning of, 229, 234 Organisation for Economic Co-operation and Development (OECD), 278 Orthogonal transformation, See Principal component analysis Orthogonality constraints for filter coefficients, 98 of eigenvectors, 80–81 as property, 93 transforms, 96 Out value, in neural networks, 105 Outliers effects on linear regression, 83–84 effects on descriptor scaling, 125 Output layer, in artificial neural networks, 108 Oxidation in synthesis planning, 235 Ozone depletion potential, 261 Ozonolysis, in synthesis planning, 235
P PAIRS expert system, 176 Paper laboratory notebook; See also Electronic laboratory notebooks basic requirements, 305 rules for conducting, 308 Parallel production system (PPS), 44 Parallelism, in substructure search, 65 Parent ion, detection of, 168 Parent structure, in metabonomics, 342 Parsing, 326 concise summary, 357 Partial atom mode, in ARC software, 153 Partial atomic charge, 126, 145–147, 203–205 Partial equalization of orbital electronegativity (PEOE), 227 concise summary, 238 Passwords, regulations, 281; See also Authentication Patents; See also Intellectual property Pathway editor, 340–341; See also Metabolic pathways concise summary, 357 Pattern function concise summary, 164 Pattern identification, 87 Pattern matching, 4 with inference engines, 20–21 concise summary, 32 with binary descriptors, 131 Pattern recognition, 74 with descriptors, 129
5323X.indb 384
Expert Systems in Chemistry Research Pattern repetition in descriptors, 130 concise summary, 164 Pauling electronegativity, 125 PCA, See Principal component analysis PDF, See Portable document format Pearson correlation, See Correlation PEOE, See Partial equalization of orbital electronegativity Perception, 5 Performance qualification (PQ), 284, 286 Perl scripting language, 38, 202, 256 Personal mode concise summary, 357 in scientific workspaces, 314, 319 Pharmacophore, See Biophore Phase analysis, See X-ray phase analysis Phase approach, in EROS, 232 Phase transition, 213 PHENIX, See Python-Based Hierarchical Environment for Integrated Xtallography Phocomelia, 226 Physicochemical properties; See also Specific properties atomic, See Atomic properties molecular, See Molecular properties in descriptors, 78 in 1H-NMR spectroscopy, 204–205 distribution along structures, 75 in property-weighted descriptors, 125 PKI, See Public key infrastructure Planar neural network, 107 Planck constant, 213 PLANNER module, 168 Planning system, in LIMS, 299 Platykurtic distribution, 85; See also Kurtosis Point-charge model, 227 Point mutation, 110–111 Polarizability, 199; See also Molecular polarizability static dielectric, 126 concise summary, 238 Pool evaporation, 266 Population of chromosomes, 110; See also Genetic algorithms Population variance, 85 Portable document format (PDF), 321 Postprocessing of descriptors, 123–124 PPS, See Parallel production system PQ, See Performance qualification PRI, See Protein identification resource Precursor, in retrosynthesis, 234 Prediction with ARC software, 157 of atomization mechanisms, 213–215 of biological activity, 217–220, 251–254 of carcinogenic effect, 73, 249 of chemical shifts, 201, 207–208
11/13/07 2:13:15 PM
385
Index of chromosome damage, 250 of effective concentrations, 221–226 of interelement coefficients, 217 of genotoxicity, 249 of mineral layers, 269 of molecular polarizability, 199–200 of mutagenicity, 250 of restriction enzymes, 248 of structures, 179–190 of toxicity, 249–250 PREDICTOR module, 169 Premanufacture notification, 258 Preprocessing of spectra, 182 in structure search, 64 Prescreening, 65 concise summary, 119 PRINCE2, See Projects in controlled environments Principal component analysis (PCA), 87–93 in data mining, 337 Print capturing, 325 concise summary, 357 Probability, 27; See also Bayesian networks and reliability, 28 Procedural algorithm, 39 Procedural programming, 35 Process deviations, 344–349 exceptions, 344 complaints, 348–349 risk assessment, 346 root causes, 345 Product management, in software development, 282, 286 Product-moment correlation, See Correlation Production Rule System, 48 concise summary, 58 in TEXTAL, 255 Progestagens, 221–222 Programming, 282 declarative, See Declarative programming heuristic, 248 imperative, See Imperative programming logical, 39 object-oriented, 16, 17–18 paradigms for, 38 procedural, 35, 37 rule-based, 12–13 in software development process, 282 vs. specification, 282 Programming and Logic (PROLOG), 41–43 concise summary, 58 facts, 41 rules, 42 Programming interface, See Application programming interface Programming languages, 38; See also Specific languages
5323X.indb 385
Programming paradigms in CLIPS, 44 overview on, 38 Programming shells, See Expert system shells Projects in controlled environments (PRINCE2), 284 PROLOG, See Programming and Logic Properties atomic, See Atomic properties molecular, See Molecular properties physicochemical, See Physicochemical properties Property indexing, in case-based reasoning, 23 Property list, in LISP 30 Property matrix, 254 Property smoothing parameter, 145 concise summary, 164 Property-weighted descriptors, 125–128; See also Specific properties concise summary, 164 Property vector, 157 Propulsion jet engines, 270 PROSPECTOR expert system, 28, 267; See also Bayesian networks Protein crystal growing, 255 Protein identification, 254–256 Protein identification resource (PIR), 248 Protein NMR, 202; See also Nuclear magnetic resonance Protein sequence database (Swiss-Prot), 248 Protein structure generation, 255 Proteomics, 247 concise summary, 273 Proton environment, 204; See also H-NMR spectroscopy Proton NMR, See H-NMR spectroscopy Public key cryptography, 333 Public key infrastructure (PKI), 332 Publishing level, in scientific workspaces, 319 PUFF diagnostic expert system, 175 Python-Based Hierarchical Environment for Integrated Xtallography (PHENIX), 256
Q QA, See Quality assurance QAU, See Quality assurance unit QMF, See Quadrature mirror filter QSAR, See Quantitative structure/activity relationship QSPR, See Quantitative structure/property relationship Quadrature mirror filter (QMF), 99 concise summary, 119 Quality assurance (QA) GLP guidelines, 279 in software development, 286 Quality assurance unit (QAU), 297
11/13/07 2:13:15 PM
386 Quality regulation charts, 301 Quantitative structure/activity relationships (QSAR), 73, 249 3D-, 251–252 concise summary, 273 Quantitative structure/property relationship (QSPR), 73, 75 concise summary, 273 Query manager, in data warehousing, 291 Query structure, specification, 67
R R, See Correlation coefficient R2, See Coefficient of determination Racemate, 226 search for, 67 Radial Basis Function (RBF), 77 Radial distribution function (RDF), 77, 119–135, 158 amplified, 160 applications summary, 162–163 aromatic patterns of, 130 atom-specific, 132, 159 attenuated, 161 bond-path, 133–134, 160 binary pattern, 130, 158 Cartesian, 133 as descriptor (RDF Code), 77, 158 distance pattern, 129, 158 frequency pattern, 130, 158 local, 132, 158 molecular, 158 multidimensional, 145–147 property-weighted, 160–161 proton, 159–160 synopsis of, 157–163 topological, 134–135, 160 two-dimensional, 161 wavelet, 147–151, 161 Radioactive isotope identification, 209 Radioligand binding, 218–219 Radioligand binding experiments, 238 Radionuclide in gamma spectroscopy, 209 identification 267, 306 Randic index, See Connectivity index Rang control chart, 301 Ranking of facts, 5 of bond cleavage, 235 Rate of association, 219 Raw data, 289; See also Data RBF, See Radial Basis Function RD, See Requirement document RDF, See Radial distribution function RDF Code, 77 RDL, See Rule description language
5323X.indb 386
Expert Systems in Chemistry Research REACH framework, See Registration, authorization, and evaluation of chemicals REACT module, 171 Reaction, organic, See Organic reactions Reaction centers descriptor for, 121 descriptor for steric hindrance at, 132 local descriptor for, 139 Reaction mechanisms in heated graphite atomizers, 210 prediction of, 214–215 Reaction rate, 212 Reactor approach, in reaction prediction, 232 Reduction of data, See Data reduction of metal oxides in gaseous phase, 210–211 pollution, 258 in synthesis planning, 235 Real-time expert system (RTXPS), 262–265 Reasoning, 20 concept of, 9 leading to actions, 11 by multiple inheritance, 55 Receptor; See also Biological activity agonist binding, 223 androgen, binding affinity, 57, 221 binding, 217–219, 227 ligand complex, 219 Recall test, 157 Recipe administration, in LIMS, 302 Recombination, in genetic algorithms, 110, 114, 338 Record, See Electronic record Record retention, 357 Reduction of data, See Data reduction by graphite carbon, 210 of information, See Information reduction by intermediary carbides, 211 in synthesis planning, 235 Reference substance module, in LIMS, 302 Refractivity, 199 Refractivity index, See Molar refractivity index Registration, authorization, and evaluation of chemicals (REACH), 250 REGISTRY database, 20; See also Chemical abstracts (CA) Regression, See Linear regression Regular expressions, for file parsing, 326 Regulations, 277–281; See also Code of Federal Regulations Reinitialize mode, in ARC software, 156 Relationships and concepts, 5 data–information–knowledge, 10–11 in frames, 17 between rules, 20 in semantic networks, 14–15
11/13/07 2:13:15 PM
387
Index Release, of software, 357 Release certificate, 286 Release notes, 283 Reliability, in probability theory, 28–29 Report generator, in knowledge management, 289 Reporting with electronic laboratory notebooks, 310 intermediate, 314 Requirement document (RD), 282 Research personnel, regulations for, 278 Residue in Markush structures, 312 concise summary, 357 Resistance, factors for expert systems, 363 Resolution, reduction in descriptor space, 92 Resolution level, in wavelet transforms, 98, 148 Resonance effect, as descriptor, 73 Retaining process, in case-based reasoning, 23 Retardation of light, 199 Rete algorithm, 21–22 in CLIPS, 43 in JBoss Rules, 48 in JESS, 48 Retes, See Rete algorithm Retrieval frequency control chart, 301 Retrieval process, in case-based reasoning, 23 Retrosynthesis approach, 230, 234–235 concise summary, 239 Retrosynthesis browser (RSB), 236 Reuse process, in case-based reasoning, 23 Reverse mode in ARC software, 156–157 in counterpropagation neural networks, 183 Revision process, in case-based reasoning, 23 RI, See Rule Interpreter Rich text format (RTF), 250, 321 Riemann integral, 124 Ring systems, as descriptor, 73 Risk assessment environmental, 262 technological, 265 in exception management, 346 RMS, See Root mean square Rock type characterization, 268 Roles concise summary, 357 in CLASSIC, 49 in user administration, 328 Root cause, 345, 347–349 Root mean square (RMS), 81 concise summary, 114 as similarity measure, 181 Rotational invariance concise summary, 114 of descriptors, 71 of statistical descriptors, 256 RS232, See Serial port
5323X.indb 387
R/S enantiomers, See Enantiomers RSB, See Retrosynthesis browser RTF, See Rich text format (RTF) RTXPS, See Real-time expert system Rule-based systems; See also Expert systems definition, 12 concise summary, 32 vs. conventional programming, 14 Rule description language (RDL) in rule interpreter, 55–56 concise summary, 58 Rule designer, 351 Rule engines advantages of, 18–19 in JBoss Rules, 48 in JESS, 47 Rule generator, in user input verification, 354 Rule Interpreter (RI), 56–57 concise summary, 58 for user input verification, 351 Rule manager in Apex-3D, 252 in BRE, 53 Rules advantages of, 18–20 in Apex-3D, 252–253 in CEA, 271 in CLASSIC, 50 in DENDRAL, 170 derivation from expert skills, 29 in EROS, 233 for knowledge representation, 12–14 in MYCIN, 173–174 predicate, 280 in PROLOG, 42 qualitative vs. quantitative, 252–253 in RTXPS, 263–264 Runge-Kutta method, 233
S SAFE module, 248 Sample control, 303–304 Sample estimate of population variance, 85 Sample management, 279, 299; See also Laboratory information management system Sample tracking, 298 SAMPO spectrum analysis system, 209 SAR, See Structure/activity relationship Satellite imaging, 266; See also Geographic information system Scaling of mother wavelets, 98 function (father wavelet), 98 function, shape of, 100 drawbacks of linear, 124–125 of spectra, 178
11/13/07 2:13:16 PM
388 Scattering coherent, 217; See also Coherent anti-Stokes Raman scattering incoherent, 217 in electron diffraction, 77 Raman, 212 in X-ray fluorescence, 216 Scientific data management system (SDMS), 293–294 concise summary, 357 vs. laboratory information management systems, 298 regulatory standards, 293 search capabilities, 334 Scientific document, 305–306; See also Electronic scientific documents Scientific workspaces, 312–319 concise summary, 357 vs. electronic scientific documents, 319 file operations in, 317 vs. file systems, 315 metadata structure, 315, 316 navigation in, 315 personal mode, 314 Scoring of chromosomes, 206 Scripting languages, 38 SD, See Standard deviation SD file format, See Structure data file format SDK, See Software development kit SDLC, See Software development life cycle SDMS, See Scientific data management system SDS, See Serial device support concept Search with artificial neural networks, 337 chemical structure, See Structure search in data management systems, 334 with data cartridges, 334–335 Search engine, in knowledge management, 289 Search for starting materials (SESAM), 230 SECOFOR system, 267 Secondary fluorescence, 216 Secret-key cryptography, 333 SECS, See Simulation and evaluation of chemical synthesis Selection algorithm, in genetic algorithms, 110, 114 Self-organizing map (SOM), 105 Self-organizing map program package (SOM_ PAK), 152 Self-similarity, of Daubechies wavelets, 98 Semantic networks, 14–16 concise summary, 32 for knowledge representation, 49 Semiempirical calculations, 75 Semipolar bonds, in structure searches, 69 SEQ, See Sequence module Sequence alignment, 256 Sequence module (SEQ), 248
5323X.indb 388
Expert Systems in Chemistry Research Sequence ontology (SO), 16 Serial device server, 323 Serial device support (SDS) concept, 322, 357 Serial port, 322 Serial search, in case-based reasoning, 24 Serum proteins, human, 171–172 Service request scenario, 327–328 SESAM, See Search for starting materials SHAMAN expert system, 209 Shapes of standard distributions, 86; See also Kurtosis Shewhart charts, 301 Shielding in 1H-NMR, 204 Shock tube parameters, 271 Sigmoidal function, 105 Signal processing, 98–99; See also Wavelet transforms Signal-to-noise ratio, 95 Signature workflow; See also Electronic signature in electronic laboratory notebooks, 329–330 concise summary, 357 Significance of correlation coefficients, 83, 86 of data, 94 Similarity; See also Diversity concise summary, 115 vs. diversity, 162 of electron density patterns, 256 of infrared spectra, 181 measures of, 81 molecular, 136–138, 147, 194 of patterns, 87 search, 191, 234–235 self-, 98 of RDF descriptors, 187–188, 194–199 with wavelet-transforms, 197 Similarity indexing, in case-based reasoning, . 23 Simple object access protocol (SOAP), 321 concise summary, 357 Simplified molecular input line system (SMILES), 63 concise summary, 115 in rule interpreter, 56 vs. SLang line notation, 253 Simulated annealing, 202; See also Nuclear magnetic resonance Simulated parallel search, in case-based reasoning, 23 Simulation of infrared spectra, 77, 178 of mass spectra, 232 of organic reactions, 229, 231 of organic synthesis, 230 of risk situations, 266 Simulation and evaluation of chemical synthesis (SECS), 230
11/13/07 2:13:16 PM
Index Single-value control chart, 301 Singular value decomposition (SVD), 91–93 Skewness, 83–85 concise summary, 115 distribution of, 195 as statistical descriptor, 256 Skin sensitization, prediction of, 250 Skolem constant, 52 SLang line notation, 253–254 SLIDER algorithm, 256 Slots in Apex-3D, 253–254 in frame-based programming, 16–18 in JESS, 47 in KM, 51–52 in CLASSIC, 50 SMART assessment, See Synthetic methodology assessment for reduction techniques SMARTS, See SMILES arbitrary target specification SME (Subject matter expert), See Expert SMILES, See Simplified molecular input line system SMILES arbitrary target specification (SMARTS), 64 concise summary, 114 vs. SLang line notation, 253 Smoothing, by transformation, 95 Smoothing parameter, 78, 119 effects on descriptor resolution, 120–122 concise summary, 164 SO, See Sequence ontology SOAP, See Simple object access protocol Soft-computing, See Artificial intelligence Software, GALP guidelines for, 279–280 Software agents, 325–327 concise summary, 357 data buffering with, 327 file splitting with, 326 inbox concept, 327 in knowledge management systems, 289 Software architecture, definition of, 282 Software development kit (SDK), 324–325, . 357 Software development life cycle (SDLC), 283–287 Software development process, 281–283 documentation for, 284–285 V-Model, 284 approval process for, 286 implementation phase in, 286–287 programming in, 282 requirement specification for, 282 verification vs. validation, 283 Software interfaces; See also Application programming interface agent-based, 326 cartridge-based, 334–335
5323X.indb 389
389 component object model, 321 development tools for, 324–325 Software interoperability, 320–328 Software maintenance, 283 Solar arrays, contamination requirements, 56 Solid-phase dissociation, 210 Solvents, hazard estimation, 261 SOM, See Self-organizing map SOM_PAK, See Self-organizing map program package SOP, See Standard operating procedure Space-based systems, monitoring of, 269 Space Station Freedom (SSF), 270 Spatial autocorrelation, See Autocorrelation vectors Specification software, 282; See also Software development process concise summary, 357 Spectra; See also Specific Spectra types as descriptors, 70 interpretation of, 177 preprocessing of, 182 representation of, 178–179 simulation of, 178 SQL, See Structured query language Squared correlation coefficient, See Coefficient of determination SSF, See Space Station Freedom SST, See Starting material selection strategies Stability management, in LIMS, 301–302 Staggered conformer, 137 Standard deviation (SD), 79, 256 Standard distribution, 84–86 Standard enthalpy, 213 Standard operating procedures (SOP) in dynamic emergency management, 262 GALP guidelines, 278 in software development, 284 Starting material selection strategies (SST), 230 Statement of work, in software projects, 282, 284 Static atomic properties, 125–126 concise summary, 164 vs. dynamic atomic properties, 125 Static dielectric polarizability, See Molecular polarizability Statistics descriptive, 79–86 limitations of, 85–86 prerequisites for, 83 sample, 298 Stereo center, specification in structure queries, 67 Stereoisomerism; See also Enantiomerism of drugs, 226 impact on biological activity, 221 of nitrogenase models, 136–137 Stereoselectivity, 226
11/13/07 2:13:16 PM
390 Stereospecific search, 67 Stoichiometry concise summary, 358 module, 311–312 Strategic bond, 235 STRUCMIX module, 269 Structural genomics in bioinformatics, 247 concise summary, 273 Structure, in electronic records, 293 Structure/activity relationship (SAR), 162, 221, 251, 260 concise summary, 273 Structure construction, in DENDRAL, 168 Structure data (SD), file format, 250 Structure databases, 61 Structured query language (SQL), 320 Structure editor, for generic structures, 312 Structure elucidation with infrared spectra, 4 expert systems for, 75 computer-assisted, 176 Structure fragment, See Molecular fragment Structure hypothesis testing, 169 Structure prediction from infrared spectra, 179–190 benzene derivatives, 185 bicyclic compounds, 186 Structure preprocessing, 182 Structure/property relationship, 162 Structure reduction, 177, 239 Structure registration, 335 Structure representation in DENDRAL, . 169–170 Structure search, 64–69 with descriptors, 162 in external systems, 335 Structure/spectrum correlation, 76 Structure/spectrum relationship, 163 Subgraph isomorphism, 65–66, 115 Subject matter expert (SME), See Expert Subjective probability, theory of, 28 Substitution pattern, 235 Substructure; See also Molecular fragments Substructure screening, 66 Substructure search, 64–65 with descriptors, 162 in external systems, 335 Superimposition of atoms, See Atom-by-atom matching of biophores, 253 Supervised learning vs. supervised learning, 105 in CPG neural networks, 107 Surface, See molecular surface SVD, See Singular value decomposition Swiss-Prot protein sequence database, 248 Symbolic inference, See Reasoning
5323X.indb 390
Expert Systems in Chemistry Research Symmetric encryption, 332 Symmetry effects in descriptors, 130 concise summary, 165 Symmetry of distribution, See Skewness SYNCHEM software, 230 SYNSUP-MB software, 231 Synthesis, See Organic synthesis Synthesis filter coefficients, in wavelet transformation, 97 Synthesis planning, 234; See also Retrosynthesis Synthesis tree, 236 Synthetic methodology assessment for reduction techniques (SMART), 258–259 System access authority check, 281 biometric, 332 permission management, 328–329 System specification, 284–285 System validation, 280 Systems autonomy demonstration project, 269
T Tailing, in standard distribution, 85 Task domain, 32 Tautomer search, 67–68, 334 Tautomerism, 67 Tcl, See Tool command language TCP, See Transmission control protocol Team in user administration, 329 concise summary, 358 Team administration, 329 Technical specification (technical design), 284, 285 Technological hazard assessment, 262 Technological risk assessment, 265 Technologies, supporting expert systems, 7 Temperature factor, 78; See also Radial distribution functions Temperature profile of graphite tubes, . 212 Templates concise summary, 358 for scientific documentation, 309 editor for, 310 Teratogenic effects of Thalidomide, 226 prediction of, 250 Termination criterion in genetic algorithms, 111, 206 concise summary, 115 Test set, in ARC software, 155–156 Testing acceptance, 279 of hypothesis, 169 software, 283, 284–285
11/13/07 2:13:17 PM
391
Index
stability, 301–302 substance, See Laboratory information management system TEXSYS, See Thermal expert system TEXTAL software, 255–256 Thalidomide, 226 Theory of evidence, See Dempster-Shafer theory Thermal bus, 269 Thermal control surfaces, contamination requirements, 56 Thermal dissociation, 210 Thermal expert system (TEXSYS), 269–270 Thermal transport property data, 271 3D molecular descriptor, See Molecular descriptor 3D-MoRSE Code, See Molecular representation of structures based on electron diffraction 3D-QSAR, See Quantitative structure/activity relationships 3D spatial autocorrelation, 75 Threshold, in neural networks, 102 Time fields, in user interface design, 352 Time stamp, in electronic records, 280 TNDO, See Typed neglect of differential overlap TOF, See Mass spectrometry Token, 326 Tool command language (Tcl), 232–233 Topological autocorrelation vectors, 74–75, 116 Topological descriptors, 74 bond-path, 140 in 1H-NMR spectroscopy, 205 local, 131–132 Topological distance, in Kohonen neural networks, 105–107 Topological index, 74, 115 Topological map, 107 of ring systems, 191 of diverse compounds, 193 Topological path concise summary, 164 Topological path descriptor, 134–135 Toroidal neural network, 107 Total binding, 219–220 Toxic Substances Control Act, 258 Toxicity assessment, 260 Toxicological assessment, 250 Toxicophore, 250 Training Kohonen neural networks, 107 counterpropagation neural network, 182–183 data selection for, 181, 206 key factors in 1H-NMR spectroscopy, 206–207 reverse mode, 157 Transfer function in artificial neurons, 103–104 in certainty theory, 25 Transform, See Mathematical transform
5323X.indb 391
Transition state theory, 212 Translation factor, 97 Translational invariance of descriptors, 71 concise summary, 115 Transmission control protocol (TCP), 325 Transport properties of complex mixtures, 271 Trend, in statistics, 142 Triple-resonance NMR, 202; See also Nuclear magnetic resonance 21 CFR, See Code of Federal Regulations Two-channel subband coder, 99 Two-dimensional descriptors, 146–147 concise summary, 164 for biological activity prediction, 222 2D structure representation, 62–64 Two-step decision, 300 Typed neglect of differential overlap (TNDO), 202
U UDDI, See Universal description, discovery, and integration UI, See User interface UID, See User interface designer Unbiased sample estimate, 85 Uncertainty, 26; See also Fuzzy logic Unidirectional port, 322 Unification, in KM, 53 Unique variance, 94 Unit, See Department Unit vector, 97 Unitary matrix, in Singular value decomposition, 91 Universal description, discovery, and integration (UDDI), 321, 358 Unshielding in 1H-NMR, 204 Unsupervised learning vs. supervised learning, 105 in CPG neural networks, 107 URS, See User requirements specification U.S. Environmental Protection Agency (EPA) Chemical Information System, 64 HPV challenge program, 250 Green chemistry program, 257 Good laboratory practices (GLP) guidelines, 278 U.S. Food and Drug Administration (FDA), 278 concise summary, 356 good laboratory practices program, 278 guidelines for electronic signatures, 330–331 guidelines for electronic records, 280–281 regulations, 277; See also Code of Federal Regulations requirement of true and complete transfer, 344 User administration, 328–329
11/13/07 2:13:17 PM
392 User dialog, creation of, 350 User ID, See Authentication User input verification, 350–354 User interface (UI) automatic form creation in, 350–350 design specification of, 282, 284–285 graphical, need for expert systems, 29, 30 in expert system design, 36–37 relevance for expert systems, 364 for research studies, 343 symbolics, 53 User interface designer (UID), 351 User interface interpreter, 353 User interface generator, 354 User interface layer, in knowledge management, 289 User requirements specification (URS), 284
V V-Model, 283–285, 358; See also Software development life cycle Validation concise summary, 358 expert system, 365 system, 280 vs. verification, 283 Validation plan, 284 van der Waals surface, 227 van der Waals volume, as descriptor, 73 van’t Hoff equation, 213 Variance, 79–80 vs. covariance, 80 population. 85 in singular value decomposition, 90 VBA, See Visual basic for applications Vector norm, See Euclidian L 2-Norm Vectors, in user interfaces, 353 Ventilator manager (VM), 175 Verification concise summary, 358 of user input, 350 vs. validation, 283 Versioning of documents, 308 concise summary, 358 Vibrational spectra infrared, See Infrared spectrum Raman, 176, 212 Vibrational spectroscopy, applications, 176–177 Virtual reality software, 338 Visual basic application (VBA), 324–325 concise summary, 358 as programming language, 38 sample code, 325 Visualization system, in knowledge management, 289 VM, See Ventilator manager
5323X.indb 392
Expert Systems in Chemistry Research
W Wavelet, 93, 115; See also Wavelet transform Wavelet compression, 151, 198 Wavelet decomposition, 100 Wavelet equations, 99 Wavelet filter coefficients, 98 Wavelet functions, construction of, 101 Wavelet mother function, 97 Wavelet transform (WLT), 96–102 concise summary, 115 for classification, 198 coarse-filtered, 148–150 of descriptors, 147–151 detail-filtered, 149–150 for diversity evaluation, 197 vs. Fourier transform, 96–97 single-level, 150–151 Web retrieval client, for electronic documents, 312 Web service, 320, 324 Web services description language (WSDL), 321, 358 Weight-property, 146 Weight vector, 106 Weighting of descriptors, 123–124 concise summary, 164 Weights adjustment of, 106 molecular, See Molecular weight in artificial neural networks, 103–104 in counterpropagation neural networks, . 107 in Kohonen neural networks, 105 Wiener Index, 74, 115 Winning neuron, See Central neuron Wiswesser line notation (WLN), 63 concise summary, 115 Witnessing of electronic data, 294 WLN, See Wiswesser line notation WLT, See Wavelet transform Workflow-based document template, 309 Workbench for the organization of data for chemical applications (WODCA), 231, 234–236 Workflow management systems, 302–305 analytical, 303–304 interfacing of, 305 order-suborder relations, 304 requirements, 303 sample tracking, 303 Working memory, 35–36, 58 Working party on spectroscopic data standards, 178 Workspace, See Scientific workspace World Wide Web Consortium, 320 WSDL, See Web services description language
11/13/07 2:13:17 PM
393
Index
X Xenobiotic compounds, 251, 339 XML, See EXtensible markup language XMS, See Exception management system XpertRule, 55–56, 58 X-ray crystallography, 254 X-ray diffraction concise summary, 273 molecular transform, 77 for phase analysis, 268
5323X.indb 393
X-ray fluorescence spectroscopy, 215–217 calibration, 216 concise summary, 239 sample effects, 217 X-ray phase analysis, 268–269, 273 XRD, See X-ray diffraction
Y Yield, calculation of, 311
11/13/07 2:13:18 PM
5323X.indb 394
11/13/07 2:13:18 PM