DATA HANDLING IN SCIENCE AND TECHNOLOGY -VOLUME
5
PCs for chemists
DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory...
70 downloads
931 Views
9MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
DATA HANDLING IN SCIENCE AND TECHNOLOGY -VOLUME
5
PCs for chemists
DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and O.M. Kvalheim Other volumes in this series:
Volume 1 Microprocessor Programming and Applications for Scientists and Engineers by R.R. Smardzewski Volume 2 Chemometrics: A textbook by D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y . Michotte and L. Kaufrnan Volume 3 Experimental Design: A Chemometric Approach by S.N. Deming and S.L. Morgan Volume 4 Advanced Scientific Computing in BASIC with Applications in Chemistry, Biology and Pharmacology by P. Valk6 and S. Vajda Volume 5 PCs for Chemists, edited by J. Zupan
DATA HANDLING IN SCIENCE AND TECHNOLOGY -VOLUME
5
Advisory Editors: B.G.M. Vandeginste and O.M. Kvalheim
PCs for chemists edited by
J. ZUPAN Boris Kidric' Institute of Chemistry, Hajdrihova 19, 6 1 1 15 Ljubljana, Yugoslavia
ELSEVIER Amsterdam - Oxford - New York - Tokyo
1990
ELSEVIER SCIENCE PUBLISHERS B.V. Sara Burgerhartstraat 25 P.O. Box 21 1, 1000 AE Amsterdam, The Netherlands
Distributors for the United States and Canada: ELSEVIER SCIENCE PUBLISHING COMPANY INC 655, Avenue of the Americas New York, NY 10010, U.S.A.
ISBN 0-444-88623-0
0Elsevier Science Publishers B.V., 1990 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science Publishers B.V./ Physical Sciences & Engineering Division, P.O. Box 330, 1000 AH Amsterdam, The Netherlands. Special regulations for readers in the USA - This publication has been registered with the Copyright Clearance Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the USA. All other copyright questions, including photocopying outside of the USA, should be referred t o the publisher. No responsibility is assumed by the Publisher for any injury and/or damage t o persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Although all advertising material is expected t o conform t o ethical (medical) standards, inclusion in this publication does not constitute a guarantee or endorsement of the quality or value of such product or of the claims made of it by its manufacturer. Printed in The Netherlands
V
CONTENTS
INTRODUCTION 1 1.1 1.2 1.2.1 1.2.2 1.2.3 1.2.4 1.3 1.3.1 1.3.2 1.3.3 1.3.4 1.3.5 1.3.6 1.3.7 1.4 1.4.1 1.4.2 1.4.3 1.4.4 1.5 2 2.1 2.2
XI
WORD PROCESSORS DEVOTED TO SCIENTIFIC PUBLISHING (by W. T. Wipke) Introduction Methods of obtaining graphics Character matrix graphics Inclusion graphics Formatter graphics On-screen integrated graphics ChemText graphics Scientific fonts Drawing general shapes Drawing molecules Image insertion ChemText images are computable Image import The integrated chemical text processor The impact of scientific information processors on science Impact on authors Impact on publishers Impact on readers Impact on journals References
11 11 12 13 13 14
DATABASES AND SPREADSHEETS (by D. L. Massart, N. Vanden Driessche and A. Van Dessel) Introduction How to make and use a database with dBASE I11 PLUS
17 17 17
1 1 2 2 3 3 3 4 4 6 7 7 9 9 10
VI
2.3 2.4 2.5 2.6 2.7
Programming in dBASE How to use a LOTUS spreadsheet Programming in LOTUS Conclusion References
24 27 38 41 41
3
PRINCIPAL COMPONENTANALYSIS OF CHEMICAL DATA (by K. Varmuza and H. Lohninger) Introduction Multivariate chemical data Display of multivariate data Principal components Display of a set of objects Application Software References
43 43 44 49 49 55 58 62 63
MANIPULATION OF CHEMICAL DATA BASES BY PROGRAMMING (by J. Zupan) Introduction Programming procedures Handling chemical structures with PC General Editing a structure Representation of chemical structures Sub- and super- structure search Update and retrieval in direct access files using hash algorithm Spectra representation in the computer General Peak tables Organization of full-curve or reduced representations of spectra Conclusion References
65 65 66 68 68 69 69 75 78 81 81 82 84 87 88
REDUCTION OF THE INFORMATION SPACE FOR DATA COLLECTIONS (by M. Razinger and M. Novic) Introduction Fast Fourier and fast Hadamard transformation
89 89 90
3.1 3.2 3.3 3.3.1 3.3.2 3.4 3.5 3.6 4 4.1 4.2 4.3 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 4.4 4.4.1 4.4.2 4.4.3 4.5 4.6
5 5.1 5.2
VII
5.3 5.4 5.4.1 5.4.2 5.4.3 5.5 5.6
Reduction of the coefficients Reduction of representations Smooth curves Discrete spectra 2-dimensional patterns Conclusion References
92 95 95 96 99 101 103
6 6.1 6.2 6.2.1 6.2.2 6.3 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 6.3.6 6.3.7 6.4 6.4.1 6.4.2 6.4.3 6.5 6.6
PROLOG ON PCs FOR CHEMISTS (by H. Moll and J. T. Clerc) Introduction Database General Exploring the database Elements of the PROLOG Simple rules Backtracking/instantiation Recursion Arit hmetics Control of backtracking Modiijing the database Obtaining output Refining the program General Manipulating lists Sorting Conclusion References
105 105 106 106 108 110 110 111 113 115 117 118 119 120 120 126 127 129 133
7
REACJlON PATHWAYS ON A PC (by E. Fontain, J. Bauer and I. Ugi) Introduction The deductive solution of chemical problems and the theory of the BE- and R- matrices The hierarchic classification of chemical reactions Reaction generators Examples Conclusion References
7.1 7.2 7.3 7.4 7.5 7.6 7.7
135 135 136 139 141 150 153 153
VIII
8.1 8.2 8.2.1 8.2.2 8.3 8.3.1 8.3.2 8.4 8.4.1 8.4.2 8.4.3 8.5 8.5.1 8.5.2 8.5.3 8.6 8.7
DATA ACQUISITION IN CHEMISTRY (by H. Lohninger and K. Varmuza) Introduction Concept of computerized data acquisition Basic concepts of signal processing Noise Signal conditioning Level shifting Linearization Analog-to-digital conversion Reference voltage sources Sample and hold circuits Principles of analog-to-digital conversion Interfaces Backplane-interface(bus) RS-232 IEEE-488 Software References
155 155 155 157 158 163 163 164 164 164 165 167 170 170 171 173 175 176
9 9.1 9.2 9.3 9.3.1 9.3.2 9.3.3 9.4 9.4.1 9.4.2 9.5 9.6 9.7
PCs AND NETWORKING (by E. Ziegler) Introduction The use of personal computers Networking of computers Different types of networks Transfer media, network topologies Ethernet LANs The operation of PCs within networks Asynchroneous line connection Ethernet connection The Ethernet LAN of the Max-Planck-Institutes in Muelheim The evolution of distributed systems References
177 177 178 179 180 180 183 184 184 185 186 188 188
10
THE FUTURE OF PERSONAL, COMPUTING IN CHEMISTRY (by G. C. Levy) Introduction Computational environment in mid 1990s
189 189 189
8
10.1 10.2
IX
10.3 10.3.1 10.3.2 10.3.3 10.3.4 10.4 10.4.1 10.4.2 10.4.3 10.5
Conceptual trends Program development environment Automated programming Data and knowledge management Scientific computation Applications to NMR and molecular structure computation A near term prognosis NMR spectrometry and personal computing 3D molecular structures from NMR References
192 192 193 193 194 194 194 195 196 200
INDEX
203
This Page Intentionally Left Blank
XI
INTRODUCTION
Very often the chemist’s first encounter with personal computers (PCs) is similar to that of a child coming across the LEGO bricks for the first time; at a friend’s home watching the enthusiastic owner building ’real things’ from the BASIC box. Inevitably, the owner tries to convince the onlooker to buy such a beautiful plaything. If the chemist is not a believer in the-world-without-computers (some children do not like to play with LEGO bricks), he or she will probably start thinking about a purchase and sooner or later (irrespective of whether this was because of the need, enthusiasm, status symbol, or peer pressure) the wonderful gadget will be in hisher office. Not surprisingly, the reaction of the chemist after playing with the computer for a while will be very similar to the reaction of a youngster playing with LEGO bricks. The desire to build larger houses, more realistic space craft, or more sophisticated objects shown in the accompanying booklet, wisely supplied with the box by the manufacturer, grows more and more intense each day. Unfortunately, chemists, unlike children who can be satisfied by buying a larger set, are much harder to please; they must buy more hardware, and more application programs, better compilers, expert systems, editors, and database management systems, to name but a few. The practising chemist would like to have useful and reliable results too. However, confusion caused by contradictory advice by computer ’experts’, an enormous amount of analogous software, almost illegible and incoherent instruction manuals, software that does not react as it should, frequent changes of versions of old familiar products, etc., may convince the chemist that in spite of the promised benefits the obtained results are not worth the effort required to master the PC. Consequently, the whole ’PC-project’ ends up on the spare desk in the office, in the expectation of better times or the attentions of an enthusiastic undergraduate.
XI1
The present book is a short '10-chapter-handbook' on how, when, and with what kind of software a chemist should use a PC. I have tried to order the chapters in the book to follow the 'natural' way in which chemists become familiar with their PCs. Mainly, they usually start with editing, manuscripts, proposals, reports, and the like. They get used to the simple editor, but become irritated by the lack of some badlyneeded features, purchase another editor, which once more leads to difficulties because of differing commands and the use of special keys, then to find out that the 'new' editor lacks the useful features the old one has, get angry again, buy the third editor, etc. The same story may happen with a Basic compiler, database management, statistical package, spreadsheet program, or some other general purpose software. Not only chemists, but almost all PC users have gone through these frustrations. To avoid such situations, ten contributions each describing one or two different applications and a suggestion for the best choice are collected in this book. The first chapter is dedicated to scientific word and text processing. Wipke describes the philosophy, purposes and abilities of word processors. The chapter is not written with the intent to promote a particular make of word processor, but to show the scientist what to look for when choosing a new product or evaluating an existing one. A careful reader will notice a subtle difference between the layout of the first and those of the rest of the chapters. In order to show the high quality manuscript that any chemist can produce in his or her own office without the assistance of artists, typists, editors, etc., Wipke has supplied his manuscript in 'camera-ready' form made on a PC-based word processor, hence the minor differences.
In the second chapter, Massart and coworkers show the use of dBASE and LOTUS-1-2-3 for chemical applications. The two packages, representatives of a large
group of database management and spreadsheet programs, respectively, were selected because they are probably most widely used packages in such applications. Once the database management and/or spreadsheet programs are used regularly the needs for programming in the chemical laboratory will diminish considerably. However, some special tasks such as sophisticated statistical procedures, feature extraction, clustering, etc. are seldom incorporated into packages designed for general database or spreadsheet handling. For example, the separation of complex, multivariate measurements into a number of classes from which predictions about the properties of the unknown samples or projections from a multi- to a two-dimensional space can be made are mainly available as dedicated packages. Varmuza and Lohninger give an introduction into one of the most useful
XI11
chemometric methods in this respect - that of principal component analysis. In addition to a description of the method, they give examples on how to use it and a number of references to the relevant software are made. Databases of chemical structures and collections of various kinds of spectra are also of great interest to chemists. Complex tasks such as structure searches, spectral-structure correlations, isomer generation, search for chemical reactions, and many others are only possible with special software rarely available to the average user. Thus, in many laboratories the researchers are forced to design, program, and implement their own solutions to these problems. In the fourth chapter Zupan describes how high-level languages can be employed for programming in order to solve problems concerning chemical structures (molecular graphs) and various kinds of spectra. The programming of spectral and structural 'database management' is strongly connected with data and information reduction, especially if implemented on PCs. In spite of the fact that space on hard disks is not a severe restriction anymore, a much better overall performance of the system can be achieved if compressed representations are used in place of complete ones. However, the computational performance and the quality of the results should be weighted against each other. All the above aspects of data reduction are addressed in the chapter by Razinger and NoviE. Once chemists are quite familiar with the computer, and even resort programming, they are eager to try new and more advanced ways to handle their data. One such possibility is described in the chapter on Prolog, the language of artificial intelligence. Moll and Clerc introduce the basis of Prolog and its most distinguished features using an informative didactic approach by gradually building up a TLC database. One of the most advanced topics of chemistry handled by the computers, the search for the reaction pathways, can now also be performed on a PC. The fundamentals and the implementation of a reaction-pathway-finding program together with the examples of reactions that the program has found is described by Ugi and coworkers. In many cases, linking the instrument to the computer is a very important requirement to the chemist, hence, knowledge on the techniques and on how to do it is badly needed. Essential information on the types of experiments, conditions of
XIV
measurements, hardware requirements, and many other aspects of data acquisition by PCs are given in the chapter by Lohninger and Varmuza. Consequently, after performing so many useful operations and handling so much data on a PC, the chemist would then like to have more computing power at his disposals for running programs such as GAUSSIAN 8% would like to access larger databases and electronic journals, and would enable the use of international mail and the participation in teleconferencing, etc. In other words the networking of the PCs to mainframe computers and to international data nets is being sought. Ziegler’s chapter contains many helpful suggestions on how such links can be made and describes a real example of a network organized at the Max-Planck-Institut in Muelheim. In this way the reader is presented with a picture of what can be achieved by networking PCs to other computers. At the end, Levy looks into the bright future of personal computing when hyper-networking and Cray capability will be within easy reach of all chemists which gives computational and experimental chemistry new perspectives. We are all hoping that such optimistic trends will come to fruition in the near future; in the mean time we are shopping around for the PC ’LEG0 bricks’ - program packages and hardware accessories - to enable better, faster, and more sophisticated data and information handling. It must be emphasized again that the activities and solutions of the problems described in the book should not be regarded as the very best or the final ones, but rather as the suggestions and words of encouragement to chemists tackling their own problems. Once firm goals have been set, and suitable software is on hand to achieve them, a good manual for the particular software can turn the PC into a great helper for many chemists. However, without a sound knowledge of chemistry (viz.: the ability to solve the problem without the computer) no PC hardware and its software can help. So why use computers at all? The answer is straightforward: it avoids trivial errors, looks at all possibilities, excludes biased conclusions, it provides suggestions, complex information retrieved from data collections, which allows the time saved to be spent in streamlining the research. For a newcomer it is always hard to make the right choice based mainly on the advertisements or an associate’s suggestions. Successful use of software, be it word
xv processor, database management, or operating system, strongly depends on how the user 'feels' with or 'adapts' to the product. This is a matter of personal preference and a product which seems convenient to one user may be regarded as the opposite to another. The term 'user-friendly' which is used to describe (too) many products on the software market must be tested and regarded as being 'friendly-to-mepersonally'. To select a software package from the large variety of mainly equivalent products is not an easy task. Many factors must be considered: the hardware at your disposal, the knowledge one has on the subject, the results one wishes to obtain with, the output one wishes to produce, the frequency of use, the after-sales service by the supplier, the quality of the manual, etc. Of course, the enormous amount of software on the market has its bright side as well. Sooner or later each will find the products (editor, spreadsheet, compiler, statistics, communication protocol, etc.) that best suit his or her requirements. It is true, however, that this does not happen immediately after the PC is placed in the office or laboratory. Once the package best-suited to the requirements has been acquired, newhpdated versions of it should not be purchased immediately upon release, since de-bugging is seldom 100% at that stage. If the purchase is not urgent because of some badly needed feature, it is best to wait for test reports, reviews, and the new release. Very often the manuals are the weakest point of many excellent packages. The manual will be your first and most often your only assistant and helper. A good manual should not only describe all possibilities on how to operate the system, but should also provide clear instructions on how to start using the system - possibly with actual examples, and should include a comprehensive troubleshooting guide. In other words the manual should be useful from the very beginning until the system has been mastered, and beyond. There are many topics (such as searching chemical literature, Current Contents, or Citation Index; the choice of hardware and maintaining the system; use of statistical packages, etc.) that could be included into the book, but at some point a selection of what are the most common, and at the same time the most important, aspects of PCs in chemistry must be made. After all, the readers will judge if too many vital topics were omitted or not. The initial idea for this book was put forward by Professor DuSan Hadii, one of the earliest promoters of the use of computers in the field of chemistry. Initially the book was intended to cover all aspects of the use of computers in chemistry with the emphasis on number-crunching and information systems. However, with the rapid
XVI
advance of the applications of PCs, the scope and intention of the book has been gradually moulded into the present form. The discussions with Professor Luc Massart during my stay in Brussels at VUB have greatly contributed to my understanding of the universal role of PCs in chemistry in general and chemometrics in particular. Finally, I would like to thank all my colleagues and coworkers at the 'Boris Kidrie' Institute of Chemistry for discussions and exchange of opinions that have added pieces of information and different points of view to many problems to which I probably would have had a very single-minded and biased attitude. At the end, a standard phrase about my gratitude to the family should stand, however, instead of it, I will rather promise to spend more time at home in the future, which means no more writing or editing of books, at least not in the next few months.
Ljubljana 1989
Jure Zupan
1
1 WORD PROCESSORS DEVOTED TO SC.IENTIFIC PUBLISHING
W. TODD WIPKE Department of Chemistry, University of California, Santa CNZ CA 95064, USA 1.1. INTRODUCTION The scientific publication process is essential for communication of research results. This process ranges from informal intra-group research reports to formal papers for scientific journals. Typically the author generates a hard copy manuscript, a graphic artist generates drawings and graphs, and the figures are assembled at the end of the manuscripts with a separate sheet of figure captions. The paper is refereed in this format, and the author revises it also in this format. Finally the publisher keyboards the text, has artists draw some of the figures, figures are photographed, and a galley is printed for the text and separately for the figures. The author reviews the galley, indicates errors, and the publisher revises the typesetting, then prints the document. This is a laborious process in which errors can invade during every manual manipulation of the information. (Ref. 1) Chemists, particularly those involved with computers, welcomed the opportunity to use the computer for text processing. (ref. la,2) Initially, the tools were crude. Some printers in computer centers could only print upper case letters. The appearance of micro computers with desk-top dot matrix printers and letter-quality daisy-wheel printers gave the chemist the opportunity to create excellent looking hard copy with upper and lower case characters, but without drawings and with a limit of 96 characters. Wordstarm under the CPM operating system; Magic WindowsTM, SuperTexP and AppleWriteP under the AppleDOS operating system; and MacWritem under the Macintosh operating system were some of the early tools for text processing. The IBM PC increased the offerings to include many new software packages with increased power. Finally, desk top laser printers have made possible hard copy with 300 dots per inch resolution. All of these developments enabled the chemist to create and print rexr of very high quality. But many manuscripts in chemistry have more space devoted to figures, equations, and tables than to normal text. Since the effort required to generate and paste in the figures far exceeded the effort to type in the text, chemists did not find text processing effectively solving their problems. Early word processors simply left space for the figures, and the figure had to be drawn, cut out, and glued into that space. Manually moving figures from one draft to another in large manuscripts (e.g, theses) was so much trouble that'intennediate drafts were often received without the figures attached. Of course, for manuscripts that had all figures at the end, transfemng figures from one draft to another was easier, but reading such a manuscript was more difficult. A conventional word processor was not much help to an organic chemist making up an exam in organic chemistry, where there were more chemical structures than words. If the figures also contained typed text, one was required to type part, draw part, then type on the drawing, then draw, then type,erc.
2
There also were problems with text processors in that they did not have a wide enough range of special symbols for scientific usage. The highest quality daisy-wheel printers were limited to 96 characters and often could not stop in the middle of a manuscript to change print wheels. Thus one was required to draw these special characters in the manuscript manually. This of course was another source for error and omission in preliminary drafts. Finally, the special symbols that were available were generally printer dependent, making it difficult for someone with a different printer to obtain the same symbols. 1.2. METHODS OF OBTAINING GRAPHICS 1.2.1. Character Matrix Graphics Several approaches to incorporating graphics into word processors have been taken. The first method is to create a special character set containing lines at special angles which can be used in building chemical structures, boxes, etc. T3, Volkswriter Scientific, The Egg, and many others use this method,the basis of which is illustrated in figure 1.1. In order to enter a chemical structural 0000000 0.00000
.......
w:%
0000000
-.-
00.0000 00.0000 000.000
%%%%
00000.0 000000.
000000. 00000.0 0000.00
000.000 00.0000 00.0000
/
\
0000000 0000000 0000000 .000000 0000000
0000000 0000000 0000000 0000000
:m
%%%%
0000000
0000000
0000000
-
I
Fig. 1.1. Special characters for creating structures in 7x9 dot manix. diagram, one must select the right building block characters and position them in the right character cell positions so they meet and create the desired diagram. Generally such systems provide methods for recording keystrokes so an entire diagram can be called up as a macro with one key. Often one finds a particular diagram simply can not be created from the given characters. The six-membered ring A at in figure 1.2 is reasonably equal-sided but it is impossible with certain scientific text processors to generate the same diagram rotated by 90 degees, instead one gets a squashed diagram as shown in B in figure 1.2.
A
B
Fig. 1.2. Character matrix graphics are simple, but limited. One can not always make an attractive looking structural diagram. The advantages of the character matrix graphics are that it is simple for the software vendor to add to a normal text processor, and that it can be fast because the graphics are simple characters.
3
The disadvantages are that this method can not generate all diagrams one needs, and it is very difficult to create diagrams in this way. It is basically like solving jigsaw puzzles. The diagrams can not be scaled and can not be easily modified, e.g., to change a six-membered ring to a seven-membered ring.
1.2.2. Inclusion Graphics Some text processors are able to leave space for a diagram and show the empty space on the screen, but the graphics are contained in an include file that is read at print time. Mass-11, Unilogic's Scribem, FinalWord"', and various extensions to DEC's All-in-Onem system can include graphics in this way. The include file must be intelligible to the printer, consequently this method is very device dependent. The word processor i d l y does not "understand the contents of the include file, it simply sends it to the printer "as is". This is not a WYSIWYG approach, i.e., what you see on the screen is not what you get. A further difficulty is that one document consists of many files, thus complicating transmission of a document across communication lines. Again, one may receive a document without some of the figures, or with some out of date figures, even though technically the figures are integrated into the document (really only at print time). 1.2.3. Formatter Graphics The very sophisticated TEX formatter system can represent graphics as a series of vectors and other graphic instructions which one must code into the manuscript, much as one would give instructions to do graphics in a BASIC program. TEX is not a WYSIWYG system. One only sees the results when the manuscript is "printed to a printer or to the screen. It is most laborious to create a graphical image in this way. No drawing tools come with the TEX system at the time of this writing, although it would not be difficult to create a stand-alone system for this purpose. The major advantage of TEX is that it handles a very wide range of printing devices. Another advantage for mathematical equations is that it has built into its procedures the wisdom of Knuth regarding the best way to format mathematical equations. For example, in order to generate equation 1 using TEX, one types and sees on the screen the following coding: $$m\circ n =\sum-( b=l)^q \sum-( c=lJ^rF-(j-b+k-c)\.kag4$$. The main point here is that to the typist or anyone looking at the computer file, it is not at all like looking at the final output, and doing chemical structures in TEX is even more complex. TEX is difficult to learn and very slow in execution on a PC. q r mon
==I.' b=l c=l
jb+kc
1.2.4. On-Screen Integrated Graphics The most flexible systems allow the scientist to draw images with a pointing device such as a "mouse", with freedom to scale the diagram and then insert it into the document and see the graphic
4
on the screen with the text. From the beginning, the Macintosh, with the combination of two programs, MacDraw and MacWrite, indirectly prQvided this capability through transfer of graphics via the Clipboard. The "standard graphics toolbox" of the Macintosh encouraged developers to provide the ability to transfer graphics via the Clipboard. On large Unix-based systems, the Interleaf system also integrates graphics with text in a very general manner. On the IBM PC family, ChemTextm provides this capability in one program. (ref. 3) ChemText was developed by chemists for chemists to meet the extra demands of the graphically oriented field of chemistry and to integrate the information flow in the chemical laboratory and office. Because ChemText is unique in its degree of integration, it makes an appropriate model to study in detail. Let us then focus our attention in the next sections on how ChemText integrates graphics with text.
1.3. CHEMTEXT GRAPHICS ChemText operates on an IBM PC or clone with 640 Mbyte memory, a mciuse, and a variety of printers ranging from Epson to HP LaserJet and Postscript printers such as the Apple Laserwriter, NEC LC-890, and Linotype. Most importantly, ChemText allows one to create drawings using arbitrary vectors that one "draws" using a pointing device such as mouse or tablet. Thus, any diagram that one can draw on paper can be drawn in ChemText in a nurural way. Drawing takes place in the "Main Menu" or in the "Molecule Editor," "Form Editor," or "Reaction Editor" (see Fig. 1.3). Mathematical equations such as Equation 1 can be entered also in the "Main Menu" drawing area. On screen, equation 1 appears exactly as it does in print.
Document Editor --------9 Windows
Main Menu
---------
Editor
Form Editor
. 4
Form Text Editor
1.3.1. Scientific Fonts ChemText provides the wide range of special symbols and fonts for scientific documents illustrated in Fig. 1.4. These characters appear on screen and print on every supported printer. When scientists exchange electronic ChemText manuscripts they will still see the proper symbols, even on different printers. In contrast, Macintosh manuscript fonts can get switched when transferred to another Macintosh system with a different arrangement of fonts in system files.
5
Roman font AaB bCcDdEeFfGgHhIiJjKkLIMmNnOoPp
QqRrSsTtUuVvWwXxYyZz!1@2#3$4%5^6 & 7 * 8 ( 9 ) 0 - - + = {[ ) ] .?/I\
Fig. 1.4. The scientific fonts available in ChemText that can be printed on any supported printer.
6
In addition to these proportional spacing fonts, there is a fixed-pitch typewriter font in ChemText 1.2 that is not shown here. There is also a wide range of text sizes as illustrated in Fig. 1.4a. This b point cke 6 (BOW)
This is point size 8 (Bold)
This is undersized Times Roman (Bold) This is point size 10 (Bold) This is point size 12 Helvetica (Bold) This,is Roint sjze 12 Times Roman fBold)
This ISsize 18 Helvetica (Bold) This is Large Roman (Bold)
This is point
Fig. 1.4a. The large range of type sizes in ChemText facilitates making slides, posters, and interesting figures. Both noxmal and bold are shown. 1.3.2. Drawing General Shapes One can draw arcs, circles, lines, arrows, boxes, and text. The style of pen can be changed, analogous to changing pen points with a Leroy Lettering Set. The style of arrows, lines, and boxes can also be vaned (see Fig. 1.5). Special options are available to require lines to be tangent,
-no ___t
............. - - - - - - - - ...........
-- _-_-_-
1101.11,.
1,1111, I,,
-0 f---
............. t--_ :::I:::::::::
Fig. 1.5. Pen, line, arrow, and box styles available in ChemText. Any combination can be selected. perpendicular, at 45O, or that lines exactly meet at a point. These special conditions are recognized to be important when one considers the resolution of the color graphic screen is only 320 wide by 200 pixels high or approximately 40 pixelshnch where the printing resolution is 300 x 300 dotsfinch. One can not rely on the human eye looking at the screen to make sure lines meet, and after all, why should one? It is simple to say "make these lines meet" and the computer has the means for making this happen! Such features were previously found only in CAD systems such as AutoCad. ChemText has sufficiently good CAD features that many in chemical engineering are creating small engineering drawings (Fig. 1.6) with it using a special set of templates created by this author.
7
Fig. 1.6. Sample chemical engineering drawing done using ChemText with engineering template set. This was created on a 9" CGA display. 1.3.3. Drawing Molecules The molecule editor is a chemically responsible drawing system. It "knows" the periodic table, the valence of atoms, that bonds end at atoms so all bonds to an atom automatically meet at a point, and can add or remove hydrogens. Templates are available for common ring systems, functional groups, and chains. The editor "knows" how to join these by joining nodes, or by joining edges (fusing). It also has an algorithm called "CLEAN" to make the structure symmetrical with equal length bonds, erc. Once a chemical structure is composed, it can be saved as a molecular connection table, called a MolFile. (ref. 4) This file is a valid chemical representation of the molecule that can be used for computing molecular energies via other programs. (ref. 5,6) Or the structure can be transported back to the "Main Menu" and composed with other such molecules together with arcs, boxes, ellipses, lines, arrows, and text to make a figure. Fig. 1.7 shows such a composite image.
1.3.4. Image Insertion Whatever is on the "Main Menu" screen is eligible to be an image. A movable border is positioned to select that part of the image that will be used. Images with their borders can be saved as an image. The F7 function key transports us to the "Document Editor" where one enters normal text (Fig. 1.3). With the cursor positioned at the desired place, one can select "Insert Image", either by typing cCtrl-IxCtrl-I> or by pointing at the word "Insert" and then pulling down the "Insert Menu" to "Image". Whatever image was on the Main Menu with the boundary is inserted into the document at the current cursor location.
8
Y B o n d i n g interaction Fig. 1.7. Illustration of overlaying orbitals, text and arrows on a chemical structural diagram. Images can be inserted in a number of different ways: Fill-in
Image floats to next page if there is not room on that page and text fills the remainder of the page.
No-Fill-in
If there is not room for the image on this page, a page break occurs and text does not fill out the page.
In-Line
The image is inserted in the line of text and text continues on the other side of the image.
Page
The image is a full page.
Rotated Page
The image is a full page rotated by 90 degrees.
When you insert the image, the text splits apart just the comct distance for the image being inserted and you see the text and the image. The image scrolls with the text. In fact, images behave exactly like text--you can copy text containing an image. move it, or delete it. Images can even be place in headers and footers. Even completely textual entities such as tables can be advantageously inserted as a figure, because as a figure. it can be stretched horizontally or vertically by a simple mouse dragging motion. Chemical structural diagrams or mathematical equations can then also be entries in the table. Mathematical equations can be easily composed in the Main Menu where one can move symbols with a mouse until it looks right. This author knows of no other system that gives that flexibility. Equation 1 looks on screen exactly as it does in print. It is WYSIWYG,you do see on the screen exactly what will be printed. To modify an image, one simply positions the cursor on the image icon and hits F7 to return to the Main Menu with the old image. Changes can then be made and one returns via F7 to the Document Editor. One simply deletes the old image with the "Del" key and inserts the new one in its place.
9
1.3.5. ChemText Images Are Computable When one inserts an image of molecule from the Molecule Editor into a ChemText document, one actually inserts a cornputable representation of the structure. As this is a new concept, let me illustrate with an example. Suppose you have a ChemBase data base of your compounds. (See Fig. 1.8) You pull out the structure of your new compound and insert it into your manuscript which you IName
Eperuane-8, 15, 18-triol
I
C20H3803
I
H
OH .ID 4202-027
IDATE ~~~
\
lRef
7/30/86
Tet., 1965, 1175.
Fig. 1.8. ChemBase form transfered directly into ChemText. Transfemng molecules from data base to report eliminates errors in redrawing. send me on a disk or in an electronic message as a single ChemText file. First, I can read your paper on my CRT screen without ever printing it. But relevant to the point, I can place the cursor on the icon of the figure of your molecule, carry that image back to the Main Menu, and write a MolFile, (ref. 4) then send that MolFile to PRXBLD, (ref. 7.8) MM2, CHEMLAB, (ref. 9) and calculate an energy, or display the molecule in various ways, etc. Thus, your document contains not only pictures of the molecules or reactions, it contains computable representations that can be used for further research, edited, or stored in a molecule or reaction data base. Similarly, I could pull out the image of a reaction, write a RxnFile. (Ref. 10) and use it to search the current literature using RJZACCS (ref. 11) on a mainframe computer or ChemBase (ref. 12,13) on a PC.
1.3.6. Image Import So far, we have only discussed molecules and reactions or other diagrams that the chemist might draw. This however neglects the fact that many images come from instruments as spectra of all kinds, MS, NMR of H and l3C. UV. IR, etc. (ref. 14) Images also come from other programs such as SPACFIL, (ref. 15) ORTEP, (ref. 16) PLUTO, (ref. 17) CHEMLAB, (ref. 18) ADAPT, (ref. 19) synthesis programs, and even Lotus-1-2-3, and instrument data analysis programs. ChemText provides utilities that allow one to capture images from laboratory instruments, (ref. 14) Lotus-1-2-3, AutoCad, and data analysis programs such as RS/lm. (ref. 20) These utilities are able to convert the particular drawing commands used into drawing commands used by ChemText. Molecular Design Limited defined the MDL MetaFile as an intermediate representation of graphical information and ChemText accepts this format. Figure 1.9 shows a p m spectrum imported into ChemText.
10 ux1
800 I
8.0
500 I
I
7.0
6.0
I
5.0
I
4.0
100
100
I
I
3.0
2.0
50
0 HI
I/lljlll/lllllllllllllyll
1 .o
0.0 wm
Fig. 1.9. Proton n m spectrum imported directly into ChemText. Chemical structures can be superimposed along with text, mows, boxes, circles, err. One can thus capture a spectrum, convert it for ChemText, read it in and overlay a structural diagram, arrows, boxes, text, etc, to make interpretation of the spectrum easier for the reader. (See Fig. 1.9) Similarly, one can generate a space-filling model, or surface diagram in CHEMLAB and incorporate that as an image. (see Fig. 10) I
Fig. 1.10. Drawing captured from CAS Online, processed by CASKit-1 and directly imported into ChemText and scaled. Diagrams from on-line searches of CAS (ref. 21) and DARC (ref. 22) can also be utilized as images in a manuscript completely electronically through a utility called CAS-KitTM.(ref. 23) Literature structures can be directly imported without human redrawing, thus eliminating that particular source of errors. Many internal reports are generated with pasted up dot-matrix print of graphic screen dumps, but by going through ChemText, printing can be done on a laser printer, generating much higher quality structural diagrams(see Fig. 1.11). 1.3.7. The Integrated Chemical Text Processor Fig. 1.12 presents an overview of the role that a truly integrated document processor can play in interfacing the chemist with chemical information from a variety of sources including data bases, chemical instruments, the literature, other scientists, electronic journals, data analysis programs,
11
+) No2-
pBc-
502-
C
\ 1
-
O
Fig. 1.11. Figure captured from DARC online and processed by CASKit-1 to permit direct import of the graphic into ChemText.
RS I 1, LOTUS DARC, CAS AutoCad
REACCS MACCS
CNMR
I
CHEMLAB\ CHEMBASE
-
CHEMIST
E -Gail Journal (TCM)
Fig. 1.12. Overall interaction of chemist with all sources of chemical information through an integrated system like ChemText. computational chemistry programs, and the chemist’s own creativity to form a composite document. The document is a single file, but can contain computable objects such as molecules, reactions, as well as spectra, erc.
1.4. THE IMPACT OF SCIENTIFIC INFORMATION PROCESSORS ON SCIENCE I was pleased that these innovations could finally simplify the task of writing research publications. Everything one could want in document processing had been delivered. It was not surprising that document preparation was now simpler and faster and that how I wrote was changed. I was surprised that these innovations actually changed what I wrote. Now it is obvious: make the use of figures as convenient as the use of words and one tends to use more figures than before. The resulting document is shorter, more interesting to the reader, and more interesting to the writer. 1.4.1. Impact on Authors When scientists have sophisticated tools for drawing that can directly be incorporated into their manuscripts it often means scientists create their own figures without intervention of graphic
12
artists. Since the figures are directly inserted by the scientist, figures do not get inserted into the wrong place. This directness clearly eliminates sources for errors, speeds results, and minimizes expenses. The ability of the scientist to interact with an image as it is designed also enhances the scientist's creativity and control of the graphics. However, the scientist now needs the knowledge of how to create attractive diagrams without professional assistance. Fortunately, they obtain this from reading textbooks, and the literature. They generally have a sense of what "looks right". But the large range of fonts, emphasis, and characters available requires self-discipline if one is to avoid an unpleasant-looking document. With integrated graphics, intermediate drafts always contain all the graphics and figures. Even if one copies just a part of the manuscript, one will obtain that part with its integrated graphics. This assists proof reading and rewriting, because one has the "whole document'' integrated. Mathematical equations present a particularly interesting situation. The major difficulty has been indicating to the typesetter in an unambiguous way which character to select. When the author is creating the camera-ready copy, this communication problem is eliminated as is the need for time-consuming and difficult proof-reading of galley proofs, but it moves the responsibility of making the equation attractive in spacing and placement from the publisher to the author. This burden has already been assumed by organic chemists for chemical reactions and chemical structures, and camera-ready copy is expected by author and publisher alike. Tetrahedron and Tenahedron Letters, two very popular organic chemistry journals are now exclusively camera-ready-copy (CRC) based. The ever increasing symposium series books are also CRC. Thus it appears that chemists are willing to accept this responsibility. In section 1.4.4, we will discuss additional innovations that minimize this burden. An obvious impact is that decisions that used to be made by copy editors at the publisher now fall on the author and the staff assisting the author. If scientists do their own drawing and typing, what does their secretary do? Will the secretary be motivated to keep pace at the same rate as the scientist? Since retyping is eliminated, a higher percentage of a secretary's effort goes to formatting, style, and higher level functions. 1.4.2. Impact on Publishers
Much of the gain in desk top publishing is lost if the publishers proceed to treat the hard copy in the same way as a typewritten copy, namely to rekey the text and recreate the figures. The results of an ACS survey of ACS authors indicated that 1) in generating drafts for their manuscripts, 59% of the authors use word processing systems, and 2) 93% of the authors produce some of their manuscripts in final form by some type of word processing system. (ref. 24) The types of computers used were 71% microcomputer, 12% stand-alone word processor, 6% minicomputer, 4% mainframe computer. The first experiment by the ACS required the authors to enter very low level typesetting codes. Authors found this cumbersome. In the most recent experiment, reported at the New Orleans ACS National Meeting, authors submitted manuscripts on disk in their own word processor format. (Ref.
13
24) A service agency was employed to remove all formatting information, then ACS staff reinserted formatting information in the ACS format. The report concluded that it was more expensive to receive papers on disk than to key in the papers from hard copy in the first place. (ref. 24) This cost evaluation did not take into consideration the fact that software exists to convert between word processors preserving formatting information, the cost of errors introduced in rekeying, or the cost to the author of proofreading. The authors and readers are often overlooked in publisher cost studies. Thus it is not surprising to see authors leading publishers in innovation. Clearly, if the author's electronic manuscript could be used directly in the publication process, one could achieve maximum efficiency, minimum publication times, minimum cost, and minimum number of errors. Obviously this would change the nature of jobs in the publishing industry, which may explain why faster progress has not been observed. Typesetting can be done with minimal training, whereas handling computer-readable manuscripts requires significantly higher levels of training. In the field of computer science, where communication via electronic networks is common, collaborative generation of reports, papers, and proposals has occurred, primarily through the use of formatters such as PUB, SCRIBE, or TROFF. Throughout its history the Stanford SUMEX project routinely received contributions to its annual report via electronic mail in PUB and more recently SCRIBE. In mathematics, Pergamon Press has begun Applied Mathematics Letters which encourages authors to submit manuscripts as a TEX computer file using the AMS-TEX macro package with the amtr.sty style file.
1.4.3. Impact on Readers The largest impact on readers is that they will be receiving much "fresher" journals as the publication time is reduced. This will hasten the feedback cycle and interaction of research groups. A second impact is they will receive materials that were simply not available before, e.g., executable programs, three-dimensional models, source code, and parameter sets in a form directly usable by them. Readers will essentially receive "more useful" information faster with fewer errors than they currently receive. 1.4.4. Impact on Journals In chemistry, Pergamon's Tetrahedron Computer Methodology (TCM) (first issue July 1, 1988) is the first journal to specifically request computer-readable manuscripts as the preferred mode of submission. TCM accepts ASCII, ChemText (PC), and Microsoft Word (Macintosh) submissions. Algorithmic formatting assures consistent style throughout and tools are now available to relieve the author from worrying about style of front material, headings, references, and bibliographic format and punctuation. (Ref. 25) TCM is the f i t scientific journal to publish simultaneously in hard copy and electronic media. Executable programs, data sets, source code, molecular coordinates, execution traces--all can be published in the electronic form as part of a paper appearing in the hard copy form. This new journal enables publishing information that is not
14
possible or not practical to publish in hardcopy. TCM provides information in a form most useful to the reader. Computer chemists can actually now publish a true experimental section from which other scientists can reproduce their results. Readers will better understand what research was actually done, and will be able to build on it. It is quite an exciting development that can dramatically improve the field of computer chemistry. (Ref. 26) Ray Dessy gave a futuristic scenario of electronic publishing (Ref. 2) that has now largely come true with Tetrahedron Computer Methodology. A very complex organic chemistry book was completely typeset with desk-top publishing. (Ref. 27) Technology will not be the rate-limiting ingredient, but rather human traditions and habits. The reader is directed to Refs. 2 and 26 for an overview of other aspects of technology such as CD-ROM.
1.5 REFERENCES
1
la 2
3 4
5 6
7 8
9
10 11 12 13 14
The camera-ready copy for this paper was prepared from one ChemText file without any cut. paste, or retouch. The display used was a 9" Color Graphic Adapter, the lowest resolution available today. A Postscript NEC LC-890 laser printer with resolution of 300x300 dots per inch was used to generate the CRC. Wipke, W. T. "Evolution of Molecular Graphics". In Graphics for Chemical Structures. Integration with Text and Data; Warr, W. A., Ed.; American Chemical Society, Symposium Series No. 341: 1987, pp 1-8. Dessy, R. A. "Scientific Word Processing". Chemometrics and Intelligent Laboratory Systems 1987, 1.309-319. Snider, B. "ChemText. Version 1.10". J. Am. Chem. SOC. 1987, 109 (23), 7240-7241. The MolFile format is proprietary to Molecular Design Limited. Wipke. W. T. "Molecular Design's Integrated System for Drug Design". In The Aster Guide ro Cornpurer Applications in the Pharmaceutical Industry; Aster Publishing: Springfield, Oregon, 1984, pp 149-166. Wipke, W. T. "Computer Modeling in Research and Development". Cosmetics and Toiletries 1984,99 (October), 73-82. Wipke, W. T.; Verbalis, J.; Dyott, T. "Three-Dimensional Interactive Model Building", Presented at the 162nd National Meeting of the ACSy, Los Angeles, Auguct 1972. Wipke, W. T. "Computer-Assisted Three-Dimensional Synthetic Analysis". In Cornpurer Representation and Manipulation of Chemical Information; Wipke, W. T.; Heller, S. R.; Feldmann, R. J.; Hyde, E., Eds.; John Wiley and Sons, Inc.: 1974, pp 147-174. Jacobson, H. S.; Pearlstein, R. A.; Hopfinger, A. J.; Tripathy, S. K.; Orchard, B.; Potenzone, Jr., R.; Doherty, D.; Grigoras, S. "Applications of Molecular Modeling to Problems in Chemical Research and Development". Scienrific Computing and Automation 1984, Novpec.. The RxnFile format is proprietary to Molecular Design Limited. "Computer System Searches Chemical Reactions". Chem. Eng. News April 1982,60 (15), 92. Seither, C.; Cohen, P. "Designing New Compounds with a Personal Computer Database". Am. Lab. 1986, 18 (9),40-47. Cohan, P.. "Current Technologies in Chemical Structure and Substructure Searching Using Microcomputers"; Online 87 Information Proceedings, Learned Information, New Jersey, 8-10 December 1987, pp 533-545. Manfre, R.; Dumont, L. "Analytical Spectra in Scientific Documents". Nature 1987, 329, 1143.
15
15 Smith, G.M.; Gund, P. "Computer-Generated Space-Filling Molecular Models". J . Chem. Inf. Compur. Sci. 1978, 18 (4). 207-210, based on code of Warme, P. K.; Comput. Biomed. Res. 1977, 10,75. 16 Johnson, C. "Ortep: A Fortran Thermal-Ellipoid Plot Program for Crystal Structure Illustrations"; Technical Report, ORNL-3794, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA, 1970. 17 Motherwell, S. Adobe Systems Corporation, Postscript Reference Manual, Reading, MA, 1985. 18 Hopfinger, A. J. "Computational chemistry, molecular graphics and drug design". Pharmacy International 1984,5,224-228. 19 Jurs, P. C.; Isenhour, T. L. Chemical Applications of Pattern Recognition; Wiley: New York, 1975. 20 Godowski, B. "An Application of Analytical Software in Laboratory Research". Am. Lab. 1986, 18 (9), 98-103. 21 Dittmar, P. G.;Stobaugh, R. E.; Watson, C. E. "The Chemical Abstracts Service Chemical Registry System". J. Chem. Inf. and Comput. Sci. 1976, 16, 1 1 1-121. 22 Attias, R. "DARC Substructure Search System: A New Approach to Chemical Information". J . Chem. Inf. Comput. Sci. 1983,23, 102-108. 23 Available for the IBM PC from DH Limited. 100 Segre Place, Santa Cruz, CA 95060. 24 Love, R. A.; Robinson, S. F. "Submission of Computer-Readable Book Manuscripts", Presented in Symposium on Electronic Submission of Manuscripts for Publication at the 194th National ACS Meeting, Aug 30, 1987, New Orleans. 25 Wipke, W. T. "REFFORM: An Automatic Reference and Bibliography Formatting System". Tetrahedron Comput. Method. 1988, 1 (l), 87-92. 26 Bowman, C. M.; Nosal, J. A.; Rogers, A. E. "Effect of New Technology on Information Transfer in the 1990s". J. Chem. Inf. Comp. Sci. 1987,27, 147-151. 27 McMurry, S. Study Guide and Solutions Manual for Organic Chemistry; 2nd edition; Brooks/Cole: Pacific-Grove, CA. This 532 page book was typeset entirely with ChemText.
ACKNOWLEDGEMENT The author gratefully acknowledges receipt of a Senior Award from the Alexander von Humboldt Stiftung and the discussions with Professor Ivar Ugi during this period.
This Page Intentionally Left Blank
17
2 DATABASES AND SPREADSHEETS
Desire Luc MASSART, Nadine Vanden DRESSCHE and Ann Van DESSEL Farmaceutisch Instituut, Laarbeeklaan 103, B-1090 Brussels, Belgium
2.1 INTRODUCTION Both databases and spreadsheets are ways of computerizing tables but their purpose is different. Databases are used when the tables are needed for the retrieval of information, while spreadsheets are made for calculation purposes. If someone has a certain amount of data and wants to be able to retrieve information from them (what is the wavelength at which Cu can be measured?), sort them (which p-blockers have a pK higher than 9), order them (which lanthanide has the highest stability constant with EDTA?) then he or she will use a database program. On the other hand, for instance, distributions can be obtained, statistical (and many other) calculations can be performed, or rearrangement of tables can be achieved easily using spreadsheets. Some examples that will clarify the differences in application are given below.
2.2 HOW TO MAKE AND USE A DATABASE WITH dBASE 111 PLUS.
As an illustration, we will construct a database about atomic absorption using dBASE I11 PLUS. There are many database programs available for microcomputers, but the most popular are certainly dBASE I1 and the more recent dBASE I11 PLUS and dBASE IV versions. All are products from Ashton Tate (ref. 1). The data we want to bring into a table are the recommended data for graphite furnace analysis from the Perkin-Elmer manual (ref. 2). Table 2.1 shows the data for one element as they are present in the manual.
18
It is the purpose to put the data for all the elements in the manual in a computerized database so that one can retrieve information from them in a simple manner. In what follows we will try to give an idea of how one constructs such a database, without giving complete instructions on how exactly to proceed, nor will we discuss all the possibilities of dBASE I11 PLUS. For such details we refer to the manual.
Table 2.1 Recommended HGA analytical conditions for: THALLIUM
Wavelength (nm): Slit(nm): Tube/site: Matrix modifier: Pretreatment temp. (IC): Atomization temp. (IC): Characteristic mass (pgi’O.0044 As):
276.8 0.7 Pyro/platform 1% H2SO4 600 1400 7.0
Comments: 1
2 3
Diluent used to obtain data: 0.2 52 H N 0 3 . An electrodeless discharge lamp is available for this element. Alternatively, a matrix modifier consisting of 13’% H N 0 3 may be useful.
The first operation is to decide on the format the data will take in the database. Different formats are possible. They are:
-
-
numerical (symbol n); the wavelength is an example, alphanumerical (symbol c), which means that the information is present in words such as is the case for the type of tube or in mixed words and numbers as is the case for the modifier, logical (symbol I), which means that the value-is yesho (or true:false). The fact whether an electrodeless discharge lamp exists or not can be treated in such a way,
19
-
date (symbol d) day, month and year are entered. The field is always 8 characters long, memo (symbol m), which can contain text up to 5000 characters and is stored in a separate file.
The first decision to make when preparing a database is to select the information which will be entered. In our case it is quite clear that we should enter the name of the element, wavelength. slit, tube and site (as separate variables), matrix modifier, pretreatment (ashing) and atomization temperature, characteristic mass, diluent, availabiiity of an EDL lamp. We can also add one or two columns for additional comments what would allow us to enter the remark 3 from Comments shown in Table 2.1. The choice of the format is not always evident. For instance the matrix modifier can be entered as one alphanumerical variable or as two variables, namely an alphanumerical variable with only the name of the modifier and a numerical one for its concentration. Since the concentration has no meaning when it is separated from the name of the modifier, we have decided for the former format (one alphanumerical variable). The first stage of rhe implementation of any database is to create a master file containing all data. This requires the command create, entering of a file name (here: FLAMELESS), and the specification of the record structure (format of the record). When working with the dBXSE program this would mean the following sequence: dbase, create, flameless,
to invoke the dBASE program, this is typed in on the command line (where the pointer is), any acceptable file name.
After this the dBASE program will display on the monitor the scheme shown at the top of Figure 2.1. The text put in by the user is at the right of the field numbers, the rest is put on the screen by the program. The input is always typed in the highlighted fields. After the fields constituting together one record have been specified (when the screen shows 14/13 - see Fig. 2.1, bottom) the formatting is completed. One indicates this by simply pushing the RETURN button. This leads to the question: INPUT DATA NOW?
20
I
Bytes rewaining:
CURSOR
(--
4888
-->
Char: + 4 word: H n e End Pan: 4 A+
Field Name Type
Width Dec
Field N a w Type
Width Dec
1 m m m m m m
Enter the field name. Field nawes beyin uith a letter and may contain lette-s, digits and underscore'
I
Bytes remaiiiiig:
3826
CURSOR (-- --) Char: + + Word: Home End Pan: *t A+
I
Field Name Type
1 ELMEd 2 UNELENGSH 3 SLIT 4 TUBE 5 SITE 6 UODIFIER
7 TASHING 8 TATM
Uidth Dec
Character 2 5 Numeric Ncmrric 3 Charictcr 18 Character 18 Character 15 4 Niuneric 4 Numeric
1 1
Field Name Type
9 18 11 12 13 14
8 8
CHAMASS
DILUEM
EDL REIfifil(1
RMARKZ
Width Dec
Numeric 5 Character 15 Logical 1 Character 58 Cbaracter 58
2
mm m m m
Enter the f k i d name. Field names begin with a letter a:ld may contain letters, digits and underscore
Fig. 2.1 Initial display for creating a database format waiting to enter the record structure (top). After entering the sequence with re uests for a chosen format the monitor will show the lower display. On 7t le top of both screens a.short description (help) of the commands for editing the formats IS given.
21
If one answers Y, then the monitor displays the questionnaire form as shown in Figure 2.2. Now the data can be entered, first for the first record which would contain the information about Thallium, then for the second record for a second element, etc. When input of data for all elements we want the database of is finished one closes the file, which is then automatically saved.
Help:
Record:
ELEMENT WAVELENGTH
SLIT TUBE SITE HUDIFI ER TASHING TATOM
Ass DI LUENT EDL
Fig. 2.2 Display for entering the first record of the newly established database format. In the boxes on the to of the screen is a short description of how the data can be entered an modified is given.
s
It is possible to add or change the information in the database at a later time. This is done by commands such as:
-
APPEND: add more records at the end of the file. INSERT insert more records at some specific place in the file.
22
-
-
DELETE: delete records; these records can be identified by number or by content. For instance, if for some reason, one wants to delete all records that refer to tubes using pyrolytic graphite, one would type in delete for tube = 'pyro'. dBASE does not immediately delete them completely: they can still be recalled in this stage. A pack command is required for permanent deletion. MODIFY STRUCTURE:add additional fields. When one has finalized the database one can go to the second stage, namely the query stage. In this stage one uses the database to retrieve information. This can be done in very simple to in rather complex ways. The simplest way is to apply the list command. There are several ways to do this. For instance, one can type:
use b: flameless list or display all
The above commands result in showing on the screen the contents of the complete database. On the screen this information is rearranged in tabular form (Figure 2.3).
2 Cd
228.5 B . ?
Gyro
y1at:'om
?iH4!i2P34 8.2lag
.F. 3 Ca
422.1 9 . 1 pyro
wall
.F. 4 Au
242.8 0 . 7 pyro
p1atfur.m
# i 8.05 my
.F. 5 1
766.5 1.4 yyro
.F.
wall
Fig. 2.3 Database displa!,ed with a 'display all' command.
23
Other possibilities for listing parts of the database are:
-
list record 3 which will show the content of record No. 3, list structure which will show the structure of the datafile, list Files on B: which will show the database files stored on the disk in disk drive B. Alternatively one can write dir B: or even dir B:*.*. In the latter case all the files on B are displayed, including non-database files if those are present.
A more directed search is also possible and, in fact, it is only then that the database program begins to be of interest in practice. For instance, if one requires the information only for thallium, but does not know the record number one can use the following display or list command: display for element"T1"
All the information about T1 will then be returned. If, however, one is interested only in the ashing and atomization temperatures, one can write: display element, tashing, tatom for element = "TI"
and the message on the screen will be:
TI
600 1400
Some other examples: display element for edl returns all the elements for which there is an electrodeless lamp available, while display element for .not. edl would return all the elements for which this is not the case, display for "H2S04"$modiFier would return all the information for those elements in which H2S04 of any concentration is present in the matrix modifier, etc. The "..."$field message means that the program will search for the word given between both " signs embedded in the field of which the name is given (modifier in our case). A query display for modifier="H2S04" would yield no result, since the information in this column is 'H2S04 1 %'.
Another option of dBASE is that one can sort the.contents of a database file. This means that, depending on the type of the field you want to sort on, a new database file is created in which the records are reordered in an alphabetical, chronological
24
or a numerical way. Suppose that one wants to reorder the current database file according to the field element. Because element is indicated as a character field, a new file will be created with all the records reordered in alphabetical way. The command to do this is: sort to sortelem on element
where ’sortelem’ is the name of the new sorted database file.
2 3 PROGRAMMING IS dBASE
Until now we have used the database for a very simple purpose, namely to extract information from a single file. However, it is also possible to connect several files. Let us suppose that we want to use dBASE for the following problem. In atomic absorption spectroscopy ( U S ) , one has to choose between the flame and the (flameless) graphite tube methods. The flame methods does not have such a low detection limit as the graphite tube, but it is easier to handle, less prone to interferences and more robust. For that reason the user’s strategy will often be to apply the flame method above a certain concentration limit and the flameless method below it. The flame method has its own experimental characteristics and we suppose that we have another database file i n which the characteristics for flame methods are given per element. In that case, we would like t h e consultation to go like this:
-
input of the element names and the lowest concentrations to be analyzed in the samples, retrieval of the conditions of the flame method for the elements in question, checking whether the flame determination limit is reached, if it is, a message that the method will be flame A A S and the experimental conditions, if it is not, a message that the method cannot be flame A A S and that the conditions for flameless AAS are the following.
The following set of commands will permit to do this, however, without yielding the messages:
25
Commands use b:flame store 'TI' to melem store 0.02 to mconc locate for element = melem ?mconc < concen.
The first command use opens on disk drive A the database file 'flame' which contains the characteristics for the flame method. Then the element 'TI' and the concentration (0.02 micro giml) one wants to analyze are stored in two memory variables 'melem' and 'mconc', respectively. Before one can examine whether or not the detection limit is reached one has to move the record pointer to the record for thallium, which is done with the locate command. dBASE compares whether the concentration one wants to analyze is lower than the determination limit for the element (which is stored in 'concen', a field of the flame database file). dBASE then checks whether the equation (the last statement in the above set of commands) is .T. (True) or .F. (False). In case it is False, the concentration to be analyzed exceeds the determination limit and one can obtain the conditions for the flame method by typing the command display. However, if the concentration is lower than specified, the flameless method must be used. To obtain the conditions for the flameless method, one then has to open the database file containing the characteristics for the flameless method and use the display command. A very interesting feature of dBASE is that one can now write a program that carries out a number of commands automatically, and adds the messages. Such a procedure (called program or macro procedure) is given below: Program set talk off. set echo off. select 1 use flameless select 2 use flame store '1' to again
26
Do while again = '1' clear accept 'What is the element you want to analyze' to melem accept 'The lowest concentration you want to analyze in the samples' to mconc locate for element = melem if Val (mconc) < concen. ?'The conc. you want to analyze is bellow the limit for the flame method.' ?'You will have to use the flameless method.' ?'The conditions for the flameless method are the following:' select 1 display all for element = melem. else ?'You can use the flame method.' ?'The conditions are the following' display endif accept 'Do you want to analyze another element. Y ( =Yes) or N ( = N o ) ' to YN if Y N = 'N' again = '2' endif select 2 enddo
The language, as i t can be seen, is specific for dBASE but resembles very much the structured BASIC. The main elements of this 'programming' language are:
-
the IF ... ELSE ... E S D I F modules, that permit to introduce alternatives: if some condition is verified, then the program will follow a certain route, if not (else) another route is indicated,
-
the DO WHILE ... ENDDO modules, that permit to carry out the same operation several times. It is equivalent to the BASIC's FOR ... NEXT or the FORTRAN's DO loop commands.
The example we have given above was a very simple one and, in fact, we could have decided to store all the information in a single datafile. Even then a program would have been preferable for the retrieval.
21
However, suppose that the alternatives are not only flame and graphite tube A A S , but that one also has XRF and ICP at one's disposal and that the decision criterion is not only a determination limit, but also a type of sample. Clearly, in this case different files are needed. There will be files per element, but also files per sample type and a database program will not clearly be of much help. Alternatively, an expert system approach could be tried out.
2.4 HOW TO USE A LOTL S SPREADSHEET
There are many spreadsheets available. One of the best known ones is LOTUS1-2-3 (ref. 3) and we will use this to show how the spreadsheets can be used.
As an application of the use of spreadsheets, we will make a table for determining the ruggedness by a partial factorial experiment, using a Plackett Burman design (ref. 4). The ruggedness or robustness of a method describes the ease with which it can be transposed to another laboratory without developing large inter-laboratory errors and also, the tendency it has to yield reproducible results in time. A method is not rugged when small variations in experimental parameters have large effects on the response.
For instance, suppose one measures the optical absorption of a solution in a colorimetric procedure. This procedure specifies that the measurement should be carried out at pH 2.5 and room temperature. O n verification, it is found that a clearly different result is obtained at pH 2.4 and that differences are also observed when the temperature is 20°C and 25OC. The method will display higher variability between laboratories and between days than expected, because it is not rugged. Ruggedness is measured by imposing small variations on the experimental parameters and recording the results. Let us suppose we have developed an . U S method and we want to test its robustness. The following seven parameters have been identified as possible sources of variation: amount of water, reaction time, distillation rate, distillation time, n-Heptane, Aniline, and status of the reagent. For short, we will call these 7 variables or factors as: AoW, Rt, Dr, Dt, n-H, Anl, SoR and perform a partial factorial experiment consisting of 8 experiments. The partial factorial experiment is described in Table 2.2.
28
Table 2.2 Partial factorial experiment for seven factors
Exp.
Factors AoW
Rt
+ +
Dr
Measurement Dt
n-H
An1
SoR
+ +
+
+
+
+ +
+
-
+ +
+ +
+ + +
+
+ +
The + and - signs denote levels of the factors AoW, Rt, Dr, Dt, n-H, Anl, SoR. They are called the nominal and the extreme values, respectively. The nominal value ( + ) is the value specified in the procedure of which one wants to evaluate the ruggedness. For instance, if the amount of water is 2, the nominal ( + ) value of AoW is 2 and the extreme (-) on could be 5, while the nominal value ( + ) of SoR is 'new' and the extreme one (-) might be 'used' (Table 2.3). The third experiment, for example, is carried out in such a way that factors AoW, Dr, and n-H take their nominal values while the others are at the extreme level. One observes that the + and - signs do not have their usual meaning since the value for AoW - is higher than for AoW +.
The object of the spreadsheet is the following. First, we want to fill in the names of the variables and the values of the + and - levels (Table 2.3) in one part of the spreadsheet and have the spreadsheet make the combination shown in Table 2.2. In practice, this table can be made easily by hand, but one must pay a lot of attention not to fill in a nominal value when it should be an extreme value. The spreadsheet approach permits to avoid such transcription errors.
29
Table 2.3 Nominal end extreme values (ref. 5 ) that will be used in our example for spreadsheet manipulation.
Variable
Amount of water Reaction time Distillation rate Distillation time n-Heptane Aniline Status of reagent
Nominal value
Extreme value
(+I
(-)
2 0 2 90 2 10 8 new
5 15 6
46 190 12 used
When one starts LOTUS-1-2-3 (by typing in 123), one is confronted with a blank worksheet or table, which is shown in Figure 2.4. The letters A, B, C, ... AA, AB, ... etc. identify the columns of the LOTUS table while numbers 1, 2, 3, ... identify its rows. There are 256 columns and 8192 rows available. Each space identified by a column letters and a row number ( A l , G45, IEI, for example) is called a 'cell'. The cell IV8192, for example, is the last cell in the lower right corner of the entire table. At the beginning all cells are empty and one can enter in each of them data, headings, equations, etc. In order to actually input the data into the cells, first one of.the three lines above the worksheet must be used. This three lines are called the control panel. Its first line gives the address of the cell, i.e. the location of t h e pointer and it is highlighted. The contents of that cell is also given. At this stage in the example all cells are still empty, so that no value is displayed on the first line of the control panel. To enter data in a specific cell, one must move the pointer to that cell. This is done by pressing the. Right, Down, etc. keys on the keyboard or, for instance, by using the page down key.
30
r‘:
121-hug-89
t1: 12 M
Fig. 2.4 Lotus initial screen. The three command lines are invoked by typin the slash /, while going back to READY mode is achieved by pressing t e Esc key.
i
There are several possibilities to fasten up the movement such as the GO TO key (FS key) followed by the address of the cell wanted, for instance GO TO Y 136. The two next lines are used when the MENU option is selected by typing in the slash ’/’. The main MENU is written on the first of these two lines. At this stage of the example (first you have to type: 1) it reads as follows: Worksheet Range Copy hlove File Print Graph Data System Quit.
In the right hand corner of the worksheet, next to’the control panel is the MODE indicator. It indicates the current mode of operation (for example, entering data, selecting items from a menu, editing).
31
When the MODE operator indicates MENU, this means that one can choose in the menu. This is done in a very simple way. O n the second line, one of the options will be highlighted and at the beginning this will be the ’Worksheet’. At the same time, the third line shows a submenu corresponding to the selected option. If the Worksheet option is highlighted the submenu reads: Global Insert Delete Column Erase Titles Window Status.
When moving the pointer to other main menu cells, the third line will each time display the appropriate submenu. One can call the submenu form the main menu by typing the first letter of the word or by highlighting the option and pressing RETURN. Let us now first see how we can enter the data for the nominal and extreme values in LOTUS. First, we type in headings (labels) which identify columns and rows. The Table 2.3 should look as shown in Figure 2.5. We can put the table anywhere, but in this example we would like it to occupy cells C6-E6, C7-E7, ... C14-El4. This means that for instance D6 contains the label Nominal, while E l 3 will contain the value of the extreme level of variable 7 (SoR). To enter the first label, move the pointer to cell C6, type ’Variable’ and press RETURN. In the same tvay enter the labels Nominal and Extreme in D6 and E6. The worksheet identifies all these names as labels by the fact that they begin with a letter. Labels beginning with a number can also be entered and are then identified with a ’ sign. Then one can enter the variable names. LOTUS contains many possibilities for improving the presentation. For instance, if long names such as ’Amount of water’ are entered, then this name will be longer than its cell permits and spill from column C into D. I t is then possible to widen column C. We will not explain all the options in LOTUS, but we will explain this one, since it allows to show how LOTUS works. First one moves the pointer to somewhere in column C. When the final ’widening’ command will be given, LOTUS automatically assumes that it has to perform this operation for that column. Then press / (the slash). This tells LOTUS that the MODE, indicator which was on READY, has to change to MENU and it first calls the main menu. In this main menu one should select Worksheet, which is done by moving the pointer to highlight word Worksheet and then pressing the RETURN key.
32
The submenu for Worksheet now becomes the main menu and one can select an option in this menu with the pointer. One should select Column. When one does so, the control panel contains the message Set-Width. Now one types the required number of characters, 20 for instance, and after entering this, one leaves automatically the MENU mode and returns to the READY mode.
C14:
' SGR
Variable Nominal Extrem
Fig. 2.5
Display of the LOTUS spreadsheet after the table headings have been typed in.
The next operation is then to enter the data, namely the variable names and for each variable its nominal and extreme value. This is done by moving the pointer to the correct column and typing them in. As an example we shall input data for 7 variables from the literature (ref. 5). After typing the labels and 2 values for each variable (see Table 2.3) the LOTUS display should be as shown in Figure 2.6. One has not obtained the table of values to be used in the experimental design, which is the first step of the ruggedness procedure, but just the ordinary table of data.
33
In the second step, we want LOTUS to tell us which experiments to perform. LOTUS should create the Table 2.2 describing those experiments automatically to avoid, as explained higher, manual transcription errors. First the headings of the table should be made and to do this we will use one of the most important features of LOTUS, namely so-called ranges. A range is a group of cells. For instance, the column of cells B1 to B7, or the block of cells comprised in the rectangle B to D, 1 to 7, are ranges. Several operations can be carried out on these ranges, for instance copying the range somewhere else in the worksheet, erasing the data in that range, etc.
In the present case, what we would like to do is to copy the variable names present in cells C8 to C14 to the row I6 to 06. To explain how ranges work let us first suppose that we want to copy range C8 - C14 into D24 - D30. First one moves the pointer to the cell from which the range starts (C8). Then one selects the main menu, by pressing / key. The mode indicator changes from READY to menu MODE.. Then in the main menu one selects the Copy option. The control panel indicates the content of the cell on which the pointer is and also prompts:
..
Enter range to copy from: C8 C8.
The first C8 is what is called anchored (cannot be changed immediately) but the second is not. By moving the pointer the second C8 changes. In this way one can move the pointer to C14 and then, by pressing the RETURN key, anchor the C8 .. C14 range. The control panel now reads: Enter range to copy to: C8.
This C8 is also highlighted, but not anchored. In this case, one moves to cell D24, so that the control panels states: Enter range to copy to: D24.
Pressing the RETURN key, now causes execution of the Copy command: the contents of D24-D30 are displayed and are equal to C8-Cl4. What we want to do next is a little more difficult since a column must be copied as a row. First one moves the pointer to C8, then by pressing I one turns on the MENU mode, selects Range with the pointer and in the submenu Transpose, moves the
34
pointer to C14 and anchors the C8 .. C14 ranges by pressing RETURN, then moves to 16, anchors it by pressing RETURN. This activates the transpose command copying labels to the required range. Now a heading is made for the experiments. In column G8 we enter label Experiment1 and in the cells G9 ... GI5 next numbers 2 ... 8. In the cell H6 one puts the heading 'Variable :'. The LOTUS table should be like the one shown in Figure 2.6.
tlb: 'Variable
ariab'ie tiominal
Extrme
aotJ
Rt
Exp. 7
Exp.8
Fig. 2.6
LOTUS screen after typing in 7 variable names and their nominal and extreme values from (ref. 5 ) .
Eight experiments will be carried out altogether. Instead of entering the values the following actions are performed in the corresponding cells (for example.: + D11 is typed into the cell L14):
35
H
G
Var.: 8 9 10 11 12 13 14 15
Exp. 1 Exp.2 Exp.3 Exp.4 Exp.5 Exp.6 Exp.7 Exp .8
I AoW +D8 +D8 +D8 +D8 +E8
+E8 +E8 +E8
L Dt
M n-H
+ D 9 + D10 +D11 + D 9 + E l 0 +D11 + E 9 + D10 + E l l +E9 +El0 + E l l + D 9 + D10 + E l l +D9 + E l 0 + E l l + E 9 + D10 +D11 + E 9 + E l 0 +D11
+D12 +El2 +D12 +El2 +El2 +D12 +El2 +D12
J Rt
K Dr
N An1
0 SoR
+ D 1 3 + D14 +El3 +El4 +El3 +El4 + D 1 3 + D14 +D13 + E l l + E l 3 + D14 + E l 3 + D14 +D13 + E l 4
The LOTUS table must be filled with the Youden design, but instead of doing this with plus and minus signs, one enters the nominal and extreme values of the experiments. To achieve this the cell addresses where this information needed is stored, are entered. Each cell address has to be preceded by a plus sign because it must be made clear that one enters values and not labels. The pointer is moved to I8 and one types + D8, press RETURN. The nominal value for the first variable in the first experiment is displayed now in cell 18. One continues to fill the entire table as shown above. The worksheet can be made easier to look at by adding lines that separate the column labels and numbers. To create a line under the labels one moves the cell pointer to cell C7, type \ then =. In LOTUS the backslash (\) serves as a repeating label prefix. Whatever is typed after the backslash is repeated until it fills the cell. After pressing R E T U R S cell C7 now contains a row of equal signs ( = ) . To continue the double line across the worksheet from cell C7 to cell 0 7 one can use the / Copy Command.
To create a line between the columns one moves the cell pointer to the column where one wants such a vertical line e.g. in column F6, type then ’ I and press return. To create a line from cell F6 to cell F15 the / Copy command is used. Then one narrows the column by: /Column Set-width Enter column width: 2
string of commands.
36
We want to make the worksheet easier to look at by separating the columns for the variables by vertical lines. Therefore we have first to insert columns. Move the cell pointer to appear in cell 56 select / Worksheet Insert Column. The control panel specifies: Enter range of columns to insert: 56
... 56 and press return.
A column is inserted at the left of column 56. The content of column 56 moved to column K6. Now a vertical line is created in 56 in the same way as was done in F6. The procedure of inserting a column and creating a vertical line is repeated in columns L6,N6, P6, R6. T6. V6. The LOTUS table should then look like as shown partially in Figure 2.7
117:
I-H 218 19EIExG.5 in1 8 1Z:Erp.f~ ioR new usedfExp.7 :Er.p.8
121-Aug-89
Fig. 2.7
11:32 ill
LOTUS table with Youden experimental desi n. To obtain the above table, the columns were narrowed, few labels were s ortened and a vertical line was added.
i
31
The experimental design is now ready to be carried out and, once this has been done, one can go to the third step in the procedure, namely to determine the effect of each of the variables. These effects are computed by using equations (ref. 4):
or
i.e. one makes the sums of the results for which a nominal, resp. extreme, value were used for the variable in question and makes the difference between the two. In these equations D A ~ W is the effect of the amount of water and Yi is the result of the experiment 1. One can consider Yi-Y4 to be one estimate of the effect of AoW and Yj-Yg another one, since in one case AoW is at the nominal level and in the other at the extreme level. Since the equation contains four such estimates, one divides the result by 4. Let us now investigate how LOTUS can do this. The first thing to do is straightforward: one creates a column (in W) for the experimental results under the label Result and, when the experiments have been carried out, fills in the results in W8-Wl5. For the label Effect one creates a row (row 17) and one would like to obtain the effects of AoW to SoR in I17 to U17 under the respective variable labels. Another of the most important features of LOTUS is now used. In 117 one puts in the equation for D A ~ wi ,n M17 the one for D D , etc. The result will be that as soon as the experimental values needed to compute the equations have been filled in column W, the result of the computations will appear in row 17. Let us now see how this is done in practice. There are in fact two uays to do it, namely by using the pointer o r by typing. The latter has our preference. One moves first the pointer to cell 117. Then one types the pertinent equation using the cell addresses, where the information needed is stored. For instance, the first element in the equation for DA is Yi. This is found in W8 and therefore in the worksheet equation W8 replaces Yi. The equation in I17 should be:
+ (W8 + W9 + W10 + W 1 1 -W12 -W13-W14 - W15)/1 and in M17:
38
+ (W8 + W10 + W12 + M'1-I- W9 - W11- W13 - Wl5)/4 When this has been typed in and the RETURN key pressed after each equation the cell displays the result of computing this equation. If one wants to look at the equation in a certain cell, one moves the pointer to that cell and the equation is displayed on the second line of the control panel.
2.5 PROGRAMMING IS LOTUS
In the same way that it is possible to program in dBASE, it is also possible to do this in LOTUS. Such programs are called Macros. Macros are based on the principle that all procedures in LOTUS consists of a sequence of keystrokes. A macro is a collection of the keystrokes that make up the procedure one wants to automatize. To explain the programming procedure, let us consider again the ruggedness program. Until now, we have made a LOTUS procedure for a specific Youden design consisting of 8 experiments and suitable for 7 variables. However, there are also designs available for other numbers of variables, namely for 4N-1 variables, where N is an integer number. Let us consider how to write a macro for the situation where designs with 3 and 7 variables must be possible. The LOTUS program must be able to do what was described higher, but additionally it must choose the correct design. Since our purpose is not to make a complete ruggedness program, but to give an idea about how to make a LOTUS pngram, we will consider only two steps, namely the input of the nominal and extreme values, and the construction of the experimental matrix. T o use the table made in a macro one has to save the worksheet as a file. After pressing / File Save, one is prompted to enter the name under which one wants to save the file. Each file must be given a unique name. Type the file name and press RETURN. The mode indicator changes to WAIT; when it changes back to READY, the file is saved. As an example, suppose one wants to carry out a ruggedness test for 3 variables following the Youden design, in the same way as demonstrated before 7 variables. This worksheet is saved under file name RUD3V and t h e first one under the name RUD7V. One begins with a blank worksheet on the screen (go to the Worksheet and select Erase).
-
39
In cell A 1 one types: Ruggednesstest for 3 or 7 variables, (Return).
Then one makes the macro procedure as described bellow.
Go to cell IE1 and type:
IEl:
{goto} AAl
-
then move to the consecutive cells in the column IE and write the following statements:
I E2: IE3: IE4: IE5: IE6: IE7: IE8: IE9:
IElO: IEll:
For how many variables do you want to make the ruggedness test (3 o r 7)? { goto} AA2
-
-
{get {if AA2 = "3"}{branch EI9} {if A.42 = "7"} {branch IEll } {beep} {branch \A} leave empty cell {goto} A1 -/FCCE RC'D3V' leave empty cell {goto} Al'/FCCE R L D N -
Before one can use this macro, one must give it a name. Macro names ahvays contain two characters. The first is the backslash (\), and the second is a single letter: A, ..., Z. Select / Range Name Create, enter \A as the name, and enter IEl as the range. I t is a good idea to put a label in the worksheet to help remember that this macro tvas named A. One can put macro names i n the cell to the immediate left of the macro. Move to ID1 and enter \A. A macro executes the keystrokes in the cell named by the user and then proceeds to the cell immediately below. It continues going down through consecutive cells, in the same column, until it reaches a blank cell or a cell containing a value entry.
40
Macros include keystrokes such as RETURN, GOTO, etc. so that the user does not have to carry them out. However, keys such as RETURN, GOTO, DOWN, ... cannot be entered directly in a cell. For each of these keys, there is a key indicator, e.g. the indicator for RETURN is for GOTO { goto}, ... etc.
-
In this example, the macro first moves the cell pointer to cell AA1 with the goto command. Here the user is asked for how many variables he wants to make the ruggedness (the answer is supposed to be either 3 or 7). Then the macro moves the cell pointer to cell AA2, where the user has to enter hidhers answer, i.e., 3 or 7. The key indicator controls {get ...}, {if}, {branch}, {beep} are advanced macro commands:
-
-
{get cell address} halts macro execution temporarily, prompts an input and stores the inputted characters as a label in a specified cell (AA2 in the above example), {if command} conditionally executes the command following the if. If the command within {if command} is true, the macro continues to execute macro instructions that follow in the same cell. If the command within {if command} is false, the macro executes macro instructions in the cell below, {branch cell address} continues executing macro instructions located in the specified cell (in the above macro these addresses are IE9 in the first and I E l l in the second example), {beep} sounds the computer’s bell or tone. Thus if the user’s ans\ver is 3, then the cell pointer moves to cell IE9. If his answer is 7, then the cell pointer moves to cell I E l l . If the answer is neither 3, nor 7, the {beep} command will be executed and the question will be asked again because the command follo\ving the {beep} is the instruction to run the same macro program named \A again.
One can also use a macro to execute a sequence of commands, for example the second part of the command in cell 1E9 /FCCE RUD3V is actually abbreviation for the series of keystrokes: / File Combine Copy Entire RUD3V. This command incorporates an entire worksheet into the current worksheet at the location of the cell pointer (here Al). File RUD3V is retrieved and the table displayed, ready to enter data. In cell IEI 1 the file RUD7V will be retrieved and displayed (if the user answered 7). If there is no file named RUD7V the system will display an ERROR message. A
41
macro is invoked by holding down the macro key and pressing the appropriate letter. Therefore one enters in cell A2: Invoke macro A (alt A).
The program is then saved under the name RM (ruggedness test macro).
2.6 CONCLUSION
Programs such as dBASE 11, 111 PLUS, or IV, and LOTUS allow the user to work easily with databases and perform the calculations that can be summarized in tabular form, respectively. I t should be noted that it is also possible to connect these programs to other software. For instance, to use the SPSS/PC+ (ref. 6), one can enter the data first into a spreadsheet and perform corrections in this spreadsheet. Another possibility is to use dBASE as a receptacle of knowledge, which can be changed by the user, and connect this to an expert system in which the knowledge that cannot be changed by the end user is embedded.
2.7 REFERENCES
1 2 3
4
5 6
dBASE I11 PLUS, dBASE IV, Trademarks by Ashton Tate Corporation, Colorado, USA, tel. 1-800-137-3329, Zeemad3030 Atomic Absorption Spectrometer, Operator's Manual, Perkin Elmer.
LOTUS, Trademark by the Lotus Development Corporation, Colorado, USA tel , 1-800-345- 1043 D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte, and L. Kaufman, 'Chemometrics: A Textbook', Elsevier, Amsterdam, 1988, 20, p. 102-106, T. Grant, A. Vernimont, 'Use of Statistics to Develop and Evaluate Analytical Methods', AOAC, Arlin ton, USA (1985), example onp. 80, Table 20, SPSS/PC +, Trademark Ey the SPSS Inc., 444 North Michigan Avenue, Suite 3000, Chicago 6061 1, Illinois, tel 1-312-319-3315
This Page Intentionally Left Blank
43
3 PRINCIPAL COMPONENT ANALYSIS OF CHEMICAL DATA
Kurt VARMUZA and Hans LOHNINGER Technical University of Vienna, Institute for General Chemistry, Lehargasse 4/152, A-1060 Vienna, Austria
3.1 INTRODUCTION
A trend to more complex problems and the availability of automated instruments are novel aspects of modern scientific research. When complex problems are investigated it is usually necessary to characterize an “object” (e.g. a sample, a reaction, a fact) not only by one parameter (measurement, feature) but by several parameters. The aim of the investigation is often to obtain a better insight into the treated problem, rather in a qualitative than in a quantitative manner. In chemistry such demands for an exploratory data analysis frequently arise in connection with analytical work on complex samples, e.g. environmental samples and also in the field of structure-property-relationships. With modern, sometimes called intelligent, instruments a great amount of data can easily be obtained from samples. The bottle-neck in this work is the data interpretation. The discipline of chemometrics provides a number of methods to deal with such problems. During the last years many of these methods became available for PCs, either by statistical software packages or by specific software developed by chemists. A variety of chemometric methods is, in principle, now available at the chemist’s own desk. Although the chemist usually is not an ’expert in multivariate statistics he/she is forced to use such methods, because of the complexity of actual problems in chemistry.
44
In this introductory chapter some fundamentals of interpretation and processing of multivariate data are outlined. Because of the limited space we focus on a user-oriented description of basic aspects of principal component analysis (PCA). PCA is an excellent tool for exploratory data analysis in chemistry. A number of surveys on the subject have already been published and it is strongly recommended to refer to a selection of them (ref. 1-7).
3.2 MULTIVARIATE CHEMICAL DATA
Multivariate data is defined as follows: In a set of n objects each object is characterized by p features Xik. Optionally, for each object q properties yil may be given.
(index i: 1, 2, ... n) (index k: 1, 2, ... p)
(index 1: 1, 2, ... q)
The features are numeric and they can originate from: measured data (concentration of mixture components, peak heights in a spectrum, etc.); or the scientific literature (experimental or computed data, e.g. dipole moment, solubility, etc.); or computed data (e.g. derived from molecular structure, molecular descriptors such as the number of double bonds, volume of the molecule, etc.); or numerical data, which is derived from the data mentioped above. The property data can be:
-
numerical, continuous property data (e.g. chemical activity); or category data (a category is a designated group (class) of objects, e.g. ketones, Scotch whisky, etc.).
45
Such data can be arranged in two matrices as shown in Figure 3.1.
features
moperties
Fig. 3.1 Multivariate data.
This general mathematical scheme can frequently be applied in chemistry. In all the cases shown in Table 3.1, the relationship between features and properties is evident but not explicitly known. In order to assign an object to a certain class in these examples, the use of a single feature is not sufficient. Therefore a multivariate interpretation system has to be adopted. Two-dimensional multivariate data (variables XI, X2) can be visualized geometrically; each object corresponds to a point in a X1-XZ-coordinate system. If the number of variables becomes higher than 3, an exact visualization of the data structure is not possible, but the concept of data representation is not affected: each object is considered to be a point in a p-dimensional feature space; the coordinates of a point are given by the features xi, x2, ... xp of that object. (Random variables are denoted here by capital letters, actual values by small letters.)
46
Table 3.1 Typical multivariate data in chemistry. OBJECTS
FEATURES
PROPERTY data interpretation problem
Samples
Concentrations of components
Origin of samples or
and/or analytical data
prediction of a (unmeasurable) property
Chemical compounds
Spectral features
Recognition of molecular substructures
and/or
and/or
molecular descriptors
prediction of a chemical or biological activity
and/or physico-chemical properties
The essential goal of the handling of multivariate data is to reduce the number of dimensions. This is not achieved by selecting the most suitable pair of features, but by computation of new coordinates by appropriate transformation of the feature space. In most of the cases the new variables Z are determined by linear combination of the features Xk
47
The coefficients co to cp determine the transformation of the feature space. There is a fundamental requirement for a sensible interpretation of multivariate data: the positions of the objects and their distances in the feature space must reflect the similarities of categories and properties. This leads to the conclusion that the decisive step is the choice or generation of suitable features. A successful feature generation requires the use of chemical knowledge. Mathematical processing alone can seldom lead to useful results. Chemometric methods cannot generate new information, they can only make already present information available or readable. Gaining new, supplementary knowledge can only be achieved by experiments. Typical applications considering chemical multivariate data are illustrated in Figure 3.2.
1. Cluster analysis investigates the existence of natural groups (clusters) of objects (Fig.. 3.2a). When clusters can be found, similarities between the members of a cluster have to be established. 2. The following data interpretation problems address objects belonging to two or more categories:
-
-
Is there a possible separation of categories using the chosen features (Fig. 3.2b)? Is there a possibility of developing a simple algorithm (classifier), allowing new, unknown objects to be correctly classified (Fig. 3.2c)? Is it useful to describe certain classes with the help of simple geometrical models (Fig. 3.2d)?
New, unknown objects, located outside the boundaries of the models in question, can then be identified as outliers. The axes of a model can be related to properties.
3. Feature selection methods identify useful features for the problem in question (Fig. 3.2e). 4. Mathematical relationships between the features in question and a continuous
property can be investigated aiming at a better understanding of the system and the possibility for prediction of the property for new objects (Fig. 3.20.
48
X2
0
I
0
X2
0
I
0
0 O
0
O
0 0
0
O
0
8 0 0 0
0 0
0 O
0
0
00
0
0
0
0
.. . . .. . .
0
0
0
0
0
0
-
-
Xl
Xl
a: CLUSTER ANALYSIS
X2
W
b
SEPARATION
d
MODELLING
X2
I
I \o
0
0
0
0
0 0
0
J
-
c: DISCRIMINATION
-
Xl
XI
Y 0
8
0
8
0
0
t
m
0
8
.
0 0
rn 8
0 0
0 0
0
0
e: FEATURE
.= .
-XI
SELECTION
f CONTINUOUS PROPERTY Y
Fig. 3.2 Typical applications using chemical multivariate data (schematically shown for 2-dimensional data): cluster analysis (a) separation of categories (b), discrimination by a decision plane and classification o f unknowns (c) modelling categories and principal component analysis (d), feature selrction ( X 2 IS not relevant for category separation),(e relationship between a continuous property Y and the features X I and (f)
2;
49
Typical data sets in chemistry contain 20 to 100 objects with 3 to 20 features. This small number of objects is not sufficient for a reasonably secure estimation of probability densities. Hence the application of ’parametric methods’ is not possible. The use of ’non parametric methods’ that make no assumptions about the underlying statistical distribution of data is necessary. These methods, however, do not allow for statements about the confidence of the results.
3 3 DISPLAY OF MULTIVARIATE DATA The most important method for exploratory analysis of multivariate data is ’reduction of the dimensionality and graphical representation’ of the data. The mainly applied technique is the projection of the data points onto a suitable plane, spanned by the first two principal component vectors. This type of projection preserves (in mathematical terms) a maximum of information on the data structure. This method, which is essentially a rotation of the coordinate system, is also referred to as “eigenvectot-projection”or “Karhunen-Loeve- projection” (ref. S).
3.3.1 Principal components The data set A shown in Table 3.2 and Figure 3.3 will be used to discuss some characteristics of principal components from the user’s point of view. The data in Figure 3.3 are mean-centered; translation does not affect the principal components because only variances are considered. The variability of the data set is partly represented by variance vxi of feature XI, partly by variance vx2 of feature X2: variance of feature X i variance of feature X2
vx 1 vx2
6.39 2.23
73.1 %
total variance of data (sum)
Vtot
8.62
100.0 %
25.9 5%
Generally, by rotation of the coordinate system it is possible to obtain directions Z with a higher variance than either vxl or vxz . The sum of the variances for a direction Z and the orthogonal direction remains constant (and is, of course, equal to VIOL). The direction Z having the largest variance is called the first principal
50
x2
l4
2
0
-2
-4
Fig. 3.3 2-dimensional data set A, mean-centered (numerical values in Table 3.2).
Table 3.2 Data set A.
Object number
x1
x2
1 2 3
3.0
7 8 9 10
2.0 3.0 4.0 4.5 5.0 5.5 6.5 7.0 9.0 10.0
mean variance corr.coeff.
5.65 6.39 0.79
5.15 2.23
4
5 6
4.5
3.5 4.5 6.0 4.5 6.5 6.0 5 .O 8.0
component (PC1). The second principal component (PC2) is orthogonal to PC1; for multidimensional data PC2 is defined as the direction of maximal remaining variance. A principal component projection is meant to be limited to the first two or three principal components. Figure 3.4 shows the values of the variance for all angles from 0 to 360 degrees (computed in steps of 1 degree). Maximum variance is at an angle of 27 degrees: variance of first principal component variance of second principal component total variance of data (sum)
X
VPC2
7.94 0.68
VtOt
8.62
VPCl
92.1 9% 7.9 YG 100.0 5%
!
1
-
PC2
\
PC1
-
*1
J’ V
Fig. 3.4 Variance of data set A during rotation of t h e coordinate system (in olar coordinates). Maximum variance v p c i is obtained for the direction C1 at an an le of 27 de rees). vxi and vx2 are the variances of the original eatures 1 and X2. C1 and PC2 are the two principal components of the data.
!
a
!
F
52
For multidimensional data it is not possible to determine the direction of greatest variance by experimenting with all possible angles. An eigenvector analysis of the covariance matrlx will then be applied. The mathematical background of this method will not be discussed in detail here; a chemist will usually make use of commercially available software for this computation. The coordinates in the rotated coordinate system with the axes PC1 and PC2 will be denoted as Ui and U2; they are linear combinations of all original features (Xi, X2, ... X,) and called scores. For 2-dimensional data:
Ui = bii. Xi U2 = b2i. X i
+ b12. X2 + b22. X2
For data set A:
U i = 0.887X1 U2 = -0.462 X i
+ 0.462X2 + 0.887 XZ
(3.3)
Computation of the score U i for a data point (Xi,X2) is equivalent to the projection of that point on the axis PC1 (Fig. 3.5). The direction of PC1 is defined by a vector with length 1 and the components b i i and b12.
Fig. 3.5 Projection of a data point (xi, xz) onto the principal component axes.
53
The relation
is also applicable to all other principal components. The components bjk (k = 1, 2, ... p) of a principal component vector j (eigenvector) are called loadings or factor score coefficients or eigenvector coefficients. The loadings indicate how much a feature is 'loaded into' a principal component, that means how relevant a variable is for calculating a score. Loadings and scores are depending on the scaling of the original data. For the interpretation of results that have been obtained from data with numerous features, it may be useful to represent the loadings of the first two (or three) principal components graphically, as shown in Figure 3.6.
A
0.89
loadings of first PC
-
0.46
1.0 -
0.89
0.5
loadings of second PC
0.5
-
0 -0.5
0
-
t
-0.46
Fig. 3.6 Gra hical representation of the loadings of principal components for data set .
R
54
A principal component analysis is reasonable only when the intrinsic dimensionality is much smaller than the dimensionality of the original data. This is the case for features related by high absolute values of the correlation coefficients. Whenever correlation between features is small, a significant direction of maximum variance cannot be found (Fig. 3.7); all principal components participate in the description of the data structure; hence a reduction of data by principal component analysis is not possible.
x2
t
A
- 4 - 2 0
2
4
6
Xl
x2
I
PC2
B
- xi
Fig. 3.7 Variance during f o t a t i p of the coordinate system for data sets with different intrinsic dimensionahty ( olar coordinates, values calculated in ste s of 1de ree). (A: corr. coe f.: 0.955, variances xi: 6.39, x2: 6.86, PC1: 12.95,PC2:8.30) and B:corr. coeff.: 0.136, variances xi: 6.39, x2: 5.24, PC1:6.79, PC2: 4.85).
P
55
Whenever data belongs to a homogeneous category, it is often possible to correlate the score for the first principal component (or another direction in the projection) with a property of the objects (e.g. in investigations of structure-activity relationships). This method is referred to as principal component regression (PCR); it often represents an advantage over the multiple regression method. In multiple regression several features are being correlated with a property; when features are highly correlated, this method leads to failures (ref. 9,lO). Whenever data belongs to different known categories a principal component model can be calculated for each category. This technique is used in the method SIMCA for classification and modelling; quantitative correlations between the model parameters (axes) and external properties can be established (ref. 9,lO).
3.3.2 Display of a set of objects
The data set B in Table 3.3 contains 3 features (p =3) and 10 objects (n = 10). The data structure is not evident when the original data are represented in matrix form. Figures 8a and 8b illustrate data with plots using two pairs of features (feature/feature-plots); they allow an insight into the data structure. These graphical illustrations would become too complex for a higher number of features.
Before applying principal component analysis data have been autoscaled (Z-transformation):
Xnew = @old
- mx) / Sx
(3.5)
mx is the arithmetic mean, sx is the standard deviation of feature X, calculated over all objects. The new feature has a mean of 0 and a standard deviation (and variance) of 1. This transformation is widely recommended; it removes the effect of weighting that arises, due to arbitrary units.
56
Table 3.3 Data set B.
Object number
Scores for princ.comp PC1 PC2
Features
x1
x2
x3
u1
u2
1 2 3 4 5 6 7 8 9 10
1.5 2.0 3.0 4.5 5.5 6.5 7.0 7.5 8.0 9.0
1.5 3.5 2.0 3.5 5.0 1.5 3.5 5.0 7.0 5.5
8.0 9.0 6.0 3.5 5.0 4.0 2.0 3.0 4.5 1.0
-2.40 -1.90 -1.40 -0.06 0.25 -0.24 0.90 1.20 1.50 2.20
-0.07 1.00 -0.37 -0.33 0.62 -1.20 -0.79 0.07 1.30 -0.22
Mean Variance %variance
5.45 6.80 41.3
3.80 3.34 20.3
4.60 6.32 38.4 % variance
Accum. variance
77.5 20.3 2.2
77.5 97.8 100.0
Princ.comp number 1
2 3
bi
0.640 -0.133 -0.757
Loadings b2
b3
0.503 0.817 0.281
-0.581 0.561 -0.590
The results of the principal component computations show a relatively high variance for PC1 (77.5 94 of total variance). This means that data points are mainly distributed along a single direction. For real chemical data this direction would hopefully correlate with a chemical or physical factor influencing the data structure. The high accumulated variance value of 97.8 % of total variance preserved by the first two principal components supports the assumption that a graph of the data
using scores ui and u2 is representative of its data structure (score/score-plot, Fig. 3.8~). FEATURE/FEATURE- PLOTS
SCORE/SCORE- PLOTS
4 10 x2
I
u2
2.0
t
1.0
(20.3 9. variance)
(20.3 9. variance)
5 -
m
w
m
0.0
m
-1.0 w
-2.0
0 ,
-U,
(77.5
z
variance)
4 10 x3
u3
2.0
t
1.0
(2.2 x variance)
(38.4 x variance)
5
0.0
B
m
0
-XI
-- 2l . .O0 1
3
(41.3 X variance)
-
1-2.0
I
I
I I I
0.0
U, (77.5 x
2.0
variance)
Fig. 3.8 Data set B: feature/feature-plots (a, b) and score/score-plots (c, d). The loadings of PC1 show that the first principal component is influenced by all three features to a similar extent, directly proportionally for XI and X2, indirectly proportionally for X3. The second principal component with 20.3 9%of the total variance has reduced impact on the data structure in comparison with PC1. The consideration of the third principal component provides almost no supplementary information on the data (Fig. 3.86). In case of experimental data PC3 could eventually be attributed to measurement errors.
58
The principal component plot of the objects allows a visual cluster analysis. The distances between data points in the projection, however, may differ considerably from the actual distance values. This will be the case when variances of the third and following principal components cannot be left out of consideration. A serious interpretation should include the application of at least another cluster analysis method (ref. 11,12).
3.1 APPLICATION
An example from the field of environmental analytical chemistry is chosen to demonstrate the application of principal component analysis. The data is taken from an investigation on air pollution in the city of Vienna (Austria) by polgcgclic aromatic hydrocarbons (PAH) (ref. 13).
The sampling station for this investigation was located near the center of the city, at the intersection of two main streets subjected to heavy traffic in the order of 50,000 vehicles per 24 hours. The data set consists of the concentrations of nine PAHs in 24 different air samples (two samples per month). Concentrations of PAHs are significantly higher in winter than in summer. Principal component analysis was applied to investigate seasonal variations of concentration profiles and to establish a relationship between sources and the occurrence of certain compounds. Because the concentrations of fluorene and pyrene are considerably higher than the others (by approximately a factor of 5 ) , these two features were divided by 5. Then the data were normalized to a constant sum for each sample. This normalization was caused by the fact that we are rather interested in relative concentrations than in absolute values. Table 3.4 shows the loadings of the first, second and third principal component; the others have very small variances. In the principal component plots in Figure 3.9 the samples are represented by different symbols related to the average temperature during the sampling time. The scores computed for the first principal component clearly distinguish the cold and warm season.
59
Table 3.4 First three principal components for PAH data. ~
~~
Compound (feature) fluorene pyrene cyclope nta[ cdlpyre ne be nz[alant hracene chrysene + triphenylene benzo[b +j + k]fluoranthene benzo[alpyrene indeno[cd]pyrene benzo[ghi]perylene
Ti
variance 94 accumulated variance
PC 1
Loadings PC 2
PC 3
0.540 0.605 -0.357 -0.106 -0.080 -0.333 -0.26 1 -0.099 0.09 1
0.024 0.083 0.793 0.05 1 -0.300 -0.506 -0.081 -0.09 1 0.027
0.299 -0.096 0.181 0.063 0.613 -0.045 -0.146 -0.227 -0.644
76.9 76.9
15.8 92.7
3.8 96.5
The first principal component (Fig. 3.10) is mainly determined by high positive loadings for the first two compounds and negative loadings for the others, that means, samples on the right side (summer side) in Figure 3.9 have high relative concentrations of fluorene and pyrene. In opposition to the summer samples, the winter samples are enriched with high-molecular, mostly carcinogenic PAHs.
The second principal component has a high positive loading for cyclopenta[cd]pyrene, a compound mainly emitted by cars. High negative loadings occur for chrysene + triphenylene and for the benzo-fluoranthene isomers; these compounds are typical pollutants from coal combustion. The second principal component can therefore be related to the relative amounts of pollution by heating and car traffic. High positive scores for PC2 (upper part in Fig. 3.9) can be attributed to a large influence of traffic, high negative scores to a large influence of heating.
60
u2
1
temperature:
I
I
\
(15.8 9: variance)
A I
5-
0.0
______ -
A*
*A -10
Y
-
*
20°C
...
outlier
0
00
A1
I
I
0
I
I ?4€
-15
(3.8 Z variance)
I
I
I
I
I
t
_.
-u,
(76.9 X variance)
Fig. 3.9 Principal component plots of PAH data. Each point corresponds to an air sample.
61
Two winter samples appear as outliers. Both can be explained by unusual weather conditions at the day of sampling. Results obtained by exploratory data analysis should be considered only as proposals. The chemist is responsible for the choice of other facts about the problem in question and for the final conclusion.
vm p v nn e -t nuorendfluorene h
0.6
PC1 (76.9 Z variance)
0.4 0.2 3
0.0
1
2
4
5
1
I
6
7
8
I
I
-
PAH number
8
-0.2
c clopenta-
-0.4 1
[Pcd]pyrene
benzo[b+j+k]fluoranthene
benzo[ n]pvene
loadings
0.8
PC2 (15.8 Z , variance)
0.8
0.4 0.2 0.0
PAH number
-0.2
-0.4 -0.6
Fig. 3.10 Loadings of the first two principal components of the PAH data.
62
3.5 SOFTWARE
Principal component analysis and other methods of multivariate statistics are contained in general statistics program packages like SPSS (Statistical Package for the Social Sciences), SAS (Statistical Analysis System) and BMDP (Bio-medical Discrimination Programs). Originally, these programs have been developed for the use at mainframe computers but are now available in PC-versions. The interactive mode of the PC-versions is an advantage for the less experienced user. A detailed description of these comprehensive software packages is beyond the scope of this chapter. An introduction and a comparison of the methods from the view of pattern recognition applications in chemistry is given by Wolff and Parsons (ref. 11). A number of software packages for this field has been developed by chemists in order to fulfil their special needs for interpreting chemical data. These programs should not be considered as substitutes of large general statistic packages, but they focus on special methods, are easier to handle and usually cheaper. Furthermore, dedicated software exists for a small number of analytical instruments, containing PCA-modules for processing measured data. A selection of commercial programs is given below (references indicate reviews or descriptions, prices are only informative). ARTHUR: Infometrix, 2 0 0 Sixth Ave. 833, Seattle, Wash. 98121, USA; cca $7000. This package of Fortran programs has been developed at the Department of Chemistry of the University of Washington, Seattle. For many years it was the most widely used program for PCA and other pattern recognition applications in chemistry. ARTHUR has been written for mainframe computers, but also PC-versions are used (e.g.: TNO CIVO Institute, P.O.Box 360, NL-3700 AJ Zeist, The Netherlands). An overview of ARTHUR is given in Wolff and Parsons (ref. 14).
CLAS: Central Laboratory for Clinical Chemistry, University Hospital, P.O.Box 30001, NL-9700 RB Groningen, The Netherlands; Dfl 1000. Multivariate classification methods (ref. 15). CLEOPATRA: Elsevier Scientific Software, .P.O.Box 21 1, NL-1000 PIE Amsterdam, The Netherlands; Dfl 3100. A set of programs as an aid in teaching chernometric methods (ref. 16, 17).
63
EIN*SIGHT: Infometrix, 2100 Sixth Ave. 833, Seattle, Wash. 98121, USA; $ 300. Exploratory analysis of multivariate data, graphics-oriented (ref. 18). PARVUS: Elsevier Scientific Software, P. 0. Box 211, NL-lOOOAE Amsterdam, The Netherlands; Dfl 1325. Package for supervised pattern recognition and handling of multivariate data (ref. 19). QSAR-PC: Biosoft, 22 Hills Road, Cambridge, CB2 lJP, U.K.; $ 200. Programs for investigation of property-activity relationships by regression analysis. SIMCA EMX, P.O.Box 336, S-95125 Lulea, Sweden; $ 2200. Multivariate data analysis by SIMCA (principal component models of classes) and PLS (partial least square) (ref. 20). Chemists sometimes overestimate the amount of work necessary for writing a simple PCA-program. The eigenvector computation causes no problems if published source code is used (21-23). If input of data matrices has to be done manually, existing spreadsheet-programs are convenient. Programming simple graphics (scatter plots, bar graphs) is well supported by Pascal or Basic. The run-time of a compiled program on an AT-compatible PC is roughly 10 to 20 s for a principal component analysis of data from 100 objects with 10 features.
3.6 REFERENCES
M. Forina, S. Lanteri, C. Armanino, 'Chemometrics in food chemistry', in Topics in current chemistry, Vol. 141, p. 91, Springer-Verlag, Berlin (1987), P.C. Jurs, 'Computer software applications in chemistry', Wiley, New York (1986), J.R. Llinas, J.M. Ruiz, 'Multivariate analysis of chemical data sets with factorial methods', in Vernin G., Chanon M. (eds.): 'Computer aids in chemistry', Ellis Horwood Limited, Chichester, England (1986). p.200, D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y.Michotte, L. Kaufman, 'Chemometrics: a textbook', Elsevier, Amsterdam (1988), M . k Sharaf, D.L. Illman, B.R. Kowalski, 'Chemometrics', Wiley, New York (1986), K. Varmuza, 'Pattern recognition in chemistry', Springer Verlag, Berlin (1980), S . Wold, K. Esbensen, P. Geladi, Chemometrics and Intelligent Laboratory Systems, 2, (1987), 37, J. Joliffe, 'Principal component analysis', Springer, Berlin (1986), S . Clementi, G. Cruciani, G. Curti, Anal. Chim. Acta 191, 149 (1986), 10 S . Wold, C. Albano, W.J. Durn 111, U. Edlund, K. Esbensen, Pi Geladi, S . Hellberg, E. Johansson, W. Lindberg, M. Sjoestroem, 'Multivariate data analysis in chemistry', in B.R. Kowalski, (ed.), 'Chemometrics, mathematics and statistics in chemistry', D. Reidel Publishing Company, Dordrecht, Holland (1984), p. 17, 11 D.L. Massart, A. Dijkstra, L. Kaufman, 'Evaluation and optimization of laboratory methods and analytical procedures', Elsevier, Amsterdam (1980)'
64
12 D.L. Massart, L.Kaufman, ’The interpretation of analytical chemical data by the use of cluster analysis’, Wiley, New York (1983), 13 J. Jaklin, P. Krenmayr, K. Varmuza, Fresenius Z. Anal. Chem. 331,479 (1988), 14 D.D. Wolff, M.L. Parsons, ’Pattern recognition approach to data interpretation’, Plenum Press, New York (1983). 15 J.B. Hemel, H. Van der Voet, F.R. Hindriks, D.A. Doornbos, Trends i. Anal. Chem. 6, (1987), 192, 16 N. Bratchell. H.J.H. MacFie. Chemometrics and Intelligent 1, - Laboratory- Systems, (1987), 124, 17 G. Guiochon, Trends i. Anal Chem,. 6, no. 6, XXIII, (1987)’ 18 R.L. Erskine, Chemometrics and Intelligent Laboratoq Systems, 2, (1987), 302, 19 M. Forina, R. Leardi, Trends i. Anal. Chem. 7, (1988), 53, 20 W. Dunn, Chemometrics and Intelligent Laboratory Systems, 1, (1987), 126, 21 K.J. Johnson, ’Numerical methods in chemistry’, Marcel Dekker, New York (1980), 22 W.H. Press, B.P. Flannery, S.A. Teukolsky, W.T. Vetterling, ’Numerical Recipes’, Cambridge Univ. Press, Cambridge, (1986), 23 B.G.M. Vandeginste, C. Sielhorst, M. Gerritsen, Trends i. Anal. Chem. 7,(1988), 286,
ACKNOWLEDGEMENT
We thank Johannes Jaklin for providing us with analytical data and Elke Schneider for her help in preparing the manuscript.
65
MANIPULATION OF CHEIIICAL DATA BASES BY PROGRAMMING
Jure ZUPAN 'Boris Kidrii.' Institute of Chemistry, Hajdrihova 19, YU-61115 Ljubljana, Yugoslavia
4.1 INTRODUCTION
A dilemma whether one should know (or learn) any of the most common used high level languages (Basic, Fortran, Pascal, C, etc.) or not is still hunting many practicing chemists, researchers, students and educators alike. There are arguments on both sides spanning from absolute rejection (almost everything chemists need can be bought on the software market or at least developed by their computer departments) to fiercely advocating the 'do-it-yourself approach (arguing that chemists using application packages as 'black-box' tools do not know how their data are actually processed). As always, the response of chemists will depend on the needs, abilities, goals, and general policy prevailing in the working environment, staff, laboratory, and not at least on the opinion of the head of the group. Probably, the most important factor in such a decision is the goal the particular project or even the laboratory is oriented towards. For a routine work with well established procedures, the 'black-box' approach is doubtless very convenient, useful, and above all, error safe. O n the other hand, the research on the frontiers in any field compulsory requires at least a minimal knowledge of programming in one high level language. A fundamental research requiring a lot of data handling and at the same time using only a bought software can seldom lead to completely new discoveries. If new facts are sought, the relevant data should be treated in a
66
completely new way at least one point of data handling process what requires programming of your own routines. There are, of course, many other occasions when knowledge of programming is very useful, especially if ’interfacing’ programs for reformatting the data for transfer between two standard packages, minor changes in the existing programs, or new special purpose procedures are needed. It is worthwhile to keep in mind that programming in a high level language is not much more complicated than programming in dBase or LOTUS, to mention only two of the most commonly used packages offering their own programming languages. In short, knowing how to program and when to use this skill is very beneficial. Many problems can be solved in a simpler, faster, and more economic way compared with the case that the user must explain the problem to a programmer, to check the obtained results, and iterate the procedure until the calculated results are satisfactory. Especially frequent small changes in a program can be very annoying if done through an intermediate person. Additionally, it should be remembered that, in order to attract the buyers, most of the stand-alone computer packages are designed as general purpose products. This means that a package is programmed to handle as many different problems (in its scope, of course) as possible and as such cannot be equally well suited for all of them. This is not to say that general packages on the market are deficient or even not worth buying, on the contrary, they are mainly very useful and much more user friendly than t h e majority of ‘home made’ programs. What we want to say is that for special applications the general purpose software may not offer the very best data handling procedure available. And at this point the programming ’know-how’ becomes very valuable.
4.2 PROGRAMMING PROCEDURES
Years ago, the choice of a programming language was a very simple one. Chemistry was dominate almost entirely by the Fortran language. Today, the choice is mainly made between Fortran, Pascal, and Basic. Because each of them has its own
67
advantages and drawbacks, it is hard to single out any of them as the best one. Basic is easy for learning (syntax and rules), is good on graphics if you have CGA (Color Graphic Adapter) or EGA (Enhanced Graphic Adapter) card on the PC, but long programs become rather difficult to maintain and update, while handling I/O with files is inconvenient and almost each computer brand has its own Basic dialect. Pascal is mainly an algorithmic language with medium I/O capability what makes it not the best choice if a lot of file manipulation and communication is planned. There are several Pascal graphic packages (Borland Turbo Graphic Box, for example) offering diverse graphic procedures making Pascal very attractive for many young chemists. As the situation stands now the majority of 'home made' programs in chemistry is still written in Fortran. Some of the reasons are historical (it is hard to switch to another language) and some are based on a high level of standardization which makes Fortran the most portable language. It is nice to transfer a source code form a mainframe to a PC, compile it there, and then run it without most of the problems usually encountered when transferring programs written in other languages. Due to its standardization, Fortran is very conservative language what graphics and screen manipulation procedures are concerned. Similar to Pascal, many software houses offer special subroutines for graphics (Microsoft Windows, MetaWindows, etc.). Development of any program is carried out in the following steps: designing a solution for the problem in an algorithmic (procedural) way, writing the selected algorithm in a high level language (source file) using a text editor, compiling the source file with the corresponding compiler (sorile compilers require compilation in 2 or 3 passes), correcting the source file for typographical and syntax errors if compiler finds some (repetition of steps 2 to 4 until no errors are detected by the compiler), linking the compiled program (file.obj) with other object files and libraries using the system linker obtaining the executable file (file.exe), running the 'exe' version and applying test data, comparing results with test data and repeating the entire procedure from point 2 in the above scheme until the obtained results are consistent with the expected ones, running real application.
68
As it can be seen, writing user’s own application is not an easy or fast task. It can be learned only be practice. Doubtless, the most difficult and tiresome part of the programming is tracing down logical (procedural o r algorithmic) errors in the source code. This part is called a ’debugging’. The use of a debugging option, if i t is offered by the compiler is of great help. It enables the programmer to trace and to monitor changes of variables, arrays, and program flow. Additionally, using a debug option, the programmer can change values of variables during the execution of the program, etc. If the compiler does not have a debug option, a number of write statements to communicate the values of variables must be included into the source, what makes finding errors a difficult and a time consuming process.
It has to be mentioned that big step towards unification and standardization of programming has been offered by MicroSoft compilers version 4.0 or more (MS Pascal 4.0, MS Basic 6.0, MS Fortran 4.1, MS C 5.1, MS Macro Assembler j . l ) , providing that object files produced from sources written in different languages can be linked together into a single executable file. For example, compiled Pascal procedures can be linked with object files obtained from Fortran source code, or vice versa. Additionally, some software and hardware producers (Hercules, MicroSoft, Borland, for example) are offering graphic packages containing, Basic, Pascal, Fortran, and assembler routines for application of graphics in different graphic environments (CGA, EGA, VGA, Hercules, etc.) which can be easily implemented in the programming code. In the following paragraphs, some of the procedures specific for chemistry that must usually be programmed will be described from two aspects. The first one will be a condensed description of the problem while the second one will be a basic procedure necessary to program the task.
4.3 HANDLING CHEhlICAL STRUCTURES WITH PC
4.3.1 General Most of the chemists will agree that chemical structure is a common denominator in the majority of chemical work and that it seems naturally to discuss t h e ways how chemical structures can be handled (input, output,. displayed, compared, searched, ranked, etc.) by computers (ref. 1) in general and by personal computers in particular.
69
Usually, in a chemical laboratory someone comes up with the idea of organizing a collection of chemical structures and to link it with a specific application. Due to the lack of general purpose packages enabling chemists to create a data base of structures according to the specific needs, the chemists are forced to 'reinvent the wheel', starting to build such a system from scratch. To avoid the situation, we shall discuss the procedures (editing and representation of chemical structures, sub and superstructure search, etc.) and ways (linking structure generation with files containing structures, making access to structure related features easier, etc.) needed to prepare a custom tailored data file of chemical structures.
4.3.2 Editing a structure Editing a chemical structure using a computer means building, changing, storing, copying, downloading, or otherwise interactively manipulating chemical structures with commands familiar to chemists. Figure 4.1 shows a process of editing the structure of 3-amino cyclohexanone using different commands from the menu displayed on the screen. Each selection of the chemist is immediately displayed on the screen so he or she can closely follow assembling of the structure. Once the desired structure is generated t h e user should be able to use its representation (the connection table) in many different ways: to store it, to combine it with other structures, supplement it with textual information, to decompose it to fragments, add it to a collection, use it as a target or query compound in different searches or procedures, use it in different applications such as simulation of spectra, determination of properties, etc. calculate molecular formula, draw it on a plotter, etc.
4.3.3 Representation of chemical structures Connectivity matrix and connection table. The most frequently used forms for representing chemical structures in the computer are the connectivity matrix (CM) and the connection table (CT). In the CM, the diagonal element Cii is a chemical symbol of the i-th atoms, while the off-diagonal elements Cij represent bond orders
70
between the i-th and j-th atom. Figure 4.2a shows the CM of the 3-amino cyclohexanone.
r
CHAIN
CHAIN
RING ATOM BOND
I5
I
-B R I D G E
BRIDGE
CT DELETE
DELETE
INSERT
INSERT
... ~
C H A I N : l AT 5 . 1 AT 3
MENU
CHAIN
I BOND BRIDGE
DELETE
OELETE
INSERT
INSERT
MENU
C T O M : 8 NH2. 7 0
MENU
s
Fig. 4.1 Building a chemical structure (3-amino cyclohexanone in this exam le) with commands partially selected from the menu and partially t. e -in by the user. The numbering-of atoms is important for two reasons:%st for fast addressing of atoms in the editing process and second, for comparing the retrieved structures with the on-screen structure. The particular software was developed in the author’s lab.
It can be seen that a number of information in CM is redundant (each bond is listed twice) and that a large portion of matrix is empty (elements are equal to zero). This indicates the structure can be represented more economically with a table of constant width w. Such representation requires only wN instead of N2 variables. In the i-th row of the new representation w data associated with i-th atom (chemical symbol of the element, sequential numbers and bond types to its neighbors) are stored. Such representation is called the connection table of a chemical structure or CT (Fig. 4.2).
71
1 2 3 4 5 6 7 8 c 1 0 0 0 1 0 O
1 c 1 0 0 0 0 O
0 1 c 1 0 0 0 O
0 0 1 c 1 0 0 l
0 0 0 1 c 1 0 O
1 0 0 0 1 c 2 O
0 0 0 0 0 2 0 O
0 0 0 1 0 0 0 N
Fig. 4.2 The connectivity matrix (CM) of the 3-amino cyclohexanone
C c c C C c 0 N
2 1 2 3 4 1 6 4
1 1 1 1 1 1 2 1
6 3 4 5 6 5 0 0
1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 8 1 0 0 1 0 0 0 0 1 7 2 0 0 0 0 0 0 0 0 0 0 0 0
and the connection table (CT) (right)
Fragment code, In computerized chemical information systems, especially in the connection with other structure handling algorithms (substructure search, structure generation, etc.) some sort of fragment code must be used. Structural fragments can be defined on a basis of chemical properties, statistical frequency of occurrence in chemical compounds, or from purely formal mathematical (ref. 2,3), i.e. graph-theoretical aspect with little relevance to physical or chemical properties. Sometimes, graph-theoretical definition of fragments is supplemented with basic chemical properties like type of atoms or bond orders, etc. Another approach is to label the selected chemical fragments of easily recognizable features (for example: -OH or -C( = 0)-group, aromatic ring, structural skeletons, etc.) with consecutive numbers, letters, or special characters. The full structure representation is than either a list of all present or a list of all different structural features. If different fragments are labeled as fi, then structure S can be written as a set of m fragments:
For a given structure the number and type of different (or of all) fragments depends entirely on the definition of fragments. For any kind of structure handling procedure the definition of fragments should be -unique for all structures in the collection.
72
In order to be as simple as possible the fragments are usually defined as structures with one atom in the center and a number of layers of neighbors to which the fragment is still defined. First layer of neighbors consists of bonds and atoms directly bonded to the central atom, the second layer consists of bonds and atoms bonded to the atoms in the first layer not yet taken into the account, and so on. Fragments should be stored in tables and described precisely enough that any structure can be decomposed into a unique set of them. The representation of all possible fragments in a form of CM or CT (with all details: atoms, bonds connections up to a certain layer) is very voluminous. To be precise, the limiting factor is not the computer space for storing fragments, but the time for scanning the entire collection of fragments for each atom of the query structure in order to find the match. However, the fragment coding can be improved considerably by adequate definition of fragments and wise restriction of coding elements. In this paragraph a procedure for unique encoding of atom centered fragments into 32-bit strings is described. The length of 32 bits imposes a limitation on the size of the fragments (number of neighbors to be considered) (ref. 4). If longer bit strings are used, more general atom centered fragments can be encoded. The reverse procedure enables decoding of any 32-bit string into a corresponding atom centered fragment. The main limitation (which is not too restricting for organic chemistry) is the maximum number of 4 neighbors at each atom. The atoms, of course, can be more than 4-valent, providing that the excessive valences are used up by multiple bonds (in fragments like -SO?- and -NO:). An additional limitation excludes from encoding hydrogen atoms. Only non-hydrogen atoms are coded and treated as atoms. Hydrogens are (or can be) added at the end of the procedure to each atom as required by its unsaturated valence. The encoding starts at the right side of the 32-bit word at its least significant part. The first 3 bits are used for encoding the type of t h e central atom, while the following 4 times 7 bits are used consequently for each neighbor of t h e central atom. If the central atom has only two or three neighbors, only two or three 7 bit strings are encoded, respectively. Each 7 bit string is coded using the same scheme: the consecutive 2, 3, and 2 bits (starting at the most significant part of the 7 bit string) represent the bond central-atom-neighbor, the atom type of the neighbor, and the number of second layer neighbors bonded to this particular neighbor. The
13
32-nd bit is always empty (equal to 0) what ensures the ID number of the fragment to be always positive. In order to obtain a unique coding scheme the order for coding the neighbors has to be established. First, using the formula: NSi = 32 * BOND
+ 4 * ATOAM + NEIGHBORS
(4.2)
a 7 bit number NSi is calculated for each neighbor i. The parameter ’BOND’ specifies the bond type between the central atom and the neighbor (BOND = 1, 2, or 3 for single, double and triple bond, respectively); the value ’ATOM’ (in the range 1 - 7, with carbon = 1, oxygen = 2, etc.) represents the type of the atom; while ’NEIGHBORS’ is the number of non-hydrogen second layer neighbors (0-3). The value of NSi can be between 0 (no neighbors) and 127. In order to obtain a unique (smallest possible) identification number ID for the fragment all NSis are sorted in descending order (NSi > NS2 > NS3 > NSI) and added to the code for a central atom type CATOM: I D = CATOM
+ 8 NS1 + 1024 NS2 + 131072 NS3 + 16777216 NSI
(4.3)
The numerical factors (23, 2”, 2” and 2”, respectively) in equation (4.3) are used for placing the corresponding SSi on the proper positions into the 32-bit string. Besides the easiness of encoding, the advantage of this fragment code is that the topology of fragments can be directly reproduced from the ID number of a fragment. The described scheme with all coding possibilities is shown in Figure 5.3. The described fragment code can be modified according to the users needs. With larger strings, larger fragments can be encoded o r the atoms more specifically determined. A disadvantage of all fragment codes is a non-unique description of structures, hence, two identical lists of fragments do not mean that the corresponding compounds have identical structures. To confirm the identity a time consuming atom-by-atom comparison must be invoked. Fortunately, the substructure search in a large file of chemical structures requires such a tedious comparison to be performed only on a list of compounds having the same set of fragment codes as subsets of their constituents. The number compounds on the ’good list’, is in most searches orders of magnitude shorter than in the entire collection.
74
32-bit word 31 30 29
4 3 2 1 0
I Bond Atom Nc 4th neighbor
Bond.Atom,Nc 3nd neighbor
ti LH Bond Atom Nc Bond Atom Nc 2rd neighbor 1st neighbor Central atom
J
Neighbors:
Central atom:
2 bits: Bond
3 bits: Atom
00 01 10 11
000 001 010 011 100 101 110
No bond Single bond Double bond Triple bond
3 bits: Atom
000 00 1 010 011 100 101 110 111
111
No atom Carbon Oxygen Nitrogen Sulphur Phosphorus Halogen A n y other atom
Noatom Carbon Oxygen Nitrogen Sulphur Phosphorus Halogen Any other atom
2 bits: No. of continuations Nc
00' 01 10 11
End atom 1atom 2 atoms 3 atoms
Fig. 4.3. Encoding structural fra ments into a 32 bits long strings. The encoding starts at the least significant part (right and proceeds towards the left yielding smallest numbers for the simp est fragments.
1'
75
4.3.4 Sub- and super- structure search
At the very bottom level of most structure handling algorithms two structures are compared atom by atom and bond by bond (ref. 1). However, the preprocessing steps, the I/O conditions, the constraints in the query or in the reference structures, and requirements for a match or failure differ considerably from application to application. The most frequent used structure manipulating procedures are substructure and superstructure searches. In the substructure search the query (input) is small compared to the reference structures. The goal is to find all structures from the reference file that contain the query as a substructure. In the superstructure search the investigated (input) structure is large compared to structures in the reference file which is usually much shorter than in the case of substructure search. The purpose of the superstructure search is to identify all structures (skeletons, substituents, parts) from the reference file that fit into the query structure (Fig. 4.4). Superstructure search consists of a number of substructure searches in each of which the 'reference' file contains only one structure, namely, the query. The number of substructure searches is equal to the number of structures in the reference file of the superstructure search. As already mentioned, the comparison of structures is a tedious and time consuming procedure. Therefore, the part where atom-by-atom and bond-by-bod comparisons are made should be executed on a file containing only a number of structures as small as possible. To achieve this, a fast scan of a long reference file should be made to select only the structures that possibly contain the query.
Usually this is done in two steps: first, the query structure Sx is decomposed into fragments (see equation 4.1): sx =
(fl, f2, f3, f4)
(4.4)
and second, using the inverted file of fragments a, group of structures containing, beside others, all fragments of Sx is selected.
76
Fig. 4.4 A substructure (left) and superstructure (right). search. In the substructure search the uery is usually smaller compared with the structures in the reference fi e, while in the case of su erstructure search the reference file is much shorter and containes on y small fragments and/or skeletons.
4
7
77
s1
= (fl, f2, fa, fb, f3, fa fc) s2 = (fl, fZf3, f4)
...
.....
Sk
= (fm,fl, f2, f3, f4)
The obtained group of k possible candidates is much shorter than the entire file. O n the k candidates atom-by-atom and bond-by-bond comparison must be made (Fig. 4.5). Query
Fragments
inverted
file
Fr a gm. ID
Reference
References
structure Structure
ID
IDS
89
I ...89,..796,..1740,..
305
I
7 6 2
I ...8 9...796. ..1740,.. I
,757057
796
'
I ...796,...1740, ...
I
Fig. 4.5
Inverted file of fraoments contains identification numbers of all reference structures having tRe same fragment in the same record. Decom osition of the query structure into fragments and scanning the inverted i l e yields a short file of possible candidates.
The IBM PC compatible program GEN (Fig. 4.1), designed and made in author's laboratory has an option for download (to a sequential permanent file) of a set of atomic centered fragments coded into 32-bit strings as described in paragraph 4.3.4 of each currently edited structure as. The option enables the generation of a file containing lists of fragment codes for a collection of structures.
78
4.3.5 Update and retrieval in direct access files using hash algorithm
The formation of an inverted file of structural fragments for a large collection of structures can be made via hashing algorithm of fragment I D numbers. After a chemical structure is decomposed into fragments and the fragments are encoded into 32-bit ID numbers, the question arises how to find (how to access) the record where the information about this particular fragment is stored. Because they are too large, the 32-bit long numbers (about 10'") are not usable as addresses for direct access. The same problem is encountered if the 'key' information for access is the chemical name of the compound o r fragment ('ADAMANTANE' or 'CARBONYL' for example). Before any large number or alphanumeric 'key' is used for addressing a record in the direct access file, it must be transformed into a number between 1 and N, N being the length of this file. The procedure employed for such transformation is called 'hash' algorithm (ref. 5 ) . The problem, how to transform an arbitrary alphanumeric string into a large number is easily solved by chopping the string into small parts of equal length (usually I bytes long) and then XOR-ing the parts into a single large number. For example, the key 'ADAMANTANE' yields a number LARGE by the following procedure: LARGE = 'ADAM' XOR 'PINTA' XOR 'NE
'
(4.5)
The XOR bit operation (0011... XOR 0101... = 0110...) is a preprogrammed function available in almost all high-level languages as Fortran, Pascal, etc. In the described way, a character string of any length can be transformed into 4 byte string which in effect can be regarded as a large integer number. I t is interesting to note that the order of XOR-ing of individual parts together does not change the final result. Hash algorithms are widely used in many applications and there is a number of different approaches how to transform a long (large) number or multi-byte string into a short address in a unique way.
79
Any such algorithms must inevitably cause more different input keys to produce identical address. This effect which is immanent to all hash algorithms is known as the 'address collision' and programers must provide a way to calculate the consecutive addresses (an address increment) until an adequate address is reached,
Figure 4.6 shows how hash algorithm works in the case of collisions. If hashing of a given key produces the address where the information about another key is stored a new address must be calculated and the content checked again. The procedure is repeated until an empty (for update of new items) record is reached or the record containing the information of the identical key is reached. In order to check the identities of keys the complete reference key must be stored in each record.
File Key
PROPENE ADAMANTANE BENZENE
Address
Hashing
Hash address
Key
Data
DROPENE
1 BENZENE
Address increment 4DAMANTANE
Fig. 4.6 Hash algorithm produces another address whenever the collision of two different keys occurs on the same address.
80
One of the most commonly used hash algorithm employs twin prime numbers (two consecutive odd numbers that are both primes) and modulo function. If KEY is a large number (fragment ID or XOR-ed parts of chopped long string) and NP the length of the direct access file than the calculated address ADDR and the increment INCR can be obtained by the following equations:
ADDR = MOD(KEY-1,NP) + 1 INCR = MOD(KEY-l,NP-2) + 1 The only requirement is that the length of the direct access file NP is set to the largest of both twin prime numbers (NP and NP-2). In any case, the length of a direct access file for which hashing is employed should be chosen about 10-20 7c larger than there is expected number of records to be actually stored. Such an surplus of empty space guarantees a reasonable access time to empty andlor correctly addressed records. The programmers must be aware that number of collisions increases sharply after the file is more than 85 % full. The full algorithm for the direct access of information described by the large number or character string KEY can be written as follows:
A1 A2
ADDR = MOD(KEY-1,NP) + 1; INCR = MOD(KEY-l,NP-2) + 1; read(file, rec=ADDR) KEYREF, list; no information for KEY found if KEYREF = 0 then return. search returns ‘list’ if KEYREF = KEY then return. if KEYREF # KEY then ADDR = ADDR + INCR; if ADDR > NP then ADDR = ADDR - NP; continue at AZ;
This algorithm can be applied either for the update of new items in the file or for the retrievals. It is evident that all records from the entire file must be repositioned again if the old file becomes too small and must be extended, i.e. new length NP must be used in the retrieval and update. Therefore, a careful study and realistic estimation of the needs must be made in advance.
81
4.4 SPECTRA REPRESENTATION IN THE COMPUTER
4.4.1 General
Another very broad field in chemistry not adequately covered on the software market is handling of spectral collection. There is, of course, a number of instrument producers that provide spectra handling software for their own instruments and ’data stations’. Unfortunately, their software is mostly neither open for the users to modify it nor documented adequately to take full advantage of it. The worst example of such software does not even allow the user the access to ’raw’ data produced by the instrument and offers no possibility to transfer the data to other computers where processing according to users’ needs can be done. The potential buyers must be aware of such products, especially if they intend to work intensively on their own measured data, what is mainly the case in the R&D laboratories.
The choice of proper spectra representation is very critical when designing an information or expert system based on a particular spectroscopy. It influences the speed, efficiency and, of course, the reliability of the system (ref. 6). In spite of the increasing computation power (space and speed) installed i n today’s laboratories, the problem of spectra representation is more serious when considering the implementation of the information system on a PC than on a mainframe computer. Besides the number of spectra one wants to handle with the system. the type of spectroscopy (infrared, NMR, mass, etc.), the goal for which the spectra are collected (identification of compounds, prediction of properties, structure elucidation, etc.), and the way the spectra are collected (link with the instrument, manual digitization, transfer from the mainframe, etc.) are the deciding factors according which the representation of spectra should be determined. Good spectra representation should:
-
-
contain as much as possible relevant information about the structure of recorded compounds, be short enough to ensure economical handling large amount of data. allow good reproduction of the original spectrum from its representation, enable retrieval and identification of spectra based on the query represented identically, enable prediction of structural features and different type of properties,
82
-
allow the coding of representations of groups of spectra in the same way as individual spectra, etc.
There are still some other requirements that a representation should fulfill, but they are mainly of more specific nature. There is no such representation that would satisfy all requirements, hence. the representation must be selected in a kind of trial-and-error procedure guided by a good spectroscopic knowledge.
4.4.2 Peak tables
Probably the most common representation for all kind of spectra used in computerized information and expert systems is the peak table. This very simple representation consists of a table containing all (or a certain number of the most significant) peaks appearing in the spectrum. Each peak is usually described by its position and intensiry, but more information (half width, multiplicity. shape type, etc.) can be added if needed. Such tables are very convenient for peak-by-peak search if the inverted files containing ID numbers of reference spectra are at hand. These files must be generated in advance (Fig. 1.7). The problem with the peak-table-representation is that the retrieved match is rather inconvenient starting point for evaluation of the experiment. A comparison between the full-curve query spectrum and the retrieved one(s), represented as the peak table(s), is almost impossible. In order to assure better comparison a link from the table representation to the original (full-curve) reference spectrum must be maintained. However, even if such link is implemented we must be aware that retrieved results obtained using ranking of peak-tables are worse compared to results obtained by comparing full-curve spectra. The second problem inherently associated with peak search in the inverted file of 'peak vs. ID numbers' is the tolerance limit within such a retrieval should be carried out. If the intervals in hich the peaks are 'inverted' are broad the search will probably yield the correct answer but the list of produced matches will be rather
83
Infrared spectrum No. 648
Peak table [cm-l]
Adresses
Inverted file
200 2 10 220 230 760 830 900 1035 1110 1135 1375 1450
/ 1730
1720 1730 1740
1
..., 648, ... ...,648, ... 648, ...
2930
3980 3990 4000
Fig. 4.7
Inverted file for retrieval of infrared spectra generated fromgea3tables for fast searching by peak positions (in the tolerance region +1 cm ).
84
long. O n the other hand if the tolerance interval is narrow the correct spectrum can be lost even if only one peak is not matched due to the experimental error in the query or reference peak table. To overcome this problem a number of reduction methods (see Chapter 5 ) can be applied to obtain reduced representations of spectra (ref. 7).
4.1.3Organization of full-curve or reduced representations of spectra Representations, From the formal point of view, the full-curve and reduced representation of the spectra as well as all handling with them are identical. The only difference is the length (dimensionality) of the representations. The most important aspect of the 'reduced spectral representation' (for the reduction of the spectral curves see Chapter 5 ) is the possibility to work with a significantly smaller number of variables compared to the number of intensity values of the full-curve representation. It is assumed, of course, that the reduced representation carries only slightly less information than the original full-curve spectrum.
Two quantities are most commonly evaluated (calculated) during spectra comparison. The first one being the similarit? Sij between two spectra and the second one the representation of a group of spectra. The similarity between two spectra is used for retrieval, ranking, clustering, structure prediction, simulation, etc., while the representation of a group is mainly used for linking structurally similar compounds or compounds with similar properties together or for extraction of significant features. The underlaying assumption in the evaluation of both information, the Sij and the representation of a group, is that compounds with similar properties have similar structural features thus producing similar spectra. I n view of the fact that no strict rule for quantitative definition of similarity betLveen structures exists it is hard to justify the above assumption. However, many valuable results can be obtained using t h e correlation between the similarity of properties (structures) and similarity of spectra. If the reduced representation of a spectrum i is written in a 'vector' form Ri as
Ri = ( r i , r g r 3,....rm)
(3.7)
85
than the similarity Sij between two ’spectra’ Ri and Rj can be expressed as the inverse distance between the corresponding representations:
The distance between two points Ri and Rj in the representation space can be any nonnegative, real, commutative function that satisfies the triangle inequality (ref. 8). Usually, when comparing spectra Euclidean or Manhattan distances are employed. The generalized form of both, the Minkowski distance, can be written as follows: m dij = (
C
-
(Xki Xkj)’)l’’
(4.9)
k=l where m is the dimensionality of the measurement space (representation). For p = 1, and p = 2 , the Manhattan and Euclidean distances can be deduced, respectively (ref. 8). Once the distance (similarity) between individual spectra is defined, a ranked list of most similar matches to the query or any other related quantity can be obtained by scanning over the entire reference collection. The representation of a group of spectra must emphasize common spectral properties of linked compounds very clearly otherwise the extraction of relevant structural features becomes very difficult if not impossible. One of the most commonly used (although not the best one) representations for a group of objects is their average:
4.10)
Usually much better, even though harder to obtain is a weighted average:
G’= W .G
= (wl .gl, w2 .gx
... wm .gm)
= (g’i, g’2,
...g ’m)
(4.11)
86
with weights Wi (values between 0 and 1) expressing importance of each specific component Xi in the representation scheme. Adequate weights for a reduced representation are harder to obtain than for a full-curve one because for the later a number of spectra-structure correlations are available. The weights for reduced representations can be obtained by a trial and error procedure on a number of known cases using some standard clustering method (ref. 9,lO) for checking the results. Handline large collections. Once the representation G (or G’) of a group is established in the same m-dimensional space as the objects (spectra or their reduced representations), the distance between the groups and/or objects can be evaluated using the equation (4.9). Although the full-curve or reduced spectra are mainly exploited in a sequential way (i.e. one after another through the entire collection), the most efficient way to handle them is a hierarchical organization. Figure 4.8 shows hierarchically organized spectra. Although the space used for a hierarchical organization is twice of that required for the sequential one, the loss is more than compensated by a significant gain in efficiency and quality of retrieval.
1
I
1
f
I
Fig. 4.8 Hierarchically organized spectral data base. Full and empty circles re resent single spectra, Ri and groups of spectra, Gj, respectively. The up ate or retrieval starts always at the root and proceeds toward the leaves (Ri’s). Some of the clusters A,B,C) contain spectra of compounds having easil reco nizable structura(Ifeatures in common. For object X, travelling t roug such a cluster, the common structural feature can be predicted.
B
h g
87
The most outstanding property of a $-member hierarchy is that each individual object (spectrum in this case) can be reached from the root in approximately 1og.N comparisons. The actual value depends on how much the hierarchy (tree) is balanced, but even for trees that are far from being perfectly balanced, the average number of comparisons is very small compared to the number of spectra in the entire collection. The scope of the book is too limited to explain the details how such an organization can be actually achieved. The interested reader is advised to relevant references (ref. 11,12). It has to be said, however, that a hierarchical organization shows its full potential when large numbers of items are to be handled. Under the word 'large number' we understand the collections containing several thousand and more spectra.
4.5 CONCLUSION
In any field, be chemistry, medicine, archeology, economy, or any other one, there is always need for programming a specific problem by your own. It is true that such a piece of code cannot be a substitute for a professional software package, but can in many cases shorten tedious work and. or accelerate the solution of a troublesome situation or even solve the entire problem. In spite of the fact that such programs are very seldom passed to other persons or groups and are not treated with the same yardstick as the packages on the market, they should notwithstanding use sound algorithms and proper methods to attack the specific problems, what in turn requires knowledge of basic algorit hms and fundamentals of programming. The above is true particularly in science where a number of programs are daily written with very specific needs in mind. Trying to be on the top of a field scientists are trying to treat their data in a unique way on at least one point of data handling process - a requirement that by the definition cannot be met by bought software.
88
4.6 REFERENCES 1
2 4
5
6 7
8 9 10 11
12
N.A.B. Gray, 'Computer-Assisted Structure Elucidation', John Wiley, New York, 1986, chapters 7 and 9, K.A. Ross, C.R.B. Wright, 'Discrete Mathematics', Prentice-Hall International, Inc., Second Edition, London, 1988,3 P. Harary, 'Graph Theory', Addison Wesley, Reading, 1972. chapters 2, I,and 13, J. Zupan, 'Algorithms for Chemists', John Wiley, Inc., Chichester, 1989, D.E. Knuth, 'The Art of Computer Programming', Sorting and Searching, Addison Wesley, Reading, Second printing, 1975, Vol3, p. 506, J. Zupan, Ed., 'Computer-supported Spectroscopic Data Bases', Ellis Honvood, Inc., Chichester, 1986, J . Zupan, S. Bohanec, M. Razinger, M. Novic, Reduction of the Information Space for Data Collections, Anal. Chim. Acta, 210, (1988). 63-72, K. Varmuza, 'Pattern Recognition in Chemistry', Springer Verlag, Berlin, 1980, p.25, B. Everitt, 'Cluster Analysis', Heineman Educational Books, London, ( 1977), D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte, L. Kaufman, 'Chemometrics; A Textbook', Elsevier, Amsterdam, 1988, p. 371, J. Zupan, 'Clustering of Large Data Sets', Research Studies Press (Wiley), Chichester, 1982, J. Zupan, M.E. Munk, Hierarchical Tree based Storage, Retrieval, and Interpretation of Infrared Spectra, Anal. Chem., 57, (1985), 1609.
89
5 REDUCTION OF THE INFORMATIOY SPACE FOR DATA COLLECTIONS
Marko RAZINGER and Marjana NOVIC 'Boris Kidric' Institute of Chemistry, Hajdrihova 19, YU-61115 Ljubljana, Yugoslavia
5.1 INTRODUCTION
Handling large number of complex multivariate data (spectra, structures, images, sounds, etc.) for storing, searching, evaluating, comparing, etc., has become a routine task in many computer applications. With the increasing computing power of personal computers these tasks are bound to be used extensively in these media as well. In spite of the fact that on large hard disks quite large quantities of complex data can be stored, the actual bottleneck is still computation time for handling large amounts of complex objects. The computation time depends on the number of variables (components) with which each object is represented. The general form of a multivariate object which is stored in a collection of equally represented items is the vector fwm:
x
=
( X I , X2, X3,
..... X N )
(5.1)
Due to the fact that the first phase of manipulation of such data is usually a fast scanning of the entire collection, a highly compressed representation of uniformly coded data is essential in order to accelerate the handling. After the search reduces the collection to a smaller group in which the target object is supposed to be, the full (extended) representation of objects can be invoked if necessary for further manipulation. In the next sections we shall discuss the use of two methods, Fast Fourier Transformation (FFT) and Fast Hadamard-Transformation (FHT), for the reduction of object representations and show by some examples in 1- and 2-dimensional patterns (spectra, images) how the explained procedures can be used
90
in order to estimate the most appropriate reduction scheme for a given problem. It has to be emphasized that any reduction of information space means a loss of information, hence, a great care has to be paid in each application to avoid losing the relevant ones.
5.2 FAST FOURIER AND FAST HADAhlARD TRAKSFORlLlATION
The algorithm for the F I T (the ’reverse butterfly’ in our case) is well known (ref. 1,2) and will not be discussed here in detail. On the other hand, the FHT has been often neglected in spite of some advantages it offers. Due to the fact that both transformations ’rotate’ the time domain into the frequency space and vice versa, the only conceptual difference between both transformations is the choice of basis vectors (sine and cosine functions vs. Walsh or ’box’ functions). In general, the rotation or transformation without a translation can be written in the following form (ref. 3):
(5.2)
Y=AX
In the above equation (5.2) where X is the original vector represented by N components, Y is the new, rotated vector represented by the same number of components, while A is the rotation matrix. I n the case of the Fourier transformation the rotation matrix A (eq. 5 . 2 ) is written as a matrix W:
1 1 1
w = I IWjkl I =
1
1
u1 u’
u2 u4
1
...
u3 ... u6 ...
1 u N-1 u N-2
... 1 UN-2
U
N-4 ,,N-6
...
l u N - I u N-’ u N-3 ...
u-7 u1
(5.3) where the matrix element Wjk (indices j and k are in the range between 0 and N-1) is equal to:
91
In the case of Hadamard transformation the rotation matrix H can be generated from the initial H 0 = 1 using the simple recursion formula (ref. 4):
which makes the first three higher transformation matrices equal to:
I: 1 1 1 1 1 1 1
H2
-1l I
-1 1 -1 1 -1 1 -1
1 -1 -1 1 1 -1 -1
-1 -1 1 1 -1 -1 1
=
1 1 l 1
1 -1 1 -1
1 1 -1 -1
1 -1 -1 -1 -1 1 1
-1 -1 1 -1 1 1 -1
1 -1 -1 1
The algorithms for both fast transformations are basically equivalent. .4s can be seen clearly from Figure 5.1, the FHT algorithm is a subset of FFr' algorithm; therefore it is not surprising that FHT is faster than the FFT. The main difference between both algorithms lies in the fact that even for a real input the FFT always needs complex arithmetic while the FHT operates entirely in the real space. This fact alone speeds up the FHT €or at least a factor of 2, without mentioning the savings of the computer memory (see specifications of arrays in both subroutines F l T and FHTshown in Figure 5.1).
92
For the transformation of I-dimensional objects represented as vectors (arrays) the corresponding routine (FFT or FHT) is called with the object stored in the input array 'xreal' (Figure 5.1) only once. On the other hand, when 2-dimensional objects, represented numerically as matrices have to be transformed, the subroutines are called 2 N times: first, N times for each row of the original data matrix and second, N times again for each column of the matrix with already transformed rows. It has to be remembered that in the second part of the transformation the inputted arrays (rows of the coefficients obtained in the first part) have both the real and the imaginary components. For the reverse transformation the same routines (source codes) can be used in FFT and FHT. However, for the reverse Fourier transformation the real and imaginary arrays of the coefficients (which are now input) should be divided by N (number of coefficients) and the imaginary array must be conjugated (multiplied by -l), while in the case of reverse Hadamard transformation only a division of N real coefficients by N is necessary.
5.3 REDUCTION OF THE COEFFICIEXTS
The advantage of the transformed objects over the original ones in data reduction schemes lies in the order induced in the sequence of coefficients. This order is correlated with frequency: while in original data the information is more or less uniformly distributed over all the sequence, in transformed object the first few low-frequency coefficients contain the information about the rough contours of the original object and the high-frequency coefficients describe the details. In both Fourier and Hadamard transforms the most important part of the information can be retained after back-transformation with the proper choice of coefficients. It is, of course, a matter of discussion Lvhich part of the information is the most important one: within the same set of objects, different problems will require different part of information to be claimed as being the most important. It is not the purpose of this chapter to discuss this; here, we shall accept as 'the most important' part of the information the most prominent feature of the pattern which is contained in the low-frequency coefficients.
93
The amount of information is gradually withdrawn from the pattern as the number of coefficients for the back-transformation is reduced. The zero order coefficients, co of both transforms, Fourier and Hadamard, carry the sum (integral) of all elements of the original representation of the object and do not contribute any other information (ref. 5 ) . Figure 5.2 shows the order of importance of other coefficients for both transformations.
SUBROUTINE FHT(N,A) DIMENSION A (N)
SUBROUTINE FFT(N.A.B) DIMENSION A(N) ,B(N) PI 3.141592654
-
DO
~.
- N-1 IF((MR+L) .LE.N-l) GO TO L - L/2 GO TO - MOD(MR,L) L IF (MR.LE.M) GO TO 12 XR - A(MR+l) A(M+l) A(MR+l) - XR XI -- B(M+1) B(MR+l) - XI 12 ~~
L 10
M
1.
N/2
11 Reshuffling of input data using the bit inversion is not necessary in FHT, but is recommended for easier reduction of the measurement space (cut-off of coefficients)
10
11
MR
+
A(M+l)
B(M+l)
12
B(MR+l)
CONTINUE
C L - 1
1
CONTINUE IF(L.GE.N) RETURN 2*L I m p DO 3 M l.L FI = -PI* (M-l)/L C * COS(F1) S SIN(F1) M.N.ISTEP DO 2 I - I + L J XR A(J)*C - R(J1-S XI B(J)*C + A(J)*S A(1) - XR A(J) A(1) A(1) + XR B(1) - XI B(J) B(1) * B(I) + XI CONTINUE CONTINUE L ISTEP GO TO 1 END
-- ---
1
L - 1 CONTINUE IF(L.GE.N) RETURN ISTEP 2*L l.L DO 3 M
--
DO 2 I * M.N.ISTEP J - I + L XR A(J)
A(J) - A(I) A(1)
A(I)
-
2 3
-
2
3
-
- XR
+ XR
CONTINUE
CONTINUE ISTEP L GO TO 1 END
Fig. 5.1 Comparison of the FORTRXS source codes of F f l and FHT algorithms, respectively. Besides the fact that FHT does not need the evaluation of cosine and sine functions (or the use of complex arithmetics), it saves time and computer memory by applying the transformation only on real coefficients.
94
Re
3
4
-
- 3 2 1 2 3 .
Fig. 5.2 Order of importance of the coefficients. The reduction of FCs (a) starts with the middle coefficient of the real and ima inary parts and. than ropagates, as indicated by arrows, symmetrical y towards the first and Past coefficient which are the most important ones (the first one is blackened, the next most important ones are darkened in half tone). The reduction of HCs (b) starts with the last coefficient and continues backwards towards the most important coefficients which are painted in the same way as FCs.
f!
In the Fourier transform the least significant coefficient is in the middle of the series while more the coefficients are approaching both ends of the series (towards co and cn-i) the greater is their information content. On the other hand, the least important coefficient in the Hadamard transform is the last of the series.
In order to reconstruct the object from the truncated set of coefficients, the back-transformation (FlT or FHT) must be made on the same number of coefficients as the forward-transformation, i.e. with N coefficients. The only difference between the full and reduced back-transformation is that a certain number of FCs or HCs is set to zero in the latter.
95
5.4 REDUCTION OF REPRESENTATIONS
The purpose of this section is to show the reader how the different degree of reduction of information space (number of new variables) influences the final reproduction. As the procedures for the reduction depend on the type of data, i.e. smooth curves, discret distributions, and 2-dimensional patterns, we shall discuss each of them in appropriate subsections. The common property to all of them is the same representation in which they can be treated by the computer (see equation 4.1).
5.4.1 Smooth curves
Under the term 'smooth curves' we understand any curves obtained by t h e computerized instruments that are represented as intensities sampled in equidistant intervals. In such cun'es not only the positions of maxima (peaks), but also the shape of the curve between them is important. .4 typical example for such objects in chemistry are infrared (IR) spectra of compounds, represented with N intensities or chromatograms sampled in N equidistant time intervals. All such measurements are of course written as measurement vectors X (eq. 5.1). In this paragraph an infrared specter lvritten as N-dimensional vector will be considered (see expression 5.1). In the language of spectroscopy, the question we shall explore is: how many intensity points compared with the original recording are really needed to represent the infrared spectrum so that it can still be uniquely identified in a collection (retrieved), or to enable the extraction of some structural features of the compound from the truncated representation?
The original representation of infrared spectrum in this example is a set of 512 equidistant intensity values (ref. 6). I n order to show the reduction of the information content, the reduction of Fourier and Hadamard coefficients (FCs and HCs) in the transform is carried out to the extreme. In Figure 5.3 the spectrum is reproduced from reduced number of coefficients obtained with the F F f and F H T of the 512-intensity-point curve.
96
The way how the spectroscopic informations (number, position, intensities, and shapes of peaks) changes with decreasing number of coefficients taken into the account can be seen clearly. The FT introduces the noise waves in the flat area of spectra very early (64 FCs in Fig. 5.3), while the use of the HT causes the intensity cut-offs (128 HCs in Fig. 5.3). The loss of resolution is about the same for both transformations: the neighboring peaks are melted into one in all cases at the same stage of coefficient reduction in both transforms. Because the spectral resolution depends on the terms with the highest frequency present in the FT or HT, the reduction of the representation for the same percentage of high frequency coefficients will have the same effect on the resolution for both transformations. With the regard to the interpretability of the infrared spectra, it was experimentally found (ref. 6 and 7) that the representation of somewhere between 64 and 128 coefficients (closer to the 128 side) is the minimum required by an experienced spectroscopist for the determination of specific structural features. This does not mean that any infrared spectrum could be completely interpreted from a curve consisting of only 128 intensity points; however, it has been shown recently that identical clustering in a small group of spectra can be obtained even by a far more reduced representation containing only about 3% of the original set (ref. 7).
5.4.2 Discrete spectra
Besides smooth curves ( f o r example the IR spectra, chromatograms, DTA curves, etc.) where the shape of curve is important, there are other types of experiments where the important information is only the position and intensity of peaks (notably other spectroscopies as I3C NMR and mass spectroscopy). The main purpose of FFT o r FHT of discrete spectra is not the reduction of storage space because a tabular representation of peaks would be usually short enough (see Chapter 4). Nevertheless, an efficient (with a respect to the computation time) manipulation and comparison of spectra calls for short and uniform representation and the use of reduced transforms is justified and helpful, especially when searching spectral collections resident personal computers.
97
FHT
FFT
f 3 1 r
5
I
L ,
9
17
33 65 129
257
Fig. 5.3 The IR spectrum of methyl vin 1 ether-maleic anhydride copolymer re resented by different number of fCs and HCs. The original consists of 5 2 intensity values. After the transformation, the coefficients are truncated step by step as shown in Fig. 5.2; the number of the remaining coefficients is shown beside respective steps. The reduced representations are then transformed back to 512 intensity values and plotted.
f
In Figure 5.4 two examples for reproduction of discrete spectra using different degrees of reduction are shown. In order to retain all t h e peaks the coefficients must not be cut off above the frequency corresponding to t h e difference of two closest peaks. This holds true for both FFT and EHT, but the latter gives better reproductions because it does not introduce noise waves around the peaks as FFT does.
98
FFT
A
A
A
A,
A.
A-
FHT
A-4
65
n
I 7
n
129
f~
1
I
n
1
n
M
512
I FFT A
A
FHT 65
1
7
.
n
Fig. 5.4 E x a w l e of the reduction of information space in discrete spectra: mass (a) and C NMR (b) spectrum, chosen arbitrarily from a spectral inhouse database are represented similarly to the reproductions of the IR spectrum shown in Fig. 5.3. Only three reproductions with 65.129, and 257 coefficients, respectively, are shown in each case.
99
5.4.3 2-dimensional patterns
The reduction of Fourier or Hadamard coefficients of 2-D transforms is analogous to 1-D problem, similarly to the analogy between 1-D and 2-D transformations described in paragraph 5.2. Instead of truncating a number of coefficients of the transform vector as in I-D, in 2-D transform it is necessary to truncate the whole transform matrix by omitting a number of columns and rows of the coefficients (Fig. 5.5). Re
Im
FHT
Fig. 5.5 Order of im ortance of the FCs (a) and HCs (b) for the reduction i n 2-D transforms. he most im ortant coefficients are in the first row and column, shown i n black. he reduction starts with the least important coefficients, shown in half tone, and propagates in the direction shown by arrows.
?
7
I
LL
LL
I-
LL
I-
co X co
ILL
0 C?
n
I
IIL
101
The effect of various degrees of reduction on a 2-D pattern is shown on Figure 5.6. The pattern is taken from a simulated 2-D NMR spectrum (ref. 8). The peaks are drawn as contours at different levels to show the consequence of the reduction o n the original intensities. Comparing the results of the two transformations, the advantage of FHT is lesser noise, while the FFT gives a more faithful reproduction of original peak shapes. When it is not necessary to retain the differences in object surface (the intensities in the matrix representation), a special way of describing 2-D objects by their outline is possible. In this case the object is numerically represented by two vectors (sets of X and Y coordinates) instead of the whole matrix. It is necessary to transform just the two vectors separately. Consequently, the procedure of the transformation and reduction is only tivo-times slower as for 1-D objects, if the number of coefficients N is equal, while for the 2-D objects in matrix representation the speed of the procedure is reduced by a factor of N2. The most obvious feature of the original pattern in Figure 5.7 is its semantic content (sema = sign). The sign is recognizable immediately as 'number seven'. It can be clearly seen that this information is still recognizable if the original pattern is reproduced with only 7 52 (1;16 part) of the original number of coefficients. It has to be said, however, that in the field of chemistry the 'intrinsic' information which is confined in the experimental data cannot be so easily recognized as in the above example, consequently, much more work (usually as trial and error) is necessary to find an appropriate ratio for reduction of the transformed representation.
5.5 CONCLUSION
The described examples demonstrate the advantages of FFT and FHT for the purpose of the reduction of information space for a variety of patterns. A reproduction using one half of the coefficients leaves an object practically indistinguishable from the original (see Fig. 5.3). Also after very drastic truncations the objects are still recognizable, what is evident for the example in Figure 5.7. Drastically truncated IR spectra might not seem to be recognizable at first glance (see Fig. 5.3). However, for certain applications like clustering (ref. 7) they contain still enough information to give the same results as the original spectra.
102
7 U Fig. 5.7
Contours of the sign '7' represented by different number o f FCs and HCs. The original contour is represented b 256 coordinate pairs. The sets of x-s and y-s are first transformed using FA and FHT, than the FCs and HCs are reduced and finally the reduced representations are transformed back to the 256 pairs of coordinates.
It was shown that both FFT and FHT are adequate for reduction of the representation of complex objects. Both transforms behave very similarly with regard to the information content, and the advantage of using the FHT, which is about 8
times faster, is obvious. Before the amount of reduction of representation (with respect to the original set) is chosen, a study on the selectivity of final representation is desirable. Finally, not only for personal computers, even in large-scale information systems handling hundreds of thousands of complex data (such as spectra or images) a compressed form of representation can economize in all handling routines and thus increase the output of the system significantly.
103
5.6 REFERENCES
E.O. Brigham, ’The Fast Fourier Transform’, Prentice-Hall, Englewood Cliffs, NJ, (1973), 2 N. Schaefer, M. Bertuch, ’Butterfly-Algorithmus’,C’t, 1986 (8), p. 44-50, 3 T.R. Brunner, R.C. Williams, C.L. Wilkins, P.J. McCombie: Hadamard Transformed Carbon-13 Nuclear Magnetic Resonance Spectra - Pattern Recognition Analysis, Anal.Chem., 46, ( 1974), 1798- 1802, 4 B.R. Kowalski, C.F. Bender, The Hadamard Transform and Spectral Analysis by Pattern Recognition, AnaLChem., 45, ( 1973), 2234-2239, 5 M. Shridhar, A. Badreldin, ’High Accuracy Character Recognition Algorithm Using Fourier and Topological Descriptors’, Pattern Recognition: 17, (1984), 515-524, 6 J. Zupan, M. Novic, ’Hierarchical Ordering of Spectral Databases’, in ’Computer Supported Spectroscopic Databases’, Ed. J. Zupan, Ellis Horwood Int. Publ., (1986) P. 42-63, 7 J. Zupan, S. Bohanec, M. Razinger, M. Novic, Reduction of the Information Space for Data Collections, Anal.Chim.Acta, 210, (1988), 63-72, 8. P. Pfaendler, G. Bodenhausen, Strong Coupling Effects in z-Filtered 2-D NMR Correlation Spectra, J.Magn.Reson., 72, (1987), 475-492. 1
This Page Intentionally Left Blank
105
6 PROLOG ON PCs FOR CHEhlISTS
Hans MOLL and Jean Thomas CLERC Pharmaceutical Institute, University of Berne, Baltzerstrasse I, CH-3012 Berne, Switzerland
6.1 INTRODUCTION
In an expert system some essential components can be identified. A first part is a set of facts which specify relations between defined objects. This part is often referred to as the data base. A second part is a collection of rules which specify how to reach conclusions by combining the facts. In most cases a rule is equivalent to a statement about relations between object clauses, or it may be given as a relation between not fully defined objects. Thus, the boundary line between rules and facts is rather illdefined.
In the general case several rules may be applicable in a given situation. This may !ead to a conflict which has to be resolved. An expert system will therefore need rules deciding about which rules shall be applied in which sequence. Thus, we need a priority scheme for rules. This means that in addition to rules about objects, there will be rules about rules, i.e. meta-rules. Furthermore, the selection of a given rule from the set of applicable rules may lead to a dead end road in the reasoning process. This necessitates a mechanism to go back to the most recent branching point and to take an alternative choice from the set of applicable rules (if there is one). This is referred to as backtracking. These mechanisms may be specified in any computer language. However, in some specialized languages some of the components are included in the language
106
definition. An example is the language PROLOG, which automatically keeps track of the reasoning process and backtracks automatically. It uses a simple priority scheme for the rules. Any rule takes precedence over all other rules specified later in the list of rules. The PROLOG user is thus relieved from the burden of keeping detailed track of the reasoning process, from the administrative atrocities of the backtracking process, and he may easily set the priority of his rules trough their sequence in the program. The PROLOG programmer supplies a description of the problem he wants to solve by specifying its components and the relations holding between them. He is not required to give a detailed stepwise recipe on how to arrive at a solution. The PROLOG system will systematically explore all the facts and rules given to find an acceptable solution. If one exists within the frame of the rules and facts specified, it will be found. However, nothing can be said about the time needed to arrive at a solution or the time needed to conclude that no solution exists. This heavily depends on how the rules are formulated by the user and on their sequence. In the following we will show how a comparatively simple problem can be specified in PROLOG and at t h e same time give an informal tutorial-type introduction to some of the basic features of the language. No attempt is made to present computationally optimal solutions, we rather have chosen examples emphasizing the philosophical particularities of PROLOG as a non-procedural programming language. The examples given do not relie on a particular implementation of PROLOG. Any implementation conforming to the minimum standard described by Clocksin and Mellish (ref. 1 ) will be suitable. It is further assumed that the reader is sufficiently familiar with his computer to get his PROLOG system started.
6.2 DATABASE
6.2.1 General
Let us assume that we intend to build a very simple expert system to handle t h e problems associated with the separation of drugs with thin layer chromatography (TLC) (ref. 2-4).
107
The very basic facts of this set of problems is that for every compound in a given separation system there is one retention factor. To make things easy we will encode the different separation systems with the codes ta, tb, ... and use retention factors multiplied by 100 and truncated to an integer value. To distinguish them from the conventional retention factors Rf we will call them hrf values. Furthermore, for reasons to be given later we will use lower case characters for names. Our objects are the compounds to be separated and the hrf values they have in the various separation systems. Thus, we relate the compound identified by its name with the pair separation system and hrf value. We will call these relations hrf. Thus, the fact that the compound aconitine in the TLC system encoded as tb has the hrf value 45 is specified as
hrf(aconitine,(tb,45)). Note that the items are separated by commas, and that the statement ends with a full stop. The parenthesized combination of TLC system code and retention value is called a structure. Seen from outside t h e parentheses it is just one single object.
To get started we will enter the following short data base of facts into the system: hrf(aconitine,(tb,45)). hrf(acetanilide,(tb,35)). hrf(aconitine,(ta,25)). The easiest way to put these datd into the computer is to instruct PROLOG to consult with the user, who will supply information via the keyboard. The command to initiate this process is consult(user). input at the PROLOG system prompt. In most implementations the prompt is ?-, which signals that the PROLOG system is ready to accept input for immediate processing from the user. Information input after the consult command is not for immediate processing, the system will rather store it in the internal data base. To indicate this fact most systems use then a different prompt (or sometimes no prompt at all). After the consult prompt we may now type in the list of facts, ending each line with a carriage return. Be sure to input each statement exactly as given
108
above. In particular, use lower case letters and do not forget the period at the end. In most implementations the consulting session with the user is ended by entering the character ctrlh (the ASCII end-of-file mark). The system should then return to the standard state and display again the respective prompt (?- in most implementations).
6.2.2 Exploring the database What can we now do with this small model data base? We may, for instance, ask the system whether aconitine in the system tb has a hrf value of 45. To do this we enter at the system prompt ?- the question hrf(acon i t ine, (t b,45)).
PROLOG now searches through the data base to find an entry which matches exactly our input. If it finds one, it returns yes on the screen (or true or something equivalent depending on the implementation). If, however, we ask hrf(aconitine,(ta,55)).
the answer will be no, because this statement conflicts with the respective fact. If we ask hrf( barbitone,(t b, 15)).
the answer will also be no, even though the system does not know whether this statement is correct. The answer no indicates 'not provable with the information available' rather than 'false'. Applications of the above type are rather trivial. We may, however, submit more sophisticated questions, as for example 'Is there a compound that in system ta has a hrf value of 25?'. PROLOG uses the convention that a string beginning with a capital letter represents a variable. Initially a variable represents a yet undefined object. During the reasoning process an object might become associated with this variable, i.e. the variable becomes instantiated. Thus, if we ask
109
hrf(Compound,(ta,25))
PROLOG tries again to match the input statement with the entries in the data base. The variable Compound is instantiated to whatever object happens to occupy the respective position in the data base entry. If a match is found, the value assigned to the variable is put out. In our example the result (its presentation may again vary slightly with different implementations) will be Compound
= aconitine
There may be other solutions of the problem at hand. The system thus asks whether we are satisfied with the solution just given or whether it should try to find alternate solutions. In some implementations the system poses the respective question explicitly, awaiting a specified response (generally yes or no). Other systems just wait for the user to act. The convention in these systems is often that a carriage return signals accepting the solution, where as input of a ; indicates that alternate solutions are to be found. In the present case, there are no other solutions. Thus, if we ask for more we get the answer no. Another interesting question may be hrf(aconitine,X). which translates into 'Which combinations of TLC systems and hrf values are known for aconitine?'. As an answer \ve get as the first solution X = tb,35
and upon backtracking (by asking for alternate solutions) X = ta,25
and finally no to indicate that there are no other possibilities to satisfy our conjecture.
110
6 3 ELEMENTS OF THE PROLOG
6.3.1 Simple rules The goal of TLC is to separate compounds. A very simple first approach to rules governing the separability of two compounds may be given as follows:
Two compounds A and B are separable in a TLC system T if: A has in System T the hrf value of X and
B has in the same system T the hrf value of Y and
X and Y are different.
With the name sep for the relation 'separable', the operator :- to mean if, the operator \ = to mean not equal, and the comma , to stand for and, the above statements (6.1) about the separability of two compounds translates into PROLOG as:
This rule may again be input from the keyboard by using the command consult(user). at the system prompt ?-.
111
6.3.2 Backtrackinghstantiation Lets now examine in more detail what happens if we enquire about the separability of aconitine and acetanilide in TLC system tb by asking
sep(aconitine,acetanilide,tb). PROLOG tries to match our input with the heads of the rules it knows. This is successful with the rule just entered and results in the variables A and B in the rule becoming instantiated to aconitine and acetanilide respectively. After this PROLOG tests whether the given conditions can be satisfied from the data base. With the variables A and T being instantiated the first condition has now become hrf(aconitine,(tb,X)), which matches with the first entry in the hrf list, namely hrf(aconitine,(tb,45)). This sets X to 45. The second condition is now hrf(acetanilide,(tb,Y)). This matches with the second entry, which is hrf(acetanilide,(tb,35)), setting Y to 35. The last condition requires 45 to be not equal to 35, which is satisfied. Thus, PROLOG concludes that, according to the rules and facts given, the relation sep evaluates to true and outputs the respective answer. Another interesting question is 'Are there any pairs of compounds C1 and C2 separable in a unspecified TLC system S?'. We thus enter at the system prompt sep(Cl,CZ,S).
This again matches with the head of our separability rule, but does not result in any variables to become instantiated. It just establishes equivalence between the variable names C1, C2, S and A, B, T respectively. The first condition matches with hrf(aconitine,(tb,45)), resulting in aconitine being assigned to A (and Cl), tb to T (and S ) and 45 to X. This modifies the second condition to hrf(B,(tb,X)), as in the previous example. The matching process for the second condition again starts at the beginning of the data base. The first match is again with hrf(aconitine,(tb,45)), which sets B to aconitine and X to -15. The third condition, X not being equal to Y is obviously not satisfied with both variables being instantiated to 45 and thus fails. PROLOG now initiates backtracking. It moves back to the most recent decision, which was the matching of the second condition with the first hrf entry. I t first undoes its results, i.e. it sets €3 and Y free, and then tries the next possibility for hrf. This results in matching hrf(B,(tb,l?) with hrf(acetanilide,(tb.X)). This assigns
1 I2
acetanilide to B and 35 to Y. Now the third condition is satisfied, and PROLOG answers by putting out the present assignments for C1 (or A), C2 (or B) and S (or
T). If we request more solutions, PROLOG enters another backtracking cycle. The last decision made was that 45 is not equal to 35. There is obviously no other way to satisfy this condition, so backtracking moves one step deeper to the matching of hrf(B,(tb,Y)) with hrF(acetanilide,(tb,35)). The results of this matching process are undone (B and Y become free again) and the next hrf entry it used. However, hrf(B,(tb,Y)) does not match with hrf(aconitine,(ta,25)), because ta does not match with tb. The result is thus another failure. As there are no other hrf entries in the data base, backtracking has to go again one step further down to the matching of hrf(A,(T,X)) with hrf(aconitine,(tb,45)). Undoing the assignments sets A, T and X free. The next entry in the hrf list is hrf(acetanilide,(tbJj)), which is matched to hrf(A,T,X), resulting in A = acetanilide, T = tb and X = 35. The second condition then becomes hrf(B,(tb,Y)). Beginning again at the start of the list of hrf entries in the data base, the first match is with hrf(aconitine,(tb,G)), which sets B to aconitine and Y to 35. The third condition is satisfied with 45 not equal 35, and PROLOG has found another solution. From the chemist’s point of view this solution is the same as the first one. However, from the point of view of PROLOG, the two solutions are distinct, as we did not supply PROLOG with an appropriate rule. Further backtracking does not produce other solutions. The important thing to note in the previous examples is how PROLOG temporarily associates variables with values. This process is referred to as ’instantiation’. This bonding of a variable to a constant (or to another variable) holds until the respective step is undone by backtracking or until a new question is entered.
Our mini data base contains two factual statements about the same compound aconitine. In a more realistic data base there will be many such cases. Specifying the same compound in several facts is somewhat redundant. It would be more economical to lump together the TLC data for a given compound in just one single statement. We would like to be able to manipulate all R f data and TLC system codes as one piece of information and still have access to the individual entries. PROLOG supplies such a data structure. the list. A list is an ordered collection of entries, which can be addressed as a Lvhole, and where the first entry (the head of
113
the list) can be separated from the remainder (the tail of the list). The most common way to represent a list in PROLOG is to enclose its entries (the members of the list) between square brackets, separated by cqmmas. If we call the new relation hrfl, our compressed data base thus can be represented as h rfl(aconi t ine, [(t b,45),(t a.25) 1 ). hrfl(acetanilide, [(tb,35)1). A list can be assigned to a variable, i.e. a variable can be instantiated to a list. Furthermore, the operator 1 serves to separate the head of a list from its tail. Thus, the PROLOG statement
results in A being instantiated to a, and B to [b,c]. Matching [A1 B] to [a] produces A = a and B = [I (the empty list), and [A1 B] does not match with [I. Note that a list is not the same thing as a set. The sequence of the entries is important in a list, but is of no relevance in a set.
6.3.3 Recursion To be able to work with lists in PROLOG we need some tools for the manipulation of lists. In particular, we will need a rule to determine whether a given element A is a member of the list L. An elegant \vay to answer this question goes as follows. If L is the empty list, then A is not a member of the list L. In PROLOG we indicate this with the condition fail, which always fails. Furthermore, this decision is final, there are no other possibilities to check, so we should never ever backtrack over this statement. The PROLOG operator to cut the backtracking path is the !, called the cut. If PROLOG ever backtracks to a !, it stops further backtracking and leaves the respective rule for good. If the element A is t h e head of the list L, then A is obviously a member of the list. Thus the predicate member(A,L) will test whether A is the head of the list L. If this test fails, we have to check whether A is a member of the tail. To do this we simply apply the member rule again but this time specifying the tail of the list as the list argument. Using the PROLOG convention that a variable the value of which we don’t care about is represented by the character (underline), the translation into a PROLOG program results in:
114
memberC,[]):!, fail. member(A,[A1-1). member(A,[-I Tail]):mem ber(A,Tail).
The program is again put into the system using consult(user). This very simple recursive program is quite powerful. Obviously we can ask whether b is a member of the list (a,b,c] by typing in member(b, [a,b,c]).
The first condition does not match, so PROLOG tries the second one. This amounts to matching member(b,[aI b,c]), which also fails. The third subrule reduces to member(b,[-l b,c]) and results in calling member(b,[b,c]). The first subrule again fails, but the second one results in success, and we get the answer yes.. If the first argument to member is not a member of the list, we will eventually reduce the list to the empty list. Then the first subrule applies, resulting in a general fail of the member predicate. If we ask member(X,[a,b,cJ). we get answer X = a. Upon backtracking we are further presented with the alternate solutions X = b and X = c . The question member(x,[a,B,c]) leads to the unique answer B = x. To make use of this new tool we have to restructure our data base as indicated above, namely by entering the facts about the TLC behavior of our two model compounds in the new form. To enter the data we call again consult(user). and enter: hrfl(aconitine,[(tb,45),(ta,23) hrfl(acetanilide, [(tbJS) I).
I).
To retrieve a single entry from the list, i.e. to duplicate the function formerly performed by hrf, we need a new rule which picks one TLC data set. If the head of our new rule exactly duplicates the old one, the only things to do is to enter the new rule and to delete the old one. Thus, the most simple solution becomes
115
hrf(Name,(Sys,Rf)):hrfl(Name,List), member((Sys,Rf),List).
Prolog allows for replacing an old version of a rule by a new one. If instead of calling consult we call reconsult, any new rule typed in will erase all old rules with the same head. Thus to update the system we call reconsult(user) and type in the new hrf rule. The program behaves exactly as before, but the basic data compilation has become more compact and will be much easier to maintain and update.
6.3.4 Arithmetics
One of the problems with the current version of the program is due to the fact that the comparison of the R f values requires an exact match. Experimental values will always have some random error. We should therefore compare the Rf values with tolerance. We thus need a rule to decide ivhether two R f values are sufficiently close to be considered as equal. This rule, which we will call rfmatch should check whether the value of A is within a tolerance window of 2*Tol centered at the value of B. T o implement such a rule we need to do some arithmetics to determine the position of the limits of the tolerance tvindow. PROLOG is not an arithmetic language. Thus handling of numerical calculations is somewhat clumsy. The standard arithmetic operators are part of the language. However, they are interpreted as specifying a relation between numbers rather than as commands to perform a calculation. In particular, the operator = signifies equality. but does not trigger the evaluation of an arithmetic expression. The PROLOG statement A = 2 + 3 assigns the structure 2 + 3 to the variable A rather than the number 5. If \ve Lvant PROLOG to evaluate an arithmetic expression, we have to instruct PROLOG to do so by using the operator is. Thus, PROLOG considers the statement 5 = 2 + 3 as false. where as 5 is 3 + 5 is true.
With these peculiarities of PROLOG in mind, we.formulate our rfrnatch rule as follows:
116
rfrnatch(A,B,Tol):Low is B-Tol, High is B Tol, A = Low, A = High.
+
This new rule should now be added to the program (use consult). Wherever in an other rule Rfvalues are compared, the new predicate rfrnatch has to be used and t h e list of arguments has to be extended to include the tolerance level Tol. The sep rule explicitly tests for inequality of Rfvalues. To update this rule we simply replace the expression X \= Y with not rfrnatch(X,Y,Tol) and add the variable To1 to the list of arguments. Thus we arrive at sep(A,B,T,Tol) :h rf(A,(T,X)), hrf(B,(T,Y)), not rfrnatch(X,Y,Tol). Using reconsult to input the new sep rule does not erase the old version, as the heads of the two rules are not equiialent due to the fact that they differ in the number of arguments. In the hrf predicate a test for equality of R f values is made implicitly by t h e PROLOG system. This makes modification more difficult. In order to replace the equality test with the new version we need first to move the respective test from the system level to the rule level. Thus, we retrieve from the data base the theoretical value Rfth which then is compared to the specified value given as argument. So we add the following new version: hrf(Narne,(Sys,Rf),ToI):-
hrfl(Narne,List), rnernber((Sys,Rfth),List), rfrnatch(Rf,Rfth,Tol).
The rfmatch rule works perfectly if all 3 arguments are instantiated, i.e if they all have assigned a numerical value. However, if one or more arguments are free, the system being unable to arithmetics with uninstantiated variables will report an error.
117
6.3.5 Control of backtracking
To keep the rule generally applicable u e have to provide alternative rules for the case where one or more arguments are free. In this case we just match A with B without doing any calculations. To realize this alternate rule we need (a) a way to test whether a variable is instantiated and (b) the operator or. For (a) there is a built-in predicate var(A) which evaluates to true if A is a currently uninstantiated variable. The or combination of truth values is symbolized by putting a semicolon (;) between them. The new subrule is rfmatch(.A,B,Tol):(va r(A) ;\,ar( B) ;var( Tol)), A = B.
The alternative rule requires a higher priority than the previous one, it has to be used first. The two subrules together will produce the following results. If all three arguments are instantiated, the first condition in the first subrule fails and upon backtracking the second subrule is called and evaluated. The result depends on the values specified for the arguments as before. If, however, either A or B (bur not both) are uninstantiated, then the first subrule applies. The value assigned to the one variable is transferred to the other, previously uninstantiated variable. If both A and B are uninstantiated, then both remain free, but are tied together, they become shared variables. If in another part of the rule a value is assigned to one of the two shared variables, the other one will automatically get the same value. The last case, where A and B are instantiated but not To1 may lead again into troubles. The first line of the first subrule applies, as To1 is free. If A and B do match exactly, everything is fine, If, however, the values of A and B do not match, then the second condition fails and backtracking takes us into the second subrule. As To1 is not instantiated, an attempt to do arithmetics again results in a n error stop. To cure this problem we have to prevent PROLOG from ever entering the second subrule when the first condition of the first subrule has been successfully met. Thus, we have to insert a cut ! right at the end of the first line in subrule one to prevent PROLOG from ever reaching subrule two once it has established that subrule one is applicable.
118
6.3.6 Modifying the database There is now the problem to get the final version of subrule one, namely rfmatch(A,B,Tol):(va r(A) ;var( B) ;var(Tol)),
!, A=B.
to its proper place in the data base. As it has higher priority, it has to go before the subrule already in memory. Using reconsult to enter the new subrule will erase the existing one. With consult we cannot specify where the new rule shall be inserted. Additions to the data base can be made with the predicates from the assert(X) family, which insert the argument X (which can be an arbitrary complex structure) into the data base. With asserta(X) the argument goes to the first place, with assertz(X) it goes i n last. In the present case we will thus enter at the system prompt asserta(rfmatch(A,B,Tol):(var(A);var(B);var(Tol),!, A = B).
To check whether everything has worked out as expected we may request a listing of the various rules in the data base. This is done by entering the command listing with the rule name as the argument. For example, I i sting(rfl).
will produce the list of the TLC data base, and I i st i n g( mem ber)
.
outputs the rule about list membership. When putting out a program list most PROLOG implementations do not use the variable names originally input. They rather use an underline character followed by an arbitrary sequence number to identify the variables. This makes the program listings sometimes rather difficult to read.
119
Using the predicates consult(user), reconsult(user), asserta(X) and assertz(X) for data input is not very convenient if more than a few lines have to be entered. PROLOG can, however, use standard ASCII files for data input, using consult or reconsult with the respective file name as an argument. As file handling conventions differ very much between different implementations, we will not go deeper into the matter. The user will have to consult the manuals of the implementation he is using. To further experiment with and extend our little program we suggest to retype the program, using a suitable text editor, producing a standard ASCII text file. We will assume that this file has the name TLCPROG.PR0 and that i t may be loaded without specifying the standard extension PRO. We furthermore suggest to extend the TLC data base in order to have available a wider variety of test cases. For more flexibility it is suggested to put the data base in a separate file named TLCDATA.PR0. In the program development phase the data collection will be updated infrequently and will always be used in conjunction with the program. The program, however, will be modified quite often. It is thus convenient to include the command to load the program in the data file. For startup we then have just to load the data file, which will automatically load the program file. If modifications are made to the program file, we need only 10 reconsult the program file. Normally, the entries in a file to be consulted are not executed immediately, they are just put into the PROLOG data base. If we want a command to be executed, we have to include the respective prompt into the file. Thus, at the end of the data file TLCDAT.4.PRO we add the line ?-reconsult(TLCPR0G).
6.3.7 Obtaining output It might be useful to get some indication as to what is happening during the loading of the program files. We would, for instance, like.to know which file is currently consulted and get a message when the consulting of a given file has ended. To output a character string on the screen PROLOG has the command write. The
120
argument to write can be a variable or a character string. If a character string begins with a capital letter or if it contains characters that might be misinterpreted as operators, it has to be enclosed in single quotes. Write outputs the argument without a leading or tailing carriage return. A new line on the screen can be initiated by giving the command nl (for new line) without an argument. Thus, we include into the data base file t h e respecrive commands. As we want them to be executed immediately, we also supply the prompt. To start up the system, once PROLOG is loaded and started we have just to enter the command consult(tlcdata), which will load the Rf data base as well as the program proper. If a modification of the program becomes necessary, one may jump out of PROLOG into an editor program, modify the file TLCPROG.PR0 which, after returning to the PROLOG system can be reloaded using the command reconsult(t1cprog). With some PROLOG implementations it is not possible to call an editor program from within the PROLOG system. In this case one has to exit from the PROLOG system and restart after updating the program.
6.1 REFINING THE PROGRAM
6.4.1 General
The data base and the program on which we will base the following discussions are given in Table 6.1 and Table 6.2. Lines beginning with /* are comments, the end of the comment is indicated by */. With the existing version of the PROLOG program the predicate hrf allows for retrieving the Rf values for the compounds in the data base. Backtracking supplies this information for other TLC systems. The same predicate used with the variable for the Rfvalue instantiated returns the names of compounds with Rfvalues within the range specified by the tolerance window. The predicate sep checks whether there is a TLC system that separates the two compounds specified.
121
Table 6.1 File TLCDATAPRO: TLC data base version 1.
/* Screen message */ ?-nl,nl,write(’consulting TLCDATA’),nl. /* TICdata base */ hrfl(a,[( ta, lo),(tb,20),( tc.30)]). hrfl(b,[(ta,12),(tb,28)1). hrfl(c,[ (tb,30),(tc,40)1). hrfl(d,[ (ta,30),[tc,42)]).
I* Screen message *I ?-nl,write(’TLCDATA loaded’),nl. I* Screen message */ ?-nl,write(’consulting TLCPROG’),nl. Loads the program file */ /* ?-consult( tlcprog).
/* Screen message *I ?-nl,write(’TLCPROG loaded’),nl.
Table 6.2 File TLCPROG.PR0: TLC program version 1.
hrf retrieves one tlc data set from the data base. Name identifies the compound name, Sys is the tlc system code, Rf is the retention factor * 100, the matching of the rf values is to + /- To\ * / hrf(Name,(Sys,Rf),Tol):hrfl( Name,List), member( (Sys,Rfth),List), rfmatch( Rf,Rfth,Tol).
/*
contd.
122
continuation of Table 6.2
/*
rfrnatch checks whether the rf values .4 and B match with a tolerance of +/- TO]*/
/ * To cope with uninstantiated variables */ rfmatch(A,B,Tol):(var(A);var(B);var(Tol)), !, A = B.
rfmatch(A,B,Tol):Low is B-Tol, High is B +Tol, A = Low, A = High.
/*
sep checks whether compounds A and B are separable in tlc system T, i.e whether the retention values differ by more than Tol. */
/*
Checks whether the firs argumer is a member of the lis given as the second
argument */ member(-,[]):I
-9
fail. mernber(A,[A I-]). mernber(A,[- I Rest]):mernber(4Rest).
123
A useful extension to our little program consists in a module that, using the predicates already implemented performs these operations for a list of compounds or a list of Rf values respectively. If only the compound list Clist is instantiated, the new module should check whether there is a TLC system which separates the compounds specified and should instantiate the Rf list to the Rf values expected. If the Rf list Rlist is instantiated, the program should return in the compound list possible assignments for the Rf values specified. In both cases backtracking should produce alternate solutions, if they exist. If both lists are specified, the system should check whether the assignment is compatible with the data base. We will now stepwise design such a program module, which we will name identl.
The three operations listed above are quite similar, they involve the correlation of compound names with Rf values, as performed by the predicate hrf. As arguments we will have the two lists, Clist and Rlist, the TLC system code and the tolerance level. Corresponding elements of the two lists will have to satisfy the relation hrf. We therefore check the relation hrf for the heads C1 and R1 of the respective lists and then recur on t h e tails Ctail and Rtail. The recurrence stops when either list is reduced to the empty list. Thus: identl (I1 ,[I ,-,J. identl([Cl ICtail],[Rl I Rtail],Sgs,Tol):hrf(C 1,(Sys,Rl),Tol), identl (Ctail,Rtail.Sp,ToI).
This rule already works quite nicely. Given compound names it retrieves the R f values in one system, Upon backtracking the Rf values in other systems are produced. From the definition of rfmatch it is obvious, that the setting of To1 is irrelevant in this case. Instantiating Sys limits the result to the system specified. If the list of Rfvalues is instantiated, the names of possible compounds are assembled in Clist. The setting of To1 controls the precision of the matching process. Backtracking supplies alternate assignments, if they exist. If a rather large value for To1 is specified, a flaw in the current program will become obvious. Sometimes, a given compound will match with more than one Rr value, and the same name will appear more than once in Clist. This can be corrected by adding a rule which states that a compound should be added to Clist only if it is not yet a member of Clist. This, however, presents a problem. As we process the lists by cutting away their heads, we have no direct access to the parts already processed.
124
We can, however, concurrently with cutting down Clist by removing its head C1 build up another list C by adding C1 as the new head. Before joining C1 to the list C we make sure that C1 is not yet a member of this list. Initially, the list C is empty or uninstantiated. We now have ident 1 ( I, [I ,-,-,-I. identl([Cl ICtail],[Rl IRtail],Sys,Tol,C):hrf(Cl,(Sys,Rl),Tol), not member(C1.C) ICl). ident l(Ctail,Rtail,Sgs,Tol,[Cl
With this modification any compound name will appear on the list at most once. Thus, assignment of the same compound to two or more Rfvalues is prevented. For the inverse application, where compound names are specified, nothing changes, as long as the same name does not appear two or more times on the list of compound names. If a list of compound names is given, we would like to know whether the Rf values for all components are sufficiently different to ensure complete separation. Thus, we have to check whether the currently processed compound C1 is separable from all other compounds. There are various ways to specify the respective condition. One may check C1 against all compounds already processed (using the auxiliary list C) or against t h e compounds not yet processed (using Ctail). Alternatively one could assemble an auxiliary list for the Rf values retrieved so far (analogous to the list C in the previous problem). The most appropriate choice is to use the list of compounds not yet processed, as in the inverse application (Rf values specified) this list is always empty, so there will be no interferences. Thus, the question to add says: 'Is the currently processed compound C1 separable from all compounds in list Ctail?'. We will put this rule in a separate predicate with the name sepall. Its arguments are, of course, C1, Ctail, Sys, and Tol). The complete module for processing lists of compound names and Rfvalues now is ident 1 ( [ 1, [I ,-,-,-I. identl([Cl ICtail],[Rl I Rtail],Sys,Tol,C):hrf(Cl,(Sys,Rl),Tol), not mem ber(C 1 ,C) sepall(Cl,Ctail,Sys.Tol), identl(Ctail,Rtail,Sys,ToI,[ClIC]).
125
with the predicate sepall to be defined.
To check for separability we w i l l use the already existing predicate sep. Again all members of the list Ctail have to be tested, i.e. the complete list has to be processed. As before, we could work on the head of the list and then recur on the tail until the list is exhausted. To present some variation, we will, however, use a different approach, namely scanning through the list by backtracking. Applying the member predicate to the list Ctail, we retrieve one member and check whether it is separable from the compound in processing using the predicate sep. We will specify the respective condition in such a \vay that it will fail if the compounds are separable, i.e. we use not sep rather than sep. Thus, if everything is ok, we will get a fail and thus initiate backtracking. PROLOG will then redo the member predicate to produce an alternate solution by retrieving the next member of the list. This is then again tested for separability. This process continues until all members of the list Ctail have been processed. Then redoing member produces a fail and backtracking moves down to sepall.
To stop the process, we provide a second rule for sepall which is then called. This second rule succeeds always, so sepall as a whole succeeds if all compound pairs are separable. If, however, we hit upon a non-separable pair of compounds not sep will succeed, and this should result in a fail for the rule sepall. To get this behavior, t h e statement not sep is followed by the Statement fail. But in the given case we have to prevent PROLOG from calling the second subrule, which always succeeds. This can be effectuated by putting a cut ! benveen not sep and fail. This prevents backtracking when the final fail predicate is reached an leads to failure of the predicate sepall. The listing of the predicate sepall is as follows
sepall(C1,Ctail,Sys,ToJ):mem ber(CZ,Ctail), not sep(Cl,CZ,Sgs,Tol),
!,
fail. sepal I C,-.-,J.
126
6.4.2 Manipulating lists In a further improvement we might sort the two lists for increasing Rf values. For demonstration purposes we will select a non-optimal solution which will, however, illustrate some new and interesting principles. First of all we need a predicate to sort the members of a list. A simple way to do this is to select to adjacent elements from the list and check whether they are i n sequence. If they are not, then they are interchanged before reinserting them into the result list (this is the well known bubble sort algorithm). Picking elements from a list and putting them back can be perceived as a special case of the concatenation of two lists, We will therefore first work on the predicate append which joins two lists together. The reasoning behind the append predicate goes as follows. When joining list L1 to list LZ to give list L3, the head H1 of List L1 will also be the head of the result list L3. The tail L3tail of the result list is obtained by joining the tail Lltail of list L1 to the list LZ. The boundary condition which terminates the recursion is when the first list has become empty. Then the result list will be identical to the second list. Thus we get append ( [I ,L,L). append([Hl ILltail],LZ,[Hl I W1:append(L1 tail,LZ,Lq). Append as defined above ib quite a versatile and interesting predicate. First of all it can be used to join two lists together. For instance append([a,b],[c,d],[a,b,c,d]) evaluates to true. Any one of the three arguments may be left uninstantiated. PROLOG will then assign the appropriate values to it. More interesting is the case where both first arguments are left uninstantiated. Upon backtracking append then produces successively all possible partitions of the third list. Thus, append(X,Y,[a,b,c]) will first result in X = [I and Y = [a,b,c]. Backtracking then brings X = [a] and Y = [b,~].As a next solution we get X = [a,b] and Y = [c]. Finally we obtain X = [a,b,c] and Y =[I. Further attempts to backtrack will result in failure and produce the answer no. For our sorting algorithm we will use a variation of this application, namely append( [X],[A,BIY],Z) with Z being instantiated. This results in adjacent members of Z being assigned to A and B. For example append(X,[A,BIY],[1,2,3.1])results in .4 and B assigned to 1 and 2, 2 and 3, and 3 and 4 respectively. Further backtracking u ill fail.
I27
6.4.3 Sorting
To realize a sort algorithm based on the append predicate we proceed as follows. The name of the predicate is sort, i t has two arguments, namely the list L to be sorted and the List R to receive the result, i.e. the sorted list. Within sort we will call the predicate order which checks bvhether its to arguments are in the correct sequence. First, we call append in the way explained above to pick two adjoining elements A and B. Then we call order with the arguments A and B interchanged. Thus, order fails if A and B are in the correct sequence. If this is the case, we backtrack to the previous call to append to process the next pair from the list L. If order(B,A) succeeds, this indicates that the two elements are to be interchanged. Thus we call again append, this time to insert the two elements in processing in reversed order, putting the corrected list as an intermediate result into the list S. Now we call sort again to process the intermediate list S.The procedure ends when we have processed the last pair. An attempt to pick the next pair from the list results in the failure of the append predicate and subsequently of the first subrule of sort. We then move to the second subrule which transfers t h e intermediate result into the second argument. Either the two elements under consideration are in the right order or they need to be interchanged, there is never third possibility. Thus, in the first subrule once order has succeeded and control passes into the second half of the program, we must never allow backtracking into the first part. This is insured by placing a cut right after the order call. So, the predicate sort becomes
order(B,A):B c A.
Our planned application calls for sorting one list and making the same eschanses of elements in the other list. Simply duplicating the calls for append with the second
128
list as argument will not tvork, because the sort predicate relies on backtracking to produce all possibilities. We want all list operations always to be performed on both lists. Backtracking, howe\.er, redoes only the most recent operation. We have to ensure that sort always backtracks on both lists. This is most easily realized by combining the append operation for both lists into one single predicate, on which sort can backtrack. This predicate, uhich \ve will name append2 simply doubles the pattern of append and may be written as appendZ([I,L,L, [ I , h W 1. append2([XI Ll],LZ,[SI W],[k’)311].112,[YI XI3l):append2(Ll,L2,W,hf 1,112,313).
Replacing all calls to append by append2 and adjusting names and number of arguments in the sort predicate \ve arrive at sort2: sortZ(L,R,Ll,Rl):append2(X, [A,B Y],L,Xl,[Al,Bl I Y l ] , L l ) , order(B,A),
.,
7
append2(X, [ B.A YI ,S,Xl, [ B l ..A 1 1 J’l 1,S1), sort (S,R,S 1 ,R1). sort(L,L,L 1 ,L1).
Now, list L is sorted in ascending order into list R, and the Same sequence changes are made in List L1 to give List R1. The new predicates may now be incorporated into the program to produce sorted lists for the Rf values and the respective compound names by defining a new predicate ident. Ident will first call identl to retrieve the values for the uninstantiated values and then use sort2 to put them in order.
ident(Clist,Rlist,Sys,Tol,Csort,Rsort):identl (Clist.Rlist,Sys,ToI, [I),
sort2(Rlist,Rsort,CIist,Csort).
129
6.5 CONCLUSION
This paragraph closes our informal presentation of PROLOG. The introduction given here is neither complete nor precise, it does not take into account many important details, formal rules and exceptions, and many interesting and useful standard predicates are not mentioned. I n particular, only the very basic input and output operations have been used, as the various implementations available today are extremely different in this respect. Our aim was to present to the reader some of the unique features of PROLOG, hopefully to motivating him to start his own experiments with the language. As a starter we present in Table 6.3 a further improved (but still definitely suboptimal) version of the TLC program. There are many more features that should be added. For instance it might be extended to process partially instantiated lists for compound names, useful when some compounds are already identified. Partially instantiated lists of Rf may become handy if one has to select an internal standard for quantification. Adding detection information and rules on how to use them (e.g. which detection reactions can be sequentially applied on the same plate) will increase the program’s selectivity. A help module should be available to inform the user on the conventions used, on the meaning of the various parameters, and to explain the codes used for the tlc systems (and for the detection reactions). Furthermore, the explanation component essential in an expert system is completely missing.
Table 6.3 TLC program version 2 .
/* /*
Data base. */ Insert your rf data here in front of the Rf catch-all */
/*
General tlc data processing module: Clist: List of compounds to be separated Rlist: List of observed Rr values sys: TLC system code ITol: Tolerance value for identification contd.
130
continuation of Table 6.3 SToI : CRl ist : RXlist: RTlist:
Tolerance value for separation Sorted list of compound names Sorted list of Rf values found Sorted list of Rfvalues expected Call with Clist instantiated returns tlc system which separates the givencompounds with Rf differences STol. Value of ITol is irrelevant.Call with Rlist and Sys instantiated returns possible assignments of the tlc spots. Theoretical Rf values differing less than ITol from experimental values are considered as matching. Value of STol is irrelevant. */
ident (Clist,Rlis t,Sys,ITol,STol,CRlist, RXlis t, RTlis t ) :ident 1(Clist,Rlist,Sys,ITol,STol,CRtemp,RRtemp,[]), sort3( Rlist,RXlist,CRtemp,CRlist,RRtemp,RTlist). /*
Aux is an auxiliary list to hold the names of the compound identified so far *I
identl ([l,[l,-,-,-,[l,[l,-). identl([Cl I Ctail],[Rl 1 Rtail],Sys,ITol,STol,[Cl I CRtail],[RRl I RRtail],Aux):h r f (C 1,( Sys,R R 1), R 1,ITo I), new(Cl,Aux), sepall(C l,Ctail,Sys,STol), ide nt 1(Ct ail,R t ai I,Sys, ITol ,STo1,CRtail, R R tai I, [C 1 I Aux]). /*
Checks whether the currently processed compound C l has already been identified as a possible component of the mixture */ new(’?’,-):!. new(C 1,Aw):member( C 1,Aux ), !, fail.
contd.
131
continuation of Table 6.3
/*
hrf retrieves one tlc data set from the data base. Name identifies the compound name, Sys is the tlc system code, Rf is the retention factor * 100, the matching of the rf values is to +/- To1 */
hrf( Name,( Sys,Rfth),Rfexp,Tol):hrfl(Name,Rflist), member( (Sys,Rfth),Rflist), rfmatch(Rfexp,Rfth,Tol).
/*
rfmatch checks whether the rf values A and B match with a tolerance of +/- To1 */
To cope with uninstantiated variables */ rfmatch(A,B,Tol):(var(A);var( B);var(Tol)), !, A = B.
/*
Some implementations of PROLOG know unsigned integers only. If TolB then the negative limit is interpreted as a very large positive number. This subrule deals with this case. */ rfmatch(A,B,Tol):B = =Low, A = < High.
/*
I*
sep checks whether compounds A and B are separable in tlc system T, i.e., whether t h e retention values differ by more than To]. */
contd.
132
continuation of Table 6.3
/*
checks whether the first argument is a member of the list given as the second argument */
member(-,[]):!, fail. member(A,[A I-]). member(A,[- I Rest]):member(A,Rest). /*
sepal1 tests whether the compound C1 is separable in tlc system Sys from all compounds given i n list Ctail, i.e. whether the respective Rfvalues differ by more than Tol. */
sepall(C l,Ctail,Sys,Tol):member(C2,Ctail), not sep (C 1,C2,Sys,To1 ), !, fail. sepal1L-LL). /*
append3 joins 3 sets of 3 lists together. Argument 1 is joined to argument 2 to give argument 3. The same operation is performed with arguments 4,5,6 and 7,8,9, respectively */
contd.
133
continuation of Table 6.3
/*
sort3 sorts argument 1 into argument 2. Identical sequence changes are performed on argument 3 to give argument 4, and on argument 5 to give argument 6. All 6 arguments are lists. The range of allowed components of argument 1 as well as the sort sequence depends on the predicate order. */
sort3( L,R,Ll,R 1,L2,R2):append3(X,[A,B I Y],L,XI,[Al,Bl I Y l],Ll,X2,[AZ,BZ I YZ],LZ), order( B,A), 1 .,
append3(X,[B,Al Y],S,Xl,[Bl,Al I Yl],Sl,X2,[B2,.42IY2],SZ), sort3(S,R,Sl,Rl,S2,RZ). sort3( L,L,Ll,Ll,LZ,L2).
/*
Order establishes the ordering sequence for predicate sort3. The current version sorts integer values into ascending order. */
order(A,B):A < B.
6.6 REFERENCES
1
2
3 4
W.F. Clocksin, C.S. Mellish, ’Programming in Prolog’, 3. Edition, Springer Verlag, Berlin 1987, F. Geiss, ’Die Parameter der Diinnschichtchromatographie’, Friedrich Vieweg & Sohn, Braunschweig 1972, A.C. Moffat, Editor,’Clarke’s Isolation and Identification of Drugs’, 2. Edition,
The Pharmaceutical Press, London 1986, A C. Moffat, J.P. Franke et al., ’Thin Layer Chromatogrsphic Rf Values of Toxicologically Relevant Substances on Standardized Systems’, VCH Verlagsgesellschaft mbH, D-6940 Weinheim 1987.
This Page Intentionally Left Blank
135
7
REACTION PATHWAYS ON A Pc
Eric FONTAIN, Johannes BAUER and Ivar UGI Organisch-Chemisches Institut der Technischen Universitat Munchen, Lichtenberg strasse 4, D-8046 Garching, W. Germany
7.1 INTRODUCTION
Leopold Kroneckers statement 'Die ganze Zahl schuf der liebe Gott, alles ubrige ist Menschenwerk' provides a classification for the diverse uses of the computer (ref. 1). The best-known uses of computers in chemistry rely on floating point computation. Numerical quantum chemistry, chemometrics and the collection and evaluation of experimental data, e.g. in X-ray crystallography, modern spectroscopy (advanced NMR, MS, IR) and chemical dynamics are major areas where floating point computation is indispensable. The logical and combinatorial chemical applications of computers are based on integer computation. Storage, retrieval and manipulation of data as well as deductive operations by computers are in that category. Chemical documentation, information oriented synthesis design, and the deductive solution of chemical problems are examples. In the deductive solution of chemical problems the solutions of the individual problems are deduced from general principles. The use of large computers is almost always advantageous for floating point computation, whereas for integer computation the small computers may be
136
preferable, due to their immediate response in the interactive mode of operation that is essential in this category of applications. This chapter is devoted to computer programs for PCs. These solve chemical problems by generating reaction pathways and are based on the theory of the BEand R-matrices, an algebraic model of constitutional chemistry.
THE DEDUCTIVE SOLUTIOS OF CHEMICAL PROBLEMS AND THE THEORY OF THE BE- AND R-MATRICES
7.2
In chemistry the deductive method does not seem to be generally applicable. Traditionally, most chemical problems are solved through reasoning by analogy, based on detailed knowledge and experience in chemistry. The information-oriented computer programs for the solution of chemical problems attempt to simulate the chemists’ reasoning by analogy. An essential prerequisite for a deductive approach to chemistry would be a theory
that affords the prediction of the existence of molecular systems and their interconversions. The theory of the BE- and R-matrices, an algebraic model of the logical structure of constitutional chemistry, is precisely such a theory (ref. 2, 3 ) . A brief outline of the theory of the BE- and R-matrices is given below. The essence of this theory is expressed by its fundamental equation B + R = E
(7.1)
It represents the conversion of an ensemble of educt molecules EM(B) into an isomeric ensemble of product molecules EM(E) by a chemical reaction. The ensembles of molecules (EM) at the Beginning and the End of the reaction are described by their BE-matrices B and E. The so-called reaction matrix R corresponds to a pattern of valence electron redistribution, o r also a scheme o f bond breakindmaking during the reaction (ref. 4). The BE-matrix B of an E M with n atoms is an nxn symmetric matrix with positive integer entries; the rows/columns of B are assigned to the individual atoms of
137
EM(B), e.g. the i-th row/column to the atom A;. The off-diagonal entries bij are the formal covalent bond orders of the bonds A;-Aj between the atoms Ai, and Aj; and the diagonal entries bii are the numbers of the lone valence electrons at atom A;. The off-diagonal entries rij = rji of the R-matrix R indicate the changes in formal bond order between the atoms A;, Aj, and the diagonal entries rii of R denote alterations in the placement of lone valence electrons.
For example, the reaction:
H-C
=NI
-H-"+=C'
is described by the matrix equation:
H
0
1
0
C
1 0
0 3
3 2
N
BHNC
+
0 -1 1
+
-1
1
2 0
0 -2
RHCX- HNC
-
-
O 0 l
O 1 3 BHNC
l 3 O
H c N
(7.2)
Here the first rowkolumn of each matrix refers to the H-atom, the second to the C-atom, and the third to the N-atom. (a) The equation B + R = E represents a chemical reaction only if t h e negative entries rij I 0 of R coincide with positive entries bij rij of B (mathematical fitting condition).
-
(b) A BE-matrix represents a real molecule or EM if all rows/columns are allowable valence schemes of the respective chemical elements (valence chemical boundary conditions) .
In general, the addition of R-matrices does not commute under these conditions. An R-matrix R is decomposable into its r basis elements Rk that denote the elementary mechanistic steps of a reaction.
138
R
=
R1
+ R2 + ...+ Rk + ... + Rr
(7.3)
The decomposition of RHCN - HNC
0
-1
-1 1
2 0
1 0 -2
-
0
-1
-1
3 0
0
0 0 0
+
0 0 1
0 0 0
1 0 -2
and the sequence:
+ R1 = B1
B B1
+ R2 = E
corresponds to the reaction mechanism of the following scheme:
Note that through the permuted sequence of R I and R? B
B:!
+ R r = B:! + R1 = E
would comply with R = R i (a) and (b) (ref. 4).
+ R.
but \\auld violate the aforementioned conditions
Since no electrons are generated or disappear during a chemical reaction, we have
C
rij = O
(7.5)
ij
whereas (7.6)
139
is twice the number of valence electrons that are involved in the conversion of EM (B) into EM(E). D(B,E) has the mathematical properties of a distance and is the so-called chemical distance (CD) between EM(B) and EM(E) (ref. 2). It depends on the correlation of the atoms in these EM, and with a fixed indexing of the atoms in EM(B), CD is a function
F (P) = D (B, P. E . P-l)
(7.7)
of permutations the atoms in EM(E); here P is a permutation matrix that permutes the rows/columns of E (ref. 2-8).
7 3 THE HIERARCHIC CLASSIFICATION OF CHEMICAL REACTIONS
A hierarchic classification of chemical reactions is implied by the theory of the BEand R-matrices (ref. 9). This classification is a powerful device in the deductive computer assisted solution of chemical problems.
Chemical reactions are first classified according to the minimal number of valence electrons that must be redistributed during the conversion of the educts into the products; this is their minimal chemical distance (MCD). The determination of the MCD of a chemical reaction (ref. 5-7) yields also an atom-by-atom mapping of the educts and the products, and it also identifies the reactive centers whose union is the socalled core of the reaction. On the next level of the hierarchical classification of chemical reactions we have the irreducible R-matrices that represent the categories of reactions. They have in common a pattern of redistribution of valence electrons at the core of reaction by the same characteristic pattern of 'electron pushing arrows'. Such arrows are customarily used in the chemical literature on reaction mechanisms. This pattern of electron redistribution is also representable by an irreducible R-matrix. An irreducible R-matrix is obtained from an ordinary R-matrix by removal of all rows/columns without non-zero entries. The pattern of electron redistribution and bond breakinglmaking shown in Figure 7.1 belongs to the most highly populated category of organic reactions; about 50 % of organic reactions proceed by this pattern (ref. 10) that is represented by the irreducible R-matrix (ref. 11,12).
140
0 -1 0 -1 - 1 0 1 0 0 1 0 - 1 1 0 - 1 0
1
2
1
2
1
2
4
3
4
3
4
3
Fig. 7.1 The most frequent category of organic reactions.
A category of reactions with a characteristic irreducible R-matrix is a set of basis reactions. The basis reactions correspond to the traditional classification of 'organic reactions'. A basis reaction is best characterized in graph theoretical terms (ref. 13). The educts and the products of a basis reaction are expressed by a graph (see Fig. 7.2) whose nodes correspond to the reactive centers and whose lines indicate the bond orders of the covalent bonds that are directly affected by the reaction. The basis reactions of R4,4' - all of these represent known chemical reactions - may serve as examples.
Down the hierarchy of chemical reactions, the basis reactions are further partitioned by specifying the chemical elements at the reactive centers. Finally we reach the individual reactions by stating the covalent bonds and atoms outside the core of the reaction. A more formalistic classification of the basis reactions is achieved through their T-matrices (ref. 14). The T-matrix of a chemical reaction is the difference between the adjacency matrices A of the products and the educts of a chemical reaction.
T = A(€)
- A (B)
(7.8)
141
The numbers AT of the rows/columns with non-zero entries and BT, the total number of non zero entries in a T-matrix may be used to distinguish between basis reactions of the same category.
iTi
.-.
i.-.-i
lri:I Fig. 7.2 Reactions can be represented by graphs. Here, six reactions of R4,4' are shown as examples.
The so-called intact BE-matrix is another device in the computer assisted classification of chemical reactions (ref. 9).
7.4 REACTION GENERATORS
Molecular systems or chemical reactions are the solutions of the chemical problems that are processed by the problem-solving chemical computer programs. The unknown 'X' of such problems can be found by solving the equation B + R = E, either from a given BE-matrix B, or a given R-matrix R. These solutions are produced by the so-called reaction generators (RG) that are the backbone of our problem-solving chemical computer programs (ref. 15).
142
An RG of type I (RG I) produces from a given BE-matrix B those pairs (B,E) that comply with B f R = E under the mathematical fitting conditions (a) and the valence chemical boundary conditions (b), while an RG of type I1 (RG 11) produces from a given R-matrix R all pairs (B,E) under the above conditions. Accordingly, an RG I elaborates all chemical reactions that an EM(B) can undergo, or by which it can be formed, whereas all reactions that have in common the same electron shift pattern, as given by R, are manufactured by an RG 11. Thus the RG I and RG I1 are complementary devices and so are the computer programs that contain them.
In our early chemical computer programs we used reaction generators that first manufactured chemical reactions from fixed sets of three to five irreducible R-matrices according to the mathematical fitting conditions and then they applied the valence chemical boundary conditions in order to eliminate the reactions involving forbidden reactants (ref. 16). In our recent programs IGOR (ref. 11,12) and RAIN (ref. 17,18) we use reaction generators that are guided by so-called transition tables. For every chemical element that is considered the allowable valence schemes in the educts and the products are stated, as well as their permissible transitions during chemical reactions. The transition tables may be a standard set, or be defined ad hoc. The transition tables for Carbon atoms shown in Figures 7.3 and 7.4 may serve as an illustration.
Fig. 7.3 Transition table of a C-atom in the Streith reaction.
143
Figure 7.3 shows a transition table that has been used for the Streith reaction (ref. 7,18,19); it refers to valence schemes that occur in stable carbon compounds. Figure 7.4 is a transition table that also permits the valence schemes of short-lived transient species in some reaction mechanisms. Note that in Figure 7.4 the diagonal entries of the unstable valence schemes indicate that these must disappear during the reaction. Furthermore, an empty row in a transition table means that the respective valence scheme must not exist in the educts, whereas an empty column points to a valence scheme that is forbidden in the products. The transition tables do not only enforce the valence chemical boundary conditions, but they also may be used to specify the allowed direction of a reaction (ref. 14). In the case of RG I we have a fixed assignment of the chemical elements to the rows/columns of the BE-matrix B. Thus it is even possible to use different transition tables for a given chemical element in the distinct rows/columns.
I-c-I
'q -c
+
+
+
+
-
+
-
+
+
-
+
-
-
-
+
+
-
-
+
-
\
-c'
\
Fig. 7.4 Transition table of a C-atom for a network of reaction mechanisms
In contrast, we have a different approach with RG 11, because there is no predefined assignment of chemical elements to the rows/columns. In addition, for each row/column a set of chemical elements with their respective transition tables can be specified (ref. 11,20).
144
The transition table guided RG I (TRG I) of the present version of RAIN operates as follows: First the allowable rows/columns of the matrix E are derived from the corresponding rows/columns of B according to the transition tables that apply. Then the differences of the rows/columns of B and E are checked whether or not they yield a matrix that qualifies as a chemically meaningful R-matrix. as specified by the user. An alternative to the above TRG I is currently implemented and tested. Here, a set of R-components is derived from B and the applicable transition tables. These R-components are then used to manufacture the R-matrices, and hence the matrices E. As soon as it is known which of the two above approaches to a T R G I is combinatorially more effective, it will be used in the T R G I of RAIN and other R = E chemical computer programs that generate solutions of the equation B from a given BE-matrix B.
+
Since 1979 a TRG I1 is used as t h e 'engine' of IGOR (ref. 11). The user of IGOR defines a R-matrix R and selects for each of its rows/columns a set of chemical elements whose valence chemical roles are described by their transition tables. For each row/column of (B,E) the union of the allowed transition tables is formed. From these the valence schemes, their transitions, and the compatible chemical elements are found for the rowkolumns pairs of B and E. Now all of the BE-pairs (B,E) are generated by exhaustive combination of the valence schemes that had been elaborated for the rows/columns. Within the hierarchic classification of chemical reactions, thus the level of basis reactions is reached. Subsequently chemical elements are assigned to the rows/columns of (B,E), and finally the unused valences of the reactive centers are supplied with residues as specified by t h e user.
The theory of the BE- and R-matrices serves as the theoretical foundation of computer programs for the deductive solution of a variety of chemical problems. Such programs are under development at our institute since 1971 (ref. 1). Our present generation o f interactive problem-solving FORTRAN 77 programs for PCs evolved in the past three years from our earlier batch type PL'I programs for mainframe computers (ref. 12).
145
FORTRAN 77 was chosen as the computer language for IGOR and RAIN for the following reason: Among generally available languages FORTRAN 77 still has the highest degree of portability. The solution of numerical (combinatorial) problems requires a numerically oriented language. FORTRAN 77 is not lacking any feature that are needed for the algorithms of IGOR an R U N . The existence of very elaborate and mature compilers is further advantage of FORTRAN 77. About 1985 it became clear that inespensive PC type computers will play a major role in computer assisted chemistn. IGOR (Interactive Generation of Organic Reaction) and RAIN (Reaction And Intermediate Network) are our first interactive problem-solving chemical computer programs that have been implemented for MS-DOS Systems. The program IGOR2, a more advanced version of IGOR, will become available as a public domain program by mid 1989. A detailed manual of IGOR will be published by J. Bauer in Tetrahedron Computer Methodology, and the program w i l l be distributed on disquettes by the same journal. RAIN2 will follow analogously towards the end of 1989. The interactive mode of operation is preferable, because it exploits the capabilities of man and machine most efficiently. The logical and combinatorial operations are left to the computer, whereas all decisions that require chemical knowledge, experience and intuition are for the expert user. Thus the customary, often time-consuming, more or less arbitrary selection procedures and estimates of reactivity are avoided. .4ccordingly for problem-solving in chemistry a well-structured interactive logic-oriented approach is particularly suitable for PC and inexpensive workstations. In essence, IGOR is a T R G I1 and a set of service modules. Besides a user-friendly graphic input-output system, the most important service modules enable the user to avoid the processing of excessively large amounts of data by imposing as many further boundary conditions as are needed. For instance, any constitutional features and substructures of the participating reactants can be demanded, or be forbidden. The input of IGOR is an R-matrix that represents a pattern of electron flow within an ensemble of reactants, and the associated changes in bonding. From this and the
146
further particulars that the user may introduce, IGOR generates pairs (B,E) that correspond to descriptions of chemical reactions. The output of IGOR can be substantially reduced by forbidding molecules with certain substructures that may be ad hoc specified or be used as a standard set. IGOR operates according to a hierarchical classification of chemical reactions (ref. 9,12), and the user can interact with IGOR at all levels of the hierarchy. The guidance of the user by the hierarchical classification of chemical reactions is illustrated in Figure 7.5 by the discovery of the extrusion of C02 from o-formyloxy-ketones (ref. 12).
R-Mafricer
Def. of ekrnentvector
F Or')
4
Futher banday conditlan
O
0
-0 0
0
0-9 HJ
H
New reaction I
etc.
Fig. 7.5 Search for a new reraction along the hierarcchic classification of chemical reactions. Internally IGOR uses since 1980 canonical representation of chemical reactions that is based on a linear combination B + kE (k is an integer the maximum entry of any B or E, e.g k = 10). The latter is subjected to a canonical indexing procedure by an extended version of the algorithm CASON (ref. 21,22).
147
When a zero matrix is used as the R-matrix, we have B = E, and IGOR generates constitutional formulas. Thus the user can apply IGOR to generate molecules in accordance with some ad hoc definitions and restrictions.
For instance, IGOR can be ordered to produce all five-membered cyclic molecules with the empirical formula CzH2Sz02, where the atoms have their customary valence schemes. The ialence isomers of cyclooctatetraene (ref. 11, 12), the 1.3-dipoles (ref. 12), and the 278 concei\.able five-membered cyclic phosphotylating reagents (ref. 12) have been generated by IGOR in this operating mode (Fig. 7.6).
0
0-
Fig. 7.6 Syndone and its analogs.
IGOR generates from the irreducible 10 x 10 'cyclic' R-matrix (ref. 2 3 ) the basis reaction 1 2 3, a [6 +-I] cycloaddition (Fig. 7.7).
+
-
148
The hierarchic classification of chemical reactions provides also criteria for judging the degree of novelty of unprecedented reactions. Among the 'new reactions' that have been 'discovered' by IGOR, and have been experimentally realized, the conversion of 4 + 5 via 6 into 7 has, up to now, the highest degree of novelty (Fig. 7.8). The reaction 3 56 7 is a realization of the above abstract scheme.
-
+
RAIN is a computer program that finds the reaction pathways for interconverting EM(B) and EM(E). These pathways may correspond to the mechanistic pathways of chemical reactions, or to multistep sequences of chemical reactions, depending on the nature of the valence schemes that are considered. If the valence schemes are confined to those of stable compounds, a program like RAIN will generate sequences of chemical reactions, such as bilaterally generated synthetic pathways (ref. 21), networks of reaction mechanisms are obtained, when the valence schemes of transient intermediates (e.g. carbenes, radicals, carbocations, carbanions) are also included. 1
2
3
Fig. 7.7 A basis reaction oi R'lo,lo
4
r-9 N
C0,Me I
C0,Me 0
+
5
Fig. 7.8 A new reaction of
7
6
R'io.io
class.
149
Scheme of the essential functions of RAIN
-
find for each atom its permissible valence schemes in the product, manufacture all combinations of these valence schemes, generate for each combination of valence schemes the rowskolumns of the respective BE-matrix, check the rows/columns of the BE-matrix for compliance with any chemically meaningful R-matrix, check these BE-matrixes for compliance with the given boundary conditions and optional selection certain, and represent canonically the molecules that correspond to these BE matrixes and enter them into the network of RAIN.
Previously, we have planned to generate the networks of reaction pathways between EM(B) and EM(E) as follows: The atomic indexings of EM(B) and EM(E) are correlated by minimization of the chemical distance D(B,E) (ref. 7). Then the R-matrix E B = R is decomposed into its components R = R I + Rz ...+ R’, and the components are used to construct the network according to B + R, = Bm etc.. etc.
-
This method was, however, discarded, because it would neglect too many worthwhile pathways outside the minima of chemical distance. Therefore an entirely different approach is preferred. The aforementioned TRG I is used to elaborate trees of reactions, pathways from EM(B) and EM(E) until they merge. Since the reaction pathways have an orientation that is enforced by the transition tables, the latter must be reflected about their main diagonals when the trees are originating from the product side. The chemical constitution of the species generated is represented in canonical form. Thus the identity of any two species is immediately noticed. This does not only eliminate redundancies, but, more importantly, indicates any nodes at which the two trees merge. In this context the distinct resonance structures of a molecule and its tautomeric forms can be treated as equivalent.
The efficiency of this type of bilateral development of the networks is ensured by keeping track of the chemical distance that the .individual trees cover, and by counting the intermediates on the pathways from EM(B) to EM( E).
150
7.5 EXAMPLES
The output of RAIN can be optionally reduced by the user, by restricting at each step the numbers of the reactive centers, of the electrons that are redistributed, of the bonds that are brokenlmade, and of the atoms that change their adjacencies. Before being displayed, the networks that are generated by R.4IN are subjected to an ordering subroutine that arranges the nodes of the network to have as few as possible intersecting connecting lines.
The application of RAIN is illustrated i n Figure 7.9 by the rearrangement of a benzocyclobutane 8 into a isochromanone 11 that has recently been studied by Kametani et a1 (ref. 1425).
a
9
Fig. 7.9 Electrocyclic tandem reaction as generated by RAIN. Rearrangement of a benzocyclobutane 8 into a isochromanone 11 via intermediate compounds 9 and 10.
151
According to RAIN, only one reaction pathway namely the electrocyclic tandem reaction 8 9 10 11 leads from 8 to 11 under the conditions given below.
-- -
1. The hydrogen atoms, the carbon atoms of the benzene ring and the OR group are no reactive centers, with the exception of the labelled atoms (see below).
2. The transition tables are such as shown in Figure 7.10
C I I
/
=C \
-Cf
4I + +
-c-
+
+
+
-
+
+
=o)
Fig. 7.10 Transition tables of C and 0 for 8 in Fig. 7.9
+
+
- 9 - 10-
11 in the reaction shown
3. The R-matrixes have the following upper bounds: - rank of an irreducible R-matrix 5 6, number of off-diagonal entries 5 12, - off-diagonal entries lrijl 5 1, - number of off-diagonal entries rij per rowkolumn L 2.
4.The T-matrixes have the following upper bonds: rank of an irreducible T-matrix I_ 4. - number of off-diagonal entries 5 4, - number of off-diagonal entries for row,’column 5 2.
152
H
H
+
CCI,
12
J H
I
J
Fig. 7.1 1 The reaction arhwavs of p! rrole and dichlorocarbene to 3-chloropyridine as generated Ey R.M?I.
5. Constitutional specificat ions: - rings with four and more numbers are allowed, Bredt’s rule is f o l l o ~ e d , triple bonds and allenic bond are allowed from 10-membered rings on, - the maximum numbers of charged atoms and connected heteroatoms are two, and the atoms that are labelled by asteri5ks may participate in the reaction, but must stay covalently connected. As shown in Figure 7.1 I , the accepted reaction mechanism of the formation of 3-chloropyridine system 17 from a pyrrole ring 12’and dichlorocarbene (ref. 6) is also generated from the educts and products by RAIN under suitable chosen conditions.
153
7.6 CONCLUSION
the computer programs for the solution of chemical problems on the basis of mathematical model of constitutional chemistry can only be feasible in an interactive mode, where the capability of man and machine are exploited best. Interactive computer programs are preferably used in a one-man'one-machine situation which becomes widely a\ailrible with the introduction of the new generation of powerful PCs, like the 80386l80386 series of the various manufacturers. The popularity of problem-solving computer programs had been low, and thus also the general acceptance of such computer programs has been slugish. With the availability of low-costlhigh-power PCs and new chemical software for these machines, the use of computers in chemical research and education will, finally, take off.
7.7 REFERENCES
1 H. Weyl, Philosophie der Mathematik und Natuwissenschaft, Wiss. BuchgeselIschaft, Darmstadt 1966, 3-rd Edition, p. 5 1, 2 J. Dugundji, I. Ugi, Top. Curr. Chem., 39, (1973), 19; and I. Ugi, J. Ind. Chem. Soc. 62, (1989,761. 3 Proceedings of the ICCCRE 19135. Eds. J. Brandt, I. Ugi, Huethig Verlag, Heidelberg, 1988, -1 I. Ugi, J. Bauer, J. Brandt, J. Friedrich, J. Gasteiger, C. Jochum, W. Schubert, Angew. Chem., 91 (1979) 99; Angew. Chem. Int. Ed. Engl., 18, (1979), 11 1, 5 C. Jochum, J. Gasteiger, I. Ugi, Angew. Chem. 92 (1980) 503; Angew. Chem. Int. Ed. Engl., 19, (1980), -195. 6 C. Jochum, J. Gasteiger, I . Ugi, J. Dugundji, Z. Naturforsch., 37B, (1982), 1205, 7 M. Wochner, J. Brandt, A.v. Scholley, I. Ugi, Chimia, 42, (1988), 217, 8 M. Wochner, I. Ugi, Theochem., 165, (1988), 229, 9 J. Brandt, J. Bauer, R.M.Frank, A.v. Scholley, Chem. Scripta, 18, (1981), 53; and J. Brandt, A.v. Scholley, Comput. Chem., 7, (1983), 57, 10 J. Bart, E. Garagnani, Z. h'aturforsch., 31B. ( 1076), 1646; ibid. 32B, (1977), 455, 465,578, 11 J. Bauer. I. Ugi, J. Chem. Res. (S) 1 1 , (1982), 2%; and ( M ) (1Y82). 3101, 3201,
154
12 J. Bauer, R. Herges, E. Fontain, I. L'gi. Chimia, 39, (1985),43; see also: I. Ugi, J. Bauer, E. Fontain, J. Gotz, G. Hering, P. Jacob, B. Landgraf, R. Karl, P. Lemmen, R. Schneiderwind-Stocklein.R. Schwarz, P. Sluka, Chern. Scripta, 26, (1986), 205, 13 A.T. Balaban, Rev. Roumaine Chirn., 12, (1967), 875, 11 E. Fontain, Dissertation Techn. Univ. Munich, 1987, 15 E. Fontain, J. Bauer, I . Ugi, Anal. Chirn. Acta, 210, (1988), 173, 16 J. Blair, J. Gasteiger. C. Gillespie, P.D. Gillespie, I. L'gi. in 'Computer Representation and Manipulation of Chemical Information', Ed. W.T. Wipke, S.R. Heller, R.J. Feldman. E. Hyde, John Wiley, New York, 1971, p. 129; see also: J. Gasteiger, M.G. Hutchings, B. Christoph, L. Gann, C. Hiller, P. h e w , M. Marsili, M. Saller, K. Yuki, Top. Curt. Chern., 137, (1987), 19, 17 E. Fontain, J. Bauer and I. Ugi, Chem. Letters, 1987,37, 18 E. Fontain, J. Bauer and I. Ugi, 2. ?iaturforsch., 42B, (1987), 889, 19 G. Augelmann, H. Fritz, G. Riks, J. Streith, J. Chem. SOC.Chem. Corr., 1982, 112; and A. Defoin, G. Augelmann, H. Fritz, G. Geoffroy, C. Schrnidlin, J. Streith, Helv. Chim. Acta. 68, (1985). 1998, 20 J. Bauer, Diss. Techn. Univ. Munich 1981, 21 W. Schubert, I. Ugi, J. Amer. Chern. SOC.,100, (1978), 37, 72 W. Schubert, I. Ugi, Chirnia, 33, (1979), 183, 23 I. Ugi, J . Bauer, R. Baumgartner, E. Fontain, D.D. Forstmayer, S . Lohberger, Pure @ Appl. Chern., 60, (1988), 1573; and D. Forstmeyer, J. Bauer, E. Fontain, R. Herges. R. Hermann, I. Ugi. Angew. Chem., 100, (1988), 1618, Angew. Chem. Int. Ed. Engl., 27, (1988), 1558, 2 1 I. Ugi, J. Bauer, J. Brandt, J. Friedrich, J. Gasteiger, C. Jochum, W. Schubert and J. Dugundji, in 'Computational Methods in Chemistry', Ed. J. Bargon, Plenum Press, New York, 1980, 25 K. Shishido, E. Shitara. K. Fukurnuto. T. Kamctani, J. Amer. Chern. Soc., 107, (1985),5810, 26 G. Magnanini, Ber. dtsch. chem. Ges., 20, ( 19X7), 2608; and C.W. Ree3, C.E. Smithen, J. Chem. SOC.. 1961, 928,938.
155
DATA ACQUISITION IN CHEMISTRY
Hans LOHNINGER and Kurt VARMUZA Technical University of Vienna, Institute for General Chemistry, Lehargasse 4/152, A-1060 Vienna, Austria
8.1 INTRODUCTION
In this article we try to give a survey on fundamental principles of data acquisition. As it is certainly beyond the scope of this text to present a thorough representation of the technical details and tricks involved in data acquisition we intend to show the capabilities and limits of digital data acquisition and to provide some explanation of the technical terms involved. In order to get a closer view of this topic the interested reader should also take a look into textbooks (ref. 1-5).
8.2 CONCEPT OF COMPUTERIZED DATA ACQUISITION
Most phenomena which are of interest to chemists are inherently not electrical. O n the other hand, electrical signals imply ease of processing and interfacing to measurement devices. Therefore it is desirable to convert all signals of interest to electrical signals which are proportional to the parameters under investigation. Implying that all major calculation and processing of signals is done in some way by electrical means (such as electronic amplifiers, computers etc.) we can define an input transducer as a device which converts a non electrical signal into an electrical
156
signal. Whereas an output transducer converts an electrical signal into a non electrical one. After the measurable variable of interest has been converted into an electrical signal, this signal is now processed by standard components of data acquisition systems. The components are mainly the same for all types of analog data acquisition, although there may be some differences in the specific layout of a data acquisition system. These differences are mainly caused by different needs of conversion speed and accuracy (ref. 6 ) . Figure 8.1 shows a typical arrangement of components in a data acquisition system. The electrical signals from the transducers (TI ...T3) are amplified by preamplifiers (AI ...A3). The amplified signals are connected to a multi-plexer, which can switch between several analog signals. The output of the multiplexer is then fed to the analog-to-digital converter (ADC) subsystem, which may be composed of several units. The result of analog digital conversion is then presented to the computer via a digital interface.
analog digital converter
c, .r(
Fig. 8.1 Typical configuration of a data acquisition system (TI ...T3 are transducers, A1 ... A3 are amplifiers).
157
8.2.1 Basic concepts of signal processing Everyone using computers to acquire analog data has to keep in mind that a computer is a digital device and most of the interesting signals are of analog nature. This contrast implies some peculiarities which can lead to completely wrong results if the underlying laws are not understood well. Resolution of analog-to-digital converters, The resolution of an ADC is defined as the logarithm (base 2) of the number of discrete steps an analog-to-digital converter can produce. It is measured in bits. Dynamic ranEe of the signal. The dynamic range of the signal is defined as the quotient of the largest possible value to the smallest possible value (usually the noise limit). The dynamic range of a signal often exceeds the resolution of an analog-to-digital converter. In this case an amplifier with digital gain control must be inserted before analog-digital conversion takes place. The gain controlled amplifier adjusts the signal level to the operating range of the ADC. If this measure is not taken the effective resolution of the ADC decreases in case of low level signals. Samplin_orate, For theoretical reasons it would be sufficient to sample a signal with a rate twice as high as the highest frequency component of the signal (Nyquist’s theorem). For two reasons, however, this theorem is not a practical guide for the data rate in real applications. First, signals which exhibit rapid steps (e.g. a square wave signal) have frequency components up to a hundred times of the base frequency. This means that you either have to limit the band width of the signal or to sample at unreasonable high data rates. Second, the edge frequency of a band limiting filter is not sharp enough to cut off all frequencies above. Therefore some degree of over-sampling is necessary. In order to determine the optimal sampling rate without knowing the frequency spectrum of a signal it is often sufficient to consider the desired time resolution of the shortest event in a signal. Look at the acquisition of a signal produced by a scanning mass spectrometer. Usually the signal coming from one single mass will last about 1 msec. In order to get a good representation of this signal part it would be necessary to get about 20
158
samples on the peak. This means that the sampling frequency should be around 20 kHz. The anti-aliasing filter should be set to a cut-off frequency of 10 kHz. The consequence of a too low sampling rate results in a phenomenon called 'aliasing'. If a signal is reconstructed from sampled data which were jammed by aliasing the reconstructed signal shows a wrong image of the original data. An example should clarify this. Figure 8.2a shows a sine signal with 100 Hz which is sampled at a rate of 120 Hz (Fig. 8.2b). If the original signal is to be reconstructed from the sampled data a signal with a frequency of 20 Hz is formed (Fig. 8 . 2 ~ ) .
4
-t I
b)
I
I
I I
I
I I I
I
I
t
t
t
t
t
1
I
1
I
-t
Fig. 8.2 Effect of aliasing. The signal shown in (a) is sampled at a too low data rate (sampling times are indicated by arrows on trace (b). Trace (c) shows the incorrect reconstruction of the signal from the digitized values.
8.2.2 Noise Noise is a central problem in data acquisition an? is often difficult to handle. A good survey on noise phenomena and on countermeasures against them is given by A. Rich (ref. 7,8).
159
Types of noise. Any electronic system contains (and produces) many types of noise. Basically two aspects of noise can be differentiated:
- intrinsic noise is the noise that originates from the signal generating process (e.g. thermal noise, or ion statistics),
- interference noise is picked up from outside an electronic system. This type of noise may be due to natural interferences (e.g. lightning) or disturbances from other electronic equipment (e.g. a nearby radio transmitter). Table 8.1 gives a short survey of possibilities for noise pick-up.
Table 8.1 Possibilities of noise pickup
Noise source -
Coupling
Receiver
radio waves logic signals line transients changing magnetic fields -
signal path electric field magnetic field
transducer signal wire preamplifier reference
Intrinsic noise can only be minimized by selecting low noise components or by using statistical methods if it is possible to make repeated measurements. Normally, the user of a data acquisition system has no influence on its construction. Therefore we will only consider the interference noise in the discussion below. A short survey on methods for noise reduction is given by Smit (ref. 9).
res a g m i t noise pick-up. Noise pick-up always involves three elements: a source of noise, a coupling medium and a receiver. In order to solve a noise problem it is necessary to eliminate at least one of these elements. Usually neither the noise source nor the receiver can be easily removed. Therefore the only way of noise reduction is to eliminate or reduce coupling.
160
Common impedance noise. Common impedance noise is developed if there is an impedance in common to several devices. Consider the following example Fig. 8.3): The power supplies of two devices are connected as indicated in Figure 8.3a. They have part of the ground wire in common, which means that the potential of the ground terminal of each device is influenced by the current drawn by the other device. The voltage drop across the ground wire is due to its finite resistance. Figure 8.3b shows how to connect two instruments correctly.
4
c-v
b) vcc
vcc
common impedance
vcc
vcc
Device
Device
Fig. 8.3 Common impedance noise: (a) common impedance may cause noise; (b) correctly connected instruments.
Capacitively coupled noise. Any two objects form a capacitor (’stray capacitance’) which establishes a path for high frequency signals. Noise which is capacitively coupled into a circuit is basically current noise which is converted to voltage by an impedance between the two objects. Let us consider a cable which is 10 meters long and consists of 2 wires. If it has a capacitance of 100 pF/m the overall capacitance is 1 nF. A signal of 1 kHz and 10 V amplitude on wire A couples to 600 mV on line B if the impedance is 10 kOhm. This example shows that long signal lines in parallel are inherently susceptible to noise coupling if no further countermeasures are taken.
161
In order to prevent capacitive coupling between two objects a conducting shield connected to a reference potential (usually ground) should be inserted. If shielding is not applied in the right way it can lead to even extra noise coupling. Therefore some rules for application of electrostatic shields are given below:
- the shielding should be connected to the reference potential of the signal. If the signal is grounded then the shielding should be connected to signal ground at the side of the signal source. If the signal is not grounded the shield should be connected to the signal reference potential and not to the ground,
-
in order to avoid ground loops the shielding must not be connected to the reference potential more than once,
-
the connecting wire between shield and reference potential should be of low impedance otherwise the noise current captured by the shield would cause a voltage drop across this wire which induces noise on its own. Inductively coupled noise, Strong magnetic fields are developed by lines carrying heavy current (power supply lines) or by inductors (motors, power transformers etc.). Inductively coupled noise is voltage noise and cannot be eliminated by a low receiver impedance. In general inductively coupled noise is generated any time some sort of loop exists. One of the most common (unwanted) loops are ground loops. The example in Figure 8.4 may clarify this.
Instrument 1 (signal source) and instrument 2 (12 bit data acquisition system) are 100 m apart. In order to reduce capacitively coupled noise the signal line is shielded and the shield is erroneously connected to ground at both ends of the line. This leads to a loop which is formed by t h e shield, the connection to ground and ground itself. Suppose that a signal of 10 V should be transmitted to the data acquisition system and there exists a potential difference of 1 V at 50 Hz (mains) between the two ground points. For a typical transmission wire (resistance 0.01 Ohm/m, mutual inductance 1 microH/m) this potential difference causes a current of 1 Ampere to flow on the shield. This current induces a noise voltage of 31 mV/50 Hz on the signal line, in other words, the 12 bit data acquisition system can be utilized only up to 8 bits. The potential difference. in heavy-industry environment can exceed 50 Volts; therefore it is nearly impossible to transmit a signal if the shield is grounded in a wrong way.
162
Ins tr. 1
n
I1
A
Fig. 8.4
A U
-
U
&
ground
Instr. 2
ir
L
A
Ground loop. A ground loop is formed if both ends of a shield are connected to round potential. The correct way is to connect the shield only at one sidge to ground.
Shielding of magnetic fields is much more elaborate than shielding of electric fields as magnetic field lines penetrate conducting materials. In general it is easier to remove a source of magnetic field than to shield it. As the flux density of a magnetic field decreases with increasing distance, a good countermeasure would be to keep enough distance from a potential source of inductively coupled noise. In order to avoid interfering magnetic noise which rises from cables conducting large currents these wires should be twisted. In case that the current in the two wires is equal and of opposite direction the magnetic field would then cancel. Another countermeasure is to transmit signals on twisted pairs of wires by using differential amplifiers. If magnetical coupling exists, the induced voltage would be of the same size and would thus be cancelled at the receiver. Furthermore, as the level of the induced voltage increases with the area of a loop a twisted pair of wires insures that induced voltages are kept low.
I63
8.3 SIGNAL CONDITIONING
Each transducer requires a special form of excitation and conditioning of its output signal. This conditioning depends on the electrical parameters of the specific transducer and includes procedures such as amplification, level shifting, galvanic isolation, linearization, filtering etc. All the steps involved in signal conditioning can be done by analog electronic devices. But recently there is a trend to perform as much as possible with the help of the computer. Nevertheless there are some processes which cannot be done by software (e.g. galvanic isolation or amplification).
8.3.1 Level shifting
Many transducers give an output signal which exhibits only minor changes over the whole operating range and have an offset from zero. In order to amplify such low level signals it is necessary to shift the output level towards zero. This can be accomplished by various means, for example by bridge circuits or by using instrumentation amplifiers (a short survey on instrumentation amplifiers is given by J. R. Riskin (ref. 10)). Fig. 8.5 shows an example of a circuit using an operational amplifier. The signal is both shifted and amplified. Level shifting by operational amplifiers is of course limited to the voltage range within the power supplies of the amplifier.
Uout =
Fig. 8.5 Level shiftin This circuit shows a sim le way to shift the output level of E transducer. l%e input signal Uin is shized and amplified according to the formula shown in the figure.
164
8.3.2 Linearization Most of the transducers have non-linear transfer characteristics. In order to get a proper result this non-linear transfer curve has to be linearized. There are two ways to do so. Analog linearization. Linearization by electronic means can be done by inserting a non-linear network into the signal path (ref. 11). This network issues a non linear response which is the inverse of the transducer transfer characteristic. The sum effect is a linearized output. Disadvantages of this method lie in the reduction of sensitivity and the extra noise which is produced by the linearizing network. DiPital linearization, Linearization can be effectively performed in the computer during signal analysis. The raw data can be linearized by either using a look-up table or by calculating the linearized value by a function which is the inverse of the transfer characteristic of the transducer. Digital linearization has the advantage of being very versatile and flexible. It can be adjusted to any type of non-linear response and can be re-scaled if conditions or transducers are changed. However, linearization by computational means can lead to poor accuracy if the transfer characteristic of a transducer is exponential and the resolution of the ADC is too low.
8.3 ANALOG-TO-DIGITAL CONVERSION 8.4.1 Reference voltage sources
Each of the possible ways for analog-to-digital conversion requires some sort of reference, usually a reference voltage. The performance of the reference voltage source has a major influence on the overall performance of a data acquisition system.
165
There are two classes of reference elements which are based on different principles: bandgap elements and Zener diodes. Bandgap elements are based on the principle that the inherent negative temperature coefficient of semiconducting material (-2 mV/deg.C for silicon) is compensated by adding a voltage which has an equal positive temperature coefficient. The resulting output voltage corresponds to the difference of energy levels between the conduction band and the valence band of the semiconducting material (about 1.25 V for silicon). Zener diodes are reverse biased diodes. At a particular voltage the reverse current increases sharply and the voltage at this point is virtually independent of the current through the diode. Zener diodes are available for a wide range of voltages (3 to 200 V). The temperature coefficient varies with Zener voltage from negative to positive values and crosses zero at approximately 5.6 V. For very low requirements it can be sufficient to use just a normal Zener diode. Most of the commonly used reference elements are integrated circuits which hold all the necessary auxiliary circuitry to insure temperature stability and the proper output voltage. Extra temperature stability is achieved by integrating a heater on the semiconductor device.
8.4.2 Sample and hold circuits
Several types of ADCs must have a constant input voltage during analog-to-digital conversion. If the input signal varies by more than half the least significant bit the accuracy of the conversion result would be decreased. This fact considerably limits the maximal signal frequency which can be digitized. Figure 8.6 shows the dependency between conversion time, desired resolution and maximum signal frequency. As you can see, the maximal allowable signal frequency is only 9 Hz, if a 12-bit ADC with a conversion time of 10 micro seconds is used.
In order to overcome this effect a special circuit called 'sample-and-hold' (S&H) or 'track-and-hold' (T&H) is inserted into the signal path just before the AD converter. This circuit has two states, one in which the output of the circuit follows the input signal ('track') and one in which the output signal is held constant no matter what the input signal does ('hold').
166
1000
100
10
1.0
0.1 0.1
1
10
100
1000
conversion time Fig. 8.6
Dependency of conversion time, resolution and maximal signal frequency.
Before the start of an AD conversion the S&H-circuit is switched into hold mode which freezes the current level of the input signal. After AD conversion the S&H is switched back into track mode. If the time needed for switching from track to hold (typically 100 nsec) is shorter than the conversion time of an ADC t h e maximal allowed signal frequency is accordingly increased. There are two exceptions where the application of sample and hold circuits is not useful or even senseless. First, if the conversion time of an AD converter is less than the switching time of the S&H circuit an added S&H would decrease the performance of an AD converter. This could be the case with fast flash converters.
167
Second, integrating types of AD converters do not need a constant input voltage during conversion, hence no sample and hold circuit is needed if averaged digitized values are tolerable.
8.4.3 Principles of analog-to-digital conversion The techniques for analog-to-digital conversion can be classified into one of three fundamental principles: they differ in conversion speed and sophistication of physical implementation. 'Word-at-a-time' conversion, This technique (flash converters) converts an analog input signal in a very short step ( 5 to 100 nsec) into digital form by comparing the input signal with all possible values at a time. Therefore a high number of comparators (2n-1, for a resolution of n bits) is needed to produce the result. As the number of comparators increases exponentially, this technique is limited to moderate resolution (typically 8 bit). Figure 8.7 shows the internal construction of a flash converter. The input signal is routed to 2* comparators. Each comparator compares the input signal to a reference voltage which is derived from a resistor ladder. If the input signal is larger than the reference voltage, the comparator gives a logical 1 as result. The binary result is evaluated by a priority encoder which determines the number of comparators which have a logical 1 as result.
Another representative of this type of converters is the 'cascade converter' which utilizes two flash converters. The first converter digitizes the input signal with a rough resolution (e.g. 6 bit). The result is converted back into analog form and the resulting voltage is subtracted from the input signal. The second flash converter digitizes this difference and the two results are then combined to form a digitized output which has higher resolution than could be obtained by a single converter. Flash converters are mainly used in very fast measurement devices with low to medium requirements on resolution (e.g. digital sampling oscilloscopes or video digitizers).
168
---_
I
priority encoder
digital o u t p u t Fig. 8.7 Flash converter.
‘Dieit-at-a-time’ conversion. With this technique only one binary digit is evaluated at a time. The conversion procedure is started by comparing the input signal with a voltage half the full scale range. If the input signal is larger than the reference signal the digit is set to 1 and the reference is subtracted from the input signal; otherwise the digit is set to 0 and the input signal is not altered. In the next step the reference voltage is halved and the procedure is repeated until the desired resolution is achieved. The resulting train of 1’s and 0’s represents the binary form of the analog signal.
169
This technique is called successive approximation as the digitized value is better approximated with each step of conversion. Compared with the 'word-at-a-time' method the method of successive approximation needs only Id( n) reference voltages but it is slower than the above mentioned technique by a factor of ld(n) (n = number of discrete steps, Id = logarithm with basis 2). Figure 8.8 shows the principal circuitry which is involved in successive approximation and the course of the reference voltage during conversion.
ADCs based on successive approximation are the most popular converters as they are both cheap and fast enough to sample signals in the kHz range.
4 Uref
Uin I
l
' 0 0
l
I
1 0
I
I
I
I
1
0
0
,
t
Control
Clock
Uin
Fig. 8.8 Successive approximation converter (DAC: digital-to-analog converter, SAR: successive approximation register).
170
'Level-at-a-time' conversion. The third technique uses only one reference voltage but n steps to get the digitized result. The methods involved are usually based on counters and are of integrative performance. These types of A D converters use to be slow (typically 100 ms) but give very high resolution (up to 21 bit). Two types are discussed here. Voltace-to-frequency-converters (V/F). The applied input voltage is converted into an oscillation with a frequency proportional to the input voltage. The resulting frequency is measured within fixed time intervals. This type of ADC has a large dynamic range which can be adjusted by varying the counting period. The linearity is typically about 14 bit. If the counting period equals the line frequency interference can be suppressed effectively. The V/F converter is often used in GC-integrators as the dynamic range of the converter meets the large dynamic range of the commonly used flame ionization detector. Dual-slope converters. These types of converters consist of an integrator, a comparator and a timer (Fig. 8.9). The result is ascertained in two phases. In phase 1 the input signal is applied to the integrator and the voltage of the integrator increases with time. After a fixed period of time the signal is disconnected and a reference voltage with opposite sign is applied to the integrator (beginning of phase 2). The voltage at the integrator output decreases until it crosses zero. The time needed to reach zero level is proportional to the voltage of the input signal. The accuracy of dud-slope converters is fairly good (up to 20 bit), as both the input signal and the reference voltage are integrated with the same integrator. Dual slope converters are mainly used in digital voltmeters.
8.5
INTERFACES
8.5.1 Backplane-interface (bus) T h e simplest way (as seen from the view of the electronic circuit designer) to route the acquired data into computer memory is via a direct bus interface. This method is very powerful because it allows very fast acquisition of data. Nevertheless the
171
application of this method requires a good background knowledge on hardware. If data rates are not very high (less than 1...10 kHz) it would be more flexible to use a standardized interface (RS-232, IEEE-488).
I
!
Control * A
Counter
------Q-
Clock
Fig. 8.9 Dual slope converter.
8.5.2 RS-232 The RS-232-C standard defines a serial interface which is commonly used. In its minimum configuration it is possible to transmit data over 15 meters with only three wires. For large distances a modem would be necessary. Data transmission is performed by sending the individual bits of a data byte one after the other. The data bits are enclosed by special bits (start bit, parity bit, stop bit) which provide proper synchronization and error detection in the receiver. The form of the transfer is called the protocol. This protocol defines the baud rate (bits per second = baud), the number of stop bits and an optional parity bit. Figure 8.10 shows an example of transmitted data. A logical 0 is transmitted by a voltage level of +3.. + 12 V and a
172
logical 1 is transmitted by a voltage level of -3..-12 V. The voltage range between -3 V and + 3 V is not defined.
---
c, Ll
4
4
2
L
- ,
e
m
0,1,2 3 4,5 1
al c,
4
6
7
I
4
0
1,2 3 I
4,5,6 I
1
7, I
I
,o
1
I
1 .
-t
Fig. 8.10 Data transmission via RS-232.
Although the RS-232 standard defines a 25-pin D-shell connector many hardware manufacturers implement the RS-232 interface by using a 9-pin D-shell connector. Figure 8.11 shows the two commonly used connectors and their pin assignment. Connecting two equal devices via a RS-232 cable can be obtained by connecting TxDl with RxD2, TxD2 with RxDl and the ground pins of both devices. If hardware handshake is desired the Signals C T S and RTS have to be connected too (CTSl with RTS2, CTS2 with RTS1).
When using a RS-232 interface one has to bear in mind that the transmission rates are restricted to about 1 kbytehecond. O n the other hand it is possible to transmit data over very long distances (round the world) by using a modem if low data rates (100 byteshecond) are sufficient.
173
-
... protective ground ... transmit data 4 RxD ... receive data * R E ... ready t o send PGND
14
Txd
... clear DSR ... data
I CTS
4
GND
25
a
4
DCD
c DTR 4
R1
t o send sot ready
... signal ground ... data corrier detect ... data terminal ready
... ring
indicator
13
DB25P
DSR RTS
CTS RI
DBSS Fig. 8.11 Pin assignment of RS-232 connectors
8.5.3 IEEE-488 The IEEE-488 standard is a widely accepted standard of data transmission between laboratory equipment. Data are transferred in a parallel way. The IEEE-488 bus is very sophisticated and flexible. It is sometimes called 'General purpose interface
174
bus (GPIB)’ or ’Hewlett Packard Interface Bus’ (HPIB) as this bus was introduced by Hewlett Packard. Each device which is connected to the bus has a unique address and has the right to listen or talk to other instruments. One instrument is set up as a bus controller which manages data transfer on the bus. Physically the IEEE-bus consists of 24 signal lines (8 data, 8 control, 8 ground lines). The connection scheme is shown in Figure 8.12.
data lines
I
DlOl 0102 D103
D104
end or identify EOI data valid DAV not ready far data NRFD noI data accepted NDAC interface clear I FC service request SRQ attention ATN signal ground GND
D105
DI06 DI07 DIOB REN GND GND GND GND GND GND GND
data lines
t
remote enable
signal ground
Fig. 8.12 Pin assignment of IEEE connectors
The IEEE-488 has one major drawback in everyday applications. The overall performance of the bus depends on the slowest device connected to the bus, which means that one slow instrument can slow down data transfer rates within a system to a not acceptable level (e. g. think of a hard disk which is connected to the computer on the same bus).
175
8.6 SOFTWARE
A lot of software products are available in order to perform data acquisition and/or data manipulation. Most of the packages are specific to the interface card used and most of them are only a collection of subroutines which allow to control the specific hardware.
Nevertheless, there are some packages on the market which allow a more general approach to data acquisition. A few products are mentioned below. ASYST. ASYST is probably the most comprehensive of the mentioned packages. It supports not only several types of cards, but it allows to do data analysis and create a graphical representation of the data. It uses a FORTH-like threaded interpretive language which provides a great deal of speed and flexibility. The software package can be bought in modules which allows the user to set up a data acquisition system to his needs. ASYST MacMillan Software Co. 866 Third Ave. New York, NY 10022 LABPAC, LABPAC is sold by the manufacturer of the LabMaster board (TecMar) and supports only this card. LABPAC consists of a collection of assembly language subroutines which can be called from high level languages like BASIC, Pascal or FORTRAN. LABPAC Scientific Solutions Inc. (Tecmar) 6225 Cochran Rd. Cleveland, OH 44 139 PCLAB, PCLAB is sold by the manufacturer of the DT2801 board (Data Translation Inc.) and supports only this card. It consists of an assembly language library which can be called from a high level language. PCLAB, Data Translation Inc. 100 Locke Dr. Marlboro, MA 01752
I76
SALT. SALT is a threaded interpretive language which is interfaced to BASIC. Thus it exhibits both the fast response of assembly language programs and the interactive character of BASIC programs. SALT was written by a researcher who emphasized the ability of instrument control. SALT supports only the TecMar LabMaster hardware.
SALT Sam Fensten 4949 South Woodlawn Ave. Chicago, IL 60615
8.7 REFERENCES
A.J. Diefenderfer, 'Principles of Electronic Instrumentation', Saunders 1979, London, 2 J. Millman, 'Microelectronics: Digital and Analog Circuits and Systems', McGraw-Hill, New York, 1982, 3 D.H. Sheingold, (ed.), 'Transducer Interfacing Handbook', Analog Devices 1980, Norwood, Massachusetts, 1980, 4 U. Tietze, Ch. Schenk, 'Halbleiter-Schaltungstechnik', Springer 1985, Berlin, 5 H. Zander, 'Analog-Digital-Wandler in der Praxis, Markt & Technik', 1983, 6 T. Fleming, 'Design Considerations for a Data Acquisition System', Harris Corporation 1982, USA, 7 A. Rich, 'Understanding Interference-type Noise', Analog Devices 1984, Norwood, Massachusetts, 1984, 8 A. Rich, 'Shielding and Guarding', Analog Devices 1984, Norwood, Massachusetts, 9 H.C. Smit, Computer-based Estimation of Noisy Analytical Signals, Conference on Computer Based Methods in Analytical Chemistry (COBAC) IV, 1986, Graz, Austria, 10 J.R. Riskin, 'A User's Guide to IC Instrumentation Amplifiers', Analog Devices 1984, Norwood, Massachusetts, 1983, 11 D.H. Sheingold, (ed.), 'Nonlinear Circuits Handbook', Analog Devices 1976, Norwood, Massachusetts. 1
177
9 PCs AND NETWORKING
Engelbert ZIEGLER Max-Planck-Institut fuer Kohlenforschung, D-4330 MuelheidRuhr, POB 011 325, W. Germany
9.1 INTRODUCTION
For more than two decades the usage of centralized computing facilities had been the most economic approach in the application of computers within a company or an institution. Grosch’s law - named after one of the early pioneers in the field described this situation: the price of a computer increases with the square-root of the compute-power. In other words: for twice the price a central computer with four times the CPU- power could be purchased; thus, a centralized big computer system was more cost-effective than a number of smaller computers with equivalent total power. Furthermore, the connection of individual systems to form a network was a difficult task: no accepted standards for the required hardware and software did exist. In most cases rather low-speed connections only, via asynchroneous RS 232Clines, could be implemented. In the course of the last few years this situation has changed dramatically through the rapid progress in semiconductor technology. (This progress becomes evident in the cost of main memory, for example, which has decreased by a factor of 20000 within 20 years!) An average mainframe computer system in 1975 was equipped with about 0.5 to 1 MB of main memory, 50 to 100 MB of disk storage and 0.5 MIPS ( = Million Instructions Per Second) of compute power. These numbers match very well with the specifications of today’s personal computers. Grosch’s law is no longer valid, except perhaps in the field of supercomputers: in many situations a number
I78
of smaller systems is more cost-effective than one large system. In addition, industry-wide standards for networking have been developed; sophisticated software facilitates the transparent usage of computer networks in accordance with the slogan ’the network is the system’. Since there is no longer an economic reason for the big centralized computer system, there is no longer a necessity for different applications and non-cooperating groups of users to share the same computer system. Furthermore, software development work, which very often causes heavy fluctuations of the total load on a system, can be separated from production work by assigning it to a separate computer system of the same type. These possibilities match very well with the general trend towards personal computing where each user owns a PC or a graphics workstation.
9.2 THE USE OF PERSONAL CO3IPUTERS
Because of their obvious advantages personal computers are used i n all kinds of organizations and for a wide variety of applications. Accessibility and cost are comparable with a dialog terminal connected to a multi-user computer system. The main advantages, however, can be seen in the wealth of high-quality software products that are available today, and in the comfort of high I/O- bandwidth between C P U and display screen allowing for instantaneous updates of the screen’s contents. Furthermore, owning a PC provides independence from other sub-organizations, e.g. from a corporate computing center. O n the other hand, besides limitations in compute power, storage capacity and output devices, there exist more serious shortcomings of PCs in the following areas: Real-time applications. For most types of PCs a wide variety of computer cards from different vendors is available to connect laboratory instruments to a PC (See Chapter 8.). The PC operating systems used today. however, e.g. MS-DOS, are not built for and a r e not suited for real-time applications. The multi-task real-time operating systems, e.g. R S X l l M , that have been available with the traditional 16-bit
179
minicomputers, allowed to design sophisticated, very modular software systems for an application, even for complex instrumental setups in the laboratory. With today’s PCs rather simple experiments only can be computerized. The normal single-user/ single-task operating system does not even allow for any postprocessing (or plotting) of data of a previous experiment, if a new experiment has already been started. The throughput of an analytical instrument can be limited seriously by not allowing postprocessing and the run of a new sample simultaneously. Administration of data. If PCs are selected and purchased on an individual basis without coordination, isolated and incompatible solutions will emerge. Information that is concentrated at the single computer system in the case of a centralized system, is distributed over many separated stations, perhaps on diskettes handled by various coworkers. The collection and administration of information that may be important to the operation of an organization is difficult to manage or impossible under such circumstances. Protection of data. A PC needs to be locked physically. Otherwise any person, who has access to the room where the PC is located, will be able to access the data files on the PC’s hard-disk. Most of these shortcomings of PCs, however, can be overcome, if the P C is integrated appropriately into a computer network.
9.3 NETWORKING OF COMPUTERS
Networking of PCs, and of computers in general, allows for the exchange of information between the networked computers, t h e nodes. Data can be collected and assembled into commonly accessible data banks. Special peripherals, like high-performance printers, but also specialized processing hardware, e.g. for FFT-applications, can be shared by many users. In short, the integration of PCs in networks, if efficiently implemented, combines the strengths of a centralized system with the advantages of personal computing.
180
9.3.1 Different types of networks Within Wide-Area Networks (WAN) computers are connecied over long distances via public telecommunication lines provided by national FITS.WANs are especially useful to access host computers with large databases, e.g. for the retrieval of bibliographic information. Another use of WANs are internationally operated electronic mail systems. In contrast to a WAN a Local-Area Network (LAN) connects the computers within an organization via privately owned communication lines, e.g. in form of coaxial cables or as fiber optic pipes. Normally, a LAN is restricted geographically, e.g. limited by a maximum allowable distance between two nodes. It is possible, however, t o connect such local segments via bridges over long distances, for instance by making use of satellite links. By these means worldwide LANs consisting of many local segments can be built. In the following paragraphs LANs only will be discussed.
9.3.2 Transfer media, network topologies Presently possible transfer media in JANs: simple twisted-pair wires, coaxial cable, and fiber optic cables, are characterized in Table I more in detail. Table 9.1
Characteristics of transfer media for networking
Twisted-pair cable
Coaxial cable
Fiber optic
low speed sensitive to noise low cost
medium speed easy interfacing
high bandwidth insensitive to noise expensive interfacing
(1) Twisted-pair wire is the cheapest transfer medium, but is sensible to noise from the environment, and is normally (depending on the distances) limited to lower transmission speeds. Therefore this type of cabling is used mainly for asyn-
181
chroneous serial data transmission with RS 232C interfaces, e.g. for the connection of dialog terminals to computers. Networks with star-topology can be built this way, where small computers or PCs are connected terminal-like to a central computer system. With special care, however, the Ethernet protocol, as used with coaxial cables, with a serial transmission rate of 10 Mbits per second can be implemented too.
(2) Coaxial cable is less sensitive to noise and is suited for higher transmission speeds than twisted-pair cable. Baseband techniques (serial bit-by-bit transmission) as well as (frequency modulated) broadband techniques are possible. Types of possible communication techniques in coaxial cables are listed in Table 9.2.
Table 9.2 Types of communication techniques
Broadband technique
Baseband technique
modulated high frequency high speed (image processing!) expensive interfacing
bit-by-bit serial transmission limited speed (10 to 20 Mbits/s) easy interfacing
Today, coaxial cable with baseband transmission is the most frequently used technique in implementations of LAYS, in Ethernet LANs as well as in token ring networks. Transmission rates are mostly between 6 and 16 Mbits/s. Interfacing to coaxial cable is rather simple, especially with Ethernet: new nodes can be connected to an existing cable without interrupting the operation of the network. (3) Fiber optic cables are especially suited for high speed data transmission. The new F D D I ('Fiber Distributed Data Interconnect') standard specifies a transmission rate of 100 Mbits/s. The comparison between the Ethernet and F D D I standards are given in Table 9.3.
Interfacing nodes to fiber optic networks is more complicated and expensive; therefore, up to now fiber optic lines are mainly used to bridge larger distances between local segments of networks.
182
Table 9.3 Characteristics of Ethernet and of FDDI standards
Ethernet
Fiber Distributed Data Interconnect ~
~~~
coaxial cable baseband technique bus topology 10 Mbits/s 1500 m within local segment accepted industry standard
fiber optic broadband technique token-ring topology 100 Mbits/s 100 km future additional standard
Fig. 9.1 Different topologies of computer networks: (a) point-to-point connections, (b) hierarchical star structure, (c) ring, (d) bus topology. ToDolok?ies.The traditional way of connecting smaller computers to a central mainframe system has been the star topology, sometimes in the form of hierarchical trees. Other networks have been built with many point-to-point connections and
183
with routing nodes. More modern approaches are ring- and bus- topologies. Different topologies of computer networks are shown in Figure 9.1.
I n token rings the right t o access the transfer medium is passed as a token sequentially from node to node within the ring system. By this means a maximum response time can be guaranteed to a sendheceive request. This is not the case in bus-topologies where the nodes are connected to a passive cable which is accessed randomly. A request will be granted only, if no other message is being transferred on the cable. A collision detection mechanism is to be applied for the case of two requests being initiated simultaneously. The Ethernet protocol applies such a method.
9.3.3 Ethernet LANs Initially t h e E t h e r n e t specifications (Table 111) had been promoted by t h e companies Xerox, Digital Equipment and Intel. They are based on a bus topology with passive coaxial cables and baseband data transmission. The physical transfer speed is specified with 10 Mbitsh. Today. Ethernet has become an industry-wide de-facto standard for the lower transport levels within LANs. In order to function with given computer operating systems higher levels of layered software have to be implemented, for example DECnet for . D E C computers, or TCP:IP for more general solutions. Based on Ethernet specifications LAYScan be built with about 1500m as maximum distance between any two nodes. With bridges, however, it is possible to extent the range of a network by connecting several local segments even over very large distances by fiber optic, micro wave, or satellite links.
. .
Ethernet and real-time appl Ications, Because of the random access method used there is no guaranteed response time within an Ethernet LAN. All nodes have equal priority for getting access to the physical cable. Due t o the E t h e r n e t specifications the maximum uninterruptable length of a block of data on the cable is 1500 bytes, corresponding to a transfer time of 1.2 ms. If data, e.g. from an active experiment, is to be transferred in real-time, delays in the order of milliseconds must be tolerable. This should not be a problem, if data can be buffered in local memory of the nodes. I f strict timing for the synchronization of processes on
184
different nodes, however, is a requirement, other protocols, e.g. the MAP protocol developed for an industrial manufacturing environment, should be applied instead of Ethernet. Within the Ethernet-LAN of the Max-Planck-Institutes in Muelheim, W. Germany, dedicated nodes (LSI11 processors) are used to acquire data in real-time from mass spectrometers and chromatographs (see Fig. 9.2). This data is buffered in the main memory of these nodes and transferred blockwise to a P D P l l or t o V A X computers for further processing. By this means a hierarchical partitioning of real-time tasks is possible, where the sub-tasks are assigned to different nodes within the network. .4 hierarchical structure can be built for a real-time application, distributed over several processors, that is very similar to a multi-task system, e.g. RSXllM, on a single processor.
9.4 THE OPERATION OF PCS WITHIN NETWORKS
A user of a PC can have access to network resources in different ways, depending on the type of hardware connection used and on the way the PC is operated.
9.3.1 Asynchroneous line connection T h e easiest way to connect a PC to another computer is the use of terminal interfaces, i.e. the RS 232C asynchroneous line connection, allowing for transfer rates of 9600 or 19200 bits/s in most cases. Every PC and every mainframe offers such interfaces ready to be plugged into without much complication. Additionally, however, some pieces of software, on both sides of the connection are needed to make the hardware work. Usually one of two alternative ways of operation are possible, requiring the appropriate sofware:
PC as terminal. The PC can be used as a terminal to another multi-user computer system, the host. For this purpose an emulation program is needed for the PC to
185
make it look like a terminal to the host. In this mode of operation all application programs are executed by the CPU of the host system. Because of the ’intelligence’ of the PC different ’dumb’ terminals can be simulated depending on the emulation program used: The PC may look, for example, like a simple VT100-type terminal or like a sophisticated Tektronix 1207 graphics terminal. Various emulation products are available in the marketplace. as advertised in many PC journals.
File transfer between computers, The hardware link is used for the exchange of information in form of data files between the two connected computers. T h e program KERMIT is in wide-spread use for this purpose: After being started in the PC, the program first logs into the partner system, starts its counterpart on that system and then transfers data files as requested by the user.
9.3.2 Ethernet connection
If the PC is connected directly to the coaxial cable of an Ethernet LAN,the PC can be used in the same modes described above, but with the much higher transmission speed of the Ethernet. A higher level of network protocol, however, has to be applied, e.g. DECnet-DOS, Novell-KetWare, or similar products. Because of the high speed of the Ethernet connection more sophisticated modes of PC operations can be implemented. A larger computer system or a PC that is well equipped with disk storage. can act
as a server system for a number of PCs. I n this case room on the disk storage of the server system is provided for virtual hard-disks of the PC. This can be accomplished through a container file on the server system. holding the files written by the PC. Instead of using a container file appropriate soft\vare can be used to convert the PC file format into the file format of the sewer before writing (and bachwards when reading). By this means files stored by the PC user can be accessed from the PC as well as from the server system. In both of these modes the PC user has transparent access to other resources of the network as well, like printers and special peripherals. Since the PC’s hard disk in reality is physically part of the server’s disk. it is included in the normal backup/ restore procedures that are in use on the sen’er system. If several users share the
186
same PC, they may still have separate 'virtual hard-disks'. Data protection is further improved by requiring an individual 'key diskette' for each user and additionally a password to the server system. I t is, however, possible to organize 'virtual hard-disks' for sharable access to data from a group of PC users. By the tight integration of PCs into a network with a variety of resources most of the shortcomings of PCs described in an earlier paragraph will be overcome.
9.5 T H E E T H E R N E T LAN O F T H E h l A X - P L A N C K - I N S T I T U T E S I N MC'ELHEIM
For the two Max-Planck-Institutes in Muelheim, W.Germany, an Ethernet L A N is in operation (see Fig. 9.21, consisting at the time of this writing of eight MicroVAX CPUs in the computing center and of cca. 20 computers used for real-time data acquisition and experimental control in many laboratories, all connected to a coaxial cable of 700m of length. The central systems are organized in two local clusters (LAVC systems) with main memory of 6 to 16 MBytes per CPU and a total disk capacity of 4.8 GBytes. Users have access to the central machines from more than 160 terminals via terminal servers or a terminal-port-multiplexer. Applications are assigned to dedicated processors, or are distributed over several processors. For example, chromatography data processing (ref. 1) is making use of satellite processors for data acquisition in the laboratories and of V.4X- systems for the evaluation, post processing and archiving of chromatograms. In the case of mass spectrometry (ref. 2) satellite computers are used to acquire spectra, a P D P l l system to collect all spectra of a series of spectra, e.g. for a CC/MS- combination, and one of the VAX systems for the evaluation and interpretation of spectra, for library search and archiving. Several PCs, using KERMIT soft\vare, are connected to t h e network through asynchroneous lines; other PCs, applying the concept of 'virtual hard-disks', and a few diskless work stations (VAXstation 2000) are connected via Thinwire-Ethernet to the main Ethernet cable. (Thinwire-Ethernet is a cheaper version of an Ethernet implementation as far as cabling and interfacing is concerned, but with the same functionality and speed.) This network approach not only provides the advantages of distributing different applications and different groups of users over several
187
M i c r oVAX 50 terminals
ksPEKTl
GC
NMR ASPEKT NMR
P
IRUV..
E l E l
MM lOMB 0.8 GB uvAx2000
- PC AV 13 MB
-/
=Ill
I MS
- PC
-1
11/24
1
-PC
MS
JTG-l
AFJ
A=]
vs2Ooo
Steenken
hlsradiolyse
Holzwarth
CV 13 MB 1.4 GB DATEX.
L DFN
I
6 printers
Terminalserver 32 terminals
Fig. 9.2 Computer network at the Max-Planck-Institutes in Muelheim: An Ethernet coaxial cable connects the eight MicroVAX CPUs o f the corn ter with computers stems in various laboratories. The PDPl are operated with J S X l 1 M software, whereas especially time software is used with the MSDAT and SADAT satellite computers for mass spectrometry and chromatoura hy. The N M R laboratory is conhinwire-Ethernet branch serves nected via two ASPECT corn uters. for connecting PCs and VS2 OO-workstations. Users have access to the network through terminal server ports for 128 terminals and throu h SO multiplexer ports. External networks can be reached through a VAS-interface into the public DATEX-P net.
s
f
188
computer systems as mentioned earlier. Furthermore, software licences, that are cheaper for smaller machines, are purchased only for those systems where they are really needed. Most importantly, however, is the possibility of low cost extensions of the entire system with small increments in CPU power and other resources.
9.6 THE EVOLUTION OF DISTRIBUTED SYSTEMS
In the short history of computer science two major generations could be observed in terms of computing center systems: until the late nineteen-sixties computing centers with Batch-operation and with no interactive dialog facilities were predominant. In the second generation these systems have been replaced by multi-user systems with dialog terminals. Today, we are in the transition to a third generation, where most of the CPU processing is moved from the central site to the individual user by substituting the dumb dialog terminals with intelligent work stations. These stations are self-contained units in many respects, but need the support of central servers for certain tasks: a file-server will provide access to shared, large data bases; a print-server will organize the output to shared printers; a compute-server will provide big compute power for special applications. Other types of servers can easily be imagined, for data banks, archiving, Artificial Intelligence (AI) languages, etc.. The software for this type of transparent networked systems is already available; but it will take some time until the conventional hardware can be replaced and until this evolution has occurred within the brains of all users and especially of all managers of computer systems.
9.7 REFERENCES
1
2
E. Ziegler, G. Schomburg, The application of COLACHROM, a new command language for chromatographic data processing, J. Chromatography, 290, ( 1984). 3 39-35 0, E. Ziegler, D. Henneberg, H. Lenk, B. Weimann, I. Wronka, A computer system for measuring fast-scan low-resolution Mass spectra, Anal. Chim. Acta, 144, (1982). 1-12.
189
10 THE FUTURE OF PERSONAL COMPUTING IN CHEMISTRY
George C. LEVY NIH Resource, CASE Center and Department of Chemistry, Syracuse University, Center for Science and Technology, Syracuse, New York 13244-4100, USA
10.1 INTRODUCTION
In the first five years of personal computing significant changes in computational resources, graphic capabilities, languages and operating systems, all led to increased applicability to chemical problems for PCs. However, until recently, most utilization of PCs in advanced chemical computation has been isolated from applications designed for high-level workstation computers, venerable departmentai VAX systems, or mainframes. This results largely from the differing basic characteristics of these computer systems and is exemplified in Table 10.1, which compares typical PC and workstation computers in the period 1987-1988. Note that by this time the workstation computer achieved most of the characteristics of a large VAX system, with added high-speed color graphics and multi-window multi-processing operating system, most usually a version of UNIX.
10.2 COMPUTATIONAL ENVIRONMENT IN MID 1990s
Most recently, PCs have been gaining the capabilities of workstation computers and indeed, current workstation computers are also emulating PC-DOS to support applications developed for the smaller computers. This trend should be fully developed by 1992-1993 when high-end PCs will achieve capabilities as shown in Table 10.2.
190
Table 10.1 Comparison of typical PC and workstation computer, 1987-1988
PC
Workstation
C P U word length (bits) C P U speed: MIPS' MFLOPS~
8-16
32(64)
1 GByte virtual, demand paged
Actual RAM memory size