DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 6
Scientific Computing and Automation (Europe) 1990
DATA HANDLING I...
250 downloads
800 Views
8MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 6
Scientific Computing and Automation (Europe) 1990
DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and O.M. Kvalheirn
Other volumes in this series:
Volume 1 Microprocessor Programming and Applications for Scientists and Engineers by R.R. Smardzewski Volume 2 Chemornetrics: A textbook by D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte and L. Kaufman Volume 3 Experimental Design: A Chemometric Approach by S.N. Deming and S.L. Morgan Volume 4 Advanced Scientific Computing in BASIC with Applications in Chemistry, Biology and Pharmacology by P. Valk6 and S. Vajda Volume 5 PCs for Chemists, edited by J. Zupan Volume 6 Scientific Computing and Automation (Europe) 1990, Proceedings of the scientific Computing and Automation (Europe) Conference, 12- 15 June, 7990,Maastricht, The Netherlands. Edited by E.J. Karjalainen
DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 6 Advisory Editors: B.G.M. Vandeginste and O.M. Kvalheim
Scientific Computing and Automation (Europe) 1990 Proceedings of the Scientific Computing and Automation (Europe) Conference, 72- 15 June, 1990, Maastricht, The Netherlands
edited by
E.J. KARJALAINEN Department of Clinical Chemistry, University of Helsinki, SF-00290 Helsinki, Finland
ELSEVIER Amsterdam - Oxford - New York - Tokyo
1990
ELSEVIER SCIENCE PUBLISHERS B.V. Sara Burgerhartstraat 25 P.O. Box 21 1, 1000 AE Amsterdam, The Netherlands Distributors for the United States and Canada: ELSEVIER SCIENCE PUBLISHING COMPANY INC 655, Avenue of the Americas New York, NY 10010, U.S.A.
L i b r a r y o f Congress C a t a l o g i n g - i n - P u b l i c a t i o n
Data
S c i e n t i f i c Computing and A u t o m a t i o n ( E u r o p e ) C o n f e r e n c e ( 1 9 9 0 : Maastricht, Netherlands) S c i e n t i f i c computing and a u t o m a t i o n ( E u r o p e ) 7990 : p r o c e e d l n g s o f t h e S c i e n t i f i c Computing and A u t o m a t i o n ( E u r o p e ) C o n f e r e n c e . 12-15 June 1990. M a a s t r i c h t , t h e N e t h e r l a n d s / e d i t e d by E . K a r j a l a i n e n . p. cm. ( D a t a h a n d l l n g :n s c i e n c e and t e c h n o l o g y ; v . 6 ) I n c l u d e s b l b l i o g r a p h i c a l r e f e r e n c e s and i n d e x . ISBN 0-444-88949-3 1 . Sclence--Data processing--Congresses. 2. Technology--Data 3. E l e c t r o n i c d i g i t a l c o m p u t e r s - - S c i e n t i f i c processing--Congresses. applications--Congresses. 4 . Computer e n g i n e e r i n g - - C o n g r e s s e s . I. K a r j a l a i n e n . E . ( E r k k i ) 11. T i t l e . 111. S e r l e s . a i 8 3 . 9 . ~ 3 1990 502.85--dC20 90-220 10 CIP
--
ISBN 0-444-88949-3
0Elsevier Science Publishers B.V., 1990 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science Publishers B.V./ Physical Sciences & EngineeringDivision, P.O. Box 330, 1000 AH Amsterdam, The Netherlands. Special regulationsfor readers in the USA - This publication has been registered with the Copyright Clearance Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the USA. All other copyright questions, including photocopying outside of the USA, should be referred to the publisher. No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Although all advertising material is expected to conform to ethical (medical)standards, inclusion in this publication does not constitute a guarantee or endorsement of the quality or value of such product or of the claims made of it by its manufacturer. This book is printed on acid-free paper. Printed in The Netherlands
Contents Preface
I.
......................................................................................................................................................................................................................................
ix
Scientific Visualization and Supercomputers 1. An overview of visualization techniques in computational science:
2. 3.
4.
5. 6.
11.
State of the delivered art at the National Center for Supercomputing * . Applications ................................................................................................................................................................................................ Hardin J, Folk M. The application of supercomputers to chemometrics ......................................................... Hopke PK. Parallel computing of resonance Raman intensities using a ............................................................................. transputer array ................................................................................... Efremov RG. A user interface for 3D reconstruction of computer tomograms or magnetic resonance images .............. ........................................................................................ Friihauf M. Automatic correspondence finding in deformed serial sections Zhang YJ. Biological applications of computer graphics and geometric modelling .................................................................................................................................................................................................. Barrett AN, Summerbell D.
3
21
31
55
Statistics 7. Experimental optimization for quality products and processes ................................ 7 1 Deming SN. 8. Experimental design, response surface methodology and multi criteria decision making in the development of drug dosage forms ............85 Doornbos DA, Smilde AK, de Boer JH, Duineveld CAA. 9. The role of exploratory data analysis in the development of novel .. antivlral compounds ................................................................................................... ...... 97 Lewi PJ, Van Hoof J, Andries K. 10. Some novel statistical aspects of the design and analysis of Quantitative Structure Activity Relationship studies .......................................................... 105 Borth DM, Dekeyser MA.
v1
11. Site-directed computer-aided drug design: Progress towards the
design of novel lead compounds using “molecular” lattices ............................ Lewis RA, Kuntz ID. 12. A chemometrics / statistics / neural networks toolbox for MATLAB ................................................................................................................................................................. Haario H, Taavitsainen V-M, Jokinen PA.
111.
133
Data Analysis and Chemometrics 13. Neural networks in analytical chemistry ............................................................................... Kateman G, Smits JRM. 14. Electrodeposited copper fractals: Fractals in chemistry .............................. Hibbert DB. 15. Thc use of fractal dimension to characterize individual airborne particles .................................................................................................................. Hopke PK, Casuccio GS, Mershon WJ, Lee RJ. 16. Use of a rule-building expert system for classifying particles based on SEM analysis ............................................................................................. Hopke PK, Mi Y. 17. Partial Least Squares (PLS) for the prediction of real-life performance from laboratory results ............................................................................................. Lewi PJ, Vekemans B, Gypen LM. 18. Dynamic modelling of complex enzymatic reactions ................................. Ferreira EC, Duarte JC. 19. From chemical sensors to bioelectronics: A constant search for improved selcctivity, sensitivity and miniaturization ................................ Coulet PR. 20. A Turbo Pascal program for on-line analysis of spontaneous . . . ncuronal unit activity ....................................................................................................................... GaA1 L, Molnk P.
IV.
117
151 161
173
179
199 21 1
22 1
237
Laboratory Robotics 21. Automation of screening reactions in organic synthesis Josses P, Joux B, Barrier R, Desmurs JR,Bulliot H, Ploquin Y, Metivier P. 22. A smart robotics system for the design and optimization of spcctrophotometric experiments Settle Jr FA, Blankenship J, Costello S,Sprouse M, Wick P.
249
259
Vii
23. Laboratory automation and robotics-Quo vadis? ............................................................ 273 Linder M. 24. Report of two years long activity of an automatic immunoassay section linked with a laboratory information system in a clinical laboratory ................................................................................................................................................................................................ 285 Dorizzi RM, Pradella M.
V.
LIMS and Validation of Computer Systems 25. An integrated approach to the analysis and design of automated manufacturing systems ....................................................................................................................................................... Maj SP. 26. A universal LIMS architecture .................................................................................................................. Mattes DC, McDowall RD. 27. Designing and implementing a LIMS for the use of a quality assurance laboratory within a brewery ................................... ..... .................... Dickinson K, Kennedy R, Smith P, 28. Selection of LIMS for a pharmaceutical research and development laboratory-A case study .................................................................................................................................... Broad LA, Maloney TA, Sub& Jr EJ. 29. A new pharmacokinetic LIMS-system (KINLIMS) with special emphasis on GLP ................................................................................................................................................... Timm U, Hirth B. 30. Validation and certification: Commercial and regulatory aspects ........ Murphy M. 31. Developing a data system for the regulated laboratory ........................................... Ycndlc PW, Smith KP, Farrie JMT, Last BJ.
VI.
293 301
307
315
329
351
Standards Activities 32. Standards in health care informatics, a European AIM ................................................. 365 Noolhoven van Goor J. 33. EUCLIDES, a European standard for clinical laboratory data exchange between independent medical information systems ............................................................................................................................................................................. 37 1 Sevens C, De Moor G, Vandewalle C. 34. Conformance testing of graphics standard software. ......................................... 379 Zicgler R.
viii
VII. Databases and Documentation 35. A system for creating collections of chemical compounds based on structures ............................................................................................................................................................................... Bohanec S, Tusar M, Tusar L, Ljubic T, Zupan J. 36. TICA: A program for the extraction of analytical chemical information from texts ....................................................................................................................................................... Postma GJ, van der Linden B, Smits JRM, Kateman G. 37. Databases for geodetic applications ........................................................................................................... Ruland D, Ruland R. 38. Automatic documentation of graphical schematics ................................................................ May M.
393
407 4 15 427
VIII. Tools for Spectroscopy 39. Dcvelopments in scientific data transfer ............................................................................................... Davies AN, Hillig H, Linscheid M. 40. Hypermedia tools for structure elucidation based on spectroscopic methods .................................................................................................................................................. Farkas M, Cadisch M, Pretsch E. 41. Synergistic use of multi-spectral data: Missing pieces of the workstation puzzle ................................................................................................................................................................... Wilkins CL, Baumeister ER, West CD. 42. Spectrum reconstruction in GC/MS. The robustness of the solution found with Alternating Regression (AR) .................................. Karjalainen EJ.
445
455
467
477
Author Index
..............................................................................................................................................................................................
491
Subject Index
................................................................................................................................................................................................
493
ix
Preface
The second European Scientific Computing and Automation-SCA 90 (Europe)-meeting was held in June 1990 in Maastricht, the Netherlands. This book contains a broad selection of the papers presented at the meeting. Science is getting more specialized. It is divided into narrow special subjects. But there are other forces at work. The computer is bringing new unity to science. Computers are used for making measurements, interpreting the data, and filing the results. Mathematical models are coming into wider use. The computer-based tools are common to many scientific fields so SCA tries to concentrate on the common tools that are useful in several disciplines. Computers can produce numbers at a furious pace. Trying to see what is going on during the computing process is a frustrating experience. It is like trying to take a sip of water from a fire hose. A new discipline, scientific visualization, is evolving to help the researcher in his attempt to come to grips with the numbers. The opening talk was given by Hardin and Folk from the National Center for Supercomputing Applications (NCSA) at University of Illinois. The main element of their presentation was the dramatic color animations produced on supercomputers and workstations. These public-domain visualization tools are described by them. Supercomputers are useful in chemometrics. Hopke gives examples of problems that benefit from the distribution of the computations into a number of parallel processor units. The parallel computer used by Efremov is different. He installed a number of transputers in an AT-type PC to calculate Raman spectra. The AT was speeded up by a factor of 200! Medical personnel need tools to manipulate three-dimensionalimages. Friihauf covers the design of intuitive user interfaces for tomography and MR images. Zhang analyzes three-dimensional structures from light microscopy of serial sections. He shows how a series of images from deformed tissue sections can be linked together to a complete three-dimensional structure. The geometric laws of the growth process in embryonic limbs are described by Barrett and Summerbell. It is possible to describe a morphological process with a small number of parameters in a geometric model. Statistical methods are needed in all scientific disciplines. Deming describes the role of experimental design in his article. Doornbos et al. uses experimental design to optimize different dosage forms of drugs. Lewi shows how chemometric tools are used in industry
X
for dcsigning drugs. Borth handles statistical problems with censored data in quantitative structure-activity relationships (QSAR) and drug design. Lewis and Kuntz describe how the idea of molecular lattices is used to find novel lead compounds in drug design. Haario, Taavitsainen and Jokinen have developed a statistical toolbox for use in chemometrics. The routines are built in MATLAB, a matrix-based language for mathematical programming. Chcmometrics is a phrase used to cover a broad field of computer applications in chemistry. The term covers statistics, expert systems and many types of mathematical modeling. The chemical applications of neural networks are handled in a tutorial by Kateman et al. Hibbert describes uses of fractals in chemistry. Hopke analyzes the surface texture of individual airborne smoke particles by fractals. Ferreira optimizes the manufacturing process that produces ampicillin. Compounds that could not by measured directly are estimated from a dynamic model. Coulet gives a broad tutorial on biosensors, where mathematical models are used to obtain more specific mcasurcments. Ga2l and Molnk describe a computer program for analyzing the electrical activity of single neurons in pharmaceutical research. Laboratories use analyzers built for a fixed purpose. A programmable arm,a laboratory robot, can be programmed by the user for many operations in the laboratory. In principle, most tasks in the laboratory can be automated with robots. The early users have oftcn found the programming costs high. Still there are uses where the flexibility of robots is nccded. Josses et al. show how a pharmaceutical company uses laboratory robotics to develop new methods for organic synthesis. Settle et al. describe how expert systems are linked with laboratory robotics. Linder ponders the philosophy of robotics and recommends using independent workstations with local intelligence. Dorizzi and Pradclla dcscribe the interfacing of immunoassay pipetting stations to a small commercial LIMS systcm. The gain in productivity for a rather small investment was impressive. Laboratory Informations Management Systems (LIMS) was one of the main themes in the meeting. People feel that LIMS is still a problem. The interest ranges all the way from single instrument users to problems of the management. The development of computer software often requires large projects. Maj describes a formal approach to the analysis and design of automated manufacturing systems. Mattes and McDowall emphasize the role of system architecture for the long-term viability of LIMS systems. Dickinson et al. analyze a LIMS development project for a quality assurance laboratory in a brewery. A project for the choice of a commercial LIMS supplier for a multinational company is described by Broad et al. The selected LIMS is used in a pharmaceutical research laboratory. Timm and Hirth developed a custom-made LIMS system for documenting
xi
pharmacokinetic measurements. The KINLIMS system emphasizes the requirements of GLP. Murphy describes the legal and commercial aspects of validation and certification processes. Yendle et al. give a practical example how an industrial software project developed and documented a chromatography integration package. One of the goals in product design was to facilitate independent validation by the user. Standards are needed for building larger systems. If we do not use proper standards the wheel has to be reinvented every time. Noothoven van Goor reports how the AIM (Advanced Informatics in Medicine) program is catalyzing the development of standards for medical informatics in Europe. Sevens, DeMoor and Vandewalle give an example in the EUCLIDES project that defines standards for the interchange of clinical laboratory data. Ziegler shows how computer graphics standards are tested for conformance to the original specification. The costs of computer data storage are decreasing. The result is that databases are bigger. Large on-line databases for literature searches are accessible to all researchers. Still there is a need for specialized local data bases. Bohanec et al. present a suite of computer programs that handle collections of spectral and other chemical data. The information is accessed by the chemical structures. The information is indexed by descriptors of molecular features derived from connectivity tables. Postma describes how databases can be built up automatically from chemical literature with an experimental parser program called TICA. The program is limited to interpreting abstracts about titration methods. A computer data base for geodetic applications is demonstrated by Ruland. The system combines all phases of the work. Technical documents can be generated by computers. May gives examples of algorithms for producing drawings. Users of spectroscopic instruments have many needs. They need tools for data interchange and powerful software as part of the instrument. Davies, Hillig, and Linscheid describe their experiences with standards for exchange of spectroscopic data. Vendor formats are replaced by general formats like JCAMP-DX. The interpretation of spectral data can be supportcd by new hypertext-basedtools described by Farkas, Cadish and Pretsch. The Hypercard-basedtools should be very useful in chemical education. Wilkins builds "hyphenated" instruments by combining FT-IR spectroscopy and mass spectrometry with gas chromatography. The combination produces more information than the single instruments. More advanced instrument software is needed to fully utilize the information produced by the combination. Karjalainen describes how overlapping spectra from hyphenated instruments are dissected into distinct components. Validation and quality assurance aspects of the spectrum decompositionprocess are emphasized.
xii
Many people contributed to the success of the meeting. Members of the program board Prof. D.L. Massart, Prof. Chr.Trendelenburg, Prof. R.E.Dessy, Dr. R. McDowall, and Prof. M.J. de Matos Barbosa contributed their experience to the selection of topics and papers. I want to express my sincere thanks to them, without their help the congress would not have been the meeting for multiple disciplines that it was. Fiona Anderson from Nature magazine handled the exhibition, a central element in the meeting. Robi Valkhoff with her team of conference organizers from Reunion kept the congress in good order. I want to thank all contributors for their papers. Finally thanks to Ulla Karjalainen, Ph.D. for laying out the pages using Quark m e s s on a Macintosh I1 computer.
September, 1990
Erkki J. Karjalainen Helsinki
Scientific Visualization and Supercomputers
This Page Intentionally Left Blank
Illustrations to Chapter 1 by J. Hardin and M. Folk
Color plate 5. An ab initio chemistry study of catalysis showing the shape of the electron potential fields surrounding a disassociating Niobium trimer. NCSA Image / Harrell Sellers, NCSA.
Color plate 6. An image generated during an earlier run of the ab initio chemistry study points out a problem in part of the researcher’s code. NCSA DataScope / Harrell Sellers, NCSA.
This Page Intentionally Left Blank
Illustrations to Chapter 1 by J. Hardin and M. Folk (cont...)
This Page Intentionally Left Blank
Illustrations to Chapter 4 by M. Friihauf
Photos taken from the screen, using a normal camera, were provided by the authors. The 35 mm slides were scanned into a Macintosh I1 computer using a BameyScan slide scanner. Page layout was done with Quark XF’ress, color separations were made using the Spectre series programs.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
3
CHAPTER 1
An Overview of Visualization Techniques in Computational Science: State of the Delivered Art at the National Center f o r Supercompnting Applications J. Hardin and M. Folk National Centerfor Supercomputing Applications, University of Illinois, USA
Introduction Over the last decade we have witnessed the emergence of a number of useful techniques and capabilities in the field of scientific computing. At NCSA, we have focused on the intcractionsbetween the simultaneousemergence of massive computational engines such as the Cray line of supercomputers, the development of increasingly functional workstations that sit on the user’s desktop, and the use of visual techniques to make sense of the work done on these systems. The keynote presentation given at the 1990 SCA (Europe) Conference in Maastricht was in the form of a report, or update, on this work. The kind of full color, 3D, fully rendered, animated images that have been produced from supcrcomputcr numerical simulations of developing tornadic clouds (see Color plate 1) represent the state of the art in visualization techniques. A 10 minute video of this required a team of visualization specialists working for weeks in collaboration with the scientists who produced the simulation program. The video hardware the images were produced on is part of an advanced digital production system that, with dedicated hardware and software, cost between 1/2 and 1 million dollars. For most sections of the animation, individual framcs took 10 to 20 minutes to render. At thirty frames a second, thc 10 minute video required close to 20,000 frames. The voice over required the use of the above mentioned video production facilities and professional support. All in all, this is a tour de force of current technologies, techniques and technicians. Obviously, not everyone has access to such resources. But increasing numbers of researchers have seen the benefits of visualizing their data and want to use such techniques in their work. The need for such capabilities comes from the mountains of data produced by numerical simulations on high performance computing systems, and from equally endless sets of data generated by satellites, radio telescopes, or any observational See the separate Color plates.
4
data collecting device. As the access to high performance systems increases, through government programs like the NSF Centers in the U.S., or just through the rapidly decreasing cost of these systems, so does the need for visualization techniques. Recognizing that such needs of researchers have increased dramatically in the past few ycars, but that most researchers work with limited resources, the question we want to pursue is: what can be done currently, by naive users, on their desktops, specifically with NCSA developed software that we can give to users now? In one sense we are asking: how many of the techniques seen in the storm simulation can be done interactively on existing installcd hardware systems by what proportion of the user population? And what changes when we start taking about interactive analysis of data versus the presentation of data that we have in the storm video?
Desktop visualization In answering these questions, we moved from the video to programs running on two computers that represent low and middle range capabilities for today’s users: the Macintosh 11, and the SGI Personal Iris. The first demonstration was of NCSA DataScope, a data analysis package developed for the Macintosh I1 line, part of a set of such tools developed at NCSA by the Software Tools Group (STG). The choice of the Macintosh emphasizes the importance of the user interface in providing a wide base of users access to methods of visual data analysis. It is an axiom among the developers of these tools that scientists do not like to program. Chemists like to do chemistry; physicists like to figure out physics; structural cngineers like to solve cngineering problems. In all these things,computer simulations of chemical processes and physical systems allow researchers to test their ideas, uncover new knowledge, and dcsign and test new structures. The data from these simulations, or from experiments or rcmotc sensing devices, may be complex enough, or simply large enough, that visualization tcchniques will be useful in its analysis. And those techniques themselves may be complex and sophisticatcd, But the Iocus of thc scientist remains on the science, not the mcthods used to generate scientific knowledge. The user needs to be able to easily mobilize sophisticated visualization techniques and apply them to the data at hand, without bccoming skilled in the means and methods of computer graphics. A good user interface makes this possible. With DataScope (Color plate 2) the user can start with the raw data in 2 dimensional array form, displayed on the screen as a spreadsheet. By choosing a menu option the user can display the data as a 2D color raster image, and apply a more revealing palette of colors. The user is now able to see the entire data set, pcrhaps a half a megabyte of data, at a glance. Shockwaves in a computational fluid dynamics simulation can be easily identilicd by shape and changes in color across thc shock boundaries (Color plate 3). The pulse or current in a MOSFET device, and the dcpth it reaches in the substrate, is apparent in
5
the results of an electronic device simulation (Color plate 4). The shape of the electron potentials surrounding disassociating atoms in an ab initio chemistry study of catalysis are clear to see (Color plate 5). This is the basic data-revealing step that is common to all visualization techniques: the mass of numbers has been transformed into a recognizable image or visual pattern. The next step is to allow the user to further investigate both the image and the numbers. By pointing and clicking with the mouse, regions of interest in the image can be chosen and the corresponding numbers in the dataset are highlighted. The two views on the data, the spreadsheet and the image, are now synchronized. It is then possible to see the actual floating point values associated with particular regions or points in the image. The researcher can also do transformations of the data set by typing an equation into a notebook window. In this way a discrete derivative can be obtained, or a highpass filter run across the data set. Then the new data set can be immediately imaged and compared to the original. Other operations are possible, all of which allow the user to interrogate the data by moving between the images and the raw numbers easily. The next tool that was demonstrated was NCSA Image. This tool was developed for investigating data images through various interactive manipulations of the color palette, and to view sequences of images that may, for instance, represent a series of time steps in a simulation. By dynamically changing the palette, or color table, associated with an image a researcher can look for unexpected contrasts or changes in values that had not been foreseen. The colors associated with a particular region of the data set that may represent the edge of a vortex, or a boundary in the simulation, can be zoomed in on, and the color contrast enhanced for a finer view of detail in that area. An animation reveals the dynamic relations of variables and the development of a physical system over time. Combinations of such techniques allow the researcher to develop as many ways of interrogating the data as possible. So far we have only discussed the data analysis aspect of visualization techniques. But equally important for researchers developing and testing their numerical simulations is the ability of images like the ones described above to be used as debugging took Without such images, researchers are often reduced to sampling the huge data sets their codes generate, and are not able to see when boundary conditions have not been properly established, or when unexpected oscillations have been generated by their models. A quick look at an image of the systems output can alert the researcher to such problems, and, on occasion, also point in the direction of a solution. An example of this was the interesting image generated by an ab initio chemistry program under developmentby a researcher at NCSA. Color plate 6, above, shows a stage in the dissociation of a Niobium himer. The atom (ion?) is moving off to the right, and leaving the dimer behind. Color plate 6 shows an image of an earlier run done during the development of that same code. By looking at this the developer not only was able to easily see that the program had gone wrong, but was also able to narrow down the search for where the problem lay. Instead of, as the
6
researcher put it “having to slog my way through the entire wave function from beginning to end,” he was able to go directly to the section of the code responsible for the problem. Not shown, but also part of the NCSA,tool suite for the Macintosh, was NCSA PalEdit, which allows users to interactively manipulate and construct color palettes, and NCSA Layout, which is used to annotate images and compose them for slides that can be used to communicate a researcher’s work and findings to colleagues (Color plate 7). This last tool adds presentation capabilities to the data analysis and code debugging uses of such visualization techniques.
Standards for data formats This is a good place to point out that a key feature of any scientific visualization system is interoperability among the tools and the scientists’ data-producing software. Whether the data is from simulations or from instrumental observations, it should be easy for scientists to both get it into a form that the tools can work with, and get it from where it is generated to the platforms where the tools reside. Furthermore, to the extent that the tools can operate on similar kinds of data, the transfer of data from one tool to another should, from the uscr’s perspective, be trivial. For this it is necessary to provide standard dah models that all of the tools understand, a file format that accommodates these models, is extensible to future models and is Uansportable across all platforms, and simple user interfaces that enable scientists with very little effort to store and retrieve their data. To satisfy these requirements NCSA has developed a format called HDF. HDF is a self-describing format, allowing an application to interpret the structure and contents of a file without any outside information. Each data object in a file is tagged with an identificr that tells what it is, how big it is, and where it can be found. A program designed to interpret certain tag types can scan a file containing those types and process the corresponding data. In addition to the primitive tag types, a grouping mechanism makes it possible to describe, wilhin the file, commonality among objects. User interfaces and utilities exist that make it easy for scientists’ programs, as well as the tools, to read and write HDF files. Currently there are sets of routines for reading and writing 8-bit and 24-bit raster images, multi-dimensional gridded floating point data, polygonal data, annotations, and general record-structured data. For example, by placing a few HDF calls in their program a user can have their numerical simulation, running on a supercomputer,generate HDF files of the output data. These files can then be moved, say by the common Unix method of file transfer (ftp), to a Macintosh where NCSA DataScope is running. The file can then be loaded into DataScope without any changes having to be made by the user. This exemplifies the desired transportability of files between machines and operating systems mentioned above. In addition, since each HDF file can contain a number of dissimilar objects, researchers can use HDF files to help organize their work. Upon completing the work
7
session described above in NCSA Datascope, the user could save the original floating point data set, the interpolatedraster image of that data generated by Datascope, any analytic functions or notes typed during that session into the notebook, and the palette that the user had found most useful to view the data image with, all in one HDF file. Upon returning, a day or a week later, to the same problem, clicking on that data set launches NCSA Datascope and loads the data, image, palette, and notes.
3D visualization Moving off of the Macintosh platform requires a user interface that retains as much of the Mac’s ease of use as possible. The emerging X-windows environment from MIT, while not providing a complete graphic user interface, does provide a standard form of windows on a variety of platforms. The NCSA STG has developed much of the functionality described above in the X-windows environment, and in the analysis of what is commonly referred to as regularly gridded data, has moved into 3D functionality. NCSA XImage and NCSA XDataSlice are tools that allow users on platforms that support X-windows to use visualization techniques in the analysis of their data. A number of techniques are used to look at data cubes, including tiling multiple windows, animating a 2D image along the 3rd dimension (which can be done with NCSA Image also, see above), isosurface rendering, and the use of the ‘slicer dicer’ feature (Color plate 8). This last method allows the user to cut into a 3 dimensional data set and see arbitrary planes represented as pseudocolored surfaces in a 3D representation. This method combines flexibility in choice of data area to view, and rapid response to user choices, increasing the interactivity of the analysis. The kind of data that is often the output of finite element or finite difference analyses, what is commonly referred to as polygonal or mesh data or points in space, presents heavier demands on the computational resources available on the desktop. (None of the tools discussed above allow the presentation or analysis of such data unless it is first translated into a grid form.) The need to rapidly interpret a large set of points, connectivity information and associated scalar values into a relatively complex 3D image requires that much of this be done in hardware if interactive speeds are to be achieved. The cost of such capabilities is rapidly decreasing, making tools that were considered esoteric 3 years ago standard items today. With this rapidly decreasing cost and the resulting increasing availability of systems capable of handling the demands of interactive 3D data visualization and analysis, the NCSA STG has developed a tool called PolyView (for Polygon Viewer) which runs on the SGI Personal Iris. The Personal Iris has a graphics pipeline in hardware and runs Unix, one of the emerging standards that NCSA has adopted. While the Personal Iris does not currently fully support a portable standardized graphic user interface, it is moving toward a complete X-windows environment in the near future. PolyView currently uses
8
the native windowing system to construct the all-important user interface. This allows naive users to load 3D data sets and display them as points, lines, or filled polygons (Color plate 9). The user can adjust shading, rotate the object, zoom in and out, and perform palette manipulations of associated scalars, for instance to investigate the heating of the base plate heat sink soldered to an induction coil (see Color plates 10-11). In this example, the heat generated by various power levels applied to the coil are simulated and heat levels mapped to a palette, just as was done in the 2D examples above. The heat level is then easy to discriminate at various points on the 3D coil and plate. Interactive manipulation of the palette allows the user to focus on a particular range of values. The user can then select a point or region that appears interesting and ask that the actual values of the variable be displayed in floating point form.
Conclusion This then was the brief update of the state of the delivered art at NCSA. Over the last year, indeed, since the talk at the beginning of the summer, work on bringing advanced visualization techniques closer to more users has progressed rapidly. Work at NCSA has pushed the limits of what can be done by users versus specialists. For instance, video production systems have decreased rapidly in price and more and more of the functionality once reserved to the large team efforts exemplified by the storm video are available in systems quickly approaching the desktop. These developments, and similar advances in user interface design and implementationin the area of scientific visualization, mean continuing rapid improvements in the functionality available to a growing base of researchers.
Acknowledgements The authors wish to thank all those who helped make this rather complicated demonstration and talk possible. Erkki Karjalainen, the conference chair, for the invitation and his enthusiastic assistance during the conference. Robi Valkhoff and Victorine Bos of CAOS and Keith Foley of Elsevier for their patience and gracious assistance setting up the presentation and solving problems, some of the authors’ making, that cropped up. And the staff of the MECC in Maastricht for their timely and professional technical assistance.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V.. Amsterdam
9
CHAPTER 2
The Application of Supercomputers to Chemometrics P.K.Hopke Department of Chemistry, Clarkson University, Potsdam, NY 13699-5810 , USA
Abstract The improvements in speed and memory of computers has made applications of chemometric methods routine with many procedures easily accomplished using microcomputers. What problems are sufficiently large and complex that the largest and most powerful computers are needed? One major class of problems is those methods based on sampling strategies. These methods include robust regression, bootstrapping, jackknifing, and cluster analysis of very large data sets. Parallel supercomputingsystems are particularly well suited to this type of problem since each processor can be used to independently analyze a sample so that multiple samples can be examined simultaneously. Examples of these problems will be presented along with measures of the improvement in computer throughput resulting from the parallelization of the problem.
1. Introduction During the last ten years, there have been revolutionary changes in the availability of computing resources for solving scientific problems. Computers range from quite powerful microcomputers now commonplace in the analytical laboratory up to “supercomputers” that are able to solve large numerical problems impossible to consider before their advent. A computer is not “super” simply because of its high clock speed. They achieve their computational throughput capability because of a combination of short cycle time and computing architecture that permits the simultaneous accomplishment of multiple tasks. There are several types of architectures and these differences lead to differences in the kinds of problems that each may solve in an optimal manner. In the past, computers with multiple processor units have generally been used to provide simultaneous services to a number of users. In the supercompukrs, multiple processors are assigned to a single job. Also in different machines, the processors are used in very different configurations.
10
Computers can be classified in terms of the multiplicity of instruction streams and the multiplicity of data streams [Flynn, 19661. Computers can have a single instruction stream (SI)or a multiple instruction stream (MI) and a single data stream (SD) or a multiple data stream (MD). A traditional computer uses a single stream of instructions operating on a single stream of data and is therefore an SISD system. The MISD class is in essence the same as an SISD machine and is not considered to be a separate class [Schendel, 19841. The SIMD processors can be divided into several types including array processors, pipeline processors, and associative processors. Although there are differences in the details of the process between these types of systems, they all work by simultaneously performing the same operation on many data elements. Thus, if the serial algorithm can be converted so that vector operations are used, then a number of identical calculations equal to the length of the vector can be made for a single set of machine instructions. Many of the supercomputers such as the CRAY and CDC CYBER 200 systems employ this vectorization to obtain their increase in computational power. It is necessary to rewrite the application program to take advantage of the vector capabilities. The nature of thcse programming changes will be different for each machine. Thus, although there has been a major effort to improve the standardization of programming languages, machines with unique capabilities require special programming considerations. Finally, there are MIMD systems where independent processors can be working on differcnt data streams. An example of this type of system are the loosely-coupled array processor systems (LCAP) at IBM [Clementi, 19881. In these systems, the host processor controls and coordinates the global processing with a “master or host program” while the hcavy calculations are carried out on the independent array processors running “slave programs”. The slave processors are not directly connected to one another and the slave programs are not coordinated with each other; the host program is responsible for control over the slave programs and therefore for their coordination and synchronization. Other multiple processor systems like the Alliant FXB use multiple processors with a shared memory. In the Alliant system, there are two types of parallel processors, interactivc processors and computational elements. The interactive processors execute the interactive user jobs and runs the operating system thus providing the link to multiple uscrs of the systcm. The eight computational elements can work in parallel on a single application with the systcm performing the synchronization and scheduling of the elements. These computational elements are such that within the parallel structure, vectorized operations can be performed. Thus, depending on the problem and program structure, this system can be working on single or multiple data. The FORTRAN compiler automatically pcrforms the optimization of the parallel and vector capabilities. This compiler control of optimization means less coordination problems for the programmer. However, it makes it more difficult to know exactly how the code is being optimized.
11
Host Computer (Master)
I
Figure 1. Schematic outline of a parallel system consisting of a master host computer and a series of slave array processors.
2. Applications for parallel systems The objective of this paper is to examine the utility of a parallel computer system for the implementation of statistical algorithms that are based on processing a large number of samples. Some aspects of parallel computing are illustrated by the implementation of algorithms for cluster analysis and robust regression. Similar applications to cross-validation and the bootstrap are also possible. The system which will be considered here is an MIMD system that consisted of a central host processor (master) connected by channels to 10 array processors (slaves). This system is termed a loosely-coupledarray processor (LCAP) system and was implemented at the IBM Research Center at Kingston, NY [Clementi, 19881. The particular system was designated as LCAP-1. The host processor was an IBM 3081 and the slaves were Floating Point System FPS-164array processors (AP’s), linked as in Figure 1. This architecture is “coarse-grained”,in contrast with “fine-grained” systems that contain many but less powerful processors. The way to use the LCAP system is to let the host processor control and coordinate the global processing with a “master” or “host program” while the heavy calculations are carried out on the AP’s running “slave programs”. The slave processors are not directly connected, and the slave programs are not coordinated with each other: the host program is responsible for control over the slave programs and therefore for their coordination and synchronization. Software running on the LCAP system must consist of a host FORTRAN program that calls subroutines (also in FORTRAN) to run on the slaves. There may be several such subroutines, that may also make use of their own subroutines. All these routines are
12
grouped into a single slave program, that is run on some or all of the slaves. The way by which duplication of the slave processing is avoided is either to have different data used on the different slaves or to execute different subroutines (or both). Parallel execution is controlled by a specific set of instructions added to the master program as well as to the slave routines. A precompiler is then used (on the host computer) to generate the master and slave program source code. Subsequently, the master program and AP routines are compiled, linked and loaded for the run. For a detailed description of the LCAP system, see Clementi and Logan [1985]. A guide to the precompiler is given by Chin et al. [1985].
3. A sequential algorithm for clustering large data sets Clustcr analysis is the name given to a large collection of techniques to find groups in data. Usually the data consist of a set of n objects, each of which is characterized by p measurement values, and one is looking for a “ ~ t u r a l grouping ” of the objects into clusters. A clustering method tries to form groups in such a way that objects belonging to the same group are similar to each other, while objects belonging to different groups are rather dissimilar. The most popular approaches to clustering are the hierarchical methods, that yield a tree-like structure, and the partitioning methods that we will focus on here. In the latter approach, one wants to obtain a single partition of the n objects into k clusters. Usually k is given by the user, although some methods also uy to select this number by means of some criterion. One of the ways to construct a partition is to determine a collection of k central points in space (called “centrotypes”) and to build the clusters around them. THE MASLOC method [see Massart et al., 19831 is somewhat different, because it searches for a subset of k objects that are representative of the k clusters. Next, the clusters are constructed by assigning each object of the data set to the nearest representative object. The sum of the distances of all clustering. Indeed, a small value of this sum indicates that most objects are close to the representativeobject of their cluster. This observation forms the basis of the method. The k representative objects are chosen in such a way that the sum of distances from all the objects of the data set to the nearest of these is as small as possible, It should be noted that, within each cluster, the representative object minimizes the total distance to the cluster’s members. Such an object is called a medoid, and the clustering technique is referred to as the k-medoid method. Implementing the k-medoid method poses two computational problems. The first is that it requires a considerable amount of storage capacity, and the second that finding an optimal solution involves a large number of calculations, even for relatively small n. In practice it is possible to run an exact algorithm for data sets of up to about 50 objects. When using a heuristic algorithm that yields an approximate solution, one can deal with about 300 or 400 objects.
13
An extension for solving much larger problems has been developed by Kaufman and Rousseeuw [ 19861. The corresponding program is called CLARA (from Clustering LARgc Applications). It extracts a radon sample of, say, m objects from thc data set, clusters it using a hcuristic k-medoid algorithm, and thcn assigns all the remaining objccts of the data set to one of the found mcdoids. The set of k medoids that have just bccn found is then complemcntcd with m - k randomly selected objects to form a new sample, that is then clustcrcd in the same way. Then all of the objects are assigned to the new mcdoids. The value of this new assignmcnt (sum of the distances from each object to its mcdoid) is calculated and compared with the previous value. The medoids corresponding to thc smallest value is kept, and used as the basis for the next sample. This process can be repeated for a given number of samplcs, or until no improvement is found during some itcrations. Thc CLARA program was written in a very portable subset of FORTRAN and implemented on scvcral systems, using variable sample sizes and numbers of samplcs. The computational advantages of CLARA are considerable. First of all, the k-mcdoid method is applied to much smaller sets of objects (typically the entire data set might consist of several hundreds of thousands of objccts, while a sample would only contain m). The numbcr of calculations bcing of the order of a quadratic function of clustcrs, this considerably reduces the computation time. Thc actual rcduction dcpcnds, of course, on the numbcr of samplcs that are considered. Another advantage concerns the storage requircmcnts. As sccn above, the k-median method is based upon the distanccs bctwccn all objects that must be clustcrcd. Thc total number of distances is also a quadratic function: For a set of 1,000 objects there are 499,500 such distances. which occupy a sizcablc part of ccntral memory. Of course it is possible to store the distances on an extcrnal dcvice or to calculate them each time they are needed, but this would s c a l y incrcasc the computation time. In the CLARA method the samples to be clustered contain fcw objccts, and thcrefore few distances must be stored. It is true that during ihc assignmcnt of the entire dataset, the distance of each object to each of the k medoids must bc calculatcd. However, only the sum of the minimal distanccs must be storcd, and not thc individual distances. After the last itcration, the assignment of all objccts to the final sct of mcdoids is carried out once again, in ordcr to obtain the resulting partition of thc entirc data set.
4. Application of parallel processing to the clustering problem Thc mcthod just dcscribcd lcnds itsclf quitc wcll to parallcl proccssing, and in particular to thc LCAP computcr systcm. Each processor can bc indcpcndcntly running thc kmcdoid mcihod on a particular sample. Of coursc, this rcquircs thc k-mcdoid codc to bc availablc on cach of the slavc proccssors. Two stratcgics can bc cmploycd to takc advantage of thc additional throughput that bccomcs available through the parallclization. In thc first strategy, cach slave proccssor
14
rcccivcs a samplc and thc host program waits until all samplcs have bccn analyzcd. Aftcr clustcring its sample, Lhc slavc proccssor also assigns cach objcct of thc entirc data set to thc closcst of thc found mcdoids. The sum of lhe dislanccs of the objects to thc choscn medoids is also calculated. The mcdoids of the best sample obtained to datc arc thcn includcd in thc ncxt batch of samplcs. In this strategy a large number of samplcs can be analyzed, but thcrc arc often idle processors, waiting for the last sample to bc complctcd. This is bccausc thc computation time of the k-medoid algorithm dcpcnds strongly on thc sample it works on, and is therefore quite variable. In the sccond suatcgy, the host waits for any sample to be finished, compares the objcctive value with thc bcst onc obtained so far, updates thc current set of mcdoids (if rcquircd), and uscs thc bcst mcdoids for the ncxt sample to bc run on thc currently idle slavc proccssor. Thus thcrc is only a vcry short waiting time, from the momcnt a slave proccssor complctcs a sample until thc ncxt sample is initiatcd. However, thcn some slavc proccssors may still bc working on samplcs now known not to include thc currcntly bcsl mcdoids. Both of thcsc slratcgics rcquirc vcry littlc communication bctwccn host and slavc proccssors. At thc bcginning of a run, data on all objccts, as wcll as the ncccssary codc, arc scnt to each of Ihc slavcs. This soflwarc consists of the samplc sclcction, the distancc calculation, thc k-mcdoid method, and the assignmcnt routincs. Once a samplc is clustcrcd, only thc casc numbcr of the mcdoids (which amounts to a fcw intcgcrs) and the total value of thc clustcring are sent back to thc host. Thc only processing carried out by the host at this point is comparing Lhe total value with thc currently bcst onc, and if it is lcss, rcplacing the bcst sct of mcdoids by the ncw sct. The host scnds back the (possibly modificd) bcst sct of mcdoids, allowing thc slave to gencratc a ncw sample. Finally, at the end (whcn no more samplcs must bc drawn) the final clustering of the entire data sct is dctcrmincd. Initial rcsults wcrc oblaincd using a crcatcd data set with a known slruclure. Thc daki sct includes three well-dcfincd clustcrs and scvcral typcs of oullicrs. It has bccn found that for thcsc data stratcgy 2 is more efficicnt than suatcgy 1. Increasing thc numbcr of objccts appears to improve the pcrformancc of stratcgy 1, and docs not sccm to slow down stratcgy 2. In both stratcgics 1 and 2, thc host proccssor has vcry littlc work to do. An altcrnativc third stratcgy is to lct it do thc assignment job for cach set of mcdoids coming from a samplc. In this way, part of thc codc (the assignment routinc) is kcpt in thc host proccssor. Unfortunatcly, this also incrcascs thc probability that thc host is busy at thc instant a slavc rcturns its sct of mcdoids, forcing it to wait until it can obtain a ncw sct. A possiblc way around this problcm is to have that slavc start a ncw samplc using thc prcvious bcst sct of mcdoids. In thc lattcr strategy thc coordination bctwccn host and slavcs is morc complcx, and it is in fact bcttcr suitcd for a systcm with a scction of sharcd mcmory, in which thc bcst sct of mcdoids found so far can bc stored. In such a situation,
mcasurcs must bc taken to avoid retrieving medoids from the sharcd memory by onc of thc slavcs while anothcr is dcpositing its results. In gcncral, it appcars that thc sclcction of a strategy must take two factors into accounI: a. Thc amount of communication bctwccn host and slaves should bc adapted to the systcm. For cxamplc in thc LCAP systcm, in which communication is a limiting factor, it should bc rcduccd to a minimum. b. Thc workload given to the slaves should bc balanccd as well as possible, avoiding idle slavcs or duplication of work. Naturally, also othcr considcrations may bc in ordcr, for instancc having to do with restrictions on thc storage capabilities of the host and the slavcs. Thc results of thcsc studics of parallclization of this cluster algorithm has also pcrmittcd the more intcnsive study of the CLARA algorithm. In thcse studics [Hopke and Kaufman, 19901, it was found that a stratcgy of fcwer, larger samples providcd partitions that wcrc closcr to thc optimum solution obtained by solving the complcte problem.
5. A parallel algorithm for robust regression In rcvicwing othcr statistical tcchniqucs suitablc for parallclization, algorithms that also proceed by rcpcated sampling wcrc considcrcd. One of thcsc is for robust rcgrcssion analysis. In the classical lincar rcgrcssion modcl.
y. =
X . 11
1
e1 + xi2e2+...+ X 1.P 8p + ei
(i = 1,2,
...)n)
the crrors e; arc assumed to bc independent and normally distributcd with zero mean and constant vxiancc. The X I , ..., xP arc callcd explanatory variables. and y is the response variablc. The aim of rcgrcssion analysis is to cstimatc the unknown rcgrcssion cocfficicnts el, €$, ..., ep from a samplc of n data points (xil, 3 2 , ...,xip, yi). Thc convcntional mcthod is lcast squares, dcfincd by
r
min
(8,. ..., i p i=l ) whcrc thc rcsiduals ri arc given by A
r . = y . - x i l e l -...1
1
A
e 'P p
X .
The lcast squxcs tcchniquc has bccn very popular throughout because thc solution can be obtained cxplicitly by mcans of some matrix algebra, making it the only fcasible method in Lhc prc-computcr agc (note that lcast squarcs was invcntcd around 1800). Morcovcr,
16
the lcast squares estimator is the most efficient when the errors ei are indeed normally distributed. However, real data often contain one or more outliers (possibly due to recording or transcription mistakes, misplaced decimal points, exceptional observations caused by earthquakes or strikes, or mcmbers of a different population), which may exert a strong influence on the least squares estimates, often making them completely unreliable. Such outlicrs may be very hard to detect, especially when the explanatory variables are outlying, bccause such “leverage points” do not necessarily show up in the least squares residuals. Thcrcforc, it is useful to have a robust estimate that can withstand the effcct of such outlicrs. The least median ofsquares method (LMS) is defined by
(
min
8J
median r : i=l*”.ln
[Roussccuw 19841. It has a high breakdown point, because it can cope with up to 50% of outlicrs. By this we mean that the estimator remains trustworthy as long as the “good” dara arc in the majority. (It is clear that the fraction of outliers may not exceed 50%. bccause they it would become impossible to distinguish between the “good” and thc”bad” points.) To calculate thc LMS estimates, Rousseeuw and Leroy [1987] use the program PROGRESS (the latter name stands for Program for Robust reGRESSion). The algorithm can be outlincd as follows: selcct at random p observations out of the n and solve thc corrcsponding systcm of p linear equations with p variables:
The solution gives a trial estimate of the coefficients, denoted by (51, ..., %). Then calcuIatc thc objeclivc value median ir 2 i =1..- .n
whcre the rcsiduals correspond to this trial estimate:
(i = 1,2,
..., n)
This procedure is carricd out many times, and the estimate is retained for which the objcctivc valuc is minimal. In the example of Figure 2, the model is yi = €$xi + 02 + ej with n = 9 and p = 2, so we consider samples with two observations. The line detcrmined
17
by the sample (g,h) yields a large objective value, as does the line passing through (fh). The line corresponding to (f,g) gives a much smaller objective value, and will be selected by the algorithm. Note that bothfand g are “good” points, whereas h is an outlier. In general, the number of replications is determined by requiring that the probability that at least one of the samples is “good” is at least 95%. When n and p are small all combinations of p points out of n may be consid- Figure 2. Example of simple regression with ered, corresponding to the algorithm of nine points. There are two distinct outliers in Steele and Steiger (1986). Once the optimal the lower right comer. . u (61, ..., Op) has been found, the algorithm uscs it to assign a weight of 1 (“good”) or 0 (“bad”) to each of the n data points, Subscqucntly, thc points with weight 1 may be used in a classical least squares regression. Whcn implemcnting this method on the LCAP system, it was again necessary to minimize the amount of host-slave communication. This minimization is easier to achieve for PROGRESS than it was for CLARA, because now each sample (of p points) is independent of the previous ones, unlike CLARA where the new sample was built around the best medoids found from the earlier samples. Again, several strategies are possible for exploiting the parallel computer structure. In the parallel versions of CLARA discussed above, the host was continuously informed for the best objective value found so far, and it was directly involved in the sample selection process. In order to obtains good system pcrformance, our LCAP implementation of PROGRESS is somewhat different. Indeed, the amount of computation for a single sample is almost constant, and relatively small since it comes down to solving a system of p linear equations and computing the median of n numbers. Thus, to send each sample from the host to the slave and then return the objcctivc value involves considcrablc communication time relative to the single sample computation timc. Thereforc, an altcmative strategy was chosen, in which the number of samples to be uscd was simply dividcd by the number of available slaves (10 in our case). Each slave then processes that numbcr of samples (as soon as it is finished with one sample, it immediately procecds with the next) and reports the best result upon completion The random numbcr generator for each slave is provided with a different seed to ensure that different sets of samplcs are used. In this strategy, a large number of calculations are pcrformcd in parallel with only a minimum of communication needed to initialize the system and to rcport the final rcsults to the host. At the end, the host merely has to sclcct the best solution. In this solution, the parallel algorithm yields exactly the same solution as the 1
18
scqucntial one, providcd the latter uses the samc random samplcs. Morcovcr, the computation timc is esscntially divided by thc number of slaves.
6. Some other parallelizable statistical techniques In both problcms discussed above, a large number of samples must bc proccsscd (almost) indcpcndcntly using identical codc. At the bcginning, the necessary code as well as all thc data arc scnt to the slavc processors. Subsequcntly, communication bctwccn host and slavcs can bc kcpt to a minimum. The same characteristics can also be found in other classcs o l statistical techniques, allowing to implement those in a similar way. The bootstrap [Diaconis and Efron 19831 is a method of determining paramctcr estimates and conlidcncc intervals by considering a large number of samples obtained by drawing (with rcplaccmcnt) the same number of objccts as in the original data set. (The idea is that such resampling is more faithful to thc data than simply assuming it to be normally disuibutcd.) Each of the samplcs may be processed by a slave, whilc at the cnd thc cstimatc and/or confidence intcrval are constructed by the host. In thc jackknife and some cross-validation techniques, the objccts of thc data set arc excludcd onc at a time. The objcclive of thc jackknife is to obtain bcttcr (less biased) pararnctcr cstirnates and to set scnsiblc conlidcncc limits in complcx situations [SCC Mostcller and Tukcy 19771. The purpose of cross-validation is to cvaluatc the pcrformancc of decision rulcs (an cxainplc is thc Icave-one-out proccdurc in discriminant analysis). Both tcchniqucs rcpcatcdly carry out the same calculations (in fact, n times) and are thcrelorc also well-suitcd for parallcl computation.
7. Conclusion Computing systcms havc grcally improved during the last years. Computcrs availablc today rangc from quitc powcrrul micro and minicomputcrs up to supcrcompiitcrs that arc ablc to solvc large and complex numerical problcms. Thcsc supcrcompuicrs acliicvc Uicir compulational pcrrormance through a combination or advanced processors arid an architcclurc that pcnnils cUicicnt algorithms. It appcars that a variety of statistical mcthods, that arc bascd on thc considcralion of a largc number of samplcs, arc iclcally suilcd for parallcl implcmcntation. Thc dcvclopmcnt or parallcl architccturcs thus opcns ncw possibilitics lor thcsc computationally intcnsivc proccdurcs.
Acknowledgeinen t Thc author wishcs to thank Drs. L. Kaulman and P. Roussccuw of thc Vrijc Univcrsitcit Brusscl for thcir collaboration in the studies prescntcd in this work, thc IBM Rcscarch Ccntcr in Kingston for thc opportunity to use the LCAP systcm and Drs. D. Logan and S.
19
Chin of IBM for thcir assistance in implementing these algorithms. This work was supported in part by the U.S. National Science Foundation through Grants INT 85 15437 and ATM 89 96203.
References Chin S, Doming0 L,Camevali A, Caltabiano R, Detrich J. Parallel Computation on the ICAP Computer System: A Guide to the Precompiler. Technical Report. Kingston, New York 12401: IRM Corporation, Data Systems Division, 1985. Clemcnti E. Global scientific and engineering simulations on scalar, vector and parallel LCAPtype supercomputers.Phil Trans R SOCLond 1988; A326: 445-470. Clcmenti E, Logan D. Parallel Processing with the Loosley Coupled Array of Processors System. Kingston, New York 12401: IBM Corporation, Data Systems Division, 1985. Diaconis P, Efron B. Computcr-Intensive Methods in Statistics. Scientijic American 1983; 248: 116-1 30. Flynn MJ. Very High Speed Computing Systems. Proceedings of the IEEE 1966; 14:1901-1909. Hopke PK, Kaufman L. The Use of Sampling to Cluster Large Data Sets. Chemometrics and Intelligent Laboratory Systems 1990; 8: 195-205. Kaufman L, Roussecuw P. Clustering Large Data Sets. In: Gelsema E, Kana1 L, eds. Pattern Recognition in Practice 11. Amsterdam: Elsevier/North-Holland. 1986: 425-437 (with discussion). Massart DL, Plastria F, Kaufman L. Non-Hierarchical Clustering with MASLOC. Pattern Recognition 1983; 16: 507-516. Mosteller F, Tukey JW.Data Analysis and Regression. Reading, Massachusetts: Addison-Wesley, 1977. Roussccuw PJ. Lease Median of Squares Regression. Journal of the American Statisrical Association 1984; 79:871-880. Roussceuw PJ, Leroy AM, Robust Regression and Outlier Detection. New York: Wiley-Interscicnce, 1987. Schcndcl U. Introduction to Numerical Methods for Parallel Computers. Chichester. England: J. Wilcy & Sons, Ltd., 1983. Steele JM, Steigcr WL. Algorithms and Complexity for Least Mcdian of Squares Regrcssion. Discrete Applied Mathemalics 1986; 14: 93-100.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computingand Automation (Europe) 1990 1990 Elsevier Science Publishers B.V., Amsterdam
21
CHAPTER 3
Parallel Computing of Resonance Raman Intensities Using a Transputer Array R.G. Efremov Shemyakin Institute of Bioorganic Chemistry, USSR Academy of Sciences, ul. MiklukhoMaklaya 16/10, 117871 Moscow, V-437,USSR
1. Introduction Recent progress in optical spectroscopy of biological molecules is closely connected to the development of the computcr power. The computational approaches have a great importance in solving the problems of spectral processing, interpretation and simulation. For example, computing the resonance Raman (RR) intensities of large molecules provides detailed information about the equilibrium geometry and dynamics of the resonant excited electronic state. The traditional approach to calculating the RR cross section involves a summation over all the vibrational levels of the resonant electronic state. Such sum-over-state algorithm performs a direct search for the complete set of excited state parameters that gcncratc the best fit to the expcrimcntal RR data. Usually, when there are only a fcw RR active modcs, the sum-over-state method is much more efficient compared with alternative approaches [I]. But in gcncral, using a standard sequential sum-over-state procedure can be computationally intractable for large (especially, biological) moleculcs because the algorithm scts up calculations the Raman cross sections of N vibrational modes of the molcculc and for each mode there are N logically nested loops. Therefore, such a technique demands powerful computer resourccs. Thc problem becomes processing-intensive espccially if a hard optimization is required-in a case when the initial estimates of excitcd slate paramctcrs are not known exactly. An altcrnativc approach to solving this task is based on the fact that the intensities of all modcs can bc calculalcd indepcndcntly at difkrent parallcl processors. The transputcr providcs an idcal unit for inexpensive, high power parallcl computers which can perform the algorithm in rcasonablc time and for real moleculcs of biological interest. The transputcr TSOO manufactured by INMOS [2] combines a fast 32 bit RISC (rcduccd instruction sct) proccssor (10 MIPS), fast static mcinory (4 Kb of on-chip RAM), a 61-bit floating point coproccssor (which can opcrate concurrcnlly with thc
22
ccntral proccssor) and four vcry fast (20 Mbit/s) bidirectional scrial communication links on onc chip. Transputcrs usc thc links for synchronous point-to-point communications with cach othcr. Thc links can bc switchcd so as to crcate any nctwork configuration. Some applications of the transputer arrays in the fields of computational chemistry, physics and biophysics havc bccn rcccntly dcscribcd [3-53. Thcsc studics clcarly dcmonstratcd the cfficicncy of the transputer architecture. It was shown that cvcn IBM AT or microVAX computcrs cquippcd with a transputer boards can be useful for such a proccssing-intcnsive problems like Monte Carlo [3], direct SCF [4]and biomolccular cncrgy [5] calculations. This work prcscnts a mcthod of parallel sum-over-state computing of rcsonancc Raman and absorption cross scctions using a transputer array. Wc havc implcmcntcd this approach to dircct modcling of thc cxpcrimcntal absorption spectrum of adcnosinc triphosphatc (ATP) and RR cxcilation profilc of ATP in watcr solution. As a rcsult, the sct of cxcitcd statc paramctcrs of ATP that provide the bcst fit to the expcrimcntal data has bccn obtaincd.
2. Computing resonance Raman and absorption cross scctions Thc rcsonancc Raman cross section in the Condon approximation can bc writtcn as thc convcntional vibronic sum over stiltcs [I]:
6.
i+f
=
c,
I,
< f l v > < vli> M E , E L C E v - E i + E 0 - E L -ir
Iz
Hcrc I i >, I v > and IS> arc thc initial, intermediate, and final vibrational states; ~i and E , arc thc cncrgics (in cm-1) of thc statcs I i > and I v >; M is the clcctronic transition Icngth; E, and EL are Lhc cncrgics of thc incidcnt and scattered photons; EO is thc cncrgy scparalion bclwccn the lowcst vibrational lcvcls of the ground and excited clcctronic statcs (zcro-zcro cncrgy); is the homogeneous linewidth (in cm-I), and C,, is a constant. The corrcsponding cxprcssion for thc absorption cross section is: 2
a,(E L ) =
c,
I< vI i >I 2
r
M ELE v X
(Ev-
Ei
+ Eo-
E L )2 +
12
Thc absorption cross scction OA (A2/molccule) is rclatcd to thc molar absorptivity E (M-l cm-I) by 0 ,= 2.303 10I9 €IN,, where iVA is Avogadro’s numbcr. In thc simplcst approach thc vibrational frequencics and normal coortiinatcs arc idcntical in thc ground and cxcitcd statcs, and a system of nmod vibrational modcs can bc
23
trcatcd as a collection of N indcpcndcnt pairs of harmonic oscillators with frequencies a k (k = 1, ..., nrnod). Therefore, multidimensional Frank-Condon factors can be written as the products of onc-dimcnsional overlaps: nmod
nmd
n < f i l v i>
and
i =1
E,-
.ci =
C
;=1
h n , ( v i - ij)
So, thc cquations (1) and (2) for fundamental resonance Raman and absorption cross scclions are given morc cxplicitly as follows:
O,(RL)=
c, M 2 E L rCC
(4)
v1 v 2
Hcrc the Raman active modc has subscript 1. One-dimcnsional Franck-Condon factors can bc calculatcd with a recursive rclation [6]. If there were no changes in vibrational frcqucncics (Q) in thc cxcitcd clcctronic state then the factors for each mode vk can be shown to bc only a function of its displacement dk in the excited state. In equations ( 3 ) and (4) nrnod vibrational modcs are included in the summation over quantum numbcrs vi = 0, 1, 2, ... of the uppcr clcctronic state for which the product of Franck-Condon factors excceds a cutoff lcvcl (wc uscd -104-105 times the magnitude of the zero-zero transition). Raman and absorption cross sections are also dependent on the environment effects bccausc in the condcnscd phase diffcrent scatterers may be either in different initial quantum states or in slightly diffcrcnt surroundings. Such phenomena lead to inhomogencous broadcning of RR excitation profiles and absorption spectra. The vibrational modcs which are active in resonance Raman don't undergo large displaccmcnts and frcqucncy shifts upon excitation. Thcrcforc, little error results from neglccting thcrmal effccts [l]. Variations in the local cnvironmcnt can be taken into account if a Gaussian distribution of zcro-zcro cncrgics with standard dcviation 0 (in cm-1) is proposcd to describe the sitc broadcning cffccts [7]:
24
where <Ep is an average zero-zero energy (in cm-l). A similar equation is also m e for the absorption cross section.
3. Hardware and software environment The hardware used in the present work is shown in Figure 1. It consists of five NMOS transputers T800-25, each with 1 Mb RAM (the transputer array) of which one node (root transputer) is attached to a host personal computer AT 386-25 running under MS-DOS, A file server program (3L, Ltd.) placed on the PC controls the access of the transputer to the disk, the screen, the keyboard etc. The transputer processor has four N M O S links, to connect it with other transputers. Each link has two channels, one in each direction. They provide synchronized, unidirectional communication. The hardware configuration as well as logical interconnections between the processors and tasks were described by a configuration language included in 3L Parallel FORTRAN package [8]. As it can be seen from Figure 1, the server task placed on the host PC is not directly connected to the application program. The filter task is interposed
. . DATA
RES.
110 SERVER
...
I
#
’
COMPUTER
4
RES.
ROOTT1
DATA
t
RES.
TASK 4
T4 Figure 1. Hardware configuration of the transputer array.
RES.
T2
T5
25
between them. It runs in parallel with the server program and the application task and passes on messages traveling in both directions. Such a configuration file has the form: PROCESSOR Host PROCESSOR Root PROCESSOR T1
.......... PROCESSOR T4 WIRE ? Host[O] Root[O] WIRE ? Root[l] T I [O] WIRE ? Root[2] T2[0] WIRE ? Root[3] T3[0] WIRE ? T I [ l ] T4[0] TASK Afserver INS=l OUTS=l TASK Filter INS=2 OUTS=2 DATA=lOK TASK 0) are different from those in the lower part (Y c 0). Thus at any given growth stage the model is finally represented by 2
-B0(c-x )
s, = Ae
for Y > 0, r = 1,2, ..., 6
and
s2 = A e
-B&c-*
2
)
for Y c 0, r = 7,8, ..., 11
60
Since thc modcl is changing with time (t) we replace A, B0 and B1 above by A(t), Bo(f) and B l ( f ) . This process was found to give a fit of the simulatcd shape to the observed shapc of better than 10% ovcr the time period shown [15]. The simulation is continued up to t = 48 h. By this stage the geometry of limb outline is visibly growing more complex. The forces modclling the limbs are certainly more varied and includc both rapid clongation of proxinial skclctal clcmcnts [ 16, 171 and complcx morphogenctic movcmcnts involving limb and flank [18]. Thc conclusions that may bc drawn from the modcl at this stage suggcsls that the r a k of cxpansion is cxponcntial with the central section constraincd by the paramctcrs givcn in cquation 2. This conformation now enablcs us to considcr an extension of thc modcl to a 3D situation.
4. The 3D model In thrcc dimensions thc shapc of the bud is most closcly reprcsenled by a scmi-cllipsoid (Fig. 3) whose basic equation is
Notice that wc can select sections through this modcl which closcly rcsemblc those of the 2D modcl (scc above). We again conslruct a series of radial vcctors with rcfcrence to a spccificd ccnlral origin and from our previous modcl conclude that appropriatc functions for initiation of thc 3D growth simulation should be of thc form
Wc also concluded that the simulation and any subsequent analysis would bc hclpcd by adopting a finitc element approach. For our case a finitc clement is defined as a 'patch' on the surface of the scmi-ellipsoid dcfincd by a plane through an arbitrary numbcr of points on thc surface (Fig. 3). Our approach is to dclerminc thc normal to thc plane and associatc thc functions clcl'incd above [4] with cach normal. At each stagc of growth thcrcforc we extcnd thc normals by an amount calculatcd from thc function [4]. The cxtcnsion of cach of these normals givcs a sct of points which defines a ncw surface. To facilitatc a rapid calculation of Lhis surfacc wc usc a lcast squares fitting proccdurc applied to a variable numbcr of ncighbouring points and rcpcat thc process over thc growth period. The algorilhm for this proccss can bc cxprcsscd as a 4 stage proccdurc. 1. Gcncralc a mcsh of points on the surface of thc semi-cllipsoid [Equation 31.
61
Figure 3. The semi-ellipsoid, radial vector and surface patch with corresponding surface normal used as controlling parameters in the simulation process.
2. Determine the equations of the planar elements together with their normals, through an arbitrary number of neighbouring points. (The arbitrariness is determined by the user requirements in terms of the required resolution and accuracy). 3. Establish a set of radial vectors from a specified origin to each normal position at the centre of the planar elements and apply the set of exponential growth functions [Equation41 to each vector to generate new vectors. 4. Reconstitute the new surface from the ends of the vectors generated in 3, using the least squares fitting procedure.
Steps 2-4 are repeated until the end of the simulation period. Figure 4 shows the initial surface generated using step 1 of the algorithm and 2,000 vectors. To test the robustness of the model we allowed the algorithms to cycle between steps 2 and 4 over a period of 4 h and to account for the fact that variations in the growth rate occur at different parts of the surface we modify our exponential growth parameters according to the values of x and y in a given quadrant, i.e., vary the rate according to position. Figure 5 shows biphasal growth during the first 2 hours and Figure 6 shows the result of introducing quadriphasal growth up to 10 hours. The plotting software used was the standard G I N 0 package installed on a SUN Workstation. From Figure 6 we can see that the surface shows two sharp “ridge” type features across the centre. Attempts to extend the growth period beyond this causes folding along the ridges. This results in a non-singular value for the plotting function in these regions causing difficulties with producing output and consequently the standard wire frame model had to be abandoned. The fold running in the long axis has a real anatomical correlate and any representative model must be able to cope with this type of situation so our requirements were for suitable plotting software to handle these situations. The most obvious approach is to either section the model into two parts with the division along the fold line or to reconstitute the surface using hidden surface algorithms.
62
Figure 4. The initial surface generated by the computer as a wire frame representation.
1
.o .0
Figurc 5 . Thc rcprcscntation of early growth with a symmetric biphasal variation.
63
Bcforc continuing the discussion however we show how the rnodclling proccss cvcn at its prcscnt stagc hclps us to formulate an analysis for monitoring cellular behaviour. Of particular intcrcst to biologists arc thc typcs of forces acting on thc cclls rcsponsible for growth. Based on our current obscrvations we are able to formulate a sct of cqualions dcscribing the growth and combine these within a newtonian framework to derive an cxpression for thc surface forces relative to a specific origin. Our basic equations may be written as: S = Ae
-B
(x
'+ y 2 ,
[ A = at, B = bt, M = mt]
where a, b and m are conslants. Initial obscrvations show that m varies linearly with timc. We may wntc thc forcc acting on a finitc elcment over a time t as F = (d/dt)(mv)
whcrc v is thc vclocity of thc clcmcnt. Wc have averaged this value to the spccd of the ccnlrc of cach finitc clcmcnt with rcspcct to time, then F = mv
+ mit r10.0 9 .O 8 .O
7.0 10. 9. 8. 7 6.
E .O 5.3 4.0 3.0 2 .o I .o
.o
5 4.
3 2 1.
Figurc 6. As Figure 5 with an additional growth variation. (Note longitudinal ridgc developmcnt which has an obscrvcd anatomical corrclate).
64
from (5)
v = s 2
2
= w 2- b f ( x + y
){1-
:. ;= - &( x 2 + y2>e
bt( x 2 + y2)} -bt(x
2+
y2
thus 2
2
F = nim - b f ( x +y )[l- 361(x2+ y2)
+
Thc force has thcrcforc been cxpressed in terms of positional values x and y and thc paramctcrs a, b and m. Values for these parameters will be establishcd by matching the modcl against the experiinental obscrvations, i.c., we adjust the paramctcr values until a reprcsentative modcl is found for all values of 1. We will also be able to dctcrmine Lhc rates of change of the forces over the surface from the cxprcssion for 6F/6xand 6F/6y respec lively. Clearly the graphics rcprcsentation of the modcl which enables us to evaluatc hypothcscs on growth ratcs by comparing the simulated shape with that observed cxpcrimcntally is a major fcaturc of thc modelling process. Asidc from the hiddcn surface approach (Fig. 7) for graphics display of the model we arc also invcstigating an altcrnaLive approach which has provcd succcssful in rcpresenling biological type structures and would appear to ofcr considerable potcntial for our own studics. This approach, based upon the usc of hyperquadrics has bccn cxtensivcly rcscarchcd by Hanson [20] and we present a bricf outline of the method togcther with some initial results. Hyperquadrics may bc considcred as roughly analogous to splines in that they arc able to rcndcr a high rcsolution surface fittcd to a sct of points mapped in 3-dimcnsions just as splines givc a high resolution line fitted to a set of points mapped in two dimcnsions. The basic hypcrquadric equation is
with
whcrc x is a D-dimcnsional vcctor, r,i and d , are constants, 6, = f l (dcpcnding on whcthcr the required hypcrquadric is clliptic or hyperbolic in naturc) and ya is a uscr spccificd paramctcr.
65
Figure 7. A preliminary 3D reconstruction using hidden surface methods.
In the 3D case we may write the equations as N C I S , ' ( A u x + B u y + C,z+
Ya
D,t)l
=1
lF1
or in parametric form: x = r(O,$) cosOcos$ cos
y = r(O,$) sinOcos$cos z = r(O,$) sin$ cos t = sin
and solve for
For a more detailed description of the algorithms to solving the above equations see Hanson [19]. Figure 8 shows a number of 3D shapes based upon the above equations for varying y, The figure shows a variety of shapes which strongly resemble those of certain types of virus and approximate to the type of representation we require for our own
66
Figurc 8. An example of shapcs closcly rcscmbling those scen in certain virus rcprcscntations using thc nicthod of hypcrquadrics [ 191.
modcl. Figurc 9 shows a nuinbcr of surfaccs gcncratcd in which wc uscd thc 3D rcprcscntations of thc cquation for various paramctcr valucs of S and a. Wc arc currcntly sccking to rcconcilc the paramctcr valucs with thosc of thc model so that a mcnningful display of thc inodclling process can bc achicvcd.
5. Conclusions The 2D modclling has providcd suitablc information for initiating a 3D simulation of growth of thc chick wing bud ovcr a 30 hour pcriod. Wc have bccn able to dcrivc a numbcr of basic cquations which will cnable us to modcl the forccs shaping thc surface at any s ~ i g of c thc growth proccss. Promising mcthods for thcsc and othcr biological modclling cxcrciscs arc bcing invcstigatcd.
Acknowledgements Mr. Karl Matthcws, a Coinputcr Scicncc project sludcnt, for dcvcloping thc 3D plolting soltwarc uscd to producc thc rcsults shown in Figurc 7.
61
Figtirc 9. I’rcliminary surraccs gcncratcd on a microcomputer using hypcrquatlric mcthodology.
References Wolpcrt L. Positional information and the spatial pattern of cellular differcntiation. J theoret Biol 1969; 25: 1-47. 2. Wolpert L. Positional information and pattern information. Curr Top Dev Biol 1969; 6: 18 3-224. 3. French V, Bryant PJ, Bryant SV. Pattern regulation in epimorphic fields. Science 1976; 193: 969-981. 4. Gierer A, Meinhardt H. A theory of biological pattern formation. Kybernetik 1972; 12: 30-39. 5 . Murray ID. A prepattern formation mechanism for animal coat markings. J theoret Biol 1981; 188: 161-199. 6. Gordon R. Computational embryology of the vertebrate nervous system. In: Geisow MJ, Barrett AN, eds. Computing in Biological Science. Amsterdam: Elsevier, 1983: 23-70. 7. Ode11 G, Oster G, Bumside B, Albcrch P. A mechanical model for epithelial morphogenesis, J mafh Biol 1980; 9: 291-295. 8. Ede DA, Law JT.Computer simulation of vertebrate limb morphogenesis. Nature (Land) 1969; 221: 244-248. 9. Mitolo V. Un programma in Fortran per la simulazione dell’accrescimento e della morfogenesi. Boll SOCital Biol sper 1971; 41: 18-20. 10. Scarls RL, Janners MV. The initiation of limb bud outgrowth in the embryonic chick. Devl Diol 1971; 24: 198-213. 11. Hamburger V, Hamilton HL. A series of normal stages in the development of the chick embryo, J Morph 1951; 88: 49-92. 12. Spcmann H. Embryonic development and induction. New Haven: Yale University Press, reprinted Hafner, New York, 1938. 13. Goodwin BC. Cohen MH. A phase-shift model for the spatial and temporal organisation of dcvcloping systems. J fheoret Biol 1969; 25: 49-107. 14. Rarrctt AN, Burdett IDJ. A three-dimensional model reconstruction of pole asscmbly in Bacillus subtilis. J theoret Biol 1981; 92:127-139. 15. Barrctt AN, Summerbell D. Mathematical modelling of growth processes in the developing chick wing bud. Comput Biol A4ed 1984; 14: 411-418. 16. Summerbell D. A descriptive study of the rate of elongation and differentiation of the skeleton of the developing chick wing. J Embryo1 exp Morph 1976; 35: 241-260. 17. Archer CW, Rooncy P, Wolpcrt L. The early growth and morphogenesis of limb cartilage. Prog clin Biol Res 1983; 110: 267-278. 18. Scarls KL. Shoulder formation; rotation of the wing, and polarity of the wing mesoderm and ectodcrm. Prog clin BiolRes 1983; 110: 165-174. 19. Hanson AJ. Hypcrquadrics: smoothly deformable shapes with polyhedral bounds. Computer Vision, Graphics and Image Processing 1988; 44: 191-210. 1.
Statistics
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 1990 Elsevier Science Publishers B.V., Amsterdam
71
CHAPTER 7
Experimental Optimization for Quality Products and Processes S.N. Deming Department of Chemistry, Univcrsity of Ilouston, 4800 Calhoun Road, Ilouston, 7 X 77204-SG41,U S A
Adaptcd, in part, from Stanlcy N. Dcming, “Quality by Design”, CIIEMWCII 1988; 18(9): 560-566.
1. Introduction Ovcr half a century ago, in 1939, William Edwards Deming wrote in the introduction to Waltcr A. Shewhart’s famous book on quality control [I], “Most of us havc thought of thc statistician’s work as h a t of measuring and prcdicting and planning, but few of us have thought it the statistician’s duty to try to bring about changes in the things that he measures. It is evident, however, ... that this vicwpoint is absolutcly csscntial if thc statistician and thc manufacturcr or rcscarch worker are to make the most o f each othcr’s accomplishmcnls.” Although Shcwhart’s methods wcrc taught bcforc and during World War 11, altcr the war thc mcthods wcrc largcly ignorcd in the west [ 2 ] . In postwar Japan, howcvcr, thc mcthods wcrc Latight by W. E. Dcming and olhcrs, and werc rcsponsiblc, in part, for recslablishing Lhc Japanese economy. Genichi Taguchi, a Japanese engineer, was espccially succcssful in using fractional factorial dcsigns in innovative ways to rcducc variation in a product or process at the dcsign stagc [3,4]. Aftcr almost half a ccntury, wcstem industry finally swms rcady to acccpt the statistician’s long-standing offer of help. “There is much to bc done if [industries are] to survive in thc new economic age. Wc statisticians have a vital rolc to play in Ihe transformation that is nccdcd to in‘akc our industry competitive in h e world economy” [51.
2. Quality Thc locus of Lhc current industrial transformation is “quality,” a word that has several nicanings. Two mcanings which are critical for both quality planning and stralcgic
12
0
0
20
40 60 Batch Number
80
100
Figure 1, Percent impurity vs. batch number for a chemical process.
business planning relate to (a) product or process advantages-features such as “inherent uniformity of a production process,” “fuel consumption of an engine,” and “millions of instructions per second (MIPS) of a computer” determine customer satisfaction; and (b) product or process deficiencies-“field failures,” “factory scrap or rework,’’ and “cngineering design changes” lead to customer dissatisfaction [6]. It is clear from these dctcrminants of quality that “[the] key to improved quality is improved processes. ... Processes make things work. Thousands of proccsscs need improvement, including things not ordinarily thought of as processes, such as the hiring and training of workers. We must study these processes and find out how to improve them. The scientific approach, data-based decisions, and teamwork are key to improving all of these processes” [ 5 ] . In the past, some industries achieved quality not by making the process produce good product but rather by separating the good from the bad based on mass inspection of the outgoing product. Only in rare situations is mass inspection the appropriate response to deficient quality. In general, “Inspection is too late, ineffective, costly. ... Scrap, downgrading, and rework are not corrective action on the proccss. Quality comes not from inspcction, but from improvement of the process” [71.
13
3. Statistical process control Considcr a chemical process that produces not only the desired compound in high yield but also a rclativcly small amount of an undesirable impurity. Discussions between the producer and consumer of this material suggest that an impurity level of up to 2.0% can be tolcratcd. A highcr impurity level is unacceptable. By mutual consent, a specification level of 5 2.0% is set. Figure 1 plots the percent impurity vs. batch number for this chemical process [81. Most of the time the percent impurity is < 2.0%, but about one batch in five is outside the specification. This can be costly for the manufacturer if there is no other customer willing to purchase “out of spec” material. These out-of-spec batches might be kept in a holding area until they can be reworked, usually by blending with superior grade material. But storage and rework are costly and ultimately weaken the competitive position of the manufacturer. Figure 1 is a way of letting the process talk to us and tell us how it behaves [9]. It seems to be saying that: on the average, the impurity level is below 2.0%; there is some variation from batch to batch; and the process behaves consistently-a moving average wouldn’t appcar to go up or down much with time, and the variation seems to be fairly constant with timc. Thcsc idcas are confirmed in the statistical “x-bar” and “I-” charts shown in Figures 2 and 3, respectively. To construct these charts, the group of 100 batches has been subdividcd into 25 scquential subgroups of four. For each subgroup, the average, ? (pronounced “x-bar”) and the range (r = greatest reading minus least reading in the subgoup) have bccn calculatcd. The resulting values are plotted as a function of subgroup number. (Subgroups are necessary, in part, to obtain estimates of the range.) As expected, the average pcrcent impurity doesn’t go up or down very much with time, and the range (a measure of variation) is fairly constant with time. Based on these observations, we make the assumption that the process is stable and place the middle, unlabelcd dashed line in Figure 2-the “grand average,” the “average of avcragcs.” We also estimate the standard deviation (s) and plot the three-sigma limits. The subgroup avcragcs will lie within these limits -99.7% of the time if the process is stablc. Only rarcly would a subgroup average lie outside these limits if the process is “in statistical control” (that is, if the process is stablc). Similar thrcc-sigma limits can be placed on the range. In Figure 3 only the upper control limit is labclcd-the lowcr control limit is zero. The unlabelcd dashcd line represents thc avcragc subgroup range. Only vcry rarcly (-0.3% of the time) would a subgroup rangc be grcatcr than the uppcr control limit if thc process is in statistical control. It is absolutcly csscntial to undcrstand that these control limits are a manifestation of Lhc proccss spcaking to us, tclling us how it behaves. These control limits do not represent how we would like the process to behave. It is common but misguided practice to
14
Lower C o n t r o l Limit
0
5
10 15 Subgroup Number
20
25
Figure 2. X-bar (mean) control chart from the data in Figure 1
0
5
10 15 Subgroup Number
Figurc 3. R (range) control chart froin the data in Figure 1.
20
25
I I0
15
20 25 Subgroup Number
30
35
Figure 4. Effcct of an out-or-control situation on the x-bar control chart.
draw on control charts lincs that rcprcsent our wishes. Thcse lincs can have no cffcct on Lhc bchavior of thc process. Control charts arc useful because they offcr a way of letting the proccss tcll us whcn it has changcd its bchavior. In Figurc 4 it is clcar that something significant happened at subgroup nunibcr 27-29. The process has clearly “gone out of control.” So many cxcursions so far away from thc conuol limit in such a short time would bc highly unlikcly from a statistical point of vicw if the proccss wcrc still operating as it was bcforc. Such cxcursions suggcst that thcrc is somc assignable cause (Shewhart) or specid cause (W. Edwards Dcming) for thc obscrvcd cffcct. Because thcsc excursions are undcsirablc in this cxarnplc (most of the individual batches produced would probably bc unfit for salc), it is cconoinically important to discover the assignable cause and prcvcnt its occurrcncc in thc future. Onc of thc most powerful uses of control charts is their ability to tcll us whcn the process is in statistical control so we can lcave it alone and not tamper with it. Another use of control charts is to show whcn spccial causes arc at work so that steps can bc taken to discovcr thc idcntity of these special causcs and use them to improve the process. Howcvcr, thcre arc two difficulties with this second use of control charts. First, wc have to wait for thc proccss to speak to us. This passive approach to proccss optimization isn’t vcry cfficicnt. If thc proccss always stays in stalistical control, wc won’t lcarn anything and the proccss can’t get bcttcr. It is not enough just to be in slatisti-
16
cal control-ur product must become “equal or superior to the quality of competing products” [6]. We probably can’t wait for the process to speak to us. We must take action now. Second, when the process does speak to us (when it goes out of control), it tantalizes us by saying “Hey! I’m behaving differently now. Try to find out why.” Discovering the reason the process behaves differently requires that we determine the cause of a givcn effect. Discovering which of many possible causes is responsible for an observed effect, an activity that continues to puzzle philosophers, is often incredibly difficult.
4. Experimental optimization In an excellent paper on cause-and-effectrelationships, Paul W. Holland [lo] concludcs that “[the] analysis of causation should begin with studying the effects of causes rather than ... trying to define what the cause of a given effect is.” That is a powerful conclusion. It is a recommendation that the technologist intentionaZZy produce causes (changes in the way the process is operated) and see what effects are produced (changes in the way the process behaves). With such designed experiments, information can be obtained now. We can make the process talk to us. We can ask the process questions and get answers from it. We don’t have to wait. Figure 5 contains the results of a set of experiments (open circles) designed to &scover thc effect of temperature on impurity. The right side of Figure 5 shows the presumed “cause and effect” relationship between impurity level and temperature. From the shape of the fitted curve, it would appear that, at least insofar as impurity level is concerned, our current operating temperature of 270 is not optimal. But there are two reasons why it is not optimal: not only is the level of impurity relatively high, but also the amount of variation of impurity level with temperature is relatively high. “Set-point control” is almost never set-point control. We might set the controller to maintain a temperature of 270.0, but time constants within the control loop and variations in mixing, temperature, flow rates, or sensors prevent the controller from maintaining a temperature of exactly 270.0. In practice, the temperature of the process will fluctuate around the set point. This variation in temperature is represented by the black horizontal bar along the temperature axis in Figure 5. Variations in temperature will be transformed by the process into variations in impurity level. The relationship between percent impurity and temperature is rather steep in the region of temperature = 270. When the temperature wanders to lower levels, the percent impurity will be high. When the temperature wanders to higher levels, the percent impurity will be low. This variation in impurity is represcnted by the black vertical bar along the impurity axis in Figure 5. The left side of Figure 5 suggests that variation in temperature will, over timc, bc transformed into variations in percent impurity and will result in a run chart similar to
270
I
280 Temperature
290
Figure 5. Results of a set of experiments designed to determine the influence of temperature on percent impurity.
that shown in Figure 1 when the process is operated in the region of temperature = 270. How could we decrease thc variation in percent impurity? One way would be to continue to operate at temperature = 270 but use a controller that allows less variation in thc temperature. This would decrease the width of the black horizontal bar in Figure 5 (variation in tcmpcrature) which would be transformed into a shorter black vertical bar (variation in impurity). Thc resulting run chart would show less variation. But this is a “brute force” way of decreasing the variation, There is another way to decrease the variation in this process. Ofher effecfs being equal, it is clear from the right side of Figure 5 that we should change our process’s operating conditions to a temperature of about 295 if we want to dccreasc the level of impurity. In this example, there is an added benefit from working at these conditions. This benefit is shown in Figure 6 . Not only has the level of impurity been reduced, but the variation in impurity has also bccn reduced! This is because the relationship between impurity and temperature is not as steep in the region of temperature = 295. When the process is operated in this region, it is said to be “rugged” or “robust” with respect to changes in temperature-the process is relatively insensitive to small changes in temperature [ 111. This principle of ruggedness is one aspect of the Taguchi philosophy of quality improvement [12]. We make the process insensitive to cxpccted random variations.
78
SPEC
2io
280
290
Temperature
Figure 6. Improved control limits for proccss operatcd near temperature = 295
If thc proccss wcrc opcratcd at this ncw tcmpcraturc, the corresponding control chart would bc similar to that shown in Figure 7, and the corresponding run chart would look likc Figurc 8. Some pcrsons criticize run char& likc Figurc 8 as belonging to “gold-platcd proccsscs,” processes that are “bcttcr than they nccd to be.” In some cascs thc criticism might bc justificd-for cxamplc, if Lhc cconomic conscquenccs of opcrating at thc highcr tcmpcraturc wcrc not justificd by thc cconomic conscqucnccs of producing such good product. But in many c a m thcrc are thrcc agumcnts that speak in favor of thcsc gold-platcd proccsscs. First, managcnicnt docsn’t havc to spend lime with customcr complaints, and no onc wastcs lime on nonproduclivc “firc-fighting.” Sccond, improvcmcnt of product oftcn opcns new markcts. And third, thcrc is a lot of “elbow room” bctwecn the pcrccnl impurity produccd by the iniprovcd proccss and the original specification limit in Figurc 8. I f thc proccss starts to drift upward (pcrhaps a heat exchangcr is fouling and causing thc impurity lcvcl to increasc), within-spcc matcrial can still bc produced while the spccial cause is discovcrcd and climinatcd.
5. Quality by design Considcration or quality in inanuracturing should begin before manufacturing starts. This
79
In N
SPEC
?
sw -,-I
L 3
a
2:LD +JW U L
60 d
ul 0
1
140
160
200 B a t c h Number
180
220
240
Figurc 8. I'crccnt iinpuriiy vs. bntch numbcr for a chcinical proccss operated near temperature =
295.
80
is “quality by design” 1131. Just as there is a producer-consumer relationship between manufacturing and the customer, so, too, is there a producer-consumer relationship between R&D and manufacturing. The manufacturing group (the consumer) should receive from R&D (the producer) a process that has inherent good quality characteristics. In particular, R&D should develop-in collaboration with manufacturing-a process that is rugged with respect to anticipated manufacturing variables [ 141. Experimentation at the manufacturing stage is orders of magnitude more costly than experimentation at the R&D stage. As Kackar 1131 has pointed out, “[It] is the designs of both the product and the manufacturing process that play crucial roles in determining the degree of performance variation and the manufacturingcost.” Data-based decisions, whether at the R&D level or at the manufacturing level, ofien require information that can be obtained most efficiently using statistical design of experiments. Creating such designs requires teamwork among researchers and statisticians. Researchers would agree that it is important for statisticians to understand the fundamentals of the production process. Statisticians would agree that it is important for researchers to understand the fundamentals of experimental design. As Box has stated [15], “If we only follow, we must always be behind. We can lead by using statistics to tap the enormous reservoir of engineering and scientific skills available to us. ... Statistics should be introduced ... as a means of catalyzing engineering and scientific reasoning by way of [experimental] design and data analysis. Such an approach ... will result in greater creativity and, if taught on a wide enough scale, could markedly improve quality and productivity and our overall competitive position.” The statistical literature is filled wiih information about experimental design and optimization [ 16-76] and can be consulted for details.
References 1. Shewhart WA. Statistical Method from the Viewpoint of Quality Control. The Graduate
School, The Agriculture Department,Washington, DC, 1939. 2. Godfrey AB. The History and Evolution of Quality in AT&T. AT&T Technical Journal 1986; 65(2): 4-20. 3. Bendell A, Disney J, F’ridmore WA, Eds. Taguchi Methods: Applications in World Industry. IFS Publications, Springer-Verlag,London, 1989. 4. Ross PJ. Taguchi Techniquesfor Quality Engineering: Loss Function, Orthogonal Experiments, Parameter and Tolerance Design. McGraw-Hill Book Company, New York, NY, 1988. 5. Joiner BL. The Key Role of Statisticians in the Transformationof North American Industry. Am Stat 1985; 39: 224-227. 6. Juran JM. Juran on Planning for Quality. Macmillan, New York, NY,1988, pp. 4-5. 7. Deming WE. Quality, Productivity, and Competitive Position. Center for Advanced Engineering Study, Massachusetts Institute of Technology, Cambridge, MA, 1982, p. 22. 8. Grant EL, Leavenworth RS. Statistical Quality Control. 6th ed., McGraw-Hill, New York, NY, 1988.
81
9. Box GEP, Hunter WG, Hunter JS. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley, New York, NY, 1978, pp. 1-15. 10. Holland PW. Statistics and Causal Inference. J A m Stat Assoc 1986; 81: 945-960. (See also ‘‘Comments’’and “Reply,” pp. 961-970.) 11. Deming SN. Optimization of Experimental Parameters in Chemical Analysis. In: De Voe JR, Ed. Validation of the Measurement Process. ACS Symposium Series no. 63, American Chemical Society, Washington, DC, 1977, pp. 162-175. 12. Taguchi G. Introduction to Quality Engineering: Designing Quality into Products and Processes. Asian Productivity Organization, Kraus International Publications, White Plains, NY, 1986. 13. Kackar RN. Off-Line Quality Control, Parameter Design, and the Taguchi Method. J Qua1 Techno1 1985; 17: 176-209. 14. Walton M. The Deming Management Method. Dodd, Mead & Co., New York, NY, 1986, pp. 131-157. 15. Box GEP. Technometrics 1988; 30: 1-18. 16. Anderson VL, McLean RA. Design of Experiments: A Realistic Approach. Dekker, New York, 1974. 17. Anon. ASTM Manual on Presentation of Data and Control Chart Analysis. Committee E-11 on Statistical Methods, ASTM Special Technical Publication 15D, American Society for Tcsting and Materials, 1916 Race Street, Philadelphia, PA 19103, 1976. 18. Barker TB. Quality by Experimental Design. Dekker, New York, NY, 1985. 19. Bayne CK, Rubin IB. Practical Experimental Designs and Optimizalion Methods f o r Chemists. VCH Publishers, Deerfield Beach, FL, 1986. 20. Beale EML. Introduction to Optimization. Wiley, New York, NY, 1988. 21. Bcveridge GSG, Schechter RS. Optimization: Theory and Practice. McGraw-Hill, New York, NY, 1970. 22. Box GEP, Draper NR. Empirical Model-Building and Response Surfaces. Wiley, New York, 1987. 23. Box GEP, Draper NR. Evolutionaty Operation: A Method for Increasing Industrial Productivity. Wiley, New York, 1969. 24. Cochran WG, Cox GM. Experimental Designs. Wiley, New York, NY, 1950. 25. Cornell J. Experiments wirh Mixtures: Designs, Models, and the Analysis of Mixture Data. Wiley, New York, NY, 1981. 26. Daniel C, Wood FS. Fitting Equations to Data. Wiley-Interscience, New York, NY, 1971. 27. Daniel C. Applications of Statistics to Industrial Experimentation. Wiley, New York, NY, 1976. 28. Davies OL, Ed. Design and Analysis of Industrial Experiments. 2nd ed., Hafner, New York, NY, 1956. 29. Davis JC. Statistics and Data Analysis in Geology. Wiley, New York. 1973. 30. Deming SN, Morgan SL. Experimental Design: A Chemometric Approach. Elsevier. Amsterdam, The Netherlands, 1987. 31. Dcming WE. Out of the Crisis, Center f o r Advanced Engineering Study. Massachusetts Institute of Technology, Cambridge, MA, 1986. 32. Dcming WE. Some Theory of Sampling. Dover, New York, NY, 1950. 33. Dcming WE. Statistical Adjustment of Data. Dover, New York, NY, 1943. 34. Dianiond WJ. Practical Experimeu Designs. 2nd ed., Van Nostrand Reinhold, New York. NY. 1989. 35. Draper NR, Smith H. AppliedRegression Analysis. 2nd ed., Wiley, New York, 1981.
82
36. Duncan AJ. Quality Control and Industrial Statistics. revised ed.. Irwin, Homewood. IL, 1959. 37. Dunn OJ, Clark VA. Applied Statistics: Analysis of Variance and Regression. 2nd ed.. Wiley, New York, 1987. 38. Fisher Sir RA. Statistical Mefhodsfor Research Workers. Hafncr, New York, NY, 1970. 39. Fisher Sir KA. The Design of Experiments. Hafner, New York, NY, 1971. 40. Flctchcr R. Practical Mefhods of Optimization. 2nd ed.,Wiley, New York, NY, 1987. 41. Hacking I. The Emergence of Probability. Cambridge University Press, Cambridge, ENGLAND, 1975. 42. Havlicek LL, Crain RD. Practical Statistics for the Physical Sciences. American Chemical Society, Washington, DC, 1988. 43. Himmelblau DM. Process Analysis by Statistical Methods. Wiley. New York, 1970. 44. Hunter JS. Applying Statistics to Solving Clicmical Problems. CIIEM7ECII 1987; 17: 167-1 69. 45. Ishikawa K. Guide to Quality Control. Asian Productivity Organization, 4-14, Akasaka 8chome, Minato-ku, Tokyo 107, JAPAN, 1982. Available from UNIPUR. Box 433 Murray Hill Station, New York, N Y 10157 (800) 521-8110. 46. Juran JM, Editor-in-Chief. Juran’s Quality Conlrol Handbook. 4th ed., McGraw-Hill, New York, NY, 1988. 47. Khuri AI, Cornell JA. Response Surfaces: Designs and Analyses. ASQC Quality Press, Milwaukee, WI, 1987. 48. Kowalski RK. Ed. Chemomefrics: Theory and Application. ACS Symposium Series 52, American Chemical Society, Washington, DC. 1977. 49. Mallows CL, Ed. Design, Data and Analysis: by Some Friends of Cuthbert Daniel, Wiley, Ncw York, NY, 1987. 50. Mandel J. The StatisficalAnalysis of Experimental Data. Wiley, New York, NY. 1963. 51. Massart DL, Kaufman L. The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. Wiley, New York, 1983. 52. Massart DL, Vandcginste BGM, Deming SN, Michotte Y, Kaufman L. Chemometrics, a Textbook. Elsevier Science Publishers, Amsterdam, 1987. 53. Mendcnhall W. Introduction to Linear Models and the Design and Analysis of Experiments. Duxbury Press, Bclmont, CA. 1968. 54. Miller JC, Miller JN. Statistics for Analylical Chemistry. Wiley. New York, 1988. 55. Montgomery DC. Design and Analysis of Experiments. 2nd ed., Wiley, New York, NY, 1984. 56. Moore DS. Slatistics: Concepts and Controversies, Freeman, San Francisco, CA. 1979. 57. Natrella MG. Experimental Stafisfics,National Bureau of Standards Handbook 91. Washington, DC, 1063. 58. Neter J, Wasscrman W. Applied Linear SfatisficalModels: Regression, Analysis of Vuriance, and Experimental Designs. Irwin, Homewood, IL, 1974. 59. Norman GR, Streiner DL. PDQ Statisfics. B. C. Deker, Toronto, 1986. 60. Scherkenbach WW. The Deming Route to Quality and Productivity: Road Maps and Roadblocks. ASQC Quality Press, Milwaukee, WI, 1986. 61. Scholtes PR. ?’he Team flandbook: IIow to Use Teams to Improve Quality. Joiner Associates, Inc., 3800 Kegent St., P. 0. Box 5445, Madison, WI 53705-044s (608) 2384134.1988. 62. Sharaf MA, Illman DL, Kowalski BR. Chemometrics. Wiley, New York, 1986.
83
63. Shewhart WA. Economic Control of Quality of Manufactured Product. Van Nostrand, New York, NY. 1931. 64. Small BB. Statistical Quality Control llandbook. Western Electric, Indianapolis, IN, 1956. 65. Snedecor GW, Cochran WG. Statistical Methods. 7th ed.,The Iowa State University Press, Ames, IA, 1980. 66. Stigler SM. The llistory of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press, Cambridge, MA, 1986. 67. Taylor JK. Quality Assurance of Chemical Measurements. Lewis Publishers, Chelsea, MI, 1987. 68. Wemimont GT. (Spendley W, Ed.), Use of Statistics to Develop and Evaluate Analytical Methods. Association of Orficial Analytical Chemists, Washington, DC, 1985. 69. Whcelcr DJ. Keeping Control Charts. Statistical Process Controls, Inc., 7026 Shadyland Drive, Knoxville, TN, 37919, (615) 584-5005,1985. 70. Wheeler DJ. Tables of Screening Designs. 2nd ed., Statistical Process Controls, Inc., 7026 Shadyland Drive, Knoxville, TN, 37919, (615) 584-5005, 1989. 71. Wheeler DJ. Understanding Industrial Experimentation. Statistical Process Controls, Inc., 7026 Shadyland Drive, Knoxville, TN, 37919, (615) 584-5005, 1987. 72. Wheeler DJ, Chambcrs DS. Understanding Statistical Process Control. Statistical Process Controls, Inc., 7026 Shadyland Drive, Knoxville, TN, 37919, (615) 584-5005,1986. 73. Wheeler DJ, Lyday KW. Evaluating the Measurement Process. Statistical Process Controls, Inc., 7026 Shadyland Drive, Knoxville, TN, 37919, (615) 584-5005,1984. 74. Wilson ER, Jr., An Im-oduction to ScientiJic Research. McGraw-Hill, New York, NY, 1952. 75. Youdcn WJ. Statistical Methods for Chemists. Wiley, New York, NY, 1951. 76. Youden WJ, Steiner EH. Statistical Manual of the Association of Official Analytical Chemists. Association of Official Analytical Chemists, Washington, DC, 1975.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor),Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
85
CHAPTER 8
Experimental Design, Response Surface Methodology and Mrilti Criteria Decision Making in the Development of Drug Dosage Forms D.A. Doornbos, A.K. Smilde, J.H. de Boer, and C.A.A. Duineveld University Centre for Pharmacy, University of Groningen, The Netherlands
1. Introduction If a paticnt consults a doctor he/she will have a fair chance to leave the doctor’s office with a prescription. Even in Thc Ncthcrlands (where the consumption of medicines is relativcly low) that chancc is about 50%. So almost everybody will know that a drug (the active substance) is not administcrcd in a pure form but has been formulated into a dosage form. This may be a fast disintegrating tablet, a sustained release formulation, a suppository, an ointment, an injcctable and there are many more. These dosage forms have in common that they are prepared with several excipicnts, that during the production proccss many proccss variablcs influence the properties of thc end-product and, most important, that their quality must be excellent and constant. This not only means that their chemical composition must conform to specifications, but more spccifically that those propcrtics that influence the response of the organism or the target organ must mcct criteria of constant value: they must be ruggcd. Frequently several criteria must be mct simultaneously and often some of these criteria are conflicting. To sum up: drug dosage forms are produced from a mixture of one or more drugs and some additives, many process variablcs influence the quality of the dosage form and critcria for that quality arc sct and must be maintaincd. As Stclsko [l] stated: “Pharmaceutical scicntisls arc often confronted with the problem of dcveloping formulations and processes for difficult products and must do so in spitc of competing objectives. Pressures, placed on the scicntist to balance variables and mcct these objcctivcs, can be cornpoundcd when limitcd funds, time and resources rcquirc rapid and accurate dcvclopmcnt aclivitics. Statistical expcrimcntal design provides an cconomical way to crficicntly gain the most information while expending the least amount of expcrimcntal cffort.”
86
TABLE 1 Factors and optimization criteria for tablets (direct compression).
Compositional factors
Process variables
(quantitative and/or qualitative) filler binder disintcgrant lubricant
tablet machine compression force mixing time relative humidity glidant (y/n)
Criteria: crushing strength, friability, disintegration time, dissolution rate, tablet weight variance, kcepability, robustness of these properties to variation of compositional and process variables
2. Drug dosage forms, factors and criteria For each typc of dosage form a list of the most important composition- and process variables can bc composed. Moreover, for each dosage form optimization criteria can be given. Some examples are given in Table 1. It can be expected that some of the factors or all of thcm interact. This means that the magnitude of a response of a criterion-value to a change of the level of a certain factor depends on the level or one or more other factors (a second, third or higher order interaction). In such a case a univariate scarch for an optimum levcl of the (compositional or proccss) factors docs not guarantee a real optimum. A multivariate search is indicated: one optimizcs the factors not separately one after each other, but the factor levels are varied according to a pre-planned design in a sequential- or in a simultaneous mode. In pharmaceutical literature some examples can be found of the application of the scqucntial Simplcx Mcthod, as advocated by Spendlcy et a1 [2] and the Modified Simplex Mcthod as proposed by Nclder and Mead [3]. Although this is an efficicnt method if factors are optimized for one single criterion, provided that thcrc are no multiple optima, it is not first choice if the optimization aims at a nurnbcr of criteria simultancously, as is often the case with drug dosage forms. In High Performance Liquid Chromatography it was tricd to solvc the multicriteria problcm by combining single criteria likc Resolution, Analysis timc and Number of peaks into a composite criterion, e.g., the well known Bcrridge Critcrion [4], but even combination of so few single responses gives rise to an ambiguous solution as has bccn shown by Debets et al. [ 5 ] . This will bc cvcn more thc case in drug formulation studics.
87
3. Examples of the use of sequential simplex, factorial design, and mixture design techniques in drug formulation studies
3.1 The sequential simplex An example of the use of a combined criterion, used in a Modificd Simplex optimization of a tablct formulation can be found with Bindschaedlcr and Gurny [6]. They studicd the cffcct of coinpression force and Aviccl/Primojcl ratio on a criterion that was a weighted sum of tablct hardncss and disintcgration time. Zicrcnbcrg [7] uscd the Modified Simplex mcthod to study the rclcasc of clcnbutcrolc (a single criterion) from a polyacrylate system for transdcrmal application. The factors were conccntration of clenbuterole and film thickncss. In both studics a good formulation was achieved but it must be rccognizcd that thcy workcd with simplc two-factor systcms, the Simplex being a triangle that visually can bc followcd in two-dimensional space, while in most formulation studies far more factors arc involvcd.
3.2 Factorial design and fractional factorial design Quite a fcw studics on thc use of factorial design in formulation research have been rcportcd. In the pcriod 1981-1989 we countcd ca 60 papers. In Tables 2, 3 and 4 the TABLE 2 objects of Lhcse studics are summarized. Dosage forms optimized in papers Among thc dcsigns uscd arc full factorial 1981-1989. dcsigns with 2, 3, 4, 5 or cvcn 6 factors Fast disintegrating tablets mostly at 2 lcvcls, but incidcntly at 3 or 4 Effervescent tablets Icvcls. Thcre were some dcsigns with 3, 4 or Slow-release tablets 5 factors at 2 lcvcls and anothcr factor at 3 Sintered tablets Lozenges Icvcls. Thc smallest dcsign was a 22 factoriCapsules al dcsign; the largcst number of cxpcrimcnts Solid dispersions was 96 in a 25*3 dcsign. In some studics a Suspensions factorial dcsign was augmcntcd with a Star Solutions Dcsign to give a Ccnlral Composite Dcsign. Thcrc arc two papers that dcscrve special TABLE 3 Unit Operations studied in papers mcntion. In a recently published study 1981-1989. Chowhan ct a1 [8] evaluated thc cffcct of 4 proccss variablcs at 3 levels each on friabilTabletting by direct compression Granulation ity, dissolution, maximum attainablc crushMicroencapsulation ing strcnglh and lablct wcight variation. In a Filmcoating full 34 factorial dcsign 81 cxpcrimcnts must
88
be performed but with the aid of a computer
TABLE 4
program for Computer Optimized Experi- Optimization criteria for tablets. menlal Design (COED) the 22 most inforCrushing strength malive factor combinations were selected. Friabilitv Quadratic response surface models were Disintegration time Weight variation calculated for each criterion. Optimum Dissolution behaviour rcgions were found by pairwise overlaying B ioavailability the individual contour plots. The other one Chemical stability is the recent paper by Chariot et a1 [9] in Solubility which they describe the way they selected a sct of experiments from a factorial lay-out according to a D-optimal design, using the program NEMROD. Instead of the 72 expcriments according to a 23*32 factorial design 11 experiments were selected, allowing the estimation of a regression equation with 5 linear, 2 quadratic and 3 two-factor interaction tcnns. In some papers factorial designs have bcen fractionated to 24-1, 25-2 or 2 6 3 designs, thereby strongly reducing the experimental effort at the expense of information about higher order interactions. In most studies the factorial designs were only used to identify the factors that most significantly influenced the response under study. But in a number of cases a mathematical modcl for each of the responses was postulated and the model fitted to the data, thus applying Response Surface Methodology. A response surface depicts one response, e.g., rablct crushing strength, as a function of some independent variables, e.g., compression force and concentration of binder, in general compositional- and process variables. The goal of response surface studies is to obtain a regression model that provides a means of mathematically evaluating changes in the response due to changes in the independent variables. Mostly experimental modcls are used, polynomials of first but preferably second or third order to describe curvature of the (hypcr)surface. From the response surface (for two factors a three-dimensional surface results, mostly depicted as a two-dimensional contour plot), optimal regions for the responses can be predicted. The prcdiction error will dcpcnd on the chosen design, on the error in the measured response at the design points and on the quality of the fit of the postulated model.
3.3 Mixture designs The above mentioned studies with factorial and fractional factorial designs have in common that with some exceptions only the effccts of process variables and qualitative compositional variables have been studied. However, in drug formulations in most cases quantitative compositional variables have significant effects. In those cases where quantitative compositional variables have bcen studied they were ueatcd as factorial variables,
89
20 kN
10 kN
-
p I actoce
p- lactose
A
avicel
a
- I actose 1aq
avicel
a-lactose l a q
Figure 1. Contour plot and levels of crushing strength (kg) of placebo tablets containing sodium starch glycolate as a disintegrant, compressed at 10 kN (left) or 20 kN (right).
p - I act ose
p- I actose
B
avicel
a- lactose l a q
avicel
a-lactose l a q
Figure 2. Contour plot and levels of disintegration time (s) of placebo tablets containing sodium starch glycolate as a disintegrant. compressed at 10 W (left) or 20 kN (right).
thereby neglecting the advantages of the mixture design approach. In Mixture Design Methodology use is made of the fact that the fractions of the componenls sum up to one:
41 + 42+ 4 3 = Givcn a polynomial of a certain degree, less experiments than would be necessary with a factorial dcsign suffice for the estimation of the coefficients of that polynomial, using this mixture constraint.
90
p - lactose
p- Iactose
C
Figure 3. Contour plot and levels of friability (%) of placebo tablets containing sodium starch glycolate as a disintcgrant, compressed at 10 kN (left) or 20 kN (right).
p- lactose
p-lactose
itarch glycolate
avicel
a-lactose laq
avicel
A
a-lactose laq -
Figure 4. Combined contour plot of crushing strength, disintegration time and friability of placcbo t;iblcts containing sodium starch glycolate as disintcgrant, comprcsscd at 10 kN (left) or 20 kN (right).
Up till now ca 10 papcrs havc appcarcd on the use of mixture dcsigns in drug formulation studies and in studies into solubility of drugs in mixed solvents; this is rclativcly few if compared with the large number of publications on applications of mixturc designs in chromatography, in particular liquid chromatography. An example can bc found with Van Kamp ct a1 [lo, 111 who studied scvcral combinations of disintcgrants, filler-binders and fillcrs as pscudocomponcnts with drugs and who added comprcssion rorcc as a proccss variable.
91
In our research group we developed software for the optimization of liquid- and thinlayer chromatographic separations, the program POEM [ 121. In cooperation with the dcpartment of Pharmaceutical Technology of our University Centre for Pharmacy we dcvclopcd at the same time the program OMEGA [ 131 for optimization of drug formulations using thc mixture design technique for binary, ternary or quaternary mixtures or mixtures with pseudocomponcnts. In both programs several statistical criteria can be selcctcd to evaluate the quality of the modcls that can be chosen: the programs offer the choice bctwccn linear, quadratic, special cubic and cubic modcls. Moreover one can choose bctween the use of one single criterion, combined criteria (only for POEM) or the use of the MCDM technique (scc below).
4. Sequential designs versus simultaneous designs One must bear in mind that in Response Surface Methodology based on simultaneous designs, each response will be represcnted by a separate response surface. If more criteria are dcemed important, then in each design point all criteria should be measured and modellcd, if necessary with models of different order. This will result in a number of regression equations and the belonging response surfaces. These equations describe the whole factor spacc studicd. The advantages of the use of Response Surface Methodology over scqucntial mcthods like Simplcx are - knowlcdgc on the dcpcndcncc of criteria on factors will be obtained over the whole factor spacc studicd - without cxlra expcrimcntal effort as many criteria can be studied as dcemed important - optimal dcsign thcory allows an optimal spread of dcsign points over the factor spacc.
5. Combinations of mixture and factorial designs: sparse designs If not only proccss variablcs but also compositional variables influence the responses, with the ultimate possibility of all variables intcracting, a combincd design must be used, as shown in Figure 5. This will incrcasc the expcrimenlal effort considcrably, unlcss cfficicnt fractionation can bc accomplishcd. So far in thc literature on formulation rcscarch only factorial Figure 5. Combined inixturc factorial design designs have bccn fractionatcd; we thought for a three-component mixture w i t h two it Would be challcnging 10 de\rClop fraction- process variables, Exma design points arc atcd combincd mixture-factorial designs used to judge quality of fit.
92
TABLE5 Simple scheme for model choice with mixture factorial designs. ~
Model #
1 ml m2 m3 ml*m2 ml*m3 m2*m3 ml*m2*m3
1
fl
M
X
X
X
X
X
X
X
X
X
fl*l-2 X
X
X
X X
Figure 6. Starting-point for projection and rotation to find sparse designs. A 23 factorial design is projected on the plane containing the mixture triangle.
Figure 7. Contraction of designpoints outcide the mixture triangle to the boundaries of the triangle.
Figurc 8. The combination design resulting from a 25-1 design, using the contraction piclured in Figurc 7. Left-right and loweruppor triangles represent thc factor levels -1 and + l .
Figure 9. Rotation and contraction of all designpoints to the boundarics of the triangle to construct a sparse design.
93
with a concomitant hierarchy of polynomial models. From Table 5 the models can be constructed. Optimality of the developed designs should be judged; we will use the measures G, V and D optimality. We will restrict our rescarch to four-component mixtures and thrcc process variables. In Figures 6 to 10 two suggested strategies are shown for a three-component mixture and two process variablcs. Figure 6 shows h e starting point, Figure 7 a successive projection and Figure 8 thc rcsulting sparse design. The second strategy is shown in Figure 6 and Figure 9, a rotation over 30" followed by a projection. Figure 10 shows the resulting sparse design.
Figure 10. The combination design resulting from a 25-1 design, using the rotation and contraction pictured in Figure 9. Left-right and lower-upper triangles represent the factor levels -1 and +1.
6. Criteria; multi criteria decision making As was said in one of the preceding sections in optimization studies of drug formulations often several criteria must be met simultaneously. Combined criteria do not offer an unambiguous solution to the multicriteria problem; the same value of the criterion can be found with an indcfinite number of combinations of the controllable factors. The solution wc have chosen for chromatography [Smilde et al. 141 as well as for drug formulation studics [de Bocr ct al. 151 is Mulli Criteria Decision Making, based on the concept of Parcto-oplimality. Thc MCDM method does not make preliminary assumptions about thc weighting factors, thc various responses are considered explicitly. MCDM makes provisions about mixturcs in thc whole factor space, therefore it can not be used in combination with scqucntial optimization methods. It can easily be understood for two critcria and a three-componcnt mixture. Aftcr the selection of models for the critcria and rcgrcssion on thc valucs of thc critcria in the design points the factorspace, a triangle, is scanned with a predcfincd step-size, e.g., 2%. In each scan point both criteria are calculatcd and thcir valucs used as coordinates in a two-dimensional graph. From the resulting sct of points (each of them rcprcscnting a mixture composition) the Pareto-optimal points arc sclcctcd: a point is Parcto-optimal if there exists no other point in the design space which yields an improvement in one criterion without causing a dcgradation in the other. By evaluating quantitatively the pay-off bctween the two criteria a choice can be made bctwccn thc PO points and the mixture compositions belonging to it. Thc MCDM mcthod was implcmented in the programs POEM and OMEGA. Using thc tcchnique then proved to be a slight drawback. In the conventional MCDM mcthod
I
300 L-ttcrget
100
200
disintegration t i m e
400
0 200
100
(s)
Figure 11. Target-MCDM: target value for disintegration timc set at 200 s., crushing strcngth maximized.
300
disintegrotion t i m e
400 (5)
Figure 12. Tolerance-MCDM: thc MCDM plot.
arc not PO but havc a prcdef'ined maximum deviation in thc critcrion valucs comparcd to the PO points.
Alfo-lactose
7. Conclusion Scqucntial optimization strategies for drug dosagc forms have found limited applicability; factorial- and mixturc dcsigns howcvcr ham succcssfully bccn uscd. Promising ncw cll'icicnt combincd fxtorial- and mixturc dcsigns arc bcing dcvclopcd. Thc
0
Beta-lactose
0
0
Dried p o t a t o starch
95
MCDM to bc used with simultaneous optimization strategies allows decisions to be made based on the pay-off between optimization criteria. Quality of drug formulations will be improved using these techniques.
Acknowledgement We are indcbtcd to G.E.P. Box and J.S. Hunter for providing us with thc idea of projection designs (Fig. 6).
References 1.
2. 3. 4. 5. 6. 7. 8.
9. 10. 11. 12. 13.
14. 15.
Stctsko G. Drug Dev Ind Pharm 1986; 12: 1109-1123. Spendlcy W. Hcxt GR, Himsworth FR. Technometrics 1962; 4: 441. Ncldcr JA, Mead R. Comput J 1965; 7: 308. Bcrridge JC. J Chromatogr 1982; 244: 1. Dcbcts HJG, BajcmaRL, Doornbos DA.Anal ChimActa 1983; 151: 131. Bindschacdlcr C, Gumy R. Pharm Acfa Ilelv 1982; 57(9): 251-255. Zicrcnbcrg B. Acta Pharm techn 1985; 31(1): 17-21. Chowhan ZT, Amaro AA. Drug Dev I d Pharm 1988; 14(8): 1079-1106. Chariot M, Lewis GA, Mathicu D, Phan-Tan-Luu R, Stevens HNE. Drug Dev Ind Pharm 1988; lI(15-1 7): 25 35-2556. Van Kamp HV. Thesis Groningen 1987. Van Kamp HV, Bolhuis GK, Lcrk CF. Pharm Weekblad Sci Ed 1987; 9: 265-273. Predicting Optimal Eluens Composition (POEM), Copyright Research Group Chemomctrics (Head: Prof DA Doombos), Univcrsity Centre for Pharmacy Groningcn, The Ncthcrlands. (Dcmo available). Optimal hlixture Evaluation with Graphical Applications (OMEGA), Copyright Rcscarch Group Chcmomctrics (Head: Prof DA Doornbos), University Centre for Pharmacy Groningcn, The Netherlands. (Dcmo available). Smilde AK, Knevclman, Cocncgracht PMJ. J Chromatogr 1986; 369: 1. Dc Rocr, JH, Smildc AK, Doombos DA. Acta Pharm Technoll988; 34(3): 140-143.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scienrific Computing and Auromarion (Europe) 7990 0 1990 Elsevier Science Publishers B.V., Amsterdam
97
CHAPTER 9
The Role of Exploratory Data Analysis in the Development of Novel Antiviral Compounds P.J. Lewi, J. Van Hoof, and K. Andries Janssen Research Foundation, Janssen Pharmaceutica NV, B-2340 Beerse, Belgium
1. Introduction Exploratory data analysis is a part of the inductive approach to scientific discovery. Where deductive methods proceed from established models to planned experiments and the collection of new expcrimental facts, induction works the other way around. Here data are collcctcd and analysed in order to obtain a tentative working hypothesis. Induction and deduction are complementary. Once a working hypothesis has been obtaincd by induction, it must be verified and confirmed by independent investigators. Eventually it may gain the status of an established model and become part of the deductive process. This has been described as the ‘arch of knowledge’ [Oldroyd 19861. Induction as a scientific method has been founded by Francis Bacon [1620] almost at the samc time whcn RenC Dcscartes dcveloped his method of deduction [1637]. It appears that inductive methods are applied most profitably in those areas where formal modcls are lacking. This is the case in many ficlds of biology and medicine, espcciall y in the search for novel therapeutic agents. The development of a new medicinc rcquires thc synthesis of some 4,000 new synthetic chemical compounds yearly, each of which has to be tested in numerous batterics of screening tests. This is essentially a Baconian approach to scientific discovery. Incidentally it was Francis Bacon who first proposcd to collcct expcrimcntal obscrvations into ‘lablcs’, which are then to be explored systcmalically in order to yield a ‘first vintage of a law’. Although Bacon was a practicising lawyer, his method has led him to rcmarkablc insights, for example that ‘heat is related to motion’ [Quinton 19803. The exploratory method is also rcfcrrcd to as the Edisonian approach. Indccd, the developmcnt of a practical incandcscent lamp by T.A. Edison sccms to have been the result from several thousands of Lrials, rathcr than from thcorctical considerations. Multivariate data analysis as a method of exploration dcpcnds on the tabulation of adcquatc data, as wcll as on the visualization of relcvant relationships in these data. Often
98
these relationships are not apparent to the unaided eye. For this reason, we need an instrument for looking into tabulated data. Such an instrument may be called a ‘datascope’, by analogy with the microscope and telescope. A drop of water may appear clear and limpid to thc naked eye no matter how long we look at it. But under the microscope, a whole microcosmos is revealed, enough to have allowed its inventor Antoni Van Leeuwenhoek (1632-1723) to write more than 500 communications to the Royal Society in London. A datascope may be thought of as a personal computer equipped with appropriate software for multivariate data analysis. In this paper wc describe how the ‘datascope’ has lcd to a relevant discovery in the search for effcctive antiviral drugs. The exploratory data analysis described in this paper has been performed by means of the SPECTRAMAP program. SPECTRAMAP is a trademark of Janssen Pharmaceutica NV. Information about this software can be obtained from the first author.
2. Method The method of exploratory data analysis that is discussed here has bcen called spectral map analysis (SMA). Originally, SMA has bcen developcd for the visualization (or mapping) of activity spectra of chemical compounds that have been tested in a battery of phannacological assays [Lewi 1976, 19891. Activity spectra are represented in our laboratory in the form of bar charts representing the effective doses of a given compound with rcspcct to thc individual tesls [Jansscn, 19651. Some compounds possess very similar spccua of activity, even if thcy differ in average potcncy. Other compounds may have widcly dissimilar activity spccua, evcn when they have the same average potcncy. The problcm of classifying compounds with respect to thcir biological activity spcctra is a multidinicnsional problcm, which can be solvcd by factor analytic methods. Basically, SMA involves the following steps: (1) logarithmic transformation of a dala table X with n rows and m columns, which produces Y , (2) subtraction of the corrcsponding row- and column-means from each element in the transformed table Y (double-ccntcring) yielding 2, (3) calculation of the variance-covariance matrix V from Z, (4)extraction of orthogonal factors I; from the variance-covariance matrix V , ( 5 4 ) calculation of the coordinates of the rows and the columns along the computed factor axes which are represcntcd by thc factor scores S and the factor loadings L, finally (7) biplot of the rows and columns in a planc spanned by the first two factors. In algcbraic notation, the procedure can be written as: Y = log x Z..=Y..-Y.-Y IJ
1J
1.
(1)
.+Y..
.J
1 ‘ v =;zz
A=
F’VF
with
F’F= I
(4)
99
whcrc I is thc unit matrix of factor space and where 1 is a diagonal matrix which carrics the factor varianccs (cigcnvalucs) on its principal diagonal. We assume that factors are arranged in dccrcasing order of their corresponding factor variances. The above algorithm is equivalcnt to a singular value decomposition (SVD) of the double-centcrcd logarithmic matrix Z , as it can be shown that the table 2 can be reconstructed from thc factor coordinatcs S and L [Mandcl 19821:
Thc biplot of SMA is a representation of the rows and columns of the data tablc in a plane diagram spanned by the first two columns of S and L [Gabriel 19711. Note that thc scaling of factor scores S and of factor loadings L in steps (5-6) is symmetrical in the sense that their varianccs arc equal to the squarc roots of the factor variances (singular values):
SMA is only diffcrcnt from logarithmic principal components analysis [PCA, Hotclling 19331 in thc sccond stcp (2) of thc algorithm (Fig. 1). Ordinary PCA uses column-centcring:
instcad of double-ccntcring which is applied in SMA. Although the algorithmic distinction bctwccn PCA and SMA may sccm trivial, its implication is farrcaching [Schapcr and Kaliszan 19871. In SMA we obtain that both the rcprcscntations of the rows and thc columns of the data table are centcrcd about the origin of factor spacc. In column-ccntcrcd PCA wc gcncrally find that only the rcprcscntations of the rows arc ccntcrcd about the origin of factor space. In terms of effects of compounds obscrvcd in a battcry of tests, SMA corrects simultaneously for differcnces of average potcncy betwccn compounds and for differcnces of average sensitivity between tests. Hcncc, in SMA all absolute aspects of the data are removed as a result of double-ccntcring. What rcmains arc diffcrcntial aspccts, which can be expressed in terms of ratios (as a icsult of the prcliminary logarithmic transformation). Thcse diffcrcntial aspects are called contrasts. Thcy rcfcr to the spccificitics or preferences of the various compounds for the diffcrcnt tcsts. Vicc versa, contrasts can also bc undcrstood as specificitics or prcfcrcnccs of the
100
Principal Components Analysis (PCA)
Spectral Map Analysis (SMA) ~
Logarithms
I Subtraction of colurnn-means
Size
+
Contrasts
Subtraction of rowand colurnn-means
7 Variance-Covariancematrix
Contrasts
I I I
Extractlon of factors
Coordinates along factor axes
Bipiot
Figure 1. Schematic diagram of principal component analysis (PCA) and of spectral map analysis (SMA). The distinction which lies in the type of centering (column-wise or both rowand column-wise) has farreaching implications as is explained in the text.
various tests for the different compounds. Stated otherwise, SMA analyses interactions, which are always mutual, between compounds and tests [Lewi 19891. It is erroneous to maintain that the second and third factors of PCA are identical to the first and second factors of SMA. Indeed, the factors extracted from column-centered PCA usually contain a mixture of the size component and of the contrasts (Fig. 1). The size component accounts for differences in average potency of the compounds. In PCA, this size component cannot be readily separated from the components of contrasts, although the first component usually expresses the largest part of the size component.
3. Application Common cold or influenza is caused by rhinovirus infection. There are 100 different typcs of rhinoviruses, each with its own antigenic characteristics. In our laboratory all 100 scrotypcs of rhinoviruses have been tested against a panel of 15 antiviral compounds. This resulted into a table with 100 rows and 15 columns. The values in this table express the concenuation of a particular substance required to inhibit half of the viral particles in a culture of a given serotypc [Andries e.a. 19901. Inhibitory concentrations are inversely related to the potency of a compound in a given test or, alternatively, to the sensitivity of a test for a given compound, From these data appeared that the compounds differ strongly
101
SPECTRAMAP
100 Rhlnovlruses and 15 Antlvlral Compounds
6a:
. I
Figure 2. Spectral map derived from a table of 100 viral serotypes (hexagons) and 15 compounds (squares). Areas of hexagons and squares are proportional to the average sensitivity of the serotypes and to the average potency of the compounds. Serotypes and compounds that are at a distance from the center and in the same direction show high contrasts as a result of their specific interaction. The separation of the 100 serotypes into two distinct groups formed the basis of a new hypothesis for the mode of interaction of antiviral compounds with the rhmoviruses.
in their activity spectra against the 100 serotypes, and this irrespective of their average potency. Looking at the data, it was not clear why some of the compounds where active against a particular group of serotypes while leaving the other serotypes intact, and why some other compounds had no effect on the former group while being active against the latter. This, of course, presents an ideal problem for SMA, since the interest is in specific interactions of therapeutic agents with biological systems, independently of the average potency of the compounds and independcntly of the average sensitivity of the viral serotypcs. The data were analyzed according to the SMA mcthod described above after transformation of the original data into reciprocal values, This is required by the fact that inhibitory concentrations of the compounds are inversely related to thcir antiviral activity. Thc resulting biplot is shown in Figure 2. Thrce rcading rulcs apply to this SMA biplot. First, hexagons refcr to the 100 rows (serotypcs) of the table, whilc the squares identify the 15 columns (compounds). Second, arcas of the hexagons are proportional to the average sensitivity of h e viruses while the areas of the squares are proportional to the average potency of the compounds. Most importantly, the third rule defines the positions of the hexagons and squares in the biplot.
102
Thosc scrotypcs that are sclcctivcly dcstroycd by a particular compound will be altractcd by it. Thosc scrotypcs that are lcft untouched by the compound will bc rcpcllcd by it. Similarly, bccausc of the mutuality of intcractions, compounds arc attractcd on thc biplot hy n scrotypc against which it is sclcctively active, whilc they arc rcpcllcd by a serotypc that is not inhibited by it. Thc ccntcr of the biplot is indicatcd by a small cross ncar the middle of thc plot and rcprcscnts thc point which is dcvoid of contrast. Compounds that arc closc to thc ccntcr arc active against most of thc 100 serotypcs. The further away from thc ccntcr, the grcatcr the specificity of compounds for the various scrotypes, and vicc vcrsa. Compounds that arc at a grcat distance from one anothcr show large contrasts. Convcrscly, serotypcs that arc far apart also cxhibit large contrasts. The two-dimensional biplot of Figure 2 accounts for 70 pcrccnt of the Lou1 variance in thc logarithmically transformed and doublc-ccntcrcd data table.
4. Rcsult and discussion Thc most striking fcalurc of the biplot of Figure 2 is Lhc scparation of thc 100 scrotypcs inlo two distinct groups. This immcdiatcly suggestcd the existcncc of two classes of scrolypcs. Left from thc ccntcr wc find a group of scrotypcs which is more sensitive to the coinpounds displayed at thc lcft (including the WIN compound). On thc right from thc ccntcr wc obscrvc a larger group of scrotypcs which is more scnsitivc to compounds on the right (among which thc MDL and DCF compounds). When thc chemical structures of thc individual compounds wcrc coinparcd with their position on thc biplot, it appcarcd that molcculcs on the lcft contain long aliphatic chains, whilc those on thc right posscss polycyclic slructurcs. Molcculcs that wcrc active against most or all scrotypcs and which appcarcd ncar thc ccntcr of thc map posscss both fcalures, i.c., at thc samc time an aliphatic and a cyclic part. It has been cstablishcd that antiviral compounds bind to a hydrophobic pocket inside the protcin cnvclopc of thc virus [Andrics 19901. From our cxploratory analysis it was induccd that iherc arc two diffcrcnt typcs of drug-binding pockcts which have cvolvcd from a common ancestor. One type of pocket is more clongatcd anti is present in the lcfunost group of scrotypcs. The other typc of pocket is widcr and appcars in the rightmost group of scrotypcs. This working hypothcsis is supported by diffcrcnccs among thc amino-acid scqucnccs of the proteins that line tlic walls of thc two pockcts. It is also strcngthcncd by diffcrcnccs belwc.cn the clinical symptoms that arc associatcd with common cold infections produced in thc two groups of scrotypcs. All thcse obscrvalions tcnd to confirm thc cxistcncc of two drug-dcfincd groups of rhinoviruscs [Andrics 19901. A practical implication of thc two-group hypolhcsis is the clcar dircctions that can bc given to organic chcmists for synthcsis of appropriate compounds that can bind to both typcs of rhinoviruscs. Anothcr bcncfit tics in the rational sclcction of a small subset of
103
rhinoviruses which can serve as an effective reduced screening panel. This greatly simplifies the work involvcd in screening newly synthesized antiviral compounds. The inductive exploratory approach also dcmonstratcd that synthetic drugs can produce relevant insight into the slructurc of large proteins.
References Andries K, Dewindt B, Snoeks J, Woutcrs L, Moereels H, Lewi PJ, Janssen PAJ.Two groups of rhinoviruses revealed by a panel of antiviral compounds present sequence divergence and differential pathogenicity. J Virology 1990; 64: 1117-1123. Bacon F. Novum Organon Scientiarum. 1620. Modem edition: London: William Pickering, 1899. Descartes R. Discours de la me’thode pour bien conduire sa raison et pour dkcouvrir la vkritk duns les sciences. La dioptrique, les mktkores et la g k o d t r i e . Jan Maire, Leyden, 1637. Gabriel KR. The biplot graphic display of matrices with applications to principal components analysis. Biometrika 1971; 38: 453467. Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psycho1 1933; 24: 4 1 7 4 1 . Jansscn PAJ,Niemegeers CJE. Schellckcns KHL. Is it possible to predict the clinical effects of neurolcptic drugs (Major tranquilizers) from animal data? Arzneim Forsch (Drug Res) 1965; 15: 104-117. Lewi PJ. Spectral Map Analysis. Analysis of contrasts, especially from log-ratios. Chemometrics Intel1 Lab Syst 1989; 5 : 105-1 16. Lewi PJ. Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim Forsch (Drug Res) 1976; 26: 1295-1300. Mandcl J. Use of the singular value decomposition in regression analysis, The Am Statistician 1982; 36: 15-24. Oldroyd D. The arch of knowledge. An introductory study of the history of philosophy and methodology of science. New York, N Y Methuen, 1986. Quinton A. Francis Bacon. Oxford: Oxford University Press, 1980. Schapcr K-J, Kaliszan K. Application of statistical methods to drug design. In: Mutschler E, Wmkcrfcldt E, eds. Trends in medicinal chemistry. Proc 9th Int Symp Med Chcm, Berlin. Wcinheim, FKG: VCH, 1987.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Compufingand Aufomation (Europe) 7990 0 1990 Elsevier Science Publishers B.V., Amsterdam
10.5
CHAPTER 10
Some Novel Statistical Aspects of the Design and Analysis of Quantitative Structure Activity Relationship Studies D.M. Borth and M.A. Dekeyser Research Laboratories, Uniroyal Chemical Ltd, P.O. Box 1120, Guelph, Ontario, N11I 6N3, Canada
Abstract A case study in developing Quantitative Structure Activity Relations (QSAR) is presented. (QSAR involves using statistical techniques to correlate molecular structure with biological activity). Two novel statistical aspects of the study are emphasized: (i) the use of a new technique for selecting molecules to synthesize and test in order to maximize the structure vs. activity information for a given amount of chemical synthesis effort, (ii) the usc of a statistical technique called censored data regression to include information from molecules for which only a lower bound on the measure of biological activity (ED50) was available.
1. Introduction 2,4-Diphcnyl- 1,3,4-oxadiazinc-5-ones have good acaricidal activity [Dekeyser et al., 19871, especially against the twospotted spider mite, Tetraizychus urficae, which is a serious plant pest. This paper 0 N is concerned with the development of I quantitative structure-activity relationships (QSAR) required to obtain the most acaricidally active member in 2-(4mcthylphcnyl)-4-(subslituted)phcnyl- 1,34-oxadiazin-5-oncs (Fig. 1). Acaricidal activity in this scrics was first discovered with the compound in Figure 1, with Figure 1. Chcmical Structure to bc optimized by Cnt R = H. A prcvious QSAR study revealed c ~ ~ o i c c o f s u b s t i ~R.
/< ,”
106
that rcplacemcnt of lhc mclhyl (CH3) group rcsulted in greatly reduced acaricidal activity; thus, our attcntion was devoted to introducing various substitucnts, R, in place of H. Thc numbcr of possible substituents is very large and it would be impractical to synthcsizc and tcst them all. This was the basic reason for adopting thc QSAR approach. The strategy in thc QSAR study rcportcd in this paper was (i) to sclcct a statistically mcaningfill yct synlhctically practical subset of substitucnts (ii) synlhcsizc the various analogucs (iii) biologically tcst thc compounds made (iv) statistically analyze thc data to dcvclop an cquation rclating biological activity to the physicochcmical paramctcrs of the substitucnts (v) usc thc equation to prcdict thc activity of the analogucs not yct made (vi) synthcsizc and tcst a numbcr of analogucs (including thosc prcdictcd to be most active) to validate thc prcdictions. Thc cmphasis in this papcr is on some novcl aspccts of items (i) and (iv). hlorc dctails of lhe entire study are rcportcd by Dckcyscr and Borlh [1990].
2. Substitnent selection A list of 433 substitucnts of known physicochcmical paramctcr values wcrc ratcd accorcling to difficulty of synthesis. Of thc 433 potcntial analogucs, 78 were judgcd to bc impractical to synthcsizc. The remaining 355 compounds werc ratcd for difficulty on a scalc from 1 to 8 with a 1 rating indicating a rclativcly casy analogue to synthcsizc. (A rating of 8 indicates that a compound is cxpcctcd to takc about 8 timcs as long to synthcsizc as a compound with a rating of 1). Thc primary sourccs for Lhc physicochcmical paramctcrs were Hansch and Leo [ 19791 and Exncr [ 19781. For somc substitucnts, values not previously rcportcd were ohtidined by intcrpolation and extrapolation from valucs for closely rclatcd substitucnts [Relyca, 19s91. From this list a set of 20 compounds were choscn for synthcsis, using a mcthod of sclcction dcvcloped by Borth ct al. [1985]. The sclcction mcthod took into account: (i) thc estirnatcd difficulty ol synthcsis for each potcntial analoguc (ii) thc amount of inronnation (dcl‘incd as cxpccted changc in statistical cntropy) which each potential analoguc providcs for making prcdictions of activity over thc wholc set of 355 practical mono-substitutcd analogs. Esscntially, the sclcction mcthod is bascd on balancing statistical cl‘ficicncy against synthcsis difficulty, yiclding the most information for a givcn cost. That is, some sclcctions arc bcttcr stitistically, in that thcy will provide bcttcr covcragc of thc paramctcr spacc, and will allow morc accuratc prcdictions of activity for thc 355 - 20 = 335 analogucs not actually synthcsizcd. Howcvcr, a sclcction bascd on statistical cfficicncy alonc may rcsult in a hcavy wcighting towards compounds which arc difficult to synthcsizc. Thc set of analogucs choscn for synthcsis consists of thc first twcnty in Tablc 1. This ublc also givcs thc rclntivc synthcsis difficulties for cach compound as well as thc QSAR paramctcrs and the ED~o’s.(Of coursc, thc EDSO’Swcrc not known at the limc that the substitucnts wcrc sclcctcd). Thc avcragc dirficulty rating is 1.5, and the maximum is 4.
107
The average difficulty rating over the entire set of feasible compounds was 6.32. Also, for sake of comparison, a selection was made assuming equal synthesis difficulty for all analogues. This selection is given in Table 2. The average difficulty rating for the comTABLE 1 Physical data, synthesis difficulty rating and efficacy data for compounds.
No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
Substituent
Difficulty rating
L
4
m-NO2 p-NO2 p-OCH3 o-SO~C&~ 0-F p-F 0-BR p-BR m-OCH2CbH5 p-OCH2CbHs P-C~H~ o - NH~ m-NH2 p-OH m-OCH3 P-SO~CH, m-F 0-CL o-GH~ mC2H5 m-SCH, P-c&5 p-COOCzH5 p-COOH m-CH2C& m-C& p-OCONHCH3
1 1 1 1 2 1 1 1 1 2 2 2 2 2 8 1 4 1 1 2 2 2 2 8 8 8 8 8
A
0.00 0.84 0.52 0.60 2.01 2.39 2.51 -1.30 -5.96 0.11 0.11 0.22 -0.03 0.27 0.00 0.15 0.84 1.19 1.66 1.66 1.10 -1.40 -1.29 -0.61 0.12 -1.20 0.22 0.76 1.39 0.99 0.64
1.74 0.46 -0.32 2.01 1.92 -0.42
F
R
MR
Hd
ED50
0.00
0.00 -0.11 -0.04 -0.13 0.00 -0.07 0.04 -0.68 -0.02 0.09 0.04 0.11 -0.53 0.12 -0.32 -0.37 -0.18 -0.21 -0.15 -0.43 -0.10 -0.59 -0.24 -0.66 -0.18 0.18 -0.13 -0.16 -0.09 -0.04 -0.07 -0.09 0.12 0.12 0.00 -0.03 -0.15
0.0 4.7 4.7 4.7 29.0 24.3 25.7 4.2 20.2 6.0 6.0 6.0 6.5 32.2 4.4 -0.4 7.6 7.6 30.7 30.7 9.4 4.2 4.2 1.5 6.5 12.5 -0.4 4.8 9.4 9.4 13.0 24.3 16.2 5.9 29.0 24.3 15.3
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1
152 >10,000 33 123 66 25 905 45 3 >1.000 >10,000 >10,000 >823 29 655 917 425 >1,000 171 31 826 151
-0.05 -0.04 -0.04 -0.07 0.10 -0.17 0.02 0.87 0.84 0.66 0.67 0.26 0.7 1 0.54 0.43 0.55 0.44 0.21 0.21 -0.05 0.02 0.02 0.29 0.25 0.54 0.43 0.5 1 -0.06 -0.05 0.19 0.08 0.33 0.33 -0.05 0.08 0.41
>1,000
>1,000 >1,000 728 714 1.000 >1,000 120 116 117 902 >2,251 >4,000 30 34 >1,000
108
pounds chosen in this way is 6.25, which is more than 4 times the average for Table 1. The statistical efficiency of the two selections can be evaluated in various ways. One criterion is the relative average standard error of prediction over the set of 355 feasible compounds. For the selection in Table 1 this is 0.76, vs. 0.51 €or the selection in Table 2. Another criterion is the maximum relative standard error of prediction over the set of 355 feasible compounds. For the selection in Table 1 this is 1.45, vs. 0.700 for the selcction in Table 2. Thus if the selection in Table 2 were used, the average synthesis difficulty would incrcasc by a factor of 4, while the average standard error of prediction would dccrcasc by 33% and the maximum standard error of prediction would decrease by 52%. Hcnce the loss in statistical efficiency is much less than the gain in ease of synthesis. Note, also, lhat as shown in Borth et al. [1985], the same statistical efficiency with less total synthesis effort could have been achieved by adding more easy to synthesize compounds. This was not done because the statistical efficiency of the 20 compounds selected was dcemcd to be adequate. The following is a brief description of the mathematical method used for substituent selection. Let xi be the row vector (corresponding to the ith substituent) such that the proposed TABLE2 mathematical model relating structure to activity Sclection and actual difficulty ratings is Log(lED5Oi) = x i p where p is the vector of which result from basing selection on unknown coefficients to be estimated. (For a statistical efficiency alone. more specific definition of xi, see Equation 3 and the following discussion in the next section.) Thc Substitucnt Difficulty rating selection process is iterative, as shown in Figure 4 2, and is based on the following criterion for 4 adding a substituent to the selection: 4 8 8
4
4 1
8 8 8 8 8 4
8 8 8 8 8
4
and the following closely related criterion for deleting a compound from the selection:
where D;is the difficulty rating for the ilh substituent and Xi is a matrix whose rows are the vectors X I , x2, ..., xi corresponding to the substituents currently selected. The numerators of the above expressions are equal to the statistical information (specifically the expected entropy changc) due to adding or deleting the ith
109
substituent to or from the group of substituents already selected. Note that the quantity Xi(Xi-1' X&X
i'
(CLUSTER ANALYSIS)
ADD ANALOGUE
WITH MAXIMUM
is proportional to the standard error of prediction for the i'th substituent [Draper and Smith, 19811 assuming data is available for DELETE ANALOGUE substituents 1, 2, ..., i-1. The constant of WITII MNIMUM proportionality is 0,the residual standard deviation from the fitted model. Of course this is unknown at the substituent selection stage, but is assumed to apply equally to all substituents. Thus the addition criterion, i.e., Equation 1 has the intuitively appealing property that in considering substitucnts of equal synthesis difficulty, the one for which the uncertainty of prediction is the greatest, Figure 2. Flow diagram of substitucnt selecgiven the substitucnts already in the selec- tion process. tion, will be chosen. As discussed by Borth [ 19753, this is related to a principle of the statistical design of experiments which applies quite generally. Similar considerations apply to the deletion criterion, i.e., Equation 2. As shown in Figure 2, the selection process is an iterative one in which the sclcction is improved by alternately adding and deleting substituents. The starting point for this process was generated by clustering the 355 feasible substituents into 20 (the number of substituents to be selected) based on the elements of the vector xi, for each substituent. The initial selection was the substituent in each cluster with the lowest difficulty rating. In the case of a tic, the cluster closest to the cluster center was chosen. The SAS proccdurc FASTCLUS [SAS Institute, 19881 was uscd for this purpose. Full details of the statistical justification of the sclcction method are given by Borth ct al. [19851.
I-[
3. Statistical analysis Statistical analyses were carried out to relate Log(lED50) to the folIowing physicochcm(a zeroical parameters: II (lipophilicity), F (inductive effect), R (resonance eCfwt), one indicator variable describing hydrogen donating effect) and M R (molar refractivity). The initial data analyzed consisted of the first 28 compounds in Table 1. These compounds are the twenty sclccted as described in the previous section, plus 8 compounds which were made prior to the selection. However, the 0-CH3 analogue was found to be
110
extremely inactive and did not fit the regression model and was deleted from the analysis. The remaining compounds in Table 1 were not used in developing the initial regression modcl, but were uscd to validate the model. Subsequently the model was rcfitted using all of the data (with the exception of the o-CH3 analog). Some of the data in Tablc 1 consists of a lower bound for the ED50 rather than a specific value. This is because less than 50% control was exhibited at all rates tested. Statistically, this is referred to as censored data. Rather than attempting to obtain EDSO'S for these very inactive compounds, the data was analyzed using a special technique for rcgrcssion analysis which allows for censored observations [Aitkin, 1981 and Wolynctz, 19791. The computations were carricd out using a SAS procedure called LTFEREG [SAS Institute Inc., 19881 on a Compaq 386/20 computer. This technique recovcrs a substantial amount of the information which would bc lost if the censored observations were merely dclctcd from the analysis. However, the reporting of the statistical analysis results is differcnt than from standard regression packages. Initially, the mathematical modcl considered for the data was as follows:
Log( 1/ED50) =
+
+ p2d2 + p3F'+ P4R' + P-j(OMR)' + &,(MMR)' + p7(PMR)' d
(3)
where the prime indicates that the variable has bccn standardized by subtracting the mcan and dividing by the standard deviation calculated over the compounds included in thc analysis. The variablcs OMR, MMR and PMR lake on the MR value for the substitucnt whcn the substitucnt is in the corresponding position and lake on the value 0, otherwise. (Equation 3 was uscd at the substituent selcction stage. Specifically, xi = (1, ni,q 2 , Fi,Xi, OMRi, MMR;, PMR;) was used in Equations 1 and 2. Standardization of thc variables was not rcquircd at the selection stage, although it is desirable at thc analysis stage.) After preliminary data analysis, it was discovered that n and n2 were not signiricant, but that the inclusion of the ifd parameter as well as a term involving (PM1?)'2 significantly improved the fit. Thus the final model is
Table 3 summarizes the statistical analysis of compounds 1, and 3 to 28 of Table 1, as fitted to equation (4). The mcan and standard deviation are provided for converting from actual to codcd values, e.g., F' = (F -0.304)/0.305 and (PMf 2 ) into 2 dimensions, the pair tl,t2 by no means determines the vector x uniquely. To restrict the scarch wc can use our a priori knowledge of the possible values by giving lower and upper values a Ix I b for the components of x, or impose some linear constraints, Ax = c. A unique solution x can be obtained by formulating the scarch of x as an optimization problem with the constmints given above. We require that the solution is ncar a group of observations with some desirable properties. For instance, dcnote by xo thc mean valuc of the observations labelled as ‘good‘ in the score plot. We then would like to minimize the Euclidean distance IIx - xoll bctween x and XO.Denote furthcr by T a polygonal target domain in the scorc plot. The point x is now obtained as the solution of the quadratic programming problem
Minimize Ilx -xu112
with the constraints
141
constrained projection on a PLS-plot
100
-1
6 47
0
,15
0
17
-2
-3
39
t
’
-4
45
-3
-2
-I
0
1
2
3
4
Figure 1. The PLS-scores of the data set for quadratic programming example. xTp
insideT
a<x E - 2%) medium concentration ( > E -4,
Amp+H20
The hydrolase catalyzed group transfer reaction may be described by more than one mechanism [Kasche et al., 19841. In the case with only one acyl-enzyme intermediate (Fig. 1) the expressions for kcd and K, (constants of the Michaelis-Menten equation) for the formation of ampicillin and phenylglycine represented in Table 1 seem to be dependent on the rate constants of the hyphothetisised mechanism and on the concentration of
213
TABLE 1 Expressions fork,,, and Km for the formation of PG and Ampicillin. Product
PG
Amp.
Km
kcal
1'
k3[H2
k2
k2+ k3[H20]+ k;[6APAI k:J x kJ6AF'Al L
( k3[H20]+ kj[6AF'Al)( kl( k,+ k3[H20]
k-l
+ k,J( KL + [6APA I)
+ kj[6APAl)
x KL
k2 + k3[ H, 0 3 + kj[6AF'Al
6-APA. Furthermore the integration of the Michaelis-Mentenequation may become even more difficult and dependent on the reactants instant concentrations. Otherwise expressions for the enzyme reactor kinetics are not easily available. It is evident that to describe the system dynamics and to be able to do reactor design and to implement control strategies the use of methodologies other than traditional kinetics may be of advantage mainly if they do not require a previous rigorous knowledge of the mechanism and kinetics of the system. This is exactly the objective of methodologies to approach the modelling of biological systems by the use of on-line estimation techniques as described by Bastin and Dochain [1990]. In principle these methodologies may be extended to enzymatic reactions that will be developed in the following sections, using the ampicillin enzymatic synthesis as a typical case study.
2. Description of the enzymatic process The reaction scheme under consideration (Fig. 2) is based on the work of V. Kasche [19861. The more important reaction of this scheme is the synthesis of ampicillin: i)
MPG
+
cp1
6-APA -> Amp + MeOH
In the case here considered the enzyme substrate, phenylglycine (PG), was substituted by its methyl ester (MPG), because when using MPG the reaction proceeds much more quickly. However MPG is also a substrate for the enzyme and we must consider its hydrolytic reaction which can also proceed chemically:
ii)
MPG
+
rpz H20 -> PG+MeOH
To complete the description of the system we have yet to write a reaction for the
21 4
6-APA
+ MPG + H 0
(91
Amp + MeOH + H20
PG + MeOH + 6-APA Figurc 2. Rcaction scheme for ampicillin synthesis. 6-APA = aminopenicillanic acid, Amp = Ampicillin, MPG = ester of methyl phenylglycine, PG = phenylglycine, MeOH = mcthanol.
enzymatic deacylation of ampicillin by pcnicillin acylase: iii)
Amp
+
(P3
H 2 0 ->
6-APA+PG
The process could be operated in continuous mode on a continuous stirred tank type rcactor with the enzyme penicillin amidase (E.C. 3.5.1.11, from E. coli) immobilized with glutaraldchyde (a gift from CIPAN-Antibioticos de Portugal). The two substrates MPG and 6-APA were added at various feed rates at constant initial concentration values (respectively 100 mM and 40 mM). The fact that PG is also present in thc feed due to the chemical hydrolysis of MPG in water solution was also considered. Experimcntal conditions wcrc: temperature 37"C, pH 6.5, enzyme activity 19.41 UI, rcactor volume 50 ml.
3. Kinetics state space model The design of the observcr derived in section 4 is based on a kinetics state space model obtained from mass balances considerations. In stirred tank reactors (STR), the process is assumed to be in a completely mixed condition (compositions homogeneous in the rcactor). Therefore the following standard Continuous STR state space model can be written:
with:
21 5
S1 the 6-APA concentration S2 the MPG concentration
P1 the PG concentration P2 the Ampicillin concentration P3 the methanol concentration cpi the reaction rates ki the yield coefficients D the dilution rate, defined as the quotient between the influent volumetric flow rate and the volume of the medium Fi substrate feed rates.
4. Design of observers for biochemical processes Equations (1.a) to (1.e) may be seen as a special case of the following general nonlinear state space model, written in matrix form
whcrc: 5 (dimension n) is the state vector; K (dimension n x p ) is a matrix involving either yield coefficients or “0” or “1” entries; cp (dimension p ) is a vector of reaction rates; U (dimension n) is a vector representing inlet flow rates of substrates.
For the modcl (l), these notations may be written as:
-kl
xT_[ 0 k2
-1
0
-1
k3
0
k,
UT =
-1
[Fi F2 Fg 0 01
0
It is considered that the matrix K and U are known, the vector cp is unknown and p state variables is measured on-line (51 denotes the vector of these measurements). The problem addressed here is to design an observer for the on-line estimation of “n - p ” nonmeasured state variables (52 is the representative vector). A basic structural property of the state space model (2) [Bastin, 19881 allows the following rearrangement:
216
with obvious definitions of K1,K2, U1, U2.It must be noted that K1 is full rank (rank(K1)= PI. There exists a state transformation:
where A is a solution of the matrix equation
AK1+ K2= 0
(5)
such that the state-space model (2) is equivalent to
&/dt = -DZ + (AUI
+ Ud
and equation (3.a). Then the following asymptotic observer can be derived from (4) and (6):
d$!/dt=- D$!+ ( A U I -t U2)
(74
t = ZA - A G
where the symbol A denotes estimated values. The asymptotic convergence of this estimation algorithm may be found in Bastin and Dochain [1990]. The main advantages of this algorithm are its simplicity by comparison, for instance, with extended Kalman filter and, its properties which allows the on-line estimation of state variables without the knowledge (nor the estimation) of reaction rates being necessary. These properties were tested previously to be very efficient for various fermentation processes [see e.g., Dochain et al., 19881.
5. Experimental validation The application of the estimator algorithm defined above was considered for the case of ampicillin synthesis. As the on-line measurements of MPG and 6-APA (p = 2) are available by high performance liquid chromatography (HF'LC) the state reconstruction is made possible.
21 7
Hence:
and
- k3/kl A = [ kdk, ( k6 - k7)/ k1
"1
(9)
k7
The transformed state is then defined as follows: (l0.a) (lO.b) (1O.c) Then the estimator algorithm (7.a-b) is as follows: h
A
h
A
d Z 1 / d t = - DZ1 - ( k3/ kl)Fl + k 3 F 2 + F3 dZ2/ dt = - DZ2 + ( ks/ kl)Fl A
d i 3 / dt = - DZ3 + [ ( k6 - k7)/ k l ] F1+ k f 2 A
h
A
h
A
h
Pi = Z1 + ( k3/kl) Sl- k 3 S 2 P2 = Z2 - ( ks/k l ) Sl
P1 = Z 3 - [( k 6 - k7)/ kl] S l - k7S2
(1l.a) (1l.b) (1l.c) (1l.d) (1l.e) (1l.f)
The computer implementation of the estimator requires a discrete-time formulation. This can be done simply by replacing the time derivative by a finite difference. A first order forward Euler approximation was used:
The yield coefficients used are listed in Table 2. Coefficients kl to k5 have been obtained with an identification study from various batch experiments, while k6, k7 have been obtained from stoichiometric considerations.
21 8
TABLE 2 Yield coefficients.
3.36
0.250
1.41
12.0
0.300
0.300
1.41
The on-line state variables used by the observer are shown in Figure 3. The dilution rate and the inlet flow rates are shown in Figure 4 and Figure 5, respectively. The cstimates of PG, Ampicillin and methanol are shown in Figures 6, 7, 8. A good agreement between the estimates and validation data (not used by the observer) has been observed. For the estimate of methanol concentration a validation data set was not available.
6. Conclusions The experimental validation of an asymptotic observer for the estimation of ampicillin, phenylglycine and methanol concentrations was presented in this paper. As it was demostrated an important feature of this algorithm is to provide on-line estimates without requiring a previous rigorous knowledge of system kinetics. Furthermore, the (specific) reaction rates, considered as time varying parameters, can be estimated on-line [Ferreira et al., 19901. The main practical interest of these observers used as ‘software sensors’ is the fact that they constitute a valuable alternative for lack of reliable sensors for on-line measurcments of thc main statc variables. They are vcry cheap compared to the expensive and complex analytical methods usually used for the measurement of compounds like antibiotics. This software tool constitutes an important step for the use of adaptive control methodologies on these enzymatic processes. o,020,Dilulion rate D (Ilmin) 90 -
s
I,
L 0
80 -do
E
5
E
D
70-
60-
fi
50-
0
0
40
30
0
100
200 T i m (min)
Figure 3. On-line measured statc variables (T = 5 min).
Figure 4. Dilution rate profilc.
300
400
219 Concontrutlon, P l (mmol/l)
Feed rates, U (mmol/Vrnin)
60,
1.6 1.4 1.2 -
--
'!
MPG
- PGestlmated
10
+
0
0
100
200
PO off-llne dat
300
0
Tlme (mln)
Figure 6 . Estimation of nonmeasured state variable. (Off line measured values not used by the estimator).
Figure 5. Inlet flow rate.
Concentration, P2 (rnmolll)
Concentrdion, P3 (mmolll)
r
1
0
1
:I 0
50
10 -
- A m p estimated t Amp. off-line data
0
100
200 Time (min)
300
I
400
Figure 7. Estimation of nonmeasured state variable. (Off line measured values not used by the estimator).
I0
Figure 8. Estimation of nonmeasured state variable. (Off line measured values not used by the estimator).
This study has been supportcd by Biotechnology Action Programme of Commission of thc European Communities and Programa Mobilizador C&T, JNICT-Portugal.
Acknowledgements The authors thank E. Santos, L. Tavares e V. Sousa for experimental work on ampicillin kinetics synthesis. We also thanks G . Bastin for his advice and discussions on the work.
References Bastin G. State estimation and adaptive control of multilinear compartmental systems: theoretical framework and application to biotechnological processes. In: New Trends in Nonlinear Systems Theory. Lecture Notes on Control and Information Science, no. 122. Springer Verlag, 1988: 341-352.
220
Bastin G, Dochain D. On-line estimation and adaptive control of bioreactors. Elsevier. 1990 (in press) Dochain D, De Buy1 E, Bastin G. Experimental validation of a methodology for on line estimation in bioreactors. In: Fish NM, Fox RI, Thornhill NF, eds. Computers Applications in Fermentation Technology: Modelling and Control of Biotechnological Processes, Elsevier. 1988: 187-194. Ferreira EC, Fey0 de Azevedo S,Duarte JC. Nonlinear estimation of specific reaction rates and state observers for a ampicillin enzymatic synthesis. To be presented at 5th European Congress on Biotechnology, Copenhagen, 1990. Kasche V. Mechanism and yields in enzyme catalyzed equilibrium and kinetically controlled synthesis of p-lactam antibiotics, peptides and other condensation products. Enzyme Microb Techno[ 1986; 8. Jan: 4-16. Kasche V, Haufler U, Zollner R. Kinetic studies on the mechanism of the penicillin amidasecatalysed synthesis of ampicillin and benzylpenicillin. Hoppe-Seyler's Z Physiol Chem 1984; 365: 1435-1443. Sheldon R. Industrial synthesis of optically active compounds. Chemistry & Industry 1990; 7 : 212-2 19.
E.J. Karjalainen (Editor), Scientific Computing and Auromarion (Europe) 7990
221
0 1990 Elsevier Science Publishers B.V., Amsterdam CHAPTER 19
From Chemical Sensors to Bioelectronics: A Constant Search for Improved Selectivity, Sensitivity and Miniaturization P.R. Coulet Laboratoire de Gknie Enzymatique, UMR 106 CNRS Universitt! Claude Bernard, Lyon 1 , 43 Boulevard du 11 Novembre 1918, F-69622, Villeurbanne Cedex, France
Abstract The need in various domains for real time information, urgently requires the design of new sensors exhibiting a high selectivity and a total reliability in connection with smart systems and actuators. This explains the strong interest dedicated to chemical sensors and particularly to biochemical sensors for such a purpose. These sensors mainly consist of a highly selective sensing layer capable of highly specific molecular recognition, intimately connected to a physical transducer. They can be directly used for the analysis of complex mediums. When the target analyte to be monitored is present and reaches the sensing layer, a physical or chemical signal occurs which is converted by a definite transducer into an output electrical signal. This signal treated in a processing system leads to a directly exploitable result. Enzyme electrodes are the archetype of the first generation of biosensors now commercially available. New generations are based on novel and promising transducers like field effect transistors or optoelectronic devices. Efforts are made for improving selectivity and sensitivity of the sensing layer, for exploring new concepts in transduction modes and for miniaturizing both the probes and smart signal processing systems. Groups including specialists of biomolecular engineering, microelectronics, optronics, computer sciences and automation capable of developing a comprehensive interdisciplinary approach will have a decisive leadership in challenging areas for the next future especially in two of them where a strong demand exists: medical sensing and environmentalmonitoring.
1. Introduction Improvements in the control and automation of industrial processes are urgently needed especially in biotechnology processes [ 13 not only to increase both quality and productivity
222
but also to favor waste-free operations. However, analyses performed on line or very closc to the process still remain difficult to perform. Process analytical chemistry which is an alternative to time-consuming conventional analysis performed in central laboratorics is raising a lot of interest and new generations of sensors mainly chemical and biochcrnical appear as promising tools in this field [2]. In the domain of health, monitoring vital parameters in critical care services, using for example implantable sensors still remains a challenging goal. The growing consciousness throughout the world of our strong dependence on environmental problcms appears also as a powerful stimulation for developing new approaches and new concepts. All these different factors are a rcal chance for boosting a fruitful interdisciplinary approach towards the “bioelectronics frontier".
2. Chemical sensors A chcmical sensor can be defined as a device in which a sensing layer is intimately inte-
grated within, or closely associated to a physical transducer able to detect and monitor specifically an analytc [3]. As a matter of fact, it is quite difficult to stick to a fully unambiguous definition and we will consider a chemical sensor as a small probe-type device which can be associatcd with a signal processing system more or less sophisticated including for instance result digital display, special output for computers etc... This devicc must provide direct information about the chemical composition of its environment. Ideally, it must exhibit a large autonomy, be reagentless, respond rapidly, selectively and reversibly to the concentration or activity of chemical species for long periods ... Stability and sensitivity to interferent specics are in fact thc two main bottlenecks to overcome in designing chemical sensors and particularly biosensors.
3. Biosensors Biosensors can be considered as highly sophisticated chemical sensors which incorporate in their sensing layer some kind of biological material conferring to the probe a very high sclecLivity.
Affinity and specific requirements for the biomolecules to be fully active As a matter of fact, it is interesting to take advantage of the different typcs of biomolccules which arc capable of molecular recognition and may present a strong affinity for other compounds. Among them, the most interesting couples are: - enzyme / substrate, - antibody / antigen,
223
-lectin / sugar, -nucleic acids /complementary sequences to which we can add chemoreceptors from biological membranes. Microorganisms, animal or plant whole cells and even tissue slices can also be incorporated in the sensing layer. Up to now, the two main classes widely used in the design of biosensors are enzymes and antibodies. Enzymes are highly specialized proteins which specifically catalyze metabolic reactions in living organisms. They can be isolated, purified and are used in vitro for analytical purposes in conventional methods. Antibodies are naturally produced by animals and human beings reacting against foreign substances. They can be obtained by inducing their production for instance in rabbits or mice and collected for use as analytical reagents in immunoassays. It must be kept in mind that most of these biological systems have extraordinary potentialities but are also fragile and must be used in definite conditions. For instance most enzymes have an optimal pH range where their activity is maximal and this pH zone has to be compatible with the characteristics of the transducer, Except for very special enzymes capable of undergoing temperatures above 100°C for several minutes, most of the biocatalysts must be used in a quite narrow temperature range (15"C4OoC). In most situations, an aqueous medium is generally required and this has to be taken into account when specific applications are considered in gaseous phases or organic solvent for instance. Stability of the bioactive molecule is certainly the main factor to consider and is in most cases an intrinsic property of the biological material very difficult to modify.
Biomolecular sensing and transducing mode Two main phenomena acting in sequence have to be considered for designing a biosensor (Fig. 1): - the selective molecular recognition of the target molecule and - the Occurrence of a first physical or chemical signal consecutive to this recognition, convencd by the transducer into a second signal generally electrical, with a transduction mode which can be either - electrochemical, - thermal, -based on a mass variation, - or optical.
Molecular recognition and selectivity Prior to examining the different possible combinations between bioactive layers and
224
information - measurement electrlc output slgnal
transistors (FET) optical fibers photodiodes CCD thermistors piezo devices
I
fransduca
physlco chemlcal
d tections: electrochemical * optical thermal ' mass variation
Sl' II
grafting
interface
-
chemical
t 1
biochemical
/
I
I
complex medium
Figure 1. Schematic configuration of a chemical or biosensor.
transducers, two points must be underlined: the first one concerns the intrinsic specificity of the biomolecule involved in the recognition process. For instance, if enzymes are considered, this specificity may strongly vary depending on the spectrum of the substrates they can accept. Urease is totally specific for urea, glucose oxidase is also very specific for P-D-glucoseand oxidizes the alpha anomer at a rate lower than 1% but other systems
225
like alcohol oxidases or alcohol dehydrogenases accept several primary alcohols as substrates and amino acid oxidases will respond to a large spectrum of amino acids as well. For antibodies this specificity can be strongly enhanced by using monoclonal antibodies now widely produced in many laboratories. The second important point is the degree of bioamplification obtained when molecular recognition occurs. If the bioactive molecule present in the sensing layer is a biocatalyst, a variable amount of product will be obtained in a short time depending on its tumover: this corresponds to an amplification at the step generating the physicochemical signal. By contrast, using antibodies to detect antigens or vice versa, is not normally a biocatalysis phenomenon and this will have to be taken into account for the choice of the transducer.
Immobilization of the biological system in the sensing layer The simplest way in retaining bioactive molecules in the immediate vicinity of the tip of a transducer is to trap them on its surface covered by a permselective membrane. This has been used in a few cases but in most of the devices which have been described in the literature the bioactive molecules are immobilized in the sensing layer. Different methods of immobilization have been available for several years [4] derived from the preparation of bioconjugates [ 5 ] now widely used in enzyme immunoassays and also from the development of heterogeneous enzymology [6]. The two main methods consist either in the embedding of the biomolecule inside the sensing layer coating the sensitive part of the transducer or in its covalent grafting onto a preexisting membrane maintained in close contact with the transducer tip. Details will be given with the description of enzyme electrodes. Several classes of biosensors have been described based on the different types of transducers. A short description will be made for each of them with some emphasis on elcctrochernical transduction and enzyme electrodes, historically the first type of biosensor now on the market.
3 .I Biosensors based on electrochemical transduction. Enzyme electrodes Associating an enzyme with an electrochemical transduction greatly increasing the selectivity of an amperometric electrode has been proposed by Clark and Lyons more than 25 years ago [7]. Since this first attempt, a large variety of enzyme electrodes have been described and a very abundant literature has been published on this subject periodically reviewed [8, 93. Glucose determination using glucose oxidase in the sensing layer is obviously the most popular system in the enzyme electrode field. Several explanations can be found to
226
this: a high selectivity and the fact that glucose oxidase contains FAD. This cofactor involved in the oxidoreduction cycle is tightly bound to the enzyme, which appears as a real advantage for designing a rcagentless probe when compared to NADH based reactions with specific dehydrogenascs. In the latter case NADH acts as a cosubstrate which must be supplicd in the reaction medium. Beside the wide demand for glucose tests not only in clinical biochemistry but also in fermentations or cell cultures, the very high stability of this enzyme appears as the key factor for its wide use in the design of most of the biosensors described today. The principle of glucose determination using the enzyme glucose oxidase as biocatalyst is thc following: glucose oxidase P-D-glucose + 0 2 + H20 > gluconic acid + H202 When considering first order kinetics conditions, glucose oxidase activity is directly proportional to glucose concentration according to the simplified model of Michaclis Menten for enzyme kinetics. Theoretically this activity can be followed by eithcr the consumption of 0 2 , the appearance of H+ from gluconic acid or H202. Practically, appearance of H+ is very difficult to monitor due to the use of buffered mediums, so the systems which have been described and which lead to the design of commercially available insuumcnts are based on either 0 2 or H202 monitoring or the use of mediators. OLhcr oxidases specific for diffcrcnt metabolites and leading to hydrogen pcroxide can also bc uscd making the system really versatile.
Enzyme electrodes based on 0 2 detection Thc consumption vcrsus time of 0 2 which is involved as a stoichiomeuic coreactant in thc glucose oxidation process can be measured with a p02 Clark electrode. A very easy to prepare sensing layer comprises glucose oxidase immobilized by copolymerization with bovine serum albumin inside a gelatin matrix using glutaraldehydc as cross linking reagent. The bioactive layer is then used for coating a polypropylene selectivc gas membrane covering a platinum cathode [lo]. This typc o€detection has been extended to other oxidases for thc mcasurement of various metabolites. Its main advantage is that this typc of dctcction is insensitive to many intcrfcrent species.
Enzyme electrodes based on H 2 0 2 detection This type of detection generally leads to more sensitive devices but is subject to intcrfcrences. In our group we have chosen to use immobilized enzyme membranes closely associated to an amperomeuic platinum electrode with a potential poised at +650 mV vs Ag/AgCl refcrcnce [ll]. Hydrogen peroxide is oxidized at this potential and the currcnt
221
Figure 2. Amperometric enzyme membrane electrode.
thus obtained can be directly correlated to the concentration of the target molecule, here the oxidase substrate (Fig. 2). Enzymes have been first immobilized on collagen membranes through acyl-azide groups. Collagen is made of a triple helix protein existing naturally in the form of fibrils which may be rearranged artificially under a film form. The activation procedure was performed as follows: lateral carboxyl groups from aspartate and glutamate residues involved in the coIIagen structure were first esterified by immersion in methanol containing 0.2 M HCl for one week at 20-22°C. After washing, the membranes were placed overnight in a 1% (v/v) hydrazine solution at room temperature. Acyl azide formation was achieved by dipping the membranes into 0.5 M NaNOd0.3 N HC1 for 3 min. A thorough and rapid washing with a buffer solution (the same as used for the coupling step) provides activated membranes ready for enzyme coupling. Two types of coupling could be performed: at random or asymmetric. For random immobilization, membranes are directly immersed in the coupling buffer solution where one or several enzymes have been dissolved and a surface covalent binding occurs on both faces with randomly distributed molecules [12, 131. It is also possible to immobilize different enzymes on each face using a specially designed coupling cell [14]. Before use, the cnzymic membranes are washed with 1M KCl for 15 min. Such a procedure allows to prepare special sensing layers for bienzyme electrodes. for instance a maltose electrode could be prepared using glucoamylase bound onto one face and glucose oxidase on the other face. The hydrolysis of maltose which is the target analyte occurs on one face leading to thc formation of glucose which crosses the membrane and is oxidized by glucose oxidasc bound on thc othcr face in contact wilh the platinum anode detecting hydrogen peroxide. With this approach, an improverncnt in sensitivity could be obtained with airdried electrodes [15].
228
More recently new types of polyamide supports from Pall Industry S.A., France have bccn selected to prepare bioactive membranes efficiently. One called Biodyne Immunoaffinity membrane is supplied in a preactivated form allowing an easy and very fast coupling of enzymes. The different methods recommended by the supplier for coupling antigens and antibodies have been adapted in our group for designing tailor-made sensing layers [16]. Briefly in the routine procedure, enzyme immobilization is achieved by simple membrane wetting: 20 microliters of the enzyme solution are applied on each side of the membrane and left to react for 1 min. Prior to their use, the membranes are washed with stirring in the chosen KC1 containing buffer.
Measurements in real samples. The problem of interferences, A lot of papers deal in fact with laboratory experiments in reconstituted mediums and in this case, enzyme elcctrodes work quite well. However when immersed in complex mixtures of unknown composition, which is the normal situation for operation, many problems will occur. For instance, at the fixed potential, other substances like ascorbic acid, uric acid etc... can be oxidized thus yielding an undesirable current and biased results. More drastic situations can be encountered with rapid inhibition or inactivation of the enzyme by undesirable substances like heavy ions. thiol rcagents etc... In this case only a pretreatment of the sample will be efficient to protect the bioactive probe. Let us focus on the problem of electrochemical interferences which can been overcome without pretreatment of the sample by different approaches. As alrcady mentioned, 0 2 electrodes are not subject to interferences but their detection limit is rather high which may be a real drawback in many cases.
Use of a dflerential system. When using hydrogen peroxide detection, a differential measurement with a two-electrode system for the automatic removal of interferent currents can be used. The two electrodes are equipped with the same type of membrane but the active electrode bears the grafted glucose oxidase whereas a plain membrane without enzyme is associated with the compensating electrode. The current at the active electrode is due to the oxidation of hydrogen peroxide generated by the enzyme and possibly to the presence of electroactive interferent species. The current at the compensating electrode is only due to the interfcrences and can be continuously subtracted leading to a signal dircctly correlated to the actual analyte concentration. A microprocessor-based analyzer using this principal is now on the French market [17]. Beside glucose, using the same approach, different analytes like L-lactate for instance could be assayed in complex mixtures with lactate oxidase leading to hydrogen peroxide production [ 181. The bioactive tip of such enzyme electrodes can be considered as a compartmentalized enzyme system where membrane characteristics and hydrodynamic conditions play a prominent role in the product distribution, here the electroactive species, on both sides of the membrane.
229
This can be of fundamental interest for improving performance of these sensors and models have been explored in this direction [191.
Muftilayer systems. Interferencescan also be removed using a multilayer membrane associated to hydrogen peroxide detection. This has been described for glucose and also extended to other analytes like L-lactate with an analyzer based on the work of L. C. Clark Jr. and his group [20] using membranes prepared by sandwiching a glutaraldehyde treated lactate oxidase solution between a special cellulose acetate membrane and a 0.01 micron pore layer of polycarbonate Nuclepore membrane. The main role of this membrane is to exclude proteins and other macromolecules from passing into the bioactive layer. Cellulose acetate membrane allows only molecules of the size of hydrogen peroxide to cross and contact the platinum anode thus preventing interferences by ascorbic acid, uric acid etc... Glucose oxidase can be used as the final enzyme of a multienzyme sequence for the assay of analytes which cannot be directly oxidized. If sucrose has to be measured for instance, it must be first hydrolyzed into a-D-glucose and fructose by invertase. After mutarotation, P-D-glucose is oxidized by glucose oxidase. In this case, endogenous glucose present in the sample will interfere and an elegant approach has been proposed by Scheller and Renneberg to circumvent this drawback [21]. A highly sophisticated multilayer system has been designed by these authors. Briefly, an outer enzyme layer acting as an anti-interference layer with entrapped glucose oxidase and catalase degrades into non responsive products hydrogen peroxide which is formed if glucose is present. A dialysis membrane separates this anti-interference layer from the sucrose indicating layer: thus sucrose can reach the second active layer where it is converted by the bienzyme system into hydrogen peroxide which is measured by the indicating electrode (for review see ref 221. Mediated systems. 0 2 involved in oxidase reactions can be replaced by electron mediators like ferrocene or its derivatives. A pen-size biosensor for glucose monitoring in whole blood (ExacTech) has been recently launched, based on this principle. It mainly involves a printed carbon enzyme electrode with the mediator embedded in it and can be operated whatever the 02 content and is not subject to interferences at the chosen potential (+160 rnV vs SCE) [23].
3.2 Biosensors based on thermal and mass transduction As already underlined, when the molecular recognition of the target molecule occurs, the consecutive physical or chemical signal must be converted by the transducer into an electrical signal processed to obtain a readable displayed result. Numerous attempts for finding a “universal” transducer i.e., matching with any kind of reaction, have been reported.
230
Hcat variation appears as a signal consecutive to practically any chemical or biochcinical reaction. Danielsson and Mosbach 1241 have developed calorimetric sensors based on tempcrature scnsing transducers arranged in a diffcrential setting. Enzymatic reactions occurring in microcolumns allowed heat variation to be measured between thc inlct and outlct with thermistors. Mass variation following molecular recognition appears also very attractive as a universal signal to be transduced cspccially for antigcn antibody reaction when no biocatalysis occurs. Piczoelectric devices sensitive to mass, density or viscosity changes can bc used as transducers. Briefly the change in the oscillation frequency can be correlated to the change in intcrfacial mass. Quartz bascd piezoelectric oscillators and surface acoustic wave devices (SAW) are the two types currently used [25, 261. Beside scnsing in gas phasc, rccent works dcal now with the possibility of using such dcvices in liquid phase with bound immunoreagcnts for improving selectivity.
4. The new frontier: microelectronics and optronics Solid-state microsensors Trials to miniaturize biosensors are regularly rcportcd and field effect transistors (FETs) appear as excellcnt candidates for GOLD LAYER PHOTORESIST achieving such a goal [27]. Sevcral methods for sclcctivcly depositing immobilized enzymc membranes on the pH scnsitivc gate of an ion sclcctive field effcct ENZYME IMMOBILIZED MEMBRANE transistor (ISFET) havc bcen described as cxcmplified in Figure 3. Kuriyama and coworkers [ 2 8 ] have proposed for instancc a lift-off technique which can be briefly summariLed as follows. A layer of photoresist is first depositcd on the whole (4) wafcr but rcmoved selectively from the gate rcgion. The surface is then silanized and an albumin glutaraldehyde enzyme mixture is spin coated. After enzyme cross-linking, the photorcsist layer is rcmoved by acetone treatment and only thc enzymic membranc remains on the Figure 3. Lift-off method for enzyme membranc scnsitivc arca. In the example rcportcd deposition in FET-based biosensors [Rcf. 281. "
23 1
here, a silicon on sapphire (SOS) wafer was used. The chip dunensions were only 1.6 x 8 mm with a membrane thickness of about 1 micrometer.
Optical transduction and fiber optic sensors The design of sensors based on fiber optics and optoelectronics raised a growing interest in the recent years, especially for waveguides associated with chemical or biochemical layers [29-341. The two main methods involve either the direct attachment of the sensing layer to the optical transducer or the coupling on or in a closely associated membrane. For this purpose, various types of supports have been utilized: cellulose, polyvinylchloride, polyamide, silicone, polyacrylamide, glutaraldehyde-bovine serum albumin membranes, and more recently Langmuir Blodgett films.
Membrane biosensor based on bio- and chemiluminescence. In our group, a novel fiber optic biosensor based on bioluminescence reactions for ATP and NADH and chemilumincscence for hydrogen peroxide has been recently developed [34]. For example, in the reaction catalyzed by the firefly luciferase, ATP concentration can be directly correlated to the intensity of light emitted according to the reaction: ATP + luciferin
> AMP + oxylucifcrin + PPi + C02 + light
~
Based on the same principle, NAD(P)H measurements can be performed with a marine bacterial system involving two enzymes, an oxidoreductase and a luciferase respectivcly: NAD(P)H + H+ + FMN FMNH, + R-CHO + 0 2
~
~
> NAD(P)+ + FMNH2 > FMN + R-COOH + H20 + light
The specific enzymic membrane was prepared from polyamide membranes according to the method already described for enzyme electrodes. The membrane is maintained in close contact with one end of a glass fibre bundle by a screw-cap (Fig. 4). The other end is connected to the photomultiplier of a luminometer (Fig. 5). Providcd the measurement cell is light-tight, concentrations of ATP and NADH in the nanomolar range could be easily detccted with such a device which can be equipped with a flow trough cell [35]. The system is specific and very sensitive. There is no need for a light source or monochromators in contrast with other optical methods and extension to several metabolites with associated NADH dependent dchydrogenases has also been realized [36].
232
I
to the photomultiplier tube
.a
.b
Figure 4. Schematic representation of the luminescence fiber-optic biosensor; (a) septum; (b) needle guide; (c) thermostated reaction vessel; (d) fiber bundle; (e) enzymatic membrane; (f) screw-cap; (g) stirring bar; (h) reaction medium; (i) black PVC jacket; (j) O-ring; [Ref. 341.
,C
d
e f
I
c
fiber-optic bundle
screw-cap mirror-
enzymatic membrane
Figure 5. Overall setup of the fibre optic biosensor [Ref. 361.
233
5. Trends and prospects Improvements of existing sensors or design of new ones will depend on advances in the different domains related to sensing layers, transducers and signal processing systems. For the sensing layer the demand is still for more selectivity and more stability. This can be achieved by different approaches rapidly but not exhaustively evoked below:
Protein engineering The possibility of modifying the structure of proteins by site directed mutagenesis is actively studied in many laboratories. Such modifications can increase the thermostability which may be a decisive advantage for dcsigning highly reliable biosensors [371.
Catalytic monoclonal antibodies (catMABs) CatMABs are new biologically engineered tools capable of both molecular recognition and catalytic properties. For obtaining such biomolecules, it is required that a molecule which resembles the transition state in catalysis is stable enough to be used as an hapten for inducing the antibody formation. The complementary determining region of the antibody may then behave as an enzyme-like active site. These tailor-made biocatalysts appear very promising when target molecules cannot be sensed by available enzymes. [38, 391.
Biological chemoreceptors Recently, Buch and Rechnitz [40], have described a chemoreceptor based biosensing device derived from electrophysiology experiments using antennules of blue crab. A signal could be recorded for a concentration of kainic acid and quisqualic acid as low as M. In an excellcnt review, Tedesco et al. pointed out that one of the central problems with the use of biological membrane receptors was the need for a transduction mechanism. In living organisms, the nicotinic reccptor (ion-permeability mechanism) and the P-adrenergic receptor system (with cyclic AMP acting as a messenger within the cell) are the two major classes involved in complex sequences. To incorporate reconstituted molecular assemblies with the natural receptor as a key component in a suitable bioactive layer associated with a transducer still appears as extremely difficult not only taking into account the extraction-purification of the receptor itself but also its stability and functionality after reconstitution [41].
234
Sirpramolecular chemistry Supramolecular chemistry has bcen widely developed by Lehn and his group in the past dccadc [42]. It has bcen possible to design artificial receptor molcculcs containing intramolecular cavitics into which a substrate may fit, leading to cryptatcs which are inclusion complexes. The molccular behaviour at the supramolecular level held promiscs not only in molecular recognition but also in mimicked biocatalysis and transport. This appears as the first age of “chcmionics” with the expected development of molccular clcctronics, photonics or ionics.
Novel transducing modes; miniaturization A lot of efforts are also dcdicatcd to miniaturization. Solid-state microsensors using the
integrated circuit fabrication tcchnology appcar as excellent candidates to bridge the gap in an elegant way between biology and electronics and enzyme field effect transistors (ENFETs) arc promising. Howcver problems of stability of the transducer ilself and of cncapsulation arc not complctcly solved. It must be pointed out that miniaturization of the biosensing tip or array is also a prerequisite for most of the in vivo applications as wcll as the use of reagentless tcchniques or at lcast of systems incorporating previously added reagents with possible rcgcncration in situ. On this track, a vcry elegant approach was the use of a hollow dialysis fiber coupled with an optical fiber as rcportcd by Schultz and his group for glucose analysis. A continuous biosensor wilh a lectin as bioreceptor and fluorescein labclcd dexwan competing with glucose could bc obtained by placing the bioreagcnts within the miniature hollow fiber dialysis compartment [43].
Signal processing system, multifunction probe or probe array Finally, improvcmcnts can also bc expected in treatment of the signal wilh smart systems adapted from other techniques to chcmical or biosensors. As an example, an array of scnsors with dilfcrent specificities lcading to a specific result pattern Lrcated in real timc would certainly bc of major interest in urgcnt situations when a prompt decision has to be lakcn.
6. Conclusion Thc interdisciplinary approach necessary for decisive breakthroughs expected in analylical monitoring is now rccognized as compulsory. This is not so easy since specialists from apparcnlly distant disciplincs must share a common language and have partially
23 5
overlapping intcrfaccs for developing in a synergistic way new concepts and new analytical tools. In the near future, biomimetic devices taking advantage of the extraordinary possibilities of Nature in the domains of selectivity via biomolecular recognition and on the tremendous evolution in the past decades of electronics and optics together with the unequalled rapidity in the treatment in real time of information will certainly be a milestone of the “Bioelectronics era”. Some years may still be necessary to know if the golden age is ahead...
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
12. 13. 14. 1.5. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
26. 27.
Glajch JL. Anal Chem 1986; 58: 385A-394A. Riebe MT, Eustace DJ. Anal Chem 1990; 62: 65A-71A. Janata J, Bezegh A. Anal Chem 1988; 60: 62R-74R. Mosbach K, Ed. Methods in Enzymology, Vol44. New York: Academic Ress, 1976: 999 p. Avrameas S , Guilbert B. Biochimie 1972; 54; 837-842. Wmgard LB Jr., Katchalsky E, Goldstein L, Eds. Immobilized enzyme principles. Applied Biochemistry and Bioengineering, Vol 1. New York: Academic Ress, 1976: 364 p. Clark LC Jr, Lyons C. Ann NYAcad Sci 1962; 102: 2945. Guilbault GG. Analytical uses of immobilized enzymes. Marcel Dekker, 1984: 453 p. Turner APF, Karube I, Wilson GS, Eds. Biosensors, Fundamentals and applications. Oxford University Press, 1987: 770p. Romette JL. GBF Monographs 1987; Vol. 10: 81-86. ThCvenot DR, Stcmberg R, Coulet PR, Laurent J, Gautheron DC. Anal Chem 1979; 51: 96100. Coulet PR, Julliard JH, Gautheron DC. French Patent 2,235,153, 1973. Coulet PR, Julliard JH, Gautheron DC. Biotechnol Bioeng 1974; 16: 1055-1068. Coulet PR, Bertrand C. Anal Lett, 1979; 12: 581-587. Bardeletti G, Coulet PR. Anal Chem 1984; 56: 591-593. Assolant-Vinet CH, Coulet PR. Anal Left 1986; 19: 875-885. Coulet PR. GBF Monographs 1987; Vol. 10: 75-80. Bardeletti G, Stchaud F, Coulet PR. Am1 Chim Acta 1986; 187: 47-5. Mai’sterrenaR , Blum LJ, Coulet PR. Biotechnol Letters 1986; 8: 305-310. Clark LC Jr, Noyes LK, Grooms TA, Gleason CA. ClinicalBiochem 1984; 17: 288-291. Scheller F, Kenncbcrg R. Anal Chim Acta 1983; 152: 265-269. Scheller FW, Schubert F, Renneberg R, Muller H-S, Jhchen M, Weise H. Biosensors 1985; 1: 135-1 60. Cass AEG, Francis DG, Hill HAO, Aston WJ, Higgins IJ, Plotkin EV. Scott LDL, Turner APF. Anal Chem 1984; 56: 667-671. Danielsson B , Mosbach K. In: Turner APF, Karube I, Wilson GS, Eds. Biosensors, Fundamenfals and applications. Oxford University Press, 1987: 575-595. Guilbault GG, Ngeh-Ngwainbi J. In: Guilbault GG, Mascini M, Eds. Analytical uses of immobilized compounds for detection, medical and industrial uses. NATO AS1 Series, Dordrecht, Holland: D. Reidel Publishing Co., 1988: 187-194. Ballantine D, Wohltjen H. Anal Chem 1989; 61: 704 A-715 A. Van der Schoot BH, Bergveld P. Biosensors 1987/88; 3: 161-186.
236
28. Nakamoto S, Kimura I, Kuriyarna T. GBF Monographs 1987;Vol. 10: 289-290. 29. Wolfbeis 0 s . Pure Appl Chem 1987;59: 663672. 30. Arnold MA, Meyerhoff ME. CRC Critical Rev Anal Chem 1988;20: 149-196. 31. Scitz WR. CRC Critical Rev Anal Chem 1988;19: 135-173. 32. Dessy RE. Anal Chem 1989;61: 1079A-1094A. 33. Coulet PR, Blum LJ, Gautier SM. J Pharm & Biomed anal 1989;7: 1361-1376. 34. Blum LJ,Gautier SM, Coulet PR. Anal Lett 1988;21: 717-726. 35. Blum LJ,Gautier SM, Coulet PR. Anal Chim Acta 1989;226: 331-336. 36. Gautier SM,Blum LJ. Coulet PR. J Biolwn Chemilum 1990;5 : 57-63. 37. Nosoh Y,Sekiguchi T.Tibtech 1990;8:16-20. 38. Schultz, P.R. Angew Chem Int Ed Engll989;28: 1283-1295. 39. Green BS, Tawfik DS. Tibtech 1989;7:304-310. 40. Buch RM,Rechnitz GA.Anal Lett 1989;22: 2685-2702. 41. Tedesco JL,Krull UJ,Thompson M. Biosensors 1989;4: 135-167. 42. Lehn J-M. Angew Chem Int Ed Engi 1988;27: 89-112. 43. Schultz JS, Mansouri S , Goldstein I. Diabetes care 1982;5: 245-253.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 1990 Elsevier Science Publishers B.V., Amsterdam
231
CHAPTER 20
A Turbo Pascal Program for On-line Analysis of Spontaneous Neuronal Unit Activity L. GaAl and P. Molnir Pharmacological Research Centre, Chemical Works of Gedeon Richter Ltd., 1475 Budapest, P.O. Box 27, Hungary
1. Introduction Extracellular unit activity measurement is a widely used method in neuroscience and especially in pharmacological research. A large number of papers proves the power of this and related methods in the research [1-4]. Albeit the extracellular unit activity measurement is one of the simplest method of microelecuophysiology, the experimenter should have extensive experience using the conventional experimental setup and way of evaluation [ 11. There are no suitable programs available for IBM PCs to provide full support in cell identification and in recording of extracellular neuronal activity together with experimental manipulation and evaluation of data. On the other hand the experimental conditions should be varied according to the needs of tasks. A data acquisition and analysis program should fulfill the above mentioned requirements as much as possible [ 5 ] . The simplicity and high capacity of a measurement and the regulations of Good Laboratory Practice (GLP) require further solutions in the pharmacological research. These reasons prompted us to create a program which meets most of our claims. The main goals in the development of the program were: - to support the cell identification, - to allow real-time visualization and analysis of neuronal activity, - to record experimental manipulation, - measurement and evaluation according to GLP, - to crcate an easy-to-use and flexible program.
2. Methods and results The program, named IMPULSE, was written in Turbo Pascal 5.5, because it is a highlevel language, supports assembler-language routines and its graphical capabilities are
238
excellent [6]. The source code is wcll structured TABLE 1 and documcnted, thus, it may be modified to suit Hardware requirements of IMPULSE particular research needs. It contains carefully 3.0. optimized assembly-language routines for the IBM PC,XT, AT Compatible critical acquisition to allow high-speed sampling. computer A new fully graphic command interface was Graphics adapter and monitor developed that can be used by single-keystroke Hard disk commands or keyboard menu selection, in addi- 640 K memory tion, it contains a sophisticatcd context-sensitive TL-1 scries acquisition systcm help available at all times. This command inter- Epson compatible printer lace provides the easy creation of different menu slructurcs containing not only menu points but fill-in form ficlds, as well. The hardware requirements of the program are summarized in Table 1. The inputs of the program arc the following: - analogue signal of the exlracellular amplifier, - TTL pulse of an event detector, - TTL pulses of any instrument (e.g., iontophorctic pump), - keyboard.
A
/,DISCRIMINATION LEVEL
POSllRlGGER SAMPLING
TRIGGER
PRE- AND POSTTRIGGER SAMPLING
Figure 1. The problem of spontaneous spike sampling. Action potentials of spontaneously active neurons arc random. The detection of a spikc is done by an cvcnt dctcctor (or window discriminator). This device sends a triggcr, whcn the input signal cxcccds the discrimination level. Using the convcntional way of sampling (i.e., sampling startcd by the trigger), only a part of a spike could bc sampled (upper right figurc). To sample the whole spike, the prctriggcr part of the signal should also be acquisited (bottom right figure).
239
Y
S t a r t spike sampling
Figure 2. Screen dump of IMPULSE <Spike> menu and screen. The left windows contain (downwards) the already sampled spike, firing pattern and interspike time distribution (IHTG) of a cerebellar Purkinje neuron. The right window is the active window showing the current spike. The amplitude of the current spike and the status of sampling mode are displayed in a dataline. (See text for the details of <Spike> menu).
The first technical problem to be solved was the acquisition of spontaneous spikes by sampling the full signal (involving the pre-spike signal, as well). The principle of the problem is featured in Figure 1. Our solution was the use of a ring buffer. When the user starts the sampling of a spike, the output of the AD converter is loaded continuously into the ring buffer. The signal of the evcnt detector with certain delay stops the acquisition. At this time the ring buffer contains the pre- and posttrigger part of the signal. IMPULSE can be characterized by the following functions: 1. SPIKE: shape and amplitude of spontaneous action potentials with pretrigger intervals, averaging possibility. 2. PATTERN: time-series of spikes with real amplitudes. 3. IHTG: interspike time distribution of action potentials. 4. FHTG: frequency histogram of neuronal activity. 5. CONFIGURATION: parameters of hardware environment, sampling, displaying and protocols. The first thrcc functions serve the cell identification.
240
IHTG:
Store
ZOOH
Quit
S t a r t I H T G sanpling
Figure 3. Screen dump of IMPULSE dHTG> menu and screen. (Skc legend of Fig. 2 ) The active window shows the proceeding of IHTG sampling. Neurons of the same type can be characterized by certain distribution of interspike times, thus, IHTG is very helpful in the cell identification.
SPIKE function calls a menu (Fig. 2) which contains certain instructions for spike sampling. Spikes can be sampled continuously or one by one, and could be averaged and displayed with real or normalized amplitude in a window or full screen (Zoom). The amplitude is measured for every spike. Selected spikes can be stored for further evaluation. PATTERN function serves the visualization of the firing pattern of the neuron which is a very helpful information in cell identification. Spikes are represented by vertical lines (thc length is proportional with the amplitude of the spike), the time between successive spike is represented by the distance between lines. Sampled pattern can be stored for further evaluation. IHTG function (Fig. 3) is a new power in cell identification and also in the study of neuronal activity. It measures the time between successive spikes (interspike time) and displays the currcnt distribution of that. Sampled IHTGs can be stored for further evaluation. The main line of the experiment is the registration of the activity (i.e., the firing rate) of Uie neuron studicd, thus, ETRG function is focused. The number of spikes is integrated on a ccrtain timebase (bin) and is displayed continuously in form of frequency histogram (Fig. 4).
24 1
16.
0.
a
0 Rou: 324 Euent: 38 N.Drug: 9 HTG: m C l e a r scRoll Spike P a t t e r n Quit
75
Start thc w a s u r e n e n t o f spiking actiuity
Figure 4. Screen dump of IMPULSE menu and screen. Extracellular unit activity measurement of a noradrenergic locus coeruleus neuron. Upper windows show the spike and the firing pattern sampled simultaneously with the recording of FHTG. Markers above the histogram and the lines of different patterns in the histogram indicate the onset of drug administrations. See the text for details.
Other functions can be used simultaneously (e.g., the variation of spike shape and pattern can be displayed and stored as illustrated in Fig. 4). It means that the measurement is not restricted to the recording of firing rate but the registration of spike shape and variations of firing properties are also included. Expcrimental manipulation (generally drug administration) can be marked and connected to the FHTG record by two ways: 1. Keyboard input: the number keys represent the onset of different types of drug administrations. 2. ITL pulses of any instrument: the switch on and switch off of maximum five devices (e.g., iontophoretic pump). The markers are stored together with the FHTG data. They are displayed over the FHTG graph to facilitate the exact evaluation of effects (Fig. 4). The length of FHTG recording is limited by the free memory of the computer only (e.g., if the bin is one second, more than 5 hours continuous recording is possible). The parameters of the experiment (e.g., sampling rate, time base of FHTG, etc) can be varied in a wide range and stored in a configuration file. By means of the mentioned
242
CREATED BY
MATION
DATA
BASELINE
DRUG
DRUG EFFECT
PROTOCOL
PARAMETERS
PATTERN
DATA
COMMENTS
IHTG
GRAPHS
Figure 5. Main functions of the evaluation program, EVALEXT. EVALEXT evaluates the data created by IMPULSE. Data can b e edited and divided into intervals, which are inputs of statistical analysis. The recorded spikes, patterns and IHTGs can be displayed and printed. The results are presented in a protocol according the prescription of GLP.
32.
16.
0. 0
.
25
50
75
tine D i s t . : ?:40 LOCKED INTERUOL: Nextscreen Preuscreen Svitchcursor Junptonextdrug Locked Druglist accept Quit Enter in the CURSOR node rou,
233
I
38:40
Figure 6. Screen dump of EVALEXT menu and screen. Data can be divided into intervals. Markers above the histogram and the different patterns of the histogram represent che already dcfincd intervals. The two cursors allow an interactive selection of intervals according to the particular needs of the user. Statistical comparison is performed bctwccn the selccted intervals.
24 3
variability the program makes it possibile to satisfy broad range of requirements. The storage of parameters provides the complete registration of experimental conditions allowing the reproduction of an experiment with the same conditions. Data are stored on hard disk in standard binary files. The file contains the frequcncy histogram, markcrs, shapes of certain spikes, patterns and IHTGs selected and any comments. Neuronal activity and the effects of experimental manipulations recorded (hereafter referred to as drug adminismation) are cvaluated and printed in final GLP form by an independent program, EVALEXT. The main functions of EVALEXT are illustrated in Figure 5. Data can be edited (e.g., to remove artifacts) in graphical form. Drug adminisuation can be commented and also edited in a full window editor. Data can be divided into intervals. This is done either in an automatic way (predefined length of intervals connected to the markers) or in an interactive way using graphical selection of intervals (Fig. 6). The intervals represent the different states of neuronal
Baseline
;gjp
30.1
0.21
CLO-2
74.
CLO-4
CLO-8 WH-200 WH-400
100.0
5:1
32 f f6
I
18.7 #_____________-___-______________________----------------------------------KIBSO 37.2 1 .58 *
Press a key t o r e t u r n . . . PRINT: -Graph Data dEuice Q u i t
P r i n t r e s u l t s i n a protocol form on output d w i c e
Figure 7. Protocol of an extracellular unit activity experiment. The head of the protocol contains the description of the experiment. The results of ANOVA and the detailed effects of different drugs are presented. “Mean FR”: is an average value of spike numbers per bin for h e given interval. “SEM”: is the standard error of the mean. “Pcrccnt”: values are expressed as the pcrcentage of the baseline activity (considered as 100%). “SEP”: is the standard error of the “Percent” value. The stars indicate a significant difference from the baseline value at p= 100 THEN 190 IF PPT < 100 THEN 150
if Pb or Ag p r e s e n t 150 SHAKER CENTRIFUGE AG::SAVE.DECANT.FOR.TEST AG::TEST.DECANT.FOR.REMAINING.PPT VERTICLE.LED IF PPT >= 2 0 0 THEN 160 AG::IF PPT.DETECTED.1N.DECANT GOTO 175 -
- all Pb removed 160 RACK.l.INDEX=DECANT.TUBE PUT.INTO.RACK.1 - all Ag removed 175
AG::IF.NO.PPT.DETECTED.IN.DECANT PUT.INTO.RACK.1 GOTO 200
- no c l a s s I i o n s p r e s e n t 190
UNKNOWNS=UNKNOWN.SALE DECANT.TIBE=l T$ = 'CLASS I IONS NOT PRESENT' PUT.ONTO.RACK.1 HEAT.OFF
200
FE::ADD.lML.NH3.BUFF
Figure 6. EASYLAB program PRETEST.
269
3.5 Databases Two types of databases are used by the supervisor system. One contains the structured data required to develop a procedure such as wavelengths and extinction coefficients for analytes while the other contains validated analytical procedures for phosphate analyses. The nature and quantity of the former type of information is such that it is best stored in a data base format rather than in the form of heuristic rules in a knowledge base. The latter database is used to store automated procedures that have been previously validated by the system. Before attempting to develop a method, the supervisor expert system interacts with the procedures database to see if an suitable procedure exists for the determination requested by the analyst. If a procedure exists, the analyst is informed and the LUO’s and values for the parameters are loaded into the robot controller, the determinations performed and the results displayed to the analyst for final approval.
3.6 Data structure The present prototype system transfers data and information among the software components through the use of ASCII text files. The component programs have the ability to read and write to this fundamental type of file. In some instances it has been necessary to write simple driver programs in QUICK BASIC to create the files and to open the communications links between computers. While the use of ASCII files may seem primitive, it has been effective in developing the prototype system. The use of an object-oriented expert system will change the format of the common data structure and should increase the efficiency of communications among system components.
4. Automated phosphate determination Efforts to date have concentrated developing and testing the software required to control the laboratory apparatus and instrumentation. Dictionary entries (“words”) for the LUO’s required to calibrate and determine ortho-phosphate concentrations by TABLE 2 Variances of laboratory operations. the vanadomolybdophosphoric colorimetric method have been DILUTION OF C U S O ~ 6 HZO STOCK SOLUTION defined and validated. CuSO, Measured at 840 nm An example of the work in = MM,2 + ABS,2 + SIP,2 TOTAL,2 progress is the analysis of the variSIP,2 = 1.17 x 10-5 ances involved with the mixing, ABS,2 = 1.66 x 10-5 transfer and absorption measureMIX4 = 0.86 x 10-5 ment. Aqueous copper sulfate TOTAL,2 = 3.69 x 10-5 solutions, known to be chemically
270
stable over long periods of time, were used for this study in order to separate the variances associated with the automated system components from the uncertainties associated with the reactions associated with the phosphate analysis. Table 2 shows the results of the copper sulfate experiments. In order to obtain the variance associated with the spectrophotometric measurement, a single copper sulfate solution was measured 40 times at a wavelength of 840 nm over a period of 30 minutes. One absorption value represents the avcragc of 5 measurements. Next, absorbance measurements were obtained for 40 aliquots taken from a single test tube using the sipper workstation of the automated system. This experiment included the variances associated with the spectrophotometer absorption measurement and the transfer of the samplc from the test tube to the spcctrophotomcter cell. The variance for the sample transfer was then obtained by subtracting the spcctrophotometervariance from the single test tube experiment. Finally a variance was obtained for the total process by pipeting a specified amount of stock copper sulfate solution into 40 different test tubes, adding a constant amount of distilled water to each, mixing the solutions with a vortex mixer, sipping one sample from each test tube into the absorption cell and obtaining an absorption measurement. The variance of dilution process is calculated by subtracting the varianccs for measurement and transfer from the variance for the total process which includes dilution. Our results indicate the largest error is associated with the spectrophotometer absorbance measuremcnt. This analysis of the variances of the individual tasks comprising an automated process illuswdtes the function of the controller software in an automated design and analysis. In the next phase of the project the experimental design and data analysis software components will be interfaced to the controller software thus permitting the analyst to design cxpcrimenls and transfer the procedures directly to the automated system for cxecution. The LUO’s required to obtain a calibration curve for Lhe vanadomolybdophosphoric acid determination of ortho-phosphates were implemented and calibration curves for three repetitive set of calibration solutions prepared by the automated system were obtaincd (Fig. 7). The results verified that uncertainties in absorption measurements are largest at the extremes of the plot, below 0.20 AU’s and above 0.80 AU’s. The calibration was both linear and reproducible for concentrations between these extreme absorption valucs. If the absorption valucs of the sample solutions are greater than 0.80 AU, then system will calculate the dilution required to give an optimum absorption value of 0.434, perform the required dilution and measure the absorption of the diluled solution. If an absorption value is below 0.200 AU, the system will warn the analyst and suggcst another method for the phosphate determination, the stannous chloride method which is reponcd to have a lower detection limit than the vanadomolybdophosphoricacid method. Work in progress is concerned with selecting and optimizing the critical experimental parameters for the vanadomolybdophosphoric acid determination. Factors under
27 1 CalibrationPlot Overlay 0 90 = 4.1039~-1+ z6we-h R*Z Y
=ow
= I W I k - 2 + Z6ZlSr-2. R"2 = 0 $96
Phosphate (pO43-) Concentration [men]
Figurc 7. Calibration plots.
considcration arc; (1) mixing and rcaction times, (2) pH of the sample solutions, (3) wavclcngth for absorption mcasurcments, (4) the prcscncc of interferences, particularity silicates, in llic sample solution and (5) Uie tcmperaturc of the sample solution. Tlic two programs mcntioncd in the prcccding section, DESIGN-EASETMand EXPERIMENTAL DESIGN TM arc currcntly being cvaluatcd for use in dcsigning the experiments. At thc prescnt time it appears that a full factorial mcthod can bc used for the design duc to thc limited numbcr of experimcntd parameters. Whcn the critical factors have been determined, a sequential simplex method such as SIMPLEX-VTMwill be interfaced to the laboratory system to automatically dctcrminc optimum values for critical experimental parameters. Once these values have been determined, the system will be calibrated with a series of standard solutions and the phosphate concentrations of water samples will be determined. The two other colorimetric methods and a potcntiomcuic mcthod for phosphate dcterminations will be studied using thc tcchniqucs dcvcloped for thc vanadomolybdophosphoric acid method. All four of the mclliods for phosphnlc dctcrminations will thcn bc compared. Information conccrning llic optimum conditions (conccntration rangcs, intcrfcrcnces, ctc.) will bc incorporatcd into tlic chcmicnl knowlcdgc basc of lhc supervisor systcni.
272
5. Summary The intelligent automated system allows the analyst to focus on the chemistry involved in automating a method by providing expertise in experimental design, data analysis and automated procedures. The interface between the analyst and the system facilitates the communication required to design and validate the operation of automated methods of analysis. The proposed systcm permits analysts to design and implement rugged automated methods for replication in many laboratories. The creation, electronic transfer and implcmcntation of this technology can improve thc reliability of interlaboratory data. Every attempt is being made to use existing, commercial software packages in the development of the system. The hardware and software components will be made as compatiblc as possible to facilitate efficient configuration, optimization and transfer of automated mclhods.
Acknowledgements This work was funded in part by National Science Foundation Research in Undergraduate Institutions Grant # CHE-8805930, a matching funds equipment grant from Zymark Corporation, Hopkinton, MA, and a grant from local research funds of the V M I Foundation.
References: 1.
2. 3. 4. 5. 6. 7.
8 9.
Franson M, ed. Standard Methods for the Examination of Water and Wastewater. 16th Edition. Washington, DC: American Public Health Association, 1985,440. Granchi M P, Biggerstaff JA, Hillard LJ, and Grey P. Spectrochima Acta 1987; 42: 169-180. VP-Expert (Version 2.2). Rerkley, CA: Paperback Software, 1989. S’LA7-EASE. Hennepin Square, Suite 191, 2021 East Hennepin Ave., Minneapolis, MN 55413,1989. Stdstical Programs. 9941 Rowlctt, Suite 6, Houston, TX 77075, 1989. Deming SN, Morgan SL. Anal Chem 1974; 9: 1174. Massart DL, Dijkstra A, Kaufman L. Evaluation and Optimization of Laboratory Methods and Laboratory Methods. Amsterdam: Elsevier, 1978,213-302. Morgan SL, Deming SN. Anal Chem 1974; 46: 1170-1 181. Placket PL, Burman JP. Biometrika 1946; 33: 305-325.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990
27 3
0 1990 Elsevier Science Publishers B.V., Amsterdam
CHAPTER 23
Laboratory Automation and Robotics-Quo
Vadis?
M . Linder Mettler-Toledo AG, Greifensee,Switzerland
1. Introduction Growing demands of society have promoted the development of many aspects of analytical chemistry. Break-throughs in the field of microelectronics and microcomputer science have expandcd the scope of analytical instrumentation. Sophisticated instruments provide valuable analytical information. However, they require qualified operators. This situation often forces trained technicians and scientists to perform repctitive tasks rather than delegate them to less skilled personnel. Improving productivity is of top priority in most laboratories. There is a distinct trend in industry for more efforts in R&D and Quality Control. Often staff numbers are constrained and cannot expand to match the ever increasing workloads and demands for laboratory support. The performance of an analytical laboratory is judged by the quality of the results and the speed at which they are produced. Automation permits laboratories to maintain a high quality standard that conforms to the Good Laboratory Ractice (GLP) guidelines and creates challcnging work for profcssionally trained personnel. Laboratory automation involves two areas: Instrument automation and laboratory managcmcnt automation. The means of achieving management automation is through a Laboratory Information Management System [ 11, generally known by the abbreviation LIMS. Thc main topic of this contribution is the discussion of instrument automation.
2. Objectives of laboratory automation in chemical analysis Laboratory automation is a means of achieving objectives, some of which can be justified by economic factors such as cost savings. Economic justification is obvious for procedures such as extendcd operations (morc than 8 hours a day) and unattended operations. There are other objectives and motivating factors beyond cost savings, that lead to several bencfits and increased perfomance. They are summarized in Table 1. Improving laboratory quality and productivity has bccome an urgent strategy in many organizations. Thc lcading chemical, pharmaceutical, food, energy and biotechnology
214
TABLE 1 Objectivesbenefits of laboratory automation. Objective
Parameter / Benefits
Cost reduction and improved productivity
Savings in personnel and material Unattended/extendcd operations Reliability of the instruments Ease of operation
Accuracy and increased quality of measurements and results
High degrce of instrument control Improved spccificity/sensitivityof the procedures Increascd throughput allows multiple measurerncnts
Rcli ability/availability
Solid design High MTBF (Mean Time Between Failure) Low MTTR (Mean Time To Repair) Proven technique
Safcty
Reduced risks to employees and the environment Precise control of the measurement process No breakdowns
Speed
Reduced analysis timc Reduced turnaround time
Flexibility
Easy adaption to different procedures
companies face intense, world-wide competition. To meet this challenge, the strategies demand the following: (i) develop innovative new products that fulfill the customers need, (ii) manufacture products that meet higher quality expectations and standards, (iii) improve the organizational productivity and (iv) reduce risks to employees and the environment. These strategies rely on the ability of the industrial laboratories to provide incrmscd analytical support for decision making in research, product development and quality control.
3. Elements of the analytical process The tasks in the analytical chemical laboratory can be divided into three general areas (Fig. la): (i) sample preparation, (ii) analytical measurement and (iii) data evaluation. For the further discussion of automation in chemical analysis this simple schcmc needs to be refined. It must consist of the following elements (Fig. lb): (i) sampling, (ii) sample preparation and sample handling (which includes analytc rclcasc and analyte
215
a) Sample PreDaration
Sample Preparation
Measurement
4
Analytical Measurement
Data Evaluation
ata Acquisition & ata Precessina*
e!i Data Validation & Decision
Documentation
Figure 1. The analytical process: a) Basic elements of the analytical process. b) Elements of automated chemical analysis.
separation procedurcs), (iii) analytical measurement (including standardization and calibration), (iv) data acquisition and data processing, (v) data validation and decision and (vi) documentation. The data validation elcment may include a decision step that allows fccdback control to each element of thc automated system. In consideration of the potential bcnefits of automation, each item above should be addresscd. However, sampling will often be an external operation not amenable to automation.
4. Status of laboratory automation Research & Dcvclopmcnt in analytical chemistry in the sixties and seventies created new and improved measurement techniques. An industry for analytical instruments with highly automated measurement procedures has emerged. This business has now grown to annual sales of about $5 billion world-wide. The enormous dcvclopmcnts in computcr technology over the past decade havc rcsultcd in a complctcly ncw generation of computcr applications. This is also apparent in the ficld of laboratory automation. It has allowed Lhe analytical laboratory to automate data handling and documentation. Whole Laboratory Information Management Systems (LIMS) automate sample managcmcnt and record-keeping functions.
216
This leaves sample preparation as the weak element in automated analysis. Sample preparation is highly application dependent. Several non-trivial operations are necessary to bring the sample into the right state for analytical measurement. An incoming sample may be inhomogeneous, too concentrated, too dilute, contaminated with interfering compounds, unstable under normal laboratory conditions or in another state that prevents direct analysis. However, sample preparation has a high potential for automation. A typical example is the determination of water content by the Karl-Fischcr titration method [2]. Individual preparation steps for the different types of sample are necessary, whereas the titration procedure is always the same (using either a dedicated volumetric or coulometric titrator). Sample preparation is probably the most critical factor for the accuracy and precision of analytical results and for sample throughput and turnaround time. The productivity of laboratory resources (personnel and instruments) and laboratory safety are mainly influenced by sample preparation procedures. Manual sample preparation is subject to human variability, labor intensive and thus expensive, tedious and time-consuming, often dangerous (exposes people to hazardous environments) and difficult to reproduce after personnel changes.
5. Approaches to laboratory automation Two fundamentally diferent approaches to automation can be distinguished, (i) the automation of the status quo and (ii) the change of the procedure for easier automation (eventually the technological principle or the method of analysis). The strict automation of existing procedures is not always successful. The duplication of manual methods with a machine has its limits. Tailoring the techniques and procedures to the automation is a better approach to gain maximum benefit. Streamlined procedures, lower implementation costs, operational economics, better data and faster throughput can often be achieved by critically reviewing the existing methods. The change of a technological principle or of the method of analysis is a long term approach, which is a possibilily whenever a technological breakthrough occurs. Examples of this kind are thc evolution of the balance technology from mechanical to electronic balances and the introduction of robotics in analytical chemistry. The search for new assay methods and new instrumentation that reduce the amount of labor for sample preparation is an ongoing process. However, alternative methods are often not known or, for legislative reasons, they may not be allowed. At this point it is worth mentioning the concept of process analytical chcmistry (PAC) [3]. Unlike traditional analytical chemistry, which is performed in sophisticated laboratories by highly trained specialists, process analytical chemistry is performed on the front lines of the chemical process industry. The analytical instruments are physically and
211
Off-line Analysis Production Line
sntral Analytical Laboratory
Sampling
I
a-
Decision
Transmission
Validat ion
At-line Analysis Production Line Sampling
Industrial Laboratory Measurement
rn - - P A Decision
6
Validation
On-line Analysis Production Line Sampling
Measurement
Validation
Decision
In-line Analysis Production Line
Validation
Decision
Figure 2. Different strategies of traditional analytical chemistry and process analytical chemistry.
218
operationally a part of thc process. The output data are used immediately for process control and optimization. The major difference between the stratcgies of PAC and traditional analytical chemistry is shown in Figure 2. Depending on how process analyzers are intcgrated in the process, the proccdurcs can be classified as at-line, on-line or in-line measurements. In at-line analysis, a dedicated instrument is installed in close proximity to the process unit. This permits faster sample proccssing without too much loss of time caused by sample transportation. In on-line analysis, sampling as well as sample preparation are completely automated and form an integral part of the analyzing instrumcnt. The difficult process of sampling can bc avoidcd complctely by in-line analysis, where one or more selective sensing devices, c.g., ionsclectivc clcctrodes, are placed in direct contact with the process solution or gas. The choice between off-line, at-line or on-line procedurcs depends on the sampling frcqucncy imposed by the time constant of the process, the complexity of thc samples and the availability of the necessary sensors. Automatic methods applied to the analysis of a series of samples can be divided into two general categories: (i) discrete or batch methods and (ii) continuous-flow methods. In batch methods each sample is kept in a separate vessel in which the different analytical stages (dilution, rcagent addition, mixing, rneasuremcnt) take place through mcchanical processes. In continuous-flow mcthods thc samples are introduced at regular intervals into a carricr strcam containing a suitable rcagent. The injected sample forms a zonc which dispcrscs and rcacts with the components of the carricr strcam. The flow thcn passcs through a flow cell of a detector. The shapc and magnitude of the resulting rccordcd signal reflects the concentration of the injected analyte along with kinetic and thcrmodynamic information of the chcmical reactions taking place in thc flowing stream. Flow mcthods arc gaining more importance these days, especially in process analytical chemistry. Thc introduction of unsegmented-flow mcthods in 1974, now refcrrcd to as Flow Injection Analysis (FIA) [4],has remarkably simplified thc ncccssary equipment. Due to speed and rcagcnt cconomy, most common colorimetric, clcctroanalytical and spectroscopic methods havc bcen adaptcd to FIA. In addition, also sample preparation tcchniques such as solvent extraction, dialysis and gas diffusion havc bccn realized with flow injection analysis. In the last ten years, the problem of automated sample preparation and manipulation has bccn addressed by the use of flexible laboratory robots and dedicated sample handling systcms, such as autodiluters, Laboratory robotics, commcrcially introduced in 1982 [ 5 ] , as an alternativc to manual sample handling, brings several bcncfits. Analytical results are more reliable. Users arc gctting them faster, safer and often at lcss costs than before. Various objcctivcs of laboratory automation, discussed in Scction 2, can bc achieved.
219
Such robots are manipulators in the form Number of Samples per Day of articulated flexible arms with various possible geometric configurations (Cartesian, cylindrical, spherical or rotary), designed to Dedicated Automation move objects and programmable for different tasks. Flexible AufomaNon These latest developments provide the final piece for automated chemical analysis. Tying the automated sample handling sysManual tem to the chemical instrumentation and data b handling network leads to a complete system Complexity of Procedure approach for a totally automated laboratory. Automation procedures in industry tradi- Figure 3. Flexible automation. tionally have required a large quantity of identical, repetitive operations to justify the large initial investments of automation. This fixed or dedicated automation is most suitable for those processes where production volumes are high and process change-overs are low. This explains the success of automation in clinical analysis. The clinical laboratory is dealing mainly with two types of sample, blood and urine, whereas most industrial laboratories have to work on a wide spectrum of products. In addition, changing needs-new products and new analysis-are typical in modem laboratories. Laboratory robotics provides flexible automation able to meet these changing needs. Flexible automation systems are programmed by individual users to perform multiple procedures. They have to be reprogrammcd to accommodate new or revised methods. As illustrated in Figure 3, flexible automation bridges the gap between manual techniques and spccialized, dedicated automation. Automation of entire laboratory procedures that are unique to the compound and matrix being analyzed seems to be an impossible task. The diversity of sample materials is reflected in thc number of assay methods that are in use. However, the constituents of a mcthod are common partial steps or building blocks, that occur in one form or another in many assays. Examples of the most frequent unit operations are given in Table 2. Laboratory procedures can be represented as a sequence of unit operations, each specificd by a set of execution parameters. This suggests that these unit operations should be automated in order to automate individual analysis. This flexible approach takes advantage of the assembly line concept of manufacturing automation. From this step by step approach it is possible to derive a conceptual design of an automated analysis system that allows to run individual samples and small and large series of samples. These findings are in contrast to the developments in clinical chemistry where with the automation of serial analysis entire assay methods have been automated.
I
280
TABLE 2 Common unit operations in analytical chemistry. Unit Operation
Description, Example
Weighing
Quantitative measurement of sample mass using a balance Reducing sample particle size Adding exact amounts of reagent using a burette Dissolving solid sample in an appropriate solvent Adjusting concentration of liquid sample Separating unwanted components of the sample Direct measurement of physical properties (pH, conductivity, absorbance, fluorescence, etc.) Conversion of raw analytical data to usable information (peak integration, spectrum analysis, etc.) Creating records and files for retrieval (printouts, graphs, ASCII files)
Grinding Dispensing Dissolution Dilution Solid-liquid and liquid-liquid extraction Dircct measurement Data reduction
Documentation
With the combination of a limited number of automated self- cleaning units and the correspondent infrastructure of transport mechanisms and electronics control, it has been shown as early as 1976 by METTLER that a large part of the analytical workload in an industrial laboratory becomes amenable to automation [ 6 ] . The instrument developed employs several types of units for automation of the basic operations for sample preparation, a sample transport mechanism, an entry and weighing station, a central control minicomputer and a line printer for result documentation. The actual configuration could be tailored to individual laboratory requirements.
I Figure 4. PyTechnology (Zymark Corporation).
28 1
The laboratory robotic systems today use the same concept. The individual unit operations are automated with dedicated workstations and use the robotic arm to transfer samples from one station to another according to user programmed procedures. Zymark Corporation, the leading manufacturer of laboratory robots, has implemented this idea of dedicated laboratory workstations for unit operations in their PyTechnology [7]. In thc PyTechnology architccturc each laboratory station is rigidly mounted on a wedge-shaped locating plate, called a PySection. Each of the PySections has the necessary hardware and software to allow the rapid installation and operation of a particular unit operation such as weighing, pipetting, mixing or centrifugation, The PySections are attached to the robot on a circular locating base plate (Fig. 4). By indicating the location of the particular PyScction to the robot, the section is ready to run. Once a PySection has been put in place, the Zymate robot will bc able to access all working positions on that particular Py Section without any additional robot teaching or positioning programming by the user. This leads to a very rapid system set-up and the ability to easily reconfigure the system for changing applications. This new system architecture allows centralized method development. As long as scvera1 users have the same PySections, standardization and transfer of assay methods from one laboratory to another is easy. Based on this approach, several dedicated turn-key solutions for common sample preparation problems such as solid phase extraction, automated dilution and membrane filtration are offered today. The reliability of robots has been improved in the last few years, Modern robotic systems use both feedforward and feedback techniques such as tactile sensing of the robot’s grip and position sensing switches to assure reliable operation.
6. Workstation concept of analytical instrumentation Inspite of the numerous benefits that complete automation can provide an analytical laboratory, the process of implementing such systems is often difficult. Based on many years of experience in manufacturing of instruments for analytical and process systems we advocate the following general workstation concept (Fig. 5) as opposed to total centralization. A workstation may consist of the following main components: (i) an external general purpose computer, (ii) an analytical instrument and (iii) a sample changer or a robotic sample preparation system including dedicated workstations for the necessary unit operations. These on-site workstation computers can be linked to a LIMS. Each workstation can be developed indepcndently with little danger of adverse interactions with others. Workstations of this kind are not simple. Most of them will have some or all of the tasks of sample identification, mclhod selection, sample preparation, running the analysis, data reduction and evaluation, result validation, reporting and record-keeping.
282
Computer
I
LIMSILAN
Figure 5. General workstation concept fur automated chemical analysis.
Workstations must also provide calibration methods (running standards and blanks) and maintenance procedurcs (wash and rinse). They should provide opcrator help and includc procedures to deal with failures and cmcrgcncies. The instrument control is preferably not under the complete guidance of the external computer. The preferred scheme is to pass control parameters from the main workstation computer to the internal computcr system of the analytical instrument and thc robot which are normally running without close supervision by the master computer. This workstation concept can be found in many commercially available analytical systems. The user may start with thc basic unit, the analytical instrumcnt. Hc can later improve versatility and flcxibility with the addition of a Personal Computer with corrcsponding software packagcs.
7. Analytical instrumentation of the future Based on thc discussions in the previous sections on thc state of thc art, the benefits and problems of laboratory automation, necessary requirements for analytical instruments for thc ncxt dccadc can be formulated: The instruments must be easy to operate and the manmachine interface must be transparent also to less skilled users. High performance and high throughput are obvious featurcs. The instruments must provide the necessary
283
mechanical, electronical and software interfaces to allow the link with master computer, sample changer, robotic sample preparation and robotic transport systems, balance, barcode reader, LIMS and othcr computer networks. The instruments must allow method storage. Methods must be up- and downloadable to and from a computer. The instrument software includes sophisticated data reduction and data evaluation schemes using chemometrics. Fcdcral laws and GLP require intermediate storage of data and the everchanging needs for the analysis of new type of samples require solutions that are flexible with regard to sample preparation and data evaluation and presentation. The following trends for the immediate future can be seen today: (i) more automation of sample preparation, (ii) more reliable laboratory robots (either dedicated turn-key systems or general-purpose systems), (iii) flexible analytical instruments partially include and thus automate sample preparation, (iv) analysis moves closer to the process (more dedicated workstations), (v) improved communications (common data structures, common protocols), (vi) more flexibility for changing needs and (vii) links with other information systems (corporate systems, production systems). However, these dcveloprnents depend on the instrument and the LIMS suppliers to provide the necessary equipment and the corporate management and finally the user to accept these developments in automated chemical analysis. A laboratory has several alternatives for the implementation of analytical systems. It can buy either a total system or some subunits. Complete systems from one vendor has ~
~~
~
TABLE 3 Alternatives in system implemcntation. Alternative
Systems Responsibility
Advantages
Disadvantages
Buy total system
Vendor
Ready made solution
Few adaptions to specific requirements possible (price)
Buy all subunits
User
Flexible solution
Interfacing problems Service Systems responsibility
Buy/make subunits
User
Optimal solution takes considcration of application and organizational aspects
Interfacing problems Time costs
Make total system
User
Latest technology Optimal solution
Time costs
Service Documentation
284
the advantage of a clearly defined systems responsibility but may not fulfill all the requirements of the user. Table 3 summarizes the advantages and disadvantages of the different approaches in system implementation. In the long range future we will see new methods for analysis that drastically reduce sample preparation. The chemical process industry will have automated batch production with in- line measurement of the important process parameters. However, this needs new robust selective sensing devices. The addition of natural language processing and vision systems to robots will open new areas for automation. Continued growth in the power of microcomputers will offer exciting prospects to laboratory automation. New chemometric tools will provide complex database management, artificial intelligence and multivariate statistics. As the tasks assigned to laboratory instruments become more complicated, it becomes necessary for the system to make “if-then-else” decisions (rules) during the daily operation. Intelligent instruments must be able to adapt to changes in experimental conditions and to make appropriate modifications in their procedures without human interventions. This type of intelligent feedback to the different elements of the automated system can be realized with expert systcms. The instrument must be able to learn from its past experience. The advent of this kind of sophistication promises a continued revolution in the future practice of laboratory automation. These are some trends and partially visions perceived by the author. They may be obvious or new to the reader, or perhaps they may be incorrect. In any event, these are exciting times to be an analytical chemist and to participate in the laboratory revolution.
References 1.
2. 3.
4.
5. 6.
7.
McDowall RD, ed. Laboratory Information Management Systems - Concepts, Integration and Implementation.Wilmslow, UK: Sigma Press, 1987. Scholz E. Karl Fischer TitratioMetermination of Water. Berlin, Heidelberg, New York: Springer-Verlag,1984. Riebe MT, Eustace DJ. Process Analytical Chemistry-An Industrial Perspective. Anal Chem 1990; 62(2): 65A. Valcarcel M, Luque de Castro MD. Flow-InjectionAnalysis, Principles and Applications. Chichester, England Ellis Horwood Limited, 1987. Strimaitis JR, Hawk GL, eds. Advances in Laboratory Automation Robotics. Vol. 1 to Vol. 5, Zymark Corporation,Hopkinton, MA, USA. Arndt RW, Werder RD. Automated Individual Analysis in Wet Chemistry Laboratory. In: Foreman JK, Stockwell PB, eds. Topics in Automatic Analytical Chemistry Vol. 1, p. 73, Horwood, Chichester, 1979. Franzen KH. Labor-Roboter-die neueste Entwicklung: PyTechnology. GIT 1987: 450.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
28 5
CHAPTER 24
Report of Two Years Long Activity of an Automatic Immunoassay Section Linked with a Laboratory Information System in a Clinical Laboratory R.M. Dorizzil and M. Pradella2 1Clinical Chemistry Laboratory, Hospital of Legnago, 37045 Legnago (VR), Italy and 2Clinical Biochemistry Chair, Padua University,Padua, Italy
Abstract The organization of an immunoassay section developed in 1987 capable to automatically process 17 different RIA and EIA tests is described. The section is equipped with a Kemtek 1000 Sample Processor, a spectrophotometer, a 16 wells NE 1600 gamma counter, a IBM PCKT running commercial software for data reduction and PC-SYSLAB, a home made program, which interlinks the section with the LIS. After the entry of the patients and of the required tests, worklists relative to the daily workload are transferred to a floppy-disk and a hard copy is simultaneouslyprinted for reference. The samples are dispensed by Kemtek lo00 according to the worklist and, after the end of the procedure, radioactivity or absorbance measurements are transferred over a RS-232 interface to the IBM PC. After the calculation of the doses, PC-SYSLAB links the doses to the identification number of the patients and then transfers these to LDM/SYSLAB. Medical validation is helped by the printing of a complete table of results. An archive relative to the thyroid panel (containing today about 1,200 patients) can be searched for the presence of known patients. Every year more than 50,000 endocrinological tests have been reported without any manual entry and transcription of data by 2 medical technologists and 1 laboratory physician.
1. Introduction During this century the diagnosis in medicine changed radically; in the beginning of the century historical information was of paramount importance and physical examination made a modest contribution; in the fifties analytical data were added to historical and physical data while in the current era the information provided by instruments is very often of critical importance [ 11.
286
An avalanche of commercial instruments automated to different extents were marketed to execute the wide array of rcquired tests. The introduction in 1957 of AutoAnalyzer started a revolution in clinical chemistry. It placed reasonably reliable and rapid analysis of many blood analytes within reach of most hospital laboratories and made possible the production of large quantities of reliable data in short time with very reasonable labour requirements [2, 31. AutoAnalyzcr led the way to the admission screen proposed by A. Burlina in Italy consisting of a biochemical scrccn, physical examination and history collection [4]. The analytical instruments exist as “automation islands” within the laboratory with little interconnection between them. Today it is essential to overcome this problem and effectively bridge the gap between the “islands” [5]. In 1987 we started a project for optimizing the workflow of the immunoassay section in processing 17 RIA and EIA tests employing the common instrumentation of clinical chcmistry laboratory and for linking it to LDM/SYSLAB.
2. Materials and methods The laboratory of Legnago Hospital acquired a Technicon LDM/SYSLAB system in 1985: it is a “turnkey” system using two SEMS 16/65 minicomputers with 1 Mbytc of memory and twin 20 Mbyte disk drives. Three external laboratories are connected to LDM/SYSLAB by standard phone lines via modem. LDM is, directly or via an intelligent terminal, connected to terminals, instruments and printers. In our laboratory there are routine assays by RIA for estriol, HPL, progesterone, 17 beta estradiol, testosterone, T4, ff4, TSH, ferritin, CEA, PSA, insulin and digoxin and by EIA for FSH, LH, prolactin, alfa-feto-protein. Immunoassay section of our laboratory is equipped with standard instruments: a Kemtck 1000 Sample Processor (Kemble, UK); a mechanical fluids aspirator MAIA-SEP (Arcs-Serono, Italy); a spectrophotometer Serozyme I1 (Ares Serono, Italy); a 16 wells gamma countcr (Thorn EMI, UK) connected with a IBM PC. A commercial program (Gammaton, Guanzate, Italy) allows the calculation of the doses by standard procedures (Wilkins’s 4-paramctcrs, spline, linear regression, Rodbard’ weighted logit-log, point-to-point) and execution of on-line Quality Control procedures.
3. Manual organization of immunoassay section The samples obtained from the 4 connected laboratories were entered in the LDM/SYSLAB by the laboratory clerks. Two full-time medical technologists and one part-time laboratory physician supervisor workcd in the section; the shift of the 2 medical technologists was 8 a.m.-2 p.m. from Monday to Saturday. On the day of the execution of the spccific assay, the identification numbcr of evcry sample was manually rcgistcrcd on a paper sheet. Samples and reagcnts were dispensed by Kcmtck 1000 and thc samplcs wcrc further processed according to the manufacturer’s recommendations. The supernatant was
287
discarded by a MAIA SEP mechanical aspirator and then the radioactivity or the absorbance of the tubes were measured by a gamma counter or by a spectrophotometcr respectively. After technical and medical validation, the results were transferred to the hand-made worklists and then to the computer-made worklists. Finally the results were manually entered in the LDM. The organization of the work, also if partially automated, was tedious and time consuming for the 2 medical technologists, who had to enter the final results of the dose, i t . , the most delicate procedure, at the very end of their work-shift. Therefore, clerical mistakes were possible and it was very difficult for the physician in charge of the section to carefully evaluate the results especially for the thyroid panel (thyroxine, free-thyroxine, TSH), pregnancy panel (esuiol and HPL) and female gonadic function panel (estradiol, progesterone, testosterone), etc. This represents a crucial problem in the modem clinical chemistry laboratory since it is widely accepted that the data generated by the laboratory become ACTUAL information only when they are utilized in patient care. The laboratory physician plays an essential role at this stage in providing fast and accurate results to the clinician [ 5 ] .
4. Computerized organization of immunoassay section 4.1 PC-SYSLAB package The PC-SYSLAB package requires at least two microcomputers: one is directly linked to the LDM system, and one, through a switch, alternatively to a radioactivity detector or to a spectrophotometer.
4.2 Patient names and worksheet uploading A BASIC program simultancously emulates a VDU peripheral and a line printer. The patient names and the worklists are "printed" to the microcomputer (TELE-PC, Televideo Systems Inc., Sunnyvale, CA, USA), where they are stored as a sequential text file. Two PASCAL programs process this filc: one updates the patient name random file (FITRAV), and one builds the test request list (LISTA). Each request is identified by a workplace and a tcst number
4.3 Test list printing and result acquisition A BASIC program builds and prints the specific test lists. The samples are sorted in the list order and the assay is pcrformed exactly in the same way previously described. The medical technologists place the sample tubes on the racks of KEMTEK following the test list.
288
4.4 Downloading of results to LDM A BASIC program builds the results list from a sequential file recorded by the Gammaton software. The results lists, merged and sorted, are sent to LDM by Tele-PC and to a program that emulates an automated instrument. Simultaneously, a table containing all patients’ results is printed. This table is checked by the laboratory physician in charge of the section, who verifies the clinical consistency of the results, and updates the patient data base relative to the thyroid panel results. When necessary, he prepares interpretive comments using a LDM/SYSLAB terminal.
4.5 Looking for knowns Known patients are recorded in a random indexed file using a “public domain” program (PC-FILE, Buttonware Inc., Bellevue, WA, USA). A small, alphabetic index is then extracted from this data base and each name in the test is looked for by a binary search algorithm by a PASCAL program. Thc output of this program is a list named K N O W S OF TODAY.
5. Results Since 1987 17 of the 25 immunological assays pcrformed in the section are performed employing the described procedure. Every day, 6 days/week, 2 medical technologists perform a mean of 6 tests for a total of about 150-200 patient tests per day with a total of more than 50,000 tests/yeat. After some months of debugging, the PC-SYSLAB was used, without interruption, since octobcr 1987. It failed only a few days due to the breakdown of LDM and of the hard disk of the IBM PC connected to the counters. Table 1 shows thc benefits in terms of time caused by the adoption of PC-SYSLAB for the processing of pregnancy panel (estriol and HF’L). Since 6 tests are assayed every day, the total daily gain is about 2 work hours depending on the number of samples for evcry assay.
6. Discussion and conclusion LDM showed overall a good performance in producing conventional and bar-coded sample labels and worksheets, in data acquisition from analytical instruments and in patient reports printing. However, LDM system does not easily perform calculations and data base functions. The first immunoassays were developed in endocrinology and nuclear medicine laboratories not familiar with automation and little oriented toward standardization and
289
TABLE 1 Immunoassay computerization benefits. OPERATION
NO COMPUTER
COMPUTER
Worklist Dispensation Input of results
20 min 28 min 25 min
10 min 28 min 5 min
Total
73 min
43 min
simplification of techniques. The wide range of methodologies adopted by laboratories exhibited little uniformity, were extremely labour intensive and relied to a high degree on very skilled staff. In the ‘80 several firms marketed very complex instruments, requiring a lot of maintenance, to automate isotopic and non isotopic immunoassays. The majority of these instruments were “closed” and proved very expensive requiring dedicated reagents. Since this market is greatly fragmented, there is no commercial pressure toward the linking of the immunoassay section to the LIS which is also hampered by the well known absence of a standard protocol of communication between equipment and the laboratory information system [9, 101. Very few reports of microcomputer linking with a commercial “turn key” LIS are reported in literature. Davies and Mills [6] linked a clinical chemistry centrifugal analyzer and Hobbs et al. [7] linked, by a Commodore 4032 microcomputer, a DEC 11/23 minicomputer used for immunoassays. Only the implementation reported by Hobbs et al. was designed to add a “high level” function (patient data base) to LIS data managerncnt. The use of microcomputers allows to transfer the laboratory data to a lot of widely spread programs to obtain sophisticated elaborations. PC-SYSLAB uses “public domain” software for data base function and list editing; in other applications, such as in serology [ll], it uses also a spreadsheet for result calculations. In summary, microcomputers are able to reduce the technology gap in small laboratories for which the cost of a commercial LIS ($1,000 pcr patient bed) is a frustrating factor. They are able to do several computing tasks also in larger laboratories [12]. They make it possible that the laboratory personnel takes advantage from well documented, error free, powerful programs such as word-processors, data bases, spreadsheets and graphic packages.
References 1. 2.
Strandjord PE. Laboratory medicine-excellence must be maintained. In: Bermes EW, Ed. The clinical laboratory in the new era. Washington: AACC Press, 1985. Burtis CA. Advanced technology and its impact on the clinical laboratory. Clin Chem 1987; 33: 352-7.
290
Valcarcel M, Luque de Castro MD. Automatic methods of analysis. Amsterdam: Elsevier, 1988. 4. Burlina A. La logica diagnostica &l laboratorio. Padova, Piccin, 1988. 5 . McDowall RD. Introduction to Laboratory Information Management Systems. In: McDowall RD, Ed. Laboratory InformationManagement Systems. Wilmslow: Sigma, 1987. 6. Davies C, Mills RJ. Development of a data manager linking a Baker Encore to a LDM computer. Ann Clin Biochem 1987; 24: S1-51-52. 7. Hobbs DR, Lloyd GC, Alabaster C, Davies KW. The use of a DEC 11/23 minicomputcr to allow the entry of immunoassays result into a Technicon LDM/Syslab data management system. Ann Clin Biochem 1987; 24: S 1-52-54. 8. Forrest GC. A general review of automated RIA. In: Hunter WM, Come JET, Eds. Immunoassays for clinical chemistry.Edinburgh: Churchill-Livingstone, 1983. 9. Blick KE, Tiffany TO. Tower of Babel has interfacing lessons for labs. CCN 1990; 16 (4): 5. 10. McDonald CJ, Hammond WE. Standard formats for electronic transfer of clinical data. Ann Intern Med 1989; 110: 333-5. 11. Pradella M. Personal unpublished data, 1988. 12. McNeely MDD. Microcomputer applications in the clinical laboratory. Chicago: ASCP Press, 1987. 3.
LIMS and Validation of Computer Systems
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
29 3
CHAPTER 25
An Integrated Approach to the Analysis and Design of Automated Manufacturing Systems S.P. Maj* 50 Gowing Rd, Mulbarton, Nonvich NR14 8AT, UK
Abstract All organisations should have an information strategy and hence a long term view to prevent the fragmented and unco-ordinated introduction of computer-based production systems. This is especially true when it is recognised that companies successfully using Computer Integrated Manufacture (CIM), Flexible Manufacturing Systems (FMS) etc can be considered to be in a minority. Further, the technology employed should be part of a computer-based system that integrates, both vertically and horizontally, the total manufacturing complex. What is perhaps needed is an integrated method applicable to the analysis and design of such automated manufacturing systems.
1. Introduction Even though the technology exits, few manufacturing complexes have succeeded in fully integrating the production cycle [ 13. Communication systems and conveyor belts etc allow the physical integration of manufacturing activities wilh distributed databases sewing to integrate the information. However production systems vary considerably in the basic production cycle of product/production ratio from job, batch, line to continuouseach with its own associated production techniques. A major problem is optimal and reliable functional integration to give a consistent and cohesive production cycle [21. This is especially so when a system is converted from manualhemi-automatic data processing to a fully integrated information management system. The total manufacturing system must be Considered thus avoiding the dangers of having ‘islands of automation’. A large range of structured methods exist for the analysis and design of computer based systems [3]. Whilst these methods are perhaps adequate for commercial applications they are largely unused by production engineers and further it is considered that
* Independent Consultant
294
they lack the rigour demanded in the specification of critical applications. This is espccially true when it is recognised that many major software systems contain errors-errors that were introduced during analysis, design and implementation. More suitably formal specification techniques, based on rnathcmatical theory, can be uscd [4].They are however complex. Due to the widc range of processing criteria that can be found in manufacturing complexes it can be considered that the use of both structured mcthods of analysis and a formal specification is best suitcd. But without doubt what is nceded is an integratcd approach thus ensuring verification and validation to the required standard.
2. Manufacturing integration Computer Integratcd Manufacture (CIM) can be considered as the use of computers to plan and control the total manufacturing activities within an organisation. Horizontal intcgration links the activities and processes that start with the design of a product and finish with its delivery to the customer. Vertical inlegration links the detailed design and manufacturing activities that make up the horizontal dimension, through a hierarchy of control, to thc stratcgic plans of the organisation. Vertical activities are typically discontinuous, c.g., corporate strategy. CIM can be said to include a family of activities. The three major components being Computer Aided Engineering (CAE), Computer Aidcd Quality Assurance (CAQA) and Computcr Aided Production Management (CAPM). CAE is often taken to include Computer Aidcd Design (CAD) and Computer Aidcd Manufacture (CAM). Thus computcrbased systcms are available for the entire spectrum from product design, manufacturing systems design to process planning, monitoring and control. It should be recogniscd that fully exploited CIM demands both communications and information inlegration.
3. Hardware Rapid advances in electronic device fabrication have taken us from thc thcrmionic valve --large, expensive, inefficient and unreliabloto solid state transistors integratcd during manufacture onto a singlc piece of semiconductor, i.e., intcgratcd circuits. Advanced fabrication techniques now employ submicron lithography. Future trends can be clearly idcnlified as higher packing densities, higher operating speeds and new semiconductor materials. As a programmable, general purpose device the microprocessor made application specific only by the associated software. General purpose devices command large production volumes with minimal unit cost. Data exchange between individual units of computcr-controlled equipment is typically achicved by hard-wired, point-to-point cablcs with spccialised electronics, i.e., ‘islands of automation’. Thesc manufacturer dependent closcd communications systems are bcing replaced by the International Standards Organisation (ISO) Reference Model for Open Systems
295
Interconnection (OSI). The OSI model addresses the problems of providing reliable, manufacturer independent, data transparent communication services [5].Networks (Local and Widearea) provide the framework for distributed data processing, i.e., a network with a high degree of cohesion and transparency in which the system consists of several autonomous processors and data stores supporting processes and databases in order to achieve an overall goal. A practical realisation is the Manufacturing Automation Protocols (MAP) initiative. The OSI system allows for a number of alternative protocols each of which provides a means of achieving a specific distributed information processing function in an opcn manner. The specific application services required can be selected along with different modes of operation and classes of service. The MAP set of protocols being selected to achieve open systems interconnection within an automated manufacturing plant. The conclusion being networked manufacturing cells of intelligent instruments and computers etc linked via a gateway to the corporate network of other on site functions. The question is why do systems fail and why can software costs represent in excess of 80% of the total system cost? [6].Modem computcr hardware is extremely reliable and can be made fault tolerant. The problem is errors are introduced and errors propagate. Further, large software systems arc not static. They exist in a constantly changing environment requiring perfective, adaptive and corrective maintenance.
4. Software Computer Integrated Manufacture is supported by software that must include a database management system. Traditional file-based systems had the intrinsic problems of file proliferation and chronological inconsistency. A database however can be defined as a collection of nonredundant data shareable between different application systems. The database conceptual schema is the description of all data to be shared by the users. The external schema of each user being a specific local view or subset of the global schema as rcquired by that particular application. The shared, yet selective, access to non-redundant data allowing schema changes to affect all with the appropriate access hence reduced program maintenance costs. The developments in networking and distributed systems have made the distributed database a practical solution. A distributed database is a collection of logically related data distributed across several machines interconnected by a computer network. Thus an application program operating on a distributed database may access data stored at more than one machine. The advantages of distribution include each site having direct control over its local data with resulting increase in data integrity and data processing efficiency. In comparison, the centralised approach requires the data to be transferred from each site to the host computer with the subsequent communication overhead. The distributed system is an automatic solution to geographically dispersed organisations. The need to
296
provide a logically integrated but physically distributed information system is the basis for a distributed database [7-91. The system software must provide high independence from the distributed environment. Relational databases in particular have been successful at providing data independencc and hence system transparency.
5. Systems analysis and design Manufacturing systems are complex. Computer Aided Production Management (CAPM) is concerned with manufacturing planning and conttol. The earlier systems were concerned primarily only with inventory control. Later developments encompassed production planning. These combincd approaches being known as Materials Requirements Planning (MRP). Manufacturing Resource Planning (MRP 10 differs from MFW in placing less emphasis on material planning and more on resource planning and control. The difficulties associated with MRP must not be underestimated [ 101. Othcr systems include Kanban, Just In Time (JIT) and Optimised Production Technology (OFT). Whichever system is used it is considered that the production of complex systems requires the use of the software or system life cycle. This consists of a series of distinct stages, each stage having clearly defined activities. spically the stages are: statement of requirements, requirements analysis, system specification, system design, detailed design, coding, integration, implementation and maintenance. Melhods applicable to smaller systems if simply scaled up result in overdue, unreliable expensive and difficult to maintain computer based data processing systems. Many methods, with varying degrees of complexity, have becn developed such as Information Engineering, Structured Dcsign and Analysis [ll].These have largely bcen in the context of commercial applications. Most methods employ the basic principles of stepwise, topdown dccomposition in which the slepwise refinement allows the deferment of detailed considerations by the use of abstraction to suppress an emphasis detail as appropriate. They all attempt to be undcrstandable, expressive, implementation independent and generally applicable. Progression through the system development life cycle consists of a series of uansformations from the user statement of requirements to the detailed design [ 121. This involves documentation employing diffcrent notations appropriate to the requirements of each stage. The statement of requirements document will be natural language with some graphics for clarity. This document will, as a result of the complex semantics of English (or any other natural language) be ambiguous, incomplcte and contain contradictions. From this document the requirements analysis stage has to produce a requirements spccification to be uscd as a reference document for all subsequent work and for final acceptance testing prior to handover. As such it has to be complete, consistent and unambiguous. Progression through the development cycle reduces the natural language content with a subsequent increase in more diagrammatic notations. The output of each stage is a specification for the following stage from which the appropriate design is made.
291
Verification is the process of ensuring that the design of each stage is correct with respect to the specification of each preceding stage, i.e., is the product right? Validation ensures design integrity in that the final design should satisfy the initial user requirements, i.e., is it the right product? [13-161.
6. Structured systems analysis and design All organisations should have an information strategy and hence a long term view to prevent the fragmented and un-coordinated introduction of computer-based systems. The considerable conceptual, organisational and technical difficulties with regard the successful implementation of C M must not be underestimated. Further laboratories, for example, are subject to Good Laboratory Practice (GLP) regulations [17]. These regulations include the definition, generation and retention of raw data, Standard Operating Procedures (SOP’S)etc. The concept of quality assurance is to produce automated data systems that meet the user requirements and maintain data integrity. The distinct phases of the system life cycle are considered to give only minimal guidance. The enhancements to this basic framework came by demand. However one of the biggest problems in system work is ‘navigation’who does what, when where and how? What is needed is a method to act as a procedural template giving comprehensiveguidance. Structured Systems Analysis and Design (SSADM), legally owned by the Central Computing and Telecommunications Agency (CCTA), is an integrated set of standards for the analysis and design of computer based systems [18]. It is a generally applicable method, suitable for widely differing project circumstances, with clearly defined structure, procedures and documentation. The structure consists of clearly defined tasks of limited scope, clearly defined interfaces and specified products. The procedures use proven, usable techniques and tools with detailed rules and guidelines for use. The three different views based on functions, data and events give a complete and consistent system view with intrinsic documentation in the structure and procedures. Productivity gains are due to the standard approach, with known techniques and clearly defined user needs, to project planning. SSADM addresses the problems of confidentiality, data integrity and availability. Project quality is ensured by early error detection and correction with readable, portable and maintainable solutions thus ensuring verification and validation to the required standard.
7. Formal specif icat ion techniques Whilst structured methods of analysis are adequate for commercial and business organisations they lack the rigour demanded by on-line critical applications. Errors can be introduced into software in the incorrect specification of the environment and the failure of
298
the design to match the specification. Errors are introduced and propagate through the design process with unacceptable consequences in life-critical applications. Formal techniques are mathematical systems to generate and manipulate abstract symbols [20]. They can be used to describe mathematically correct system specifications and software designs together with the techniques for verification and validation. There are several types of formal technique, such as Z and VDM, but typically they consist of a language to provide the domain description and a deductive apparatus for the manipulation of the abstract symbols. The languages have an alphabet to define the symbols to be used, rules of grammar to define how the symbols may be combined in order to write acceptable strings of symbols or well formed formulae (wff), i.e.. rules of syntax and the interpretation of the language onto the domain of interest by the rules of semantics. The deductive apparatus defines the axioms or wffs that can be written without reference to other wffs and also the rules of inference that allow wff to be written as a consequence of other wff. Due to the complexity of proof procedures the original aims of fully automated proof systems to automatically demonstrate that programs met their specification without the need for testing have achieved limited success only. Using a formal technique (Z) it has bccn possible to write a limited specification for an automated analyser (ion selecLive electrode measurements) to act as a behaviour model-states and operations. Theorem syntax was checked for consistency. In recognition that most formal systems are incomplete, only limited completeness checks were performed.
8. Discussion Work to-date indicates that structured methods of analysis (SSADM) may be suitable for an integrated approach to the analysis and design of automated manufacturing systems. The procedural method helping ensure system quality, i.e., the features and characteristics that bear on its ability to satisfy total system requirements. The selective and appropriate use of formal specification techniques can be used for critical application specification and design together with reduced testing. The use of natural language is not eliminated but is used to enhance the techniques as appropriate. Formal techniques can be perhaps considered as tools in the repertoire of SSADM thereby allowing the complete business system integration with the more critical demands of industrial processes, i.e., Total Integrated Manufacturing (TIM).
Acknowledgements In acknowledgement of Dr. R. Dowsing and Mr.A. Booth of the University of East Anglia.
299
References 1. 2. 3. 4. 5.
6. 7. 8. 9.
Barker K. CAPM-little cause for optimism. Production Engineer 1984; November: 12. Woodcock K. The best laid plans of MRP 11's. Technology Vol8(14): 10. Connor D. Information Systems Specification and Design Road Map. Prentice Hall, 1985, Chapter 1. Denvir T. Introduction to Discrete Mathematics for Sofhvare Engineering. Macmillan, 1986, Chapter 1. Halsall F. Data Communications, Computer Networks and OSI. Addison-Wesley, 2nd Ed, 1988, Chapter 10. Sommerville I. Software Engineering. International Computer Science Series, AddisonWesley, 1987, Chapter 1. Kleinrock L. Distributed Systems. Computer November 1985; 90-103. Van Rensselaer C. Centralize? Decentralize? Distribute? Datamation 1979; April: 90-97. Bender M. Distributed Databases: Needs and Solutions. Mini-micro Systems 1982; October: 229-235.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 7990
301
0 1990 Elsevier Science Publishers B.V., Amsterdam
CHAPTER 26
A Universal LIMS Architecture D.C. Mattes1 and R.D. McDowall* ISmithKline Beechum Pharmaceuticals, P.O. Box 1539, King of Prussia, PA 19446, USA and 2WellcomeResearch Ltd., Langley Court, Beckenham, Kent, BR3 3BS, UK
In this presentation we will be discussing the need and demonsuating the benefit of providing a consistent LIMS architecture in order to meet user requirements. In order to discuss this issue we will lay some initial ground work to address the following: - What are we really trying to manage in a LIMS ? - What is Architecture ? From these points we will move into the presentation of a model architecture for LIMS and system design and implementation that this architecture leads to. The first point to be considered is that in a LIMS we should be managing INFORMATION and not just data. Information is: Data that has been proccsscd into a form that is meaningful to the recipient or user and is of real or perceived value in current or perspective decision processes. This definition begins to emphasize the differcncc between DATA and INFORMATION. The following begins to demonstrate the difference: The following set of numbers are a set of DATA: -42 -0.5 35 69 98 126
3 4 5 6 7 8
Although this set of DATA is accurate it is of little value. Why? Becausc thcrc is no information. If wc add thc following then we have some valuable information:
302
Hvdrocarbon Boilinn Points
DaL 4 2 -0.5 35 69 98 126
Carbon Chain 3 (Propane)
4 (Butane) 5 (Pentanc)
6 (Hexane) 7 (Heptane) 8 (Octanc)
In this last tablc in addition to adding data I have also addcd CONTEXT or rclationships to the data. Having the same data with no srructure or rclationships would be useless (cxccpt perhaps as a puzzle). Therefore in order to have information we must have DATA and CONTEXT (or the relationships of the data). Therefore we can say that more information value is NOT more DATA but more CONTEXT. Often times we have plenty of data which leads management and the organization to the point at which they are “drowning in data” and need “meaningful information”. In most organizations which we represent as scientists a key function is to convert data into meaningful and useful information. How does this convcrsion takc place? Thc following is a simple representation of a cycle which goes on within the scientific process. a. Wc cxecute experiments which generate duta. b. Wc pcrform analysis on data which generates information. c. We apply intelligence to information to gain knowledge. d. Wc dcsign experiments bascd on knowledge to (see a). Today most LIMS only address the DATA part of this process, and at times minimal analysis. What is the sourcc of this dcficicncy (i.e., not addrcssing thc entire proccss) the capability of thc systcms or the implementation of the systems? In reality it is probably a mixturc of both but I think that “Some LIMS CAN’T and most LIMS DON’T”. In order to reach this goal of integration we have dcvcloped a LIMS model. This model is based on simple architecture which allows a LIMS to be implemented and integrated with the organizational requirements while not overlooking functions which arc requircd for the system to be successful. We discuss this model in the light of an ARCHITECTURE which is an orderly design, prescnted in various views for diffcrcnt audiences in order to implement a system which meets ALL perceived needs. Wc can draw on construction industry as an examplc to explain architcchturc. To mcct the needs of a buyer an architect will prepare drawings which dcscribe the finishcd product. Thcsc drawings arc dcsigncd for scvcral audiences or different groups involved in the “building”.
303
The OWNER needs to know what a building will contain to insure that it meets the required FUNCTIONS. This is the USER. The DESIGNER needs to understand the required RELATIONSHIPS of these functions to insure a structurally stable building. This is the SYSTEM ANALYST. The BUILDER needs to understand the TOOLS and FOUNDATION in order to construct a finished building which meets the goals of the initial architectural drawings which the owner approved. This is the SYSTEM PROGRAMMER. The point of the architecture is to keep the various views of the “building” in synchronization in order to assure a satisfactory product. In the LIMS world this model provides an architectural vicw to communicate to all groups involved in the implementation of the LIMS and assure a satisfactory product. In the development of this architecture we had several objectives which we fclt a model like this could fulfill. - To define the scope of the LIMS. - To define the organization of the LIMS. - To facilitate communication between diverse technical groups about LIMS. - To provide a tool for training in the implementation and use of a LIMS. The components of the LIMS model are: - D a t a b a s e (a common data rcpository). - Data Collection (the path for data to cntcr thc LIMS). - Data Analysis (modules which read and write to the repository as they “gcneratc information”). - Data Reporting (the path for information to leave the LIMS for external use). - Lab Management (the tools for the managcmcnt of activities and resources within the lab).
( Database ]
I
A pictorial reprcscntation of t~ic modcl is includcd in Figure 1. This Figure 1. The LIhlS model. Arrows resprcscnt data and modcl can help us look at the informationflow.
304
Distributed Environment
Host Environment Figure 2. The designer’s view of the IJMS model indentifies the “structural relationship” of system components.
cornplcte picture of a LIMS and therefore begin to apply the LIMS to meeting a larger part of the scientific process discussed previously. In Figure 2 a similar yet different picture of the model presents a potential designer’s view of this model. This view increases the level of detail regarding the various components and outlines the organization or relationships of these components. In Figure 3 a similar, i.e., based on the same model, picture is presented which is the structure of an aciual LIMS application. This application CUTLAS (Clinical Unit Testing and Lab Automation System) was designed around an architectural model. The results of this have been that the application has been very successful as demonstrated by the following facts; - The system went from introduction to full production in less than 6 months. - The functions and relationships of the system components have been easily understood by the user community. - The systcm has been in operation for 5 years. - The system has proved to be extensible, It has been enhanced through the integration of new functions which in no way impacted the “running system”.
305
Figure. 3. The high-level structure of the CUTLAS-LIMS application.
To summarize, from his presentation it is important to walk away with the following key ideas: 1. A LIMS should be focused on managing INFORMATION and not just data. 2. The ARCHITECTURE behind a LIMS is an important requirement for a successful system. It provides a critical foundation. 3. The LIMS MODEL or architecture presented here has been successfully used in systems development.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
307
CHAPTER 27
Designing and Implementing a LIMS for the Use of a Quality Assurance Laboratory Within a Brewery K. Dickinsonl, R. Kennedy2, and P. Smith' Isunderland Polytechnic, Sunderland, UK and 2Vaux Breweries, Sunderland, UK
1. Introduction This paper presents an overview of the Vaux Laboratory Information Database (VaLID) developed by the authors for use in the quality assurance laboratories of Vaux Brcweries in the North East of England. The LIMS has been developed in-house and this paper discusses some some of the problems inherent in this development. The use of a structured systcms analysis and design methodology such as SSADM is presented as one method of overcoming some of these problems.
2. Background The authors are currently involved in a jointly funded project between Vaux Breweries, Sunderland and Sunderland Polytechnic. The purpose of the project is to develop a LIMS for use within the brewery whilst carrying out research into LIMS and their application elsewhere. The project is funded by Sunderland Polytechnic, Vaux Breweries and the National Advisory Board of the UK government. It is one of a number of research projects into User Friendly Decision Support Systems for Manufacturing Management. Vaux Breweries are the second largest regional brcwery in the UK. They produce a range of beers, lagers and soft drinks. The entire production process from raw material to packaged product takes place on a single site. The primary role of the laboratories within the brcwery is to ensure the quality of the product at each stage of the manufacturing process. There are a total of three laboratories, each spccialising in a particular area. A analytical chemistry laboratory carries out analyses on a wide variety of samples obtaining a number of analytical results which are used in the day to day production process. A microbiological laboratory assesses the microbiological integrity of process materials and final products, as well as ensuring plant hygiene.
308
There are approximately 70 different types of samples and 50 different analyses carried out on a routine basis. It is estimated that a total of 250,000 analyses are performed per year by the three laboratories with an average of 5 analyses per sample.
3. The VaLID system The VaLID system has been developed on 80286 based PC’s connected via an Ethernet network to a 80386 fileserver. The final system will have 8 such PC’s linked into the system. The PC workstations are used to enter data in the laboratories and to query the data in production. All of the data is held on the central fileserver. The system has been developed using the commercial Paradox database system and runs under the Novel1 operating system. The system consists of six major modules as follows:
3.1 Sample registration module The sample registration module is used by the laboratory staff to enter the details of samples as they arrive in thc laboratory. The user selects a sample type from a menu and is prompted for the sample identifying information. The information requested depends upon the sample selected and the group to which that sample belongs. For example, a bright beer tank sample requires brew number, tank, and quality to be input, whilst a packaged product requires this information in addition to product name, package size, best before date, etc. A master sample registration table records the fact that a sample has been logged-in together with the sample’s status and the date and time.
3.2 Analysis data entry module This module allows the analysis results to be recorded. A system of worksheets has bcen designed which holds results for one or more analyses. The analyses have been grouped according to location and time of data entry so as to allow laboratory staff to continue with a system similar to the previous manual one. When a worksheet is selected all the outstanding samples scheduled for data entry via that worksheet are obtained and the user may select one or more for which he wishes to enter data. After a number of discussions with users, and demonstrations of possible methods of data entry, a spreadsheet-like method was decided upon as being the most appropriate method for the majority of worksheets. After data has been entered it is checked against a set of specifications. The specifications allow a maximum and minimum value to be defined for each analysis, the exact value depending upon up to two of the sample registration details.
309
3.3 Validation Analysis results are not available outside the laboratory until after they have been validated by laboratory supervisors. Validation consists of viewing a worksheet with analysis results displayed and those out of specification highlighted. The supervisor may either hold analyses or allow them to be validated after which they may be accessed by production personnel.
3.4 Reporting A number of pre-defined reports linking data from various worksheets have been defined. These are produced on a daily basis as part of a daily laboratory report. A period report giving summary statistics may also be produced. ‘Pass chits’ are automatically produced when certain analyses are validated within specification allowing the product sampled to move to the next stage of production.
3.5 Query module The main method by which production staff may obtain data is via the query system. This system is currently still under development but will eventually allow data to be selected via a number of parameters including analysis or sample registration details.
3.6 Archiving To improve performance data not required on a regular basis may be archived. This data is still available but is not accessed during regular queries.
4. LIMS choices The VaLID system is a custom-made LIMS developed in house. This was one of three possible options available when installation of a LIMS was being considered, the others being to buy an ‘off the shelf’ commercial LIMS or to commission a software house to develop a system to Vaux specifications. By far the simplest and preferred option should be to purchase a commercial LIMS system. There are number of these now available on the market and careful consideration of these systems must be made before a decision to develop a unique system is taken. After considering a number of commercial systems on the market it was felt that they could not meet the specific needs of Vaux, in particular the need to include some of the particular requirements of the production deparunent was thought to be difficult. If a tailored LIMS was to be developed it was felt that a system developed on site was more
310
likely to meet the specific requirements since the development staff would be in much closer communication with the end user than if a system had bcen developed externally. Obviously this option is only available if the necessary computing skills exist internally within the organisation.
5. Potential problems when developing a custom-made LIMS Care must be taken when developing a LIMS. Software development is one of the most difficult tasks which can be undertaken by an organisation and careful planning and design are required if the project is to be successful. A number of potential problems arisc which are common to many software development projects. A full analysis of the requirements of the system nceds to be made and hardware and software chosen which will be able to meet these needs. A common cause of software failure is an underestimation of the data requirements or an overestimation of system performance. This may give rise to a conflict of requirements versus constraints which should be resolved as early as possible. The requirements of the laboratory must be balanced against the constraints of cost and facilities available. A realistic set of requirements nccd to be determincd if the development is to be successful. Careful analysis of the work involved in achieving these requirements needs to be made and a realistic timescale obtained. Often in software development the work involved is underestimated, leading to systems which run over time and, therefore, over budget. It has been estimated that approximately 90% of software projects run beyond their budgeted timescale. As well as being realistic, requirements must accurately reflect the needs of the laboratory. This requires good communication between those developing the system and laboratory staff. The LIMS needs to be fully and unambiguously specified and these specifications understood by all concerned. The needs of laboratories are constantly changing as new samples, analyses and methods are undertaken. A successful LIMS needs to be flexible enough to be modified to meet these changing needs. Development staff unfamiliar wilh a laboratory set up may fail to incorporate facilities for such modifications if they are not fully specified. The system must be designed with flexibility in mind. A LIMS, like any other software system, needs to be fully documented. Often documentation is left until a system has been developed which may lead to poor or incomplete documents. It has been shown that it is far more efficient to develop documentation whilst a system is being developed. This documentation should include both user and program manuals.
6. Suggestions for avoidance of potential problems Many of the problems discussed above result either directly or indirectly from a lack of
31 1
1 Sample Collector Sample Registration
3 Supervisor
Check Results
4 LaboratoryStaff Archive Data
A
v A
v 2 Laboratory Staff
5 LaboratoryStaff
Figure 1. Example of a data flow diagram.
communication between the user and the system developer. It is vital that the software engineer fully understands the operation of the laboratory and that laboratory staff understand the specifications being proposed. The computer personnel can gain knowledge of the laboratory from discussion with laboratory personnel and through studying the written standard operating procedures. It is unlikely, however, that such knowledge will encompass all the needs of the laboratory. It is therefore important to have regular meetings bctween computer personnel and laboratory personnel to discuss the developing specifications. During development of the VaLID system a thorough understanding of the laboratory operations was gained by ‘shadowing’ laboratory staff at work and studying the written procedures. Once sufficient dctail had been gathered a detailed specification was prepared and discussed with all concerned. This resulted in a number of modifications and rcfinements until a system to which all agree was obtained. During such meetings it was found to be advantageous to use some of the techniques from the Structured Systems Analysis and Design Methodology (SSADM).
7. SSADM SSADM is a formal methodology for use in the analysis, design and development of computcr systems. It has six stages each of which have defined inputs and outputs. The outputs from each stage should be agreed upon before continuance to the next stage. The six stages of SSADM are Analysis, Specification of Requirements, Selection of System options, Logical Data Design, Logical Process Design, Physical Design. A number of
312
Figure 2. Example of an entity life history diagram.
techniques are available for use within each of these stages. These stages when used with ihe defined techniques provide a step by step approach to system design. The techniques result in a non contradictory precise specification of the system. The deliverables at thc end of each stage also form a base for documentationof the completed system. The techniques each have precise rules laid down so that the ambiguities associated with natural language descriptions are avoided.
8. Data flow diagrams, entity life histories and logical data structures These three techniques provide a diagrammatic representation of the system being considercd. This can be either the existing physical system or the proposed computer system. They can be easily and quickly modified and restructured until an agreed specification has been reached. Such modification is often lengthy and difficult with written descriptions. Figure 1 shows the data flow diagram for data associated with a Bright Beer Tank sample within the VaLID system. This data flow was shown to be extremely similar for all beer samples within the system. The diagram shows how analysis data moves around h e laboratory. It does not, however, show the order in which each of the functions occur. To achieve this an Entity Life History is drawn up (Fig. 2). This shows the order in which each of the functions take place and what has to have occurred before data can move from one location to another. A Logical Data Structure is used to show the relationship between items of data, or entities, within the system. Figure 3 shows a simplified Logical dab structure for a BBT sample relating it to a number of analysis groups. All of these
313
BBT Sample
c02
Analysis
Haze Analyses
Scaba Analyses
Figure 3 . Example of a logical data structure diagram.
techniques proved extremely useful during development of the system. A full description of the SSADM techniques is beyond the scope of this paper but the interested reader may refer to any of the standard text books on the subject. SSADM was primarily designed for use with commercial data processing applications and some of the techniques and stages involved are, therefore, not applicable to the LIMS application. However, following the stages and techniques laid down provides a framework from which a successful system can be developed.
9. Conclusion The decision to implement a LIMS within a laboratory is one of the most important strategic decisions which can be made. This decision should not be entered into lightly or without planning and fully specifying the proposed system. Wherever possible commercial systems should be sought which meet the specified requirements. If a decision is made to develop a specific LIMS then care must be taken to ensure a structured approach is made. A number of formal methods are available which provide a structured approach to software development, of which SSADM is only one. The use of such formal methods can help to ensure that no ambiguities exist, either in the system itself or, in the perception of the system bctwecn the laboratory and software engineer. The important point is that full communication takes place between the LIMS developer and the laboratory at all stages of development,
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), ScienrificComputing and Auromation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
315
CHAPTER 28
Selection of LIMS for a Pharmaceutical Research and Development Laboratory-A Case Study L.A. Broadl, T.A. Maloney2, and E.J. Sub& Jr2 'Analytical Chemistry Department, Pfizer Central Research, Sandwich, Kent, UK and 2Anal)ltical Research Department, Pfizer Central Research, Groton, CT, USA
Abstract The Analytical Chcmistry laboratorics of Pfizer Ccntral Research on both sides of the Atlantic are jointly evaluating, selecting and implementing LIMS in both departments. A two-phased stratcgy was devised by an international project team to meet thc particular challcngcs of sclccting a single LIMS for our rescarch and development based laboratorics hat would satisfy bolh groups. Throughout the evaluation, the commitmcnt of thc future users of thc system has bccn actively encouraged by involving them directly in thc rcquircmcnts analysis and evaluation, primarily through the use of workgroups. The twophascd stratcgy has bcen highly successful, resulting in the sclection of a single vendor LIMS and wc now proceed into implcmcntation with the support and commitmcnt of thc projcct tcam and futurc users.
1. Introduction Thc sclcction of a commercial Laboratory Information Management System (LIMS) product for thc analytical rcscarch and devclopment dcpartments of a pharmaccutical organisation is a challenging process. The flexibility and functionality nccdcd cxtend bcyond that normally rcquircd in a QC environment. A system is required that can accommodate the unexpected, unknown sample for which no tests or specifications are yet dcfincd, that can handle evolving methods and specifications, and that can be configurcd to modcl thc rolcs and structurc of various different laboratories. The two Analytical Chcmistry dcpartmcnts of Plizer Ccntral Rcscarch in Sandwich, UK and Groton, USA havc bccn tackling this problem as ajoint project with the objective of evaluating, sclccting and implcmcnting a single LIMS producl at both sitcs. This papcr will firstly providc an oullinc of thc dcpartmcnts involvcd in this project, a dcscription of thc intcrnational project tcam, and its joint objectives and constraints. Thc
316
two phases of the LIMS evaluation process itself will then be described, highlighting experiences gained. Finally, the plans for implementation of the selected system are outlined.
2. Background The two Analytical departments of Pfizer Central Research in Sandwich, Kent, UK and Groton, Connecticut, USA are primarily research oriented. The common mission is the development and application of the analytical chemistry associated with bringing new drug candidates to market. There is an increasing need to assist the analyst in the easy and efficient capture of data, to facilitate the transformation of these data into information for both internal and regulatory use, and to share this information between the departments. Tn practice, the two departments are functionally very similar, but some organisational differences are apparent. Both comprise a number of project development, resource and technology groups responsible for all analytical activities from initial discovery of a drug through to its marketing and launch. This includes in-process, final QC and stability testing of the drug substance and its dosage forms. Both departments are also responsible for collating data for world-wide registration purposes. Historically most information generated in the laboratories was stored on paper. Many staff in both groups have, however, become accustomed Lo using a variety of statistical, structure drawing, word-processing and spreadsheet packages. In addition, a number of in-house developed databases have been implemented, handling such functions as sample tracking, storage of stability protocols and results, stability scheduling and logging clinical supplies’ expiry and disposition. Neither group has yet, however, interfaced instruments to its databases. As a result of the use of databases, the value of electronic information management, from day-to-day work and ad-hoc queries through to the collation of data for regulatory submissions, has quickly been appreciated by all levels of staff. By 1987 both departments recogniscd the need for LIMS and began to prepare for evaluation. Since there is close trans-Atlantic co-operation and frequent transfer and sharing of information, the decision was made to work jointly in an international project to evaluate, select and implement LIMS. This approach was seen as preferable to either independently selecting, maybe different, LIMS products at each site or even to having one site conduct the evaluation and selection on behalf of the other. A project team was established, comprising a LIMS coordinator on both sides of the Atlantic and additional management involvement. The responsibilities of the team are to coordinate the joint project, set its objectives and to plan and recommend decisions to senior management for endorsement.
317
3. Objectives, constraints and challenges The selection and implementation of LIMS has been widely discussed in the literature, including general perspectives [ 11 and specific examples from a variety of organisations [2-51. Most publications consider implementations at a single site with fairly routine sample handling needs: the challenges of a joint, international project in a pharmaceutical research organisation have not received attention. The paramount objective of this joint effort was to establish a unified system in Sandwich and Groton. The goal was to identify a system that would be flexible enough to accommodate the research and development nature of both laboratories. The team also anticipated the need to implement the system over many months, maybe years, so a system was needed that could be developed in a stepwise fashion. One major question to be answered, however, was whether to buy a LIMS or develop a system in-house. The decision to purchase a system was relatively easy: Pfizer is a pharmaceutical company and, while having excellent computer support, resources cannot meet the demands required for full scale LIMS development, nor is there the many years of expertise that the commercial suppliers can call on. The joint objective therefore became to select and purchase a product from the same vendor for both sites. In addition to the constraint of selecting a system that suited both sites, there were a number of other practical constraints. The LIMS would have to run on DEC VAX computers-a corporate and connectivity requirement. In speaking to some potential vendors there was often criticism for this ‘must’. However, the consideration of the hardware is a vital issue in any selection. For a large, two-site installation such as was planned, the full and committed support was nceded of the computing groups, which necessarily required abiding by their policies and recommendations. The connectivity offered by the existing Pfizer Central Research trans-Atlantic DEC computer network was also required. The system would also have to be compatible with the VAX based chromatographic data acquisition system installed at both our sites. Overall, it had to be possible to transfer data easily between chromatography and LIMS systems on a given site and across the Atlantic. There are many challenges associated with an international LIMS project, both administrative and technical. Obviously the geographical separation provides project administration challenges for the international team. In order to overcome these there is regular electronic mail and telephone communication between the project coordinators and international meetings are held on an approximately quarterly basis. These meetings are used to make major project decisions and to develop and agree strategies and timetables for the project. As has already been identified, the selection of a commercial LIMS product for a research and development laboratory presents its own series of technical challenges. The
318
flexibility and functionality needed extend beyond that usually required in a QC environment. A system is required that can accommodate the unexpected, unknown sample for which no tcsts or specifications are yet defined, that can handle evolving methods and specifications, and that can be configured to model a range of laboratory disciplines. These challenges are increased when the same LIMS product has to be suitable for multiple laboratories, each with its different product types, roles, organisational structure and, therefore, LIMS requirements. The two departments are functionally similar but do have differences in organisational structure. Throughout the evaluation, selection and implementation process it was necessary to work toward obtaining a product that suited both departments equitably. A further challenge for the international project team is to balance the joint project needs and plans against the background of all other work in the two departments, It is often necessary to justify priorities and resources for the LIMS project against the competing requirements for the time of staff involved. This can be further complicated when the extent or timing of available resource differs between the two departments. Considerable flexibility is required to schedule joint LIMS activities against this changing background. Finally, there is the challenge of developing, and sustaining, user commitment in a long-term project such as this. The joint process for the selection and implementation of LTMS has bcen designed specifically to include significant participation and contributions from the potential users of that system.
4. Strategy Having made the decision to select a single vendor system it was clear that a thorough evaluation of those available would be required to determine which was the most suitable and whether that systcm could be implemented to meet our requirements. The basic strategy developed for our evaluation comprised two main phases. Firstly, a preliminary evaluation would be conducted of all commercially available systems that met the original constraints. Both departments would conduct similar evaluations, essentially concurrently, and reach a joint decision as to the leading contender for further evaluation. The second phase would then be based on an extensive on-site evaluation of the selected product. Again, the evaluation would be conducted on both sides of the Atlantic. In addition to the ‘hands-on’ demonstrations and evaluations, both of these evaluation phases would include significant preparation and other associated research, including analysing and developing requirements, literature review and discussions with vendors. These activities would involve many people outside the international project team: the team decided to implement a workgroup approach throughout the evaluation and implementation process with the objective of developing user involvement in, and commitment to, the project.
319
5. Phase 1 preparation The details of the first phases of the project were developed at a first international meeting. During this meeting the project team developed the strategy and timing for evaluation of the commercially available LIMS products that met our original constraints. In preparation for the first evaluation phase, each department had previously researched its basic LIMS requirements through review of the departmental structure, operation, sample and information flow, and through interviews with staff and brainstorming sessions. Joint brainstoming sessions were then held during the first international meeting to identify, merge and organize the LIMS requirements. These requirements were then classed as essential or desirable and were also organised into six general categories. This classification was designcd to help structure the evaluation and to provide areas of study for workgroups. Following this preparation, the strategy developed for the Phase 1 evaluation included the following activities: (i) The participating vendors would be invited to demonstrate their systems at both sites for two days. During these sessions each workgroup would have its own session with the vendor and the vendor would be invited to make an overview presentation of the system. (ii) Sales and system literature would be obtained from all vendors and reviewed. Copies of all available manuals-system manager manuals, user guides and reference manuals, would be requested to be provided to each site. (iii) A joint U W S Request For Proposal would be prepared and submittcd to all the vendors that were to participate in our selection process. Their proposals would be available at both sites for review. (iv) The project team would then select the most suitable product for extended demonstration (some 6 months) on site. (v) At the end of the extended evaluation a joint recommendation for purchase would be prepared. Throughout the process, both departments would enjoy the same opportunity to evaluate the products.
6. Request for proposal The Request For Proposal was designed to obtain from the vendors the information nceded to reduce the number of systems for final evaluation. It also provided the opportunity to describe to the vendors the joint nature of the evaluation and to emphasise the need for their cooperation in meeting joint demonstration requirements and schedules on both sides of the Atlantic. The document, some 20 pages, was prepared during a second international meeting and contained six sections; (i) the mission of the departments, (ii) the existing laboratory information management systems-this included an overview of the departments on both
320
sides of the Atlantic, including their interactions and sample and information flow, (iii) background and objectives-a discussion of the current LIMS and automation strategy and an outline of laboratory instrumentation, connectivity and data import and export requirements, (iv) LIMS functionality required-a list summarising our essential and desirable requirements, separated into six categories, (v) selection and implementation approach-this contained information for vendors such as requirements for the on-site demonstrations and (vi) response to RFP-here was described the response required from the vendor.
7. Requirements & workgroup categories The initial joint LIMS requirements were organised into six general categories: (i) Sample management: this considered information available to the laboratory exclusive of that obtained by analysis-for example sample identity, description, tracking, scheduling and status. (ii) Data management: this was concerned with information gathered from sample testing and included the recording, processing, manipulation, validation, storage, retrieval and reporting of data. (iii) Quality management: this was a term used to include two areas. Firstly, electronic quality assurance functions such as data validation, results validation, method performance and system performance. Secondly, method and specification management, looking for a database and tools for the preparation, validation, storage, revision, indexing and performance monitoring of analytical methods and specifications, and also at on-line validation of test results against specifications. (iv) Instrument management: an assessment of the database and tools for the identification, indexing, validation, calibration and maintenance of instruments and associated electronic data transfer (with error checking). (v) Users and customers: this covered searching and reporting functions and the user interface and ergonometrics-for example on-line help, screen displays and responsc times. (vi) Technology and validation: in this category was considered the technology of the LIMS database and tools, for example the database management system structure, archival and retrieval functions, audit trailing and also regulatory needs in the arca of computer system validation.
8. The formation of workgroups Since the project team aimed to develop user commitment primarily through the use of workgroups, these workgroups were established in the earliest stages of the evaluation. In both departments the participants were invited from various groups and represented all
321
levels of staff, from technician to management. In this way a wide range of experience and expertise was included in the the LIMS evaluation process. Both departments were free to structure and staff the workgroups to fit in with local organisation and resource availability. The use of workgroups allowed a large number of people to participate directly in the project. In Sandwich, for example, approximately one third of the department was involved in one or more workgroups during Phase 1. During the first phase, each department had a group of people studying each of the six categories outlined previously. Each workgroup included a local coordinator and secretary and reported its findings through minutes, presentations or meetings to its international LIMS coordinator. The international LIMS coordinator usually attended the meetings to provide tutorials, background information, information from other workgroups, both local and trans-Atlantic, or other information as needed. Otherwise the workgroups were given the freedom to conduct the meetings as they saw fit.
9. Workgroup objectives and activities-Phase
1
The main feature of the first phase was the series of two day on-site demonstrations of the products. The first objectives of the workgroups were to take the basic joint lists of requirements and functions derived from earlier brainstorming sessions and to develop these requirements in order to (i) provide input into a more detailed functional specification and (ii) produce a list of questions to be put to the vendors during demonstration of their systems. The workgroups were then expected to (i) attend the demonstrations and (ii) report their findings-the answers to the questions formed the basis of the workgroups’ summary reports. The use of predefined question lists during the vendor demonstrations enabled the workgroups to obtain the information they needed in addition to the information that the vendors selectively chose to give them. It also helped the groups obtain corresponding information from all vendors. At the end of this first phase the level of user involvement and activity was high. The degree of understanding of LIMS was growing steadily. The level of detail reached in developing the requirements and reviewing the vendor systems was excellent, and this provided a large part of the input into the subsequent selection decision process.
10. Phase 1: Selection decision At the end of Phase 1, the leading system was identified by means of a thorough and structured decision analysis process which took place during the third joint meeting of the international project team. The Decision Analysis technique, named Kepner-Tregoe after its founders [6],comprises a series of steps as follows: (i) define the objective-“to select a vendor for further evaluation”, (ii) establish criteria-some time was spent formulating
322
twenty criteria that could be used to sort the systems and also reflected the most important requirements of the system, (iii) weight the criteria-it was found easiest to start with all criteria weighted 5 on a scale of 1-10, and then to increase or decrease weightings from that starting position, (iv) scoring-the systems were scored out of 10, with 5 representing a ‘satisfactory’ score; h e overall weighted scores were then calculated for each system, at which point an overall winner emerged, (v) consider the adverse consequences -any potential problems or adverse consequences with the decision obtained from the scoring exercise were then considered and, finally, (vi) take an overview of all the available data and make the best balanced decision-the final decision was to progress with the single system that had scored highest in the first part of the process. Information in support of thc decision making was drawn from project team and workgroup findings arising from the on-site demonstrations, from the vendors responses to our Request For Proposal, from a review of sales and system literature, and from additional project team rcsearch. The selected vendor system was then progressed to Phase 2 of the process, with the objective of gaining a more detailed understanding of the system and its suilability for both departments. A further objective was to identify any additional requirements to be spccificd in the system before making a purchase.
11. Phase 2: Preparation and activities The slrategy proposed for Phase 2 cenued on installing the selected product on sitc, at both sites, for a detailed technical evaluation. Following a meeting with the vendor at the cnd of 1988 it was agreed that, in Sandwich, the configuration and installation for onsite cvaluation would be performed in much the same way as it would be for a permanent installation. The main features would include (i) a preparatory range-finding and scheduling meeting at Sandwich to determine requircments for the configuration, (ii) two 3-day configuration meetings at the vendor site to prepare the system for installation, (iii) 2-day programming training course to enable instrument interfacing to be evaluated in-house and (iv) 5-day installation and training at Sandwich. Once installed, h e system would be evaluated by means of comprehensive and reprcscntative worked examples and scenarios. In Groton, a similar on-site evaluation was also conducted. There were slight differences in the structuring of the collaboration with the vendor and the details of the in-house activities, but the objectives were the same.
12. Workgroup activities-Phase 2 As in the first phase of this project, there was extensive workgroup involvement. The initial LIMS requirements had been identified during Phase 1. During preparation for the
323
onsite evaluation, the objectives of the workgroups were to develop these requirements in more detail. In Sandwich, much of this work was developed by a core workgroup comprised of selected members of the earlier workgroups. Prior to the first configuration meeting the workgroups identified sample types, representative specifications and reports, required database fields, example test procedures, calculations and specifications, and reporting requirements. Unique sample identification numbering systems were also developed. During this time the workgroups also developed worked examples, scenarios and other exercises to be used during the hands-on evaluation period and identified various aspects that could challenge the system. The workgroups contributed to collation of specimen documentation for reference during configuration-a document containing the identified methods, specifications, database fields and report types to be accommodated in the evaluation system was compiled for the vendor to use as reference. The workgroups were then expected to attend hands-on evaluation of the system during the six months, report their findings and input into the final decision. During Phase 2 the workgroups were often able to identify the exceptions to routine QC sampling, testing and reporting that would be most likely to offer a challenge to the system.
13. Configuration process The configuration process was similar at both sites but, for clarity, only the Sandwich process will be described here. The first configuration meeting was held in January 1989 at Sandwich. During this meeting the dates for the remaining configuration and installation meetings were agreed and Pfizer were actioned to prepare for the next meeting: it was necessary to develop various unique identification schemes and to identify database fields and formats. The two 3-day configuration meetings were then held at the vendor site. During these meetings the basic vendor system was configured to match the requirements developcd at Sandwich. The main activities were as follows: (i) the database fields were built into the system, (ii) our unique numbering schemes were incorporated, (iii) example screens were prepared for most basic functions, (iv) a skeleton menu structure was configured, (v) some specifications and tests were configured, (vi) examples of various types of customer report were generated for use as templates and (vii) a provisional security structure was configured. By the end of the meetings the software had been configured successfully to accommodate our starting requirements for sample tracking, method and specification management, result entry and reporting. A Pfizer programmer also attended the final part of the second meeting to be trained in the interfacing and data transaction languagesoffering the opportunity for a more informed evaluation of direct instrument interfacing.
324
14. Installation and on-site evaluation activities The software was installed on-site at Sandwich in May 1989 by the vendor. Hardware for instrument interfacing was also supplied. Installation took less than a day: the remainder of the installation week was spent on user and system management training. To structure the evaluation, two major scenarios were proposed for examination. These scenarios would be based on ‘real’ work and would identify and involve associated gcnuine samples and data. The first scenario was designed to look at how the systcm could be used for work associated with products in the advanced stage of developmenta current product was used as the worked example. The evaluation considered aspects such as sample logging and tracking, configuring and using methods and specifications, and reporting. In addition to the drug substance and dosage forms, considcration was given to how the system would be used for associated raw materials, in-process samples, excipients and comparative agents, providing a comprehensive scenario for a representative advanced candidate. The fewest problems were anticipated in implementing LIMS for advanced candidates, but it was the area where it was essential that the systcm could accommodate our requirements. The second scenario considered early developmcnt projects. The ability of the system to handle projects from the earliest development sample, through to formal stability, was asscsscd. To simulate this, the early development of a candidate, examined retrospectivcly, was used as a model. Greater problems were anticipated in handling these early samples, since thcre can be rapid changes in sample types, methods and specifications. It was the area of work that would probably be implemented last, but a high dcgree of confidence was still required that the system could accommodate such work to a reasonable cxtcnt.
15. Hands-on activities A range of representative test procedures were analysed and translated into tests. A series of product specifications were entered to allow evaluation of the use, maintenance and rcvision of specifications. Screens, menus and standard reports were developed to assess how easy it would be to routinely modify and configure the system in house in a live implementation. Custom retrievals and reports were developed, again to assess the ease and flexibility of operation. During the evaluation the core workgroup held a series of mcctings and examined the system on-line. The users were able to get hands-on experience of the system and wcrc able to identify more detailed rcquircments, questions and concerns. Members of the core workgroup logged in a number of samples to the system, using real laboratory data from complctcd work. The use of the instrument interface was examined on-line through the
325
interfacing of an analytical balance and UV spectrophotometer,and the development of a program to enable content uniformity and dissolution of capsules to be tested.
16. Experiences, observations and benefits In general it was found that the configuration developed and evaluated could accommodate most of the sample types, with varying degrees of ease and elegance. Most of the basic functionality required was provided. Because of the limited training and experience available, first efforts at setting up tests, specifications, screens and reports were time consuming but not difficult. In general, most methods or test procedures could be modelled and translated into tests but, in particularly complex cases, the need for custom work was identified. As experience was gained with the system so also some of the future implementation issues began to emerge. As was expected, one of the challenges is the ‘unknown’ sample -we soon recognised that an implication of using specifications for early development candidates would be the need to be able to create and update specifications at short notice to accommodate these new, unknown, samples and to keep up with test and specification updates. The need to further consider the issue of responsibilities for system updates in a live implementation was recognised. Would test and specification management be restricted to LIMS system managers and implementers or would specified laboratmy staff be able to maintain their own records? The experience of interfacing instruments was extremely worthwhile. Workgroup users were able to gain a valuable insight into practical aspects of online data acquisition and were able to provide valuable feedback on ergonometric issues such as how to manage the sharing of interfaces between users. The two-phase strategy, in particular the extended evaluation phase, has been extremely successful and has a number of benefits. An in-depth understanding of the functionality, flexibility, ease of use and scope of the system has been gained. A better understanding of the product now enables consideration of the role of the vendor LIMS product within a wider information management strategy: the long term implementation and evolution of the system can be planned. The evaluation has helped to assess the resource implications for implementation: it was recognised that the resource required throughout our implementation would depend partly on how many requirements are met by the off-the-shelf purchase and how much customisation (either in-house or by consultancy) is required. The groups are now better placed to assess this and can also estimate and plan for the operational support of the system, including user training. As the role of the workgroups also now changes from evaluation to implementation, the benefits of the workgroup approach become increasingly evident. Through their involvement in all stages of analysis, specification and evaluation, the workgroup members have developed a good understanding of LIMS in general and a working knowledge
326
of the selected product. They have a greater awareness of the likely impact of LIMS in the workplace and realise the need to consider the future implementation of LIMS when, for example, evaluating instruments for purchase, writing methods or automating proccdims. There is also an enthusiasm for the benefits that they expect from the implementation of LIMS in their area. The project team has collaborated throughout the selection process in order to achieve the joint objectives and to make the best use of available resources on both sidcs of thc Allantic. Where appropriate, certain aspects of the evaluation have been sharcd, to be cxamined in greater detail by one or other group, for example where the requirements of both dcpartments were the same. In other areas of study it has been necessary for both groups to evaluate the system to ensure that specific local requirements were properly addressed and assessed. Through this collaboration both groups have a clearer idea of how to mcct our unification objectives with this system. A clearly defined objective at the start of this project was to implement the same LIMS product at both our sites. What was not clcarly defined at that stage was how similar the local implcmcntations of that single product should be. Knowing the technology as well as the goals, it is possible now to identify pre-requisites to developing this technical unity.
17. Completion of Phase 2 At the conclusion of Phase 2 of the evaluation it was agreed that the system under evaluation should be purchascd for implementation in our two departments. Phase 2 has provided thc confidcnce that the system can bc implemented for handling our more routine work-for example stability and clinical trial samples. There is less certainty about how thc system will handle non-QC samples, but preliminary evaluations have been encouraging. From an understanding of the expected enhancements and the vendor’s mission and product commiunent, there is confidence that the system can form the nucleus of a joint LIMS implemcntation. Most of the basic requirements are met satisfactorily, either as delivcrcd functions or achievable by customisation.
18. Future plans-Implementation Both laboratories place value in an implementation that is modular in nature and expect to prototype the sclccted LIMS in a portion of each Department prior to implementation laboratory-wide. The objectives in prototyping include continued collaboration and user involvement through workgroups to (i) learn the LIMS product, (ii) uncover problems carlicr with less impact on laboratory opcrations, (iii) gain implementation expcriencc, (iv) idcntify future product development and implementation requirements, both sharcd and sitc-specific, cnhance the user interface and develop training requirements and (vi) determine system validation needs.
327
The first area of effort in Phase 3 is to prove the LIMS configuration. Then follows a refinement of the requirements for advanced candidates, i.e., for stability work, followed by implementation in that area. Effort in other associated areas such as interfacing with our chromatographic data acquisition system will also be started.
19. Conclusion The two-phase evaluation strategy has been highly successful, resulting in the selection of a single vendor LIMS for our two departments. The selection decision has been made with a well developed understanding of requirements and of the evaluated product. The evaluation has also given an insight into a number of implementation issues and both groups now proceed into the next phase with the support and commitment of the project team and future users. It is our certainty that the international effort and user commitment will continue to contribute to the goal of a unified LIMS for the two departments and continue to knit the two departments together as a whole.
Acknowledgement The authors thank Dr. J. C. Bcrridge for his help during the preparation of this manuscript.
References 1.
2. 3.
4.
5. 6.
McDowall RD. Laboratory Information Management Systems. Wilmslow, England: Sigma Press, 1987 Berthrong PG.Computerization in a Pharmaceutical QC Laboratory. Am Lab 1984; 16(2): 20. Cooper EL, Turkel EJ. Performance of a Paperless Laboratory. Am Lab 1988; 20(3): 42. Henderson AD. Use of the Beckman CALS System in Quality Control. Anal Proc 1988; 25: 147. Dessy RE. Laboratory Informantion Management Systems: Part 11. Anal Chem 1983; 55(2): 211A. Kepner CH, Tregoe BB. The New Rational Manager. Princeton Research Press, Princeton, New Jersey, USA, 1981.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europel 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
329
CHAPTER 29
A New Pharmacokinetic LIMS-System (KINLIMS) with Special Emphasis on GLP U. Timm and B. Hirth Pharmaceutical Research and Technology Management Departments, F. Hoffmann-La Roche Ltd, 4002 Basel, Switzerland
Summary A new, ‘in-house’-developed, laboratory information and management system for pharmacokinetic studies (KINLIMS) is presented. Chromatography data systems are connected via PCs, terminal servers and Ethernet with a central DEC-VAX computer. Bioanalytical data acquired in the laboratories are sent electronically to a central ORACLEdatabase for GLP-conform data handling. The data are then exported ‘on-line’ to RS/lprograms for subsequent kinetic treatment. The main benefits of KINLIMS with respect to ‘Good Laboratory Practice’ may be summarized as follows: - No ‘off-line’ data transfer steps involved between acquisition and pharmacokinetic treatment. - KINLIMS ensures adherence to GLP with respect to data handling/manipulation as laid down in departmental SOPS. - All data manipulation steps and administration activities associated with data handling are documented in an electronic GLP-journal. - A sophisticated built-in authorization hierarchy ensures GLP-conforming system handling. - Future oriented, system-independent and secure concept for the archiving of KINLIMS-produced data. - An automated validation procedure allows rapid re-validation of all vital KINLIMSfunctions.
1. Introduction Pharmacokinetic studies dealing with the time course of absorption, distribution and excretion of a drug are an important step in the development of new pharmaceuticals.
330
Drug analysis in biological fluids is an indispensable part of these investigations and reprcscnts a time-consuming and cost-intensive factor during drug development. Modern data systems are, therefore, widely used to improve the efficiency of collcction, rcduction, and documentation of analytical data, with concomitant reduction in cost pcr sample analyzed. The next step in computerisation of analytical laboratories is the introduction of laboratory information managemcnt systems (LIMS), and this is now being intensively investigated. The aim of these new systems is to co-ordinate thc automation of analytical instruments and data handling and cxtcnd this to the distribution of information, management and control within the overall structure of an organisation. According to our experience, none of the available commercial LIMS-systems fulfill all thc data processing and management requirements encountered in trace drug analysis in biological fluids. For this reason, a LIMS-system with special emphasis on pharmacokinetic studies (KINLIMS) has been developed in our company. Main advantages of the new system may be summarized as follows: - Easy and quick input of parameters from all kinds of prc-clinical and clinical studies. - All data import, handling and export steps are in compliance with Good Laboratory Practice (GLP) regulations. - Duration of analysis is considcrably decreased by avoiding manual data transfer stcps bctwcen acquisition and pharmacokinctic treatment of data. KINLIMS is a very complex and sophisticated system (it needed 4 man-years for developing the concept and realizing the application). A detailed description of all aspccts of the system would obviously be beyond the scope of this paper. For this reason, the papcr bricfly describes the concept and functionality, and then concentrates on a single aspect of the new system, namely the benefits of KINLIMS with respect to GLP.
2. Brief description of the system KINLIMS is installed on a central DEC-VAX computer and uses VMS as the operating systcm, ORACLE as central database and RS/l for graphics and statistics. The application was dcvclopcd ‘in-house’, using the 4th generation development tool UNIFACE. According to the hardware concept shown in Figure 1, chromatography data systems in the laboratories are connected via personal computers (PCs) and terminal servers with thc ccnlral VAX computcr, using Ethernet for data transport. Up to now, three different chromatography data systems are supported by the system, including SP 4200 integrators (Spectra Physics), Nelson 4430 XWZ data systems (PerkinElmcr) and Nclson 2600 data systems (Perkin-Elmer). In ordcr to import the acquired data in the form of standardizcd daily reports to thc host computer, all data systems have bccn cquippcd with uscr programs, which wcre developed in our own laboratories. Transfer of daily rcports to thc central computer is achieved by the protocol-driven progams Kermit (SP 4200, Nelson 2600) and HP Datapass software (Nelson 4430 XWZ).
33 1
I
VAXNMS
I
Ethernet ~~
Terminal Server
Terminal Terminal
P-a
Laser Printer
Chrorn.
Figure 1. KINLIMS hardware concept.
VT-200 compatible terminals or PCs with terminal emulation are used to handle the application and to provide the system with information about the kinetic study, investigated drug, etc. Data output is possible by means of local printers or central QMS-, PostScript- and LN03-laser printers.
3. Overview of KINLIMS-functions KINLIMS offers three diffcrcnt types of functions (project, management and systcm). To perform a function, the user must have obtained the corresponding authorization from his KINLIMS-manager (see paragraph 4.4). The project-functions represent the heart of the application and are applied to laboratory projects (see Fig. 2). A lab-project represents a complete pharmacokinetic study or only a part of it (e.g. the urine samples). For test purposes (training, validation, ctc.), it is possible to define test projects which may be deleted aftcr use. Only three persons (supervisor, analyst, pharmacokineticist), together with their seniors and deputies have acccss to the data and can carry out project-functions (details are given below). The status of a labprojcct can be either ‘planned’, ‘ongoing’, ‘closed’, or ‘archived’, depending on thc progress within the lab-project. During initialization, the status is ‘planned’. After aclivation, the status changes to ‘ongoing’ and the lab-project is now ready for data acquisition. No further data handling is possible in lab-projects with status ‘closcd’, while only archived lab-projecls can be deleted from the system. During the initialization phase, a new lab-project is defined and all relevant information conccrning investigated drug, dcsign of the kinetic study, involved samples (including calibration and quality control (QC) samples), and analytical methodology is entered
332
i
KlNLlMS
INPUT
OUTPUT
Project-Functions Spectra Nelson/ Nelson/
Daily Reports
+%-
Figure 2. KINLIMS project-functions. For each project-function name of activily, status of lab-project and necessary authorization arc shown (S = Supervisor. A = Analyst, P = Pharmacokineticist.I = Import).
to thc system. Inputs are made either directly via input masks, or by sclecting the information from pre-defined dictionaries, which are maintained by KINLIMS-managers. The initialization procedure has been specifically designed for treatment of pharmacokinetic studies and allows an easy and rapid input of sample descriptions from all kinds of preclinical and clinical studies, such as experimental kinetic studies, toxicokinctic studies, tolerance studies, bioavailability studies, randomized multiple dose studies, etc. After activation by the supervisor, the new lab-project is ready for data acquisition. Analytical data are imported in the form of daily reports generated by chromatography data systems and transferred electronically to the host computer. The daily reports contain only reduced data from a single day, namely names, conccntrations and qualifying remarks for all three sample types, as well as information about the quality of the calibration for that particular day. Daily reports can also be corrected under GLP-control after acquisition. However, corrections must be justified by means of a comment (details arc given later). The stored data can be selectively retrieved from the database, displayed in 'working tables' and treated statistically in compliance wilh GLP (exclusion of invalid data, removal of statistical outliers, calculation of means and relative standard deviations for replicate determinations, statistics with calibration and QC data). The quality of lhe data
333
can be evaluated by exporting the data to RS/l and generating graphical outputs, such as quality control charts, cumulation curves, etc. For laboratory management, various reports, tables, lists, etc., can be generated and displayed on the screen or printed out on local and central printers. After data release by the supervisor analytical reports for calibration, QC and unknown samples can be generated. The pharmacokineticist transfers the analytical end data ‘on-line’ into his private ORACLE-account and evaluates kinetic parameters by means of pharmacokinetic programs based on RS/1. Completed lab-projccts may be closed and then archived for long-term data storage. Archived lab-projects can be removed from the database leaving only some cardinal data in the system to allow the management of archived projects. Management-functions are used to gain management data from all individual lab-projects, e.g. overview and description of existing or archived lab-projects, number of analysed samples per year, number of released concentration values per year with respect to various parameters, such as applied analytical methodology, involved species and biological fluids, etc. System-functions are used to maintain the application. The authorized KINLIMSmanager can dcfine ncw users, edit existing user definitions, maintain dictionaries for species, biological fluids, etc., and run the automated re-validation procedure.
4. Benefits of system with respect to GLP The Federal Good Laboratory Practices regulations for all non-clinical laboratory studies, and the impending Good Clinical Practices (GCP) regulations require that all analytical data for pivotal studies included in an IND/NDA submission meet specific criteria for acceptability. KINLIMS plays a central role during the analysis of pivotal studies and, therefore, underlies the principles of Good Laboratory Computing Practice (GLCP). Considerable effort has been investcd to incorporate major GLCP principles into the concept of KINLIMS, including: - Data infegrity:all analytical data and supporting information is maintained in a sccure and consistent manner through all steps between import and archiving. - System integrity: the system is developed, maintained, operated and used according to the highest standards of computer technology. - Traceability: suitable controls have been incorporated into the system to ensure that data handling is performed in compliance with GLP and that particular activities, including project-, management- and system-functions, are performed by the correct people. The following six paragraphs indicate in which way these major GLCP principles were realised.
334
PC Send Dail Report
To AX
VAX
I I I
Extract Data and Prepare Temporary File
I
I I I I
! II I
I
I
& l okay?
I
NO
YES
I
Store Imported Data in Database I
Send Message 10 PC Data Import Success-
Figurc 3.
D a t a import.
4.1 On-line data transfer Before KINLIMS was introduced to our laboratories, daily reports were only available in form of printouts. Pocket calculators were used for data processing and final data had to be entered manually into kinetic programs. In KINLIMS no ‘off-line’ data transfer step is involved between acquisition and pharmacokinetic treatment thus avoiding any time-consuming and sometimes faulty transcription from raw data to final pharmacokinetic report. Figure 3 shows schematically the import of acquired data into the central database. After establishing automatically the connection to the host computer (user name and password are required), the transfer routine is started. All relevant information is extracted from the imported daily reports, stored in a temporary file and tested for syntax and logic. In casc of errors, the import program sends error messages to the chromatography data system and rejects the imported data.Otherwise, the temporary data are stored in the database and the user receives a message that the data import was successful. In Figure 4, the on-line export of data from KINLIMS into other programs is shown schematically. Quality of acquired data can be graphically evaluated in the following way: after starting an RS/1-session via KINLIMS, the data are transferred temporarily from ORACLE to RS/1 and processed by m a n s of RPL-procedures. At the end of the RS/l-session, the system automatically returns to KINLIMS. For end data treatment, a number of interfaces to pharmacokinetic programs or tcxt-systems have been developed. The data are transferred in the form of ORACLE-tables from the database into the private
335
RS/1
I I 1
Plot Quality of Calib-Data Plot QCControl Chart
Cumulation
KINLIMS~ Procedure for Calib.-Samples %%%E :l
Urine SamDles
Raw Data Treatment
h H
I I I I
I
RS/1
I
I I I I
I
Pharmacokinetics
Elsfit
H
Interface for Released unknown SarnDle Data
I
I
I
I
I I I I I
I I
I
I
I
I I
I
I
I
Plot Quality of Replicate Determinations
Interface to Text-Systems Table
I I
I
Enddata Treatment
Figure 4. Data export.
accounts of end users. RPL-procedures are started, picking up the exported data and preparing suihblc RS/1-tables for direct data input into kinetic programs based on RS/1 (INDEPEND, ELSFIT), or into text-systems.
4.2 Data handling according to GLID KINLIMS ensures adherence to GLP with respect to data manipulation / handling, as laid down in dcpartmcntal standard operating procedures (SOP). All analytical laboratories work according to the same quality criteria with respect to treatment, rejection and prolocolling of data. This may be illustrated by the exclusion of invalid raw-data within the KINLIMS-system. All laboratory data systems linked to KINLIMS carefully monitor the quality of data during acquisition and, if necessary, flag invalid concentration valucs with ‘qualifying rcmarks’. For example, concentrations falling outside the calibrated range receive the remark ‘OUT’ (above calibrated range) or ‘BLC’ (below limit of calibration), respectively, as demonstrated in Table 1. All imported data flagged with a qualifying remark are then identified by KINLIMS as invalid data and excluded automatically from any further data trcatment. Exclusion of suspicious data is also possible after data import at the KINLIMS-level. For example, the user may flag manually statistical outliers with ‘EXC’. It is also possiblc to rejcct all data from a daily report, or to correct individual sample names or qualifying rcmarks in daly reports by means of a special GLP-correction routine. For GLP-reasons,
336
TABLE 1 Exclusion of invalid raw-data in ‘WORKING TABLES’. SAMPLE NO
SAMPL TIME
DATE
~~
CONC FOUND ~
REM A
REM B
CONC MEAN
RSD* N (%)
~~~~
UA 100 UAlOO
Om Om
26.03.90 27.03.90
-2.43 3.24
NOP BLC
UAlOl UAlOl UAlOl
10m 10m 10m
26.03.90 27.03.90 28.03.90
100.24 130.11 83.56
OUT
UA 102 UA102 UA 102
30m 30m 30m
26.03.90 27.03.90 28.03.90
210.98 205.13 312.45
130.11
-
1
208.06
1.99
2
EXC
EXC
* Relative standard deviation REMARK A (imported together with values by the laboratory data system) N O P No Peak Found BLC: Below Limit of Calibration OUT:Out of Quality Range EXC: Excluded from further treatment CLE: Calibration Level Excluded AAR: Acquired After Release REMARK B (manually set by the user on KINLIMS-level) EXC: Excluded from further data treatment
Comment for GLP-Journal Please insert a COMMENT for: Dailv reDort 1T080789.UT excluded
I
GLP-Journal, data correction Date
Time User Activity Comment
________________________________________ 09-AUG-1990 14:49 Dr. XY Daily Rep. lT080789. UT exclude Problems with analytical method 10-AUG-1990 08:34 Dr. AB .................... 11-AUG-1990 11:12 Dr.CD 12-AUG-1990 15148 Dr. EF
Figure 5 . GLP-Journal.
....................
337
TABLE 2 List of activities recorded in the GLP-Journal. ACTIVITY Activate lab-project Import of daily report Extend lab-project by new parameters Extend lab-project by new samples Exclude data from further treatment Withdraw ‘Exclude data from further treatment’ Exclude daily report from further treatment Modify header of daily report Rename quality control sample Rename unknown sample Change qualifying remark of quality control sample Change qualifying remark of unknown sample Release data Withdraw ‘Release data’ Close lab-project Archive lab-project
COMMENT
No No No No Yes Yes Yes Yes Yes Yes Yes Yes No Yes No No
all handling steps associated with modification or rejection of data are documented in an electronic GLP-journal, as described in paragraph 4.3.
4.3 Electronic GLP-journal One major aim of GLCP is the inclusion of a history into the LIMS, showing which person was rcsponsible for the various activities carried out on particular items of information. For this reason, all important activities in ongoing lab-projects are documented with date, time, user and type of activity, in an electronic GLP-journal. In the case of critical data handling steps, the system even asks for a comment which is also protocolled in the GLP-journal as shown in Figure 5. The GLP-journal is divided into three sections, dealing with lab-project activities, working table activities and corrections of daily reports. The user has only rcad-access to the entries and, therefore, cannot overwrite or even dclete inputs in the GLP-journal. Table 2 shows all activities underlying the GLP-control and indicatcs in which case a comment is requested by the system.
4.4 Authorization hierarchy A sophisticated built-in authorization hierarchy ensures GLP-conform system handling
and guarantees a high standard of data security. Four authorization levels have been installed to control access to the system (Fig. 6 ) .
338
t
Fourth Level’) Project
-Supervisor -Analyst - Pharmacokineticist
Third Level Functions
- Project.)
-Data Im ort - Project-binition
Second Level VMS-Identifiers
- KINLIMS-Cwner
- KINLIMS-Manager - KINLIMS-User
VMS
First Level Operating System - VMS-User-ID (Username, password)
---------- - KINLIMS
’) Data Access also granted to seniors and deputies
Figure 6. Authorization levels.
The first two levels are controlled by VMS. All users require a VMS user idcntificaLion and a valid password and must be authorized for KINLIMS. Only users managing thc application receive the VMS-identifier ‘KINLIMS-manager’, and have access to programs and VMS-identificrs. The third authorization level concerns global KINLTMS-functions such as ‘projcct’, ‘project definition’, ‘data import’, ‘management’ and ‘system’. Dcpcnding on the responsibility within KINLIMS, a user may be authorized for one or more of these functions. All staff members obtain authorization for project-functions, while only laboratory supervisors are also authorized to dcfine new lab-projects. Only staff-members with spccial training obtain authorization for import of daily reports. The management function is dcpcndent on the position of the user in the organisation: Managers and group leadcrs can scarch through all lab-projects of their group members, while a laboratory supervisor can only manage his own lab-projects. For security reasons, only two persons in each dcpartmcnt (KINLIMS-manager and his dcputy) are authorized to carry out system-functions. The fourth lcvel regulates the privileges within a particular lab-project. Only thrcc pcoplc (supcrvisor, analyst, and pharmacokineticist), together with their seniors and deputies, have access to the data, and can perform project-functions. However, according to the different responsibilities in the pharmacokinetic study, they have different privilcgcs within the lab-projcct, as shown in Figure 2.
4.5 Archiving of data For GLP rcasons, all KINLIMS-produced data from pharmacokinetic studics must be retained for a pcriod of at least 10 years after the last introduction of the pharmaceutical.
339
I I I 1 I
I I I I I 1
I
Data Loader
b
Electronic Archives
Cassettes Optical Discs Tapes
I I I I I I 1 I I I I I I I
Figure 7. KINLIMS archiving concept.
Keeping all the data on-line would need a huge database and lead to serious responsetime problems. For this reason, a secure, system-independent and future-oriented concept for the archiving of KINLIMS-produced data was developed (Fig. 7). A large rcport-file, including the lab-project description, all imported daily reports, the working tablcs (showing all data manipulations) and the GLP-journal is produced togethcr with an ASCII-file containing all released end data of the lab-project. Report-file and end data-file are sent on-line to the electronic archive and stored finally on tapes, cassettcs, optical discs, etc. During the dearchiving process the report-file is sent back into a defincd VMS-account and can bc displayed on the screen or printed out without the need for any special utilities. The dearchived end data-file is loaded into a private ORACLE table which may be re-used for data transfer into pharmacokinetic programs, as already dcscribcd. The main advantages of the archiving concept may be summarized as follows: - Sccure long-tcrm data storage: several copics of the archived data files are produced and stored at different locations. Stored data are protected against modification and accidental loss. Only authorized persons have access to the archive rooms and are allowed to dcarchive data. - System indcpendcnt data storage: KINLIMS-data are not archived with their data structure and, thcrefore, need not to be restored into the database during dearchiving. Interpretation of the dcarchived ASCII report- and end data-files is possible without any spccial tools, and is not dependent on VMS, ORACLE, UNIFACE or the KINLIMS-application itself.
340
- Future-oriented data-storage: Long-term storage of KINLIMS-produced data saves space in expensive theftproof, watertight, and fireproof storage rooms. Dearchiving is reasonably fast and is completed in less than 2 hours.
4.6 Automated validation All LIMS-systems working under GLP conditions have to be revalidatcd at regular time intervals, or after relevant changes in hard- and software. Because of the complexity ol the application, re-validation of all vital KINLIMS-functions is a tedious and time-consuming task. For this reason, a procedure for automatic validation has been developed specding up the re-validation process and nceding only minimal input by the user. In the ‘Learning Mode’, the operator prepares validation modules Tor all relevant labproject functions, including initialization of a validation test-project, import of test data sets, dam correction, data handling, data reporting, data export, data security, and data archiving. In the ‘Executive Modc’, execution of one or more validation modules is started and the application is tested automatically.
Acknowledgements The authors wish to thank Mr. U. Blattlcr, Mr. A. Rook (Multihouse, The Netherlands) and Mr. Y. Schcrlen for their participation during development of the KINLIMS-system, Dr. H. Eggers, Dr. J. Kneer and Mr. M. Zell for helpful and stimulating discussions, and Dr. H. A. Welker and Mr. G . Zaidman for developing RPL-procedures in connection with KlNLIMS. Thanks are also due to Mr. H. Suter for drawing the figures. Finally, the authors are grateful to Dr. D. Dell for continual encouragement during development of the system and for correcting the manuscript.
E.J. Karjalainen (Editor), Scientific Computing and Automarion (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
34 1
CHAPTER 30
Validation and Certification: Commercial and Regulatory Aspects M. Murphy Sema Group Systems Ltd, Wilmslow, UK
1. Introduction There has been a significant upsurge of interest in the subject of computer system validation over the last few years, largely triggered by the pronouncements of regulatory bodies. Much of this interest has been fearful, concentrating on the difficulties, timescales and costs involved in validation, and exacerbated by the relative lack of knowledge which scientific professionals have of computers and their works. I would like to try and dispel some of this fear and trepidation, so my intention in this paper is to look at the wider reasons for going through a validation exercise, and to show the various benefits which an organisation can obtain from its successful completion. In order to achieve these ends this paper will be divided into four main areas, namely: - a review of the current position in t e n s of regulation and certification, and the likely future trends; - a discussion of what validation actually is, and what it is intended to achieve; - a brief look at how one actually goes about doing it, and - a consideration of the benefits which flow from the completion of the process.
2. Regulatory situation report: Now and in the future I do not propose here to re-cover the well trodden ground of developments in the regulator’s position on laboratory computer systems. Nevertheless, a quick look at the status quo will help to set the framework for the rest of the paper. The present position is that the US FDA and EPA, and the UK Department of Health regard it as essential for certain categories of laboratory computer systems to be validated. This applies to systems involved in the pre-clinical safety assessment of drugs (human and veterinary), food additives, cosmetics, agro-chemicals, etc. Even though this covers a relatively small subset of what might be called ‘commercial science’ this requirement has been enough to worry very large numbers of people. The global nature of the markets affected by these requirements is such that there is little comfort to be
342
gained from thc fact that most other nations are sticking strictly to thc OECD requirements on GLP, which tends not to say much about computer systems at all. Those of you who arc looking to the regulators for a degree of comfort may find it in thc fact that no regulationary position has been adopted on the use of computcrs in othcr parts of thc product developmcnt cyclc, namely research (as opposed to dcvclopmcnt), clinical trialing or manufacturing. Be warned however, that it does not appear that this situation will obtain in pcrpctuity. On one hand mcctings have already takcn place in the UK to considcr whcthcr a similar set of rcquiremcnts should be built into GMP inspcctions-a developmcnt that the US would be unlikcly to ignore. On the othcr hand the CANDA projcct (in which NDAs are being submittcd to FDA electronically, albcit on a trial basis) currently mixcs data from validated and unvalidatcd systems (c.g., clinical)a position which cannot be tcnablc in the long term. Having sccn that the scope of mandatory regulation is likely to increase, what about the ‘voluntary’ scctor? There is no doubt that the major commcrcial change affecting manufacturing industry in the late 1980s and early 1990s is Lhe emphasis on quality. In thc mid 1980s quality was seen by both buyers and sellers as an optional extra which was available if you were prepared to pay for it. This is no longer true! In today’s markcts it is necessary to be ablc to demonstrate quality in a commodity in order to be able to sell it at all. In scicncc related industries this not only increases the laboratory work loads (thereby increasing thc rcliance on computcr systcms) but also raises thc nccd to demonstrate thc rcliability of the data and informalion produccd. Presently this dcvclopmcnt is having the cffcct of requiring morc and more organisations to scck certification or accreditation to rccognised quality standards such as I S 0 9000 (or somc equivalent). The rapid approach of thc Europcan Single Markct in 1992, with its emphasis on mutual recognition of tcst rcsults, is crcating additional prcssurc for the laboratorics of manufacturing cornpanics to scck accrcditation for thc ability to carry out specific CEN or CENELEC tcsts. This prcssurc can only continue, and will bc incrcascd by the implications of Product Liability Icgislation and so forth. What has this to do with thc validation of computer systems? Simply this, thc accrcditation authorities have begun to takc notice of the work that has bccn done in both thc pharmaccutical and computer industries, and arc seriously considcring its application to the laboratorics they assess. The day when validation and ‘Good Computing Practice’ become a requiremcnt in thcsc areas cannot be long dclayed. If we look at the position in purcly commercial terms then a numbcr or lessons can bc drawn : 1. Thc prcssurc of work on laboratorics is going to incrcasc with thc increasing need to
dcmonstrate quality and safcty in products. 2. Thc falling numbcrs of scicncc graduates mcans that this can only be accomrnodatcd by incrcascd crficicncy, which inevitably involvcs thc use of computcrs.
343
3. Market and regulatory pressures will increasingly require those computer systems to be validated in order to demonstrate the reliability of the information they produce. The bottom line therefore is that if your computer systems are not validated, or worse still cannot be validated, your commercial position will be seriously and increasingly impaired.
3. What is validation and what are we validating? Having established that the failure to validate laboratory computer systems is (at best) commercially undesirable, it would be good to consider the basics of what exactly validation is. The commonly accepted definition of validation, for regulatory purposes at least, is that produccd by the IEEE, namely : “(1) The process of evaluating a system at the end of the development process to assure compliance with user requirements. (2) The process of evaluating software at the end of the software development process to ensure compliance with software
requirements”. Clearly there is some room for interpretation in this definition-if that were not so there would be much less worry and discussion about the whole subject. It is also worth remembering that the definition was not coined with GLP specifically in mind. It is therefore worthwhile to look at the definition and provide a more concrete interpretation. The first difficult point to arise is the word “system”. What does this mean? It must mean more than just software, or there would be no nced for part 2 of the dcfinition; what is the GLP interpretation of the term? I suggest that the term “system” needs to be interpreted widely, and includes the software, the hardware, the documentation and the people involved. The second difficult area is what is meant by “the end of the development process”? This has bccn interpreted as meaning the end of system or integration testing, ie before the system goes off to the user site. I suggest that this interpretation is inadequate as it cxcludcs consideration of the hardware on which the system will run, a proportion of the documentation ( e g , SOPS) and the knowledge and training of the people who are actually going to use the system. That being so “the end of the development process” must mean the end of commissioning, i.e., the point at which the system is installed in its target environment. This, I believe, is the commonly accepted view. The only rcmaining problem with part 1 of the definition is thc term “user requircments”. It is likely that the originators of the definition meant this to mean conformity to the agreed specification, but will this do for a GLP environment (whether formally
344
regulated or not)? Clearly conformity with the specification is important, in the sense that the system does what it should, but we need to consider also whether the system provides functions or allows actions which would not be acceptable for GLP purposes. It seems necessary to presume the existence of a user requirement for certain functions to be present (such as audit trail) almost regardless of what the specification actually says. Given these interpretations we can provide an expression of part 1 of the definition of validation in terms of a concrete set of questions to be answered. These are: 1. Are the functions of the systcm restricted to those which are acceptable under GLP? 2. Within that, does the system provide the functions that the user requires? 3. Are those functions properly documented in terms of user manuals, SOPs, operators guides, etc? 4. Are the people who will use the system (including computer operations staff, if any) properly trained? 5. Does the installed system actually work in its production environment?
It might be thought that if part 1 of the definition of validation covers all of these points
then part 2 is redundant. Part 2 refers to “the process of evaluating software at the end of the software development process to ensure compliance within software requirerncnts”. Since this cannot mean user requirements, which are covered part 1, we are forced to conclude that this is related to the requirement that laboratory equipment needs to be properly designed and produced, so as to be capable of maintenance to the standard of the bcst of current good practice. Expressed bluntly, if perhaps contentiously, this boils down to the simple question: “have the software developers done their job properly?” This is probably the part of validation which most worries the scientist and most annoys the computer specialist. The lattcr often objects strongly to a pcrceived implicaLion of incompetence (much as many scientists did in the early days of GLP itself). The scientist on the other hand, is frequently conscious of his lack of knowledge. The problem is not made easier by the fact that there is no single, universal statement of what constitutes good practice in system development. Nevertheless there is a degree of consensus on what should and should not be done in a properly run development project. It is therefore possible to break the second part of the definition down into a further series of concrete questions, they are these: 1. Are there SOPs (even if referred to be a different name) covering the software devel-
2. 3. 4. 5.
opment process? Do they require rigorous specification and design in detail? Do they require thorough testing? Do they require comprehensivequality assurance and quality control? Do they require rigorous change management processes?
345
6. Do they require the use of suitably qualified personnel? 7. Have the SOPSbeen consistently and demonstrably applied? In thus breaking the definition of validation down into discrete and specific questions I hope to have provided a practical basis from which validation can commence. Before moving on to consider the mechanics of validation, however, I would like to make one further point. Many people start to think about validating systems at the end of the development. Even the briefest consideration of the questions that need to be asked makes it obvious that this is far too late. If you do not consider validation and quality from the outset you are almost certainly doomed. If the system is not built to be capable of being validated then no amount of validation will help you.
4. The mechanics of validation Now that we have determined the objectives of the study (for that is what validation is) we can start to look at the way we are going to go about doing it. I do not propose to talk about this topic in great detail as it deserves a paper to itself (and we will shortly hear Hr. Ziegler presenting just such a paper). It is nevertheless necessary to give an overview of the subject in order to provide the basis for discussing the positive benefits which arise from validating a system. The first esscntial is to determine what you are going to regard as raw data and what you are not. In a GLP context this decision is fundamental to the validation of a system since any system which has no dealings with raw data may well be excluded from the need for validation. This is not to say, however, that deciding that all raw data will be on paper removes all need for validation as, for example, it is virtually certain that you will need to push electronic copies of the data through such things as statistics packages in order to create reduced data for reports. In any event most people with terminals or PCs on their desk will refer to electronic copies of data before they will walk a hundred meters to go and find the original paper. The next step is to define the system you are going to validate. This activity has two components: the first being to define the boundaries of the system, and the second to determine the components within that boundary. The definition of the boundary, which I often refer to as the “domain of compliance” is important partly because we want to be sure that we cover all parts of the system which handle raw data, and partly to ensure that we do not expend effort on systems that do not require it. The identification of the component parts of the system is important because we need to know exactly what skills and knowledge are required to make an adequate assessment and because it helps in estimating the time and effort required to do it, but more of that later. Before discussing what we are going to do with all this ‘configuration’ information it is worth considering the importance of selecting the right team. The question of skills and
346
cxpcricnce is just as important as in any other GLP study. The information gathered during the project phascs already described can be placed alongside the list of questions to be addressed in order to determine the expertise necessary to obtain reliable answcrs. Whilst it is impossible here to provide hard and fast guidelines on team selection it is possible to make some observations which may be hclpful: 1. Do not be afraid to involve your specialist computing staff. In-house specialists can
provide an indcpendcnt asscssment of externally produced software, or it may be possible to get such an assessment of an intcrnally produced system from some other part of the organisation. It is essential to have a competent external assessment of the software, and equally to have the developers around to answer questions as they arise. 2. Be sure you get the application aspects of the systcm (ie what it does) assessed by people who have bccn and will continue to be involved in the areas bcing assessed. Just as you would not ask a junior technician to validate (e.g.) an advanced spccuoscopy package, remcmbcr that rank does not confer omniscicncc, and thc junior staff oftcn have valid and useful points to make. 3. Rcmcmber thosc whom the laboratory exists to serve! This may not sound like a GLP issue, but if you send to Regulatory Affairs information which they need, or choosc, to re-manipulate before submitting to licensing authorities your compliance might bc questioned. 4. Last, but by no means lcast, rcmember the Quality Assurance Unit. They are aftcr all, mandated to act as the guardians of GLP. Thc last important point to be made about staffing a validation excrcisc is that it is likely to require significant effort, particularly if thc system in question is a large one. To get the job donc in a rcasonablc time it is essential that management make a cornmitmcnt to provide the necessary rcsourccs. The definition of the scope of the system to bc validatcd can be used to estimatc thc rcsources required, and such an estimate will make thc management commitmcnt casier to oblain. The assignment of individual pcople to the validation study will normally result in their allocation to look at spccific parts of h e system. Once that allocation has been madc it is possible for the sub-teams thus formed to identify the dctailcd yardsticks against which compliancc is to be measured. This is important, as it is no good coming up with results like "it looks ok"-after all you would never do that with a ncw chemical. The degree of delail requircd, and thc dcgrce of specificity, will vary from casc to case, so that it may be appropriate for example to have a single, gencrally agreed yard$tick against which to mcasure thc adcquacy of a program specification. As a rough rule of thumb yardsticks for inspcctions will be more general (if not necessarily less dctailcd) than those for actual tests. It is, of coursc, almost certain that the validation exercise will involvc both these elements.
341
All the points I have covered so far in this section are concerned with the planning of the study. The completion of these tasks will put you in a position to actually start to do the work, knowing what to look for, how to find it, and (to a degree) how to assess it. Before beginning the study proper there is one other job I would recommend you to do, which is to devise some method of quantifying the results. Confoimity to GLP requirement is rarely a black and white issue, and an agreed mcans of measuring the shades of grey can bring additional benefits as I shall shortly seek to show. I propose to say very little about the ‘doing’ part of the validation exercise. This is not to say that it is unimportant, but it is essentially a question of putting into effect what has already been planned. One point I would make is this; keep good records of the observations made and the test results obtained. It is unlikely that this ‘raw data’ will be subject to retrospective scrutiny in the way your laboratory data might, but it will certainly bc used, and it might also serve to impress any passing inspector.
5. Reaping the benefits So far this paper has contained a large proportion of doom and gloom, lightened only by the possibility of keeping the regulators happy, and, maybe, even staying in business. Clcarly thcse bcncfits are not sufficient or the subject of validation would not have caused the worry that it so obviously has. So where are the other benefits? If we choose to begin with the technical benefits the first must be the disciplinc it imposes on the developers. Any developers who recognise that the system is going to be validated, and property validated at that, would be fools not to take steps to ensure that thcir work will m e t the challenge. The results of this approach must have many tangible benefits for the laboratory and its staff: 1. The rate of system failure will be lower than might otherwise have been the case, which leads to rcduced data loss. The savings to bc achieved here will vary greatly dcpcnding on the way the system operates. At the very least there will be less need to re-enter data lost or corrupted by system failure, and hence less scope to query the validity of such ‘raw’ data. In more advanced systems where data is being acquired directly from instruments there will be savings to be made from reduced re-analysis and the re-preparation of samples, and less danger of total data loss due to the fact that re-analysis is impossible for whatever reason. Lastly, of course, there are many instances in which speed of analysis and reporting is crucial, for example, flow manufacturing environments, where total system loss is wholly unacceptable.
2. Another technical benefit arises from the fact that the system is likely to have been better built if it is known that detailed validation is to follow. Such systems are generally much easier to modify when the need arises. This means that not only can any
348
failurcs be more swiftly remedied, but the changes that become necessary during thc life of a system can be more quickly and hence more cheaply applied. The bottom line is faster response to change at reduced costs.
3. Talk of changes lo systems raises the question of re-validation, which is a requirement aftcr any change to hardware or software, as well as periodically if there is a long period during which the system is static. The existence of a dctailed, segmented validation protocol makes it much easier to re-validate selected parts of the system and to cvaluatc thc rcsults obtained; a further benefit to users anxiously awaiting a new or upgraded facility. This is a specific instance of the actual use of thc raw data accumulatcd during the initial validation of the system. 4. Another use of this raw data comes from the fact that no system is ever going to be perfect. In quantifying the conformity of parts of the system to the agreed yardsticks thcrc will always be cases where a better result might have been hoped for. Not only will the quantification of these instances allow their correction or enhancement to be prioritiscd, but the raw data will be invaluable to the developers in actually doing something about the problems which have been identified.
5. Thc bcncfit to the people involved in the validation excrcisc cannot be over-emphasiscd. Not only will they become familiar in detail with ‘their’ part of the system, but they will develop confidence in it. This will not only make them champions of thc system among their colleagucs, but it will make them ideally placcd to act as centres of expertise, and to assist in the training of those around them. This enhancement of knowledge and confidence will, in itself, make the system more effective by virtuc of thc fact that people will use it more willingly and with a lower incidence of error. 6. All of thcse points lead to the fact that the system will produce more reliable and consistcnt data, and hence better and more timely information. Aside from the fact that this will improve the reputation and influence of the laboratory it is likely that widcr and different uses will be found for the information produced, thereby beginning an upward spiral which can only benefit all concerned.
6. Summary and conclusions In thc coursc of this paper I have sought to show that the pressures of regulation and the commercial need for laboratory accreditation are increasing. There seems little doubt that this will continue to be the case and that science based induslries will need to conform to survive.
349
It is clear that the validation of laboratory computer systems is fundamental to meeting the requirements of both regulatory and accreditation authorities. Although the definition of validation adopted for official purposes contains few specifics I believe it can be boiled down to the basic questions of “does it operate as GLP would require?” and “can it be maintained and enhanced without prejudice to its satisfactory operation?”. Based on that premise I have tried to show that the way to complete validation thoroughly in a reasonable time is to define properly the system to be considered, identify the skills required for the validation, define the tests and inspections, and the yardsticks against which they are to be assessed, and to quantify the results. Finally I hope I have shown that the benefits to be accrued from this exercise go far beyond merely kecping the authorities happy, but are fundamental to the success of the system, the laboratory it serves and the business served, in turn, by the laboratory.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
351
CHAPTER 31
Developing a Data System for the Regulated Laboratory P.W. Yendle, K.P. Smith, J.M.T. Farrie, and B.J. Last VG Data Systems, Tudor Road, llanover Business Park, Altrincham, Cheshire, WA14 5RZ. UK
1. Introduction The ability to demonstrate Good Laboratory Practice (GLP) and high standards of quality control to regulatory authorities is increasingly a requirement for many analytical laboratories [l]. Since computer data systems form an integral part of the modern laboratory, validation of such systems can form a major part of the demonstration of GLP, and it is essential that such systems do not compromise standards of quality in use in the laboratory B 3 1 . The regulatory bodies place the responsibility for validation of computer systems with thc laboratory. There is, however, a great deal that the vendors of such systems can do to assist in this process, and users of laboratory data systems are increasingly demanding evidence of high quality control during software development, and support for their validation procedures. This paper dcscribcs the adoption of a software development environment suitable for production of softwarc for use in the regulated laboratory, with examples taken from the dcvclopmcnt of the XChrom chromatography data system. The software development life cycle (SDLC) adopted is illustrated, and the implications of the requirements of validation to both vendor and user are discussed.
2. Validation and verification Although oftcn taken as synonymous in everyday use, the terms validation and verification have well-defined meanings in the context of software development [4]: Validation: Thc process of cvaluating software at the end of thc software development process to ensure compliance with software requirements.
352
Verification:
1. The process of determining whether or not the products of a given phase of the soft-
ware development cycle fulfill the requirements established during the previous phase. 2. Formal proof of program correctness. 3. The act of reviewing, inspecting, testing, checking, auditing, or otherwise establishing whether or not items, processes, services or documents conform to specified requirements. From these definitions it can be seen that it is possible for a user to validate a system (in the strictest sense of the word) with no input from the supplier of the system, since validation is a "black-box'' technique performed on the finished product. Indeed, it can be argued that validation can only be performed by the user, since only the user knows the requirements of the system in their particular environment. The supplier can, however, simplify the user's task by providing skeleton validation protocols on which the user can base their own validation procedures, and by providing example data sets and expected results from these for testing. Although validation of a system in the user environment is essential, this black-box approach alone is not sufficient to satisfy the requirements of many regulatory bodies, which require evidence that a system is both validated and verified. Since verification requires both examination of the intcrnals of a data system (sometimes referred to as grey- and white-box testing), and demonstration of quality control during development of the system, this can only be achieved with the full co-operation of the software developer. Furthermore, the system can only be truly verified if the complete software development life cycle (SDLC) is designed with verification in mind.
3. The software development life cycle The SDLC is the complete process of software development from conception of an application through release of the product to its eventual retirement. Although the exact SDLC adopted will vary among projects (the SDLC adopted for XChrom development is reprcsented schematically in Fig. l), the following stages can usually be identified: Analysis Specification Design Implementation Testing Release support Each of these stages are discussed in detail below.
(LIK)
m 3&-[
Software Development Life Cycle
Figure 1. Schematic of software development life cycle (SDLC) adopted for XChrom development.
W W VI
354
3.1 Analysis The analysis stage of the SDLC involves identification of the scope of an application, and will itself consist of several stages. For the developer of commercial data systems, a common stage of analysis will be identification of market niches that are vacant. This will typically involve reviewing the existing products of both the developer and competitors, and will often be combined with a review of the state-of-the-art of both hardware and software. As an example, conception of the XChrom data system was prompted by several major wends in the development of both hardware and software. The intention was to utilise the availability of high performance graphics workstations wilh powerful networking and distributed processing facilities (which had previously bccn beyond the budget of the average laboratory), and to follow the industry trend to open systems by use of the X Window System [5] and relational database management systems supporting Suuctured Query Language (SQL; [6]). Having identified an application in concept, its feasibility on both technical and marketing grounds must be checked. By documenting the concept and having it reviewed by technical staff, sales and support staff, existing customers and potential users, the initial direction that development should take (and indeed whether development should proceed at all) can be identified. It has been shown [7] that the costs of identification of errors (including incorrect analysis of a system) escalate rapidly as the SDLC proceeds, and so the aim of the analysis stage is to identify any erroneous concepts at the earliest possible instant.
3.2 Specification The specification stage of the SDLC involves formal summary of the analysis stage, and a detailed description of the requirements of the system from the point of view of the user. This description should be formally documented in the Software Requirements Specification (SRS). A major part of validation of the system will involve demonstrating that the requirements in the SRS havc bccn met by the system, and so the specification stage of the SDLC should also produce the System Validation Plan (SVP) which formally sets out how compliance with requirements will be tested. To identify errors in specification at the earliest possible stage, bolh the SRS and SVP should be thoroughly formally reviewed. Although this review process may be performed internally by the developer, it makes sense to involve users in this process, since both SRS and SVP should be expressed from the users point of view. This may pose logistic problems when (as in the case of XChrom) the user community is widely distributed geographically. The approach adopted in XChrom development to overcome these problems
355
has been to conduct external informal review prior to formal internal review, with the views of the external reviewers being collated and expressed by a representative at the internal review.
3.3 Design The design stage is the first stage in converting the requirements of the users (identified at the specification stage) to computer code, although little or no code will be produced in this stage (the possible exception being for prototyping). Of the numerous decisions to be made at the design stage, perhaps the most important is how the design itself will proceed. There are various "methodologies" for formalising the design process, often embodied in Computer Aided Software Engineering (CASE) tools to assist the developer. For a system such as XChrom that combines elements of real-time software engineering, objectoriented programming and relational database design, no single methodology fulfills all design requirements, but similar design processes can be applied to all elements of the system. System level design involves identification of functional modules into which the system can be divided. Within these modules the various layers of the software need to be identified, and then the contents and the interface between the layers can be designed. As in the case of analysis and specification, all stages of dcsign nced to be documented and reviewed, to ensure earliest possible identification of potential errors in the system.
3.4 Implementation The implementation stage of the SDLC involves committing the detailed design to computer code, using the programming language@) specified in the design stage. In order to chcck thc quality of coding, it is necessary to specify codmg standards that any piece of code can be verified against. In the case of XChrom these standards cover topics ranging from preferred layout style (e.g., standard module and function headers) and external standards to adhere to (e.g., ANSI C) to allowed data types and mechanisms for error handling. Development and enforcement of such standards requires tactful planning, since there is the risk of offending professional pride, for which software engineers are notorious 181. One mcthod for minimising problems here is to make code reviews a peer-group activity to reduce friction, although ultimately the outcome of peer-group review will of course itself need to be reviewed by project managers. One aim of coding standards should be to promote "robust" or "defensive" programming [9], in which the programmer attempts to cater for all possible eventualities in a
356
picce of code, even those that are considered extremely unlikely. In the development of XChrom protocols have been designed to encourage defensive programming, including mechanisms for internal status checking, parameter checking and comprehensive internal memory management. In addition to code reviews to ensure compliance with coding standards, source code can be subjected to automated analysis using CASE tools. VAX Source Code Analyzer (Digital Equipment Corporation) and the Unix utility lint have been used on the XChrom project for this purpose. Source code can be further checked by ensuring that (where appropriate) it will compile and execute on a range of hardware platforms using a range of compilers. When possible, code for XChrom is compiled and checked on Digital Equipment Corporation VAX and RISC hardware (running VMS and Ultrix respectively, Hewlett Packard HP-9000 series hardware (running HP-UX), and on a range of compilers on personal computers running MSDOS. In a multi-platform development, it is essential to employ both code and module managcment systems to ensure consistent versions of software across all platforms. For XChrom a central code and module management system has been established using VAX Code Management System and VAX Module Management System (Digital Equipment Corporation). All revisions to code must be checked through this central system, which allows the nature, date and author(s) of all modifications to be tracked.
3.5 Testing Although represented here as a single stage of the SDLC, testing actually comprises several stages. The lowest level of testing (sometimes referred to as "white-box" testing) is performed during implementation, and is encouraged by defensive programming, and enforced by source code analysis and examination of code using interactive debugging tools. Once a particular module of a system has been implemented (and white-box tested) a tcst harness can be constructed for the module. This consists of a piece of code that exercises all of the functionality provided by that module, without the module bcing inserted into the system itself. This level of testing is sometimes described as "grey-box'' testing. Once all of the modules in a system have been grey-box tested, integration of these modules can begin. As individual modules are integrated together, the composite modules can themselves be grey-box tested. During this integration testing it is desirable to ascertain whether program flow is passing between modules as predicted by the system design, and this can be achieved using CASE tools for profiling or coverage analysis. VAX Performance and Coverage Analyzer has been used during XChrom development for this purpose. Once module integration and integration testing is complete, the system as a whole can be validated against the SRS, using the SVP. Once this procedure has been performed
357
once it can be simplified by the use of regression testing, i.e., the results of one test can be compared to those of previous identical tests to ensure validity, without demonstrating validity from first principles. This testing can be greatly simplified by the use of CASE tools for regression testing (such as VAX Test Manager; Digital Equipment Corporation), although during XChrom development it has not yet been possible to find an automated test manager that will adequately test all aspects of a highly interactive graphical application.
3.6 Release Although regression testing is the simplest method of system testing during development, the system should be completely validated using the SVP at least once prior to release to users. Before gcncral release to all users the system needs to undergo "live" testing in a laboratory situation. It is not, of course, possible to perform such testing in a regulated environment, but given the current concerns over data system validation, we have found that many regulated users will assign resources to live testing of new releases of software. It is essential that the developer promotes a strong working relationship with such users. Release of a new version of a system may require new user documentation (manuals, etc). Although not discussed here, development of user documentation should parallel development of the software. Documentation should pass through its own specification (based on the SRS), design, implementation and testing phases, with appropriate review at each stage. In our experience, review and validation of documentation is more troublesome than the equivalent validation of software, since the evaluation of documentation relies on subjective measures such as style and other aesthetic concerns as much as objective measures such as factual accuracy. In addition to user documentation, a new release of a system will require an installation protocol and a validation protocol. The installation protocol should be supported by appropriate documentation and software to allow the user to install and configure the system to the needs of their laboratory. The validation protocol should consist of documentation and test data to allow the user to perform basic system validation, which can be augmented by the users own validation and regression testing. In addition, details of verification and validation procedures employed during development should be made available to those users that require them.
3.7 Support Once the system has been released, the supplier can support the regulated user in a numbcr of ways, Training of users of the system will promote correct use of the system and explain modes of operation, simplifying the design of standard operating procedures by the users. This can be augmented by technical support (both remote and on-site) to both
358
assist in routine use of the system and to monitor feedback by users of the system. Support for a regulated user should provide a formal system for monitoring and recording user fccdback, and should provide a formal escalation mechanism in the event that the user is not satisfied with the technical support provided. Specific support for user validation can be provided in a number of ways. In addition to providing a basic validation protocol at release to all users that require it, the supplier can collate information and results from users own validation procedurcs. This information should be regularly distributed to users (in the case of XChrom this is achieved through a user newsletter and user group meetings). In addition, where appropriate, support can be provided for hardware that requires routine validation (in the case of XChrom, annual revalidation of the VG Chromatography Server acquisition device may be performed either by VG or the user).
4. Implications for the software developer Production of a data system for the regulated laboratory requires development of a system that can be both validated and verified. Wc have seen that this requires development and adoption of an SDLC which promotes software verification, and this has both advantagcs and disadvantages to the software developer.
4 .I Advan tag es The principal advantage of developing a system in this way is that it aims to provide the user with the system they want, which should lead to satisfied users and (hopefully) an expanding user base. In addition, adoption of latest standards and techniques of software engineering should promotc feelings of professional pride (identified as a possible problem in Section 3.4) in software engineers, and therefore produce a more satisfied and motivated development staff. The verified software produced should be of a higher quality, and this should lead to reduccd maintenance effort on the part of the developer. The effort liberated in this way may be used to offset the extra effort required in adopting a verifiable SDLC, or may be channeled into development of new products. The reduction in maintenance, coupled with the tight definition of each stage of the development process, should simplify projcct management, and enable the developer to make most efficient use of the development team.
4.2 Disadvantages The principal disadvantage to the developer is the increased development effort required in the adoption of a verifiable SDLC. Since the distribution of effort throughout the
359
SDLC is not uniform (more staff are required during implementation than specification, for instance), this increase in effort cannot be met simply by recruitment of more staff, but must also result in an increase in the length of the development cycle. Both factors increase the cost of development, as does the need to purchase CASE tools necessary for verified development. In addition to evidence of validation and verification, regulatory bodies may also require access to commercially sensitive material such as proprietary algorithms and source code. The only circumstances under which access to such material can be justified is in the event that a developer refuses to verify software (unlikely given the current concerns of regulated laboratories) or ceases support for a system. In the case of XChrom, the latter possibility has been covered by provision of an escrow facility. The source code and development documentation is lodged with an independent third party, and a legal agreement is drawn up which allows the user access to this in the event that the dcvcloper ceases support of the system. In our experience the provision of an escrow facility satisfies the requirement of regulatory bodies for access to proprietary information, while restricting the access to this information to essential cases only.
5. Implications for the laboratory Although development of a verified system is intended to meet the requirements of the regulated laboratory, there are both advantages and disadvantages to the users of such systems.
5.1 Advantages The principal advantage is of course that demonstrating that the system meets the requirements of regulatory authorities should be greatly simplified. The effort that the user will nmd to dedicate to this activity should be reduced, and the system validation procedure should take the shortest possible time. The system should also be more reliable, reducing any down-time or effort spent on trouble-shooting, and generally increasing user confidence in the system. This should in turn promote use and understanding of the system, which will hopefully increase the efficicncy of the laboratory.
5.2 Disadvantages The principal disadvantages to the user will be a reflcction of the increased effort committed by the developer, such as increased cost of the system. In addition, the increased length of the development cycle will mean that any requested changes to the system will take longer, and so response time for verified user-requested modifications will increase.
360
There may also be a disadvantage in the performance of the initial releases of the system. Since the product will be extensively modularised and layered, there will necessarily bc more code to execute than in a system which has not been designed with verification in mind. Furthermore, initial versions will contain a high proportion of internal tcst code (see Section 3.4 above) which has to be executed and therefore reduces performance of thc system. Once the system has bcen extensively tested in a "live" situation, many of these internal tests can be identified as redundant, and removed from the system. System performance may therefore be expected to increase with successive releases.
6. Conclusions Validation of computer systems in the laboratory may be performed by the user without support from the system developer. Such validation will, however, need to be augmented by system verification in order to satisfy the requirements of most regulatory authorities. This verification can only be performed with the full support of the supplier, as it nceds to be designed into the system, and incorporated into all stages of the software dcvelopmcnt life cycle (SDLC). Development and adoption of a SDLC which allows system verification requires commitment of increased resources by the developer. In the short term this may providc problems for both developer and user, but these minor inconveniences should be acccpted, since in the long term development of a verifiable data system is to the mutual advantage of both parties.
Acknowledgements The authors would like to thank David Giles (Logica Communications and Electronic Systems Ltd, Stockport, England) for information and discussion on thc adoption of a verifiable SDLC, for designing the SDLC shown in Figure 1, and for producing that figure.
References 1.
Good Laboratory Practice: The United Kingdom Compliance Program. Department of Health. London, 1989. 2. Good Laboratory Practice: The Application of GLP Principles to Computer Systems. Department of Health, London, 1989. 3. Computerized Data Systems for Nonclinical Safety Assessment. Drug Information Association, Maple Glen PA, 1988. 4. IEEE Standard Glossary of Software Engineering Terminology. ANSIDEEE Std 729-1983. IEEE Inc, New York,1983. 5. Scheifler R,. Newman R. X Window System Protocol, Version 11.Massachusetts Institute of Technology. 1985.
361
6. 7. 8. 9.
Date CJ. A Guide to the SQL Standard. Addison-Wesley, Reading MA, 1987. Grady RB, Caswell DL. Software Metrics: Establishing a Company-Wide Program. F’rentice-Hall, New Jersey, 1987. Weinberg GM. The Psychology of Computer Programming. Van Rostrand Reinhold, 1971. Sommerville I. Software Engineering. Addison-Wesley,Wokingham, 1989.
This Page Intentionally Left Blank
Standards Activities
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 7990 0 1990 Elsevier Science Publishers B.V., Amsterdam
365
CHAPTER 32
Standards in Health Care Informatics, a European AIM J. Noothoven van Goor Commission of the European Communities, DG XIII-FIAIM, Brussels, Belgium
Abstract To create a common market in Europe it is not sufficient to support common understanding and cooperation in research and basic technological developments. Harmonized applications and common solutions for practical problems are of equal priority. The Commission of the European Communities thus undertook a number of programmes with the aim of advancing informatics and telecommunications in important application fields. One of these programmes is concerned with medicine and health care. It is called AIM, Advanced Informatics in Medicine. Comparcd to other fields, the introduction of infomatics in health care is a late and slow process. It is often said that the problem of medical informatics is the medical information itself. This information is potentially as complicated as life itself. Based on the universal one-patient-one-physician relation, the organization of health care is shallow and extended, while the information is extremely diversified. Moreover, the nature of the information may be both of vital importance and of private interest. Therefore, both security and privacy should be guaranteed to the highest degree. On the other hand, the technologies of informatics and telecommunications can support medicine and health care only when a certain level of standardization is accomplished. What is more, standardization is also a condition for industrial developments of applications. The current AIM action supports 42 projects. Five of these aim directly at making proposals for standards, while most of the others have standardization as their second or third objectivcs. From the other side the Commission has mandated CEN/CENELEC to take standardization action in the field of medical informatics. EWOS, the European Workshop for Open Systems will play an intermediate role. Finally, EFMI, the European Federation for Medical Informatics, founded a Working Group on the subject.
366
POS T UPK
CAE
CAA
Figure 1. Computers entered the hospital in two corners: CAE, computer aided equipment; CAA, computer aided administration; POS,patient oriented systems; T,timc; UPK, use of pertinent knowledge.
1. Introduction From a historical point of view computer systems entered the hospitals in two corncrs (Sce Fig. 1). First, as a device supporting the performance of medical equipment, and secondly, in about the same period, as an automatic card tray in the accounting dcpartment supporting simple office duties. Soon after that developments started of mcdical equipment with a computer as an csscntial part. For computer tomography, digital angiography, ultrasonic imaging, nuclear mcdicinc, magnetic resonance, and analyzers dedicated computers are indispensable. On the othcr end, the automatic card tray developed into billing systcms, accounting systems, and dcpartmcntal management systcms. The overall situation is still characterized by stand-alone systems which are not connected to each other, and which conceivably even could not be. This arrangement reflccts that of relations in hcalth care in general. Of old, health care is characterized by small operational units, final responsibility at the base, individual patients, no authority structurcs, in short a great multiple of the basic and primary one-patient-one-doctor relation. Historically this social situation led to a significant characteristic of medical information as a derivative of medical language. For long the use of medical information as expresscd in medical language was practically limited to the primary one-paticnt-one-doctor relation. Neither need nor possibility existcd for a wide disseminalion, and thcrcforc mcdical languagc rcmained individual, diversificd and not generally defined. What is more,
361
medical language is usually about exceptions, and in some cases could only express the inherent uncertainties of the medical issue. In the past decade technical means to integrate and to communicate became available on a large scale. The diversification of the medical information, however, forms the main obstacle to use these means in health care. On a Conference on Scientific Computing and Auwnation it should be emphasized that certainly the highly sophisticated features in operation or in development in current computer systems are scarcely of any use in the administration oriented systems for health care. The problem of medical informatics is the character of the medical information itself. However, not only communication over distances should be considered. The current generation of systems can accommodate knowledge bases and make accumulation and therefore communication of information over time durations a useful perspective. The combination of both communication modes -over distance and over time duration- would create possibilities of systems in health care that are more oriented towards the patient. Indeed, originating from the isolated comer of a single medical apparatus, computers now serve IMACS, image archiving and communicating systems. These systems are coupled to a number of mcdical devices for receiving the image data, receive alphanumeric patient data from the administration systems, and should transmit the information to other places. The orientation of these systems is towards the patient. At the other comer the computer that originally replaced the card tray in the accounting department will gradually become a knowledge based system that supports the administrative staff in planning and accounting as well as the medical staff in deciding on diagnosis and treatment. Also here an orientation towards the patient will occur. Eventually, any overall system or combination of systems will combine characteristics of the threc comers and handle data from the three sources: the test results, the administrative data, and the data concerning clinical judgements and the like. A harmonization of semantics and syntaxes might mean an effort for the individual physician. However, it is a condition for extended knowledge bases, it facilitates epidemiology, it will support professional education, it will be at the foundation of advanced health care policy, and it will be given an enthusiastic welcome by third party payers.
2. The AIM programme By promoting international cooperation in research and development the Commission of the European Communities aims at a number of objectives. Of primary importance are the realization of a common and uniform market and the creation of chances on that market for the own industries. In the Commission two Directorates General are charged with the task to promote research and development, viz. DG XI1 for fundamental research and life sciences, and
368
DG XI11 for informatics and telecommunications. The plans of the Commission are pcriodically unfolded and up-dated in the so called Framework Programme of Research and Dcv elopmcnt. Also the application areas of the informatics and telecommunications technologies are included in the Framcwork Programme: for the application in road traffic and transport thc programme DRIVE; and for that in education the programme DELTA. The AIM Programme, Advanced Informatics in hlcdicine, aims at the promotion of the applications of these technologies in hcalth care and medicine. For thc AIM Programmeas for the other programmes-an initial Exploratory Phase was dcfined, and needs were indicated for a subsequent Main Phase and a conclusive Evaluation Phasc. As gencral objectives were takcn: the improvement of the efficacy of hcalth care; the reinforcement of the position of the European Community in the field of thc mcdical, biological and health care informatics; and the realization of a favourablc climate for a fast implementation and a proper application of informatics in health carc. Furthermore, it was considercd that the costs of medical care are high and still rise, and hat the applications of informatics and telecommunications form an ideal opportunity to improvc the quality, the accessibility, thc cfficacy and the cost-cffectivcncss of this care. By broad consensus a Workplan was dcfined in which three Action Lincs are drawn. The first pertains to thc dcvelopment of a common conccptual framework for coopcration, thc second is composed of five more technical chapters, which will be dcscribcd bclow, and thc third mentions the non-tcchnological factors. The chapters of the sccond Action Line are: thc medical informatics climate; data structures and medical records; communications and functional integration; the integration of knowledge based systcms in hcalth carc; and advanced instrumentation and scrviccs for health care and mcdical rcscarch. Thc main activity of thc AIhl Programme is to subsidizc projects that should fulfil thc tasks or parts of the tasks as dcscribcd in the Workplan. In this “cost-shared” modcl thc Commission pays half thc costs of the projects. Thcse projccts should be undcrtakcn by intcrnalional consortia formed for the purpose. At lcast one of the partners of a consortium should be a commercial entcrprise in one of the member states, and at least a second partncr should come from another member state. A further requirement was that at least one of the partners was either an institute or an company profoundly concerned with mcdicinc or hcalth care. Furthermore, the consortia could have partners from one or more E R A countries (European Free Trade Association, the countries are Austria, Switzerland, Iccland, Norway, Swcdcn, and Finland). The costs EFTA partners make, are not to bc refunded by the Commission. The dcsircd intcnsification of the use of advanced IT scrvices in hcallh carc could hardly bc conccivcd without a harmonization of protocols, syntaxes, and semantics. The dcvclopmcnt of common slandards is onc of thc main rcquircmcnts for a common nxukct for 1T systcms, and Ihcrefore a primary mission of the AIM Programme.
369
Most of the 42 projects in the current AIM phase have the development of standards in their particular fields of interest as an important objective. Five projects even undertook to formulate complete sets of proposals for standards.
3. Standardization In Europe the national institutes for technical normalization founded a platform for cooperation called CEN, ComitC EuropCen de Normalisation. It applies both the formal and the practical proccdures to obtain European standards. Within CEN a great number of CT’s, Technical Committees, are charged with the direct responsibility for standardization in the various areas. The members of a TC are delegated by the national institutes. Observers from international organizations concerned with the specific field may participate in the meetings. Institutes analogous to CEN are CENELEC for electrotechnical products, and ETSI for telecommunications. To confer on demarcations between their operating areas they formed a joint Information Technology Steering Committee, ITSTC. Rccently EWOS, European Workshop for Open Systems, was founded. By organizing rather informal workshops and forming small project teams EWOS intends to achieve quick results. On the one hand these are practical recommendations for standards which are fed into the formal CEN procedures, and on the other hand EWOS supports the actual implementation of standards. Any institution can become a member of EWOS. In 1989 the Commission of the European Communities, DG XIII-E issued a mandate to the standardization institutes to explore the general aspects of standardization in medical informatics in order to define the requirements in this area. The ITSTC decided that CEN coordinate and be primarily responsible for the total work, and that EWOS undertake a part of it, namely the standardization of the transaction of medical data. CEN formed a Technical Committee for Medical Informatics to make a first classification of the area. This TC 251 had a first meeting in June 1990. Apart from the delegates of many countries, representatives of the AIM office, of EFMI (see below), and COCIR (the committee of manufacturers of radiological equipment) participated. From many nominees the TC chose a Project Team, PT 001, to do the actual proposing work. A first report is to be expected soon. Already in March 1990, thanks to their rapid procedures, EWOS formed a Project Team,PT007, of six experts. They directly started to work, and a first draft of their report was issued.
4. EFMI, European federation for medical informatics Generally the participation to the discussions about standards for technical products is limited to a relatively small number of manufacturers. To identify a set of suppliers in the
370
vast area of health care informatics is difficult, however. The usual actors form a varied group which ranges from computer researchers to medical practitioners, and from salcsmen of systems to project managers. They discuss the desired standards, and their representatives form the CEN TC 25 1. In 1975 the national societies for medical informatics founded EFMI, European Federation for Medical Informatics. The main activity is the organization of a yearly congrcss M E , Medical Informatics Europe. In 1988 EFMI started a Working Group on Standardization in Health Care Informatics. So far the Working Group held five meetings. The participants, either as contact persons of the national societies or as individual experts, exchange views and inform each other of standardization developments and projccts. The EFMI Working Group proved to be an ideal instrument to identify in the various countries those experts that are interested in standardization, and to get them enthusiastic about international cooperation. In this way and via relations in the AIM projects it was possiblc to activate many national circles to participate in the CEN meetings. In the world of standardization cooperation is a first requirement. It is considered beneficial when the same expcrts are members of the necessarily great number of committees. From the outside the persons in the consortia of the AIM projects, the members of the EFMI Working Group, and the delegates and representatives in the CEN and EWOS committees seem to form an inextricable entwining and a closed shop. Howcvcr, thcy will wclcomc anybody who as themselves is willing to make an effort to standardization.
5. Conclusions Standardization is a condition for the wide scale use of health care and medical informatics and for the creation of a common market. In the last two years three important categories, namely the Commission of the European Communities with thcir programmes and thcir mandates, the medical informaticians via thcir European professional federation, and the national normalization institutes through their European committcc havc shown to be aware of this problem and taken actions. As results, a number of AIM projects, the CEC mandates to CEN and EWOS, the EFMI Working Group on standardization, the Technical Committee of CEN, and the Project Teams of CEN and EWOS are working on the subject. Bccausc of personal unions and good mutual relations an cxccllcnt Cooperation is achieved.
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
37 1
CHAPTER 33
EUCLIDES, a European Standard for Clinical Laboratory Data Exchange between Independent Medical Information Systems C.Sevens1, G . De Moor2, and C. Vandewallel 1Department of Clinical Chemistry, Vrije Universiteit Brussel, Brussel and 2Medicul Informalics Department, Rijksuniversiteit Gent, Gent, Belgium
Clinical laboratory medicine has developed in the past two decades into a major medical specialty. Its primary goal being informative, it may actually miss its point because of the inability to transfer timely and correctly an ever increasing load of data. Almost all clinical laboratories provide computerized information but the way they exchange data with the clinicians who request the tests, differs considerably. Clinical laboratories vary in their structure according to their status. They are hospital-integrated, university-associated, or stand-alone, mostly private. In a hospital whcre various computers are used-administration, laboratories, clinical departments, pharmacy-the interconnection may present real difficulties. In private instances, one is temptcd to crcate unique electronic bonds between users, private practioners, and the laboratory. This type of connection which implies compatible software and hardware, is of course vcndor oricnted, not compatible with othcr systems and limits the practioners’ freedom of choice of services. The need for standardized vendor independent laboratory exchangc systems is obvious. Euclidcs is a project within the exploratory phase of the AIM programme (Advanced Informatics in Mcdicine) of thc Commission of the European Communities. It was started in 1989 in Belgium within two university centers, the Medical Informatics Departmcnt at the Rijksuniversitcit Gent and the Department of Clinical Chemistry at the Vrije Universitcit Brussel. There are in total thirtcen partners belonging to seven counlrics: Belgium, France, Grcece, Ireland, Italy, Norway and the UK. G . De Moor is the project leader. The projcct is aimed at the production of a standard suitable for use in at least four arcas of application : 1. two-way routine transmission of laboratory messages (requests, results) bctwccn rcmote computcrs in primary care physicians’ offices, hospitals and laboratories (hospital, university or private).
312
2. the transmission of data for external quality assessment programmes. 3. the use of the semantic model to interface analysers with laboratory computers. 4. forwarding of anonymous, aggregated data to public authorities for financial analysis and budgetary control. In thc pilot phase which should not exceed a duration of sixteen months, three main subprojects are conducted simultaneously.
1. The message handling system (MHS) Euclides has chosen to use the powerful existing X.400 MHS standard rccommendcd by the C C I n (ComitC Consultatif International de TelCgraphie & TClCphonic). The standard is based on the International Standard Organisation (ISO) Open System Interconnection (OSI) model. Euclides will be implemented on top of the 1984 version. However the use of the 1988 version, which enhances universality and as such matches the goals of Euclides even better, will be recommended. The MHS (Fig. 1) can easily be compared to a postal service. A sender utilizes the facility of a user agent (UA) to create envelopes. They contain messages not to be deciphered by strangers which should be delivered to another user or receiver. The message transfer system consists of routes created between sorting areas, the message transfer agents (MTA). The 1988 version, not yet fully commercially available, improves the routing and personalizes the service [ 13. Euclides’ message Lransfer system put forward in the prototyping phase (Fig. 2) will involve the public domain provided by national administrations as well as the private
MESSAGE TRANSFER SYSTEM
I
MESSAGE HANDLING SYSTEM
MESSAGE HANDLING ENVIRONMENT -
Figure 1. MHS-model(l984).
373
........ (
I.....
Physician 1
Physician 2
Figure 2. EUCLIDES prototype.
domain. The latter can contain MTA's like a front-end computer to a large clinical laboratory system as well as local UA's with full X.400 functionality or remote UA's in combination with a MTA from the public or the private domain. Within this subproject, security and data protection are major issues. Authorization, authentication, encryption of all or part of the messages and the use of check functions to ensure data integrity are all considered in the Euclides' standards. The general sccurity service of X.400 will evidently be used. At users' ends however, Euclides will not attempt to consider security measures such as physical security of the local hardware or access control within the local system software. These are the responsibilities of the users.
2. Syntax In every language rules are used in communication exchanges but are not explicitly described at every exchange. The set of rules, or syntax, is part of the hidden knowledge [2]. The purpose of the Euclides syntax message is to mimic reality and avoid the embedded overhead of syntax rules in each exchange. An object-oriented approach has been adopted. The kernel of the system is the information exchange unit (I.E.U.) (Fig. 3) composed of a header, a body and a trailer, all
314
--Fl INFORMATION
t
w [SYSTEM MESSAGE
-I.-.,-
INFORMATION
{-I;---
IIN FORMATION 1
1
*
I
MESSAGE
> INFORMATION
JEEE
I
Figure 3. EUCLIDES information exchange unit, >: sequence, selection, *: iteration O:
mandatory features. Although structurally similar, the syntax (system) messages are quite distinct from the data messages. The former contain meta-data, i.e., syntax rules about the objects of the data message. The latter contain the data themselves and their relationships. Each dialogue between sender and receiver within Euclides starts with one or more syntax messages laying the ground rules for the exchange. When data messages arc transmitted they follow the syntax rules but do not contain them. Figure 4 shows the format of a data message set. The label is the unique identifier of the data message, c.g., test request. The body contains the low-level objects which are qualified as mandatory, conditional, optional or prohibited, e.g., patient, analyte, specimen. The object again is composed of attributes which contain the actual values of the message. From the point of view of the clinical pathologist, they are the alpha-numeric values attributed to the result of a chemical test.
375
lhn,,KI AGGREGATE
P MESSAGE
7
DATA MESSAGE SET
MESSAGE VERSION TAG I
I
'
MESSAGE
DATA MESSAGE Bow
MANDATORY OBJECT >
CONDITIONAL 0 OBJECT DATA
- MESSAGE
Figure 4. EUCLIDES data message set. >: sequence, selection, *: iteration O:
3. Semantics Two problems are tackled in this part of the project, terminology and classification. Existing literaturc is rather scarce and does not meet the expectations of Euclides 131. Lists of tests do not address the clinical pathology in its whole but limit themselves to one or two subspccialties like clinical chemistry, haematology, immunology, microbiology, toxicology. On the other hand, classification systems of clinical laboratory procedures have been set up with goals like financial purposes or as an aid to medical diagnostics. The Euclides lists are being set up by compiling existing nomenclatures with the content of laboratory guidcs. These include tests in all subspecialties and are gathered from represcntative clinical laboratories from all over Europe. Synonyms and acronyms were
376
identified, unique codes were given, standard units were considered and basic common rules for a standard nomenclature system were developed. Items and their relationships have bccn chosen in accordance to the syntax rules and with the aim of being universal. The lists are currently being translated into the language of the parlners in the project and even beyond these borders. All the features described here will make it possible for any user to send messages in his usual terminology, in his own language. The messages will be transported by the message handling system using the Euclides codes. The Euclides standard resources translate the message into the receiver’s own terminology and/or language. The use of different units of measurements and their conversion have also been taken into account. At the time of writing a minimum basic data set comprising about 1,000 tests (or analytes), 100 units of measurement and 70 types of specimens has been extracted for use in the prototyping phase. It is already available in four languages.
4. Current development and conclusion Based on the preliminary research conducted over the past year within the three subprojects described in this paper, the implementation phase has started. A software package is being developed, called the “Euclides bridge” to provide the functions and tables
v COMPRESS/ DECCMPRESS SYSTEM
Figure 5 . EUCLIDES bridge.
EUCLIDES
b
# BRIDGE
ENCRYPT/ DECRYPT SYSTEM
311
necessary to the communication with local systems. The objective is to implement Euclides next to and in communication with existing systems without having to make any major modifications to the local software nor interfere with it. The dialogue between partncrs is initiated in the locally available bridge. The syntax rules are called into action and the mapping of the local file with the Euclides I.E.U. format (dialogue tables) lakes place (Fig. 5). These dialogue tables are used to obtain the values of the local data message. There is one dialogue table for each local system. After the creation of the dialogue table, the local data are translated into the Euclides syntax which is checked for errors. External packages (which are not part of the bridge) are called to compress/decompress, encrypVdecrypt and transmit the message. The Euclides project, after a year of existence, is entering the implementation phase. The work that has been achieved in three areas can be summarized as follows: 1. a standard message handling system X.400 is being used for direct application in a medical domain 2. a flexible object-oriented syntax, applicable to all fields of the clinical laboratory data exchanges has been created; its particular features are to convey metadata and avoid the burden of systematic overhead of syntactic rules 3. multiple lists of clinical laboratory objects have been elaborated; they can match any local files, they make the local data transferable to another party and as such are of universal use.
References 1.
2. 3.
Schicker P. Message Handling Systems, X 400. In: Stefferud E. Jacobsen 0-J. Schicker, Eds. Message Handling Systems and Distributed Applications. Elsevier Science Publishers B .V. (North-Holland)IFIP, 1989: 3-41. Adapted from Collins Dictionary of the English Language, 1985. CPT4, SNOMED. Institut Pasteur, ASTM, ICD9-CM. CAP Chemistry.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990
319
0 1990 Elsevier Science Publishers B.V., Amsterdam
CHAPTER 34
Conformance Testing of Graphics Standard Software R.Ziegler Fraunhofer-Arheitsgruppe f u r Graphische Datenverarbeitung (FhG-AGD), WilhelminenstraJe 7,0-6100Darmstadt, FRG
Abstract The widesprcad acceptance of Graphics Standards like GKS, GKS-3D, CGM, and PHIGS as international standards for computer graphics leads to a software market offcring a lot of implcmentations of these standards, even before they become official. This paper dcscribcs conformance testing of GKS and CGI implementations. The first testing service was cstablishcd for GKS. Thc developmcnt of this service was finished 1989. The testing scrvice for CGI is still under dcvclopmcnt and a prototype is now available. The testing serviccs for both standards rely very highly on the visual checks of a human tester. Thus an important considcration is whether thc human judgement of the corrcctncss of picturcs can be rcplaccd, at lcast in part, by automatic processes.
1. Introduction Pcoplc who buy an implemcntation of a graphics standard like GKS, GKS-3D, CGM, and PHIGS rcquire a guarantee of the corrcct functioning and its compliance with the standard. Tcst suitcs that test implcrncntations to detcrmine whether the functions pcrform corrcctly and that the language bindings (and data cncodings) have been implemcnted are needcd. It is dcsirable to have the test suite available at the time the standardization process of a certain graphics standard has finishcd. This means that both, standard and test suite, have to be developed in parallel. Thc foundations for the certification of graphics standard software were laid at a workshop held at Rixensart near Brussels in 1981 [Thom-84]. Experts on graphics, and expcrts on certification of software systems discussed the issue of how to apply certifications to graphics software. In 1985, the first phasc of the conformance testing service programme (CTS1) was launchcd by the Commission of the Europcan Communities. Within CTS 1 thc tcsting scrvicc for the Graphical Kerncl System (GKS) was developed.
350
Two years later the second phase of the CTS-programme (CTS2) started. Within CTS2 a project is now underway to develop testing tools for the emerging Computer Graphics Interface (CGI) standard. Implementations of a certain graphics standard can be tested only by an accredited testing laboratory [IS0-90]. The client of an implcmentation can get the test suite in order to apply a pre-test. This enables him to make changes to the implementation to correct any errors. The formal test of the implementation, which leads to a final test report, is then executed by the testing laboratory. The results of the execution of the tests are listed within this final report in detail. The certification authority tests whether the criteria for issuing a certificate are fulfilled. Testing laboratories for GKS testing are GMD (Germany F.R.), NCC (United Kingdom), AFNOR (France), NIST (United States), and IMQ (Italy). NCC (United Kingdom) and AFNOR (France) are currently discussing the establishing of these laboratories for CGI testing.
2. GKS conformance testing The Graphical Kernel System (GKS) DSO-851 is the first international standard for programming computer graphics applications. GKS defines a standardized interface betwecn application programs and a graphical system. Moreover it is a unified methodology for defining graphical systcms and their concepts. GKS supports graphics output, interactive operator input, picture segmentation and segment manipulation. The graphical standard GKS covers a wide field of 2D-applications in intcractive computer graphics. The first testing service for graphics software was developed for testing implemcntations of GKS. The GKS documcnt is essentially an informal specification written in natural language. Interpreting the specification in terms of a computer program requires human intellectual effort. Errors can occur. Thc testers take the complete GKS implemenLition, subject it to an intensive test suite, and hope to discover errors. The test suite is a sequence of test programs and is based on a simple model which involvcs two interfaces: the application intcrface between GKS and an application program, and the human interface where GKS output is observed and input devices are operated. The testing strategy for GKS involves five distinct tcst series [KiPf-90]: - data consistency, - data structure, - crror handling, - inpul/output, and - metafile. The data consistcncy test series examine the GKS description tables which describe the configuration and certain properties of the GKS implementation. The tests check the
38 1 test o p l d 2 :frameI values in the description tables for consisset polymarker representation tency and conformity with the GKS standard. ix mk,msf,colix bundled individual The data structure test series ensure that the values in the GKS state lists are manipu- 1 5 1.0 I X X lated correctly. This is done by setting, mod5 4 3.3 4 Ll 0 ifying and inquiring the values. The error handling test series produce 10 3 5.5 3 x error situations and then check that an error mechanism in line with the GKS standard is I5 2 7.8 2 supported by the implementation. The input/output test series provide a 20 1 10.0 1 check of the G K S Implementation as a whole. This is done through a comprehen- Figure 1. Polymarker representation test. sive set of tests which cover all the input and output capabilities of GKS. The graphical output is produced by the tests and checked against a set of reference pictures which show the expected appearance of the test output on the display. The Evaluator’s Manual gives a list of items to be checked for each picture. It describes the process of execution for the tests accurately. The input is tested by a set of defined operator actions which should produce specific results on a display. Figure 1 shows the reference picture for the test of polymarker representation. It has to be checked whether the certain representations are drawn with the attributes described under the headings marker type (mkt), marker size factor (msf) and colour index (colix). The ‘bundled’ and ‘individual’ drawn markers have to be identical for each index (ix). The metafile test series check that the GKS metafile is used correctly. Metafiles are created and checkpoints are used to enable visual comparison between screen output and reference pictures. The metafiles are then interpreted and the sessions are interrupted at exactly thc same checkpoints so that the output from metafile interpretation can be checked against the reference pictures. The GKS state list entries from the generating and interpreting sessions are also checked to ensure that they are identical. The test programs in each series are grouped into sets, each of which contains the cumulative test programs for a specific level of GKS. The five test series are testing different areas of the GKS standard. They are of different complexity. Each test series assumes that the previous test series ran without major errors. If the description tables that are examined in the first test series cannot be inquired it would not make sense to run the data structure test series as this test series gets its information about the workstation under test from the workstation description table. Therefore the test programs should be executed in the order listed above. The current GKS test suite only tests GKS implementations with FORTRAN language binding. For the GKS C language binding a pilot
* +
+
382
version is developed. Clients can buy the GKS test tools from the national testing laboratories. These laboratories offer a list of all tested and certificated GKS implementations.
3. CGI conformance testing The Computer Graphics Interface (CGI) [ISO-891 defines the interface from a graphics system to a graphical device. A CGI virtual device may be a hardware device or a soltware implementation. A specific implementation is bound to an environment (like hardware, operating system, control software) and may be influenced by other controlling interfaces in the environment. These dependencies have to be taken into account and increase the complexity of the tests. CGI defines control, output, segment, input, and raster functions. This set of functions covers h e whole GKS functionality (CGI as GKS workstation) and in addition provides raster functionality. It is expected that CGI will bccomc International Standard at the beginning of 1991. A project is now underway to develop testing tools for the emerging Computer Graphics Interface (CGI) standard in parallel to the standardization process. The developmen1 team aims to build on the experience gained in constructing and using the GKS tools.
3.1 Test system structure The definition of the CGI standard covers several parts. Beside the functional description there exist additional standards for data encodings (binary, character, clear text) and language bindings (FORTRAN, Pascal, Ada, C). The CGI testing service aims to build testing tools which cover all these requirements, CGI implementations with different language bindings as well as implementations with different data encodings. The evaluation of the requirements leads to the illustrated CGI test system structure (Fig. 2). The main components are the description of the implementation under test, the test case database and the test suite interpreter (TSI). Valid CGI implementations (which conform to the standard) can differ in the functionality (profiles) and in the capabilities of the virtual device. Therefore the CGI standard defines description tables which describe a certain implementation. The test system component ‘description of the implementation under test’ reflects all description tables defined within the CGI standard. Furthermore certain state list entries which are noted as implementation dependent (e.g., the bundle representations) are included. This description must be set independently from the test. A program called inquiry tool will call the necessary inquiry functions to gain the information. If necessary the entries will be set manually. This effort is needed to have the information available in a file in a unique and well defined format. Dependent on the description of the CGI implementation under test the test cases are selected (‘selection’).
383
I
CGI Implementation Under Test 4
Figure 2. CGI test system structure.
The ‘test case database’ contains all implemented test cases. The structure of the database is a directory structure subdividcd according to the functional parts of the CGI standard (control, output, input, segments, raster). During runtime the complete test set for a specific implementation undcr test will be selected. Each selected test case contains additional documentation (help utility) which describes and documents the test.
384
Example: Following is the help text for Test Case ‘Polymarker Geometry-qpe and Posit ion ’: TARGET: Type and Position of the Polymarker primitive are checked here. DISPLAY This is done by drawing several markers. All markers are centrcd horizontally and their position is marked by annotation lines. They are drawn within one box and the type is described by annotation text. CHECKS: Please check, that 5 markers are visible. The 5 markers should be centred to the position annotated by lines and a surrounding box. The marker typcs should be (from top to bottom) describcd by annotation text: Plus Sign, Star, Circle, X, Dot. The CGI test suite is written in a self-defined C-like test description language (pseudo code). Selected test cases are interpretcd by a test suite interpreter (TSI) which interprets the pseudo code according to the language binding or data encoding of the implementation under test. Thus the tests are portable to different environments. The available prototype can interpret pseudo code to an implementation realized as a procedural C language binding. Until the end of the project (mid of 1991) additionally the TSI will be able to interpret to FORTRAN, binary and character encoding. The standards for the Pascal and Ada language binding and the clear text encoding won’t be considered, because up to now their standardization process has not been started. The selection of the test cases depends on the description of the implementation under test. The ‘test results’ are collected within a file and contain pass/fail answcrs and additional remarks.
3.2 Test picture design The testing strategy is similar to the GKS testing strategy (no metafile tests) but the design of the test software was changed in some major parts [BrRo-89]. The CGI test suite includes automatic testing and visual chccking. Data consistency/structure and error handling tests can be tested automatically, the main test will be done by visual chccking of output by a human tester. The decision of the test suite developers was to cover as many aspects as possible of the CGI implementation under test. But the test software must be designed in a way that the tester won’t become bored or tired. Therefore the requirements for the designer of the required test software were to keep the tests interesting, simple, and uncluttered. Visual cues (self-annotating test pictures) have to be included to aid the judgement of the tester (see Fig. 3). Finally, redundant tests have to be avoided.
385
The specification of the CGI test suite does not rely I on the concept of reference pictures (such as was used in GKS). The wide range of functionality of CGI in addiPlus Sign tion to diverse hardware capabilities makes the selection I of appropriate reference pictures impossible. Test cases Star interpreted by the TSI generate visual output. The human tester has to examine whether the result corresponds to 0 the required behaviour, defined within the standard. Circle The CGI test system satisfies all requirements according to the design and specification criteria. The X ‘polymarker representation’ test within the GKS testing (see Fig. 1) can be applied to CGI testing, too. However, the design must be changed. The describing columns (ix, Dot mkt, msf, colix) will be removed. The test documentation will contain this description (see the previous example). Visual cues will be added to evaluate the correct positioning and sizing. This example illustrates the appli- Figure 3. Test of polymarker. cation of the design criteria for CGI testing. Consequently all test pictures are very simple with self-annotating visual cues. The annotation utilities use the POLYLINE, POLYMARKER and TEXT function and certain attribute functions. A possibility for using cues is boxing i.e., POLYLINE drawn round the primitives to be checked. Each test case is documented explicitly. The human tester can get additional help by this documentation which is available either on a separate screen (on-line documentation) or within a comprehensive manual. The TSI interprets the test case. Furthermore the TSI manages the reporting of the test results and the on-line test documentation (if feasible). Thus a user interface was designed to handle interaction with the tester and to access the test documents, so as to present information to the tester. Finally, information is passed to the automatic report generator, concerning the tester’s assessment of whether a test has passed or failed. As mentioned above the CGI test system is available as a prototype. This first version will be capable to test CGI devices covering the GKS level OA functionality realized as ‘C’ language binding implementations.
-+-
-*-II
-x-
- . -I
3.3 Application of automatic testing The realized CGI test system includes visual checking and automatic testing. The output/input test which is the main part of the test is done by visual checking. Automatic tests are performed for checking whether a certain profile is implemented, whether the inquiry functions deliver correct information and the error handling mechanism behaves
386
correctly. These tests concentrate on checking consistency of description table entries and dcfault/currcnt setting of state list entries, but not on output visible on the screen. The described scheme as used in the GKS and CGI validation service relies strongly on human checking: it is thus subjective and limited by the accuracy of the human eye and brain. An important consideration is whether the human judgement of the correctness of pictures within the CGI test system can be replaced, at least in part, by some automatic processes. The necd is certainly great: CGI defines a much richer set of primitives and atwibutes, compared to GKS, and the demands on the human tester will be severe. Thus we evaluated whether automatic testing of visual output could be applied and integrated within the CGI test system. How can the test program ‘see’ the generated visual output? The first approach [Brod891 we made was to examine the raster image-we call this testing at the raster interface. CGI includes a function (GET PIXEL ARRAY) that returns the colour of each pixel within a specified rectangle. The raster interface is thus a practical point at which to examine graphical data. Our preferred approach is to analyse the raster data generated by the implementation in response to a test program. We define a number of conditions that must be satisfied, and analyse raster data to verify whether this is indeed the case. For example we develop a number of conditions that characterise a line. These conditions will include a set of pixels that must not be illuminated, a set of pixels that must be illuminated, and relationships between remaining pixels. We have looked in detail at the issue of whether a sequence of pixels adequately represents a given line, and have derived a characterisation based on human visual assessment. Furthermore we developed similar characterisations of other primitives (line primitives, polymarker, polygon primitives, cell array) and have defined certain test methods. Similarly, we dcvclopcd conditions that must be satisfied in certain graphical operations. We have looked at the attribute binding mechanism (bundle representations), the clipping mechanism, all segment mechanisms (creation, deletion, display, transformation, copy, detection, inheritance) and all raster functions. Generally these conditions relate to the raster data before and after the operation. A trivial example is segment (in)visibility. Thc condition for segment invisibility is that the raster data should be identically zcro aftcr the operation. Furthermore clipping (see Fig. 4) is well suitcd to automatic testing of pixel maps. We draw a picture using a set of output primitives and attributes, with clipping set to ‘off’. The pixel map is storcd, and the screen cleared. We repeat the process exactly, but with clipping this time set to ‘on’. The pixel map is again retrieved. The area outside the clipping rectangle must be cleared to the background colour, and the interior must match on a pixel-by-pixel basis with the interior of the original. We have seen that many graphical operations such as segment visibility, have a very simple cffcct on the pixel map of a rastcr device and so can be checked by automatic means. Our work has shown, however, that care is needed in this comparison process to
387
I
CGI I
Reference
*
Inquiries
Database
Documentation
Figure 4. Clipping test.
allow for rounding errors-particularly in the discretisation step. The automatic testing of output primitives is not evaluated completely. The attributes (e.g., line type) have no precise definition within the standard. Thus the definition of evaluation conditions can be very subjective. Additional work must be done to solve this problem. In the case of CGI devices which do not provide the function GET PIXEL ARRAY (allowed in CGI) we have to find an additional way of pixel readback, e.g., picture capture by a camera (camera input). To decide whether camera input can be used as pixel
388
readback depends upon the question ‘Can we predict the camera frame buffer by a given graphical output generated by CGI?’. To answer this question we made some experiments with different sets of graphical output on a monochrome screen. The camera ‘sees’ the physical reality. In fact the pixels on a certain screen were not square. The x-size was smaller than the y-size. Therefore the horizontal lines seemed to be thicker than the vertical lines. The camera ‘sees’ this thicker line. This is also true for the visual tester. We did not investigate on the acceptance of users of these deficiencies. Unfortunately there arose an additional problem. As mentioned above the CGI output was limited to colour indices 0 and 1. In contrast the camera device provided a range of 256 grey levels (0..255). The experiments showed that dependent on the adjustment of the camera one pixel is ‘lightcr’ than another pixel. That means if one pixel will hit exactly one camera pixel, the edge of two camera pixels or the corner of four camera pixels the camera ‘sees’ different light intcnsity. These are the kinds of problems which still have to be resolved. One major problem is to find out the definition of the function which defines the mapping from one CGI device pixel to the camera pixels (different light intensities). The second major problem is the calibration of the camera (calibration phase). We have to analyse how the physical capabilities of the device under test (pixel aperture and size) have to be fed into our camera input pipeline. All in all if we find feasible solutions (which can be implemented) we can apply the same automatic test methods described above (testing at the raster interface). Then conformance testing by automatic means will become more and more important and indeed will be included within the test suite of CGI testing.
4. Conclusions This paper has described the test systems for testing of GKS and CGI implementations. The GKS testing service is the first testing service in the area of graphics standards. Since establishing the GKS testing service experience was gained. This experience showed that conformance testing services are very useful to get implementations of Graphics Standards conforming to the certain standard. The developers of the CGI testing tools aimed to build on the experience gained in building and using the GKS testing tools. In fact, the CGI standard changed and functions have been added (e.g.. the inquiry functions) during the standardization process. Therefore the test specification had to be adapted according to these changes. At least, that neither a language binding nor a data encoding is standardized yet, made (and still makes) matters worse. We described the issues and possible solution of automatic testing. In general, conformance testing of Graphics Standard software should be automated as much as possible and less subjective judgement would be required. This could be realized if standards will be more precise. The GKS and CGI standards leave some room for the implementors, e.g., there is no definition of line end styles. Thus the developers of automatic tests have to decide whether the generated test pictures are a “good” or a “bad” representation.
389
Finally, the experience developing the automatic tests for CGI testing showed that those tests which check graphical operations (e.g., segment visibility, clipping) are “safe” candidates. The geometric tests of output primitives are critical (but realisable) and need future research work. The use of camera for accessing the pictures of an implementation under test requires a larger set of verification strategies, including image processing capabilities.
Acknowledgement I want to thank all those who helped in the preparation of this report. In particular, I thank Alexander Bolloni who spent a lot of time and work in investigating the feasibility of automatic testing of CGI at the raster interface. Thanks also to Ann Roberts, Ken Brodlie and Roger Boyle which spent many hours in discussing and experimenting during my visit at the University of Leeds.
References [Brod-891Brodlie KW, Goebel M, Roberts A, Ziegler R. When is a Line a Line? Eurographics ‘89. Participants Edition. 1989; 427-438. [BrRo-891 Brodlie KW, Roberts A. Visual Testing of CGZ. Internal Report. School of Computer Science, University of Leeds, 1989. [KiPf-901 Kirsch B, Pflueger C. Conformance Testing for Computer Graphics Standards. Internal Project Report (CTS2-CGI-072), 1990. [ISO-851 ISODIS 79424raphical Kernel System (GKS).Functional Description. 1985. [ISO-891 ISO/DIS 963Womputer Graphics Interface (CGI).Functional Specification, 1989. [ISO-90] ISO/SC24/WG5 N474onformance Testing of Implementations of Graphics Standards. DP text, 1990. [Thorn-841 Thompson K. Graphics certification at the European Community level. Computer Graphics & Applications 1984; 8(1): 59-61.
This Page Intentionally Left Blank
Databases and D ocunz eIZta tioi z
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scienrific Computing and Automarion (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
393
CHAPTER 35
A System for Creating Collections of Chemical Compounds Based on Structures S. Bohanecl, M. Tusar2, L. Tusar2, T. Ljubic', and J. Zupanl 'Boris Kidric', Institute of Chemistry, Ljubljana and 2SRC Kemija. Ljubljana, Yugoslavia
Abstract A system for creating collections of properties of chemical compounds based on their
structures is described. This system enables the chemist to handle (characterize, save, retrieve, change, etc.) correlations between selected features and parts of chemical structures. At the present step of development, the system can handle only spectral features. Using the structure editor as the central module of the system, the chemist can generate any structure and search for all features saved in the databases with respect to it. Some features in the structure defined databases are already built-in the system (I3C NMR basic spectra collection, IR spectra collection). There are also substructure defined correlation tables (table of 13C NMR basic chemical shifts and increments defined with the neighborhood of an isotopic atom, similar tables for 1H Nh4R spectroscopy, etc.). The user can add new features to the system or delete existing ones. It is also possible to add, correct, or delete only one or more data.
1. Introduction The common goal of most chemical information systems can be described as a search for possible connections on relation: structure features A feature is any structure-dependent characteristics of a compound in question (physi-
cal, chemical, biological, pharmacological, ecological, etc.). Some examples of features are spectra, activities, melting points, boiling points or merely chemical names of compounds. Typically, a database is composed of structures of chemical compounds and corresponding features.
394
tructure
"C NMR
Structure
Structure Chem. names
IR
1 CLL 0
I
a
I
Structure
Chem. namcs
1 2
4
3 4
Figure 1. Structure dependcnt databases can have the structures attached to all records in each database (top row) or only maintaining links to central structure database (bottom row).
The system we are describing was designed as a general and efficient tool for helping the chemist in the process of solving the structure oriented problems. A very popular example for such task is the identification and characterization of chemical compounds on the basis of different spectra [1,2].
395
The structure editor, which connects different databases with other components (information an/or expert system modules) is a central part of our system. Due to a unique representation of chemical structures handled by all parts of the system (structure editor, databases, information and/or expert system modules) the transfer of results and the connections between different parts or modules becomes easy and transparent for the uscr.
2. General concept To extract chemical structure from different features or the properties of intcrcst from chemical structurcs wc need different databases, general system for building and maintaining these databases and providing links and communications between them, and finally, a number of information and/or expert systems to obtain some specific information. The databases [3, 41 consist of uniformly composed records. In principle, each rccord should contain a chemical structure and the corresponding information (Fig. 1 , top row) most frequently a spectrum (I3C NMR, IR,mass, IH NMR, etc.) or some other complcx (multivariatc) information (ecologically dangerous effects and properties, recipes, technological pararnetcrs, ctc.). All such data (oftcn called supplemental information) contained in diffcrcnt databases are linked with the central structure database and can be accessed through it (Fig. 1, bottom row). The system for building and maintaining databases [5, 61 with structure editor as its ccntral part are used for: building new structures, deleting, correcting, searching, and chccking old structures ctc. The output of structure editor is the connection table [ 13 of the handlcd structure, while other components assure checking, handling, and updating of records with supplemental information. The system for building and maintaining the databases must provide a number of tasks completely hidden (transparcnt) to the uscr such as generation and update of inverted files, establishing links bctwcen diffcrcnt databases and keys, decomposition of structurcs on fragments, normalizing, base-line correction, smoothing, peak dctection of spectra, etc. The chemical informalion andfor expert systems [3, 7-17] enable the chemist to use various activities offcrcd by the system combincd with data pooled from the databases and finally to combine partial rcsults or resulting files obtained at different modules into complex information. The information and expert systems may be ranked from very simplc data scarchcs in various databases, to complex simulations (spectra simulation) and complete or partial structure predictions, etc. It is evident that most of these systems must be connected to an efficient chemical structure manipulation system.
396
EXPERT SYSTEMS
DATABASES
I
I
IRSYS
I
(CARBON
I
IGENSTR
I
-[W] I
AD1
13CNMR
STRUCTURE
collection of
I
-
I
-
I
1
IR
1
INES
1
I SIMULA I a a
8
8
a
I Figure 2. General scheme of our system. It is always possible to access the information and/or expert systems and the structure editor and over the editor any collection of the structures in the databases. The modules are also accessible directly or from the structure editor. The following information and expert systems are included in our system: UPGEN (for building, checking, correcting, and organizing databases, selecting compounds due to the common structure characteristics or other properties), VODIK (for 1H NMR spectra simulation and for the supplementation of correlation tables), SIMULA (for 13C NMR spectra simulation), IRSYS (for structure and IR spectra collection managing), AD1 (for the supplementation of tables of 13C NMR chemical shifts), GENSTR (for building all possible structures from structure fragments, generated with structure editor or obtained born CARBON or IRSYS systems).
397
I
llT0M BRIDGE BOND
/’”
,8-9
CHAIN ERllSE 1HSERT
\ /’
‘L6
RING
3\
/. l\
1
ROPEM RBOND “IBERS UNDO NEY SAUE LOllD RENRBE DELEIE S EARCH EKSfW FILE
Fl-Help
F2-Keys
--Print
ESC-Exit
Figure 3. Structure editor as seen from the PC monitor with displayed p-bromo acetophenone structure, which was built using commands: RING, RBOND, CHAIN, ATOM, BOND.
3. System description A general scheme of the whole system is represented in Figure 2. Every information and expert system (on the right) are accessible directly or via structure editor, which is the central part of the whole system. The databases (on the right) can also be accessed via the structure editor or from the information and expcrt systems (on the left) that require data from the databases. The information and expert systems can be used sequentially one after other or in cycles. For example: the results from the first system are input data for the second onc, etc. All logical operations (AND, OR, XOR, and NOT) can be applied on files and resulting files used again as input files of other systems. One specific application which has employed a number of different parts (modules) of our systems is described in section 5. As already mentioned, the structure editor is a central part of the whole system [51. With simple commands (CHAIN, BOND, RING, etc.) understandable to the chemist any chemical structure to a ccrtain size (in our case the limit is set to 60 non-hydrogen atoms) can be built (Fig. 3).
398
I. 2.
3. 4. 5. 6.
7. 8. 9. 0.
nToN BR I DGE BUM0
2= 6- 7c I - 3c 2- 4. C 3= 5- 8c 4- 0. c s= 1BR 1C 4- B= toC
o c
CHlIN ERASE INSERT RING
ROPEN RBOHD NUHBERS UNDO CT
8= 8-
UEY SlUE
Lolo REMNE DELETE I
SEllRCH EKS -S yt
Prsrr SPACE to continua Figure 4. Conncction table of p-bromo acetophenone. Each row of connection table contains data of one (not hydrogen) atom: identification number of atom, chemical symbol of atom, identification number of the first neighboring atoms and the type of bonds to the first neighboring atoms.
Transparent to the user, during the editing process of any structure, its connection table is maintained all the time (Fig. 4). As a matter of fact, in this form all structurcs in the database are handled in the entire systcm. With commands SAVE, LOAD, RENAME and DELETE h e structures can be savcd on or loaded from the temporary files, the temporary files can be renamed and/or deleted. The structure that is currently active in the structure editor can be used in three diffcrcnt ways: first, searched for (SEARCH command) in the central collection of structures or in any partial onc that was previously generated as output of another module (SEARCH for spectra, for example), second, used as an input for expert or information systcm (EKS-SYS command), or, third, written on pcrmancnt file (FILE command). A rcsult of the SEARCH (with complete or partial structure-substructure) in the collection is the list of the identification numbers of structures that match the query structure. If the sought structure is a substructure the SEARCH will yield all appearances of it in any structure, which mcans that thcre can be more hits for only one reference structure. Atom-to-atom connections bctwccn the query and reference structure arc given for all hits what makes a good tool for studying the symmetry of compounds.
399
In our scheme the expert systems are used for: spectra simulation (SIMULA and VODIK for simulation of I3CNMR and 'H NMR spectra), generation of possible structures from some substructures (GENSTR), decomposition of structures on atomic centered fragments, and classification of structures due to the common fragments (decomposition and classification is described in the next chapter as a part of UPGEN system). The edited chemical structures can be down-loaded on files. The structures on these files can be accessed sequentially or directly. In first type of files the connection tables of structures are saved one after another as alphanumeric records. This form is suitable for structure transfer, particularly between personal computers. Direct access files and inverted files [l], with structures classified according to common structural characteristics, are more suitable for versatile processing of structural data. The inverted direct access files enable fast searching through large collections of structures, specially in the case of substructure searching and fast access to partial supplemental information associated with only parts of structures (fragments), etc.
4. Database improvements While using a chemical system in the qucst for various information at any stage the data that are inadequate (misleading, faulty, incomplete, completely wrong, duplicates, etc.) can be found or at least assumed that they are such. System UPGEN enables the user to handle such cases and maintain database. In order to maintain databases in an adequate state the database manager (this can be any user if our system is implemented on a PC) has a direct access to any database to do one of the following actions:
- adding new data, - deleting data, - correcting old data, - organizing (classifying) whole database, - dclcting the whole database and preparing empty files for new
database.
After every correction or input of a new chemical structure the system automatically checks, if this new structure already exists in the collection or not. If it does, then user can choose bctwecn abandoning the update or incorporating the new structure (and supplemental information, if any) into the collection. Chemical structures entered using the module UPGEN are decomposed on atomic centered fragments and classified upon different characteristics (heteroatoms, bonds, topology, etc). The identification number of structure is written on records of inverted file as shown in Figure 5. The procedure according to which the structures are decomposed into fragments and stored into the inverted file is as follows:
400
Structure 164
Decomposition on fragments f1
Inverted file
... 164 ...
4536 164 ...
... 164 0c5 7720
... 164 ...
I Figurc 5. An example of decomposition of a chemical structure (ID = 164) on atomic centered fragments and updating the identification number on different records in the inverted file of fragments is shown. The code representing each fragment is calculated from the bit mapped pattern of atomic centered fragments [ 11. From each code the position of corresponding record in the inverted file is determined by hash algorithm [l].
- dccornposition of
the structures on atomic centered fragments, coding each fragment by formation of 64-bit mapped patterns [ 13, - obtaining one number for each fragmcnt from bit mapped patterns by XOR function, - calculation of a proper hash address for each number representing a fragment, - saving structure’s ID number to the address in the inverted file. -
Each ID number of a chemical structure is stored into as many records of the inverted file as thcre were different atomic ccntcrcd fragments found in these structure.
5. An application of our system At the cnd we would like to dcscribe a problem that was solved in our laboratory using the discussed system. During the work on 13C NMR spectroscopy of furan dcrivatives it was asccrtaincd that the system docs not contain enough data (neither complctc spectra nor corrcclions of chemical shifts due to the ncighborhoods of observed atoms for correct 13C NMR spectra simulation) for any of these compounds.
401
shifts 2 3
C
c
143.0 wn 1W.O PPR
4 5
C
109.9 pon
c
143.0
PPR
-2.3 o m
1
811 i n c r c m t tatus=1,2... No. o f nirring incrcncnts
I
Picture o f spcctrrm (Y/W?
Figure 6. Simulated spectrum of 2-methyl furan. Incomplete tables were used with missing increments for some substituents of furan. Number of missing data is expressed with STATUS.
The case started when the structure of 2-methyl furan was built with the structure editor and then it was established that there was no such structure in assigned collection of 13C NMR spectra (SEARCH module was used) and simulated spectra (SIMULA system was uscd) was not correct due to nonadequate data in the tables of the increments (Fig. 6). In any structure, the chemical shift of an isotopic carbon atom is simulated by adding to the basic chemical shift (standard chemical shift A, in equation (1)) [17] increments produced by all substituents. These increments are dependent on the type of substituent, presence of other substituents, and relative position of the substituent with the respect to thc isotopic atom: D i = A , + Ck B kJ D; is chemical shift of ith carbon atom, A , is basic chcmical shift for functional group z, CB,, is the sum of increments due to the substituents (the system can recognize 150 diffcrent substituents [17, 221 and then determines belonging increments for such substitucnt on distancej (& p, yor 9 from the isotopic carbon atom i.
402
TABLE 1 Increments, Bkj (k is a position of the substituent a n d j is a position of isotopic carbon i in furan ring) for furan rings with substituents on positions 2 or 5. ~
~
Substituent
zsp3 -C=C-
-CHO -CO4O-O-CH3
~~
Increments (ppm) B22=B 5 5
B23=R54
9.2 7.6 10.8 6.8 1.8
-2.8 6.8 11.7 10.1 8.0
B24=B53
0.7 3.0 3.0 2.4 2.0
B25=R52
-1 .o 2.9
5.7 4.8 3.4
The data base of chemical shifts and increments used in SIMULA, was taken from the literature [18-201 and at present contains about 40,000 values. Nevertheless, the adequate shifts and increments for furan derivatives were not at hand. In ow case two types of data were missing for the simulation of I3C NMR chemical shifts. The first type represents the influence of furan on chemical shift of methyl group substituted on the position 2 of the furan ring and the second type represents the influence of methyl group as a substituent of furan ring on the same position on the chemical shifts of all carbon atoms in this ring (for a, p, yand 6 positions [17]). In this case only first of two possiblc positions in furan (2 or 5 and 3 or 4) was in our interest. The missing increments were determined with AD1 system and by generation of small specialized database (UPGEN system was used). From existing collection and from literature [21] all compounds with furan ring were extracted (SEARCH and UPGEN systems were used). A new small collection of 10 assigned specua was used as input to AD1 system. It is necessary to emphasize that only assigned specua should be included in such collection, otherwise new increments cannot be determined. In the first run of AD1 program new increments can be calculated only from the cases where in the simulation of one shift exactly one increment is missing (STATUS = 1).In the second and next runs, however, after the increments obtained from prcvious runs are already updated into the tables, the other increments can be determined as well. To be precise, the increments determined in the described process are written on a special temporary file. Only after checking the simulation with a number of cases the new increments are updated permanently into the tables with a special command. The increments for some substituents on furans obtained with the described procedure are given in Table 1. This data are related only with the substituents on position 2 or 5 on furan ring. Besides the increments for methyl group, a complete set of data for four additional substituents were obtained.
403
Ilo. Bton
2 3 4
c
5
c
&
C
C C
tatus0 tatur=1,2..
Chemical shifts 152.2 107.1 110.4 142.0 12.7
Status
ppm P P ~
0
opn
0 0 0
PQlr PQS
0
All increments
. No.
o f missing ineraments
Pictura o f reactrun ( Y A ) ? Fl-Ulp
F2-Keys
F5-Print
F3-In
rt
F6-
Figure 7. Simulated spectrum of 2-methyl furan obtained with improved tables of chemical shifts. Experimental chemical shifts for this compound are: Dc-2 = 152.0 ppm, Dc-3 = 105.7 ppm. D ~ A = 110.5 ppm, Dc-5 = 141.0 ppm, Dc. = 12.9ppm.
The basic chemical shirk, A,, on the positions 2 or 5, Ac-2.5, and 3 or 4, A~-3,4,of the furan ring are 143.0 and 109.9 ppm, respectively 1181. With the same procedure the system AD1 was able to determine the increment to chemical shift of methyl group due to the furan ring on a position. The increment is 15.0 ppm to the basic chemical shift -2.3 ppm for alkanes. With all necessary increments (Bkj for furans and alkanes) determined in due process by the system ADI, the 13C NMR spectrum of the 2-methyl furan was simulated again. The simulated values (Fig. 7) are very similar to the real values which was in the meantime obtained from the literature [20]. Denotation of atoms on Figure 7 corresponds to that on Figure 6. The differences between the simulated and corrcsponding experimental chemical shifts are small and amount for the positions C-2, C-3, C-4, C-5, and C-6 for 0.2, 1.4, 0.1, 1.0, and 0.2 ppm, respectively. The standard deviation of a difference being 0.5 ppm. As shown in the above example, using the module AD1 the table of increments can be successfully supplemented for any [22] functional group of user’s choice. However, bcsidcs the AD1 system a rcprcsentative collection of assigned I3C-NMR spectra containing of structures (compounds) containing the functional group is mandatory.
,
404
6. Conclusion We hope that the explained approach of a combination between the databases and information and/or expert systems where the links are provided with a set of powerful structure handling tools has been shown convincingly, Without a flexible structure handling capability no data base, expert system, or a knowledge base can be fully exploited. The extension of this work is aimed towards a system similar to UPGEN but with much more power in dealing with general type of databases containing (besides the chemical structures) urbirrury other data, enabling the cross-links between the structural data and different textual, numeric, spectral and other types of databases. The presented system is implemented and runs IBM PC/XT/AT/PS or compatible computers under VGA/EGA/Hercules graphics environment. In part (infrared spectra and chemical structures) the described system is additionally implemented on the Institute’s rnicroVax system under VMS 0 s and can be accessed via JUPAK (official Yugoslav data communication net) at no charge. For the access procedure and arrangements contact the authors.
References 1. 2. 3.
4. 5. 6. 7.
8.
9.
10. 11.
12.
Zupan J. Algorithms for Chemists. Chichester: John Wiley & Sons, 1989. Gray NAB. Computer-Assisted Structure Elucidation, New South Wales: John Wiley & Sons, 1986. Zupan J, Ed. Computer-Supported Spectroscopic Databases. Chichester: Ellis Horwood. Int., 1986. Bremser W, Ernst L, Franke B, Gerhards R, Hardt A. Carbon-I3 NMR Spectral Data. Weinheim: Verlag Chemie. 3rd ed.,1981. Zupan J, Bohanec S. Creation and Use of Chemical Data Bases with Substructure Search Capability. VestnSlov Kem Drust 1987; 34(1): 71-81. Zupan J, Razinger M, Bohanec S, Novic M, Tusar M, Lah L. Building Knowledge into an Expert System. Chem Intell Lab Syst 1988; 4: 307-314. Zupan J, Novic M, Bohanec S, Razinger M, Lah L, Tusar M, Kosir I. Expert System for Solving Problems in Carbon-13 Nuclear Magnetic Resonance Spectroscopy. Anal Chim Acta 1987; 200: 333-345. Lindsay RK, Buchanan BG, Feigenbaum EA, Lederberg J. Applications of Artificial Intelligence for Organic C h e m i s t y T h e Dendral Project. McGraw-Hill, New York, 1980. Picchiottino R, Sicouri G, Dubois E.DARC-SYNOPSYS Expert System. Production Rules in Organic Chemistry and Application to Synthesis Design. In: Z. Hippe, Dubois JE Eds. Computer Science and Data Bank. Polish Academy of Sciences, Warsaw, 1984. Milne GWA, Fisk CL, Heller SR, Potenzone R. Environmental Uses of the NTH-EPA Chemical Information System. Science 1982; 215: 371. Milne GWA, Heller SR. NIH-EPA Chemical Information System. J Chem Inf Comput Sci 1980; 20: 204. Zupan J, Penca M, Razinger M, Barlic B, Hadzi D. KISIK, Combined Chemical Information System for a Minicomputer.Am1 Chim Acta 1980; 112: 103.
405
13. Sasaki S , Abe H, Hirota Y, Ishida Y, Kuda Y, Ochiai S , Saito K, Yamasaki K. CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds. J Chem Inf Comput Sci 1978; 18: 211. 14. Oshima T, Ishida Y,Saito K, Sasaki S . CHEMICS-UBE, A Modified System of CHEMICS. Anal Chim Acta 1980; 122: 95. 15. Shelley CA, Munk ME. CASE, Computer Model of the Structure Elucidation Process. Anal ChimActa 1981; 133: 507. 16. Robien W. Computer-Assisted Structure Elucidation of Organic Compounds III: Automatic Fragment Generation from 13C-NMR Spectra. Mikrochim Acta, Wien, 1987; 1986-11: 27 1-279. 17. Lah L, Tusar M, Zupan J. Simulation of 13C NMR Spectra. TetrahedronComputer Methodology 1989; 2(2): 5-15. 18. Pretsch E, Clerc JT, Seibl J, Simon W. Tabellen zur Strukturaujklarung organischer Verbindungen mit spectroskopischenMethoden. Berlin: Springer-Verlang, 1976. 19. Brown DW. A Short Set of C-13 NMR Correlation Tables, J Chem Education 1985; 62(3): 209-2 12. 20. Stothers IB.Carbon-13NMR Spectroscopy. New York and London: Academic Press, 1972. 21. Johnson LF, Jankowski CW. Carbon-13 NMR Spectra, A Collection of Assigned, Coded, and Indexed Spectra. Wiley & Sons, 1972. 22. With described simulation of 13C-NMR spectra 65 different functional groups with belonging basic chemical shifts were determined. In specba simulation [ 171 only 20 different functional goups were determined.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Computing and Automation (Europe) 1990
407
0 1990 Elsevier Science Publishers B.V., Amsterdam
CHAPTER 36
TICA: A Program for the Extraction of Analytical Chemical Information from Texts G.J. Postma, B. van der Linden, J.R.M. Smits, and G. Kateman Department of Analytical Chemislry, University of Nijmegen, Toerrwoiveld,6525 ED, Nijmegen, The Netherlands
Abstract A program for the extraction of factual and methodological information from abstract-
like texts on analytical chemical methods is described. The system consists of a parser/intcrpretcr and a frame-based reasoning system. The current domain is inorganic rcdox titrimetry. Some results are given. Possible sources of information on analytical chemical methods are discussed.
1. Introduction Within Analytical Chemistry the analytical instruments are being equipped with computers. Thcsc computcrs oftcn have knowledge and reasoning systems connectcd to the tcchniqucs in which the instrumcnt is used. They can assist in the development of the analytical mcthod and can perform the actual management of the analytical procedure [l-31. In the future analytical instruments are expected to become automatic analytical units for which the only information that is needed is the analyte and information on the sample to be analyscd. For the development of an analytical procedure usually the literature (or some inhousc mclhod database) is checked first and if a suitable method is found an expert system is consulted for the finetuning or modification of the procedure to the situation at hand. If there is no directly applicable method an expert system can be used for the development of a mcthod. Such an expcrt system has to be continuously updated with new knowledgc. For this task beside a human expert also literature data could be used in combination with some lcarning expert system. Dircctly by computer accessiblc and usable analytical literature databases hardly cxist. Information on analytical methods has to be manually extracted from the literaturc. In thc secondary litcrature such as Chemical Abstracts and Analytical Abstracts [4, 51
408
much analytical interesting information is also not directly accessible. In Analytical Abstracts (the online database) there are a number of search fields with which the functions of the contents of the field are determined (indexed). For instance the field ANALYTE determines that the chemical in that field is used in the corresponding article as analyte (has the role of analyte). There are also ficlds like CONCEPT and MATRIX but for for most of chemicals, data, equipment, etc., their role is not directly searchable and accessible. The procedural information must be extracted by human effort, too. Still, this information exists in the abstract or can be inferred from the abstract. Chemical Abstracts lacks even these from an analytical chemical point of view interesting search fields. The text analysis system outlined in this article is aiming at the automation of the extraction of factual and procedural analytical method information from texts. The information contained in the descriptions of analytical methods can be subdivided into factual and procedural information. Factual information is data on, e.g., the analyte, thc working-range of the procedure, the accuracy of the procedure, the composition of the reagents, etc. Procedural information entails all the actions that have to be performed, inclusive information on the roles of the chemicals, solutions, instruments, etc. that participate in the actions and information on the circumstances under which the actions take place. This information can be extracted from text by means of Natural Language Processing techniques. Text analysis consists of morphological, syntactic, semantic and discourse analysis (Fig. 1). Morphological analysis deals with the structure of words, its inflections and how lcxcmcs can be derived. Syntactic analysis deals with the relative ordering of words within sentences and sentence elements in terms of their syntactic classes. Semantic analysis
morphological analysis
syntactic ana Iysis
text a n alysis semantic analysis
Figure 1. Natural language text processing parts.
discourse analysis
409
produces information on the meaning function of the various sentence elements and their relationships. After the semantic analysis of sentences some kind of semantic representation is produced in which as much as possible the meaning of the individual sentences is represented. The semantic representations of each sentence serve as input for the discourse analysis. During the discourse analysis various kinds of intra and inter sentential references and ambiguities are resolved and some kind of discourse representation is produced by comparing the input with background knowledge on the domain of the subject of the text.
2. The program TICA The program TICA consists of two parts. The first part performs the sentence analysis, the second part performs the discourse analysis and extracts the information. For sentence analysis we have chosen for the method introduced by Riesbech, Schank [61 and Gershman [7]. The morphological, syntactic and semantic analyses are performed concurrently. Initially we have chosen a semantic representation that was close to that of Shank (the Conceptual Dependency theory) [ 8 ] . His theory uses a limited set of types of concepts (Actions, Picture Producers, Properties and Relations) and these types of concepts are subdivided into a limited set of members. All the concepts that appear in a sentence are represented by means of these basic concepts. This representation proved to be too abstract and distant from the actual meaning and use of the sentences and was difficult to handle. Because of this at the moment a case based representation is used and implemented (see Fillmore [91 for the original ideas and, e.g., Nishida [lo] for a adopted and extended set of cases). Cases are relations mainly between the main verb of a sentence or clause and the other elements of the sentence or clause. These relations represent the semantic function that the various elements fulfil in relation to the main verb. These relations can be the Actor of the action represented by the verb, the Object, the Location, the Manner, the Instrument, the Purpose, the Goal or product, etc. These case relations are most of the times linked to the syntactic functions or leading prepositions of the various sentence elements and determined by the semantic class of the main verb and by that of the main part (noun or verb) of the sentence element. The verbs are not represented by means of a limited set of primitive Actions but used as they are, sometimes replaced by a synonym. The semantic representation is represented in frames. After the sentence analysis the discourse analysis is performed. For the discourse analysis a relative simple ‘script’approach (Schank [ l l , 123 and Cullingford [13, 141) is implemented in a frame-based reasoning system for the description of the background knowledge. The principle of this approach is that texts frequently describe a story in which the sentences and sentence parts describe a sequence of events and states. These events are ordered and this order can be captured in a script. Within a script about a certain story there are a number of possible routes or tracks describing different event
410
sequences which lead from the start to the end of the story. A script can furthermore be subdivided into small units of related events and states, called scenes or episodes. A text about an analytical method can also be captured within a script, e.g., titration. A titration consists of a limited number of analytical actions but the existence and order of these actions can differ along different routes such as a direct titration and a backtitration. The discourse analysis part of the program takes care of the reference resolution and uses the script information about the domain to determine the function and meaning of the sentences within the text. After this all relevant information is extracted. The determination of the relevant information is done via marking of all relevant concepts in the knowledge base. The program is written in Prolog. More information about the program and the semantic representation can be found in rcfcrencc 15.
3. Results and discussion The current program is capable of analysing short abstract like texts within the domain of rcdox titrimetry. An example of this text is: Thc determinationof iodine. Samples containing 62-254 mg iodine are reduced with an excess of 0.1 N potas1. sium ferrocyanate. The ferrocyanate is oxidized by the iodine to the ferricyanate. 2. The unreacted ferrocyanate is titrated with ascorbic acid. 3. The titration is carried out in a solution buffered with bicarbonate. 4. The indicator is 2-hydroxyvariaminblue. 5. A solution of it is preparcd by mixing 1 g of 2-hydroxyvariamin blue wilh 500 ml 6. sodium chloride solution. A portion of this mixture weighing 0.3-0.9 g is used for each titration. 7. The standard deviation is 0.11 %. 8. Some of the questions that are resolved by the program are: What is the analyte? What is the function of the second sentence? What is the type of titration? What is the titrant? To what docs “it” in sentence 6 refer? - To what refers “this mixture” of sentence 7?
-
411
The extracted information is: Analyte: I2 - Working-range: 62-254 mg Method: backtiuation Titrant: ascorbic acid - Reagent: K4Fe(CN)6 - Reagent-concentration: 0.1 N - Indication-method: indicator - Indicator: 2-hydroxyvariamin blue - etc. Beside factual information also procedural information such as the preparation of the indicator solution can be extracted. This procedural information can be represented using a recursive frame representation consisting of: frame, attribute, value. In this representation ‘attribute’ stands for some property or case and ‘value’ is the value of the property or case and can be a frame itself. If in the above example the preparation of the indicator consisted of the mentioned mixing followed by some filtration this could be represented in the simplified nested list form of Figure 2.
[Preparation, [object, [ ‘ I n d i c a t o r s o l u t i o n ’ 1 1, [ m e t hod, [mix, [object, [ ‘2-hydroxyvariamin b l u e
...1 1 ,
[applied, [solution, [has-part, [ ‘ sodium c h l o r i d e ‘
... 1 1 1 1 1
[output, [’mix o u t p u t l’]], [ followed-by, [filter, [object, [‘mix output 1’1 I , [output, [‘indicatorsolution’]]]]]]].
Figure 2. An example of procedural information represented in a nested list form.
412
TABLE 1 Kesult of an analysis of 40 abstracts from Analytical Abstracts. Type of information
%
(n = 40) working range detection limit conditions analyte (matrix, inferents) complete sample pretreatment (*) main reagent (*) figures on main reagent (*) complete method description method performance figures
48 37 66 65 98 73 85
52
~~
* means: relative to those abstracts that contain the type of information.
The procedural information can be stored in relational tables such as described by Nishida et a1 [lo, 161 or represented along the method used by the TOSAR system for organic synthesis representation and storage (Fugmann et al [17]) extended by specific case information of the participants of each action (reaction, process, etc.). If the information is to be used by an expert system linked to,e.g., a robot and/or analytical instrument the information could be transferred directly in the form of frames. There are different sources of text on analytical methods. The current program is being developed for abstracts. One of the drawbacks of abstracts is that they are not complete. This, of course, by virtue of the nature of abstracts. But even when a basic set of types of method description data is selected these data are frequently not present. The results of an investigation of 40 abstracts from Analytical Abstracts is presented in Table 1. The abstracts are taken from the end of 1988 and the start of 1989. The percentages of ‘complete descriptions of sample pretreatment’ and ‘complete analytical method description’ are rough: the completeness is only evaluated using general knowledge on the analytical techniques and not by comparing the abstracts with the articles itsclf. The category ‘main reagent’ includes reagents for the production of coloured compounds which are measured, eluents for chromatography and titrants. In the category ‘complete method description’ the main reagent (if it exists) is not included for the evalualion. The division of method description data in the presented types can of course be improved but the incorporation of the most important information about the applied or developed analytical methods in abstracts facilitates a better access to these methods. This study will be continued. The predominant source of information is the article itself. When the same list of types of method description data is used, a manual pilot study on 6 randomly selected
41 3
articles describing one or more analytical methods from 5 different frequently used journals on Analytical Chemistry show that for none of the articles all the information can be found in the Material and Method section. Most of the times the complete article must be analysed in order to obtain all relevant information. This was also observed in three randomly choosen recent articles of the Journal of the Association of Official Analytical Chemists. Although it is even possible to extract information from graphs [IS]the full automatic extraction of information from complete articles is seen as troublesome at the moment. Perhaps for a number of articles reasonable results can be obtained by combining text analysis techniques for the Material and Method section and some combination of a keyword search on relevant factual information (such as method statistics) and textual analysis of the environmentof the keywords found. The situation would be improved if all relevant method information (also) appears in one closed section. Another source of information is Official Methods of Analysis [19]. A drawback of this source is that only a small number of the published methods are included in this volume (after extensive testing and if necessary modification) and that the methods are not recent (because of the evaluation procedures).
4. Conclusions It is possible to extract information from short texts on a subdomain of Analytical Chemistry wilh the methods presented. Further work will be undertaken to incorporate other fields of analytical chemistry. The extraction of all relevant method information from articles will be difficult because of the spreading of the information all over the article.
References 1.
2.
3. 4.
5. 6. 7. 8.
Goulder D, Blaffert T, Blokland A, et al. Expert Systems for Chemical Analysis (ESPRIT Project 1570), Chromatographia 1988; 26:237-243. van Leeuwen JA. Buydens LMC, Vandeginste BGM, Kateman G . Expert Systems in Chemical Analysis, Trends in Analytical Chemistry 1990; 9:49-54. Isenhour TL,Eckert SE. Marshall JC. Intelligent Robots-The Next Step in Laboratory Automation.Analytical Chemistry 1989; 61:805A-814A. The American Chemical Society. Chemical Abstracts, Chemical Abstract Service, Columbus, USA. Analytical Abstracts, the Royal Society of Chemistry, Letchworth, Herts, England. Riesbeck CK,Schank RC. Comprehension by Computer. Technical Report 78, Yale University, New Haven, 1976. Gershman AV. Knowledge-based Parsing. Research Report 156. Department of Computer Science, Yale University, New Haven, 1979. Schank RC. ConceptuulInformation Processing. Fundamental Studies in Computer Science, volume 3, Amsterdam: North-Holland Publishing Company, 1975.
414
9.
10. 11. 12. 13. 14.
15.
16. 17.
18. 19.
Fillmore C. Some problems for case gammer. In: O’Brien RJ, Ed. Report of the twenty-second annual round table meeting on linguistics and language studies. Monograph Series on Languages and Linguistics. no. 24, Georgetown University Press, Washington, DC. 1971: 35-56. Nishida F, Takamatsu S. Structured-information extraction from patent-claim sentences. Information Processing & Management 1982; 18: 1-1 3. Schank RC. SAM, u story understunder. Research Report 43. Department of Computer science, Yale University, New Haven, 1975. Schank RC, Abelson RP. Scripts, Plans, Goals and Understanding.An inquiry into IIwnan Knowledge Structures. Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1977. Cullingford R. Script Applications. Computer understanding of Newspaper Stories. Technical Report 116, Yale University New Haven, 1978. Cullingford R. S A M . In: Schank RC, Riesbeck CK. eds. Inside Computer Understanding: Five Programs Plus Miniatures. Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1981. Postma GJ, van dcr Linden B, Smits JRM,Kateman G. TICA: a system for the extraction of data from analytical chemical text. Chemometrics and Intelligent Laboratory Systems accepted. Nishida F, Takamatsu S, Fujita Y. Semiautomatic Indexing of Structured Information of Tcxt. Journal of Chemical Information and Computer Sciences 1984; 24: 15-20. Fugmann R, Nickelsen H, Nickelsen 1,Winter JH. Representation of Concept Relations Using the TOSAR System of the IDC: Treatise I11 on Information Retrieval Theory. Journal of the American Society for Informution Science 1974; 25: 287-307. Rozas R. Fcrnandez H. Automatic Processing of Graphics for Image Databases in Science, Journal of Chemical Information and Computer Sciences 1990; 30: 7-12. Official Methods of Analysis. Williams S, ed. The Association of Official Analytical Chemists Inc., Arlington, Virginia. 1984.
E.J. Karjalainen (Editor), Scientific Compuring and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
415
CHAPTER 37
Databases for Geodetic Applications D. Rulandl and R. Ruland2* lSiemens AG, Dept. ZU S3, 0-8000 Miinchen, FRG and 2Stanford Linear Accelerator Center, Stanford University,Stanford, CA 94309, USA
Abstract Geodetic applications even for a defined project consist of various different activities, and access a vast amount of heterogeneous data. Geodetic activities need a hybrid and heterogeneous hardware environment. This paper gives a brief introduction to the gcodctic data flow using a sample application in survey engineering data. It states a general multi-level integration model providing an open system architecture. The model yields the GEOMANAGER project. Its data management aspect is addressed by this paper.
1. Introduction 1.I Preface Over the last dccadc data handling in applied geodesy and surveying has changed dramatically. What usdc to bc the ficldbook is now a portable computer and the fieldbook kccper has bcen substituted by a microprocessor and an interface. Further down the data processing line one can see the same changes, least-squares adjustments used to rcquirc thc computational power of mainframes, now, thcre is a multitude of sophisticated program systcms available which run on Pcrsonal Computers (PC) and provide an even more clcgant human interface. Also, Lhcre are solutions available for the automated data preprocessing, i.e., for the data handling and preparation from the electronic fieldbook to the creation of input files for the least-squares adjustments, [FrPuRRu87, RRuFr861. However, an cqually important step has not found much consideration in geodetic discussions and publications, the integration of the gcodetic data flow, i.e., the management of geodctic data in large projects. This paper will summarize the geodetic activities and the data flow shown at a sample and representative gcodctic application. A two-lcvcl intcgration model is introduced, consisting of communication integration and information
* Work supported by thc Dcpamncnt of Energy, contract DE-hC03-SF00515
416
integration. Whereas an integrated communication system can be implemented using today’s market standard components, an information integration requires a customization of new database management systems. The goals, requirements, and solutions for gcodctic database management systems, especially for the GEOMANAGER of our sample application are emphasized.
1.2 Dataflow The geodetic data flow is summarized in Figure 1. First, the readings are stored in measurement instruments or data collectors. The data then is uploaded and prepared by DATA PREPARATION programs yielding a measurement data file for each considered observable. Thcse measurement data are raw measurement data, which must be processed by PREPROCESSING programs yielding reduced data. To do so, the preprocessing
Figure 1. Geodetic data flow.
417
programs need to access the calibration data. Furthermore, point identifiers are normalized to standard point identifiers, i.e., alias or synonyms are replaced by standard identifiers. The preprocessed measurement data form the input of various DATA ANALYSIS programs, which compute (new) coordinates for the considered points. For each point all sets of new and previously measured coordinates are stored. Each set of coordinates rcfers to a common or measurement specific underlying coordinate system. Thus, a huge amount of highly structured data is generated and accessed by various activities from data collection in the field to time consuming data analysis programs. Due Lo the nature of geodetic applications, the geographical sites of the activities are widely sprcad. The observables are collected in the field using portable microcomputers (e.g., HP Portable Plus or 71 computers) running the specialized data collection programs. The observation data are either manually entered or the data collectors are intcrfaced with the survey instruments (e.g., KERN E2 or WILD T3000 theodolites) to transmit bi-directional signals. Preparation and preprocessing of the collected field data is exccuted on a departmental cluster of workstation and personal computers, respectively. Hence, the field data collection is off-line connected to the cluster. Other activities, like calibration of survey instruments, is performed in sites, located several miles away from the cluster. The data analysis programs run mainly on the cluster. Only some special analysis programs still need a mainframe computer. Summarizing, most of the geodetic activities are performed on the PC/WS cluster. As shown, the different geodetic actions deal with various data which can be classified as follows (see Fig. 2): Measurement Data (Data concerned with different observables)
- Height Data - Distance Data - Direction Data
Calibration Data (Data about instruments) - TapeData
- RodData - Circle Data
Point Data - Point Identification Data - Coordinate Data - Coordinate System Data
(Synonym identifiers)
418
-
Measurement Data Distance Height Direction
Calibration Data
. ..
EDM
Point Data Point Names, Point Description
Coordinate Data
Rod
Circle
.. .
-
- Synonym Identifications Alias Names
- Coordinate System Data
xyz, T/S Comments
Origin, System Definitions
Figure 2. Data classification.
I .3 Enhanced data integration Thc GEONET data management approach [FrPuRRu87] was developed to handle thc hugc amount of gcodctic data originated during the construction survey and subsequent rcalignmcnt surveys of the Stanford Linear Collider (SLC), built by the Stanford Linear Accclcrator Ccntcr (SLAC) [Er84]. The SLC is a high energy physics particle collidcr for thc research into the bchavior and properties of the smallcst constitucnts of matter. During the construction alone, somc 100,000 coordinates had to bc dctermincd [OrRRu85, Pi86J. The GEONET approach was bascd on a hierarchical database managemcnt (DBM) conccpl which requircd thc hardwiring of data structures. Ncvcrtheless, GEONET proved to bc very successful and has found many applications in thc high energy physics survey and alignmcnt community. Howevcr, the concept does not provide flcxibility of easy assimilation to changing rcquirements, of easy integration of new tools and of establishing ncw data relationships. Thcrefore, future projects likc the Supraconducting Super Collidcr (SSC) which will produce an at least 20-fold increase in the amount of data and will show more complex data relationships due to an increase of observables and more sophisticated and complex mathcmatical modclling will require new conccpts. This situation triggered Lhc project GEOMANAGER.
2. Integration of the geodetic data flow As pointcd out in the introduction an integration of the data flow among the various gcodctic activities is ncccssary.
419
The major goals and requirements of an integration of the data flow result from the following characteristics: - Geodetic software tools have different communication interfaces. - Geodetic software tools (e.g., data gathering, data analysis programs, etc.) use a huge amount of different data. - Geodetic software tools share same data. - Geodetic software tools run in different project environments. - New geodetic software tools must be easily integrated. - Geodetic data are highly smctured and need heterogeneous types. - Gcodctic data own various complex consistency constraints. An integration must provide an open system architecture for an easy integration of new tools and instruments. The geodetic integration concept provided by the GEOMANAGER project consists of two levels: - Communication integration - Information integration Information integration requires an integrated communication management. Communication integration emphasizes a full interfacing of all used computers and instruments. The interfaces must be suitable for the required communication. The requirements of the main interfaces of the sample application are: - Interface: Survey instruments and data collection computers Special purpose low lcvel signal transmission communication - Interface: WS/PC cluster and data collector computers Transmission of small amounts of field data files - Interface: WSPC cluster High speed local area network - Intcrface: WS/PC cluster and mainframes Transmission of large amounts of various data Interfacing an hybrid computer environment can use today well equipped communication standards. But in some cases (e.g., interfacing survey instruments) the customization of special interface boxes is required. The major goal of the information integration is a unified high level data management, such that all activities can access the data on a high level of abstraction and in a unified way. Information integration is best fulfilled by a database approach, providing the following concepts: - Conceptual data centralization - Data redundancy elimination
420
SHARED PLOTTERS
SHARED PRINTERS
\
SHARED DISK STORAGE
PCNSCLUSTER
I
GATEWAY
0 0
0
E2/T3000 TH EODELITE
-@-f MAINFRAME
DATA COLLECTORS
Figure 3. IIybrid computer environment. - Data sharing -
Data indcpcndcnce
- High level intcrfaccs
- Open system architecture Thc GEOMANAGERs database resides on the WSPC cluster, because all major geodetic activities take place here (see Fig. 3).
421
Databases provide some further well-known functions and capabilities, which are also required by geodetic applications. They are not discussed here [DRuRRu87]. There are various problems and aspects in applying a geodetic database system. Because of space limitations, we focused only on the following aspects: - Data modelling - Database interface
3. Data modelling As already pointed out, geodetic data are highly structured and use heterogeneous data types. However, traditional data models do not support all relationship and data types as well as more sophisticated data abstraction concepts. These limited data modelling capabilities complicate the database design process and the database usage. The lack of semantics becomes more important the more complex the data structure of the application is (especially in more sophisticated "nonstandard" database applications, like engineering design, office automation, geographic applications, etc.). Furthermore, geodetic tools run in a wide range of project environments using database systems based on different data models. Thus, the same application data structure must be modelled in different data models, which causes redundant database design processes. These gaps between applications and traditional data models are bridged by semantic data models. We use the EntityBelationship model (ER model) extended by the data abstraction concepts of aggregation and generalization hierarchies. Extended ER schemes are developed for the following major geodetic data classes: - distance measurement data - height measurement data - point data.
3.1 ER schemes for the sample geodetic application In Figures 4,5 and 6 ER diagrams are given for distance measurement data, height measurement data, calibration data, and point data. These ER schemes contain 17 entity types and relationship types, respectively. Because of space limitations, and since the ER diagrams are self-explaining, only a few aspects are pointed out in the following. The entity types DISTANCE-MEASUREMENT, TAPE-METHOD, EDM-METHOD, DISTINVAR-METHOD, and INTERFEROMETER-METHOD describe the distance measurement data. Entities of the latter entity types specialize the distance measurement data by adding property properties of a specific method. A DISTANCE-MEASUREMENT entity describes the method-independent properties. It must be related to exactly one entity of exactly one METHOD entity type. Thus, DM-METHOD represents a generalization
422
among Lhe generalized DISTANCE-MEASUREMENT cntity typc and thc 4 individual METHOD cntity typcs. The entity types HEIGHT-MEASUREMENT and READING dcscribe thc height mcasurcment data. Since a height measurement consisls of several rcadings, which arc cxistcnce and idcntification dependent, a PART-OF-relationship typc represents these associations. The cntity typcs TAPE-INSTRUMENT, EDM-INSTRUMENT, DISTINVARINSTRUMENT-WIRE, and INTERFEROMETER-INSTRUMENT, as well as RODlNSTRUMENT dcscribc the calibration data of the instrument used for distance and height mcasuremcnts, rcspectivcly. These cntity types arc conncctcd to the cntity typcs describing the mcasurcment data. Notice, that the relationship types USED-ROD-1 and USED-ROD-2 are thc only relationship typcs with attributes. Thr :tributes rcprcscnt the raw and reduced readings on the two scales on each of thc two rods used. Finally, the cntity typcs POINT, SYNONYM, COORDINATE, and COORDINATE-SYSTEM dcscribe the point data. Each point owns several coordinatc data sets. Notice, that thc relationship type SAME-SERIES is the only recursivc relationship type. It rclatcs coordinates, which rcsult from the same measurcmcnt epoch.
Figurc 4. ER diagram: distance measurement data.
423
HeightMeasurement
-
T/S Ah
HM point
Point n:!
In:!
HMSlorlPoint
Offset
Figure 5. ER diagram: height measurement data.
- _T/S __ CoordinateSystem
- Zero P o i n t - x - Axis
- y-
Axis
424
4. Database interface The database interface is based on the used data model and must meet the data access and m‘anipulation requirements of the geodetic tools. GEOMANAGER’s interface is a hybrid data intcrface, combining descriptive and procedural elements. First of all, the interface supports elementary operation Lo access sets of entities or relationships of a single type. The entities or relationships must be qualificd by their identificrs. Thus, the elementary operations support a proccdural interface. But, most gcodetic tools need an access to aggregates of associated entities of sevcral types. Thus, opcrations for accessing data aggregates must be supported by the interface. These aggregate operations dcfine a descriptive intcrface. Its dcsign is bascd on the following propcrtics of geodetic applications. First, for each geodetic tool a set of generic data aggregate types accessed by this tool can be specified. Hence, the set of used data aggrcgates are pre-known. Second, some geodetic applications do not have any direct access to the database provided by the communication system (e.g., data analysis programs running on mainframe computers). Other existing geodetic tools do not yet support any database interface. They use their own dedicated file structures. Thus, the interface supports a prc-dcfined set of the parametrized access modulcs for data aggregates. In a first step, the data aggregate type is specified. If data aggregates are rcuieved or modified, their qualification is also given. The spccification model for qualification statements is derived from predicate logic extended by concepts for handling hierarchies for objcct classes. This information is especially used by the transaction management for concurrency control and recovery. The second step dcpcnds on the communication mode. If direct access is possible, then the specified data aggregates can be rcuieved, modified, or written using elementary operations. Thus, this second step access is a procedural one. If there is no direct access possible, a retrieved data aggregates are downloadcd from the database in a data stream using standardized interchange format. The interchange format is derived from the database scheme. If data aggregates are entcrcd, they must bc given as datasueam, which is uploaded to the database.
5. Conclusion In this paper, the need of a data flow integration in geodetic applications is shown. Thc
goal of this paper is: - to providc some understanding of
geodetic activities and of geodetic data flow intcgration the geodetic data flow - to introduce a two-level intcgration model - to show the problems in applying software tools (i.e., DBMSs) in today’s market place for this “non-standard” application - to evaluate the gcncral potential of
425
The proposed integration model provides an open system architecture and has two integration levels: - Communication integration - Information and data integration This paper addresses the information and data integration level. The requirements are: - Access to a huge amount of data by the tools - Tool migration among various projects - Open system architecture - Highly structured data - Complex consistency constraints
The goals of thc information and data integration are to provide: - Conceptual data centralization - Data redundancy
elimination
- Data sharing - Data
indcpendence and high level database interfaces, using a database approach
Howevcr, DBMSs are commonly used in commercial applications, and not frequently in "non-standard" applications, like engineering and scientific applications. The papcr mentioned three problems in using geodetic DBMSs in geodetic applications, i.e., GEOMANAGER project: - Data modelling - EntityBclationship schemes, extended by aggregation and generalization hierarchies are developed for our sample application. - Database interface. A hybrid, i.e., procedural and descriptive database interface is developed for accessing simple entities/relationships as well as complex data aggregates. Furthermore, up- and downloading of data is possible.
References DRuRRu87 Ruland D, Ruland R. Integrated Database Approach for Geodetic Applications. IV. International Working Conference on Statistical and Scientific Data Base Management. Rom, 1988. Er84 Erickson R, ed. SLC Design Ilandbook. Stanford Linear Accelerator Center, Stanford University, CA. FrPuRRu87 Friedsam H, Pushor R, Ruland R. The GEONET Approach-A Realization of an Automated Data Flow for Data Collecting. Processing, Storing and Retrieving. ASPRS-ACSM Fall Convention, Reno, 1987.
426
OrRRu85 Oren W, Ruland R. Survey Computation Problems Associated with Multi-Planar Electron Positron Collidcrs. In: Proceedings of the ASPRS-ASCM Convention. Washington, 1985: 3 38-347. Pi85 I’ictryka M. Friedsam H, Oren W, Pitthan R, Ruland R. The Alignment of Stanford’s New Linear Electron Positron Collider. In: Proceedings of the ASPRS-ASCM Convention. Washington, 1985: 321-329. KRuFr86 Kuland R, Friedsam H. GEONET-A Realization of an Automated Data Flow for Collcction. Processing, Storing and Retrieving Geodetic Data for Accelerator Alignment Applicntiom. Invited Paper, XVIII Congress, Federation Internationale de Geodesie, Toronto, 1986.
E.J. Karjalainen (Editor), Scientific Computingand Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
427
CHAPTER 38
Automatic Documentation of Graphical Schematics M. May Academy of Sciences of the GDR, Central Institute of Cybernetics and Information Processes (ZKI),KurstraJe 33, DDR-1086 Berlin, GDR
Abstract Graphical documentation is still one of the less supported time consuming and errorprone engineering activities, even in an era of sophisticated CAE/CAD-systems and tools. Among graphical documents schematic diagrams represent a specific class of 2Ddrawings. They arc characterized mainly by structural information, i.e., by graphical (macro-) symbols and their interconnection lines. Schematic drawings are not true-toscale graphics and may thus be derived automatically from a description, containing only simple structural information. The resulting layout problem for graphical schematics applies to various branches in design automation. The entire and very difficult layout process may be decomposed into scheme partitioning, placement of the graphical symbols and routing of interconnection lines. Their is no unique layout algorithm for this problem. We found that different typcs of diagrams are to be classified and gcncratcd by different tcchniques. rinecrThese new layout methods have bcen proven to be very efficient in various enb’ ing applications such as system design and graphical documentation, e.g., in automation, clccuonics, technology, and software engineering. Schematics that have been generated by our Computer Aided Schematics- (CAS-) approach are electric and wiring diagrams, logic schematics, flow charts and tcchnological schematics.
1. Introduction When designing a complex technical system or process the engineer or designer starts by fixing its gcncral suucturc and main functions. No matter whether structural, functional or implcmenlalional design is considered he or she has to think over system components, subfunctions, and proccdurcs and how they intcrconnect and interact.
428
In most of the technical disciplines, such as electronics and automation or process and software engineering, schematic diagrams are used to depict these difficult interrelations. Typical Schematics from those application fields are elecuic and wiring diagrams, logic and technological schematics, block diagrams, control and programming flow charts, graphs and networks. It is widely accepted that the manual preparation and updating of diagrams is a time consuming, error-prone, costly, and not very creative job. This makes schematics generation a work lo be intensively assisted by the computer. However, despite some promising results mainly in electric and logic diagram drawing the automatic generation of general schematics did not receive much attention in the CAD/CAE-community. To cope with the inherent fuyout problems a more general view and unifying approach is necessary.
2. Schematics structure Schematics are not true-to-scale graphics. Essentially, they are characterized by graphical symbols, by inferconnecfions(lines) between input and output connectors (pins) of thcsc symbols, and possibly some lettering. The detailed arrangement of these graphical constituents on the layout area (display or sheet of paper) is of secondary importance. Consequently, schematics are mainly determined by their structure which can be described in different ways. Most frequently, an interconnection list is used, specifying the pins to be connected and their relative positions on the symbols they belong to. Figure 1 shows two possible structure descriptions for schematics and a corresponding graphical representation. Future efforts are supposed to concentrate on standardized structural schcmatics intcrfaces. Additionally, there are user-oriented structural descripfion funguages supporting also hicrarchical design [l]. It should be noted that structural schematics descriptions need not necessarily be the rcsult of manual input but may result from CAD preprocesses.
3. The layout problem The major objective in the automatic generalion (layout) of schematic diagrams is to produce correct, readable, and easily comprehensible graphics while taking into account the conditions imposed by standards and the aesthetic viewpoint. These objectives are reflectcd by the layout requirements, such as grouping together strongly interconnected symbols and subschemes, maintenance of the main signal or information flow, small interconnection length, few intersections and bends in the line routing, and uniform utilization of the layout area.
429
From Structure Description to Graphical Representation
b
-
V1,TZ p 3 - - 7
PI
s1
I
Pz
-
V2,Tl
I
'-
P1
s3 Pt
P3
-
S -symbol P -interconncction point (pin) V - interconnection net T - interconneelion- (line-) type
Figure 1. Two p o s s i b l e structure descriptions for graphical schematics.
Formalizing the general layout problem for schematics is a very complex task [ 2 ] .So, we give only a few remarks on those parts of the model that have to be specified. Without loss of generality the symbols are considered to have a rectangular boundary where the pins are located on integer coordinates of a supposed grid. Nets are subsets of pins to be connected by certain line structures. In general these structures are trees. The layout area is represented by a scction of the Euclidean plane. It is typical for schematics to embed the interconnection lines into a rectangular grid which is called rectilinear routing. Furthermore, besides some application specific layout rules there exist some general ones. Symbols and lines must not overlap. Orthogonal line intersections are generally allowed. Complex schematics have to be divided into several sheets of a given format. Intersheet references ought to be done by connectors . By the layout rules an admissible layout is defined. The most difficult modelling part is the selection of an appropriate goal function for layout optimization 121. Usually, a function of the symbol @in) positions and line tracing has to be minimized. Because of its inherent complexity this global optimization problem must be simplified in two ways.
430
Figurc 2. A grid schematic.
At first, schcmatics must be classified into groups with similar charactcristics, such that within one group the same layout techniques can be applied. Secondly, it is indispcnsiblc to decompose the gcncral layout problem into easicr subproblcms that can be trcatcd scqucntially. Gcncrally, these subproblems are: decomposition of complex schematics, symbol placement, and line routing.
4. Types of graphical schematics Looking at the varicty of schematic diagrams we can distinguish bctween thrcc classes, each rcpresenting similar layout charactcristics [2].
4.1 Grid schematics Thcy arc characterized by graphical symbols of nearly equal size, whcrc the symbols Ki are rcprcsented by thcir enclosing rectangle. Hence, these symbols are arrangcd on grid points of an (cquidistant) rectangular grid (matrix), where each symbol is assigned to cxaclly one grid point (matrix clement). The space on the layout arca not occupicd by symbols is available for line routing. A sample grid schematic is illustratcd in Figurc 2. Thc main critcria for laying out a grid schematic are interconnection length and bcnd (corncr) minimization. Typical representatives of grid schematics are: - block diagrams
43 1
Y Figure 3. A row schematic. - simple logic diagrams - hydraulic schematics - technological diagrams
- graphs and nctworks. In principle, any diagram can be considered a grid schematic if each grid point reprcscnts an area bigger than the size of the biggest symbol. However, if symbols differ considerably in size this model results in a very inefficient space utilization. Sometimes the grid model may be too resuictive even for equal sized symbols, since a certain degree of frcedom in symbol displacement often tends to reduce line comers and thus to improve the readability of a graphic.
4.2 Row schematics Here the symbols are arranged on consecutive parallel rows [2, 31 of a certain width. We restrict our consideration to vertical rows. The row width dcpends on the symbol length. Symbols are supposed to have similar length but arbitrary height. Usually, row schematics are signal flow oriented representations, i.e., there exists a preferred direction (from left to right) along which the signal flow is to be watched. Beside acslhctic line routing the main layout objective for row schematics is to maintain this signal flow, resulting in a reduction of line crossings and feedbacks. Bctwccn two symbol rows there is always another row (channel) left for embedding the intcrconncction lines. This special topology allows usage of very efficient (channel)
432
1 2 3 4 s
I0
15
20
2s X
Figure 4. A free schematic.
routing procedures. Examples of row schematics are: - (arbitrary) logic diagrams - programming and control flow charts - Peui nets - signal flow graphs - rclay ladder diagrams - state-transition diagrams.
4.3 Free schematics This is the most general class of diagrams. Symbols are al!owed 3 ake any size. There is no restriction of the placement area, i.e., the free plane is available for symbol arrangement and line tracing. This requires considerable effort in designing acceptable layout procedures. A typical free schematic is depicted in Figure 4. As for grid schematics the major layout objective is aesthetic and complete line routing as well as uniform utilization of the layout area. Representatives of this class are: - electric circuit schematics - wiring diagrams - general block diagrams - entity-relationship diagrams - enginwring schematics.
433
5. Automatic scheme generation Generally, the entire generation cycle of schematic diagrams is separated into three layout steps: 1) scheme decomposition 2) symbol placement 3) line routing.
5.1 Scheme decomposition Decomposition means deciding which symbols of a schematic have to be assigned to one sheet and determining intersheet connections. We distinguish between a priori and a posteriori decomposition. In the a priori approach, also called partitioning, symbols have to be decomposed into groups prior to symbol placement and line routing. The objective is to group those symbols on a sheet that belong strongly together such that the number of intersheet connectors is minimized. Before partitioning the diagram has to be replaced by an appropriate graph model (e.g., weighted star, clique or hypergraph). After this transformation the decomposition is obtained by size-constrained clustering [2] similar to IC partitioning [4]. Schematics do usually not exceed the size of several hundred symbols. Efficient placement (and routing) procedures often manage to handle the entire diagram without a priori decomposition. In this case the diagram can be generated without taking into consideration the format of the output sheets. Then an a posteriori decomposition is to tear the overall picture into several subgraphics of sheet size. This may include a slight (local) displacement of those symbols that are cut by sheet boundaries. If routing on the overall schematic is too expensive a similar a posteriori decomposition can be performed immediately after symbol placement. In this way the entire routing problem reduces to routing on single sheets. A posteriori decomposition can be applied to all types of diagrams, but it is especially suited to grid and row schematics [2, 53.
5.2 Symbol placement Placement is the most critical part in schematics layout, differing much from placement problems appearing in PCB- and IC-design [4].The objective is not a very compact layout but a placement that allows complete and aesthetically pleasing line routing. For grid schematics placement can be transformed to the standard (NP-hard) Quadratic Assignment Problem, for which heuristic solution techniques are well known. Placement for row schematics is usually divided into three steps: assigning symbols to rows, ordering the symbols within their rows, and detailed placement [3,6]. The first step
434
leads to a modified version of the Feedback Arc Set Problem [2] and the second one to the Crossing Number Problem in multi-partite graphs [71. Detailed placement is obtained by local displacement opcrations such as to maximize the number of straight line segments in routing. An alternative approach to row placement consists in dctermining and joining horizontal signal chains of symbols, which results in solving a modified Optimal Lincar Arrangement Problem. The most difficult placement problem is that forfree schematics. It is not yet deeply investigated. It is related to building block placement and floorplan design in circuit layout [4, 51. Hence, force-directed and min-cut algorithms could be adopted to symbol placement 121. However, in this approach it is hardly possible to take into account conditions imposed by routing requirements. Neverthcless, this technique can be applied to obtain an admissible symbol arrangement with acceptable global characteristics. Local propcrtics can be considered more appropriately in a sequential placemcnt algorithm, where in one step exactly one symbol is to be placed on thc sheet in such a position that allows casy line routing. Here, very often global effects are neglected when local decisions are takcn. Consequently, we suggest a combination of both views resulting in a hierarchical (bottom-up) placement for free schematics. In this way, subschematics are built and mcrgcd step by step until the overall diagram is generated. Using this method the routing could be pcrformed in a similar hierarchical way on each subschcmatic.
5.3 Line routing Routing means the automatic generation of interconnection lincs bctween the symbols of a schematic. Since lines on most diagrams are made by orthogonal segments we restrict our consideration to rectilinear routing. The objcctivc of routing is to embcd each interconnection net under certain resuictions in the routing area (rectangular grid). An embedding of a net is a rectilinear Steiner tree with its leaves just being the net pins. Trees with a minimum total length and/or number of bends are to be found. Whereas differcnt placement techniques are necessary for the differcnt classes of diagrams, routing can be treated in an universul approach. However, this does not mean, that certain routing problems could not be solved more efficiently by specific techniques. The most flcxible (universal) strategy for routing on general schematics is to connect net by net and to reduce the Steiner tree layout to sequential routing of two-point structures using path finding algorithms [4, 51. Its most popular representative is the Lee algorifhm [8], operating on a matrix where cach matrix element (cell) corresponds to exactly one grid point. It is a breadth-first search algorithm, somctimcs callcd wave propagation method. This algorithm operates
435
on a very general class of monotone path cost functions and may be adopted to the spec i k needs of schematics routing [2]. Based on this principle an universal line router 191 CARO (Computer Aided Routing) has been developed and applied to many different diagram and document types. Emphasis has been put on performance rate and speed by dynamic sorting of pins and nets, anti-blocking technique, directed target search, and multi-level routing hierarchy. Figure 5 shows a section of an electric diagram automatically routed by CARO. Due to the complexity of this routing task a two-level hierarchy was used. The other major strategy for generating two-point connections is the so-called line search method [4,103. Unlike the Lee algorithm in which a path is represented by a sequence of grid points, the line search algorithms search a path as a sequence of line segments. Starting simultancously from the source and target point, horizontal and vertical lines are expanded until they hit an obstacle or the routing boundary. From these lincs again perpendicular extension lines are constructed, etc., until a polyline originating from the source intersects one from the target. This line search strategy was used in [ll]. The third strategy for two-point routing is to explore the principle of pattern routing, i.e., lo rind simple-shaped interconnection paths efficiently (e.g., straight lines, comers, u- and z-shaped paths). Pattern routing is especially suited to schematics, but the number
I
Figure 5. A CARO routing result.
I
436
of topological routing patterns grows exponentially with the number of line segments. Thus, pattern routing is applicable either for rather simple diagrams or may be used as initial step in complex routing procedures [ll, 121. Beside thcse sequential (net by net) routing algorithms for general diagrams there exist highly efficient (semi-parallel) channel routing techniques [4] especially for diagrams with rather regular structure. A channel is a rectangular section of the routing arca with pins only on two opposite sides. For grid and row schematics thc decomposition of the entire routing area into channels is obtained in a natural way. The assignment of intcrconnections to channels, called global routing, can easily be adopted from the IC global routing methods [41. During channel routing a subset of nets is embedded simultaneously with a minimum number of bcnds. Furthermore, channel routing on schcmatics is to aim at a minimum number of line crossings rather than at conventional minimum channel width [2,5].
6. Application fields The first indications of using layout techniques in computer aided schematics design can bc found in the early sixties. Lee [8] presented a small electric diagram automatically routed by his well-known breadth-first search path finding algorithm. We also find in the sixties first efforts to use line generation [lo, 13-17], symbol placement [18, 191 and partitioning [20] in the documentation of logic schematics. With the rapidly growing complexity of electronic components in the seventies, these first experiences were extended to integrated circuit design. In producing electric diagrams refincd layout techniques were dcveloped resulting in more aesthetic and readable drawings. For example, diagram-specific line routing by recognizing simple line patterns (pattern routing). So far the main application field of schcmatics layout has bcen the automatic generation and docurnentation of logic and electric diagrams. This is due to the pioncering role of microelectronics in the CAD/CAE/CAM era. However, CAS-techniques are on their way to penetrating into new engineering disciplines. Several efforts are being made to document automation and control systems [2, 21,221 automatically. Promising results have been obtained in graphical programming and docurnentation of programmable controllers by logic diagrams and control flow charts [21,23]. Another field of interest in CAS is design specificulion and documentation by schematics, such as programming flow charts [24], data flow diagrams [25], PERT networks [26], Pcui nets [6],and entity-relationship diagrams [27]. First experiences in automatic documentation of workshop drawings by applying channel routing to the generation of dimcnsion lines [28] have been made.
437
C D
f f
WIJ
I
& H I
n
1
o
Figure 6. An automatically generated sketch of a logic diagram.
Furthermore, in engineering as well as in scientific research we encounter the problem of embedding a graph or network in the plane. Again schematics layout techniques providc an appropriate graphical representation. Finally, single CAS-components such as a line router can be used as additional and efficient drawing tools in any graphics system [2].
7. Results Our first application of schematics layout was the automatic documentation and re-documentation of programmable controllers by logic diagrams [21, 291. Logic diagrams belong to the class of row schematics. Here new graph theoretic placement and channcl routing methods were used to derive the correct graphics from the control programs, describing the schematics structure. Figure 6 illustrates a sketch of an automatically generated logic diagram.
438
X1205.2 4
~-
X 1205.2.6
x11
11
X16102 EO X 1 6 2 0 2 B3
X1617.2 8 3 X 1 1 ’ 12 Y1615.2 ‘A 3
x N
W
Figure 7. An automatically documcnted backboard wiring diagram.
439
Figure 8. An automatically routed electric diagram.
Based on Lhe general routing system CARO a number of different application packages for design and documentation purposes was developed. In Figure 7 we have the documentation result of a backboard wiring diagram. This graphic was dcrived automatically from structural information supplied by the preceding hardware design process. Furthermore, in Figure 8 an electric diagram is presented which was routed by CARO. Based on CARO thc system FLOWCAD [23] for efficient graphical programming and documentation of programmable controllers by so-called control flow charts was
440
M 162.6
s p e l l flschen ~.
E
.
l ? ..
11 zdrehlw M !62.7
Schutzgltter Lackwerk geoeffnet
YAfflLW
A
STCW-32 A
13.
15.6
Stoerung P e h ueberbfachung
Figure 9. A control flow chart generated by FLOWCAD.
dcveloped. Here interactive and/or automatic symbol placement, automatic routing of directed interconnections and text handling was combined. Figure 9 shows a typical control flow chart generated by FLOWCAD. The application results presented show that general layout tools are in a position to gcncratc a wide range of schematic documentations very efficiently. In our examples the efliiciency in schematics documentation compared to conventional drawing methods increased by about 300% to 6,000%.
8. Conclusion Thc promising results mcnlioned should make computer aided schematics a field of more intcnsivc research including experts from different engineering and scientific disciplines. Futurc CAS developments will comprise new mathematical models and methods, rule based-techniques, generative computer graphics, flexible schematics description
44 1
techniques, interfacing and the integration of postprocessing, such as list generation, simulation and manufacturing. In this way CAS is going to become a standard tool in many design and documentation processes. A so-called CAS-system comprising all of these aspects is under development at the ZKI. Among these activities we focus our attention on new layout techniques (e.g., bus routing, free placement, partitioning, hierarchical design) as well as new application fields.
References 1. 2. 3. 4.
5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 17. 18.
Plessow M , Simeonov P. Netlike schematics and their structure description. Proc VI1 Bilateral Workshop GDR-Italy on Informatics in Industrial Automation, Berlin 3 1 O c t . 4 Nov. 1989: 144-163. May M. O n the layout of netlike schematics. (In German) Doctors thesis (B). Berlin: Academy of Sciences, 1989. May M. Computer-generated multi-row schematics. Comp-Aided Design 1985; 17( 1): 25-29. Ohtsuki T. ed. Layout design and verification. Amsterdam-New York-Oxford-Tokyo: NorthHolland, 1986. May M, Nehrlich W, Weese M, eds. Layout desigmnathematical problems and procedures. (In German), Berlin: ZKVAdW, 1988. Rouzeyre B, Alali R. Automatic generation of logic schemata. Proc COMPINT'85 Conf 1985: 414420. May M, Szkatula K. On the bipartite crossing number. Confrol and Cybernetics 1988; 17(1): 85-98. Lee CY. An algorithm for path connection and its application. IRE Trans on Electr Comp 1961; EC-10: 346-365. May M, Doering S , Kluge S , Thiede F, Vigerske W. Automatic line routing on system documentations. ZKI-Report 82-1/90, Berlin, 1990. Hightowcr DW. A solution to line-routing problems on the continuous plane. Proc 6th Design Autom Workshop 1969: 1-24. Venkataraman VV, Wilcox CD. GEMS: an automatic layout tool for MIMOLA schematics. Proc 23rd Design Autom Conf 1986: 131-137. Brcnnan RJ. An algorithm for automatic line routing on schematic drawings. Proc 12th Design Autom Conf 1975: 324-330. Warburton CR. Automation of logic page printing. IBM Data Systems Division Techn Report No 00,720,1961. Dehaan WR. The Bell Telephone Laboratories automatic graphic schematic drawing program. Proc 3rd Design Autom Workshop 1966: 1-25. Friedman TD. Alert: a program to produce logic designs from preliminary machine descriptions. IBM Research Report RC-1578,1966. Balducci EG. Automated logic implementation. Proc 23rd Nut Conf of the ACM 1968: 223-240. Wise D K . LIDO-an integrated system for computer layout and documentation of digital electronics. Proc Iru Confon Comp Aided Design 1969: 72-81. Rocket FA. A systematic method for computer simplification of logic diagrams. IRE Int Convention Record 1961; Part 2: 217-223.
442 19. Kalish HM. Machine aided preparation of electrical diagrams. Bell Lab Record 1963; 41(9): 338-345. 20. Roth JP. Systematic design of automata. Prep of the Fall Joint Computer Conf 1965; 27(1): 1093-1100. 21. May M. CAS approach to graphical programming and documentation of programmable controllers. Prep 4th IFAC Symp on Comp-Aided Design in Control Systems. Bcijing 23-25 Aug. 1988: 262-268. 22. Barkcr HA, Chen M, Townsend. Algorithms for transformations betwccn block diagrams and signal flow graphs. Proc 4th. IFAC Symp on Comp-Aided Design in Control Systems. Reijing 23-25 Aug. 1988: 231-236. 23. May M, Thiede F. Rcchnergestuctzter Entwurf und Dokumentation von SPS mittels Stcucrungsablaufplacnen.Tagungsbeitr Rechnergest Entwurf binaerer Steuerungen. Dresdcn 15. Mai 1990: 25-26. 24. Yamada A, Kawaguchi A, Takahashi K, Kato S. Microprogramming design support system. Proc 11th Design A u o m Conf 1974: 137-142. 25. Ratini C, Nardelli E, Tamassia, R. A layout algorithm for data-flow diagrams. IEEE Trans Sofho Eng 1986; SE-12(4): 538-546. 26. Sandanadurai R. Private communication. Dec. 1984. 27. Batini C, Talamo M, Tamassia R. Computer aided layout of cntity-relationship diagrams, J Syst SO@ 1984; 4: 163-173. 28. Iwainsky A, Kaiser D, May M. Computer graphics and layout design in documcntation proccsscs. To appear in: Computers and Graphics 1990; 14(3). 29. May M, Mcnnecke P. Layout of schematic drawings. Syst Anal Model Simul 1984; l(4): 307-338.
Tools for Spectroscopy
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scientific Compuring and Automation (Europe) 1990 0 1990 Elsevier Science Publishers B.V., Amsterdam
445
CHAPTER 39
Developments in Scientific Data Transfer A.N. Davies, H. Hillig, and M. Linscheid ISAS, Institut fur Spekfrochemieund angewandte Spektroskopie, 0-4600 Dortmund 1 , FR G
Abstract The development and publication of the JCAMP-DX standard transfer format for infrared spectra has opened up a new era of format standardization in spectroscopy. The acceptance of the standard by IR instrument manufacturers has ensured the broad implementation of the standard and has opened the way for a multitude of data exchange and comparison possibilities previously limited to the mass spectroscopists with their EPA format. The simplicity and success of this standard has brought about the call for and development of JCAMP-DX similar standard formats for structure information, NMR spectra, UV/Vis spectra, Mass spectra and crystallographic information so that data exchange bctween scientists can now be a simple matter of decoding ASCII files with standard software. In this paper we present some of our work as the Data Standards Test Center for the German Unified Spectroscopic Database Project where we look into the current state of the implemented software handling these transfer standards and at future developments. Some of the benefits of this standardization will also be discussed.
1. Background The Gcrman government initiative “Informationssystem-Spektroskopie” has been running for several years with the aim of producing high quality spectroscopic databases and multi-spectroscopic software packages to make this data available. The Institut fur Spektrochemic und angewandte Spektroskopie in Dortmund, FRG,has taken on a number of roles within this project: i. Software development for spectra quality control, valuation and exchange. ii. Spectral data Quality evaluation in the fields of NMR and infrared spectroscopy. iii. Collaboration in the development of an X-windows version of the software package “SpccInfo”.
446
Figure 1. Scvere problems exist in the field of scientific data transfer between scientists with diffcrent opcrating systems and data stations.
iv. Coordination of University projects with Chemical Concepts GmbH. v. Development of sections of the “SpecInfo” package concentrating on the Infrared components and advising on algorithm quality. We arc however, primarily responsible for spectra collection and evaluation in the fields of infrared and NMR speclroscopy. Through the work carried out in this area the project tcam at ISAS havc become all too acutely aware of the problems currently prevalent in the field of scicntific data transfer. Following a ‘call for spectra’ in the forerunner project “SpektrendatcnbankcnVcrbundsystem” scientists ran into difficulties when their request for data in any format on any media resulted in such a diverse collection of magnetic tapes and Winchester discs landing at the collection institute that significant amounts of submitted data was not readable and ended up bcing rcturned to the submitting organizations. When ISAS took over the task of coordinating the collection and distribution of spectroscopic and othcr data it was soon obvious that significant of work was needed towards standardizing transfer formats.
2. Some problems and solutions This diversity of internal formats has lcad to the somewhat sad situation where often two scicntists within the same company or rescarch institute find themsclvcs unable to communicate with one another. The incompatibility of intcrnal data storage formats
441
Figure 2. Some organizations have introduced ‘Island’ solutions to overcome the data transfer problems internally but this doesn’t help external data transfer and the solution requires constant software maintenance.
implemented by different manufacturers is like a wall between scientists and prohibits the free flow of information (Fig. 1). One possible solution is of course to purchase only equipment from a single manufacturer but this is rarely a viable option as no manufacturer can possible provide for all the needs of a large organization, and the dangers involved in becoming dependent on a single particular supplier of equipment are all too obvious. Many organizations have solved this problem by inuoducing their own organizationwide data storage format and writing their own software to convert the manufacturers formats present within the organization into some other format. ‘Island’ solutions make life somewhat easier for those on the ‘Islands’ but require constant software maintenance. These solutions however present the same difficulties when contact with the outside world is required, as in our project, where the same lack of conformity problem exists (Fig. 2). Fortunately, there are now standards for data transfer being developed and implemented which should make life much easier. In 1988 McDonald and Wilks published the specifications for a data transfer standard for infrared spectra and interferograms [l]. This format was developed under the Joint Committee on Atomic and Molecular Physical Data (JCAMP) and was given the name JCAMP-DX. The implementation of this standard by infrared equipment manufacturers and software houses now provides an alternative to the ‘single supplier’ solution allowing transfer of spectra between infrared systems regardless of internal system format (Figs. 3 and 4).
448
Figure 3. The introduction of data transfer standards solves the problem of transfer of data bctwecn different operating systems.
Figure 4. The organization type ‘Island’ solutions can also communicate with each another by using a standard transfer format.
449
1.750
1
1.500 1.2
TEST 5 TEST 5
Figure 5. Two infrared spectral curves showing the problem of wrongly implemented software.
3. Some problems with the solution ! Unfortunately the JCAMP organization did not watch over the implementation of the standard amongst manufacturers and this has lead to a rather piecemeal and sometimes catastrophic array of so-called JCAMP-DX compatible software. The errors that have come to our attention were detailed recently [2] and include such unexpected problems and the failure of infrared manufacturers to tell the difference between Transmittance and Transmission. Some more subtle errors were the failure of a major manufacturers software to convert the fractional laser wavelength x-axis increment of another manufacturer into the regular x-axis required by their own internal format. The rounding error produced resulted in a severe shift in the infrared band position at low wavenumbers (Fig. 5). Recent developments in the right direction have been the interest taken in the development and maintenance of JCAMP-DX standards for specific techniques in the sphere of ASTM E49, (Computerization of Materials Property Data) [3].
4. The demand for more standards The usefulness of the JCAMP-DX standard has been clearly shown by the free adoption of the software written for infrarc.d spectrometers by manufacturers in other fields of
450
3 2.5
2 I.5 1
0.s 0 BINARY INTEGER CLEANED DIFDUP DIFDUP DIFDUP DIFDUP (X**(Y .Y))
TESTSPEC
(X**(R..R))
TESTFID
(X**(L.I))
tX**(RI..RI))
0TESTFIDP
Figure 6 . NMR file sizes relative to the original binary file showing the reduction in size when good quality information is coded in the JCAMP-DX DIFDUP format.
spectroscopy. That JCAMP have not succeeded in publishing standards in other ficlds two years aftcr the first publication has meant that UV/Vis and NMR spectra are currently being exchanged with the JCAMP-DX label ##DATATYPE=Infrared Spectrum so that the available software will handle thc data! This is obviously not desirable and this ycar has seen much activity in the field of ncw JCAMP-DX compatible standard formats. The lirst to be published will be JCAMP-CS, a standard format for the exchange of chcmical structures [4]. Scveral drafts currently exist for a JCAMP-DX standard in NMR spccuoscopy [5, 61 which only essentially differ in the proposed method of storing the data curves themselves. The American proposal wishing to adopt a binary data storagc format and the European proposal retaining the ASCII option to allow the data and hcadcr information to remain as one file and more importantly to avoid the enormous problems associatcd with transferring binary files between computers with different operating systems and internal word lengths, The reason bchind the pressure to allow binary storage in a transport format is one of file size. It is a generally held belief that a spcctrum or FID coded as an ASCII file would be far bigger than the same spectra or FIJI coded in binary. To probe this assumption several format tests were carried out and the preliminary rcsults will be given below. Three NMR data files were laken and coded them into several ASCII formats allowcd by Lhc JCAMP-DX specifications for infrarcd spcctra and a new format we have dcveloped for NMR spectra. The three files were:
451
1.4
.......................................................
1.2
.........................
1 0 -8
0.6 0.4
0.2 0
I
DIFDUP
BINARY
(X++(RI..RI))
TESTSPEC
TESTFID
0TESTFIDP
Figure 7. The reduction in transfer time is even more drastic when binary and JCAMP-DX DIFDUP files are compared due to the necessity of inserting control codes into binary files for transfer purposes. Transfers were between a 20 MHz AT-386 compatible personal computer running Kermit-MS Version 2.32/A, 21. Jan.1989, and a DEC GPX running Vax Kermit-32, over a RS-232C line at 9600 baud.
1. TESTFID, a noisy high resolution FID, (noisy data sets cause size problems with JCAMP-DX ASDF data compression formats due to the lack of correlation between
neighboring data points.) 2. TESTSPEC, the transformed version of TESTFID. 3. TESTFID2, an FID with a good signal to noise ratio. All three data sets consisted of alternating real and imaginary data points and only the Yvalues are prescnt in the original data set. The data were first converted to a fixed integer format and then compressed to remove superfluous blanks from the data file. The data was then converted to a standard JCAMP-DX DIFDUP format [l]. Normal FID data sets actually contain paired real and imaginary points where there is no direct correlation to the value data point and it's nearest neighbor in the data file, an assumption in JCAMP-DX DIFDUP coding. This lack of correlation causes a similar effect to large random fast noise signals where the coding is concerned increasing the file storage size significantly. To introduce some degree of correlation between neighboring points the two data sets in each file were then separated into two files and coded independently JCAMP-DX DIFDUP format. Finally, a new idea on FID coding was tested where the two curves were DIFDUP encoded applying the algorithm to each curve independently but leaving the pair-wise point storage. This format was given the new ASDF code (X+ +(RI..RI) (see [l]).
452
Thc prcliminary results were surprising and encouraging (Fig. 6). The worst case scenario showcd only a doubling of the storage space required for the ASCII DIFDUP files and a reduction in the storagc space for the ASCII format over that of the binary filc by more than a factor of 2 was obtained for the good quality FID! This would seem to invalidate the argument that binary files arc necessary because of the excessive storage requircmcnt for ASCII files and leave only the disadvantage of a non-transportablc transport format if binary is adopted for data storage. For a transport format the nctwork transport timcs are often more important than the actual file size and here an even biggcr advantage is shown by thc good TESTFlD2 DIFDUP file over the original binary file as the transfcr program needs to inscrt control codes into the binary file to facilitate transfer (Fig. 7)
5. Other new standards Several other JCAMP compatible formats arc also currently under discussion including a proposal from the American Society of Mass Spectroscopists (ASMS) for a Mass Spectra Standard [7], and a specification for X-Ray diffractograms from the International Ccnlcr for Diffraction Data [8]. Anyone intcrcstcd in contributing to the development of these standards should conuct the author for further information.
6. Standards for multidimensional experiments Thcsc standards arc excellent for single dimensional data but the expansion of multidimcnsional experiments has revealcd a weakness in the original JCAMP-DX conccpt. This can be best explained if we take the guidelines of the Coblentz Socicty for infrared refcrcnce spectra for GC-IR [9]. The spectral evaluations committee of thc Coblentz society namc 34 mandatory labels and an additional 5 desired labels for each IR rcfcrence spectrum. Taking this as the norm for good data content at least 39 lines of text should be added to each spectral curve. For a typical GC-IR experiment vcry little information actually changes between subscqucnt spectra except the retention time of the measurcmcnt and perhaps the oven temperature but as that is programmed the information could be detailcd at thc beginning of thc experiment anyway. This means that if cross-referencing between spcctra were possible each spectra in a GC-IR experiment following the first should be codeable with only a two line header instead of 39, the two refercncing the initial spectrum header file and the time of measurement or retention time. This type of block structure has been published in the Standard Molecular Data (SMD) Format developed by the European chemical and pharmaceutical companies [lo]. Here complex files containing Scopes, Sections, Blocks, and Subblocks all inter-referable are del-incd and a block structuring format of this nature is required within JCAMP-DX.
453
7. Conclusion It can be seen from the flowering of new standards that great interest exists in the standardization of data transfer. The problems with the early JCAMP-DX implementations have shown the need for watchdog organization for these standards and with the involvement of ASTM and hopefully other standards organizations the future of improved data transfcr looks good.
References McDonald RS, Wilks PA. JCAMP-DX: A Standard Form for Exchange of Infrared Spectra in Computer Readable Form. Applied Spectroscopy 1988; 42(1): 151-162. 2. Davies AN, Hillig H, Linscheid M. JCAMP-DX, A Standard? Sofhuare-Development in Chemistry 4, JGasteiger Ed. Springer Verlag, 1990. 3. McDonald RS. Private Communication. 7. May 1990. 4. Gasteiger J, Hendriks BMP, Hoever P, Jochum C, Somberg H. JCAMP-CS: A Standard Exchange Format for Chemical Structure Idormation in Computer Readable Form. Applied Spectroscopy, in press. 5 . Davies AN. Proposal for a JCAMP-DX NMR Spectroscopy Standard. ISAS Dortmund, Postfach 1013 52,4600 Dortmund 1, FRG. 6. Thibault CG. Proposed JCAMP-DX Standard for NMR Data. Software Dept., Bruker Instruments Inc., Manning Park, Billerica MA 01821, USA. 7. Campbell S, Christopher R. Davis TS, Hegedus JKJ, James C, Onstot J, Stranz DD, Watt JG. A Data Exchange Format for Mass Spectrometry. ASMS; c/o David D. Stranz, Hewlctt Packard, Scientific Instr.Div., 1601 California Ave., Palo Alto, California 94304, USA. 8. Dismore PF, Hamill GP, Holomany M, Jenkins R, Schreiner WN, Snyder RL, Toby RH. (chair). Specifications for Storing X-Ray Dvfractograms in a JCAMP-DX Compatible Format. Draft Document-September 1989; PDF-3 Task Group, JCPDS-International Centre for Diffraction Data, Swarthmore, P.A., USA. 9. Kalasinsky KS, Griffiths PR, Gurka DF, Lowry SR, Boruta M. Coblentz Society Specifications for Infrared Reference Spectra of Materials in the Vapour Phase above Ambient Temperature. Applied Spectroscopy 1990; 44(2): 21 1-215. 10. Latest version: Barnard JM. Draft Specification for Revised Version of the Standard Molecular Data (SMD) Format. 1 Chem Inf Compuf Sci 1990; 30: 81-96. 1.
This Page Intentionally Left Blank
E.J. Karjalainen (Editor), Scienrific Computing and Automation (Europe) 7990 0 1990 Elsevier Science Publishers B.V.. Amsterdam
455
CHAPTER 40
Hypermedia Tools for Structure Elucidation Based on Spectroscopic Methods M. Farkas, M. Cadisch, and E. Pretsch Department of Organic Chemistry, Swiss Federal Institute of Technology, CH-8092 Zurich, Switzerland
Summary SpecTool, a software package having hypermedia features is a collection of 1H-NMR, 13C-NMR,MS, IR and UV/VIS data and reference spectra, as well as heuristic rules and computer programs used for the interpretation of such spectra. It can be looked at as an “electronic book”. Some of the pages contain only navigation tools and others mainly numerical and/or graphical information.From some pages programs can be started. A high degree of flexibility and forgivingness is built in, so that the same piece of information can be obtained many different ways. It supports associative searching, i.e., “browsing and looking for something relevant”, a feature of “hypermedia” that is hardly possible with other types of computer programs. A series of simpler navigation tools helps promote flexible usage and to avoid the feeling of “being lost” in the system. A distinct feature of SpecTool is that in contrast to expert systems it does not make decisions. It just shows or calculates data as proposals or aids for the decisions of the user.
1. Introduction The interpretation of molecular spectra for the structure elucidation of organic compounds relies mainly on empirical correlations and heuristic rules as well as reference data and reference spectra. The necessary information is spread out in many printed volumes for the most relevant techniques: MS (mass spectrometry), NMR (nuclear magnetic resonance), IR (infrared) and UV/VIS (spectroscopy in the ultraviolet and visible spectral range). Because of the complementary nature of the available information from the individual spectroscopicmethods, a multimethod view is especially powerful [l]. Although printed media are still the most frequently used sources of information, computer programs are more and more widely applied for various subtasks of the structure elucidation process. Most available programs, including spectroscopic databases, are
4 56
out of reach to the majority of potential users. User interfaces, not adequate for occasional users, create a further barrier. Today’s spectroscopists have thus a wealth of fragmented pieces of information distributed in many books and spectra catalogs as well as databases and other computer programs (running on different computers in different environments). Finding the necessary information means often manual scarches in various books and catalogs and takes often a large amount of work. The purpose of this contribution is to present a medium which accommodates the necessary information for spectroscopist’s everyday work within one unique environment. Its usage is as simple as using a book. The system contains interfaces to external programs and will also at a later date contain interfaces to external databases. It can be viewed as an electronic book which is also capable of performing calculations. It is a tool supporting the decisions of the spectroscopists. It has, however, on purpose no dccision making features within the systcm. In this paper the overall structure and the most important features at the present state of the development will be described.
2. Hardware and software Software and hardware for the development of hypermedia became available recently [2, 31. In hypermedia virtually any links can be made between discrete pieces of information (including computer programs). They can contain a large amount of information (data, spectra, programs) within one unique environment. Macintosh computers are rather widely available in chemical laboratories and their price is at the low-end of computers for which adequate hypermedia tools are available. Hypermedia is the combination of multimedia and hypertext in a computer system. Hypertext means that the user may read the information not only in one, sequential way. The information is stored in many chunks. Pieces of information that are related to cach other are connected by links. In a hypertext document the pieces of information are cmbcddcd therefore in a network of links. Information is provided both by what is stored in cach node and in the way information nodes are linked to each other. Rcading the document, the user can follow these links according to his interests and information needs. “Multimedia”, i.e., graphics, animation and sound, helps to present the data in a more flexible way and help the user to be comfortable in the system. Hypercard is a “hypermedia” developing and running tool kit for Lhe object-based programming language HyperTalk. What the user sees on the screen is a card. On a card there can be text (stored into fields), pictures and so-called “buttons”. An action is invoked by clicking at a buttons. If the same object occurs on different cards, instead of storing it for each occurrence separately, it may be stored at one location called background. Objects stored in the background are visible on all cards belonging to this
451
particular background. The above-described objects are stored together in a file that is called a stack. Different stacks can easily be linked together. Besides buttons, fields, cards, backgrounds and stacks may contain scripts, as HyperTalk programs are called. If scripts are attached to objects they may create active areas on the screen that react, e.g., on actions performed by the mouse of the Hypercard user. HyperTalk scripts may modify card pictures, show or hide texts or buttons, show animation, play sounds, do calculations, call compiled programs written in other programming languages or navigate to other cards or stacks.
3. Results and discussion At present the system is still under development. Many features and overall structures have been established. A part of the reference data and reference spectra is entered into the system. This experimental version of SpecTool uses over 5 MBytes of disk storage. A CD-ROM is envisaged as storage medium for the distributed version.
3.1 Organization The file structure of the system has becn designed to provide transparency and to serve the development and maintenance of the system (Fig. 1). The main organization groups are the individual spectroscopic methods. A further group contains a collection of
1 manager
u
Ch=\
1spec t r o s c o pi inf o r m a t i on
MS
applications
CNMR
-( data ) -(r)
Figure 1. Organizationof SpecTool.
programs
data
data
458
Figurc 2. Logical organization of SpecTml.
“~ools”,i.e., programs for activities other than just navigation or information presentalion. External programs which can be called from the hypercard environment build a further group. Finally there is an overall organization unit, the “manager”. The logical structure, the structure seen by the user can look quite different. It is designed to achieve a transparency from the user’s point of view. Several logical structures, e.g., for several types of users, can easily be added to one existing physical structure. At present one such logical structure exists. The top level structure, which appears to the user when he starts the system, can be imagined as hierarchical network of tables of contents (Fig. 2). Every node is the table of contents of the next deeper level. A step to such a sublevel (achieved by a mouse-click on the corresponding item) shows the table of contents of the sub-sublevels.Technically this step can be done by a jump to another card or by the display of a further field within the same card. With this simple hierarchic organization the user can address a huge amount of data with a few steps. At the same time the inspection of the possibilities for the next step is fast, since only a limited information is displayed on the screen at any moment. This structure is not only efficient but easy to use intuitively, i.e., it avoids the feeling of getting lost in the system. The real strength and user friendliness are achieved by adding further connections bctween selectcd points of the hicrarchy (represented symbolically with dotted lines in
459
Top l e v e l menu
I
Reference Data menu (A1 kanes, Alkenes ,...,A1 cohols,...1 (HS; CNtlR; HNHR; IR; UV/VlS)
I HNMR Reference Spectra
ALiphatic A1 cohol s Data IR ALlphatic A1cohol s Data
CNMR
ALiphatic Alcohols Data
Figure 3. Example of a simple navigation, showing some possible orthogonal movements at the data card “HNMR, Aliphatic alcohols” (bold lines).
Fig. 2). These allow “orthogonal” movements. For example the user can enter the system by selecting the Reference Data submenu (Fig. 3 and Fig. 4 top) and here the “HNMR of Alcohols” sub-submenu (Fig. 3 and Fig. 4 middle). With another step he can arrive at the data for aliphatic alcohols (Fig. 4 bottom). Now if he is interested in 13C-NMR,IR, MS or UV/VIS data (all at the same hierarchical level but within different methods), or in reference spectra of the same type of compound, he does not need to go back in the hierarchy. With one mouse click he can address any of these items (bold lines in Fig. 3 and bottom line in Fig. 4 bottom).
3.2 Navigation Virtually any connections can be made between the cards. The only limitation is that the user must not be confused by the offering of too many possible choices. The purpose of the design of various navigation tools is thus to achieve the maximum number of possible choices and to avoid confusing complexity. One part of the navigation is a collection of structured paths through the system (these wcrc described in the previous section). Further navigation tools are presented here. On the lower part of the cards navigation buttons are included (Fig. 4 bottom). The buttons in
460
Figure 4. Navigation within SpecTool. Top: Reference Data Mcnu. The user can select a chemical class (top part) and a spectroscopic method (bottom line). Middle: Card selected by choosing “-OH” and “HNMR’ in dic Rcfcrcncc Data Menu. Boftom: Card sclcctcd by choosing “Alkyl” of “Aliphatic Alcohols”.
46 1
the left-hand side lead to corresponding reference data for the other spectroscopic methods (HNMR is written in inverse style because this card belongs to HNMR). Two general navigation buttons are found on each card of every stack. The first one is the “back-arrow” at the right hand side (Fig. 4 bottom), which allows browsing backwards through the previously seen cards. Another button “myCds”(= my cards) can be used to mark a card, i.e., to note the card name with its path name into a file. At any time this file can be consultcd and direct access to any of the marked cards by clicking onto its name is possible. This feature is analogous to putting a bookmark at some pages in a printed book. The left-hand and right-hand arrow buttons are for browsing within logically connected pages. If there are no more related pages, a stop-bar appears (Fig. 4 bottom). Two final buttons are navigation tools related to the structure of the reference data files. The upward arrow leads to a logically higher order of level, from a data card (Fig. 4 bottom) to a submenu card (Fig. 4 middle). The system automatically saves the submenuof the type shown in Figure 4 middle-which was opened the last time and the upward arrow leads to this submenu card. The button “toMenu” brings the user to a main menu (Fig. 2). As stated above, whenever sensible, direct access from one data card to corresponding cards for the other spectroscopic methods is possible. Thus clicking at CNMR on the card shown in Figure 4 bottom directly shows 13C-NMR data for aliphatic alcohols. In some cases no 1:l correspondence is sensible. In such cases the selection of the corresponding method leads to one lever higher, i.e., to the submenu card ‘‘Alcohols’’, corresponding to the one shown in Figure 4 middle of the selected method. Submenu cards exist for all compound classes of each method. A jump between them is therefore always possible. This overall organization results in three main types of cards: 1. Cards serving mainly or exclusively navigation 2. Cards mainly presenting data or spectra 3. Cards on which programs can be started to perform some kinds of calculations (see 3.4)
3.3 Selection and presentation of the information on a card In many cases tables are of interest, which contain so many items that only a part of them can be displayed on the screen. Such tables are collected into scrolling fields. Scrolling is the equivalent of a linear search in a printed medium table. In SpecTool further possibilities are provided which help a more efficient localization of a table entry. First of all, coarse indices are added to such tables (Fig. 5). They are either permanently on the cards or, if there is not enough room for them, they can be blended in upon a mouse-click. Clicking to an item on the coarse index scrolls the field to the region of this item.
462
m
Isotopes. Nasses, Abundances [Number o f
Isotopes, mass
IlaSSeS,
lsoloue lm W
ISO!
Abundances INumber o f mass 105.9032
1801
re1 abundance 100
106.9050
1
113.9036
I
Figure 5. Table of naturally occurring isotopes. Top: Card as displayed upon opening. Right side: table entries ordered according to increasing masses in a scrolling field. Left side of the table: coarse index for the selection of mass ranges. Leftmost: a scrolling field with alphabetically ordered list of element symbols and corresponding coarse index. Bottom: Upon selecting of an element in the table of isotopes the isotope abundancies are displayed graphically.
If sensible, various indices can be added to one table. The table of isotopes (Fig. 5) exhibits three different indices. The table is ordered according to increasing masses of the elements. A coarse index on the left-hand side of the table can be used for the selection of a mass range (this saves scrolling time). The next index is an alphabetically ordered list of the element symbols. Finally the left most column is a coarse index of this list. With these tools the selection of an element can be accomplished in various ways. Another example is shown in Figure 6 . Here about 250 proton chemical shifts are compiled within one table. The primary order is a list of substituents (y-axis) and skeletons (x-axis). A coarse index of the substituents is blended in through a mouse-click on the button “to choose” in the table heading (Fig. 6 middle). Such an order is not ideal if
463 1 H-Chemical Shifts in Monosubstituted
Alkanes
-H
o.06
o m
2.w
1.m
2.16 2.83
1.15 1.21 1.24
2.10
4.36 3.47 3 37 3.10
3.m
3.50
1.10
3.24
3.37 1,
1.15
0.n C
-CWCHI -C