RELATIONAL MANAGEMENT and DISPLAY of SITE ENVIRONMENTAL DATA
RELATIONAL MANAGEMENT and DISPLAY of SITE ENVIRONMENTAL DATA David W. Rich, Ph.D.
LEWIS PUBLISHERS A CRC Press Company Boca Raton London New York Washington, D.C.
Library of Congress Cataloging-in-Publication Data Rich, David William, 1952Relational management and display of site environmental data / David W. Rich. p. cm. Includes bibliographical references and index. ISBN 1-56670-591-6 (alk. paper) 1. Pollution—Measurement—Data processing. 2. Environmental monitoring—Data processing. 3. Database management. I. Title. TD193 .R53 2002 628.5′028′ 7—dc21
2002019441
This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe.
Visit the CRC Press Web site at www.crcpress.com © 2002 by CRC Press LLC Lewis Publishers is an imprint of CRC Press LLC No claim to original U.S. Government works International Standard Book Number 1-56670-591-6 Library of Congress Card Number 2002019441 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 Printed on acid-free paper
PREFACE
The environmental industry is changing, along with the way it manages data. Many projects are making a transition from investigation through remediation to ongoing monitoring. Data management is evolving from individual custom systems for each project to standardized, centralized databases, and many organizations are starting to realize the cost savings of this approach. The objective of Relational Management and Display of Site Environmental Data is to bring together in one place the information necessary to manage the data well, so everyone, from students to project managers, can learn how to benefit from better data management. This book has come from many sources. It started out as a set of course notes to help transfer knowledge about earth science computing and especially environmental data management to our clients as part of our software and consulting practice. While it is still used for that purpose, it has evolved into a synthesis of theory and a relation of experience in working with site environmental data. It is not intended to be the last word on the way things are or should be done, but rather to help people learn from the experience of others, and avoid mistakes whenever possible. The book has six main sections plus appendices. Part One provides an overview of the subject and some general concepts, including a discussion of system data content. Part Two covers system design and implementation, including database elements, user interface issues, and implementation and operation of the system. Part Three addresses gathering the data, starting with an overview of site investigation and remediation, progressing through gathering samples in the field, and ending with laboratory analysis. Part Four covers the data management process, including importing, editing, maintaining data quality, and managing multiple projects. Part Five is about using the data once it is in the database. It starts with selecting data, and then covers various aspects of data output and analysis including reporting and display; graphs; cross sections and similar displays; a large chapter on mapping and GIS; statistical analysis; and integration with other programs. Section Six discusses problems, benefits, and successes with implementing a site environmental data management system, along with an attempt to look into the future of data management and environmental projects. Appendices include examples of a needs assessment, a data model, a data transfer standard, typical constituent parameters, some exercises, a glossary, and a bibliography. A number of people have contributed directly and indirectly to this book, including my parents, Dr. Robert and Audrey Rich; Dr. William Fairley, my uncle and professor of geology at the University of Notre Dame; and Dr. Albert Carozzi, my advisor and friend at the University of Illinois. Numerous coworkers and friends at Texaco, Inc., Shell Oil Company, Sabine Corporation, Grant Environmental, and Geotech Computer Systems, Inc. helped bring me to the point professionally where I could write this book. These include Larry Ratliff, Jim Thomson, Dr. James L. Grant, Neil Geitner, Steve Wampler, Jim Quin, Cathryn Stewart, Bill Thoen, Judy Mitchell, Dr. Mike Wiley, and other Geotech staff members who helped with the book in various ways. Friends
in other organizations have also helped me greatly in this process, including Jim Reed of RockWare, Tom Bresnahan of Golden Software, and other early members of the Computer Oriented Geological Society. Thanks also go to Dr. William Ganus, Roy Widmann, Sherron Hendricks, and Frank Schultz of Kerr-McGee for their guidance. I would also like to specifically thank those who reviewed all or part of the book, including Cathryn Stewart (AquAeTer), Bill Thoen (GISNet), Mike Keester (Oklahoma State University), Bill Ganus and Roy Widmann (Kerr-McGee), Mike Wiley (The Consulting Operation), and Sue Stefanosky and Steve Clough (Roy. F. Weston). The improvements are theirs. The errors are still mine. Finally, my wife, business partner, and best friend, Toni Rich, has supported me throughout my career, hanging in there through the good times and bad, and has always done what she could to make our enterprise successful. She’s also a great proofreader. Throughout this book a number of trademarks and registered trademarks are used. The registered trademarks are registered in the United States, and may be registered in other countries. Any omissions are unintentional and will be remedied in later editions. Enviro Data and Spase are registered trademarks of Geotech Computer Systems, Incorporated. Microsoft, Office, Windows, NT, Access, SQL Server, Visual Basic, Excel, and FoxPro are trademarks or registered trademarks of Microsoft Corporation. Oracle is a registered trademark of Oracle Corporation. Paradox and dBase are registered trademarks of Borland International, Incorporated. IBM and DB2 are registered trademarks of International Business Machines Corporation. AutoCAD and AutoCAD Map are registered trademarks of Autodesk, Incorporated. ArcView is a registered trademark of Environmental Systems Research Institute, Incorporated. Norton Ghost is a trademark of Symantec Corporation. Apple and Macintosh are registered trademarks of Apple Computer, Incorporated. Sun is a registered trademark and Sparcstation is a trademark of Sun Microsystems. Capability Maturity Model and CMM are registered trademarks of The Software Engineering Institute of Carnegie Mellon University. Adobe and Acrobat are registered trademarks of Adobe Systems. Grapher is a trademark and Surfer is a registered trademark of Golden Software, Inc. RockWare is a registered trademark and RockWorks and Gridzo are trademarks of RockWare, Inc. Intergraph and GeoMedia are trademarks of Intergraph Corporation. Corel is a trademark and Corel Draw is a registered trademark of Corel Corporation. UNIX is a registered trademark of The Open Group. Linux is a trademark of Linus Torvalds. Use of these products is for illustration only, and does not signify endorsement by the author. A Web site has been established for updates, exercises, and other information related to this book. It is located at www.geotech.com/relman. I welcome your comments and questions. I can be reached by email at
[email protected]. David W. Rich
AUTHOR
David W. Rich is founder and president of Geotech Computer Systems, Inc. in Englewood, CO. Geotech provides off-the-shelf and custom software and consulting services for environmental data management, GIS, and other technical computing projects. Dr. Rich received his B.S. in Geology from the University of Notre Dame in 1974, and his M.S. and Ph.D. in Geology from the University of Illinois in 1977 and 1979, with his dissertation on “Porosity in Oolitic Limestones.” He worked for Texaco, Inc. in Tulsa, OK and Shell Oil Company in Houston, TX, exploring for oil and gas in Illinois and Oklahoma. He then moved to Sabine Corporation in Denver, CO as part of a team that successfully explored for oil in the Minnelusa Formation in the Powder River Basin of Wyoming. He directed the data management and graphics groups at Grant Environmental in Englewood, CO where he worked on several projects involving soil and groundwater contaminated with metals, organics, and radiologic constituents. His team created automated systems for mapping and cross section generation directly from a database. In 1986 he founded Geotech Computer Systems, Inc., where he has developed and supervised the development of custom and commercial software for data management, GIS, statistics, and Web data access. Environmental projects with which Dr. Rich has been directly involved include two Superfund wood treating sites, three radioactive material processing facilities, two hazardous waste disposal facilities, many municipal solid waste landfills, two petroleum refineries, and several mining and petroleum production and transportation projects. He has been the lead developer on three public health projects involving blood lead and related data, including detailed residential environmental measurements. In addition he has been involved in many projects outside of the environmental field, including a real-time Web-based weather mapping system, an agricultural GIS analysis tool, and database systems for petroleum exploration and production data, paleontological data, land ownership, health care tracking, parts inventory and invoice printing, and GPS data capture. Dr. Rich has been using computers since 1970, and has been applying them to earth science problems since 1975. He was a co-founder and president of the Computer Oriented Geological Society in the early 1980s, and has authored or co-authored more than a dozen technical papers, book chapters, and journal articles on environmental and petroleum data management, geology, and computer applications. He has taught many short courses on geological and environmental computing in several countries, and has given dozens of talks at various industry conventions and other events. When he is not working, Dr. Rich enjoys spending time with his family and riding his motorcycle in the mountains, and often both at the same time.
CONTENTS
PART ONE - OVERVIEW AND CONCEPTS ............................................................................1 CHAPTER 1 - OVERVIEW OF ENVIRONMENTAL DATA MANAGEMENT.............3 Concern for the environment...........................................................................................3 The computer revolution ..................................................................................................5 Convergence - Environmental data management ..........................................................7 Concept of data vs. information.......................................................................................8 EMS vs. EMIS vs. EDMS .................................................................................................8 CHAPTER 2 - SITE DATA MANAGEMENT CONCEPTS .............................................11 Purpose of data management .........................................................................................11 Types of data storage......................................................................................................12 Responsibility for data management .............................................................................18 Understanding the data ..................................................................................................19 CHAPTER 3 - RELATIONAL DATA MANAGEMENT THEORY................................21 What is relational data management?...........................................................................21 History of relational data management.........................................................................21 Data normalization .........................................................................................................22 Structured Query Language ..........................................................................................26 Benefits of normalization................................................................................................30 Automated normalization...............................................................................................31 CHAPTER 4 - DATA CONTENT ........................................................................................35 Data content overview ....................................................................................................35 Project technical data .....................................................................................................36 Project administrative data............................................................................................39 Project document data....................................................................................................41 Reference data.................................................................................................................42 Document management ..................................................................................................43 PART TWO - SYSTEM DESIGN AND IMPLEMENTATION ...............................................47 CHAPTER 5 - GENERAL DESIGN ISSUES......................................................................49 Database management software ....................................................................................49
Database location options...............................................................................................50 Distributed vs. centralized databases ............................................................................56 The data model................................................................................................................59 Data access requirements ...............................................................................................61 Government EDMS systems...........................................................................................63 Other issues .....................................................................................................................64 CHAPTER 6 - DATABASE ELEMENTS ...........................................................................69 Hardware and software components.............................................................................69 Units of data storage .......................................................................................................75 Databases and files..........................................................................................................76 Tables (“databases”).......................................................................................................76 Fields (columns)...............................................................................................................78 Records (rows).................................................................................................................79 Queries (views) ................................................................................................................79 Other database objects ...................................................................................................80 CHAPTER 7 - THE USER INTERFACE............................................................................85 General user interface issues..........................................................................................85 Conceptual guidelines .....................................................................................................86 Guidelines for specific elements .....................................................................................90 Documentation ................................................................................................................91 CHAPTER 8 - IMPLEMENTING THE DATABASE SYSTEM ......................................93 Designing the system.......................................................................................................93 Buy or build?...................................................................................................................97 Implementing the system ................................................................................................99 Managing the system ....................................................................................................103 CHAPTER 9 - ONGOING DATA MANAGEMENT ACTIVITIES...............................107 Managing the workflow................................................................................................107 Managing the data ........................................................................................................109 Administering the system .............................................................................................110 PART THREE - GATHERING ENVIRONMENTAL DATA................................................115 CHAPTER 10 - SITE INVESTIGATION AND REMEDIATION..................................117 Overview of environmental regulations ......................................................................117 The investigation and remediation process.................................................................119 Environmental Assessments and Environmental Impact Statements.......................121 CHAPTER 11 - GATHERING SAMPLES AND DATA IN THE FIELD ......................123 General sampling issues................................................................................................123 Soil..................................................................................................................................126 Sediment.........................................................................................................................127 Groundwater .................................................................................................................127 Surface water ................................................................................................................130 Decontamination of equipment ....................................................................................131 Shipping of samples ......................................................................................................131 Air...................................................................................................................................131
Other media...................................................................................................................132 Overview of parameters ...............................................................................................133 CHAPTER 12 - ENVIRONMENTAL LABORATORY ANALYSIS .............................139 Laboratory workflow ...................................................................................................139 Sample preparation.......................................................................................................140 Analytical methods........................................................................................................141 Other analysis issues .....................................................................................................145 PART FOUR - MAINTAINING THE DATA ..........................................................................149 CHAPTER 13 - IMPORTING DATA................................................................................151 Manual entry .................................................................................................................151 Electronic import ..........................................................................................................153 Tracking imports...........................................................................................................163 Undoing an import ........................................................................................................164 Tracking quality............................................................................................................165 CHAPTER 14 - EDITING DATA ......................................................................................167 Manual editing ..............................................................................................................167 Automated editing.........................................................................................................168 CHAPTER 15 - MAINTAINING AND TRACKING DATA QUALITY........................173 QA vs. QC......................................................................................................................173 The QAPP ......................................................................................................................173 QC samples and analyses..............................................................................................175 Data quality procedures ...............................................................................................181 Database support for data quality and usability ........................................................186 Precision vs. accuracy...................................................................................................187 Protection from loss ......................................................................................................188 CHAPTER 16 - DATA VERIFICATION AND VALIDATION......................................191 Types of data review .....................................................................................................191 Meaning of verification ................................................................................................191 Meaning of validation...................................................................................................193 The verification and validation process ......................................................................193 Verification and validation checks ..............................................................................194 Software assistance with verification and validation.................................................195 CHAPTER 17 - MANAGING MULTIPLE PROJECTS AND DATABASES...............199 One file or many?..........................................................................................................199 Sharing data elements...................................................................................................201 Moving between databases...........................................................................................201 Limiting site access........................................................................................................202 PART FIVE - USING THE DATA ............................................................................................203 CHAPTER 18 - DATA SELECTION.................................................................................205 Text-based queries ........................................................................................................205 Graphical selection........................................................................................................207 Query-by-form ..............................................................................................................210 CHAPTER 19 - REPORTING AND DISPLAY................................................................213
Text output ....................................................................................................................213 Formatted reports.........................................................................................................214 Formatting the result ....................................................................................................216 Interactive output .........................................................................................................223 Electronic distribution of data .....................................................................................224 CHAPTER 20 - GRAPHS ...................................................................................................225 Graph overview.............................................................................................................225 General concepts ...........................................................................................................226 Types of graphs .............................................................................................................227 Graph examples.............................................................................................................228 Curve fitting ..................................................................................................................232 Graph theory .................................................................................................................233 CHAPTER 21 - CROSS SECTIONS, FENCE DIAGRAMS, AND 3-D DISPLAYS.....235 Lithologic and wireline logs .........................................................................................235 Cross sections ................................................................................................................237 Profiles ...........................................................................................................................238 Fence diagrams and stick displays...............................................................................239 Block Diagrams and 3-D displays ................................................................................240 CHAPTER 22 - MAPPING AND GIS ...............................................................................243 Mapping concepts .........................................................................................................243 Mapping software .........................................................................................................251 Displaying data..............................................................................................................254 Contouring and modeling.............................................................................................256 Specialized displays.......................................................................................................262 CHAPTER 23 - STATISTICS AND ENVIRONMENTAL DATA..................................269 Statistical concepts ........................................................................................................269 Types of statistical analyses..........................................................................................273 Outliers and comparison with limits ...........................................................................275 Toxicology and risk assessment ...................................................................................277 CHAPTER 24 - INTEGRATION WITH OTHER PROGRAMS....................................279 Export-import ...............................................................................................................279 Digital output.................................................................................................................282 Export-import advantages and disadvantages ...........................................................282 Direct connection ..........................................................................................................283 Data warehousing and data mining.............................................................................285 Data integration ............................................................................................................286 PART SIX - PROBLEMS, BENEFITS, AND SUCCESSES...................................................287 CHAPTER 25 - AVOIDING PROBLEMS........................................................................289 Manage expectations.....................................................................................................289 Use the right tool ...........................................................................................................290 Prepare for problems with the data ............................................................................291 Plan project administration .........................................................................................292 Increasing the chance of a positive outcome...............................................................292
CHAPTER 26 - SUCCESS STORIES................................................................................293 Financial benefits ..........................................................................................................293 Technical benefits..........................................................................................................295 Subjective benefits ........................................................................................................296 CHAPTER 27 - THE FUTURE OF ENVIRONMENTAL DATA MANAGEMENT ...299 PART SEVEN - APPENDICES .................................................................................................301 APPENDIX A - NEEDS ASSESSMENT EXAMPLE.......................................................303 APPENDIX B - DATA MODEL EXAMPLE ....................................................................307 Introduction...................................................................................................................307 Conventions ...................................................................................................................307 Primary tables ...............................................................................................................308 Lookup tables ................................................................................................................312 Reference tables ............................................................................................................321 Utility tables...................................................................................................................324 APPENDIX C - DATA TRANSFER STANDARD ...........................................................327 Purpose ..........................................................................................................................327 Database background information ..............................................................................327 Data content ..................................................................................................................328 Acceptable file formats .................................................................................................332 Submittal requirements ................................................................................................334 Non-conforming data....................................................................................................335 APPENDIX D - THE PARAMETERS...............................................................................337 Overview........................................................................................................................337 Inorganic parameters ...................................................................................................338 Organic parameters ......................................................................................................340 Other parameters..........................................................................................................347 Method reference ..........................................................................................................348 APPENDIX E - EXERCISES..............................................................................................357 Database redesign exercise...........................................................................................357 Data normalization exercise .........................................................................................359 Group discussion - data management and your organization...................................360 Database redesign exercise solution ............................................................................360 Data normalization exercise solution...........................................................................361 Database software exercises .........................................................................................361 APPENDIX F - GLOSSARY ..............................................................................................363 APPENDIX G - BIBLIOGRAPHY ....................................................................................407 INDEX..........................................................................................................................................419
PART ONE - OVERVIEW AND CONCEPTS
CHAPTER 1 OVERVIEW OF ENVIRONMENTAL DATA MANAGEMENT
Concern for our environment has been on the rise for many years, and rightly so. At many industrial facilities and other locations toxic or potentially toxic materials have been released into the environment in large amounts. While the health impact of these releases has been quite variable and, in some cases, controversial, it clearly is important to understand the impact or potential impact of these releases on the public, as well as on the natural environment. This has led to increased study of the facilities and the areas around them, which has generated a large amount of data. More and more, people are looking to sophisticated database management technology, together with related technologies such as geographic information systems and statistical analysis packages, to make sense of this data. This chapter discusses this increasing concern for the environment, the growth of computer technology to support environmental data management, and then some general thoughts on environmental data management in an organization.
CONCERN FOR THE ENVIRONMENT The United States federal government has been regulating human impact on the environment for over a century. Section 13 of the River and Harbor Act of 1899 made it unlawful (with some exceptions) to put any refuse matter into navigable waters (Mackenthun, 1998, p. 20). Since then hundreds of additional laws have been enacted to protect the environment. This regulation occurs at all levels of government from international treaties, through federal and state governments, to individual municipalities. Often this situation of multiple regulatory oversight results in a maze of regulations that makes even legitimate efforts to improve the situation difficult, but it has definitely increased the effort to clean up the environment and keep it clean. Through the 1950s the general public had very little awareness or concern about environmental issues. In the 1960s concern for the environment began to grow, helped at least some by the book Silent Spring by Rachel Carson (Carson, 1962). The ongoing significance of this book is highlighted by the fact that a 1994 edition of the book has a foreword by then Vice President Al Gore. In this book Ms. Carson brought attention to the widespread and sometimes indiscriminate use of DDT and other chlorinated hydrocarbons, organic phosphates, arsenic, and other materials, and the impact of this use on ground and surface water, soil, plants, and animals. She cites examples of workers overcome by exposure to large doses of chemicals, and changes in animal populations after use of these chemicals, to build the case that widespread use of these materials is harmful. She also discusses the link between these chemicals and cancer.
4
Relational Management and Display of Site Environmental Data
Rachel Carson’s message about concern for the environment came at a time, the 1960s, when America was ready for a “back-to-the-earth” message. With the youth of America and others organizing to oppose the war in Vietnam, the two causes fit well together and encouraged each other’s growth. This was reflected in the music of the time, with many songs in the sixties and seventies discussing environmental issues, often combined with sentiments against the war and nuclear power. The war in Vietnam ended, but the environmental movement lives on. There are many examples of rock songs of the sixties and seventies discussing environmental issues. In 1968 the rock musical Hair warned about the health effects of sulfur dioxide and carbon monoxide. Zager and Evans in their 1969 song In The Year 2525 talked about taking from the earth and not giving back, and in 1970 the Temptations discussed air pollution and many other social issues in Ball of Confusion. Three Dog Night also warned about air pollution in their 1970 songs Cowboy and Out in the Country. Perhaps the best example of a song about the environment is Marvin Gaye’s 1971 song Mercy Mercy Me (The Ecology), in which he talked about oil polluting the ocean, mercury in fish, and radiation in the air and underground. In 1975 Joni Mitchell told farmers not to use DDT in her song Big Yellow Taxi, and the incomparable songwriter Bob Dylan got into the act with his 1976 song A Hard Rain’s A-gonna Fall, warning about poison in water and global hunger. It’s not a coincidence that this time frame overlaps all of the significant early environmental regulations. A good example of an organized environmental effort that started in those days and continues today is Earth Day. Organized by Senator Gaylord Nelson and patterned after teach-ins against the war in Vietnam, the first Earth Day was held on April 22, 1970, and an estimated 20 million people around the country attended, according to television anchor Walter Cronkite. In the 10 years after the first Earth Day, 28 significant pieces of federal environmental legislation were passed, along with the establishment of the U.S. Environmental Protection Agency (EPA) in December of 1970. The first major environmental act, the National Environmental Policy Act of 1969 (NEPA) predated Earth Day, and had the stated purposes (Yost, 1997) of establishing harmony between man and the environment; preventing or eliminating damage to the environment; stimulating the health and welfare of man; enriching the understanding of ecological systems; and establishment of the Council on Environmental Quality. Since that act, many laws protecting the environment have been passed at the national, state, and local levels. Evidence that public interest in environmental issues is still high can be found in the public reaction to the book A Civil Action (Harr, 1995). This book describes the experience of people in the town of Woburn, Massachusetts. A number of people in the town became ill and some died due to contamination of groundwater with TCE, an industrial solvent. This book made the New York Times bestseller list, and was later made into a movie starring John Travolta. More recently, the movie Erin Brockovich starring Julia Roberts covered a similar issue in California with Pacific Gas and Electric and problems with hexavalent chrome in groundwater causing serious health issues. Public interest in the environment is exemplified by the various watchdog organizations that track environmental issues in great detail. A good example of this is Scorecard.org, (Environmental Defense, 2001) a Web site that provides a very large amount of information on environmental current events, releases of toxic substances, environmental justice, and similar topics. For example, on this site you can find the largest releasers of pollutants near your residence. Sites like this definitely raise public awareness of environmental issues. It’s also important to point out that the environmental industry is big business. According to reports by the U.S. Department of Commerce and Environmental Business International (as quoted in Diener, Terkla, and Cooke, 2000), the environmental industry in the U.S. in 1998 had $188.7 billion in sales, up 1.6% from the previous year. It employed 1,354,100 people in 115,850 companies. The worldwide market for environmental goods and services for the same period was estimated to be $484 billion.
Overview of Environmental Data Management
5
Figure 1 - The author (front row center) examining state-of-the-art punch card technology in 1959
THE COMPUTER REVOLUTION In parallel with growing public concern for the environment has been growth of technology to support a better understanding of environmental conditions. While people have been using computing devices of some sort for over a thousand years and mainframe computers since the 1950s (see Environmental Computing History Timeline sidebar), the advent of personal computers in the 1980s made it possible to use them effectively on environmental projects. For more information on the history of computers, see Augarten (1984) and Evans (1981). Discussions of the history of geological use of computers are contained in Merriam (1983,1985). With the advent of Windows-based, consumer-oriented database management programs in the 1990s, the tools were in place to create an environmental data management system (EDMS) to store data for one or more facilities and use it to improve project management. Computers have assumed an increasingly important role in our lives, both at work and at home. The average American home contains more computers than bathtubs. From electronic watches to microwave ovens, we are using computers of one type or another a significant percentage of our waking hours. In the workplace, computers have changed from big number crunchers cloistered somewhere in a climate-controlled environment to something that sits on our desk (or our lap). No longer are computers used only for massive computing jobs which could not be done by hand, but they are now replacing the manual way of doing our daily work. This is as true in the earth science disciplines as anywhere else. Consequently, industry sages have suggested that those who do not have computer skills will be left behind in the next wave of automation of the
6
Relational Management and Display of Site Environmental Data
Environmental Computing History Timeline 1000 BC – The Abacus was invented (still in use). 1623 – The first mechanical calculator was invented by German professor Wilhelm Schickard. 1834 – Charles Babbage began work on the Analytical Engine, which was never completed. 1850 – Charles Lyell was the first person to use statistics in geology. 1876 – Alexander Graham Bell patented the telephone. 1890 – Herman Hollerith built the Tabulating Machine, which was the first successful mechanical calculating machine. 1899 – The River and Harbor Act was the first environmental law passed in the United States. 1943 – The Mark 1, an electromechanical calculator, was developed. 1946 – ENIAC (Electronic Numerator, Integrator, Analyzer and Computer) was completed. (Dick Tracy’s wrist radio also debuted in the comic strip.) 1947 – The transistor was invented by Bardeen, Brattain, and Shockley at Bell Labs. 1951 – UNIVAC, the first commercial computer, became available. 1952 – Digital plotters were introduced. 1958 – The integrated circuit was invented by Jack Kilby at Texas Instruments. 1962 – Rachel Carson’s Silent Spring is published, starting the environmental movement. 1965 – IBM white paper on computerized contouring appeared. 1969 – National Environmental Policy Act (NEPA) was enacted. 1970 – The first Earth Day was held. 1970 – Relational data management was described by Edwin Codd. 1971 – The first microprocessor, the Intel 4004, was introduced. 1973 – SQL was introduced by Boyce and Chamberlain. 1977 – The Apple II, the first widely accepted personal computer, was introduced. 1981 – IBM releases its Personal Computer. This was the computer that legitimized small computers for business use. 1984 – The Macintosh computer was introduced, the first significant use of a graphical user interface on a personal computer. 1985 – Windows 1.0 was released. 1990 – Microsoft shipped Windows 3.0, the first widely accepted version. 1994 – Netscape Navigator was released by Mosaic Communications, leading to widespread use of the World Wide Web. workplace. At the least, those who are computer aware will be in a better position to evaluate how computers can help them in their work. The growth that we have seen in computer processing power is related to Moore’s law (Moore, 1965; see also Schaller, 1996), which states that the capacity of semiconductor memory doubles every 18 months. The price-performance ratio of most computer components meets or exceeds this law over time. For example, I bought a 10 megabyte hard drive in 1984 for $800. In 2001 I could buy a 20 gigabyte hard drive for $200, a price-performance increase of 8000 times in 17 years. This averages to a doubling about every 16 months. Over the same time, PC processing speed has increased from 4 megahertz for $5000 to 1000 megahertz for $1000, an increase of 1250, a doubling every 20 months. These average to 18 months. So computers become twice as powerful every year and a half, obeying Moore’s law. Unlike 10 or especially 20 years ago, it is now usual in industrial and engineering companies for each employee to have a suitable computer on his or her desk, and for that computer to be networked to other people’s computers and often a server. This computing environment is a good base on which to build a data management system.
Overview of Environmental Data Management
7
As the hardware has developed, so has the data management software. It is now possible to outfit an organization with the software for a client-server data management system starting at $1,000 or $2,000 a seat. Users probably already have the hardware. Adding software customization, training, support, and other costs still allows a powerful data management system to be put in place for a cost which is acceptable for many projects. In general, computers perform best when problem solving calls for either deductive or inductive reasoning, and poorly when using subjective reasoning. For example, calculating a series of stratigraphic horizon elevations where the ground level elevation and the depth to the formation are known is an example of deductive reasoning. Computers perform optimally on problems requiring deductive reasoning because the answers are precise, requiring explicit computations. Estimating the volume of contamination or contouring a surface is an example of inductive reasoning. Inductive reasoning is less precise, and requires a skilled geoscientist to critique and modify the interpretation. Lastly, the feeling that carbonate aquifers may be more complex than clastic aquifers is an example of subjective reasoning. Subjective reasoning uses qualitative data and is the most difficult of all for computer analysis. In such instances, the analytical potential of the computer is secondary to its ability to store and graphically portray large amounts of information. Graphic capabilities are requisite to earth scientists in order to make qualitative data usable for interpretation. Another example of appropriate use of computers relative to types of reasoning is the distinction between verification and validation, which is discussed in detail in Chapter 16. Verification, which checks compliance of data with project requirements, is an example of deductive logic. Either a continuous calibration verification sample was run every ten samples or it wasn’t. Validation, on the other hand, which determines the suitability of the data for use, is very subjective, requiring an understanding of sampling conditions, analytical procedures, and the expected use of the data. Verification is easily done with computer software. How far software can go toward complete automation of the validation process remains to be seen.
CONVERGENCE - ENVIRONMENTAL DATA MANAGEMENT Efficient data management is taking on increased importance in many organizations, and yours is probably no exception. In the words of one author (Diamondstone, 1990, p. 3): Automated measuring equipment has provided rapidly increasing amounts of data. Now, the challenge before us is to assure sufficient data uniformity and compatibility and to implement data quality measures so that these data will be useful for integrative environmental problem solving. This is particularly true in organizations where many different types of site investigation and monitoring data are coming from a variety of different sources. Fortunately, software tools are now available which allow off-the-shelf programs to be used by people who are not computer experts to retrieve this data in a format that is meaningful to them. According to Finkelstein (1989, p. 3): Management is on the threshold of an explosive boom in the use of computers. A boom initiated by simplicity and ease of use. Managers and staff at all levels of an organization will be able to design and implement their own systems, thereby dramatically reducing their dependence on the data processing (DP) department, while still ensuring that DP maintains central control, so that application systems and their data can be used by others in the business. With the advent of relatively easy to use software tools such as Microsoft Windows and Microsoft Access, it is even more true now that individuals can have a much greater role in
8
Relational Management and Display of Site Environmental Data
satisfying their own data management needs. It is important to develop a data management approach that makes efficient use of these tools to solve the business problem of managing data within the organization. The environmental data management system that will result from implementation of a plan based on this approach will provide users with access to the organization’s environmental data to satisfy their business needs. It will allow them to expand their data retrievals as their needs change and as their skills develop. As with most business decisions, the decision to implement a data management system should be based on an analysis of the expected return on the time and money invested. In the case of an office automation project, some of the return is tangible and can be expressed in dollar savings, and some is intangible savings in efficiency in everyday operations. In general, the best approach for system implementation is to look for leverage points in the system where a great return can be had for a small cost. The question becomes: How do you improve the process to get the greatest savings? Often some examples of tangible returns can be identified within the organization. The benefits can best be seen from analyzing the impact of the data management system on the whole site investigation and remediation process. For example, during remediation you might be able, by more careful tracking and modeling of the contamination, to decrease the amount of waste to be removed or water to be processed. You may also be able to decrease the time required to complete the project and save many person-years of cost by making quality data available in a standardized format and in a timely fashion. For smaller sites, automating the data management process can provide savings by repetition. Once the system has been set up for one site and people trained to use it, that effort can be re-used on the next site. The intangible benefits of a data management system are difficult to quantify, but subjectively can include increased job satisfaction of project workers, a higher quality work product, and better decision making. The cumulative financial and intangible return on investment of these various benefits can easily justify reasonable expenditures for a data management system.
CONCEPT OF DATA VS. INFORMATION It is important to recognize that there is a difference between numbers and letters stored in a computer and useful information. Numbers stored in a computer, or printed out onto a sheet of paper, may not themselves be of any value. It is only when those numbers are presented in a form that is useful to the intended audience that they become useful information. The keys to making the transition from data to information are organization and access. It doesn't matter if you have a file of all the monitoring wells ever drilled; if you can't get the information you want out of the file, it is useless. Before any database is created, careful attention should be paid to how the data is going to be used, to ensure that the maximum use can be received from the effort. Statistics and graphics can be tremendously helpful in perceiving relationships among different variables contained in the data. As the power and ease-of-use of both general business programs and technical programs for statistics and graphics improves, it will become common to take a good look at the data as a set before working with individual members of the set. The next step is to move from information to knowledge. The difference between the two is understanding. Once you have processed the information and understand it, it becomes knowledge. This transition is a human activity, not a computer activity, but the computer can help by presenting the information in an understandable manner.
EMS VS. EMIS VS. EDMS A final overview issue to discuss is the relationship between EMS (environmental management systems), EMIS (environmental management information systems), and site EDMS (environmental data management systems). An EMS is a set of policies and procedures for managing
Overview of Environmental Data Management
9
Data is or Data are? Is “data” singular or plural? In this book the word data is used as a singular noun. Depending on your background, you may not like this. Many engineers and scientists think of data as the plural of “datum,” so they consider the word plural. Computer people view data as a chunk of stuff, and, like “chunk,” consider it singular. In one dictionary I consulted (Webster, 1984), data as the plural of datum was the third definition, with the first two being synonyms for “information,” which is always treated as singular. It also states that common usage at this time is singular rather than plural, and that “data can now be used as a singular form in English.” In Strunk and White (1935), a style manual that I use, the discussion of singular vs. plural nouns uses an example of the contents of a jar. If the jar contains marbles, its contents are plural. If it contains jam, its content is singular. You decide: Is data jam or marbles? environmental issues for an organization or a facility. An EMIS is a software system implemented to support the administration of the EMS (see Gilbert, 1999). EMIS usually has a focus on record keeping and reporting, and is implemented with the hope of improving business processes and practices. A site environmental data management systems (EDMS) is a software system for managing data regarding the environmental impact of current or former operations. EDMS overlaps partially with EMIS systems. For an operating facility, the EDMS is a part of the EMIS. For a facility no longer in operation, there may be no formal EMS or EMIS, but the EDMS is necessary to facilitate monitoring and cleanup.
CHAPTER 2 SITE DATA MANAGEMENT CONCEPTS
The size and complexity of environmental investigation and monitoring programs at industrial facilities continue to increase. Consequently the amount of environmental data, both at operating facilities and orphan sites, is growing as well. The volume of data often exceeds the capacity of simple tools like paper reports and spreadsheets. When that happens it is appropriate to implement a more powerful data management system and often the system of choice is a relational database manager. This section provides a top-down discussion of management of environmental data. It focuses on the purpose and goals of environmental data management, and on the types and locations of data storage. These issues should always be resolved before an electronic (or in fact any) data management system should be implemented.
PURPOSE OF DATA MANAGEMENT Why manage data electronically? Or why even manage it at all? Clear answers to these questions are critical before a successful system can be implemented. This section addresses some of the issues related to the purpose of data management. It all comes down to planning. If you understand the goal to be accomplished, you have a better chance of accomplishing it. There is only one real purpose of data management: to support the goals of the organization. These goals are discussed in detail in Chapter 8. No data management system should be built unless it satisfies one or more significant business or technical goals. Identification of these goals should be done prior to designing and implementing the system for two reasons. One reason is that the achievement of these goals provides the economic justification for the effort of building the system. The other reason is that the system is more likely to generate satisfactory results if those results are understood, at least to a first approximation, before the system is implemented and functionality is frozen. Different organizations have different things that make them tick. For some organizations, internal considerations such as cost and efficiency are most important. For others, outside appearances are equally or more important. The goals of the organization must be taken into consideration in the design of the system so that the greatest benefit can be achieved. Typical goals include: Improve efficiency – Environmental site investigation and remediation projects can involve an enormous amount of data. Computerized methods, if they are well designed and implemented,
12
Relational Management and Display of Site Environmental Data
Environmental problems are complex problems. Complex problems have simple, easy-tounderstand wrong answers. From Environmental Humor by Gerald Rich (1996), reprinted with permission can be a great help in improving the flow of data through the project. They can also be a great sink of time and effort if poorly managed. Maximize quality – Because of the great importance of the results derived from environmental investigation and remediation, it is critical that the quality of the output be maximized relative to the cost. This is not trivial, and careful data storage, and annotation of data with quality information, can be a great help in achieving data quality objectives. Minimize cost – No organization has an unlimited amount of money, and even those with a high level of commitment to environmental quality must spend their money wisely to receive the greatest return on their investment. This means that unnecessary costs, whether in time or money, must be minimized. Electronic data management can help contain costs by saving time and minimizing lost data. People tend to start working on a database without giving a lot of thought to what a database really is. It is more than an accumulation of numbers and letters. It is a special way to help us understand information. Here are some general thoughts about databases: A database is a model of reality – In many cases, the data that we have for a facility is the only representation that we have for conditions at that facility. This is especially true in the subsurface, and for chemical constituents that are not visible, either because of their physical condition or their location. The model helps us understand the reality – In general, conditions at sites are nearly infinitely complex. The total combination of geological, hydrological and engineering factors usually exceeds our ability to understand it without some simplification. Our model of the site, based on the data that we have, helps us to perform this simplification in a meaningful way. This understanding helps us make decisions – Our simplified understanding of the site allows us to make decisions about actions to be taken to improve the situation at the site. Our model lets us propose and test solutions based on the data that we have, identify additional data that we need, and then choose from the alternative solutions. The clearer the model, the better the decisions – Since our decisions are based on our databased model, it follows that we will make better decisions if we have a clear, accurate, up-to-date model. The purpose of a database management system for environmental data is to provide us the information to build accurate models and keep them current. Clearly information technology, including data management, is important to organizations. Linderholm (2001) reports the results of a study that asked business executives about the importance of information technology (IT) to their business. 70% reported that it was absolutely essential, and 20% said it was extremely valuable. The net increase in revenue attributable to IT, after accounting for IT costs, was estimated to be 20%, clearly a good return. 70% said that the role of IT in business strategy is increasing. In the environmental business the story must be similar, but perhaps not as strong. If you were to survey project managers today about the importance of data management on their projects, probably the percentage that said it was essential or extremely valuable would be less than the 90% quoted above, and maybe less than 50%. But as the amount of data for sites continues to grow, this number will surely increase.
TYPES OF DATA STORAGE Once the purpose of the system has been determined, the next step is to identify the data to be contained in the system and how it is to be stored. Some data must be stored electronically, while
Site Data Management Concepts
13
other data might not need to be stored this way. Implementers should first develop a thorough understanding of their existing data and storage methods, and then make decisions about how electronic storage can provide an improvement. This section will cover ways of storing site environmental data. The content of an EDMS will be discussed in Chapter 4.
Hard copy Since its inception, hard copy data storage has been the lifeblood of the environmental industry. Many organizations have thousands of boxes of paper related to their projects. The importance of this data varies greatly, but in many organizations, it is not well understood. A data management system for hard copy data is different from a data management system for digital data such as laboratory analytical results. The former is really a document management system, and many vendors offer software and other tools to build this type of system. The latter is more of a technical database issue, and can be addressed by in-house generated solutions or offthe-shelf or semi-custom solutions from environmental software vendors.
LAB REPORTS Laboratory analyses can generate a large volume of paper. Programs like the U.S.E.P.A. Contract Lab Program (CLP) specify deliverables that can be hundreds of pages for one sampling event. This paper is important as backup for the data, but these hundreds of pages can cause a storage and retrieval problem for many organizations. Often the usable data from the lab event, that is, the data actually used to make site decisions, may be only a small fraction of the paper, with the rest being quality assurance and other backup information.
DERIVED REPORTS Evaluation of the results of laboratory analysis and other investigation efforts usually results in a printed report. These reports contain a large amount of useful information, but over time can also become a storage and retrieval problem.
Electronic There are many different ways of organizing data for digital storage. There is no “right” or “wrong” way, but there are approaches that provide greater benefits than others in specific situations. People store environmental data a lot of different ways, both in database systems and in other file types. Here we will discuss two non-database ways of storing data, and several different database system designs for storing data.
TEXT FILES AND WORD PROCESSOR FILES The simplest way to manage environmental data is in text files. These files contain just the information of interest, with no formatting or information about the data structure or relationships between different data elements. Usually these files are encoded in ASCII, which stands for American Standard Code for Information Interchange and is pronounced like as′-kee. For this reason they are sometimes called ASCII files. Text files can be effectively used for storing and transferring small amounts of data. Because they lack “intelligence” they are not usually practical for large data sets. For example, in order to search for one piece of data in a text file you must look at every word until you find the one you are looking for, rather than using a more efficient method such as indexed searching used by data management programs. A variation on text files is word processor files, which contain some formatting and structure resulting from the word processing program that created them. An example of this would be the data in a table in a report. Again this works well only for small amounts of data.
14
Relational Management and Display of Site Environmental Data
SPREADSHEETS Over the years a large amount of environmental data has been managed in spreadsheets. This approach works for data sets that are small to medium in size, and where the display and retrieval requirements are relatively simple. For large data sets, a database manager program is usually required because spreadsheets have a limit to the number of rows and columns that they contain, and these limits can easily be exceeded by a large data set. For example, Lotus 123 has a limit of about 16,000 rows of data, and Excel 97 has a limit of 65,536 rows. Spreadsheets do have their place in working with environmental data. They are particularly useful for statistical analysis of data and for graphing in a variety of ways. Spreadsheets are for doing calculations. Database managers are for managing data. As long as both are used appropriately, the two together can be very powerful. The problem with spreadsheets occurs when they are used in situations where real data management is required. For example, it’s not unusual for organizations to manage quarterly groundwater monitoring data using spreadsheets. They can do statistics on the data and print reports. Where the problem becomes evident is when it becomes necessary to do a historical analysis of the data. It can be very difficult to tie the data together. The format of the spreadsheets may have evolved over time. The file for one quarter may be missing or corrupted. Suddenly it becomes a chore to pull all of the data together to answer a question such as “What is the trend of the sulfate values over the last five years?”
DATABASE MANAGERS For storing large amounts of data, and where immediate calculations are not as important, database managers usually do a better job than spreadsheets, although the capabilities of spreadsheets and databases certainly overlap somewhat. The better database managers allow you to store related data in several different tables and to link them together based on the contents of the data. Many database manager programs have a reputation for not being very easy to use, partly because of the sheer number of options available. This has been improved with the menu-driven interfaces that are now available. These interfaces help with the learning curve, but data management software, especially database server software, can still be very difficult to master. Many database manager programs provide a programming language, which allows you to automate tasks that you perform often or repeatedly. It also allows you to configure the system for other users. This language provides the tools to develop sophisticated applications programs for nearly any data handling need, and provides the basis for some commercial EDMS software. Database managers are usually classified by how they store and relate data. The most common types are flat files, hierarchical, network, object-oriented, and relational. Most use the terminology of “record” for each object in the database (such as a well or sample location) and “field” for each type of information on each object (such as county or collection date). For information on database management concepts see Date (1981) and Rumble and Hampel (1984). Sullivan (2001) quotes a study by the University of California at Berkeley that humans have generated 12 exabytes (an exabyte is over 1 million terabytes, or a million trillion bytes) of data since the start of time, and will double this in the next two and a half years. Currently, about 20% of the world’s data is contained in relational databases, while the rest is in flat files, audio, video, pre-relational, and unstructured formats.
Flat file A flat file is a two-dimensional array of data organized in rows and columns similar to a spreadsheet. This is the simplest type of database manager. All of the data for a particular type of object is stored in a single file or table, and each record can have one instance of data for each field. A good analogy is a 3"×5" card file, where there is one card (record) for each item being tracked in the database, and one line (field) for each type of information stored.
Site Data Management Concepts
15
Flat file database managers are usually the cheapest to buy, and often the easiest to use, but the complexity of real-world data often requires more power than they can provide. In a flat file storage system, each row represents one observation, such as a boring or a sample. Each column contains the same kind of data. An example of a flat file of environmental data is shown in the following table: Well B-1 B-1 B-2 B-2 B-2 B-3 B-3
Elev 725 725 706 706 706 714 714
X 1050 1050 342 342 342 785 785
Y 681 681 880 880 880 1101 1101
SampDate 2/3/96 5/8/96 11/4/95 2/3/96 5/8/96 2/3/96 5/8/96
Sampler JLG DWR JAM JLG DWR JLG CRS
As .05 .05 3.7 2.1 1.4 .05 .05
AsFlag not det not det detected detected detected not det not det
Cl
ClFlag
.05 9.1 8.4 7.2
not det detected detected detected
.05
not det
pH 6.8 6.7 5.2 5.3 5.8 8.1 7.9
Figure 2 - Flat file of environmental data
In this table, each line is the result of one sampling event for an observation well. Since the wells were sampled more than once, and analyzed for multiple parameters, information specific to the well, such as the elevation and location (X and Y), is repeated. This wastes space and increases the chance for error since the same data element must be entered more than once. The same is true for sampling events, represented here by the date and the initials of the person doing the sampling. Also, since the format for the analysis results requires space for each value, if the value is missing, as it is for some of the chloride measurements, the space for that data is wasted. In general, flat files work acceptably for managing small amounts of data such as individual sampling events. They become less efficient as the size of the database grows. Examples of flat file data management programs are FileMaker Pro (www.filemaker.com) and Web-based database programs such as QuickBase (www.quickbase.com).
Hierarchical In the hierarchical design, the one-to-many relationship common to many data sets is formalized into the database design. This design works well for situations such as multiple samples for each boring, but has difficulty with other situations such as many-to-many relationships. This type of program is less common than flat files or relational database managers, but is appropriate for some types of data. In a hierarchical database, data elements can be viewed as branches of an inverted tree. A good example of a hierarchical database might be a database of organisms. At the top would be the kingdom, and underneath that would be the phyla for each kingdom. Each phylum belongs to only one kingdom, but each kingdom can have several phyla. The same continues down the line for class, order, and so on. The most important factor in fitting data into this scheme is that there must be no data element at one level that needs to be under more than one element at a higher level. If a crinoid could be both a plant and an animal at the same time, it could not be classified in a hierarchical database by phylogeny (which biological kingdom it evolved from). Environmental site data is for the most part hierarchical in nature. Each site can have many monitoring wells. Each well can have many samples, either over time or by depth. Then each sample can be analyzed for multiple constituents. Each constituent analysis comes from one specific sample, which comes from one well, which comes from one site. A data set which is inherently hierarchical can be stored in a relational database manager, and relational database managers are somewhat more flexible, so pure hierarchical database managers are now rare.
16
Relational Management and Display of Site Environmental Data
Network In the network data model, multiple relationships between different elements at the same level are easy to manage. Hypertext systems (such as the World Wide Web) are examples of managing data this way. Network database managers are not common, but are appropriate in some cases, especially those in which the interrelationships among data are complex. An example of a network database would be a database of authors and articles. Each author may have written many articles, and each article may have one or more authors. This is called a “many-to-many” relationship. This is a good project for a network database manager. Each author is entered, as is each article. Then the links between authors and articles are established. The data elements are entered, and then the network established. Then an article can be called up, and the information on its authors can be retrieved. Likewise, an author can be named, and his or her articles listed. A network data topology (geometric configuration) can be stored in a relational database manager. A “join table” is needed to handle the many-to-many relationships. Storing the above article database in a relational system would require three tables, one for authors, one for articles, and a join table with the connections between them.
Object oriented This relatively recent invention stores each data element as an object with properties and methods encapsulated (wrapped up) into each object. This is a deviation from the usual separation of code and data, but is being used successfully in many settings. Current object-oriented systems do not provide the data retrieval speed on large data sets provided by relational systems. Using this type of software involves a complete re-education of the user, since different terminology and concepts are used. It is a very powerful way to manipulate data for many purposes, and is likely to see more widespread use. Some of the features of object-oriented databases are described in the next few paragraphs. Encapsulation – Traditional programming languages focus on what is to be done. This is referred to as “procedural programming.” Object-oriented programming focuses on objects, which are a blend of data and program code (Watterson, 1989). In a procedural paradigm (a paradigm is an approach or model), the data and the programs are separate. In an object-oriented paradigm, the objects consist of data that knows what to do with itself, that is, objects contain methods for performing actions. This is called encapsulation. Thus, instead of applying procedures to passive data, in object-oriented programming systems (OOPS), methods are part of the objects. Some examples of the difference between procedural systems and OOPS might be helpful. In a procedural system, the data for a well could contain a field for well type, such as monitoring well or soil boring. The program operating on the data would know what symbol to draw on the map based on the contents of that field. In an OOPS the object called “soil boring” would include a method to draw its symbol, based on the data content (properties) of the object. Properties of objects in OOPS are usually loosely typed, which means that the distinction between data types such as integers and characters is not rigorously defined. This can be useful when, as is often the case, a numeric property such as depth to a particular formation needs to be filled with character values such as NP (not present) or NDE (not deep enough). For another illustration, imagine modeling a rock or soil body subject to chemical and physical processes such as leaching or neutralization using an OOPS. Each mineral phase would be an object of class “mineral,” while each fluid phase would be an object of class “fluid.” Methods known to the objects would include precipitation, dissolution, compaction, and so on. The model is given an initial condition, and then the objects interact via messages triggering methods until some final state is reached. Inheritance – Objects in an OOPS belong to classes, and members of a particular class share the same methods. Also, similar classes of objects can inherit properties and methods from an existing class. This feature, called inheritance, allows a building-block approach to designing a
Site Data Management Concepts
17
database system by first creating simple objects and then building on and combining them into more complex objects. In this way, an object called “site” made up of “well” objects would know how to display itself with no additional programming. Message Passing – An object-oriented program communicates with objects via messages, and objects can exchange messages as well. For example, an object of class “excavated material” could send a message to an object of class “remediation pit” which would update the property “remaining material” within object “remediation pit.” Polymorphism – A method is part of an object, and is distinct from messages between objects. The objects “well” and “boring” could both contain the method “draw yourself,” and sending the “draw yourself” message to one or the other object will cause a similar but different result. This is referred to as polymorphism. Object-oriented programming directly models the application, with messages being passed between objects being the analog of real-world processes (Thomas, 1989). Software written in this way is easier to maintain because programmers, other than the author, can easily comprehend the program code. Since program code is easily reusable, development of complex applications can be done more quickly and smoothly. Encapsulation, message passing, inheritance, and polymorphism give OOPS developers very different tools from those provided by traditional programming languages. Also, OOPS often use a graphical user interface and large amounts of memory, making them more suitable to high-end computer systems. For these reasons, OOPS have been slow in gaining acceptance, but they are gaining momentum and are considered by many to be the programming system of the future. Examples of object-oriented programming languages include Smalltalk developed by Xerox at the Palo Alto Research Center in the 1970s (Goldberg and Robson, 1983); C++, which is a superset of the C programming language (Soustrup, 1986); and Hypercard for the Macintosh. NextStep, the programming environment for the Next computer, also uses the object-oriented paradigm. There are several database management programs that are designed to be object oriented, which means that their primary data storage design is to store objects. Also, a number of relational database management systems have recently added object data types to allow object-oriented applications to use them as data repositories, and are referred to as Object-Relational systems.
Relational Relational database managers and SQL are discussed in much greater detail in Chapter 3, and are described here briefly for comparison with other database manager types. In the relational model, data is stored in one or more tables, and these tables are related, that is, they can be joined together, based on data elements within the tables. This allows storage of data where there may be many pieces of one type of information related to one object (one-to-many relationship), as well as other relationships such as hierarchical and many-to-many. In many cases, this has been found to be the most efficient form of data storage for large, complicated databases, because it provides efficient data storage combined with flexible data retrieval. Currently the most popular type of database manager program is the relational type. A file of monitoring well data provides a good example of how real-world data can be stored in a relational database manager. One table is created which contains the header data for the well including location, date drilled, elevation, and so on, with one record for each well. For each well, the driller or logger will report a different number of formation tops, so a table of formation tops is created, with one record for each top. A unique identifier such as well ID number relates the two tables to each other. Each well can also have one or more screened intervals, and a table is created for each of those data types, and related by the same ID number. Each screened interval can have multiple sampling events, with a description for each, so another table can be created for these sample events, which can be related by well ID number and sample event number. Very complex
18
Relational Management and Display of Site Environmental Data
systems can be created this way, but it often will take a program, written in the database language, to keep track of all the relationships and allow data entry, updating, and reporting. The most popular way of interacting with relational database managers is Structured Query Language (SQL, sometimes incorrectly pronounced “sequel,” see below). SQL provides a powerful, flexible way of retrieving, adding and changing data in a relational system. A typical SQL query might look like this: SELECT X_COORD, Y_COORD, COLLAR_ELEV - MUDDY_TOP, SYMBOL_CODE FROM WELLS, TOPS WHERE MUDDY_TOP > 100 AND WELLS.WELL_ID = TOPS.WELL_ID This query would produce a list where the first column is the X coordinate for a well, the second column is the Y coordinate, the third column is the difference between the well elevation and the depth to the top of the Muddy Formation, and the fourth column is the symbol code. Only the wells for which the Muddy Formation is deeper than 100 would be listed. The X and Y coordinates, the elevation, and the symbol code come from the WELLS table and the Muddy Formation top comes from the TOPS table. The last line is the “relational join” that hooks the two tables together for this query, based on WELL_ID, which is a field common to both tables. A field that relates two tables like this is called a “key.” The output from this query could be sent to a mapping program to make a contour map of the Muddy structure. Most relational database managers have SQL as their native retrieval language. The rest usually have an add-in capability to handle SQL queries.
XML XML (eXtensible Markup Language) was developed as a data transfer format, and has become increasingly popular for exchanging data on the Internet, especially in business-to-business transactions. The use of XML for transferring data is discussed in Chapter 24. Database management products are now starting to appear that use XML as their data storage format. (For example, see Dragan, 2001.) As of this writing these products are very immature. One example product costs $50,000 and does not include a report writer module. This is expected with a new product category. What is not clear is whether this data storage approach will catch on and replace more traditional methods, especially relational data management systems. This may be unlikely, especially since relational software vendors are adding the capability to store XML data in their programs, which are much more mature and accepted. Given that XML was intended as a transfer format, it’s not clear that it is a good choice for a storage format. It will be interesting to see if database products with XML as their primary data storage type become popular.
RESPONSIBILITY FOR DATA MANAGEMENT A significant issue in many organizations is who is responsible for data management. In some organizations, data management is centralized in a data management group. In others, the project team members perform the data management. Some organizations outsource data management to a consultant. Finally, some groups use a combination of these approaches. Each of these options will be discussed briefly, along with its pros and cons. Dedicated data management group – The thinking in this approach is that the organization can have a group of specialists who manage data for all of the projects. This is often done in conjunction with the group (which may be the same people) that performs validation on the data. The advantages of this are that the people develop specialized skills and expertise that allows them to manage the data efficiently. They can respond to requests for custom processing and output, because they have mastered the tools that they use to manage data. The disadvantage is that they
Site Data Management Concepts
19
may not have hands-on knowledge of the project and its data, which may be necessary to recognize and remedy problems. They need to be kept in the loop, such as when a new well is drilled, or when the laboratory is asked to analyze for a different suite of constituents, so that they can react appropriately. Data management by the project team – Here the focus is on the benefit of project knowledge rather than data management expertise. The people managing the data are likely to be involved in decisions that affect the data, and should be in a position to respond to changes in the data being gathered and reported. They might have problems, though, when something is asked of them that is beyond their data management expertise, because that is only part of what they are expected to do. Outsourcing to a consultant – A very common practice is to outsource the data management to a consultant, often the one that is gathering the data. Then the consultant has to decide between one of the previous approaches. This may be the best option when staff time or expertise is not available in-house to do the data management. The price can be loss of control over the process. A team effort – Sometimes the best solution is a team effort. In this scenario, project team members are primarily responsible for the data management process. They are backed up by data management specialists, either in-house or through a consultant, to help them when needs change or problems occur. Project staff may outsource large data gathering, data entry, or cleanup projects, especially when it is necessary to “catch up,” for example, to bring a lot of historical data from disparate sources into a comprehensive database to satisfy some specific project requirements. The team approach has the potential to be the strongest, because it allows the project team to leverage the strengths of the different available groups. It does require good management skills to keep everyone on the same page, especially as deadlines approach.
UNDERSTANDING THE DATA It is extremely important that project managers, data administrators, and users of the software have a complete understanding of the data in the database and how it is to be used. It is important for them to understand the data structure. It is even more important for them to understand the content. The understanding of the structure refers to how data elements from the real world are placed in fields in the database. Many DBMS programs allow comments to be associated with fields, either in the data model, or on editing forms, or both. These comments can assist the user with understanding how the data is to be entered. Additional information on data structure is contained in Chapter 4. Once you know where different data elements are to go, you must also know what data elements in the database mean. This is true of both primary data elements (data in the main tables) and coded values (stored in lookup tables). The content can be easily misunderstood. A number of issues regarding data content are discussed in Parts Three and Four.
CHAPTER 3 RELATIONAL DATA MANAGEMENT THEORY
The people using an EDMS will often come to the system with little or no formal training or experience in environmental data management. In order to provide a conceptual framework on which they can build their expertise in using the system, this section provides an overview of the fundamentals of relational management of environmental data. This section starts with a discussion of the meaning and history of relational data management. This is followed by a description of breaking data apart with data normalization, and using SQL to put it back together again.
WHAT IS RELATIONAL DATA MANAGEMENT? Relational data management means organizing the data based on relationships between items within the database. It involves designing the database so that like data elements are grouped into tables together. This process is called data normalization. Then the data can be joined back together for retrievals, usually using the SQL data retrieval language. The key elements of relational data storage are: • • •
Tables – Database object containing records and fields where the data is stored. Examples: Samples, Analyses. Fields – Data elements (columns) within the table. Examples: Parameter Name, Value. Records – Items being stored (rows) within the table. Example: Arsenic for last quarter.
Each of these items will be discussed in much greater detail later.
HISTORY OF RELATIONAL DATA MANAGEMENT Prior to 1970, data was primarily managed in hierarchical and networked storage systems. Edwin Codd, an IBM Fellow working at the San Jose research lab, became concerned about protecting users from needing to understand the internal representation of the data, and published a paper entitled “A Relational Model of Data for Large Shared Data Banks” (Codd, 1970). This set off a debate in industry about which model was the best. IBM developed System/R and a team at Berkeley developed INGRES, both prototype relational data management systems built in the mid1970s.
22
Relational Management and Display of Site Environmental Data
SQL was the database language for System/R, and was first described in Boyce and Chamberlain (1973). SQL is sometimes pronounced “sequel.” This is not really correct. In the early 1970s Edwin Codd and others in IBM’s research center were working on relational data management, and Donald Chamberlain and others developed a language to work with relational databases (Gagnon, 1998). This language was called Structured English Query Language, or SEQUEL. This was later revised and renamed SEQUEL/2. SQL as it is currently implemented is not the same as these early languages. And IBM stopped using the name SEQUEL for legal reasons. So it is not correct to call SQL “sequel.” It’s better to just say the letters, but many call it “sequel” anyway. In 1977, a contract programming company called Software Development Laboratories (SDL) developed a database system for the Central Intelligence Agency for a project called “Oracle.” The company released the program, based on System/R and SQL, in 1979. This became the first commercial relational database management system, and ran on IBM mainframes and Digital VAX and UNIX minicomputers. SDL assumed the name Oracle for both the company and the product. IBM followed in 1981 with the release of SQL/DS, and with DB2 in 1983 (Finkelstein, 1989). Oracle and DB2 are still active products in widespread use. The American National Standards Institute (ANSI) accepted SQL as a standardized fourth generation (4GL) language in 1986, with several revisions since then.
DATA NORMALIZATION Often the process of storing complex types of data in a relational database manager is improved by a technique called data normalization. Usually the best database designs have undergone this process, and are referred as a “normalized data model.” The normalization process separates data elements into a logical grouping of tables and fields.
Definition Normalization of a database is a process of grouping the data elements into tables in a way that reduces needless duplication and wasted space while allowing maximum flexibility of data retrieval. The concepts of data normalization were developed by Edwin Codd, and the first three steps of normalization were described in his 1972 paper (Codd, 1972). These steps were expanded and developed by Codd and others over the next few years. A good summary of this work is contained in Date (1981). Figure 3 shows an example of a simple normalized data model for site environmental data. This figure is known as an entity-relationship diagram, or E-R diagram. In Figure 3, and in other E-R diagrams used in this book, each box represents a table, with the name of the table shown at the top. The other words in each box represent the fields in that table. These are the “entities.” The lines between the boxes represent the relationships between the tables, and the fields used for the relationships. All of the relationships shown are “one-to-many,” signified by the number one and the infinity symbol on the ends of the join lines. That means that one record in one table can be related to many records in the other table.
The Five Normal Forms Most data analysts recognize five levels of normalization. These forms, called First through Fifth Normal Form, represent increasing levels of organization of the data. Each level will be discussed briefly here, with examples of a simple environmental data set organized into each form.
Relational Data Management Theory
23
Figure 3 - Simple normalized data model for site environmental data
Tools are now available to analyze a data set and assist with normalizing it. The Access Table Analyzer Wizard (see Figure 14, later) is one example. These tools usually require that the data be in First Normal Form (rows and columns, no repeating groups) before they can work with it. In this section we will go through the process of normalizing a data set. We will start with a flat file of site analytical data in a format similar to the way it would be received from a laboratory. This is shown in Figure 4. Well B-1 B-1 B-2 B-2 B-2 B-3 B-3
Elev 725 725 706 706 706 714 714
X 1050 1050 342 342 342 785 785
Y 681 681 880 880 880 1101 1101
SampDate 2/3/96 5/8/96 11/4/95 2/3/96 5/8/96 2/3/96 5/8/96
Sampler JLG DWR JAM JLG DWR JLG CRS
As .05 .05 3.7 2.1 1.4 .05 .05
AsFlag not det not det detected detected detected not det not det
Cl
ClFlag
.05 9.1 8.4 7.2
not det detected detected detected
.05
not det
pH 6.8 6.7 5.2 5.3 5.8 8.1 7.9
Figure 4 - Environmental data prior to normalization - Problems: Repeating Groups, Redundancy
First Normal Form – First we will convert our flat file to First Normal Form. In this form, data is represented as a two-dimensional array of data, like a flat file. Unlike some flat files, a first normal form table has no repeating groups of fields. In the flat file illustration in Figure 4, there are columns for arsenic (As), chloride (Cl), and pH, and detection flags for arsenic and chloride. This design requires that space be allocated for every constituent for every sample, even though some constituents were not measured. Converting this table to first Normal Form Results in the configuration shown in Figure 5.
24
Well B-1 B-1 B-1 B-1 B-1 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-3 B-3 B-3 B-3 B-3
Relational Management and Display of Site Environmental Data
Elev 725 725 725 725 725 706 706 706 706 706 706 706 706 706 714 714 714 714 714
X 1050 1050 1050 1050 1050 342 342 342 342 342 342 342 342 342 785 785 785 785 785
Y 681 681 681 681 681 880 880 880 880 880 880 880 880 880 1101 1101 1101 1101 1101
SampDate 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96 11/4/95 11/4/95 11/4/95 2/3/96 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96
Sampler JLG JLG DWR DWR DWR JAM JAM JAM JLG JLG JLG DWR DWR DWR JLG JLG CRS CRS CRS
Param As pH As Cl pH As Cl pH As Cl pH As Cl pH As pH As Cl pH
Value .05 6.8 .05 .05 6.7 3.7 9.1 5.2 2.1 8.4 5.3 1.4 7.2 5.8 .05 8.1 .05 .05 7.9
Flag not det not det not det detected detected detected detected detected detected not det not det not det
Figure 5 - Environmental data in first normal form (no repeating groups) - Problem: Redundancy
In this form, there is a line for each constituent that was measured for each well. There is no wasted space for the chloride measurements for B-1 or B-3 for 2/3/96. There is, however, quite a bit of redundancy. The elevation and location are repeated for each well, and the sampler’s initials are repeated for each sample. This redundancy is addressed in Second Normal Form. Second Normal Form – In this form, redundant data is moved to separate tables. In the more formal terminology of data modeling, data in non-key columns must be fully dependent on the primary key for the table. Key columns are the columns that uniquely identify a record. In our example, the data that uniquely identifies each row in the analytical table is a combination of the well, the sampling date, and the parameter. The above table has data, such as the elevation and sample date, which is not dependent on the entire compound key, but on only part of the key. Elevation depends on well but not on sample date, and sampler depends on well and sample date but not on parameter. In order to convert our table to Second Normal Form, we must separate it into three tables as shown in Figure 6. Third Normal Form – This form requires that the table conform to the rules for First and Second Normal Form, and that all non-key columns of a table be dependent on the table’s primary key and independent of one another. Once our example data set has been modified to fit Second Normal Form, it also meets the criteria for Third Normal Form, since all of the non-key values are dependent on the key fields and not on each other. Fourth Normal Form – The rule for Fourth Normal Form is that independent data entities cannot be stored in the same table where many-to-many relationships exist between these entities. Many-to-many relationships cannot be expressed as simple relationships between entities, but require another table to express this relationship. Our tables as described above meet the criteria for Fourth Normal Form.
Relational Data Management Theory
Stations Well B-1 B-2 B-3
Elev 725 706 714
Analyses X 1050 342 785
Y 681 880 1101
Samples Well B-1 B-1 B-2 B-2 B-2 B-3 B-3
25
SampDate 2/3/96 5/8/96 11/4/95 2/3/96 5/8/96 2/3/96 5/8/96
Sampler JLG DWR JAM JLG DWR JLG CRS
Well B-1 B-1 B-1 B-1 B-1 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-2 B-3 B-3 B-3 B-3 B-3
SampDate 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96 11/4/95 11/4/95 11/4/95 2/3/96 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96 2/3/96 2/3/96 5/8/96 5/8/96 5/8/96
Param As pH As Cl pH As Cl pH As Cl pH As Cl pH As pH As Cl pH
Value .05 6.8 .05 .05 6.7 3.7 9.1 5.2 2.1 8.4 5.3 1.4 7.2 5.8 .05 8.1 .05 .05 7.9
Flag not det not det not det detected detected detected detected detected detected not det not det not det
Figure 6 - Environmental data in second normal form - Problems: Compound Keys, Repeated Values
Fifth Normal Form – Fifth Normal Form requires that you be able to recreate exactly the original table from the tables into which it was decomposed. Our tables do not meet this criterion, since we cannot duplicate the repeating groups (value and flag for As, Cl, and pH) from our original table because we don’t know the order in which to display them. In order to overcome this, we need to add a table with the display order for each parameter. Param 1 2 3
Parameter As Cl pH
Order 1 2 3
Figure 7 -Parameters table for fifth normal form
The Flag field can be handled in a similar way. Note that when we handle the flag as a lookup table (a table containing values and codes), we are required to have values for that field in all of the records, or else records will be lost in a join at retrieval time. We have used “v” for detected value for the pH results, but a different flag could be used if desired to signify that these are different from the chemical values. Removal of compound keys – In order to minimize redundant data storage and to decrease the opportunity for error, it is often useful to remove compound keys from the tables and replace them with an artificial key that represents the removed values. For example, the Analyses table has a compound key of Well and SampDate. This key can be replaced by a SampleID field, which points to the sample table. Likewise, the parameter name can be removed from the Analyses table and a code inserted in its place. These keys can be AutoNum fields maintained by the system, and
26
Relational Management and Display of Site Environmental Data
don’t have to have any meaning relative to the data other than that the data is assigned to that key. Keys of this type are called synthetic keys and, because they are not compound, are also called simple keys. In many data management systems, some activities are made much easier by simple synthetic keys. For example, joining records between tables, and especially more complicated operations like pivot tables, can be much easier when there is a single field to use for the operation. The final result of the conversion to Fifth Normal Form is shown in Figure 8. The SampleID field in the Samples table is called a primary key, and must be unique for each record in the table. The SampleID field in the Analyses table is called a foreign key because it is the primary key in a different (foreign) but related table.
STRUCTURED QUERY LANGUAGE Normalization of a data model usually breaks the data out into multiple tables. Often to view the data in a way that is meaningful, it is necessary to put the data back together again. You can view this as “de-normalizing” or reconstituting the data. The tool most commonly used to do this is Structured Query Language, or SQL. The following sections provide an overview of SQL, along with examples of how it is used to provide useful data.
Overview of SQL The relational data model provides a way to store complex data in a set of relatively simple tables. These tables are then joined by key fields, which are present in more than one table and allow the data in the tables to be related to each other based on the values of these keys. Structured Query Language (SQL) is an industry-standard way of retrieving data from a relational database management system (RDBMS). There are a number of good books available on the basics of SQL, including van der Lans (1988) and Trimble and Chappell (1989).
How SQL is used In the simplest sense, an SQL query can take the data from the various tables in the relational model and reconstruct them into a flat file again. The benefit is that the format and content of the resulting grid of data can be different each time a retrieval is performed, and the format of the output is somewhat independent of the structure of the underlying tables. In other words, the presentation is separate from the contents. There are two parts to SQL: the data definition language (DDL) for creating and modifying the structure of tables, and the data manipulation language (DML) for working with the data itself. Access and other graphical data managers replace the SQL DDL with a graphical system for creating and editing table structures. This section will discuss SQL DML, which is used for inserting, changing, deleting, and retrieving data from one or more relational tables. The SQL keywords for changing data are INSERT, UPDATE, and DELETE. As you would expect, INSERT places records in a table, UPDATE changes values in a table, and DELETE removes records from a table. Data retrieval using SQL is based on the SELECT statement. The SELECT statement is described in a later section on queries. SQL is a powerful language, but it does take some effort to learn it. In many cases it is appropriate to hide this complexity from users. This can be done using query-by-form and other techniques where users are asked for the necessary information, and then the query is generated for them automatically.
Relational Data Management Theory
Stations WellID 1 2 3 Samples WellID 1 1 2 2 2 3 3 Flags Flag u v
Well B-1 B-2 B-3
Elev 725 706 714
SampleID 1 2 3 4 5 6 7
X 1050 342 785
SampDate 2/3/96 5/8/96 11/4/95 2/3/96 5/8/96 2/3/96 5/8/96
Name not det detected
Parameters Param Parameter 1 As 2 Cl 3 pH
Y 681 880 1101
Order 1 2 3
Sampler JLG DWR JAM JLG DWR JLG CRS
Analyses SampleID Param 1 1 1 3 2 1 2 2 2 3 3 1 3 2 3 3 4 1 4 2 4 3 5 1 5 2 5 3 6 1 6 3 7 1 7 2 7 3
Value .05 6.8 .05 .05 6.7 3.7 9.1 5.2 2.1 8.4 5.3 1.4 7.2 5.8 .05 8.1 .05 .05 7.9
27
Flag u v u u v v v v v v v v v v u v u u v
Figure 8 - Environmental data in fifth normal form with simple keys and coded values
Data retrieval is based on the SQL SELECT statement. The basic SELECT statement syntax (the way it is written) is: SELECT field list FROM table list WHERE filter expression An example of this would be an SQL query to extract the names and depths of the borings (stations) deeper than 10 feet: SELECT StationName, Depth FROM Stations WHERE Depth > 10 The SELECT and FROM clauses are required. The WHERE clause is optional. Data retrieval in Access and other relational database managers such as Oracle and SQL Server is based on this language. Most queries are more complicated than this. The following is an example of a complicated multi-table query created by Enviro Data, a product from Geotech Computer Systems, Inc. This query outputs data to be used for a report of analytical results.
28
Relational Management and Display of Site Environmental Data
Figure 9 - Data display resulting from a query
SELECT DISTINCTROW Stations.StationName, Stations.StationType, Stations.Location_CX, Stations.Location_CY, Stations.GroundElevation, Stations.StationDate_D, Samples.SampleType, Samples.SampleTop, Samples.SampleBottom, Samples.Sampler, Samples.SampleDate_D, Parameters.ParameterNumber, Parameters.LongName, Analytic.Value, Analytic.AnalyticMethod, Analytic.AnalyticLevel, Analytic.ReportingUnits, Analytic.Flag, Analytic.Detect, Analytic.Problems, Analytic.AnalDate_D, Analytic.Lab, Analytic.ReAnalysis AS Expr1 FROM StationTypes INNER JOIN (Stations INNER JOIN (Samples INNER JOIN ([Parameters] INNER JOIN Analytic ON Parameters.ParameterNumber = Analytic.ParameterNumber) ON Samples.SampleNumber = Analytic.SampleNumber) ON Stations.StationNumber = Samples.StationNumber) ON StationTypes.StationType = Stations.StationType WHERE (((Stations.StationName) Between [Forms]![AnalyticReport]![StartingStation] And [Forms]![AnalyticReport]![EndingStation]) AND ((Samples.SampleTop) Between [Forms]![AnalyticReport]![LowerElev] And [Forms]![AnalyticReport]![UpperElev]) AND ((Samples.SampleBottom) Between [Forms]![AnalyticReport]![LowerElev] And [Forms]![AnalyticReport]![UpperElev]) AND ((Parameters.ParameterNumber) Between [Forms]![AnalyticReport]![LowerParam] And [Forms]![AnalyticReport]![UpperParam])) ORDER BY Stations.StationName, Samples.SampleTop; Users are generally not interested in all of this complicated SQL code. They just want to see some data. Figure 9 shows the result of a query in Access.
Relational Data Management Theory
29
Figure 10 - Query that generated Figure 9
The SQL statement to generate Figure 9 is shown in Figure 10. It’s beyond most users to type queries like this. Access helps these users to create a query with a grid-type display. The user can drag and drop fields from the tables at the top to create their result. Figure 11 shows what the grid display looks like when the user is creating a query.
Figure 11 - Grid display of the query in Figure 10
30
Relational Management and Display of Site Environmental Data
Figure 12 - Query-by-form in the Enviro Data customized database system (This software screen and others not otherwise noted are courtesy of Geotech Computer Systems)
Even this is too much for some people. The complexity can be hidden even more in a customized system. The example in Figure 12 allows the users to enter selection criteria without having to worry too much about tables, fields, and so on. Then if users click on List they will get a query display similar to the one in Figure 9. In this system, they can also take the results of the query (the code behind this form creates an SQL statement just like the one above) and do other things such as create maps and reports.
BENEFITS OF NORMALIZATION The data normalization process can take a lot of effort. Why bother? The answer is that normalizing your data can provide several benefits, including greater data integrity, increased storage efficiency, and more flexible data retrieval.
Data integrity A normalized data model improves data integrity in several ways. The most important is referential integrity. This is an aspect of managing the data that is provided by the data management software. With referential integrity, relationships can be defined which require that a record in one table must be present before a record in a related table can be added. Usually the required record is a parent record (the “one” side of the one-to-many), which is necessary before a child record (the “many” side) can be entered. An example would be preventing you from entering a sample from a well before you have entered the well. More importantly, referential integrity can prevent deletion of a record when there are related records that depend on it, and this can be a great tool to prevent “orphan” records in the database. A good introduction to referential integrity in Access can be found in Harkins (2001b).
Relational Data Management Theory
31
A related concept is entity integrity. This requires that every row must be uniquely identified, meaning that each primary key value in the table must be unique and non-null. It is always a good idea for tables to have a primary key field, and that entity integrity be enforced. Another contributor to data integrity is the use of lookup tables, which is a capability of the relational database design. Lookup tables are tables where values that are repeated in the database are replaced by codes in the main data tables, and another table contains the codes and the lookups. An example is types of stations in the Stations table. Values for this field might include “soil boring,” “monitoring well,” and so on. These could be replaced by codes like “s” and “m,” respectively, in the main table, and the lookup table would contain the translation between the code and the value. It is much more likely that a user will mistype “detected” than “2.” Errors are even less likely if they are picking the lookup value off a list. They can still pick the wrong one, but at least errors from misspelling will be minimized.
Efficient storage Despite the increase in organization, the data after normalization as shown in Figure 8 actually takes less room than the original flat file. The data normalization process helps remove redundant data from the database. Because of this it may take less space to store the data. The original flat file contained 293 characters (not including field names) while the Fifth Normal Form version has only 254 characters, a decrease of 13% with no decrease in information content. The decrease in size can be quite a bit more dramatic with larger data sets. For example, in the new normalized model adding data for another analysis would take up only about 6 characters, where in the old model it would be about 34 characters, a savings in size of 82%.
Flexible retrieval When the data is stored in a word processor or spreadsheet, the output is usually a direct representation of the way the data is stored in the file. In a relational database manager, the output is usually based on a query, which provides an opportunity to organize and format the data a different way for each retrieval need.
AUTOMATED NORMALIZATION Normalization of a database is such an important process that some database programs have tools to assist with the normalization process. For example, Microsoft Access has a Table Analyzer Wizard that can help you split a table into multiple normalized tables based on the data content. To illustrate this, Figure 13 shows a very simple flat file of groundwater data with several stations, sample dates, and parameters. This file is already in First Normal Form, with no repeating groups, which is the starting point for the Wizard. The Table Analyzer Wizard was then run on this table. Figure 14 shows the result of this analysis. The Wizard has created three tables, one with the value, one with the station and sample date, and one with the parameter, units, and flag. It has also added several key fields to join the tables. The user can then modify the software’s choices using a drag and drop interface. The Wizard then steps you through a process for correcting typographical errors to improve the normalization process. For comparison, Figure 15 shows how someone familiar with the data would normalize the tables, creating primary tables for stations, samples, and analyses, and a lookup table for parameters. (Actually, units and flags should probably be lookup tables too.)
32
Relational Management and Display of Site Environmental Data
Figure 13 - Data set for the Table Analyzer Wizard
Figure 14 - Wizard analysis of groundwater data
Relational Data Management Theory
33
Figure 15 - Manual normalization of groundwater data
For this data set, the Wizard didn’t do very well, missing the basic hierarchical nature of the data. With other data sets with different variability in the data it might do better. In general the best plan is for people familiar with the data and the project needs to design the fields, tables, and relationships based on a thorough understanding of the natural relationships between the elements being modeled in the database. But the fact that Microsoft has provided a tool to assist with the normalization process highlights the importance of this process in database design.
CHAPTER 4 DATA CONTENT
A database can contain just about anything. The previous chapter discussed what a database is and how the data can be stored. A bigger issue is what data the database will contain. It is also important to make sure that database users understand the data and use it properly.
DATA CONTENT OVERVIEW The data contained in an EDMS can come from a wide variety of sources and can be used in many different ways. For some published examples, see Sara (1974), especially chapters 7 and 11. There is a wide range of data needs for the different projects performed by the typical environmental organization. With projects ranging from large sites with hundreds of wells and borings to small service stations with minimal data, and staffing levels from dozens of people on one project to many projects for one person, the variability of the data content needed for environmental projects is very great. Even once a data management system is in place, additional data items may be identified over time, as other groups and individuals are identified and become involved in the data management system, and as different project needs arise. Since different groups have differing data storage needs, a primary design goal of the data management system must be to accommodate this diversity while fostering standardization where possible. Sets of needs must be considered during design and development to avoid unnecessary limitations for any group. While there is always a trade-off between flexibility, ease of use, and development cost in creating a data management system, a well-designed data model for information storage can be worth the initial expense. Typical data components of an EDMS include four areas. The first three are project related: Project Technical Data, Project Administrative Data, and Project Document Data. The fourth, Reference Data, is not specific to an individual project, but in many cases has relationships to project-specific data. Often project data is the same as site data, so the terms are used interchangeably here. Figure 16 shows the main data components for the EDMS. This is just one way of organizing data. There are probably other equally valid ways of assigning a hierarchy to the data. Different data components are covered here in different levels of detail. The primary focus of this book is on project technical data, and especially data related to site investigation and remediation. This data is described briefly here, and in greater detail throughout the book. Other data components are covered briefly in this section, and have been well discussed in other books and articles. For each category of data, the major data components will be listed, along with, in many cases, comments about data management aspects of those components.
36
Relational Management and Display of Site Environmental Data
Enterprise
Project
Technical
Administrative
Reference
Documents
Figure 16 - Overview of the EDMS data model
PROJECT TECHNICAL DATA This category covers the technical information used by project managers, engineers, and geoscientists in managing, investigating, and remediating sites of environmental concern. Many acronyms are used in this section. The environmental business loves acronyms. You can find the meanings of these acronyms in Appendix A.
Product information and hazardous materials In an operating facility, tracking of materials within and moving through the facility can be very important from a health and safety perspective. This ranges from purchasing of materials through storage, use, and disposal. There are a number of aspects of this process where data management technology can be of value. The challenge is that in many cases the data management process must be integrated with the process management within the facility. Material information – The information to be stored about the materials themselves is roughly the same for all materials in the facility. Examples include general product information; materials management data; FIFRA labeling; customer usage; chemical information; and MSDS management, including creation, access, and updating/maintenance. Material usage – Another set of data is related to the usage of the materials, including shelf life tracking; recycling and waste minimization and reporting; and allegation and incident reports related to materials. Also included would be information related to keeping the materials under control, including source inventory information; LDR; pollution prevention; TSCA and asbestos management; and exception reports.
Hazardous wastes When hazardous materials move from something useful in the facility to something to be disposed of, they become hazardous wastes. This ranges from things like used batteries to toxic mixed wastes (affectionately referred to with names like “trimethyl death”). Waste handling – Safe handling and disposal of hazardous chemicals and other substances involves careful tracking of inventories and the shipping process. There is a significant recordkeeping component to working with this material. Data items include waste facility permit applications; waste accumulation and storage information (usually for discrete items like batteries); and waste stream information (more for materials that are part of a continuous process).
Data Content
37
If you put a drop of wine in a gallon of hazardous waste, you get a gallon of hazardous waste. If you put a drop of hazardous waste in a gallon of wine, you get a gallon of hazardous waste. Rich (1996) Waste disposal – If the waste is to be disposed onsite, then detailed records must be kept of the process, and data management can be an important component. If the waste is to be shipped offsite, then waste manifesting; waste shipping; NFPA/HMIS waste labels; and hazardous waste reports are important issues.
Environmental releases Inevitably, undesirable materials make it from the facility into the environment, and the EDMS can help track and report these releases. Types of releases – Water quality and drinking water management cover issues related to supposedly clean water. Wastewater management and pretreatment management cover water that is to be released beyond the facility. Once the material makes it into the groundwater, then groundwater management becomes important as discussed in the next section. Release issues – Regarding the releases themselves, issues include emissions management; air emissions inventory; NPDES permits; discharges and stormwater monitoring; stormwater runoff management; leak detection and LDAR; toxic chemical releases and TRI; and exceedence monitoring and reporting.
Site investigation and remediation The data gathered for site investigation and remediation is the major topic of this book. This data will be discussed briefly here, and in much more detail in other places in the book, especially Chapter 11 and Appendix B. Site data – Site data covers general information about the project such as location; type (organics, metals, radiologic, mixed, etc.); ownership; status; and so on. Other data to be stored at the site level includes QA and sampling plans; surveys; field data status; various other components of project status information; and much other information below the site level. Geology and hydrogeology – Geology and hydrogeology data includes surface geology information such as geologic maps, as well as subsurface information from wells, borings, and perhaps geophysical surveys. There are two kinds of geology/hydrogeology data for a site. The first could be considered micro data. This data is related to a particular boring or surface sample, and includes physical, lithological, and stratigraphic information. This information can be considered part of the sampling process, and can be included in the chemistry portion of the EDMS. The other type of data can be considered macro. This information covers more than an individual sample. It would include unit distribution and thickness maps; outcrop geology; facies maps (where appropriate); hydraulic head maps; and other data that is site-wide or nearly so. Stations – This is information related to locations of sampling such as monitoring wells; borings; surface water and air monitoring locations; and other sampling locations. In the EDMS, station data is separated into sample-related data and other data. The sample-related data is discussed below. Other station-related data includes surveys; boring logs; wellbore construction; and stratigraphy and lithology on a boring basis (as opposed to by sample). Some primary data elements for stations can cause confusion, including station coordinates and depths. For example, the XY coordinates used to describe station locations are based on some projection from latitude-longitude, or measurements on the surface of the earth from some (hopefully) known point (see Chapter 22). Either way, the meaning of the coordinates depends on reference information about the projection system of known points. Another example is depths,
38
Relational Management and Display of Site Environmental Data
When control equipment fails, it will do so during inspection. Rich (1996) which are sometimes stored as measured depths, and other times as elevations above sea level. With both of these examples, interpretation of numbers in fields in the database depends on knowledge of some other information. Samples – Information gathered for each sample includes sample date; frequency; depth; matrix; and many other things. This information is gathered from primary documents generated as part of the sampling process, such as the sample report and the Chain of Custody (COC). Quality control (QC) samples are an integral part of the sampling process for most environmental projects. An area where data management systems have great potential is in the interface between the field sampling event, the laboratory, and the centralized database; and for event planning; field management; and data checking. Analyses – The samples are then analyzed in the field and/or in the laboratory for chemical, physical, geotechnical, and sometimes geophysical parameters. QC data is important at the analyses level as well.
Cartography This category includes map-related data such as site maps; coordinates; air and satellite photos; and topography. It is a broad category. For example, site maps can include detailed maps of the site as well as regional maps showing the relationship of the site to geographic or other features. Implementation of this category in a way that is accessible to all the EDMS users requires providing some map display and perhaps editing capabilities as part of the system. This can be done by integrating an external mapping program such as ArcView, MapInfo, or Enviro Spase, or by inserting a map control (software object) like GeoObjects or MapObjects into the Access database. Data displayed on maps presented this way could include not only the base map information such as the site outline; buildings; etc. but also data from the EDMS, including well locations; sample counts; analytical values; and physical measurements like hydraulic head.
Coverages Some site data is discrete, that is, it is gathered at and represents a specific location in space. Other data represents a continuous variable across all or part of the site, and this type of data is referred to as a coverage. Examples of coverage data include surface elevation or the result of geophysical surveys such as gravity or resistivity. Often a single variable is sampled at discrete locations selected to represent a surface or volume. These locations can be on a regular grid or at irregular locations to facilitate sampling. In the first case, the locations can be stored implicitly by specifying the grid geometry and then listing the values. In the second case the coordinates need to be stored with each observation.
Models Models are spatial data that result from calculations. The EDMS can work with models in two ways. The first is to store and output data to be used in the modeling process. An example would be to select data and export an XYZ file of the current concentration of some constituent for use in a contouring program. This feature should be a standard part of the EDMS. The other model component of the EDMS is storage and display of the results of modeling. Most modeling projects result in a graphical output such as a contour map or three-dimensional display, or a numerical result. Usually these results can be saved as images in a file in vector (lines,
Data Content
39
polygons, etc.) or raster (pixels) format, or in numeric fields in tables. These graphics files or other results can be imported into the EDMS and displayed upon request. Examples of models often used in environmental projects include air dispersion; surface water and watershed configuration and flow; conceptual hydrologic; groundwater fate and transport; subsurface (2-D and 3-D) contouring; 3-D modeling and visualization; statistics, geostatistics, and sampling plan analysis; cross sections; volume estimates; and resource thicknesses and stripping ratios. Some of these are covered in more detail in later chapters.
Other issues Other data elements – The list of possible data elements to store for remediation projects is almost endless. Other data elements and capabilities include toxicity studies and toxicology; risk assessment; risk analysis and risk management; biology and ecology; SCADA; remediation planning and remedial plans; design calculations; and geotechnical designs. Summary information – Often it is useful to be able to see how many items of a particular kind of data are stored in the database, such as how many wells there are at a particular site, or how many arsenic analyses in the last year exceeded regulatory limits. There are two ways to provide this information, live or canned. In live count generation, the records are counted in real time as the result of a query. This is the most flexible approach, since the counts can be the result of nearly any query, and usually the most accurate, since the results are generated each time from the base data. With canned count generation, the counting is done separately from the display, and the results stored in a table. This is useful when the counts will be used often, and doing the counting takes a long time. It has the disadvantage that the counts can become “stale” when the primary data changes and the counts are not updated. Some systems use a combination of both approaches.
PROJECT ADMINISTRATIVE DATA Project administrative data covers a wide variety of types of data related to managing and administering projects. Examples of project administrative and management data elements include general site information; project management data; health and safety; and employee information. Some of this data consists of numbers and text items that should be stored using database management software. Other parts of this data are more document oriented, and might be a better fit in a document management system as discussed in a later section.
Site information Project overview – This could include site profile (overview) reports and descriptions; status reports and other general project information; and process descriptions and history. Ownership and permits – Items in this category include ownership and related administrative information; property transfers and related contact information; and information about permitting, permit management and tracking. Infrastructure – Possible infrastructure data elements include building and equipment information such as type; location; and operating history. Operations and maintenance – This is a broad area with much opportunity for data management assistance. This includes data related to requirements management; certifications, inspections, and audits; QA planning and documentation; continuous improvement and performance management; lockout/tagout surveys for equipment (where equipment that is not working properly is removed from service); energy analysis; emergency response; hazard analysis and tracking; collection systems; and residuals and biosolids management
40
Relational Management and Display of Site Environmental Data
Laboratory operations – If the site has one or more onsite laboratories, this would include data related to the LIMS; lab data checking; EDDs; and Web data delivery.
Project management Project management information can be stored as part of the general site database, or in specialized project management or accounting software. Budgets – Typical information in this category includes planned expenditures vs. actual expenditures; progress (% complete); costs on an actual and a unit basis, including EH&S (employee health and safety) expenses; sampling and analysis costs; and fee reports. Schedules – Schedule items can be regulatory items such as report due dates and other deadlines; or engineering items like work breakdown and critical path items. Reimbursement and fund recovery – For some projects it is possible to obtain partial reimbursement for remediation expenses from various funds such as state trust funds. The database can help track this process. Emission reduction credits – Facilities are required to meet certain emission criteria, and to encourage them to meet or even exceed these criteria they can be issued emission reduction credits that can be sold to facilities that are having trouble meeting their criteria. The second facility can then use these credits to minimize penalties. Customers and vendors – This would include information such as customer’s and vendor’s names and contact information; purchasing data; and electronic procurement of products and services. Other issues – There are many other project management data items that could be stored in the database. These include project review reports, records of status meetings, and so on.
Health and safety Tracking employee health and training is another fruitful area for a computerized data management approach. Because the health and safety of people is at stake, great care must be taken in data gathering, manipulation, and reporting of health and safety information. Facility information – Information to be tracked at the facility level includes process safety; hazard assessments; workplace safety plans and data; fire control and alarm systems; safety inspections and audits; and OSHA reports. Employee information – There is a large amount of information in this category that needs to be tracked, including safety, confined space, and other specialized training; accident/illness/injury tracking and reporting; individual exposure monitoring; and workers’ compensation and disability coverage and claims. General health and information – This category includes industrial hygiene; occupational medicine; environmental risk factors; toxicology and epidemiology; and monitoring of the health exposure and status of onsite and offsite individuals.
Personnel/project team Project staff – Data categories for project staff include the health and safety information discussed above, as well as organization information; personnel information; recruiting, hiring, and human resource information; demographics; training management and records; and certifications and fit tests. Others – Often information similar to that stored for project staff must be stored for others such as contractor and perhaps laboratory personnel.
Data Content
41
Incident tracking and reporting Despite the best efforts of project personnel, incidents of various types occur. The database system can help minimize incidents, and then track them when they do occur. Planning – Data to be stored prior to an incident includes emergency plans and information on emergency resources. Responding – This item covers things such as emergency management information and mutual aid (cooperation on emergency response). Tracking – This includes incident tracking; investigation and notification; “near miss” tracking; spill tracking; agency reports; and complaint tracking and response.
Regulatory status Regulatory status information for a facility or project includes quality assurance program plans; corrective action plans and corrective actions; rulings; limits (target values) and reporting times; Phase 1, Form R, and SARA (Superfund) reporting; Clean Air Act management; right to know management; project oversight and procedures; ROD and other limits; and certifications (such as ISO 9000 and 14000).
Multi-plant management Large organizations often have a large number of facilities with various environmental issues, each with a different level of urgency. Even big companies have limited resources, both financial and personnel-wise, and are unable to pay full attention to all of the needs of all of the facilities. Unfortunately, this can lead to a “brush-fire” approach to project management, where the projects generating the most heat get the most attention. Methods exist to evaluate facilities and prioritize those needing the most attention, with the intention of dealing with issues before they become serious problems.
Public affairs, lobbying, legislative activities The response of the public to site activities is increasing greatly in importance, with activities such as environmental justice lawsuits requiring more and more attention from site and corporate staff. A proactive approach of providing unbiased information to the public can, in some cases, minimize public outcry. A data management system can help organize these activities. Likewise, contacts with legislators and regulators can be of value, and can be managed using the database system.
PROJECT DOCUMENT DATA In terms of total volume of data, the project document data is probably the largest amount of data that will need to be managed within any organization. It may also be the most diverse from the viewpoint of its current physical configuration and its location. Options for storing document data are discussed in a later section of this chapter. The distinction between this category and the previous one about administrative data is that the previous data elements are part of the operations process, while items in this category are records of what happened. These items include compliance plans and reports; investigative reports (results); agreements; actual vs. planned activities; correspondence; drawings; and document management, control, and change tracking.
42
Relational Management and Display of Site Environmental Data
REFERENCE DATA Often it is desirable to provide access to technical reference information, both project-specific and general, on a timely basis. This includes boilerplate letters and standardized report templates. Reference data looks similar to project document data in its physical appearance, but is not associated with a specific project. Storing this type of information in a centralized location can be a great time saver in getting the work out. The storage options for this data are mostly the same as those for project document data. One difference is that some of the components of this data type may be links to external data sources, rather than data stored within the organization. An example of this might be a reference source for updated government regulations accessed from the Internet via the company intranet. Other than that, the reference data will usually be handled in the same way as project document data.
Administrative General administrative data can be similar to project data, but for the enterprise rather than for a specific project. This data includes timesheets; expense reports; project and task numbers; policies; correspondence numbers; and test reports. This category can also include information about shared resources, both internal and external, such as personnel and equipment availability; equipment service records; contractors; consultants; and rental equipment.
Regulatory Regulatory compliance information can be either general or project-specific. The project specific information was discussed above. In many cases it is helpful to store general regulatory compliance data in the database, including regulatory limits; reporting time guidelines; analyte suites; and regulatory databases (federal, state, local), including copies of regulations and decisions, like the Code of Federal Regulations. Regulatory alerts and regulatory issue tracking can also be helped with the use of a database.
Documentation Documentation at the enterprise level includes the information needed to run projects and the organization itself, from both the technical and administrative perspective. Technical information – Reference information in this category includes design information and formulas; materials engineering information; engineering guidelines; and other reference data. Enterprise financial information – Just as it is important to track schedules and budgets for each project, it is important to track this information for the enterprise. This is especially true for finances, with databases important for tracking accounting information, requests for proposals and/or bids; purchase orders and AFEs; employee expenses; and so on. Document resources – Sometimes a significant amount of time can be saved by maintaining a library of template documents that can be modified for use on specific projects. Examples include boilerplate letters and standardized reports. QA data – This category includes manuals; standard operating procedures for office, facility, and field use; and other quality documents. Quality management information can be general or specific. General information can include Standard Operating Procedures and other enterprise quality documents. Specific information can be project wide, such as a Quality Assurance Project Plan, or more specific, to the level of an individual analysis. Specific quality control information is covered in more detail in Chapter 15.
Data Content
43
News and information – It is important for staff members to have access to industry information such as industry news and events. Sometimes this is maintained in a database, but more commonly nowadays it is provided through links to external resources, especially Web pages managed by industry publications and organization.
DOCUMENT MANAGEMENT In all of the above categories, much of the data may be contained in existing documents, and often not in a structured database system. The biggest concern with this data is to provide a homogeneous view of very diverse data. The current physical format of this data ranges from digital files in a variety of formats (various generations of word processing, spreadsheet, and other files) through reports and oversized drawings. There are two primary issues that must be addressed in the system design: storage format and organization.
Storage options There are many different options for storing document data. There may be a part of this data that should be stored using database management software, but it usually consists largely of text documents, and to a lesser degree, diagrams and drawings. The text documents can be handled in five ways, depending on their current physical format and their importance. These options are presented in decreasing order of digital accessibility, meaning that the first options can be easily accessed, searched, edited, and so on, while the capability to do this becomes more limited with the later options. Many document management systems use a combination of these approaches for different types of documents. The first storage option, which applies to documents currently available in electronic form, is to keep them in digital form, either in their original format or in a portable document format such as Acrobat .pdf files (a portable document format from Adobe). Digital storage provides the greatest degree of flexibility in working with the documents, and provides relatively compact data storage (a few thousand bytes per page). The remaining four storage options apply to documents currently available only in hard copy form. The second and third options involve scanning the documents into digital form. In the second option, each document is scanned and then submitted to optical character recognition (OCR) to convert it to editable text. Current OCR software has an error rate of 1% or less, which sounds pretty good until you apply this to the 2000 or so characters on the average text page which gives 20 errors per page. Running the document through a spell-checker can catch many of these errors, but in order to have a usable document, it really should be manually edited as well. This can be very time-consuming if there are a large number of documents to be managed. Also, some people are good at this kind of work and some aren’t, so the management of a project to perform this conversion can be difficult. This option is most appropriate for documents that are available only in hard copy, but must be stored in editable form for updates, changes, and so on. It also provides compact storage (a few thousand bytes per page) if the scanned images are discarded after the conversion process. The third storage option, which is the second involving scanning, omits all or most of the OCR step. Instead of converting the documents to text and saving the text, the scanned images of the documents are saved and made available to users. In order to be able to search the documents, the images are usually associated with an index of keywords in the document. These keywords can be provided manually by reviewers, or automatically using OCR and specialized indexing software. Then when users want to find a particular document they can search the keyword index, locate the document they want, and then view it on the screen or print it. This option is most appropriate when the documents must be available online, but will be used for reference and not edited. It requires a large amount of storage. The theoretical storage size for an 8½ by 11 inch black and
44
Relational Management and Display of Site Environmental Data
white page at 300 dots per inch is about one megabyte, but with compression, the actual size is usually a few hundred thousand bytes, which is about a hundred times larger than the corresponding text document. The fourth storage option is to retain the documents in hard copy, and provide a keyword index similar to that in the third option. Creating the keyword index requires manually reviewing the documents and entering the keywords into a database. This may be less time-consuming than scanning and indexing. When the user wants a document, he or she can scan the keyword index to identify it, and then go and physically find it to work with it. This option requires very little digital storage, but requires a large amount of physical storage for the hard copies, and is the least efficient for users to retrieve documents. It is the best choice for documents that must be available, but will be accessed rarely. There is a final option for documents that do not justify any of the four previous options. That is to throw them away. On a practical level this is the most attractive for many documents, but for legal and other reasons is difficult to implement. A formal document retention policy with time limits on document retention can be a great help in determining which documents to destroy and when. For oversized documents, the options for storage are similar to those for page-sized documents, but the tools for managing them are more limited. For example, for a document in Word format, anyone in the organization can open it in Word, view it, and print it. For a drawing in AutoCAD format, the user must have AutoCAD, or at least a specialized CAD viewing program, in order to access it. To further complicate the issue, that drawing is probably a figure that should be associated with a text document. Part of the decision process for selecting document storage tools should include an evaluation of the ability of those tools to handle both regular and oversized documents, if both types are used in the organization.
Organization and display Once the decision has been made about the storage format(s) for the documents, then the organization of the documents must be addressed. Organization of the documents covers how they are grouped, where they are stored, and how they are presented. One possibility for grouping the documents is to organize them by type of document. Another is to divide them into quality-related and quality-unrelated documents. Documents that are in a digital format should be stored on a server in a location visible to all users with a legitimate need to see them. They can be stored in individual files or in a document management system, depending on the presentation system selected. The presentation of the data depends mostly on the software chosen. The options available for the type of data considered here include a structured database management system, a specialized document management system, the operating system file system, or a hypertext system such as the hypertext markup language (HTML) used for the Internet and the company intranet. A database management system can be used for document storage if an add-in is used to help manage the images of scanned pages. This approach is best when very flexible searching of the keywords and other data is the primary consideration. Specialized document management systems are programs designed to overcome many of the limitations of digital document storage. They usually have data capture interfaces designed for efficient scanning of large numbers of hard-copy documents. They may store the documents in a proprietary data structure or in a standard format such as .tif (tagged image file format). Some can handle oversized documents such as drawings and maps. They provide indexing capabilities to assist users in finding documents, although they vary in the way the indices work and in the amount of OCR that is done as part of the indexing. Some provide manual keyword entry only, while others will OCR some or all of the text and create or at least start the index for you. Document management systems work best when there is a large number of source documents with a similar
Data Content
45
physical configuration, such as multi-page printed reports. They do less well when the source documents are very diverse (printed reports, hard copy and digital drawings and maps, word processor and spreadsheet digital documents, etc.) as in many environmental organizations. This software might be useful if the decision is made to scan and index a large volume of printed documents, in which case it might be used to assist with the data capture. The file system built into the operating system provides a way of physically storing documents of all kinds. With modern operating systems, it provides at least some multi-user access to the data via network file sharing. This is the system currently used in most organizations for storing and retrieving digital files, and many may wish to continue with this approach for file storage. (For example, an intranet approach relies heavily on the operating system file storage capabilities.) Where this system breaks down is in retrieving documents from a large group, since the retrieval depends on traversing the directory and file structure and locating documents by a (hopefully meaningful) file name. In this case, a better system is needed for organizing and relating documents. Hypertext systems have been growing in popularity as a powerful, flexible way of presenting large amounts of data where there may be great variation in the types of relationships between data elements. Some examples of hypertext systems include the Hypercard program for the Apple Macintosh and the Help System for Microsoft Windows. A more recent and widely known example of hypertext systems is the HTML system used by the World Wide Web (and its close relative, the intranet). All hypertext systems have as a common element the hyperlink, which is a pointer from one document into another, usually based on highlighted words or images. The principal advantage of this approach is that it facilitates easy movement of the user from one resource to another. The main disadvantage is the effort required to convert documents to HTML, and the time that it takes to set up an intuitive system of links. This approach is rapidly becoming the primary document storage system within many companies.
PART TWO - SYSTEM DESIGN AND IMPLEMENTATION
CHAPTER 5 GENERAL DESIGN ISSUES
The success of a data management task usually depends on the tool used for that task. The theoretical physicist Stephen Hawking is quoted as saying, “When all you have is a hammer, everything looks like a nail.” This is as true in data management as in anything else. People who like to use a word processor or a spreadsheet program are likely to use the tool they are familiar with to manage their data. But just as a hammer is not the right tool to tighten a screw, a spreadsheet is not the right tool to manage a large and complicated database. A database management program should be used instead. This section discusses the design of the database management tool, and how the design can influence the success of the project.
DATABASE MANAGEMENT SOFTWARE Database management programs fall into two categories, desktop and client-server. The use of the two different types and decisions about where the data will be located are discussed in the next section. This section will discuss database applications themselves and briefly discuss the features and benefits of the programs. The major database software vendors put a large amount of effort into expanding and improving their products, so these descriptions are a snapshot in time. For an overview of desktop and Web-based database software, see Ross et al. (2001). Older database systems were, for the most part, based on dBase, or at least the dBase file format. dBase started in the early days of DOS, and was originally released as dBase II because that sounded more mature than calling it 1.0. If anyone tells you that they have been doing databases since dBase 1 you know they are bluffing. dBase was an interpreted application, meaning that the code was translated into machine language (compiled) each time it was run, which was slow on those early computers. This created a market for dBase compilers, of which FoxPro was the most popular. Both used a similar data format in which each data table was called a database file or .dbf. Relationships were defined in code, rather than as part of the data model. Much data has been, and still is in some cases, managed in this format. These files were designed for single-user desktop use, although locking capabilities were added in later versions of the software to allow shared use. Nowadays Microsoft Access dominates the desktop database market. This program provides a good combination of ease of use for beginners and power for experts. It is widely available, either as a stand-alone product or as part of the Office desktop suite. Additional information on Access can be found in books by Dunn (1994), Jennings (1995), and others, and especially in journals such as PC Magazine and Access/Visual Basic Advisor. Access has a feature that is common to almost all successful database programs, which is a programming language that allows users to
50
Relational Management and Display of Site Environmental Data
automate tasks, or even build complete programs using the database software. In the case of Access, there are actually two programming models, a macro language that is falling out of favor, and a programming language. The programming language is called Visual Basic for Applications (VBA), and is a fairly complete development environment. Since Access is a desktop database, it has limitations relative to larger systems. Experience has shown that for practical use, the software starts to have performance problems when the largest table in a database starts to reach a half million to a million records. Access allows multiple users to share a database, and no programming is required to implement this, but a dozen or so concurrent users is an upper limit for heavy usage database scenarios. Access is available either as a stand-alone product or as part of the Microsoft Office Suite. An alternative to Access is Paradox from Corel (www.corel.com). This is a programmable, relational database system, and is available as part of the Corel Office Suite. Paradox is a capable tool suitable for a complex database project, but the greater acceptance of Access makes Paradox an unlikely choice in the environmental business where file sharing is common, and Access is widespread. The next step up from Access for many organizations is Microsoft SQL Server. This is a fullscale client-server system with robust security and a larger capacity than Access. It is moderately priced and relatively easy to use (for enterprise software), and increases the capacity to several million records. It is easy to attach an Access front end (user interface) to a SQL Server back end (data storage), so the transition to SQL Server is relatively easy when the data outgrows Access. This connection can be done using ODBC (Open DataBase Connectivity) or other connection methods. For even larger installations, Oracle or IBM’s DB2 offer industrial-strength data storage, but with a price and learning curve to match. These can also be connected to the desktop using connection methods like ODBC, and one front-end application can be set up to talk to data in these databases, as well as to data in Access. Using this approach it is possible to create one user interface that can work with data in all of the different database systems. A new category of database software that is beginning to appear is open-source software. Open-source software refers to programs where the source code for the software is freely available, and the software itself is often free as well. This type of software is popular for Internet applications, and includes such popular programs as the Linux operating system and Apache Web server. Two open-source database programs are PostgreSQL and MySQL (Jepson, 2001). These programs are not yet as robust as commercial database systems, but are improving rapidly. They are available in commercial, supported versions as well as the free open-source versions, so they are starting to become options for enterprise database storage. And you can’t beat the price. Another new category of database software is Web-based programs. These programs run in a browser rather than on the desktop, and are paid for with a monthly fee. Current versions of these programs are limited to a flat-file design, which makes them unsuitable for the complex, relational nature of most environmental data, but they might have application in some aspects of Web data delivery. Examples of this type of software include QuickBase from the authors of the popular Quicken and QuickBooks financial software (www.quickbase.com), and Caspio Bridge (www.caspio.com).
DATABASE LOCATION OPTIONS A key decision in designing a data management system is where the data will reside. Related to this are a variety of issues, including what hardware and software will provide the necessary functionality, who will be responsible for data entry and editing, and who will be responsible for backup of the database.
General Design Issues
51
Stand-alone The simplest design for a database location is stand-alone. In this design, the data and the software to manage it reside on the computer of one user. That computer may or may not be on a network, but all of the functionality of the database system is local to that machine. The hardware and software requirements of a system like this are modest, requiring only one computer and one license for the database management software. The software does not need to provide access for more than one user at a time. One person is in control of the computer, software, and data. For small projects, especially one-person projects, this type of design is often adequate. For larger projects where many people need access to the data, the single individual keeping the data can become a bottleneck. This is particularly true when the retrievals required are large or complicated. The person responsible for the data can end up spending most or all of his or her time responding to data requests from users. When the data management requirements grow beyond that point, the stand-alone system no longer meets the needs of the project team, and a better design is required.
Shared file Generally the next step beyond a stand-alone system is a shared file system. In a shared file system, the server (or any computer) stores the database on its hard drive like any other file. Clients access the file using database software on their computers the same way they would open any other file on the server. The operating system on the server makes the file available. The database software on the client computer is responsible for handling access to the database by multiple users. An example of this design would be a system in which multiple users have Microsoft Access on their computers, and the database file, which has an extension of .mdb, resides on a server, which could be running Windows 95/98/ME or NT/2000/XP. When one or more users is working in the database file, their copy of Access maintains a second file on the server called a lock file. This file, which has an extension of .ldb, keeps track of which users are using the database and what objects in the database may be locked at any particular time. This design works well for a modest number of users in the database at once, providing adequate performance for a dozen or so users at any given time, and for databases up to a few hundred thousand records.
Client-server When the load on the database increases to the point where a shared file system no longer provides adequate performance, the next step is a client-server system. In this design, a data manager program runs on the server, providing access to the data through a system process. One computer is designated the server, and it holds the data management software and the data itself. This system may also be used as the workstation of an individual user, but in high-volume situations this is not recommended. More commonly, the server computer is placed on a network with the computers of the users, which are referred to as clients. The software on the server works with software on the client computers to provide access to the data. The following diagram covers the internal workings of a client-server EDMS. It contains two parts, the Access component at the top and the SQL Server part at the bottom. In discussing the EDMS, the Access component is sometimes called the user interface, since that is the part of the system that users see, but in fact both Access and SQL Server have user interfaces. The Access user interface has been customized to make it easy for the EDMS users to manage the data in ways useful to them. The SQL Server user interface is used without modifications (as provided by Microsoft) for data administration tasks. Between these user interfaces are a number of pieces that work together to provide the data management capabilities of the system.
52
Relational Management and Display of Site Environmental Data
Access User Interface Lookup Table Maintenance Electronic Import
Manual Entry
Table View
Data Review
Record Counts
Maps
Subset Creation
Graphs
Formatted Reports Fmt1 Fmt2
Client Server
Selection Scr. Access Queries / Modules Access Attachments
Selection Screen Access Queries / Modules Access Attachments
Security System Read/Write
Security System Read Only Server Tables
Volume Maint.
File Export Fmt1 Fmt2
Subset Database Backup / Restore
Server Volume
SQL Server / Oracle User Interface Figure 17 - Client-server EDMS data flow diagram
Discussion of this diagram will start at the bottom and work toward the top, since this order starts with the least complicated parts (from the user’s perspective, not the complexity of the software code) and moves to the more complicated parts. That means starting on the SQL Server side and working toward the client side. This sequence provides the most orderly view of the system. In this diagram, the part with a gray background represents SQL Server, and the rest of the box is Access. The basic foundation of the server side is the SQL Server volume, which is actually a file on the server hard drive that contains the data. The size of this volume is set when the database is set up, and can be changed by the administrator as necessary. Unlike many computer files, it will not grow automatically as data is added. Someone needs to monitor the volume size and the amount of data and keep them in synch. The software works this way because the location and structure of the file on the hard drive is carefully managed by the SQL Server software to provide maximum performance (speed of query execution). The database tables are within the SQL Server volume. These tables are similar in function and appearance to the tables in Access. They contain all of the data in the system, usually in normalized database form. The data in the tables is manipulated through SQL queries passed to the SQL Server software via the ODBC link from the clients. Also stored in the SQL Server volume can be triggers and stored procedures to increase performance in the client-server system and to enforce referential integrity. If they wish, users can see the tables in the SQL Server volume through the database window in Access, but their ability to manipulate the data depends on the privileges that they have through the security system. A System Administrator should back up data in the SQL Server tables on a regular basis (at least daily). The interface between the EDMS and the SQL Server tables is through a security system that is partly implemented in SQL Server and partly in Access. Most users should have read-only permission for working with the data; that is, they will be able to view data but not change it. A small group of users, called data administrators, should be able to modify data, which will include importing and entering data, changing data, and deleting data.
General Design Issues
53
The actual connection between Access and SQL Server is done through attachments. Attachments in Access allow Access to see data that is located somewhere other than the current Access .mdb file as if it were in the current database. This is the usual way of connecting to SQL Server, and also allows us to provide the flexibility of attaching to other data sources. Once the attachments are in place, the client interaction with the database is through Access queries, either alone or in combination with modules, which are programs written in VBA. Various queries provide access to different subsets of the data for different purposes. Modules are used when it is necessary to work with the data procedurally, that is, line by line, instead of the whole query as a set.
Distributed The database designs described above were geared toward an organization where the data management is centralized, that is, where the data management activities are performed in one central location, usually on a local area network (LAN). With environmental data management, this is not always the case. Often the data for a particular facility must be available both at the facility and at the central office. The situation becomes even more complicated when the central office must manage and share data with multiple remote sites. This requires that some or all of the data be made available at multiple locations. The following sections describe three ways to do this: wide-area networks, distributed databases with replication, and remote communication with subsets. The factors that determine which solution is best for an organization include the amount of data to be managed, how fresh the data needs to be at the remote locations, and whether full-time communication between the facilities is available and the speed of that communication. Wide-area networks – In situations where a full-time, high-speed communication link is or can be made available (at a cost which is reasonable for the project), a wide-area network (WAN) is often the best choice. From the point of view of the people using it, the WAN hardware and software connect the computers just as with a LAN. The difference is that instead of all of the computers being connected directly through a local Ethernet or Token Ring network, some are connected through long-distance data lines of some sort. Often there are LANs at the different locations connected together over the WAN. The connection between the LANs is usually done with routers on either end of the longdistance line. The router looks at the data traffic passing over the network, and data packets which have a destination on the other LAN are routed across the WAN to that LAN, where they continue on to their destination. There are several options for the long-distance lines forming the WAN between the LANs. This discussion will cover some popular existing and emerging technologies. This is a rapidly changing industry, and new technologies are appearing regularly. At the high end of connectivity options are dedicated analog solutions such as T1 (or in cases of very high data volume, T3) or frame relay. These services are connected full-time, and provide high to moderate speeds ranging from 56 kilobits per second (kbps) to 1 megabit per second (mbps) or more. These services can cost $1000 per month or more. This is proven technology, and is available, at a cost, to nearly any facility. Recently, newer digital services have become available for some areas. Integrated Services Digital Network (ISDN) provides 128 kbps for around $100 per month. Digital Subscriber Line (DSL) provides connectivity ranging in speed from 256 kbps to 1.5 mbps or more. Prices can be as low as $40 per month, so this service can be a real bargain, but service is limited to a fairly short distance from the telephone company central office, so it’s not available to many locations. Cable modems promise to give DSL a run for its money. It’s not widely available right now, especially for business locations, and when it is it will have more of a focus on residential rather than business since that is where cable is currently connected.
54
Relational Management and Display of Site Environmental Data
Another option is standard telephone lines and analog modems. This is sometimes called POTS (plain old telephone system). This provides 56 kbps, or more with modem pooling, and the connection is made on demand. The cost is relatively low ($15 to 30 per month) and is available everywhere. In order to have WAN-level connectivity, you should have a full-time connection of about 1 mbps or faster. If the connection speed available is less than this, another approach should be used. Distributed databases with replication – There are several situations where a client-server connection over a WAN is not the best solution. One is where the connection speed is too low for real-time access. The second is where the data volume is extremely high, and local copies make more sense. In this situation, distributed databases can make sense. In this design, copies of the database are placed on local servers at each facility, and users work with the local copies of the data. This is an efficient use of computer resources, but raises the very important issue of currency of the data. When data is entered or changed in one copy, the other copy is no longer the most current, and, at some point, the changes must be transferred between the copies. Most high-end database programs, and now some low end ones, can do this automatically at specified intervals. This is called replication. Generally, the database manager software is smart enough to move only the changed data (sometimes called “dirty” records) rather than the whole database. Problems can occur when users make simultaneous changes to the same records at different locations, so this approach rapidly becomes complicated. Remote communication with subsets – Often it is valuable for users to be able to work with part of the database remotely. This is particularly useful when the communication line is slow. In this scenario, users call in with their remote computers and attach to the main database, and either work with it that way or download subsets for local use. In some software this is as easy as selecting data using the standard selection screen, then instructing the EDMS to create a subset. This subset can be created on the user’s computer, then the user can hang up, attach to the local subset, then use the EDMS in the usual way, working with the subset. This works for retrieving data, but not as well for data entry or editing, unless a way is provided for data entered into the subset to be uploaded to the main database.
Internet/intranet Tools are now available to provide access to data stored in relational database managers using a Web browser interface. At the time of this writing, in the opinion of the author, these tools are not yet ready for use as the primary interface in a sophisticated EDMS package. Specifically, the technology to provide an intuitive user interface with real-time feedback is too costly or impossible to build with current Web development tools. Vendors are working on implementing that capability, but the technology is not currently ready for prime time, at least for everyday users, and the current technology of choice is client-server. It is now feasible, however, to provide a Web interface with limited functionality for specific applications. For example, it is not difficult to provide a public page with summaries of environmental data for plants. The more “canned” the retrieval is, the easier it is to implement in a browser interface, although allowing some user selection is not difficult. In the near future, tools like Dynamic HTML, Active Server Pages, and better Java applets, combined with universal high-speed connections, will make it much easier to provide an interactive user interface, hosted in a Web browser. At that time, EDMS vendors will certainly provide this type of interface to their client’s databases. The following figure shows a view of three different spectra provided by the Internet and related technologies. There are probably other ways of looking at it, but this view provides a framework for discussing products and services and their presentation to users. In these days of multi-tiered applications, this diagram is somewhat of an over-simplification, but it serves the purpose here.
General Design Issues
Local
55
Global Applications
StandAlone
Shared Files
Client- WebServer Enabled
WebBased
Data Proprietary
Commercial
Public Domain
Users Desktops
Laptops
PDAs, etc.
Public Portals
Figure 18 - The Internet spectrum
The overall range of the diagram in Figure 18 is from Local on the left to Global on the right. This range is divided into three spectra for this discussion. The three spectra, which are separate but not unrelated, are applications, data, and users. Applications – Desktop computer usage started with stand-alone applications. A program and the data that it used were installed on one computer, which was not attached to others. With the advent of local area networks (LANs), and in some organizations wide-area networks (WANs), it became possible to share files, with the application running on the local desktop and with the data residing on a file server. As software evolved and data volumes grew, software was installed on both the local machine (client) and the server, with the user interface operating locally, and data storage, retrieval, and sometimes business logic operating on the server. With the advent of the Internet and the World Wide Web, sharing on a much broader scale is possible. The application can reside either on the client computer and communicate with the Web, or it can run on a Web server. The first type of application can be called Web-enabled. An example of this is an email program that resides locally, but talks through the Web. Another example would be a virusscanning program that operates locally but goes to the Web to update its virus signature files. The second type of application can be called Web-based. An example of this would be a browser-based stock trading application. Many commercial applications still operate in the range between stand-alone and client server. There is now a movement of commercial software to the right in this spectrum, to Web-enabled or Web-based applications, probably starting with Web-enabling current programs, and then perhaps evolving to a thin-client, browser-based model over time. This migration can be done with various parts of the applications at different rates depending on the costs and benefits at each stage. New technologies like Microsoft’s .NET initiative are helping accelerate this trend. Data – Most environmental database users currently work mostly with data that they generate themselves. Their base map data is usually based on CAD drawings that they create, and the rest of their project data comes from field measurements and laboratory analyses, which the data manager (or their client) pays for and owns. This puts them to the left of the spectrum in the above figure. Many vendors now offer both base map and other data, either on the Web or on CD-ROM, which might be of value to many users. Likewise, government agencies are making more and more data, spatial and non-spatial, available, often for free. As vendors evolve their software and Web presence, they can work toward integrating this data into their offerings. For example, software could be used to load a USGS or Census Bureau base map, and then display sites of environmental concern obtained from the EPA. Several software companies provide tools to make it possible to
56
Relational Management and Display of Site Environmental Data
serve up this type of data from a modified Web server. Revenue can be obtained from purchase or rental of the application, as well as from access to the data. Users – The World Wide Web has opened up a whole new world of options for computing platforms. These range from the traditional desktop computers and laptops through personal digital assistants (PDAs), which may be connected via wireless modem, to Web portals and other public access devices. Desktops and laptops can run present and future software, and, as most are becoming connected to the Internet, will be able to support any of the computing models discussed above. PDAs and other portable devices promise to provide a high level of portability and connectivity, which may require re-thinking data delivery and display. Already there are companies that integrate global positioning systems (GPS) with PDAs and map data to show you where you are. Other possible applications include field data gathering and delivery, and a number of organizations provide this capability. Web portals include public Internet access (such as in libraries and coffee shops) as well as other Internet-enabled devices like public phones. This brings up the possibility that applications (and data) may run on a device not owned by or controlled by the client, and argues for a thin-client approach. This is all food for thought as we try to envision the evolution of environmental software products and services (see Chapter 27). What is clear is that the options for delivery of applications and data have broadened significantly, and must be considered in planning for future needs.
Multi-tiered The evolution of the Internet and distributed computing has led to a new deployment model called “multi-tiered.” The three most common tiers are the presentation level, the business logic level, and the data storage level. Each level might run on a different computer. For example, the presentation level displayed to the user might run on a client computer, using either client-server software or a Web browser. The business logic level might enforce the data integrity and other rules of the database, and could reside on a server or Web server computer. Finally, the data itself could reside on a database server computer. Separating the tiers can provide benefits for both the design and operation of the system.
DISTRIBUTED VS. CENTRALIZED DATABASES An important decision in implementing a data management system for an organization performing environmental projects for multiple sites is whether the databases should be distributed or centralized. This is particularly true when the requirements for various uses of the data are taken into consideration. This issue will be discussed here from two perspectives. The first perspective to be discussed will be that of the data, and the second will be that of the organization. From the perspective of the data and the applications, the options of distributed vs. centralized databases are illustrated in Figures 19 and 20. Clearly it is easier for an application to be connected to a centralized, open database than a diverse assortment of data sources. The downside is the effort required to set up and maintain a centralized data repository.
General Design Issues
Mapping
?
Project Management
?
?
GIS Coverages
Statistics
Reporting
Validation
?
Lab Deliverables
Spreadsheets
CAD Files
Legacy Systems
?
ASCII Files
Word Proc. Files
Chain of Custody
Field Notebooks
Hard Copy Files
Regulatory Reports
?
?
?
Graphing
Planning
Web Access
Figure 19 - Connection to diverse, distributed data sources
Mapping
Validation
Reporting
Statistics
Centralized Open Database
Graphing
Project Management
Web Access
Planning
Figure 20 - Connection to a centralized open database
57
58
Relational Management and Display of Site Environmental Data
CLIENT
The Dilemma
The Solution
Site 1
Consultant Consultant Consultant Consultant Consultant 3A 3B 2B 1 2A
Site 4
Consultant 1 + Geotech
!"#$%& Other Access Spreadsheet Do It '()( EDMS Other Yourself EDMS
Consultant 4
Labs Web etc. !"#$%&*'()(
Figure 21 - Distributed vs. centralized databases
The choice of distributed vs. centralized databases can also be viewed from the perspective of the organization. This is illustrated in Figure 21. The left side of the diagram shows the way the data for environmental projects has traditionally been managed. The client, such as an industrial company, owns several sites with environmental issues. One or more consultants, labeled C1, C2, etc., manage each site, and each consultant may manage the project from various offices, such as C2A, C2B, etc. Each consultant office might use a different tool to manage the data. For example, for Site 1, consultant C1 may use an Excel spreadsheet. Consultant C2, working on a different part of the project, or on the same issues at a different time, may use a home-built database. Other consultants working on different sites use a wide variety of different tools. If people in the client organization, either in the head office or at one of the sites, want some data from one monitoring event, it is very difficult for them to know where to look. Contrast this with the right side of the diagram. In this scenario, all of the client’s data is managed in a centralized, open database. The data may be managed by the client, or by a consultant, but the data can be accessed by anyone given permission to do so. There are huge savings in efficiency, because everyone knows where the data is and how to get at it. The difficult challenge is getting the data into the centralized database before the benefits can be realized.
General Design Issues
59
Figure 22 - Example of a simplified logical data model
THE DATA MODEL The data model for a data management system is the structure of the tables and fields that contain the data. Creating a robust data model is one of the most important steps in building a successful data management system (Walls, 1999). If you are building a data management system from scratch, you need to get this part right first, as best you can, before you proceed with the user interface design and system construction. Many software designers work with data models at two levels. The logical data model (Figure 22) describes, at a conceptual level, the data content for the system. The lines between the boxes represent the relationships in the model. The physical data model (Figure 23) describes in detail exactly how the data will be stored, with names, data types, and sizes for all of the fields in each table, along with the relationships (key fields which join the tables) between the tables. The overall scope of the logical data model should be identified as early in the design process as possible. This is particularly true when the project is to be implemented in stages. This allows identification of the interactions between the different parts of the system so that dependencies can be planned for as part of the detailed design for each subset of the data as that subset is implemented. Then the physical data model for the subset can be designed along with the user interface for that subset. The following sections describe the data structure and content for a relational EDMS. This structure and content is based on a commercial system developed and marketed by Geotech Computer Systems, Inc. called Enviro Data. Because this is a working system that has managed hundreds of databases and millions of records of site environmental investigation and monitoring data, it seems like a good starting point for discussing the issues related to a data model for storing this type of data.
60
Relational Management and Display of Site Environmental Data
Figure 23 - Table and field display from a physical data model
Data structure The structure of a relational EDMS, or of any database for that matter, should, as closely as possible, reflect the physical realities of the data being stored. For environmental site data, samples are taken at specific locations, at certain times, depths, and/or heights, and then analyzed for certain physical and chemical parameters. This section describes the tables and relationships used to model this usage pattern. The section after this describes in some detail the data elements and exactly how they are used so that the data accurately reflects what happened. Tables – The data model for storing site environmental data consists of three types of tables: primary tables, lookup tables, and utility tables. The primary tables contain the data of interest. The lookup tables contain codes and their expanded values that are used in the primary tables to save space and encourage consistency. Sometimes the lookups contain other useful information for the data elements that are represented by the coded values. The utility tables provide a place to store various data items, often related to the operation and maintenance of the system. Often these tables are not related directly to the primary tables. For the most part, the primary data being stored in the EDMS has a series of one-to-many (also known as parent-child or hierarchical) relationships. It is particularly fortunate that these relationships are one-to-many rather than many-to-many, since one-to-many relationships are handled well by the relational data model, and many-to-many are not. (Many-to-many relationships can be handled in the relational data model. They require adding another table to track the links between the two tables. This table is sometimes called a join table. We don’t have to worry about that here.)
General Design Issues
61
The primary tables in this system are called Sites, Stations, Samples, and Analyses. The detailed content of these tables is described below. Sites contains information about each facility being managed in the system. Stations stores data for each location where samples are taken, such as monitoring wells and soil borings. (Note that what is called a station in this discussion is called a site in some system designs.) Samples represents each physical sample or monitoring event at specific stations, and Analyses contains specific observed values or analytical results from the samples. Relationships – The hierarchical relationships between the tables are obvious. Each site can have one or more stations, each station has one or more samples, and each sample is analyzed for one or more, often many, constituents. But each sulfate measurement corresponds to one specific sampling event for one specific location for one specific site. The lookup relationships are one-to-many also, with the “one” side being the lookup table and the “many” side being the primary table. For example, there is one entry in the StationTypes table for monitoring wells, with a code of “mw,” but there can be (and usually are) many monitoring wells in the Stations table.
Data content This section will discuss briefly the data content of the example EDMS. This material will be covered in greater detail in Appendix B. Sites – A Site is a facility or project that will be treated as a unit. Some projects may be treated as more than one site, and sometimes a site can be more than one facility, but the use of the site terminology should be consistent within the database, or at least for each project. Some people refer to a sampling location as a site, but in this discussion we will call that a station. Stations – A Station is a location of observation. Examples of stations include soil borings, monitoring wells, surface water monitoring stations, soil and stream sediment sample locations, air monitoring stations, and weather stations. A station can be a location that is persistent, such as a monitoring well which is sampled regularly, or can be the location of a single sampling event. For stations that are sampled at different elevations (such as a soil boring), the location of the station is the surface location for the boring, and the elevation or depth component is part of the sampling event. Samples – A Sample is a unique sampling event or observation for a station. Each station can be sampled at various depths (such as with a soil boring), at various dates and times (such as with a monitoring well), or, less commonly, both. Observations, which may or may not accompany a physical sample, can be taken at a station at a particular time, and in this model would be considered part of a sample event. Analyses – An Analysis is the observed value of a parameter related to a sample. This term is intended to be interpreted broadly, and not to be limited to chemical analyses. For example, field parameters such as pH, temperature, and turbidity also are considered analyses. This would also include operating parameters of environmental concern such as flow, volume, and so on. Lookups – A lookup table is a table that contains codes that are used in the main data tables, and the expanded values of those codes that are used for selection and display. Utilities – The system may contain tables for tracking internal information not directly related to the primary tables. These utility tables are important to the software developers and maybe the system and data administrators, but can usually be ignored by the users.
DATA ACCESS REQUIREMENTS The user interface provides a number of data manipulation functions, some of which are read/write and the rest are read-only.
62
Relational Management and Display of Site Environmental Data
Read-write The functions that require read/write access to the database are: Electronic import – This function allows data administrators to import analytical and other data. Initially the data formats supported will be the three formats defined in the Data Transfer Standard. Other import formats may be added as needed. This is shown in Figure 17 as a singleheaded arrow going into the database, but in reality there is also a small flow of data the other way as the module checks for valid data. Manual entry – The hope is that the majority of the data that will be put in the system will be in digital format that can be imported without retyping. However, there will probably be some data which will need to be manually entered and edited, and this function will allow data administrators to make those entries and changes. Editing – Data administrators will sometimes need to change data in the database. Such changes must be done with great care and be fully documented. Lookup table maintenance – One of the purposes of the lookup tables is to standardize the entries to a limited number of choices, but there will certainly be a need for those data tables to evolve over time. This feature allows the data administrators to edit those tables. A procedure will be developed for reviewing and approving those changes before entry. Verification and validation – Either as part of the import process or separately, data validators will need to add or change validation flags based on their work. Data review – Data review should accompany data import and entry, and can be done independently as well. This function allows data administrators to look at data and modify its data review flag as appropriate, such as after validation.
Read-only The functions that require read-only access to the database are: Record counts – This function is a useful guide in making selections. It should provide the number of selected items whenever a selection criterion is changed. Table view – This generalized display capability allows users to view the data that they have selected. This might be all of the output they need, or they might wish to proceed to another output option, once they have confirmed that they have selected correctly. They can also use this screen to copy the data to the clipboard or save it to a file for use in another application. Formatted reports – Reports suitable for printing can be generated from the selection screen. Different reports could be displayed depending on the data element selected. Maps – The results of the selection can be displayed on a map, perhaps with the value of a constituent for each station drawn next to that station and a colored dot representing the value. See Chapter 22 for more information on mapping. Graphs – The most basic implementation of this feature allows users to draw a graph of constituent values as a function of time for the selected data. They should be able to graph multiple constituents for one station or one constituent for several stations. More advanced graphing is also possible as described in Chapter 20. Subset creation – Users should be able to select a subset of the main database and export it to an Access database. This might be useful for providing the data to others, or to work with the subset when a network connection to the database is unavailable or slow. File export – This function allows users to export data in a format suitable for use in other software needing data from the EDMS. Formats need to be provided for the data needs of the other software. Direct connection without export-import is also possible.
General Design Issues
63
GOVERNMENT EDMS SYSTEMS A number of government agencies have developed systems for managing site environmental data. This section describes some of the systems that are most widely used. STORET (www.epa.gov/storet) – STORET (short for STOrage and RETrieval) is EPA’s repository for water quality, biological, and physical data. It is used by EPA and other federal agencies, state environmental agencies, universities, private citizens, and others. It is one of EPA’s two data management systems containing water quality information for the nation's waters. The other system, the Legacy Data Center, or LDC, contains historical water quality data dating back to the early part of the 20th century and collected up to the end of 1998. It is being phased out in favor of STORET. STORET contains data collected beginning in 1999, along with older data that has been properly documented and migrated from the LDC. Both LDC and STORET contain raw biological, chemical, and physical data for surface and groundwater collected by federal, state, and local agencies, Indian tribes, volunteer groups, academics, and others. All 50 states, territories, and jurisdictions of the U.S., along with portions of Canada and Mexico, are represented in these systems. Each sampling result is accompanied by information on where the sample was taken, when the sample was gathered, the medium sampled, the name of the organization that sponsored the monitoring, why the data was gathered, and much other information. The LDC and STORET are Web-enabled, so users can browse both systems interactively or create files to be downloaded to their computer for further use. CERCLIS (www.epa.gov/superfund/sites/cursites) – CERCLIS is a database that contains the official inventory of Superfund hazardous waste sites. It contains information on hazardous waste sites, site inspections, preliminary assessments, and remediation of hazardous waste sites. The EPA provides online access to CERCLIS data. Additionally, standard CERCLIS site reports can be downloaded to a personal computer. CERCLIS is a database and not an EDMS, but can be of value in EDMS projects. IRIS (www.epa.gov/iriswebp/iris/index.html) – The Integrated Risk Information System, prepared and maintained by the EPA, is an electronic database containing information on human health effects that may result from exposure to various chemicals in the environment. The IRIS system is primarily a collection of computer files covering individual chemicals. These chemical files contain descriptive and quantitative information on oral reference doses and inhalation reference concentrations for chronic non-carcinogenic health effects, and hazard identification, oral slope factors, and oral and inhalation unit risks for carcinogenic effects. It is a database and not an EDMS, but can be of value in EDMS projects. ERPIMS (www.afcee.brooks.af.mil/ms/msc_irp.htm) – The Environmental Resources Program Information Management System (ERPIMS, formerly IRPIMS) is the U.S. Air Force system for validation and management of data from environmental projects at all Air Force bases. The project is managed by the Air Force Center for Environmental Excellence (AFCEE) at Brooks Air Force Base in Texas. ERPIMS contains analytical chemistry samples, tests, and results as well as hydrogeological information, site/location descriptions, and monitoring well characteristics. AFCEE maintains ERPTools/PC, a Windows-based software package that has been developed to help Air Force contractors in collection and entry of their data, validation, and quality control. Many ERPIMS data fields are filled by codes that have been assigned by AFCEE. These codes are compiled into lists, and each list is the set of legal values for a certain field in the database. Air Force contractors use ERPTools/PC to prepare the data, including comparing data to these lists, and then submit it to the main ERPIMS database at Brooks. IRDMIS (aec.army.mil/prod/usaec/rmd/im/imass.htm) – The Installation Restoration Data Management Information System (IRDMIS) supports the technical and managerial requirements of the Army's Installation Restoration Program (IRP) and other environmental efforts of the U.S. Army Environmental Center (USAEC, formerly the U.S. Toxic and Hazardous Materials Agency). (Don’t confuse this AEC with the Atomic Energy Commission, which is now the Department of
64
Relational Management and Display of Site Environmental Data
Energy.) Since 1975, more than 15 million data records have been collected and stored in IRDMIS with information collected from over 100 Army installations. IRDMIS users can enter, validate, store, and retrieve the Army’s geographic; geological and hydrological; sampling; chemical; and physical analysis information. The system covers all aspects of the data life cycle, including complete data entry and validation software using USAEC and CLP QA/QC methods; a Web site for data submission and distribution; and an Oracle RDMS with menu-driven user interface for standardized reports, geographical plots, and plume modeling. It provides a fully integrated information network of data status and disposition for USAEC project officers, chemists, geologists, contracted laboratories, and other parties, and supports Geographical Information Systems and other third-party software. USGS Water Resources (http://water.usgs.gov/nwis) – This is a set of Web pages that provide access to water resources data collected at about 1.5 million locations in all 50 states, the District of Columbia, and Puerto Rico. The U.S. Geological Survey investigates the occurrence, quantity, quality, distribution, and movement of surface and groundwater, and provides the data to the public. Online access to data on this site includes real-time data for selected surface water, groundwater, and water quality sites; descriptive site information for all sites with links to all available water data for individual sites; water flow and levels in streams, lakes, and springs; water levels in wells; and chemical and physical data for streams, lakes, springs, and wells. Site visitors can easily select data and retrieve it for on-screen display or save it to a file for further processing.
OTHER ISSUES Creating and maintaining an environmental database is a serious undertaking. In addition to the activities directly related to maintaining the data itself, there are a number of issues related to the database system that must be considered.
Scalability Databases grow with time. You should make sure that the tool you select for managing your environmental data can grow with your needs. If you store your data in a spreadsheet program, when the number of lines of data exceeds the capacity of the spreadsheet, you will need to start another file, and then you can’t easily work with all of your data. If you store your data in a standalone database manager program like Access, when your data grows you can relatively easily migrate to a more powerful database manager like SQL Server or Oracle. The ability of software and hardware to handle tasks of different sizes is called scalability, and this requirement should be part of your planning if there is any chance your project will grow over time.
Security The cost of building a large environmental database can be hundreds of thousands of dollars or more. Protect this investment from loss. Ensure that only authorized individuals can get access to the database. Make adequate backups frequently. Be sure that the people who are working with the database are adequately trained so that they do a good job of getting clean data into the database, and that the data stays there and stays clean. Instill an attitude of protecting the database and keeping its quality up so that people can feel comfortable using it.
Access and permissions Most database manager programs provide a system for limiting who can use a database, and what actions they can perform. Some have more than one way of doing this. Be sure to set up and
General Design Issues
65
use an access control system that fits the needs of your organization. This may not be easy. You will have to walk a thin line between protecting your data and letting people do what they need to do. Sometimes it’s better to start off more restrictive than you think you need to, and then grant more permissions over time, than to be lenient and then need to tighten up, since people react better to getting more power rather than less. Also be aware that security and access limitations are easier to implement and manage in a client-server system than in a stand-alone system, so if you want high security, choose SQL Server or Oracle over Access for the back-end.
Activity tracking To guarantee the quality of the data in the database, it is important to track what changes are made to the data, when they are made, who made them, and why they were made. A simple activity tracking system would include an ActivityLog table in the database to allow data administrators to track data modifications. On exit from any of the data modification screens, including importing, editing, or reviewing, an activity log screen will appear. The program reports the name of the data administrator and the activity date. The data administrator must enter a description of the activity, and the name of the site that was modified. The screen should not close until an entry has been made. Figure 24 shows an example of a screen for this type of simple system. The system should also provide a way to select and display the activity log. Figure 25 shows an example of a selection screen and report of activity data. In this example, the log can be filtered on Administrator name, Activity Date, or Site. If no filters are entered, the entire log is displayed. Another option is a more elaborate system that keeps copies of any data that is changed. This is sometimes called a shadow system or audit log. In this type of system, when someone changes a record in a table, a copy of the unchanged record is stored in a shadow table, and then the change is made in the main table. Since most EDMS activity usually does not involve a lot of changes, this does not increase the storage as much as it might appear, but it does significantly increase the complexity of the software.
Figure 24 - Simple screen for tracking database activity
66
Relational Management and Display of Site Environmental Data
Figure 25 - Output of activity log data
Database maintenance There are a number of activities that must be performed on an ongoing or at least occasional basis to keep an EDMS up and running. These include: Backup – Backing up data in the database is discussed in Chapter 15, but must be kept in mind as part of ongoing database maintenance. Upgrades – Both commercial and custom software should be upgraded on a regular basis. These upgrades may be required due to a change in the software platform (operating system, database software) or to add features and fix bugs. A system should be implemented so that all users of the EDMS receive the latest version of the software in a timely fashion. For large enterprises with a large number of users, automated tools are available to assist the system administrator with distributing upgrades to all of the appropriate computers without having to visit each one. Web-based tools are beginning to appear that provide the same functionality for all users of software programs that support this feature. Either of these approaches can be a great time saver for a large enterprise system. Other maintenance – Other maintenance activities are required, both on the client side and the server side. For example, on the client side, Access databases grow in size with use. You should occasionally compact your database files. You can do this on some set schedule, such as monthly, or when you notice that it has grown large, such as larger than 5 megabytes (5000 Kb). Occasionally problems will occur with Access databases due to power failures, system crashes, etc.
General Design Issues
67
When this happens, first exit Access, then shut down Windows, power down the computer, and restart. If you get errors in the database after that, you can have Access repair and compact the database. In the worst case (if repairing does not work), you should obtain a new copy of the database program from the original source, and restore your data file from a backup. System maintenance will be required on the server database as well, and will generally be performed by the system administrator with assistance from the vendor if necessary. These procedures include general maintenance of the server computer, user administration, database maintenance, and system backup. The database is expected to grow as new data is received for sites currently in the database, and as new sites are added. At some point in the future it will be necessary to expand the size of the device and the database to accommodate the increased volume of data which is anticipated. The system administrator should monitor the system to determine when the database size needs to be increased.
CHAPTER 6 DATABASE ELEMENTS
A number of elements make up an EDMS. These elements include the computer on the user’s desk, the software on that computer, the network hardware and software, and the database server computer. They also include the components of the database management system itself, such as files, tables, fields, and so on. This chapter covers the important elements from these two categories. This presentation focuses on how these objects are implemented in Access (for standalone use) and SQL Server (for client-server), two popular database products from Microsoft. A good overview of Access for both new and experienced database users can be found in Jennings (1995). More advanced users might be interested in Dunn (1994). More information on SQL Server can be found in Nath (1995); England and Stanley (1996); and England (1997). More information on database elements can be found in Dragan (2001), Gagnon (1998), Harkins (2001a, 2001b), Jepson (2001), and Ross et al. (2001).
HARDWARE AND SOFTWARE COMPONENTS A modern networked data management system consists of a number of hardware and software components. These items, which often come from different manufacturers and vendors, all must work together for the system to function properly.
The desktop computer It is obvious that in order to run a data management system, either client-server or stand-alone, you must have a computer, and the computer resources must be sufficient to run the software. Data management programs can be relatively large applications. In order to run a program like this you must have a computer capable of running the appropriate operating system such as Windows. This section describes the desktop hardware and software requirements for either a client-server or stand-alone database management system. Other than the network connection, the hardware requirements are the same.
DESKTOP HARDWARE The computer should have a large enough hard drive and enough random access memory (RAM) to be able to load the software and run it with adequate performance, and data management software can have relatively high requirements. For example, Microsoft Access has the greatest resource requirements of any of the Microsoft Office programs. At the time of this writing, the minimum and recommended computer specifications for adequate performance using the data management system are as shown in Figure 26.
70
Relational Management and Display of Site Environmental Data
Item Computer Hard drive Memory Removable storage Display Network Peripherals
Minimum 200 megahertz Pentium processor Adequate for software and local data storage, at least 1 gigabyte 64 megabytes RAM 3.5” floppy, CD-ROM
Recommended 500 to 1000 megahertz Pentium processor Adequate for software and local data storage, at least 1 gigabyte 128 megabytes RAM 3.5” floppy, CD-RW, Zip drive
VGA 800x600 10 megabits per second Printer
XGA 1024x768 or better 100 megabits per second High-speed printer, scanner
Figure 26 - Suggested hardware specifications
Probably the most important requirement is adequate random access memory (RAM), the chips that provide short-term storage of data. The amount of RAM should be increased on machines that are not providing acceptable performance. If increasing the RAM does not increase the performance to a level appropriate for that user’s specific needs, then replacing the computer with a faster one may be required. It is important to note that the hardware requirements to run the latest software, and the computer processing power of standard systems available at the store, both become greater over time. Computers that are more than three years or so old may be inadequate for running the latest version of the database software. A brand-new, powerful computer including a monitor and printer sells for $1000 or less, so it doesn’t make sense to limp along on an underpowered, flaky computer. Don’t be penny-wise and pound-foolish. Be sure that everyone has adequate computers for the work they do. It will save money in the long run. An important distinction to keep in mind is the difference between memory and storage. A computer has a certain amount of system memory or RAM. It also has a storage device such as a hard drive. Often people confuse the two, and say that their computer has 10 gigabytes of memory, when they mean disk storage.
DESKTOP SOFTWARE Several software components are required in order to run a relational database management system. These include the operating system, networking software (where appropriate), database management software, and the application.
Operating system Most systems used for data management run one of the Microsoft operating systems: Windows 95, 98, ME, or NT/2000/XP. All of these systems can run the same client data management software and perform pretty much the same. Apple Macintosh systems are present in some places, but are used mostly for graphic design and education, and have limited application for data management due to poor software availability. UNIX systems (including the popular open-source version, Linux) are becoming an increasingly viable possibility, with serious database systems like Oracle and DB2 now available for various types of UNIX.
Networking software If the data is to be managed with a shared-file or client-server system, or if the files containing a single-user database are to be stored on a file server computer, the client computer will need to run networking software to make the network interface card work. In some cases the networking software is part of the operating system. This is the case with a Windows network. In other cases the networking will be done with a separate software package. Examples include Novell Netware
Database Elements
71
and Banyan Vines. Either way, the networking software will generally be loaded during system startup, and after that can pretty much be ignored, except that network file server resources and network database server resources are available. This networking software is described in more detail in the next section.
Database management software The next software element in the database system is the database management software itself. Examples of this software are Microsoft Access, FoxPro, and Paradox. This software can be used by itself to manage the data, or with the help of a customized application as described in the next section. The database application provides the user interface (the menus and forms that the user sees) and can, in the case of a stand-alone or single-user system, also provide the data storage. In a client-server system, the database software on the client computer provides the user interface, and some or all of the data is stored on the database server computer somewhere else on the network. If the data to be managed is relatively simple, the database management software by itself is adequate for managing it. For example, a simple table of names and addresses can be created and data entered into it with a minimum of effort. As the data model becomes more complicated, and as the interaction between the database and external data sources becomes more involved, it can become increasingly difficult to perform the required activities using the tools of the software by itself. At that point a specialized application may be required.
Application When the complexity of the database or its interactions exceeds the capability of the generalpurpose database manager program, it is necessary to move to a specialized vertical market application. This refers to software specialized for a particular industry segment. An EDMS represents software of this type. This type of system is also referred to as COTS (commercial offthe-shelf) software. Usually the vertical market application will provide pre-configured tables and fields to store the data, import routines for data formats common in the industry, forms for editing the data, reports for printing selected data, and export formats for specific needs. Using off-theshelf EDMS software can give you a great head start in building and managing your database, relative to designing and building your own system.
The network Often the EDMS will run on a network so people can share the data. The network has hardware and software components, which are discussed in the following sections.
NETWORK HARDWARE The network on which the EDMS operates has three basic components, in addition to the computers themselves: network adapters, wiring, and hubs. These network hardware components are shown in Figure 27. The network adapters are printed circuit boards that are placed in slots in the client and server computers and provide the electronic connection between the computer and the network. The type of adapter card used depends on the kind of computer in which it is placed, and the type of network being used.
72
Relational Management and Display of Site Environmental Data
Clients
Network Adapter
Network Adapter
Network Adapter
Network Hub
Network Adapter
Server Figure 27 - The EDMS network hardware diagram
The wiring also depends on the type of network being used. The two most common types of wiring are twisted pair and coaxial, usually thin Ethernet. Twisted pair is becoming more common over time due to lower cost. Most twisted pair networks use Category 5 (sometimes called Cat5) cable, which is similar to standard telephone wiring, but of higher quality. There is usually a short cable that runs between the computer and a wall plate, wiring in the walls from the client’s or server’s office to a wiring closet, and then another cable from the wall plate or switch block in the wiring closet to the hub. The hub is a piece of hardware that takes the cables carrying data from the computers on the network and connects them together physically. Depending on the type of network and the number of computers, other hardware may be used in place of or in addition to the hub. This might include network switches or routers. The network can run at different speeds depending on the capability of the computers, network cards, hubs, wiring, and so on. Until recently 10 megabits per second was standard for local area networks (LANs), and 56 kilobits per second was common for wide-area networks (WANs). Increasingly, 100 megabits per second is being installed for LANs and 1 megabit per second or faster is used for WANs.
EDMS NETWORK SOFTWARE There are a number of software components required on both the client and server computers in order for the EDMS to operate. Included in this category is the operating system transport protocols and other software required just to make the computer and network work. The operating system and network software should be up and running before the EDMS is installed.
Database Elements
73
Clients
Access Front-end
Access Front-end
Access Front-end
ODBC Driver
ODBC Driver
ODBC Driver
SQL Queries Data In
Query Results Data Out
SQLServer Process SQLServer Data Storage
Backup and Restore
Server Figure 28 - The EDMS network software components
The major networked data management software components of the EDMS are discussed in this section from an external perspective, that is, looking at the various pieces and what they do, but not at the detailed internal workings of each. The important parts of the internal view, especially of the data management system, will be provided in later sections. On the client computers in a client-server system, the important components for data management provide the user interface and communication with the server. On the server, the software completes the communication and provides storage and manipulation of the data. For a stand-alone system, both parts run on the client computer. The diagram in Figure 28 shows the major data management software components for a client-server system, based on Access as a front-end and SQL Server as a back-end. On the client computers, the user interface for the EDMS can be provided by a database such as Microsoft Access, or can be written in a programming language like Visual Basic, Power Builder, Java, or C++. The advantage of using a database language is ease of development and flexibility. The advantage of a compiled language is code security, and perhaps speed, although speed is less of a distinguishing factor than it used to be. The main user interface components are forms and menus for soliciting user input and forms and reports for displaying output. Also provided by Access on the desktop are queries to manipulate data and macros and modules (both of which are types of programs) to control program
74
Relational Management and Display of Site Environmental Data
operation and perform various tasks. Customized components specific to the EDMS, if any, are contained in an Access .mdb file which is placed on the client computer during setup and which can be updated on a regular basis as modifications are made to the software. Through this interface, the user should be able to (with appropriate privileges) import and check data, select subsets of the data, and generate output, including tables, reports, graphs, and maps. To communicate data with the server, the Access software works with a driver, which is a specialized piece of software with specific capabilities. In a typical EDMS this driver uses a data transfer protocol called Open DataBase Connectivity (ODBC). The driver for communicating with SQL Server is provided by Microsoft as part of the Access software installation, although it may not be installed as part of the standard installation. Drivers for other server databases are available from various sources, often the vendor of the database software. There are two parts to the ODBC system in Windows. One part is ODBC administration, which can be accessed through the ODBC icon in Control Panel. This part provides central management of ODBC connections for all of the drivers that are installed. The second part consists of individual drivers for specific data sources. There are two kinds of ODBC drivers, single-tier and multi-tier. The single-tier drivers provide both the communication and data manipulation capabilities, and the data management software for that specific format itself is not required. Examples of single-tier drivers include the drivers for Access, dBase, and FoxPro data files. Multi-tier drivers provide the communication between the client and server, and work with the database management software on the server to provide data access. Examples of multi-tier drivers include the drivers for SQL Server and Oracle. The server side of the ODBC communication link is provided by software that runs on the server as an NT/2000/XP process. The SQL Server process listens for data requests from clients across the network via the ODBC link, executes queries locally on the server, and sends the results back to the requesting client. This step is very important, because the traffic across the network is minimized. The requests for data are in the form of SQL queries, which are a few hundred to a few thousand characters, and the data returned is whatever was asked for. In this way the user can query a small amount of data from a database with millions of records and the network traffic would be just a few thousand characters. Some EDMS software packages can work in either stand-alone or client-server mode. In the first case it uses a direct link to the Jet database engine when working with an Access database. In the second case, the EDMS uses the SQL Server multi-tier driver to communicate between the user interface in Access and SQL Server on the server. When users are attached to a local Access database, all of the processing and data flow occurs on the client computer. When connected to the server database the data comes from the server.
The server SERVER HARDWARE The third hardware component of the EDMS, besides client computers and the network, is the database server. This is a computer, usually a relatively powerful one, which contains the data and runs the server component of the data management software. Usually it runs an enterprise-grade operating system such as Windows NT/2000/XP or UNIX. In large organizations the server will be provided or operated by an Information Technology (IT) or similar group, while in smaller organizations data administrators or power users in the group will run it. The range of hardware used for servers, especially running Windows NT/2000/XP, is great. NT/2000/XP can run on a standard PC of the type purchased at discount or office supply stores. This is actually a good solution for small groups, especially when the application is not mission critical, meaning that if the database becomes unavailable for short periods of time the company won’t be shut down.
Database Elements
75
Figure 29 - Example administrative screen from Microsoft SQL Server
For an organization where the amount of use of the system is greater, or full-time availability is very important, a computer designed as a server, with redundant and hot-swappable (can be replaced without turning off the computer) components, is a better solution. This can increase the cost of the computer by a factor of two to ten or more, but may be justified depending on the cost of loss of availability.
SERVER SOFTWARE The client-based software components described above are those that users interact with. System administrators also interact with the server database user interface, which is software running on the server computer that allows maintenance of the database. These maintenance activities include regular backup of the data and occasional other maintenance activities including user and volume administration. Software is also available which allows many of these maintenance activities to be performed from computers remote from the server, if this is more convenient. An example screen from SQL Server is shown in Figure 29.
UNITS OF DATA STORAGE The smallest unit of information used by computers is the binary bit (short for BInary digiT). A bit is made up of one piece of data consisting of either a zero or a one, or more precisely, the electrical charge is on or off at that location in memory. All other types of data are composed of one or more bits. The next larger common unit of storage is the byte, which contains eight bits. One byte can represent one of 256 different possibilities (two raised to the eighth power). This allows a byte to represent any one of the characters of the alphabet, the numbers and punctuation symbols, or a large number of other characters. For example, the letter A (capital A) can be represented by the byte 01000001. How each character is coded depends on the coding convention used. The two most common are ASCII (American Standard Code for Information Interchange) used on personal
76
Relational Management and Display of Site Environmental Data
computers and workstations, and EBCDIC (Extended Binary Coded Decimal Interchange Code) used on some mainframes. The largest single piece of data that can be handled directly by a given processor is called a word. For an 8-bit machine, a word is the same as a byte. For a 16-bit system, a word is 16 bits long, and so on. A 32-bit processor is faster than a 16-bit processor of the same clock speed because it can process more data at once, since the word size is twice as big. For larger amounts of data, the amount of storage is generally referred to in terms of the number of bytes, usually in factors of a thousand (actually 1024, or 210). Thus one thousand bytes would be one kilobyte, one million would be one megabyte, one billion is one gigabyte, and one trillion is one terabyte. As memory, mass storage devices, and databases become larger, the last two terms are becoming increasingly important.
DATABASES AND FILES As discussed in Chapter 5, databases can be described by their logical data model, which focuses on data and relationships, and their physical data model, which is how the data is stored in the computer. All data in a modern computer is stored in files. Files are chunks of related data stored together on a disk drive such as a hard disk or floppy disk. The operating system takes care of managing the details of the files such as where they are located on the disk. Files have names, and files in DOS and Windows usually have a base name and an extension separated by a period, such as Mydata.dbf. The extension usually tells you what type of file it is. Older database systems often stored their data in the format of dBase, with an extension of .dbf. Access stores its data and programs in files with the extension of .mdb for Microsoft DataBase, and can store many tables and other objects in one file. Most Access developers build their applications with one .mdb file for the program information (queries, forms, reports, etc.) and another for the data (tables). Larger database applications have their data in an external database manager such as Oracle or SQL Server. The user does not see this data as files, but rather as a data source available across the network. If the front end is running in Access, they will still have the program .mdb either on their local hard drive or available on a network drive. If their user interface is a compiled program written in Visual Basic, C, or a similar language, it will have an extension of .exe. We will now look at the remaining parts of a database system from the point of view of a stand-alone Access database. The concepts are about the same for other database software packages. Access databases contain six primary objects. These are tables, queries, forms, reports, macros, and modules. These objects are described in the following sections.
TABLES (“DATABASES”) The basic element of storage in a relational database system is the table. Each table is a homogeneous set of rows of data describing one type of real-world object. In some older systems like dBase, each table was referred to as a database file. Current usage tends more toward considering the database as the set of related tables, rather than calling one table a database. Tables contain the following parts: Records – Each line in a table is called a record, row, entity, or tuple. For example, each boring or analysis would be a record in the appropriate table. Records are described in more detail below. Fields – Each data element within a record is called a field, column, or attribute. This represents a significant attribute of a real-world object, such as the elevation of a boring or the measured value of a constituent. Records are also described in more detail below.
Database Elements
77
Figure 30 - Join Properties form in Microsoft Access
Relationships – Data in different tables can be related to each other. For example, each analysis is related to a specific sample, which in turn is related to a specific boring. Relationships are usually based on key fields. The database manager can help in enforcing relationships using referential integrity, which requires that defined relationships be fulfilled according to the join type. Using this capability, it would be impossible to have an analysis for which there is no sample. Join types – A relationship between two tables is defined by a join. There are two kinds of joins, inner joins and outer joins. In an inner join, matching records must be present on both sides of the join. That means that if one of the tables has records that have no matching records in the other, they are not displayed. An outer join allows unmatched records to be displayed. It can be a left join or a right join, depending on which table will have unmatched records displayed. Figure 30 shows an example of defining an outer join in Access. In this example, a query has been created with the Sites and Stations tables. The join based on the SiteNumber field has been defined as an outer join, with all records from the Sites table being displayed, even if there are no corresponding records in the Stations table. This outer join is a left join. Figure 31 shows the result of this query. There are stations for Rad Industries and Forest Products Co., but none for Refining, Inc. Because of the outer join there is a record displayed for Refining, Inc. even though there are no stations.
Figure 31 - Result of an outer join query
78
Relational Management and Display of Site Environmental Data
FIELDS (COLUMNS) The fields within each record contain the data of each specific kind within that record. These are analogous to the way columns are often used in a spreadsheet, or the blanks to be filled out on a paper form. Data types – Each field has a data type, such as numeric (several possible types), character, date/time, yes/no, object, etc. The data type limits the content of the field to just that kind of data, although character fields can contain numbers and dates. You shouldn’t store numbers in a character field, though, if you want to treat them as numbers, such as performing arithmetic calculations on them. Character fields are the most common type of field. They may include letters, numbers, punctuation marks, and any other printable characters. Some typical character fields would be SiteName, SampleType, and so on. Numeric is for numbers on which calculations will be performed. They may be either positive or negative, and may include a decimal point. Numeric fields that might be found in an EDMS are GroundElevation, SampleTop, etc. Some systems break numbers down further into integer and floating point numbers of various degrees of precision. Generally this is only important if you are writing software, and less important if you are using commercial programs. It is important to note that Microsoft programs such as Excel and Access have an annoying feature (bug) that refuses to save trailing zeros, which are very important in tracking precision. If you open a new spreadsheet in Excel, type in 3.10, and press Enter, the zero will go away. You can change the formatting to get it back, but it’s not stored with the number. The best way around this is to store the number of decimals with each result value, and then format the number when it is displayed. Date is pretty obvious. Arithmetic calculations can often be performed on dates. For example, the fields SampleDate and AnalysisDate could be included in a table, and could be subtracted from each other to find the holding time. Date fields in older systems are often 8 characters long (MM/DD/YY), while more modern, year 2000 compliant systems are 10 characters (MM/DD/YYYY). There is some variability in the way that time is handled in data management systems. In some database files, such as dBase and FoxPro .dbf files, date and time are stored in separate fields. In others, such as Access .mdb files, both can be stored in one field, with the whole number representing the date and the decimal component containing the time. The dates in Access are stored as the number of days since 12/30/1899, and times as the fraction of the day starting at midnight, such that .5 is noon. The way dates are displayed traditionally varies from one part of the world to another, so as we go global, be careful. On Windows computers, the date display format is set in the operating system under Start/Settings/Control Panel/Regional Settings. Logical represents a yes/no (true/false) value. Logical fields are one byte long (although it actually takes only one bit to store a logical value). ConvertedValue could be a logical field that is true or false based on whether or not a value in the database has been converted from its original units. Data domain – Data within each field can be limited to a certain range. For example, pH could be limited to the range of 0 to 14. Comprehensive domain checking can be difficult to implement effectively, since in a normalized data model, pH is not stored in its own field, but in the same Value field that stores sulfate and benzene, which certainly can exceed 14. That means that this type of domain analysis usually requires programming. Value – Each field has a value, which can be some measured amount, some text attribute, etc. It is also possible that the value may be unknown or not exist, in which case the value can be set to Null. Be aware, however, that Null is not the same as zero, and is treated differently by the software.
Database Elements
79
Figure 32 - Oracle screen for setting field properties
Key fields – Within each table there should be one or more fields that make each record in the table unique. This might be some real-world attribute (such as laboratory sample number) or a synthetic key such as a counter assigned by the data management system. A primary key has a unique value for each record in the table. A field in one table that is a primary key in another table is called a foreign key, and need not be unique, such as on the “many” side of a one-to-many relationship. Simple keys, which are made up of one field, are usually preferable to compound keys made up of more than one field. Compound keys, and in fact any keys based on real data, are usually poor choices because they depend on the data, which may change. Figure 32 shows an Oracle screen for setting field properties.
RECORDS (ROWS) Once the tables and fields have been defined, the data is usually entered one record at a time. Each well in the Stations table or groundwater sample in the Samples table is a record. Often the size of a database is described by the number of records in its tables.
QUERIES (VIEWS) In Access, data manipulation is done using queries. Queries are based on SQL, and are given names and stored as objects, just like tables. The output of a query can be viewed directly in an editable, spreadsheet-like view, or can be used as the basis of a form or a report. Access has six types of queries: Select – This is the basic data retrieval query. Cross-tab – This is a specialized query for summarizing data.
80
Relational Management and Display of Site Environmental Data
Figure 33 - Simple data editing form
Make table – This query is used to retrieve data and place it into a new table. Update – This query changes data in an existing table. Append – This query type adds records to an existing table. Delete – These queries remove records from a table, and should be used with great care!
OTHER DATABASE OBJECTS The other types of database objects in an Access system are forms, reports, macros, and modules. Forms and reports are for entering and displaying data, while macros and modules are for automating operations.
Forms Forms in data management programs such as Access are generally used for entering, editing, or selecting data, although they can also be used as menus for selecting an activity. Forms for working with data use a table or a query as a data source.
Figure 34 - Advanced data editing form
Database Elements
81
Figure 35 - Example of a navigation form
Figure 33 shows an example of a simple form for editing data from one table. Data editing forms can be much more complicated than the previous example. Figure 34 shows a data editing form with many fields and a subform, which allows for many records in the related table to be displayed for each record in the main table. Figure 35 shows a form used for navigation. Users click on one of the gray rectangles with their mouse to open the appropriate form for what they want to do. Sometimes the data entry forms can be combined with navigation capabilities. The following form is mostly a data entry form, with data fields and a subform, but it also allows users to navigate to a specific record. They do this by opening a combo box, and selecting an item from the list. The form then takes them to that specific record. Forms are a very important part of a database system, since they are usually the main way users interact with the system.
Figure 36 - A form combining data entry and navigation
82
Relational Management and Display of Site Environmental Data
Figure 37 - Report of analytical data
Reports Reports are used for displaying data, usually for printing. Reports use a table or form as a data source, the same way that forms do. The main differences are that the data on reports cannot be edited, and reports can better handle large volumes of data using multiple pages. Figure 37 shows a typical report of analytical data. Reports will be covered in much more detail in Chapter 19.
Macros, modules, subroutines, and functions Much of the power in modern data management programs comes from the ability to program them for specific needs. Some programs, like Access, provide more than one way to tell the program what to do. The two ways in Access are macros and modules. Macros are like stored keystrokes, and are used to automate procedures. Modules are more like programs, and can also be used to automate activities. Modules have some advantages over macros. Many Access developers prefer modules to macros, but since macros are easier to learn, they are frequently used, especially by beginners. Microsoft encourages use of modules instead of macros for programming their applications, and suggests that support for macros may be removed in future versions. Figure 38 shows an Access macro in the macro-editing screen. This macro minimizes the current window and displays the STARTUP form.
Database Elements
83
Figure 38 - Access macro example
Modules provide the programming power behind Access. They are written in Access Basic, a variety of Visual Basic for Applications (VBA). VBA is a complete, powerful programming language that can do nearly anything that any programming language can do. VBA is customized to work with Access data, which makes it easy to write sophisticated applications to work with data in Access tables. Figure 39 shows the Access screen for editing a module. The subroutine shown re-spaces the print order in the Parameters table with an increment of five so new parameters can be inserted in between. A module can have two kinds of code in it, subroutines and functions (also known as “subs”), both of which are referred to as procedures. Both are written in VBA. The difference is that a function returns a value, and a sub does not. Otherwise they can do exactly the same thing.
Figure 39 - Access module example
84
Relational Management and Display of Site Environmental Data
Figure 40 - SQL Server screen for editing a trigger
Triggers and stored procedures There is another kind of automation that a database manager program can have. This is associating activity with specific events and data changes. Access does not provide this functionality, but SQL Server and Oracle do. You can associate a trigger with an event, such as changing a data value, and the software will run that action when that event happens. Entering and editing triggers can be done in one of two ways. The programs provide a way to create and modify triggers using the SQL Data Definition Language by entering commands interactively. They also provide a user interface for drilling down to the trigger as part of the table object model and entering and editing triggers. This interface for SQL Server is shown in Figure 40. A stored procedure is similar except that it is called explicitly rather than being associated with an event.
Calculated fields One of the things that computers do very well is perform calculations, and often retrieving data from a database involves a significant amount of this calculation. People who use spreadsheets are accustomed to storing formulas in cells, and then the spreadsheet displays the result. It is tempting to store calculated results in the database as well. In general, this is a bad idea (Harkins, 2001a), despite the fact that this is easy to do using the programmability of the database software. There are several reasons why this is bad. First, it violates good database design. In a well-designed database, changing one field in a table should have no effect on other fields in the table. If one field is calculated from one or more others, this will not be the case. The second and main reason is the risk of error due to redundant data storage. If you change one data element and forget to change the calculated data, the database will be inconsistent. Finally, there are lots of other ways to achieve the same thing. Usually the best way is to perform the calculation in the query that retrieves the data. Also, calculated controls can display the result on the fly. There are exceptions, of course. A data warehouse contains extracted and often calculated data for performance purposes. In deeply nested queries in Access, memory limitations sometimes required storing intermediate calculations in a table, and then performing more queries on the intermediate results. For the most part, however, if your design involves storing calculated results, you will want to take a hard look at whether this is the best way to proceed.
CHAPTER 7 THE USER INTERFACE
The user interface for the software defines the interaction between the user and the computer for the tasks to be performed by the software. Good user interface design requires that the software should be presented from the point of view of what the user wants to do rather than what the computer needs to know to do it. Also, with user interfaces, less is more. The more the software can figure out and do for the user without asking, the better. This section provides information that may be helpful in designing a good user interface, or in evaluating a user interface designed by others.
GENERAL USER INTERFACE ISSUES The user interface should be a modern graphical user interface with ample guidance for users to make decisions about the action required from them so that they can perform the tasks offered by the system. In addition, online help should be provided to assist them should they need additional information beyond what is presented on the software screens. The primary method for navigating through the system should be a menu, forms, or a combination of both. These should have options visible on the screen to take users to the areas of the program where they can perform the various activities. An example of a main menu form for an EDMS is shown in Figure 41. Users select actions by pressing labeled buttons on forms. Pressing a button will usually bring up another form for data input or viewing. Each form should have a button to bring them back to the previous form, so they can go back when they are finished or if they go to the wrong place. Forms for data entry and editing should have labeled fields, so it is clear from looking at the screen what information goes where. Data entry screens should have two levels of undo. Pressing the Escape key once will undo changes to the current field. Pressing it again will undo all changes to the current record. Multiple levels of undo can significantly increase the users’ confidence that they can recover from problems. The user interface should provide guidance to the users in two ways. The first is in the arrangement and labeling of controls on forms. Users should be able to look at a form and get enough visual clues so they know what to do. The second type of user interface guidance is tool tips. Tool tips are little windows that pop up when the cursor moves across a control providing the user with guidance on using that control. Consistency and clarity are critical in the user interface, especially for people who use the software on a part time or even an occasional basis.
86
Relational Management and Display of Site Environmental Data
The illusion of simplicity comes from focusing on only one variable. Rich (1996)
Figure 41 - Example of a menu form with navigation buttons
Figure 42 - Tool tip
CONCEPTUAL GUIDELINES An environmental data management program is intended to have a useful life of many years. During both the initial development and ongoing maintenance stages, it is likely that many different people will make additions and modifications to the software. This section is intended to provide guidance to those individuals so that the resulting user interface is as seamless as possible. The primary focus of these guidelines is on ease of use and consistency. These two factors, combined with a high level of functionality, will lead to a positive user experience with the software and acceptance of the system in the organization. A number of questions to be answered regarding user interface design are listed in Cooper (1995, p. 20). His advice for the answers to these questions can be broken down into two premises. The first is that the software should be presented from the point of view of what the user wants to do (manifest model) rather than what the computer needs to know to do it (implementation model). The second premise is that with user interfaces, less is more. The more the software can figure out and do for the user without asking, the better.
The User Interface
87
This section uses Cooper’s questions as a framework for discussing the user interface issues for an EDMS. The target data management system is a client-server system with Microsoft Access as the front end and SQL Server as the back-end, but the guidelines apply equally well for other designs. Answers are provided for a typical system design, but of course these answers will vary depending on the implementation details. It is important to note that the tools used in addressing user interface issues must be those of the system in which it is operating. Some of this material is based on interviews with users in the early stages of system implementation, and some on discussions with designers and experienced users, so it represents several perspectives. What should be the form of the program? – The data management system consists of tables, queries, forms, reports, macros, and modules in Microsoft Access and SQL Server to store and manipulate the data and present the user interface. For the most part, the user will see forms containing action buttons and other controls. The results of their actions will be presented as an Access form or report window. A recurring theme with users is that the system must be easy to learn and use if people are going to embrace it. This theme must be kept in mind during system design and maintenance. Every attempt should be made to ensure that the software provides users with the guidance that they need to make decisions and perform actions. The software should help them get their work done efficiently. How will the user interact with the program? – The primary interaction with the user is through screen forms with buttons for navigation and various data controls for selection and data entry. In general users should be able to make selections from drop-down lists rather than having to type in selections, and as much as possible the software should remember their answers to questions so that those answers can be suggested next time. The most common comment from users related to the user interface is that the system should be easy to use. People feel that they don't have much time to learn a new system. In order to gain acceptance, the new system will need to save time, or at least provide benefits that outweigh the costs of setup and use. Another way of saying this is that the software should be discoverable. The user should be able to obtain clues about what to do by looking at the screen. The example shown in Figure 43 shows a screen from a character-mode DOS interface. This interface is not discoverable. The user is expected to know what to type in to make the program do something. In this example, what the user tried didn’t work. The next example, Figure 44, shows a major improvement. Users need no a priori knowledge of what to do. They can look at the screen and figure out what to do by reading their options. Of course, even a good idea can have flaws. In this example, the flow is a little illogical, expecting users to click on Start to stop (shut down) their computer, but the general idea is a great improvement. The transition to a discoverable interface, especially at the operating system level, which was originally popularized by the Apple Macintosh computer and later by the Microsoft Windows operating system, has made computer use accessible to a much wider audience. How can the program’s function be most effectively organized? – The functions of the program are organized by the tasks to be performed. In most cases, users will start their session by selecting a project, and perhaps other selection criteria, and then select an action to perform on the selected data. Where a set of answers is required, the questions they are asked should be presented in a clear, logical sequence, either on a single form or as a series of related “wizard”-like screens. How will the program introduce itself to first-time users? – An example of program introduction would be for the program to display an introductory (splash) screen followed by the main menu. An on-screen “tour” or tutorial screen as shown in Figure 45 can be very helpful in getting a new user up and running fast.
88
Relational Management and Display of Site Environmental Data
Figure 43 - Example of an interface that is not “discoverable”
Figure 44 - Example of a “discoverable” interface
Figure 45 - On-screen “tour” or tutorial
A printed tutorial can perform this function also, but experience has shown that people are more likely to go through the tutorial if it is presented on-screen. This satisfies the “instant gratification” requirement of someone who has just acquired a new software program and wants to see it do something right away (the “out of box” experience). Then after that users can take the time to learn the program in detail, get their data loaded, and use the software to perform useful work. How can the program put an understandable and controllable face on technology? – The software must make the user comfortable in working with the data. The user interface must make it easy to experiment with different approaches to retrieving and displaying data. This allows people to find the data selection and presentation approach that helps them best communicate the desired message. Figures 46 and 47 provide interesting examples of the good and the bad of using multiple display windows to help the user feel comfortable working with data. The multiple windows showing different ways of looking at the data would confuse some users. Others would be thrilled to be able to look at their data in all of these different ways. There is certainly a personal preference issue regarding how the software presents data and lets the user work with it. The software should provide the option of multiple on-screen displays, and users can open as many as they are comfortable with.
The User Interface
Figure 46 - Database software display with many data elements
Figure 47 - GIS software display with many data elements
89
90
Relational Management and Display of Site Environmental Data
Several things can be done in the user interface to support these usability objectives. As discussed above, the software should be discoverable. It should also be recoverable, so that users can back out of any selection or edit that they may make incorrectly. Users should be provided with information so that they can predict the consequences of their actions. For example, the software should give clues about how long a process will take, and the progress of execution. Any process that will take a long time should have a cancel button or other way to terminate processing. This goal can sometimes be difficult to accomplish, but it’s almost always worth the effort. How can the program deal with problems? – When an error occurs, the program should trap the error. If possible the software should deal with the error without user intervention. If this is not possible, then the program should present a description of the error to the user, along with options for recovery. The software designers should try to anticipate error conditions and prepare the software and the users to handle them. How will the program help infrequent users become more expert? – The user should be able to determine how to perform their desired action by looking at the screen. Tool tips should be provided for all controls to assist in learning, and context-sensitive help at the form level should be provided to make more information available should the user require it. How can the program provide sufficient depth for expert users? – The forms-based menu system should provide users with the bulk of the functionality needed to perform their work. Those wishing to go beyond this can be trained to use the object creation capabilities of the EDMS to make their own queries, forms, and reports for their specific needs. This is a major benefit of using a database program like Access rather than a compiled language like Visual Basic to build the EDMS front-end. Training is important at this stage so users can be steered away from pitfalls that can be detrimental to the quality of their output. For example, a section in Chapter 15 discusses data retrieval and output quality issues. Another way of looking at this issue is that the program should be capable of growing with the user.
GUIDELINES FOR SPECIFIC ELEMENTS Automated import, data review, manual data entry, lookup table maintenance, and related administrative activities should be done in forms, queries, and modules called up from menu choices. The rest of the interaction with the system is normally done through the selection screen. This screen is a standardized way for the user to select a subset of the data to work with. They should be able to select a subset based on a variety of data elements. The selection process starts with a base query for each of the data elements. The selection screen then appends to the SQL “WHERE” clause based on the items they have selected on the screen. This query is then saved, and can be used as the basis for further processing such as retrieving or editing data. The system should make it easy, or at least possible, to add new functions to the system as specific needs are identified. The user interface is the component of the EDMS that interacts with the people using the software. The user interface for a data management system consists of five principal parts: Input, Editing, Selection, Output, and Maintenance. Input – This section of the user interface allows data to be put into the system. In an EDMS, this involves file import and manual input. File import allows data in one or more specified formats to be brought into the system. The user interface component of file import should let the user select the location and name of the file, along with the format of the file being imported. They should be able to specify how various import options like parameter and unit conversion will be handled. Manual input allows data to be typed into the system. The procedures for file import and manual input must provide the necessary level of quality assurance before the data is considered ready for use. Editing – It is necessary to provide a way to change data in the system. The user interface component of data editing consists of presenting the various data components and allowing them to
The User Interface
91
be changed. It is also critical that the process for changing data be highly controlled to prevent accidental or intentional corruption of the data. Data editing procedures should provide a process to assure that the changes are valid. Components of the data management software, such as referential integrity, lookup tables, and selections from drop-down menus, can help with this. Selection – The two most important parts of an EDMS are getting the data in and getting the data out. Especially in larger organizations, getting the data in is done by data administrators who have been trained and have experience in carefully putting data into the system. Getting the data out, however, is often done by project personnel who may not be as computer literate, or at least not database experts. The user interface must address the fact that there will be a range of types of users. At one extreme is the type of user who is not familiar or comfortable with computers, and may never be. In the middle are people who may not have had much experience with data management, but will learn more over time. At the high extreme are power users who know a lot about data management coming in and want to roll their sleeves up and dig in. The software should make all of these types of users comfortable in selecting and outputting data. A query by form (QBF) selection screen is one way to accomplish this. Output – Once a subset of the data has been selected, the software should allow the data to be output in a variety of formats. For a system like this, almost all of the output follows a selection step. The selection for output involves choosing the data content (tables and fields), record subset (certain sites, stations, sample dates, etc.) and output format. The software should provide a set of standard (canned) output formats that can be chosen from the QBF selection screen. These can range from relatively unformatted lists to formalized reports, along with graphs and maps, and perhaps output to a file for further processing. Maintenance – All databases require maintenance, and the larger the database (number of records, number of users, etc.), the more maintenance is required. For an EDMS, the maintenance involves the server and the clients. The largest item requiring maintenance in the server component of an EDMS is data backup, which is discussed in more detail in Chapter 15. Another server task is maintenance of the list of users and passwords. This must be done whenever people or assignments change. Also, the data volumes in the database manager may need to be changed occasionally as the amount of data increases. This is usually done by a computer professional in IS. Some maintenance of the client component of the EDMS is usually required. Access databases (.mdb files) grow over time because temporary objects such as queries are not automatically removed. Consequently, maintenance (compacting) of the .mdb files must be performed on an occasional basis, which can vary from weekly to monthly depending on the level of use. Also, occasionally .mdb files become corrupted and need to be repaired. This can be automated as well. Finally, as improvements are made to the EDMS system, new versions of the program file containing the front end will need to be distributed to users, and a simple process should be developed to perform this distribution to a large number of users with a minimum of effort.
DOCUMENTATION The three main factors that lead to a satisfactory user experience with software are an intuitive user interface, accessible and clear documentation, and effective training. The data management system should have two primary documentation types, hard copy and online. The online documentation consists of two parts, the user interface and the help file. The hard copy documentation and help file will be described in the following sections.
Hard copy The hard copy documentation should consist of two parts, a tutorial section and a reference section. The tutorial section should take the user through the process of using the software and
92
Relational Management and Display of Site Environmental Data
working with data. The reference section should cover the various aspects of working with the system, and have a goal of anticipating and answering any questions the user might have in working with the software. Both sections should have appropriate illustrations to help the user understand what is being described. In addition to covering the day-to-day operation of the system, the documentation should also cover maintenance procedures for both the client and server sides of the system.
Help file The data management system should be installed with a complete online help file using the standard Windows Help System. It can be based on the hard copy documentation, with any modifications necessary to adapt that documentation to the online format. Help screens should be provided for each form in the system (form-level context-sensitivity).
CHAPTER 8 IMPLEMENTING THE DATABASE SYSTEM
This chapter addresses the process of getting a database management system up and running. Topics covered include designing the system, installing the system, and other issues related to the transition from design and construction to ongoing use.
DESIGNING THE SYSTEM The design of the database system is important if the goals of the system are to be satisfied. This section covers a number of issues related to designing the system.
General design goals Before designing a database system, the goals and needs for the system should be clearly identified. To design a usable system, both general design goals such as those described in the literature and goals specific to the organization must be taken into account. This section presents the design goals for an EDMS from a general perspective. The objectives of database design are generally the same for all organizations and all database systems. Stating these goals can help with implementation decisions because the choice that furthers the greatest number of goals is usually the right choice. Some of the material in this section is discussed in Jennings (1995, pp. 828-829) and Yourdon (1996). The most important aspect of designing a database management system is determining the business problem to be solved. Once this has been done, many of the design decisions are much easier to make. In designing and implementing this system, the following data management goals should be considered: Organization goals that should be addressed include: • • •
Fulfilling the needs of the organization for information in a timely, consistent, and economical manner. Working within the hardware and software standards of the organization, and, as much as is practical, using existing resources to minimize cost. Accommodating expansion of the database to adapt to changing organizational needs.
94
Relational Management and Display of Site Environmental Data
Planning for flexibility must be a part of the system design. Business rules have longevity, but processes change. The technology that models the business process must be able to change in order to have any longevity. Data management goals that should be considered include: • • • •
Providing rapid access to data required by each user category. This includes providing a user interface that is easy to learn and easy to use, yet flexible and powerful enough to allow any kind of data retrieval. Eliminating or minimizing the duplication of data across the organization. Easing the creation or integration of data entry, data review, editing, display, and reporting applications that efficiently serve the needs of the users of the database. Preserving data for future use. Any significant data available in digital form should be stored in the database, and backup and recovery procedures should be in place to preserve the content.
Database security goals include: • • • • •
Maintaining the integrity of the database so that it contains only validated, auditable information. Preventing access to the database by unauthorized people. Permitting access only to those elements of the database information that individual users or categories of users need in the course of their work. Allowing only authorized people to add or edit information in the database. Tracking modifications to the data.
Quality goals for the system could be: • • •
Designation of responsibility for decisions about data included in the database. Designation of responsibilities for decisions about data gathering. Use of approved data collection procedures.
Database project management issues include: • • • • • • • • • •
Responsibilities for data management should be clearly defined. The system should provide the container for data. However, project managers should decide how to use it based on the business model for their project, since the level of detail that is appropriate may vary from project to project. Potential uses for the data should be identified so that the quality of the data gathered will match the intended use for that data. Objectives of the data gathering effort should be clearly and unambiguously established. Where several organizations are separately collecting data for multimedia assessments, it is essential that efforts be made to identify and discuss the needs of the principal users of the data. This includes establishment of minimum data accuracy. Once the intended uses for the data are defined, then a quality control program should be designed and implemented. Data needs should be periodically reviewed in the light of increased understanding of environmental processes, or changes in toxicological, climatic, or other environmental conditions. To get the full use of a database system, correct data collection procedures should be used to achieve the highest possible quality for the data being entered into the database. Accepted measurement and data handling methodologies should be used whenever possible. Well-tested measurement practices and standard reference materials should be developed if not already in use. This will allow adequate quality control practices to be implemented in measurement programs.
Implementing the Database System • • • •
95
Protocols for data measurement and management should be periodically reviewed. The quality of the collected data should be carefully reviewed and documented before the data is made available for general use. When data-gathering procedures are changed, care should be taken to assure that the old data can be correlated with the new set with no loss in continuity. Information on existing data programs and data and measurement standards should be disseminated widely to the data user community.
Determine and satisfy needs It is possible to develop a standard procedure for completing the project on time and on budget. This procedure has several steps, which will be described here. Many of these steps are discussed in more detail in later sections, but are included here to make sure that they are considered in the planning process. Assess the needs of the organization – This is probably the most important step in the process. This process should be continued on into implementation. One good process is to select a cross section of potential users and other interested parties within the organization and interview them from a prepared questionnaire. (See Appendix A.) This questionnaire should be prepared for each project based on a template from previous similar projects. The questions on the form progress from general to specific in order to elicit each user’s needs and interests for the system. An important factor in selecting technology is the attitudes of your organization and the individuals in it toward the adoption of new technology. Moore (1991), in his popular book Crossing the Chasm, describes the Technology Adoption Life Cycle, and groups people by their response to technology. Gilbert (1999) has related these groups to software implementation in environmental organizations. These groups are Innovators, Early adopters, Early majority, Late majority, and Laggards. The chasm is between Innovators and Early adopters, and it is difficult to move technology across this chasm. This concept is important in selling and then managing a technology project in your organization. You should analyze the decision makers and prospective users of the technology, and choose technology at the appropriate level of innovation that will be comfortable (or at least not too uncomfortable) for them. Create a plan – Based on the results of the questionnaire, the implementation team should work with a small group of individuals within the target user group to develop a data management plan. This plan serves several major purposes. It provides a road map for the long-term direction of the project. It serves as a design document for the initial release of the system. And it helps facilitate discussion of the various issues to be addressed in developing and implementing the system. This often-overlooked step is critical to project success. Fail to plan, plan to fail. Develop the data model – This is also a critical step in the process. In this step, which can be done parallel with or subsequent to creation of the plan, the implementation team and the users work to make sure that the data content of the system meets the needs of the users. Perform software modifications – If the needs assessment or the data model design identifies changes which need to be made in the software, they are performed in this step. If the database software is based on an open system using standard tools, these changes are usually quite straightforward. Of course, the level of effort required by this step is proportional to the number and scope of the changes to be made. A key action that must be carried out throughout all five of the above steps is to communicate with the future users. Too often, the team writing the software creates what they think the users want, only to find out weeks or months later that the users’ needs are totally different. Frequent communication between the developers and users can help prevent frustration and project failure. Test, then test again – Once the data model and software functionality have been implemented, the system must be fully tested. The installation team should test the software during and after the modifications and remedy any difficulties encountered at that time. Once the team
96
Relational Management and Display of Site Environmental Data
members are satisfied that they have found all of the problems that they can, the software must be tested in the different and often more varied environment of the client site. It is better to start out with a small group of knowledgeable users, and then expand the user base as the number of problems encountered per unit of use time decreases. When the client and the implementation team agree that the problem rate is at an acceptable level, the software can be released for general use. Document – Good documentation is important for a successful user experience. Some users prefer and will read a written manual. Others prefer online help. Both must be provided. Train – Most users learn best from a combination of formally presented material and hands-on use. It is useful to have materials to teach classes for a variety of different types of users, and these materials can be modified prior to presentation to reflect the anticipated use pattern of the target users. A facility suitable for hands-on training must be provided that is in a location that is convenient for the students. Support – When the user has a problem, which they will despite the best developing and testing efforts, there must be a mechanism in place for them to get help. The actual execution of these steps varies from client to client, but by following this process, the project has the greatest chance for success.
Prepare a plan A data management plan is intended to provide guidance during the detailed design and implementation of an EDMS. Design of a computerized data management system should begin with a survey of the data management needs of the organization. It should integrate knowledge about the group’s information management needs gathered during the needs assessment phase, together with the necessary hardware and software components to satisfy as many of the data management needs as possible. It is intended that this plan be revised on a regular basis, perhaps semi-annually, as long as the data management system is in use. The expected life of a typical data management system is five to ten years. After that period, it can be anticipated that technology will have changed sufficiently and it will be appropriate to replace the system with a new tool. It is reasonable to expect that the data from the old system can be transported into the new system. Even during the life of the software it will be necessary to make changes to the system to accommodate changes in computer technology and changes in data needs. This is particularly true now, as the computer industry is undergoing a change from traditional client-server systems to browser-based Internet and intranet systems. Since the data management plan often involves an incremental development process where functionality is added over time, you should expect that lessons learned from early deployment will be incorporated into later development. Finally, the level of detail provided in the plan may vary in the discussion of the different data types to be stored. Additional detail will be added as software development progresses. You should allow the plan to evolve to address these and other, perhaps unanticipated, system changes. A typical data management plan might contain the following sections: Section 1 – System Requirements Section 2 – System Design Section 3 – Implementation Plan Section 4 – Resource Requirements Appendix A – Database Fundamentals Appendix B – User Interface Guidelines Appendix C – Preliminary Data Model Appendix D – Preliminary System Diagrams Appendix E – Data Transfer Standard Appendix F – Coded Values
Implementing the Database System
97
Appendix G – Data Management Survey Results Appendix H – Other Issues and Enhancements Appendix I – References The plan should contain a discussion of all of the important issues. The level of detail may vary depending on the particular item and its urgency. For example, at the planning stage the data content of the system may be outlined in broad terms. After identification of the data components that are the most significant to potential users the data content can be filled in with more detail. The next step toward implementation of the database management system is the detailed design.
Design the system in detail The next step after finalizing the data management plan will usually be a detailed system design. This detailed design will identify and respond in detail to various issues related to each data type being addressed. The plan should be viewed as a national road map. The national map provides an overview of the whole country, showing the relationships between different areas and the high-level connections between them, such as the Interstate highways. It contains greater detail in some areas, just as a national map may have more detailed sub-maps of some metropolitan areas. These sub-maps contain adequate detail to take you wherever you want to go in those areas. The detailed system design covers this greater detail. In many cases, the detailed design is not prepared entirely in one step, but parts of the system are designed in detail prior to implementation, in an evolving process. Figure 48 shows an example of this iterative process. This example shows a preliminary data model that was designed and then submitted to prospective users for comment. The result of the meeting to discuss the data model was the notes on the original data model shown in the figure. After several sessions of this type, the final data model was completed, which is shown at a reduced scale in Figure 49. This figure illustrates the complexity that resulted from the feedback process. It’s important to catch as many errors as possible in this stage of the design process. Conventional wisdom in the software development business is that the cost to fix an error that is found at the end of the implementation process is a factor of 80 to 100 greater than to fix the same error early in the design process. It’s worthwhile to beat the design process to death before you proceed with development, despite the natural desire to move on. One or two extra design sessions after all involved think they are happy with the design will usually pay for themselves many times over.
BUY OR BUILD? After the needs have been determined and a system design developed, you will need to decide whether to buy existing software or write your own (or have it written for you). A number of factors enter into this decision. The biggest, of course, is whether there is software that you can buy that does what you want. The closer that existing software functionality matches your needs, the easier the decision is. Usually, it is more cost-effective when you can buy rather than build, for several reasons, mostly related to the number of features relative to the cost of acquiring the system. There is a cultural component to the decision, with some organizations preferring to write software, while others, perhaps with less interest or confidence in their development capabilities, opting to buy when possible. There may be times when you have to bite the bullet and write software when there is no viable alternative, and the benefits justify the cost.
Relational Management and Display of Site Environmental Data
Figure 48 - Intermediate step in the detailed design process
98
Figure 48 - Intermediate step in the detailed design process
Implementing the Database System
99
Confidence is the feeling you have before you understand the situation. Rich (1996)
Figure 49 - Reduced data model illustrating the complexity resulting from the detailed design process in the previous figure
There is a definite trend in the environmental consulting business away from custom database software development. This is due to two primary reasons. The first is that commercial software has improved to the point that it often satisfies all or most project needs out of the box. The second is more scrutiny of project budgets, with clients less willing to pay for software development unless it is absolutely necessary. Abbott (2001) has stated that 31% of software development projects are canceled before they are completed, and 53% ultimately cost 189% or more of their original budgets. Software projects completed by large companies typically retain about 42% of the features originally proposed. Buying off-the-shelf software decreases the chance of one of these types of project failure. It would be helpful to be able to estimate the cost of implementing a database system. Vizard (2001) states that $7 of every $10 spent on software goes into installing and integrating the software once it is purchased. Turning this around, for every dollar spent on purchasing or developing the software, about two more are spent getting it up and running. So for a rough estimate of the project cost, take the cost of the software and triple it to get the total cost of the implementation.
IMPLEMENTING THE SYSTEM Once the system is selected or designed, there are a number of tasks that must be completed to get it up and running. These same basic tasks are the same, whether you are buying or building the software.
Acquire and install the hardware and software The process of selecting, purchasing, and installing a data management system, or of writing one, can be quite involved. You should be sure that the software selected or developed fits the needs of the individuals and the organization, and that it will run on existing hardware. It may be necessary to negotiate license and support agreements with the vendor. Then it will be necessary to install the software on users’ computers and perhaps on one or more servers.
100
Relational Management and Display of Site Environmental Data
Special considerations for developing software If you are building the software instead of buying it, it is important to follow good software development practices. Writing quality software is a difficult, error-prone process. For an overview of software quality assurance, see Wallace (2000). Here are a few tips that may help: Start with a requirements plan – Developing good software starts with carefully identifying the requirements for the finished system as described above (Abbott, 2001). According to Abbott, 40 to 60% of software defects and failures result from bad requirements. Getting everyone involved in the project to focus on developing requirements, and getting developers to follow them, can be very difficult, but the result is definitely worthwhile. Abbott quotes statistics that changes that occur in the development stage cost five times as much as those that occur during requirements development, and once the product is in production, the cost impact is a factor of 100. Use the best tool for the job – Choose the development environment so that it fits project needs. If code security is important, a compiled language such as Visual Basic may be a good choice. If flexibility and ease of change is important, a database language like Access is best. Manage change during development – Even with the best plan, changes will occur during development, and managing these changes is critical. On all but the smallest projects, a formal change order process should be used, where all changes are documented in writing, and all stakeholders sign off on every change before the developer changes code. A good guideline is ANVO (accept no verbal orders). Use prototypes and incremental versions – Developers should provide work examples to users early and throughout the process, and solicit feedback on a regular basis, to identify problems as early as possible in the development process. Then the change order process can be used to implement modifications. Manage the source code – Use source code management software or a formal process for source code management to minimize the chance for conflicting development activities and lost work, especially on projects with multiple developers. Implement a quality program – There are many different types of quality programs for software development. ISO 9000 for quality management and ISO 14000 for environmental management can be applied to EDMS software development and use. TQM, QFD, SQFD, and CMM are other examples of quality programs that can be used for any software development. Which program you choose is less important than that you choose one and stick to it. Write modular code and reuse when possible – Writing software in small pieces, testing these pieces thoroughly, and then assembling the pieces with further testing is a good way to build reliable code. Where possible centralize calculations in functions rather than spreading them across forms, reports, and code. Each module should have one entry and one exit point. Use syntax checking – Take advantage of the syntax checking of your development environment. Most languages provide a way for the program to make sure your syntax is valid before you move on, either by displaying a dialog box, or underlining the text, or both. Format your code – Use indentation, blank lines, and any other formatting tools you can think of to make your code easier to read. Explicitly declare variables – Always require that variables be explicitly declared in the declarations section of the code. In Visual Basic and VBA you can require this with the statement Option Explicit in the top of each module, and you can have this automatically entered by turning on Require Variable Declaration in the program options. Watch variable scope – The scope of variables (where in the code they are active) can be tricky. Where possible declare the scope explicitly, and avoid re-using variable names in subroutines in case you get it wrong. Keeping the scope as local as possible is usually a good idea. Be very careful about modifying global variables within a function. Design tables carefully – Every primary table should have a unique, system-assigned ID and an updated date to track changes.
Implementing the Database System
101
Be careful about error handling – This may be the one item that separates amateur code from professional code. A good programmer will anticipate error conditions that could arise and provide error trapping that helps the user understand what the problem is. For example, if the detection limit is needed for a calculation but is not present, an EDMS error message like “Please enter the detection limit” is much more helpful than a system message like “Invalid use of Null.” Use consistent names – The more consistent you are in naming variables and other objects the easier it will be to develop and support the code. There are several naming conventions to choose from, and how you do it is not as important as doing it consistently. An example of one such system can be found in Microsoft (2001). For example, it is a good idea to name variables based on their data type and the data they contain, such as txtStationName for the text field that contains station names. Avoid field and variable names that are keywords in the development language or any other place that they may be used. Document your code – Document code internally and provide programmer documentation for the finished product. Each procedure should start with a comment that describes what the function does, as well as its input and output variables. Inline comments in the code should be plentiful and clear. Don’t assume that it will be obvious what you are doing (or more importantly why you are doing it this way instead of some other way), especially when you try to get clever. Don’t get out of your depth – Many advanced computer users have an exaggerated view of their programming skills. If you are building a system that has a lot riding on it, be sure the person doing the development is up to the task. Don’t forget to communicate – On a regular basis, such as weekly or monthly, depending on the length of the project, or after each new section of the system is completed, talk to the users. Ask them if the direction you are taking is what they need. If yes, proceed ahead. If not, stop work, talk about directions, expectations, solutions, and so on, before you write one more line of code. This could make the difference between project success and failure.
Test The system must be thoroughly tested prior to release to users. This is usually done in two parts, alpha testing and beta testing, and both are quite valuable. Each type of testing should have a test plan, and should be carefully documented and methodically followed. Alpha testing – Alpha testing is testing of the software by the software developer after it has been written, but before it is delivered to users. Alpha testing is usually performed incrementally during development, and then comprehensively immediately before the software is released for beta testing. For commercial software purchased off the shelf, this should already have been done before the software is released to the general public. For custom software being made to order, this stage is very important and, unfortunately, in the heat of project deadlines, may not get the attention it deserves. The test plan for alpha testing should exercise all features of the software, and when a change is made, all previous tests should be re-run (regression testing). Test items should include logic tests, coverage (the software works for all cases), boundary tests (if appropriate), satisfaction of requirements, inputs vs. outputs, and user interface performance. It is also important to test the program in all of the environments where the software is going to be deployed. If some users will be using Access 97 running under Windows 95, and others Access 2000 and Windows 2000, then a test environment should be set up for each one. A program like Norton Ghost, which lets you easily restore various hard drive images, can be a great time saver in setting up test systems. A test machine with removable drives for different operating system versions can be helpful as well, and is not expensive to set up. Don’t assume that if a feature does what you want in one environment that it will work the same (or at all) in a different one. Beta testing – After the functionality is implemented and tested by the software author, the software should be provided to willing users and tested on selected computers. There is nothing
102
Relational Management and Display of Site Environmental Data
like real users and real data to expose the flaws in the software. The feedback from the beta testers should then be used to improve the software, and the process repeated as necessary until the list of problem reports each time is very short, ideally zero. The test plan for beta testing is usually less formal than that for alpha testing. Beta testers should use the software in a way similar to how they would use the final product. A caveat here is that since the software is not yet fully certified it is likely to contain bugs. In some cases these bugs can result in lost or corrupted data. Beta testers should be very careful about how they use the results obtained by the software. It is best to consider beta testing as a separate process from their ordinary workflow, especially early in the beta test cycle.
Document An important part of a successful user experience is for users to be able to figure out how to get the software to do the things that they need to do. Making the software discoverable, so that visual clues on the screen help them figure out what to do, can help with this. Unfortunately this is not always possible, so users should be provided with documentation to help them over the rough spots. Often a combination of printed and online documentation provides the best source of information for the user in need. The response of users to documentation varies tremendously from user to user. Some people, when faced with a new software product, will take the book home and read it before they start using the program. Other users (perhaps most) will refuse to read the manual and will call technical support with questions that are clearly answered in both the manual and the help file. This is a personality issue, and both types of help must be provided.
Train Environmental professionals are usually already very busy with their existing workloads. The prospect of finding time to learn new software is daunting to many of these people. The implementation phase must provide adequate training on use of the system, while not requiring a great time commitment from those learning it. These somewhat contradictory goals must be accommodated in order for the system to be accepted by users. New training tools on the horizon, such as Web-based, self-based training, show promise in helping people learn with a minimum impact on their time and the organization’s budget. Training should be provided for three classes of users. System administrators manage the system itself. Data administrators are responsible for the data in the system. Users access the data to assist them in their work. Usually the training of administrators involves the greatest time commitment, since they require the broadest knowledge of the system, but since it is likely to be an important part of their job, they are usually willing to take the time to learn the software. System administrators – Training for System administrators usually includes an overview of the database tools such as Access, SQL Server, or Oracle; the implementation of the data management system using these tools; and operation of the system. It should cover installation of the client software, and maintenance of the server system including volume management, user administration, and backup and restoration of data. Data administrators – Training for data administrators should cover primarily data import and editing, including management of the quality tracking system. Particular emphasis should be placed on building a thorough understanding of the data and how it is reported by the laboratories, since this is a place where a lot of the problems can occur. The data administrator training should also provide information on making enhancements to the system, such as customized reports for specific user needs. Users – Users should be trained on operation of the system. This should include some of the theory behind the design of the system, but should mostly focus on how they can accomplish the
Implementing the Database System
103
tasks to make their jobs easier, and how to get the most out of the system, especially in the area of data selection and retrieval. It should also include instructions for maintenance of files that may be placed on the user’s system by the software.
MANAGING THE SYSTEM In implementing a data management system, there are many other issues that should be considered both internal to and outside of the organization implementing the system. A few of these issues are addressed here.
Licensing the software Software licensing describes the relationship between the owner of the software and the users. Generally you don’t buy software, you pay for the right to use it, and the agreement that covers this is called the software license. This license agreement may be a signed document, or it may be a shrink-wrap agreement, where your using the software implies agreement to the terms. Software licenses can have many forms. A few are described briefly here. Organizations implementing any software should pay attention to exactly what rights they are getting for the money they are spending. Computer license – When software is licensed by computer, the software may be installed and used on one computer by whichever user is sitting at that computer. This was a useful approach when many organizations had fewer computers than users. Implicit in this licensing is that users cannot take the software home to use there, or to another computer. User license – Software can be licensed by user. In this situation, the software is licensed to a particular person. That person can take the software to whichever computer he or she is using at any particular time, including a desktop, laptop, and, in some license agreements, home computer. This type and the previous one are sometimes called licensing the software by seat. Concurrent user license – If the software is licensed by concurrent user, then licenses are purchased for the maximum number of users who will be using the software at any one time. If an organization had users working three non-overlapping shifts, then it could buy one-third the number of licenses as it has users, since only one third would be using the software at once. Software licensed this way should have a license manager, which is a program that tracks and reports usage patterns. If not, the organization or software vendor should perform occasional audits of use to ensure that the right number of licenses are in force. Server license – When a large component of the software runs on a server rather than on client computers, it can be licensed to each server on which it runs, with no attention paid to the number of users. Site license – This is similar to a server license, except that the software can be run on any computers at one specific facility, or site. Company-wide license – This extends the concept of a site license to the whole company. Periodic lease – In this approach, there is usually no up-front cost, but there is a periodic payment, such as monthly or annually. Usually this is used for very expensive software, especially if it requires a lot of maintenance and support, because this can all be rolled into the periodic fee. Pay per use – A variation on the periodic lease is pay per use, where you pay for the amount of time the software is used, to perform a specific calculation, look up a piece of reference information, and so on. As with the periodic lease, this is most popular for very expensive software. Application server – With the advent of the Internet and the World Wide Web, a new option for software licensing has become available. In this model, you pay little or no up-front fee, and pay as you go, as in the previous example. The difference is that most or all of the program is running on a server at the service provider, and your computer runs only a Web browser, perhaps
104
Relational Management and Display of Site Environmental Data
with some add-ins. The company providing this service is called an application service provider or ASP. There are many variations and combinations of these license types in order to fit the needs of the customer and software vendor. Purchasers of software can sometimes negotiate better license fees by suggesting a licensing model that fits their needs better.
Interfaces with other organizations There may be a variety of groups outside of the environmental organization that can be expected to interact with the environmental database in various different ways. During the detailed design phase of software implementation, there should be a focused effort to identify all such groups and to determine their requirements for interaction with the system. The following sections list some of these related organizations that might have some involvement in the EDMS. Information Technology – In many organizations, especially larger ones, an information technology group (IT, or sometimes called IS for Information Services) is charged with servicing the computer needs of the company. This group is usually responsible for a variety of areas ranging from desktop computers and networking systems to company mainframes and accounting software. In discussions with IT, it is often clear that it has resources that could be useful in building an EDMS. The network that it manages often connects the workstations on which the user interface for the software will run. It may have computers running data management software such as Oracle and Microsoft SQL Server, which could provide the back-end for data storage. Finally, it has expertise in many areas of networking and data management, which could be useful in implementing and supporting the data management system. It is also often the case that IT personnel are very busy with their current activities and do not, in general, have very much time available to take on a new project. On the other hand, their help will be needed in a variety of areas, the largest being support in implementing the back-end database. The people responsible for implementing the EDMS should arrange for a liaison person from IT to be assigned to the project, with his or her time funded by the environmental organization if necessary, to provide guidance as the project moves ahead. Help will probably be needed from others during design, implementation, and ongoing operations, and a mechanism should be put in place to provide access to these people. In implementing data management systems in organizations where IT should be involved, one thing stands out as the most important in helping smooth the relationship. That thing is early and ongoing involvement of IT in the selection, design, and implementation process. When this happens, the implementation is more likely to go smoothly. The reasons for lack of communication are often complicated, involving politics, culture differences, and other non-technical factors, but in general, effort in encouraging cooperation in this area is well rewarded. Operating divisions – Different organizations have different reporting structures and responsibilities for environmental and other data management. Often the operating divisions are the source of the data, or at least have some interest in the gathering and interpretation of the data. Once again, the most important key is early and frequent communication. Remote facilities – Remote facilities can have data management needs, but can also provide a challenge because they may not be connected directly to the company network, or the connection may not be of adequate speed to support direct database connections. These issues must be part of the planning and design process if these people are to have satisfactory access to the data. Laboratories – Laboratories provide a unique set of problems and opportunities in implementing an EDMS. In many cases, the laboratories are the primary source of the data. If the data can be made to flow efficiently from the laboratory into the database, a great step has been taken toward an effective system. Unfortunately, many laboratories have limited resources for responding to requests for digital data, especially if the format of the digital file is different from what they are used to providing.
Implementing the Database System
105
There are several things that can be done to improve the cooperation obtained from the labs: • • • • • •
Communicate clearly with the laboratory. Provide clear and specific instructions on how to deliver data. Be consistent in your data requirements and delivery formats. Be an important customer. Many companies have cut down on the number of laboratories that they use so that the amount of work going to the remaining labs is of sufficient volume to justify compliance by the lab with data transfer standards. Choose a lab or labs that are able to comply with your needs in a cost-effective way. Be constantly on guard for data format and content changes. Just because they got it right last quarter does not ensure that they will get it right this time.
Appendix C contains an example of a Data Transfer Standard document that can be used to facilitate communication with the lab about data structure and content. A great timesaving technique is to build a feedback loop between the data administrator and the lab. The EDMS software can be used to help with this. The data administrator and the lab should be provided with the same version of the software. The data administrator maintains the database with updated information on wells, parameter name spellings, units, and so on, and provides a copy of the database to the lab. Before issuing a data deliverable, the lab imports the deliverable into the database. It then remedies any problems indicated by the import process. Once the data imports cleanly, it can be sent to the data administrator for import into the main database. This can be a great time-saver for the busy data administrator, because most data deliverables will import cleanly on the first try. This capability is described in more detail in Chapter 13. Consultants – Consultants provide a special challenge in implementing the database system. In some cases they can act like or actually be laboratories, generating data. In other cases they may be involved in quality assurance aspects of the data management projects. It is often helpful for the consultants working on a project to be using the same software as the rest of the project team to facilitate transfer of data between the two. If that can’t be done, at least make sure that there is a format common to the two programs to use for transferring the data. Regulators – In many (perhaps most) cases, regulators are the true “customers” of the data management project. Usually they need to receive the data after it has been gathered, entered, and undergone quality assurance procedures. The format in which the data is delivered is usually determined by the requirements of the regulators, and can vary from a specific format tailored to their needs, to the native format in which it is stored in the EDMS. In the latter case they may request to be provided with a copy of the EDMS software so that they can easily work with the data. Companies managing the monitoring or cleanup project may view this with mixed feelings. It is usually to their benefit to keep the regulators happy and to provide them with all of the data that they request. They are often concerned, however, that by providing the regulators with the data and powerful software for exploring it, the regulators may uncover issues before the company is ready to respond. Usually on issues like this, the lawyers will need to be involved in the decision process. Auditors – Quality assurance auditors can work with the data directly, with subsets of the data, or with reports generated from the system. Sometimes their interest is more in the process used in working with the data than in the content of the data itself. In other cases they will want to dive into the details to make sure that the numbers and other information made it into the system correctly. The details of how this is done are usually spelled out in the QAPP (quality assurance program plan) and should be followed scrupulously.
Managing the implementation project Implementing a database system, whether you are buying or building, is a complex project. Good project management techniques are important if you are going to complete the project successfully. It is important to develop schedules and budgets early in the implementation project,
106
Relational Management and Display of Site Environmental Data
and track performance relative to plan at regular intervals during the project. Particular attention should be paid to programming, if the system is being built, and data cleanup and loading, as these are areas where time and cost overruns are common. For managing the programming component of the project, it is important to maintain an ongoing balance between features and time/cost. If work is ahead of schedule, you might consider adding more features. If, as is more often the case, the schedule is in trouble, you might look for features to eliminate or at least postpone. To control the data cleanup time, pay close attention to where the time overruns are occurring. Repetitive tasks such as fixing systematic data problems can often be automated, either external to the database using a spreadsheet or custom programs, or using the EDMS to help with the cleanup. If that is not enough, and the project is still behind, it may be necessary to prioritize the order of data import, and import either the cleanest data or the data that is most urgently needed first, and delaying import of other data to stay on schedule.
Preparing for change It is important to remember that any organization can be expected to undergo reorganizations, which change the administrative structure and the required activities of the people using the system. Likewise, projects undergo changes in management philosophy, regulatory goals, and so on. Such changes should be expected to occur during the life of the system. The design of the system must allow it to be easily modified to accommodate these changes. The key design component in preparing for change is flexibility in how the database is implemented. It should be relatively easy to change the organization of the data, moving sites to different databases, stations between sites, how values and flags are displayed, and so on. Effort expended early in the database design and implementation may be repaid manyfold later during the stressful period resulting from a major change, especially in the face of deadlines. The Boy Scouts were right: “Be Prepared.” All of the issues above should be considered and dealt with prior to and during the implementation project. Many will remain once the system is up and running.
CHAPTER 9 ONGOING DATA MANAGEMENT ACTIVITIES
Once the data management system has been implemented, the work is just starting. The cost of ongoing data management activities, in time and/or financial outlay, will usually exceed the implementation cost of the system, at least if the system is used for any significant period of time. These activities should be taken into account in calculating the total cost of ownership over the lifetime of the system. When this calculation is made, it may turn out that a feature in the software that appears expensive up-front may actually cost less over time relative to the labor required by not having the feature. Also, these ongoing activities must be both planned for and then performed as part of the process if the system is expected to be a success. These activities include managing the workflow, managing the data, and administering the system. Many of these activities are described in more detail later in this book.
MANAGING THE WORKFLOW For large projects, managing the workflow efficiently can be critical to project success. Data flow diagrams and workflow automation can help with this.
Data flow diagrams Usually the process of working with the data involves more than one person, often within several organizations. In many cases those individuals are working on several projects, and it can be easy to lose track of who is supposed to do what. A useful tool to keep track of this is to create and maintain a data flow diagram for each project (if this wasn’t done during the design phase). These flow diagrams can be distributed in hard copy, or made available via an intranet page so people can refer to them when necessary. This can be particularly helpful when a contact person becomes unavailable due to illness or other reason. If the flowchart has backup names at each position, then it is easier to keep the work flowing. An example of a data flow diagram is shown in Figure 50.
Workflow automation Tools are now becoming available that can have the software help with various aspects of moving the data through the data management process. Workflow automation software takes responsibility for knowing the flow of the data through the process, and makes it happen.
Figure 50 - Data flow diagram for a project No
= Work Flow
Data Administrator Name
Fail
Lab Problem
Test Import
Analyzed
Received
Figure 50 - Data flow diagram for a project
* DA Logs & Reports
Red Text = Name
Update
Preparation
Received
Reference List
Lab Name
Yes
(Proj.Mgr & Site Mgr.)
Reporting
Report Preparation Name
Main Database
Bottles
Field Tech Name
Interpretation
Print
3rd Party Name
Samples to Lab
Notify Proj. Mgr.
Hydrologist Review Name
Fax Sample Order Form
Sampling Plan Scheduled and Routine Sampling
Company or Consultant Name
Notify
Yes
Project Mgr. & Site Mgr.
Import
Data Administrator Name
Collate
(Excel or Access)
Field Measurements
Prepare COC
Collect Samples
Prepare Table
Site Name
No
Project Manager Name
Project Data Flow Diagram
108 Relational Management and Display of Site Environmental Data
Ongoing Data Management Activities
109
In a workflow automated setting, after the laboratory finishes its data deliverable the workflow automation system sends the data to the data administrator, who checks and imports the data. The project manager is automatically notified and the appropriate data review process initiated. After this the data management software generates the necessary reports, which, after appropriate review, are sent to the regulators. At the present time, bits and pieces are available to put a system together that acts this way, but the level of integration and automation can be expected to improve in the near future.
MANAGING THE DATA There are a number of tasks that must be performed on an ongoing basis as the data management system is being used. These activities cover getting the data in, maintaining it, and getting it out. Since it costs money to gather, manage, and interpret data, project managers and data managers should spend that money wisely on activities that will provide the greatest return on that investment. Sara (1994, p. 1-11) has presented the “Six Laws of Environmental Data”: 1. 2. 3. 4. 5. 6.
The most important data is that which is used in making decisions, therefore only collect data that is part of the decision making process. The cost of collecting, interpreting, and reporting data is about equal to the cost of analytical services. About 90% of data does not pass from the first generation level (of data use and interpretation) to the next (meaning that it is never really used). There is significant operational and interpretive skill required in moving up the generation ladder. Data interpretation is no better than the quality control used to generate the original data. Significant environmental data should be apparent.
Convert historical data The historical data for each project must be loaded before the system can be used on that project for trend and similar analyses. After that, current data must be loaded for that project on an ongoing basis. For both of these activities, data review must be performed on the data. An important issue in the use of the EDMS is developing a process for determining which data is to be loaded. This includes the choice of which projects will have their data loaded and how much historical data will be loaded for each of those projects. It also includes decisions about the timing of data loading for each project. Most organizations have limited resources for performing these activities, and the needs of the various projects must be balanced against resource availability in order to load the data that will provide the greatest return on the data loading investment. For historical data loading, project personnel will need to identify the data to be loaded for projects, and make the data available to whoever is loading it. This will require anywhere from a few hours to several weeks or more for each project, depending on the amount of the data and the difficulty in locating it. Significant data loading projects can cost in the hundreds of thousands of dollars.
Import data from laboratories or other sources For most projects this is the largest part of managing the EDMS. Adequate personnel must be assigned to this part of the project. This includes working with the laboratories to ensure that clean data is delivered in a useful format, and also importing the data into the database. There must be enough well trained people so that the work goes smoothly and does not back up to the point where it affects the projects using the system. Importing data is covered in more detail in Chapter 13.
110
Relational Management and Display of Site Environmental Data
Murphy’s law of thermodynamics: Things get worse under pressure. Rich (1996)
Manage the review status of all data Knowing where the data came from and what has happened to it is very important in order to know how much you can trust it. The EDMS can help significantly with this process. Management of the review status is discussed in depth in Chapter 15.
Select data for display or editing Many benefits can be obtained by moving enterprise environmental data to a centralized, open database. With all of the data in one place, project personnel will need to work with the data, and it will be unusual for them to need to look at all of the data at once. They will want to select parts of the data for analysis and display, so the selection system becomes more important as the data becomes more centralized. See Chapter 18 for more information on selecting data.
Analyze the data Once the data is organized and easy to find, the next step is to analyze the data to get a better understanding of what that data is telling you about the site. Organizations that implement data management systems often find that once they can spend less time managing the data because they are doing that part more efficiently, they can spend more time analyzing it. This can provide great benefits to the project. Map analysis of the data is discussed in Chapter 22, and statistical analysis in Chapter 23.
Generate graphs, maps, and reports Building the database is usually not the goal of the project. The goal is using the data to make decisions. This means that the benefits will be derived by generating output, either using the EDMS directly or with other applications. Several chapters in Part Five cover various aspects of using the data to achieve project benefits. The important point to be made here is that, during the planning and implementation phases, as much or more attention should be paid to how the data will be used compared to how it will be gathered, and often this is not done.
Use the data in other applications Modern EDMS products usually provide a broad suite of data analysis and display tools, but they can’t do everything. The primary purpose of the EDMS is to provide a central repository for the data, and the tools to get the data in and out. It should be easy to use the data in other applications, and software interface tools like ODBC are making this much easier. Integration of the database with other programs is covered in Chapter 24.
ADMINISTERING THE SYSTEM During and after installation of the EDMS, there are a number of activities that must be performed, mostly on an ongoing basis, to keep the system operational. Some of the most
Ongoing Data Management Activities
111
important of these activities are discussed here. A significant amount of staff time may be required for these tasks. In some cases, consultants can be substituted for internal staff in addressing these issues. Time estimates for these items can be difficult to generate because they are so dependent on the volume of data being processed, but are important for planning and allocating resources, especially when the project is facing a deadline.
System maintenance The EDMS will require ongoing maintenance in order to keep it operational. This is true to a small degree of the client component and to a larger degree of the server part. Client computer maintenance – For the client computers, this might include installing new versions of the software, and, for Access-based systems, compacting and perhaps repairing the database files. As new versions of the program file are released containing bug fixes and enhancements, these will need to be installed on each client computer, and especially for large installations an efficient way of doing this must be established. Compacting should be done on a regular basis. Repairing the data files on the client computers may occasionally be required should one of them become corrupted. This process is not usually difficult or time intensive. System maintenance will require that each user spend one hour or less each month to maintain his or her database. Server maintenance – For the server, there are several maintenance activities that must be performed on a regular basis. The most important and time-consuming is backing up system data as discussed below and in Chapter 15. Also, the user database for the system must be maintained as users are added and removed, or as their activities and data access needs change. Finally, with most server programs the database volume must be resized occasionally as the data content increases. System administrators and data administrators will need to spend more time on their maintenance tasks than users. System administrators should expect to spend at least several hours each week on system maintenance. The time requirements for data administrators will depend to a large extent on the amount of data that they are responsible for maintaining. Importing data from a laboratory can take a few minutes or several hours, depending on the complexity of the data, and the number of problems that must be overcome. Likewise, data review time for each data set can vary widely. Gathering and inputting existing data from hard copy can be very time-consuming, and then reviewing that data requires additional time. A complete data search and entry project for a large site could take several people weeks or more. Project or management personnel will need to decide which sites will be imported, in which order, and at what rate, based on personnel available.
Software and user support A significant contributor to the success of an EDMS is the availability of support on the system after installation. Even the best software requires good support in order to provide a satisfactory user experience. As the system is used, people will need help in using the system efficiently. This will include hot-line support so they have someone to call when they have a problem. Sometimes their problem will be that they can’t figure out how to use all the features of the system. Other times they will identify problems with the software that must be fixed. Still other times they will identify enhancements that might be added in a later release of the software. The organization should implement a system for obtaining software support, either through power users, a dedicated support staff, the software vendor, or consultants. There are two primary software support needs for an EDMS: system issues and feature issues. System issues involve running the software itself. This includes network problems, printing problems, and other similar items. These usually occur at the start of using the product. This usually requires an hour or less per user, both for the client and the support technician, to overcome any of these problems. Recurrence rate, after preliminary shakedown, is relatively low.
112
Relational Management and Display of Site Environmental Data
The second type of support is on software features. This ranges from usage issues like import problems to questions about how to make enhancements. This support for each user usually peaks in the first week or two of intense use, and tapers off after that, but never goes away entirely. The amount of support depends greatly on the way the software is used and the computer literacy level of the user. Some users have required an hour of support or less to be up and running. Other users need four hours or more in the first couple of months. Adequate training on theoretical and handson aspects of the software can cut the support load by about half. Users and the support organization should expect that the greatest amount of support will be required shortly after each user starts using the software. Staging the implementation so that not all of the users are new users at once can help with the load on the support line. It also allows early users to help their neighbors, which can be an efficient way of providing support in some situations. There are a number of different types of support required for each user. Initial hands-on support – After the software is delivered to the users, the development staff or support personnel should be onsite for a period of time to assist with any difficulties that are encountered. Technical personnel with access to reference resources should back up these people to assist with overcoming any obstacles. Telephone/email support – Once the system is up and running, problems will almost certainly be encountered. Often the resolution of these problems is simple for someone with a good understanding of how the system operates. These problems usually fall into two categories: user error (misuse of the software) and software error. In the case of user error, two steps are required. The first is to help the user overcome the current obstacle. The second is to analyze the user error to determine whether it could be avoided in the future with changes to the user interface, written documentation, help system, or training program. If the problem is due to software error, it should be analyzed for seriousness. Based on that analysis, the problem should be addressed immediately (if there is the potential for data loss or the problem leads to very inefficient use of the software) or added to a list of corrections to be performed as part of a future release. Troubleshooting – If software problems are identified, qualified personnel must be available to address them. This usually involves duplicating the problem if possible, identifying the cause, determining the solution, scheduling and performing the fix, and distributing the modified code.
Power user development Once the system is operational, some people may express an interest in becoming more knowledgeable about using the software, and perhaps in learning to expand and customize the system. It is often to the organization’s advantage to encourage these people and to support their learning more about the system, because this greater knowledge reduces dependence on support staff, and perhaps consultants. A system should be put in place for developing the advanced capabilities of these people, often referred to as power users. This system might include specialized training for their advanced needs, and perhaps also specialized software support for them. Development of power users is usually done individually as people identify themselves as this type of user. This will require their time for formal and informal training to expand their knowledge of the system.
Enhancements and customization After the EDMS has been installed and people are using the system, it is likely that they will have ideas for improving the system. A keystone of several popular management approaches to quality and productivity is continuous improvement. An EDMS can certainly benefit from this process. Users should be encouraged to provide feedback on the system, and, assuming that the software has the flexibility and configurability to accommodate it, changes should be made on an ongoing basis so that over time the system becomes a better and better fit to users’ needs. The
Ongoing Data Management Activities
113
organization should implement a system for gathering users’ suggestions, ranking them by the return on any cost that they may entail, and then implementing those that make good business sense. There is a conflict between the need for continuous improvement and the need to control the versions of the software in use. Too many small improvements can lead to differing and sometimes incompatible versions of the software. Too few revisions can result in an unacceptably long time until improvements are made available to users. A compromise must be found based on the needs of individual users, the development team, and the organization.
Backup of data to protect from loss The importance of backing up the database cannot be overemphasized. This should be done a minimum of every day. In some organizations, Information Services staff will do this as part of their services. Loss of data can be extremely costly. More information on the backup task can be found in Chapter 15.
PART THREE - GATHERING ENVIRONMENTAL DATA
CHAPTER 10 SITE INVESTIGATION AND REMEDIATION
The site investigation and remediation process is usually the reason for site environmental data management. The results of the data management process can provide vital input in the decisionmaking process. This chapter provides an overview of the regulations that drive the site investigation and remediation process, some information on how the process works under the major environmental regulations, and how data management and display is involved in the different parts of the process. Related processes are environmental assessments and environmental impact statements, which can also be aided by an EDMS.
OVERVIEW OF ENVIRONMENTAL REGULATIONS The environmental industry is driven by government regulations. These regulations have been enacted at the national, state, or local level. Nearly all environmental investigation and remediation activity is performed to satisfy regulatory requirements. A good overview of environmental regulations can be found in Mackenthun (1998). The following are some of the most significant environmental regulations: National Environmental Policy Act of 1969 (NEPA) – Requires federal agencies to consider potentially significant environmental impacts of major federal actions prior to taking the action. The NEPA process contains three levels of possible documentation: 1) Categorical Exclusion (CATEX), where no significant effects are found, 2) Environmental Assessment (EA), which addresses various aspects of the project including alternatives, potential impacts, and mitigation measures, and 3) Environmental Impact Statement (EIS), which covers topics similar to an EA, but in more detail. Clean Air Act of 1970 (CAA) – Provides for the designation of air quality control regions, and requires National Ambient Air Quality Standards (NAAQS) for six criteria pollutants (particulate matter, sulfur dioxide, carbon monoxide, ozone, nitrogen dioxide, and lead). Also requires National Emission Standards for Hazardous Air Pollutants (NESHAPs) for 189 hazardous air pollutants. The act requires states to implement NAAQS, and requires that source performance standards be developed and attained by new sources of air pollution. Occupational Safety and Health Act of 1970 – Requires private employers to provide a place of employment safe from recognized hazards. The act is administered by the Occupational Safety and Health Administration (OSHA).
118
Relational Management and Display of Site Environmental Data
Bad regulations are more likely to be supplemented than repealed. Rich (1996) Endangered Species Act of 1973 (ESA) – Provides for the listing of threatened or endangered species. Any federal actions must be evaluated for their impact on endangered species, and the act makes it illegal to harm, pursue, kill, etc. a listed endangered or threatened species. Safe Drinking Water Act of 1974 (SDWA) – Protects groundwater aquifers and provides standards to ensure safe drinking water at the tap. It makes drinking water standards applicable to all public water systems with at least 15 service connections serving at least 25 individuals. Requires primary drinking water standards that specify maximum contamination at the tap, and prohibits certain activities that may adversely affect water quality. Resource Conservation and Recovery Act of 1976 (RCRA) – Regulates hazardous wastes from their generation through disposal, and protects groundwater from land disposal of hazardous waste. It requires criteria for identifying and listing of hazardous waste, and covers transportation and handling of hazardous materials in operating facilities. The act also covers construction, management of, and releases from underground storage tanks (USTs). In 1999, 20,000 hazardous waste generators regulated by RCRA produced over 40 million tons of hazardous waste (EPA, 2001b). RCRA was amended in 1984 with the Hazardous and Solid Waste Amendments (HSWA) that required phasing out land disposal of hazardous waste. Toxic Substances Control Act of 1976 (TSCA) – Requires testing of any substance that may present an unreasonable risk of injury to health or the environment, and gives the EPA authority to regulate these substances. Covers the more than 60,000 substances manufactured or processed, but excludes nuclear materials, firearms and ammunition, pesticides, tobacco, food additives, drugs, and cosmetics. Clean Water Act of 1977 (CWA) – Based on the Federal Water Pollution Control Act of 1972 and several other acts. Amended significantly in 1987. This act, which seeks to eliminate the discharge of pollutants into navigable waterways, has provisions for managing water quality and permitting of treatment technology. Development of water quality standards is left to the states, which must set standards at least as stringent as federal water quality standards. Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (CERCLA, Superfund) – Enacted to clean up abandoned and inactive hazardous waste sites. Creates a tax on the manufacture of certain chemicals to create a trust fund called the Superfund. Sites to be cleaned up are prioritized as a National Priority List (NPL) by the EPA. Procedures and cleanup criteria are specified by a National Contingency Plan. The NPL originally contained 408 sites, and now contains over 1300. Another 30,000 sites are being evaluated for addition to the list. Emergency Planning and Community Right-to-Know Act of 1986 (EPCRA) – Enacted after the Union Carbide plant disaster in Bhopal, India in 1984, in which release of methyl isocyanate from a chemical plant killed 2,000 and impacted the health of 170,000 survivors, this law requires industrial facilities to disclose information about chemicals stored onsite. Pollution Prevention Act of 1990 (PPA) – Requires collection of information on source reduction, recycling, and treatment of listed hazardous chemicals. Resulted in a Toxic Release Inventory for facilities including amounts disposed of onsite and sent offsite, recycled, and used for energy recovery. These regulations have contributed significantly to improvement of our environment. They have also resulted in a huge amount of paperwork and other expenses for many organizations, and explain why environmental coordinators stay very busy.
Site Investigation and Remediation
119
THE INVESTIGATION AND REMEDIATION PROCESS The details of the site investigation and remediation process vary depending on the regulation under which the work is being done. Superfund was designed to remedy mistakes in hazardous waste management made in the past at sites that have been abandoned or where a sole responsible party cannot be determined. RCRA deals with sites that have viable operators and ongoing operations. The majority of sites fall into one of these two categories. The rest operate under a range of regulations through various different regulatory bodies, many of which are agencies in the various states.
CERCLA CERCLA (Superfund) gives the EPA the authority to respond to releases or threatened releases of hazardous substances that may endanger human health and the environment. The three major areas of enforcement at Superfund sites are: achieving site investigations and cleanups led by the potentially responsible party (PRP) or parties (PRP lead cleanups, meaning the lead party on the project is the PRP); overseeing PRP investigation and cleanup activities; and recovering from PRPs the costs spent by EPA at Superfund cleanups (Fund lead cleanups). The National Contingency Plan of CERCLA describes the procedures for identification, evaluation, and remediation of past hazardous waste disposal sites. These procedures are preliminary assessment and site inspection; Hazard Ranking System (HRS) scoring and National Priority List (NPL) site listing; remedial investigation and feasibility studies; record of decision; remedial design and remedial action; construction completion; operation and maintenance; and NPL site deletion. Site environmental data can be generated at various steps in the process. Additional information on Superfund enforcement can be found in EPA (2001a). Preliminary assessment and site inspection – The process starts with investigations of site conditions. A preliminary assessment (PA) is a limited scope investigation performed at each site. Its purpose is to gather readily available information about the site and surrounding area to determine the threat posed by the site. The site inspection (SI) provides the data needed for the hazard ranking system, and identifies sites that enter the NPL site listing process (see below). SIs typically involve environmental and waste sampling that can be managed using the EDMS. HRS scoring and NPL site listing – The hazard ranking system (HRS) is a numerically based screening system that uses information from initial, limited investigations to assess the relative potential of sites to pose a threat to human health or the environment. The HRS assigns a numerical score to factors that relate to risk based on conditions at the site. The four risk pathways scored by HRS are groundwater migration; surface water migration; soil exposure; and air migration. HRS is the principal mechanism EPA uses to place uncontrolled waste sites on the National Priorities List (NPL). Identification of a site for the NPL helps the EPA determine which sites warrant further investigation, make funding decisions, notify the public, and serve notice to PRPs that EPA may begin remedial action. Remedial investigation and feasibility studies – Once a site is on the NPL, a remedial investigation/feasibility study (RI/FS) is conducted at the site. The remedial investigation involves collection of data to characterize site conditions, determine the nature of the waste, assess the risk to human health and the environment, and conduct treatability testing to evaluate the potential performance and cost of the treatment technologies that are being considered. The feasibility study is then used for the development, screening, and detailed evaluation of alternative remedial actions. The RI/FS has five phases: scoping; site characterization; development and screening of alternatives; treatability investigations; and detailed analyses. The EDMS can make a significant contribution to the site characterization component of the RI/FS, which often involves a significant amount of sampling of soil, water, and air at the site. The EDMS serves as a repository of the data,
120
Relational Management and Display of Site Environmental Data
as well as a tool for data selection and analysis to support the decision-making process. Part of the site characterization process is to develop a baseline risk assessment to identify the existing or potential risks that may be posed to human health and environment at the site. The EDMS can be very useful in this process by helping screen the data for exceedences that may represent risk factors. Record of decision – Once the RI/FS has been completed, a record of decision (ROD) is issued that explains which of the cleanup alternatives will be used to clean up the site. This public document can be significant for data management activities because it often sets target levels for contaminants that will be used in the EDMS for filtering, comparison, and so on. Remedial design and remedial action – In the remedial design (RD), the technical specifications for cleanup remedies and technologies are designed. The remedial action (RA) follows the remedial design and involves the construction or implementation phase of the site cleanup. The RD/RA is based on specifications described in the ROD. The EDMS can assist greatly with tracking the progress of the RA and determining when ROD limits have been met. Construction completion – A construction completion list (CCL) helps identify successful completion of cleanup activities. Sites qualify for construction completion when any physical construction is complete (whether or not cleanup levels have been met), EPA has determined that construction is not required, or the site qualifies for deletion from the NPL. Operation and maintenance – Operation and maintenance (O&M) activities protect the integrity of the selected remedy for a site, and are initiated by the state after the site has achieved the actions and goals outlined in the ROD. The site is then determined to be operational and functional (O&F) based on state and federal agreement when the remedy for a site is functioning properly and performing as designed, or has been in place for one year. O&M monitoring involves inspection; sampling and analysis; routine maintenance; and reporting. The EDMS is used heavily in this stage of the process. NPL site deletion – In this final step, sites are removed from the NPL once they are judged to no longer be a significant threat to human health and the environment. To date, not many sites have been delisted.
RCRA The EPA’s Office of Solid Waste (OSW) is responsible for ensuring that currently generated solid waste is managed properly, and that currently operating facilities address any contaminant releases from their operations. In some cases, accidents or other activities at RCRA facilities have released hazardous materials into the environment, and the RCRA Corrective Action Program covers the investigation and cleanup of these facilities. Additional information on RCRA enforcement can be found in EPA (2001b). As a condition of receiving a RCRA operating permit, active facilities are required to clean up contaminants that are being released or have been released in the past. EPA, in cooperation with the states, verifies compliance through compliance monitoring, educational activities, voluntary incentive programs, and a strong enforcement program. The EDMS is heavily involved in compliance monitoring and to some degree in enforcement actions. Compliance monitoring – EPA and the states determine a waste handler’s compliance with RCRA requirements using inspections, record reviews, sampling, and other activities. The EDMS can generate reports comparing sampling results to regulatory limits to save time in the compliance monitoring process. Enforcement actions – The compliance monitoring process can turn up violations, and enforcement actions are taken to bring the waste handler into compliance and deter further violations. These actions can include administrative actions, civil judicial actions, and criminal actions. In addition, citizens can file suit to bring enforcement actions against violators or potential violators.
Site Investigation and Remediation
121
One important distinction from a data management perspective between CERCLA and RCRA projects is that CERCLA projects deal with past processes, while RCRA projects deal with both past and present processes. This means that the EDMS for both projects needs to store information on soil, groundwater, etc., while the RCRA EDMS also might store information on ongoing processes such as effluent concentrations and volumes, and even production and other operational information.
Other regulatory oversight While many sites are investigated and remediated under CERCLA or RCRA, other regulatory oversight is also possible. The EPA has certified some states to oversee cleanup within their boundaries. In some cases, other government agencies, including the armed forces, oversee their own cleanup efforts. In general, the technical activities performed are pretty much the same regardless of the type of oversight, and the functional requirements for the EDMS are also the same The main exception is that some of these agencies require the use of specific reporting tools as described in Chapter 5.
ENVIRONMENTAL ASSESSMENTS AND ENVIRONMENTAL IMPACT STATEMENTS The National Environmental Policy Act of 1969 (NEPA), along with various supplemental laws and legal decisions, requires federal agencies to consider the environmental impacts and possible alternatives of any federal actions that significantly affect the environment (Mackenthun, 1998, p. 15; Yost, 1997, p. 1-11). This usually starts with an environmental assessment (EA). The EA can result in a determination that an environmental impact statement (EIS) is required, or in a finding of no significant impact (FONSI). The EIS is a document that is prepared to assist with decision making based on the environmental consequences and reasonable alternatives of the action. The format of an EIS is recommended in 40 CFR 1502.10, and is normally limited to 150 pages. Often there is considerable public involvement in this process. One important use of environmental assessments is in real estate transactions. The seller and especially the buyer want to be aware of any environmental liabilities related to the property being transferred. These assessments are broken into phases. The data management requirements of EAs and EISs vary considerably, depending on the nature of the project and the amount and type of data available. Phase 1 Environmental Assessment – This process involves evaluation of existing data about a site, along with a visual inspection, followed by a written report, similar to a preliminary assessment and site inspection under CERCLA, and can satisfy some CERCLA requirements such as the innocent landowner defense. The Phase 1 assessment process is well defined, and guidelines such as Practice E-1527-00 from the American Society for Testing and Materials (ASTM 2001a, 2001b), are used for the assessment and reporting process. There are four parts to this process: gathering information about past and present activities and uses at the site and adjoining properties; reviewing environmental files maintained by the site owner and regulatory agencies; inspection of the site by an environmental professional; and preparation of a report identifying existing and potential sources of contamination on the property. The work involves document searches and review of air photos and site maps. Often the source materials are in hard copy not amenable to data management. Public and private databases are available to search ownership, toxic substance release, and other information, but this data is usually managed by its providers and not by the person performing the search. Phase 1 assessments for a small property are generally not long or complicated, and can cost as little as $1,000.
122
Relational Management and Display of Site Environmental Data
Phase 2 Investigation – If a Phase 1 assessment determines that the presence of contamination is likely, the next step is a Phase 2 assessment. The primary differences are that Phase 1 relies on existing data, while in Phase 2 new data is gathered, usually in an intrusive manner, and the Phase 2 process is less well defined. This can involve sampling soil, sediment, and sludge and installation of wells for sampling groundwater. This is similar to remedial investigation and feasibility studies under CERCLA. If the assessment progresses to the point where samples are being taken and analyzed, then the in-house data management system can be of value. Phase 3 Site Remediation and Decommissioning – The final step of the assessment process, if necessary, is to perform the cleanup and assess the results. Motivation for the remediation might include the need to improve conditions prior to a property transfer, to prevent contamination from migrating off the property, to improve the value of the property, or to avoid future liability. Monitoring the cleanup process, which can involve ongoing sampling and analysis, will usually involve the EDMS.
CHAPTER 11 GATHERING SAMPLES AND DATA IN THE FIELD
Environmental monitoring at industrial and other facilities can involve one or more different media. The most common are soil, sediment, groundwater, surface water, and air. Other media of concern in specific situations might include dust, paint, waste, sludge, plants and animals, and blood and tissue. Each medium has its own data requirements and special problems. Generating site environmental data starts with preparing sampling plans and gathering the samples and related data in the field. There are a number of aspects of this process that can have a significant impact on the resulting data quality. Because the sampling process is specific to the medium being sampled, this chapter is organized by medium. Only the major media are discussed in detail.
GENERAL SAMPLING ISSUES The process of gathering data in the field, sending samples to the laboratory, analyzing the data, and reporting the results is complicated and error-prone. The people doing the work are often overworked and underpaid (who isn’t), and the requirements to do a good job are stringent. Problems that can lead to questionable or unusable data can occur at any step of the way. The exercise (and in some cases, requirement) of preparing sampling plans can help minimize field data problems. Field sampling activities must be fully documented in conformance with project quality guidelines. Those guidelines should be carefully thought out and followed methodically. A few general issues are covered here. The purpose of this section is not to teach field personnel to perform the sampling, but to help data management staff understand where the samples and data come from in order to use it properly. In all cases, regulations and project plans should be followed in preference to statements made here. Additional information on these issues can be found in ASTM (1997), DOE/HWP (1990a), and Lapham, Wilde, and Koterba (1985).
Taking representative samples Joseph (1998) points out that the basic premise of sampling is that the sample must represent the whole population, and quotes the law of statistical regularity as stating that “a set of subjects taken at random from a large group tends to reproduce the characteristics of that large group.” But the sample is only valid if the errors introduced in the sampling process do not invalidate the results for the purpose intended for the samples. Analysis of the samples should result in no bias and minimum random errors.
124
Relational Management and Display of Site Environmental Data
Types of Sampling Patterns
Known Plume
Simple Random Sampling
Judgment Sampling
Grid (Systematic) Sampling
Stratified Sampling
Primary Stage
A
B
Secondary Stage Random Grid Sampling
Two-Stage Sampling
Figure 51 - Types of sampling patterns
The size of the sample set is directly related to the precision of the result. More samples cost more money, but give a more reliable result. If you start with the precision required, then the number of samples required can be calculated:
Gathering Samples and Data in the Field
n=
125
Ns 2 N ( B 2 / 4) + s 2
where n is the number of samples, N is the size of the population, s is the standard deviation of the sample, and B is the desired precision, such as 95% confidence. According to Joseph (1998), the standard deviation can be estimated by taking the largest value of the data minus the smallest and dividing by four. There are several strategies for laying out a sampling program. Figure 51, modified after Adolfo and Rosecrance (1993), shows six possibilities. Sampling strategies are also discussed in Sara (1994, p. 10-49). In simple random sampling, the chance of selecting any particular location is the same. With judgment sampling, sampling points are selected based on previous knowledge of the system to be sampled. Grid sampling provides uniform coverage of the area to be studied. Stratified sampling has the sample locations based on existence of discrete areas of interest, such as aquifers and confining layers, or disposal ponds and the areas between them. Random grid sampling combines uniform coverage of the study area with a degree of random selection of each location, which can be useful when access to some locations is difficult. With two-stage sampling, secondary sampling locations are based on results of primary stage samples. In the example shown, primary sample A had elevated values, so additional samples were taken nearby, while primary sample B was clean, so no follow-up samples were taken. Care should be taken so that the sample locations are as representative as possible of the conditions being investigated. For example, well and sample locations near roadways may be influenced by salting and weed spraying activities. Also, cross-contamination from dirty samples must be avoided by using procedures like sampling first from areas expected to have the least contamination, then progressing to areas expected to have more.
Logbooks and record forms Field activities must be fully documented using site and field logbooks. The site logbook stores information on all field investigative activities, and is the master record of those activities. The field logbook covers the same activities, but in more detail. The laboratory also should keep a logbook for tracking the samples after they receive them. The field logbook should be kept up-to-date at all times. It should include information such as well identification; date and time of sampling; depth; fluid levels; yield; purge volume, pumping rate, and time; collection methods; evacuation procedures; sampling sequence; container types and sample identification numbers; preservation; requested parameters; field analysis data; sample distribution and transportation plans; name of collector; and sampling conditions. Several field record forms are used as part of the sampling process. These include Sample Identification and Chain of Custody forms. Also important are sample seals to preserve the integrity of the sample between sampling and when it is opened in the laboratory. These are legal documents, and should be created and handled with great care. Sample Identification forms are usually a label or tag so that they stay with the sample. Labels must be waterproof and completed in permanent ink. These forms should contain such information as site name; unique field identification of sample, such as station number; date and time of sample collection; type of sample (matrix) and method of collection; name of person taking the sample; sample preservation; and type of analyses to be conducted. Chain of Custody (COC) forms make it possible to trace a sample from the sampling event through transport and analysis. The COC must contain the following information: project name; signature of sampler; identification of sampling stations; unique sample numbers; date and time of collection and of sample possession; grab or composite designation; matrix; number of containers; parameters requested for analysis; preservatives and shipping temperatures; and signatures of individuals involved in sample transfer.
126
Relational Management and Display of Site Environmental Data
Velilind’s laws of Experimentation: 1. If reproducibility may be a problem, conduct the test only once. 2. If a straight line fit is required, obtain only two data points. McMullen (1996) COC forms should be enclosed in a plastic cover and placed in the shipping container with the samples. When the samples are given to the shipping company, the shipping receipt number should be recorded on the COC and in the site logbook. All transfers should be documented with the signature, date, and time on the form. A sample must remain under custody at all times. A sample is under custody if it is in the sampler’s possession; it is in the sampler’s view after being in possession; it is in the possession of a traceable carrier; it is in the possession of another responsible party such as a laboratory; or it is in a designated secure area.
SOIL Taking soil samples must take into consideration that soil is a very complex physical material. The solid component of soil is a mix of inorganic and organic materials. In place in the ground, soil can contain one or more liquid phases and a gas phase, and these can be absorbed or adsorbed in various ways. The sampling, transportation, and analysis processes must be managed carefully so that analytical results accurately represent the true composition of the sample.
Soil sampling issues Before a soil sample can be taken, the material to be sampled must be exposed. For surface or shallow subsurface soil samples this is generally not an issue, but for subsurface samples this usually requires digging. This can be done using either drilling or drive methods. For unconsolidated formations, drilling can be done using an auger (hollow-stem, solid flight, or bucket), drilling (rotary, sonic, directional), or jetting methods. For consolidated formations, rotary drilling (rotary bit, downhole hammer, or diamond drill) or cable tools can be used. Drive methods include cone penetrometers or direct push samplers. Sometimes it is useful to do a borehole geophysical survey after the hole is drilled. Examples of typical measurements include spontaneous potential, resistivity, gamma and neutron surveys, acoustic velocity, caliper, temperature, fluid flow, and electromagnetic induction. Soil samples are gathered with a variety of tools, including spoons, scoops, shovels, tubes, and cores. The samples are then sealed and sent to the laboratory. Duplicates should be taken as required by the QAPP (quality assurance project plan). Sometimes soil samples are taken as a boring is made, and then the boring is converted to a monitoring well for groundwater, so both soil and water samples may come from the same hole in the ground. Typical requirements for soil samples are as follows. The collection points should be surveyed relative to a permanent reference point, located on a site map, and referenced in the field logbook. A clean, decontaminated auger, spoon, or trowel should be used for each sample collected. Surface or air contact should be minimized by placing the sample in an airtight container immediately after collection. The sampling information should be recorded in the field logbook and any other appropriate forms. For subsurface samples, the process for verifying depth of sampling, the depth control tolerance, and the devices used to capture the samples should be as specified in the work plan. Care must be taken to prevent cross-contamination or misidentification of samples. Sometimes the gas content of soil is of concern, and special sampling techniques must be used. These include static soil gas sampling, soil gas probes, and air sampling devices.
Gathering Samples and Data in the Field
127
Groundwater or Ground Water? Is “groundwater” one word or two? When used by itself, groundwater as one word looks fine, and many people write it this way. The problem comes in when it is written along with surface water, which is always two words, and ground water as two words looks better. Some individuals and organizations prefer it one way, some the other, so apparently neither is right or wrong. For any one writer, just as with “data is” vs. “data are,” the most important thing is to pick one and be consistent. Special consideration should be given for soil and sediment samples to be analyzed for volatile organics (VOAs). The samples should be taken with the least disturbance possible, such as using California tubes. Use Teflon or stainless steel equipment. If preservatives are required, they should be added to the bottle before sampling. Samples for VOA analysis should not be split. Air bubbles cannot be present in the sample. The sample should never be frozen.
Soil data issues Soil data is usually gathered in discrete samples, either as surface samples or as part of a soil boring or well installation process. Then the sample is sent to the laboratory for analysis, which can be chemical, physical, or both. Each sample has a specific concentration of each constituent of concern. Sometimes it is useful to know not only the concentration of a toxin, but also its mobility in groundwater. Useful information can be provided by a leach test such as TCLP (toxicity characterization leaching procedure), in which a liquid is passed through the soil sample and the concentration in the leachate is measured. This process is described in more detail in Chapter 12. Key data fields for soil samples include the site and station name; the sample date and depth; COC and other field sample identification numbers; how the sample was taken and who took it; transportation information; and any sample QC information such as duplicate and other designations. For surface soil samples, the map coordinates are usually important. For subsurface soil samples, the map coordinates of the well or boring, along with the depth, often as a range (top and bottom), should be recorded. Often a description of the soil or rock material is recorded as the sample is taken, along with stratigraphic or other geologic information, and this should be stored in the EDMS as well.
SEDIMENT Procedures for taking sediment samples are similar to those for soil samples. Samples should be collected from areas of least to greatest contamination, and from upstream to downstream. Sediment plumes and density currents should be avoided during sample collection.
GROUNDWATER Groundwater is an important resource, and much environmental work involves protecting and remediating groundwater. A good overview of groundwater and its protection can be found in Conservation Technology Resource Center (2001). Groundwater accounts for more than 95% of all fresh water available for use, and about 40% of river flow depends on groundwater. About 50% of Americans (and 95% of rural residents) obtain all or part of their drinking water from groundwater. Groundwater samples are usually taken at a location such as a monitoring well, for an extended period of time such as quarterly, for many years. Additional information on groundwater sampling can be found in NWWA (1986) and Triplett (1997).
128
Relational Management and Display of Site Environmental Data
Figure 52 - Submersible sampling pump (Courtesy of Geotech Environmental Equipment)
Groundwater sampling issues The first step in groundwater sampling is to select the location and drill the hole. Drilling methods are similar to those described above in the section on soil sampling, and soil samples can be taken when a groundwater well is drilled. Then the wellbore equipment such as tubing, screens, and annular material is placed in the hole to make the well. The tubing closes off part of the hole and the screens open the other part so water can enter the wellbore. Screening must be at the correct depth so the right interval is being sampled. Prior to the first sampling event, the well is developed. For each subsequent event it is purged and then sampled. The following discussion is intended to generally cover the issues of groundwater sampling. The details of the sampling process should be covered in the project work plan. Appropriate physical measurements of the groundwater are taken in the field. The sample is placed in a bottle, preserved as appropriate, chilled and placed in a cooler, and sent to the laboratory. Well development begins sometime after the well is installed. A 24-hour delay is typical. Water is removed from the well by pumping or bailing, and development usually continues until the water produced is clear and free of suspended solids and is representative of the geologic formation being sampled. Development should be documented on the Well Development Log Form and in the site and field logbooks. Upgradient and background wells should be developed before downgradient wells to reduce the risk of cross-contamination. Measurement of water levels should be done according to the sampling plan, which may specify measurement prior to purging or prior to sampling. Groundwater level should be measured to a specific accuracy (such as 0.05 ft) and with a specific precision (such as 0.01 ft). Measurements should be made relative to a known, surveyed datum. Measurements are taken with a steel tape or an electronic device such as manometer or acoustical sounder. Some wells have a pressure transducer installed so water levels can be obtained more easily.
Gathering Samples and Data in the Field
129
Figure 53 - Multi-parameter field meter (Courtesy of Geotech Environmental Equipment)
Some wells contain immiscible fluid layers, called non-aqueous phase liquids (NAPLs). There can be up to three layers, which include the water and lighter and heavier fluids. The lighter fluids, called light non-aqueous phase liquids (LNAPLs) or floaters, accumulate above the water. The heavier fluids, called dense non-aqueous phase liquids (DNAPLs) or sinkers, accumulate below the water. For example, LNAPLs like gasoline float on water, while DNAPLs such as chlorinated hydrocarbons (TCE, TCA, and PCE) sink. NAPLs can have their own flow regime in the subsurface separate from the groundwater. The amount of these fluids should be measured separately, and the fluids collected, prior to purging. For information on measurement of DNAPL, see Sara (1994, p. 10-75). Purging is done to remove stagnant water in the casing and filter pack so that the water sampled is “fresh” formation water. A certain number of water column volumes (such as three) are purged, and temperature, pH, and conductivity must be monitored during purging to ensure that these parameters have stabilized prior to sampling. Upgradient and background wells should be purged and sampled before downgradient wells to reduce the risk of cross-contamination. Information concerning well purging should be documented in the Field Sampling Log. Sampling should be done within a specific time period (such as three hours) of purging, if recharge is sufficient, otherwise as soon as recharge allows. The construction materials of the sampling equipment should be compatible with known and suspected contaminants. Groundwater sampling is done using various types of pumps including bladder, gear, submersible rotor, centrifugal, suction, or inertial lift; or with a bailed rope. Pumping is usually preferred over bailing because it is takes less effort and causes less disturbance in the wellbore. An example of a submersible pump is shown in Figure 52. Field measurements should be taken at the time of sampling. These measurements, such as temperature, pH, and specific conductance, should be taken before and after the sample is collected to check on the stability of the water during sampling. Figure 53 shows a field meter for taking these measurements. The field data (also known as the field parameters) is entered on the COC, and should be entered into the EDMS along with the laboratory analysis data. Sometimes the laboratory will enter this data for the client and deliver it with the electronic data deliverable.
130
Relational Management and Display of Site Environmental Data
At all stages of the sampling process it is critical that contamination of the samples be prevented. Contamination can be minimized by properly storing and transporting sampling equipment, keeping equipment and bottles away from sources of contamination, using clean hands and gloves to handle equipment and bottles, and carefully cleaning the purging and sampling equipment after use. If sampling is for VOAs (volatile organic analysis) then equipment or processes that can agitate and potentially volatilize samples should be avoided. Sampling methods such as bottomfilling bailers of stainless steel or Teflon and/or Teflon bladder pumps should be used. Powell and Puls (1997) have expressed a concern that traditional groundwater sampling techniques, which are largely based on methods developed for water supply investigations, may not correctly represent the true values or extent of a plume. For example, the turbidity of a sample is often related to the concentration of constituents measured in the sample, and sometimes this may be due to sampling methods that cause turbulence during sampling, resulting in high concentrations not representative of in-situ conditions. Filtering the sample can help with this, but a better approach may be to use sampling techniques that cause less disturbance of materials in the wellbore. Small diameter wells, short screened intervals, careful device insertion (or the use of permanently installed devices), and low pump rates (also known as low-flow samples) are examples of techniques that may lead to more representative samples. Preservation and handling of the samples is critical for obtaining reliable analytical results. Groundwater samples are usually treated with a preservative such as nitric, sulfuric, or hydrochloric acid or sodium hydroxide (depending on the parameter) to stabilize the analytes, and then cooled (typically to 4°C) and shipped to the laboratory. The shipping method is important because the analyses must be performed within a certain period (holding time) after sampling. The preservation and shipping process varies for different groups of analytes. See Appendix D for more information about this.
Groundwater data issues The sample is taken and often some parameters are measured in the field such as temperature, pH, and turbidity. Then the sample is sent to the laboratory for analysis. When the field and laboratory data are sent to the data administrator, the software should help the data administrator tie the field data to the laboratory data for each sampling event. Key data fields for groundwater data include the site and station name, the sample date and perhaps time, COC and other field sample identification numbers, how the sample was taken and who took it, transportation information, and any sample QC information such as duplicate and other designations. All of this data should be entered into the EDMS.
SURFACE WATER Surface water samples have no purging requirements, but are otherwise sampled and transferred the same as groundwater samples. Surface water samples may be easier to acquire, so they may be taken more often than groundwater samples.
Surface water sampling issues Surface water samples can be taken either at a specific map location, or at an appropriate location and depth. The location of the sample should be identified on a site map and described in the field logbook. Samples should progress from areas of least contamination to worst contamination and generally from upstream to downstream. The sample container should be submerged with the mouth facing upstream (to prevent bubbles in the sample), and sample
Gathering Samples and Data in the Field
131
information should be recorded in the field logbook and any other appropriate forms. The devices used and the process for verifying depth and depth control tolerance should be as specified in the project work plan.
Surface water data issues The data requirements for surface water samples are similar to groundwater samples. For samples taken in tidal areas, the status of the tide (high or low) should be noted.
DECONTAMINATION OF EQUIPMENT Equipment must be decontaminated prior to use and re-use. The standard operating procedure for decontamination should be in the project work plan. The decontamination process is usually different for different equipment. The following are examples of equipment decontamination procedures (DOE/HWP 1990a). For nonsampling equipment such as rigs, backhoes, augers, drill pipe, casing, and screen, decontaminate with high pressure steam, and if necessary scrub with laboratory-grade detergent and rinse with tap water. For sampling equipment used in inorganic sampling, scrub with laboratory-grade detergent, rinse with tap water, rinse with ASTM Type II water, air-dry, and cover with plastic sheeting. For sampling equipment used in inorganic or organic sampling, scrub with laboratory-grade detergent, rinse with tap water, rinse with ASTM Type II water, rinse with methanol (followed by a hexane rinse if testing for pesticides, PCBs, or fuels), air-dry, and wrap with aluminum foil.
SHIPPING OF SAMPLES Samples should be shipped in insulated carriers with either freezer forms (“blue ice”) or wet ice. If wet ice is used, it should be placed in leak-proof plastic bags. Shipping containers should be secured with nylon reinforced strapping tape. Custody seals should be placed on the containers to verify that samples are not disturbed during transport. Shipping should be via overnight express within 24 hours of collection so the laboratory can meet holding time requirements.
AIR The data management requirements for air sampling are somewhat different from those of soil and water because the sampling process is quite different. While both types of data can (and often should) be stored in the same database system, different data fields may be used, or the same fields used differently, for the different types of data. As an example, soil samples will have a depth below ground, while air samples may have an elevation above ground. Typical air quality parameters include sulfur dioxide, nitrogen dioxide, carbon monoxide, ozone, and lead. Other constituents of concern can be measured in specific situations. Sources of air pollution include transportation, stationary fuel combustion, industrial processes, solid waste disposal, and others. For an overview of air sampling and analysis, see Patnaik (1997). For details on several air sampling methods, see ASTM (1997).
Air sampling issues Concentrations of contaminants in air vary greatly from time to time due to weather conditions, topography, and changes in source input. Weather conditions that may be important
132
Relational Management and Display of Site Environmental Data
include wind conditions, temperature, humidity, barometric pressure, and amount of solar radiation. The taking and analysis of air samples vary widely depending on project requirements. Air samples can represent either outdoor air or indoor air, and can be acquired by a variety of means, both manual and automated. Some samples represent a specific volume of air, while others represent a large volume of air passed through a filter or similar device. Some air measurements are taken by capturing the air in a container for analysis, while others are done without taking samples, using real-time sensors. In all cases, a sampling plan should be established and followed carefully. Physical samples can be taken in a Tedlar bag, metal (Summa) canister, or glass bulb. The air may be concentrated using a cryogenic trap, or compressed using a pump. For organic analysis, adsorbent tubes may be used. Adsorbent materials typically used include activated charcoal, Tenax (a porous polymer), or silica gel. Particulate matter such as dust, silica, metal, and carbon particles are collected using membrane filters. A measured volume of air is pumped through the filter, and the suspended particles are deposited on the filter. For water-soluble analytes such as acid, alkali, and some organic vapors, samples can be taken using an impinger, where the air is bubbled through water, and then the water is analyzed. Toxic gases and vapors can be measured using colorimetric dosimeters, where a tube containing a paper strip or granulated material is exposed to the air on one end, and the gas diffuses into the tube and causes the material to change color. The amount of color change reflects the concentration of the constituent being measured. Automated samples can be taken using ambient air analyzers. Care should be taken that the air being analyzed is representative of the area under investigation, and standards should be used to calibrate the analyzer.
Air data issues Often air samples are taken at relatively short time intervals, sometimes as short as minutes apart. This results in a large amount of data to store and manipulate, and an increased focus on time information rather than just date information. It also increases the importance of data aggregation and summarization features in the EDMS so that the large volume of data can be presented in an informative way. Key data fields for air data include the site and station name, the sample date and time (or, for a sample composited over time, the start and end dates and times), how the sample was taken and who took it, transportation information if any, and any sample QC information such as duplicate and other designations.
OTHER MEDIA The variety of media that can be analyzed to gather environmental information is almost unlimited. This section covers just a few of the possible media. There certainly are many others routinely being analyzed, and more being added over time.
Human tissue and fluids Exposure to toxic materials can result in the buildup of these materials in the body. Tracking this exposure involves measuring the concentration of these materials in tissue and body fluids. For example, hair samples can provide a recent history of exposure, and blood and urine analyses are widely used to track exposure to metals such as lead, arsenic, and cadmium. Often this type of data is gathered under patient confidentiality rules, and maintaining this confidentiality must be considered in implementing and operating the system for managing the data. Lead exposure in
Gathering Samples and Data in the Field
133
children (and pregnant and nursing mothers) is of special interest since it appears to be correlated with developmental problems, and monitoring and remediating elevated blood lead is receiving much attention in some communities. The data management system should be capable of managing both the blood lead data and the residential environmental data (soil, paint, water, and dust) for the children. It should also be capable of relating the two even if the blood data is within the patient confidentiality umbrella and the residential environmental data is not.
Organisms Because each level of the food chain can concentrate pollutants by one or more orders of magnitude, the concentration of toxins in biologic material can be a key indicator of those toxins in the environment. In addition, some organisms themselves can pose a health hazard. Both kinds of information might need to be stored in a database. Sampling procedures vary depending on the size of the organisms and whether they are benthic (attached to the bottom), planktonic (move by floating), or nektonic (move by swimming). For more information, see ASTM (1997).
Residential and workplace media Increasingly, the environmental quality in homes and offices is becoming a concern. From toxic materials, to “sick building syndrome” and infectious diseases like anthrax and Legionnaire’s disease, the quality of the indoor environment is coming into question, and it is logical to track this information in a database. Other environmental issues relate to exposure to toxic materials such as lead. Lead can occur in paint, dust, plumbing, soil, and in household accessories, including such common objects as plastic mini-blinds from the local discount store. Concentration information can be stored in a database, and the EDMS can be used to correlate human exposure information with residential or workplace media information to assist with source identification and remediation.
Plant operating and other data Often information on plant operations is important in management of the environmental issues at a facility. The relationship between releases and production volume, or chemical composition vs. physical properties, can be best investigated if some plant operating information is captured and stored in the EDMS. One issue to keep in mind is that the volume of this information can be great, and care should be taken to prevent degradation of the performance of the system due to the potentially large volume of this data. Sometimes it makes sense to store true operating data in one database and environmental data in another. However, due to the potential overlap of retrieval requirements, a combined database or duplicated storage for the data with dual uses is sometimes necessary. Also, the reporting of plant operating data and its relationship to environmental factors often involves deriving results through calculations. For example, what is measured may be the volume of effluent and the concentration of some material in the effluent, but perhaps the operating permit sets limits on the total amount of material. The reporting process needs to multiply the volume times the concentration to get the amount, and perhaps scale that to different units. Figure 54 shows a screen for defining calculated parameters.
OVERVIEW OF PARAMETERS The EDMS is used in environmental investigation and remediation projects to track and understand the amount and location of hazardous materials in the environment. In addition, other parameters may be tracked to help better understand the amount and distribution of the contaminants.
134
Relational Management and Display of Site Environmental Data
Figure 54 - Screen for defining calculated parameters
This section briefly covers some of the environmental and related parameters that are likely to be managed in an EDMS. Other sources cover this material in much more detail. Useful reference books include Manahan (2000, 2001), Patnaik (1997), and Weiner (2000). Many Web sites include reference information on parameters. Examples include EPA (2000a), EXTOXNET (2001), Spectrum Labs (2001), NCDWQ (2001), SKC (2001), and Cambridge Software (2001). A Web search will turn up many more. This section covers inorganic parameters, organic parameters, and various other parameters commonly found in an EDMS.
Inorganic parameters Inorganic compounds include common metals, heavy metals, nutrients, inorganic nonmetals, and radiologic parameters. Common metals include calcium, iron, magnesium, potassium, sodium, and others. These metals are generally not toxic, but can cause a variety of water quality problems if present in large quantities. Heavy metals include arsenic, cadmium, chromium, lead, mercury, selenium, sulfur, and several others. These metals vary significantly in their toxicity. For example, arsenic is quite poisonous, but sulfur is not. Lead is toxic in large amounts, and in much lower amounts is thought to cause developmental problems in small children. The toxicity of some metals depends on their chemical state. Mercury is much more toxic in organic compounds than as a pure metal. Many of us have chased little balls of mercury around after a thermometer broke, and suffered no ill effects, while some organic mercury compounds are so toxic that one small drop on the skin can cause almost instantaneous death. Hexavalent chromium (Cr6+) is extremely toxic, while trivalent chromium (Cr3+) is much less so. (See the movie Erin Brockovich.) Nutrients include nitrogen and phosphorus (and the metal potassium). Nitrogen is present as nitrate (NO3-) and nitrite (NO2-). Nitrates can cause problems with drinking water, and phosphorus can pollute surface waters. Inorganic nonmetals include ammonia, bicarbonate, carbonate, chloride, cyanide, fluoride, iodide, nitrite, nitrate, phosphate, sulfate, and sulfide. With the exception of cyanide, these are not particularly toxic, but can contribute to low water quality if present in sufficient quantities. Asbestos is an inorganic pollutant that is somewhat different from the others. Most toxic substances are toxic due to their chemical activity. Asbestos is toxic, at least in air, because of its physical properties. The small fibers of asbestos (a silicate mineral) can cause cancer when breathed into the lungs. It has not been established whether it is toxic in drinking water.
Gathering Samples and Data in the Field
135
Radiologic parameters such as plutonium, radium, thorium, and uranium consist of both natural materials and man-made products. These materials were produced in large quantities for use in weapons and nuclear reactors, and many other uses (for example, lantern mantles and smoke detectors). They can cause health hazards through either chemical or radiologic exposure. Some radioactive materials, such as plutonium, are extremely toxic, while others such as uranium are less so. High levels of radioactivity (alpha and beta particles and gamma rays) can cause acute health problems, while long exposure to lower levels can lead to cancer. Radiologic parameters are differentiated by isotope number, which varies for the same element depending on the number of neutrons in the nucleus of each atom. For example, radium224 and radium226 have atomic weights of 224 and 226, respectively, but are the same element and participate in chemical reactions the same way. Different isotopes have different levels of radioactivity and different half-lives (how long it takes half of the material to undergo radioactive decay), so they are often tracked separately. Inorganic pollutants in air include gaseous oxides such as carbon dioxide, sulfur dioxide, and the oxides of nitrogen, which cause acid rain and may contribute to atmospheric warming (the greenhouse effect). Chloride atoms in the atmosphere can damage the ozone layer, which protects us from harmful ultraviolet radiation from the sun. Particulate matter is also significant, some of which, for example, sea salt, is natural, but much of which is man-made. Colloidal-sized particles formed by physical processes (dispersion aerosols) or chemical processes (condensation aerosols) can cause smog and health problems if inhaled.
Organic parameters Organic compounds are compounds that contain carbon, usually with hydrogen and often with oxygen. Organics may contain other atoms as well, such as halides, nitrogen, and phosphorus. They are usually segregated into volatile organic compounds (VOCs), and semivolatile organic compounds (SVOCs). Hydrocarbons, chlorinated hydrocarbons, pesticides, and herbicides are also organic compounds. The delineation between volatiles and semivolatiles is not as easy as it sounds. SW-846, the guidance document from EPA for analytical methods (EPA, 1980) describes volatile compounds as “compounds which have boiling points below 200°C and that are insoluble or slightly soluble in water.” Other references describe volatiles as those compounds that can be adequately analyzed by a purge and trap procedure. Unfortunately, semivolatiles are described altogether differently. SW846 describes semivolatiles in their procedures as “neutral, acidic and basic organic compounds that are soluble in methylene chloride and capable of being eluted without derivatization.” No mention is made of the boiling points of semivolatile compounds, although it’s probably implicit that their boiling points are higher than volatile compounds (Roy Widmann, pers. comm., 2001). VOCs are organic compounds with a high vapor pressure, meaning that they evaporate easily. Examples include benzene, toluene, ethylene, and xylene, (collectively referred to as BTEX), acetone, carbon tetrachloride, chloroform, ethylene glycol, and various alcohols. Many of these compounds are used as industrial solvents and cleaning fluids. SVOCs are organic compounds with a low vapor pressure, so they resist evaporation. They also have a higher boiling point than VOCs, greater than 200°C. Examples include anthracene, dibenzofuran, fluorene (not to be confused with the halide fluorine), pentachlorophenol (PCP), phenol, polycyclic aromatic compounds (PAHs), polychlorinated biphenyls (PCBs, Aroclor), and pyrene. Some of these substances are used in manufacture of a wide variety of materials such as plastics and medicine. Others are degradation products resulting from exposure of other organics to the environment. Halogenated compounds are organic compounds that have one or more of the hydrogens replaced with a halide like fluorine, chlorine, or bromine. For example, 1,2-dibromoethane has the first and second hydrogen replaced by bromine. One category of halogenated SVOCs, polychlorinated biphenyls, were widely used in industry in applications such as cooling of
136
Relational Management and Display of Site Environmental Data
transformers until banned by TSCA in 1976. They have high chemical, thermal, and biological stability, which makes them very persistent in the environment. Hydrocarbons consist of crude oil and various refined products. The have a wide range of physical and chemical properties, and are widely dispersed in the environment. In some situations, hydrocarbons are exempt from hazardous materials regulation. For example, crude oil is not currently considered hazardous if spilled at the wellhead, but it is if spilled during transportation. Chlorinated hydrocarbons can pose a significant health risk by contaminating drinking water (Cheremisinoff, 2001). Three of these substances, trichlorethylene (TCE), trichloroethane (TCA), and tetrachloroethylene (perchloroethylene or PCE), are widely used industrial solvents, and are highly soluble in water, so a small quantity of material can contaminate a large volume of groundwater. Pesticides (insecticides) are widely distributed in the environment due to agricultural and other uses, especially since World War II. Some pesticides such as nicotine, rotenone, and pyrethrins are naturally occurring substances, are biodegradable, and pose little pollution risk. Organochloride insecticides such as DDT, dieldrin, and endrin were widely used in the 1960s, but are for the most part banned due to their toxicity and persistence in the food chain. These have been largely replaced by organophosphates such as malathion, and carbamates such as carbaryl and carbofuran. Herbicides are also widely used in agriculture, and too often show up in drinking water. There are several groups of herbicides, including bipyridilium compounds (diquat and paraquat), heterocyclic nitrogen compounds (atrazine and metribuzin), chlorophenoxyls (2,4-D and 2,4,5-T), substituted amides (propanil and alachlor), nitroanilines (trifluralin), and others. By-products from the manufacture of pesticides and herbicides and the degradation products of these materials are also significant problems in the environment. Organic pollutants in the air can be a significant problem, including direct effects such as cancer caused by inhalation of vinyl chloride, or the formation of secondary pollutants such as photochemical smog.
Other parameters There are a number of other parameters that may be tracked in an EDMS. Some are pollutants, while others describe physical and chemical parameters that help better understand the site geology, chemistry, or engineering. Examples include biologic parameters, field parameters, geophysical measurements, operating parameters, and miscellaneous other parameters. Biologic parameters (also called microbes, or pathogens if they are toxic) include fungi, protozoa, bacteria, and viruses. Pathogens such as Cryptosporidium parvum and Giardia cause a significant percentage (more than half) of waterborne disease. However, not all microbes are bad. For example, bacteria such as Micrococcus, Pseudomonas, Mycobacterium, and Nocardia can degrade hydrocarbons in the environment, both naturally and with human assistance, as a site cleanup method. Field parameters fall into two categories, The first are parameters measured at the time that samples, such as water samples, are taken. These include pH, conductivity, turbidity, groundwater elevation, and presence and thickness of sinkers and floaters. In some cases, such as field pH, multiple observations may be taken for each sample, and this must be taken into consideration in the design of the database system and reporting formats. The other category of field parameters is items measured or observed without taking a sample. A variety of chemical and other measurements can be taken in the field, especially for air monitoring, and increasingly for groundwater monitoring, as sensors continue to improve. Groundwater elevation is a special type of field observation. It can be observed with or without sampling, and obtaining accurate and consistent water level elevations can be very important in managing a groundwater project. This is because many other parameters react in
Gathering Samples and Data in the Field
137
various ways to the level of the water table. Issues such as the time of year, amount of recent precipitation, tidal influences, and many other factors can influence groundwater elevation. The EDMS should contain sufficient field parameter information to assist with interpretation of the groundwater elevation data. Geophysical measurements are generally used for site characterization, and can be done either on the surface or in a borehole. For some projects this data may be stored in the EDMS, while for others it is stored outside the database system in specialized geophysical software. Examples of surface geophysical measurements include gravity, resistivity, spontaneous potential (SP), and magnetotellurics. Borehole geophysical measurements include SP, resistivity, density, sonic velocity, and radiologic parameters such as gamma ray and neutron surveys. Operating parameters describe various aspects of facility operation that might have an influence on environmental issues. Options for storage of this data are discussed in a previous section, and the parameters include production volume, fluid levels, flow rates, and so on. In some cases it is important to track this information along with the chemical data because permits and other regulatory issues may correlate pollutant discharges to production volumes. Managing operating parameters in the EDMS may require that the system be able to display calculated parameters, such as calculating the volume of pollutant discharge by multiplying the concentration times the effluent volume, as shown in Figure 54. Miscellaneous other parameters include the RCRA characterization parameters (corrosivity, ignitability, reactivity, and toxicity) as well as other parameters that might be measured in the field or lab such as color, odor, total dissolved solids (TDS), total organic carbon (TOC), and total suspended solids (TSS). In addition, there are many other parameters that might be of importance for a specific project, and any of these could be stored in the EDMS. The design of the EDMS must be flexible enough to handle any parameter necessary for the various projects on which the software is used.
CHAPTER 12 ENVIRONMENTAL LABORATORY ANALYSIS
Once the samples have been gathered, they are usually sent to the laboratory for analysis. How well the laboratory does its work, from sample intake and tracking through the analysis and reporting process, has a major impact on the quality of the resulting data. This chapter discusses the procedures carried out by the laboratory, and some of the major laboratory analytical techniques. A basic understanding of this information can be useful in working effectively with the data that the laboratory generates. The laboratory business is a tough business. In the current market, the amount of analytical work is decreasing, meaning that the laboratories are having to compete more aggressively for what work there is. They have cut their profit margins to the bone, but are still expected to promptly provide quality results. Unfortunately this has caused some laboratories to close, and others to cut corners to the point of criminal activities to make ends meet. Project managers must be ever vigilant to make sure that the laboratories are doing their job with an adequate level of quality. The good news is that there are still many laboratories performing quality work.
LABORATORY WORKFLOW Because laboratories process a large volume of samples and must maintain a high level of quality, the processes and procedures in the laboratory must be well defined and rigorously followed. Each step must be documented in a permanent record. Steps in the process include sample intake, sample tracking, preparation, analysis, reporting, and quality control. Sample intake – The laboratory receives the samples from the shipper and updates the chain of custody. The client is notified that the samples have arrived, the shipping container temperature is noted, and the samples are scheduled for analysis. Sample tracking – It is critical that samples and results be tracked throughout the processes performed in the laboratory. Usually the laboratory uses specialized software called a laboratory information management system or LIMS to assist with sample tracking and reporting. Sample preparation – For most types of samples and analyses, the samples must be prepared before they can be analyzed. This is discussed in the following section. Analysis – Hundreds of different analytical techniques are available for analyzing different parameters in various media. The most common techniques are described below. Reporting – After the analyses have been performed, the results must be output in a format suitable for use by the client and others. An electronic file created by the laboratory for delivery to
140
Relational Management and Display of Site Environmental Data
the client is called an electronic data deliverable (EDD). Creation of the EDD should be done using the LIMS, but too often there is a data reformatting step, or even worse a manual transcription process, to get some or all of the data into the EDD, and these manipulation steps can be prone to errors. Quality control – The need to maintain quality permeates the laboratory process. Laboratory quality control is discussed below, and aspects of it are also discussed in Chapter 15. Laboratory QC procedures are determined by the level required for each site. For example, Superfund projects generally require the highest level of QC and the most extensive documentation.
SAMPLE PREPARATION Most samples must be prepared before they can be analyzed. The preparation process varies depending on the sample matrix, the material to be analyzed, and the analytical method. The most important processes include extraction and cleanup, digestion, leaching, dilution, and filtering. Depending on the sample matrix, other procedures such as grinding and chemical manipulations may be required. Extraction and cleanup – Organic analytes are extracted to bring them into the appropriate solvent prior to analysis (Patnaik, 1997). The extraction method varies depending on whether the sample is liquid or solid. Extraction techniques for aqueous samples include liquid-liquid (separatory funnel or continuous) and solid-phase. For solid samples, the methods include Soxhlett, supercritical fluid, and sonication. Some extraction processes are repeated multiple times (such as three) to improve the efficiency of extraction. Samples may undergo a cleanup process to improve the analysis process and generate more reliable results. Cleanup methods include acid-base partitioning, alumina column, silica gel, florisil, gel-permeation, sulfur, and permanganate-sulfuric acid. Digestion – Samples analyzed for metals are usually digested. The digestion process uses strong acids and heat to increase the precision and accuracy of the measurement by providing a homogeneous solution for analysis by removing metals adsorbed to particles and breaking down metal complexes. Different digestion techniques are used depending on the analytical method and target accuracy levels. Leaching – Sometimes in addition to the concentration of a toxic substance in a sample, the mobility of the substance is also of concern, especially for material headed for disposal in a landfill. A leach test is used to determine this. Techniques used for this are the toxicity characterization leaching procedure (TCLP), synthetic precipitate leaching procedure (SPLP), and the EP toxicity test (EPTOX). In all three methods, fluids are passed through the solid material such as soil, and the quantity of the toxic substance leached by the fluid is measured. TCLP uses a highly buffered and mildly acidic aqueous fluid. In SPLP the fluid is slightly more acidic, varies by geographic area (east or west of the Mississippi River), and is intended to more accurately represent the properties of groundwater. EPTOX takes longer, tests for fewer parameters, and is no longer widely used. The concentration of an analyte after leaching is not comparable to the total concentration in the sample, so leached analyses should be so marked as such in the EDMS. Dilution – Sometimes it is necessary to dilute the sample prior to analysis. Reasons for this include that the concentration of the analyte may be outside the concentration range where the analytical technique is linear, or other substances in the sample may interfere with the analysis (matrix interference). A record of the dilution factor should be kept with the result. Dilution affects the result itself as well as the detection limit for the result (Sara 1994, p. 11-11). For non-detected results, the reported result based on the detection limit will be increased proportionately to the dilution, and this needs to be considered in interpreting the results. Filtering – The sample may or may not be filtered, either in the field or in the laboratory. If the sample is not filtered, the resulting measurement is referred to as a total measurement, while if it is filtered, it is considered a dissolved result. For filtered samples, the size of the openings in the
Environmental Laboratory Analysis
141
filter (such as 1 micron) should be included with the result. Commonly, once a sample has been filtered it is preserved. This information should also be noted with the result.
ANALYTICAL METHODS Laboratories use many different methods to analyze for different constituents. Most of the guidance on this comes from the EPA. The main reference for analytical methods in the environmental field is the EPA’s SW-846 (EPA, 1980). SW-846 is the official compendium of analytical and sampling methods that have been evaluated and approved for use in complying with the RCRA regulations. It was first issued in 1980, and is updated regularly. The tables in Appendix D show the recommended analytical methods for various parameters. This section provides a general description of the methods themselves, starting with methods used mostly for inorganic constituents, followed by methods for organic analysis. Additional information on analytical methods can be found in many sources, including EPA (1980, 2000a), Extoxnet (2001), Spectrum Labs (2001), SKC (2001), Cambridge Software (2001), NCDWQ (2001), Scorecard.org (Environmental Defense, 2001), Manahan (2000, 2001), Patnaik (1997), and Weiner (2000).
Inorganic methods Many different methods are used for analysis of inorganic constituents. The most common include titration, colorimetric, atomic absorption and emission spectrometry, ion-selective electrodes, ion chromatography, transmission electron microscopy, gravimetry, nephelometric, and radiochemical methods. Titration is one of the oldest and most commonly used of the wet chemistry techniques. It is used to measure hardness, acidity and alkalinity, chemical oxygen demand, non-metals such as chlorine and chloride, iodide, cyanide, nitrogen and ammonia, sulfide and sulfite, and some metals and metal ions such as calcium, magnesium, bromate, and bromide. Titration can be used on wastewater, potable water, and aqueous extracts of soil and other materials. The method works by slowly adding a standard solution of a known concentration to a solution of an unknown concentration until the chemical reaction between the solutions stops, which occurs when the analyte of concern in the second solution has fully reacted. Then the amount of the first solution required to complete the reaction is used to calculate the concentration in the second. The completion of the reaction is monitored using an indicator chemical such as phenolphthalein that changes color when the reaction in complete, or with an electrode and meter. Titration methods commonly used in environmental analyses include acid-base, redox, iodometric, argentometric, and complexometric. Titration is relatively easy and quick to perform, but other techniques often have lower detection limits, so are more useful for environmental analyses. Colorimetric methods are also widely used in environmental analysis. Hardness, alkalinity, chemical oxygen demand, cyanide, chloride, fluoride, ammonia, nitrite, nitrogen and ammonia, phosphorus, phosphate and orthophosphate, silica, sulfate and sulfite, phenolics, most metals, and ozone are among the parameters amenable to colorimetric analysis. For the most part, colorimetric methods are fast and inexpensive. Aqueous substances absorb light at specific wavelengths depending on their physical properties. The amount of monochromatic light absorbed is proportional to the concentration, for relatively low concentrations, according to Beer’s law. First the analyte is extracted, often into an organic solvent, and then a color-forming reagent is added. Filtered light is passed through the solution, and the amount of light transmitted is measured using a photometer. The result is compared to a calibration curve based on standard solutions to derive the concentration of the analyte. Atomic absorption spectrometry (AA) is a fast and accurate method for determining the concentration of metals in solution. Aqueous and non-aqueous samples are first digested in nitric
142
Relational Management and Display of Site Environmental Data
acid, sometimes along with other acids, so that all of the metals are in solution as metal nitride salts. Then a heat source is used to vaporize the sample and convert the metal ions to atoms; light from a monochromatic source is passed through the vapor; and the amount of light absorbed is proportional to the concentration. A photoelectric detector is used to measure the remaining light, and digital processing used to calculate the concentration. There are several AA methods, including direct (flame), graphite furnace, and platform techniques. Calibration is performed using the method of standard addition, in which various strengths of standards solutions are added to the sample, and the results used to create a calibration curve. A lower detection can be obtained for many metals, such as cadmium, chromium, cobalt, copper, iron, lead, manganese, nickel, silver, and zinc, by using the chelation-extraction method prior to analysis. A chelating agent such as ammonium pyrrolidine dithiocarbamate (APDC) reacts with the metal, and the resulting metal chelate is extracted with methyl isobutyl ketone (MIBK), and then analyzed with AA. Analysis of arsenic and selenium can be enhanced using the hydride generation method, in which the metals in HCl solution are treated with sodium borohydride, then purged with nitrogen or argon and atomized for analysis. Cold vapor is a specialized AA technique for mercury, in which mercury and its salts are converted to mercury nitrate with nitric acid, then reduced to elemental form with stannous chloride. The mercury forms a vapor, which is carried in air into the absorption cell for analysis. Inductively coupled plasma/atomic emission spectroscopy (ICP/AES) is a technique for simultaneous or sequential multi-element determination of elements in solution, allowing for the analysis of several metals at once. The ICP source is a high-powered radio frequency (RF) generator together with a quartz torch, water-cooled coil, nebulizer, spray chamber, and drain. An argon gas stream is ionized in the RF field, which is inductively coupled to the ionized gas by the coil. The sample is converted to an aerosol in the nebulizer, and injected into the plasma, where the analytes are ionized. The light emitted by the ions is then analyzed with a polychromatic or scanning monochromatic detector, and the results compared to a curve based on standards to generate concentrations of the target analytes. Inductively coupled plasma/mass spectrometry (ICP/MS) is similar to ICP/AES except that after the analyte has been ionized, the mass spectra of the molecules are used to identify the compounds. Ion-selective electrodes are useful for analysis of metals and anions, as well as dissolved gases such as oxygen, carbon dioxide, ammonia, and oxides of nitrogen. A sensing electrode specific to the analyte of interest is immersed in the sample solution, resulting in an electrical potential, which is compared to the potential of a reference electrode using a voltmeter. The voltage is proportional to the concentration of the analyte for which the electrode is designed. Solid samples must be extracted before analysis. Calibration is performed using the standard calibration method, standard addition method, or sample addition method. Ion chromatography is an inorganic technique that can analyze for multiple parameters sequentially in one procedure. Many common anions, including nitrate, nitrite, phosphate, sulfide, sulfate, fluoride, chloride, bromide, and iodide can be analyzed, as well as oxyhalides such as perchlorate and hypochlorite, weak organic acids, metal ions, and alkyl amine. The analytes are mixed with an elutant and separated chromatographically in an ion exchanger, and then measured with a conductivity detector based on their retention times and peak areas and heights. The samples are compared to calibration standards to calculate the concentrations. The most common elutant is a mixture of sodium carbonate and sodium bicarbonate, but other elutants such as sodium hydroxide and others can be used to analyze for different constituents. In some cases an electrical potential is applied to improve sensitivity and detection levels (auto suppression). Transmission electron microscopy (TEM) is used for analysis of asbestos. The prepared sample is placed in the transmission electronic microscope. An electron beam passes through the sample and generates a pattern based on the crystal structure of the sample. This pattern is then analyzed for the presence of the pattern for asbestos.
Environmental Laboratory Analysis
143
Gravimetry can be used for various analytes, including chloride, silica, sulfate, oil and grease, and total dissolved solids. The procedure varies from analyte to analyte, but the general process is to react the analyte of interest with one or more other substances, purify the resulting precipitate, and weigh the remaining material. For oil and grease, the sample is extracted, a known volume of fluid is separated, and the oil component weighed. For TDS, the sample is evaporated and the remaining solid is weighed. Nephelometric analysis is used to measure turbidity (cloudiness). The sample is placed in a turbidimeter, and the instrument light source illuminates the sample. The intensity of scattered light is measured at right angles (or a range of angles) to the path of the incident light. The system is calibrated with four reference standards and a blank. The scattering of light increases with a greater suspended load. Turbidity is commonly measured in nephelometric turbidity units (NTU) which have replaced the older Jackson turbidity units (JTU). Radiochemical methods cover a range of different techniques. These include radiochemical methodology (EPA 900 series), alpha spectroscopy (isotopic identification by measurement of alpha particle energy), gamma ray spectrometry (nuclide identification by measurement of gamma ray energy), gross alpha-beta counting (semiquantitative, or quantitative after wet chemical separation), gross alpha by co-precipitation, extraction chromatography, chelating resin, liquid scintillation counter, neuron activation followed by delayed neuron counting, electret ionization chambers, alpha track detectors, radon emanation technique, and fluorometric methodology.
Organic methods Analytical methods for organic constituents include gas chromatography, gas chromatography/ mass spectrometry, high performance liquid chromatography, enzyme immunoassay, and infrared spectroscopy. Gas chromatography (GC) is the most widely used technique for determining organics in environmental samples. Many EPA organic methods employ this technique. The method can be optimized for compounds with different physical and chemical properties by selecting from various different columns and detectors. The sample is first concentrated using a purge and trap technique. Then it is passed through a capillary (or less commonly, packed) column. The capillary column is made of fused silica, glass, or stainless steel of various inside diameters. The smaller the diameter, the higher the resolution, but the smaller the sample size. Also, the longer the column, the higher the resolution. Then one of a number of different types of detectors is used to determine the concentration. General-purpose detectors include flame ionization detectors (FID) and thermal conductivity detectors (TCD). Other detectors are specific to certain analytes. Electron capture detectors (ECD) and Hall electrolyte conductivity detectors (HECD) are specific to halogen compounds. Nitrogen-phosphorus detectors (NPD) can be used for nitrogen-containing organics or organophosphorus compounds, depending on the detection mode used. Flame photometric detectors (FPD) are used for sulfur-containing organics, and less commonly for phosphorus compounds. Photoionization detectors (PID) are used for substances containing the carbon-carbon double bond such as aromatics and olefins. The equipment must be calibrated prior to running samples, using either an external standard or an internal standard. The external standard method uses just the standard solution, while the internal standard method uses one or more standard solutions added to equal volumes of sample extracts and calibration standards. Internal standards are more reliable, but require more effort. Gas chromatography/mass spectrometry (GC/MS) is one of the most versatile techniques for analysis of organics. Examples of GC/MS methods include the EPA methods 624, 8240, and 8260 for volatiles; and 625 and 8270 for semivolatiles. The analysis process is similar for the two, but the sample extraction and concentration is different. The analytical approach is to use chromatography to separate the components of a mixture, then mass spectrometry to identify the compounds. The sample is concentrated using purge and trap or thermal desorption. Next, the
144
Relational Management and Display of Site Environmental Data
chromatograph column is used as above to separate the compounds. Then the components are eluted from the column and ionized using electron-impact or chemical ionization. Finally, the mass spectra of the molecules are used to identify the compounds, based on their primary and secondary ions and retention times. High performance liquid chromatography (HPLC) can be used to analyze more than twothirds of all organic compounds. It is widely used in the chemical industry, but is relatively new to environmental analyses, and just a few EPA methods, such as some methods for pesticides and PAHs, employ it. The chromatograph equipment consists of a constant-flow pump, a high-pressure injection valve, the chromatograph column, a detector, and a chart recorder or digital interface to gather the data. A mobile liquid phase transports the sample through the column, where individual compounds are separated when they are selectively retained on a stationary liquid phase that is bonded to a support. There are several types of detectors, including ultraviolet, fluorescence, conductivity, or electrochemical. Calibration standards at various concentrations are compared against the data for the samples to determine the concentration. Enzyme immunoassay analysis can be used to screen for various constituents such as pesticides, herbicides, PAHs, PCBs, PCP, nitro-organics, and other compounds. In this method, polyclonal antibodies specific to the desired analyte bind to the analyte in an analysis tube, competing with an analyte-enzyme conjugate also added to the tube, resulting in a color change of the material on the wall of the tube. The color change is inversely proportional to the concentration in the analyte due to competition with the conjugate. The color change can be compared visually to standards for qualitative determination, or analyzed with a spectrometer, which makes the result semiquantitative. Unlike most of the techniques described here, this one is suitable for use in the field as well as in the laboratory. Infrared spectroscopy (IR) is used to analyze total petroleum hydrocarbons (TPH). The samples are extracted with a fluorocarbon solvent, dried with anhydrous NaSO4, then analyzed with the IR spectrometer.
RCRA characterization The Code of Federal Regulations, Title 40, Paragraph 261.20 provides a definition of a waste as hazardous based on four criteria: corrosivity, ignitability, reactivity, and toxicity. Corrosivity – A waste is considered corrosive if it is aqueous and has a pH less than or equal to 2 or greater than or equal to 12.5 measured with a pH meter, or has a corrosivity to steel of more than 635 mm/yr. Ignitability – A solid waste is ignitable if the flash point is less than 60°C. Liquids are ignitable if their vapors are likely to ignite in the presence of ignition sources. Reactivity – A waste is reactive if it is normally unstable and readily undergoes violent change, reacts violently with water, forms potentially explosive mixtures or toxic gases, vapors, or fumes when combined with water. Also it is reactive if it is a cyanide- or sulfate-bearing waste that can generate toxic gases, vapors, or fumes when exposed to pH conditions between 2 and 12.5. Toxicity – A poisonous or hazardous substance is referred to as toxic. The toxicity of a waste is determined by the leaching methods described above. In addition, the EPA lists over 450 listed wastes, which are specific substances or class of substances known to be hazardous.
Air analysis Air analysis requires different sample preparation and may require different analysis methods from those used for soil and water. The material for analysis is received in sorbent tubes, Tedlar bags, or Summa canisters. Then the samples can be analyzed using specialized equipment, often with a pre-concentration step or using standard analytical techniques such as GC/FID or GC/MS.
Environmental Laboratory Analysis
145
Different techniques are used for different analytes such as halogenated organics (GC with ECD), phosphorus and sulfur (FPD), nitrogen (NPD), aromatics and olefins (PID), and light hydrocarbons such as methane, ethane, and ethylene (TCD). Other compounds can be measured with HPLC, IR, UV or visible spectrophotometry, and gravimetry.
OTHER ANALYSIS ISSUES There are a number of other issues related to laboratory analysis that can impact the storage and analysis of the data.
Levels of analysis The EPA has defined five levels for laboratory analyses. The levels, also called Data Quality Objective (DQO) levels, are based on the type of site being investigated, the level of accuracy and precision required, and the intended use of the data. The following table, based on DOE/HWP (1990b), lists these five levels: Level I
II
Analysis example Qualitative or semiquantitative analysis Indicator parameters Immediate response in the field Semiquantitative or quantitative analysis Compound specific Rapid turnaround in the field
III
Quantitative analysis Technically defensible data Sites near populated areas Major sites
IV
Quantitative analysis Legally defensible data National Priorities List sites
V
Qualitative to quantitative analysis Method specific Unique matrices (e.g., pure water, biota, explosives, etc.)
Typical data use Site characterization Monitoring during implementation Field screening Site characterization Evaluation of alternatives Engineering design Monitoring during implementation Field screening Risk assessment Site characterization Evaluation of alternatives Engineering design Monitoring during implementation Risk assessment Site characterization Evaluation of alternatives Engineering design Risk assessment Evaluation of alternatives Engineering design
Holding times Some analytes can degrade (change in concentration) after the sample is taken. For that reason, the analysis must be performed within a certain time period after sampling. This time is referred to as the holding time. Meeting holding time requirements is an important component of laboratory data quality. Some analytes have a holding time from sampling to analysis, while others will have one holding time before extraction and another before analysis. Samples for which holding time requirements are not met should be flagged, and in some cases the station must be re-
146
Relational Management and Display of Site Environmental Data
sampled, depending on project requirements. Holding times for some common analytes are listed in Appendix D.
Detection limits Analytical methods cannot analyze infinitely small amounts of target parameters. For each method, analyte, and even each specific sample, the lowest amount that can be detected will vary, and this is called the detection limit. There are actually a number of different detection limits determined in different ways (Core Labs, 1996) that overlap somewhat in meaning. The detection limit (DL) in general means that the concentration is distinctly detectable above the concentration of a blank. The limit of detection (LOD) is the lowest concentration statistically different from the blank. The instrument detection limit (IDL) represents the smallest measurable signal above background noise. The method detection limit (MDL) is the minimum concentration detectable with 99% confidence. The reliable detection limit (RDL) is the lowest level for reliable decisions. The limit of quantitation (LOQ) is the level above which quantitative results have a specified degree of confidence. The reliable quantitation limit (RQL) is the lowest level for quantitative decisions. The practical quantitation limit (PQL) is the lowest level that can be reliably determined within specific limits of precision and accuracy. The contract required quantitation limit (CRQL) or the reporting limit (RL) is the level at which the laboratory routinely reports analytical results, and the contract required detection limit (CRDL) is the detection limit required by the laboratory contract. Some limits, such as MDL, PQL, and RL are used regularly, while other limits are used less frequently. The limits to be used for each project are defined in the project work plan. The laboratory will report one or more detection limits, depending on project requirements. When an analyte is not detected in a sample, there is really no value for that analyte to be reported, and the lab will report the detection limit and provide a flag that the analyte was not detected. Sometimes the detection limit will also be placed in the value field. When you want to use the data for a non-detected analyte, you will need to decide how to display the result, or how to use it numerically for graphing, statistics, mapping, and so on. For reporting, the best thing is usually to display the detection limit and the flag, such as “.01 u,” or the detection limit with a “less than” sign in front of it, such as “< .01.” To use the number numerically, it is common to use one-half of the detection limit, although other multipliers are used in some situations. It’s particularly important for averages and other statistical measures to be aware of how undetected results are handled.
Significant figures Significant figures, also called significant digits, should reflect the precision of the analytical method used (Sara, 1994, p. 11-13), and should be maintained through data reporting and storage. Unfortunately, some software, especially Microsoft programs such as Access and Excel, don’t maintain trailing zeros, making it difficult to maintain significant digits. In this case, the number of significant figures or decimal places should be stored in a separate field. If you display fewer digits than were measured, then you are rounding the number. This is especially true if the digits you are dropping are not zero, such as rounding 3.21 to 3.2, but is also true if you are dropping zeros. Sometimes this is appropriate and sometimes it is not. Samuel Johnson, English writer of the 1700s, said, “Round numbers are always false.”
Data qualifiers When the laboratory encounters a problem with an analysis, it should flag or qualify the data so that users of the data are aware of the problems. Various different agencies require different
Environmental Laboratory Analysis
147
flagging schemes, and these schemes can be different and in some cases conflicting, which can cause problems for managing the data. Some typical qualifier flags are shown in the following table: Code * a b c d e f g h i j l
Flag Surrogate outside QC limits Not available Analyte detected in blank and sample Coelute Diluted Exceeds calibration range Calculated from higher dilution Concentration > value reported Result reported elsewhere Insufficient sample Est. value; conc. < quan. limit Less than detection limit
Code m n q r s t u v w x z
Flag Matrix interference Not measured Uncertain value Unusable data Surrogate Trace amount Not detected Detected value Between CRDL/IDL Determined by associated method Unknown
Reporting units The units of measure or reporting units for each analysis are extremely important, because an analytical result without units is pretty much meaningless. The units reported by laboratories for one parameter can change from time to time or from laboratory to laboratory. There aren’t any “standard” units that can be depended upon (with some exceptions, like pH, which is a dimensionless number), so the units must be carried along with the analytical value. The EDMS should provide the capability to convert to consistent units, either during import or preferably during data retrieval, since some output such as graphing and statistics require consistent units. Some types of parameters have specific issues with reporting units. For example, the amount of a radioactive constituent in a sample can be reported in either activity (amount of radioactivity from the substance) or concentration (amount of the substance by weight), and determining the conversion between them involves factors, such as the isotope ratios, which are usually sitespecific. Data managers should take great care in preserving reporting units, and if conversions are performed, the conversions should be documented.
Laboratory quality control Laboratories operate under a Laboratory Quality Assurance Plan (LQAP), which they submit to their regulators, usually through the contractor operating the project. This plan includes policies and procedures, some of which are required by regulations, and others are determined by the laboratory. Some of the elements of laboratory quality control (QC) are described in Chapter 15. The laboratories are audited on some regular basis, such as annually, to ensure that they are conforming to the LQAP. Edwards and Mills (2000) have brought up a number of problems with laboratory generation and delivery of electronic data. One issue is systematic problems in the laboratory, such as misunderstanding of analysis criteria, scheduling and resource difficulties, and other internal process problems. Another problem area is that laboratories usually generate two deliverables, a hard copy and an electronic file. The hard copy is usually generated by the LIMS system, while the EDD may be generated using other methods as described above, leading to errors. A third problem area is the lack of standardization in data delivery. No standard EDD format has been accepted across the industry (despite some vendor claims to the contrary), so a lot of effort is expended in
148
Relational Management and Display of Site Environmental Data
satisfying various custom format descriptions, which can result in poor data quality when the transfer specification has not been accurately met. Finally, lack of universal access to digital data in many organizations can lead to delays in data access and delivery, which can result in project delays and staff frustration. A centralized EDMS with universal data access is the answer to this, but many organizations have not yet implemented this type of system. Sara (1994, p. 11-8) has pointed out that many laboratories have a problem with persistent contamination by low levels of organic parameters such as methylene chloride and acetone. These are common laboratory chemicals that are used in the extraction process, and can show up in analytical results as low background levels. The laboratory might subtract these values from the reported results, or report the values without correction, perhaps with a comment about possible contamination. Users of the data should be made aware of these issues before they try to interpret the data for these parameters.
PART FOUR - MAINTAINING THE DATA
CHAPTER 13 IMPORTING DATA
The most important component of any data management system is the data in it. Manual and automated entry, importing, and careful checking are critical components in ensuring that the data in the system can be trusted, at least to the level of its intended use. For many data management projects, the bulk of the work is finding, organizing, and inputting the data, and then keeping up with importing new data as it comes along. The cost of implementing the technology to store the data should be secondary. The EDMS can be a great time-saver, and should more than pay for its cost in the time saved and greater quality achieved using an organized database system. The time savings and quality improvement will be much greater if the EDMS facilitates efficient data importing and checking.
MANUAL ENTRY Sometimes there’s no other way to get data into the system other than transcribing it from hard copy, usually by typing it in. This process is slow and error prone, but if it’s the only way, and if the data is important enough to justify it, then it must be done. The challenge is to do the entry cost-effectively while maintaining a sufficient level of data quality.
Historical entry Often the bulk of manual entry is for historical data. Usually this is data in hard-copy files. It can be found in old laboratory reports, reports which have been submitted to regulators, and many other places.
DATA SELECTION - WHAT’S REALLY IMPORTANT? Before embarking on a manual entry project, it is important to place a value on the data to be entered. The importance of the data and the cost to enter it must be balanced. It is not unusual for a data entry project for a large site, where an effort is made to locate and input a comprehensive set of data for the life of the facility, to cost tens or hundreds of thousands of dollars. The decision to proceed should not be taken lightly.
LOCATING AND ORGANIZING DATA The next step, and often the most difficult, is to find the data. This is often complicated by the fact that over time many different people or even different organizations may have worked on the
152
Relational Management and Display of Site Environmental Data
project, and the data may be scattered across many different locations. It may even be difficult to locate people who know or can find out what happened in the past. It is important to locate as much of this historical data as possible, and then the portion selected as described in the previous section can be inventoried and input. Once the data has been found, it should be inventoried. On small projects this can be done in word processor or spreadsheet files. For larger projects it is appropriate to build a database just to track documents and other items containing the data, or include this information in the EDMS. Either way, a list should be made of all of the data that might be entered. This list should be updated as decisions are made about what data is to be entered, and then updated again as the data is entered and checked. If the data inventory is stored in the EDMS, it should be set up so that after the data is imported it can be tracked back to the original source documents to help answer questions about the origin of the data.
TOOLS TO HELP WITH CORRECT ENTRY There are a number of ways to enter the data, and these options provide various levels of assistance in getting clean data into the system. Entry and review process – Probably the most common approach used in the environmental industry is manual entry followed by visual review. In this process, someone types in the data, then it is printed out in a format similar to the one that was used for import. Then a second person compares every piece of data between the two pieces of paper, and marks any inconsistencies. These are then remedied in the database, and the corrections checked. The end result, if done conscientiously, is reliable data. The process is tedious for those involved, and care should be taken that those doing it keep up their attention to detail, or quality goes down. Often it is best to mix this work with other work, since it is hard to do this accurately for days on end. Some people are better at it than others, and some like it more than others. (Most don’t like it very much.) Double entry – Another approach is to have the data entered twice, by two different people, and then have special software compare the two copies. Data that does not match is then entered again. This technique is not as widely used as the previous one in the environmental industry perhaps because existing EDMS software does not make this easy to do, and maybe also because the human checking in the previous approach sounds more reliable. Scanning and OCR – Hardware and software are widely available to scan hard copy documents into digital format, and then convert it into editable text using optical character recognition (OCR). The tools to do this have improved immensely over the last few years, such that error rates are down to just a few errors per page. Unfortunately, the highest error rates are with older documents and with numbers, both of which are important in historical entry of environmental data. Also, because the formats of old documents are widely variable, it is difficult to fit the data into a database structure after it has been scanned. These problems are most likely to be overcome, from the point of view of environmental data entry, when there is a large amount of data in a consistent format, with the pages in good condition. Unless you have this situation, then scanning probably won’t work. However, this approach has been known to work on some projects. After scanning, a checking step is required to maintain quality. Voice entry – As with scanning, voice recognition has taken great strides in recent years. Systems are available that do a reasonable job of converting a continuous stream of spoken words into a word processing document. Voice recognition is also starting to be used for on-screen navigation, especially for the handicapped. It is probably too soon to tell whether this technology will have a large impact on data entry. Offshore entry – There are a number of organizations in countries outside the United States, especially Mexico and India, that specialize in high-volume data entry. They have been very successful in some industries, such as processing loan applications. Again, the availability of a large number of documents in the same format seems to be the key to success in this approach, and a post-entry checking step is required.
Importing Data
153
Figure 55 - Form entry of analysis data
Form entry vs. spreadsheet entry – EDMS programs usually provide a form-based system for entering data, and the form usually has fields for all the data at each level, such as site, station, sample, and analysis. Figure 55 shows an example of this type of form. This is usually best for entering a small amount of data. For larger data entry projects, it may be useful to make a customized form that matches the source documents to simplify input. Another common approach is to enter the data into a spreadsheet, and then use the import tool of the EDMS to check and import the data. Figure 56 shows this approach. This has two benefits. The EDMS may have better data checking and cleanup tools as part of the import than it does for form entry. Also, the person entering the data into the spreadsheet doesn’t necessarily need a license for the EDMS software, which can save the project money. Sometimes it is helpful to create spreadsheet templates with things like station names, dates, and parameter lists using cut and paste in one step, and then have the results entered in a second step.
Ongoing entry There may be situations where data needs to be manually entered on an ongoing basis. This is becoming less common as most sources of data involve a computerized step, so there is usually a way to import the data electronically. If not, approaches as described above can be used.
ELECTRONIC IMPORT The majority of data placed into the EDMS is usually in digital format in some form or other before it is brought into the system. The implementers of the system should provide a data transfer standard (DTS) so that the electronic data deliverables (EDDs) created by the laboratory for the EDMS contain the appropriate data elements in a format suitable for easy import. An example DTS is shown in Appendix C.
154
Relational Management and Display of Site Environmental Data
Figure 56 - Spreadsheet entry of analysis data
Automated import routines should be provided in the EDMS so that data in the specified format (or formats if the system supports more than one) can be easily brought into the system and checked for consistency. Data review tracking options and procedures must be provided. In addition, if it is found that a significant amount of digital data exists in other formats, then imports for those formats should be provided. In some cases, importing those files may require operator involvement if, for example, the file is a spreadsheet file of sample and analytical data but does not contain site or station information. These situations usually must be addressed on a case-by-case basis.
Historical entry Electronic entry of historical data involves several issues including selecting, locating, and organizing data, and format and content issues. Data selection, location, and organization – The same issues exist here as in manual input in terms of prioritizing what data will be brought into the EDMS. Then it is necessary to locate and catalog the data, whatever format it is in, such as on a hard drive or on diskettes. Format issues – Importing historical data in digital format involves figuring out what is in the files and how it is formatted, and then finding a way to import it, either interactively using queries or automatically with a menu-driven system. Most modern data management programs can read a variety of file formats including text files, spreadsheets, word processing documents, and so on. Usually the data needs to be organized and reformatted before it can be merged with other data already in the EDMS. This can be done either in its native format, such as in a spreadsheet, or imported into the database program and organized there. If each file is in a different format, then there can be a big manual component to this. If there are a lot of data files in the same format, it may be possible to automate the process to a large degree.
Importing Data
155
Content issues – It is very important that the people responsible for importing the data have a detailed understanding of the content of the data being imported. This includes knowing where the data was acquired and when, how it is organized, and other details like detection limits, flags, and units, if they are not in the data files. Great care must be exercised here, because often details like these change over time, often with little or no documentation, and are important in interpreting the data.
Ongoing entry The EDMS should provide the capability to import analytical data in the format(s) specified in the data transfer standard. This import capability must be robust and complete, and the software and import procedures must address data selection, format, and content issues, and special issues such as field data, along with consistency checking as described in a later section. Data selection – For current data in a standard format, importing may not be very timeconsuming, but it may still be necessary to prioritize data import for various projects. The return on the time invested is the key factor. Format and content issues – It may be necessary to provide other import formats in addition to those in the data transfer standard. The identification of the need to implement other data formats will be made by project staff members. The content issues for ongoing entry may be less than for historical data, since the people involved in creating the files are more likely to be available to provide guidance, but care must still be taken to understand the data in order to get it in right. Field data – In the sampling process for environmental data there is often a field component and a laboratory component. More and more the data is being gathered in the field electronically. It is sometimes possible to move this data digitally into the EDMS. Some hard copy information is usually still required, such as a chain of custody to accompany the samples, but this can be generated in the field and printed there. The EDMS needs to be able to associate the field data arriving from one route with the laboratory data from another route so both types of data are assigned to the correct sample.
Understanding duplicated and superseded data Environmental projects generate duplicated data in a variety of ways. Particular care should be taken with duplicated data at the Samples and Analyses levels. Duplicate samples are usually the result of the quality assurance process, where a certain number of duplicates of various types are taken and analyzed to check the quality of the sampling and analysis processes. QC samples are described in more detail in Chapter 15. A sample can also be reanalyzed, resulting in duplicated results at the Analyses level. These results can be represented in two ways, either as the original result plus the reanalysis, or as a superseded (replaced) original result plus the new, unsuperseded result. The latter is more useful for selection purposes, because the user can easily choose to see just the most current (unsuperseded) data, whereas selecting reanalyzed data is not as helpful because not all samples will have been reanalyzed. Examples of data at these two levels and the various fields that can be involved in the duplications at the levels are shown in Figure 57.
Obtaining clean data from laboratories Having an accurate, comprehensive, historical database for a facility provides a variety of benefits, but requires that consistency be enforced when data is being added to the database. Matching analytical data coming from laboratories with previous data in a database can be a timeconsuming process.
156
Relational Management and Display of Site Environmental Data
Duplicate - Samples Level Sample No.
Station*
Sample Date*
Matrix*
Filtered*
1 2 3
MW-1
8/1/2000
Water
Total
Duplicate* 0 1 2
QC Code
Lab ID
Original Field Dup. Split
2000-001 2000-002 2000-003
* Unique Index - For Water Samples
Superseded - Analysis Level Sample No.* (Station, Date, Matrix, Filt., Dup.)
Parameter Name*
Leach Method*
Basis*
Superseded*
Value Code
1
Field pH Field pH Field pH Field pH
None None None None
None None None None
0 1 2 3
None None None None
1
Naphthalene Naphthalene Naphthalene
None None None
None None None
0 1 2
Original DL1 DL2
Dilution Factor
Reportable Result
1 50 10
N Y N
* Unique Index - For Water Analyses
Figure 57 - Duplicate and superseded data
Variation in station names, spelling of constituent names, abbreviation of units, and problems with other data elements can result in data that does not tie in with historical data, or, even worse, does not get imported at all because of referential integrity constraints. An alternative is a timeconsuming data checking and cleanup process with each data deliverable, which is standard operating procedure for many projects.
WORKING WITH LABS - STANDARDIZING DELIVERABLES The process of getting the data from the laboratory in a consistent, usable format is a key element of a successful data management system. Appendix C contains a data transfer standard (DTS) that can be used to inform the lab how to deliver data. EDDs should be in the same format every time, with all of the information necessary to successfully import the data into the database and tie it with field samples, if they are already there. Problems with EDDs fall into two general areas: 1) data format problems and 2) data content problems. In addition, if data is gathered in the field (pH, turbidity, water level, etc.) then that data must be tied to the laboratory data once the data administrator has received both data sets. Data format problems fall into two areas: 1) file format and 2) data organization. The DTS can help with both of these by defining the formats (text file, Excel spreadsheet, etc.) acceptable to the data management system, and the columns of data in the file (data elements, order, width, etc.). Data content problems are more difficult, because they involve consistency between what the lab is generating and what is already in the database. Variation in station names (is it “MW1” or “MW-1”?), spelling of constituent names, abbreviation of units, and problems with other data elements can result in data that does not tie in with historical data. Even worse, the data may not get imported at all because of referential integrity constraints defined in the data management system.
Importing Data
157
Figure 58 - Export laboratory reference file
USING REFERENCE FILES AND A CLOSED-LOOP SYSTEM While project managers expect their laboratories to provide them with “clean” data, on most projects it is difficult for the laboratory to deliver data that is consistent with data already in the database. What is needed is a way for the project personnel to keep the laboratory updated with information on the various data elements that must be matched in order for the data to import properly. Then the laboratory needs a way to efficiently check its electronic data deliverable (EDD) against this information prior to delivering it to the user. When this is done, then project personnel can import the data cleanly, with minimal impact on the data generation process at the laboratory. It is possible to implement a system that cuts the time to import a laboratory deliverable by a factor of five to ten over traditional methods. The process involves a DTS as described in Appendix C to define how the data is to be delivered, and a closed-loop reference file system where the laboratory compares the data it is about to deliver to a reference file provided by the database user. Users employ their database software to create the reference file. This reference file is then sent to the laboratory. The laboratory prepares the electronic data deliverable (EDD) in the usual way, following the DTS, and then uses the database software to do a test import against the reference file. If the EDD imports successfully, the laboratory sends it to the user. If it does not, the laboratory can make changes to the file, test it again, and once successful, send it to the user. Users can then import this file with a minimum of effort because consistency problems have been eliminated before they receive it. This results in significant time-savings over the life of a project. If the database tracks which laboratories are associated with which sites, then the creation of the reference file can start with selection of the laboratory. An example screen to start the process is shown in Figure 58. In this example, the software knows which sites are associated with the laboratory, and also knows the name to be used for the reference file. The user selects the laboratory, confirms the file name, and clicks on Create File. The file can then be sent to the laboratory via email or on a disk. This process is done any time there are significant changes to the database that might affect the laboratory, such as installation of new stations (for that laboratory’s sites) or changes to the lookup tables. There are many benefits to having a centralized, open database available to project personnel. In order to have this work effectively the data in the database must be accurate and consistent. Achieving this consistency can be a time-consuming process. By using a comprehensive data transfer standard, and the closed-loop system described above, this time can be minimized. In one organization the average time to import a laboratory deliverable was reduced from 30 minutes down to 5 minutes using this process. Another major benefit of this process is higher data quality. This increase in quality comes from two sources. The first is that there will be fewer errors in the data deliverable, and consequently fewer errors in the database, because a whole class of errors
158
Relational Management and Display of Site Environmental Data
related to data mismatches has been completely eliminated. A second increase in quality is a consequence of the increased efficiency of the import process. The data administrator has more time to scrutinize the data during and after import, making it easier to eliminate many other errors that would have been missed without this scrutiny.
Automated checking Effective importing of laboratory and other data should include data checking prior to import to identify errors and to assist with the resolution of those errors prior to placing the data in the system. Data checking spans a range of activities from consistency checking through verification and validation. Performing all of the checks won’t ensure that no bad data ever gets into the database, but it will cut down significantly on the number of errors. The verification and validation components are discussed in more detail in Chapter 16. The consistency checks should include evaluation of key data elements, including referential integrity (existence of parents); valid site (project) and station (well); valid parameters, units, and flags; handling of duplicate results (same station, sample date and depth, and parameter); reasonable values for each parameter; comparison with like data; and comparison with previous data. The software importing the data should perform all of the data checks and report on the results before importing the data. It’s not helpful to have it give up after finding one error, since there may well be more, and it might as well find and flag all of them so you can fix them all at once. Unfortunately, this is not always possible. For example, valid station names are associated with a specific site, so if the site in the import file is wrong, or hasn’t been entered in the sites table, then the program can’t check the station names. Once the program has a valid site, though, it should be able to perform the rest of the checks before stopping. Of course, all of this assumes that the file being imported is in a format that matches what the software is looking for. If site name is in the column where the result values should be, the import should fail, unless the software is smart enough to straighten it out for you. Figure 59 shows an example of a screen where the user is being asked what software-assisted data checking they want performed, and how to handle specific situations resulting from the checking.
Figure 59 - Screen for software-assisted data checking
Importing Data
159
Figure 60 - Screen for editing data prior to import
You might want to look at the data prior to importing it. Figure 60 shows an example of a screen to help you do this. If edits are made to the laboratory deliverable, it is important that a record be kept of these changes for future reference.
REFERENTIAL INTEGRITY CHECKING A properly designed EDMS program based on the relational model should require that a parent entry exist before related child entries can be imported. (Surprisingly, not all do.) This means that a site must exist before stations for that site can be entered, and so on through stations, samples, and analyses. Relationships with lookups should also be enforced, meaning that values related to a lookup, such as sample matrix, must be present and match entries in the lookup table. This helps ensure that “orphan” data does not exist in the tables. Unfortunately, the database system itself, such as Access, usually doesn’t give you much help when referential integrity problems occur. It fails to import the record(s), and provides an error message that may, or may not, give you some useful information about what happened. Usually it is the job of the application software running within the database system to check the data and provide more detailed information about problems.
CHECKING SITES AND STATIONS When data is obtained from the lab it must contain information about the sites and samples associated with the data. It is usually not a good idea to add this data to the main data tables automatically based on the lab data file. This is because it is too easy to get bad records in these two tables and then have the data being imported associated with those bad records. In our experience, it is more likely that the lab has misspelled the station name than that you really drilled a new well, although obviously this is not always the case. It is better to enter the sites and stations first, and then associate the samples and analyses with that data during import. Then the import should check to make sure the sites and stations are there, and tell you if they aren’t, so you can do something about it. On many projects the sample information follows two paths. The samples and field data are gathered in the field. The samples go to the laboratory for analysis, and that data arrives in the electronic data deliverable (EDD) from the laboratory. The field data may arrive directly from the field, or may be input by the laboratory.
160
Relational Management and Display of Site Environmental Data
Figure 61 - Helper screen for checking station names
If the field data arrives separately from the laboratory data, it can be entered into the EDMS prior to arrival of the EDD from the laboratory. This entry can be done in the field in a portable computer or PDA, in a field office at the site, or in the main office. Then the EDMS needs to be able to associate the field information with the laboratory information when the EDD is imported. Another approach is to enter the sample information prior to the sampling event. Then the EDD can check the field data and laboratory data as it arrives for completeness. The process needs to be flexible enough to accommodate legitimate changes resulting from field activities (well MW1 was dry), but also notify the data administrator of data that should be there but is missing. This checking can be performed on data at both the sample and analyses levels. The screen shown in Figure 61 shows the software helping with the data checking process. The user has imported a laboratory data file that has some problems with station names. The program is showing the names of the stations that don’t match entries already in the database, and providing a list of valid stations to choose from. The user can step through the problem stations, choosing the correct names. If they are able to correctly match all of the stations, the import can proceed. If not, they will need to put this import aside while they research the station names that have problems. The import routine may provide an option to convert the data to consistent units, and this is useful for some projects. For other projects (perhaps most), data is imported as it was reported by the laboratory, and conversion to consistent units is done, if at all, at retrieval time. This is discussed in Chapter 19. The decision about whether to convert to consistent units during import should be made on a project-by-project basis, based on the needs of the data users. In general, if the data will be used entirely for site analysis, it probably makes sense to convert to consistent units so retrieval errors due to mixed units are eliminated. If the data will be used for regulatory and litigation purposes, it is better to import the data as-is, and do conversion on output.
CHECKING PARAMETERS, UNITS, AND FLAGS After the import routine is happy with the sites and stations in the file, it should check the other data, as much as possible, to try to eliminate inconsistent data. Data in the import file should be compared to lookup tables in the database to weed out errors. Parameter names in particular provide a great opportunity for error, as do reporting units, flags, and other data.
Importing Data
161
Figure 62 - Screen for entering defaults for required values
The system should provide screens similar to Figure 61 to help fix bad values, and flag records that have issues that can’t be resolved so that they can be researched and fixed. Note that comparing values against existing data like sites and stations, or against lookups, only makes sure that the data makes sense, not that it is really right. A value can pass a comparison test against a lookup and still be wrong. After a successful test of the import file, it is critical that the actual data values be checked to an appropriate level before the data is used. Sometimes the data being imported may not contain all of the data necessary to satisfy referential integrity constraints. For example, historical data being imported may not have information on sample filtration or measurement basis, or even the sample matrix, if all of the data in the file has the same matrix. The records going into the tables need to have values in these fields because of their relationships to the lookup tables, and also so that the data is useful. It is helpful if the software provides a way to set reasonable defaults for these values, as shown in Figure 62, so the data can be imported without a lot of manual editing. Obviously, this feature should be used with care, based on good knowledge of the data being imported, to avoid assigning incorrect values.
OTHER CHECKS There are a number of other checks that the software can perform to improve the quality of the data being imported. Checking for repeated import – In the confusion of importing data, it is easy to accidentally import, or at least try to import, the same data more than once. The software should look for this, tell you about it, and give you the opportunity to stop the import. It is also helpful if the software gives you a way to undo an import later if a file shouldn’t have been imported for one reason or another. Parameter-specific reasonableness – Going beyond checking names, codes, etc., the software should check the data for reasonableness of values on a parameter-by-parameter basis. For example, if a pH value comes in outside the range of 0 to 14, then the software could notice and complain. Setting up and managing a process like this takes a considerable amount of effort, but results in better data quality. Comparison with like data – Sometimes there are comparisons that can be made within the data set to help identify incorrect values. One example is comparing total dissolved solids reported by the lab with the sum of all of the individual constituents, and flag the data if the difference
162
Relational Management and Display of Site Environmental Data
exceeds a certain amount. Another is to do a charge balance comparison. Again, this is not easy to set up and operate, but results in better data quality. Comparison with previous data – In situations where data is being gathered on a regular basis, new data can be compared to historical data, and data that is more than or less than previous data by a certain amount (usually some number of standard deviations from the mean) is suspect. These data points are often referred to as outliers. The data point can then be researched for error, re-sampled, or excluded, depending on the regulations for that specific project. The field of statistical quality control has various tools for performing this analysis, including Shewhart and Cumulative Sum control charts and other graphical and non-graphical techniques. See Chapters 20 and 23 for more information.
CONTENT-SPECIFIC FILTERING At times there will be specific data content that needs to be handled in a special way during import. Some data will require specific attention when it is present in the import. For example, one project that we worked on had various problems over time with phenols. At different times the laboratory reported phenols in different ways. For this project, any file that contained any variety of phenol required specific attention. In another case, the procedure for a project specified that tentatively identified compounds (TICs) should not be imported at all. The database software should be able to handle these two situations, allowing records with specific data content to be either flagged or not imported. Figure 63 shows an example of a screen to help with this. Some projects allow the data administrator to manually select which data will be imported. This sounds strange to many people, but we have worked on projects where each line in the EDD is inspected to make sure that it should be imported. If a particular constituent is not required by the project plan, and the laboratory delivered it anyway, that line is deleted prior to import. In Figure 60 the checkbox near the top of the screen is used for this purpose. The software should allow deleted records to be saved to a file for later reference if necessary.
Figure 63 - Screen to configure content-specific filtering
Importing Data
163
Figure 64 - Screens showing results of a successful and an unsuccessful import
TRACKING IMPORTS Part of the administration of the data management task should include keeping records of the import process. After trying an import, the software should notify you of the result. Records should be kept of both unsuccessful and successful imports. Figures 65 and 66 are examples of reports that can be printed and saved for this purpose.
Figure 65 - Report showing an unsuccessful import
164
Relational Management and Display of Site Environmental Data
Figure 66 - Report showing a successful import
The report from the unsuccessful import can be used to resolve problems prior to trying the import again. At this stage it is helpful for the software to be able to summarize the errors so an error that occurs many times is shown only once. Then each type of error can be fixed generically and the report re-run to make sure all of the errors have been remedied so you can proceed with the import. The report from the successful import provides a permanent record of what was imported. This report can be used for another purpose as well. In the upper left corner is a panel (shown larger in Figure 67) showing the data review steps that may apply to this data. This report can be circulated among the project team members and used to track which review steps have been performed. After all of the appropriate steps have been performed, the report can be returned to the data administrator to enter the upgraded review status for the analyses.
UNDOING AN IMPORT Despite your best efforts, sometimes data is imported that either should not have been imported or is incorrect. An Undo Import feature can do this automatically for you if the software provides this feature. The database software should track the data that you import so you can undo an import if necessary. You might need to do this if you find out that a particular file that you imported has errors and is being replaced, or if you accidentally imported a file twice. An undo import feature should be easy to use but sophisticated, leaving samples in the database that have analyses from a different import, and undoing superseded values that were incremented by the import of the file being undone. Figure 68 shows a form to help you select an import for deletion.
Importing Data
165
Figure 67 - Section of successful import report used for data review
Figure 68 - Form to select an import for deletion
TRACKING QUALITY A constant focus on quality should be maintained during the import process. Each result in the database should be marked with flags regarding lab and other problems, and should also be marked
166
Relational Management and Display of Site Environmental Data
with the level of data review that has been applied to that result. An example of a screen to assist with maintaining data review status is shown in Figure 75 in Chapter 15. If the import process is managed properly using software with a sufficiently sophisticated import tool, and if the data is checked properly after import, then the resulting data will be of a quality that it is useful to the project. The old axiom of “garbage in, garbage out” holds true with environmental data. Another old axiom says “a job worth doing is worth doing well,” or in other words, “If you don’t have time to do it right the first time, how are you ever going to find time to do it again?” These old saws reinforce the point that the time invested in implementing a robust checking system and using it properly will be rewarded by producing data that people can trust.
CHAPTER 14 EDITING DATA
Once the data is in the database it is sometimes necessary to modify it. This can be done manually or using automated tools, depending on the task to be accomplished. These two processes are described here. Due to the focus on data integrity, a log of all changes to the data should be maintained, either by the software or manually in a logbook.
MANUAL EDITING Sometimes it is necessary to go into the database and change specific pieces of data content. Actually, modification of data in an EDMS is not as common as an outsider might expect. For the most part, the data comes from elsewhere, such as the field or the laboratory, and once it is in it stays the way it is. Data editing is mostly limited to correcting errors (which, if the process is working correctly, should be minimal) and modifying data qualifiers such as review status and validation flags. The data management system will usually provide at least one way to manually edit data. Sometimes the user interface will provide more than one way to view and edit data. Two examples include form view (Figure 69) and datasheet view (Figure 70).
Figure 69 - Site data editing screen in form view
168
Relational Management and Display of Site Environmental Data
Figure 70 - Site data editing screen in datasheet view
AUTOMATED EDITING If the changes involve more than one record at a time, then it probably makes sense to use an automated approach. For specific types of changes that are a standard part of data maintenance, this should be programmed into the system. Other changes might be a one-time action, but involve multiple records with the same change, so a bulk update approach using ad hoc queries is better.
Standardized tasks Some data editing activities are a relatively common activity. For these activities, especially if they involve a lot of records to be changed or a complicated change process, the software should provide an automated or semi-automated process to assist the data administrator with making the changes. The examples given here include both a simple process and a complicated one to show how the system can provide this type of capability.
UPDATING REVIEW STATUS It’s important to track the review status of the data, that is, what review steps have been performed on the data. An automated editing step can help update the data as review steps are completed. Automated queries should allow the data administrators to update the review status flags after appropriate data checks have been made. An example of a screen to assist with maintaining data review status is shown in Figure 75 in Chapter 15.
REMOVAL OF DUPLICATED ENTRIES Repeated records can enter the database in several ways. The laboratory may deliver data that has already been delivered, either a whole EDD or part of one. Data administrators may import the same file twice without noticing. (The EDMS should notify them if they try to do this.) Data that has been imported from the lab may also be imported from a data validator with partial or complete overlap. The lab may include field data, which has already been imported, along with its data. However it gets in, this repeated data provides no value and should be removed, and records kept of the changes that were made to the database. However, duplicated data resulting from the quality control process usually is of value to the project, and should not be removed. Repeated information can be present in the database at the samples level, the analyses level, or both. The removal of duplicated records should address both levels, starting at the samples level, and then moving down to the analyses level. This order is important because removing repeated samples can result in more repeated analyses, which will then need to be removed. The samples component of the duplicated record removal process is complicated by the fact that samples have analyses underneath them, and when a duplicate sample is removed, the analyses should probably not be lost, but rather moved to the remaining sample. The software should help you do this by letting you pick the sample to which you want to move the analyses. Then the software should modify the superseded value of the affected analyses, if necessary, and assign them to the other sample.
Editing Data
169
Figure 71 - Form for moving analyses from a duplicate sample
The analyses being moved may in fact represent duplicated data themselves, and the duplicated record removal at the analyses level can be used to remove these results. The analyses component of the duplicated record removal process must deal with the situation that, in some cases, redundant data is desirable. The best example is groundwater samples, where four independent observations of pH are often taken, and should all be saved. The database should allow you to specify for each parameter and each site and matrix how many observations should be allowed before the data is considered redundant. The first step in the duplicated record removal process is to select the data for the duplicate removal process. Normally you will want to work with all of the data for a sampling event. Once you have selected the set of data to work on, the program should look for samples that might be repeated information. It should do this by determining samples that have the same site, station, matrix, sample date, top and base, and lab sample ID. Once the software has made its recommendations for samples you might want to remove, the data should be displayed for you to confirm the action. Before removing any samples, you should print a report showing the samples that are candidates for removal. You should then make notes on this report about any actions taken regarding removal of duplicated sample records, and save the printed report in the project file. If a sample to be removed has related analyses, then the analyses must be moved to another sample before the candidate sample can be deleted. This might be the case if somehow some analyses were associated with one sample in the database and other analyses with another, and in fact only one sample was taken. In that case, the analyses should be moved to the sample with a duplicate value of zero from the one with a higher duplicate value, and then the sample with a higher duplicate value should be deleted. The software should display the sample with the higher duplicate value first, as this is the one most likely to be removed, and display a sample that is a likely target to move the analyses to. A screen for a sample with analyses to be moved might look like Figure 71. The screen has a notation that the sample has analyses, and provides a combo box, in gray, for you to select a sample to move the analyses to. If the sample being displayed does not have analyses, or once they have been moved to another sample, then it can be deleted. In this case, the screen might look like Figure 72. Once you have moved analyses as necessary and deleted improper duplicates, the program should look for analyses that might contain repeated information. It can do this using the following process: 1) Determine all of the parameters in the selection set. 2) Determine the number of desired observations for each parameter. Use site-specific information if it is present. If it is not, use global information. If observation data is not available, either site-specific or global, for one or more parameters, the software should notify you, and provide the option of stopping or proceeding. 3) Determine which analyses for each parameter exceed the observations count.
170
Relational Management and Display of Site Environmental Data
Figure 72 - Form for deleting duplicate samples for a sample without analyses
Next, the software should recommend analyses for removal. The goal of this process is to remove duplicated information, while, for each sample, retaining the records with the most data. The program can use the following process: 1) Group all analyses where the sample and parameter are the same. 2) If all of the data is exactly the same in all of the fields (except for AnalysisNumber and Superseded), mark all but one for deletion. 3) If all of the data is not exactly the same, look at the Value, AnalyticMethod, AnalDate_D, Lab, DilutionFactor, QCAnalysisCode, and AnalysisLabID fields. If the records are different in any of these fields, keep them. For records that are the same in all of these fields, mark all but one for deletion. (The user should be able to modify the program’s selections prior to deletion.) If the data in all of these fields is the same, then keep the record with the greatest number of other data fields populated, and mark the others for removal. Once the software has made its recommendations for analyses to be removed, the data should be displayed in a form such as that shown in Figure 73. In this example, the software has selected several analyses for removal. Visible on the screen are two Arsenic and two Chloride analyses, and one of each has been selected for removal. In this case, this appears appropriate, since the data is exactly duplicated. The information on this screen should be reviewed carefully by someone very familiar with the site. You should look at each analysis and the recommendation to confirm that the software has selected the correct action. After selecting analyses for removal, but before removing any analyses, you should print a report showing the analyses that have been selected for removal. You should save the printed report in the project file. There are two parts to the Duplicated Record Removal process for analyses. The first part is the actual removal of the analytical records. This can be done with a simple delete query, after users are asked to confirm that they really want to delete the records. The second part is to modify the superseded values as necessary to remove any gaps caused by the removal process. This should be done automatically after the removal has been performed.
PARAMETER PRINT REORDERING This task is an example of a relatively simple process that the software can automate. It has to do with the order that results appear on reports. A query or report may display the results in alphabetical order by parameter name. The data user may not want to see it this way. A more useful order may be to see the data grouped by category, such as all of the metals followed by all of the organics. Or perhaps the user wants to enter some specific order, and have the system remember it and use it.
Editing Data
171
Figure 73 - Form for deleting duplicated analyses
A good way to implement a specific order is to have a field somewhere in the database, such as in the Parameters table, that can be used in queries to display the data in the desired order. For the case where users want the results in a specific order, they can manually edit this field until the order is the way they want it. For the case of putting the parameters in order by category, the software can also help. A tool can be provided to do the reordering automatically. The program needs to open a query of the parameters in order by category and name, and then assign print orders in increasing numbers from the first to the last. If the software is set up to skip some increment between each, then the user can slip a new one in the middle without needing to redo the reordering process. The software can also be set up to allow you to specify an order for the categories themselves that is different from alphabetical, in case you want the organics first instead of the metals.
Ad hoc queries Where the change to be made affects multiple records, but will performed only once, or a small number of times over the life of the database, it doesn’t make sense to provide an automated tool, but manual entry is too tedious. An example of this is shown in Figure 74. The problem is that when the stations were entered, their current status was set to “z” for “Unknown,” even though only active wells were entered at that time. Now that some inactive wells are to be entered, the status needs to be set to “s” for “In service.” Figure 74 shows an example of an update query to do this. The left panel shows the query in design view, and the right panel in SQL view. The data administrator has told the software to update the Stations table, setting the CurrentStatusCode field to “s” where it is currently “z.” The query will then make this change for all of the appropriate records in one step, instead of the data administrator having to make the change to each record individually. This type of ad hoc query can be a great time saver in the hands of a knowledgeable user. It should be used with great care, though, because of the potential to cause great damage to the database. Changes made in this way should be fully documented in the activity log, and backup copies of the database maintained in case it is done wrong.
172
Relational Management and Display of Site Environmental Data
Figure 74 - Ad hoc query showing a change to the CurrentStatusCode field
CHAPTER 15 MAINTAINING AND TRACKING DATA QUALITY
If the data in your database is not of sufficient quality, people won’t (and shouldn’t) use it. Managing the quality of the data is just as important as managing the data itself. This chapter and the next cover a variety of issues related to quality terminology, QA/QC samples, data quality procedures and standards, database software support for quality analysis and tracking, and protection from loss. General data quality issues are contained in this chapter, and issues specific to data verification and validation in the next.
QA VS. QC Quality assurance (QA) is an integrated system of activities involving planning, quality control, quality assessment, reporting, and quality improvement to ensure that a product or service meets defined standards of quality with a stated level of confidence. Quality control (QC) is the overall system of technical activities whose purpose is to measure and control the quality of a product or service so that it meets the needs of users. The aim is to provide quality that is satisfactory, adequate, dependable, and economical (EPA, 1997a). In an over-generalization, QA talks about it and QC does it. Since the EDMS involves primarily the technical data and activities that surround it, including quantification of the quality of the data, it comes under QC more than QA. An EMS and the related EMIS (see Chapter 1), on the other hand, cover the QA component.
THE QAPP The quality assurance project plan (QAPP) provides guidance to the project to maintain the quality of the data gathered for the project. The following are typical minimum requirements for a QAPP for EPA projects:
Project management • • • •
Title and approval sheet. Table of Contents – Document control format. Distribution List – Distribution list for the QAPP revisions and final guidance. Project/Task Organization – Identify individuals or organizations participating in the project and discuss their roles, responsibilities, and organization.
174 • • • • •
Relational Management and Display of Site Environmental Data
Problem Definition/Background – 1) State the specific problem to be solved or the decision to be made. 2) Identify the decision maker and the principal customer for the results. Project/Task Description – 1) Hypothesis test, 2) expected measurements, 3) ARARs or other appropriate standards, 4) assessment tools (technical audits), 5) work schedule and required reports. Data Quality Objectives for Measurement – Data decision(s), population parameter of interest, action level, summary statistics, and acceptable limits on decision errors. Also, scope of the project (domain or geographical locale). Special Training Requirements/Certification – Identify special training that personnel will need. Documentation and Record – Itemize the information and records that must be included in a data report package, including report format and requirements for storage, etc.
Measurement/data acquisition • • • • • • • • • •
Sampling Process Designs (Experimental Design) – Outline the experimental design, including sampling design and rationale, sampling frequencies, matrices, and measurement parameter of interest. Sampling Methods Requirements – Sample collection method and approach. Sample Handling and Custody Requirements – Describe the provisions for sample labeling, shipment, chain of custody forms, procedures for transferring and maintaining custody of samples. Analytical Methods Requirements – Identify analytical method(s) and equipment for the study, including method performance requirements. Quality Control Requirements – Describe routine (real-time) QC procedures that should be associated with each sampling and measurement technique. List required QC checks and corrective action procedures. Instrument/Equipment Testing Inspection and Maintenance Requirements – Discuss how inspection and acceptance testing, including the use of QC samples, must be performed to ensure their intended use as specified by the design. Instrument Calibration and Frequency – Identify tools, gauges and instruments, and other sampling or measurement devices that need calibration. Describe how the calibration should be done. Inspection/Acceptance Requirements for Supplies and Consumables – Define how and by whom the sampling supplies and other consumables will be accepted for use in the project. Data Acquisition Requirements (Non-direct Measurements) – Define the criteria for the use of non-measurement data such as data that comes from databases or literature. Data Management – Outline the data management scheme including the path and storage of the data and the data record-keeping system. Identify all data handling equipment and procedures that will be used to process, compile, and analyze the data.
Assessment/oversight • •
Assessments and Response Actions – Describe the assessment activities for this project. Reports to Management – Identify the frequency, content, and distribution of reports issued to keep management informed.
Data validation and usability •
Data Review, Validation, and Verification Requirements – State the criteria used to accept or reject the data based on quality.
Maintaining and Tracking Data Quality
175
What Is Quality? Take a few minutes, put this book down, get a paper and pencil, and write a concise answer to: “What is quality in data management?” It’s harder than it sounds. “Quality … you know what it is, yet you don’t know what it is. But that’s selfcontradictory. But some things are better than others, that is, they have more quality. But when you try to say what the quality is, apart from the things that have it, it all goes poof! There’s nothing to talk about. But if you can’t say what Quality is, how do you know what it is, or how do you know that it even exists? If no one knows what it is, then for all practical purposes it doesn’t exist at all. But for all practical purposes, it really does exist. … So round and round you go, spinning mental wheels and nowhere finding anyplace to get traction. What the h___ is Quality? What is it?” Robert M. Pirsig, 1974 - Zen and the Art of Motorcycle Maintenance Quality is a little like beauty. We know it when we see it, but it’s hard to say how we know. When you are talking about data quality, be sure the person you are talking to has the same meaning for quality that you do. As an aside, a whole discipline has grown up around Pirsig’s work, called the Metaphysics of Quality. For more information, visit www.moq.org. • •
Validation and Verification Methods – Describe the process to be used for validating and verifying data, including the chain of custody for data throughout the lifetime of the project. Reconciliation with Data Quality Objectives – Describe how results will be evaluated to determine if DQOs have been satisfied.
There are many sources of information on how to write a QAPP. The EPA Web site (www.epa.gov) is a good place to start. It is usually not necessary to create the QAPP from scratch. Templates for QAPPs are available from a number of sources, and one of these templates can be modified for the needs of each specific project.
QC SAMPLES AND ANALYSES Over time, project personnel, laboratories, and regulators have developed a set of procedures to help maintain data quality through the sampling, transportation, analysis, and reporting process. This section describes these procedures and their impact on environmental data management. An attempt has been made to keep the discussion general, but some of the issues discussed here apply to some types of samples more than others. Material in this section is based on information in EPA (1997a); DOE/HWP (1990a, 1990b); and Core Laboratories (1996). This section covers four parts of the process: field samples, field QC samples, lab sample analysis, and lab calibration. There are several aspects of handling QC data that impact the way it should be handled in the EDMS. The basic purpose of QC samples and analyses is to confirm that the sampling and analysis process is generating results that accurately represent conditions at the site. If a QC sample produces an improper result, it calls into question a suite of results associated with that QC sample. The scope of the questionable suite of results depends on the samples associated with that QC sample. The scope might be a shipping cooler of samples, a sampling event, a laboratory batch, and so on. The questionable results must then be investigated further to determine whether they are still usable. Another issue is the amount and type of QC data to store. The right answer is to store the data necessary to support the use of the data, and no more or less. The problem is that different projects and uses have different requirements, and different parts of the data handling process can be done either inside or outside the database system. Once the range of data handling processes has been defined for the anticipated project tasks that will be using the system, a decision must be
176
Relational Management and Display of Site Environmental Data
made regarding the role of the EDMS in the whole process. Then the scope of storage of QC data should be apparent. Another issue is the amount of QC data that the laboratory can deliver. There is quite a bit of variability in the ability of laboratories and their LIMS systems to place QC information in the EDD. This is a conversation that you should have with the laboratory prior to selecting the laboratory and finalizing the project QC plan. A number of QC items involve analyses of samples which are either not from a specific station, or do not represent conditions at that station. Because the relational data model requires that each sample be associated with a station, stations must be entered for each site for each type of QC sample to be stored in the database. Multiple stations can be added where multiple samples must be distinguished. Examples of these stations include: ! ! ! !
Trip Blank 1 Trip Blank 2 Field Blank Rinseate Sample
! ! ! !
Equipment Blank Laboratory Control Sample Matrix Spike Matrix Spike Duplicate
For each site only a small number of these would normally be used. These stations can be excluded from normal data retrieval and display using QC codes stored with the data at the station level. A technical issue regarding QC sample data storage is whether to store the QC data in the same or separate tables from the main samples and analyses. Systems have been built both ways. As with the decision of which QC data to store, the answer to separate vs. integrated table design should be based on the intended uses of the data and the data management system. The following table summarizes some common QC sample types and the scope of samples over which they have QC influence. Where a term has multiple synonyms, only one is shown in the table. Some QC sample types are not included in the table because they are felt to primarily serve laboratory calibration rather than QC purposes, but this decision is admittedly subjective. Also, some types of QC samples can be generated either in the field or the laboratory, and only one is shown here. Source Sample type Field Samples Field duplicates Split samples Referee duplicates Field sample spikes Field QC Samples Trip blank Field blank Rinseate blank Sampling equipment blank Lab Sample Analyses Matrix spike Matrix spike duplicate Surrogate spikes Internal standard Laboratory duplicates Laboratory reanalyses
QC scope Sample event or batch Sample event or batch Sample event or batch Analytical batch Single cooler Sample event or batch Sample event or batch Sample event or batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch
Maintaining and Tracking Data Quality
Source Sample type Lab Calibration Blank spike Method blank Instrument blank Instrument carryover blank Reagent blank Check sample Calibration blank Storage blank Blind sample Dynamic blank Calibration standard Reference standard Measurement standard
177
QC scope Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch Analytical batch
Each of these QC items has some (at least potential) impact on the database system.
FIELD SAMPLES Field samples are the starting point for gathering both the primary data of interest as well as the QC data. This section covers QC items related to these samples themselves. Samples – Material gathered in the field for analysis from a specific location at a specific time. Field duplicates, Duplicate samples, Replicate samples – Two or more samples of the same material taken in separate containers, and carried through all steps of the sampling and analytical procedures in an identical manner. Duplicate samples are used to assess variance of the total method, including sampling and analysis. The frequency of field duplicates is project specific. There is an issue here that a major reason for sending duplicates to the laboratory is to track their performance. Marking the samples as duplicates might alert the laboratory that it is being checked, and cause it to deviate from the usual process. Using synthetic station names can overcome this, at the expense of making it more difficult to associate the data with the real station name in the database. A problem with a field duplicate impacts the samples from that day or that batch of field data. Split samples – Two or more representative portions taken from a sample or subsample and analyzed by different analysts or laboratories. Split samples are used to replicate the measurement of the variable(s) of interest to measure the reproducibility of the analysis. The difference between field duplicates and split samples is that splits start out as one sample, and duplicates as two. The frequency of split samples is project specific, with one for each 20 samples being common. Samples for VOC analysis should not be split. A problem with a split sample impacts the samples from that day or that batch of field data. Referee duplicates – Duplicate samples sent to a referee QA laboratory, if one is specified for the project. A problem with a referee duplicate impacts the samples from that day or that batch of field data. Field sample spikes, Field matrix spikes – A sample prepared by adding a known mass of target analyte to a specified amount of a field sample for which an independent estimate of target analyte concentration is available. Spiked samples are used, for example, to determine the effect of matrix interference on a method’s recovery efficiency. The frequency of these spikes is project specific. A problem with a matrix spike impacts the samples from that analytical batch.
178
Relational Management and Display of Site Environmental Data
FIELD QC SAMPLES Usually some samples are brought in from the field which do not represent in-situ conditions at the site, but are used in the QC process to help avoid bad data from contamination and other sources not directly related to actual site conditions. Trip blank – A clean sample of matrix that is carried to the sampling site and transported to the laboratory for analysis without having been unsealed and exposed to sampling procedures (as opposed to a field blank, which is opened or even prepared in the field). They measure the contamination, usually by volatile organics, from laboratory water, sample containers, site handling, transit, and storage. There is usually one trip blank per shipment batch (cooler). Trip blanks are usually used for water samples, but may also be sent with soil samples, in which case they are analyzed and reported as water samples. A problem with a trip blank impacts the contents of one cooler of samples. Field blank, Blank sample, Medium blank, Field reagent blank, Site blank – A clean sample (e.g., distilled water), carried to the sampling site, exposed to sampling conditions (e.g., bottle caps removed, preservatives added) and returned to the laboratory and treated as an environmental sample. This is different from trip blanks, which are transported to but not opened in the field. Field blanks are used to check for analytical artifacts and/or background introduced by sampling and analytical procedures. The frequency of field blanks is project specific. The term is also used for samples of source water used for decontamination and steam cleaning (DOE/HWP, 1990a, p. 17). A problem with a field blank impacts the samples from that day or that batch of field data. Rinseate blank, Equipment rinseates – A clean sample (e.g., distilled water or ASTM Type II water) passed through decontaminated sampling equipment before sampling, and returned to the laboratory as a sample. Sampling equipment blanks are used to check the cleanliness of sampling devices. Usually one rinseate sample is collected for each 10 samples of each matrix for each piece of equipment. A problem with a rinseate blank impacts the samples from that day or that batch of field data. Sampling equipment blank, Decontamination blank – A clean sample that is collected in a sample container with the sample-collection device after or between samples, and returned to the laboratory as a sample. Sampling equipment blanks are used to check the cleanliness of sampling devices. A problem with a sampling equipment blank impacts the contents of one sample event or batch.
LAB SAMPLE ANALYSES This section covers QC procedures performed in the laboratory, specifically involving the samples from the field. Matrix spike, Spiked sample, Laboratory spiked sample – A sample prepared by adding a known mass of target analyte to a specified amount of a field sample for which an independent estimate of target analyte concentration is available. Spiked samples are used, for example, to determine the effect of matrix interference on a method’s recovery efficiency. Usually one matrix spike is analyzed per sample batch, or 5 to 10% of the samples. Matrix spikes and matrix spike duplicates are associated with a specific analytical batch, and are not intended to represent a specific site or station. The matrix spike for a batch may be from a different laboratory client than some of the samples in the batch. A problem with a matrix spike impacts the samples from that analytical batch. Matrix spike duplicate – A duplicate of a matrix spike, used to measure the laboratory precision between samples. Usually one matrix spike duplicate is analyzed per sample batch. Percent differences between matrix spikes and matrix spike duplicates can be calculated. A problem with a matrix spike duplicate impacts the samples from that analytical batch. Surrogate spikes – Non-target analytes of known concentration that are added to organic samples prior to sample preparation and instrument analysis. They measure the efficiency of all
Maintaining and Tracking Data Quality
179
steps of the sample preparation and analytical method in recovering target analytes from the sample matrix, based on the assumption that non-target surrogate compounds behave the same as the target analytes. They are run with all samples, standards, and associated quality control. Spike recoveries can be calculated from spike concentrations. A problem with this type of sample impacts the samples from that analytical batch. Internal standard – Non-target analytes of known concentration that are added to organic samples following sample preparation but prior to instrument analysis (as opposed to surrogate spikes, which are added before sample preparation). They are used to determine the efficiency of the instrumentation in quantifying target analytes and for performing calculations by relative response factors. They are run with all samples, standards, and associated quality control. A problem with this type of sample impacts the samples from that analytical batch. Laboratory duplicates, Duplicate analyses or measurements, Replicate analyses or measurements, Laboratory replicates – The analyses or measurements of the variable of interest performed identically on two or more subsamples of the same sample. The results from duplicate analyses are used to evaluate analytical or measurement precision, including non-homogeneous sample matrix effects but not the precision of sampling, preservation, or storage internal to the laboratory. Typically lab duplicate analysis is performed on 5 to 10% of the samples. These terms are also used for laboratory reanalyses. A problem with this type of sample impacts the samples from that analytical batch. Laboratory reanalyses, Laboratory replicates – Repeated analyses of a single field sample aliquot that has been prepared by the same sample preparation procedure to measure the repeatability of the sample analysis. A problem with this type of analysis impacts the samples from that analytical batch. Maximum holding time – The length of time a sample can be kept under specified conditions without undergoing significant degradation of the analyte(s) or property of interest. Problems with holding times impact just those specific samples. Recovery efficiency – In an analytical method, the fraction or percentage of a target analyte extracted from a sample containing a known amount of the analyte. A problem with recovery efficiency impacts the samples from that analytical batch. Dilution factor – The numerical value obtained from dividing the new volume of a diluted sample by its original volume. This is a value to be tracked for the analysis, rather than a separate QC sample, although often one or more analytes are reported at more than one dilution. Professional judgment is then required to determine which result is most representative of the true concentration in the sample. Method of standard addition, MSA – Analysis of a series of field samples which are spiked at increasing concentrations of the target analytes. This provides a mathematical approach for quantifying analyte concentrations of the target analyte. It is used when spike recoveries are outside the QC acceptance limits specified by the method. This is more a lab calibration technique than a QC sample.
LAB CALIBRATION A number of procedures are carried out in the laboratory to support the QC process, but do not directly involve the samples from the field. Blank spike, Spike – A known mass of target analyte added to a blank sample or subsample; used to determine recovery efficiency or for other quality control purposes. Blank spikes are used when sufficient field sample volumes are not provided for matrix spiking, or as per method specifications. A problem with this type of sample impacts the samples from that analytical batch. Method blank – A clean sample containing all of the method reagents, processed simultaneously with and under the same conditions as samples containing an analyte of interest through all steps of the analytical procedure. They measure the combined contamination from reagent water or solid material, method reagents, and the sample preparation and analysis
180
Relational Management and Display of Site Environmental Data
procedures. The concentration of each analyte in the method blank should be less than the detection limit for that analyte. Method blanks are analyzed once per batch of samples, or 5 to 10% of the sample population, depending on the method specifications. A problem with this type of sample impacts the samples from that analytical batch. Instrument blank – A clean sample processed through the instrumental steps of the measurement process at the beginning of an analytical run, during, and at the end of the run. They are used to determine instrument contamination and indicate if corrective action is needed prior to proceeding with sample analysis. Normally one blank is analyzed per analytical batch, or as needed. A problem with this type of sample impacts the samples from that analytical batch. Instrument carryover blank – Laboratory reagent water samples which are analyzed after a high-level sample. They measure instrument contamination after analyzing highly concentrated samples, and are analyzed as needed when high-level samples are analyzed. A problem with this type of sample impacts the samples from that analytical batch. Reagent blank, Analytical blank, Laboratory blank, Medium blank – A sample consisting of reagent(s) (without the color forming reagent), without the target analyte or sample matrix, introduced into the analytical procedure at the appropriate point and carried through all subsequent steps. They are used to determine the contribution of the reagents and of the involved analytical steps to error in the observed value, to zero instruments, and to correct for blank values. They are usually run one per batch. A problem with this type of sample impacts the samples from that analytical batch. Check sample, QC check sample, Quality control sample, Control sample, Laboratory control sample, Laboratory control standard, Synthetic sample, LCS – An uncontaminated sample matrix spiked with known amounts of analytes usually from the same source as the calibration standards. It is generally used to establish the stability of the analytical system but may also be used to assess the performance of all or a portion of the measurement system. They are usually analyzed once per analytical batch or as per method specifications, although LCS duplicates are also sometimes run. A problem with this type of sample impacts the samples from that analytical batch. Calibration blank – Laboratory reagent water samples analyzed at the beginning of an analytical run, during, and at the end of the run. They verify the calibration of the system and measure instrument contamination or carry-over. A problem with this type of sample impacts the samples from that analytical batch. Storage blank – Laboratory reagent water samples stored in the same type of sample containers and in the same storage units as field samples. They are prepared, stored for a defined period of time, and then analyzed to monitor volatile organic contamination derived from sample storage units. Typically one blank is used for each sample batch, or as per method specifications. A problem with this type of sample impacts the samples from that analytical batch. Blind sample, Double-blind sample – A subsample submitted for analysis with a composition and identity known to the submitter but unknown to the analyst and used to test the analyst’s or laboratory’s proficiency in the execution of the measurement process. A problem with this type of sample impacts the samples from that analytical batch. Dynamic blank – A sample-collection material or device (e.g., filter or reagent solution) that is not exposed to the material to be selectively captured but is transported and processed in the same manner as the sample. A problem with this type of sample impacts the samples from that analytical batch. Calibration standard, Calibration-check standard – A substance or reference material containing a known concentration of the target analytes used to calibrate an instrument. They define the working range and linearity of the analytical method and establish the relationship between instrument response and concentration. They are used according to method specifications. A problem with this type of sample impacts the samples from that analytical batch. The process of using these standards on an ongoing basis through the analytical run is called Continuous
Maintaining and Tracking Data Quality
181
If the outcome of a test is likely to change, conduct the test only once. Rich (1996) Calibration Verification or CCV, and CCV samples are run at a project-specific frequency such as one per ten samples. Reference standard – Standard of known analytes and concentration obtained from an independent source than the standards used for instrument calibration. They are used to verify the accuracy of the calibration standards, and are analyzed after each initial calibration or as per method specifications. A problem with this type of sample impacts the samples from that analytical batch. Measurement standard – A standard added to the prepared test portion of a sample (e.g., to the concentrated extract or the digestate) as a reference for calibrating and controlling measurement or instrumental precision and bias. Clean sample – A sample of a natural or synthetic matrix containing no detectable amount of the analyte of interest and no interfering material. This is more a lab material than a QC sample. Laboratory performance check solution – A solution of method and surrogate analytes and internal standards; used to evaluate the performance of the instrument system against defined performance criteria. This is more a lab material than a QC sample. Performance evaluation sample (PE sample), Audit sample – A sample, the composition of which is unknown to the analyst and is provided to test whether the analyst/laboratory can produce analytical results within specified performance limits. This is more a lab material than a QC sample. Spiked laboratory blank, Method check sample, Spiked reagent blank, Laboratory spiked blank – A specified amount of reagent blank fortified with a known mass of the target analyte; usually used to determine the recovery efficiency of the method. This is more a lab material than a QC sample.
METHOD-SPECIFIC QC Some analytical methods have specific QC requirements or procedures. For example, gas chromatograph analysis, especially when mass spectrometry is not used, can use two columns, generate two independent results, and the results from the second column confirmation can be used for comparison. In many metals analyses, replicate injections, in which the results are run several times and averaged, can provide precision information and improve precision. Sample dilutions are another QC tool to help generate reproducible results. The sample is re-analyzed at one or more dilution levels to bring the target analyte into the analytical range of the instrument.
DATA QUALITY PROCEDURES For laboratory-generated analytical data, data quality is assured through data verification and validation procedures (Rosecrance, 1993). Data validation procedures developed by the EPA and state agencies provide data validation standards for specific program and project requirements. These data validation procedures usually refer to QA/QC activities conducted by a Contract Laboratory Program (CLP) EPA laboratory. They can occur at and be documented by the laboratory performing the analyses, or they can be performed by a third party independent from the laboratory and the client. Subsequent to receiving the data from the laboratory, the largest component of quality assurance as it applies to a data management system involves data import, data entry, and data editing. In all three cases, the data being placed in the system must be reviewed to determine if the data fulfills the project requirements and project specifications. Data review is a systematic process consisting of data check-in (chain of custody forms, sampling dates, entire data set received), data
182
Relational Management and Display of Site Environmental Data
entry into the database, checking the imported data against the import file, and querying and comparing the data to the current data set and to historical data. A data review flag should be provided in the database to allow users to track the progress of the data review, and to use the data appropriately depending on the status of the review. The details of how data review is accomplished and by whom must be worked out on a project-by-project basis. The software should provide a data storage location and manipulation routines for information about the data review status of each analytical value in the system. The system should allow the storage of data with different levels of data checking. A method should be provided for upgrading (and in rare cases, perhaps downgrading) the data checking status of each data item as additional checking is done, and a history of the various checking steps performed. This process can be used for importing or entering data with a low level of checking, then updating the database after the review has been performed to the required level.
Levels of data review Data review activities determine if analytical data fulfills the project requirements and project specifications. The extent of data review required for analyses will likely vary by project, and even by constituent or sample type within the project. The project managers should document the actual level of effort required for each data review step. Data review flags indicate that the projectspecific requirements have been met for each step. Some projects may require third-party validation of the laboratory data. Flags in the data review table can be used to indicate if this procedure has been performed, while still tracking the data import and review steps required for entry into the data management system. It is also possible for various levels of data review to be performed prior to importing the data. Laboratories and consultants could be asked to provide data with a specified level of checking, and this information brought in with the data. The following table shows some typical review codes that might be associated with analytical data: Data Review Code 0 1 2 3 4 5 6 7 8
Data Review Status Imported Vintage (historical) data Data entry checked Sampler error checked Laboratory error checked Consistent with like data Consistent with previous data In-house validation Third-party validation
Tracking review status The system should provide the capability to update a data review flag for subsets of analytical data that have undergone a specific step of data review, such as consistency checking, verification, or validation. The data review flag will alert users to the status of the data relative to the review process, and help them determine the appropriate uses for the data. A tool should be provided to allow the data administrators to update the review status codes after the appropriate data checks have been made to the analyses.
Maintaining and Tracking Data Quality
183
Figure 75 - An interface for managing the review status of EDMS data
Figure 75 shows an example of an interface for performing this update. Data that had previously been checked for internal consistency has now undergone a third-party validation process, and the data administrator has selected this data set and is updating the review status code. In this figure, the review status is being modified for all of the records in the set of selected data. It is helpful to be able to select data by batch or by chain of custody, and then modify the review status of that whole set at once.
Documentation and audits In addition to data checking, a method of tracking database changes should also be instituted. The system should maintain a log of activity for each project. This log should permanently record all changes to the database for that project, including imports, data edits, and changes to the review status of data. It should include the date of the change, the name of the person making the change, and a brief description of the change. Occasional audits should be performed on the system and its use to help identify deficiencies in the process of managing the data. A remedy that adequately addresses the extent of the deficiency should follow the identification of the deficiencies. The details of the operation of the tracking and auditing procedures should be specified in the Quality Assurance Project Plan for each project.
184
Relational Management and Display of Site Environmental Data
Data quality standards Quality has always been very important in organizations working with environmental data. Standards like NQA-1 and BS 7750 have made a major contribution to the quality of the results of site investigation and remediation. Recent developments in international standards can have an impact on the design and implementation of an EDMS, and the EDMS can contribute to successful implementation of these systems. ISO 9000 and ISO 14000 are families of international standards. Both families consist of standards and guidelines relating to management systems, and related supporting standards on terminology and specific tools. ISO 9000 is primarily concerned with “quality management,” while ISO 14000 is primarily concerned with “environmental management.” ISO 9000 and ISO 14000 both have a greater chance of success in an organization with good data management practices. A good source of information on ISO 9000 and ISO 14000 is www.iso.ch/9000e/9k14ke.htm. Another international standard that applies to environmental data gathering and management is ISO 17025, which focuses on improvements in testing laboratories (Edwards and Mills, 2000). It provides for preventive and corrective programs that facilitate client communication to resolve problems.
ISO 9000 ISO 9000 addresses the management of quality in the organization. The focus of ISO 9000 is on documentation, procedures, and training. A data management system can assist with providing procedures that increase the chance of generating a quality result. A newer version of ISO 9000, called QS 9000, includes some modifications for specific industries such as the automotive industry. The definition of "quality" in ISO 9000 refers to those features of a product or service that are required by the customer. "Quality management" is what the organization does to ensure that its products conform to the customer’s requirements, for example, what the organization does to minimize harmful effects of its activities on the environment.
ISO 14000 ISO 14000 encourages organizations to perform their operations in a way that has a positive, or at least not a negative, impact on the environment. An important component of this is tracking environmental performance, and an EDMS can help with this. In ISO 14000 there are ten principles for organizations implementing environmental management systems (Sayre, 1996). Some of these principles relate to environmental data management. In the following section each principle is followed by a brief discussion of how that principle relates to environmental data management and the EDMS software. Recognize that environmental management is one of the highest priorities of any organization – This means that adequate resources should be allocated to management of environmental aspects of the business, including environmental monitoring data. Every organization that has operations with potential environmental impacts should have an efficient system for storing and accessing its environmental data. EDMS is a powerful tool for this. Establish and maintain communications with both internal and external interested parties – From the point of view of environmental data, the ability to quickly retrieve and communicate data relevant to issues that arise is critical, both on a routine and emergency basis. Communication between the EDMS and regulators is a good example of the former, the ad hoc query capability of an EDMS is a good example of the second. Determine legislative requirements and those environmental aspects associated with your activities, products, and services – Satisfaction of regulatory requirements is one of the primary purposes of an environmental data management system. An EDMS can store information
Maintaining and Tracking Data Quality
185
like the sampling intervals for monitoring locations and target concentration limits to assist with tracking and satisfying these requirements. Develop commitment, by everyone in the organization, to environmental protection and clearly assign responsibilities and accountability – This is just as true of environmental data as any other aspect of the management process. Promote environmental planning throughout the life cycle of the product and the process – Planning through the life cycle is important. Tracking is equally important, and an EDMS can help with that. Establish a management discipline for achieving targeted performance – A key to achieving targeted performance is tracking performance, and using a database is critical for efficient tracking. Provide the right resources and sufficient training to attain performance targets – Again, the tracking is an important part of this. You can’t attain targets if you don’t know where you are. Implementing an environmental data management system with the power and flexibility to get answers to important questions is critical. That system should also be easy to use, so that the resources expended on training generate the greatest return. A good EDMS will provide these features - power, flexibility, and ease of use. Evaluate performance against policy, environmental objectives, and targets, and make improvements where possible – Again, the tracking is the important part, along with improvements, as discussed in the next principle. Establish a process to review, monitor, and audit the environmental management system to identify opportunities for improvement in performance – This is a reflection of the Deming/Japanese/Incremental Improvement approach to quality, which has been popular in the management press. The key to this from a data management perspective is to implement open systems where small improvements are not rejected because of the effort to implement them. In an EDMS, a knowledgeable user should be able to create some new functionality, like a specific kind of report, without going through a complicated process involving formal interaction with an Information Technology group or the software vendor. Encourage vendors to also establish environmental management systems – Propagating the environmental quality process through the industry is encouraged, and the ability to transfer data effectively can be important in this. This is even truer if it is looked at from the overall quality perspective. For example, a reference file system used to check laboratory data prior to delivery encourages the movement of quality data from the vendor (the lab) to the user within the organization.
ISO 17025 A new ISO standard called 17025 covers laboratory procedures for management of technical and quality records (Edwards and Mills, 2000). This standard requires laboratories to establish and maintain procedures for identification, collection, indexing, access, storage, maintenance, and disposal of quality and technical records. This includes original observations and derived data, along with sufficient information to establish an audit trail, calibration records, staff records, and a copy of each report issued for a defined period. The focus of the standard is on internal improvements within the laboratory, as well as corrective and preventative action programs that require client communications to satisfactorily resolve problems.
EPA GUIDANCE The U.S.E.P.A. has a strong focus on quality in its data gathering and analysis process. The EPA provides a number of guidance documents covering various quality procedures, as shown in the following table. Many are available on its Web site at www.epa.gov.
186
Relational Management and Display of Site Environmental Data
QA/G-0 QA/G-3 QA/G-4 QA/G-4D QA/G-4H QA/G-5 QA/G-5S QA/G-6 QA/G-7 QA/G-8 QA/G-9 QA/G-9D QA/G-10 QA/R-1 QA/R-1ER QA/R-2 QA/R-5
EPA Quality System Overview Management Systems Review Process Data Quality Objectives Process Decision Error Feasibility Trials (DEFT) DQO Process for Hazardous Waste Sites Quality Assurance Project Plans Sampling Designs to Support QA Project Plans Preparation of SOPs Technical Assessments for Data Operations Environmental Data Verification & Validation DQA: Practical Methods for Data Analysis DQA Assessment Statistical Toolbox (DataQUEST) Developing a QA Training Program Requirements for Environmental Programs Extramural Research Grants Quality Management Plans Quality Assurance Project Plans
DATABASE SUPPORT FOR DATA QUALITY AND USABILITY Data review in an EDMS is a systematic process that consists of data check-in, data entry, screening, querying, and reviewing (comparing data to established criteria to ensure that data is adequate for its intended use), following specific written procedures for each project. The design of the software and system constraints can also make a great contribution to data quality. An important part of the design process and implementation plan for an EDMS involves the detailed specification of procedures that will be implemented to assure the quality of the data in the database. This part of the design affects all of the other parts of the detailed design, especially the data model, user interface, and import and export. System tools – Data management systems provide tools to assist with maintaining data quality. For example, systems that implement transaction processing can help data quality by passing the ACID test (Greenspun, 1998): Atomicity – The results of a transaction’s execution are either all committed or all rolled back. Consistency – The database is transformed from one valid state to another, and at no time is in an invalid state. Isolation – The results of one transaction are invisible to another transaction until the transaction is complete. Durability – Once committed, the results of a transaction are permanent and survive future system and media failures. These features work together to help ensure that changes to data are made consistently and reliably. Most commercial data management programs provide these features, at least to some degree. Data model – A well-designed, normalized data model will go a long way toward enforcing data integrity and improving quality. Referential integrity can prevent some types of data errors from occurring in the database, as discussed in Chapter 3. Also, a place should be provided in the data model for storage of data review information, along with standard flags and other QC information. During the detailed design process, data elements may be identified which would help with tracking quality related information, and those fields should be included in the system. User interface – A user interface must be provided with appropriate security for maintaining quality information in the database. In addition, all data entry and modification screens should
Maintaining and Tracking Data Quality
187
support a process of reviewing data in keeping with the quality procedures of the organization and projects. These procedures should be identified and software routines specified as part of the EDMS implementation. Import and export – The quality component for data import involves verifying that data is imported correctly and associated correctly with data already in the database. This means that checks should be performed as part of the import process to identify any problems or unanticipated changes in the data format or content which could result in data being imported improperly or not at all. After the data has been imported, one or more data review steps should be performed which are appropriate to the quality level required for the data in the database for that particular project and expected data uses. Once the data is in the database with an appropriate level of quality, retrieving, and exporting quality data requires that the selection process be robust enough to ensure that the results are representative of the data in the database, relative to the question being asked. This is described in a later section.
Data retrieval and quality Another significant component of quality control is data retrieval. The system should have safeguards to assure that the data being retrieved is internally consistent and provides the best representation of the data contained in the system. This involves attention to the use of flags and especially units so that the data delivered from the system is complete and internally consistent. This is easy to enforce for canned queries, but more difficult for ad hoc queries. Adequate training must be provided in the use of the software so that errors in data retrieval are minimized. Another way to say this is that the system should provide the answer that best addresses the intent of the question. This is a difficult issue, since the data delivery requirements can vary widely between projects.
PRECISION VS. ACCURACY Many people confuse precision, accuracy, and bias. Precision is the degree to which a set of observations or measurements of the same property, usually obtained under similar conditions, conform to themselves, which is also thought of as reproducibility or repeatability. Precision is usually expressed as standard deviation, variance, or range, in either absolute or relative terms, of a set of data. Accuracy is the degree of agreement between an observed value and an accepted reference value. Bias is the systematic or persistent distortion of a measurement process that deprives the result of representativeness (i.e., the expected sample measurement is different than the sample’s true value). For a good discussion of precision and accuracy, see Patnaik (1997, p. 6) Accuracy as most people think of it includes a combination of random error (precision) and systematic error (bias) components due to sampling and analytical operations. EPA recommends that the term accuracy not be used, and that precision and bias be used to convey the information usually associated with accuracy. Figure 76, based on ENCO (1998), illustrates graphically the difference between precision and accuracy. From the perspective of the laboratory QC process, the method accuracy is based on the percent recovery of a known spike concentration from a sample matrix. The precision is based on the relative percent difference between the duplicate samples or duplicate spike samples.
188
Relational Management and Display of Site Environmental Data
Inaccurate and Imprecise
Accurate and Imprecise
Inaccurate and Precise
Accurate and Precise
Figure 76 - Illustration of precision vs. accuracy
PROTECTION FROM LOSS The final quality assurance component is protection of data from loss. Once data has been entered and reviewed, it should be available forever, or at least until a conscious decision is made that it is no longer needed. This protection involves physical measures such as a process for regular and reliable backups, and verification of those backups. It also involves an ongoing process of checking data contained in the database to assure that the data content remains as intended. This can involve checking data against previous reports or other methods designed to identify any improper changes to data, whether those changes were intentional or not.
Data security In order to protect the integrity and quality of the database, accessing the database and performing actions once the database is opened should be restricted to those with a legitimate business need for that access. Security methods should be developed for the EDMS that will designate what data each user is allowed to access. A desktop data management system such as Microsoft Access typically provides two methods of securing a database: setting a password for opening the database, and user-level security. The password option will protect against unauthorized user access. However, once the database is open, all database objects are available to the user. This level of security does not provide adequate protection of sensitive data, prevent users from inadvertently breaking an application by changing code or objects on which the application depends, or inadvertently changing reviewed data. User-level security provides the ability to secure different objects in a database at different levels. Users identify themselves by logging into the database when the program is started. Permissions can be granted to groups and users to regulate database usage. Some database users, designated as data administrators, are granted permission to view, enter, or modify data. The data import and edit screens should only be accessible to these staff members. Other users should be restricted to just viewing the data. Access to tables can be restricted by group or user. In a client-server database system, such as one with Microsoft SQL Server or Oracle as a back-end, the server security model can provide an additional level of security. This is a fairly complicated subject, but implementing this type of security can provide a very high level of security.
Maintaining and Tracking Data Quality
189
Figure 77 - SQL Server screen for modifying permissions on tables
For example, SQL Server provides a variety of security options to protect the server and the data stored on that server. SQL Server security determines who can log on to the server, the administrative tasks each user is allowed to perform, and which databases, database objects (tables, indexes, views, defaults, procedures, etc.) and data are available to each user. Figure 77 shows the SQL Server screen for editing permissions on Table objects. SQL Server login security can be configured for one of three security modes: ! ! !
SQL Server’s own login validation process Windows NT/2000/XP authentication services A mixture of the two login types
The system administrator and a backup system administrator should be trained on how to add users with specific permissions, and change existing permissions, if necessary. To further track database additions or modifications, a dialogue box can pop up when a database administrator ends a session where data might have been changed. The box should contain the user name, site, and date, which are not editable, and a memo field, where the user can enter a description of the data modifications that were made. This information should be stored in a table to allow tracking of database activities. This table should be protected from edits so that the tracking information it contains is reliable. An example of this type of tracking is shown in Chapter 5. Some enterprise EDMS databases will be set up with multiple sites in one database. This is a good design for some organizations, because it can simplify the data management process, and allow for comparison across sites. However, often it may not be necessary for all users to have access to all sites. A user interface can be provided to limit the access of users to only specific sites. Figure 78 shows an example of a screen (in form and datasheet view) for assigning users to sites, and the software can then use this information to limit access as appropriate.
190
Relational Management and Display of Site Environmental Data
Figure 78 - Security screen for assigning users to sites
Backup Backing up the database is probably the most important maintenance activity. The basic rule of thumb is when you have done enough work in the database that you wouldn’t want to re-do it, back up. Backup programs contain various options regarding scheduled vs. manual backups, backup media, compression features, etc., so you want to think through your backup strategy, then implement it. Then be sure to stay with it. The process for backing up depends on the type of database that you have. If you are running a stand-alone program like Access, you can use consumer-oriented backup tools, such as the one that comes with the operating system, or a third-party backup utility, to back up your database file. Most of these programs will not back up a file that is open, so be sure everyone that might have the database file open closes it before the backup program runs. If people keep the database open all the time, even though a backup program is running regularly, the file may not be backed up. Backing up a client-server database is more complicated. For example, there are several options for backing up the SQL Server database file. For the Windows NT/2000/XP system, the NT/2000/XP backup program can be used to back up the SQL Server database if the database file is closed during the backup process. SQL Server also contains a backup program that will back up the database files while they are open. The SQL backup program is located under tools in SQL Enterprise Manager. The format for the SQL Server backup program is not compatible with NT/2000/XP format backups and separate tapes must be used if both backup options are used. It is recommended that frequent backups of the database be scheduled using this program. Third-party backup programs are also available, and may provide more functionality, but at an additional cost. On a regular basis, probably daily, a backup of the data should be made to a removable medium such as tape, Zip disk, or writeable CD or DVD. The tapes or other media should be rotated according to a formal schedule so that data can be recovered if necessary from a previous date. A reliable staff member should be assigned responsibility for the backup process, which should include occasional testing to make sure that the backup tapes can be read if necessary. This task might be performed by someone in IS if it is maintaining the server.
CHAPTER 16 DATA VERIFICATION AND VALIDATION
For many projects, data verification and validation are significant components of the data management effort. A variety of related tasks are performed on the values in the database so that they are as accurate as possible, and so that their accuracy (or lack thereof) is documented. This maximizes the chance that data is useful for its intended purpose. The move toward structured data validation has been driven by the EPA Contract Laboratory Program (CLP), but the process is certainly performed on non-EPA projects as well. Verification and validation is not “one size fits all.” Different projects have different data quality objectives, and consequently different data checking activities. The purpose of this section is not to teach you to be a data validator, but rather to make you aware of some of the issues that are addressed in data validation. The details of these issues for any project are contained in the project’s quality assurance project plan (QAPP). The validation procedures are generally different for different categories of data, such as organic water analyses, inorganic water analyses, and air analyses.
TYPES OF DATA REVIEW Data review refers to the process of assessing and reporting data quality, and includes verification, validation, and data quality assessment. Data quality terms have different meanings to different people in the industry, and often people have different but strongly held beliefs in the definition of terms like validation, verification, data review, and checking. Hopefully when you use any of these terms, the person you are talking to is hearing what you mean, and not just what you say.
MEANING OF VERIFICATION EPA (2001c) defines verification as: Confirmation by examination and provision of objective evidence that specified requirements have been fulfilled. Data verification is the process of evaluating the completeness, correctness, and conformance/compliance of a specific data set against the method, procedural, or contractual requirements. Core Laboratories (1996) provides the following somewhat simpler definition of verification: Verification is the process of determining the compliance of data with method and project requirements, including both documentation and technical criteria.
Relational Management and Display of Site Environmental Data
DATA VERIFICATION
192
Project Planning
Field Activities Field Documentation Review
Sample Collection
Sample Management
Sample Preparation
DATA VALIDATION
Sample Analysis
LIMS
Sample Receipt Data Verification Documentation and Verified Data
Laboratory Documentation Review Focused Data Validation Report
Focused Data Validation (as requested)
Data Validation of Field and Analytical Laboratory Data
Data Validation Report and Validated Data
Data Quality Assessment Figure 79 - Data verification and validation components in the project life cycle (after EPA, 2001c)
Verification is sometimes used informally to refer to the process of checking values for consistency, reasonableness, spelling, etc., especially when it is done automatically by software. This is particularly important in a relational data management system, because referential integrity depends strongly on data consistency in order for the data be imported and retrieved successfully.
Data Verification and Validation
193
MEANING OF VALIDATION EPA (2001c) defines validation as: Confirmation by examination and provision of objective evidence that the particular requirements for a specific use have been fulfilled. Data validation is an analyte- and sample-specific process that extends the evaluation of data beyond method, procedural, or contractual compliance (i.e., data verification) to determine the analytical quality of a specific data set. Core Laboratories (1996) provides the following definition of validation: Validation is the process of determining the usability of data for its intended use, including qualification of any non-compliant data. EPA (1998c) provides the following levels of data validation: Level 0 Data Validation – Conversion of instrument output voltages to their scaled scientific units using nominal calibrations. May incorporate flags inserted by the data logger. Level 1 Data Validation – Observations have received quantitative and qualitative reviews for accuracy, completeness, and internal consistency. Final audit reviews required. Level 2 Data Validation – Measurements are compared for external consistency against other independent data sets (e.g., comparing surface ozone concentrations from nearby sites, intercomparing raw windsonde and radar profiler winds, etc.). Level 3 Data Validation – Observations are determined to be physically, spatially, and temporally consistent when interpretive analyses are performed during data analysis. Validation contains a subjective component, while verification is objective.
THE VERIFICATION AND VALIDATION PROCESS While it is possible to find definitions of verification and validation, it is not easy to draw the line between the two. Verification is the evaluation of performance against predetermined requirements, while validation focuses on the data needs of the project. The data must satisfy the compliance requirement (verification) in order to satisfy the usability requirement (validation). From a software perspective, perhaps it is best to view verification as something that software can do, and validation as something that people need to do, and that is where the data validator comes in. Data validators are professionals who spend years learning their trade. Some would say that validation is as much an art as a science, and a good validator has a “feel” for the data that takes a long time to develop. It is certainly true that an experienced person is more likely than an inexperienced person to identify problems based on minimal evidence. EPA (1996) describes validation as identifying the analytical error associated with a data set. This is then combined with sampling error to determine the measurement error. The measurement error is then combined with the sampling variability (spatial variability, etc.) to determine the total error or uncertainty. This total error is then used to evaluate the usability of the data. The validator is responsible for determining the analytical error and the sampling error. The end user is responsible for combining this with the sampling variability for the final assessment of data usability. Data validation is a decision-making process in which established quality control criteria are applied to the data. An overview of the process is illustrated in Figure 79. The validator should review the data package for completeness, assess the results of QC checks and procedures, and examine the raw data in detail to verify the accuracy of the information. The validation process involves a series of checks of the data as described in the next section. Each sample is accepted, rejected, or qualified based on these checks. Individual sample results that fail any of the checks are not thrown away, but are marked with qualifier codes so the user is aware of the problems. Accepted data can be used for any purpose. Rejected data, usually given a flag of “R,” should never be used. Qualified data, such as data that is determined to be estimated and given a “J” flag, can be used as long as it is felt to satisfy the data quality objectives, but should not be used
194
Relational Management and Display of Site Environmental Data
If you’re right 90% of the time, why quibble about the remaining 4%? Rich (1996) indiscriminately. The goal is to generate data that is technically valid, legally defensible, of known quality, and ultimately usable in making site decisions. Verification and validation requirements apply to both field and laboratory data. Recently EPA (2001c) has been emphasizing a third part of the process, data quality assessment (DQA), which is a process that determines the credibility of the data. This has become increasingly important as more and more cases are being found of laboratory and other fraud in generating data, and in fact some laboratory operators have been successfully prosecuted for fraud. It can no longer be assumed that the integrity of the laboratory and others in the process can be trusted. The DQA process involves looking at the data for clues that shortcuts have been taken or other things done which would result in invalid data. Examples of improper laboratory processes include failure to analyze samples and then fabricating the results (drylabbing); failure to conduct the required analytical steps; manipulating the sample prior to analysis, such as by fortification with additional analyte (juicing); manipulating the results during analysis such as by reshaping a peak that is subtly out of specification (peak shaving or peak enhancement); and post-analysis alteration of results. EPA guidance documents provide warning signs to validators to assist with detecting these activities. Data verification consists of two steps. The first is identifying the project requirements for records, documentation, and technical specifications for data generation, and determining the location and source of these documents. The second is verifying that the data records that are produced or reported satisfy the method, procedural, or contractual requirements as per the field and analytical operational requirements, including sample collection, sample receipt, sample preparation, sample analysis, and data verification documentation review. The two outputs of data verification are the verified data and the data verification documentation. Data validation involves inspection of the verified data and verification documentation, a review of the verified data to determine the analytical quality of the data set, and production of a data validation report and qualified data. Documentation input to validation includes projectspecific planning documents such a QAPP or SAP, generic planning documents, field and laboratory SOPs, and published sampling and analytical methods. Data validation includes, as much as possible, the reasons for failure to meet the requirements, and the impact that this failure has on the usability of the data set.
VERIFICATION AND VALIDATION CHECKS This section describes a few typical validation checks, and comes from a variety of sources, including EPA (1996). Data completeness – The validator should review the data package for completeness to ensure that it contains the required documents and forms. Field QC samples – Field QC samples such as trip blanks, equipment blanks, and field duplicates are typically taken about one per 20 field samples. The validator should confirm compliance, and use the QC results to identify the sampling error. Holding times – For various matrices and types of analyses, the time from collection to extraction and analysis must not exceed a certain period. Exceedence of holding times is a common reason for qualifying data. Information on holding times for some analytes is given in Appendix D. Equipment calibration – The appropriate project-specific or analysis-specific procedure must be used, both for initial and ongoing calibration, and the validator should check for this. LCS and duplicates – The correct number of QC samples must be run at various stages of the process, and should agree with the primary samples within specific tolerances.
Data Verification and Validation
195
Figure 80 - Data entry screen for setting up validation parameters
Blanks – Blank samples should be run at appropriate intervals to identify contamination, and this should be confirmed by the validator. Surrogates – Surrogates are compounds not expected in the sample, but expected to react similarly in analysis. Surrogates are added in known concentration to assist with calibration. The recovery of the surrogate must be within certain ranges. Matrix effects – The validator should examine matrix spikes, matrix spike duplicates, surrogate spike recoveries, and internal standards responses to identify any unusual matrix effects. Performance evaluation samples – Blind PE samples may be included in each sample group to help evaluate the laboratory’s ability to identify and measure values during the sample analysis, and the validator should compare the analyzed results to the known concentrations to evaluate the laboratory performance. Detection limits – The laboratory should be able to perform the analyses to the required detection limits, and if not, this may require qualification of the data. In addition, there are a number of checks that should be performed for specific analytical techniques, such as for furnace AA and ICP.
SOFTWARE ASSISTANCE WITH VERIFICATION AND VALIDATION Data verification and validation is such an important operation for many projects that those projects must have tools and procedures to accomplish it. With the current focus on keeping the cost down on environmental projects, it is increasingly important that these tools and procedures allow the validation process to be performed efficiently. The EDMS software can help with this.
196
Relational Management and Display of Site Environmental Data
Figure 81 - Software screen for configuring and running validation statistics reports
Prior to validation Usually the data is verified before it is validated. The EDMS can be a big help with the verification process, providing checks for consistency, missing data, and referential integrity issues. Chapter 13 contains information on software assistance with checking during import. One useful approach is for the consistency component of the verification to be done as part of the standard import, and then to provide an option, after consistency checking, to import the data either directly into the database or into a validation subsystem. The validation subsystem contains tables, forms, and reports to support the verification and validation process. After validation, the data can then be moved into the database. To validate data already in the database, the data selection screen for the main database can provide a way to move the data to be validated into the validation table. Once the validation has been performed, information resulting from the validation, such as validation flags, can be added to those records in the main database. The validation system can provide Save and Restore options that allow users to move between multiple data sets. In the validation table, flagging edits and QC notations are made prior to entry into the main database. Once validation is completed, the associated analytical and field duplicate data can then be imported into the main database, and the validation table can be saved in a file or printed as documentation of the data validation process. Validation involves comparison of the data to a number of standard values, limits, and counts. These values vary by QC type, matrix, and site. Figure 80 shows a program screen for setting some of these values.
Visual validation Visual validation is the process of looking at the data and the supporting information from the laboratory and making the determination of whether the validation criteria have been met. The EDMS can help by organizing the data in various ways so that the visual validation can be done
Data Verification and Validation
197
efficiently. This is a combination of calculations and reporting. Figure 81 shows a software screen for configuring and running validation statistics reports
VALIDATION CALCULATIONS The software can provide a set of calculations to assist the data validator. Examples include calculation of relative percentage difference (RPD) between lab and field duplicates and the primary samples, and calculation of the adequacy of the number of lab QC samples, such as calibration check verification (CCV) and laboratory control sample (LCS) analyses.
VALIDATION REPORTING The key component of visual validation is inspection of the data. The software should provide reports to help with this. These reports include QC Exceedence, Summary of sample and QC control completeness, QC data summary by QC type (field dup., lab dup., LCS, CCV), and reports of quality control parameters used in RPD and recovery calculations.
STATISTICS Some of the reports used in the validation process are based on statistical calculations on the data, and in some cases on other data in the database from previous sampling events. Examples of these reports include: Basic Statistics Report – This report calculates basic statistics, such as minimum, maximum, mean (arithmetic and/or geometric), median, standard deviation, mean plus 2 standard deviations, and upper 95th percentile by parameter for a selected date range. Ion Balance Report – This report calculates the cation/anion percent difference. It can also calculate the percent differences of field parameters analyzed in the field and the lab, and might also add up the major constituents and compare the result to the amount of total dissolved solids (TDS) reported by the laboratory. Trend Report – This is a basic report that compares statistics from a range of data in the database with the current data set in the validation table, and reports percent difference by parameter. Comparison Report – This report flags data in the validation data set that is higher or lower than any analyses in the comparison dataset. Figure 82 shows an example of one type of validation statistics report.
Figure 82 - Example of a validation statistics report (the L to the right means less than the mean)
198
Relational Management and Display of Site Environmental Data
Autoflagging The autoflagging process uses the results of calculations to set preliminary flags, which are then inspected by the validator and either accepted or modified. For example, the validation option can compare QC calculations against user-supplied project control limits. The data is then flagged based on data flagging options that are user-configurable. Flagging can then be reviewed and revised using edit screens included in the validation menu system.
CHAPTER 17 MANAGING MULTIPLE PROJECTS AND DATABASES
Often people managing data are working on several facilities at once, especially over the time frame of months or years. This raises several issues related to the site data, lookup tables and other related data, and ease of moving between databases. These issues include whether the data should be stored in one database or many, sharing data elements such as codes and lookups, moving between databases, and multi-site security.
ONE FILE OR MANY? If the data management system allows for storing multiple sites in the same database, then some decisions need to be made about how many databases to have, and how many and which sites to store in each. Even the concept of what constitutes a site can be difficult to define in some cases.
What is a site? While the usage of the term “site” in this book has been pretty much synonymous with “facility” or “project,” it is often not that simple. First, some people use “site” to mean a sample location, or what we are calling a “station,” including monitoring wells, soil borings, and so on. This is a difference in terminology, not concepts. Is just personal preference, and will be ignored, with our apologies to those who use it that way. A bigger issue, when “site” is used for “facility,” is what is a “site”? The problem with defining a site (assuming the meaning of facility, and not sample location) can be illustrated with several examples. One example is a facility that has various different operations, administrative units, or environmental problems within it. Some large facilities have dozens of different environmental issues being dealt with relatively independently. For example, we have a client with a refinery site that it is investigating and remediating. A number of ponds and slag piles are being managed one way, and each has its own set of soil borings and monitoring wells. The operating facility itself has another set of problems and data associated with it. This project can be viewed as one site with multiple sub-parts, which is how the client is managing it, or as several separate sites.
200
Relational Management and Display of Site Environmental Data
Why does DC have the most lawyers per capita and New Jersey the most toxic waste dumps? New Jersey had first choice. Rich (1996) The second case is where there are several nearby, related projects. One of our clients is remediating a nuclear processing facility. The facility itself has a number of ponds, railway areas, and building locations that must be excavated and shipped away. Over the years some tailings from the facility were used throughout the neighboring residential area for flower gardens and yards (unfortunately the tailings made good topsoil). And some material from the facility and the residential area made it into the local creek, which now requires remediation. Each of these is being managed differently, with different investigative techniques, supervision, and regulatory oversight. In this case the client has chosen to treat each area as a separate site from a data management perspective because it views them as different projects. Another example is a municipality that is using an EDMS to manage several solid waste landfills, which are, for the most part, managed separately. Two of the landfills are near each other. Each landfill has its own monitoring wells, and these can be easily assigned to the proper site. Some monitoring wells are used as background wells for both landfills, so don’t apply uniquely to either. In this case the client has elected to define a third “site” for the combined wells. At data selection time it can choose one landfill site plus the site containing the combined wells to obtain the data it needs. There is no “right” or “wrong” way to define a site. The decision should be made based on which approach provides the greatest utility for working with the data.
To lump or to split? Once you have decided what to call a site, you still have the problem of which sites to put in which databases, if your EDMS supports multiple sites. Part of the answer to this problem comes from what type of organization is managing the data, and who the data belongs to. The answer may be different for an industrial user managing data for its own sites, than for a consulting company managing data for multiple clients. However, the most important issue is usually whether you need to make comparisons between sites. If you do, it will be easier if the sites are in the same database. If not, there may be no benefit to having them in the same database. In one case, a landfill company used its database system to manage groundwater data from multiple landfills. Its hydrogeologist thought that there might be a relationship between the turbidity of samples taken in the field and the concentration of contaminants reported by the lab. This company had chosen to “lump” its sites, and the hydrogeologist was able to perform some queries to compare values across dozens of landfills to see if there was in fact a correlation (there was). In this case, having all of the data in one database provided a big benefit. Consultants managing data for multiple clients have a different issue. It is unlikely that they will want to compare and report on data belonging to more than one client at a time. It is also likely that the client will, at some point, be required to provide a copy of its database to others, either voluntarily or through litigation. It certainly should not deliver data that it doesn’t own. In this case it makes sense to have a different database for each different client. Then within that client’s data a decision should be made whether to have one large database or multiple smaller ones. We recently visited a consulting company with 38 different SQL Server databases, one for each of its active clients. Another factor that often enters into the database grouping decision is geographic proximity. If several sites are nearby, there is a good chance that at some point they will be viewed as a unit, and there would be an advantage to having them in the same database. If they are in different parts of the country, it is less likely that, for example, you would want to put their values on the same map.
Managing Multiple Projects and Databases
201
Figure 83 - A screen to help the user attach to different server databases.
The size of the resulting database can also have an impact on the decision. If the amount of data for each site will be very large, then combining several sites might not be a good idea because the size of the database might become unmanageable. A good understanding of the capacity of the database tool is important before making the decision.
SHARING DATA ELEMENTS It is useful to view the data in the database as consisting of the site data and supporting data. The site data consists of the site, station, sample, and analysis information. The supporting data includes the lookup tables like station types, units and unit conversions, parameter names, and so on. If you are managing several similar sites, then the supporting data could be similar. In the case of sites with a long list of constituents of concern, just managing the parameter table can take a lot of effort. If the sites are in one database, then sharing of the supporting data between them should not be an issue. If the sites are managed in separate databases, it may be desirable to have a way to move the supporting data between databases, or to propagate changes made in one database into other similar databases. One project we worked on involved multiple pipeline pumping stations across several states. The facilities were managed by different consultants, but the client wanted the data managed consistently across projects. In this case, the decision was made to store the data for each site in different databases because of the different companies responsible for each project. However, in order to keep the data management consistent, one data manager was assigned management of the parameter list and other lookups, and kept the other data administrators updated with changes so the database stayed consistent.
MOVING BETWEEN DATABASES Depending on the computing experience of the people using the databases, it might be of value to provide an easy way for users to move between databases. This is even more important in the case of client-server systems where connecting to the server database involves complicated command strings. Figure 83 shows an example of a screen that does this. Often it is helpful to be able to move data between databases. If the EDMS has a good data transfer system, such as using a formalized data transfer standard, then moving data from one database to another should not be difficult.
202
Relational Management and Display of Site Environmental Data
Figure 84 - Screen for assigning users to sites
LIMITING SITE ACCESS One of the issues that may need to be addressed if multiple sites are stored in one database is that not all users need to have access to all sites. Some users may need to work on one site, while others may need access to several or all of the sites. In this scenario, the software must provide a way for users to be assigned to one or more sites, and then limited to working with only those sites. Figure 84 shows a screen for assigning users to sites based on their Windows login ID. The software will then filter the data that each user sees to the sites to which each has been assigned.
PART FIVE - USING THE DATA
CHAPTER 18 DATA SELECTION
An important key to successful use of an EDMS is to allow users to easily find the data they need. There are two ways for the software to assist the user with data selection: text-based and graphical. With text-based queries, the user describes the data to be retrieved using words, generally in the query language of the software. Graphical queries involve selecting data from a graphical display such as a graph or a map. Query-by-form is a hybrid technique that uses a graphical interface to make text-based selections.
TEXT-BASED QUERIES There are two types of text-based queries: canned and ad hoc. The trade-off is ease of use vs. flexibility.
Canned queries Canned queries are procedures where the query is prepared ahead of time, and the retrieval is done the same way each time. An example would be a specific report for management or regulators, which is routinely generated from a menu selection screen. The advantage of canned selections is that they can be made very easy to use since they involve a minimum of choices for the user. The goal of this process is to make it easy to quickly generate the output that will be required most of the time by most of the users. The EDMS should make it easy to add new canned queries, and to connect to external data selection tools if required. Figure 85 shows an example of a screen from Access from which users can select pre-made queries. The different icons next to the queries represent the different query types, including select, insert, update, and delete. The user can execute a query by double-clicking on it. Queries that modify data (action queries), such as insert, update, and delete, display a warning dialog box before performing the action. Other than with the icons, this screen does not separate selection queries from action queries, which results in some risk in the hands of inexperienced or careless users.
206
Relational Management and Display of Site Environmental Data
Figure 85 - Access database window showing the Queries tab
Ad hoc queries Sometimes it is necessary to generate output with a format or data content that was not anticipated in the system design. Text selections of this type are called ad hoc queries (“ad hoc” is a Latin term meaning “for this”). These are queries that are created when they are needed for a particular use. This type of selection is more difficult to provide the user, especially the casual user, in a way that they can comfortably use. It usually requires that users have a good understanding of the structure and content of the database, as well as a medium to high level of expertise in using the software, in order to perform ad hoc text-based queries. The data model should be included with the system documentation to assist them in doing this. Unfortunately, ad hoc queries also expose a high level of risk that the data retrieved may not be valid. For example, the user may not include the units for analyses, and the database may contain different units for a single parameter sampled at different times. The data retrieved will be invalid if the units are assumed to be the same, and there is no visible indication of the problem. This is particularly dangerous when the user is not seeing the result of the query directly, but using the data indirectly to generate some other result such as statistics or a contour map. In general, it is desirable to formalize and add to the menu as wide a variety of correctly formatted retrievals as possible. Then casual users are likely to get valid results, and “power users” can use the ad hoc queries only as necessary. Figure 86 shows an example of creation of an ad hoc text-based query. The user has created a new query, selected the tables for display, dragged the fields from the tables to the grid, and entered selection criteria. In this case, the user has asked for all “Sulfate” results for the site “Rad Industries” where the value is > 1000. Access has translated this into SQL, which is shown in the second panel, and the user can toggle between the two. The third panel shows the query in datasheet view, which displays the selected data. The design and SQL views contain the same information, although in Access it is possible to write a query, such as a union query, that can’t be displayed in design view and must be shown in SQL. Some advanced users prefer to type in the SQL rather than use design view, but even for them the drag and drop can save typing and minimize errors.
Data Selection
207
Figure 86 - A text-based query in design, SQL, and datasheet views
GRAPHICAL SELECTION A second selection type is graphical selection. In this case, the user generates a graphical display, such as a map, of a given site, selects the stations (monitoring wells, borings, etc.), then retrieves associated analytical data from the database.
208
Relational Management and Display of Site Environmental Data
Figure 87 - Interactive graphical data selection
Figure 88 - Editing a well selected graphically
Data Selection
209
Figure 89 - Batch-mode graphical data selection
Geographic Information System (GIS) programs such as ArcView, MapInfo, and Enviro Spase provide various types of graphical selection capability. Some map add-ins that can be integrated with database management and other programs, such as MapObjects and GeoObjects, also offer this feature. There are two ways of graphically selecting data, interactive and batch. In Figure 87 the user has opened a map window and a list window showing a site and some monitoring wells. The user then double-clicked on one of the wells on the map, and the list window scrolled to show some additional information on the well. In Figure 88 a well was selected graphically, then the user called up an editing screen to view and possibly change data for that well. The capability of working with data in its spatial context can be a valuable addition to an EDMS. In Figure 89 the user wanted to work with wells in or near two ponds. The user dragged a rectangle to select a group of wells, and then individually selected another. Then the user asked the software to create a list of information about those wells, which is shown on the bottom part of the screen. In this case the spatial component was a critical part of the selection process. Selection based on distance from a point can also be valuable. The point can be a specific object, such as a well, or any other location on the ground, such as a proposed construction location. The GIS can help you perform these selections. Other types of graphical selection include selection from graphs and selections from cross sections. Some graphics and statistics programs allow you to create a graph, and then click on a point on the graph and bring up information about that point, which may represent a station, sample, or analysis. GIS programs that support cross section displays can provide a similar feature where a user can click on a soil boring in a cross section, and then call up data from that boring, or a specific sample for that boring.
210
Relational Management and Display of Site Environmental Data
Figure 90 - Example of query-by-form
QUERY-BY-FORM A technique that works well for systems with a variety of different user skill levels is queryby-form, or QBF. In this technique, a form is presented to the user with fields for some of the data elements that are most likely to be used for selection. The user can fill out as many of the fields as needed to select the subset that the user is interested in. The software then creates a query based on the selection criteria. This query can then be used as the basis for a variety of different lists, reports, graphs, maps, or file exports. Figure 90 shows an example of this method.
Data Selection
211
Figure 91 - Query-by-form screen showing selection criteria for different data levels
In this example, the user has selected Analyses in the upper right corner. Along the left side the user selected “Rad Industries” as the site, and “MW-1” as the station name. In the center of the screen, the user has selected a sample date range of greater than 1/1/1985, and “Sulfate” as the parameter. The lower left of the screen indicates that there are 16 records that match these criteria, meaning that there are 16 sulfate measurements for this well for this time period. When the user selected List, the form at the bottom of the screen was displayed showing the results. To be effective, the form for querying should represent the data model, but in a way that feels comfortable to the user. Also, the screen should allow the user to see the selection options available. Figure 91 shows four different versions of a screen allowing users to make selections at four different levels of the data hierarchy. The more defined the data model, the easier it is to provide advanced user-friendly selection. The Access query editor is very flexible, and will work with any tables and fields that might be in the database. However, the user has to know the values to enter into the selection criteria. If the fields are well defined and won’t change, then a screen like that shown in Figures 90 and 91 can provide selection lists to select values from. Figure 92 shows an example of a screen showing the user a list of parameter names to choose from.
212
Relational Management and Display of Site Environmental Data
Figure 92 - Query-by-form screen showing data choices
One final point to be emphasized is the reliance of data quality on good selection practices. This was discussed above and in Chapter 15. Improper selection and display can result in data that is easy to misinterpret. Great care must be taken in system design, implementation, and user training so that the data retrieved accurately represents the answer to the question the user intended to ask.
CHAPTER 19 REPORTING AND DISPLAY
It takes a lot of work to build a good database. Because of this, it makes sense to get as much benefit from the data as possible. This means providing data in formats that are useful to as many aspects of the project as possible, and printed reports and other displays are one of the primary output goals of most data management projects. This chapter covers a variety of issues for reports and other displays. Graph displays are described in Chapter 20. Cross sections are discussed in Chapter 21, and maps and GIS displays in Chapter 22. Chapter 23 covers statistical analysis and display, and using the EDMS as a data source for other programs is described in Chapter 24.
TEXT OUTPUT Whether the user has performed a canned or ad hoc query, the desired result might be a tabular display. This display can be viewed on the screen, printed, saved to a file, or copied to the clipboard for use in other applications. Figure 93 is an example of this type of display. This is the most basic type of retrieval. This is considered unformatted output, meaning that the data is there, but there is no particular presentation associated with it.
Figure 93 - Tabular display of output from the selection screen
214
Relational Management and Display of Site Environmental Data
Figure 94 - Banded report for printing
FORMATTED REPORTS Once a selection has been made, another option is formatted output. The data can be sent to a formatted report for printing or electronic distribution. A formatted report is a template designed for a specific purpose and saved in the program. The report is based on a query or table that provides the data, and the report form provides the formatting.
Standard (banded) reports Figure 94 is an example of a report formatted for printing. This example shows a standard banded report, where the data at different parent-child levels is displayed in horizontal bands across the page. This is the easiest type of report to create in many database systems, and is most useful when there is a large amount of information to present for each data element, because one or more lines can be dedicated to each result.
Cross-tab reports The next figure, Figure 95, shows a different organization called a cross-tab or pivot table report. In this layout, one element of the data is used to create the headers for columns. In this example, the sample event information is used as column headers.
Reporting and Display
215
Figure 95 - Cross-tab report with samples across and parameters down
Figure 96 - Cross-tab report with parameters across and samples down
Figure 96 is a cross-tab pivoted the other way, with parameters across and sample events down. In general, cross-tab reports are more compact than banded reports because multiple results can be shown on one line.
216
Relational Management and Display of Site Environmental Data
Figure 97 - Data display options
Cross-tab reports provide a challenge regarding the display of field data when multiple field observations must be displayed with the analytical data. Typically there will be one result for each analyte (ignoring dilutions and reanalyses), but several observations of pH for each sample. In a cross-tab, the additional pH values can be displayed either as additional columns or additional rows. Adding rows usually takes less space than additional columns, so this may be preferred, but either way the software needs to address this issue.
FORMATTING THE RESULT There are a number of options that can affect how the user sees the data. Figure 97 shows a panel with some of these options for how the data might be displayed. The user can select which regulatory limit or regulatory limit group to use for comparison, how to handle non-detected values, how to display graphs and handle field data, whether to include calculated parameters, how to display the values and flags, how to format the date and time, and whether to convert to consistent units and display regulatory limits.
Regulatory limit comparison For investigation and remediation projects, an important issue is comparison of analytical results to regulatory limits or target levels. These limits might be based on national regulations such as federal drinking water standards, state or local government regulations, or site-specific goals based on an operating permit or Record of Decision (ROD). Project requirements might be to display all data with exceedences highlighted, or to create a report with only the exceedences. For most constituents, the comparison is against a maximum value. For others, such as pH, both an upper and a lower limit must be met. The first step in using regulatory limits is to define the limit types that will be used. Figure 98 shows a software screen for doing this. The user enters the regulatory limit types to be used, along with a code for each type. The next step is to enter the limits themselves. Figure 99 shows a form for doing this. Limits can be entered as either site-specific or for all sites. For each limit, the matrix, parameter, and limit type are entered, along with the upper and lower limits and units. The regulatory limit units are particularly important, and must be considered in later comparison, and should be taken into consideration in conversion to consistent units as described below. There is one complication that must be addressed for limit comparison to be useful for many project requirements. Often the requirement is for different parameters, or groups of parameters, to be compared to different limit types on the same report. For example, the major ions might be compared to federal drinking water standards, but the organics may be compared to more stringent local or site-specific criteria. This requires that the software provide a feature to allow the use of different limits for different parameters. Figure 100 shows a screen for doing this. The user enters a name for the group, and then selects limits from the various limit types to use in that group.
Reporting and Display
Figure 98 - Form for defining regulatory limit types
Figure 99 - Form for entering regulatory limits
Figure 100 - Form for defining regulatory limit groups
217
218
Relational Management and Display of Site Environmental Data
Figure 101 - Selection of regulatory limit or group for reporting
After the limits and groups have been defined, they can be used in reporting. Figure 101 shows a panel from the selection screen where the user is selecting the limit type or group for comparison. The list contains both the regulatory limit types and the regulatory limit groups, so either one can be used at report time. The software code should be set up to determine which type of limit has been selected, and then retrieve the proper data for comparison.
Value and flag Analytical results contain much more information than just the measured value. A laboratory deliverable file may contain 30 or more fields of data for each analysis. In a banded report there is room to display all of this data. When the result is displayed in a cross-tab report, there is only one field for each result, but it is still useful to display some of this additional information. The items most commonly involved in this are the value, the analytical flag, and the detection limit. Different EDMS programs handle this in different ways, but one way to do it is using fields for reporting factor and reporting basis that are based on the analytical flag. Another way to do it is to have a text field for each analysis containing exactly the formatting desired. Examples of reporting factor and reporting basis values, and how each result might look, are shown in the following table: Basis code v f b l g d d a m
Reporting basis Value only Flag only Both value and flag Less than sign () and detection limit or value Detection limit (times factor) and flag Detection limit (times factor) and flag Average of values Dash (-) only
Reporting factor 1 1 1 1
Value
Flag
Result
v v v u
Detection limit 0.1 0.1 0.1 0.1
3.7 3.7 3.7 3.7
1
3.7
u
0.1
> 0.1
1
3.7
u
0.1
0.1 u
.5
3.7
u
0.1
0.05 u
1 1
3.7 3.7
v v
0.1 0.1
1.9 -
3.7 v 3.7 v < 0.1
The next table shows examples of some analytical flags and how the reporting factor and reporting basis might be assigned to each.
Reporting and Display
Flag code b c d e f g h i j l m n q r s t u v w x y z
Flag Analyte detected in blank and sample Coelute Diluted Exceeds calibration range Calculated from higher dilution Concentration > value reported Result reported elsewhere Insufficient sample Est. value; conc. < quan. limit Less than detection limit Matrix interference Not measured Uncertain value Unusable data Surrogate Trace amount Not detected Detected value Between CRDL/IDL Determined by associated method Calculated value Unknown
Reporting factor 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0.5 1 1 1 0 1
219
Reporting basis v v v v v g f v b l v v v f v d l v v v v v
Finally, analyses can often have multiple flags, for example “uj,” but the result can only be displayed one way. The software needs to have an established priority for the reporting basis so that the display is based on the highest priority format. Based on the previous basis code values, an example of the priority might be: f, l, g, b, d, v, a, and m. This means that for a flag of “bj” the basis codes would be “v” (from the “b” flag) and “b” (from the “j” flag). The “b” basis would have preference, so a less than sign (
223
Reg. Limit Units
Sample Point -> Sample Date ->
Parameters Field pH Iron (Ferrous) Nitrate Potassium Sulfate
MW-1 2/26/1981
MW-1 4/20/1981
7.8 0.35 1.7 6.9 1255
7.9 0.1 1 6.6 1400
Units s.u. mg/l mg/l mg/l mg/l
Figure 104 - Reports with different levels of formatting for performance comparison
Formatting and performance Keep in mind that asking the software to perform sophisticated formatting comes at a cost. In Figure 104, the panel on the top has formatted values and comparison to regulatory limits. Notice that a regulatory limit is displayed for sulfate, and both sulfate values are bolded and underlined because they exceed this limit. Also, for 4/20/1981 the value for iron shows the value and analytical flags, and the value for nitrate shows “